ModErn Text Analysis
META Enumerates Textual Applications
Public Member Functions | Protected Member Functions | Private Member Functions | Private Attributes | Friends | List of all members
meta::corpus::corpus Class Referenceabstract

Provides interface to with multiple corpus input formats. More...

#include <corpus.h>

Inheritance diagram for meta::corpus::corpus:
meta::corpus::file_corpus meta::corpus::gz_corpus meta::corpus::libsvm_corpus meta::corpus::line_corpus

Public Member Functions

 corpus (std::string encoding)
 Constructs a new corpus with the given encoding. More...
 
virtual bool has_next () const =0
 
virtual document next ()=0
 
virtual uint64_t size () const =0
 
virtual metadata::schema_type schema () const
 
virtual ~corpus ()=default
 Destructor.
 
const std::string & encoding () const
 
bool store_full_text () const
 
void set_store_full_text (bool store_full_text)
 

Protected Member Functions

std::vector< metadata::fieldnext_metadata ()
 Helper function to be used by deriving classes in implementing next() to set the metadata for the current document.
 

Private Member Functions

void set_metadata_parser (metadata_parser &&mdparser)
 

Private Attributes

std::string encoding_
 The type of encoding this document uses.
 
util::optional< metadata_parsermdata_parser_
 The metadata parser.
 
bool store_full_text_
 Whether to store the original document text.
 

Friends

std::unique_ptr< corpusmake_corpus (const cpptoml::table &)
 Convenience method for creating a corpus using the factory. More...
 

Detailed Description

Provides interface to with multiple corpus input formats.

Required config parameters:

prefix = "prefix"
dataset = "datasetname" # relative to prefix
corpus = "corpus-spec-file" # e.g. "line.toml"

The corpus spec toml file also requires a corpus type and an optional encoding for the corpus text.

Required config parameters:

type = "line-corpus" # for example

Optional config parameters:

encoding = "utf-8" # default value
store-full-text = false # default value; N/A for libsvm-corpus
metadata = # metadata schema; see metadata object
See also
https://meta-toolkit.org/overview-tutorial.html

Constructor & Destructor Documentation

§ corpus()

meta::corpus::corpus::corpus ( std::string  encoding)

Constructs a new corpus with the given encoding.

Parameters
encodingThe encoding to interpret the text as

Member Function Documentation

§ has_next()

virtual bool meta::corpus::corpus::has_next ( ) const
pure virtual
Returns
whether there is another document in this corpus

Implemented in meta::corpus::libsvm_corpus, meta::corpus::line_corpus, meta::corpus::file_corpus, and meta::corpus::gz_corpus.

§ next()

virtual document meta::corpus::corpus::next ( )
pure virtual
Returns
the next document from this corpus

Implemented in meta::corpus::line_corpus, meta::corpus::libsvm_corpus, meta::corpus::file_corpus, and meta::corpus::gz_corpus.

§ size()

virtual uint64_t meta::corpus::corpus::size ( ) const
pure virtual
Returns
the number of documents in this corpus

Implemented in meta::corpus::line_corpus, meta::corpus::file_corpus, meta::corpus::libsvm_corpus, and meta::corpus::gz_corpus.

§ schema()

metadata::schema_type meta::corpus::corpus::schema ( ) const
virtual
Returns
the corpus' metadata schema

Reimplemented in meta::corpus::file_corpus, and meta::corpus::libsvm_corpus.

§ encoding()

const std::string & meta::corpus::corpus::encoding ( ) const
Returns
the encoding for the corpus.

§ store_full_text()

bool meta::corpus::corpus::store_full_text ( ) const
Returns
whether this corpus will create a metadata field for full text (called "content")

§ set_store_full_text()

void meta::corpus::corpus::set_store_full_text ( bool  store_full_text)
Parameters
store_full_textTells this corpus to store full document text as metadata

Friends And Related Function Documentation

§ make_corpus

std::unique_ptr<corpus> make_corpus ( const cpptoml::table &  config)
friend

Convenience method for creating a corpus using the factory.

The configuration object passed here should be the "global" configuration (as in, the one that contains the "prefix", "dataset", and "corpus" keys).


The documentation for this class was generated from the following files: