ModErn Text Analysis
META Enumerates Textual Applications
Public Member Functions | Static Public Attributes | Private Attributes | List of all members
meta::corpus::line_corpus Class Reference

Fills document objects with content line-by-line from an input file. More...

#include <line_corpus.h>

Inheritance diagram for meta::corpus::line_corpus:
meta::corpus::corpus

Public Member Functions

 line_corpus (const std::string &file, std::string encoding, uint64_t num_lines=0)
 
bool has_next () const override
 
document next () override
 
uint64_t size () const override
 
- Public Member Functions inherited from meta::corpus::corpus
 corpus (std::string encoding)
 Constructs a new corpus with the given encoding. More...
 
virtual metadata::schema_type schema () const
 
virtual ~corpus ()=default
 Destructor.
 
const std::string & encoding () const
 
bool store_full_text () const
 
void set_store_full_text (bool store_full_text)
 

Static Public Attributes

static const util::string_view id = "line-corpus"
 The identifier for this corpus.
 

Private Attributes

doc_id cur_id_
 The current document we are on.
 
uint64_t num_lines_
 The number of lines in the file.
 
std::ifstream infile_
 Parser to read the corpus file.
 
std::unique_ptr< std::ifstream > class_infile_
 Parser to read the class labels.
 

Additional Inherited Members

- Protected Member Functions inherited from meta::corpus::corpus
std::vector< metadata::fieldnext_metadata ()
 Helper function to be used by deriving classes in implementing next() to set the metadata for the current document.
 

Detailed Description

Fills document objects with content line-by-line from an input file.

It is up to the tokenizer used to be able to correctly parse the document content into labels and features.

Constructor & Destructor Documentation

§ line_corpus()

meta::corpus::line_corpus::line_corpus ( const std::string &  file,
std::string  encoding,
uint64_t  num_lines = 0 
)
Parameters
fileThe path to the corpus file, where each line represents a document
encodingThe encoding for the file
num_docsThe number of documents (i.e., lines) in the corpus file if known beforehand. If unknown, leave out this parameter and the value will be calculated in the constructor.

Member Function Documentation

§ has_next()

bool meta::corpus::line_corpus::has_next ( ) const
overridevirtual
Returns
whether there is another document in this corpus

Implements meta::corpus::corpus.

§ next()

document meta::corpus::line_corpus::next ( )
overridevirtual
Returns
the next document from this corpus

Implements meta::corpus::corpus.

§ size()

uint64_t meta::corpus::line_corpus::size ( ) const
overridevirtual
Returns
the number of documents in this corpus

Implements meta::corpus::corpus.


The documentation for this class was generated from the following files: