ModErn Text Analysis
META Enumerates Textual Applications
Namespaces | Classes | Typedefs | Functions
meta::analyzers Namespace Reference

Contains various ways to segment text and deal with preprocessed files (POS tags, parse trees, etc). More...

Namespaces

 filters
 Contains filters that mutate existing token streams in a filter chain.
 
 tokenizers
 Contains tokenizers that start off a filter chain.
 

Classes

class  analyzer
 An class that provides a framework to produce token counts from documents. More...
 
class  analyzer_exception
 Basic exception for analyzer interactions. More...
 
class  analyzer_factory
 Factory that is responsible for creating analyzers from configuration files. More...
 
class  branch_featurizer
 Tokenizes parse trees by extracting branching factor features. More...
 
class  depth_featurizer
 Tokenizes parse trees by extracting depth features. More...
 
class  diff_analyzer
 Analyzes documents using lm::diff edits; see lm::diff for config file information and further explanation. More...
 
class  embedding_analyzer
 Analyzes documents by averaging word embeddings for each token. More...
 
class  featurizer
 Used by analyzers to increment feature values in feature_maps generically. More...
 
class  featurizer_exception
 Basic exception for featurizer interactions. More...
 
class  featurizer_factory
 Factory that is responsible for creating tree featurizers from configuration files. More...
 
class  filter_factory
 Factory that is responsible for creating filters during analyzer construction. More...
 
class  multi_analyzer
 The multi_analyzer class contains more than one analyzer. More...
 
class  ngram_analyzer
 Analyzes documents based on an ngram word model, where the value for n is supplied by the user. More...
 
class  ngram_pos_analyzer
 Analyzes documents based on part-of-speech tags instead of words. More...
 
class  ngram_word_analyzer
 Analyzes documents using their tokenized words. More...
 
class  semi_skeleton_featurizer
 Tokenizes parse trees by keeping track of only a single node label and the underlying tree structure. More...
 
class  skeleton_featurizer
 Tokenizes parse trees by only tokenizing the tree structure itself. More...
 
class  subtree_featurizer
 Tokenizes parse trees by counting occurrences of subtrees in a document's parse tree. More...
 
class  tag_featurizer
 Tokenizes parse trees by looking at labels of leaf and interior nodes. More...
 
class  token_stream
 Base class that represents a stream of tokens that have been extracted from a document. More...
 
class  token_stream_exception
 Basic exception class for token stream interactions. More...
 
class  tree_analyzer
 Base class tokenizing using parse tree features. More...
 
class  tree_featurizer
 Base class for featurizers that convert trees into features in a document. More...
 

Typedefs

template<class T >
using feature_map = hashing::probe_map< std::string, T >
 

Functions

std::unique_ptr< analyzerload (const cpptoml::table &config)
 
std::unique_ptr< token_streamdefault_filter_chain (const cpptoml::table &config)
 
std::unique_ptr< token_streamdefault_unigram_chain (const cpptoml::table &config)
 
std::unique_ptr< token_streamload_filters (const cpptoml::table &global, const cpptoml::table &config)
 
std::unique_ptr< token_streamload_filter (std::unique_ptr< token_stream > src, const cpptoml::table &config)
 
std::string get_content (const corpus::document &doc)
 
template<class Analyzer >
std::unique_ptr< analyzermake_analyzer (const cpptoml::table &, const cpptoml::table &)
 Factory method for creating an analyzer. More...
 
template<class Analyzer >
void register_analyzer ()
 Registration method for analyzers. More...
 
template<class Tokenizer >
void register_tokenizer ()
 Registration method for tokenizers. More...
 
template<class Filter >
void register_filter ()
 Registration method for filters. More...
 
template<>
std::unique_ptr< analyzermake_analyzer< ngram_word_analyzer > (const cpptoml::table &, const cpptoml::table &)
 Specialization of the factory method for creating ngram_word_analyzers.
 
template<>
std::unique_ptr< analyzermake_analyzer< embedding_analyzer > (const cpptoml::table &, const cpptoml::table &)
 Specialization of the factory method for creating embedding_analyzers.
 
template<>
std::unique_ptr< analyzermake_analyzer< diff_analyzer > (const cpptoml::table &, const cpptoml::table &)
 Specialization of the factory method for creating diff analyzers.
 
template<class Featurizer >
std::unique_ptr< tree_featurizermake_featurizer ()
 Factory method for creating a featurizer. More...
 
template<class Featurizer >
void register_featurizer ()
 Registration method for analyzers. More...
 
template<>
std::unique_ptr< analyzermake_analyzer< ngram_pos_analyzer > (const cpptoml::table &, const cpptoml::table &)
 Specialization of the factory method for creating ngram_pos_analyzers.
 
template<>
std::unique_ptr< analyzermake_analyzer< tree_analyzer > (const cpptoml::table &global, const cpptoml::table &config)
 

Detailed Description

Contains various ways to segment text and deal with preprocessed files (POS tags, parse trees, etc).

Function Documentation

§ load()

std::unique_ptr< analyzer > meta::analyzers::load ( const cpptoml::table &  config)
Parameters
configThe config group used to create the analyzer from
Returns
an analyzer as specified by a config object

§ default_filter_chain()

std::unique_ptr< token_stream > meta::analyzers::default_filter_chain ( const cpptoml::table &  config)
Parameters
configThe config group used to create the analyzer from
Returns
the default filter chain for this version of MeTA, based on a config object

§ default_unigram_chain()

std::unique_ptr< token_stream > meta::analyzers::default_unigram_chain ( const cpptoml::table &  config)
Parameters
configThe config group used to create the analyzer from
Returns
the default filter chain for unigram words for this version of MeTA, based on a config object

§ load_filters()

std::unique_ptr< token_stream > meta::analyzers::load_filters ( const cpptoml::table &  global,
const cpptoml::table &  config 
)
Parameters
globalThe original config object with all parameters
configThe config group used to create the filters from
Returns
a filter chain as specified by a config object

§ load_filter()

std::unique_ptr< token_stream > meta::analyzers::load_filter ( std::unique_ptr< token_stream src,
const cpptoml::table &  config 
)
Parameters
srcThe token stream that will feed into this filter
configThe config group used to create the filter from
Returns
a single filter specified by a config object

§ get_content()

std::string meta::analyzers::get_content ( const corpus::document doc)
Parameters
docThe document to get content for
Returns
the contents of the document, as a string

§ make_analyzer()

template<class Analyzer >
std::unique_ptr<analyzer> meta::analyzers::make_analyzer ( const cpptoml::table &  ,
const cpptoml::table &   
)

Factory method for creating an analyzer.

You should specialize this method if you need to customize creation behavior for your analyzer class.

§ register_analyzer()

template<class Analyzer >
void meta::analyzers::register_analyzer ( )

Registration method for analyzers.

Clients should use this method to register any new filters they write.

§ register_tokenizer()

template<class Tokenizer >
void meta::analyzers::register_tokenizer ( )

Registration method for tokenizers.

Clients should use this method to register any new tokenizers they write.

§ register_filter()

template<class Filter >
void meta::analyzers::register_filter ( )

Registration method for filters.

Clients should use this method to register any new filters they write.

§ make_featurizer()

template<class Featurizer >
std::unique_ptr<tree_featurizer> meta::analyzers::make_featurizer ( )

Factory method for creating a featurizer.

You should specialize this method if you need to customize creation behavior for your featurizer class.

§ register_featurizer()

template<class Featurizer >
void meta::analyzers::register_featurizer ( )

Registration method for analyzers.

Clients should use this method to register any new filters they write.