forayer.transformation.word_embedding.AttributeVectorizer¶

class forayer.transformation.word_embedding.AttributeVectorizer(tokenizer: Optional[Callable] = None, embedding_type: str = 'fasttext', vectors_path: Optional[str] = None, default_download_dir: Optional[str] = None)[source]¶

Vectorizer class to get attribute embeddings of entities with pre-trained embeddings.

tokenizer: Callable: callable that tokenizes a string
embedding_type: str: type of pretrained embeddings
vectors_path: str: path to pre-trained embeddings
wv: gensim.models.KeyedVectors: word embeddings

__init__(tokenizer: Optional[Callable] = None, embedding_type: str = 'fasttext', vectors_path: Optional[str] = None, default_download_dir: Optional[str] = None)[source]¶

Initialize an AttributeVectorizer object and load the pre-trained embeddings.

tokenizer: Callable: callable that tokenizes a string
embedding_type: str: type of pretrained embeddings
vectors_path: str: path to pre-trained embeddings
default_download_dir: str: directory where embeddings are downloaded if they are not present default is “./data/word_embeddings/”

TypeError: if tokenizer is not callable
ValueError: if embedding_type is unknown or vectors_path does not exist

Methods

`__init__`([tokenizer, embedding_type, ...])	Initialize an AttributeVectorizer object and load the pre-trained embeddings.
`reset_token_count`()	Reset .seen_tokens and .ignored_tokens.
`vectorize`(sentence)	Tokenize and vectorize a sentence with the given word embeddings.
`vectorize_entity_attributes`(attributes)	Tokenize and vectorize values of entity attributes.

reset_token_count()[source]¶: Reset .seen_tokens and .ignored_tokens.

vectorize(sentence: str) → List[numpy.ndarray][source]¶

Tokenize and vectorize a sentence with the given word embeddings.

sentencestr: sentence to vectorize

vectorized: List[np.ndarray]: List of token embeddings,

Ignores tokens that are not contained in the used embeddings Ignored tokens will be set to np.NaN

vectorize_entity_attributes(attributes: Dict[Any, Any]) → Dict[Any, List[numpy.ndarray]][source]¶

Tokenize and vectorize values of entity attributes.

attributesDict[Any, Any]: dictionary of entity attributes with attribute names as keys

embedded_entity_attributes: Dict[Any, List[np.ndarray]]: entity dicts with attribute values replaced with list of token embeddings