forayer.transformation.word_embedding.AttributeVectorizer¶
- class forayer.transformation.word_embedding.AttributeVectorizer(tokenizer: Optional[Callable] = None, embedding_type: str = 'fasttext', vectors_path: Optional[str] = None, default_download_dir: Optional[str] = None)[source]¶
Vectorizer class to get attribute embeddings of entities with pre-trained embeddings.
- tokenizer: Callable
callable that tokenizes a string
- embedding_type: str
type of pretrained embeddings
- vectors_path: str
path to pre-trained embeddings
- wv: gensim.models.KeyedVectors
word embeddings
- __init__(tokenizer: Optional[Callable] = None, embedding_type: str = 'fasttext', vectors_path: Optional[str] = None, default_download_dir: Optional[str] = None)[source]¶
Initialize an AttributeVectorizer object and load the pre-trained embeddings.
- tokenizer: Callable
callable that tokenizes a string
- embedding_type: str
type of pretrained embeddings
- vectors_path: str
path to pre-trained embeddings
- default_download_dir: str
directory where embeddings are downloaded if they are not present default is “./data/word_embeddings/”
- TypeError
if tokenizer is not callable
- ValueError
if embedding_type is unknown or vectors_path does not exist
Methods
__init__([tokenizer, embedding_type, ...])Initialize an AttributeVectorizer object and load the pre-trained embeddings.
Reset .seen_tokens and .ignored_tokens.
vectorize(sentence)Tokenize and vectorize a sentence with the given word embeddings.
vectorize_entity_attributes(attributes)Tokenize and vectorize values of entity attributes.
- vectorize(sentence: str) List[numpy.ndarray][source]¶
Tokenize and vectorize a sentence with the given word embeddings.
- sentencestr
sentence to vectorize
- vectorized: List[np.ndarray]
List of token embeddings,
Ignores tokens that are not contained in the used embeddings Ignored tokens will be set to np.NaN
- vectorize_entity_attributes(attributes: Dict[Any, Any]) Dict[Any, List[numpy.ndarray]][source]¶
Tokenize and vectorize values of entity attributes.
- attributesDict[Any, Any]
dictionary of entity attributes with attribute names as keys
- embedded_entity_attributes: Dict[Any, List[np.ndarray]]
entity dicts with attribute values replaced with list of token embeddings
