forayer.transformation.word_embedding.AttributeVectorizer

class forayer.transformation.word_embedding.AttributeVectorizer(tokenizer: Optional[Callable] = None, embedding_type: str = 'fasttext', vectors_path: Optional[str] = None, default_download_dir: Optional[str] = None)[source]

Vectorizer class to get attribute embeddings of entities with pre-trained embeddings.

tokenizer: Callable

callable that tokenizes a string

embedding_type: str

type of pretrained embeddings

vectors_path: str

path to pre-trained embeddings

wv: gensim.models.KeyedVectors

word embeddings

__init__(tokenizer: Optional[Callable] = None, embedding_type: str = 'fasttext', vectors_path: Optional[str] = None, default_download_dir: Optional[str] = None)[source]

Initialize an AttributeVectorizer object and load the pre-trained embeddings.

tokenizer: Callable

callable that tokenizes a string

embedding_type: str

type of pretrained embeddings

vectors_path: str

path to pre-trained embeddings

default_download_dir: str

directory where embeddings are downloaded if they are not present default is “./data/word_embeddings/”

TypeError

if tokenizer is not callable

ValueError

if embedding_type is unknown or vectors_path does not exist

Methods

__init__([tokenizer, embedding_type, ...])

Initialize an AttributeVectorizer object and load the pre-trained embeddings.

reset_token_count()

Reset .seen_tokens and .ignored_tokens.

vectorize(sentence)

Tokenize and vectorize a sentence with the given word embeddings.

vectorize_entity_attributes(attributes)

Tokenize and vectorize values of entity attributes.

reset_token_count()[source]

Reset .seen_tokens and .ignored_tokens.

vectorize(sentence: str) List[numpy.ndarray][source]

Tokenize and vectorize a sentence with the given word embeddings.

sentencestr

sentence to vectorize

vectorized: List[np.ndarray]

List of token embeddings,

Ignores tokens that are not contained in the used embeddings Ignored tokens will be set to np.NaN

vectorize_entity_attributes(attributes: Dict[Any, Any]) Dict[Any, List[numpy.ndarray]][source]

Tokenize and vectorize values of entity attributes.

attributesDict[Any, Any]

dictionary of entity attributes with attribute names as keys

embedded_entity_attributes: Dict[Any, List[np.ndarray]]

entity dicts with attribute values replaced with list of token embeddings