fuzzy_search package

Submodules

fuzzy_search.fuzzy_config module

fuzzy_search.fuzzy_context_searcher module

fuzzy_search.fuzzy_match module

fuzzy_search.fuzzy_patterns module

fuzzy_search.fuzzy_phrase module

fuzzy_search.fuzzy_phrase_model module

fuzzy_search.fuzzy_phrase_searcher module

fuzzy_search.fuzzy_searcher module

class fuzzy_search.fuzzy_searcher.FuzzySearcher(char_match_threshold=0.5, ngram_threshold=0.5, levenshtein_threshold=0.5, max_length_variance=1)

Bases: object

disable_strip_suffix()
enable_strip_suffix()
filter_candidates(candidates, keyword, ngram_size=2)
filter_char_match_candidates(candidates, match_term)
filter_levenshtein_candidates(candidates, match_term)
filter_ngram_candidates(candidates, match_term, ngram_size)
find_candidates(text, keyword, ngram_size=2, use_word_boundaries=False)

Find candidate matches that start with the same initial character as the search term and filter them based on default thresholds for character overlap, ngram overlap and levenshtein distance.

find_start_candidates(text, term, use_word_boundaries)

Find candidate matches that start with the same initial character as the search term.

find_term_matches(text, term, max_length_variance=None, use_word_boundaries=False)
make_ngrams(term, n)
rank_candidates(candidates, keyword, ngram_size=2)
score_char_overlap(term1, term2)
score_char_overlap_ratio(term1, term2)
score_levenshtein_distance(s1, s2)
score_levenshtein_distance_ratio(term1, term2)
score_ngram_overlap(term1, term2, ngram_size)
score_ngram_overlap_ratio(term1, term2, ngram_size)
strip_suffix(match)
fuzzy_search.fuzzy_searcher.create_term_match(re_match, term)

fuzzy_search.fuzzy_string module

fuzzy_search.fuzzy_template module

fuzzy_search.fuzzy_template_searcher module

fuzzy_search.similarity module

class fuzzy_search.similarity.SkipCooccurrence(vocabulary: Vocabulary, skip_size: int = 1, sentences: Optional[Iterable[List[str]]] = None)

Bases: object

calculate_skip_cooccurrences(sentences: Iterable[List[str]], skip_size: int = 0)

Count the frequency of term (skip) co-occurrences for a given list of sentences.

Parameters:
  • sentences (Iterable[List[str]) – a list of sentences, where each sentence is itself a list of term tokens

  • skip_size (int) – the maximum number of skips to allow between co-occurring terms

get_term_coocs(term: str) Union[None, Generator[Tuple[str, str], None, None]]
class fuzzy_search.similarity.SkipgramSimilarity(ngram_length: int = 3, skip_length: int = 0, terms: Optional[List[str]] = None, max_length_diff: int = 2)

Bases: object

index_terms(terms: List[str], reset_index: bool = True)

Make a frequency index of the skip grams for a given list of terms. By default, indexing is cumulative, that is, everytime you call index_terms with a list of terms, they are added to the index. Use ‘reset_index=True’ to reset the index before indexing the given terms.

Parameters:
  • terms (List[str]) – a list of term to index

  • reset_index (bool) – whether to reset the index before indexing or to keep the existing index

rank_similar(term: str, top_n: int = 10, score_cutoff: float = 0.5)

Return a ranked list of similar terms from the index for a given input term, based on their character skipgram cosine similarity.

Parameters:
  • term (str) – a term (any string) to match against the indexed terms

  • top_n (int (default 10)) – the number of highest ranked terms to return

  • score_cutoff (float) – the minimum similarity score after which to cutoff the ranking

Returns:

a ranked list of terms and their similarity scores

Return type:

List[Tuple[str, float]]

class fuzzy_search.similarity.Vocabulary

Bases: object

add_terms(terms: List[str], reset_index: bool = True)

Add a list of terms to the vocabulary. Use ‘reset_index=True’ to reset the vocabulary before adding the terms.

Parameters:
  • terms (List[str]) – a list of terms to add to the vocabulary

  • reset_index (bool) – a flag to indicate whether to empty the vocabulary before adding terms

id2term(term_id: int)

Return the term for a given term ID.

reset_index()
term2id(term: str)

Return the term ID for a given term.

fuzzy_search.similarity.get_begin_sim(phrase1: str, phrase2: str, begin_length: int) float
fuzzy_search.similarity.get_end_sim(phrase1: str, phrase2: str, end_length: int) float
fuzzy_search.similarity.get_min_length(phrase1: str, phrase2: str, begin_length: int) int
fuzzy_search.similarity.get_skip_coocs(seq_ids: List[str], skip_size: int = 0) Generator[Tuple[int, int], None, None]
fuzzy_search.similarity.vector_length(skipgram_freq)