abydos.distance package
abydos.distance.
The distance package implements string distance measure and metric classes:
These include traditional Levenshtein edit distance and related algorithms:
Levenshtein distance (
Levenshtein
)Optimal String Alignment distance (
Levenshtein
withmode='osa'
)Damerau-Levenshtein distance (
DamerauLevenshtein
)Yujian-Bo normalized edit distance (
YujianBo
)Higuera-Micó contextual normalized edit distance (
HigueraMico
)Indel distance (
Indel
)Syllable Alignment Pattern Searching similarity (
distance.SAPS
)Meta-Levenshtein distance (
MetaLevenshtein
)Covington distance (
Covington
)ALINE distance (
ALINE
)FlexMetric distance (
FlexMetric
)BI-SIM similarity (
BISIM
)Discounted Levenshtein distance (
DiscountedLevenshtein
)Phonetic edit distance (
PhoneticEditDistance
)
Hamming distance (Hamming
), Relaxed Hamming distance
(RelaxedHamming
), and the closely related Modified
Language-Independent Product Name Search distance (MLIPNS
) are
provided.
Block edit distances:
Tichy edit distance (
Tichy
)Levenshtein distance with block operations (
BlockLevenshtein
)Rees-Levenshtein distance (
ReesLevenshtein
)Cormode's LZ distance (
CormodeLZ
)Shapira-Storer I edit distance with block moves, greedy algorithm (
ShapiraStorerI
)
Distance metrics developed for the US Census or derived from them are included:
Jaro distance (
JaroWinkler
withmode='Jaro'
)Jaro-Winkler distance (
JaroWinkler
)Strcmp95 distance (
Strcmp95
)Iterative-SubString (I-Sub) correlation (
IterativeSubString
)
A large set of multi-set token-based distance metrics are provided, including:
AMPLE similarity (
AMPLE
)AZZOO similarity (
AZZOO
)Anderberg's D similarity (
Anderberg
)Andres & Marzo's Delta correlation (
AndresMarzoDelta
)Baroni-Urbani & Buser I similarity (
BaroniUrbaniBuserI
)Baroni-Urbani & Buser II correlation (
BaroniUrbaniBuserII
)Batagelj & Bren similarity (
BatageljBren
)Baulieu I distance (
BaulieuI
)Baulieu II distance (
BaulieuII
)Baulieu III distance (
BaulieuIII
)Baulieu IV distance (
BaulieuIV
)Baulieu V distance (
BaulieuV
)Baulieu VI distance (
BaulieuVI
)Baulieu VII distance (
BaulieuVII
)Baulieu VIII distance (
BaulieuVIII
)Baulieu IX distance (
BaulieuIX
)Baulieu X distance (
BaulieuX
)Baulieu XI distance (
BaulieuXI
)Baulieu XII distance (
BaulieuXII
)Baulieu XIII distance (
BaulieuXIII
)Baulieu XIV distance (
BaulieuXIV
)Baulieu XV distance (
BaulieuXV
)Benini I correlation (
BeniniI
)Benini II correlation (
BeniniII
)Bennet's S correlation (
Bennet
)Braun-Blanquet similarity (
BraunBlanquet
)Canberra distance (
Canberra
)Cao similarity (
Cao
)Chao's Dice similarity (
ChaoDice
)Chao's Jaccard similarity (
ChaoJaccard
)Chebyshev distance (
Chebyshev
)Chord distance (
Chord
)Clark distance (
Clark
)Clement similarity (
Clement
)Cohen's Kappa similarity (
CohenKappa
)Cole correlation (
Cole
)Consonni & Todeschini I similarity (
ConsonniTodeschiniI
)Consonni & Todeschini II similarity (
ConsonniTodeschiniII
)Consonni & Todeschini III similarity (
ConsonniTodeschiniIII
)Consonni & Todeschini IV similarity (
ConsonniTodeschiniIV
)Consonni & Todeschini V correlation (
ConsonniTodeschiniV
)Cosine similarity (
Cosine
)Dennis similarity (
Dennis
)Dice's Asymmetric I similarity (
DiceAsymmetricI
)Dice's Asymmetric II similarity (
DiceAsymmetricII
)Digby correlation (
Digby
)Dispersion correlation (
Dispersion
)Doolittle similarity (
Doolittle
)Dunning similarity (
Dunning
)Euclidean distance (
Euclidean
)Eyraud similarity (
Eyraud
)Fager & McGowan similarity (
FagerMcGowan
)Faith similarity (
Faith
)Fidelity similarity (
Fidelity
)Fleiss correlation (
Fleiss
)Fleiss-Levin-Paik similarity (
FleissLevinPaik
)Forbes I similarity (
ForbesI
)Forbes II correlation (
ForbesII
)Fossum similarity (
Fossum
)Generalized Fleiss correlation (
GeneralizedFleiss
)Gilbert correlation (
Gilbert
)Gilbert & Wells similarity (
GilbertWells
)Gini I correlation (
GiniI
)Gini II correlation (
GiniII
)Goodall similarity (
Goodall
)Goodman & Kruskal's Lambda similarity (
GoodmanKruskalLambda
)Goodman & Kruskal's Lambda-r correlation (
GoodmanKruskalLambdaR
)Goodman & Kruskal's Tau A similarity (
GoodmanKruskalTauA
)Goodman & Kruskal's Tau B similarity (
GoodmanKruskalTauB
)Gower & Legendre similarity (
GowerLegendre
)Guttman Lambda A similarity (
GuttmanLambdaA
)Guttman Lambda B similarity (
GuttmanLambdaB
)Gwet's AC correlation (
GwetAC
)Hamann correlation (
Hamann
)Harris & Lahey similarity (
HarrisLahey
)Hassanat distance (
Hassanat
)Hawkins & Dotson similarity (
HawkinsDotson
)Hellinger distance (
Hellinger
)Henderson-Heron similarity (
HendersonHeron
)Horn-Morisita similarity (
HornMorisita
)Hurlbert correlation (
Hurlbert
)Jaccard similarity (
Jaccard
) & Tanimoto coefficient (Jaccard.tanimoto_coeff()
)Jaccard-NM similarity (
JaccardNM
)Johnson similarity (
Johnson
)Kendall's Tau correlation (
KendallTau
)Kent & Foster I similarity (
KentFosterI
)Kent & Foster II similarity (
KentFosterII
)Köppen I correlation (
KoppenI
)Köppen II similarity (
KoppenII
)Kuder & Richardson correlation (
KuderRichardson
)Kuhns I correlation (
KuhnsI
)Kuhns II correlation (
KuhnsII
)Kuhns III correlation (
KuhnsIII
)Kuhns IV correlation (
KuhnsIV
)Kuhns V correlation (
KuhnsV
)Kuhns VI correlation (
KuhnsVI
)Kuhns VII correlation (
KuhnsVII
)Kuhns VIII correlation (
KuhnsVIII
)Kuhns IX correlation (
KuhnsIX
)Kuhns X correlation (
KuhnsX
)Kuhns XI correlation (
KuhnsXI
)Kuhns XII similarity (
KuhnsXII
)Kulczynski I similarity (
KulczynskiI
)Kulczynski II similarity (
KulczynskiII
)Lorentzian distance (
Lorentzian
)Maarel correlation (
Maarel
)Manhattan distance (
Manhattan
)Morisita similarity (
Morisita
)marking distance (
Marking
)marking metric (
MarkingMetric
)MASI similarity (
MASI
)Matusita distance (
Matusita
)Maxwell & Pilliner correlation (
MaxwellPilliner
)McConnaughey correlation (
McConnaughey
)McEwen & Michael correlation (
McEwenMichael
)mean squared contingency correlation (
MSContingency
)Michael similarity (
Michael
)Michelet similarity (
Michelet
)Millar distance (
Millar
)Minkowski distance (
Minkowski
)Mountford similarity (
Mountford
)Mutual Information similarity (
MutualInformation
)Overlap distance (
Overlap
)Pattern difference (
Pattern
)Pearson & Heron II correlation (
PearsonHeronII
)Pearson II similarity (
PearsonII
)Pearson III correlation (
PearsonIII
)Pearson's Chi-Squared similarity (
PearsonChiSquared
)Pearson's Phi correlation (
PearsonPhi
)Peirce correlation (
Peirce
)q-gram distance (
QGram
)Raup-Crick similarity (
RaupCrick
)Rogers & Tanimoto similarity (
RogersTanimoto
)Rogot & Goldberg similarity (
RogotGoldberg
)Russell & Rao similarity (
RussellRao
)Scott's Pi correlation (
ScottPi
)Shape difference (
Shape
)Size difference (
Size
)Sokal & Michener similarity (
SokalMichener
)Sokal & Sneath I similarity (
SokalSneathI
)Sokal & Sneath II similarity (
SokalSneathII
)Sokal & Sneath III similarity (
SokalSneathIII
)Sokal & Sneath IV similarity (
SokalSneathIV
)Sokal & Sneath V similarity (
SokalSneathV
)Sørensen–Dice coefficient (
Dice
)Sorgenfrei similarity (
Sorgenfrei
)Steffensen similarity (
Steffensen
)Stiles similarity (
Stiles
)Stuart's Tau correlation (
StuartTau
)Tarantula similarity (
Tarantula
)Tarwid correlation (
Tarwid
)Tetrachoric correlation coefficient (
Tetrachronic
)Tulloss' R similarity (
TullossR
)Tulloss' S similarity (
TullossS
)Tulloss' T similarity (
TullossT
)Tulloss' U similarity (
TullossU
)Tversky distance (
Tversky
)Weighted Jaccard similarity (
WeightedJaccard
)Unigram subtuple similarity (
UnigramSubtuple
)Unknown A correlation (
UnknownA
)Unknown B similarity (
UnknownB
)Unknown C similarity (
UnknownC
)Unknown D similarity (
UnknownD
)Unknown E correlation (
UnknownE
)Unknown F similarity (
UnknownF
)Unknown G similarity (
UnknownG
)Unknown H similarity (
UnknownH
)Unknown I similarity (
UnknownI
)Unknown J similarity (
UnknownJ
)Unknown K distance (
UnknownK
)Unknown L similarity (
UnknownL
)Unknown M similarity (
UnknownM
)Upholt similarity (
Upholt
)Warrens I correlation (
WarrensI
)Warrens II similarity (
WarrensII
)Warrens III correlation (
WarrensIII
)Warrens IV similarity (
WarrensIV
)Warrens V similarity (
WarrensV
)Whittaker distance (
Whittaker
)Yates' Chi-Squared similarity (
YatesChiSquared
)Yule's Q correlation (
YuleQ
)Yule's Q II distance (
YuleQII
)Yule's Y correlation (
YuleY
)YJHHR distance (
YJHHR
)Bhattacharyya distance (
Bhattacharyya
)Brainerd-Robinson similarity (
BrainerdRobinson
)Quantitative Cosine similarity (
QuantitativeCosine
)Quantitative Dice similarity (
QuantitativeDice
)Quantitative Jaccard similarity (
QuantitativeJaccard
)Roberts similarity (
Roberts
)Average linkage distance (
AverageLinkage
)Single linkage distance (
SingleLinkage
)Complete linkage distance (
CompleteLinkage
)Bag distance (
Bag
)Soft cosine similarity (
SoftCosine
)Monge-Elkan distance (
MongeElkan
)TF-IDF similarity (
TFIDF
)SoftTF-IDF similarity (
SoftTFIDF
)Jensen-Shannon divergence (
JensenShannon
)Simplified Fellegi-Sunter distance (
FellegiSunter
)MinHash similarity (
MinHash
)BLEU similarity (
BLEU
)Rouge-L similarity (
RougeL
)Rouge-W similarity (
RougeW
)Rouge-S similarity (
RougeS
)Rouge-SU similarity (
RougeSU
)Positional Q-Gram Dice distance (
PositionalQGramDice
)Positional Q-Gram Jaccard distance (
PositionalQGramJaccard
)Positional Q-Gram Overlap distance (
PositionalQGramOverlap
)
Three popular sequence alignment algorithms are provided:
Needleman-Wunsch score (
NeedlemanWunsch
)Smith-Waterman score (
SmithWaterman
)Gotoh score (
Gotoh
)
Classes relating to substring and subsequence distances include:
Longest common subsequence (
LCSseq
)Longest common substring (
LCSstr
)Ratcliff-Obserhelp distance (
RatcliffObershelp
)
A number of simple distance classes provided in the package include:
Normalized compression distance classes for a variety of compression algorithms are provided:
Three similarity measures from SeatGeek's FuzzyWuzzy:
FuzzyWuzzy Partial String similarity (
FuzzyWuzzyPartialString
)FuzzyWuzzy Token Sort similarity (
FuzzyWuzzyTokenSort
)FuzzyWuzzy Token Set similarity (
FuzzyWuzzyTokenSet
)
A convenience class, allowing one to pass a list of string transforms (phonetic algorithms, string transforms, and/or stemmers) and, optionally, a string distance measure to compute the similarity/distance of two strings that have undergone each transform, is provided in:
Phonetic distance (
PhoneticDistance
)
The remaining distance measures & metrics include:
Western Airlines' Match Rating Algorithm comparison (
distance.MRA
)Editex (
Editex
)Bavarian Landesamt für Statistik distance (
Baystat
)Eudex distance (
distance.Eudex
)Sift4 distance (
Sift4
,Sift4Simplest
,Sift4Extended
)Typo distance (
Typo
)Synoname (
Synoname
)Ozbay metric (
Ozbay
)Indice de Similitude-Guth (
ISG
)INClusion Programme (
Inclusion
)Guth (
Guth
)Victorian Panel Study (
VPS
)LIG3 (
LIG3
)String subsequence kernel (SSK) (
SSK
)
Most of the distance and similarity measures have sim
and dist
methods,
which return a measure that is normalized to the range \([0, 1]\). The
normalized distance and similarity are always complements, so the normalized
distance will always equal 1 - the similarity for a particular measure supplied
with the same input. Some measures have an absolute distance method
dist_abs
and/or a similarity score sim_score
, which are not limited to
any range.
The first three methods can be demonstrated using the
DamerauLevenshtein
class, while SmithWaterman
offers
the fourth:
>>> dl = DamerauLevenshtein()
>>> dl.dist_abs('orange', 'strange')
2
>>> dl.dist('orange', 'strange')
0.2857142857142857
>>> dl.sim('orange', 'strange')
0.7142857142857143
>>> sw = SmithWaterman()
>>> sw.sim_score('TGTTACGG', 'GGTTGACTA')
4.0
- class abydos.distance.ALINE(epsilon: float = 0.0, c_skip: float = -10, c_sub: float = 35, c_exp: float = 45, c_vwl: float = 10, mode: str = 'local', phones: str = 'aline', normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, **kwargs: ~typing.Any)[source]
Bases:
_Distance
ALINE alignment, similarity, and distance.
ALINE alignment was developed by [DHC+08, Kon00, Kon02], and establishes an alignment algorithm based on multivalued phonetic features and feature salience weights. Along with the alignment itself, the algorithm produces a term similarity score.
[DHC+08] develops ALINE's similarity score into a similarity measure & distance measure:
\[sim_{ALINE} = \frac{2 \dot score_{ALINE}(src, tar)} {score_{ALINE}(src, src) + score_{ALINE}(tar, tar)}\]However, because the average of the two self-similarity scores is not guaranteed to be greater than or equal to the similarity score between the two strings, by default, this formula is not used here in order to guarantee that the similarity measure is bounded to [0, 1]. Instead, Kondrak's similarity measure is employed:
\[sim_{ALINE} = \frac{score_{ALINE}(src, tar)} {max(score_{ALINE}(src, src), score_{ALINE}(tar, tar))}\]New in version 0.4.0.
Initialize ALINE instance.
- Parameters:
epsilon (float) -- The portion (out of 1.0) of the maximum ALINE score, above which alignments are returned. If set to 0, only the alignments matching the maximum alignment score are returned. If set to 1, all alignments scoring 0 or higher are returned.
c_skip (float) -- The cost of an insertion or deletion
c_sub (float) -- The cost of a substitution
c_exp (float) -- The cost of an expansion or contraction
c_vwl (float) -- The additional cost of a vowel substitution, expansion, or contraction
mode (str) -- Alignment mode, which can be
local
(default),global
,half-local
, orsemi-global
phones (str) --
- Phonetic symbol set, which can be:
aline
selects Kondrak's original symbols setipa
selects IPA symbols
normalizer (function) -- A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). For the normalization proposed by Downey, et al. (2008), set this to:
lambda x: sum(x)/len(x)
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- alignment(src: str, tar: str) Tuple[float, str, str] [source]
Return the top ALINE alignment of two strings.
The top ALINE alignment is the first alignment with the best score. The purpose of this function is to have a single tuple as a return value.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
ALINE alignment and its score
- Return type:
tuple(float, str, str)
Examples
>>> cmp = ALINE() >>> cmp.alignment('cat', 'hat') (50.0, 'c ‖ a t ‖', 'h ‖ a t ‖') >>> cmp.alignment('niall', 'neil') (90.0, '‖ n i a ll ‖', '‖ n e i l ‖') >>> cmp.alignment('aluminum', 'catalan') (81.5, '‖ a l u m ‖ inum', 'cat ‖ a l a n ‖') >>> cmp.alignment('atcg', 'tagc') (65.0, '‖ a t c ‖ g', 't ‖ a g c ‖')
New in version 0.4.1.
- alignments(src: str, tar: str, score_only: bool = False) Union[float, List[Tuple[float, str, str]]] [source]
Return the ALINE alignments of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
score_only (bool) -- Return the score only, not the alignments
- Returns:
ALINE alignments and their scores or the top score
- Return type:
list(tuple(float, str, str) or float
Examples
>>> cmp = ALINE() >>> cmp.alignments('cat', 'hat') [(50.0, 'c ‖ a t ‖', 'h ‖ a t ‖')] >>> cmp.alignments('niall', 'neil') [(90.0, '‖ n i a ll ‖', '‖ n e i l ‖')] >>> cmp.alignments('aluminum', 'catalan') [(81.5, '‖ a l u m ‖ inum', 'cat ‖ a l a n ‖')] >>> cmp.alignments('atcg', 'tagc') [(65.0, '‖ a t c ‖ g', 't ‖ a g c ‖'), (65.0, 'a ‖ tc - g ‖', '‖ t a g ‖ c')]
New in version 0.4.0.
Changed in version 0.4.1: Renamed from .alignment to .alignments
- c_features = {'aspirated', 'lateral', 'manner', 'nasal', 'place', 'retroflex', 'syllabic', 'voice'}
- feature_weights = {'affricate': 0.9, 'alveolar': 0.85, 'approximant': 0.6, 'back': 0.0, 'bilabial': 1.0, 'central': 0.5, 'dental': 0.9, 'fricative': 0.8, 'front': 1.0, 'glottal': 0.1, 'high': 1.0, 'high vowel': 0.4, 'labiodental': 0.95, 'low': 0.0, 'low vowel': 0.0, 'mid': 0.5, 'mid vowel': 0.2, 'minus': 0.0, 'palatal': 0.7, 'palato-alveolar': 0.75, 'pharyngeal': 0.3, 'plus': 1.0, 'retroflex': 0.8, 'stop': 1.0, 'tap': 0.5, 'trill': 0.55, 'uvular': 0.5, 'velar': 0.6}
- phones_ipa: Dict[str, Dict[str, str]] = {'a': {'aspirated': 'minus', 'back': 'front', 'high': 'low', 'lateral': 'minus', 'long': 'minus', 'manner': 'low vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'b': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'bilabial', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'c': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'palatal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'd': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'e': {'aspirated': 'minus', 'back': 'front', 'high': 'mid', 'lateral': 'minus', 'long': 'minus', 'manner': 'mid vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'f': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'labiodental', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'g': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'h': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'glottal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'i': {'aspirated': 'minus', 'back': 'front', 'high': 'high', 'lateral': 'minus', 'long': 'minus', 'manner': 'high vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'j': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'approximant', 'nasal': 'minus', 'place': 'palatal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'k': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'l': {'aspirated': 'minus', 'lateral': 'plus', 'manner': 'approximant', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'm': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'plus', 'place': 'bilabial', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'n': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'plus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'o': {'aspirated': 'minus', 'back': 'back', 'high': 'mid', 'lateral': 'minus', 'long': 'minus', 'manner': 'mid vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'p': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'bilabial', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'q': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'uvular', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'r': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'trill', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'plus', 'syllabic': 'minus', 'voice': 'plus'}, 's': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 't': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'u': {'aspirated': 'minus', 'back': 'back', 'high': 'high', 'lateral': 'minus', 'long': 'minus', 'manner': 'high vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'v': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'labiodental', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'w': {'aspirated': 'minus', 'double': 'bilabial', 'lateral': 'minus', 'manner': 'approximant', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'x': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'y': {'aspirated': 'minus', 'back': 'front', 'high': 'high', 'lateral': 'minus', 'long': 'minus', 'manner': 'high vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'z': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'æ': {'aspirated': 'minus', 'back': 'front', 'high': 'low', 'lateral': 'minus', 'long': 'minus', 'manner': 'low vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'ç': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'palatal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'ð': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'dental', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ø': {'aspirated': 'minus', 'back': 'front', 'high': 'mid', 'lateral': 'minus', 'long': 'minus', 'manner': 'mid vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'ħ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'pharyngeal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'ŋ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'plus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'œ': {'aspirated': 'minus', 'back': 'front', 'high': 'mid', 'lateral': 'minus', 'long': 'minus', 'manner': 'mid vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'ɒ': {'aspirated': 'minus', 'back': 'back', 'high': 'low', 'lateral': 'minus', 'long': 'minus', 'manner': 'low vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'ɔ': {'aspirated': 'minus', 'back': 'back', 'high': 'mid', 'lateral': 'minus', 'long': 'minus', 'manner': 'mid vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'ɖ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'retroflex', 'retroflex': 'plus', 'syllabic': 'minus', 'voice': 'plus'}, 'ə': {'aspirated': 'minus', 'back': 'central', 'high': 'mid', 'lateral': 'minus', 'long': 'minus', 'manner': 'mid vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'ɛ': {'aspirated': 'minus', 'back': 'front', 'high': 'mid', 'lateral': 'minus', 'long': 'minus', 'manner': 'mid vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'ɟ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'palatal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɢ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'uvular', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɣ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɦ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'glottal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɨ': {'aspirated': 'minus', 'back': 'central', 'high': 'high', 'lateral': 'minus', 'long': 'minus', 'manner': 'high vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'ɬ': {'aspirated': 'minus', 'lateral': 'plus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'ɮ': {'aspirated': 'minus', 'lateral': 'plus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɰ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'approximant', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɱ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'plus', 'place': 'labiodental', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɲ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'plus', 'place': 'palatal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɳ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'plus', 'place': 'retroflex', 'retroflex': 'plus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɴ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'plus', 'place': 'uvular', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɸ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'bilabial', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'ɹ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'approximant', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɻ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'approximant', 'nasal': 'minus', 'place': 'retroflex', 'retroflex': 'plus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɽ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'tap', 'nasal': 'minus', 'place': 'retroflex', 'retroflex': 'plus', 'syllabic': 'minus', 'voice': 'plus'}, 'ɾ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'tap', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ʀ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'trill', 'nasal': 'minus', 'place': 'uvular', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ʁ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'uvular', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ʂ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'retroflex', 'retroflex': 'plus', 'syllabic': 'minus', 'voice': 'minus'}, 'ʃ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'palato-alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'ʈ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'retroflex', 'retroflex': 'plus', 'syllabic': 'minus', 'voice': 'minus'}, 'ʉ': {'aspirated': 'minus', 'back': 'central', 'high': 'high', 'lateral': 'minus', 'long': 'minus', 'manner': 'high vowel', 'nasal': 'minus', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'ʋ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'approximant', 'nasal': 'minus', 'place': 'labiodental', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ʐ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'retroflex', 'retroflex': 'plus', 'syllabic': 'minus', 'voice': 'plus'}, 'ʒ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'palato-alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ʔ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'glottal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'ʕ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'pharyngeal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ʙ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'trill', 'nasal': 'minus', 'place': 'bilabial', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ʝ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'palatal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'ʰ': {'aspirated': 'plus', 'supplemental': 'True'}, 'ː': {'long': 'plus', 'supplemental': 'True'}, 'β': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'bilabial', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'θ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'dental', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'χ': {'aspirated': 'minus', 'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'uvular', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}}
- phones_kondrak: Dict[str, Dict[str, str]] = {'A': {'aspirated': 'plus', 'supplemental': 'True'}, 'B': {'back': 'back', 'supplemental': 'True'}, 'C': {'back': 'central', 'supplemental': 'True'}, 'D': {'place': 'dental', 'supplemental': 'True'}, 'F': {'back': 'front', 'supplemental': 'True'}, 'H': {'long': 'plus', 'supplemental': 'True'}, 'N': {'nasal': 'plus', 'supplemental': 'True'}, 'P': {'place': 'palatal', 'supplemental': 'True'}, 'R': {'round': 'plus', 'supplemental': 'True'}, 'S': {'manner': 'fricative', 'supplemental': 'True'}, 'V': {'place': 'palato-alveolar', 'supplemental': 'True'}, 'a': {'back': 'central', 'high': 'low', 'lateral': 'minus', 'manner': 'low vowel', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'b': {'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'bilabial', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'c': {'lateral': 'minus', 'manner': 'affricate', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'd': {'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'e': {'back': 'front', 'high': 'mid', 'lateral': 'minus', 'manner': 'mid vowel', 'nasal': 'minus', 'place': 'palatal', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'f': {'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'labiodental', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'g': {'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'h': {'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'glottal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'i': {'back': 'front', 'high': 'high', 'lateral': 'minus', 'manner': 'high vowel', 'nasal': 'minus', 'place': 'palatal', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'j': {'lateral': 'minus', 'manner': 'affricate', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'k': {'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'l': {'lateral': 'plus', 'manner': 'approximant', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'm': {'lateral': 'minus', 'manner': 'stop', 'nasal': 'plus', 'place': 'bilabial', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'n': {'lateral': 'minus', 'manner': 'stop', 'nasal': 'plus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}, 'o': {'back': 'back', 'high': 'mid', 'lateral': 'minus', 'manner': 'mid vowel', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'p': {'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'bilabial', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'q': {'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'glottal', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'r': {'lateral': 'minus', 'manner': 'approximant', 'nasal': 'minus', 'place': 'retroflex', 'retroflex': 'plus', 'syllabic': 'minus', 'voice': 'plus'}, 's': {'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 't': {'lateral': 'minus', 'manner': 'stop', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'u': {'back': 'back', 'high': 'high', 'lateral': 'minus', 'manner': 'high vowel', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'v': {'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'labiodental', 'retroflex': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'w': {'back': 'back', 'double': 'bilabial', 'high': 'high', 'lateral': 'minus', 'manner': 'high vowel', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'round': 'plus', 'syllabic': 'plus', 'voice': 'plus'}, 'x': {'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'minus'}, 'y': {'back': 'front', 'high': 'high', 'lateral': 'minus', 'manner': 'high vowel', 'nasal': 'minus', 'place': 'velar', 'retroflex': 'minus', 'round': 'minus', 'syllabic': 'plus', 'voice': 'plus'}, 'z': {'lateral': 'minus', 'manner': 'fricative', 'nasal': 'minus', 'place': 'alveolar', 'retroflex': 'minus', 'syllabic': 'minus', 'voice': 'plus'}}
- salience = {'aspirated': 5, 'back': 5, 'high': 5, 'lateral': 10, 'long': 1, 'manner': 50, 'nasal': 10, 'place': 40, 'retroflex': 10, 'round': 5, 'syllabic': 5, 'voice': 10}
- sim(src: str, tar: str) float [source]
Return the normalized ALINE similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Normalized ALINE similarity
- Return type:
float
Examples
>>> cmp = ALINE() >>> cmp.dist('cat', 'hat') 0.4117647058823529 >>> cmp.dist('niall', 'neil') 0.33333333333333337 >>> cmp.dist('aluminum', 'catalan') 0.5925 >>> cmp.dist('atcg', 'tagc') 0.45833333333333337
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the ALINE alignment score of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
ALINE alignment score
- Return type:
float
Examples
>>> cmp = ALINE() >>> cmp.sim_score('cat', 'hat') 50.0 >>> cmp.sim_score('niall', 'neil') 90.0 >>> cmp.sim_score('aluminum', 'catalan') 81.5 >>> cmp.sim_score('atcg', 'tagc') 65.0
New in version 0.4.0.
- v_features = {'back', 'high', 'long', 'nasal', 'retroflex', 'round', 'syllabic'}
- class abydos.distance.AMPLE(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
AMPLE similarity.
The AMPLE similarity [AZvanGemund07, DLZ05] is defined in getAverageSequenceWeight() in the AverageSequenceWeightEvaluator.java file of AMPLE's source code. For two sets X and Y and a population N, it is
\[sim_{AMPLE}(X, Y) = \big|\frac{|X \cap Y|}{|X|} - \frac{|Y \setminus X|}{|N \setminus X|}\big|\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{AMPLE} = \big|\frac{a}{a+b}-\frac{c}{c+d}\big|\]Notes
This measure is asymmetric. The first ratio considers how similar the two strings are, while the second considers how dissimilar the second string is. As a result, both very similar and very dissimilar strings will score high on this measure, provided the unique aspects are present chiefly in the latter string.
New in version 0.4.0.
Initialize AMPLE instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the AMPLE similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
AMPLE similarity
- Return type:
float
Examples
>>> cmp = AMPLE() >>> cmp.sim('cat', 'hat') 0.49743589743589745 >>> cmp.sim('Niall', 'Neil') 0.32947729220222793 >>> cmp.sim('aluminum', 'Catalan') 0.10209049255441008 >>> cmp.sim('ATCG', 'TAGC') 0.006418485237483954
New in version 0.4.0.
- class abydos.distance.AZZOO(sigma: float = 0.5, alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
AZZOO similarity.
For two sets X and Y, and alphabet N, and a parameter \(\sigma\), AZZOO similarity [CTY06] is
\[sim_{AZZOO_{\sigma}}(X, Y) = \sum{s_i}\]where \(s_i = 1\) if \(X_i = Y_i = 1\), \(s_i = \sigma\) if \(X_i = Y_i = 0\), and \(s_i = 0\) otherwise.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{AZZOO} = a + \sigma \cdot d\]New in version 0.4.0.
Initialize AZZOO instance.
- Parameters:
sigma (float) -- Sigma designates the contribution to similarity given by the 0-0 samples in the set.
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the AZZOO similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
AZZOO similarity
- Return type:
float
Examples
>>> cmp = AZZOO() >>> cmp.sim('cat', 'hat') 0.9923857868020305 >>> cmp.sim('Niall', 'Neil') 0.9860759493670886 >>> cmp.sim('aluminum', 'Catalan') 0.9710327455919395 >>> cmp.sim('ATCG', 'TAGC') 0.9809885931558935
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the AZZOO similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
AZZOO similarity
- Return type:
float
Examples
>>> cmp = AZZOO() >>> cmp.sim_score('cat', 'hat') 391.0 >>> cmp.sim_score('Niall', 'Neil') 389.5 >>> cmp.sim_score('aluminum', 'Catalan') 385.5 >>> cmp.sim_score('ATCG', 'TAGC') 387.0
New in version 0.4.0.
- class abydos.distance.Anderberg(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Anderberg's D.
For two sets X and Y and a population N, Anderberg's D [And73] is
\[\begin{split}t_1 = max(|X \cap Y|, |X \setminus Y|)+ max(|Y \setminus X|, |(N \setminus X) \setminus Y|)+\\ max(|X \cap Y|, |Y \setminus X|)+ max(|X \setminus Y|, |(N \setminus X) \setminus Y|)\\ \\ t_2 = max(|Y|, |N \setminus Y|)+max(|X|, |N \setminus X|)\\ \\ sim_{Anderberg}(X, Y) = \frac{t_1-t_2}{2|N|}\end{split}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Anderberg} = \frac{(max(a,b)+max(c,d)+max(a,c)+max(b,d))- (max(a+b,b+d)+max(a+b,c+d))}{2n}\]Notes
There are various references to another "Anderberg similarity", \(sim_{Anderberg} = \frac{8a}{8a+b+c}\), but I cannot substantiate the claim that this appears in [And73]. In any case, if you want to use this measure, you may instatiate
WeightedJaccard
with weight=8.Anderberg states that "[t]his quantity is the actual reduction in the error probability (also the actual increase in the correct prediction) as a consequence of using predictor information" [And73]. It ranges [0, 0.5] so a
sim
method ranging [0, 1] is provided in addition tosim_score
, which gives the value D itself.It is difficult to term this measure a similarity score. Identical strings often fail to gain high scores. Also, strings that would otherwise be considered quite similar often earn lower scores than those that are less similar.
New in version 0.4.0.
Initialize Anderberg instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Anderberg's D similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Anderberg similarity
- Return type:
float
Examples
>>> cmp = Anderberg() >>> cmp.sim('cat', 'hat') 0.0 >>> cmp.sim('Niall', 'Neil') 0.0 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Anderberg's D similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Anderberg similarity
- Return type:
float
Examples
>>> cmp = Anderberg() >>> cmp.sim_score('cat', 'hat') 0.0 >>> cmp.sim_score('Niall', 'Neil') 0.0 >>> cmp.sim_score('aluminum', 'Catalan') 0.0 >>> cmp.sim_score('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.AndresMarzoDelta(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Andres & Marzo's Delta correlation.
For two sets X and Y and a population N, Andres & Marzo's \(\Delta\) correlation [AndresM04] is
\[corr_{AndresMarzo_\Delta}(X, Y) = \Delta = \frac{|X \cap Y| + |(N \setminus X) \setminus Y| - 2\sqrt{|X \setminus Y| \cdot |Y \setminus X|}}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{AndresMarzo_\Delta} = \Delta = \frac{a+d-2\sqrt{b \cdot c}}{n}\]New in version 0.4.0.
Initialize AndresMarzoDelta instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Andres & Marzo's Delta correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Andres & Marzo's Delta correlation
- Return type:
float
Examples
>>> cmp = AndresMarzoDelta() >>> cmp.corr('cat', 'hat') 0.9897959183673469 >>> cmp.corr('Niall', 'Neil') 0.9822344346552608 >>> cmp.corr('aluminum', 'Catalan') 0.9618259496215341 >>> cmp.corr('ATCG', 'TAGC') 0.9744897959183674
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Andres & Marzo's Delta similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Andres & Marzo's Delta similarity
- Return type:
float
Examples
>>> cmp = AndresMarzoDelta() >>> cmp.sim('cat', 'hat') 0.9948979591836735 >>> cmp.sim('Niall', 'Neil') 0.9911172173276304 >>> cmp.sim('aluminum', 'Catalan') 0.980912974810767 >>> cmp.sim('ATCG', 'TAGC') 0.9872448979591837
New in version 0.4.0.
- class abydos.distance.AverageLinkage(tokenizer: Optional[_Tokenizer] = None, metric: Optional[_Distance] = None, **kwargs: Any)[source]
Bases:
_TokenDistance
Average linkage distance.
For two lists of tokens X and Y, average linkage distance [DD16] is
\[dist_{AverageLinkage}(X, Y) = \frac{\sum_{i \in X} \sum_{j \in Y} dist(X_i, Y_j)}{|X| \cdot |Y|}\]New in version 0.4.0.
Initialize AverageLinkage instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packagemetric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants. (Defaults to Levenshtein distance)**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the average linkage distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
average linkage distance
- Return type:
float
Examples
>>> cmp = AverageLinkage() >>> cmp.dist('cat', 'hat') 0.8125 >>> cmp.dist('Niall', 'Neil') 0.8333333333333334 >>> cmp.dist('aluminum', 'Catalan') 0.9166666666666666 >>> cmp.dist('ATCG', 'TAGC') 0.8
New in version 0.4.0.
- class abydos.distance.BISIM(qval: int = 2, **kwargs: Any)[source]
Bases:
_Distance
BI-SIM similarity.
BI-SIM similarity [KD03] is an n-gram based, edit-distance derived similarity measure.
New in version 0.4.0.
Initialize BISIM instance.
- Parameters:
qval (int) -- The number of characters to consider in each n-gram (q-gram). By default this is 2, hence BI-SIM. But TRI-SIM can be calculated by setting this to 3.
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the BI-SIM similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
BI-SIM similarity
- Return type:
float
Examples
>>> cmp = BISIM() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.4 >>> cmp.sim('aluminum', 'Catalan') 0.3125 >>> cmp.sim('ATCG', 'TAGC') 0.375
New in version 0.4.0.
- class abydos.distance.BLEU(n_min: int = 1, n_max: int = 4, tokenizers: Optional[List[_Tokenizer]] = None, weights: Optional[List[float]] = None, **kwargs: Any)[source]
Bases:
_Distance
BLEU similarity.
BLEU similarity [PRWZ02] compares two strings for similarity using a set of tokenizers and a brevity penalty:
\[\begin{split}BP = \left\{ \begin{array}{lrl} 1 & \textup{if} & c > r \\ e^{(1-\frac{r}{c})} & \textup{if} & c \leq r \end{array} \right.\end{split}\]The BLEU score is then:
\[\textup{B\textsc{leu}} = BP \cdot e^{\sum_{n=1}^N w_n log p_n}\]For tokenizers 1 to N, by default q-gram tokenizers for q=1 to N in Abydos, weights \(w_n\), which are uniformly \(\frac{1}{N}\), and \(p_n\):
\[p_n = \frac{\sum_{token \in tar} min(Count(token \in tar), Count(token \in src))}{|tar|}\]New in version 0.4.0.
Initialize BLEU instance.
- Parameters:
n_min (int) -- The minimum q-gram value for BLEU score calculation (1 by default)
n_max (int) -- The maximum q-gram value for BLEU score calculation (4 by default)
tokenizers (list(_Tokenizer)) -- A list of initialized tokenizers
weights (list(float)) -- A list of floats representing the weights of the tokenizers. If tokenizers is set, this must have the same length. If n_min and n_max are used to set tokenizers, this must have length equal to n_max-n_min-1. Otherwise, uniform weights will be used.
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the BLEU similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
BLEU similarity
- Return type:
float
Examples
>>> cmp = BLEU() >>> cmp.sim('cat', 'hat') 0.7598356856515925 >>> cmp.sim('Niall', 'Neil') 0.7247557929987696 >>> cmp.sim('aluminum', 'Catalan') 0.44815260192961937 >>> cmp.sim('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- class abydos.distance.Bag(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Bag distance.
Bag distance is proposed in [BCP02]. It is defined as
\[dist_{bag}(src, tar) = max(|multiset(src)-multiset(tar)|, |multiset(tar)-multiset(src)|)\]New in version 0.3.6.
Initialize Bag instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized bag distance between two strings.
Bag distance is normalized by dividing by \(max( |src|, |tar| )\).
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Normalized bag distance
- Return type:
float
Examples
>>> cmp = Bag() >>> cmp.dist('cat', 'hat') 0.3333333333333333 >>> cmp.dist('Niall', 'Neil') 0.4 >>> cmp.dist('aluminum', 'Catalan') 0.625 >>> cmp.dist('ATCG', 'TAGC') 0.0
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- dist_abs(src: str, tar: str, normalized: bool = False) float [source]
Return the bag distance between two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
normalized (bool) -- Normalizes to [0, 1] if True
- Returns:
Bag distance
- Return type:
int or float
Examples
>>> cmp = Bag() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 2 >>> cmp.dist_abs('aluminum', 'Catalan') 5 >>> cmp.dist_abs('ATCG', 'TAGC') 0 >>> cmp.dist_abs('abcdefg', 'hijklm') 7 >>> cmp.dist_abs('abcdefg', 'hijklmno') 8
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.BaroniUrbaniBuserI(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Baroni-Urbani & Buser I similarity.
For two sets X and Y and a population N, the Baroni-Urbani & Buser I similarity [BUB76] is
\[sim_{BaroniUrbaniBuserI}(X, Y) = \frac{\sqrt{|X \cap Y| \cdot |(N \setminus X) \setminus Y|} + |X \cap Y|} {\sqrt{|X \cap Y| \cdot |(N \setminus X) \setminus Y|} + |X \cap Y| + |X \setminus Y| + |Y \setminus X|}\]This is the second, but more commonly used and referenced of the two similarities proposed by Baroni-Urbani & Buser.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{BaroniUrbaniBuserI} = \frac{\sqrt{ad}+a}{\sqrt{ad}+a+b+c}\]New in version 0.4.0.
Initialize BaroniUrbaniBuserI instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Baroni-Urbani & Buser I similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Baroni-Urbani & Buser I similarity
- Return type:
float
Examples
>>> cmp = BaroniUrbaniBuserI() >>> cmp.sim('cat', 'hat') 0.9119837740878104 >>> cmp.sim('Niall', 'Neil') 0.8552823175014205 >>> cmp.sim('aluminum', 'Catalan') 0.656992712054851 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.BaroniUrbaniBuserII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Baroni-Urbani & Buser II correlation.
For two sets X and Y and a population N, the Baroni-Urbani & Buser II correlation [BUB76] is
\[corr_{BaroniUrbaniBuserII}(X, Y) = \frac{\sqrt{|X \cap Y| \cdot |(N \setminus X) \setminus Y|} + |X \cap Y| - |X \setminus Y| - |Y \setminus X|} {\sqrt{|X \cap Y| \cdot |(N \setminus X) \setminus Y|} + |X \cap Y| + |X \setminus Y| + |Y \setminus X|}\]This is the first, but less commonly used and referenced of the two similarities proposed by Baroni-Urbani & Buser.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{BaroniUrbaniBuserII} = \frac{\sqrt{ad}+a-b-c}{\sqrt{ad}+a+b+c}\]New in version 0.4.0.
Initialize BaroniUrbaniBuserII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Baroni-Urbani & Buser II correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Baroni-Urbani & Buser II correlation
- Return type:
float
Examples
>>> cmp = BaroniUrbaniBuserII() >>> cmp.corr('cat', 'hat') 0.8239675481756209 >>> cmp.corr('Niall', 'Neil') 0.7105646350028408 >>> cmp.corr('aluminum', 'Catalan') 0.31398542410970204 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Baroni-Urbani & Buser II similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Baroni-Urbani & Buser II similarity
- Return type:
float
Examples
>>> cmp = BaroniUrbaniBuserII() >>> cmp.sim('cat', 'hat') 0.9119837740878105 >>> cmp.sim('Niall', 'Neil') 0.8552823175014204 >>> cmp.sim('aluminum', 'Catalan') 0.656992712054851 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.BatageljBren(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Batagelj & Bren distance.
For two sets X and Y and a population N, the Batagelj & Bren distance [BB95], Batagelj & Bren's \(Q_0\), is
\[dist_{BatageljBren}(X, Y) = \frac{|X \setminus Y| \cdot |Y \setminus X|} {|X \cap Y| \cdot |(N \setminus X) \setminus Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BatageljBren} = \frac{bc}{ad}\]New in version 0.4.0.
Initialize BatageljBren instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Batagelj & Bren distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Batagelj & Bren distance
- Return type:
float
Examples
>>> cmp = BatageljBren() >>> cmp.dist('cat', 'hat') 3.2789465400556106e-06 >>> cmp.dist('Niall', 'Neil') 9.874917709019092e-06 >>> cmp.dist('aluminum', 'Catalan') 9.276668350823718e-05 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the Batagelj & Bren distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Batagelj & Bren distance
- Return type:
float
Examples
>>> cmp = BatageljBren() >>> cmp.dist_abs('cat', 'hat') 0.002570694087403599 >>> cmp.dist_abs('Niall', 'Neil') 0.007741935483870968 >>> cmp.dist_abs('aluminum', 'Catalan') 0.07282184655396619 >>> cmp.dist_abs('ATCG', 'TAGC') inf
New in version 0.4.0.
- class abydos.distance.BaulieuI(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Baulieu I distance.
For two sets X and Y and a population N, Baulieu I distance [Bau89] is
\[sim_{BaulieuI}(X, Y) = \frac{|X| \cdot |Y| - |X \cap Y|^2}{|X| \cdot |Y|}\]This is Baulieu's 12th dissimilarity coefficient.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{BaulieuI} = \frac{(a+b)(a+c)-a^2}{(a+b)(a+c)}\]New in version 0.4.0.
Initialize BaulieuI instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Baulieu I distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Baulieu I distance
- Return type:
float
Examples
>>> cmp = BaulieuI() >>> cmp.dist('cat', 'hat') 0.75 >>> cmp.dist('Niall', 'Neil') 0.8666666666666667 >>> cmp.dist('aluminum', 'Catalan') 0.9861111111111112 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- class abydos.distance.BaulieuII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Baulieu II similarity.
For two sets X and Y and a population N, Baulieu II similarity [Bau89] is
\[sim_{BaulieuII}(X, Y) = \frac{|X \cap Y|^2 \cdot |(N \setminus X) \setminus Y|^2} {|X| \cdot |Y| \cdot |N \setminus X| \cdot |N \setminus Y|}\]This is based on Baulieu's 13th dissimilarity coefficient.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{BaulieuII} = \frac{a^2d^2}{(a+b)(a+c)(b+d)(c+d)}\]New in version 0.4.0.
Initialize BaulieuII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Baulieu II similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Baulieu II similarity
- Return type:
float
Examples
>>> cmp = BaulieuII() >>> cmp.sim('cat', 'hat') 0.24871959237343852 >>> cmp.sim('Niall', 'Neil') 0.13213719608444902 >>> cmp.sim('aluminum', 'Catalan') 0.013621892326789235 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.BaulieuIII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Baulieu III distance.
For two sets X and Y and a population N, Baulieu III distance [Bau89] is
\[sim_{BaulieuIII}(X, Y) = \frac{|N|^2 - 4(|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|)}{2 \cdot |N|^2}\]This is based on Baulieu's 20th dissimilarity coefficient.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{BaulieuIII} = \frac{n^2 - 4(ad-bc)}{2n^2}\]Notes
It should be noted that this is based on Baulieu's 20th dissimilarity coefficient. This distance is exactly half Baulieu's 20th dissimilarity. According to [Bau89], the 20th dissimilarity should be a value in the range [0.0, 1.0], meeting the article's (P1) property, but the formula given ranges [0.0, 2.0], so dividing by 2 corrects the formula to meet the article's expectations.
New in version 0.4.0.
Initialize BaulieuIII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Baulieu III distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Baulieu III distance
- Return type:
float
Examples
>>> cmp = BaulieuIII() >>> cmp.dist('cat', 'hat') 0.4949500208246564 >>> cmp.dist('Niall', 'Neil') 0.4949955747605165 >>> cmp.dist('aluminum', 'Catalan') 0.49768591017891195 >>> cmp.dist('ATCG', 'TAGC') 0.5000813463140358
New in version 0.4.0.
- class abydos.distance.BaulieuIV(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', positive_irrational: float = 2.718281828459045, **kwargs: Any)[source]
Bases:
_TokenDistance
Baulieu IV distance.
For two sets X and Y, a population N, and a positive irractional number k, Baulieu IV distance [Bau97] is
\[dist_{BaulieuIV}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X| - (|X \cap Y| + \frac{1}{2}) \cdot (|(N \setminus X) \setminus Y| + \frac{1}{2}) \cdot |(N \setminus X) \setminus Y| \cdot k}{|N|}\]This is Baulieu's 22nd dissimilarity coefficient.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuIV} = \frac{b+c-(a+\frac{1}{2})(d+\frac{1}{2})dk}{n}\]Notes
The default value of k is Euler's number \(e\), but other irrationals such as \(\pi\) or \(\sqrt{2}\) could be substituted at initialization.
New in version 0.4.0.
Initialize BaulieuIV instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Baulieu IV distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Baulieu IV distance
- Return type:
float
Examples
>>> cmp = BaulieuIV() >>> cmp.dist('cat', 'hat') 0.49999799606535283 >>> cmp.dist('Niall', 'Neil') 0.49999801148659684 >>> cmp.dist('aluminum', 'Catalan') 0.49999883126809364 >>> cmp.dist('ATCG', 'TAGC') 0.4999996033268451
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the Baulieu IV distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Baulieu IV distance
- Return type:
float
Examples
>>> cmp = BaulieuIV() >>> cmp.dist_abs('cat', 'hat') -5249.96272285802 >>> cmp.dist_abs('Niall', 'Neil') -5209.561726488335 >>> cmp.dist_abs('aluminum', 'Catalan') -3073.6070822721244 >>> cmp.dist_abs('ATCG', 'TAGC') -1039.2151656463932
New in version 0.4.0.
- class abydos.distance.BaulieuIX(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Baulieu IX distance.
For two sets X and Y and a population N, Baulieu IX distance [Bau97] is
\[dist_{BaulieuIX}(X, Y) = \frac{|X \setminus Y| + 2 \cdot |Y \setminus X|}{|N| + |Y \setminus X|}\]This is Baulieu's 27th dissimilarity coefficient. This coefficient fails Baulieu's (P7) property, that \(D(a,b,c,d) = D(a,c,b,d)\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuIX} = \frac{b+2c}{a+b+2c+d}\]New in version 0.4.0.
Initialize BaulieuIX instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Baulieu IX distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Baulieu IX distance
- Return type:
float
Examples
>>> cmp = BaulieuIX() >>> cmp.dist('cat', 'hat') 0.007633587786259542 >>> cmp.dist('Niall', 'Neil') 0.012706480304955527 >>> cmp.dist('aluminum', 'Catalan') 0.027777777777777776 >>> cmp.dist('ATCG', 'TAGC') 0.019011406844106463
New in version 0.4.0.
- class abydos.distance.BaulieuV(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Baulieu V distance.
For two sets X and Y and a population N, Baulieu V distance [Bau97] is
\[dist_{BaulieuV}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X| + 1}{|X \cap Y| + |X \setminus Y| + |Y \setminus X| + 1}\]This is Baulieu's 23rd dissimilarity coefficient. This coefficient fails Baulieu's (P2) property, that \(D(a,0,0,0) = 0\). Rather, \(D(a,0,0,0) > 0\), but \(\lim_{a \to \infty} D(a,0,0,0) = 0\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuV} = \frac{b+c+1}{a+b+c+1}\]New in version 0.4.0.
Initialize BaulieuV instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Baulieu V distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Baulieu V distance
- Return type:
float
Examples
>>> cmp = BaulieuV() >>> cmp.dist('cat', 'hat') 0.7142857142857143 >>> cmp.dist('Niall', 'Neil') 0.8 >>> cmp.dist('aluminum', 'Catalan') 0.9411764705882353 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- class abydos.distance.BaulieuVI(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Baulieu VI distance.
For two sets X and Y and a population N, Baulieu VI distance [Bau97] is
\[dist_{BaulieuVI}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X|} {|X \cap Y| + |X \setminus Y| + |Y \setminus X| + 1}\]This is Baulieu's 24th dissimilarity coefficient. This coefficient fails Baulieu's (P3) property, that \(D(a,b,c,d) = 1\) for some (a,b,c,d). Rather, \(D(a,b,c,d) < 1\), but \(\lim_{b \to \infty, c \to \infty} D(a,b,c,d) = 0\) for \(a = 0\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuVI} = \frac{b+c}{a+b+c+1}\]New in version 0.4.0.
Initialize BaulieuVI instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Baulieu VI distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Baulieu VI distance
- Return type:
float
Examples
>>> cmp = BaulieuVI() >>> cmp.dist('cat', 'hat') 0.5714285714285714 >>> cmp.dist('Niall', 'Neil') 0.7 >>> cmp.dist('aluminum', 'Catalan') 0.8823529411764706 >>> cmp.dist('ATCG', 'TAGC') 0.9090909090909091
New in version 0.4.0.
- class abydos.distance.BaulieuVII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Baulieu VII distance.
For two sets X and Y and a population N, Baulieu VII distance [Bau97] is
\[dist_{BaulieuVII}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X|} {|N| + |X \cap Y| \cdot (|X \cap Y| - 4)^2}\]This is Baulieu's 25th dissimilarity coefficient. This coefficient fails Baulieu's (P4) property, that \(D(a+1,b,c,d) \leq D(a,b,c,d) = 0\) with equality holding iff \(D(a,b,c,d) = 0\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuVII} = \frac{b+c}{n + a \cdot (a-4)^2}\]New in version 0.4.0.
Initialize BaulieuVII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Baulieu VII distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Baulieu VII distance
- Return type:
float
Examples
>>> cmp = BaulieuVII() >>> cmp.dist('cat', 'hat') 0.005050505050505051 >>> cmp.dist('Niall', 'Neil') 0.008838383838383838 >>> cmp.dist('aluminum', 'Catalan') 0.018891687657430732 >>> cmp.dist('ATCG', 'TAGC') 0.012755102040816327
New in version 0.4.0.
- class abydos.distance.BaulieuVIII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Baulieu VIII distance.
For two sets X and Y and a population N, Baulieu VIII distance [Bau97] is
\[dist_{BaulieuVIII}(X, Y) = \frac{(|X \setminus Y| - |Y \setminus X|)^2}{|N|^2}\]This is Baulieu's 26th dissimilarity coefficient. This coefficient fails Baulieu's (P5) property, that \(D(a,b+1,c,d) \geq D(a,b,c,d)\), with equality holding if \(D(a,b,c,d) = 1\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuVIII} = \frac{(b-c)^2}{n^2}\]New in version 0.4.0.
Initialize BaulieuVIII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Baulieu VIII distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Baulieu VIII distance
- Return type:
float
Examples
>>> cmp = BaulieuVIII() >>> cmp.dist('cat', 'hat') 0.0 >>> cmp.dist('Niall', 'Neil') 1.6269262807163682e-06 >>> cmp.dist('aluminum', 'Catalan') 1.6227838857560144e-06 >>> cmp.dist('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.BaulieuX(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Baulieu X distance.
For two sets X and Y and a population N, Baulieu X distance [Bau97] is
\[dist_{BaulieuX}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X| + max(|X \setminus Y|, |Y \setminus X|)}{|N| + max(|X \setminus Y|, |Y \setminus X|)}\]This is Baulieu's 28th dissimilarity coefficient. This coefficient fails Baulieu's (P8) property, that \(D\) is a rational function whose numerator and denominator are both (total) linear.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuX} = \frac{b+c+max(b,c)}{n+max(b,c)}\]New in version 0.4.0.
Initialize BaulieuX instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Baulieu X distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Baulieu X distance
- Return type:
float
Examples
>>> cmp = BaulieuX() >>> cmp.dist('cat', 'hat') 0.007633587786259542 >>> cmp.dist('Niall', 'Neil') 0.013959390862944163 >>> cmp.dist('aluminum', 'Catalan') 0.029003783102143757 >>> cmp.dist('ATCG', 'TAGC') 0.019011406844106463
New in version 0.4.0.
- class abydos.distance.BaulieuXI(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Baulieu XI distance.
For two sets X and Y and a population N, Baulieu XI distance [Bau97] is
\[dist_{BaulieuXI}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X|} {|X \setminus Y| + |Y \setminus X| + |(N \setminus X) \setminus Y|}\]This is Baulieu's 29th dissimilarity coefficient. This coefficient fails Baulieu's (P4) property, that \(D(a+1,b,c,d) \leq D(a,b,c,d) = 0\) with equality holding iff \(D(a,b,c,d) = 0\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuXI} = \frac{b+c}{b+c+d}\]New in version 0.4.0.
Initialize BaulieuXI instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Baulieu XI distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Baulieu XI distance
- Return type:
float
Examples
>>> cmp = BaulieuXI() >>> cmp.dist('cat', 'hat') 0.005115089514066497 >>> cmp.dist('Niall', 'Neil') 0.008951406649616368 >>> cmp.dist('aluminum', 'Catalan') 0.01913265306122449 >>> cmp.dist('ATCG', 'TAGC') 0.012755102040816327
New in version 0.4.0.
- class abydos.distance.BaulieuXII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Baulieu XII distance.
For two sets X and Y and a population N, Baulieu XII distance [Bau97] is
\[dist_{BaulieuXII}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X|} {|X \cap Y| + |X \setminus Y| + |Y \setminus X| - 1}\]This is Baulieu's 30th dissimilarity coefficient. This coefficient fails Baulieu's (P5) property, that \(D(a,b+1,c,d) \geq D(a,b,c,d)\), with equality holding if \(D(a,b,c,d) = 1\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuXII} = \frac{b+c}{a+b+c-1}\]Notes
In the special case of comparisons where the intersection (a) contains 0 members, the size of the intersection is set to 1, resulting in a distance of 1.0. This prevents the distance from exceeding 1.0 and similarity from becoming negative.
New in version 0.4.0.
Initialize BaulieuXII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Baulieu XII distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Baulieu XII distance
- Return type:
float
Examples
>>> cmp = BaulieuXII() >>> cmp.dist('cat', 'hat') 0.8 >>> cmp.dist('Niall', 'Neil') 0.875 >>> cmp.dist('aluminum', 'Catalan') 1.0 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- class abydos.distance.BaulieuXIII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Baulieu XIII distance.
For two sets X and Y and a population N, Baulieu XIII distance [Bau97] is
\[dist_{BaulieuXIII}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X|} {|X \cap Y| + |X \setminus Y| + |Y \setminus X| + |X \cap Y| \cdot (|X \cap Y| - 4)^2}\]This is Baulieu's 31st dissimilarity coefficient. This coefficient fails Baulieu's (P4) property, that \(D(a+1,b,c,d) \leq D(a,b,c,d) = 0\) with equality holding iff \(D(a,b,c,d) = 0\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuXIII} = \frac{b+c}{a+b+c+a \cdot (a-4)^2}\]New in version 0.4.0.
Initialize BaulieuXIII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Baulieu XIII distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Baulieu XIII distance
- Return type:
float
Examples
>>> cmp = BaulieuXIII() >>> cmp.dist('cat', 'hat') 0.2857142857142857 >>> cmp.dist('Niall', 'Neil') 0.4117647058823529 >>> cmp.dist('aluminum', 'Catalan') 0.6 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- class abydos.distance.BaulieuXIV(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Baulieu XIV distance.
For two sets X and Y and a population N, Baulieu XIV distance [Bau97] is
\[dist_{BaulieuXIV}(X, Y) = \frac{|X \setminus Y| + 2 \cdot |Y \setminus X|}{|X \cap Y| + |X \setminus Y| + 2 \cdot |Y \setminus X|}\]This is Baulieu's 32nd dissimilarity coefficient. This coefficient fails Baulieu's (P7) property, that \(D(a,b,c,d) = D(a,c,b,d)\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuXIV} = \frac{b+2c}{a+b+2c}\]New in version 0.4.0.
Initialize BaulieuXIV instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Baulieu XIV distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Baulieu XIV distance
- Return type:
float
Examples
>>> cmp = BaulieuXIV() >>> cmp.dist('cat', 'hat') 0.75 >>> cmp.dist('Niall', 'Neil') 0.8333333333333334 >>> cmp.dist('aluminum', 'Catalan') 0.9565217391304348 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- class abydos.distance.BaulieuXV(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Baulieu XV distance.
For two sets X and Y and a population N, Baulieu XV distance [Bau97] is
\[dist_{BaulieuXV}(X, Y) = \frac{|X \setminus Y| + |Y \setminus X| + max(|X \setminus Y|, |Y \setminus X|)}{|X \cap Y| + |X \setminus Y| + |Y \setminus X| + max(|X \setminus Y|, |Y \setminus X|)}\]This is Baulieu's 33rd dissimilarity coefficient. This coefficient fails Baulieu's (P8) property, that \(D\) is a rational function whose numerator and denominator are both (total) linear.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{BaulieuXV} = \frac{b+c+max(b, c)}{a+b+c+max(b, c)}\]New in version 0.4.0.
Initialize BaulieuXV instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Baulieu XV distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Baulieu XV distance
- Return type:
float
Examples
>>> cmp = BaulieuXV() >>> cmp.dist('cat', 'hat') 0.75 >>> cmp.dist('Niall', 'Neil') 0.8461538461538461 >>> cmp.dist('aluminum', 'Catalan') 0.9583333333333334 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- class abydos.distance.Baystat(min_ss_len: Optional[int] = None, left_ext: Optional[int] = None, right_ext: Optional[int] = None, **kwargs: Any)[source]
Bases:
_Distance
Baystat similarity and distance.
Good results for shorter words are reported when setting min_ss_len to 1 and either left_ext OR right_ext to 1.
The Baystat similarity is defined in [FurnrohrRvR02].
This is ostensibly a port of the R module PPRL's implementation: https://github.com/cran/PPRL/blob/master/src/MTB_Baystat.cpp [Ruk18]. As such, this could be made more pythonic.
New in version 0.3.6.
Initialize Levenshtein instance.
- Parameters:
min_ss_len (int) -- Minimum substring length to be considered
left_ext (int) -- Left-side extension length
right_ext (int) -- Right-side extension length
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Baystat similarity.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The Baystat similarity
- Return type:
float
Examples
>>> cmp = Baystat() >>> round(cmp.sim('cat', 'hat'), 12) 0.666666666667 >>> cmp.sim('Niall', 'Neil') 0.4 >>> round(cmp.sim('Colin', 'Cuilen'), 12) 0.166666666667 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.BeniniI(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
BeniniI correlation.
For two sets X and Y and a population N, Benini I correlation, Benini's Index of Attraction, [Ben01] is
\[corr_{BeniniI}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|}{|Y| \cdot |N \setminus X|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{BeniniI} = \frac{ad-bc}{(a+c)(c+d)}\]New in version 0.4.0.
Initialize BeniniI instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Benini I correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Benini I correlation
- Return type:
float
Examples
>>> cmp = BeniniI() >>> cmp.corr('cat', 'hat') 0.49743589743589745 >>> cmp.corr('Niall', 'Neil') 0.3953727506426735 >>> cmp.corr('aluminum', 'Catalan') 0.11485180412371133 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237483954
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Benini I similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Benini I similarity
- Return type:
float
Examples
>>> cmp = BeniniI() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.6976863753213367 >>> cmp.sim('aluminum', 'Catalan') 0.5574259020618557 >>> cmp.sim('ATCG', 'TAGC') 0.496790757381258
New in version 0.4.0.
- class abydos.distance.BeniniII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
BeniniII correlation.
For two sets X and Y and a population N, Benini II correlation, Benini's Index of Repulsion, [Ben01] is
\[corr_{BeniniII}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|} {min(|Y| \cdot |N \setminus X|, |X| \cdot |N \setminus Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{BeniniII} = \frac{ad-bc}{min((a+c)(c+d), (a+b)(b+d))}\]New in version 0.4.0.
Initialize BeniniII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Benini II correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Benini II correlation
- Return type:
float
Examples
>>> cmp = BeniniII() >>> cmp.corr('cat', 'hat') 0.49743589743589745 >>> cmp.corr('Niall', 'Neil') 0.3953727506426735 >>> cmp.corr('aluminum', 'Catalan') 0.11485180412371133 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237483954
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Benini II similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Benini II similarity
- Return type:
float
Examples
>>> cmp = BeniniII() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.6976863753213367 >>> cmp.sim('aluminum', 'Catalan') 0.5574259020618557 >>> cmp.sim('ATCG', 'TAGC') 0.496790757381258
New in version 0.4.0.
- class abydos.distance.Bennet(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Bennet's S correlation.
For two sets X and Y and a population N, Bennet's \(S\) correlation [BAG54] is
\[corr_{Bennet}(X, Y) = S = \frac{p_o - p_e^S}{1 - p_e^S}\]where
\[ \begin{align}\begin{aligned}p_o = \frac{|X \cap Y| + |(N \setminus X) \setminus Y|}{|N|}\\p_e^S = \frac{1}{2}\end{aligned}\end{align} \]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[ \begin{align}\begin{aligned}p_o = \frac{a+d}{n}\\p_e^S = \frac{1}{2}\end{aligned}\end{align} \]New in version 0.4.0.
Initialize Bennet instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Bennet's S correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Bennet's S correlation
- Return type:
float
Examples
>>> cmp = Bennet() >>> cmp.corr('cat', 'hat') 0.989795918367347 >>> cmp.corr('Niall', 'Neil') 0.9821428571428572 >>> cmp.corr('aluminum', 'Catalan') 0.9617834394904459 >>> cmp.corr('ATCG', 'TAGC') 0.9744897959183674
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Bennet's S similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Bennet's S similarity
- Return type:
float
Examples
>>> cmp = Bennet() >>> cmp.sim('cat', 'hat') 0.9948979591836735 >>> cmp.sim('Niall', 'Neil') 0.9910714285714286 >>> cmp.sim('aluminum', 'Catalan') 0.9808917197452229 >>> cmp.sim('ATCG', 'TAGC') 0.9872448979591837
New in version 0.4.0.
- class abydos.distance.Bhattacharyya(tokenizer: Optional[_Tokenizer] = None, **kwargs: Any)[source]
Bases:
_TokenDistance
Bhattacharyya distance.
For two multisets X and Y drawn from an alphabet S, Bhattacharyya distance [Bha46] is
\[dist_{Bhattacharyya}(X, Y) = -log(\sum_{i \in S} \sqrt{X_iY_i})\]New in version 0.4.0.
Initialize Bhattacharyya instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Bhattacharyya coefficient of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Bhattacharyya distance
- Return type:
float
Examples
>>> cmp = Bhattacharyya() >>> cmp.dist('cat', 'hat') 0.5 >>> cmp.dist('Niall', 'Neil') 0.3651483716701107 >>> cmp.dist('aluminum', 'Catalan') 0.11785113019775792 >>> cmp.dist('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the Bhattacharyya distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Bhattacharyya distance
- Return type:
float
Examples
>>> cmp = Bhattacharyya() >>> cmp.dist_abs('cat', 'hat') 0.6931471805599453 >>> cmp.dist_abs('Niall', 'Neil') 1.0074515102711326 >>> cmp.dist_abs('aluminum', 'Catalan') 2.1383330595080277 >>> cmp.dist_abs('ATCG', 'TAGC') -inf
New in version 0.4.0.
- class abydos.distance.BlockLevenshtein(cost: ~typing.Tuple[float, float, float, float] = (1, 1, 1, 1), normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, **kwargs: ~typing.Any)[source]
Bases:
Levenshtein
Levenshtein distance with block operations.
In addition to character-level insert, delete, and replace operations, this version of the Levenshtein distance supports block-level insert, delete, and replace, provided that the block occurs in both input strings.
New in version 0.4.0.
Initialize BlockLevenshtein instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized block Levenshtein distance between strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The normalized Levenshtein distance with blocks between src & tar
- Return type:
float
Examples
>>> cmp = BlockLevenshtein() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.6 >>> cmp.dist('aluminum', 'Catalan') 0.875 >>> cmp.dist('ATCG', 'TAGC') 0.75
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the block Levenshtein edit distance between two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The block Levenshtein edit distance between src & tar
- Return type:
int
Examples
>>> cmp = BlockLevenshtein() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 3 >>> cmp.dist_abs('aluminum', 'Catalan') 7 >>> cmp.dist_abs('ATCG', 'TAGC') 3
New in version 0.4.0.
- class abydos.distance.BrainerdRobinson(tokenizer: Optional[_Tokenizer] = None, **kwargs: Any)[source]
Bases:
_TokenDistance
Brainerd-Robinson similarity.
For two multisets X and Y drawn from an alphabet S, Brainerd-Robinson similarity [Bra51, Rob51] is
\[sim_{BrainerdRobinson}(X, Y) = 200 - 100 \cdot \sum_{i \in S} |\frac{X_i}{\sum_{i \in S} |X_i|} - \frac{Y_i}{\sum_{i \in S} |Y_i|}|\]New in version 0.4.0.
Initialize BrainerdRobinson instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Brainerd-Robinson similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Brainerd-Robinson similarity
- Return type:
float
Examples
>>> cmp = BrainerdRobinson() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.3333333333333334 >>> cmp.sim('aluminum', 'Catalan') 0.111111111111111 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Brainerd-Robinson similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Brainerd-Robinson similarity
- Return type:
float
Examples
>>> cmp = BrainerdRobinson() >>> cmp.sim_score('cat', 'hat') 100.0 >>> cmp.sim_score('Niall', 'Neil') 66.66666666666669 >>> cmp.sim_score('aluminum', 'Catalan') 22.2222222222222 >>> cmp.sim_score('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.BraunBlanquet(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Braun-Blanquet similarity.
For two sets X and Y and a population N, the Braun-Blanquet similarity [BB32] is
\[sim_{BraunBlanquet}(X, Y) = \frac{|X \cap Y|}{max(|X|, |Y|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{BraunBlanquet} = \frac{a}{max(a+b, a+c)}\]New in version 0.4.0.
Initialize BraunBlanquet instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Braun-Blanquet similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Braun-Blanquet similarity
- Return type:
float
Examples
>>> cmp = BraunBlanquet() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.3333333333333333 >>> cmp.sim('aluminum', 'Catalan') 0.1111111111111111 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.Canberra(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Canberra distance.
For two sets X and Y, the Canberra distance [LW66, LW67b] is
\[sim_{Canberra}(X, Y) = \frac{|X \triangle Y|}{|X|+|Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Canberra} = \frac{b+c}{(a+b)+(a+c)}\]New in version 0.4.0.
Initialize Canberra instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Canberra distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Canberra distance
- Return type:
float
Examples
>>> cmp = Canberra() >>> cmp.dist('cat', 'hat') 0.5 >>> cmp.dist('Niall', 'Neil') 0.6363636363636364 >>> cmp.dist('aluminum', 'Catalan') 0.8823529411764706 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- class abydos.distance.Cao(**kwargs: Any)[source]
Bases:
_TokenDistance
Cao's CY dissimilarity.
Given \(X_{ij}\) (the number of individuals of speecies \(j\) in sample \(i\)), \(X_{kj}\) (the number of individuals of speecies \(j\) in sample \(k\)), and \(N\) (the total number of speecies present in both samples), Cao dissimilarity (CYd) [CBW97] is:
\[dist_{Cao}(X, Y) = CYd = \frac{1}{N}\sum\Bigg(\frac{(X_{ij} + X_{kj})log_{10}\big( \frac{X_{ij}+X_{kj}}{2}\big)-X_{ij}log_{10}X_{kj}-X_{kj}log_{10}X_{ij}} {X_{ij}+X_{kj}}\Bigg)\]In the above formula, whenever \(X_{ij} = 0\) or \(X_{kj} = 0\), the value 0.1 is substituted.
Since this measure ranges from 0 to \(\infty\), a similarity measure, CYs, ranging from 0 to 1 was also developed.
\[sim_{Cao}(X, Y) = CYs = 1 - \frac{Observed~CYd}{Maximum~CYd}\]where
\[Observed~CYd = \sum\Bigg(\frac{(X_{ij} + X_{kj})log_{10}\big( \frac{X_{ij}+X_{kj}}{2}\big)-X_{ij}log_{10}X_{kj}-X_{kj}log_{10}X_{ij}} {X_{ij}+X_{kj}}\Bigg)\]and with \(a\) (the number of species present in both samples), \(b\) (the number of species present in sample \(i\) only), and \(c\) (the number of species present in sample \(j\) only),
\[Maximum~CYd = D_1 + D_2 + D_3\]with
\[ \begin{align}\begin{aligned}D_1 = \sum_{j=1}^b \Bigg(\frac{(X_{ij} + 0.1) log_{10} \big( \frac{X_{ij}+0.1}{2}\big)-X_{ij}log_{10}0.1-0.1log_{10}X_{ij}} {X_{ij}+0.1}\Bigg)\\D_2 = \sum_{j=1}^c \Bigg(\frac{(X_{kj} + 0.1) log_{10} \big( \frac{X_{kj}+0.1}{2}\big)-X_{kj}log_{10}0.1-0.1log_{10}X_{kj}} {X_{kj}+0.1}\Bigg)\\D_1 = \sum_{j=1}^a \frac{a}{2} \Bigg(\frac{(D_i + 1) log_{10} \big(\frac{D_i+1}{2}\big)-log_{10}D_i}{D_i+1} + \frac{(D_k + 1) log_{10} \big(\frac{D_k+1}{2}\big)-log_{10}D_k}{D_k+1}\Bigg)\end{aligned}\end{align} \]with
\[ \begin{align}\begin{aligned}D_i = \frac{\sum X_{ij} - \frac{a}{2}}{\frac{a}{2}}\\D_k = \frac{\sum X_{kj} - \frac{a}{2}}{\frac{a}{2}}\end{aligned}\end{align} \]for
\[ \begin{align}\begin{aligned}X_{ij} \geq 1\\X_{kj} \geq 1\end{aligned}\end{align} \]New in version 0.4.1.
Initialize Cao instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.1.
- dist_abs(src: str, tar: str) float [source]
Return Cao's CY dissimilarity (CYd) of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Cao's CY dissimilarity
- Return type:
float
Examples
>>> cmp = Cao() >>> cmp.dist_abs('cat', 'hat') 0.3247267992925765 >>> cmp.dist_abs('Niall', 'Neil') 0.4132886536450973 >>> cmp.dist_abs('aluminum', 'Catalan') 0.5530666041976232 >>> cmp.dist_abs('ATCG', 'TAGC') 0.6494535985851531
New in version 0.4.1.
- sim(src: str, tar: str) float [source]
Return Cao's CY similarity (CYs) of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Cao's CY similarity
- Return type:
float
Examples
>>> cmp = Cao() >>> cmp.sim('cat', 'hat') 0.0 >>> cmp.sim('Niall', 'Neil') 0.0 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.1.
- class abydos.distance.ChaoDice(**kwargs: Any)[source]
Bases:
ChaoJaccard
Chao's Dice similarity.
Chao's Dice similarity [CCCS04]
New in version 0.4.1.
Initialize ChaoDice instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.1.
- sim(src: str, tar: str) float [source]
Return the normalized Chao's Dice similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Normalized Chao's Dice similarity
- Return type:
float
Examples
>>> import random >>> random.seed(0) >>> cmp = ChaoDice() >>> cmp.sim('cat', 'hat') 0.36666666666666664 >>> cmp.sim('Niall', 'Neil') 0.27868852459016397 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.1.
- sim_score(src: str, tar: str) float [source]
Return the Chao's Dice similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Chao's Dice similarity
- Return type:
float
Examples
>>> import random >>> random.seed(0) >>> cmp = ChaoDice() >>> cmp.sim_score('cat', 'hat') 0.36666666666666664 >>> cmp.sim_score('Niall', 'Neil') 0.27868852459016397 >>> cmp.sim_score('aluminum', 'Catalan') 0.0 >>> cmp.sim_score('ATCG', 'TAGC') 0.0
New in version 0.4.1.
- class abydos.distance.ChaoJaccard(**kwargs: Any)[source]
Bases:
_TokenDistance
Chao's Jaccard similarity.
Chao's Jaccard similarity [CCCS04]
New in version 0.4.1.
Initialize ChaoJaccard instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.1.
- sim(src: str, tar: str) float [source]
Return normalized Chao's Jaccard similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Normalized Chao's Jaccard similarity
- Return type:
float
Examples
>>> import random >>> random.seed(0) >>> cmp = ChaoJaccard() >>> cmp.sim('cat', 'hat') 0.22448979591836735 >>> cmp.sim('Niall', 'Neil') 0.1619047619047619 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.1.
- sim_score(src: str, tar: str) float [source]
Return Chao's Jaccard similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Chao's Jaccard similarity
- Return type:
float
Examples
>>> import random >>> random.seed(0) >>> cmp = ChaoJaccard() >>> cmp.sim_score('cat', 'hat') 0.22448979591836735 >>> cmp.sim_score('Niall', 'Neil') 0.1619047619047619 >>> cmp.sim_score('aluminum', 'Catalan') 0.0 >>> cmp.sim_score('ATCG', 'TAGC') 0.0
New in version 0.4.1.
- class abydos.distance.Chebyshev(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = 0, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
Minkowski
Chebyshev distance.
Euclidean distance is the chessboard distance, equivalent to Minkowski distance in \(L^\infty\)-space.
New in version 0.3.6.
Initialize Euclidean instance.
- Parameters:
alphabet (collection or int) -- The values or size of the alphabet
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(*args: Any, **kwargs: Any) NoReturn [source]
Raise exception when called.
- Parameters:
*args -- Variable length argument list
**kwargs -- Arbitrary keyword arguments
- Raises:
NotImplementedError -- Method disabled for Chebyshev distance
New in version 0.3.6.
- dist_abs(src: str, tar: str, *args: Any, **kwargs: Any) float [source]
Return the Chebyshev distance between two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
The Chebyshev distance
- Return type:
float
Examples
>>> cmp = Chebyshev() >>> cmp.dist_abs('cat', 'hat') 1.0 >>> cmp.dist_abs('Niall', 'Neil') 1.0 >>> cmp.dist_abs('Colin', 'Cuilen') 1.0 >>> cmp.dist_abs('ATCG', 'TAGC') 1.0
>>> cmp = Chebyshev(qval=1) >>> cmp.dist_abs('ATCG', 'TAGC') 0.0 >>> cmp.dist_abs('ATCGATTCGGAATTTC', 'TAGCATAATCGCCG') 3.0
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.Chord(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Chord distance.
For two sets X and Y drawn from an alphabet S, the chord distance [Orloci67] is
\[sim_{chord}(X, Y) = \sqrt{\sum_{i \in S}\Big(\frac{X_i}{\sqrt{\sum_{j \in X} X_j^2}} - \frac{Y_i}{\sqrt{\sum_{j \in Y} Y_j^2}}\Big)^2}\]New in version 0.4.0.
Initialize Chord instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Chord distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized chord distance
- Return type:
float
Examples
>>> cmp = Chord() >>> cmp.dist('cat', 'hat') 0.707106781186547 >>> cmp.dist('Niall', 'Neil') 0.796775770420944 >>> cmp.dist('aluminum', 'Catalan') 0.94519820240106 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the Chord distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Chord distance
- Return type:
float
Examples
>>> cmp = Chord() >>> cmp.dist_abs('cat', 'hat') 1.0 >>> cmp.dist_abs('Niall', 'Neil') 1.126811100699571 >>> cmp.dist_abs('aluminum', 'Catalan') 1.336712116966249 >>> cmp.dist_abs('ATCG', 'TAGC') 1.414213562373095
New in version 0.4.0.
- class abydos.distance.Clark(**kwargs: Any)[source]
Bases:
_TokenDistance
Clark's coefficient of divergence.
For two sets X and Y and a population N, Clark's coefficient of divergence [Cla52] is:
\[dist_{Clark}(X, Y) = \sqrt{\frac{\sum_{i=0}^{|N|} \big(\frac{x_i-y_i}{x_i+y_i}\big)^2}{|N|}}\]New in version 0.4.1.
Initialize Clark instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.1.
- dist(src: str, tar: str) float [source]
Return Clark's coefficient of divergence of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Clark's coefficient of divergence
- Return type:
float
Examples
>>> cmp = Clark() >>> cmp.dist('cat', 'hat') 0.816496580927726 >>> cmp.dist('Niall', 'Neil') 0.8819171036881969 >>> cmp.dist('aluminum', 'Catalan') 0.9660917830792959 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.1.
- class abydos.distance.Clement(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Clement similarity.
For two sets X and Y and a population N, Clement similarity [Cle76] is defined as
\[sim_{Clement}(X, Y) = \frac{|X \cap Y|}{|X|}\Big(1-\frac{|X|}{|N|}\Big) + \frac{|(N \setminus X) \setminus Y|}{|N \setminus X|} \Big(1-\frac{|N \setminus X|}{|N|}\Big)\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Clement} = \frac{a}{a+b}\Big(1 - \frac{a+b}{n}\Big) + \frac{d}{c+d}\Big(1 - \frac{c+d}{n}\Big)\]New in version 0.4.0.
Initialize Clement instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Clement similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Clement similarity
- Return type:
float
Examples
>>> cmp = Clement() >>> cmp.sim('cat', 'hat') 0.5025379382522239 >>> cmp.sim('Niall', 'Neil') 0.33840586363079933 >>> cmp.sim('aluminum', 'Catalan') 0.12119877280918714 >>> cmp.sim('ATCG', 'TAGC') 0.006336616803332366
New in version 0.4.0.
- class abydos.distance.CohenKappa(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Cohen's Kappa similarity.
For two sets X and Y and a population N, Cohen's kappa similarity [Coh60] is
\[sim_{Cohen_\kappa}(X, Y) = \kappa = \frac{p_o - p_e^\kappa}{1 - p_e^\kappa}\]where
\[\begin{split}\begin{array}{l} p_o = \frac{|X \cap Y| + |(N \setminus X) \setminus Y|}{|N|}\\ \\ p_e^\kappa = \frac{|X|}{|N|} \cdot \frac{|Y|}{|N|} + \frac{|N \setminus X|}{|N|} \cdot \frac{|N \setminus Y|}{|N|} \end{array}\end{split}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[\begin{split}\begin{array}{l} p_o = \frac{a+d}{n}\\ \\ p_e^\kappa = \frac{a+b}{n} \cdot \frac{a+c}{n} + \frac{c+d}{n} \cdot \frac{b+d}{n} \end{array}\end{split}\]New in version 0.4.0.
Initialize CohenKappa instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return Cohen's Kappa similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Cohen's Kappa similarity
- Return type:
float
Examples
>>> cmp = CohenKappa() >>> cmp.sim('cat', 'hat') 0.9974358974358974 >>> cmp.sim('Niall', 'Neil') 0.9955041746949261 >>> cmp.sim('aluminum', 'Catalan') 0.9903412749517064 >>> cmp.sim('ATCG', 'TAGC') 0.993581514762516
New in version 0.4.0.
- class abydos.distance.Cole(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Cole correlation.
For two sets X and Y and a population N, the Cole correlation [Col49] has three formulae:
If \(|X \cap Y| \cdot |(N \setminus X) \setminus Y| \geq |X \setminus Y| \cdot |Y \setminus Y|\) then
\[corr_{Cole}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|} {(|X \cap Y| + |X \setminus Y|) \cdot (|X \setminus Y| + |(N \setminus X) \setminus Y|)}\]If \(|(N \setminus X) \setminus Y| \geq |X \cap Y|\) then
\[corr_{Cole}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|} {(|X \cap Y| + |X \setminus Y|) \cdot (|X \cap Y| + |Y \setminus X|)}\]Otherwise
\[corr_{Cole}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|} {(|X \setminus Y| + |(N \setminus X) \setminus Y|) \cdot (|Y \setminus X| + |(N \setminus X) \setminus Y|)}\]
Cole terms this measurement the Coefficient of Interspecific Association.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[\begin{split}corr_{Cole} = \left\{ \begin{array}{ll} \frac{ad-bc}{(a+b)(b+d)} & \textup{if} ~ad \geq bc \\ \\ \frac{ad-bc}{(a+b)(a+c)} & \textup{if} ~d \geq a \\ \\ \frac{ad-bc}{(b+d)(c+d)} & \textup{otherwise} \end{array} \right.\end{split}\]New in version 0.4.0.
Initialize Cole instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Cole correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Cole correlation
- Return type:
float
Examples
>>> cmp = Cole() >>> cmp.corr('cat', 'hat') 0.49743589743589745 >>> cmp.corr('Niall', 'Neil') 0.3290543431750107 >>> cmp.corr('aluminum', 'Catalan') 0.10195910195910196 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Cole similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for similarity
tar (str) -- Target string (or QGrams/Counter objects) for similarity
- Returns:
Cole similarity
- Return type:
float
Examples
>>> cmp = Cole() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.6645271715875054 >>> cmp.sim('aluminum', 'Catalan') 0.550979550979551 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.CompleteLinkage(tokenizer: Optional[_Tokenizer] = None, metric: Optional[_Distance] = None, **kwargs: Any)[source]
Bases:
_TokenDistance
Complete linkage distance.
For two multisets X and Y, complete linkage distance [DD16] is
\[sim_{CompleteLinkage}(X, Y) = max_{i \in X, j \in Y} dist(X_i, Y_j)\]New in version 0.4.0.
Initialize CompleteLinkage instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packagemetric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants. (Defaults to Levenshtein distance)**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized complete linkage distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
normalized complete linkage distance
- Return type:
float
Examples
>>> cmp = CompleteLinkage() >>> cmp.dist('cat', 'hat') 1.0 >>> cmp.dist('Niall', 'Neil') 1.0 >>> cmp.dist('aluminum', 'Catalan') 1.0 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the complete linkage distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
complete linkage distance
- Return type:
float
Examples
>>> cmp = CompleteLinkage() >>> cmp.dist_abs('cat', 'hat') 2 >>> cmp.dist_abs('Niall', 'Neil') 2 >>> cmp.dist_abs('aluminum', 'Catalan') 2 >>> cmp.dist_abs('ATCG', 'TAGC') 2
New in version 0.4.0.
- class abydos.distance.ConsonniTodeschiniI(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Consonni & Todeschini I similarity.
For two sets X and Y and a population N, Consonni & Todeschini I similarity [CT12] is
\[sim_{ConsonniTodeschiniI}(X, Y) = \frac{log(1+|X \cap Y|+|(N \setminus X) \setminus Y|)} {log(1+|N|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{ConsonniTodeschiniI} = \frac{log(1+a+d)}{log(1+n)}\]New in version 0.4.0.
Initialize ConsonniTodeschiniI instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Consonni & Todeschini I similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Consonni & Todeschini I similarity
- Return type:
float
Examples
>>> cmp = ConsonniTodeschiniI() >>> cmp.sim('cat', 'hat') 0.9992336018090547 >>> cmp.sim('Niall', 'Neil') 0.998656222829757 >>> cmp.sim('aluminum', 'Catalan') 0.9971098629456009 >>> cmp.sim('ATCG', 'TAGC') 0.9980766131469967
New in version 0.4.0.
- class abydos.distance.ConsonniTodeschiniII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Consonni & Todeschini II similarity.
For two sets X and Y and a population N, Consonni & Todeschini II similarity [CT12] is
\[sim_{ConsonniTodeschiniII}(X, Y) = \frac{log(1+|N|) - log(1+|X \setminus Y|+|Y \setminus X|} {log(1+|N|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{ConsonniTodeschiniII} = \frac{log(1+n)-log(1+b+c)}{log(1+n)}\]New in version 0.4.0.
Initialize ConsonniTodeschiniII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Consonni & Todeschini II similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Consonni & Todeschini II similarity
- Return type:
float
Examples
>>> cmp = ConsonniTodeschiniII() >>> cmp.sim('cat', 'hat') 0.7585487129939101 >>> cmp.sim('Niall', 'Neil') 0.6880377723094788 >>> cmp.sim('aluminum', 'Catalan') 0.5841297898633079 >>> cmp.sim('ATCG', 'TAGC') 0.640262668568961
New in version 0.4.0.
- class abydos.distance.ConsonniTodeschiniIII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Consonni & Todeschini III similarity.
For two sets X and Y and a population N, Consonni & Todeschini III similarity [CT12] is
\[sim_{ConsonniTodeschiniIII}(X, Y) = \frac{log(1+|X \cap Y|)}{log(1+|N|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{ConsonniTodeschiniIII} = \frac{log(1+a)}{log(1+n)}\]New in version 0.4.0.
Initialize ConsonniTodeschiniIII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Consonni & Todeschini III similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Consonni & Todeschini III similarity
- Return type:
float
Examples
>>> cmp = ConsonniTodeschiniIII() >>> cmp.sim('cat', 'hat') 0.1648161441769704 >>> cmp.sim('Niall', 'Neil') 0.1648161441769704 >>> cmp.sim('aluminum', 'Catalan') 0.10396755253417303 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.ConsonniTodeschiniIV(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Consonni & Todeschini IV similarity.
For two sets X and Y and a population N, Consonni & Todeschini IV similarity [CT12] is
\[sim_{ConsonniTodeschiniIV}(X, Y) = \frac{log(1+|X \cap Y|)}{log(1+|X \cup Y|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{ConsonniTodeschiniIV} = \frac{log(1+a)}{log(1+a+b+c)}\]New in version 0.4.0.
Initialize ConsonniTodeschiniIV instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Consonni & Todeschini IV similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Consonni & Todeschini IV similarity
- Return type:
float
Examples
>>> cmp = ConsonniTodeschiniIV() >>> cmp.sim('cat', 'hat') 0.5645750340535796 >>> cmp.sim('Niall', 'Neil') 0.4771212547196623 >>> cmp.sim('aluminum', 'Catalan') 0.244650542118226 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.ConsonniTodeschiniV(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Consonni & Todeschini V correlation.
For two sets X and Y and a population N, Consonni & Todeschini V correlation [CT12] is
\[corr_{ConsonniTodeschiniV}(X, Y) = \frac{log(1+|X \cap Y| \cdot |(N \setminus X) \setminus Y|)- log(1+|X \setminus Y| \cdot |Y \setminus X|)} {log(1+\frac{|N|^2}{4})}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{ConsonniTodeschiniV} = \frac{log(1+ad)-log(1+bc)}{log(1+\frac{n^2}{4})}\]New in version 0.4.0.
Initialize ConsonniTodeschiniV instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Consonni & Todeschini V correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Consonni & Todeschini V correlation
- Return type:
float
Examples
>>> cmp = ConsonniTodeschiniV() >>> cmp.corr('cat', 'hat') 0.48072545510682463 >>> cmp.corr('Niall', 'Neil') 0.4003930264973547 >>> cmp.corr('aluminum', 'Catalan') 0.21794239483504532 >>> cmp.corr('ATCG', 'TAGC') -0.2728145951429799
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Consonni & Todeschini V similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Consonni & Todeschini V similarity
- Return type:
float
Examples
>>> cmp = ConsonniTodeschiniV() >>> cmp.sim('cat', 'hat') 0.7403627275534124 >>> cmp.sim('Niall', 'Neil') 0.7001965132486774 >>> cmp.sim('aluminum', 'Catalan') 0.6089711974175227 >>> cmp.sim('ATCG', 'TAGC') 0.36359270242851005
New in version 0.4.0.
- class abydos.distance.CormodeLZ(**kwargs: Any)[source]
Bases:
_Distance
Cormode's LZ distance.
Cormode's LZ distance [Cor03, CPSV00]
New in version 0.4.0.
Initialize CormodeLZ instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Cormode's LZ distance of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Cormode's LZ distance
- Return type:
float
Examples
>>> cmp = CormodeLZ() >>> cmp.dist('cat', 'hat') 0.3333333333333333 >>> cmp.dist('Niall', 'Neil') 0.8 >>> cmp.dist('aluminum', 'Catalan') 0.625 >>> cmp.dist('ATCG', 'TAGC') 0.75
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the Cormode's LZ distance of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Cormode's LZ distance
- Return type:
float
Examples
>>> cmp = CormodeLZ() >>> cmp.dist_abs('cat', 'hat') 2 >>> cmp.dist_abs('Niall', 'Neil') 5 >>> cmp.dist_abs('aluminum', 'Catalan') 6 >>> cmp.dist_abs('ATCG', 'TAGC') 4
New in version 0.4.0.
- class abydos.distance.Cosine(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Cosine similarity.
For two sets X and Y, the cosine similarity, Otsuka-Ochiai coefficient, or Ochiai coefficient [Och57, Ots36] is
\[sim_{cosine}(X, Y) = \frac{|X \cap Y|}{\sqrt{|X| \cdot |Y|}}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{cosine} = \frac{a}{\sqrt{(a+b)(a+c)}}\]Notes
This measure is also known as the Fowlkes-Mallows index [FM83] for two classes and G-measure, the geometric mean of precision & recall.
New in version 0.3.6.
Initialize Cosine instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the cosine similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Cosine similarity
- Return type:
float
Examples
>>> cmp = Cosine() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.3651483716701107 >>> cmp.sim('aluminum', 'Catalan') 0.11785113019775793 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.Covington(weights: Tuple[int, int, int, int, int, int, int, int] = (0, 5, 10, 30, 60, 100, 40, 50), **kwargs: Any)[source]
Bases:
_Distance
Covington distance.
Covington distance [Cov96]
New in version 0.4.0.
Initialize Covington instance.
- Parameters:
weights (tuple) --
An 8-tuple of costs for each kind of match or mismatch described in Covington's paper:
exact consonant or glide match
exact vowel match
vowel-vowel length mismatch or i and y or u and w
vowel-vowel mismatch
consonant-consonant mismatch
consonant-vowel mismatch
skip preceded by a skip
skip not preceded by a skip
The weights used in Covington's first approximation can be used by supplying the tuple (0.0, 0.0, 0.5, 0.5, 0.5, 1.0, 0.5, 0.5)
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- alignment(src: str, tar: str) Tuple[float, str, str] [source]
Return the top Covington alignment of two strings.
This returns only the top alignment in a standard (score, source alignment, target alignment) tuple format.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Covington score & alignment
- Return type:
tuple(float, str, str)
Examples
>>> cmp = Covington() >>> cmp.alignment('hart', 'kordis') (240, 'hart--', 'kordis') >>> cmp.alignment('niy', 'genu') (170, '--niy', 'genu-')
New in version 0.4.1.
- alignments(src: str, tar: str, top_n: Optional[int] = None) List[Alignment] [source]
Return the Covington alignments of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
top_n (int) -- The number of alignments to return. If None, all alignments will be returned. If 0, all alignments with the top score will be returned.
- Returns:
Covington alignments
- Return type:
list
Examples
>>> cmp = Covington() >>> cmp.alignments('hart', 'kordis', top_n=1)[0] Alignment(src='hart--', tar='kordis', score=240) >>> cmp.alignments('niy', 'genu', top_n=1)[0] Alignment(src='--niy', tar='genu-', score=170)
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Covington distance of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Normalized Covington distance
- Return type:
float
Examples
>>> cmp = Covington() >>> cmp.dist('cat', 'hat') 0.19117647058823528 >>> cmp.dist('Niall', 'Neil') 0.25555555555555554 >>> cmp.dist('aluminum', 'Catalan') 0.43333333333333335 >>> cmp.dist('ATCG', 'TAGC') 0.45454545454545453
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the Covington distance of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Covington distance
- Return type:
float
Examples
>>> cmp = Covington() >>> cmp.dist_abs('cat', 'hat') 65 >>> cmp.dist_abs('Niall', 'Neil') 115 >>> cmp.dist_abs('aluminum', 'Catalan') 325 >>> cmp.dist_abs('ATCG', 'TAGC') 200
New in version 0.4.0.
- class abydos.distance.DamerauLevenshtein(cost: ~typing.Tuple[float, float, float, float] = (1, 1, 1, 1), normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, **kwargs: ~typing.Any)[source]
Bases:
_Distance
Damerau-Levenshtein distance.
This computes the Damerau-Levenshtein distance [Dam64]. Damerau-Levenshtein code is based on Java code by Kevin L. Stern [Ste14], under the MIT license: https://github.com/KevinStern/software-and-algorithms/blob/master/src/main/java/blogspot/software_and_algorithms/stern_library/string/DamerauLevenshteinAlgorithm.java
Initialize Levenshtein instance.
- Parameters:
cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
normalizer (function) -- A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Damerau-Levenshtein similarity of two strings.
Damerau-Levenshtein distance normalized to the interval [0, 1].
The Damerau-Levenshtein distance is normalized by dividing the Damerau-Levenshtein distance by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The normalized Damerau-Levenshtein distance
- Return type:
float
Examples
>>> cmp = DamerauLevenshtein() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.6 >>> cmp.dist('aluminum', 'Catalan') 0.875 >>> cmp.dist('ATCG', 'TAGC') 0.5
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- dist_abs(src: str, tar: str) float [source]
Return the Damerau-Levenshtein distance between two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The Damerau-Levenshtein distance between src & tar
- Return type:
int (may return a float if cost has float values)
- Raises:
ValueError -- Unsupported cost assignment; the cost of two transpositions must not be less than the cost of an insert plus a delete.
Examples
>>> cmp = DamerauLevenshtein() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 3 >>> cmp.dist_abs('aluminum', 'Catalan') 7 >>> cmp.dist_abs('ATCG', 'TAGC') 2
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.Dennis(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Dennis similarity.
For two sets X and Y and a population N, Dennis similarity [Den65] is
\[sim_{Dennis}(X, Y) = \frac{|X \cap Y| - \frac{|X| \cdot |Y|}{|N|}} {\sqrt{\frac{|X|\cdot|Y|}{|N|}}}\]This is the fourth of Dennis' association measures, and that which she claims is the best of the four.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Dennis} = \frac{a-\frac{(a+b)(a+c)}{n}}{\sqrt{\frac{(a+b)(a+c)}{n}}}\]New in version 0.4.0.
Initialize Dennis instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Dennis correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Dennis correlation
- Return type:
float
Examples
>>> cmp = Dennis() >>> cmp.corr('cat', 'hat') 0.494897959183673 >>> cmp.corr('Niall', 'Neil') 0.358162114559075 >>> cmp.corr('aluminum', 'Catalan') 0.107041854561785 >>> cmp.corr('ATCG', 'TAGC') -0.006377551020408
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Dennis similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Dennis similarity
- Return type:
float
Examples
>>> cmp = Dennis() >>> cmp.sim('cat', 'hat') 0.6632653061224487 >>> cmp.sim('Niall', 'Neil') 0.5721080763727167 >>> cmp.sim('aluminum', 'Catalan') 0.4046945697078567 >>> cmp.sim('ATCG', 'TAGC') 0.32908163265306134
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Dennis similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Dennis similarity
- Return type:
float
Examples
>>> cmp = Dennis() >>> cmp.sim_score('cat', 'hat') 13.857142857142858 >>> cmp.sim_score('Niall', 'Neil') 10.028539207654113 >>> cmp.sim_score('aluminum', 'Catalan') 2.9990827802847835 >>> cmp.sim_score('ATCG', 'TAGC') -0.17857142857142858
New in version 0.4.0.
- class abydos.distance.Dice(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
Tversky
Sørensen–Dice coefficient.
For two sets X and Y, the Sørensen–Dice coefficient [Cze09, Dic45, MDobrzanskiZ50, Sorensen48] is
\[sim_{Dice}(X, Y) = \frac{2 \cdot |X \cap Y|}{|X| + |Y|}\]This is the complement of Bray & Curtis dissimilarity [BC57], also known as the Lance & Williams dissimilarity [LW67a].
This is identical to the Tanimoto similarity coefficient [Tan58] and the Tversky index [Tve77] for \(\alpha = \beta = 0.5\).
In the Ruby text library this is identified as White similarity, after [Whid.].
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Dice} = \frac{2a}{2a+b+c}\]Notes
In terms of a confusion matrix, this is equivalent to \(F_1\) score
ConfusionTable.f1_score()
.The multiset variant is termed Gleason similarity [Gle20].
New in version 0.3.6.
Initialize Dice instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Sørensen–Dice coefficient of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Sørensen–Dice similarity
- Return type:
float
Examples
>>> cmp = Dice() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.36363636363636365 >>> cmp.sim('aluminum', 'Catalan') 0.11764705882352941 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.DiceAsymmetricI(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Dice's Asymmetric I similarity.
For two sets X and Y and a population N, Dice's Asymmetric I similarity [Dic45] is
\[sim_{DiceAsymmetricI}(X, Y) = \frac{|X \cap Y|}{|X|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{DiceAsymmetricI} = \frac{a}{a+b}\]Notes
In terms of a confusion matrix, this is equivalent to precision or positive predictive value
ConfusionTable.precision()
.New in version 0.4.0.
Initialize DiceAsymmetricI instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Dice's Asymmetric I similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Dice's Asymmetric I similarity
- Return type:
float
Examples
>>> cmp = DiceAsymmetricI() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.3333333333333333 >>> cmp.sim('aluminum', 'Catalan') 0.1111111111111111 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.DiceAsymmetricII(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Dice's Asymmetric II similarity.
For two sets X and Y, Dice's Asymmetric II similarity [Dic45] is
\[sim_{DiceAsymmetricII}(X, Y) = \frac{|X \cap Y|}{|Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{DiceAsymmetricII} = \frac{a}{a+c}\]Notes
In terms of a confusion matrix, this is equivalent to recall, sensitivity, or true positive rate
ConfusionTable.recall()
.New in version 0.4.0.
Initialize DiceAsymmetricII instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Dice's Asymmetric II similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Dice's Asymmetric II similarity
- Return type:
float
Examples
>>> cmp = DiceAsymmetricII() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.4 >>> cmp.sim('aluminum', 'Catalan') 0.125 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.Digby(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Digby correlation.
For two sets X and Y and a population N, Digby's approximation of the tetrachoric correlation coefficient [Dig83] is
\[corr_{Digby}(X, Y) = \frac{(|X \cap Y| \cdot |(N \setminus X) \setminus Y|)^\frac{3}{4}- (|X \setminus Y| \cdot |Y \setminus X|)^\frac{3}{4}} {(|X \cap Y| \cdot |(N \setminus X) \setminus Y|)^\frac{3}{4} + (|X \setminus Y| \cdot |Y \setminus X|)^\frac{3}{4}}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{Digby} = \frac{ad^\frac{3}{4}-bc^\frac{3}{4}}{ad^\frac{3}{4}+bc^\frac{3}{4}}\]New in version 0.4.0.
Initialize Digby instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Digby correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Digby correlation
- Return type:
float
Examples
>>> cmp = Digby() >>> cmp.corr('cat', 'hat') 0.9774244829419212 >>> cmp.corr('Niall', 'Neil') 0.9491281473458171 >>> cmp.corr('aluminum', 'Catalan') 0.7541039303781305 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Digby similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Digby similarity
- Return type:
float
Examples
>>> cmp = Digby() >>> cmp.sim('cat', 'hat') 0.9887122414709606 >>> cmp.sim('Niall', 'Neil') 0.9745640736729085 >>> cmp.sim('aluminum', 'Catalan') 0.8770519651890653 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.DiscountedLevenshtein(mode: str = 'lev', normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, discount_from: ~typing.Union[int, str] = 1, discount_func: ~typing.Union[str, ~typing.Callable[[float], float]] = 'log', vowels: str = 'aeiou', **kwargs: ~typing.Any)[source]
Bases:
Levenshtein
Discounted Levenshtein distance.
This is a variant of Levenshtein distance for which edits later in a string have discounted cost, on the theory that earlier edits are less likely than later ones.
New in version 0.4.1.
Initialize DiscountedLevenshtein instance.
- Parameters:
mode (str) --
Specifies a mode for computing the discounted Levenshtein distance:
lev
(default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutionsosa
computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once
normalizer (function) -- A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.
discount_from (int or str) -- If an int is supplied, this is the first character whose edit cost will be discounted. If the str
coda
is supplied, discounting will start with the first non-vowel after the first vowel (the first syllable coda).discount_func (str or function) -- The two supported str arguments are
log
, for a logarithmic discount function, andexp
for a exponential discount function. See notes below for information on how to supply your own discount function.vowels (str) -- These are the letters to consider as vowels when discount_from is set to
coda
. It defaults to the English vowels 'aeiou', but it would be reasonable to localize this to other languages or to add orthographic semi-vowels like 'y', 'w', and even 'h'.**kwargs -- Arbitrary keyword arguments
Notes
This class is highly experimental and will need additional tuning.
The discount function can be passed as a callable function. It should expect an integer as its only argument and return a float, ideally less than or equal to 1.0. The argument represents the degree of discounting to apply.
New in version 0.4.1.
- dist(src: str, tar: str) float [source]
Return the normalized Levenshtein distance between two strings.
The Levenshtein distance is normalized by dividing the Levenshtein distance (calculated by any of the three supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The normalized Levenshtein distance between src & tar
- Return type:
float
Examples
>>> cmp = DiscountedLevenshtein() >>> cmp.dist('cat', 'hat') 0.3513958291799864 >>> cmp.dist('Niall', 'Neil') 0.5909885886270658 >>> cmp.dist('aluminum', 'Catalan') 0.8348163322045603 >>> cmp.dist('ATCG', 'TAGC') 0.7217609721523955
New in version 0.4.1.
- dist_abs(src: str, tar: str) float [source]
Return the Levenshtein distance between two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The Levenshtein distance between src & tar
- Return type:
float (may return a float if cost has float values)
Examples
>>> cmp = DiscountedLevenshtein() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 2.526064024369237 >>> cmp.dist_abs('aluminum', 'Catalan') 5.053867269967515 >>> cmp.dist_abs('ATCG', 'TAGC') 2.594032108779918
>>> cmp = DiscountedLevenshtein(mode='osa') >>> cmp.dist_abs('ATCG', 'TAGC') 1.7482385137517997 >>> cmp.dist_abs('ACTG', 'TAGC') 3.342270622531718
New in version 0.4.1.
- class abydos.distance.Dispersion(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Dispersion correlation.
For two sets X and Y and a population N, the dispersion correlation [Cor17] is
\[corr_{dispersion}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|} {|N|^2}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{dispersion} = \frac{ad-bc}{n^2}\]New in version 0.4.0.
Initialize Dispersion instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Dispersion correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Dispersion correlation
- Return type:
float
Examples
>>> cmp = Dispersion() >>> cmp.corr('cat', 'hat') 0.002524989587671803 >>> cmp.corr('Niall', 'Neil') 0.002502212619741774 >>> cmp.corr('aluminum', 'Catalan') 0.0011570449105440383 >>> cmp.corr('ATCG', 'TAGC') -4.06731570179092e-05
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Dispersion similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Dispersion similarity
- Return type:
float
Examples
>>> cmp = Dispersion() >>> cmp.sim('cat', 'hat') 0.5012624947938359 >>> cmp.sim('Niall', 'Neil') 0.5012511063098709 >>> cmp.sim('aluminum', 'Catalan') 0.500578522455272 >>> cmp.sim('ATCG', 'TAGC') 0.499979663421491
New in version 0.4.0.
- class abydos.distance.Doolittle(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Doolittle similarity.
For two sets X and Y and a population N, the Doolittle similarity [Doo84] is
\[sim_{Doolittle}(X, Y) = \frac{(|X \cap Y| \cdot |N| - |X| \cdot |Y|)^2} {|X| \cdot |Y| \cdot |N \setminus Y| \cdot |N \setminus X|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Doolittle} = \frac{(an-(a+b)(a+c))^2}{(a+b)(a+c)(b+d)(c+d)}\]New in version 0.4.0.
Initialize Doolittle instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Doolittle similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Doolittle similarity
- Return type:
float
Examples
>>> cmp = Doolittle() >>> cmp.sim('cat', 'hat') 0.24744247205785666 >>> cmp.sim('Niall', 'Neil') 0.13009912077202224 >>> cmp.sim('aluminum', 'Catalan') 0.011710186806836291 >>> cmp.sim('ATCG', 'TAGC') 4.1196952743799446e-05
New in version 0.4.0.
- class abydos.distance.Dunning(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Dunning similarity.
For two sets X and Y and a population N, Dunning log-likelihood [Dun93], following [CGHH91], is
\[\begin{split}sim_{Dunning}(X, Y) = \lambda = |X \cap Y| \cdot log_2(|X \cap Y|) +\\ |X \setminus Y| \cdot log_2(|X \setminus Y|) + |Y \setminus X| \cdot log_2(|Y \setminus X|) +\\ |(N \setminus X) \setminus Y| \cdot log_2(|(N \setminus X) \setminus Y|) -\\ (|X| \cdot log_2(|X|) + |Y| \cdot log_2(|Y|) +\\ |N \setminus Y| \cdot log_2(|N \setminus Y|) + |N \setminus X| \cdot log_2(|N \setminus X|))\end{split}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[\begin{split}sim_{Dunning} = \lambda = a \cdot log_2(a) +\\ b \cdot log_2(b) + c \cdot log_2(c) + d \cdot log_2(d) - \\ ((a+b) \cdot log_2(a+b) + (a+c) \cdot log_2(a+c) +\\ (b+d) \cdot log_2(b+d) + (c+d) log_2(c+d))\end{split}\]Notes
To avoid NaNs, every logarithm is calculated as the logarithm of 1 greater than the value in question. (Python's math.log1p function is used.)
New in version 0.4.0.
Initialize Dunning instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Dunning similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Dunning similarity
- Return type:
float
Examples
>>> cmp = Dunning() >>> cmp.sim('cat', 'hat') 0.33462839191969423 >>> cmp.sim('Niall', 'Neil') 0.19229445539929793 >>> cmp.sim('aluminum', 'Catalan') 0.03220862737070572 >>> cmp.sim('ATCG', 'TAGC') 0.0010606026735052122
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Dunning similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Dunning similarity
- Return type:
float
Examples
>>> cmp = Dunning() >>> cmp.sim('cat', 'hat') 0.33462839191969423 >>> cmp.sim('Niall', 'Neil') 0.19229445539929793 >>> cmp.sim('aluminum', 'Catalan') 0.03220862737070572 >>> cmp.sim('ATCG', 'TAGC') 0.0010606026735052122
New in version 0.4.0.
- class abydos.distance.Editex(cost: Tuple[int, int, int] = (0, 1, 2), local: bool = False, taper: bool = False, **kwargs: Any)[source]
Bases:
_Distance
Editex.
As described on pages 3 & 4 of [ZD96].
The local variant is based on [RU09].
New in version 0.3.6.
Changed in version 0.4.0: Added taper option
Initialize Editex instance.
- Parameters:
cost (tuple) -- A 3-tuple representing the cost of the four possible edits: match, same-group, and mismatch respectively (by default: (0, 1, 2))
local (bool) -- If True, the local variant of Editex is used
taper (bool) -- Enables cost tapering. Following [ZD96], it causes edits at the start of the string to "just [exceed] twice the minimum penalty for replacement or deletion at the end of the string".
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Editex distance between two strings.
The Editex distance is normalized by dividing the Editex distance (calculated by any of the three supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Normalized Editex distance
- Return type:
int
Examples
>>> cmp = Editex() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.2 >>> cmp.dist('aluminum', 'Catalan') 0.75 >>> cmp.dist('ATCG', 'TAGC') 0.75
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- dist_abs(src: str, tar: str) float [source]
Return the Editex distance between two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Editex distance
- Return type:
int
Examples
>>> cmp = Editex() >>> cmp.dist_abs('cat', 'hat') 2 >>> cmp.dist_abs('Niall', 'Neil') 2 >>> cmp.dist_abs('aluminum', 'Catalan') 12 >>> cmp.dist_abs('ATCG', 'TAGC') 6
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.Euclidean(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = 0, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
Minkowski
Euclidean distance.
Euclidean distance is the straigh-line or as-the-crow-flies distance, equivalent to Minkowski distance in \(L^2\)-space.
New in version 0.3.6.
Initialize Euclidean instance.
- Parameters:
alphabet (collection or int) -- The values or size of the alphabet
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Euclidean distance between two strings.
The normalized Euclidean distance is a distance metric in \(L^2\)-space, normalized to [0, 1].
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
The normalized Euclidean distance
- Return type:
float
Examples
>>> cmp = Euclidean() >>> round(cmp.dist('cat', 'hat'), 12) 0.57735026919 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.683130051064 >>> round(cmp.dist('Colin', 'Cuilen'), 12) 0.727606875109 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- dist_abs(src: str, tar: str, normalized: bool = False) float [source]
Return the Euclidean distance between two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
normalized (bool) -- Normalizes to [0, 1] if True
- Returns:
The Euclidean distance
- Return type:
float
Examples
>>> cmp = Euclidean() >>> cmp.dist_abs('cat', 'hat') 2.0 >>> round(cmp.dist_abs('Niall', 'Neil'), 12) 2.645751311065 >>> cmp.dist_abs('Colin', 'Cuilen') 3.0 >>> round(cmp.dist_abs('ATCG', 'TAGC'), 12) 3.162277660168
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.Eudex(weights: Optional[Union[str, Iterable[float], Callable[[], Generator[float, None, None]]]] = 'exponential', max_length: int = 8, **kwargs: Any)[source]
Bases:
_Distance
Distance between the Eudex hashes of two terms.
Cf. [Tic].
New in version 0.3.6.
Initialize Eudex instance.
- Parameters:
weights (str, iterable, or generator function) --
The weights or weights generator function
If set to
None
, a simple Hamming distance is calculated.If set to
exponential
, weight decays by powers of 2, as proposed in the eudex specification: https://github.com/ticki/eudex.If set to
fibonacci
, weight decays through the Fibonacci series, as in the eudex reference implementation.If set to a callable function, this assumes it creates a generator and the generator is used to populate a series of weights.
If set to an iterable, the iterable's values should be integers and will be used as the weights.
In all cases, the weights should be ordered or generated from least significant to most significant, so larger values should generally come first.
max_length (int) -- The number of characters to encode as a eudex hash
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return normalized distance between the Eudex hashes of two terms.
This is Eudex distance normalized to [0, 1].
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The normalized Eudex Hamming distance
- Return type:
int
Examples
>>> cmp = Eudex() >>> round(cmp.dist('cat', 'hat'), 12) 0.062745098039 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.000980392157 >>> round(cmp.dist('Colin', 'Cuilen'), 12) 0.004901960784 >>> round(cmp.dist('ATCG', 'TAGC'), 12) 0.197549019608
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- dist_abs(src: str, tar: str, normalized: bool = False) float [source]
Calculate the distance between the Eudex hashes of two terms.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
normalized (bool) -- Normalizes to [0, 1] if True
- Returns:
The Eudex Hamming distance
- Return type:
int
Examples
>>> cmp = Eudex() >>> cmp.dist_abs('cat', 'hat') 128 >>> cmp.dist_abs('Niall', 'Neil') 2 >>> cmp.dist_abs('Colin', 'Cuilen') 10 >>> cmp.dist_abs('ATCG', 'TAGC') 403
>>> cmp = Eudex(weights='fibonacci') >>> cmp.dist_abs('cat', 'hat') 34 >>> cmp.dist_abs('Niall', 'Neil') 2 >>> cmp.dist_abs('Colin', 'Cuilen') 7 >>> cmp.dist_abs('ATCG', 'TAGC') 117
>>> cmp = Eudex(weights=None) >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 1 >>> cmp.dist_abs('Colin', 'Cuilen') 2 >>> cmp.dist_abs('ATCG', 'TAGC') 9
>>> # Using the OEIS A000142: >>> cmp = Eudex(weights=[1, 1, 2, 6, 24, 120, 720, 5040]) >>> cmp.dist_abs('cat', 'hat') 5040 >>> cmp.dist_abs('Niall', 'Neil') 1 >>> cmp.dist_abs('Colin', 'Cuilen') 7 >>> cmp.dist_abs('ATCG', 'TAGC') 15130
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- static gen_exponential(base: int = 2) Generator[float, None, None] [source]
Yield the next value in an exponential series of the base.
Starts at base**0
- Parameters:
base (int) -- The base to exponentiate
- Yields:
int -- The next power of base
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- static gen_fibonacci() Generator[float, None, None] [source]
Yield the next Fibonacci number.
Based on https://www.python-course.eu/generators.php Starts at Fibonacci number 3 (the second 1)
- Yields:
int -- The next Fibonacci number
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.Eyraud(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Eyraud similarity.
For two sets X and Y and a population N, the Eyraud similarity [Eyr38] is
\[sim_{Eyraud}(X, Y) = \frac{|X \cap Y| - |X| \cdot |Y|} {|X| \cdot |Y| \cdot |N \setminus Y| \cdot |N \setminus X|}\]For lack of access to the original, this formula is based on the concurring formulae presented in [Shi93] and [Hubalek08].
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Eyraud} = \frac{a-(a+b)(a+c)}{(a+b)(a+c)(b+d)(c+d)}\]New in version 0.4.0.
Initialize Eyraud instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Eyraud similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Eyraud similarity
- Return type:
float
Examples
>>> cmp = Eyraud() >>> cmp.sim('cat', 'hat') 1.438198553583169e-06 >>> cmp.sim('Niall', 'Neil') 1.5399964580081465e-06 >>> cmp.sim('aluminum', 'Catalan') 1.6354719962967386e-06 >>> cmp.sim('ATCG', 'TAGC') 1.6478781097519779e-06
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Eyraud similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Eyraud similarity
- Return type:
float
Examples
>>> cmp = Eyraud() >>> cmp.sim_score('cat', 'hat') -1.438198553583169e-06 >>> cmp.sim_score('Niall', 'Neil') -1.5399964580081465e-06 >>> cmp.sim_score('aluminum', 'Catalan') -1.6354719962967386e-06 >>> cmp.sim_score('ATCG', 'TAGC') -1.6478781097519779e-06
New in version 0.4.0.
- class abydos.distance.FagerMcGowan(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Fager & McGowan similarity.
For two sets X and Y, the Fager & McGowan similarity [Fag57, FM63] is
\[sim_{FagerMcGowan}(X, Y) = \frac{|X \cap Y|}{\sqrt{|X|\cdot|Y|}} - \frac{1}{2\sqrt{max(|X|, |Y|)}}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{FagerMcGowan} = \frac{a}{\sqrt{(a+b)(a+c)}} - \frac{1}{2\sqrt{max(a+b, a+c)}}\]New in version 0.4.0.
Initialize FagerMcGowan instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Fager & McGowan similarity of two strings.
As this similarity ranges from \((-\inf, 1.0)\), this normalization simply clamps the value to the range (0.0, 1.0).
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Fager & McGowan similarity
- Return type:
float
Examples
>>> cmp = FagerMcGowan() >>> cmp.sim('cat', 'hat') 0.25 >>> cmp.sim('Niall', 'Neil') 0.16102422643817918 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Fager & McGowan similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Fager & McGowan similarity
- Return type:
float
Examples
>>> cmp = FagerMcGowan() >>> cmp.sim_score('cat', 'hat') 0.25 >>> cmp.sim_score('Niall', 'Neil') 0.16102422643817918 >>> cmp.sim_score('aluminum', 'Catalan') -0.048815536468908724 >>> cmp.sim_score('ATCG', 'TAGC') -0.22360679774997896
New in version 0.4.0.
- class abydos.distance.Faith(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Faith similarity.
For two sets X and Y and a population N, the Faith similarity [Fai83] is
\[sim_{Faith}(X, Y) = \frac{|X \cap Y| + \frac{|(N \setminus X) \setminus Y|}{2}}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Faith} = \frac{a+\frac{d}{2}}{n}\]New in version 0.4.0.
Initialize Faith instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Faith similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Faith similarity
- Return type:
float
Examples
>>> cmp = Faith() >>> cmp.sim('cat', 'hat') 0.4987244897959184 >>> cmp.sim('Niall', 'Neil') 0.4968112244897959 >>> cmp.sim('aluminum', 'Catalan') 0.4910828025477707 >>> cmp.sim('ATCG', 'TAGC') 0.49362244897959184
New in version 0.4.0.
- class abydos.distance.FellegiSunter(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', simplified: bool = False, mismatch_factor: float = 0.5, **kwargs: Any)[source]
Bases:
_TokenDistance
Fellegi-Sunter similarity.
Fellegi-Sunter similarity is based on the description in [CRF03] and implementation in [CRFR03].
New in version 0.4.0.
Initialize FellegiSunter instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.simplified (bool) -- Specifies to use the simplified scoring variant
mismatch_factor (float) -- Specifies the penalty factor for mismatches
**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Fellegi-Sunter similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Fellegi-Sunter similarity
- Return type:
float
Examples
>>> cmp = FellegiSunter() >>> cmp.sim('cat', 'hat') 0.2934477792670495 >>> cmp.sim('Niall', 'Neil') 0.13917536933271363 >>> cmp.sim('aluminum', 'Catalan') 0.056763632331436484 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Fellegi-Sunter similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Fellegi-Sunter similarity
- Return type:
float
Examples
>>> cmp = FellegiSunter() >>> cmp.sim_score('cat', 'hat') 0.8803433378011485 >>> cmp.sim_score('Niall', 'Neil') 0.6958768466635681 >>> cmp.sim_score('aluminum', 'Catalan') 0.45410905865149187 >>> cmp.sim_score('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.Fidelity(tokenizer: Optional[_Tokenizer] = None, **kwargs: Any)[source]
Bases:
_TokenDistance
Fidelity.
For two multisets X and Y drawn from an alphabet S, fidelity is
\[sim_{Fidelity}(X, Y) = \Bigg( \sum_{i \in S} \sqrt{|\frac{A_i}{|A|} \cdot \frac{B_i}{|B|}|} \Bigg)^2\]New in version 0.4.0.
Initialize Fidelity instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the fidelity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
fidelity
- Return type:
float
Examples
>>> cmp = Fidelity() >>> cmp.sim('cat', 'hat') 0.25 >>> cmp.sim('Niall', 'Neil') 0.1333333333333333 >>> cmp.sim('aluminum', 'Catalan') 0.013888888888888888 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.Fleiss(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Fleiss correlation.
For two sets X and Y and a population N, Fleiss correlation [Fle75] is
\[corr_{Fleiss}(X, Y) = \frac{(|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|) \cdot (|X| \cdot |N \setminus X| + |Y| \cdot |N \setminus Y|)} {2 \cdot |X| \cdot |N \setminus X| \cdot |Y| \cdot |N \setminus Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{Fleiss} = \frac{(ad-bc)((a+b)(c+d)+(a+c)(b+d))}{2(a+b)(c+d)(a+c)(b+d)}\]This is Fleiss' \(M(A_1)\), \(ad-bc\) divided by the harmonic mean of the marginals \(p_1q_1 = (a+b)(c+d)\) and \(p_2q_2 = (a+c)(b+d)\).
New in version 0.4.0.
Initialize Fleiss instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Fleiss correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Fleiss correlation
- Return type:
float
Examples
>>> cmp = Fleiss() >>> cmp.corr('cat', 'hat') 0.49743589743589745 >>> cmp.corr('Niall', 'Neil') 0.3621712520061204 >>> cmp.corr('aluminum', 'Catalan') 0.10839724112919989 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237483954
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Fleiss similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Fleiss similarity
- Return type:
float
Examples
>>> cmp = Fleiss() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.6810856260030602 >>> cmp.sim('aluminum', 'Catalan') 0.5541986205645999 >>> cmp.sim('ATCG', 'TAGC') 0.496790757381258
New in version 0.4.0.
- class abydos.distance.FleissLevinPaik(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Fleiss-Levin-Paik similarity.
For two sets X and Y and a population N, Fleiss-Levin-Paik similarity [FLP03] is
\[sim_{FleissLevinPaik}(X, Y) = \frac{2|(N \setminus X) \setminus Y|} {2|(N \setminus X) \setminus Y| + |X \setminus Y| + |Y \setminus X|}\]This is [Mor12]'s 'd Specific Agreement'.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{FleissLevinPaik} = \frac{2d}{2d + b + c}\]New in version 0.4.0.
Initialize FleissLevinPaik instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Fleiss-Levin-Paik similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Fleiss-Levin-Paik similarity
- Return type:
float
Examples
>>> cmp = FleissLevinPaik() >>> cmp.sim('cat', 'hat') 0.9974358974358974 >>> cmp.sim('Niall', 'Neil') 0.9955041746949261 >>> cmp.sim('aluminum', 'Catalan') 0.9903412749517064 >>> cmp.sim('ATCG', 'TAGC') 0.993581514762516
New in version 0.4.0.
- class abydos.distance.FlexMetric(normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, indel_costs: ~typing.Optional[~typing.List[~typing.Tuple[~typing.Union[~typing.Sequence[str], ~typing.Set[str], ~typing.FrozenSet[str]], float]]] = None, subst_costs: ~typing.Optional[~typing.List[~typing.Tuple[~typing.Union[~typing.Sequence[str], ~typing.Set[str], ~typing.FrozenSet[str]], float]]] = None, **kwargs: ~typing.Any)[source]
Bases:
_Distance
FlexMetric distance.
FlexMetric distance [Kem05]
New in version 0.4.0.
Initialize FlexMetric instance.
- Parameters:
normalizer (function) -- A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.
indel_costs (list of tuples) -- A list of insertion and deletion costs. Each list element should be a tuple consisting of an iterable (sets are best) and a float value. The iterable consists of those letters whose insertion or deletion has a cost equal to the float value.
subst_costs (list of tuples) -- A list of substitution costs. Each list element should be a tuple consisting of an iterable (sets are best) and a float value. The iterable consists of the letters in each letter class, which may be substituted for each other at cost equal to the float value.
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized FlexMetric distance of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Normalized FlexMetric distance
- Return type:
float
Examples
>>> cmp = FlexMetric() >>> cmp.dist('cat', 'hat') 0.26666666666666666 >>> cmp.dist('Niall', 'Neil') 0.3 >>> cmp.dist('aluminum', 'Catalan') 0.8375 >>> cmp.dist('ATCG', 'TAGC') 0.5499999999999999
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the FlexMetric distance of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
FlexMetric distance
- Return type:
float
Examples
>>> cmp = FlexMetric() >>> cmp.dist_abs('cat', 'hat') 0.8 >>> cmp.dist_abs('Niall', 'Neil') 1.5 >>> cmp.dist_abs('aluminum', 'Catalan') 6.7 >>> cmp.dist_abs('ATCG', 'TAGC') 2.1999999999999997
New in version 0.4.0.
- class abydos.distance.ForbesI(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Forbes I similarity.
For two sets X and Y and a population N, the Forbes I similarity [For07, Moz36] is
\[sim_{ForbesI}(X, Y) = \frac{|N| \cdot |X \cap Y|}{|X| \cdot |Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{ForbesI} = \frac{na}{(a+b)(a+c)}\]New in version 0.4.0.
Initialize ForbesI instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Forbes I similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Forbes I similarity
- Return type:
float
Examples
>>> cmp = ForbesI() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.3333333333333333 >>> cmp.sim('aluminum', 'Catalan') 0.11125283446712018 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Forbes I similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Forbes I similarity
- Return type:
float
Examples
>>> cmp = ForbesI() >>> cmp.sim_score('cat', 'hat') 98.0 >>> cmp.sim_score('Niall', 'Neil') 52.266666666666666 >>> cmp.sim_score('aluminum', 'Catalan') 10.902777777777779 >>> cmp.sim_score('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.ForbesII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Forbes II correlation.
For two sets X and Y and a population N, the Forbes II correlation, as described in [For25], is
\[corr_{ForbesII}(X, Y) = \frac{|X \setminus Y| \cdot |Y \setminus X| - |X \cap Y| \cdot |(N \setminus X) \setminus Y|} {|X| \cdot |Y| - |N| \cdot min(|X|, |Y|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{ForbesII} = \frac{bc-ad}{(a+b)(a+c) - n \cdot min(a+b, a+c)}\]New in version 0.4.0.
Initialize ForbesII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Forbes II correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Forbes II correlation
- Return type:
float
Examples
>>> cmp = ForbesII() >>> cmp.corr('cat', 'hat') 0.49743589743589745 >>> cmp.corr('Niall', 'Neil') 0.3953727506426735 >>> cmp.corr('aluminum', 'Catalan') 0.11485180412371133 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237483954
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Forbes II similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Forbes II similarity
- Return type:
float
Examples
>>> cmp = ForbesII() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.6976863753213367 >>> cmp.sim('aluminum', 'Catalan') 0.5574259020618557 >>> cmp.sim('ATCG', 'TAGC') 0.496790757381258
New in version 0.4.0.
- class abydos.distance.Fossum(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Fossum similarity.
For two sets X and Y and a population N, the Fossum similarity [FK66] is
\[sim_{Fossum}(X, Y) = \frac{|N| \cdot \Big(|X \cap Y|-\frac{1}{2}\Big)^2}{|X| \cdot |Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Fossum} = \frac{n(a-\frac{1}{2})^2}{(a+b)(a+c)}\]New in version 0.4.0.
Initialize Fossum instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Fossum similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Fossum similarity
- Return type:
float
Examples
>>> cmp = Fossum() >>> cmp.sim('cat', 'hat') 0.1836734693877551 >>> cmp.sim('Niall', 'Neil') 0.08925619834710742 >>> cmp.sim('aluminum', 'Catalan') 0.0038927335640138415 >>> cmp.sim('ATCG', 'TAGC') 0.01234567901234568
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Fossum similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Fossum similarity
- Return type:
float
Examples
>>> cmp = Fossum() >>> cmp.sim_score('cat', 'hat') 110.25 >>> cmp.sim_score('Niall', 'Neil') 58.8 >>> cmp.sim_score('aluminum', 'Catalan') 2.7256944444444446 >>> cmp.sim_score('ATCG', 'TAGC') 7.84
New in version 0.4.0.
- class abydos.distance.FuzzyWuzzyPartialString(**kwargs: Any)[source]
Bases:
_Distance
FuzzyWuzzy Partial String similarity.
This follows the FuzzyWuzzy Partial String similarity algorithm [Coh11]. Rather than returning an integer in the range [0, 100], as demonstrated in the blog post, this implementation returns a float in the range [0.0, 1.0].
New in version 0.4.0.
Initialize _Distance instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the FuzzyWuzzy Partial String similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
FuzzyWuzzy Partial String similarity
- Return type:
float
Examples
>>> cmp = FuzzyWuzzyPartialString() >>> round(cmp.sim('cat', 'hat'), 12) 0.666666666667 >>> round(cmp.sim('Niall', 'Neil'), 12) 0.75 >>> round(cmp.sim('aluminum', 'Catalan'), 12) 0.428571428571 >>> cmp.sim('ATCG', 'TAGC') 0.5
New in version 0.4.0.
- class abydos.distance.FuzzyWuzzyTokenSet(tokenizer: Optional[_Tokenizer] = None, **kwargs: Any)[source]
Bases:
_TokenDistance
FuzzyWuzzy Token Set similarity.
This follows the FuzzyWuzzy Token Set similarity algorithm [Coh11]. Rather than returning an integer in the range [0, 100], as demonstrated in the blog post, this implementation returns a float in the range [0.0, 1.0]. Distinct from the
New in version 0.4.0.
Initialize FuzzyWuzzyTokenSet instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package. By default, the regexp tokenizer is employed, matching only letters.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the FuzzyWuzzy Token Set similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
FuzzyWuzzy Token Set similarity
- Return type:
float
Examples
>>> cmp = FuzzyWuzzyTokenSet() >>> cmp.sim('cat', 'hat') 0.75 >>> cmp.sim('Niall', 'Neil') 0.7272727272727273 >>> cmp.sim('aluminum', 'Catalan') 0.47058823529411764 >>> cmp.sim('ATCG', 'TAGC') 0.6
New in version 0.4.0.
- class abydos.distance.FuzzyWuzzyTokenSort(tokenizer: Optional[_Tokenizer] = None, **kwargs: Any)[source]
Bases:
_TokenDistance
FuzzyWuzzy Token Sort similarity.
This follows the FuzzyWuzzy Token Sort similarity algorithm [Coh11]. Rather than returning an integer in the range [0, 100], as demonstrated in the blog post, this implementation returns a float in the range [0.0, 1.0].
New in version 0.4.0.
Initialize FuzzyWuzzyTokenSort instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package. By default, the regexp tokenizer is employed, matching only letters.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the FuzzyWuzzy Token Sort similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
FuzzyWuzzy Token Sort similarity
- Return type:
float
Examples
>>> cmp = FuzzyWuzzyTokenSort() >>> cmp.sim('cat', 'hat') 0.6666666666666666 >>> cmp.sim('Niall', 'Neil') 0.6666666666666666 >>> cmp.sim('aluminum', 'Catalan') 0.4 >>> cmp.sim('ATCG', 'TAGC') 0.5
New in version 0.4.0.
- class abydos.distance.GeneralizedFleiss(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', mean_func: str = 'arithmetic', marginals: str = 'a', proportional: bool = False, **kwargs: Any)[source]
Bases:
_TokenDistance
Generalized Fleiss correlation.
For two sets X and Y and a population N, Generalized Fleiss correlation is based on observations from [Fle75].
\[corr_{GeneralizedFleiss}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|} {\mu_{products~of~marginals}}\]The mean function \(\mu\) may be any of the mean functions in
abydos.stats
. The products of marginals may be one of the following:a
: \(|X| \cdot |N \setminus X|\) & \(|Y| \cdot |N \setminus Y|\)b
: \(|X| \cdot |Y|\) & \(|N \setminus X| \cdot |N \setminus Y|\)c
: \(|X| \cdot |N| \setminus Y|\) & \(|Y| \cdot |N \setminus X|\)
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{GeneralizedFleiss} = \frac{ad-bc}{\mu_{products~of~marginals}}\]And the products of marginals are:
a
: \(p_1q_1 = (a+b)(c+d)\) & \(p_2q_2 = (a+c)(b+d)\)b
: \(p_1p_2 = (a+b)(a+c)\) & \(q_1q_2 = (c+d)(b+d)\)c
: \(p_1q_2 = (a+b)(b+d)\) & \(p_2q_1 = (a+c)(c+d)\)
New in version 0.4.0.
Initialize GeneralizedFleiss instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.mean_func (str or function) --
Specifies the mean function to use. A function taking a list of numbers as its only required argument may be supplied, or one of the following strings will select the specified mean function from
abydos.stats
:arithmetic
employsamean()
, and this measure will be identical toMaxwellPilliner
with otherwise default parametersgeometric
employsgmean()
, and this measure will be identical toPearsonPhi
with otherwise default parametersharmonic
employshmean()
, and this measure will be identical toFleiss
with otherwise default parametersag
employs the arithmetic-geometric meanagmean()
gh
employs the geometric-harmonic meanghmean()
agh
employs the arithmetic-geometric-harmonic meanaghmean()
contraharmonic
employs the contraharmonic meancmean()
identric
employs the identric meanimean()
logarithmic
employs the logarithmic meanlmean()
quadratic
employs the quadratic meanqmean()
heronian
employs the Heronian meanheronian_mean()
hoelder
employs the Hölder meanhoelder_mean()
lehmer
employs the Lehmer meanlehmer_mean()
seiffert
employs Seiffert's meanseiffert_mean()
marginals (str) --
Specifies the pairs of marginals to multiply and calculate the resulting mean of. Can be:
a
: \(p_1q_1 = (a+b)(c+d)\) & \(p_2q_2 = (a+c)(b+d)\)b
: \(p_1p_2 = (a+b)(a+c)\) & \(q_1q_2 = (c+d)(b+d)\)c
: \(p_1q_2 = (a+b)(b+d)\) & \(p_2q_1 = (a+c)(c+d)\)
proportional (bool) -- If true, each of the values, \(a, b, c, d\) and the marginals will be divided by the total \(a+b+c+d=n\).
**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Generalized Fleiss correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Generalized Fleiss correlation
- Return type:
float
Examples
>>> cmp = GeneralizedFleiss() >>> cmp.corr('cat', 'hat') 0.49743589743589745 >>> cmp.corr('Niall', 'Neil') 0.35921989956790845 >>> cmp.corr('aluminum', 'Catalan') 0.10803030303030303 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237483954
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Generalized Fleiss similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Generalized Fleiss similarity
- Return type:
float
Examples
>>> cmp = GeneralizedFleiss() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.6796099497839543 >>> cmp.sim('aluminum', 'Catalan') 0.5540151515151515 >>> cmp.sim('ATCG', 'TAGC') 0.496790757381258
New in version 0.4.0.
- class abydos.distance.Gilbert(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Gilbert correlation.
For two sets X and Y and a population N, the Gilbert correlation [Gil84] is
\[corr_{Gilbert}(X, Y) = \frac{2(|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|)} {|N|^2 - |X \cap Y|^2 + |X \setminus Y|^2 + |Y \setminus X|^2 - |(N \setminus X) \setminus Y|^2}\]For lack of access to the original, this formula is based on the concurring formulae presented in [Pei84] and [Doo84].
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{Gilbert} = \frac{2(ad-cd)}{n^2-a^2+b^2+c^2-d^2}\]New in version 0.4.0.
Initialize Gilbert instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Gilbert correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Gilbert correlation
- Return type:
float
Examples
>>> cmp = Gilbert() >>> cmp.corr('cat', 'hat') 0.3310580204778157 >>> cmp.corr('Niall', 'Neil') 0.21890122402504983 >>> cmp.corr('aluminum', 'Catalan') 0.057094811018577836 >>> cmp.corr('ATCG', 'TAGC') -0.003198976327575176
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Gilbert similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Gilbert similarity
- Return type:
float
Examples
>>> cmp = Gilbert() >>> cmp.sim('cat', 'hat') 0.6655290102389079 >>> cmp.sim('Niall', 'Neil') 0.6094506120125249 >>> cmp.sim('aluminum', 'Catalan') 0.5285474055092889 >>> cmp.sim('ATCG', 'TAGC') 0.4984005118362124
New in version 0.4.0.
- class abydos.distance.GilbertWells(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, **kwargs: Any)[source]
Bases:
_TokenDistance
Gilbert & Wells similarity.
For two sets X and Y and a population N, the Gilbert & Wells similarity [GW66] is
\[sim_{GilbertWells}(X, Y) = ln \frac{|N|^3}{2\pi |X| \cdot |Y| \cdot |N \setminus Y| \cdot |N \setminus X|} + 2ln \frac{|N|! \cdot |X \cap Y|! \cdot |X \setminus Y|! \cdot |Y \setminus X|! \cdot |(N \setminus X) \setminus Y|!} {|X|! \cdot |Y|! \cdot |N \setminus Y|! \cdot |N \setminus X|!}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{GilbertWells} = ln \frac{n^3}{2\pi (a+b)(a+c)(b+d)(c+d)} + 2ln \frac{n!a!b!c!d!}{(a+b)!(a+c)!(b+d)!(c+d)!}\]Notes
Most lists of similarity & distance measures, including [CCT10, Hubalek08, Mor12] have a quite different formula, which would be \(ln~a - ln~b - ln \frac{a+b}{n} - ln \frac{a+c}{n} = ln\frac{an}{(a+b)(a+c)}\). However, neither this formula nor anything similar or equivalent to it appears anywhere within the cited work, [GW66]. See :class:
UnknownF
for this, alternative, measure.New in version 0.4.0.
Initialize GilbertWells instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Gilbert & Wells similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Gilbert & Wells similarity
- Return type:
float
Examples
>>> cmp = GilbertWells() >>> cmp.sim('cat', 'hat') 0.4116913723876516 >>> cmp.sim('Niall', 'Neil') 0.2457247406857589 >>> cmp.sim('aluminum', 'Catalan') 0.05800001636414742 >>> cmp.sim('ATCG', 'TAGC') 0.028716013247135602
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Gilbert & Wells similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Gilbert & Wells similarity
- Return type:
float
Examples
>>> cmp = GilbertWells() >>> cmp.sim_score('cat', 'hat') 20.17617447734673 >>> cmp.sim_score('Niall', 'Neil') 16.717742356982733 >>> cmp.sim_score('aluminum', 'Catalan') 5.495096667524002 >>> cmp.sim_score('ATCG', 'TAGC') 1.6845961909440712
New in version 0.4.0.
- class abydos.distance.GiniI(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', normalizer: str = 'proportional', **kwargs: Any)[source]
Bases:
_TokenDistance
Gini I correlation.
For two sets X and Y and a population N, Gini I correlation [Gin12], using the formula from [GK59], is
\[corr_{GiniI}(X, Y) = \frac{\frac{|X \cap Y|+|(N \setminus X) \setminus Y|}{|N|} - \frac{|X| \cdot |Y|}{|N|} + \frac{|N \setminus Y| \cdot |N \setminus X|}{|N|}} {\sqrt{(1-(\frac{|X|}{|N|}^2+\frac{|Y|}{|N|}^2)) \cdot (1-(\frac{|N \setminus Y|}{|N|}^2 + \frac{|N \setminus X|}{|N|}^2))}}\]In 2x2 confusion table terms, where a+b+c+d=n, after each term has been converted to a proportion by dividing by n, this is
\[corr_{GiniI} = \frac{(a+d)-(a+b)(a+c) + (b+d)(c+d)} {\sqrt{(1-((a+b)^2+(c+d)^2))\cdot(1-((a+c)^2+(b+d)^2))}}\]New in version 0.4.0.
Initialize GiniI instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.normalizer (str) -- Specifies the normalization type. See normalizer description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Gini I correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Gini I correlation
- Return type:
float
Examples
>>> cmp = GiniI() >>> cmp.corr('cat', 'hat') 0.49722814498933254 >>> cmp.corr('Niall', 'Neil') 0.39649090262533215 >>> cmp.corr('aluminum', 'Catalan') 0.14887105223941113 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237489576
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Gini I similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Gini I similarity
- Return type:
float
Examples
>>> cmp = GiniI() >>> cmp.sim('cat', 'hat') 0.7486140724946663 >>> cmp.sim('Niall', 'Neil') 0.6982454513126661 >>> cmp.sim('aluminum', 'Catalan') 0.5744355261197056 >>> cmp.sim('ATCG', 'TAGC') 0.4967907573812552
New in version 0.4.0.
- class abydos.distance.GiniII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', normalizer: str = 'proportional', **kwargs: Any)[source]
Bases:
_TokenDistance
Gini II distance.
For two sets X and Y and a population N, Gini II correlation [Gin15], using the formula from [GK59], is
\[corr_{GiniII}(X, Y) = \frac{\frac{|X \cap Y| + |(N \setminus X) \setminus Y|}{|N|} - (\frac{|X| \cdot |Y|}{|N|} + \frac{|N \setminus Y| \cdot |N \setminus X|}{|N|})} {1 - |\frac{|Y \setminus X| - |X \setminus Y|}{|N|}| - (\frac{|X| \cdot |Y|}{|N|} + \frac{|N \setminus Y| \cdot |N \setminus X|}{|N|})}\]In 2x2 confusion table terms, where a+b+c+d=n, after each term has been converted to a proportion by dividing by n, this is
\[corr_{GiniII} = \frac{(a+d) - ((a+b)(a+c) + (b+d)(c+d))} {1 - |b-c| - ((a+b)(a+c) + (b+d)(c+d))}\]New in version 0.4.0.
Initialize GiniII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.normalizer (str) -- Specifies the normalization type. See normalizer description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Gini II correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Gini II correlation
- Return type:
float
Examples
>>> cmp = GiniII() >>> cmp.corr('cat', 'hat') 0.49722814498933254 >>> cmp.corr('Niall', 'Neil') 0.4240703425535771 >>> cmp.corr('aluminum', 'Catalan') 0.15701415701415936 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237489576
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Gini II similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Gini II similarity
- Return type:
float
Examples
>>> cmp = GiniII() >>> cmp.sim('cat', 'hat') 0.7486140724946663 >>> cmp.sim('Niall', 'Neil') 0.7120351712767885 >>> cmp.sim('aluminum', 'Catalan') 0.5785070785070797 >>> cmp.sim('ATCG', 'TAGC') 0.4967907573812552
New in version 0.4.0.
- class abydos.distance.Goodall(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Goodall similarity.
For two sets X and Y and a population N, Goodall similarity [AC77, Goo67] is an angular transformation of Sokal & Michener's simple matching coefficient
\[sim_{Goodall}(X, Y) = \frac{2}{\pi} \sin^{-1}\Big( \sqrt{\frac{|X \cap Y| + |(N \setminus X) \setminus Y|}{|N|}} \Big)\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Goodall} =\frac{2}{\pi} \sin^{-1}\Big( \sqrt{\frac{a + d}{n}} \Big)\]New in version 0.4.0.
Initialize Goodall instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Goodall similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Goodall similarity
- Return type:
float
Examples
>>> cmp = Goodall() >>> cmp.sim('cat', 'hat') 0.9544884026871964 >>> cmp.sim('Niall', 'Neil') 0.9397552079794624 >>> cmp.sim('aluminum', 'Catalan') 0.9117156301536503 >>> cmp.sim('ATCG', 'TAGC') 0.9279473952929225
New in version 0.4.0.
- class abydos.distance.GoodmanKruskalLambda(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Goodman & Kruskal's Lambda similarity.
For two sets X and Y and a population N, Goodman & Kruskal's lambda [GK54] is
\[sim_{GK_\lambda}(X, Y) = \frac{\frac{1}{2}(max(|X \cap Y|, |X \setminus Y|)+ max(|Y \setminus X|, |(N \setminus X) \setminus Y|)+ max(|X \cap Y|, |Y \setminus X|)+ max(|X \setminus Y|, |(N \setminus X) \setminus Y|))- (max(|X|, |N \setminus X|)+max(|Y|, |N \setminus Y|))} {|N|-\frac{1}{2}(max(|X|, |N \setminus X|)+ max(|Y|, |N \setminus Y|))}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{GK_\lambda} = \frac{\frac{1}{2}((max(a,b)+max(c,d)+max(a,c)+max(b,d))- (max(a+b,c+d)+max(a+c,b+d)))} {n-\frac{1}{2}(max(a+b,c+d)+max(a+c,b+d))}\]New in version 0.4.0.
Initialize GoodmanKruskalLambda instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return Goodman & Kruskal's Lambda similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Goodman & Kruskal's Lambda similarity
- Return type:
float
Examples
>>> cmp = GoodmanKruskalLambda() >>> cmp.sim('cat', 'hat') 0.0 >>> cmp.sim('Niall', 'Neil') 0.0 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.GoodmanKruskalLambdaR(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Goodman & Kruskal Lambda-r correlation.
For two sets X and Y and a population N, Goodman & Kruskal \(\lambda_r\) correlation [GK54] is
\[corr_{GK_{\lambda_r}}(X, Y) = \frac{|X \cap Y| + |(N \setminus X) \setminus Y| - \frac{1}{2}(max(|X|, |N \setminus X|) + max(|Y|, |N \setminus Y|))} {|N| - \frac{1}{2}(max(|X|, |N \setminus X|) + max(|Y|, |N \setminus Y|))}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{GK_{\lambda_r}} = \frac{a + d - \frac{1}{2}(max(a+b,c+d)+max(a+c,b+d))} {n - \frac{1}{2}(max(a+b,c+d)+max(a+c,b+d))}\]New in version 0.4.0.
Initialize GoodmanKruskalLambdaR instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return Goodman & Kruskal Lambda-r correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Goodman & Kruskal Lambda-r correlation
- Return type:
float
Examples
>>> cmp = GoodmanKruskalLambdaR() >>> cmp.corr('cat', 'hat') 0.0 >>> cmp.corr('Niall', 'Neil') -0.2727272727272727 >>> cmp.corr('aluminum', 'Catalan') -0.7647058823529411 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return Goodman & Kruskal Lambda-r similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Goodman & Kruskal Lambda-r similarity
- Return type:
float
Examples
>>> cmp = GoodmanKruskalLambdaR() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.36363636363636365 >>> cmp.sim('aluminum', 'Catalan') 0.11764705882352944 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.GoodmanKruskalTauA(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', normalizer: str = 'proportional', **kwargs: Any)[source]
Bases:
_TokenDistance
Goodman & Kruskal's Tau A similarity.
For two sets X and Y and a population N, Goodman & Kruskal's \(\tau_a\) similarity [GK54], by analogy with \(\tau_b\), is
\[sim_{GK_{\tau_a}}(X, Y) = \frac{\frac{\frac{|X \cap Y|}{|N|}^2 + \frac{|Y \setminus X|}{|N|}^2}{\frac{|Y|}{|N|}}+ \frac{\frac{|X \setminus Y|}{|N|}^2 + \frac{|(N \setminus X) \setminus Y|}{|N|}^2} {\frac{|N \setminus X|}{|N|}} - (\frac{|X|}{|N|}^2 + \frac{|N \setminus X|}{|N|}^2)} {1 - (\frac{|X|}{|N|}^2 + \frac{|N \setminus X|}{|N|}^2)}\]In 2x2 confusion table terms, where a+b+c+d=n, after each term has been converted to a proportion by dividing by n, this is
\[sim_{GK_{\tau_a}} = \frac{ \frac{a^2 + c^2}{a+c} + \frac{b^2 + d^2}{b+d} - ((a+b)^2 + (c+d)^2)} {1 - ((a+b)^2 + (c+d)^2)}\]New in version 0.4.0.
Initialize GoodmanKruskalTauA instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.normalizer (str) -- Specifies the normalization type. See normalizer description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return Goodman & Kruskal's Tau A similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Goodman & Kruskal's Tau A similarity
- Return type:
float
Examples
>>> cmp = GoodmanKruskalTauA() >>> cmp.sim('cat', 'hat') 0.3304969657208484 >>> cmp.sim('Niall', 'Neil') 0.22137604585914503 >>> cmp.sim('aluminum', 'Catalan') 0.05991264724130685 >>> cmp.sim('ATCG', 'TAGC') 4.119695274745721e-05
New in version 0.4.0.
- class abydos.distance.GoodmanKruskalTauB(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', normalizer: str = 'proportional', **kwargs: Any)[source]
Bases:
_TokenDistance
Goodman & Kruskal's Tau B similarity.
For two sets X and Y and a population N, Goodman & Kruskal's \(\tau_b\) similarity [GK54] is
\[sim_{GK_{\tau_b}}(X, Y) = \frac{\frac{\frac{|X \cap Y|}{|N|}^2 + \frac{|X \setminus Y|}{|N|}^2}{\frac{|X|}{|N|}}+ \frac{\frac{|Y \setminus X|}{|N|}^2 + \frac{|(N \setminus X) \setminus Y|}{|N|}^2} {\frac{|N \setminus X|}{|N|}} - (\frac{|Y|}{|N|}^2 + \frac{|N \setminus Y|}{|N|}^2)} {1 - (\frac{|Y|}{|N|}^2 + \frac{|N \setminus Y|}{|N|}^2)}\]In 2x2 confusion table terms, where a+b+c+d=n, after each term has been converted to a proportion by dividing by n, this is
\[sim_{GK_{\tau_b}} = \frac{ \frac{a^2 + b^2}{a+b} + \frac{c^2 + d^2}{c+d} - ((a+c)^2 + (b+d)^2)} {1 - ((a+c)^2 + (b+d)^2)}\]New in version 0.4.0.
Initialize GoodmanKruskalTauB instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.normalizer (str) -- Specifies the normalization type. See normalizer description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return Goodman & Kruskal's Tau B similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Goodman & Kruskal's Tau B similarity
- Return type:
float
Examples
>>> cmp = GoodmanKruskalTauB() >>> cmp.sim('cat', 'hat') 0.3304969657208484 >>> cmp.sim('Niall', 'Neil') 0.2346006486710202 >>> cmp.sim('aluminum', 'Catalan') 0.06533810992392582 >>> cmp.sim('ATCG', 'TAGC') 4.119695274745721e-05
New in version 0.4.0.
- class abydos.distance.Gotoh(gap_open: float = 1, gap_ext: float = 0.4, sim_func: Optional[Callable[[str, str], float]] = None, **kwargs: Any)[source]
Bases:
NeedlemanWunsch
Gotoh score.
The Gotoh score [Got82] is essentially Needleman-Wunsch with affine gap penalties.
New in version 0.3.6.
Initialize Gotoh instance.
- Parameters:
gap_open (float) -- The cost of an open alignment gap (1 by default)
gap_ext (float) -- The cost of an alignment gap extension (0.4 by default)
sim_func (function) -- A function that returns the similarity of two characters (identity similarity by default)
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Gotoh score of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Normalized Gotoh score
- Return type:
float
Examples
>>> cmp = Gotoh() >>> cmp.sim('cat', 'hat') 0.6666666666666667 >>> cmp.sim('Niall', 'Neil') 0.22360679774997896 >>> round(cmp.sim('aluminum', 'Catalan'), 12) 0.0 >>> cmp.sim('cat', 'hat') 0.6666666666666667
New in version 0.4.1.
- sim_score(src: str, tar: str) float [source]
Return the Gotoh score of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Gotoh score
- Return type:
float
Examples
>>> cmp = Gotoh() >>> cmp.sim_score('cat', 'hat') 2.0 >>> cmp.sim_score('Niall', 'Neil') 1.0 >>> round(cmp.sim_score('aluminum', 'Catalan'), 12) -0.4 >>> cmp.sim_score('cat', 'hat') 2.0
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.GowerLegendre(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', theta: float = 0.5, **kwargs: Any)[source]
Bases:
_TokenDistance
Gower & Legendre similarity.
For two sets X and Y and a population N, the Gower & Legendre similarity [GL86] is
\[sim_{GowerLegendre}(X, Y) = \frac{|X \cap Y| + |(N \setminus X) \setminus Y|} {|X \cap Y| + |(N \setminus X) \setminus Y| + \theta \cdot |X \triangle Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{GowerLegendre} = \frac{a+d}{a+\theta(b+c)+d}\]New in version 0.4.0.
Initialize GowerLegendre instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.theta (float) -- The weight to place on the symmetric difference.
**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Gower & Legendre similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Gower & Legendre similarity
- Return type:
float
Examples
>>> cmp = GowerLegendre() >>> cmp.sim('cat', 'hat') 0.9974424552429667 >>> cmp.sim('Niall', 'Neil') 0.9955156950672646 >>> cmp.sim('aluminum', 'Catalan') 0.9903536977491961 >>> cmp.sim('ATCG', 'TAGC') 0.993581514762516
New in version 0.4.0.
- class abydos.distance.Guth(tokenizer: Optional[_Tokenizer] = None, **kwargs: Any)[source]
Bases:
_Distance
Guth matching.
Guth matching [Gut76] uses a simple positional matching rule list to determine whether two names match. Following the original, the
sim_score()
method returns only 1.0 for matching or 0.0 for non-matching.The \(.sim\) mathod instead penalizes more distant matches and never outrightly declares two names a non-matching unless no matches can be made in the two strings.
Tokens other than single characters can be matched by specifying a tokenizer during initialization or setting the qval parameter.
New in version 0.4.1.
Initialize Guth instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.1.
- sim(src: str, tar: str) float [source]
Return the relative Guth similarity of two strings.
This deviates from the algorithm described in [Gut76] in that more distant matches are penalized, so that less similar terms score lower that more similar terms.
If no match is found for a particular token in the source string, this does not result in an automatic 0.0 score. Rather, the score is further penalized towards 0.0.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Relative Guth matching score
- Return type:
float
Examples
>>> cmp = Guth() >>> cmp.sim('cat', 'hat') 0.8666666666666667 >>> cmp.sim('Niall', 'Neil') 0.8800000000000001 >>> cmp.sim('aluminum', 'Catalan') 0.4 >>> cmp.sim('ATCG', 'TAGC') 0.8
New in version 0.4.1.
- sim_score(src: str, tar: str) float [source]
Return the Guth matching score of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Guth matching score (1.0 if matching, otherwise 0.0)
- Return type:
float
Examples
>>> cmp = Guth() >>> cmp.sim_score('cat', 'hat') 1.0 >>> cmp.sim_score('Niall', 'Neil') 1.0 >>> cmp.sim_score('aluminum', 'Catalan') 0.0 >>> cmp.sim_score('ATCG', 'TAGC') 1.0
New in version 0.4.1.
- class abydos.distance.GuttmanLambdaA(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Guttman's Lambda A similarity.
For two sets X and Y and a population N, Guttman's \(\lambda_a\) similarity [Gut41] is
\[sim_{Guttman_{\lambda_a}}(X, Y) = \frac{max(|X \cap Y|, |Y \setminus X|) + max(|X \setminus Y|, |(N \setminus X) \setminus Y|) - max(|X|, |N \setminus X|)} {|N| - max(|X|, |N \setminus X|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Guttman_{\lambda_a}} = \frac{max(a, c) + max(b, d) - max(a+b, c+d)}{n - max(a+b, c+d)}\]New in version 0.4.0.
Initialize GuttmanLambdaA instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Guttman Lambda A similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Guttman's Lambda A similarity
- Return type:
float
Examples
>>> cmp = GuttmanLambdaA() >>> cmp.sim('cat', 'hat') 0.0 >>> cmp.sim('Niall', 'Neil') 0.0 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.GuttmanLambdaB(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Guttman's Lambda B similarity.
For two sets X and Y and a population N, Guttman's \(\lambda_b\) similarity [Gut41] is
\[sim_{Guttman_{\lambda_b}}(X, Y) = \frac{max(|X \cap Y|, |X \setminus Y|) + max(|Y \setminus X|, |(N \setminus X) \setminus Y|) - max(|Y|, |N \setminus Y|)} {|N| - max(|Y|, |N \setminus Y|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Guttman_{\lambda_b}} = \frac{max(a, b) + max(c, d) - max(a+c, b+d)}{n - max(a+c, b+d)}\]New in version 0.4.0.
Initialize GuttmanLambdaB instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Guttman Lambda B similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Guttman's Lambda B similarity
- Return type:
float
Examples
>>> cmp = GuttmanLambdaB() >>> cmp.sim('cat', 'hat') 0.0 >>> cmp.sim('Niall', 'Neil') 0.0 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.GwetAC(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Gwet's AC correlation.
For two sets X and Y and a population N, Gwet's AC correlation [Gwe08] is
\[corr_{Gwet_{AC}}(X, Y) = AC = \frac{p_o - p_e^{AC}}{1 - p_e^{AC}}\]where
\[ \begin{align}\begin{aligned}\begin{array}{lll} p_o &=&\frac{|X \cap Y| + |(N \setminus X) \setminus Y|}{|N|}\\p_e^{AC}&=&\frac{1}{2}\Big(\frac{|X|+|Y|}{|N|}\cdot \frac{|X \setminus Y| + |Y \setminus X|}{|N|}\Big) \end{array}\end{aligned}\end{align} \]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[ \begin{align}\begin{aligned}\begin{array}{lll} p_o&=&\frac{a+d}{n}\\p_e^{AC}&=&\frac{1}{2}\Big(\frac{2a+b+c}{n}\cdot \frac{2d+b+c}{n}\Big) \end{array}\end{aligned}\end{align} \]New in version 0.4.0.
Initialize GwetAC instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Gwet's AC correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Gwet's AC correlation
- Return type:
float
Examples
>>> cmp = GwetAC() >>> cmp.corr('cat', 'hat') 0.9948456319360438 >>> cmp.corr('Niall', 'Neil') 0.990945276504824 >>> cmp.corr('aluminum', 'Catalan') 0.9804734301840141 >>> cmp.corr('ATCG', 'TAGC') 0.9870811678360627
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Gwet's AC similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Gwet's AC similarity
- Return type:
float
Examples
>>> cmp = GwetAC() >>> cmp.sim('cat', 'hat') 0.9974228159680218 >>> cmp.sim('Niall', 'Neil') 0.995472638252412 >>> cmp.sim('aluminum', 'Catalan') 0.9902367150920071 >>> cmp.sim('ATCG', 'TAGC') 0.9935405839180314
New in version 0.4.0.
- class abydos.distance.Hamann(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Hamann correlation.
For two sets X and Y and a population N, the Hamann correlation [Ham61] is
\[corr_{Hamann}(X, Y) = \frac{|X \cap Y| + |(N \setminus X) \setminus Y| - |X \setminus Y| - |Y \setminus X|}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{Hamann} = \frac{a+d-b-c}{n}\]New in version 0.4.0.
Initialize Hamann instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Hamann correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Hamann correlation
- Return type:
float
Examples
>>> cmp = Hamann() >>> cmp.corr('cat', 'hat') 0.9897959183673469 >>> cmp.corr('Niall', 'Neil') 0.9821428571428571 >>> cmp.corr('aluminum', 'Catalan') 0.9617834394904459 >>> cmp.corr('ATCG', 'TAGC') 0.9744897959183674
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Hamann similarity of two strings.
Hamann similarity, which has a range [-1, 1] is normalized to [0, 1] by adding 1 and dividing by 2.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Hamann similarity
- Return type:
float
Examples
>>> cmp = Hamann() >>> cmp.sim('cat', 'hat') 0.9948979591836735 >>> cmp.sim('Niall', 'Neil') 0.9910714285714286 >>> cmp.sim('aluminum', 'Catalan') 0.9808917197452229 >>> cmp.sim('ATCG', 'TAGC') 0.9872448979591837
New in version 0.4.0.
- class abydos.distance.Hamming(diff_lens: bool = True, **kwargs: Any)[source]
Bases:
_Distance
Hamming distance.
Hamming distance [Ham50] equals the number of character positions at which two strings differ. For strings of unequal lengths, it is not normally defined. By default, this implementation calculates the Hamming distance of the first n characters where n is the lesser of the two strings' lengths and adds to this the difference in string lengths.
New in version 0.3.6.
Initialize Hamming instance.
- Parameters:
diff_lens (bool) -- If True (default), this returns the Hamming distance for those characters that have a matching character in both strings plus the difference in the strings' lengths. This is equivalent to extending the shorter string with obligatorily non-matching characters. If False, an exception is raised in the case of strings of unequal lengths.
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Hamming distance between two strings.
Hamming distance normalized to the interval [0, 1].
The Hamming distance is normalized by dividing it by the greater of the number of characters in src & tar (unless diff_lens is set to False, in which case an exception is raised).
The arguments are identical to those of the hamming() function.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Normalized Hamming distance
- Return type:
float
Examples
>>> cmp = Hamming() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> cmp.dist('Niall', 'Neil') 0.6 >>> cmp.dist('aluminum', 'Catalan') 1.0 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- dist_abs(src: str, tar: str) float [source]
Return the Hamming distance between two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The Hamming distance between src & tar
- Return type:
int
- Raises:
ValueError -- Undefined for sequences of unequal length; set diff_lens to True for Hamming distance between strings of unequal lengths.
Examples
>>> cmp = Hamming() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 3 >>> cmp.dist_abs('aluminum', 'Catalan') 8 >>> cmp.dist_abs('ATCG', 'TAGC') 4
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.HarrisLahey(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Harris & Lahey similarity.
For two sets X and Y and a population N, Harris & Lahey similarity [HL78] is
\[sim_{HarrisLahey}(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\cdot \frac{|N \setminus Y| + |N \setminus X|}{2|N|}+ \frac{|(N \setminus X) \setminus Y|}{|N \setminus (X \cap Y)|}\cdot \frac{|X| + |Y|}{2|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{HarrisLahey} = \frac{a}{a+b+c}\cdot\frac{2d+b+c}{2n}+ \frac{d}{d+b+c}\cdot\frac{2a+b+c}{2n}\]Notes
Most catalogs of similarity coefficients [Mor12, War08, Xia13] omit the \(n\) terms in the denominators, but the worked example in [HL78] makes it clear that this is intended in the original.
New in version 0.4.0.
Initialize HarrisLahey instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Harris & Lahey similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Harris & Lahey similarity
- Return type:
float
Examples
>>> cmp = HarrisLahey() >>> cmp.sim('cat', 'hat') 0.3367085964820711 >>> cmp.sim('Niall', 'Neil') 0.22761577457069784 >>> cmp.sim('aluminum', 'Catalan') 0.07244410503054725 >>> cmp.sim('ATCG', 'TAGC') 0.006296204706372345
New in version 0.4.0.
- class abydos.distance.Hassanat(tokenizer: Optional[_Tokenizer] = None, **kwargs: Any)[source]
Bases:
_TokenDistance
Hassanat distance.
For two multisets X and Y drawn from an alphabet S, Hassanat distance [Has14] is
\[dist_{Hassanat}(X, Y) = \sum_{i \in S} D(X_i, Y_i)\]where
\[\begin{split}D(X_i, Y_i) = \left\{\begin{array}{ll} 1-\frac{1+min(X_i, Y_i)}{1+max(X_i, Y_i)}&, min(X_i, Y_i) \geq 0 \\ \\ 1-\frac{1+min(X_i, Y_i)+|min(X_i, Y_i)|} {1+max(X_i, Y_i)+|min(X_i, Y_i)|}&, min(X_i, Y_i) < 0 \end{array}\right.\end{split}\]New in version 0.4.0.
Initialize Hassanat instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Hassanat distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Hassanat distance
- Return type:
float
Examples
>>> cmp = Hassanat() >>> cmp.dist('cat', 'hat') 0.3333333333333333 >>> cmp.dist('Niall', 'Neil') 0.3888888888888889 >>> cmp.dist('aluminum', 'Catalan') 0.4777777777777778 >>> cmp.dist('ATCG', 'TAGC') 0.5
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the Hassanat distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Hassanat distance
- Return type:
float
Examples
>>> cmp = Hassanat() >>> cmp.dist_abs('cat', 'hat') 2.0 >>> cmp.dist_abs('Niall', 'Neil') 3.5 >>> cmp.dist_abs('aluminum', 'Catalan') 7.166666666666667 >>> cmp.dist_abs('ATCG', 'TAGC') 5.0
New in version 0.4.0.
- class abydos.distance.HawkinsDotson(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Hawkins & Dotson similarity.
For two sets X and Y and a population N, Hawkins & Dotson similarity [HD73] is the mean of the occurrence agreement and non-occurrence agreement
\[sim_{HawkinsDotson}(X, Y) = \frac{1}{2}\cdot\Big( \frac{|X \cap Y|}{|X \cup Y|}+ \frac{|(N \setminus X) \setminus Y|}{|N \setminus (X \cap Y)|} \Big)\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{HawkinsDotson} = \frac{1}{2}\cdot\Big(\frac{a}{a+b+c}+\frac{d}{b+c+d}\Big)\]New in version 0.4.0.
Initialize HawkinsDotson instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Hawkins & Dotson similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Hawkins & Dotson similarity
- Return type:
float
Examples
>>> cmp = HawkinsDotson() >>> cmp.sim('cat', 'hat') 0.6641091219096334 >>> cmp.sim('Niall', 'Neil') 0.606635407786303 >>> cmp.sim('aluminum', 'Catalan') 0.5216836734693877 >>> cmp.sim('ATCG', 'TAGC') 0.49362244897959184
New in version 0.4.0.
- class abydos.distance.Hellinger(tokenizer: Optional[_Tokenizer] = None, **kwargs: Any)[source]
Bases:
_TokenDistance
Hellinger distance.
For two multisets X and Y drawn from an alphabet S, Hellinger distance [Hel09] is
\[dist_{Hellinger}(X, Y) = \sqrt{2 \cdot \sum_{i \in S} (\sqrt{|A_i|} - \sqrt{|B_i|})^2}\]New in version 0.4.0.
Initialize Hellinger instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Hellinger distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Hellinger distance
- Return type:
float
Examples
>>> cmp = Hellinger() >>> cmp.dist('cat', 'hat') 0.8164965809277261 >>> cmp.dist('Niall', 'Neil') 0.881917103688197 >>> cmp.dist('aluminum', 'Catalan') 0.9128709291752769 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the Hellinger distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Hellinger distance
- Return type:
float
Examples
>>> cmp = Hellinger() >>> cmp.dist_abs('cat', 'hat') 2.8284271247461903 >>> cmp.dist_abs('Niall', 'Neil') 3.7416573867739413 >>> cmp.dist_abs('aluminum', 'Catalan') 5.477225575051661 >>> cmp.dist_abs('ATCG', 'TAGC') 4.47213595499958
New in version 0.4.0.
- class abydos.distance.HendersonHeron(**kwargs: Any)[source]
Bases:
_TokenDistance
Henderson-Heron dissimilarity.
For two sets X and Y and a population N, Henderson-Heron dissimilarity [HH77] is:
New in version 0.4.1.
Initialize HendersonHeron instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.1.
- dist(src: str, tar: str) float [source]
Return the Henderson-Heron dissimilarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Henderson-Heron dissimilarity
- Return type:
float
Examples
>>> cmp = HendersonHeron() >>> cmp.dist('cat', 'hat') 0.00011668873858680838 >>> cmp.dist('Niall', 'Neil') 0.00048123075776606097 >>> cmp.dist('aluminum', 'Catalan') 0.08534181060514882 >>> cmp.dist('ATCG', 'TAGC') 0.9684367974410505
New in version 0.4.1.
- class abydos.distance.HigueraMico(**kwargs: Any)[source]
Bases:
_Distance
The Higuera-Micó contextual normalized edit distance.
This is presented in [delHigueraMico08].
This measure is not normalized to a particular range. Indeed, for an string of infinite length as and a string of 0 length, the contextual normalized edit distance would be infinity. But so long as the relative difference in string lengths is not too great, the distance will generally remain below 1.0
Notes
The "normalized" version of this distance, implemented in the dist method is merely the minimum of the distance and 1.0.
New in version 0.4.0.
Initialize Levenshtein instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the bounded Higuera-Micó distance between two strings.
This is the distance bounded to the range [0, 1].
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The bounded Higuera-Micó distance between src & tar
- Return type:
float
Examples
>>> cmp = HigueraMico() >>> cmp.dist('cat', 'hat') 0.3333333333333333 >>> cmp.dist('Niall', 'Neil') 0.5333333333333333 >>> cmp.dist('aluminum', 'Catalan') 0.7916666666666667 >>> cmp.dist('ATCG', 'TAGC') 0.6000000000000001
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the Higuera-Micó distance between two strings.
This is a straightforward implementation of Higuera & Micó pseudocode from [delHigueraMico08], ported to Numpy.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The Higuera-Micó distance between src & tar
- Return type:
float
Examples
>>> cmp = HigueraMico() >>> cmp.dist_abs('cat', 'hat') 0.3333333333333333 >>> cmp.dist_abs('Niall', 'Neil') 0.5333333333333333 >>> cmp.dist_abs('aluminum', 'Catalan') 0.7916666666666667 >>> cmp.dist_abs('ATCG', 'TAGC') 0.6000000000000001
New in version 0.4.0.
- class abydos.distance.HornMorisita(**kwargs: Any)[source]
Bases:
_TokenDistance
Horn-Morisita index of overlap.
Horn-Morisita index of overlap [Hor66], given two populations X and Y drawn from S species, is:
\[sim_{Horn-Morisita}(X, Y) = C_{\lambda} = \frac{2\sum_{i=1}^S x_i y_i} {(\hat{\lambda}_x + \hat{\lambda}_y)XY}\]where
\[X = \sum_{i=1}^S x_i ~~;~~ Y = \sum_{i=1}^S y_i\]\[\hat{\lambda}_x = \frac{\sum_{i=1}^S x_i^2}{X^2} ~~;~~ \hat{\lambda}_y = \frac{\sum_{i=1}^S y_i^2}{Y^2}\]Observe that this is identical to Morisita similarity, except for the definition of the \(\lambda\) values in the denominator.
New in version 0.4.1.
Initialize HornMorisita instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.1.
- sim(src: str, tar: str) float [source]
Return the Horn-Morisita similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Horn-Morisita similarity
- Return type:
float
Examples
>>> cmp = HornMorisita() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.3636363636363636 >>> cmp.sim('aluminum', 'Catalan') 0.10650887573964497 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.1.
- class abydos.distance.Hurlbert(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Hurlbert correlation.
In 2x2 confusion table terms, where a+b+c+d=n, Hurlbert's coefficient of interspecific association [Hur69] is
\[corr_{Hurlbert} = \frac{ad-bc}{|ad-bc|} \sqrt{\frac{Obs_{\chi^2}-Min_{\chi^2}} {Max_{\chi^2}-Min_{\chi^2}}}\]Where:
\[ \begin{align}\begin{aligned}\begin{array}{lll} Obs_{\chi^2} &= \frac{(ad-bc)^2n}{(a+b)(a+c)(b+d)(c+d)}\\Max_{\chi^2} &= \frac{(a+b)(b+d)n}{(a+c)(c+d)} &\textrm{ when } ad \geq bc\\Max_{\chi^2} &= \frac{(a+b)(a+c)n}{(b+d)(c+d)} &\textrm{ when } ad < bc \textrm{ and } a \leq d\\Max_{\chi^2} &= \frac{(b+d)(c+d)n}{(a+b)(a+c)} &\textrm{ when } ad < bc \textrm{ and } a > d\\Min_{\chi^2} &= \frac{n^3 (\hat{a} - g(\hat{a}))^2} {(a+b)(a+c)(c+d)(b+d)}\\\textrm{where } \hat{a} &= \frac{(a+b)(a+c)}{n}\\\textrm{and } g(\hat{a}) &= \lfloor\hat{a}\rfloor &\textrm{ when } ad < bc,\\\textrm{otherwise } g(\hat{a}) &= \lceil\hat{a}\rceil \end{array}\end{aligned}\end{align} \]New in version 0.4.0.
Initialize Hurlbert instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Hurlbert correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Hurlbert correlation
- Return type:
float
Examples
>>> cmp = Hurlbert() >>> cmp.corr('cat', 'hat') 0.497416003373807 >>> cmp.corr('Niall', 'Neil') 0.32899851514665707 >>> cmp.corr('aluminum', 'Catalan') 0.10144329225459262 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Hurlbert similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Hurlbert similarity
- Return type:
float
Examples
>>> cmp = Hurlbert() >>> cmp.sim('cat', 'hat') 0.7487080016869034 >>> cmp.sim('Niall', 'Neil') 0.6644992575733285 >>> cmp.sim('aluminum', 'Catalan') 0.5507216461272963 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.ISG(full_guth: bool = False, symmetric: bool = True, **kwargs: Any)[source]
Bases:
_Distance
Indice de Similitude-Guth (ISG) similarity.
This is an implementation of Bouchard & Pouyez's Indice de Similitude-Guth (ISG) [BP80]. At its heart, ISG is Jaccard similarity, but limits on token matching are added according to part of Guth's matching criteria [Gut76].
[BP80] is limited in its implementation details. Based on the examples given in the paper, it appears that only the first 4 of Guth's rules are considered (a letter in the first string must match a letter in the second string appearing in the same position, an adjacent position, or two positions ahead). It also appears that the distance in the paper is the greater of the distance from string 1 to string 2 and the distance from string 2 to string 1.
These qualities can be specified as parameters. At initialization, specify
full_guth=True
to apply all of Guth's rules andsymmetric=False
to calculate only the distance from string 1 to string 2.New in version 0.4.1.
Initialize ISG instance.
- Parameters:
full_guth (bool) -- Whether to apply all of Guth's matching rules
symmetric (bool) -- Whether to calculate the symmetric distance
**kwargs -- Arbitrary keyword arguments
New in version 0.4.1.
- sim(src: str, tar: str) float [source]
Return the Indice de Similitude-Guth (ISG) similarity of two words.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The ISG similarity
- Return type:
float
Examples
>>> cmp = ISG() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.5 >>> cmp.sim('aluminum', 'Catalan') 0.15384615384615385 >>> cmp.sim('ATCG', 'TAGC') 1.0
New in version 0.4.1.
- class abydos.distance.Ident(**kwargs: Any)[source]
Bases:
_Distance
Identity distance and similarity.
New in version 0.3.6.
Initialize _Distance instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the identity similarity of two strings.
Identity similarity is 1.0 if the two strings are identical, otherwise 0.0
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Identity similarity
- Return type:
float
Examples
>>> cmp = Ident() >>> cmp.sim('cat', 'hat') 0.0 >>> cmp.sim('cat', 'cat') 1.0
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.Inclusion(**kwargs: Any)[source]
Bases:
_Distance
Inclusion distance.
The INC Programme, developed by [BP80] designates two terms as being "included" when:
One name is shorter than the other
There are at least 3 common characters
There is at most one difference, disregarding unmatching prefixes and suffixes
In addition to these rules, this implementation considers two terms as being "included" if they are identical.
The return value, though a float, can only take one of two values: 0.0, indicating inclusion, or 1.0, indication non-inclusion.
New in version 0.4.1.
Initialize _Distance instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the INClusion Programme value of two words.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The INC Programme distance
- Return type:
float
Examples
>>> cmp = Inclusion() >>> cmp.dist('cat', 'hat') 1.0 >>> cmp.dist('Niall', 'Neil') 1.0 >>> cmp.dist('aluminum', 'Catalan') 1.0 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.1.
- class abydos.distance.Indel(**kwargs: Any)[source]
Bases:
Levenshtein
Indel distance.
This is equivalent to Levenshtein distance, when only inserts and deletes are possible.
New in version 0.3.6.
Initialize Levenshtein instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized indel distance between two strings.
This is equivalent to normalized Levenshtein distance, when only inserts and deletes are possible.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Normalized indel distance
- Return type:
float
Examples
>>> cmp = Indel() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.333333333333 >>> round(cmp.dist('Colin', 'Cuilen'), 12) 0.454545454545 >>> cmp.dist('ATCG', 'TAGC') 0.5
New in version 0.3.6.
- class abydos.distance.IterativeSubString(hamacher: float = 0.6, normalize_strings: bool = False, **kwargs: Any)[source]
Bases:
_Distance
Iterative-SubString correlation.
Iterative-SubString (I-Sub) correlation [SSK05]
This is a straightforward port of the primary author's Java implementation: http://www.image.ece.ntua.gr/~gstoil/software/I_Sub.java
New in version 0.4.0.
Initialize IterativeSubString instance.
- Parameters:
hamacher (float) -- The constant factor for the Hamacher product
normalize_strings (bool) -- Normalize the strings by removing the characters in '._ ' and lower casing
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Iterative-SubString correlation of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Iterative-SubString correlation
- Return type:
float
Examples
>>> cmp = IterativeSubString() >>> cmp.corr('cat', 'hat') -1.0 >>> cmp.corr('Niall', 'Neil') -0.9 >>> cmp.corr('aluminum', 'Catalan') -1.0 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Iterative-SubString similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Iterative-SubString similarity
- Return type:
float
Examples
>>> cmp = IterativeSubString() >>> cmp.sim('cat', 'hat') 0.0 >>> cmp.sim('Niall', 'Neil') 0.04999999999999999 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.Jaccard(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
Tversky
Jaccard similarity.
For two sets X and Y, the Jaccard similarity coefficient [Jac01, Rruvzivcka58] is
\[sim_{Jaccard}(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}`.\]This is identical to the Tanimoto similarity coefficient [Tan58] and the Tversky index [Tve77] for \(\alpha = \beta = 1\).
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Jaccard} = \frac{a}{a+b+c}\]Notes
The multiset variant is termed Ellenberg similarity [Ell56].
New in version 0.3.6.
Initialize Jaccard instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Jaccard similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Jaccard similarity
- Return type:
float
Examples
>>> cmp = Jaccard() >>> cmp.sim('cat', 'hat') 0.3333333333333333 >>> cmp.sim('Niall', 'Neil') 0.2222222222222222 >>> cmp.sim('aluminum', 'Catalan') 0.0625 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- tanimoto_coeff(src: str, tar: str) float [source]
Return the Tanimoto distance between two strings.
Tanimoto distance [Tan58] is \(-log_{2} sim_{Tanimoto}(X, Y)\).
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Tanimoto distance
- Return type:
float
Examples
>>> cmp = Jaccard() >>> cmp.tanimoto_coeff('cat', 'hat') -1.5849625007211563 >>> cmp.tanimoto_coeff('Niall', 'Neil') -2.1699250014423126 >>> cmp.tanimoto_coeff('aluminum', 'Catalan') -4.0 >>> cmp.tanimoto_coeff('ATCG', 'TAGC') -inf
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.JaccardNM(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Jaccard-NM similarity.
For two sets X and Y and a population N, Jaccard-NM similarity [NMM11] is
\[sim_{JaccardNM}(X, Y) = \frac{|X \cap Y|} {|N| + |X \cap Y| + |X \setminus Y| + |Y \setminus X|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{JaccardNM} = \frac{a}{2(a+b+c)+d}\]New in version 0.4.0.
Initialize JaccardNM instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Jaccard-NM similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Jaccard-NM similarity
- Return type:
float
Examples
>>> cmp = JaccardNM() >>> cmp.sim('cat', 'hat') 0.005063291139240506 >>> cmp.sim('Niall', 'Neil') 0.005044136191677175 >>> cmp.sim('aluminum', 'Catalan') 0.0024968789013732834 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Jaccard-NM similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Jaccard-NM similarity
- Return type:
float
Examples
>>> cmp = JaccardNM() >>> cmp.sim_score('cat', 'hat') 0.002531645569620253 >>> cmp.sim_score('Niall', 'Neil') 0.0025220680958385876 >>> cmp.sim_score('aluminum', 'Catalan') 0.0012484394506866417 >>> cmp.sim_score('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.JaroWinkler(qval: int = 1, mode: str = 'winkler', long_strings: bool = False, boost_threshold: float = 0.7, scaling_factor: float = 0.1, **kwargs: Any)[source]
Bases:
_Distance
Jaro-Winkler distance.
Jaro(-Winkler) distance is a string edit distance initially proposed by Jaro and extended by Winkler [Jar89, Win90].
This is Python based on the C code for strcmp95: http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c [WMJL94]. The above file is a US Government publication and, accordingly, in the public domain.
New in version 0.3.6.
Initialize JaroWinkler instance.
- Parameters:
qval (int) -- The length of each q-gram (defaults to 1: character-wise matching)
mode (str) --
Indicates which variant of this distance metric to compute:
winkler
-- computes the Jaro-Winkler distance (default) which increases the score for matches near the start of the wordjaro
-- computes the Jaro distance
long_strings (bool) -- Set to True to "Increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers." (Used in 'winkler' mode only.)
boost_threshold (float) -- A value between 0 and 1, below which the Winkler boost is not applied (defaults to 0.7). (Used in 'winkler' mode only.)
scaling_factor (float) -- A value between 0 and 0.25, indicating by how much to boost scores for matching prefixes (defaults to 0.1). (Used in 'winkler' mode only.)
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Jaro or Jaro-Winkler similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Jaro or Jaro-Winkler similarity
- Return type:
float
- Raises:
ValueError -- Unsupported boost_threshold assignment; boost_threshold must be between 0 and 1.
ValueError -- Unsupported scaling_factor assignment; scaling_factor must be between 0 and 0.25.'
Examples
>>> cmp = JaroWinkler() >>> round(cmp.sim('cat', 'hat'), 12) 0.777777777778 >>> round(cmp.sim('Niall', 'Neil'), 12) 0.805 >>> round(cmp.sim('aluminum', 'Catalan'), 12) 0.60119047619 >>> round(cmp.sim('ATCG', 'TAGC'), 12) 0.833333333333
>>> cmp = JaroWinkler(mode='jaro') >>> round(cmp.sim('cat', 'hat'), 12) 0.777777777778 >>> round(cmp.sim('Niall', 'Neil'), 12) 0.783333333333 >>> round(cmp.sim('aluminum', 'Catalan'), 12) 0.60119047619 >>> round(cmp.sim('ATCG', 'TAGC'), 12) 0.833333333333
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.JensenShannon(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Jensen-Shannon divergence.
Jensen-Shannon divergence [DLP99] of two multi-sets X and Y is
\[ \begin{align}\begin{aligned}\begin{array}{rl} dist_{JS}(X, Y) &= log 2 + \frac{1}{2} \sum_{i \in X \cap Y} h(p(X_i) + p(Y_i)) - h(p(X_i)) - h(p(Y_i))\\h(x) &= -x log x\\p(X_i \in X) &= \frac{|X_i|}{|X|} \end{array}\end{aligned}\end{align} \]New in version 0.4.0.
Initialize JensenShannon instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Jensen-Shannon distance of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Normalized Jensen-Shannon distance
- Return type:
float
Examples
>>> cmp = JensenShannon() >>> cmp.dist('cat', 'hat') 0.49999999999999994 >>> cmp.dist('Niall', 'Neil') 0.6355222557917826 >>> cmp.dist('aluminum', 'Catalan') 0.8822392827203127 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the Jensen-Shannon divergence of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Jensen-Shannon divergence
- Return type:
float
Examples
>>> cmp = JensenShannon() >>> cmp.dist_abs('cat', 'hat') 0.3465735902799726 >>> cmp.dist_abs('Niall', 'Neil') 0.44051045978517045 >>> cmp.dist_abs('aluminum', 'Catalan') 0.6115216713968132 >>> cmp.dist_abs('ATCG', 'TAGC') 0.6931471805599453
New in version 0.4.0.
- class abydos.distance.Johnson(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Johnson similarity.
For two sets X and Y, the Johnson similarity [Joh67] is
\[sim_{Johnson}(X, Y) = \frac{(|X \cap Y|}{|X|} + \frac{|Y \cap X}{|Y|}`.\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Johnson} = \frac{a}{a+b}+\frac{a}{a+c}\]New in version 0.4.0.
Initialize Johnson instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Johnson similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Johnson similarity
- Return type:
float
Examples
>>> cmp = Johnson() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.3666666666666667 >>> cmp.sim('aluminum', 'Catalan') 0.11805555555555555 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Johnson similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Johnson similarity
- Return type:
float
Examples
>>> cmp = Johnson() >>> cmp.sim_score('cat', 'hat') 1.0 >>> cmp.sim_score('Niall', 'Neil') 0.7333333333333334 >>> cmp.sim_score('aluminum', 'Catalan') 0.2361111111111111 >>> cmp.sim_score('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.KendallTau(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Kendall's Tau correlation.
For two sets X and Y and a population N, Kendall's Tau correlation [Ken38] is
\[corr_{KendallTau}(X, Y) = \frac{2 \cdot (|X \cap Y| + |(N \setminus X) \setminus Y| - |X \triangle Y|)}{|N| \cdot (|N|-1)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{KendallTau} = \frac{2 \cdot (a+d-b-c)}{n \cdot (n-1)}\]New in version 0.4.0.
Initialize KendallTau instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Kendall's Tau correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kendall's Tau correlation
- Return type:
float
Examples
>>> cmp = KendallTau() >>> cmp.corr('cat', 'hat') 0.0025282143508744493 >>> cmp.corr('Niall', 'Neil') 0.00250866630176975 >>> cmp.corr('aluminum', 'Catalan') 0.0024535291823735866 >>> cmp.corr('ATCG', 'TAGC') 0.0024891182526650506
Notes
This correlation is not necessarily bounded to [-1.0, 1.0], but will typically be within these bounds for real data.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Kendall's Tau similarity of two strings.
The Tau correlation is first clamped to the range [-1.0, 1.0] before being converted to a similarity value to ensure that the similarity is in the range [0.0, 1.0].
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kendall's Tau similarity
- Return type:
float
Examples
>>> cmp = KendallTau() >>> cmp.sim('cat', 'hat') 0.5012641071754372 >>> cmp.sim('Niall', 'Neil') 0.5012543331508849 >>> cmp.sim('aluminum', 'Catalan') 0.5012267645911868 >>> cmp.sim('ATCG', 'TAGC') 0.5012445591263325
New in version 0.4.0.
- class abydos.distance.KentFosterI(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Kent & Foster I similarity.
For two sets X and Y and a population N, Kent & Foster I similarity [KF77], \(K_{occ}\), is
\[sim_{KentFosterI}(X, Y) = \frac{|X \cap Y| - \frac{|X|\cdot|Y|}{|X \cup Y|}} {|X \cap Y| - \frac{|X|\cdot|Y|}{|X \cup Y|} + |X \setminus Y| + |Y \setminus X|}\]Kent & Foster derived this from Cohen's \(\kappa\) by "subtracting appropriate chance agreement correction figures from the numerators and denominators" to arrive at an occurrence reliability measure.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{KentFosterI} = \frac{a-\frac{(a+b)(a+c)}{a+b+c}}{a-\frac{(a+b)(a+c)}{a+b+c}+b+c}\]New in version 0.4.0.
Initialize KentFosterI instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Kent & Foster I similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Kent & Foster I similarity
- Return type:
float
Examples
>>> cmp = KentFosterI() >>> cmp.sim('cat', 'hat') 0.8 >>> cmp.sim('Niall', 'Neil') 0.7647058823529411 >>> cmp.sim('aluminum', 'Catalan') 0.6956521739130435 >>> cmp.sim('ATCG', 'TAGC') 0.6666666666666667
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Kent & Foster I similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kent & Foster I similarity
- Return type:
float
Examples
>>> cmp = KentFosterI() >>> cmp.sim_score('cat', 'hat') -0.19999999999999996 >>> cmp.sim_score('Niall', 'Neil') -0.23529411764705888 >>> cmp.sim_score('aluminum', 'Catalan') -0.30434782608695654 >>> cmp.sim_score('ATCG', 'TAGC') -0.3333333333333333
New in version 0.4.0.
- class abydos.distance.KentFosterII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Kent & Foster II similarity.
For two sets X and Y and a population N, Kent & Foster II similarity [KF77], \(K_{nonocc}\), is
\[sim_{KentFosterII}(X, Y) = \frac{|(N \setminus X) \setminus Y| - \frac{|X \setminus Y|\cdot|Y \setminus X|} {|N \setminus (X \cap Y)|}} {|(N \setminus X) \setminus Y| - \frac{|X \setminus Y|\cdot|Y \setminus X|} {|N \setminus (X \cap Y)|} + |X \setminus Y| + |Y \setminus X|}\]Kent & Foster derived this from Cohen's \(\kappa\) by "subtracting appropriate chance agreement correction figures from the numerators and denominators" to arrive at an non-occurrence reliability measure.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{KentFosterII} = \frac{d-\frac{(b+d)(c+d)}{b+c+d}}{d-\frac{(b+d)(c+d)}{b+c+d}+b+c}\]New in version 0.4.0.
Initialize KentFosterII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Kent & Foster II similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Kent & Foster II similarity
- Return type:
float
Examples
>>> cmp = KentFosterII() >>> cmp.sim('cat', 'hat') 0.998719590268876 >>> cmp.sim('Niall', 'Neil') 0.9978030025631628 >>> cmp.sim('aluminum', 'Catalan') 0.9952153110047858 >>> cmp.sim('ATCG', 'TAGC') 0.9968010236724241
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Kent & Foster II similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kent & Foster II similarity
- Return type:
float
Examples
>>> cmp = KentFosterII() >>> cmp.sim_score('cat', 'hat') -0.0012804097311239404 >>> cmp.sim_score('Niall', 'Neil') -0.002196997436837158 >>> cmp.sim_score('aluminum', 'Catalan') -0.004784688995214218 >>> cmp.sim_score('ATCG', 'TAGC') -0.0031989763275758767
New in version 0.4.0.
- class abydos.distance.KoppenI(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', normalizer: str = 'proportional', **kwargs: Any)[source]
Bases:
_TokenDistance
Köppen I correlation.
For two sets X and Y and an alphabet N, provided that \(|X| = |Y|\), Köppen I correlation [GK59, Koppen70] is
\[corr_{KoppenI}(X, Y) = \frac{|X| \cdot |N \setminus X| - |X \setminus Y|} {|X| \cdot |N \setminus X|}\]To support cases where \(|X| \neq |Y|\), this class implements a slight variation, while still providing the expected results when \(|X| = |Y|\):
\[corr_{KoppenI}(X, Y) = \frac{\frac{|X|+|Y|}{2} \cdot \frac{|N \setminus X|+|N \setminus Y|}{2}- \frac{|X \triangle Y|}{2}} {\frac{|X|+|Y|}{2} \cdot \frac{|N \setminus X|+|N \setminus Y|}{2}}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{KoppenI} = \frac{\frac{2a+b+c}{2} \cdot \frac{2d+b+c}{2}- \frac{b+c}{2}} {\frac{2a+b+c}{2} \cdot \frac{2d+b+c}{2}}\]Notes
In the usual case all of the above values should be proportional to the total number of samples n. I.e., a, b, c, d, & n should all be divided by n prior to calculating the coefficient. This class's default normalizer is, accordingly, 'proportional'.
New in version 0.4.0.
Initialize KoppenI instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.normalizer (str) -- Specifies the normalization type. See normalizer description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Köppen I correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Köppen I correlation
- Return type:
float
Examples
>>> cmp = KoppenI() >>> cmp.corr('cat', 'hat') 0.49615384615384617 >>> cmp.corr('Niall', 'Neil') 0.3575056927658083 >>> cmp.corr('aluminum', 'Catalan') 0.1068520131813188 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237483896
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Köppen I similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Köppen I similarity
- Return type:
float
Examples
>>> cmp = KoppenI() >>> cmp.sim('cat', 'hat') 0.7480769230769231 >>> cmp.sim('Niall', 'Neil') 0.6787528463829041 >>> cmp.sim('aluminum', 'Catalan') 0.5534260065906594 >>> cmp.sim('ATCG', 'TAGC') 0.49679075738125805
New in version 0.4.0.
- class abydos.distance.KoppenII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Köppen II similarity.
For two sets X and Y, Köppen II similarity [GK59, Koppen70] is
\[sim_{KoppenII}(X, Y) = |X \cap Y| + \frac{|X \setminus Y| + |Y \setminus X|}{2}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{KoppenII} = a + \frac{b+c}{2}\]New in version 0.4.0.
Initialize KoppenII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Köppen II similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Köppen II similarity
- Return type:
float
Examples
>>> cmp = KoppenII() >>> cmp.sim('cat', 'hat') 0.6666666666666666 >>> cmp.sim('Niall', 'Neil') 0.6111111111111112 >>> cmp.sim('aluminum', 'Catalan') 0.53125 >>> cmp.sim('ATCG', 'TAGC') 0.5
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Köppen II similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Köppen II similarity
- Return type:
float
Examples
>>> cmp = KoppenII() >>> cmp.sim_score('cat', 'hat') 4.0 >>> cmp.sim_score('Niall', 'Neil') 5.5 >>> cmp.sim_score('aluminum', 'Catalan') 8.5 >>> cmp.sim_score('ATCG', 'TAGC') 5.0
New in version 0.4.0.
- class abydos.distance.KuderRichardson(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Kuder & Richardson correlation.
For two sets X and Y and a population N, Kuder & Richardson similarity [Cro51, KR37] is
\[corr_{KuderRichardson}(X, Y) = \frac{4(|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|)} {|X| \cdot |N \setminus X| + |Y| \cdot |N \setminus Y| + 2(|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{KuderRichardson} = \frac{4(ad-bc)}{(a+b)(c+d) + (a+c)(b+d) +2(ad-bc)}\]New in version 0.4.0.
Initialize KuderRichardson instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Kuder & Richardson correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuder & Richardson correlation
- Return type:
float
Examples
>>> cmp = KuderRichardson() >>> cmp.corr('cat', 'hat') 0.6643835616438356 >>> cmp.corr('Niall', 'Neil') 0.5285677463699631 >>> cmp.corr('aluminum', 'Catalan') 0.19499521400246136 >>> cmp.corr('ATCG', 'TAGC') -0.012919896640826873
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Kuder & Richardson similarity of two strings.
Since Kuder & Richardson correlation is unbounded in the negative, this measure is first clamped to [-1.0, 1.0].
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuder & Richardson similarity
- Return type:
float
Examples
>>> cmp = KuderRichardson() >>> cmp.sim('cat', 'hat') 0.8321917808219178 >>> cmp.sim('Niall', 'Neil') 0.7642838731849815 >>> cmp.sim('aluminum', 'Catalan') 0.5974976070012307 >>> cmp.sim('ATCG', 'TAGC') 0.4935400516795866
New in version 0.4.0.
- class abydos.distance.KuhnsI(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Kuhns I correlation.
For two sets X and Y and a population N, Kuhns I correlation [Kuh64], the excess of separation over its independence value (S), is
\[corr_{KuhnsI}(X, Y) = \frac{2\delta(X, Y)}{|N|}\]where
\[\delta(X, Y) = |X \cap Y| - \frac{|X| \cdot |Y|}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{KuhnsI} = \frac{2\delta(a+b, a+c)}{n}\]where
\[\delta(a+b, a+c) = a - \frac{(a+b)(a+c)}{n}\]New in version 0.4.0.
Initialize KuhnsI instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Kuhns I correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns I correlation
- Return type:
float
Examples
>>> cmp = KuhnsI() >>> cmp.corr('cat', 'hat') 0.005049979175343606 >>> cmp.corr('Niall', 'Neil') 0.005004425239483548 >>> cmp.corr('aluminum', 'Catalan') 0.0023140898210880765 >>> cmp.corr('ATCG', 'TAGC') -8.134631403581842e-05
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Kuhns I similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns I similarity
- Return type:
float
Examples
>>> cmp = KuhnsI() >>> cmp.sim('cat', 'hat') 0.5050499791753436 >>> cmp.sim('Niall', 'Neil') 0.5050044252394835 >>> cmp.sim('aluminum', 'Catalan') 0.502314089821088 >>> cmp.sim('ATCG', 'TAGC') 0.49991865368596416
New in version 0.4.0.
- class abydos.distance.KuhnsII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Kuhns II correlation.
For two sets X and Y and a population N, Kuhns II correlation [Kuh64], the excess of rectangular distance over its independence value (R), is
\[corr_{KuhnsII}(X, Y) = \frac{\delta(X, Y)}{max(|X|, |Y|)}\]where
\[\delta(X, Y) = |X \cap Y| - \frac{|X| \cdot |Y|}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{KuhnsII} = \frac{\delta(a+b, a+c)}{max(a+b, a+c)}\]where
\[\delta(a+b, a+c) = a - \frac{(a+b)(a+c)}{n}\]New in version 0.4.0.
Initialize KuhnsII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Kuhns II correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns II correlation
- Return type:
float
Examples
>>> cmp = KuhnsII() >>> cmp.corr('cat', 'hat') 0.49489795918367346 >>> cmp.corr('Niall', 'Neil') 0.32695578231292516 >>> cmp.corr('aluminum', 'Catalan') 0.10092002830856334 >>> cmp.corr('ATCG', 'TAGC') -0.006377551020408163
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Kuhns II similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns II similarity
- Return type:
float
Examples
>>> cmp = KuhnsII() >>> cmp.sim('cat', 'hat') 0.663265306122449 >>> cmp.sim('Niall', 'Neil') 0.5513038548752834 >>> cmp.sim('aluminum', 'Catalan') 0.40061335220570893 >>> cmp.sim('ATCG', 'TAGC') 0.32908163265306123
New in version 0.4.0.
- class abydos.distance.KuhnsIII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Kuhns III correlation.
For two sets X and Y and a population N, Kuhns III correlation [Kuh64], the excess of proportion of overlap over its independence value (P), is
\[corr_{KuhnsIII}(X, Y) = \frac{\delta(X, Y)}{\big(1-\frac{|X \cap Y|}{|X|+|Y|}\big) \big(|X|+|Y|-\frac{|X|\cdot|Y|}{|N|}\big)}\]where
\[\delta(X, Y) = |X \cap Y| - \frac{|X| \cdot |Y|}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{KuhnsIII} = \frac{\delta(a+b, a+c)}{\big(1-\frac{a}{2a+b+c}\big) \big(2a+b+c-\frac{(a+b)(a+c)}{n}\big)}\]where
\[\delta(a+b, a+c) = a - \frac{(a+b)(a+c)}{n}\]Notes
The coefficient presented in [Eid14, Mor12] as Kuhns' "Proportion of overlap above independence" is a significantly different coefficient, not evidenced in [Kuh64].
New in version 0.4.0.
Initialize KuhnsIII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Kuhns III correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns III correlation
- Return type:
float
Examples
>>> cmp = KuhnsIII() >>> cmp.corr('cat', 'hat') 0.3307757885763001 >>> cmp.corr('Niall', 'Neil') 0.21873141468207793 >>> cmp.corr('aluminum', 'Catalan') 0.05707545392902886 >>> cmp.corr('ATCG', 'TAGC') -0.003198976327575176
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Kuhns III similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns III similarity
- Return type:
float
Examples
>>> cmp = KuhnsIII() >>> cmp.sim('cat', 'hat') 0.498081841432225 >>> cmp.sim('Niall', 'Neil') 0.41404856101155846 >>> cmp.sim('aluminum', 'Catalan') 0.29280659044677165 >>> cmp.sim('ATCG', 'TAGC') 0.24760076775431863
New in version 0.4.0.
- class abydos.distance.KuhnsIV(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Kuhns IV correlation.
For two sets X and Y and a population N, Kuhns IV correlation [Kuh64], the excess of conditional probabilities over its independence value (W), is
\[corr_{KuhnsIV}(X, Y) = \frac{\delta(X, Y)}{min(|X|, |Y|)}\]where
\[\delta(X, Y) = |X \cap Y| - \frac{|X| \cdot |Y|}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{KuhnsIV} = \frac{\delta(a+b, a+c)}{min(a+b, a+c)}\]where
\[\delta(a+b, a+c) = a - \frac{(a+b)(a+c)}{n}\]New in version 0.4.0.
Initialize KuhnsIV instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Kuhns IV correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns IV correlation
- Return type:
float
Examples
>>> cmp = KuhnsIV() >>> cmp.corr('cat', 'hat') 0.49489795918367346 >>> cmp.corr('Niall', 'Neil') 0.3923469387755102 >>> cmp.corr('aluminum', 'Catalan') 0.11353503184713376 >>> cmp.corr('ATCG', 'TAGC') -0.006377551020408163
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Kuhns IV similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns IV similarity
- Return type:
float
Examples
>>> cmp = KuhnsIV() >>> cmp.sim('cat', 'hat') 0.7474489795918368 >>> cmp.sim('Niall', 'Neil') 0.696173469387755 >>> cmp.sim('aluminum', 'Catalan') 0.5567675159235669 >>> cmp.sim('ATCG', 'TAGC') 0.4968112244897959
New in version 0.4.0.
- class abydos.distance.KuhnsIX(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Kuhns IX correlation.
For two sets X and Y and a population N, Kuhns IX correlation [Kuh64], the excess of coefficient of linear correlation over its independence value (L), is
\[corr_{KuhnsIX}(X, Y) = \frac{\delta(X, Y)}{\sqrt{|X|\cdot|Y|\cdot(1-\frac{|X|}{|N|}) \cdot(1-\frac{|Y|}{|N|})}}\]where
\[\delta(X, Y) = |X \cap Y| - \frac{|X| \cdot |Y|}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{KuhnsIX} = \frac{\delta(a+b, a+c)}{\sqrt{(a+b)(a+c)(1-\frac{a+b}{n}) (1-\frac{a+c}{n})}}\]where
\[\delta(a+b, a+c) = a - \frac{(a+b)(a+c)}{n}\]New in version 0.4.0.
Initialize KuhnsIX instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Kuhns IX correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns IX correlation
- Return type:
float
Examples
>>> cmp = KuhnsIX() >>> cmp.corr('cat', 'hat') 0.49743589743589745 >>> cmp.corr('Niall', 'Neil') 0.36069255713421955 >>> cmp.corr('aluminum', 'Catalan') 0.10821361655002706 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237483954
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Kuhns IX similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns IX similarity
- Return type:
float
Examples
>>> cmp = KuhnsIX() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.6803462785671097 >>> cmp.sim('aluminum', 'Catalan') 0.5541068082750136 >>> cmp.sim('ATCG', 'TAGC') 0.496790757381258
New in version 0.4.0.
- class abydos.distance.KuhnsV(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Kuhns V correlation.
For two sets X and Y and a population N, Kuhns V correlation [Kuh64], the excess of probability differences U over its independence value (U), is
\[corr_{KuhnsV}(X, Y) = \frac{\delta(X, Y)} {max\big(|X|\cdot(1-\frac{|X|}{|N|}), |Y|\cdot(1-\frac{|Y|}{|N|})\big)}\]where
\[\delta(X, Y) = |X \cap Y| - \frac{|X| \cdot |Y|}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{KuhnsV} = \frac{\delta(a+b, a+c)} {max\big((a+b)(1-\frac{a+b}{n}), (a+c)(1-\frac{a+c}{n})\big)}\]where
\[\delta(a+b, a+c) = a - \frac{(a+b)(a+c)}{n}\]New in version 0.4.0.
Initialize KuhnsV instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Kuhns V correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns V correlation
- Return type:
float
Examples
>>> cmp = KuhnsV() >>> cmp.corr('cat', 'hat') 0.497435897435897 >>> cmp.corr('Niall', 'Neil') 0.329477292202228 >>> cmp.corr('aluminum', 'Catalan') 0.10209049255441 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237484
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Kuhns V similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns V similarity
- Return type:
float
Examples
>>> cmp = KuhnsV() >>> cmp.sim('cat', 'hat') 0.7487179487179485 >>> cmp.sim('Niall', 'Neil') 0.664738646101114 >>> cmp.sim('aluminum', 'Catalan') 0.551045246277205 >>> cmp.sim('ATCG', 'TAGC') 0.496790757381258
New in version 0.4.0.
- class abydos.distance.KuhnsVI(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Kuhns VI correlation.
For two sets X and Y and a population N, Kuhns VI correlation [Kuh64], the excess of probability differences V over its independence value (V), is
\[corr_{KuhnsVI}(X, Y) = \frac{\delta(X, Y)} {min\big(|X|\cdot(1-\frac{|X|}{|N|}), |Y|(1-\frac{|Y|}{|N|})\big)}\]where
\[\delta(X, Y) = |X \cap Y| - \frac{|X| \cdot |Y|}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{KuhnsVI} = \frac{\delta(a+b, a+c)} {min\big((a+b)(1-\frac{a+b}{n}), (a+c)(1-\frac{a+c}{n})\big)}\]where
\[\delta(a+b, a+c) = a - \frac{(a+b)(a+c)}{n}\]New in version 0.4.0.
Initialize KuhnsVI instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Kuhns VI correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns VI correlation
- Return type:
float
Examples
>>> cmp = KuhnsVI() >>> cmp.corr('cat', 'hat') 0.497435897435897 >>> cmp.corr('Niall', 'Neil') 0.394865211810013 >>> cmp.corr('aluminum', 'Catalan') 0.11470398970399 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237484
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Kuhns VI similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns VI similarity
- Return type:
float
Examples
>>> cmp = KuhnsVI() >>> cmp.sim('cat', 'hat') 0.7487179487179485 >>> cmp.sim('Niall', 'Neil') 0.6974326059050064 >>> cmp.sim('aluminum', 'Catalan') 0.557351994851995 >>> cmp.sim('ATCG', 'TAGC') 0.496790757381258
New in version 0.4.0.
- class abydos.distance.KuhnsVII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Kuhns VII correlation.
For two sets X and Y and a population N, Kuhns VII correlation [Kuh64], the excess of angle between vector over its independence value (G), is
\[corr_{KuhnsVII}(X, Y) = \frac{\delta(X, Y)}{\sqrt{|X|\cdot|Y|}}\]where
\[\delta(X, Y) = |X \cap Y| - \frac{|X| \cdot |Y|}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{KuhnsVII} = \frac{\delta(a+b, a+c)}{\sqrt{(a+b)(a+c)}}\]where
\[\delta(a+b, a+c) = a - \frac{(a+b)(a+c)}{n}\]New in version 0.4.0.
Initialize KuhnsVII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Kuhns VII correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns VII correlation
- Return type:
float
Examples
>>> cmp = KuhnsVII() >>> cmp.corr('cat', 'hat') 0.49489795918367346 >>> cmp.corr('Niall', 'Neil') 0.3581621145590755 >>> cmp.corr('aluminum', 'Catalan') 0.10704185456178524 >>> cmp.corr('ATCG', 'TAGC') -0.006377551020408163
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Kuhns VII similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns VII similarity
- Return type:
float
Examples
>>> cmp = KuhnsVII() >>> cmp.sim('cat', 'hat') 0.663265306122449 >>> cmp.sim('Niall', 'Neil') 0.572108076372717 >>> cmp.sim('aluminum', 'Catalan') 0.40469456970785683 >>> cmp.sim('ATCG', 'TAGC') 0.32908163265306123
New in version 0.4.0.
- class abydos.distance.KuhnsVIII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Kuhns VIII correlation.
For two sets X and Y and a population N, Kuhns VIII correlation [Kuh64], the excess of coefficient by the arithmetic mean over its independence value (E), is
\[corr_{KuhnsVIII}(X, Y) = \frac{\delta(X, Y)}{|X \cap Y|+\frac{1}{2}\cdot|X \triangle Y|}\]where
\[\delta(X, Y) = |X \cap Y| - \frac{|X| \cdot |Y|}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{KuhnsVIII} = \frac{\delta(a+b, a+c)}{a+\frac{1}{2}(b+c)}\]where
\[\delta(a+b, a+c) = a - \frac{(a+b)(a+c)}{n}\]Notes
The coefficient presented in [Eid14, Mor12] as Kuhns' "Coefficient of arithmetic means" is a significantly different coefficient, not evidenced in [Kuh64].
New in version 0.4.0.
Initialize KuhnsVIII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Kuhns VIII correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns VIII correlation
- Return type:
float
Examples
>>> cmp = KuhnsVIII() >>> cmp.corr('cat', 'hat') 0.49489795918367346 >>> cmp.corr('Niall', 'Neil') 0.35667903525046385 >>> cmp.corr('aluminum', 'Catalan') 0.10685650056200824 >>> cmp.corr('ATCG', 'TAGC') -0.006377551020408163
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Kuhns VIII similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns VIII similarity
- Return type:
float
Examples
>>> cmp = KuhnsVIII() >>> cmp.sim('cat', 'hat') 0.663265306122449 >>> cmp.sim('Niall', 'Neil') 0.5711193568336426 >>> cmp.sim('aluminum', 'Catalan') 0.40457100037467214 >>> cmp.sim('ATCG', 'TAGC') 0.32908163265306123
New in version 0.4.0.
- class abydos.distance.KuhnsX(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Kuhns X correlation.
For two sets X and Y and a population N, Kuhns X correlation [Kuh64], the excess of Yule's Q over its independence value (Q), is
\[corr_{KuhnsX}(X, Y) = \frac{|N| \cdot \delta(X, Y)}{|X \cap Y| \cdot |(N \setminus X) \setminus Y| + |X \setminus Y| \cdot |Y \setminus X|}\]where
\[\delta(X, Y) = |X \cap Y| - \frac{|X| \cdot |Y|}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{KuhnsX} = \frac{n \cdot \delta(a+b, a+c)}{ad+bc}\]where
\[\delta(a+b, a+c) = a - \frac{(a+b)(a+c)}{n}\]New in version 0.4.0.
Initialize KuhnsX instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Kuhns X correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns X correlation
- Return type:
float
Examples
>>> cmp = KuhnsX() >>> cmp.corr('cat', 'hat') 0.994871794871795 >>> cmp.corr('Niall', 'Neil') 0.984635083226633 >>> cmp.corr('aluminum', 'Catalan') 0.864242424242424 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Kuhns X similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns X similarity
- Return type:
float
Examples
>>> cmp = KuhnsX() >>> cmp.sim('cat', 'hat') 0.9974358974358974 >>> cmp.sim('Niall', 'Neil') 0.9923175416133165 >>> cmp.sim('aluminum', 'Catalan') 0.932121212121212 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.KuhnsXI(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Kuhns XI correlation.
For two sets X and Y and a population N, Kuhns XI correlation [Kuh64], the excess of Yule's Y over its independence value (Y), is
\[corr_{KuhnsXI}(X, Y) = \frac{|N| \cdot \delta(X, Y)}{(\sqrt{|X \cap Y| \cdot |(N \setminus X) \setminus Y|} + \sqrt{|X \setminus Y| \cdot |Y \setminus X|})^2}\]where
\[\delta(X, Y) = |X \cap Y| - \frac{|X| \cdot |Y|}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{KuhnsXI} = \frac{n \cdot \delta(a+b, a+c)}{(\sqrt{ad}+\sqrt{bc})^2}\]where
\[\delta(a+b, a+c) = a - \frac{(a+b)(a+c)}{n}\]New in version 0.4.0.
Initialize KuhnsXI instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Kuhns XI correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns XI correlation
- Return type:
float
Examples
>>> cmp = KuhnsXI() >>> cmp.corr('cat', 'hat') 0.9034892632818761 >>> cmp.corr('Niall', 'Neil') 0.8382551144735259 >>> cmp.corr('aluminum', 'Catalan') 0.5749826820237787 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Kuhns XI similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns XI similarity
- Return type:
float
Examples
>>> cmp = KuhnsXI() >>> cmp.sim('cat', 'hat') 0.951744631640938 >>> cmp.sim('Niall', 'Neil') 0.919127557236763 >>> cmp.sim('aluminum', 'Catalan') 0.7874913410118893 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.KuhnsXII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Kuhns XII similarity.
For two sets X and Y and a population N, Kuhns XII similarity [Kuh64], the excess of index of independence over its independence value (I), is
\[sim_{KuhnsXII}(X, Y) = \frac{|N| \cdot \delta(X, Y)}{|X| \cdot |Y|}\]where
\[\delta(X, Y) = |X \cap Y| - \frac{|X| \cdot |Y|}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{KuhnsXII} = \frac{n \cdot \delta(a+b, a+c)}{(a+b)(a+c)}\]where
\[\delta(a+b, a+c) = a - \frac{(a+b)(a+c)}{n}\]New in version 0.4.0.
Initialize KuhnsXII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Kuhns XII similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Kuhns XII similarity
- Return type:
float
Examples
>>> cmp = KuhnsXII() >>> cmp.sim('cat', 'hat') 0.2493573264781491 >>> cmp.sim('Niall', 'Neil') 0.1323010752688172 >>> cmp.sim('aluminum', 'Catalan') 0.012877474353417137 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Kuhns XII similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kuhns XII similarity
- Return type:
float
Examples
>>> cmp = KuhnsXII() >>> cmp.sim_score('cat', 'hat') 97.0 >>> cmp.sim_score('Niall', 'Neil') 51.266666666666666 >>> cmp.sim_score('aluminum', 'Catalan') 9.902777777777779 >>> cmp.sim_score('ATCG', 'TAGC') -1.0
New in version 0.4.0.
- class abydos.distance.KulczynskiI(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Kulczynski I similarity.
For two sets X and Y, Kulczynski I similarity [Kulczynski27] is
\[sim_{KulczynskiI}(X, Y) = \frac{|X \cap Y|}{|X \setminus Y| + |Y \setminus X|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{KulczynskiI} = \frac{a}{b+c}\]New in version 0.4.0.
Initialize KulczynskiI instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(*args: Any, **kwargs: Any) NoReturn [source]
Raise exception when called.
- Parameters:
*args -- Variable length argument list
**kwargs -- Arbitrary keyword arguments
- Raises:
NotImplementedError -- Method disabled for Kulczynski I similarity.
New in version 0.3.6.
- sim(*args: Any, **kwargs: Any) NoReturn [source]
Raise exception when called.
- Parameters:
*args -- Variable length argument list
**kwargs -- Arbitrary keyword arguments
- Raises:
NotImplementedError -- Method disabled for Kulczynski I similarity.
New in version 0.3.6.
- sim_score(src: str, tar: str) float [source]
Return the Kulczynski I similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kulczynski I similarity
- Return type:
float
Examples
>>> cmp = KulczynskiI() >>> cmp.sim_score('cat', 'hat') 0.5 >>> cmp.sim_score('Niall', 'Neil') 0.2857142857142857 >>> cmp.sim_score('aluminum', 'Catalan') 0.06666666666666667 >>> cmp.sim_score('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.KulczynskiII(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Kulczynski II similarity.
For two sets X and Y, Kulczynski II similarity [Kulczynski27] or Driver & Kroeber similarity [DK32] is
\[sim_{KulczynskiII}(X, Y) = \frac{1}{2} \Bigg(\frac{|X \cap Y|}{|X|} + \frac{|X \cap Y|}{|Y|}\Bigg)\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{KulczynskiII} = \frac{1}{2}\Bigg(\frac{a}{a+b}+\frac{a}{a+c}\Bigg)\]New in version 0.4.0.
Initialize KulczynskiII instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Kulczynski II similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Kulczynski II similarity
- Return type:
float
Examples
>>> cmp = KulczynskiII() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.3666666666666667 >>> cmp.sim('aluminum', 'Catalan') 0.11805555555555555 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.LCPrefix(**kwargs: Any)[source]
Bases:
_Distance
Longest common prefix.
New in version 0.4.0.
Initialize _Distance instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist_abs(src: str, tar: str, *args: str) int [source]
Return the length of the longest common prefix of the strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
*args (strs) -- Additional strings for comparison
- Raises:
ValueError -- All arguments must be of type str
- Returns:
The length of the longest common prefix
- Return type:
int
Examples
>>> pfx = LCPrefix() >>> pfx.dist_abs('cat', 'hat') 0 >>> pfx.dist_abs('Niall', 'Neil') 1 >>> pfx.dist_abs('aluminum', 'Catalan') 0 >>> pfx.dist_abs('ATCG', 'TAGC') 0
New in version 0.4.0.
- lcprefix(strings: List[str]) str [source]
Return the longest common prefix of a list of strings.
Longest common prefix (LCPrefix).
- Parameters:
strings (list of strings) -- Strings for comparison
- Returns:
The longest common prefix
- Return type:
str
Examples
>>> pfx = LCPrefix() >>> pfx.lcprefix(['cat', 'hat']) '' >>> pfx.lcprefix(['Niall', 'Neil']) 'N' >>> pfx.lcprefix(['aluminum', 'Catalan']) '' >>> pfx.lcprefix(['ATCG', 'TAGC']) ''
New in version 0.4.0.
- sim(src: str, tar: str, *args: str) float [source]
Return the longest common prefix similarity of two or more strings.
Longest common prefix similarity (\(sim_{LCPrefix}\)).
This employs the LCPrefix function to derive a similarity metric: \(sim_{LCPrefix}(s,t) = \frac{|LCPrefix(s,t)|}{max(|s|, |t|)}\)
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
*args (strs) -- Additional strings for comparison
- Returns:
LCPrefix similarity
- Return type:
float
Examples
>>> pfx = LCPrefix() >>> pfx.sim('cat', 'hat') 0.0 >>> pfx.sim('Niall', 'Neil') 0.2 >>> pfx.sim('aluminum', 'Catalan') 0.0 >>> pfx.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.LCSseq(normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, **kwargs: ~typing.Any)[source]
Bases:
_Distance
Longest common subsequence.
Longest common subsequence (LCSseq) is the longest subsequence of characters that two strings have in common.
New in version 0.3.6.
Initialize LCSseq.
- Parameters:
normalizer (function) -- A normalization function for the normalized similarity & distance. By default, the max of the lengths of the input strings. If lambda x: sum(x)/2.0 is supplied, the normalization proposed in [RTS+01] is used, i.e. \(\frac{2 \dot |LCS(src, tar)|}{|src| + |tar|}\).
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- lcsseq(src: str, tar: str) str [source]
Return the longest common subsequence of two strings.
Based on the dynamic programming algorithm from http://rosettacode.org/wiki/Longest_common_subsequence [Cod18a]. This is licensed GFDL 1.2.
- Modifications include:
conversion to a numpy array in place of a list of lists
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The longest common subsequence
- Return type:
str
Examples
>>> sseq = LCSseq() >>> sseq.lcsseq('cat', 'hat') 'at' >>> sseq.lcsseq('Niall', 'Neil') 'Nil' >>> sseq.lcsseq('aluminum', 'Catalan') 'aln' >>> sseq.lcsseq('ATCG', 'TAGC') 'AC'
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- sim(src: str, tar: str) float [source]
Return the longest common subsequence similarity of two strings.
Longest common subsequence similarity (\(sim_{LCSseq}\)).
This employs the LCSseq function to derive a similarity metric: \(sim_{LCSseq}(s,t) = \frac{|LCSseq(s,t)|}{max(|s|, |t|)}\)
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
LCSseq similarity
- Return type:
float
Examples
>>> sseq = LCSseq() >>> sseq.sim('cat', 'hat') 0.6666666666666666 >>> sseq.sim('Niall', 'Neil') 0.6 >>> sseq.sim('aluminum', 'Catalan') 0.375 >>> sseq.sim('ATCG', 'TAGC') 0.5
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
Changed in version 0.4.0: Added normalization option
- class abydos.distance.LCSstr(normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, **kwargs: ~typing.Any)[source]
Bases:
_Distance
Longest common substring.
New in version 0.3.6.
Initialize LCSseq.
- Parameters:
normalizer (function) -- A normalization function for the normalized similarity & distance. By default, the max of the lengths of the input strings. If lambda x: sum(x)/2.0 is supplied, the normalization proposed in [RTS+01] is used, i.e. \(\frac{2 \dot |LCS(src, tar)|}{|src| + |tar|}\).
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- lcsstr(src: str, tar: str) str [source]
Return the longest common substring of two strings.
Longest common substring (LCSstr).
Based on the code from https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring [Wik18]. This is licensed Creative Commons: Attribution-ShareAlike 3.0.
Modifications include:
conversion to a numpy array in place of a list of lists
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The longest common substring
- Return type:
str
Examples
>>> sstr = LCSstr() >>> sstr.lcsstr('cat', 'hat') 'at' >>> sstr.lcsstr('Niall', 'Neil') 'N' >>> sstr.lcsstr('aluminum', 'Catalan') 'al' >>> sstr.lcsstr('ATCG', 'TAGC') 'A'
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- sim(src: str, tar: str) float [source]
Return the longest common substring similarity of two strings.
Longest common substring similarity (\(sim_{LCSstr}\)).
This employs the LCS function to derive a similarity metric: \(sim_{LCSstr}(s,t) = \frac{|LCSstr(s,t)|}{max(|s|, |t|)}\)
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
LCSstr similarity
- Return type:
float
Examples
>>> sstr = LCSstr() >>> sstr.sim('cat', 'hat') 0.6666666666666666 >>> sstr.sim('Niall', 'Neil') 0.2 >>> sstr.sim('aluminum', 'Catalan') 0.25 >>> sstr.sim('ATCG', 'TAGC') 0.25
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
Changed in version 0.4.0: Added normalization option
- class abydos.distance.LCSuffix(**kwargs: Any)[source]
Bases:
LCPrefix
Longest common suffix.
New in version 0.4.0.
Initialize _Distance instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist_abs(src: str, tar: str, *args: str) int [source]
Return the length of the longest common suffix of the strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
*args (strs) -- Additional strings for comparison
- Raises:
ValueError -- All arguments must be of type str
- Returns:
The length of the longest common suffix
- Return type:
int
Examples
>>> sfx = LCSuffix() >>> sfx.dist_abs('cat', 'hat') 2 >>> sfx.dist_abs('Niall', 'Neil') 1 >>> sfx.dist_abs('aluminum', 'Catalan') 0 >>> sfx.dist_abs('ATCG', 'TAGC') 0
New in version 0.4.0.
- lcsuffix(strings: List[str]) str [source]
Return the longest common suffix of a list of strings.
Longest common suffix (LCSuffix).
- Parameters:
strings (list of strings) -- Strings for comparison
- Returns:
The longest common suffix
- Return type:
str
Examples
>>> sfx = LCSuffix() >>> sfx.lcsuffix(['cat', 'hat']) 'at' >>> sfx.lcsuffix(['Niall', 'Neil']) 'l' >>> sfx.lcsuffix(['aluminum', 'Catalan']) '' >>> sfx.lcsuffix(['ATCG', 'TAGC']) ''
New in version 0.4.0.
- sim(src: str, tar: str, *args: str) float [source]
Return the longest common suffix similarity of two or more strings.
Longest common prefix similarity (\(sim_{LCPrefix}\)).
This employs the LCSuffix function to derive a similarity metric: \(sim_{LCSuffix}(s,t) = \frac{|LCSuffix(s,t)|}{max(|s|, |t|)}\)
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
*args (strs) -- Additional strings for comparison
- Returns:
LCSuffix similarity
- Return type:
float
Examples
>>> pfx = LCPrefix() >>> pfx.sim('cat', 'hat') 0.0 >>> pfx.sim('Niall', 'Neil') 0.2 >>> pfx.sim('aluminum', 'Catalan') 0.0 >>> pfx.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.LIG3(**kwargs: Any)[source]
Bases:
_Distance
LIG3 similarity.
[SD02] proposes three Levenshtein-ISG-Guth hybrid similarity measures: LIG1, LIG2, and LIG3. Of these, LIG1 is identical to ISG and LIG2 is identical to normalized Levenshtein similarity. Only LIG3 is a novel measure, defined as:
\[sim_{LIG3}(X, Y) = \frac{2I}{2I+C}\]Here, I is the number of exact matches between the two words, truncated to the length of the shorter word, and C is the Levenshtein distance between the two words.
New in version 0.4.1.
Initialize _Distance instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the LIG3 similarity of two words.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The LIG3 similarity
- Return type:
float
Examples
>>> cmp = LIG3() >>> cmp.sim('cat', 'hat') 0.8 >>> cmp.sim('Niall', 'Neil') 0.5714285714285714 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.1.
- class abydos.distance.Length(**kwargs: Any)[source]
Bases:
_Distance
Length similarity and distance.
New in version 0.3.6.
Initialize _Distance instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the length similarity of two strings.
Length similarity is the ratio of the length of the shorter string to the longer.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Length similarity
- Return type:
float
Examples
>>> cmp = Length() >>> cmp.sim('cat', 'hat') 1.0 >>> cmp.sim('Niall', 'Neil') 0.8 >>> cmp.sim('aluminum', 'Catalan') 0.875 >>> cmp.sim('ATCG', 'TAGC') 1.0
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.Levenshtein(mode: str = 'lev', cost: ~typing.Tuple[float, float, float, float] = (1, 1, 1, 1), normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, taper: bool = False, **kwargs: ~typing.Any)[source]
Bases:
_Distance
Levenshtein distance.
This is the standard edit distance measure. Cf. [Lev65, Lev66].
Optimal string alignment (aka restricted Damerau-Levenshtein distance) [Boy11] is also supported.
The ordinary Levenshtein & Optimal String Alignment distance both employ the Wagner-Fischer dynamic programming algorithm [WF74].
Levenshtein edit distance ordinarily has unit insertion, deletion, and substitution costs.
New in version 0.3.6.
Changed in version 0.4.0: Added taper option
Initialize Levenshtein instance.
- Parameters:
mode (str) --
Specifies a mode for computing the Levenshtein distance:
lev
(default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutionsosa
computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once
cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
normalizer (function) -- A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.
taper (bool) -- Enables cost tapering. Following [ZD96], it causes edits at the start of the string to "just [exceed] twice the minimum penalty for replacement or deletion at the end of the string".
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- alignment(src: str, tar: str) Tuple[float, str, str] [source]
Return the Levenshtein alignment of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
A tuple containing the Levenshtein distance and the two strings, aligned.
- Return type:
tuple
Examples
>>> cmp = Levenshtein() >>> cmp.alignment('cat', 'hat') (1.0, 'cat', 'hat') >>> cmp.alignment('Niall', 'Neil') (3.0, 'N-iall', 'Nei-l-') >>> cmp.alignment('aluminum', 'Catalan') (7.0, '-aluminum', 'Catalan--') >>> cmp.alignment('ATCG', 'TAGC') (3.0, 'ATCG-', '-TAGC')
>>> cmp = Levenshtein(mode='osa') >>> cmp.alignment('ATCG', 'TAGC') (2.0, 'ATCG', 'TAGC') >>> cmp.alignment('ACTG', 'TAGC') (4.0, 'ACT-G-', '--TAGC')
New in version 0.4.1.
- dist(src: str, tar: str) float [source]
Return the normalized Levenshtein distance between two strings.
The Levenshtein distance is normalized by dividing the Levenshtein distance (calculated by either of the two supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The normalized Levenshtein distance between src & tar
- Return type:
float
Examples
>>> cmp = Levenshtein() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.6 >>> cmp.dist('aluminum', 'Catalan') 0.875 >>> cmp.dist('ATCG', 'TAGC') 0.75
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- dist_abs(src: str, tar: str) float [source]
Return the Levenshtein distance between two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The Levenshtein distance between src & tar
- Return type:
int (may return a float if cost has float values)
Examples
>>> cmp = Levenshtein() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 3 >>> cmp.dist_abs('aluminum', 'Catalan') 7 >>> cmp.dist_abs('ATCG', 'TAGC') 3
>>> cmp = Levenshtein(mode='osa') >>> cmp.dist_abs('ATCG', 'TAGC') 2 >>> cmp.dist_abs('ACTG', 'TAGC') 4
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.Lorentzian(tokenizer: Optional[_Tokenizer] = None, **kwargs: Any)[source]
Bases:
_TokenDistance
Lorentzian distance.
For two multisets X and Y drawn from an alphabet S, Lorentzian distance is
\[dist_{Lorentzian}(X, Y) = \sum_{i \in S} log(1 + |A_i - B_i|)\]Notes
No primary source for this measure could be located, but it is included in surveys and catalogues, such as [DD16] and [Cha08].
New in version 0.4.0.
Initialize Lorentzian instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Lorentzian distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Lorentzian distance
- Return type:
float
Examples
>>> cmp = Lorentzian() >>> cmp.dist('cat', 'hat') 0.6666666666666667 >>> cmp.dist('Niall', 'Neil') 0.7777777777777778 >>> cmp.dist('aluminum', 'Catalan') 0.9358355851062377 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the Lorentzian distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Lorentzian distance
- Return type:
float
Examples
>>> cmp = Lorentzian() >>> cmp.dist_abs('cat', 'hat') 2.772588722239781 >>> cmp.dist_abs('Niall', 'Neil') 4.852030263919617 >>> cmp.dist_abs('aluminum', 'Catalan') 10.1095256359474 >>> cmp.dist_abs('ATCG', 'TAGC') 6.931471805599453
New in version 0.4.0.
- class abydos.distance.MASI(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
MASI similarity.
Measuring Agreement on Set-valued Items (MASI) similarity [Pas06] for two sets X and Y is based on Jaccard similarity:
\[sim_{Jaccard}(X, Y) = \frac{|X \cap Y|}{|X \cup Y|}\]- This Jaccard similarity is scaled by a value M, which is:
1 if \(X = Y\)
\(\frac{2}{3}\) if \(X \subset Y\) or \(Y \subset X\)
\(\frac{1}{3}\) if \(X \cap Y \neq \emptyset\), \(X \setminus Y \neq \emptyset\), and \(Y \setminus X \neq \emptyset\)
0 if \(X \cap Y = \emptyset\)
New in version 0.4.0.
Initialize MASI instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the MASI similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
MASI similarity
- Return type:
float
Examples
>>> cmp = MASI() >>> cmp.sim('cat', 'hat') 0.1111111111111111 >>> cmp.sim('Niall', 'Neil') 0.07407407407407407 >>> cmp.sim('aluminum', 'Catalan') 0.020833333333333332 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.MLIPNS(threshold: float = 0.25, max_mismatches: int = 2, **kwargs: Any)[source]
Bases:
_Distance
MLIPNS similarity.
Modified Language-Independent Product Name Search (MLIPNS) is described in [SA10]. This function returns only 1.0 (similar) or 0.0 (not similar). LIPNS similarity is identical to normalized Hamming similarity.
New in version 0.3.6.
Initialize MLIPNS instance.
- Parameters:
threshold (float) -- A number [0, 1] indicating the maximum similarity score, below which the strings are considered 'similar' (0.25 by default)
max_mismatches (int) -- A number indicating the allowable number of mismatches to remove before declaring two strings not similar (2 by default)
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the MLIPNS similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
MLIPNS similarity
- Return type:
float
Examples
>>> cmp = MLIPNS() >>> cmp.sim('cat', 'hat') 1.0 >>> cmp.sim('Niall', 'Neil') 0.0 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.MRA(**kwargs: Any)[source]
Bases:
_Distance
Match Rating Algorithm comparison rating.
The Western Airlines Surname Match Rating Algorithm comparison rating, as presented on page 18 of [MKTM77].
New in version 0.3.6.
Initialize _Distance instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the MRA comparison rating of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
MRA comparison rating
- Return type:
int
Examples
>>> cmp = MRA() >>> cmp.dist_abs('cat', 'hat') 5 >>> cmp.dist_abs('Niall', 'Neil') 6 >>> cmp.dist_abs('aluminum', 'Catalan') 0 >>> cmp.dist_abs('ATCG', 'TAGC') 5
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- sim(src: str, tar: str) float [source]
Return the normalized MRA similarity of two strings.
This is the MRA normalized to \([0, 1]\), given that MRA itself is constrained to the range \([0, 6]\).
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Normalized MRA similarity
- Return type:
float
Examples
>>> cmp = MRA() >>> cmp.sim('cat', 'hat') 0.8333333333333334 >>> cmp.sim('Niall', 'Neil') 1.0 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.8333333333333334
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.MSContingency(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Mean squared contingency correlation.
For two sets X and Y and a population N, the mean squared contingency correlation [Col49] is
\[corr_{MSContingency}(X, Y) = \frac{\sqrt{2}(|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|)} {\sqrt{(|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|)^2 + |X| \cdot |Y| \cdot |N \setminus X| \cdot |N \setminus Y|}}\][Hubalek08] and [CCT10] identify this as Cole similarity. Although Cole discusses this correlation, he does not claim to have developed it. Rather, he presents his coefficient of interspecific association as being his own development:
Cole
.In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{MSContingency} = \frac{\sqrt{2}(ad-bc)}{\sqrt{(ad-bc)^2+(a+b)(a+c)(b+d)(c+d)}}\]New in version 0.4.0.
Initialize MSContingency instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the normalized mean squared contingency corr. of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Mean squared contingency correlation
- Return type:
float
Examples
>>> cmp = MSContingency() >>> cmp.corr('cat', 'hat') 0.6298568508557214 >>> cmp.corr('Niall', 'Neil') 0.4798371954796814 >>> cmp.corr('aluminum', 'Catalan') 0.15214891090821628 >>> cmp.corr('ATCG', 'TAGC') -0.009076921903905553
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized ms contingency similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Mean squared contingency similarity
- Return type:
float
Examples
>>> cmp = MSContingency() >>> cmp.sim('cat', 'hat') 0.8149284254278607 >>> cmp.sim('Niall', 'Neil') 0.7399185977398407 >>> cmp.sim('aluminum', 'Catalan') 0.5760744554541082 >>> cmp.sim('ATCG', 'TAGC') 0.49546153904804724
New in version 0.4.0.
- class abydos.distance.Maarel(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Maarel correlation.
For two sets X and Y and a population N, Maarel correlation [vandMaarel69] is
\[corr_{Maarel}(X, Y) = \frac{2|X \cap Y| - |X \setminus Y| - |Y \setminus X|}{|X| + |Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{Maarel} = \frac{2a - b - c}{2a + b + c}\]New in version 0.4.0.
Initialize Maarel instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Maarel correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Maarel correlation
- Return type:
float
Examples
>>> cmp = Maarel() >>> cmp.corr('cat', 'hat') 0.0 >>> cmp.corr('Niall', 'Neil') -0.2727272727272727 >>> cmp.corr('aluminum', 'Catalan') -0.7647058823529411 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Maarel similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Maarel similarity
- Return type:
float
Examples
>>> cmp = Maarel() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.36363636363636365 >>> cmp.sim('aluminum', 'Catalan') 0.11764705882352944 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.Manhattan(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = 0, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
Minkowski
Manhattan distance.
Manhattan distance is the city-block or taxi-cab distance, equivalent to Minkowski distance in \(L^1\)-space.
New in version 0.3.6.
Initialize Manhattan instance.
- Parameters:
alphabet (collection or int) -- The values or size of the alphabet
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Manhattan distance between two strings.
The normalized Manhattan distance is a distance metric in \(L^1\)-space, normalized to [0, 1].
This is identical to Canberra distance.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
The normalized Manhattan distance
- Return type:
float
Examples
>>> cmp = Manhattan() >>> cmp.dist('cat', 'hat') 0.5 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.636363636364 >>> round(cmp.dist('Colin', 'Cuilen'), 12) 0.692307692308 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- dist_abs(src: str, tar: str, normalized: bool = False) float [source]
Return the Manhattan distance between two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
normalized (bool) -- Normalizes to [0, 1] if True
- Returns:
The Manhattan distance
- Return type:
float
Examples
>>> cmp = Manhattan() >>> cmp.dist_abs('cat', 'hat') 4.0 >>> cmp.dist_abs('Niall', 'Neil') 7.0 >>> cmp.dist_abs('Colin', 'Cuilen') 9.0 >>> cmp.dist_abs('ATCG', 'TAGC') 10.0
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.Marking(**kwargs: Any)[source]
Bases:
_Distance
Ehrenfeucht & Haussler's marking distance.
This edit distance [EH88] is the number of marked characters in one word that must be masked in order for that word to consist entirely of substrings of another word.
It is normalized by the length of the first word.
New in version 0.4.0.
Initialize Marking instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized marking distance of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
marking distance
- Return type:
float
Examples
>>> cmp = Marking() >>> cmp.dist('cat', 'hat') 0.3333333333333333 >>> cmp.dist('Niall', 'Neil') 0.6 >>> cmp.dist('aluminum', 'Catalan') 0.625 >>> cmp.dist('ATCG', 'TAGC') 0.5 >>> cmp.dist('cbaabdcb', 'abcba') 0.25
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the marking distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
marking distance
- Return type:
int
Examples
>>> cmp = Marking() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 3 >>> cmp.dist_abs('aluminum', 'Catalan') 5 >>> cmp.dist_abs('ATCG', 'TAGC') 2 >>> cmp.dist_abs('cbaabdcb', 'abcba') 2
New in version 0.4.0.
- class abydos.distance.MarkingMetric(**kwargs: Any)[source]
Bases:
Marking
Ehrenfeucht & Haussler's marking metric.
This metric [EH88] is the base 2 logarithm of the product of the marking distances between each term plus 1 computed in both orders. For strings x and y, this is:
\[dist_{MarkingMetric}(x, y) = log_2((diff(x, y)+1)(diff(y, x)+1))\]The function diff is Ehrenfeucht & Haussler's marking distance
Marking
.New in version 0.4.0.
Initialize MarkingMetric instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized marking distance of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
marking distance
- Return type:
float
Examples
>>> cmp = Marking() >>> cmp.dist('cat', 'hat') 0.3333333333333333 >>> cmp.dist('Niall', 'Neil') 0.6 >>> cmp.dist('aluminum', 'Catalan') 0.625 >>> cmp.dist('ATCG', 'TAGC') 0.5 >>> cmp.dist('cbaabdcb', 'abcba') 0.25
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the marking distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
marking distance
- Return type:
int
Examples
>>> cmp = MarkingMetric() >>> cmp.dist_abs('cat', 'hat') 2.0 >>> cmp.dist_abs('Niall', 'Neil') 3.584962500721156 >>> cmp.dist_abs('aluminum', 'Catalan') 4.584962500721156 >>> cmp.dist_abs('ATCG', 'TAGC') 3.169925001442312 >>> cmp.dist_abs('cbaabdcb', 'abcba') 2.584962500721156
New in version 0.4.0.
- class abydos.distance.Matusita(tokenizer: Optional[_Tokenizer] = None, **kwargs: Any)[source]
Bases:
_TokenDistance
Matusita distance.
For two multisets X and Y drawn from an alphabet S, Matusita distance [Mat55] is
\[dist_{Matusita}(X, Y) = \sqrt{\sum_{i \in S} \Bigg(\sqrt{\frac{|A_i|}{|A|}} - \sqrt{\frac{|B_i|}{|B|}}\Bigg)^2}\]New in version 0.4.0.
Initialize Matusita instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Matusita distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Matusita distance
- Return type:
float
Examples
>>> cmp = Matusita() >>> cmp.dist('cat', 'hat') 0.707106781186547 >>> cmp.dist('Niall', 'Neil') 0.796775770420944 >>> cmp.dist('aluminum', 'Catalan') 0.939227805062351 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the Matusita distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Matusita distance
- Return type:
float
Examples
>>> cmp = Matusita() >>> cmp.dist_abs('cat', 'hat') 1.0 >>> cmp.dist_abs('Niall', 'Neil') 1.126811100699571 >>> cmp.dist_abs('aluminum', 'Catalan') 1.3282687000770907 >>> cmp.dist_abs('ATCG', 'TAGC') 1.414213562373095
New in version 0.4.0.
- class abydos.distance.MaxwellPilliner(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Maxwell & Pilliner correlation.
For two sets X and Y and a population N, Maxwell & Pilliner correlation [MP68] is
\[corr_{MaxwellPilliner}(X, Y) = \frac{2(|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|)} {|X| \cdot |N \setminus X| + |Y| \cdot |N \setminus Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{MaxwellPilliner} = \frac{2(ad-bc)}{(a+b)(c+d)+(a+c)(b+c)}\]New in version 0.4.0.
Initialize MaxwellPilliner instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Maxwell & Pilliner correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Maxwell & Pilliner correlation
- Return type:
float
Examples
>>> cmp = MaxwellPilliner() >>> cmp.corr('cat', 'hat') 0.49743589743589745 >>> cmp.corr('Niall', 'Neil') 0.35921989956790845 >>> cmp.corr('aluminum', 'Catalan') 0.10803030303030303 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237483954
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Maxwell & Pilliner similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Maxwell & Pilliner similarity
- Return type:
float
Examples
>>> cmp = MaxwellPilliner() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.6796099497839543 >>> cmp.sim('aluminum', 'Catalan') 0.5540151515151515 >>> cmp.sim('ATCG', 'TAGC') 0.496790757381258
New in version 0.4.0.
- class abydos.distance.McConnaughey(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
McConnaughey correlation.
For two sets X and Y, McConnaughey correlation [McC64] is
\[corr_{McConnaughey}(X, Y) = \frac{|X \cap Y|^2 - |X \setminus Y| \cdot |Y \setminus X|} {|X| \cdot |Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{McConnaughey} = \frac{a^2-bc}{(a+b)(a+c)}\]New in version 0.4.0.
Initialize McConnaughey instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the McConnaughey correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
McConnaughey correlation
- Return type:
float
Examples
>>> cmp = McConnaughey() >>> cmp.corr('cat', 'hat') 0.0 >>> cmp.corr('Niall', 'Neil') -0.26666666666666666 >>> cmp.corr('aluminum', 'Catalan') -0.7638888888888888 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the McConnaughey similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
McConnaughey similarity
- Return type:
float
Examples
>>> cmp = McConnaughey() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.3666666666666667 >>> cmp.sim('aluminum', 'Catalan') 0.11805555555555558 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.McEwenMichael(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
McEwen & Michael correlation.
For two sets X and Y and a population N, the McEwen & Michael correlation [Mic20] is
\[corr_{McEwenMichael}(X, Y) = \frac{4(|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|)} {(|X \cap Y| + |(N \setminus X) \setminus Y|)^2 + (|X \setminus Y| + |Y \setminus X|)^2}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{McEwenMichael} = \frac{4(ad-bc)}{(a+d)^2+(b+c)^2}\]New in version 0.4.0.
Initialize Michael instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the McEwen & Michael correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Michael correlation
- Return type:
float
Examples
>>> cmp = McEwenMichael() >>> cmp.corr('cat', 'hat') 0.010203544942933782 >>> cmp.corr('Niall', 'Neil') 0.010189175491654217 >>> cmp.corr('aluminum', 'Catalan') 0.0048084299262381456 >>> cmp.corr('ATCG', 'TAGC') -0.00016689587032858459
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the McEwen & Michael similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Michael similarity
- Return type:
float
Examples
>>> cmp = McEwenMichael() >>> cmp.sim('cat', 'hat') 0.5051017724714669 >>> cmp.sim('Niall', 'Neil') 0.5050945877458272 >>> cmp.sim('aluminum', 'Catalan') 0.502404214963119 >>> cmp.sim('ATCG', 'TAGC') 0.4999165520648357
New in version 0.4.0.
- class abydos.distance.MetaLevenshtein(tokenizer: ~typing.Optional[~abydos.tokenizer._tokenizer._Tokenizer] = None, corpus: ~typing.Optional[~abydos.corpus._unigram_corpus.UnigramCorpus] = None, metric: ~typing.Optional[~abydos.distance._distance._Distance] = None, normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, **kwargs: ~typing.Any)[source]
Bases:
_Distance
Meta-Levenshtein distance.
Meta-Levenshtein distance [MYCappe08] combines Soft-TFIDF with Levenshtein alignment.
New in version 0.4.0.
Initialize MetaLevenshtein instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packagecorpus (UnigramCorpus) -- A unigram corpus
UnigramCorpus
. If None, a corpus will be created from the two words when a similarity function is called.metric (_Distance) -- A string distance measure class for making soft matches, by default Jaro-Winkler.
normalizer (function) -- A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.
**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Levenshtein distance between two strings.
The Levenshtein distance is normalized by dividing the Levenshtein distance (calculated by any of the three supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The normalized Levenshtein distance between src & tar
- Return type:
float
Examples
>>> cmp = MetaLevenshtein() >>> round(cmp.dist('cat', 'hat'), 12) 0.205186754296 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.507780131444 >>> cmp.dist('aluminum', 'Catalan') 0.8675933954313434 >>> cmp.dist('ATCG', 'TAGC') 0.8077801314441113
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- dist_abs(src: str, tar: str) float [source]
Return the Meta-Levenshtein distance of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Meta-Levenshtein distance
- Return type:
float
Examples
>>> cmp = MetaLevenshtein() >>> cmp.dist_abs('cat', 'hat') 0.6155602628882225 >>> cmp.dist_abs('Niall', 'Neil') 2.538900657220556 >>> cmp.dist_abs('aluminum', 'Catalan') 6.940747163450747 >>> cmp.dist_abs('ATCG', 'TAGC') 3.2311205257764453
New in version 0.4.0.
- class abydos.distance.Michelet(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Michelet similarity.
For two sets X and Y and a population N, Michelet similarity [TCLM88] is
\[sim_{Michelet}(X, Y) = \frac{|X \cap Y|^2}{|X| \cdot |Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Michelet} = \frac{a^2}{(a+b)(a+c)}\]Following [Seq18], this is termed "Michelet", though Turner is most often listed as the first author in papers presenting this measure.
New in version 0.4.0.
Initialize Michelet instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Michelet similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Michelet similarity
- Return type:
float
Examples
>>> cmp = Michelet() >>> cmp.sim('cat', 'hat') 0.25 >>> cmp.sim('Niall', 'Neil') 0.13333333333333333 >>> cmp.sim('aluminum', 'Catalan') 0.013888888888888888 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.Millar(**kwargs: Any)[source]
Bases:
_TokenDistance
Millar's binomial deviance dissimilarity.
For two sets X and Y drawn from a population S, Millar's binomial deviance dissimilarity [AM04] is:
\[dist_{Millar}(X, Y) = \sum_{i=0}^{|S|} \frac{1}{x_i+y_i} \bigg\{x_i log(\frac{x_i}{x_i+y_i}) + y_i log(\frac{y_i}{x_i+y_i}) - (x_i+y_i) log(\frac{1}{2})\bigg\}\]New in version 0.4.1.
Initialize Millar instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.1.
- dist(*args: Any, **kwargs: Any) NoReturn [source]
Raise exception when called.
- Parameters:
*args -- Variable length argument list
**kwargs -- Arbitrary keyword arguments
- Raises:
NotImplementedError -- Method disabled for Millar dissimilarity.
New in version 0.3.6.
- dist_abs(src: str, tar: str) float [source]
Return Millar's binomial deviance dissimilarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Millar's binomial deviance dissimilarity
- Return type:
float
Examples
>>> cmp = Millar() >>> cmp.dist_abs('cat', 'hat') 2.772588722239781 >>> cmp.dist_abs('Niall', 'Neil') 4.852030263919617 >>> cmp.dist_abs('aluminum', 'Catalan') 9.704060527839234 >>> cmp.dist_abs('ATCG', 'TAGC') 6.931471805599453
New in version 0.4.1.
- class abydos.distance.MinHash(tokenizer: Optional[_Tokenizer] = None, k: int = 0, seed: int = 10, **kwargs: Any)[source]
Bases:
_Distance
MinHash similarity.
MinHash similarity [Bro97] is a method of approximating the intersection over the union of two sets. This implementation is based on [Kul15].
New in version 0.4.0.
Initialize MinHash instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packagek (int) -- The number of hash functions to use for similarity estimation
seed (int) -- A seed value for the random functions
**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the MinHash similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
MinHash similarity
- Return type:
float
Examples
>>> cmp = MinHash() >>> cmp.sim('cat', 'hat') 0.75 >>> cmp.sim('Niall', 'Neil') 1.0 >>> cmp.sim('aluminum', 'Catalan') 0.5 >>> cmp.sim('ATCG', 'TAGC') 0.6
New in version 0.4.0.
- class abydos.distance.Minkowski(pval: float = 1, alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = 0, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Minkowski distance.
The Minkowski distance [Min10] is a distance metric in \(L^p-space\).
New in version 0.3.6.
Initialize Euclidean instance.
- Parameters:
pval (int) -- The \(p\)-value of the \(L^p\)-space
alphabet (collection or int) -- The values or size of the alphabet
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return normalized Minkowski distance of two strings.
The normalized Minkowski distance [Min10] is a distance metric in \(L^p\)-space, normalized to [0, 1].
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
The normalized Minkowski distance
- Return type:
float
Examples
>>> cmp = Minkowski() >>> cmp.dist('cat', 'hat') 0.5 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.636363636364 >>> round(cmp.dist('Colin', 'Cuilen'), 12) 0.692307692308 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- dist_abs(src: str, tar: str, normalized: bool = False) float [source]
Return the Minkowski distance (\(L^p\)-norm) of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
normalized (bool) -- Normalizes to [0, 1] if True
- Returns:
The Minkowski distance
- Return type:
float
Examples
>>> cmp = Minkowski() >>> cmp.dist_abs('cat', 'hat') 4.0 >>> cmp.dist_abs('Niall', 'Neil') 7.0 >>> cmp.dist_abs('Colin', 'Cuilen') 9.0 >>> cmp.dist_abs('ATCG', 'TAGC') 10.0
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.MongeElkan(sim_func: Optional[Union[_Distance, Callable[[str, str], float]]] = None, symmetric: bool = False, **kwargs: Any)[source]
Bases:
_Distance
Monge-Elkan similarity.
Monge-Elkan is defined in [ME96].
Note: Monge-Elkan is NOT a symmetric similarity algorithm. Thus, the similarity of src to tar is not necessarily equal to the similarity of tar to src. If the symmetric argument is True, a symmetric value is calculated, at the cost of doubling the computation time (since \(sim_{Monge-Elkan}(src, tar)\) and \(sim_{Monge-Elkan}(tar, src)\) are both calculated and then averaged).
New in version 0.3.6.
Initialize MongeElkan instance.
- Parameters:
sim_func (function) -- The internal similarity metric to employ
symmetric (bool) -- Return a symmetric similarity measure
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Monge-Elkan similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Monge-Elkan similarity
- Return type:
float
Examples
>>> cmp = MongeElkan() >>> cmp.sim('cat', 'hat') 0.75 >>> round(cmp.sim('Niall', 'Neil'), 12) 0.666666666667 >>> round(cmp.sim('aluminum', 'Catalan'), 12) 0.388888888889 >>> cmp.sim('ATCG', 'TAGC') 0.5
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.Morisita(**kwargs: Any)[source]
Bases:
_TokenDistance
Morisita index of overlap.
Morisita index of overlap [Mor59], following the description of [Hor66], given two populations X and Y drawn from S species, is:
\[sim_{Morisita}(X, Y) = C_{\lambda} = \frac{2\sum_{i=1}^S x_i y_i}{(\lambda_x + \lambda_y)XY}\]where
\[X = \sum_{i=1}^S x_i ~~;~~ Y = \sum_{i=1}^S y_i\]\[\lambda_x = \frac{\sum_{i=1}^S x_i(x_i-1)}{X(X-1)} ~~;~~ \lambda_y = \frac{\sum_{i=1}^S y_i(y_i-1)}{Y(Y-1)}\]New in version 0.4.1.
Initialize Morisita instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.1.
- dist(*args: Any, **kwargs: Any) NoReturn [source]
Raise exception when called.
- Parameters:
*args -- Variable length argument list
**kwargs -- Arbitrary keyword arguments
- Raises:
NotImplementedError -- Method disabled for Morisita similarity.
New in version 0.3.6.
- sim(*args: Any, **kwargs: Any) NoReturn [source]
Raise exception when called.
- Parameters:
*args -- Variable length argument list
**kwargs -- Arbitrary keyword arguments
- Raises:
NotImplementedError -- Method disabled for Morisita similarity.
New in version 0.3.6.
- sim_score(src: str, tar: str) float [source]
Return the Morisita similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Morisita similarity
- Return type:
float
Examples
>>> cmp = Morisita() >>> cmp.sim_score('cat', 'hat') 0.25 >>> cmp.sim_score('Niall', 'Neil') 0.13333333333333333 >>> cmp.sim_score('aluminum', 'Catalan') 1.0 >>> cmp.sim_score('ATCG', 'TAGC') 0.0
New in version 0.4.1.
- class abydos.distance.Mountford(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Mountford similarity.
For two sets X and Y, the Mountford similarity [Mou62] is
\[sim_{Mountford}(X, Y) = \frac{2|X \cap Y|}{2|X|\cdot|Y|-(|X|+|Y|)\cdot|X \cap Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Mountford} = \frac{2a}{2(a+b)(a+c)-(2a+b+c)a}\]New in version 0.4.0.
Initialize Mountford instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Mountford similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Mountford similarity
- Return type:
float
Examples
>>> cmp = Mountford() >>> cmp.sim('cat', 'hat') 0.25 >>> cmp.sim('Niall', 'Neil') 0.10526315789473684 >>> cmp.sim('aluminum', 'Catalan') 0.015748031496062992 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.MutualInformation(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Mutual Information similarity.
For two sets X and Y and a population N, Mutual Information similarity [CGHH91] is
\[sim_{MI}(X, Y) = log_2(\frac{|X \cap Y| \cdot |N|}{|X| \cdot |Y|})\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{MI} = log_2(\frac{an}{(a+b)(a+c)})\]New in version 0.4.0.
Initialize MutualInformation instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Mutual Information similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Mutual Information similarity
- Return type:
float
Examples
>>> cmp = MutualInformation() >>> cmp.sim('cat', 'hat') 0.933609253088981 >>> cmp.sim('Niall', 'Neil') 0.8911684881725231 >>> cmp.sim('aluminum', 'Catalan') 0.7600321183863901 >>> cmp.sim('ATCG', 'TAGC') 0.17522996523538537
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Mutual Information similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Mutual Information similarity
- Return type:
float
Examples
>>> cmp = MutualInformation() >>> cmp.sim_score('cat', 'hat') 6.528166795717758 >>> cmp.sim_score('Niall', 'Neil') 5.661433326581222 >>> cmp.sim_score('aluminum', 'Catalan') 3.428560943378589 >>> cmp.sim_score('ATCG', 'TAGC') -4.700439718141092
New in version 0.4.0.
- class abydos.distance.NCDarith(probs: Optional[Dict[str, Tuple[Fraction, Fraction]]] = None, **kwargs: Any)[source]
Bases:
_Distance
Normalized Compression Distance using arithmetic coding.
Cf. https://en.wikipedia.org/wiki/Arithmetic_coding
Normalized compression distance (NCD) [CV05].
New in version 0.3.6.
Initialize the arithmetic coder object.
- Parameters:
probs (dict) -- A dictionary trained with
Arithmetic.train()
New in version 0.3.6.
Changed in version 0.3.6: Encapsulated in class
- dist(src: str, tar: str) float [source]
Return the NCD between two strings using arithmetic coding.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Compression distance
- Return type:
float
Examples
>>> cmp = NCDarith() >>> cmp.dist('cat', 'hat') 0.5454545454545454 >>> cmp.dist('Niall', 'Neil') 0.6875 >>> cmp.dist('aluminum', 'Catalan') 0.8275862068965517 >>> cmp.dist('ATCG', 'TAGC') 0.6923076923076923
New in version 0.3.5.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.NCDbwtrle(**kwargs: Any)[source]
Bases:
NCDrle
Normalized Compression Distance using BWT plus RLE.
Cf. https://en.wikipedia.org/wiki/Burrows-Wheeler_transform
Normalized compression distance (NCD) [CV05].
New in version 0.3.6.
Initialize _Distance instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the NCD between two strings using BWT plus RLE.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Compression distance
- Return type:
float
Examples
>>> cmp = NCDbwtrle() >>> cmp.dist('cat', 'hat') 0.75 >>> cmp.dist('Niall', 'Neil') 0.8333333333333334 >>> cmp.dist('aluminum', 'Catalan') 1.0 >>> cmp.dist('ATCG', 'TAGC') 0.8
New in version 0.3.5.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.NCDbz2(level: int = 9, **kwargs: Any)[source]
Bases:
_Distance
Normalized Compression Distance using bzip2 compression.
Cf. https://en.wikipedia.org/wiki/Bzip2
Normalized compression distance (NCD) [CV05].
New in version 0.3.6.
Initialize bzip2 compressor.
- Parameters:
level (int) -- The compression level (0 to 9)
New in version 0.3.6.
Changed in version 0.3.6: Encapsulated in class
- dist(src: str, tar: str) float [source]
Return the NCD between two strings using bzip2 compression.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Compression distance
- Return type:
float
Examples
>>> cmp = NCDbz2() >>> cmp.dist('cat', 'hat') 0.06666666666666667 >>> cmp.dist('Niall', 'Neil') 0.03125 >>> cmp.dist('aluminum', 'Catalan') 0.17647058823529413 >>> cmp.dist('ATCG', 'TAGC') 0.03125
New in version 0.3.5.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.NCDlzma(level: int = 6, **kwargs: Any)[source]
Bases:
_Distance
Normalized Compression Distance using LZMA compression.
Cf. https://en.wikipedia.org/wiki/Lempel-Ziv-Markov_chain_algorithm
Normalized compression distance (NCD) [CV05].
New in version 0.3.6.
Initialize LZMA compressor.
- Parameters:
level (int) -- The compression level (0 to 9)
New in version 0.5.0.
- dist(src: str, tar: str) float [source]
Return the NCD between two strings using LZMA compression.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Compression distance
- Return type:
float
Examples
>>> cmp = NCDlzma() >>> cmp.dist('cat', 'hat') 0.08695652173913043 >>> cmp.dist('Niall', 'Neil') 0.16 >>> cmp.dist('aluminum', 'Catalan') 0.16 >>> cmp.dist('ATCG', 'TAGC') 0.08695652173913043
New in version 0.3.5.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.NCDlzss(**kwargs: Any)[source]
Bases:
_Distance
Normalized Compression Distance using LZSS compression.
Cf. https://en.wikipedia.org/wiki/Lempel-Ziv-Storer-Szymanski
Normalized compression distance (NCD) [CV05].
New in version 0.4.0.
Initialize _Distance instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the NCD between two strings using LZSS compression.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Compression distance
- Return type:
float
- Raises:
ValueError -- Install the PyLZSS module in order to use LZSS
Examples
>>> cmp = NCDlzss() >>> cmp.dist('cat', 'hat') 0.75 >>> cmp.dist('Niall', 'Neil') 1.0 >>> cmp.dist('aluminum', 'Catalan') 1.0 >>> cmp.dist('ATCG', 'TAGC') 0.8
New in version 0.4.0.
- class abydos.distance.NCDpaq9a(**kwargs: Any)[source]
Bases:
_Distance
Normalized Compression Distance using PAQ9A compression.
Cf. http://mattmahoney.net/dc/#paq9a
Normalized compression distance (NCD) [CV05].
New in version 0.4.0.
Initialize _Distance instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the NCD between two strings using PAQ9A compression.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Compression distance
- Return type:
float
- Raises:
ValueError -- Install the paq module in order to use PAQ9A
Examples
>>> cmp = NCDpaq9a() >>> cmp.dist('cat', 'hat') 0.42857142857142855 >>> cmp.dist('Niall', 'Neil') 0.5555555555555556 >>> cmp.dist('aluminum', 'Catalan') 0.5833333333333334 >>> cmp.dist('ATCG', 'TAGC') 0.5
New in version 0.4.0.
- class abydos.distance.NCDrle(**kwargs: Any)[source]
Bases:
_Distance
Normalized Compression Distance using RLE.
Cf. https://en.wikipedia.org/wiki/Run-length_encoding
Normalized compression distance (NCD) [CV05].
New in version 0.3.6.
Initialize _Distance instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the NCD between two strings using RLE.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Compression distance
- Return type:
float
Examples
>>> cmp = NCDrle() >>> cmp.dist('cat', 'hat') 1.0 >>> cmp.dist('Niall', 'Neil') 1.0 >>> cmp.dist('aluminum', 'Catalan') 1.0 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.3.5.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.NCDzlib(level: int = -1, **kwargs: Any)[source]
Bases:
_Distance
Normalized Compression Distance using zlib compression.
Normalized compression distance (NCD) [CV05].
New in version 0.3.6.
Initialize zlib compressor.
- Parameters:
level (int) -- The compression level (0 to 9)
New in version 0.3.6.
- dist(src: str, tar: str) float [source]
Return the NCD between two strings using zlib compression.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Compression distance
- Return type:
float
Examples
>>> cmp = NCDzlib() >>> cmp.dist('cat', 'hat') 0.3333333333333333 >>> cmp.dist('Niall', 'Neil') 0.45454545454545453 >>> cmp.dist('aluminum', 'Catalan') 0.5714285714285714 >>> cmp.dist('ATCG', 'TAGC') 0.4
New in version 0.3.5.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.NeedlemanWunsch(gap_cost: float = 1, sim_func: Optional[Callable[[str, str], float]] = None, **kwargs: Any)[source]
Bases:
_Distance
Needleman-Wunsch score.
The Needleman-Wunsch score [NW70] is a standard edit distance measure.
New in version 0.3.6.
Initialize NeedlemanWunsch instance.
- Parameters:
gap_cost (float) -- The cost of an alignment gap (1 by default)
sim_func (function) -- A function that returns the similarity of two characters (identity similarity by default)
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Needleman-Wunsch score of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Normalized Needleman-Wunsch score
- Return type:
float
Examples
>>> cmp = NeedlemanWunsch() >>> cmp.sim('cat', 'hat') 0.6666666666666667 >>> cmp.sim('Niall', 'Neil') 0.22360679774997896 >>> round(cmp.sim('aluminum', 'Catalan'), 12) 0.0 >>> cmp.sim('cat', 'hat') 0.6666666666666667
New in version 0.4.1.
- static sim_matrix(src: str, tar: str, mat: Optional[Dict[Tuple[str, str], int]] = None, mismatch_cost: float = 0, match_cost: float = 1, symmetric: bool = True, alphabet: Optional[str] = None) float [source]
Return the matrix similarity of two strings.
With the default parameters, this is identical to sim_ident. It is possible for sim_matrix to return values outside of the range \([0, 1]\), if values outside that range are present in mat, mismatch_cost, or match_cost.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
mat (dict) -- A dict mapping tuples to costs; the tuples are (src, tar) pairs of symbols from the alphabet parameter
mismatch_cost (float) -- The value returned if (src, tar) is absent from mat when src does not equal tar
match_cost (float) -- The value returned if (src, tar) is absent from mat when src equals tar
symmetric (bool) -- True if the cost of src not matching tar is identical to the cost of tar not matching src; in this case, the values in mat need only contain (src, tar) or (tar, src), not both
alphabet (str) -- A collection of tokens from which src and tar are drawn; if this is defined a ValueError is raised if either tar or src have symbols not found in alphabet
- Returns:
Matrix similarity
- Return type:
float
- Raises:
ValueError -- src value not in alphabet
ValueError -- tar value not in alphabet
Examples
>>> NeedlemanWunsch.sim_matrix('cat', 'hat') 0 >>> NeedlemanWunsch.sim_matrix('hat', 'hat') 1
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- sim_score(src: str, tar: str) float [source]
Return the Needleman-Wunsch score of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Needleman-Wunsch score
- Return type:
float
Examples
>>> cmp = NeedlemanWunsch() >>> cmp.sim_score('cat', 'hat') 2.0 >>> cmp.sim_score('Niall', 'Neil') 1.0 >>> cmp.sim_score('aluminum', 'Catalan') -1.0 >>> cmp.sim_score('ATCG', 'TAGC') 0.0
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.Overlap(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Overlap coefficient.
For two sets X and Y, the overlap coefficient [Sim49, Szy34], also called the Szymkiewicz-Simpson coefficient and Simpson's ecological coexistence coefficient, is
\[sim_{overlap}(X, Y) = \frac{|X \cap Y|}{min(|X|, |Y|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{overlap} = \frac{a}{min(a+b, a+c)}\]New in version 0.3.6.
Initialize Overlap instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the overlap coefficient of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Overlap similarity
- Return type:
float
Examples
>>> cmp = Overlap() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.4 >>> cmp.sim('aluminum', 'Catalan') 0.125 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.Ozbay(**kwargs: Any)[source]
Bases:
_Distance
Ozbay metric.
The Ozbay metric [Ozb15] is a string distance measure developed by Hakan Ozbay, which combines Jaccard distance, Levenshtein distance, and longest common substring distance.
The normalized variant should be considered experimental.
New in version 0.4.0.
Initialize _Distance instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Ozbay distance.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Normalized Ozbay distance
- Return type:
float
Examples
>>> cmp = Ozbay() >>> round(cmp.dist('cat', 'hat'), 12) 0.027777777778 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.24 >>> round(cmp.dist('Colin', 'Cuilen'), 12) 0.214285714286 >>> cmp.dist('ATCG', 'TAGC') 0.140625
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the Ozbay metric.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Ozbay metric
- Return type:
float
Examples
>>> cmp = Ozbay() >>> round(cmp.dist_abs('cat', 'hat'), 12) 0.75 >>> round(cmp.dist_abs('Niall', 'Neil'), 12) 6.0 >>> round(cmp.dist_abs('Colin', 'Cuilen'), 12) 7.714285714286 >>> cmp.dist_abs('ATCG', 'TAGC') 3.0
New in version 0.4.0.
- class abydos.distance.Pattern(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Pattern difference.
For two sets X and Y and a population N, the pattern difference [BB95], Batagelj & Bren's \(- bc -\) is
\[dist_{pattern}(X, Y) = \frac{4 \cdot |X \setminus Y| \cdot |Y \setminus X|} {|N|^2}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{pattern} = \frac{4bc}{n^2}\]In [Cor17], the formula omits the 4 in the numerator: \(\frac{bc}{n^2}\).
New in version 0.4.0.
Initialize Pattern instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Pattern difference of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Pattern difference
- Return type:
float
Examples
>>> cmp = Pattern() >>> cmp.dist('cat', 'hat') 2.6030820491461892e-05 >>> cmp.dist('Niall', 'Neil') 7.809246147438568e-05 >>> cmp.dist('aluminum', 'Catalan') 0.0003635035904093472 >>> cmp.dist('ATCG', 'TAGC') 0.0001626926280716368
New in version 0.4.0.
- class abydos.distance.PearsonChiSquared(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Pearson's Chi-Squared similarity.
For two sets X and Y and a population N, the Pearson's \(\chi^2\) similarity [PH13] is
\[sim_{PearsonChiSquared}(X, Y) = \frac{|N| \cdot (|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|)^2} {|X| \cdot |Y| \cdot |N \setminus X| \cdot |N \setminus Y|}\]This is also Pearson I similarity.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{PearsonChiSquared} = \frac{n(ad-bc)^2}{(a+b)(a+c)(b+d)(c+d)}\]New in version 0.4.0.
Initialize PearsonChiSquared instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return Pearson's Chi-Squared correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Pearson's Chi-Squared correlation
- Return type:
float
Examples
>>> cmp = PearsonChiSquared() >>> cmp.corr('cat', 'hat') 0.2474424720578567 >>> cmp.corr('Niall', 'Neil') 0.1300991207720222 >>> cmp.corr('aluminum', 'Catalan') 0.011710186806836291 >>> cmp.corr('ATCG', 'TAGC') -4.1196952743799446e-05
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return Pearson's normalized Chi-Squared similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Pearson's Chi-Squared similarity
- Return type:
float
Examples
>>> cmp = PearsonChiSquared() >>> cmp.corr('cat', 'hat') 0.2474424720578567 >>> cmp.corr('Niall', 'Neil') 0.1300991207720222 >>> cmp.corr('aluminum', 'Catalan') 0.011710186806836291 >>> cmp.corr('ATCG', 'TAGC') -4.1196952743799446e-05
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return Pearson's Chi-Squared similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Pearson's Chi-Squared similarity
- Return type:
float
Examples
>>> cmp = PearsonChiSquared() >>> cmp.sim_score('cat', 'hat') 193.99489809335964 >>> cmp.sim_score('Niall', 'Neil') 101.99771068526542 >>> cmp.sim_score('aluminum', 'Catalan') 9.19249664336649 >>> cmp.sim_score('ATCG', 'TAGC') 0.032298410951138765
New in version 0.4.0.
- class abydos.distance.PearsonHeronII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Pearson & Heron II correlation.
For two sets X and Y and a population N, Pearson & Heron II correlation [PH13] is
\[corr_{PearsonHeronII}(X, Y) = \cos \Big(\frac{\pi\sqrt{|X \setminus Y| \cdot |Y \setminus X|}} {\sqrt{|X \cap Y| \cdot |(N \setminus X) \setminus Y|} + \sqrt{|X \setminus Y| \cdot |Y \setminus X|}}\Big)\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{PearsonHeronII} = \cos \Big(\frac{\pi\sqrt{bc}}{\sqrt{ad}+\sqrt{bc}}\Big)\]New in version 0.4.0.
Initialize PearsonHeronII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Pearson & Heron II correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Pearson & Heron II correlation
- Return type:
float
Examples
>>> cmp = PearsonHeronII() >>> cmp.corr('cat', 'hat') 0.9885309061036239 >>> cmp.corr('Niall', 'Neil') 0.9678978997263907 >>> cmp.corr('aluminum', 'Catalan') 0.7853000893691571 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Pearson & Heron II similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Pearson & Heron II similarity
- Return type:
float
Examples
>>> cmp = PearsonHeronII() >>> cmp.sim('cat', 'hat') 0.994265453051812 >>> cmp.sim('Niall', 'Neil') 0.9839489498631954 >>> cmp.sim('aluminum', 'Catalan') 0.8926500446845785 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.PearsonII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
PearsonChiSquared
Pearson II similarity.
For two sets X and Y and a population N, the Pearson II similarity [PH13], Pearson's coefficient of mean square contingency, is
\[corr_{PearsonII} = \sqrt{\frac{\chi^2}{|N|+\chi^2}}\]where
\[\chi^2 = sim_{PearsonChiSquared}(X, Y) = \frac{|N| \cdot (|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|)^2} {|X| \cdot |Y| \cdot |N \setminus X| \cdot |N \setminus Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[\chi^2 = sim_{PearsonChiSquared} = \frac{n \cdot (ad-bc)^2}{(a+b)(a+c)(b+d)(c+d)}\]New in version 0.4.0.
Initialize PearsonII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Pearson II similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Pearson II similarity
- Return type:
float
Examples
>>> cmp = PearsonII() >>> cmp.sim('cat', 'hat') 0.6298568508557214 >>> cmp.sim('Niall', 'Neil') 0.47983719547968123 >>> cmp.sim('aluminum', 'Catalan') 0.15214891090821628 >>> cmp.sim('ATCG', 'TAGC') 0.009076921903905551
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Pearson II similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Pearson II similarity
- Return type:
float
Examples
>>> cmp = PearsonII() >>> cmp.sim_score('cat', 'hat') 0.44537605041688455 >>> cmp.sim_score('Niall', 'Neil') 0.3392961347892176 >>> cmp.sim_score('aluminum', 'Catalan') 0.10758552665334761 >>> cmp.sim_score('ATCG', 'TAGC') 0.006418353030552324
New in version 0.4.0.
- class abydos.distance.PearsonIII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
PearsonPhi
Pearson III correlation.
For two sets X and Y and a population N, the Pearson III correlation [PH13], Pearson's coefficient of racial likeness, is
\[corr_{PearsonIII} = \sqrt{\frac{\phi}{|N|+\phi}}\]where
\[\phi = corr_{PearsonPhi}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|} {\sqrt{|X| \cdot |Y| \cdot |N \setminus X| \cdot |N \setminus Y|}}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[\phi = corr_{PearsonPhi} = \frac{ad-bc} {\sqrt{(a+b)(a+c)(b+c)(b+d)}}\]New in version 0.4.0.
Initialize PearsonIII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Pearson III correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Pearson III correlation
- Return type:
float
Examples
>>> cmp = PearsonIII() >>> cmp.corr('cat', 'hat') 0.025180989806958435 >>> cmp.corr('Niall', 'Neil') 0.021444241017487504 >>> cmp.corr('aluminum', 'Catalan') 0.011740218922356615 >>> cmp.corr('ATCG', 'TAGC') -0.0028612777635371113
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Pearson III similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Pearson III similarity
- Return type:
float
Examples
>>> cmp = PearsonIII() >>> cmp.sim('cat', 'hat') 0.5125904949034792 >>> cmp.sim('Niall', 'Neil') 0.5107221205087438 >>> cmp.sim('aluminum', 'Catalan') 0.5058701094611783 >>> cmp.sim('ATCG', 'TAGC') 0.49856936111823147
New in version 0.4.0.
- class abydos.distance.PearsonPhi(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Pearson's Phi correlation.
For two sets X and Y and a population N, the Pearson's \(\phi\) correlation [Guirk, Pea00, PH13] is
\[corr_{PearsonPhi}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|} {\sqrt{|X| \cdot |Y| \cdot |N \setminus X| \cdot |N \setminus Y|}}\]This is also Pearson & Heron I similarity.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{PearsonPhi} = \frac{ad-bc} {\sqrt{(a+b)(a+c)(b+d)(c+d)}}\]Notes
In terms of a confusion matrix, this is equivalent to the Matthews correlation coefficient
ConfusionTable.mcc()
.New in version 0.4.0.
Initialize PearsonPhi instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return Pearson's Phi correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Pearson's Phi correlation
- Return type:
float
Examples
>>> cmp = PearsonPhi() >>> cmp.corr('cat', 'hat') 0.49743589743589745 >>> cmp.corr('Niall', 'Neil') 0.36069255713421955 >>> cmp.corr('aluminum', 'Catalan') 0.10821361655002706 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237483954
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Pearson's Phi similarity of two strings.
This is normalized to [0, 1] by adding 1 and dividing by 2.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Pearson's Phi similarity
- Return type:
float
Examples
>>> cmp = PearsonPhi() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.6803462785671097 >>> cmp.sim('aluminum', 'Catalan') 0.5541068082750136 >>> cmp.sim('ATCG', 'TAGC') 0.496790757381258
New in version 0.4.0.
- class abydos.distance.Peirce(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Peirce correlation.
For two sets X and Y and a population N, the Peirce correlation [Pei84] is
\[corr_{Peirce}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus Y|} {|X| \cdot |N \setminus X|}\]Both [CCT10] and [Hubalek08] present a different formula and incorrectly attribute it to Peirce. Likewise, [Doo84] presents a different formula and incorrectly attributes it to Peirce. This is distinct from the formula he presents and attributes to himself.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{Peirce} = \frac{ad-bc}{(a+b)(c+d)}\]New in version 0.4.0.
Initialize Peirce instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Peirce correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Peirce correlation
- Return type:
float
Examples
>>> cmp = Peirce() >>> cmp.corr('cat', 'hat') 0.49743589743589745 >>> cmp.corr('Niall', 'Neil') 0.32947729220222793 >>> cmp.corr('aluminum', 'Catalan') 0.10209049255441008 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237483954
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Peirce similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Peirce similarity
- Return type:
float
Examples
>>> cmp = Peirce() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.664738646101114 >>> cmp.sim('aluminum', 'Catalan') 0.5510452462772051 >>> cmp.sim('ATCG', 'TAGC') 0.496790757381258
New in version 0.4.0.
- class abydos.distance.PhoneticDistance(transforms: Optional[Union[Type[_Phonetic], Type[_Stemmer], Type[_Fingerprint], _Phonetic, _Stemmer, _Fingerprint, Callable[[str], str], Sequence[Union[Type[_Phonetic], Type[_Stemmer], Type[_Fingerprint], _Phonetic, _Stemmer, _Fingerprint, Callable[[str], str]]]]] = None, metric: Optional[Union[Type[_Distance], _Distance]] = None, encode_alpha: bool = False, **kwargs: Any)[source]
Bases:
_Distance
Phonetic distance.
Phonetic distance applies one or more supplied string transformations to words and compares the resulting transformed strings using a supplied distance measure.
A simple example would be to create a 'Soundex distance':
>>> from abydos.phonetic import Soundex >>> soundex = PhoneticDistance(transforms=Soundex()) >>> soundex.dist('Ashcraft', 'Ashcroft') 0.0 >>> soundex.dist('Robert', 'Ashcraft') 1.0
New in version 0.4.1.
Initialize PhoneticDistance instance.
- Parameters:
transforms (list or _Phonetic or _Stemmer or _Fingerprint or type) -- An instance of a subclass of _Phonetic, _Stemmer, or _Fingerprint, or a list (or other iterable) of such instances to apply to each input word before computing their distance or similarity. If omitted, no transformations will be performed.
metric (_Distance or type) -- An instance of a subclass of _Distance, used for computing the inputs' distance or similarity after being transformed. If omitted, the strings will be compared for identify (returning 0.0 if identical, otherwise 1.0, when distance is computed).
encode_alpha (bool) -- Set to true to use the encode_alpha method of phonetic algoritms whenever possible.
**kwargs -- Arbitrary keyword arguments
New in version 0.4.1.
- dist(src: str, tar: str) float [source]
Return the normalized Phonetic distance.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The normalized Phonetic distance
- Return type:
float
Examples
>>> from abydos.phonetic import Soundex >>> cmp = PhoneticDistance(Soundex()) >>> cmp.dist('cat', 'hat') 1.0 >>> cmp.dist('Niall', 'Neil') 0.0 >>> cmp.dist('Colin', 'Cuilen') 0.0 >>> cmp.dist('ATCG', 'TAGC') 1.0
>>> from abydos.distance import Levenshtein >>> cmp = PhoneticDistance(transforms=[Soundex], metric=Levenshtein) >>> cmp.dist('cat', 'hat') 0.25 >>> cmp.dist('Niall', 'Neil') 0.0 >>> cmp.dist('Colin', 'Cuilen') 0.0 >>> cmp.dist('ATCG', 'TAGC') 0.75
New in version 0.4.1.
- dist_abs(src: str, tar: str) float [source]
Return the Phonetic distance.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The Phonetic distance
- Return type:
float or int
Examples
>>> from abydos.phonetic import Soundex >>> cmp = PhoneticDistance(Soundex()) >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 0 >>> cmp.dist_abs('Colin', 'Cuilen') 0 >>> cmp.dist_abs('ATCG', 'TAGC') 1
>>> from abydos.distance import Levenshtein >>> cmp = PhoneticDistance(transforms=[Soundex], metric=Levenshtein) >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 0 >>> cmp.dist_abs('Colin', 'Cuilen') 0 >>> cmp.dist_abs('ATCG', 'TAGC') 3
New in version 0.4.1.
- class abydos.distance.PhoneticEditDistance(mode: str = 'lev', cost: ~typing.Tuple[float, float, float, float] = (1, 1, 1, 0.33333), normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, weights: ~typing.Optional[~typing.Union[~typing.Iterable[float], ~typing.Dict[str, float]]] = None, **kwargs: ~typing.Any)[source]
Bases:
Levenshtein
Phonetic edit distance.
This is a variation on Levenshtein edit distance, intended for strings in IPA, that compares individual phones based on their featural similarity.
New in version 0.4.1.
Initialize PhoneticEditDistance instance.
- Parameters:
mode (str) --
Specifies a mode for computing the edit distance:
lev
(default) computes the ordinary Levenshtein distance, in which edits may include inserts, deletes, and substitutionsosa
computes the Optimal String Alignment distance, in which edits may include inserts, deletes, substitutions, and transpositions but substrings may only be edited once
cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 0.33333)). Note that transpositions cost a relatively low 0.33333. If this were 1.0, no phones would ever be transposed under the normal weighting, since even quite dissimilar phones such as [a] and [p] still agree in nearly 63% of their features.
normalizer (function) -- A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.
weights (None or list or tuple or dict) -- If None, all features are of equal significance and a simple normalized hamming distance of the features is calculated. If a list or tuple of numeric values is supplied, the values are inferred as the weights for each feature, in order of the features listed in abydos.phones._phones._FEATURE_MASK. If a dict is supplied, its key values should match keys in abydos.phones._phones._FEATURE_MASK to which each weight (value) should be assigned. Missing values in all cases are assigned a weight of 0 and will be omitted from the comparison.
**kwargs -- Arbitrary keyword arguments
New in version 0.4.1.
- dist(src: str, tar: str) float [source]
Return the normalized phonetic edit distance between two strings.
The edit distance is normalized by dividing the edit distance (calculated by either of the two supported methods) by the greater of the number of characters in src times the cost of a delete and the number of characters in tar times the cost of an insert. For the case in which all operations have \(cost = 1\), this is equivalent to the greater of the length of the two strings src & tar.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The normalized Levenshtein distance between src & tar
- Return type:
float
Examples
>>> cmp = PhoneticEditDistance() >>> round(cmp.dist('cat', 'hat'), 12) 0.059139784946 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.232258064516 >>> cmp.dist('aluminum', 'Catalan') 0.3084677419354839 >>> cmp.dist('ATCG', 'TAGC') 0.2983870967741935
New in version 0.4.1.
- dist_abs(src: str, tar: str) float [source]
Return the phonetic edit distance between two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The phonetic edit distance between src & tar
- Return type:
int (may return a float if cost has float values)
Examples
>>> cmp = PhoneticEditDistance() >>> cmp.dist_abs('cat', 'hat') 0.17741935483870974 >>> cmp.dist_abs('Niall', 'Neil') 1.161290322580645 >>> cmp.dist_abs('aluminum', 'Catalan') 2.467741935483871 >>> cmp.dist_abs('ATCG', 'TAGC') 1.193548387096774
>>> cmp = PhoneticEditDistance(mode='osa') >>> cmp.dist_abs('ATCG', 'TAGC') 0.46236225806451603 >>> cmp.dist_abs('ACTG', 'TAGC') 1.2580645161290323
New in version 0.4.1.
- class abydos.distance.PositionalQGramDice(max_dist: int = 1, tokenizer: Optional[_Tokenizer] = None, **kwargs: Any)[source]
Bases:
_Distance
Positional Q-Gram Dice coefficient.
Positional Q-Gram Dice coefficient [Chr06, GIJ+01]
New in version 0.4.0.
Initialize PositionalQGramDice instance.
- Parameters:
max_dist (int) -- The maximum positional distance between to q-grams to count as a match.
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Positional Q-Gram Dice coefficient of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Positional Q-Gram Dice coefficient
- Return type:
float
Examples
>>> cmp = PositionalQGramDice() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.36363636363636365 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.PositionalQGramJaccard(max_dist: int = 1, tokenizer: Optional[_Tokenizer] = None, **kwargs: Any)[source]
Bases:
_Distance
Positional Q-Gram Jaccard coefficient.
Positional Q-Gram Jaccard coefficient [Chr06, GIJ+01]
New in version 0.4.0.
Initialize PositionalQGramJaccard instance.
- Parameters:
max_dist (int) -- The maximum positional distance between to q-grams to count as a match.
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Positional Q-Gram Jaccard coefficient of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Positional Q-Gram Jaccard coefficient
- Return type:
float
Examples
>>> cmp = PositionalQGramJaccard() >>> cmp.sim('cat', 'hat') 0.3333333333333333 >>> cmp.sim('Niall', 'Neil') 0.2222222222222222 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.PositionalQGramOverlap(max_dist: int = 1, tokenizer: Optional[_Tokenizer] = None, **kwargs: Any)[source]
Bases:
_Distance
Positional Q-Gram Overlap coefficient.
Positional Q-Gram Overlap coefficient [Chr06, GIJ+01]
New in version 0.4.0.
Initialize PositionalQGramOverlap instance.
- Parameters:
max_dist (int) -- The maximum positional distance between to q-grams to count as a match.
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Positional Q-Gram Overlap coefficient of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Positional Q-Gram Overlap coefficient
- Return type:
float
Examples
>>> cmp = PositionalQGramOverlap() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.4 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.Prefix(**kwargs: Any)[source]
Bases:
_Distance
Prefix similiarity and distance.
New in version 0.3.6.
Initialize _Distance instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the prefix similarity of two strings.
Prefix similarity is the ratio of the length of the shorter term that exactly matches the longer term to the length of the shorter term, beginning at the start of both terms.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Prefix similarity
- Return type:
float
Examples
>>> cmp = Prefix() >>> cmp.sim('cat', 'hat') 0.0 >>> cmp.sim('Niall', 'Neil') 0.25 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.QGram(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
q-gram distance.
For two multisets X and Y, q-gram distance [Ukk92] is
\[sim_{QGram}(X, Y) = |X \triangle Y|\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{QGram} = b+c\]Notes
This class uses bigrams without appended start or stop symbols, by default, as in [Ukk92]'s examples. It is described as the \(L_1\) norm of the difference of two strings' q-gram profiles, which are the vectors of q-gram occurrences. But this norm is simply the symmetric difference of the two multisets.
There aren't any limitations on which tokenizer is used with this class, but, as the name would imply, q-grams are expected and the default.
The normalized form uses the union of X and Y, making it equivalent to the Jaccard distance
Jaccard
, but the Jaccard class, by default uses bigrams with start & stop symbols.New in version 0.4.0.
Initialize QGram instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized q-gram distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
q-gram distance
- Return type:
float
Examples
>>> cmp = QGram() >>> cmp.sim('cat', 'hat') 0.33333333333333337 >>> cmp.sim('Niall', 'Neil') 0.0 >>> cmp.sim('aluminum', 'Catalan') 0.08333333333333337 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the q-gram distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
q-gram distance
- Return type:
int
Examples
>>> cmp = QGram() >>> cmp.dist_abs('cat', 'hat') 2 >>> cmp.dist_abs('Niall', 'Neil') 7 >>> cmp.dist_abs('aluminum', 'Catalan') 11 >>> cmp.dist_abs('ATCG', 'TAGC') 6 >>> cmp.dist_abs('01000', '001111') 5
New in version 0.4.0.
- class abydos.distance.QuantitativeCosine(tokenizer: Optional[_Tokenizer] = None, **kwargs: Any)[source]
Bases:
_TokenDistance
Quantitative Cosine similarity.
For two multisets X and Y drawn from an alphabet S, Quantitative Cosine similarity is
\[sim_{QuantitativeCosine}(X, Y) = \frac{\sum_{i \in S} X_iY_i} {\sqrt{\sum_{i \in S} X_i^2}\sqrt{\sum_{i \in S} Y_i^2}}\]New in version 0.4.0.
Initialize QuantitativeCosine instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Quantitative Cosine similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Quantitative Cosine similarity
- Return type:
float
Examples
>>> cmp = QuantitativeCosine() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.3651483716701107 >>> cmp.sim('aluminum', 'Catalan') 0.10660035817780521 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.QuantitativeDice(tokenizer: Optional[_Tokenizer] = None, **kwargs: Any)[source]
Bases:
_TokenDistance
Quantitative Dice similarity.
For two multisets X and Y drawn from an alphabet S, Quantitative Dice similarity is
\[sim_{QuantitativeDice}(X, Y) = \frac{2 \cdot \sum_{i \in S} X_iY_i} {\sum_{i \in S} X_i^2 + \sum_{i \in S} Y_i^2}\]New in version 0.4.0.
Initialize QuantitativeDice instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Quantitative Dice similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Quantitative Dice similarity
- Return type:
float
Examples
>>> cmp = QuantitativeDice() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.36363636363636365 >>> cmp.sim('aluminum', 'Catalan') 0.10526315789473684 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.QuantitativeJaccard(tokenizer: Optional[_Tokenizer] = None, **kwargs: Any)[source]
Bases:
_TokenDistance
Quantitative Jaccard similarity.
For two multisets X and Y drawn from an alphabet S, Quantitative Jaccard similarity is
\[sim_{QuantitativeJaccard}(X, Y) = \frac{\sum_{i \in S} X_iY_i} {\sum_{i \in S} X_i^2 + \sum_{i \in S} Y_i^2 - \sum_{i \in S} X_iY_i}\]New in version 0.4.0.
Initialize QuantitativeJaccard instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Quantitative Jaccard similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Quantitative Jaccard similarity
- Return type:
float
Examples
>>> cmp = QuantitativeJaccard() >>> cmp.sim('cat', 'hat') 0.3333333333333333 >>> cmp.sim('Niall', 'Neil') 0.2222222222222222 >>> cmp.sim('aluminum', 'Catalan') 0.05555555555555555 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.RatcliffObershelp(**kwargs: Any)[source]
Bases:
_Distance
Ratcliff-Obershelp similarity.
This follows the Ratcliff-Obershelp algorithm [RM88] to derive a similarity measure:
Find the length of the longest common substring in src & tar.
Recurse on the strings to the left & right of each this substring in src & tar. The base case is a 0 length common substring, in which case, return 0. Otherwise, return the sum of the current longest common substring and the left & right recursed sums.
Multiply this length by 2 and divide by the sum of the lengths of src & tar.
Cf. http://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970
New in version 0.3.6.
Initialize _Distance instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Ratcliff-Obershelp similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Ratcliff-Obershelp similarity
- Return type:
float
Examples
>>> cmp = RatcliffObershelp() >>> round(cmp.sim('cat', 'hat'), 12) 0.666666666667 >>> round(cmp.sim('Niall', 'Neil'), 12) 0.666666666667 >>> round(cmp.sim('aluminum', 'Catalan'), 12) 0.4 >>> cmp.sim('ATCG', 'TAGC') 0.5
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.RaupCrick(**kwargs: Any)[source]
Bases:
_TokenDistance
Raup-Crick similarity.
For two sets X and Y and a population N, Raup-Crick similarity [RC79] is:
Notes
Observe that Raup-Crick similarity is related to Henderson-Heron similarity in that the former is the sum of all Henderson-Heron similarities for an intersection size ranging from 0 to the true intersection size.
New in version 0.4.1.
Initialize RaupCrick instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.1.
- sim(src: str, tar: str) float [source]
Return the Raup-Crick similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Raup-Crick similarity
- Return type:
float
Examples
>>> cmp = RaupCrick() >>> cmp.sim('cat', 'hat') 0.9999998002120004 >>> cmp.sim('Niall', 'Neil') 0.9999975146378747 >>> cmp.sim('aluminum', 'Catalan') 0.9968397599851411 >>> cmp.sim('ATCG', 'TAGC') 0.9684367974410505
New in version 0.4.1.
- class abydos.distance.ReesLevenshtein(block_limit: int = 2, normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, **kwargs: ~typing.Any)[source]
Bases:
_Distance
Rees-Levenshtein distance.
Rees-Levenshtein distance [Ree14, RB13] is the "Modified Damerau-Levenshtein Distance Algorithm, created by Tony Rees as part of Taxamatch.
New in version 0.4.0.
Initialize ReesLevenshtein instance.
- Parameters:
block_limit (int) -- The block length limit
normalizer (function) -- A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Rees-Levenshtein distance of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Normalized Rees-Levenshtein distance
- Return type:
float
Examples
>>> cmp = ReesLevenshtein() >>> cmp.dist('cat', 'hat') 0.3333333333333333 >>> cmp.dist('Niall', 'Neil') 0.6 >>> cmp.dist('aluminum', 'Catalan') 0.875 >>> cmp.dist('ATCG', 'TAGC') 0.5
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the Rees-Levenshtein distance of two strings.
This is a straightforward port of the PL/SQL implementation at https://confluence.csiro.au/public/taxamatch/the-mdld-modified-damerau-levenshtein-distance-algorithm
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Rees-Levenshtein distance
- Return type:
float
Examples
>>> cmp = ReesLevenshtein() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 3 >>> cmp.dist_abs('aluminum', 'Catalan') 7 >>> cmp.dist_abs('ATCG', 'TAGC') 2
New in version 0.4.0.
- class abydos.distance.RelaxedHamming(tokenizer: Optional[_Tokenizer] = None, maxdist: int = 2, discount: float = 0.2, **kwargs: Any)[source]
Bases:
_Distance
Relaxed Hamming distance.
This is a variant of Hamming distance in which positionally close matches are considered partially matching.
New in version 0.4.1.
Initialize DiscountedHamming instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packagemaxdist (int) -- The maximum distance to consider for discounting.
discount (float) -- The discount factor multiplied by the distance from the source string position.
**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.1.
- dist(src: str, tar: str) float [source]
Return the normalized relaxed Hamming distance between strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Normalized relaxed Hamming distance
- Return type:
float
Examples
>>> cmp = RelaxedHamming() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> cmp.dist('Niall', 'Neil') 0.27999999999999997 >>> cmp.dist('aluminum', 'Catalan') 0.8 >>> cmp.dist('ATCG', 'TAGC') 0.2
New in version 0.4.1.
- dist_abs(src: str, tar: str) float [source]
Return the discounted Hamming distance between two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Relaxed Hamming distance
- Return type:
float
Examples
>>> cmp = RelaxedHamming() >>> cmp.dist_abs('cat', 'hat') 1.0 >>> cmp.dist_abs('Niall', 'Neil') 1.4 >>> cmp.dist_abs('aluminum', 'Catalan') 6.4 >>> cmp.dist_abs('ATCG', 'TAGC') 0.8
New in version 0.4.1.
- class abydos.distance.Roberts(tokenizer: Optional[_Tokenizer] = None, **kwargs: Any)[source]
Bases:
_TokenDistance
Roberts similarity.
For two multisets X and Y drawn from an alphabet S, Roberts similarity [Rob86] is
\[sim_{Roberts}(X, Y) = \frac{\Big[\sum_{i \in S} (X_i + Y_i) \cdot \frac{min(X_i, Y_i)}{max(X_i, Y_i)}\Big]} {\sum_{i \in S} (X_i + Y_i)}\]New in version 0.4.0.
Initialize Roberts instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Roberts similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Roberts similarity
- Return type:
float
Examples
>>> cmp = Roberts() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.36363636363636365 >>> cmp.sim('aluminum', 'Catalan') 0.11764705882352941 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.RogersTanimoto(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Rogers & Tanimoto similarity.
For two sets X and Y and a population N, the Rogers-Tanimoto similarity [RT60] is
\[sim_{RogersTanimoto}(X, Y) = \frac{|X \cap Y| + |(N \setminus X) \setminus Y|} {|X \setminus Y| + |Y \setminus X| + |N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{RogersTanimoto} = \frac{a+d}{b+c+n}\]New in version 0.4.0.
Initialize RogersTanimoto instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Rogers & Tanimoto similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Rogers & Tanimoto similarity
- Return type:
float
Examples
>>> cmp = RogersTanimoto() >>> cmp.sim('cat', 'hat') 0.9898477157360406 >>> cmp.sim('Niall', 'Neil') 0.9823008849557522 >>> cmp.sim('aluminum', 'Catalan') 0.9625 >>> cmp.sim('ATCG', 'TAGC') 0.9748110831234257
New in version 0.4.0.
- class abydos.distance.RogotGoldberg(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Rogot & Goldberg similarity.
For two sets X and Y and a population N, Rogot & Goldberg's "second index adjusted agreement" \(A_2\) [RG66] is
\[sim_{RogotGoldberg}(X, Y) = \frac{1}{2}\Bigg( \frac{2|X \cap Y|}{|X|+|Y|} + \frac{2|(N \setminus X) \setminus Y|} {|N \setminus X|+|N \setminus Y|} \Bigg)\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{RogotGoldberg} = \frac{1}{2}\Bigg( \frac{2a}{2a+b+c} + \frac{2d}{2d+b+c} \Bigg)\]New in version 0.4.0.
Initialize RogotGoldberg instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Rogot & Goldberg similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Rogot & Goldberg similarity
- Return type:
float
Examples
>>> cmp = RogotGoldberg() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.6795702691656449 >>> cmp.sim('aluminum', 'Catalan') 0.5539941668876179 >>> cmp.sim('ATCG', 'TAGC') 0.496790757381258
New in version 0.4.0.
- class abydos.distance.RougeL(**kwargs: Any)[source]
Bases:
_Distance
Rouge-L similarity.
Rouge-L similarity [Lin04]
New in version 0.4.0.
Initialize RougeL instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str, beta: float = 8) float [source]
Return the Rouge-L similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
beta (int or float) -- A weighting factor to prejudice similarity towards src
- Returns:
Rouge-L similarity
- Return type:
float
Examples
>>> cmp = RougeL() >>> cmp.sim('cat', 'hat') 0.6666666666666666 >>> cmp.sim('Niall', 'Neil') 0.6018518518518519 >>> cmp.sim('aluminum', 'Catalan') 0.3757225433526012 >>> cmp.sim('ATCG', 'TAGC') 0.5
New in version 0.4.0.
- class abydos.distance.RougeS(qval: int = 2, **kwargs: Any)[source]
Bases:
_Distance
Rouge-S similarity.
Rouge-S similarity [Lin04], operating on character-level skipgrams
New in version 0.4.0.
Initialize RougeS instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str, beta: float = 8) float [source]
Return the Rouge-S similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
beta (int or float) -- A weighting factor to prejudice similarity towards src
- Returns:
Rouge-S similarity
- Return type:
float
Examples
>>> cmp = RougeS() >>> cmp.sim('cat', 'hat') 0.3333333333333333 >>> cmp.sim('Niall', 'Neil') 0.30185758513931893 >>> cmp.sim('aluminum', 'Catalan') 0.10755653612796467 >>> cmp.sim('ATCG', 'TAGC') 0.6666666666666666
New in version 0.4.0.
- class abydos.distance.RougeSU(qval: int = 2, **kwargs: Any)[source]
Bases:
RougeS
Rouge-SU similarity.
Rouge-SU similarity [Lin04], operating on character-level skipgrams
New in version 0.4.0.
Initialize RougeSU instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str, beta: float = 8) float [source]
Return the Rouge-SU similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
beta (int or float) -- A weighting factor to prejudice similarity towards src
- Returns:
Rouge-SU similarity
- Return type:
float
Examples
>>> cmp = RougeSU() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.4020618556701031 >>> cmp.sim('aluminum', 'Catalan') 0.1672384219554031 >>> cmp.sim('ATCG', 'TAGC') 0.8
New in version 0.4.0.
- class abydos.distance.RougeW(f_func: Optional[Callable[[float], float]] = None, f_inv: Optional[Callable[[float], float]] = None, **kwargs: Any)[source]
Bases:
_Distance
Rouge-W similarity.
Rouge-W similarity [Lin04]
New in version 0.4.0.
Initialize RougeW instance.
- Parameters:
f_func (function) -- A weighting function based on the value supplied to this function, such that f(x+y) > f(x) + f(y)
f_inv (function) -- The close form inverse of f_func
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str, beta: float = 8) float [source]
Return the Rouge-W similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
beta (int or float) -- A weighting factor to prejudice similarity towards src
- Returns:
Rouge-W similarity
- Return type:
float
Examples
>>> cmp = RougeW() >>> cmp.sim('cat', 'hat') 0.6666666666666666 >>> cmp.sim('Niall', 'Neil') 0.34747932867894143 >>> cmp.sim('aluminum', 'Catalan') 0.280047049205176 >>> cmp.sim('ATCG', 'TAGC') 0.43301270189221935
New in version 0.4.0.
- wlcs(src: str, tar: str) float [source]
Return the Rouge-W weighted longest common sub-sequence length.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The Levenshtein distance between src & tar
- Return type:
int (may return a float if cost has float values)
Examples
>>> cmp = RougeW() >>> cmp.wlcs('cat', 'hat') 4 >>> cmp.wlcs('Niall', 'Neil') 3 >>> cmp.wlcs('aluminum', 'Catalan') 5 >>> cmp.wlcs('ATCG', 'TAGC') 3
New in version 0.4.0.
- class abydos.distance.RussellRao(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Russell & Rao similarity.
For two sets X and Y and a population N, the Russell & Rao similarity [RR40] is
\[sim_{RussellRao}(X, Y) = \frac{|X \cap Y|}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{RussellRao} = \frac{a}{n}\]New in version 0.4.0.
Initialize RussellRao instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Russell & Rao similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Russell & Rao similarity
- Return type:
float
Examples
>>> cmp = RussellRao() >>> cmp.sim('cat', 'hat') 0.002551020408163265 >>> cmp.sim('Niall', 'Neil') 0.002551020408163265 >>> cmp.sim('aluminum', 'Catalan') 0.0012738853503184713 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.SAPS(cost: ~typing.Tuple[int, int, int, int, int, int, int] = (1, -1, -4, 6, -2, -1, -3), normalizer: ~typing.Callable[[~typing.List[float]], float] = <built-in function max>, tokenizer: ~typing.Optional[~abydos.tokenizer._tokenizer._Tokenizer] = None, **kwargs: ~typing.Any)[source]
Bases:
_Distance
Syllable Alignment Pattern Searching tokenizer.
This is the alignment and similarity calculation described on p. 917-918 of [RY05].
New in version 0.4.0.
Initialize SAPS instance.
- Parameters:
cost (tuple) --
A 7-tuple representing the cost of the four possible matches:
syllable-internal match
syllable-internal mis-match
syllable-initial match or mismatch with syllable-internal
syllable-initial match
syllable-initial mis-match
syllable-internal gap
syllable-initial gap
(by default: (1, -1, -4, 6, -2, -1, -3))
normalizer (function) -- A function that takes an list and computes a normalization term by which the edit distance is divided (max by default). Another good option is the sum function.
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized SAPS similarity between two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The normalized SAPS similarity between src & tar
- Return type:
float
Examples
>>> cmp = SAPS() >>> round(cmp.sim('cat', 'hat'), 12) 0.0 >>> round(cmp.sim('Niall', 'Neil'), 12) 0.2 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the SAPS similarity between two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The SAPS similarity between src & tar
- Return type:
int
Examples
>>> cmp = SAPS() >>> cmp.sim_score('cat', 'hat') 0 >>> cmp.sim_score('Niall', 'Neil') 3 >>> cmp.sim_score('aluminum', 'Catalan') -11 >>> cmp.sim_score('ATCG', 'TAGC') -1 >>> cmp.sim_score('Stevenson', 'Stinson') 16
New in version 0.4.0.
- class abydos.distance.SSK(tokenizer: Optional[_Tokenizer] = None, ssk_lambda: float = 0.9, **kwargs: Any)[source]
Bases:
_TokenDistance
String subsequence kernel (SSK) similarity.
This is based on [LSShaweTaylor+02].
New in version 0.4.1.
Initialize SSK instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packagessk_lambda (float or Iterable) -- A value in the range (0.0, 1.0) used for discouting gaps between characters according to the method described in [LSShaweTaylor+02]. To supply multiple values of lambda, provide an Iterable of numeric values, such as (0.5, 0.05) or np.arange(0.05, 0.5, 0.05)
**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-skipgram. Using this parameter and tokenizer=None will cause the instance to use the QGramskipgrams tokenizer with this q value.
New in version 0.4.1.
- sim(src: str, tar: str) float [source]
Return the normalized SSK similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized string subsequence kernel similarity
- Return type:
float
Examples
>>> cmp = SSK() >>> cmp.sim('cat', 'hat') 0.3558718861209964 >>> cmp.sim('Niall', 'Neil') 0.4709007822130597 >>> cmp.sim('aluminum', 'Catalan') 0.13760157193822603 >>> cmp.sim('ATCG', 'TAGC') 0.6140899528060498
New in version 0.4.1.
- sim_score(src: str, tar: str) float [source]
Return the SSK similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
String subsequence kernel similarity
- Return type:
float
Examples
>>> cmp = SSK() >>> cmp.dist_abs('cat', 'hat') 0.6441281138790036 >>> cmp.dist_abs('Niall', 'Neil') 0.5290992177869402 >>> cmp.dist_abs('aluminum', 'Catalan') 0.862398428061774 >>> cmp.dist_abs('ATCG', 'TAGC') 0.38591004719395017
New in version 0.4.1.
- class abydos.distance.ScottPi(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Scott's Pi correlation.
For two sets X and Y and a population N, Scott's \(\pi\) correlation [Sco55] is
\[corr_{Scott_\pi}(X, Y) = \pi = \frac{p_o - p_e^\pi}{1 - p_e^\pi}\]where
\[ \begin{align}\begin{aligned}\begin{array}{ll} p_o &= \frac{|X \cap Y| + |(N \setminus X) \setminus Y|}{|N|}\\p_e^\pi &= \Big(\frac{|X| + |Y|}{2 \cdot |N|}\Big)^2 + \Big(\frac{|N \setminus X| + |N \setminus Y|}{2 \cdot |N|}\Big)^2 \end{array}\end{aligned}\end{align} \]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[ \begin{align}\begin{aligned}\begin{array}{ll} p_o &= \frac{a+d}{n}\\p_e^\pi &= \Big(\frac{2a+b+c}{2n}\Big)^2 + \Big(\frac{2d+b+c}{2n}\Big)^2 \end{array}\end{aligned}\end{align} \]New in version 0.4.0.
Initialize ScottPi instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Scott's Pi correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Scott's Pi correlation
- Return type:
float
Examples
>>> cmp = ScottPi() >>> cmp.corr('cat', 'hat') 0.49743589743589733 >>> cmp.corr('Niall', 'Neil') 0.35914053833129245 >>> cmp.corr('aluminum', 'Catalan') 0.10798833377524023 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237489689
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Scott's Pi similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Scott's Pi similarity
- Return type:
float
Examples
>>> cmp = ScottPi() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.6795702691656462 >>> cmp.sim('aluminum', 'Catalan') 0.5539941668876202 >>> cmp.sim('ATCG', 'TAGC') 0.49679075738125517
New in version 0.4.0.
- class abydos.distance.Shape(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Penrose's shape difference.
For two sets X and Y and a population N, the Penrose's shape difference [Pen52] is
\[dist_{Shape}(X, Y) = \frac{1}{|N|}\cdot\Big(\sum_{x \in (X \triangle Y)} x^2\Big) - \Big(\frac{|X \triangle Y|}{|N|}\Big)^2\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Shape} = \frac{1}{n}\Big(\sum_{x \in b} x^2 + \sum_{x \in c} x^2\Big) - \Big(\frac{b+c}{n}\Big)^2\]In [Cor17], the formula is instead \(\frac{n(b+c)-(b-c)^2}{n^2}\), but it is clear from [Pen52] that this should not be an assymmetric value with respect to the ordering of the two sets, among other errors in this formula. Meanwhile, [DD16] gives the formula \(\sqrt{\sum((x_i-\bar{x})-(y_i-\bar{y}))^2}\).
New in version 0.4.0.
Initialize Shape instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Penrose's shape difference of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Shape ifference
- Return type:
float
Examples
>>> cmp = Shape() >>> cmp.sim('cat', 'hat') 0.994923990004165 >>> cmp.sim('Niall', 'Neil') 0.9911511479591837 >>> cmp.sim('aluminum', 'Catalan') 0.9787090754188811 >>> cmp.sim('ATCG', 'TAGC') 0.9874075905872554
New in version 0.4.0.
- class abydos.distance.ShapiraStorerI(cost: Tuple[int, int] = (1, 1), prime: bool = False, **kwargs: Any)[source]
Bases:
_Distance
Shapira & Storer I edit distance with block moves, greedy algorithm.
Shapira & Storer's greedy edit distance [SS07] is similar to Levenshtein edit distance, but with two important distinctions:
It considers blocks of characters, if they occur in both the source and target strings, so the edit distance between 'abcab' and 'abc' is only 1, since the substring 'ab' occurs in both and can be inserted as a block into 'abc'.
It allows three edit operations: insert, delete, and move (but not substitute). Thus the distance between 'abcde' and 'deabc' is only 1 because the block 'abc' can be moved in 1 move operation, rather than being deleted and inserted in 2 separate operations.
If prime is set to True at initialization, this employs the greedy' algorithm, which limits replacements of blocks in the two strings to matching occurrences of the LCS.
New in version 0.4.0.
Initialize ShapiraStorerI instance.
- Parameters:
cost ((int, int)) -- A tuple representing the insertion & deletion costs
prime (bool) -- If True, employs the greedy' algorithm rather than greedy
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Shapira & Storer I distance.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The normalized Shapira & Storer I distance between src & tar
- Return type:
float
Examples
>>> cmp = ShapiraStorerI() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.333333333333 >>> cmp.dist('aluminum', 'Catalan') 0.6 >>> cmp.dist('ATCG', 'TAGC') 0.25
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the Shapira & Storer I edit distance between two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The Shapira & Storer I edit distance between src & tar
- Return type:
int
Examples
>>> cmp = ShapiraStorerI() >>> cmp.dist_abs('cat', 'hat') 2 >>> cmp.dist_abs('Niall', 'Neil') 3 >>> cmp.dist_abs('aluminum', 'Catalan') 9 >>> cmp.dist_abs('ATCG', 'TAGC') 2
New in version 0.4.0.
- class abydos.distance.Sift4(max_offset: int = 5, max_distance: int = 0, **kwargs: Any)[source]
Bases:
_Distance
Sift4 Common version.
This is an approximation of edit distance, described in [Zac14].
New in version 0.3.6.
Initialize Sift4 instance.
- Parameters:
max_offset (int) -- The number of characters to search for matching letters
max_distance (int) -- The distance at which to stop and exit
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized "common" Sift4 distance between two terms.
This is Sift4 distance, normalized to [0, 1].
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The normalized Sift4 distance
- Return type:
float
Examples
>>> cmp = Sift4() >>> round(cmp.dist('cat', 'hat'), 12) 0.333333333333 >>> cmp.dist('Niall', 'Neil') 0.4 >>> cmp.dist('Colin', 'Cuilen') 0.5 >>> cmp.dist('ATCG', 'TAGC') 0.5
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- dist_abs(src: str, tar: str) float [source]
Return the "common" Sift4 distance between two terms.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The Sift4 distance according to the common formula
- Return type:
int
Examples
>>> cmp = Sift4() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 2 >>> cmp.dist_abs('Colin', 'Cuilen') 3 >>> cmp.dist_abs('ATCG', 'TAGC') 2
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.Sift4Extended(max_offset: int = 5, max_distance: int = 0, tokenizer: Optional[_Tokenizer] = None, token_matcher: Optional[Callable[[str, str], bool]] = None, matching_evaluator: Optional[Callable[[str, str], float]] = None, local_length_evaluator: Optional[Callable[[float], float]] = None, transposition_cost_evaluator: Optional[Callable[[int, int], float]] = None, transpositions_evaluator: Optional[Callable[[float, float], float]] = None, **kwargs: Any)[source]
Bases:
_Distance
Sift4 Extended version.
This is an approximation of edit distance, described in [Zac14].
New in version 0.4.0.
Initialize Sift4Extended instance.
- Parameters:
max_offset (int) -- The number of characters to search for matching letters
max_distance (int) -- The distance at which to stop and exit
tokenizer (_Tokenizer) -- A tokenizer instance (character tokenization by default)
token_matcher (function) -- A token matcher function of two parameters (equality by default). \(Sift4Extended.sift4_token_matcher\) is also supplied.
matching_evaluator (function) -- A token match quality function of two parameters (1 by default). \(Sift4Extended.sift4_matching_evaluator\) is also supplied.
local_length_evaluator (function) -- A local length evaluator function (its single parameter by default). \(Sift4Extended.reward_length_evaluator\) and \(Sift4Extended.reward_length_evaluator_exp\) are also supplied.
transposition_cost_evaluator (function) -- A transposition cost evaluator function of two parameters (1 by default). \(Sift4Extended.longer_transpositions_are_more_costly\) is also supplied.
transpositions_evaluator (function) -- A transpositions evaluator function of two parameters (the second parameter subtracted from the first, by default).
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the Sift4 Extended distance between two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The Sift4 distance according to the extended formula
- Return type:
int
Examples
>>> cmp = Sift4Extended() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 2 >>> cmp.dist_abs('aluminum', 'Catalan') 5 >>> cmp.dist_abs('ATCG', 'TAGC') 2
New in version 0.4.0.
- static longer_transpositions_are_more_costly(pos1: int, pos2: int) float [source]
Longer Transpositions Are More Costly.
- Parameters:
pos1 (int) -- The position of the first transposition
pos2 (int) -- The position of the second transposition
- Returns:
float -- A cost that grows as difference in the positions increases
.. versionadded:: 0.4.0
- static reward_length_evaluator(length: int) float [source]
Reward Length Evaluator.
- Parameters:
length (int) -- The length of a local match
- Returns:
float -- A reward value that grows sub-linearly
.. versionadded:: 0.4.0
- static reward_length_evaluator_exp(length: int) float [source]
Reward Length Evaluator.
- Parameters:
length (int) -- The length of a local match
- Returns:
float -- A reward value that grows exponentially
.. versionadded:: 0.4.0
- class abydos.distance.Sift4Simplest(max_offset: int = 5, **kwargs: Any)[source]
Bases:
Sift4
Sift4 Simplest version.
This is an approximation of edit distance, described in [Zac14].
New in version 0.3.6.
Initialize Sift4Simplest instance.
- Parameters:
max_offset (int) -- The number of characters to search for matching letters
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the "simplest" Sift4 distance between two terms.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The Sift4 distance according to the simplest formula
- Return type:
int
Examples
>>> cmp = Sift4Simplest() >>> cmp.dist_abs('cat', 'hat') 1 >>> cmp.dist_abs('Niall', 'Neil') 2 >>> cmp.dist_abs('Colin', 'Cuilen') 3 >>> cmp.dist_abs('ATCG', 'TAGC') 2
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.SingleLinkage(tokenizer: Optional[_Tokenizer] = None, metric: Optional[_Distance] = None, **kwargs: Any)[source]
Bases:
_TokenDistance
Single linkage distance.
For two multisets X and Y, single linkage distance [DD16] is
\[dist_{SingleLinkage}(X, Y) = min_{i \in X, j \in Y} dist(X_i, Y_j)\]New in version 0.4.0.
Initialize SingleLinkage instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packagemetric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants. (Defaults to Levenshtein distance)**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized single linkage distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
normalized single linkage distance
- Return type:
float
Examples
>>> cmp = SingleLinkage() >>> cmp.dist('cat', 'hat') 0.0 >>> cmp.dist('Niall', 'Neil') 0.0 >>> cmp.dist('aluminum', 'Catalan') 0.0 >>> cmp.dist('ATCG', 'TAGC') 0.5
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the single linkage distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
single linkage distance
- Return type:
float
Examples
>>> cmp = SingleLinkage() >>> cmp.dist_abs('cat', 'hat') 0.0 >>> cmp.dist_abs('Niall', 'Neil') 0.0 >>> cmp.dist_abs('aluminum', 'Catalan') 0.0 >>> cmp.dist_abs('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- class abydos.distance.Size(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Penrose's size difference.
For two sets X and Y and a population N, the Penrose's size difference [Pen52] is
\[sim_{Size}(X, Y) = \frac{(|X \triangle Y|)^2}{|N|^2}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Size} = \frac{(b+c)^2}{n^2}\]In [Cor17], the formula is instead \(\frac{(b-c)^2}{n^2}\), but it is clear from [Pen52] that this should not be an assymmetric value with respect two the ordering of the two sets. Meanwhile, [DD16] gives a formula that is equivalent to \(\sqrt{n}\cdot(b+c)\).
New in version 0.4.0.
Initialize Size instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Penrose's size difference of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Size difference
- Return type:
float
Examples
>>> cmp = Size() >>> cmp.sim('cat', 'hat') 0.9999739691795085 >>> cmp.sim('Niall', 'Neil') 0.9999202806122449 >>> cmp.sim('aluminum', 'Catalan') 0.9996348736257049 >>> cmp.sim('ATCG', 'TAGC') 0.9998373073719283
New in version 0.4.0.
- class abydos.distance.SmithWaterman(gap_cost: float = 1.0, sim_func: Optional[Callable[[str, str], float]] = None, **kwargs: Any)[source]
Bases:
NeedlemanWunsch
Smith-Waterman score.
The Smith-Waterman score [SW81] is a standard edit distance measure, differing from Needleman-Wunsch in that it focuses on local alignment and disallows negative scores.
New in version 0.3.6.
Initialize SmithWaterman instance.
- Parameters:
gap_cost (float) -- The cost of an alignment gap (1 by default)
sim_func (function) -- A function that returns the similarity of two characters (identity similarity by default)
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Smith-Waterman score of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Normalized Smith-Waterman score
- Return type:
float
Examples
>>> cmp = SmithWaterman() >>> cmp.sim('cat', 'hat') 0.6666666666666667 >>> cmp.sim('Niall', 'Neil') 0.22360679774997896 >>> round(cmp.sim('aluminum', 'Catalan'), 12) 0.0 >>> cmp.sim('cat', 'hat') 0.6666666666666667
New in version 0.4.1.
- sim_score(src: str, tar: str) float [source]
Return the Smith-Waterman score of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Smith-Waterman score
- Return type:
float
Examples
>>> cmp = SmithWaterman() >>> cmp.sim_score('cat', 'hat') 2.0 >>> cmp.sim_score('Niall', 'Neil') 1.0 >>> cmp.sim_score('aluminum', 'Catalan') 0.0 >>> cmp.sim_score('ATCG', 'TAGC') 1.0
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.SoftCosine(tokenizer: Optional[_Tokenizer] = None, metric: Optional[_Distance] = None, sim_method: str = 'a', **kwargs: Any)[source]
Bases:
_TokenDistance
Soft Cosine similarity.
As described in [SGGomezAP14], soft cosine similarity of two multi-sets X and Y, drawn from an alphabet S, is
\[sim_{soft cosine}(X, Y) = \frac{\sum_{i \in S}\sum_{j \in S} s_{ij} X_i Y_j} {\sqrt{\sum_{i \in S}\sum_{j \in S} s_{ij} X_i X_j} \sqrt{\sum_{i \in S}\sum_{j \in S} s_{ij} Y_i Y_j}}\]where \(s_{ij}\) is the similarity of two tokens, by default a function of Levenshtein distance: \(\frac{1}{1+Levenshtein\_distance(i, j)}\).
Notes
This class implements soft cosine similarity, as defined by [SGGomezAP14]. An alternative formulation of soft cosine similarity using soft (multi-)sets is provided by the
Cosine
class using intersection_type=``soft``, based on the soft intersection defined in [RHJF14].New in version 0.4.0.
Initialize SoftCosine instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package, defaulting to the QGrams tokenizer with q=4threshold (float) -- The minimum similarity for a pair of tokens to contribute to similarity
metric (_Distance) -- A distance instance from the abydos.distance package, defaulting to Levenshtein distance
sim_method (str) --
Selects the similarity method from the four given in [SGGomezAP14]:
a
: \(\frac{1}{1+d}\)b
: \(1-\frac{d}{m}\)c
: \(\sqrt{1-\frac{d}{m}}\)d
: \(\Big(1-\frac{d}{m}\Big)^2\)
Where \(d\) is the distance (Levenshtein by default) and \(m\) is the maximum length of the two tokens. Option a is default, as suggested by the paper.
**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
- Raises:
ValueError -- sim_method must be one of 'a', 'b', 'c', or 'd'
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Soft Cosine similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Fuzzy Cosine similarity
- Return type:
float
Examples
>>> cmp = SoftCosine() >>> cmp.sim('cat', 'hat') 0.8750000000000001 >>> cmp.sim('Niall', 'Neil') 0.8844691709074513 >>> cmp.sim('aluminum', 'Catalan') 0.831348688760277 >>> cmp.sim('ATCG', 'TAGC') 0.8571428571428572
New in version 0.4.0.
- class abydos.distance.SoftTFIDF(tokenizer: Optional[_Tokenizer] = None, corpus: Optional[UnigramCorpus] = None, metric: Optional[_Distance] = None, threshold: float = 0.9, **kwargs: Any)[source]
Bases:
_TokenDistance
SoftTF-IDF similarity.
For two sets X and Y and a population N, SoftTF-IDF similarity [CRF03] is
\[\begin{split}\begin{array}{ll} sim_{SoftTF-IDF}(X, Y) &= \sum_{w \in \{sim_{metric}(x, y) \ge \theta | x \in X, y \in Y \}} V(w, S) \cdot V(w, X) \cdot V(w, Y) \\ \\ V(w, S) &= \frac{V'(w, S)}{\sqrt{\sum_{w \in S} V'(w, S)^2}} \\ \\ V'(w, S) &= log(1+TF_{w,S}) \cdot log(1+IDF_w) \end{array}\end{split}\]Notes
One is added to both the TF & IDF values before taking the logarithm to ensure the logarithms do not fall to 0, which will tend to result in 0.0 similarities even when there is a degree of matching.
Rather than needing to exceed the threshold value, as in [CRF03] the similarity must be greater than or equal to the threshold.
New in version 0.4.0.
Initialize SoftTFIDF instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packagecorpus (UnigramCorpus) -- A unigram corpus
UnigramCorpus
. If None, a corpus will be created from the two words when a similarity function is called.metric (_Distance) -- A string distance measure class for making soft matches, by default Jaro-Winkler.
threshold (float) -- A threshold value, similarities above which are counted as soft matches, by default 0.9.
**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the SoftTF-IDF similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
SoftTF-IDF similarity
- Return type:
float
Examples
>>> cmp = SoftTFIDF() >>> cmp.sim('cat', 'hat') 0.30404449697373 >>> cmp.sim('Niall', 'Neil') 0.20108911303601 >>> cmp.sim('aluminum', 'Catalan') 0.05355175631194 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.SokalMichener(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Sokal & Michener similarity.
For two sets X and Y and a population N, the Sokal & Michener's simple matching coefficient [SM58], equivalent to the Rand index [Ran71] is
\[sim_{SokalMichener}(X, Y) = \frac{|X \cap Y| + |(N \setminus X) \setminus Y|}{|N|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{SokalMichener} = \frac{a+d}{n}\]Notes
The associated distance metric is the mean Manhattan distance and 4 times the value of the variance dissimilarity of [Cor17].
In terms of a confusion matrix, this is equivalent to accuracy
ConfusionTable.accuracy()
.New in version 0.4.0.
Initialize SokalMichener instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Sokal & Michener similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Sokal & Michener similarity
- Return type:
float
Examples
>>> cmp = SokalMichener() >>> cmp.sim('cat', 'hat') 0.9948979591836735 >>> cmp.sim('Niall', 'Neil') 0.9910714285714286 >>> cmp.sim('aluminum', 'Catalan') 0.9808917197452229 >>> cmp.sim('ATCG', 'TAGC') 0.9872448979591837
New in version 0.4.0.
- class abydos.distance.SokalSneathI(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Sokal & Sneath I similarity.
For two sets X and Y and a population N, Sokal & Sneath I similarity [SS63] is
\[sim_{SokalSneathI}(X, Y) = \frac{2(|X \cap Y| + |(N \setminus X) \setminus Y|)} {|X \cap Y| + |(N \setminus X) \setminus Y| + |N|}\]This is the first of five "Unnamed coefficients" presented in [SS63]. It corresponds to the "Matched pairs carry twice the weight of unmatched pairs in the Denominator" with "Negative Matches in Numerator Included". "Negative Matches in Numerator Excluded" corresponds to the Sørensen–Dice coefficient,
Dice
.In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{SokalSneathI} = \frac{2(a+d)}{a+d+n}\]New in version 0.4.0.
Initialize SokalSneathI instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Sokal & Sneath I similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Sokal & Sneath I similarity
- Return type:
float
Examples
>>> cmp = SokalSneathI() >>> cmp.sim('cat', 'hat') 0.9974424552429667 >>> cmp.sim('Niall', 'Neil') 0.9955156950672646 >>> cmp.sim('aluminum', 'Catalan') 0.9903536977491961 >>> cmp.sim('ATCG', 'TAGC') 0.993581514762516
New in version 0.4.0.
- class abydos.distance.SokalSneathII(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Sokal & Sneath II similarity.
For two sets X and Y, Sokal & Sneath II similarity [SS63] is
\[sim_{SokalSneathII}(X, Y) = \frac{|X \cap Y|} {|X \cap Y| + 2|X \triangle Y|}\]This is the second of five "Unnamed coefficients" presented in [SS63]. It corresponds to the "Unmatched pairs carry twice the weight of matched pairs in the Denominator" with "Negative Matches in Numerator Excluded". "Negative Matches in Numerator Included" corresponds to the Rogers & Tanimoto similarity,
RogersTanimoto
.In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{SokalSneathII} = \frac{a}{a+2(b+c)}\]New in version 0.4.0.
Initialize SokalSneathII instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Sokal & Sneath II similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Sokal & Sneath II similarity
- Return type:
float
Examples
>>> cmp = SokalSneathII() >>> cmp.sim('cat', 'hat') 0.2 >>> cmp.sim('Niall', 'Neil') 0.125 >>> cmp.sim('aluminum', 'Catalan') 0.03225806451612903 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.SokalSneathIII(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Sokal & Sneath III similarity.
For two sets X and Y and a population N, Sokal & Sneath III similarity [SS63] is
\[sim_{SokalSneathIII}(X, Y) = \frac{|X \cap Y| + |(N \setminus X) \setminus Y|} {|X \triangle Y|}\]This is the third of five "Unnamed coefficients" presented in [SS63]. It corresponds to the "Unmatched pairs only in the Denominator" with "Negative Matches in Numerator Excluded". "Negative Matches in Numerator Included" corresponds to the Kulczynski I coefficient,
KulczynskiI
.In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{SokalSneathIII} = \frac{a+d}{b+c}\]New in version 0.4.0.
Initialize SokalSneathIII instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(*args: Any, **kwargs: Any) NoReturn [source]
Raise exception when called.
- Parameters:
*args -- Variable length argument list
**kwargs -- Arbitrary keyword arguments
- Raises:
NotImplementedError -- Method disabled for Sokal & Sneath III similarity.
New in version 0.3.6.
- sim(*args: Any, **kwargs: Any) NoReturn [source]
Raise exception when called.
- Parameters:
*args -- Variable length argument list
**kwargs -- Arbitrary keyword arguments
- Raises:
NotImplementedError -- Method disabled for Sokal & Sneath III similarity.
New in version 0.3.6.
- sim_score(src: str, tar: str) float [source]
Return the Sokal & Sneath III similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Sokal & Sneath III similarity
- Return type:
float
Examples
>>> cmp = SokalSneathIII() >>> cmp.sim_score('cat', 'hat') 195.0 >>> cmp.sim_score('Niall', 'Neil') 111.0 >>> cmp.sim_score('aluminum', 'Catalan') 51.333333333333336 >>> cmp.sim_score('ATCG', 'TAGC') 77.4
New in version 0.4.0.
- class abydos.distance.SokalSneathIV(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Sokal & Sneath IV similarity.
For two sets X and Y and a population N, Sokal & Sneath IV similarity [SS63] is
\[sim_{SokalSneathIV}(X, Y) = \frac{1}{4}\Bigg( \frac{|X \cap Y|}{|X|}+ \frac{|X \cap Y|}{|Y|}+ \frac{|(N \setminus X) \setminus Y|} {|N \setminus Y|}+ \frac{|(N \setminus X) \setminus Y|} {|N \setminus X|} \Bigg)\]This is the fourth of five "Unnamed coefficients" presented in [SS63]. It corresponds to the first "Marginal totals in the Denominator" with "Negative Matches in Numerator Included". "Negative Matches in Numerator Excluded" corresponds to the Kulczynski II similarity,
KulczynskiII
. This is also Rogot & Goldberg's "adjusted agreement" \(A_1\) [RG66].In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{SokalSneathIV} = \frac{1}{4}\Big(\frac{a}{a+b}+\frac{a}{a+c}+ \frac{d}{b+d}+\frac{d}{c+d}\Big)\]New in version 0.4.0.
Initialize SokalSneathIV instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Sokal & Sneath IV similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Sokal & Sneath IV similarity
- Return type:
float
Examples
>>> cmp = SokalSneathIV() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.6810856260030602 >>> cmp.sim('aluminum', 'Catalan') 0.5541986205645999 >>> cmp.sim('ATCG', 'TAGC') 0.496790757381258
New in version 0.4.0.
- class abydos.distance.SokalSneathV(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Sokal & Sneath V similarity.
For two sets X and Y and a population N, Sokal & Sneath V similarity [SS63] is
\[sim_{SokalSneathV}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y|} {\sqrt{|X| \cdot |Y| \cdot |N \setminus Y| \cdot |N \setminus X|}}\]This is the fifth of five "Unnamed coefficients" presented in [SS63]. It corresponds to the second "Marginal totals in the Denominator" with "Negative Matches in Numerator Included", also sometimes referred to as Ochiai II similarity. "Negative Matches in Numerator Excluded" corresponds to the Cosine similarity,
Cosine
.In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{SokalSneathV} = \frac{ad}{\sqrt{(a+b)(a+c)(b+d)(c+d)}}\]New in version 0.4.0.
Initialize SokalSneathV instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Sokal & Sneath V similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Sokal & Sneath V similarity
- Return type:
float
Examples
>>> cmp = SokalSneathV() >>> cmp.sim('cat', 'hat') 0.4987179487179487 >>> cmp.sim('Niall', 'Neil') 0.3635068033537323 >>> cmp.sim('aluminum', 'Catalan') 0.11671286273067434 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.Sorgenfrei(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Sorgenfrei similarity.
For two sets X and Y, Sorgenfrei similarity [Sor58] is
\[sim_{Sorgenfrei}(X, Y) = \frac{|X \cap Y|^2}{|X| \cdot |Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Sorgenfrei} = \frac{a^2}{(a+b)(a+c)}\]New in version 0.4.0.
Initialize Sorgenfrei instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Sorgenfrei similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Sorgenfrei similarity
- Return type:
float
Examples
>>> cmp = Sorgenfrei() >>> cmp.sim('cat', 'hat') 0.25 >>> cmp.sim('Niall', 'Neil') 0.13333333333333333 >>> cmp.sim('aluminum', 'Catalan') 0.013888888888888888 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.Steffensen(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', normalizer: str = 'proportional', **kwargs: Any)[source]
Bases:
_TokenDistance
Steffensen similarity.
For two sets X and Y and a population N, Steffensen similarity \(\psi^2\) [Ste34] is
\[\begin{split}\begin{array}{ll} sim_{Steffensen_{\psi}}(X, Y) = \psi^2 &= \sum_{i \in X}\sum_{j \in Y} p_{ij} \phi_{ij}^2 \\ \\ \phi_{ij}^2 &= \frac{(p_{ij} - p_{i*}p_{*i})^2} {p_{i*}(1-p_{i*})p_{*j}(1-p_{*j})} \end{array}\end{split}\]Where each value \(p_{ij}\) is drawn from the 2x2 contingency table:
\(x \in\)
tar
\(x \notin\)
tar
\(x \in\)
src
\(p_{11} = a\)
\(p_{10} = b\)
\(p_{1*} = a+b\)
\(x \notin\)
src
\(p_{01} = c\)
\(p_{00} = d\)
\(p_{0*} = c+d\)
\(p_{*1} = a+c\)
\(p_{*0} = b+d\)
\(1\)
New in version 0.4.0.
Initialize Steffensen instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.normalizer (str) -- Specifies the normalization type. See normalizer description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Steffensen similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Steffensen similarity
- Return type:
float
Examples
>>> cmp = Steffensen() >>> cmp.sim('cat', 'hat') 0.24744247205786737 >>> cmp.sim('Niall', 'Neil') 0.1300991207720166 >>> cmp.sim('aluminum', 'Catalan') 0.011710186806836031 >>> cmp.sim('ATCG', 'TAGC') 4.1196952743871653e-05
New in version 0.4.0.
- class abydos.distance.Stiles(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Stiles similarity.
For two sets X and Y and a population N, Stiles similarity [Sti61] is
\[sim_{Stiles}(X, Y) = log_{10} \frac{|N| \Big(||X \cap Y| \cdot |N| - |X \setminus Y| \cdot |Y \setminus X|| - \frac{|N|}{2}\Big)^2} {|X \setminus Y| \cdot |Y \setminus X| \cdot (|N| - |X \setminus Y|) \cdot (|N| - |Y \setminus X|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Stiles} = log_{10} \frac{n(|an-bc|-\frac{1}{2}n)^2}{bc(n-b)(n-c)}\]New in version 0.4.0.
Initialize Stiles instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Stiles correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Stiles correlation
- Return type:
float
Examples
>>> cmp = Stiles() >>> cmp.corr('cat', 'hat') 0.14701542182970487 >>> cmp.corr('Niall', 'Neil') 0.11767566062554877 >>> cmp.corr('aluminum', 'Catalan') 0.022355640924908403 >>> cmp.corr('ATCG', 'TAGC') -0.046296656196428934
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Stiles similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Stiles similarity
- Return type:
float
Examples
>>> cmp = Stiles() >>> cmp.sim('cat', 'hat') 0.5735077109148524 >>> cmp.sim('Niall', 'Neil') 0.5588378303127743 >>> cmp.sim('aluminum', 'Catalan') 0.5111778204624542 >>> cmp.sim('ATCG', 'TAGC') 0.4768516719017855
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Stiles similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Stiles similarity
- Return type:
float
Examples
>>> cmp = Stiles() >>> cmp.sim_score('cat', 'hat') 2.6436977886009236 >>> cmp.sim_score('Niall', 'Neil') 2.1622951406967723 >>> cmp.sim_score('aluminum', 'Catalan') 0.41925115106844024 >>> cmp.sim_score('ATCG', 'TAGC') -0.8426334527850912
New in version 0.4.0.
- class abydos.distance.Strcmp95(long_strings: bool = False, **kwargs: Any)[source]
Bases:
_Distance
Strcmp95.
This is a Python translation of the C code for strcmp95: http://web.archive.org/web/20110629121242/http://www.census.gov/geo/msb/stand/strcmp.c [WMJL94]. The above file is a US Government publication and, accordingly, in the public domain.
This is based on the Jaro-Winkler distance, but also attempts to correct for some common typos and frequently confused characters. It is also limited to uppercase ASCII characters, so it is appropriate to American names, but not much else.
New in version 0.3.6.
Initialize Strcmp95 instance.
- Parameters:
long_strings (bool) -- Set to True to increase the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers.
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the strcmp95 similarity of two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Strcmp95 similarity
- Return type:
float
Examples
>>> cmp = Strcmp95() >>> cmp.sim('cat', 'hat') 0.7777777777777777 >>> cmp.sim('Niall', 'Neil') 0.8454999999999999 >>> cmp.sim('aluminum', 'Catalan') 0.6547619047619048 >>> cmp.sim('ATCG', 'TAGC') 0.8333333333333334
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.StuartTau(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Stuart's Tau correlation.
For two sets X and Y and a population N, Stuart's Tau-C correlation [Stu53] is
\[corr_{Stuart_{\tau_c}}(X, Y) = \frac{4 \cdot (|X \cap Y| + |(N \setminus X) \setminus Y| - |X \triangle Y|)}{|N|^2}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{Stuart_{\tau_c}} = \frac{4 \cdot ((a+d)-(b+c))}{n^2}\]New in version 0.4.0.
Initialize StuartTau instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Stuart's Tau correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Stuart's Tau correlation
- Return type:
float
Examples
>>> cmp = StuartTau() >>> cmp.corr('cat', 'hat') 0.005049979175343606 >>> cmp.corr('Niall', 'Neil') 0.005010932944606414 >>> cmp.corr('aluminum', 'Catalan') 0.004900807334983164 >>> cmp.corr('ATCG', 'TAGC') 0.0049718867138692216
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Stuart's Tau similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Stuart's Tau similarity
- Return type:
float
Examples
>>> cmp = StuartTau() >>> cmp.sim('cat', 'hat') 0.5025249895876718 >>> cmp.sim('Niall', 'Neil') 0.5025054664723032 >>> cmp.sim('aluminum', 'Catalan') 0.5024504036674916 >>> cmp.sim('ATCG', 'TAGC') 0.5024859433569346
New in version 0.4.0.
- class abydos.distance.Suffix(**kwargs: Any)[source]
Bases:
_Distance
Suffix similarity and distance.
New in version 0.3.6.
Initialize _Distance instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the suffix similarity of two strings.
Suffix similarity is the ratio of the length of the shorter term that exactly matches the longer term to the length of the shorter term, beginning at the end of both terms.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Suffix similarity
- Return type:
float
Examples
>>> cmp = Suffix() >>> cmp.sim('cat', 'hat') 0.6666666666666666 >>> cmp.sim('Niall', 'Neil') 0.25 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.Synoname(word_approx_min: float = 0.3, char_approx_min: float = 0.73, tests: Union[int, Iterable[str]] = 4095, ret_name: bool = False, **kwargs: Any)[source]
Bases:
_Distance
Synoname.
Cf. [Gro91, JPGTrust91]
New in version 0.3.6.
Initialize Synoname instance.
- Parameters:
word_approx_min (float) -- The minimum word approximation value to signal a 'word_approx' match
char_approx_min (float) -- The minimum character approximation value to signal a 'char_approx' match
tests (int or Iterable) -- Either an integer indicating tests to perform or a list of test names to perform (defaults to performing all tests)
ret_name (bool) -- If True, returns the match name rather than its integer equivalent
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Synoname distance between two words.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Normalized Synoname distance
- Return type:
float
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- dist_abs(src: Union[str, Tuple[str, str, str]], tar: Union[str, Tuple[str, str, str]]) int [source]
Return the Synoname similarity type of two words.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Synoname value
- Return type:
int
Examples
>>> cmp = Synoname() >>> cmp.dist_abs(('Breghel', 'Pieter', ''), ('Brueghel', 'Pieter', '')) 2
New in version 0.6.0.
- sim_type(src: Union[str, Tuple[str, str, str]], tar: Union[str, Tuple[str, str, str]], force_numeric: bool = False) Union[int, str] [source]
Return the Synoname similarity type of two words.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
force_numeric (bool) -- Overrides the instance's ret_name setting
- Returns:
Synoname value
- Return type:
int (or str if ret_name is True)
Examples
>>> cmp = Synoname() >>> cmp.sim_type(('Breghel', 'Pieter', ''), ('Brueghel', 'Pieter', '')) 2
>>> cmp = Synoname(ret_name=True) >>> cmp.sim_type(('Breghel', 'Pieter', ''), ('Brueghel', 'Pieter', '')) 'omission' >>> cmp.sim_type(('Dore', 'Gustave', ''), ... ('Dore', 'Paul Gustave Louis Christophe', '')) 'inclusion' >>> cmp.sim_type(('Pereira', 'I. R.', ''), ('Pereira', 'I. Smith', '')) 'word_approx'
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
Changed in version 0.6.0: Renamed dist_abs to sim_type and added dist_abs with standard interface
- class abydos.distance.TFIDF(tokenizer: Optional[_Tokenizer] = None, corpus: Optional[UnigramCorpus] = None, **kwargs: Any)[source]
Bases:
_TokenDistance
TF-IDF similarity.
For two sets X and Y and a population N, TF-IDF similarity [CRF03] is
\[ \begin{align}\begin{aligned}sim_{TF-IDF}(X, Y) = \sum_{w \in X \cap Y} V(w, X) \cdot V(w, Y)\\V(w, S) = \frac{V'(w, S)}{\sqrt{\sum_{w \in S} V'(w, S)^2}}\\V'(w, S) = log(1+TF_{w,S}) \cdot log(1+IDF_w)\end{aligned}\end{align} \]Notes
One is added to both the TF & IDF values before taking the logarithm to ensure the logarithms do not fall to 0, which will tend to result in 0.0 similarities even when there is a degree of matching.
New in version 0.4.0.
Initialize TFIDF instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packagecorpus (UnigramCorpus) -- A unigram corpus
UnigramCorpus
. If None, a corpus will be created from the two words when a similarity function is called.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the TF-IDF similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
TF-IDF similarity
- Return type:
float
Examples
>>> cmp = TFIDF() >>> cmp.sim('cat', 'hat') 0.30404449697373 >>> cmp.sim('Niall', 'Neil') 0.20108911303601 >>> cmp.sim('aluminum', 'Catalan') 0.05355175631194 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.Tarantula(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Tarantula similarity.
For two sets X and Y and a population N, Tarantula similarity [JH05] is
\[sim_{Tarantula}(X, Y) = \frac{\frac{|X \cap Y|}{|X \cap Y| + |X \setminus Y|}} {\frac{|X \cap Y|}{|X \cap Y| + |X \setminus Y|} + \frac{|Y \setminus X|} {|Y \setminus X| + |(N \setminus X) \setminus Y|}}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Tarantula} = \frac{\frac{a}{a+b}}{\frac{a}{a+b} + \frac{c}{c+d}}\]New in version 0.4.0.
Initialize Tarantula instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Tarantula similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Tarantula similarity
- Return type:
float
Examples
>>> cmp = Tarantula() >>> cmp.sim('cat', 'hat') 0.9948979591836735 >>> cmp.sim('Niall', 'Neil') 0.98856416772554 >>> cmp.sim('aluminum', 'Catalan') 0.9249106078665077 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.Tarwid(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Tarwid correlation.
For two sets X and Y and a population N, the Tarwid correlation [Tar60] is
\[corr_{Tarwid}(X, Y) = \frac{|N| \cdot |X \cap Y| - |X| \cdot |Y|} {|N| \cdot |X \cap Y| + |X| \cdot |Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{Tarwid} = \frac{na-(a+b)(a+c)}{na+(a+b)(a+c)}\]New in version 0.4.0.
Initialize Tarwid instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Tarwid correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Tarwid correlation
- Return type:
float
Examples
>>> cmp = Tarwid() >>> cmp.corr('cat', 'hat') 0.9797979797979798 >>> cmp.corr('Niall', 'Neil') 0.9624530663329162 >>> cmp.corr('aluminum', 'Catalan') 0.8319719953325554 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Tarwid similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Tarwid similarity
- Return type:
float
Examples
>>> cmp = Tarwid() >>> cmp.sim('cat', 'hat') 0.9898989898989898 >>> cmp.sim('Niall', 'Neil') 0.981226533166458 >>> cmp.sim('aluminum', 'Catalan') 0.9159859976662776 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.Tetrachoric(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Tetrachoric correlation coefficient.
For two sets X and Y and a population N, the Tetrachoric correlation coefficient [Pea00] is
\[corr_{Tetrachoric}(X, Y) = \cos \Big(\frac{\pi \sqrt{|X \setminus Y| \cdot |Y \setminus X|}} {\sqrt{|X \cap Y| \cdot |(N \setminus X) \setminus Y|} + \sqrt{|X \setminus Y| \cdot |Y \setminus X|}}\Big)\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{Tetrachoric} = \cos \frac{\pi\sqrt{bc}}{\sqrt{ad}+\sqrt{bc}}\]New in version 0.4.0.
Initialize Tetrachoric instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Tetrachoric correlation coefficient of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Tetrachoric correlation coefficient
- Return type:
float
Examples
>>> cmp = Tetrachoric() >>> cmp.corr('cat', 'hat') 0.9885309061036239 >>> cmp.corr('Niall', 'Neil') 0.9678978997263907 >>> cmp.corr('aluminum', 'Catalan') 0.7853000893691571 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Tetrachoric correlation coefficient of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Tetrachoric correlation coefficient
- Return type:
float
Examples
>>> cmp = Tetrachoric() >>> cmp.sim('cat', 'hat') 0.994265453051812 >>> cmp.sim('Niall', 'Neil') 0.9839489498631954 >>> cmp.sim('aluminum', 'Catalan') 0.8926500446845785 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.Tichy(cost: Tuple[int, int] = (1, 1), **kwargs: Any)[source]
Bases:
_Distance
Tichy edit distance.
Tichy described an algorithm, implemented below, in [Tic84]. Following this, [Cor03] identifies an interpretation of this algorithm's output as a distance measure, which is largely followed by the methods below.
Tichy's algorithm locates substrings of a string S to be copied in order to create a string T. The only other operation used by his algorithms for string reconstruction are add operations.
Notes
While [Cor03] counts only move operations to calculate distance, I give the option (enabled by default) of counting add operations as part of the distance measure. To ignore the cost of add operations, set the cost value to (1, 0), for example, when initializing the object. Further, in the case that S and T are identical, a distance of 0 will be returned, even though this would still be counted as a single move operation spanning the whole of string S.
New in version 0.4.0.
Initialize Tichy instance.
- Parameters:
cost (tuple) -- A 2-tuple representing the cost of the two possible edits: block moves and adds (by default: (1, 1))
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Tichy edit distance between two strings.
The Tichy distance is normalized by dividing the distance by the length of the tar string.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The normalized Tichy distance between src & tar
- Return type:
float
Examples
>>> cmp = Tichy() >>> round(cmp.dist('cat', 'hat'), 12) 0.666666666667 >>> round(cmp.dist('Niall', 'Neil'), 12) 1.0 >>> cmp.dist('aluminum', 'Catalan') 0.8571428571428571 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the Tichy distance between two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The Tichy distance between src & tar
- Return type:
int (may return a float if cost has float values)
Examples
>>> cmp = Tichy() >>> cmp.dist_abs('cat', 'hat') 2 >>> cmp.dist_abs('Niall', 'Neil') 4 >>> cmp.dist_abs('aluminum', 'Catalan') 6 >>> cmp.dist_abs('ATCG', 'TAGC') 4
New in version 0.4.0.
- class abydos.distance.TullossR(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Tulloss' R similarity.
For two sets X and Y and a population N, Tulloss' R similarity [Tul97] is
\[sim_{Tulloss_R}(X, Y) = \frac{log(1+\frac{|X \cap Y|}{|X|}) \cdot log(1+\frac{|X \cap Y|} {|Y|})}{log^2(2)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Tulloss_R} = \frac{log(1+\frac{a}{a+b}) \cdot log(1+\frac{a}{a+c})}{log^2(2)}\]New in version 0.4.0.
Initialize TullossR instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return Tulloss' R similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Tulloss' R similarity
- Return type:
float
Examples
>>> cmp = TullossR() >>> cmp.sim('cat', 'hat') 0.34218112724994865 >>> cmp.sim('Niall', 'Neil') 0.2014703364316006 >>> cmp.sim('aluminum', 'Catalan') 0.025829125872886074 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.TullossS(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Tulloss' S similarity.
For two sets X and Y and a population N, Tulloss' S similarity [Tul97] is
\[sim_{Tulloss_S}(X, Y) = \frac{1}{\sqrt{log_2(2+\frac{min(|X \setminus Y|, |Y \setminus X|)} {|X \cap Y|+1})}}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Tulloss_S} = \frac{1}{\sqrt{log_2(2+\frac{min(b,c)}{a+1})}}\]New in version 0.4.0.
Initialize TullossS instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return Tulloss' S similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Tulloss' S similarity
- Return type:
float
Examples
>>> cmp = TullossS() >>> cmp.sim('cat', 'hat') 0.8406515643305636 >>> cmp.sim('Niall', 'Neil') 0.7943108670863427 >>> cmp.sim('aluminum', 'Catalan') 0.6376503816669968 >>> cmp.sim('ATCG', 'TAGC') 0.5968309535438173
New in version 0.4.0.
- class abydos.distance.TullossT(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Tulloss' T similarity.
For two sets X and Y and a population N, Tulloss' T similarity [Tul97] is
\[ \begin{align}\begin{aligned}\begin{array}{l} sim_{Tulloss_T}(X, Y) = \sqrt{sim_{Tulloss_U}(X, Y) \cdot sim_{Tulloss_S}(X, Y) \cdot sim_{Tulloss_R}(X, Y)}\\= \sqrt{ log_2(1+\frac{min(|X \setminus Y|, |Y \setminus X|)+|X \cap Y|} {max(|X \setminus Y|, |Y \setminus X|)+|X \cap Y|}) \cdot \frac{1}{\sqrt{log_2(2+\frac{min(|X \setminus Y|, |Y \setminus X|)} {|X \cap Y|+1})}} \cdot \frac{log(1+\frac{|X \cap Y|}{|X|}) \cdot log(1+\frac{|X \cap Y|} {|Y|})}{log^2(2)}} \end{array}\end{aligned}\end{align} \]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Tulloss_T} = \sqrt{ log_2\Big(1+\frac{min(b, c)+a}{max(b, c)+a}\Big) \cdot \frac{1}{\sqrt{log_2(2+\frac{min(b,c)}{a+1})}} \cdot \frac{log(1+\frac{a}{a+b}) \cdot log(1+\frac{a}{a+c})}{log^2(2)}}\]New in version 0.4.0.
Initialize TullossT instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return Tulloss' T similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Tulloss' T similarity
- Return type:
float
Examples
>>> cmp = TullossT() >>> cmp.sim('cat', 'hat') 0.5363348766461724 >>> cmp.sim('Niall', 'Neil') 0.3740873705689327 >>> cmp.sim('aluminum', 'Catalan') 0.1229300783095269 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.TullossU(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Tulloss' U similarity.
For two sets X and Y, Tulloss' U similarity [Tul97] is
\[sim_{Tulloss_U}(X, Y) = log_2\Big(1+\frac{min(|X \setminus Y|, |Y \setminus X|)+|X \cap Y|} {max(|X \setminus Y|, |Y \setminus X|)+|X \cap Y|}\Big)\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Tulloss_U} = log_2\Big(1+\frac{min(b, c)+a}{max(b, c)+a}\Big)\]New in version 0.4.0.
Initialize TullossU instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return Tulloss' U similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Tulloss' U similarity
- Return type:
float
Examples
>>> cmp = TullossU() >>> cmp.sim('cat', 'hat') 1.0 >>> cmp.sim('Niall', 'Neil') 0.8744691179161412 >>> cmp.sim('aluminum', 'Catalan') 0.917537839808027 >>> cmp.sim('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- class abydos.distance.Tversky(alpha: float = 1.0, beta: float = 1.0, bias: Optional[float] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Tversky index.
The Tversky index [Tve77] is defined as: For two sets X and Y:
\[sim_{Tversky}(X, Y) = \frac{|X \cap Y|} {|X \cap Y| + \alpha|X - Y| + \beta|Y - X|}\]\(\alpha = \beta = 1\) is equivalent to the Jaccard & Tanimoto similarity coefficients.
\(\alpha = \beta = 0.5\) is equivalent to the Sørensen-Dice similarity coefficient [Dic45, Sorensen48].
Unequal α and β will tend to emphasize one or the other set's contributions:
\(\alpha > \beta\) emphasizes the contributions of X over Y
\(\alpha < \beta\) emphasizes the contributions of Y over X)
Parameter values' relation to 1 emphasizes different types of contributions:
\(\alpha\) and \(\beta > 1\) emphsize unique contributions over the intersection
\(\alpha\) and \(\beta < 1\) emphsize the intersection over unique contributions
The symmetric variant is defined in [JBG13]. This is activated by specifying a bias parameter.
New in version 0.3.6.
Initialize Tversky instance.
- Parameters:
alpha (float) -- Tversky index parameter as described above
beta (float) -- Tversky index parameter as described above
bias (float) -- The symmetric Tversky index bias parameter
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Tversky index of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Tversky similarity
- Return type:
float
- Raises:
ValueError -- Unsupported weight assignment; alpha and beta must be greater than or equal to 0.
Examples
>>> cmp = Tversky() >>> cmp.sim('cat', 'hat') 0.3333333333333333 >>> cmp.sim('Niall', 'Neil') 0.2222222222222222 >>> cmp.sim('aluminum', 'Catalan') 0.0625 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.1.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.Typo(metric: str = 'euclidean', cost: Tuple[float, float, float, float] = (1.0, 1.0, 0.5, 0.5), layout: str = 'QWERTY', failsafe: bool = False, **kwargs: Any)[source]
Bases:
_Distance
Typo distance.
This is inspired by Typo-Distance [Son11], and a fair bit of this was copied from that module. Compared to the original, this supports different metrics for substitution.
New in version 0.3.6.
Initialize Typo instance.
- Parameters:
metric (str) -- Supported values include:
euclidean
,manhattan
,log-euclidean
, andlog-manhattan
cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and shift, respectively (by default: (1, 1, 0.5, 0.5)) The substitution & shift costs should be significantly less than the cost of an insertion & deletion unless a log metric is used.
layout (str) -- Name of the keyboard layout to use (Currently supported:
QWERTY
,Dvorak
,AZERTY
,QWERTZ
,auto
). Ifauto
is selected, the class will attempt to determine an appropriate keyboard based on the supplied words.failsafe (bool) -- If True, substitution of an unknown character (one not present on the selected keyboard) will incur a cost equal to an insertion plus a deletion.
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized typo distance between two strings.
This is typo distance, normalized to [0, 1].
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Normalized typo distance
- Return type:
float
Examples
>>> cmp = Typo() >>> round(cmp.dist('cat', 'hat'), 12) 0.527046276695 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.565028153987 >>> round(cmp.dist('Colin', 'Cuilen'), 12) 0.569035593729 >>> cmp.dist('ATCG', 'TAGC') 0.625
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- dist_abs(src: str, tar: str) float [source]
Return the typo distance between two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
Typo distance
- Return type:
float
- Raises:
ValueError -- char not found in any keyboard layouts
Examples
>>> cmp = Typo() >>> cmp.dist_abs('cat', 'hat') 1.5811388300841898 >>> cmp.dist_abs('Niall', 'Neil') 2.8251407699364424 >>> cmp.dist_abs('Colin', 'Cuilen') 3.414213562373095 >>> cmp.dist_abs('ATCG', 'TAGC') 2.5
>>> cmp = Typo(metric='manhattan') >>> cmp.dist_abs('cat', 'hat') 2.0 >>> cmp.dist_abs('Niall', 'Neil') 3.0 >>> cmp.dist_abs('Colin', 'Cuilen') 3.5 >>> cmp.dist_abs('ATCG', 'TAGC') 2.5
>>> cmp = Typo(metric='log-manhattan') >>> cmp.dist_abs('cat', 'hat') 0.8047189562170501 >>> cmp.dist_abs('Niall', 'Neil') 2.2424533248940004 >>> cmp.dist_abs('Colin', 'Cuilen') 2.242453324894 >>> cmp.dist_abs('ATCG', 'TAGC') 2.3465735902799727
New in version 0.3.0.
Changed in version 0.3.6: Encapsulated in class
- class abydos.distance.UnigramSubtuple(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Unigram subtuple similarity.
For two sets X and Y and a population N, unigram subtuple similarity [Pec10] is
\[sim_{unigram~subtuple}(X, Y) = log(\frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y|} {|X \setminus Y| \cdot |Y \setminus Y|}) - 3.29 \cdot \sqrt{\frac{1}{|X \cap Y|} + \frac{1}{|X \setminus Y|} + \frac{1}{|Y \setminus X|} + \frac{1}{|(N \setminus X) \setminus Y|}}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{unigram~subtuple} = log(\frac{ad}{bc}) - 3.29 \cdot \sqrt{\frac{1}{a} + \frac{1}{b} + \frac{1}{c} + \frac{1}{d}}\]New in version 0.4.0.
Initialize UnigramSubtuple instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the unigram subtuple similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Unigram subtuple similarity
- Return type:
float
Examples
>>> cmp = UnigramSubtuple() >>> cmp.sim('cat', 'hat') 0.6215275850074894 >>> cmp.sim('Niall', 'Neil') 0.39805896767519555 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the unigram subtuple similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Unigram subtuple similarity
- Return type:
float
Examples
>>> cmp = UnigramSubtuple() >>> cmp.sim_score('cat', 'hat') 1.9324426894059226 >>> cmp.sim_score('Niall', 'Neil') 1.4347242883606355 >>> cmp.sim_score('aluminum', 'Catalan') -1.0866724701675263 >>> cmp.sim_score('ATCG', 'TAGC') -0.461880260111438
New in version 0.4.0.
- class abydos.distance.UnknownA(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Unknown A correlation.
For two sets X and Y and a population N, Unknown A correlation is sometimes attributed to [Pei84]. It differs from
Peirce
in that the numerator is the product of the opposite pair of marginals:\[corr_{UnknownA}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus Y|} {|Y| \cdot |N \setminus Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{UnknownA} = \frac{ad-bc}{(a+c)(b+d)}\]New in version 0.4.0.
Initialize UnknownA instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Unknown A correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Unknown A correlation
- Return type:
float
Examples
>>> cmp = UnknownA() >>> cmp.corr('cat', 'hat') 0.49743589743589745 >>> cmp.corr('Niall', 'Neil') 0.39486521181001283 >>> cmp.corr('aluminum', 'Catalan') 0.1147039897039897 >>> cmp.corr('ATCG', 'TAGC') -0.006418485237483954
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Unknown A similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Unknown A similarity
- Return type:
float
Examples
>>> cmp = UnknownA() >>> cmp.sim('cat', 'hat') 0.7487179487179487 >>> cmp.sim('Niall', 'Neil') 0.6974326059050064 >>> cmp.sim('aluminum', 'Catalan') 0.5573519948519948 >>> cmp.sim('ATCG', 'TAGC') 0.496790757381258
New in version 0.4.0.
- class abydos.distance.UnknownB(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Unknown B similarity.
For two sets X and Y and a population N, Unknown B similarity, which [Mor12] attributes to [Doo84] but could not be located in that source, is
\[sim_{UnknownB}(X, Y) = \frac{(|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|)^2} {|X| \cdot |Y| \cdot |N \setminus X| \cdot |N \setminus Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{UnknownB} = \frac{(ad-bc)^2}{(a+b)(a+c)(b+d)(c+d)}\]New in version 0.4.0.
Initialize UnknownB instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Unknown B similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Unknown B similarity
- Return type:
float
Examples
>>> cmp = UnknownB() >>> cmp.sim('cat', 'hat') 0.24744247205785666 >>> cmp.sim('Niall', 'Neil') 0.13009912077202224 >>> cmp.sim('aluminum', 'Catalan') 0.011710186806836291 >>> cmp.sim('ATCG', 'TAGC') 4.1196952743799446e-05
New in version 0.4.0.
- class abydos.distance.UnknownC(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Unknown C similarity.
For two sets X and Y and a population N, Unknown C similarity, which [Mor12] attributes to [Gow71] but could not be located in that source, is
\[sim_{UnknownC}(X, Y) = \frac{|X \cap Y| + |(N \setminus X) \setminus Y|} {\sqrt{|X| \cdot |Y| \cdot |N \setminus X| \cdot |N \setminus Y|}}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{UnknownC} = \frac{a+d}{\sqrt{(a+b)(a+c)(b+d)(c+d)}}\]New in version 0.4.0.
Initialize UnknownC instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Unknown C similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Unknown C similarity
- Return type:
float
Examples
>>> cmp = UnknownC() >>> cmp.sim('cat', 'hat') 0.25 >>> cmp.sim('Niall', 'Neil') 0.18222244271345164 >>> cmp.sim('aluminum', 'Catalan') 0.11686463498390019 >>> cmp.sim('ATCG', 'TAGC') 0.1987163029525032
New in version 0.4.0.
- class abydos.distance.UnknownD(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Unknown D similarity.
For two sets X and Y and a population N, Unknown D similarity, which [Mor12] attributes to [Pei84] but could not be located in that source, is
\[sim_{UnknownD}(X, Y) = \frac{|X \cap Y| \cdot |X \setminus Y| + |X \setminus Y| \cdot |Y \setminus X|} {|X \cap Y| \cdot |X \setminus Y| + 2 \cdot |X \setminus Y| \cdot |Y \setminus X| + |Y \setminus X| + |(N \setminus X) \setminus Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{UnknownD} = \frac{ab+bc}{ab+2bc+cd}\]New in version 0.4.0.
Initialize UnknownD instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Unknown D similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Unknown D similarity
- Return type:
float
Examples
>>> cmp = UnknownD() >>> cmp.sim('cat', 'hat') 0.00510204081632653 >>> cmp.sim('Niall', 'Neil') 0.00848536274925753 >>> cmp.sim('aluminum', 'Catalan') 0.011630019989096857 >>> cmp.sim('ATCG', 'TAGC') 0.006377551020408163
New in version 0.4.0.
- class abydos.distance.UnknownE(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Unknown E correlation.
For two sets X and Y and a population N, Unknown E correlation, which [Mor12] attributes to [GK54] but could not be located in that source, is
\[corr_{UnknownE}(X, Y) = \frac{2 \cdot min(|X \cap Y|, |(N \setminus X) \setminus Y|) - |X \setminus Y| - |Y \setminus X|} {2 \cdot min(|X \cap Y|, |(N \setminus X) \setminus Y|) + |X \setminus Y| + |Y \setminus X|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{UnknownE} = \frac{2 \cdot min(a, d) - b - c}{2 \cdot min(a, d) + b + c}\]New in version 0.4.0.
Initialize UnknownE instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Unknown E correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Unknown E correlation
- Return type:
float
Examples
>>> cmp = UnknownE() >>> cmp.corr('cat', 'hat') 0.0 >>> cmp.corr('Niall', 'Neil') -0.2727272727272727 >>> cmp.corr('aluminum', 'Catalan') -0.7647058823529411 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Unknown E similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Unknown E similarity
- Return type:
float
Examples
>>> cmp = UnknownE() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.36363636363636365 >>> cmp.sim('aluminum', 'Catalan') 0.11764705882352944 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.UnknownF(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Unknown F similarity.
For two sets X and Y and a population N, Unknown F similarity, which [CCT10] attributes to [GW66] but could not be located in that source, is given as
\[sim(X, Y) = log(|X \cap Y|) - log(|N|) - log\Big(\frac{|X|}{|N|}\Big) - log\Big(\frac{|Y|}{|N|}\Big)\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim = log(a) - log(n) - log\Big(\frac{a+b}{n}\Big) - log\Big(\frac{a+c}{n}\Big)\]This formula is not very normalizable, so the following formula is used instead:
\[sim_{UnknownF}(X, Y) = min\Bigg(1, 1+log\Big(\frac{|X \cap Y|}{|N|}\Big) - \frac{1}{2}\Bigg(log\Big(\frac{|X|}{|N|}\Big) + log\Big(\frac{|Y|}{|N|}\Big)\Bigg)\Bigg)\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{UnknownF} = min\Bigg(1, 1+log\Big(\frac{a}{n}\Big) - \frac{1}{2}\Bigg(log\Big(\frac{a+b}{n}\Big) + log\Big(\frac{a+c}{n}\Big)\Bigg)\Bigg)\]New in version 0.4.0.
Initialize UnknownF instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(*args: Any, **kwargs: Any) NoReturn [source]
Raise exception when called.
- Parameters:
*args -- Variable length argument list
**kwargs -- Arbitrary keyword arguments
- Raises:
NotImplementedError -- Method disabled for Unknown F similarity
New in version 0.4.0.
- sim(*args: Any, **kwargs: Any) NoReturn [source]
Raise exception when called.
- Parameters:
*args -- Variable length argument list
**kwargs -- Arbitrary keyword arguments
- Raises:
NotImplementedError -- Method disabled for Unknown F similarity
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Unknown F similarity between two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Unknown F similarity
- Return type:
float
Examples
>>> cmp = UnknownF() >>> cmp.sim_score('cat', 'hat') 0.3068528194400555 >>> cmp.sim_score('Niall', 'Neil') -0.007451510271132555 >>> cmp.sim_score('aluminum', 'Catalan') -1.1383330595080272 >>> cmp.sim_score('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- class abydos.distance.UnknownG(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Unknown G similarity.
For two sets X and Y and a population N, Unknown G similarity, which [CCT10] attributes to [Kulczynski27] but could not be located in that source, is
\[sim_{UnknownG}(X, Y) = \frac{\frac{|X \cap Y|}{2} \cdot (|X| + |Y|)} {|X| \cdot |Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{UnknownG} = \frac{\frac{a}{2} \cdot (2a+b+c)}{(a+b)(a+c)}\]New in version 0.4.0.
Initialize UnknownG instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Unknown G similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Unknown G similarity
- Return type:
float
Examples
>>> cmp = UnknownG() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.36666666666666664 >>> cmp.sim('aluminum', 'Catalan') 0.11805555555555555 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.UnknownH(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Unknown H similarity.
For two sets X and Y and a population N, Unknown H similarity is a variant of Fager-McGowan index of affinity [Fag57, FM63]. It uses minimum rather than maximum in the denominator of the second term, and is sometimes misidentified as the Fager-McGown index of affinity (cf. [Whi82], for example).
\[sim_{UnknownH}(X, Y) = \frac{|X \cap Y|}{\sqrt{|X|\cdot|Y|}} - \frac{1}{2\sqrt{min(|X|, |Y|)}}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{UnknownH} = \frac{a}{\sqrt{(a+b)(a+c)}} - \frac{1}{2\sqrt{min(a+b, a+c)}}\]New in version 0.4.0.
Initialize UnknownH instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Unknown H similarity of two strings.
As this similarity ranges from \((-\inf, 1.0)\), this normalization simply clamps the value to the range (0.0, 1.0).
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Unknown H similarity
- Return type:
float
Examples
>>> cmp = UnknownH() >>> cmp.sim('cat', 'hat') 0.25 >>> cmp.sim('Niall', 'Neil') 0.14154157392013175 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Unknown H similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Unknown H similarity
- Return type:
float
Examples
>>> cmp = UnknownH() >>> cmp.sim('cat', 'hat') 0.25 >>> cmp.sim('Niall', 'Neil') 0.14154157392013175 >>> cmp.sim('aluminum', 'Catalan') 0.0 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.UnknownI(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Unknown I similarity.
For two sets X and Y, the Unknown I similarity is based on Mountford similarity [Mou62]
Mountford
.\[sim_{UnknownI}(X, Y) = \frac{2(|X \cap Y|+1)}{2((|X|+2)\cdot(|Y|+2))- (|X|+|Y|+4)\cdot(|X \cap Y|+1)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{UnknownI} = \frac{2(a+1)}{2(a+b+2)(a+c+2)-(2a+b+c+4)(a+1)}\]New in version 0.4.0.
Initialize UnknownI instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Unknown I similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Unknown I similarity
- Return type:
float
Examples
>>> cmp = UnknownI() >>> cmp.sim('cat', 'hat') 0.16666666666666666 >>> cmp.sim('Niall', 'Neil') 0.08955223880597014 >>> cmp.sim('aluminum', 'Catalan') 0.02247191011235955 >>> cmp.sim('ATCG', 'TAGC') 0.023809523809523808
New in version 0.4.0.
- class abydos.distance.UnknownJ(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Unknown J similarity.
For two sets X and Y and a population N, Unknown J similarity, which [Seq18] attributes to "Kocher & Wang" but could not be located, is
\[sim_{UnknownJ}(X, Y) = |X \cap Y| \cdot \frac{|N|}{|X| \cdot |N \setminus X|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{UnknownJ} = a \cdot \frac{n}{(a+b)(c+d)}\]New in version 0.4.0.
Initialize UnknownJ instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Unknown J similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Unknown J similarity
- Return type:
float
Examples
>>> cmp = UnknownJ() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.33333333333333337 >>> cmp.sim('aluminum', 'Catalan') 0.11111111111111112 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Unknown J similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Unknown J similarity
- Return type:
float
Examples
>>> cmp = UnknownJ() >>> cmp.sim_score('cat', 'hat') 0.5025641025641026 >>> cmp.sim_score('Niall', 'Neil') 0.33590402742073694 >>> cmp.sim_score('aluminum', 'Catalan') 0.11239977090492555 >>> cmp.sim_score('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.UnknownK(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Unknown K distance.
For two sets X and Y and a population N, Unknown K distance, which [Seq18] attributes to "Excoffier" but could not be located, is
\[dist_{UnknownK}(X, Y) = |N| \cdot (1 - \frac{|X \cap Y|}{|N|})\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{UnknownK} = n \cdot (1 - \frac{a}{n})\]New in version 0.4.0.
Initialize UnknownK instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized Unknown K distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Unknown K distance
- Return type:
float
Examples
>>> cmp = UnknownK() >>> cmp.dist('cat', 'hat') 0.9974489795918368 >>> cmp.dist('Niall', 'Neil') 0.9974489795918368 >>> cmp.dist('aluminum', 'Catalan') 0.9987261146496815 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the Unknown K distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Unknown K distance
- Return type:
float
Examples
>>> cmp = UnknownK() >>> cmp.dist_abs('cat', 'hat') 782.0 >>> cmp.dist_abs('Niall', 'Neil') 782.0 >>> cmp.dist_abs('aluminum', 'Catalan') 784.0 >>> cmp.dist_abs('ATCG', 'TAGC') 784.0
New in version 0.4.0.
- class abydos.distance.UnknownL(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Unknown L similarity.
For two sets X and Y and a population N, Unknown L similarity, which [Seq18] attributes to "Roux" but could not be located, is
\[sim_{UnknownL}(X, Y) = \frac{|X \cap Y| + |(N \setminus X) \setminus Y|} {min(|X \setminus Y|, |Y \setminus X|) + min(|N|-|X \setminus Y|, |N|-|Y \setminus X|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{UnknownL} = \frac{a+d}{min(b, c) + min(n-b, n-c)}\]New in version 0.4.0.
Initialize UnknownL instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Unknown L similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Unknown L similarity
- Return type:
float
Examples
>>> cmp = UnknownL() >>> cmp.sim('cat', 'hat') 0.9948979591836735 >>> cmp.sim('Niall', 'Neil') 0.9923371647509579 >>> cmp.sim('aluminum', 'Catalan') 0.9821428571428571 >>> cmp.sim('ATCG', 'TAGC') 0.9872448979591837
New in version 0.4.0.
- class abydos.distance.UnknownM(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Unknown M similarity.
For two sets X and Y and a population N, Unknown < similarity, which [Seq18] attributes to "Roux" but could not be located, is
\[sim_{UnknownM}(X, Y) = \frac{|N|-|X \cap Y| \cdot |(N \setminus X) \setminus Y|} {\sqrt{|X| \cdot |N \setminus X| \cdot |Y| \cdot |N \setminus Y|}}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{UnknownM} = \frac{n-ad}{\sqrt{(a+b)(c+d)(a+c)(b+d)}}\]New in version 0.4.0.
Initialize UnknownM instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Unknown M similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Unknown M similarity
- Return type:
float
Examples
>>> cmp = UnknownM() >>> cmp.sim('cat', 'hat') 0.6237179487179487 >>> cmp.sim('Niall', 'Neil') 0.5898213585061158 >>> cmp.sim('aluminum', 'Catalan') 0.49878582197419324 >>> cmp.sim('ATCG', 'TAGC') 0.3993581514762516
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Unknown M similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Unknown M similarity
- Return type:
float
Examples
>>> cmp = UnknownM() >>> cmp.sim_score('cat', 'hat') -0.24743589743589745 >>> cmp.sim_score('Niall', 'Neil') -0.17964271701223158 >>> cmp.sim_score('aluminum', 'Catalan') 0.0024283560516135103 >>> cmp.sim_score('ATCG', 'TAGC') 0.2012836970474968
New in version 0.4.0.
- class abydos.distance.Upholt(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Upholt similarity.
For two sets X and Y and a population N, Upholt similarity, Upholt's S, [Uph77] is
\[sim_{Upholt}(X, Y) = \frac{1}{2}\Bigg(-\frac{2 \cdot |X \cap Y|}{|X| + |Y|} + \sqrt{\Big(\frac{2 \cdot |X \cap Y|}{|X| + |Y|}\Big)^2 + 8\frac{2 \cdot |X \cap Y|}{|X| + |Y|}}\Bigg)\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Upholt}(X, Y) = \frac{1}{2}\Bigg(-\frac{2a}{2a+b+c} + \sqrt{\Big(\frac{2a}{2a+b+c}\Big)^2 + 8\frac{2a}{2a+b+c}}\Bigg)\]New in version 0.4.0.
Initialize Upholt instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Upholt similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Upholt similarity
- Return type:
float
Examples
>>> cmp = Upholt() >>> cmp.sim('cat', 'hat') 0.7807764064044151 >>> cmp.sim('Niall', 'Neil') 0.6901511860568581 >>> cmp.sim('aluminum', 'Catalan') 0.42980140370106323 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.VPS(**kwargs: Any)[source]
Bases:
_Distance
Victorian Panel Study (VPS) score.
VPS score is presented in [Schurer07].
New in version 0.4.1.
Initialize _Distance instance.
- Parameters:
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Victorian Panel Study score of two words.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The VPS score
- Return type:
float
Examples
>>> cmp = VPS() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.3 >>> cmp.sim('aluminum', 'Catalan') 0.14285714285714285 >>> cmp.sim('ATCG', 'TAGC') 0.3333333333333333
New in version 0.4.1.
- class abydos.distance.WarrensI(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Warrens I correlation.
For two sets X and Y, Warrens I correlation \(S_{NS1}\) [War08] is
\[corr_{WarrensI}(X, Y) = \frac{2|X \cap Y| - |X \setminus Y| - |Y \setminus X|} {2|X \cap Y| + |X \setminus Y| + |Y \setminus X|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{WarrensI} = \frac{2a-b-c}{2a+b+c}\]New in version 0.4.0.
Initialize WarrensI instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Warrens I correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Warrens I correlation
- Return type:
float
Examples
>>> cmp = WarrensI() >>> cmp.corr('cat', 'hat') 0.0 >>> cmp.corr('Niall', 'Neil') -0.2727272727272727 >>> cmp.corr('aluminum', 'Catalan') -0.7647058823529411 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Warrens I similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Warrens I similarity
- Return type:
float
Examples
>>> cmp = WarrensI() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.36363636363636365 >>> cmp.sim('aluminum', 'Catalan') 0.11764705882352944 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.WarrensII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Warrens II similarity.
For two sets X and Y and a population N, Warrens II similarity \(S_{NS2}\) [War08] is
\[sim_{WarrensII}(X, Y) = \frac{2|(N \setminus X) \setminus Y|} {|N \setminus X| + |N \setminus Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{WarrensII} = \frac{2d}{b+c+2d}\]New in version 0.4.0.
Initialize WarrensII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Warrens II similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Warrens II similarity
- Return type:
float
Examples
>>> cmp = WarrensII() >>> cmp.sim('cat', 'hat') 0.9974358974358974 >>> cmp.sim('Niall', 'Neil') 0.9955041746949261 >>> cmp.sim('aluminum', 'Catalan') 0.9903412749517064 >>> cmp.sim('ATCG', 'TAGC') 0.993581514762516
New in version 0.4.0.
- class abydos.distance.WarrensIII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Warrens III correlation.
For two sets X and Y and a population N, Warrens III correlation \(S_{NS3}\) [War08] is
\[corr_{WarrensIII}(X, Y) = \frac{2|(N \setminus X) \setminus Y| - |X \setminus Y| - |Y \setminus X|}{|N \setminus X| + |N \setminus Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{WarrensIII} = \frac{2d-b-c}{2d+b+c}\]New in version 0.4.0.
Initialize WarrensIII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return the Warrens III correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Warrens III correlation
- Return type:
float
Examples
>>> cmp = WarrensIII() >>> cmp.corr('cat', 'hat') 0.9948717948717949 >>> cmp.corr('Niall', 'Neil') 0.9910083493898523 >>> cmp.corr('aluminum', 'Catalan') 0.9806825499034127 >>> cmp.corr('ATCG', 'TAGC') 0.9871630295250321
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Warrens III similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Warrens III similarity
- Return type:
float
Examples
>>> cmp = WarrensIII() >>> cmp.sim('cat', 'hat') 0.9974358974358974 >>> cmp.sim('Niall', 'Neil') 0.9955041746949261 >>> cmp.sim('aluminum', 'Catalan') 0.9903412749517064 >>> cmp.sim('ATCG', 'TAGC') 0.993581514762516
New in version 0.4.0.
- class abydos.distance.WarrensIV(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Warrens IV similarity.
For two sets X and Y and a population N, Warrens IV similarity [War08] is
\[sim_{WarrensIV}(X, Y) = \frac{4|X \cap Y| \cdot |(N \setminus X) \setminus Y|} {4|X \cap Y| \cdot |(N \setminus X) \setminus Y| + (|X \cap Y| + |(N \setminus X) \setminus Y|) (|X \setminus Y| + |Y \setminus X|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{WarrensIV} = \frac{4ad}{4ad + (a+d)(b+c)}\]New in version 0.4.0.
Initialize WarrensIV instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Warrens IV similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Warrens IV similarity
- Return type:
float
Examples
>>> cmp = WarrensIV() >>> cmp.sim('cat', 'hat') 0.666095890410959 >>> cmp.sim('Niall', 'Neil') 0.5326918120113412 >>> cmp.sim('aluminum', 'Catalan') 0.21031040612607685 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.WarrensV(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Warrens V similarity.
For two sets X and Y and a population N, Warrens V similarity [War08] is
\[sim_{WarrensV}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|} {min(|X| \cdot |Y|, |N \setminus X| \cdot |N \setminus Y|)}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{WarrensV} = \frac{ad-bc}{min( (a+b)(a+c), (b+d)(c+d) )}\]New in version 0.4.0.
Initialize WarrensV instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the normalized Warrens V similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Warrens V similarity
- Return type:
float
Examples
>>> cmp = WarrensV() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.3333333333333333 >>> cmp.sim('aluminum', 'Catalan') 0.11125283446712018 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- sim_score(src: str, tar: str) float [source]
Return the Warrens V similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Warrens V similarity
- Return type:
float
Examples
>>> cmp = WarrensV() >>> cmp.sim_score('cat', 'hat') 97.0 >>> cmp.sim_score('Niall', 'Neil') 51.266666666666666 >>> cmp.sim_score('aluminum', 'Catalan') 9.902777777777779 >>> cmp.sim_score('ATCG', 'TAGC') -1.0
New in version 0.4.0.
- class abydos.distance.WeightedJaccard(tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', weight: int = 3, **kwargs: Any)[source]
Bases:
_TokenDistance
Weighted Jaccard similarity.
For two sets X and Y and a weight w, the Weighted Jaccard similarity [LL98] is
\[sim_{Jaccard_w}(X, Y) = \frac{w \cdot |X \cap Y|} {w \cdot |X \cap Y| + |X \setminus Y| + |Y \setminus X|}\]Here, the intersection between the two sets is weighted by w. Compare to Jaccard similarity (\(w = 1\)), and to Dice similarity (\(w = 2\)). In the default case, the weight of the intersection is 3, following [LL98].
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Jaccard_w} = \frac{w\cdot a}{w\cdot a+b+c}\]New in version 0.4.0.
Initialize TripleWeightedJaccard instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.weight (int) -- The weight to apply to the intersection cardinality. (3, by default.)
**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Triple Weighted Jaccard similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Weighted Jaccard similarity
- Return type:
float
Examples
>>> cmp = WeightedJaccard() >>> cmp.sim('cat', 'hat') 0.6 >>> cmp.sim('Niall', 'Neil') 0.46153846153846156 >>> cmp.sim('aluminum', 'Catalan') 0.16666666666666666 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.Whittaker(tokenizer: Optional[_Tokenizer] = None, **kwargs: Any)[source]
Bases:
_TokenDistance
Whittaker distance.
For two multisets X and Y drawn from an alphabet S, Whittaker distance [Whi52] is
\[sim_{Whittaker}(X, Y) = 1 - \frac{1}{2}\sum_{i \in S} \Bigg| \frac{|X_i|}{|X|} - \frac{|Y_i|}{|Y|} \Bigg|\]New in version 0.4.0.
Initialize Whittaker instance.
- Parameters:
tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
package**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return the Whittaker distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Whittaker distance
- Return type:
float
Examples
>>> cmp = Whittaker() >>> cmp.sim('cat', 'hat') 0.5 >>> cmp.sim('Niall', 'Neil') 0.33333333333333 >>> cmp.sim('aluminum', 'Catalan') 0.11111111111111 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.YJHHR(pval: int = 1, alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
YJHHR distance.
For two sets X and Y and a parameter p, YJHHR distance [YJH+16] is
\[dist_{YJHHR_p}(X, Y) = \sqrt[p]{|X \setminus Y|^p + |Y \setminus X|^p}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{YJHHR} = \sqrt[p]{b^p + c^p}\]New in version 0.4.0.
Initialize YJHHR instance.
- Parameters:
pval (int) -- The \(p\)-value of the \(L^p\)-space
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the normalized YJHHR distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
normalized YJHHR distance
- Return type:
float
Examples
>>> cmp = YJHHR() >>> cmp.dist('cat', 'hat') 0.6666666666666666 >>> cmp.dist('Niall', 'Neil') 0.7777777777777778 >>> cmp.dist('aluminum', 'Catalan') 0.9375 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the YJHHR distance of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
YJHHR distance
- Return type:
float
Examples
>>> cmp = YJHHR() >>> cmp.dist_abs('cat', 'hat') 4.0 >>> cmp.dist_abs('Niall', 'Neil') 7.0 >>> cmp.dist_abs('aluminum', 'Catalan') 15.0 >>> cmp.dist_abs('ATCG', 'TAGC') 10.0
New in version 0.4.0.
- class abydos.distance.YatesChiSquared(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Yates's Chi-Squared similarity.
For two sets X and Y and a population N, Yates's \(\chi^2\) similarity [Yat34] is
\[sim_{Yates_{\chi^2}}(X, Y) = \frac{|N| \cdot (||X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|| - \frac{|N|}{2})^2} {|X| \cdot |N \setminus X| \cdot |Y| \cdot |N \setminus Y|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[sim_{Yates_{\chi^2}} = \frac{n \cdot (|ad-bc| - \frac{n}{2})^2}{(a+b)(c+d)(a+c)(b+d)}\]New in version 0.4.0.
Initialize YatesChiSquared instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return Yates's normalized Chi-Squared similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Yates's Chi-Squared similarity
- Return type:
float
Examples
>>> cmp = YatesChiSquared() >>> cmp.sim('cat', 'hat') 0.18081199852082455 >>> cmp.sim('Niall', 'Neil') 0.08608296705052738 >>> cmp.sim('aluminum', 'Catalan') 0.0026563223707532654 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- sim_score(src: str, tar: str, signed: bool = False) float [source]
Return Yates's Chi-Squared similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
signed (bool) -- If True, negative correlations will carry a negative sign
- Returns:
Yates's Chi-Squared similarity
- Return type:
float
Examples
>>> cmp = YatesChiSquared() >>> cmp.sim_score('cat', 'hat') 108.37343852728468 >>> cmp.sim_score('Niall', 'Neil') 56.630055670871954 >>> cmp.sim_score('aluminum', 'Catalan') 1.8574215841854373 >>> cmp.sim_score('ATCG', 'TAGC') 6.960385076156687
New in version 0.4.0.
- class abydos.distance.YujianBo(cost: Tuple[int, int, int, int] = (1, 1, 1, 1), **kwargs: Any)[source]
Bases:
Levenshtein
Yujian-Bo normalized Levenshtein distance.
Yujian-Bo's normalization of Levenshtein distance [YB07], given Levenshtein distance \(GLD(X, Y)\) between two strings X and Y, is
\[dist_{N-GLD}(X, Y) = \frac{2 \cdot GLD(X, Y)}{|X| + |Y| + GLD(X, Y)}\]New in version 0.4.0.
Initialize YujianBo instance.
- Parameters:
cost (tuple) -- A 4-tuple representing the cost of the four possible edits: inserts, deletes, substitutions, and transpositions, respectively (by default: (1, 1, 1, 1))
**kwargs -- Arbitrary keyword arguments
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return the Yujian-Bo normalized edit distance between strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The Yujian-Bo normalized edit distance between src & tar
- Return type:
float
Examples
>>> cmp = YujianBo() >>> round(cmp.dist('cat', 'hat'), 12) 0.285714285714 >>> round(cmp.dist('Niall', 'Neil'), 12) 0.5 >>> cmp.dist('aluminum', 'Catalan') 0.6363636363636364 >>> cmp.dist('ATCG', 'TAGC') 0.5454545454545454
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return the Yujian-Bo normalized edit distance between two strings.
- Parameters:
src (str) -- Source string for comparison
tar (str) -- Target string for comparison
- Returns:
The Yujian-Bo normalized edit distance between src & tar
- Return type:
int
Examples
>>> cmp = YujianBo() >>> cmp.dist_abs('cat', 'hat') 0.2857142857142857 >>> cmp.dist_abs('Niall', 'Neil') 0.5 >>> cmp.dist_abs('aluminum', 'Catalan') 0.6363636363636364 >>> cmp.dist_abs('ATCG', 'TAGC') 0.5454545454545454
New in version 0.4.0.
- class abydos.distance.YuleQ(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Yule's Q correlation.
For two sets X and Y and a population N, Yule's Q correlation [Yul12] is
\[corr_{Yule_Q}(X, Y) = \frac{|X \cap Y| \cdot |(N \setminus X) \setminus Y| - |X \setminus Y| \cdot |Y \setminus X|} {|X \cap Y| \cdot |(N \setminus X) \setminus Y| + |X \setminus Y| \cdot |Y \setminus X|}\]Yule himself terms this the coefficient of association.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{Yule_Q} = \frac{ad-bc}{ad+bc}\]New in version 0.4.0.
Initialize YuleQ instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return Yule's Q correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Yule's Q correlation
- Return type:
float
Examples
>>> cmp = YuleQ() >>> cmp.corr('cat', 'hat') 0.9948717948717949 >>> cmp.corr('Niall', 'Neil') 0.9846350832266325 >>> cmp.corr('aluminum', 'Catalan') 0.8642424242424243 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return Yule's Q similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Yule's Q similarity
- Return type:
float
Examples
>>> cmp = YuleQ() >>> cmp.sim('cat', 'hat') 0.9974358974358974 >>> cmp.sim('Niall', 'Neil') 0.9923175416133163 >>> cmp.sim('aluminum', 'Catalan') 0.9321212121212121 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.
- class abydos.distance.YuleQII(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Yule's Q dissimilarity.
For two sets X and Y and a population N, Yule's Q dissimilarity [YK68] is
\[dist_{Yule_QII}(X, Y) = \frac{2 \cdot |X \setminus Y| \cdot |Y \setminus X|} {|X \cap Y| \cdot |(N \setminus X) \setminus Y| + |X \setminus Y| \cdot |Y \setminus X|}\]In 2x2 confusion table terms, where a+b+c+d=n, this is
\[dist_{Yule_QII} = \frac{2bc}{ad+bc}\]New in version 0.4.0.
Initialize YuleQII instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- dist(src: str, tar: str) float [source]
Return normalized Yule's Q dissimilarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Normalized Yule's Q II distance
- Return type:
float
Examples
>>> cmp = YuleQII() >>> cmp.dist('cat', 'hat') 0.002564102564102564 >>> cmp.dist('Niall', 'Neil') 0.0076824583866837385 >>> cmp.dist('aluminum', 'Catalan') 0.06787878787878789 >>> cmp.dist('ATCG', 'TAGC') 1.0
New in version 0.4.0.
- dist_abs(src: str, tar: str) float [source]
Return Yule's Q dissimilarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Yule's Q II distance
- Return type:
float
Examples
>>> cmp = YuleQII() >>> cmp.dist_abs('cat', 'hat') 0.005128205128205128 >>> cmp.dist_abs('Niall', 'Neil') 0.015364916773367477 >>> cmp.dist_abs('aluminum', 'Catalan') 0.13575757575757577 >>> cmp.dist_abs('ATCG', 'TAGC') 2.0
New in version 0.4.0.
- class abydos.distance.YuleY(alphabet: Optional[Union[Counter[str], Sequence[str], Set[str], int]] = None, tokenizer: Optional[_Tokenizer] = None, intersection_type: str = 'crisp', **kwargs: Any)[source]
Bases:
_TokenDistance
Yule's Y correlation.
For two sets X and Y and a population N, Yule's Y correlation [Yul12] is
\[corr_{Yule_Y}(X, Y) = \frac{\sqrt{|X \cap Y| \cdot |(N \setminus X) \setminus Y|} - \sqrt{|X \setminus Y| \cdot |Y \setminus X|}} {\sqrt{|X \cap Y| \cdot |(N \setminus X) \setminus Y|} + \sqrt{|X \setminus Y| \cdot |Y \setminus X|}}\]In [Yul12], this is labeled \(\omega\), so it is sometimes referred to as Yule's \(\omega\). Yule himself terms this the coefficient of colligation.
In 2x2 confusion table terms, where a+b+c+d=n, this is
\[corr_{Yule_Y} = \frac{\sqrt{ad}-\sqrt{bc}}{\sqrt{ad}+\sqrt{bc}}\]New in version 0.4.0.
Initialize YuleY instance.
- Parameters:
alphabet (Counter, collection, int, or None) -- This represents the alphabet of possible tokens. See alphabet description in
_TokenDistance
for details.tokenizer (_Tokenizer) -- A tokenizer instance from the
abydos.tokenizer
packageintersection_type (str) -- Specifies the intersection type, and set type as a result: See intersection_type description in
_TokenDistance
for details.**kwargs -- Arbitrary keyword arguments
qval (int) -- The length of each q-gram. Using this parameter and tokenizer=None will cause the instance to use the QGram tokenizer with this q value.
metric (_Distance) -- A string distance measure class for use in the
soft
andfuzzy
variants.threshold (float) -- A threshold value, similarities above which are counted as members of the intersection for the
fuzzy
variant.
New in version 0.4.0.
- corr(src: str, tar: str) float [source]
Return Yule's Y correlation of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Yule's Y correlation
- Return type:
float
Examples
>>> cmp = YuleY() >>> cmp.corr('cat', 'hat') 0.9034892632818762 >>> cmp.corr('Niall', 'Neil') 0.8382551144735259 >>> cmp.corr('aluminum', 'Catalan') 0.5749826820237787 >>> cmp.corr('ATCG', 'TAGC') -1.0
New in version 0.4.0.
- sim(src: str, tar: str) float [source]
Return Yule's Y similarity of two strings.
- Parameters:
src (str) -- Source string (or QGrams/Counter objects) for comparison
tar (str) -- Target string (or QGrams/Counter objects) for comparison
- Returns:
Yule's Y similarity
- Return type:
float
Examples
>>> cmp = YuleY() >>> cmp.sim('cat', 'hat') 0.9517446316409381 >>> cmp.sim('Niall', 'Neil') 0.919127557236763 >>> cmp.sim('aluminum', 'Catalan') 0.7874913410118893 >>> cmp.sim('ATCG', 'TAGC') 0.0
New in version 0.4.0.