abydos.fingerprint package

abydos.fingerprint.

The fingerprint package implements string fingerprints such as:

  • Basic fingerprinters originating in OpenRefine <http://openrefine.org>:

    • String (String)

    • Phonetic, which applies a phonetic algorithm and returns the string fingerprint of the result (Phonetic)

    • QGram, which applies Q-gram tokenization and returns the string fingerprint of the result (QGram)

  • Fingerprints developed by Pollock & Zomora:

  • Fingerprints developed by Cisłak & Grabowski:

  • The Synoname toolcode (SynonameToolcode)

  • Taft's codings:

  • L.A. County Sheriff's System (LACSS)

  • Library of Congress Cutter table encoding (LCCutter)

  • Burrows-Wheeler transform (BWTF) and run-length encoded Burrows-Wheeler transform (BWTRLEF)

Each fingerprint class has a fingerprint method that takes a string and returns the string's fingerprint:

>>> sk = SkeletonKey()
>>> sk.fingerprint('orange')
'ORNGAE'
>>> sk.fingerprint('strange')
'STRNGAE'

class abydos.fingerprint.BWTF(terminator: str = '\x00')[source]

Bases: _Fingerprint

Burrows-Wheeler transform fingerprint.

This is a wrapper of the BWT class in abydos.compression, which provides the same interface as other descendants of _Fingerprint.

New in version 0.4.1.

Initialize BWTF instance.

Parameters:

terminator (str) -- A character added to signal the end of the string

New in version 0.4.1.

fingerprint(word: str) str[source]

Return the Burrows-Wheeler transform of a word.

Parameters:

word (str) -- The word to fingerprint

Returns:

The Burrows-Wheeler transform of a word

Return type:

str

Examples

>>> fp = BWTF()
>>> fp.fingerprint('hat')
'th\x00a'
>>> fp.fingerprint('niall')
'linla\x00'
>>> fp.fingerprint('colin')
'n\x00loic'
>>> fp.fingerprint('atcg')
'g\x00tca'
>>> fp.fingerprint('entreatment')
'term\x00teetnan'

New in version 0.4.1.

class abydos.fingerprint.BWTRLEF(terminator: str = '\x00')[source]

Bases: _Fingerprint

Burrows-Wheeler transform plus run-length encoding fingerprint.

This is a wrapper of the BWT and RLE classes in abydos.compression, which provides the same interface as other descendants of _Fingerprint.

New in version 0.4.1.

Initialize BWTRLEF instance.

Parameters:

terminator (str) -- A character added to signal the end of the string

New in version 0.4.1.

fingerprint(word: str) str[source]

Return the run-length encoded Burrows-Wheeler transform of a word.

Parameters:

word (str) -- The word to fingerprint

Returns:

The run-length encoded Burrows-Wheeler transform of a word

Return type:

str

Examples

>>> fp = BWTRLEF()
>>> fp.fingerprint('hat')
'th\x00a'
>>> fp.fingerprint('niall')
'linla\x00'
>>> fp.fingerprint('colin')
'n\x00loic'
>>> fp.fingerprint('atcg')
'g\x00tca'
>>> fp.fingerprint('entreatment')
'term\x00teetnan'

New in version 0.4.1.

class abydos.fingerprint.Consonant(variant: int = 1, doubles: bool = True, vowels: Optional[Union[Iterable[str], str]] = None)[source]

Bases: _Fingerprint

Consonant Coding Fingerprint.

Based on the consonant coding from [Taf70], variants 1, 2, 3, 1-D, 2-D, and 3-D.

New in version 0.4.1.

Initialize Consonant instance.

Parameters:
  • variant (int) --

    Selects between Taft's 3 variants, which assign to the vowel set one of:

    1. A, E, I, O, & U

    2. A, E, I, O, U, W, & Y

    3. A, E, I, O, U, W, H, & Y

  • doubles (bool) -- If set to False, multiple consonants in a row are conflated to a single instance.

  • vowels (list, set, or str) -- Setting vowels to a non-None value overrides the variant setting and defines the set of letters to be removed from the input.

New in version 0.4.1.

fingerprint(word: str) str[source]

Return the consonant coding.

Parameters:

word (str) -- The word to fingerprint

Returns:

The consonant coding

Return type:

int

Examples

>>> cf = Consonant()
>>> cf.fingerprint('hat')
'HT'
>>> cf.fingerprint('niall')
'NLL'
>>> cf.fingerprint('colin')
'CLN'
>>> cf.fingerprint('atcg')
'ATCG'
>>> cf.fingerprint('entreatment')
'ENTRTMNT'

New in version 0.4.1.

class abydos.fingerprint.Count(n_bits: int = 16, most_common: Tuple[str, ...] = ('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]

Bases: _Fingerprint

Count Fingerprint.

Based on the count fingerprint from [CislakG17].

New in version 0.3.6.

Initialize Count instance.

Parameters:
  • n_bits (int) -- Number of bits in the fingerprint returned

  • most_common (list) -- The most common tokens in the target language, ordered by frequency

New in version 0.4.0.

fingerprint(word: str) str[source]

Return the count fingerprint.

Parameters:

word (str) -- The word to fingerprint

Returns:

The count fingerprint

Return type:

str

Examples

>>> cf = Count()
>>> cf.fingerprint('hat')
'0001010000000001'
>>> cf.fingerprint('niall')
'0000010001010000'
>>> cf.fingerprint('colin')
'0000000101010000'
>>> cf.fingerprint('atcg')
'0001010000000000'
>>> cf.fingerprint('entreatment')
'1111010000100000'

New in version 0.3.0.

Changed in version 0.3.6: Encapsulated in class

Changed in version 0.6.0: Changed to return a str and added fingerprint_int method

fingerprint_int(word: str) int[source]

Return the count fingerprint.

Parameters:

word (str) -- The word to fingerprint

Returns:

The count fingerprint as an int

Return type:

int

Examples

>>> cf = Count()
>>> cf.fingerprint_int('hat')
5121
>>> cf.fingerprint_int('niall')
1104
>>> cf.fingerprint_int('colin')
336
>>> cf.fingerprint_int('atcg')
5120
>>> cf.fingerprint_int('entreatment')
62496

New in version 0.6.0.

class abydos.fingerprint.Extract(letter_list: Union[int, Iterable[str]] = 1)[source]

Bases: _Fingerprint

Extract Letter List fingerprint.

Based on the extract letter list coding from [Taf70], for lists 1, 2, 3, & 4.

New in version 0.4.1.

Initialize Extract instance.

Parameters:

letter_list (int or iterable) -- If an integer (1-4) is supplied, Taft's specified letter lists are used. If an iterable is supplied, its values will be used as the list of letters to remove (in order).

New in version 0.4.1.

fingerprint(word: str) str[source]

Return the extract letter list coding.

Parameters:

word (str) -- The word to fingerprint

Returns:

The extract letter list coding

Return type:

str

Examples

>>> fp = Extract()
>>> fp.fingerprint('hat')
'HAT'
>>> fp.fingerprint('niall')
'NILL'
>>> fp.fingerprint('colin')
'CLIN'
>>> fp.fingerprint('atcg')
'ATCG'
>>> fp.fingerprint('entreatment')
'NRMN'

New in version 0.4.1.

class abydos.fingerprint.ExtractPositionFrequency[source]

Bases: _Fingerprint

Extract - Position & Frequency fingerprint.

Based on the extract - position & frequency coding from [Taf70].

New in version 0.4.1.

fingerprint(word: str) str[source]

Return the extract - position & frequency coding.

Parameters:

word (str) -- The word to fingerprint

Returns:

The extract - position & frequency coding

Return type:

str

Examples

>>> fp = ExtractPositionFrequency()
>>> fp.fingerprint('hat')
'HAT'
>>> fp.fingerprint('niall')
'NILL'
>>> fp.fingerprint('colin')
'COLN'
>>> fp.fingerprint('atcg')
'ATCG'
>>> fp.fingerprint('entreatment')
'NMNT'

New in version 0.4.1.

class abydos.fingerprint.LACSS[source]

Bases: _Fingerprint

L.A. County Sheriff's System fingerprint.

Based on the description from [Taf70].

New in version 0.4.1.

fingerprint(word: str) str[source]

Return the LACSS coding.

Parameters:

word (str) -- The word to fingerprint

Returns:

The L.A. County Sheriff's System fingerprint

Return type:

str

Examples

>>> cf = LACSS()
>>> cf.fingerprint('hat')
'4911211'
>>> cf.fingerprint('niall')
'6488374'
>>> cf.fingerprint('colin')
'3015957'
>>> cf.fingerprint('atcg')
'1772371'
>>> cf.fingerprint('entreatment')
'3882324'

New in version 0.4.1.

Changed in version 0.6.0: Changed to return a str and added fingerprint_int method

fingerprint_int(word: str) int[source]

Return the LACSS coding.

Parameters:

word (str) -- The word to fingerprint

Returns:

The L.A. County Sheriff's System fingerprint as an int

Return type:

int

Examples

>>> cf = LACSS()
>>> cf.fingerprint_int('hat')
4911211
>>> cf.fingerprint_int('niall')
6488374
>>> cf.fingerprint_int('colin')
3015957
>>> cf.fingerprint_int('atcg')
1772371
>>> cf.fingerprint_int('entreatment')
3882324

New in version 0.6.0.

class abydos.fingerprint.LCCutter(max_length: int = 64)[source]

Bases: _Fingerprint

Library of Congress Cutter table encoding.

This is based on the Library of Congress Cutter table encoding scheme, as described at https://www.loc.gov/aba/pcc/053/table.html [oC13]. Handling for numerals is not included.

New in version 0.4.1.

Initialize LCCutter instance.

Parameters:

max_length (int) -- The length of the code returned (defaults to 64)

New in version 0.4.1.

fingerprint(word: str) str[source]

Return the Library of Congress Cutter table encoding of a word.

Parameters:

word (str) -- The word to fingerprint

Returns:

The Library of Congress Cutter table encoding

Return type:

str

Examples

>>> cf = LCCutter()
>>> cf.fingerprint('hat')
'H38'
>>> cf.fingerprint('niall')
'N5355'
>>> cf.fingerprint('colin')
'C6556'
>>> cf.fingerprint('atcg')
'A834'
>>> cf.fingerprint('entreatment')
'E5874386468'

New in version 0.4.1.

class abydos.fingerprint.Occurrence(n_bits: int = 16, most_common: Tuple[str, ...] = ('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]

Bases: _Fingerprint

Occurrence Fingerprint.

Based on the occurrence fingerprint from [CislakG17].

New in version 0.3.6.

Initialize Count instance.

Parameters:
  • n_bits (int) -- Number of bits in the fingerprint returned

  • most_common (list) -- The most common tokens in the target language, ordered by frequency

New in version 0.4.0.

fingerprint(word: str) str[source]

Return the occurrence fingerprint.

Parameters:

word (str) -- The word to fingerprint

Returns:

The occurrence fingerprint

Return type:

str

Examples

>>> of = Occurrence()
>>> of.fingerprint('hat')
'0110000100000000'
>>> of.fingerprint('niall')
'0010110000100000'
>>> of.fingerprint('colin')
'0001110000110000'
>>> of.fingerprint('atcg')
'0110000000010000'
>>> of.fingerprint('entreatment')
'1110010010000100'

New in version 0.3.0.

Changed in version 0.3.6: Encapsulated in class

Changed in version 0.6.0: Changed to return a str and added fingerprint_int method

fingerprint_int(word: str) int[source]

Return the occurrence fingerprint.

Parameters:

word (str) -- The word to fingerprint

Returns:

The occurrence fingerprint as an int

Return type:

int

Examples

>>> of = Occurrence()
>>> of.fingerprint_int('hat')
24832
>>> of.fingerprint_int('niall')
11296
>>> of.fingerprint_int('colin')
7216
>>> of.fingerprint_int('atcg')
24592
>>> of.fingerprint_int('entreatment')
58500

New in version 0.6.0.

class abydos.fingerprint.OccurrenceHalved(n_bits: int = 16, most_common: Tuple[str, ...] = ('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'))[source]

Bases: _Fingerprint

Occurrence Halved Fingerprint.

Based on the occurrence halved fingerprint from [CislakG17].

New in version 0.3.6.

Initialize Count instance.

Parameters:
  • n_bits (int) -- Number of bits in the fingerprint returned

  • most_common (list) -- The most common tokens in the target language, ordered by frequency

New in version 0.4.0.

fingerprint(word: str) str[source]

Return the occurrence halved fingerprint.

Based on the occurrence halved fingerprint from [CislakG17].

Parameters:

word (str) -- The word to fingerprint

Returns:

The occurrence halved fingerprint

Return type:

str

Examples

>>> ohf = OccurrenceHalved()
>>> ohf.fingerprint('hat')
'0001010000000010'
>>> ohf.fingerprint('niall')
'0000010010100000'
>>> ohf.fingerprint('colin')
'0000001001010000'
>>> ohf.fingerprint('atcg')
'0010100000000000'
>>> ohf.fingerprint('entreatment')
'1111010000110000'

New in version 0.3.0.

Changed in version 0.3.6: Encapsulated in class

Changed in version 0.6.0: Changed to return a str and added fingerprint_int method

fingerprint_int(word: str) int[source]

Return the occurrence halved fingerprint.

Based on the occurrence halved fingerprint from [CislakG17].

Parameters:

word (int) -- The word to fingerprint

Returns:

The occurrence halved fingerprint as an int

Return type:

int

Examples

>>> ohf = OccurrenceHalved()
>>> ohf.fingerprint_int('hat')
5122
>>> ohf.fingerprint_int('niall')
1184
>>> ohf.fingerprint_int('colin')
592
>>> ohf.fingerprint_int('atcg')
10240
>>> ohf.fingerprint_int('entreatment')
62512

New in version 0.6.0.

class abydos.fingerprint.OmissionKey[source]

Bases: _Fingerprint

Omission Key.

The omission key of a word is defined in [PZ84].

New in version 0.3.6.

fingerprint(word: str) str[source]

Return the omission key.

Parameters:

word (str) -- The word to transform into its omission key

Returns:

The omission key

Return type:

str

Examples

>>> ok = OmissionKey()
>>> ok.fingerprint('The quick brown fox jumped over the lazy dog.')
'JKQXZVWYBFMGPDHCLNTREUIOA'
>>> ok.fingerprint('Christopher')
'PHCTSRIOE'
>>> ok.fingerprint('Niall')
'LNIA'

New in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

class abydos.fingerprint.Phonetic(phonetic_algorithm: Optional[Union[Callable[[str], str], _Phonetic]] = None, joiner: str = ' ')[source]

Bases: String

Phonetic Fingerprint.

A phonetic fingerprint is identical to a standard string fingerprint, as implemented in String, but performs the fingerprinting function after converting the string to its phonetic form, as determined by some phonetic algorithm. This fingerprint is described at [Ope12].

New in version 0.3.6.

Initialize Phonetic instance.

phonetic_algorithmfunction

A phonetic algorithm that takes a string and returns a string (presumably a phonetic representation of the original string). By default, this function uses double_metaphone().

joinerstr

The string that will be placed between each word

New in version 0.4.0.

fingerprint(phrase: str) str[source]

Return the phonetic fingerprint of a phrase.

Parameters:

phrase (str) -- The string from which to calculate the phonetic fingerprint

Returns:

The phonetic fingerprint of the phrase

Return type:

str

Examples

>>> pf = Phonetic()
>>> pf.fingerprint('The quick brown fox jumped over the lazy dog.')
'0 afr fks jmpt kk ls prn tk'
>>> from abydos.phonetic import Soundex
>>> pf = Phonetic(Soundex())
>>> pf.fingerprint('The quick brown fox jumped over the lazy dog.')
'b650 d200 f200 j513 l200 o160 q200 t000'

New in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

class abydos.fingerprint.Position(n_bits: int = 16, most_common: Tuple[str, ...] = ('e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'c', 'u', 'm', 'w', 'f'), bits_per_letter: int = 3)[source]

Bases: _Fingerprint

Position Fingerprint.

Based on the position fingerprint from [CislakG17].

New in version 0.3.6.

Initialize Count instance.

Parameters:
  • n_bits (int) -- Number of bits in the fingerprint returned

  • most_common (list) -- The most common tokens in the target language, ordered by frequency

New in version 0.4.0.

fingerprint(word: str) str[source]

Return the position fingerprint.

Parameters:

word (str) -- The word to fingerprint

Returns:

The position fingerprint

Return type:

str

Examples

>>> pf = Position()
>>> pf.fingerprint('hat')
'1110100011111111'
>>> pf.fingerprint('niall')
'1111110101110010'
>>> pf.fingerprint('colin')
'1111111110010111'
>>> pf.fingerprint('atcg')
'1110010001111111'
>>> pf.fingerprint('entreatment')
'0000101011111111'

New in version 0.3.0.

Changed in version 0.3.6: Encapsulated in class

Changed in version 0.6.0: Changed to return a str and added fingerprint_int method

fingerprint_int(word: str) int[source]

Return the position fingerprint.

Parameters:

word (str) -- The word to fingerprint

Returns:

The position fingerprint as an int

Return type:

int

Examples

>>> pf = Position()
>>> pf.fingerprint_int('hat')
59647
>>> pf.fingerprint_int('niall')
64882
>>> pf.fingerprint_int('colin')
65431
>>> pf.fingerprint_int('atcg')
58495
>>> pf.fingerprint_int('entreatment')
2815

New in version 0.6.0.

class abydos.fingerprint.QGram(qval: int = 2, start_stop: str = '', joiner: str = '', skip: int = 0)[source]

Bases: _Fingerprint

Q-Gram Fingerprint.

A q-gram fingerprint is a string consisting of all of the unique q-grams in a string, alphabetized & concatenated. This fingerprint is described at [Ope12].

New in version 0.3.6.

Initialize Q-Gram fingerprinter.

qvalint

The length of each q-gram (by default 2)

start_stopstr

The start & stop symbol(s) to concatenate on either end of the phrase, as defined in tokenizer.QGrams

joinerstr

The string that will be placed between each word

skipint or Iterable

The number of characters to skip, can be an integer, range object, or list

New in version 0.4.0.

fingerprint(phrase: str) str[source]

Return Q-Gram fingerprint.

Parameters:

phrase (str) -- The string from which to calculate the q-gram fingerprint

Returns:

The q-gram fingerprint of the phrase

Return type:

str

Examples

>>> qf = QGram()
>>> qf.fingerprint('The quick brown fox jumped over the lazy dog.')
'azbrckdoedeleqerfoheicjukblampnfogovowoxpequrortthuiumvewnxjydzy'
>>> qf.fingerprint('Christopher')
'cherhehrisopphristto'
>>> qf.fingerprint('Niall')
'aliallni'

New in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

class abydos.fingerprint.SkeletonKey[source]

Bases: _Fingerprint

Skeleton Key.

The skeleton key of a word is defined in [PZ84].

New in version 0.3.6.

fingerprint(word: str) str[source]

Return the skeleton key.

Parameters:

word (str) -- The word to transform into its skeleton key

Returns:

The skeleton key

Return type:

str

Examples

>>> sk = SkeletonKey()
>>> sk.fingerprint('The quick brown fox jumped over the lazy dog.')
'THQCKBRWNFXJMPDVLZYGEUIOA'
>>> sk.fingerprint('Christopher')
'CHRSTPIOE'
>>> sk.fingerprint('Niall')
'NLIA'

New in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

class abydos.fingerprint.String(joiner: str = ' ')[source]

Bases: _Fingerprint

String Fingerprint.

The fingerprint of a string is a string consisting of all of the unique words in a string, alphabetized & concatenated with intervening joiners. This fingerprint is described at [Ope12].

New in version 0.3.6.

Initialize String instance.

Parameters:

joiner (str) -- The string that will be placed between each word

New in version 0.4.0.

fingerprint(phrase: str) str[source]

Return string fingerprint.

Parameters:

phrase (str) -- The string from which to calculate the fingerprint

Returns:

The fingerprint of the phrase

Return type:

str

Example

>>> sf = String()
>>> sf.fingerprint('The quick brown fox jumped over the lazy dog.')
'brown dog fox jumped lazy over quick the'

New in version 0.1.0.

Changed in version 0.3.6: Encapsulated in class

class abydos.fingerprint.SynonameToolcode[source]

Bases: _Fingerprint

Synoname Toolcode.

Cf. [Gro91, JPGTrust91].

New in version 0.3.6.

fingerprint(lname: str, fname: str = '', qual: str = '', normalize: int = 0) str[source]

Build the Synoname toolcode.

Parameters:
  • lname (str) -- Last name

  • fname (str) -- First name (can be blank)

  • qual (str) -- Qualifier

  • normalize (int) -- Normalization mode (0, 1, or 2)

Returns:

The transformed names and the synoname toolcode, separated by commas

Return type:

str

Examples

>>> st = SynonameToolcode()
>>> st.fingerprint('hat')
'hat,,0000000003$$h'
>>> st.fingerprint('niall')
'niall,,0000000005$$n'
>>> st.fingerprint('colin')
'colin,,0000000005$$c'
>>> st.fingerprint('atcg')
'atcg,,0000000004$$a'
>>> st.fingerprint('entreatment')
'entreatment,,0000000011$$e'
>>> st.fingerprint('Ste.-Marie', 'Count John II', normalize=2)
'ste.-marie ii,count john,0200491310$015b049a127c$smcji'
>>> st.fingerprint('Michelangelo IV', '', 'Workshop of')
'michelangelo iv,,3000550015$055b$mi'

New in version 0.3.0.

Changed in version 0.3.6: Encapsulated in class

Changed in version 0.6.0: Changed to return a comma-separated string instead of 3-tuple of strs

fingerprint_tuple(lname: str, fname: str = '', qual: str = '', normalize: int = 0) Tuple[str, str, str][source]

Build the Synoname toolcode.

Parameters:
  • lname (str) -- Last name

  • fname (str) -- First name (can be blank)

  • qual (str) -- Qualifier

  • normalize (int) -- Normalization mode (0, 1, or 2)

Returns:

The transformed names and the synoname toolcode

Return type:

tuple

Examples

>>> st = SynonameToolcode()
>>> st.fingerprint_tuple('hat')
('hat', '', '0000000003$$h')
>>> st.fingerprint_tuple('niall')
('niall', '', '0000000005$$n')
>>> st.fingerprint_tuple('colin')
('colin', '', '0000000005$$c')
>>> st.fingerprint_tuple('atcg')
('atcg', '', '0000000004$$a')
>>> st.fingerprint_tuple('entreatment')
('entreatment', '', '0000000011$$e')
>>> st.fingerprint_tuple('Ste.-Marie', 'Count John II', normalize=2)
('ste.-marie ii', 'count john', '0200491310$015b049a127c$smcji')
>>> st.fingerprint_tuple('Michelangelo IV', '', 'Workshop of')
('michelangelo iv', '', '3000550015$055b$mi')

New in version 0.6.0.