http://zorba.io/modules/data-cleaning/token-based-string-similarity

View as XML or JSON.

This library module provides token-based string similarity functions that view strings as sets or multi-sets of tokens and use set-related properties to compute similarity scores.

The tokens correspond to groups of characters extracted from the strings being compared, such as individual words or character n-grams.

These functions are particularly useful for matching near duplicate strings in cases where typographical conventions often lead to rearrangement of words (e.g., "John Smith" versus "Smith, John").

The logic contained in this module is not specific to any particular XQuery implementation, although the module requires the trigonometic functions of XQuery 3.0 or a math extension function such as sqrt($x as numeric) for computing the square root.

Function Summary

cosine-ngrams ($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double

Returns the cosine similarity coefficient between sets of character n-grams extracted from two strings.

cosine-tokens ($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

cosine ($desc1 as xs:string*, $desc2 as xs:string*) as xs:double

Auxiliary function for computing the cosine similarity coefficient between strings, using stringdescriptors based on sets of character n-grams or sets of tokens extracted from two strings.

dice-ngrams ($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double

Returns the Dice similarity coefficient between sets of character n-grams extracted from two strings.

dice-tokens ($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the Dice similarity coefficient between sets of tokens extracted from two strings.

jaccard-ngrams ($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double

Returns the Jaccard similarity coefficient between sets of character n-grams extracted from two strings.

jaccard-tokens ($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the Jaccard similarity coefficient between sets of tokens extracted from two strings.

ngrams ($s as xs:string, $n as xs:integer) as xs:string*

Returns the individual character n-grams forming a string.

overlap-ngrams ($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double

Returns the overlap similarity coefficient between sets of character n-grams extracted from two strings.

overlap-tokens ($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the overlap similarity coefficient between sets of tokens extracted from two strings.

Functions

cosine-ngrams#3

declare  function simt:cosine-ngrams($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double

Returns the cosine similarity coefficient between sets of character n-grams extracted from two strings.

The n-grams from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).

Example usage :

 cosine-ngrams("DWAYNE", "DUANE", 2 ) 

The function invocation in the example above returns :

 0.2401922307076307 

Parameters

s1 as xs:string
The first string.
s2 as xs:string
The second string.
n as xs:integer
The number of characters to consider when extracting n-grams.

Returns

xs:double
The cosine similarity coefficient between the sets n-grams extracted from the two strings.

cosine-tokens#3

declare  function simt:cosine-tokens($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings. The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).

Example usage :

 cosine-tokens("The FLWOR Foundation", "FLWOR Found.", " +" ) 

The function invocation in the example above returns :

 0.408248290463863 

Parameters

s1 as xs:string
The first string.
s2 as xs:string
The second string.
r as xs:string
A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.

Returns

xs:double
The cosine similarity coefficient between the sets tokens extracted from the two strings.

cosine#2

declare  function simt:cosine($desc1 as xs:string*, $desc2 as xs:string*) as xs:double

Auxiliary function for computing the cosine similarity coefficient between strings, using stringdescriptors based on sets of character n-grams or sets of tokens extracted from two strings.

Example usage :

 cosine( ("aa","bb") , ("bb","aa")) 

The function invocation in the example above returns :

 1.0 

Parameters

desc1 as xs:string
The descriptor for the first string.
desc2 as xs:string
The descriptor for the second string.

Returns

xs:double
The cosine similarity coefficient between the descriptors for the two strings.

dice-ngrams#3

declare  function simt:dice-ngrams($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double

Returns the Dice similarity coefficient between sets of character n-grams extracted from two strings.

Example usage :

 dice-ngrams("DWAYNE", "DUANE", 2 ) 

The function invocation in the example above returns :

 0.4615384615384616 

Parameters

s1 as xs:string
The first string.
s2 as xs:string
The second string.
n as xs:integer
The number of characters to consider when extracting n-grams.

Returns

xs:double
The Dice similarity coefficient between the sets of character n-grams extracted from the two strings.

dice-tokens#3

declare  function simt:dice-tokens($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the Dice similarity coefficient between sets of tokens extracted from two strings.

Example usage :

 dice-tokens("The FLWOR Foundation", "FLWOR Found.", " +" ) 

The function invocation in the example above returns :

 0.4 

Parameters

s1 as xs:string
The first string.
s2 as xs:string
The second string.
r as xs:string
A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.

Returns

xs:double
The Dice similarity coefficient between the sets tokens extracted from the two strings.

jaccard-ngrams#3

declare  function simt:jaccard-ngrams($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double

Returns the Jaccard similarity coefficient between sets of character n-grams extracted from two strings.

Example usage :

 jaccard-ngrams("DWAYNE", "DUANE", 2 ) 

The function invocation in the example above returns :

 0.3 

Parameters

s1 as xs:string
The first string.
s2 as xs:string
The second string.
n as xs:integer
The number of characters to consider when extracting n-grams.

Returns

xs:double
The Jaccard similarity coefficient between the sets of character n-grams extracted from the two strings.

jaccard-tokens#3

declare  function simt:jaccard-tokens($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the Jaccard similarity coefficient between sets of tokens extracted from two strings.

Example usage :

 jaccard-tokens("The FLWOR Foundation", "FLWOR Found.", " +" ) 

The function invocation in the example above returns :

 0.25 

Parameters

s1 as xs:string
The first string.
s2 as xs:string
The second string.
r as xs:string
A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.

Returns

xs:double
The Jaccard similarity coefficient between the sets tokens extracted from the two strings.

ngrams#2

declare  function simt:ngrams($s as xs:string, $n as xs:integer) as xs:string*

Returns the individual character n-grams forming a string.

Example usage :

 ngrams("FLWOR", 2 ) 

The function invocation in the example above returns :

 ("_F" , "FL" , "LW" , "WO" , "LW" , "WO" , "OR" , "R_") 

Parameters

s as xs:string
The input string.
n as xs:integer
The number of characters to consider when extracting n-grams.

Returns

xs:string*
The sequence of strings with the extracted n-grams.

overlap-ngrams#3

declare  function simt:overlap-ngrams($s1 as xs:string, $s2 as xs:string, $n as xs:integer) as xs:double

Returns the overlap similarity coefficient between sets of character n-grams extracted from two strings.

Example usage :

 overlap-ngrams("DWAYNE", "DUANE", 2 ) 

The function invocation in the example above returns :

 0.5 

Parameters

s1 as xs:string
The first string.
s2 as xs:string
The second string.
n as xs:integer
The number of characters to consider when extracting n-grams.

Returns

xs:double
The overlap similarity coefficient between the sets of character n-grams extracted from the two strings.

overlap-tokens#3

declare  function simt:overlap-tokens($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the overlap similarity coefficient between sets of tokens extracted from two strings.

Example usage :

 overlap-tokens("The FLWOR Foundation", "FLWOR Found.", " +" ) 

The function invocation in the example above returns :

 0.5 

Parameters

s1 as xs:string
The first string.
s2 as xs:string
The second string.
r as xs:string
A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.

Returns

xs:double
The overlap similarity coefficient between the sets tokens extracted from the two strings.