http://zorba.io/modules/data-cleaning/hybrid-string-similarity

View as XML or JSON.

This library module provides hybrid string similarity functions, combining the properties of character-based string similarity functions and token-based string similarity functions.

The logic contained in this module is not specific to any particular XQuery implementation, although the module requires the trigonometic functions of XQuery 3.0 or a math extension function such as sqrt($x as numeric) for computing the square root.

Function Summary

monge-elkan-jaro-winkler ($s1 as xs:string, $s2 as xs:string, $prefix as xs:integer, $fact as xs:double) as xs:double

Returns the Monge-Elkan similarity coefficient between two strings, using the Jaro-Winkler similarity function to discover token identity.

soft-cosine-tokens-edit-distance ($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:integer) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

soft-cosine-tokens-jaro-winkler ($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:double, $prefix as xs:integer?, $fact as xs:double?) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

soft-cosine-tokens-jaro ($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:double) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

soft-cosine-tokens-metaphone ($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

soft-cosine-tokens-soundex ($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

Functions

monge-elkan-jaro-winkler#4

declare  function simh:monge-elkan-jaro-winkler($s1 as xs:string, $s2 as xs:string, $prefix as xs:integer, $fact as xs:double) as xs:double

Returns the Monge-Elkan similarity coefficient between two strings, using the Jaro-Winkler

similarity function to discover token identity.

Example usage :

 monge-elkan-jaro-winkler("Comput. Sci. and Eng. Dept., University of California, San Diego", "Department of Computer Scinece, Univ. Calif., San Diego", 4, 0.1) 

The function invocation in the example above returns :

 0.992 

Parameters

s1 as xs:string
The first string.
s2 as xs:string
The second string.
prefix as xs:integer
The number of characters to consider when testing for equal prefixes with the Jaro-Winkler metric.
fact as xs:double
The weighting factor to consider when the input strings have equal prefixes with the Jaro-Winkler metric.

Returns

xs:double
The Monge-Elkan similarity coefficient between the two strings.

soft-cosine-tokens-edit-distance#4

declare  function simh:soft-cosine-tokens-edit-distance($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:integer) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).

The Edit Distance similarity function is used to discover token identity, and tokens having an edit distance bellow a given threshold are considered as matching tokens.

Example usage :

 soft-cosine-tokens-edit-distance("The FLWOR Foundation", "FLWOR Found.", " +", 0 ) 

The function invocation in the example above returns :

 0.408248290463863 

Parameters

s1 as xs:string
The first string.
s2 as xs:string
The second string.
r as xs:string
A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.
t as xs:integer
A threshold for the similarity function used to discover token identity.

Returns

xs:double
The cosine similarity coefficient between the sets tokens extracted from the two strings.

soft-cosine-tokens-jaro-winkler#6

declare  function simh:soft-cosine-tokens-jaro-winkler($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:double, $prefix as xs:integer?, $fact as xs:double?) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).

The Jaro-Winkler similarity function is used to discover token identity, and tokens having a Jaro-Winkler similarity above a given threshold are considered as matching tokens.

Example usage :

 soft-cosine-tokens-jaro-winkler("The FLWOR Foundation", "FLWOR Found.", " +", 1, 4, 0.1 ) 

The function invocation in the example above returns :

 0.45 

Parameters

s1 as xs:string
The first string.
s2 as xs:string
The second string.
r as xs:string
A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.
t as xs:double
A threshold for the similarity function used to discover token identity.
prefix as xs:integer
The number of characters to consider when testing for equal prefixes with the Jaro-Winkler metric.
fact as xs:double
The weighting factor to consider when the input strings have equal prefixes with the Jaro-Winkler metric.

Returns

xs:double
The cosine similarity coefficient between the sets tokens extracted from the two strings.

soft-cosine-tokens-jaro#4

declare  function simh:soft-cosine-tokens-jaro($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:double) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).

The Jaro similarity function is used to discover token identity, and tokens having a Jaro similarity above a given threshold are considered as matching tokens.

Example usage :

 soft-cosine-tokens-jaro("The FLWOR Foundation", "FLWOR Found.", " +", 1 ) 

The function invocation in the example above returns :

 0.5 

Parameters

s1 as xs:string
The first string.
s2 as xs:string
The second string.
r as xs:string
A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.
t as xs:double
A threshold for the similarity function used to discover token identity.

Returns

xs:double
The cosine similarity coefficient between the sets tokens extracted from the two strings.

soft-cosine-tokens-metaphone#3

declare  function simh:soft-cosine-tokens-metaphone($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).

The Metaphone phonetic similarity function is used to discover token identity, which is equivalent to saying that this function returns the cosine similarity coefficient between sets of Metaphone keys.

Example usage :

 soft-cosine-tokens-metaphone("ALEKSANDER SMITH", "ALEXANDER SMYTH", " +" ) 

The function invocation in the example above returns :

 1.0 

Parameters

s1 as xs:string
The first string.
s2 as xs:string
The second string.
r as xs:string
A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.

Returns

xs:double
The cosine similarity coefficient between the sets Metaphone keys extracted from the two strings.

soft-cosine-tokens-soundex#3

declare  function simh:soft-cosine-tokens-soundex($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval).

The Soundex phonetic similarity function is used to discover token identity, which is equivalent to saying that this function returns the cosine similarity coefficient between sets of Soundex keys.

Example usage :

 soft-cosine-tokens-soundex("ALEKSANDER SMITH", "ALEXANDER SMYTH", " +") 

The function invocation in the example above returns :

 1.0 

Parameters

s1 as xs:string
The first string.
s2 as xs:string
The second string.
r as xs:string
A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.

Returns

xs:double
The cosine similarity coefficient between the sets of Soundex keys extracted from the two strings.