http://zorba.io/modules/data-cleaning/character-based-string-similarity

View as XML or JSON.

This library module provides character-based string similarity functions that view strings as sequences of characters, generally computing a similarity score that corresponds to the cost of transforming one string into another. These functions are particularly useful for matching near duplicate strings in the presence of typographical errors.

The logic contained in this module is not specific to any particular XQuery implementation.

Function Summary

edit-distance ($s1 as xs:string, $s2 as xs:string) as xs:integer

Returns the edit distance between two strings.

jaro-winkler ($s1 as xs:string, $s2 as xs:string, $prefix as xs:integer, $fact as xs:double) as xs:double

Returns the Jaro-Winkler similarity coefficient between two strings.

jaro ($s1 as xs:string, $s2 as xs:string) as xs:double

Returns the Jaro similarity coefficient between two strings.

needleman-wunsch ($s1 as xs:string, $s2 as xs:string, $score as xs:integer, $penalty as xs:integer) as xs:double

Returns the Needleman-Wunsch distance between two strings.

smith-waterman ($s1 as xs:string, $s2 as xs:string, $score as xs:integer, $penalty as xs:integer) as xs:double

Returns the Smith-Waterman distance between two strings.

Functions

edit-distance#2

declare  function simc:edit-distance($s1 as xs:string, $s2 as xs:string) as xs:integer

Returns the edit distance between two strings.

This distance, also refered to as the Levenshtein distance, is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.

Example usage :

edit-distance("FLWOR", "FLOWER")

The function invocation in the example above returns :

2

Parameters

s1 as xs:string
The first string.
s2 as xs:string
The second string.

Returns

xs:integer
The edit distance between the two strings.

jaro-winkler#4

declare  function simc:jaro-winkler($s1 as xs:string, $s2 as xs:string, $prefix as xs:integer, $fact as xs:double) as xs:double

Returns the Jaro-Winkler similarity coefficient between two strings.

This similarity coefficient corresponds to an extension of the Jaro similarity coefficient that weights or penalizes strings based on their similarity at the beginning of the string, up to a given prefix size.

Example usage :

jaro-winkler("DWAYNE", "DUANE", 4, 0.1 )

The function invocation in the example above returns :

0.8577777777777778

Parameters

s1 as xs:string
The first string.
s2 as xs:string
The second string.
prefix as xs:integer
The number of characters to consider when testing for equal prefixes in the strings.
fact as xs:double
The weighting factor to consider when the input strings have equal prefixes.

Returns

xs:double
The Jaro-Winkler similarity coefficient between the two strings.

jaro#2

declare  function simc:jaro($s1 as xs:string, $s2 as xs:string) as xs:double

Returns the Jaro similarity coefficient between two strings.

This similarity coefficient is based on the number of transposed characters and on a weighted sum of the percentage of matched characters held within the strings. The higher the Jaro-Winkler value is, the more similar the strings are. The coefficient is normalized such that 0 equates to no similarity and 1 is an exact match.

Example usage :

jaro("FLWOR Found.", "FLWOR Foundation")

The function invocation in the example above returns :

0.5853174603174603

Parameters

s1 as xs:string
The first string.
s2 as xs:string
The second string.

Returns

xs:double
The Jaro similarity coefficient between the two strings.

needleman-wunsch#4

declare  function simc:needleman-wunsch($s1 as xs:string, $s2 as xs:string, $score as xs:integer, $penalty as xs:integer) as xs:double

Returns the Needleman-Wunsch distance between two strings.

The Needleman-Wunsch distance is similar to the basic edit distance metric, adding a variable cost adjustment to the cost of a gap (i.e., an insertion or deletion) in the distance metric.

Example usage :

needleman-wunsch("KAK", "KQRK", 1, 1)

The function invocation in the example above returns :

0

Parameters

s1 as xs:string
The first string.
s2 as xs:string
The second string.
score as xs:integer
The score value.
penalty as xs:integer
The penalty value.

Returns

xs:double
The Needleman-Wunsch distance between the two strings.

smith-waterman#4

declare  function simc:smith-waterman($s1 as xs:string, $s2 as xs:string, $score as xs:integer, $penalty as xs:integer) as xs:double

Returns the Smith-Waterman distance between two strings.

Example usage :

smith-waterman("ACACACTA", "AGCACACA", 2, 1)

The function invocation in the example above returns :

12

Parameters

s1 as xs:string
The first string.
s2 as xs:string
The second string.
score as xs:integer
The score value.
penalty as xs:integer
The penalty value.

Returns

xs:double
The Smith-Waterman distance between the two strings.