http://zorba.io/modules/data-cleaning/set-similarity

View as XML or JSON.

This library module provides similarity functions for comparing sets of XML nodes (e.g., sets of XML elements, attributes or atomic values).

These functions are particularly useful for matching near duplicate sets of XML nodes.

The logic contained in this module is not specific to any particular XQuery implementation.

Function Summary

deep-intersect ($s1, $s2) as item()*

Returns the intersection between two sets, using the deep-equal() function to compare the XML nodes from the sets.

deep-union ($s1, $s2) as item()*

Returns the union between two sets, using the deep-equal() function to compare the XML nodes from the sets.

dice ($s1, $s2) as xs:double

Returns the Dice similarity coefficient between two sets of XML nodes.

distinct ($s) as item()*

Removes exact duplicates from a set, using the deep-equal() function to compare the XML nodes from the sets.

jaccard ($s1, $s2) as xs:double

Returns the Jaccard similarity coefficient between two sets of XML nodes.

overlap ($s1, $s2) as xs:double

Returns the overlap coefficient between two sets of XML nodes.

Functions

deep-intersect#2

declare  function set:deep-intersect($s1, $s2) as item()*

Returns the intersection between two sets, using the deep-equal() function to compare the XML nodes from the sets.

Example usage :

 deep-intersect ( ( "a", "b", "c") , ( "a", "a",  ) ) 

The function invocation in the example above returns :

 ("a") 

Parameters

s1 as
The first set.
s2 as
The second set.

Returns

item()*
The intersection of both sets.

deep-union#2

declare  function set:deep-union($s1, $s2) as item()*

Returns the union between two sets, using the deep-equal() function to compare the XML nodes from the sets.

Example usage :

 deep-union ( ( "a", "b", "c") , ( "a", "a",  ) ) 

The function invocation in the example above returns :

 ("a", "b", "c",  ) 

Parameters

s1 as
The first set.
s2 as
The second set.

Returns

item()*
The union of both sets.

dice#2

declare  function set:dice($s1, $s2) as xs:double

Returns the Dice similarity coefficient between two sets of XML nodes.

The Dice coefficient is defined as defined as twice the shared information between the input sets (i.e., the size of the intersection) over the sum of the cardinalities for the input sets.

Example usage :

 dice ( ( "a", "b",  ) , ( "a", "a", "d") ) 

The function invocation in the example above returns :

 0.4 

Parameters

s1 as
The first set.
s2 as
The second set.

Returns

xs:double
The Dice similarity coefficient between the two sets.

distinct#1

declare  function set:distinct($s) as item()*

Removes exact duplicates from a set, using the deep-equal() function to compare the XML nodes from the sets.

Example usage :

 distinct ( ( "a", "a",  ) ) 

The function invocation in the example above returns :

 ("a",  ) 

Parameters

s as
A set.

Returns

item()*
The set provided as input without the exact duplicates (i.e., returns the distinct nodes from the set provided as input).

jaccard#2

declare  function set:jaccard($s1, $s2) as xs:double

Returns the Jaccard similarity coefficient between two sets of XML nodes.

The Jaccard coefficient is defined as the size of the intersection divided by the size of the union of the input sets.

Example usage :

 jaccard ( ( "a", "b",  ) , ( "a", "a", "d") ) 

The function invocation in the example above returns :

 0.25 

Parameters

s1 as
The first set.
s2 as
The second set.

Returns

xs:double
The Jaccard similarity coefficient between the two sets.

overlap#2

declare  function set:overlap($s1, $s2) as xs:double

Returns the overlap coefficient between two sets of XML nodes.

The overlap coefficient is defined as the shared information between the input sets (i.e., the size of the intersection) over the size of the smallest input set.

Example usage :

 overlap ( ( "a", "b",  ) , ( "a", "a", "b" ) ) 

The function invocation in the example above returns :

 1.0 

Parameters

s1 as
The first set.
s2 as
The second set.

Returns

xs:double
The overlap coefficient between the two sets.