http://zorba.io/modules/data-cleaning/consolidation

View as XML or JSON.

This library module provides data consolidation functions that generally take as input a sequence of XML nodes and apply some rule in order do decide which node is better suited to represent the entire sequence.

The logic contained in this module is not specific to any particular XQuery implementation, although the consolidation functions based on matching sequences against XPath expressions require some form of dynamic evaluation for XPath expressions.

Function Summary

all-xpaths ($s as element(*)*, $paths as xs:string*) as element(*)*

Returns the elements from an input sequence of elements that, when matched to a given set of XPath expressions, produce a non-empty set of nodes in all the cases.

least-attributes ($s) as element(*)

Returns the single node having the smallest number of descending attributes (attributes at any given depth) in a sequence of nodes provided as input.

least-distinct-attributes ($s) as element(*)

Returns the single node having the smallest number of distinct descending attributes (attributes at any given depth) in a sequence of nodes provided as input.

least-distinct-elements ($s) as element(*)

Returns the single node having the smallest number of distinct descending elements (sub-elements at any given depth) in a sequence of nodes provided as input.

least-distinct-nodes ($s) as element(*)

Returns the single node having the smallest number of distinct descending nodes (sub-nodes at any given depth) in a sequence of nodes provided as input.

least-elements ($s) as element(*)

Returns the single node having the smallest number of descending elements (sub-elements at any given depth) in a sequence of nodes provided as input.

least-frequent ($s) as item()

Returns the single less frequent node in a sequence of nodes provided as input.

least-nodes ($s) as element(*)

Returns the single node having the smallest number of descending nodes (sub-nodes at any given depth) in a sequence of nodes provided as input.

least-similar-edit-distance ($s as xs:string*, $m as xs:string) as xs:string?

Returns the single least similar string, in terms of the edit distance metric towards an input string, in a sequence of strings provided as input.

least-tokens ($s as xs:string*, $r as xs:string) as xs:string?

Returns the single shortest string, in terms of the number of tokens, in a sequence of strings provided as input.

least-xpaths ($s as element(*)*, $paths as xs:string*) as element(*)*

Returns the single element from an input sequence of elements that matches the smallest number of XPath expressions from a given set, producing a non-empty set of nodes.

longest ($s as xs:string*) as xs:string?

Returns the single longest string, in terms of the number of characters, in a sequence of strings provided as input.

matching ($s as xs:string*, $r as xs:string) as xs:string*

Returns the strings from an input sequence of strings that match a particular regular expression.

most-attributes ($s) as element(*)

Returns the single node having the largest number of descending attributes (attributes at any given depth) in a sequence of nodes provided as input.

most-distinct-attributes ($s) as element(*)

Returns the single node having the largest number of distinct descending attributes (attributes at any given depth) in a sequence of nodes provided as input.

most-distinct-elements ($s) as element(*)

Returns the single node having the largest number of distinct descending elements (sub-elements at any given depth) in a sequence of nodes provided as input.

most-distinct-nodes ($s) as element(*)

Returns the single node having the largest number of distinct descending nodes (sub-nodes at any given depth) in a sequence of nodes provided as input.

most-elements ($s) as element(*)

Returns the single node having the largest number of descending elements (sub-elements at any given depth) in a sequence of nodes provided as input.

most-frequent ($s) as item()

Returns the single most frequent node in a sequence of nodes provided as input.

most-nodes ($s) as element(*)

Returns the single node having the largest number of descending nodes (sub-nodes at any given depth) in a sequence of nodes provided as input.

most-similar-edit-distance ($s as xs:string*, $m as xs:string) as xs:string?

Returns the single most similar string, in terms of the edit distance metric towards an input string, in a sequence of strings provided as input.

most-tokens ($s as xs:string*, $r as xs:string) as xs:string?

Returns the single longest string, in terms of the number of tokens, in a sequence of strings provided as input.

most-xpaths ($s as element(*)*, $paths as xs:string*) as element(*)*

Returns the single element from an input sequence of elements that matches the largest number of XPath expressions from a given set, producing a non-empty set of nodes.

shortest ($s as xs:string*) as xs:string?

Returns the single shortest string, in terms of the number of characters, in a sequence of strings provided as input.

some-xpaths ($s as element(*)*, $paths as xs:string*) as element(*)*

Returns the elements from a sequence of elements that, when matched to a given set of XPath expressions, produce a non-empty set of nodes for some of the cases.

superstring ($s as xs:string*) as xs:string?

Returns the single string, from an input sequence of strings, that appears more frequently as part of the other strings in the sequence.

validating-schema ($s as element(*)*, $schema as element(*)) as element(*)*

Returns the nodes from an input sequence of nodes that validate against a given XML Schema.

Functions

all-xpaths#2

declare  function con:all-xpaths($s as element(*)*, $paths as xs:string*) as element(*)*

Returns the elements from an input sequence of elements that, when matched to a given set of XPath expressions, produce a non-empty set of nodes in all the cases.

Example usage :

 all-xpaths( ( <a><b/></a>, <c><d/></c>, <d/>), (".//b") ) 

The function invocation in the example above returns :

 (<a><b/></a>) 

Parameters

s as element(*)
A sequence of elements.
paths as xs:string
A sequence of strings denoting XPath expressions.

Returns

element(*)*
The elements that, when matched to the given set of XPath expressions, always return a non-empty set of nodes.

least-attributes#1

declare  function con:least-attributes($s) as element(*)

Returns the single node having the smallest number of descending attributes (attributes at any given depth) in a sequence of nodes provided as input.

If more then one answer is possible, return the first node according to the order of the input sequence.

Example usage :

least-attributes( ( <a att1="a1" att2="a2"/>, <b att1="a1" />, <c/> ) )

The function invocation in the example above returns :

(<c/>)

Parameters

s as
A sequence of nodes.

Returns

element(*)
The node having the smallest number of descending attributes in the input sequence.

least-distinct-attributes#1

declare  function con:least-distinct-attributes($s) as element(*)

Returns the single node having the smallest number of distinct descending attributes (attributes at any given depth) in a sequence of nodes provided as input.

If more then one answer is possible, return the first node according to the order of the input sequence.

Example usage :

 least-distinct-attributes( ( <a att1="a1" att2="a2"/>, <b att1="a1" />, <c/> ) ) 

The function invocation in the example above returns :

 (<c/>) 

Parameters

s as
A sequence of nodes.

Returns

element(*)
The node having the smallest number of distinct descending attributes in the input sequence.

least-distinct-elements#1

declare  function con:least-distinct-elements($s) as element(*)

Returns the single node having the smallest number of distinct descending elements (sub-elements at any given depth) in a sequence of nodes provided as input.

If more then one answer is possible, return the first node according to the order of the input sequence.

Example usage :

 least-distinct-elements( ( <a><b/></a>, <b><c/></b>, <d/>) ) 

The function invocation in the example above returns :

 (<d/>) 

Parameters

s as
A sequence of nodes.

Returns

element(*)
The node having the smallest number of distinct descending elements in the input sequence.

least-distinct-nodes#1

declare  function con:least-distinct-nodes($s) as element(*)

Returns the single node having the smallest number of distinct descending nodes (sub-nodes at any given depth) in a sequence of nodes provided as input.

If more then one answer is possible, return the first node according to the order of the input sequence.

Example usage :

 least-distinct-nodes( ( <a><b/></a>, <b><c/></b>, <d/>) ) 

The function invocation in the example above returns :

 (<d/>) 

Parameters

s as
A sequence of nodes.

Returns

element(*)
The node having the smallest number of distinct descending nodes in the input sequence.

least-elements#1

declare  function con:least-elements($s) as element(*)

Returns the single node having the smallest number of descending elements (sub-elements at any given depth) in a sequence of nodes provided as input.

If more then one answer is possible, return the first node according to the order of the input sequence.

Example usage :

least-elements( ( <a><b/></a>, <b><c/></b>, <d/>) )

The function invocation in the example above returns :

(<d/>)

Parameters

s as
A sequence of nodes.

Returns

element(*)
The node having the smallest number of descending elements in the input sequence.

least-frequent#1

declare  function con:least-frequent($s) as item()

Returns the single less frequent node in a sequence of nodes provided as input.

If more then one answer is possible, return the first node according to the order of the input sequence.

Example usage :

least-frequent( ( "a", "a", "b") )

The function invocation in the example above returns :

("b")

Parameters

s as
A sequence of nodes.

Returns

item()
The least frequent node in the input sequence.

least-nodes#1

declare  function con:least-nodes($s) as element(*)

Returns the single node having the smallest number of descending nodes (sub-nodes at any given depth) in a sequence of nodes provided as input.

If more then one answer is possible, return the first node according to the order of the input sequence.

Example usage :

least-nodes( ( <a><b/></a>, <b><c/></b>, <d/>) )

The function invocation in the example above returns :

(<d/>)

Parameters

s as
A sequence of nodes.

Returns

element(*)
The node having the smallest number of descending nodes in the input sequence.

least-similar-edit-distance#2

declare  function con:least-similar-edit-distance($s as xs:string*, $m as xs:string) as xs:string?

Returns the single least similar string, in terms of the edit distance metric towards an input string, in a sequence of strings provided as input. If more than one string has a minimum similarity (a maximum value for the edit distance metric), return the first string according to the order of the input sequence.

Example usage :

least-similar-edit-distance( ( "aaabbbccc", "aaabbb", "eeefff" ), "aaab" )

The function invocation in the example above returns :

( "eeefff" )

Parameters

s as xs:string
A sequence of strings.
m as xs:string
The string towards which we want to measure the edit distance.

Returns

xs:string?
The least similar string in the input sequence.

least-tokens#2

declare  function con:least-tokens($s as xs:string*, $r as xs:string) as xs:string?

Returns the single shortest string, in terms of the number of tokens, in a sequence of strings provided as input.

If more then one answer is possible, return the first string according to the order of the input sequence.

Example usage :

least-tokens( ( "a b c", "a b", "a"), " +" )

The function invocation in the example above returns :

("a")

Parameters

s as xs:string
A sequence of strings.
r as xs:string
A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.

Returns

xs:string?
The shortest string in the input sequence, in terms of the number of tokens.

least-xpaths#2

declare  function con:least-xpaths($s as element(*)*, $paths as xs:string*) as element(*)*

Returns the single element from an input sequence of elements that matches the smallest number of XPath expressions from a given set, producing a non-empty set of nodes.

If more then one answer is possible, return the first element according to the order of the input sequence.

Example usage :

 least-xpaths( ( <a><b/></a>, <d><c/><b/></d>, <d/>) , (".//b", ".//c") ) 

The function invocation in the example above returns :

 ( $lt;d/> ) 

Parameters

s as element(*)
A sequence of elements.
paths as xs:string
A sequence of strings denoting XPath expressions.

Returns

element(*)*
The element that matches the smallest number of XPath expressions producing a non-empty set of nodes.

longest#1

declare  function con:longest($s as xs:string*) as xs:string?

Returns the single longest string, in terms of the number of characters, in a sequence of strings provided as input.

If more then one answer is possible, return the first string according to the order of the input sequence.

Example usage :

con:longest( ( "a", "aa", "aaa") )

The function invocation in the example above returns :

("aaa")

Parameters

s as xs:string
A sequence of strings.

Returns

xs:string?
The longest string in the input sequence.

matching#2

declare  function con:matching($s as xs:string*, $r as xs:string) as xs:string*

Returns the strings from an input sequence of strings that match a particular regular expression.

Example usage :

matching( ( "a A b", "c AAA d", "e BB f"), "A+" )

The function invocation in the example above returns :

( "a A b", "c AAA d")

Parameters

s as xs:string
A sequence of strings.
r as xs:string
The regular expression to be used in the matching.

Returns

xs:string*
The strings in the input sequence that match the input regular expression.

most-attributes#1

declare  function con:most-attributes($s) as element(*)

Returns the single node having the largest number of descending attributes (attributes at any given depth) in a sequence of nodes provided as input.

If more then one answer is possible, return the first node according to the order of the input sequence.

Example usage :

most-attributes( ( <a att1="a1" att2="a2"/>, <b att1="a1" />, <c/> ) )

The function invocation in the example above returns :

(<a att1="a1" att2="a2"/>)

Parameters

s as
A sequence of nodes.

Returns

element(*)
The node having the largest number of descending attributes in the input sequence.

most-distinct-attributes#1

declare  function con:most-distinct-attributes($s) as element(*)

Returns the single node having the largest number of distinct descending attributes (attributes at any given depth) in a sequence of nodes provided as input.

If more then one answer is possible, return the first node according to the order of the input sequence.

Example usage :

most-distinct-attributes( ( <a att1="a1" att2="a2" att3="a3"/>, <a att1="a1" att2="a2"><b att2="a2" /></a>, <c/> ) )

The function invocation in the example above returns :

(<a att1="a1" att2="a2" att3="a3"/>)

Parameters

s as
A sequence of nodes.

Returns

element(*)
The node having the largest number of distinct descending attributes in the input sequence.

most-distinct-elements#1

declare  function con:most-distinct-elements($s) as element(*)

Returns the single node having the largest number of distinct descending elements (sub-elements at any given depth) in a sequence of nodes provided as input.

If more then one answer is possible, return the first node according to the order of the input sequence.

Example usage :

most-distinct-elements( ( <a><b/><c/><d/></a>, <a><b/><b/><c/></a>, <a/> ) )

The function invocation in the example above returns :

(<a><b/><c/><d/></a>)

Parameters

s as
A sequence of nodes.

Returns

element(*)
The node having the largest number of distinct descending elements in the input sequence.

most-distinct-nodes#1

declare  function con:most-distinct-nodes($s) as element(*)

Returns the single node having the largest number of distinct descending nodes (sub-nodes at any given depth) in a sequence of nodes provided as input.

If more then one answer is possible, return the first node according to the order of the input sequence.

Example usage :

most-distinct-nodes( ( <a><b/></a>, <a><a/></a>, <b/>) )

The function invocation in the example above returns :

(<a><b/></a>)

Parameters

s as
A sequence of nodes.

Returns

element(*)
The node having the largest number of distinct descending nodes in the input sequence.

most-elements#1

declare  function con:most-elements($s) as element(*)

Returns the single node having the largest number of descending elements (sub-elements at any given depth) in a sequence of nodes provided as input.

If more then one answer is possible, return the first node according to the order of the input sequence.

Example usage :

most-elements( ( <a><b/></a>, <a/>, <b/>) )

The function invocation in the example above returns :

(<a><b/></a>)

Parameters

s as
A sequence of nodes.

Returns

element(*)
The node having the largest number of descending elements in the input sequence.

most-frequent#1

declare  function con:most-frequent($s) as item()

Returns the single most frequent node in a sequence of nodes provided as input.

If more then one answer is possible, returns the first node according to the order of the input sequence.

Example usage :

most-frequent( ( "a", "a", "b") )

The function invocation in the example above returns :

("a")

Parameters

s as
A sequence of nodes.

Returns

item()
The most frequent node in the input sequence.

most-nodes#1

declare  function con:most-nodes($s) as element(*)

Returns the single node having the largest number of descending nodes (sub-nodes at any given depth) in a sequence of nodes provided as input.

If more then one answer is possible, return the first node according to the order of the input sequence.

Example usage :

most-nodes( ( <a><b/></a>, <a/>, <b/>) )

The function invocation in the example above returns :

(<a><b/></a>)

Parameters

s as
A sequence of nodes.

Returns

element(*)
The node having the largest number of descending nodes in the input sequence.

most-similar-edit-distance#2

declare  function con:most-similar-edit-distance($s as xs:string*, $m as xs:string) as xs:string?

Returns the single most similar string, in terms of the edit distance metric towards an input string, in a sequence of strings provided as input. If more than one string has a maximum similarity (a minimum value for the edit distance metric), the function return the first string according to the order of the input sequence.

Example usage :

most-similar-edit-distance( ( "aaabbbccc", "aaabbb", "eeefff" ), "aaab" )

The function invocation in the example above returns :

( "aaabbb" )

Parameters

s as xs:string
A sequence of strings.
m as xs:string
The string towards which we want to measure the edit distance.

Returns

xs:string?
The most similar string in the input sequence.

most-tokens#2

declare  function con:most-tokens($s as xs:string*, $r as xs:string) as xs:string?

Returns the single longest string, in terms of the number of tokens, in a sequence of strings provided as input.

If more then one answer is possible, return the first string according to the order of the input sequence.

Example usage :

most-tokens( ( "a b c", "a b", "a"), " +" )

The function invocation in the example above returns :

("a b c")

Parameters

s as xs:string
A sequence of strings.
r as xs:string
A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.

Returns

xs:string?
The longest string in the input sequence, in terms of the number of tokens.

most-xpaths#2

declare  function con:most-xpaths($s as element(*)*, $paths as xs:string*) as element(*)*

Returns the single element from an input sequence of elements that matches the largest number of XPath expressions from a given set, producing a non-empty set of nodes.

If more then one answer is possible, return the first element according to the order of the input sequence.

Example usage :

 most-xpaths( ( <a><b/></a>, <d><c/><b/></d>, <d/>) , (".//b", ".//c") ) 

The function invocation in the example above returns :

 ( <d><c/><b/></d> ) 

Parameters

s as element(*)
A sequence of elements.
paths as xs:string
A sequence of strings denoting XPath expressions.

Returns

element(*)*
The element that matches the largest number of XPath expressions producing a non-empty set of nodes.

shortest#1

declare  function con:shortest($s as xs:string*) as xs:string?

Returns the single shortest string, in terms of the number of characters, in a sequence of strings provided as input.

If more then one answer is possible, return the first string according to the order of the input sequence.

Example usage :

shortest( ( "a", "aa", "aaa") )

The function invocation in the example above returns :

("a")

Parameters

s as xs:string
A sequence of strings.

Returns

xs:string?
The shortest string in the input sequence.

some-xpaths#2

declare  function con:some-xpaths($s as element(*)*, $paths as xs:string*) as element(*)*

Returns the elements from a sequence of elements that, when matched to a given set of XPath expressions, produce a non-empty set of nodes for some of the cases.

Example usage :

 some-xpaths( ( <a><b/></a>, <d><c/></d>, <d/>), (".//b", ".//c") ) 

The function invocation in the example above returns :

 ( <a><b/></a> , <d><c/></d> ) 

Parameters

s as element(*)
A sequence of elements.
paths as xs:string
A sequence of strings denoting XPath expressions.

Returns

element(*)*
The elements that, when matched to the given set of XPath expressions, return a non-empty set of nodes for at least one of the cases.

superstring#1

declare  function con:superstring($s as xs:string*) as xs:string?

Returns the single string, from an input sequence of strings, that appears more frequently as part of the other strings in the sequence. If no such string exists, the function returns an empty sequence.

If more then one answer is possible, the function returns the first string according to the order of the input sequence.

Example usage :

super-string( ( "aaa bbb ccc", "aaa bbb", "aaa ddd", "eee fff" ) )

The function invocation in the example above returns :

( "aaa bbb" )

Parameters

s as xs:string
A sequence of strings.

Returns

xs:string?
The string that appears more frequently as part of the other strings in the sequence.

validating-schema#2

declare  function con:validating-schema($s as element(*)*, $schema as element(*)) as element(*)*

Returns the nodes from an input sequence of nodes that validate against a given XML Schema.

Example usage :

 validating-schema ( ( <a/> , <b/> ), <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"><xs:element name="a" /></xs:schema> ) 

The function invocation in the example above returns :

 ( <a/> ) 

Parameters

s as element(*)
A sequence of elements.
schema as element(*)
An element encoding an XML Schema.

Returns

element(*)*
The nodes that validate against the XML Schema. Attention : This function is still not implemented.