http://zorba.io/modules/full-text

View as XML or JSON.

This module provides an XQuery API to full-text functions. For general information about this implementation of the XQuery and XPath Full Text 1.0 specification as well as instructions for building an installing a thesaurus, see the Full Text Thesaurus documentation.

Notes on languages

To refer to particular human languages, uses either the ISO 639-1 or ISO 639-2 languages codes. Note that only a subset of the complete list of language codes are supported and not every function supports the same subset.

Most functions in this module take a language as a parameter using the xs:language XML schema data type.

Notes on stemming

The stem() functions return the stem of a word. The stem of a word itself, however, is not guaranteed to be a word. It is best to consider a stem as an opaque byte sequence. All that is guaranteed about a stem is that, for a given word, the stem of that word will always be the same byte sequence. Hence, you should never compare the result of one of the stem() functions against a non-stemmed string, for example:
  if ( ft:stem( "apples" ) eq "apple" )             ** WRONG **
 
Instead do:
  if ( ft:stem( "apples" ) eq ft:stem( "apple" ) )  ** CORRECT **
 

Notes on the thesaurus

The thesaurus-lookup() functions have "levels" and "relationship" parameters. The values for these are implementation-defined. The default implementation uses the WordNet lexical database, version 3.0.

In WordNet, the number of "levels" that two phrases are apart are how many hierarchical meanings apart they are. For example, "canary" is 5 levels away from "vertebrate" (carary > finch > oscine > passerine > bird > vertebrate).

When using the WordNet implementation, all of the relationships (and their abbreviations) specified by ISO 2788 and ANSI/NISO Z39.19-2005 with the exceptions of "HN" (history note) and "X SN" (see scope note for) are supported. These relationships are:

Rel. Meaning WordNet Rel.
BT broader term hypernym
BTG broader term generic hypernym
BTI broader term instance instance hypernym
BTP broader term partitive part meronym
NT narrower term hyponym
NTG narrower term generic hyponym
NTI narrower term instance instance hyponym
NTP narrower term partitive part holonym
RT related term also see
SN scope note n/a
TT top term hypernym
UF non-preferred term n/a
USE preferred term n/a
Note that you can specify relationships either by their abbreviation or their meaning. Relationships are case-insensitive. In addition to the ISO 2788 and ANSI/NISO Z39.19-2005 relationships, All of the relationships offered by WordNet are also supported. These relationships are:
Relationship Meaning
also see A word that is related to another, e.g., for "varnished" (furniture) one should also see "finished."
antonym A word opposite in meaning to another, e.g., "light" is an antonym for "heavy."
attribute A noun for which adjectives express values, e.g., "weight" is an attribute for which the adjectives "light" and "heavy" express values.
cause A verb that causes another, e.g., "show" is a cause of "see."
derivationally related form A word that is derived from a root word, e.g., "metric" is a derivationally related form of "meter."
derived from adjective An adverb that is derived from an adjective, e.g., "correctly" is derived from the adjective "correct."
entailment A verb that presupposes another, e.g., "snoring" entails "sleeping."
hypernym A word with a broad meaning that more specific words fall under, e.g., "meal" is a hypernym of "breakfast."
hyponym A word of more specific meaning than a general term applicable to it, e.g., "breakfast" is a hyponym of "meal."
instance hypernym A word that denotes a category of some specific instance, e.g., "author" is an instance hypernym of "Asimov."
instance hyponym A term that donotes a specific instance of some general category, e.g., "Asimov" is an instance hyponym of "author."
member holonym A word that denotes a collection of individuals, e.g., "faculty" is a member holonym of "professor."
member meronym A word that denotes a member of a larger group, e.g., a "person" is a member meronym of a "crowd."
part holonym A word that denotes a larger whole comprised of some part, e.g., "car" is a part holonym of "engine."
part meronym A word that denotes a part of a larger whole, e.g., an "engine" is part meronym of a "car."
participle of verb An adjective that is the participle of some verb, e.g., "breaking" is the participle of the verb "break."
pertainym An adjective that classifies its noun, e.g., "musical" is a pertainym in "musical instrument."
similar to Similar, though not necessarily interchangeable, adjectives. For example, "shiny" is similar to "bright", but they have subtle differences.
substance holonym A word that denotes a larger whole containing some constituent substance, e.g., "bread" is a substance holonym of "flour."
substance meronym A word that denotes a constituant substance of some larger whole, e.g., "flour" is a substance meronym of "bread."
verb group A verb that is a member of a group of similar verbs, e.g., "live" is in the verb group of "dwell", "live", "inhabit", etc.

Notes on tokenization

For general information about the implementation of tokenization, including what constitutes a token, see the Full Text Tokenizer documentation.

Function Summary

current-compare-options () as object() external

Gets the current compare options.

current-lang () as xs:language external

Gets the current language : either the language specified by the declare ft-option using language statement (if any) or the one returned by ft:host-lang() (if none).

host-lang () as xs:language external

Gets the host's current language .

is-stem-lang-supported ($lang as xs:language) as xs:boolean external

Checks whether the given language is supported for stemming.

is-stop-word-lang-supported ($lang as xs:language) as xs:boolean external

Checks whether the given language is supported for stop words.

is-stop-word ($word as xs:string) as xs:boolean external

Checks whether the given word is a stop-word.

is-stop-word ($word as xs:string, $lang as xs:language) as xs:boolean external

Checks whether the given word is a stop-word.

is-thesaurus-lang-supported ($lang as xs:language) as xs:boolean external

Checks whether the given language is supported for look-up using the default thesaurus.

is-thesaurus-lang-supported ($uri as xs:string, $lang as xs:language) as xs:boolean external

Checks whether the given language is supported for look-up using the thesaurus specified by the given URI.

is-tokenizer-lang-supported ($lang as xs:language) as xs:boolean external

Checks whether the given language is supported for tokenization.

stem ($word as xs:string) as xs:string external

Stems the given word.

stem ($word as xs:string, $lang as xs:language) as xs:string external

Stems the given word.

strip-diacritics ($string as xs:string) as xs:string external

Strips all diacritical marks from all characters.

thesaurus-lookup ($phrase as xs:string) as xs:string* external

Looks-up the given phrase in the default thesaurus.

thesaurus-lookup ($uri as xs:string, $phrase as xs:string) as xs:string* external

Looks-up the given phrase in a thesaurus.

thesaurus-lookup ($uri as xs:string, $phrase as xs:string, $lang as xs:language) as xs:string* external

Looks-up the given phrase in the thesaurus specified by the given URI.

thesaurus-lookup ($uri as xs:string, $phrase as xs:string, $lang as xs:language, $relationship as xs:string) as xs:string* external

Looks-up the given phrase in a thesaurus.

thesaurus-lookup ($uri as xs:string, $phrase as xs:string, $lang as xs:language, $relationship as xs:string, $level-least as xs:integer, $level-most as xs:integer) as xs:string* external

Looks-up the given phrase in a thesaurus.

tokenize-node ($node as node()) as object()* external

Tokenizes the given node and all of its descendants.

tokenize-node ($node as node(), $lang as xs:language) as object()* external

Tokenizes the given node and all of its decendants.

tokenize-nodes ($includes as node()+, $excludes as node()*) as object()* external

Tokenizes the set of nodes comprising $includes (and all of its descendants) but excluding $excludes (and all of its descendants), if any.

tokenize-nodes ($includes as node()+, $excludes as node()*, $lang as xs:language) as object()* external

Tokenizes the set of nodes comprising $includes (and all of its descendants) but excluding $excludes (and all of its descendants), if any.

tokenize-string ($string as xs:string) as xs:string* external

Tokenizes the given string.

tokenize-string ($string as xs:string, $lang as xs:language) as xs:string* external

Tokenizes the given string.

tokenizer-properties () as object() external

Gets properties of the tokenizer for the language returned by ft:current-lang() .

tokenizer-properties ($lang as xs:language) as object() external

Gets properties of the tokenizer for the given language .

Functions

current-compare-options#0

declare  function ft:current-compare-options() as object() external
Gets the current compare options.

Parameters

Returns

object()
said compare options.

current-lang#0

declare  function ft:current-lang() as xs:language external
Gets the current language: either the language specified by the declare ft-option using language statement (if any) or the one returned by ft:host-lang() (if none).

Parameters

Returns

xs:language
said language.

host-lang#0

declare  function ft:host-lang() as xs:language external
Gets the host's current language. The "host" is the computer on which the software is running. The host's current language is obtained as follows:
  • For *nix systems:
    1. If setlocale(3) returns non-null, the language corresponding to that locale is used.
    2. Else, if the LANG environment variable is set, that language is ued.
    3. Otherwise, there is no default language.
  • For Windows systems, the language corresponding to the locale returned by the GetLocaleInfo() function is used.

Parameters

Returns

xs:language
said language.

is-stem-lang-supported#1

declare  function ft:is-stem-lang-supported($lang as xs:language) as xs:boolean external
Checks whether the given language is supported for stemming.

Parameters

lang as xs:language
The language to check.

Returns

xs:boolean
true only if the language is supported.

is-stop-word-lang-supported#1

declare  function ft:is-stop-word-lang-supported($lang as xs:language) as xs:boolean external
Checks whether the given language is supported for stop words.

Parameters

lang as xs:language
The language to check.

Returns

xs:boolean
true only if the language is supported.

is-stop-word#1

declare  function ft:is-stop-word($word as xs:string) as xs:boolean external
Checks whether the given word is a stop-word.

Parameters

word as xs:string
The word to check. The word's language is assumed to be the one returned by ft:current-lang().

Returns

xs:boolean
true only if $word is a stop-word.

is-stop-word#2

declare  function ft:is-stop-word($word as xs:string, $lang as xs:language) as xs:boolean external
Checks whether the given word is a stop-word.

Parameters

word as xs:string
The word to check.
lang as xs:language
The language of $word.

Returns

xs:boolean
true only if $word is a stop-word.

is-thesaurus-lang-supported#1

declare  function ft:is-thesaurus-lang-supported($lang as xs:language) as xs:boolean external
Checks whether the given language is supported for look-up using the default thesaurus.

Parameters

lang as xs:language
The language to check.

Returns

xs:boolean
true only if the language is supported.

is-thesaurus-lang-supported#2

declare  function ft:is-thesaurus-lang-supported($uri as xs:string, $lang as xs:language) as xs:boolean external
Checks whether the given language is supported for look-up using the thesaurus specified by the given URI.

Parameters

uri as xs:string
The URI specifying the thesaurus to use.
lang as xs:language
The language to check.

Returns

xs:boolean
true only if the language is supported.

is-tokenizer-lang-supported#1

declare  function ft:is-tokenizer-lang-supported($lang as xs:language) as xs:boolean external
Checks whether the given language is supported for tokenization.

Parameters

lang as xs:language
The language to check.

Returns

xs:boolean
true only if the language is supported.

stem#1

declare  function ft:stem($word as xs:string) as xs:string external
Stems the given word.

Parameters

word as xs:string
The word to stem. The word's language is assumed to be the one returned by ft:current-lang().

Returns

xs:string
the stem of $word.

stem#2

declare  function ft:stem($word as xs:string, $lang as xs:language) as xs:string external
Stems the given word.

Parameters

word as xs:string
The word to stem.
lang as xs:language
The language of $word.

Returns

xs:string
the stem of $word.

strip-diacritics#1

declare  function ft:strip-diacritics($string as xs:string) as xs:string external
Strips all diacritical marks from all characters.

Parameters

string as xs:string
The string to strip diacritical marks from.

Returns

xs:string
$string with diacritical marks stripped.

thesaurus-lookup#1

declare  function ft:thesaurus-lookup($phrase as xs:string) as xs:string* external
Looks-up the given phrase in the default thesaurus.

Parameters

phrase as xs:string
The phrase to look up. The phrase's language is assumed to be the one returned by ft:current-lang().

Returns

xs:string*
the related phrases if $phrase is found in the thesaurus or the empty sequence if not.

thesaurus-lookup#2

declare  function ft:thesaurus-lookup($uri as xs:string, $phrase as xs:string) as xs:string* external
Looks-up the given phrase in a thesaurus.

Parameters

uri as xs:string
The URI specifying the thesaurus to use.
phrase as xs:string
The phrase to look up. The phrase's language is assumed to be the one the one returned by ft:current-lang().

Returns

xs:string*
the related phrases if $phrase is found in the thesaurus or the empty sequence if not.

thesaurus-lookup#3

declare  function ft:thesaurus-lookup($uri as xs:string, $phrase as xs:string, $lang as xs:language) as xs:string* external
Looks-up the given phrase in the thesaurus specified by the given URI.

Parameters

uri as xs:string
The URI specifying the thesaurus to use.
phrase as xs:string
The phrase to look up.
lang as xs:language
The language of $phrase.

Returns

xs:string*
the related phrases if $phrase is found in the thesaurus or the empty sequence if not.

thesaurus-lookup#4

declare  function ft:thesaurus-lookup($uri as xs:string, $phrase as xs:string, $lang as xs:language, $relationship as xs:string) as xs:string* external
Looks-up the given phrase in a thesaurus.

Parameters

uri as xs:string
The URI specifying the thesaurus to use.
phrase as xs:string
The phrase to look up.
lang as xs:language
The language of $phrase.
relationship as xs:string
The relationship the results are to have to $phrase.

Returns

xs:string*
the related phrases if $phrase is found in the thesaurus or the empty sequence if not.

thesaurus-lookup#6

declare  function ft:thesaurus-lookup($uri as xs:string, $phrase as xs:string, $lang as xs:language, $relationship as xs:string, $level-least as xs:integer, $level-most as xs:integer) as xs:string* external
Looks-up the given phrase in a thesaurus.

Parameters

uri as xs:string
The URI specifying the thesaurus to use.
phrase as xs:string
The phrase to look up.
lang as xs:language
The language of $phrase.
relationship as xs:string
The relationship the results are to have to $phrase.
level-least as xs:integer
The minimum number of levels within the thesaurus to be traversed.
level-most as xs:integer
The maximum number of levels within the thesaurus to be traversed.

Returns

xs:string*
the related phrases if $phrase is found in the thesaurus or the empty sequence if not.

tokenize-node#1

declare  function ft:tokenize-node($node as node()) as object()* external
Tokenizes the given node and all of its descendants.

Parameters

node as node()
The node to tokenize. The node's default language is assumed to be the one returned by ft:current-lang().

Returns

object()*
a (possibly empty) sequence of tokens.

tokenize-node#2

declare  function ft:tokenize-node($node as node(), $lang as xs:language) as object()* external
Tokenizes the given node and all of its decendants.

Parameters

node as node()
The node to tokenize.
lang as xs:language
The default language of $node.

Returns

object()*
a (possibly empty) sequence of tokens.

tokenize-nodes#2

declare  function ft:tokenize-nodes($includes as node()+, $excludes as node()*) as object()* external
Tokenizes the set of nodes comprising $includes (and all of its descendants) but excluding $excludes (and all of its descendants), if any.

Parameters

includes as node()
The set of nodes (and its descendants) to include. The default language is assumed to be the one returned by ft:current-lang().
excludes as node()
The set of nodes (and its descendants) to exclude.

Returns

object()*
a (possibly empty) sequence of tokens.

tokenize-nodes#3

declare  function ft:tokenize-nodes($includes as node()+, $excludes as node()*, $lang as xs:language) as object()* external
Tokenizes the set of nodes comprising $includes (and all of its descendants) but excluding $excludes (and all of its descendants), if any.

Parameters

includes as node()
The set of nodes (and its descendants) to include.
excludes as node()
The set of nodes (and its descendants) to exclude.
lang as xs:language
The default language for nodes.

Returns

object()*
a (possibly empty) sequence of tokens.

tokenize-string#1

declare  function ft:tokenize-string($string as xs:string) as xs:string* external
Tokenizes the given string.

Parameters

string as xs:string
The string to tokenize. The string's language is assumed to be the one returned by ft:current-lang().

Returns

xs:string*
a (possibly empty) sequence of tokens.

tokenize-string#2

declare  function ft:tokenize-string($string as xs:string, $lang as xs:language) as xs:string* external
Tokenizes the given string.

Parameters

string as xs:string
The string to tokenize.
lang as xs:language
The language of $string.

Returns

xs:string*
a (possibly empty) sequence of tokens.

tokenizer-properties#0

declare  function ft:tokenizer-properties() as object() external
Gets properties of the tokenizer for the language returned by ft:current-lang().

Parameters

Returns

object()
said properties.

tokenizer-properties#1

declare  function ft:tokenizer-properties($lang as xs:language) as object() external
Gets properties of the tokenizer for the given language.

Parameters

lang as xs:language
The language of the tokenizer to get the properties of.

Returns

object()
said properties.

Variables

$ft:LANG-DA as xs:language
Predeclared constant for the Danish xs:language.
$ft:LANG-DE as xs:language
Predeclared constant for the German xs:language.
$ft:LANG-EN as xs:language
Predeclared constant for the English xs:language.
$ft:LANG-ES as xs:language
Predeclared constant for the Spanish xs:language.
$ft:LANG-FI as xs:language
Predeclared constant for the Finnish xs:language.
$ft:LANG-FR as xs:language
Predeclared constant for the French xs:language.
$ft:LANG-HU as xs:language
Predeclared constant for the Hungarian xs:language.
$ft:LANG-IT as xs:language
Predeclared constant for the Italian xs:language.
$ft:LANG-NL as xs:language
Predeclared constant for the Dutch xs:language.
$ft:LANG-NO as xs:language
Predeclared constant for the Norwegian xs:language.
$ft:LANG-PT as xs:language
Predeclared constant for the Portuguese xs:language.
$ft:LANG-RO as xs:language
Predeclared constant for the Romanian xs:language.
$ft:LANG-RU as xs:language
Predeclared constant for the Russian xs:language.
$ft:LANG-SV as xs:language
Predeclared constant for the Swedish xs:language.
$ft:LANG-TR as xs:language
Predeclared constant for the Turkish xs:language.