A key benefit of XML is its ability to represent a mix of structured and
unstructured (text) data. Querying XML data is a well-explored topic with
powerful database-style query languages such as XPath and XQuery set to
become W3C standards. An equally compelling paradigm for querying XML
documents is full-text search. Although current XML query languages such as
XPath and XQuery can express rich queries over structured data (e.g.,
navigate in document structure, construct new elements), they can only
express very rudimentary queries over text.
I will present
TeXQuery,
a full-text extension to XQuery. TeXQuery provides a rich set of fully
composable full-text search primitives, such as Boolean connectives, phrase
matching, proximity distance, stemming and thesauri. TeXQuery enables users
to seamlessly query over both structured and text data. It supports a
flexible scoring construct that can be used to score query results based on
full-text predicates. TeXQuery is the precursor of the full-text language
extension to XPath 2.0 and XQuery 1.0 that is being developed at the W3C.
I will give an overview of the TeXQuery language and its data model and some
of the challenges that arise when designing such a language. I will also
describe possible implementation architectures for TeXQuery. Finally, I will
present several solved and unsolved research issues that arise when
integrating queries on structure with queries on text. Such issues include
phrase matching in XML documents, approximate matching of structure and text
in XML and efficient ranking and top-K answering.