SYNOPSIS

  my $qp = new Search::QueryParser;
  my $s = '+mandatoryWord -excludedWord +field:word "exact phrase"';
  my $query = $qp->parse($s)  or die "Error in query : " . $qp->err;
  $someIndexer->search($query);

  # query with comparison operators and implicit plus (second arg is true)
  $query = $qp->parse("txt~'^foo.*' date>='01.01.2001' date<='02.02.2002'", 1);

  # boolean operators (example below is equivalent to "+a +(b c) -d")
  $query = $qp->parse("a AND (b OR c) AND NOT d");

  # subset of rows
  $query = $qp->parse("Id#123,444,555,666 AND (b OR c)");

DESCRIPTION

This module parses a query string into a data structure to be handled by external search engines. For examples of such engines, see File::Tabular and Search::Indexer.

The query string can contain simple terms, \*(L"exact phrases\*(R", field names and comparison operators, '+/-' prefixes, parentheses, and boolean connectors.

The parser can be parameterized by regular expressions for specific notions of \*(L"term\*(R", \*(L"field name\*(R" or \*(L"operator\*(R" ; see the new method. The parser has no support for lemmatization or other term transformations : these should be done externally, before passing the query data structure to the search engine.

The data structure resulting from a parsed query is a tree of terms and operators, as described below in the parse method. The interpretation of the structure is up to the external search engine that will receive the parsed query ; the present module does not make any assumption about what it means to be \*(L"equal\*(R" or to \*(L"contain\*(R" a term.

QUERY STRING

The query string is decomposed into \*(L"items\*(R", where each item has an optional sign prefix, an optional field name and comparison operator, and a mandatory value.

Sign prefix

Prefix '+' means that the item is mandatory. Prefix '-' means that the item must be excluded. No prefix means that the item will be searched for, but is not mandatory.

As far as the result set is concerned, \*(C`+a +b c\*(C' is strictly equivalent to \*(C`+a +b\*(C' : the search engine will return documents containing both terms 'a' and 'b', and possibly also term 'c'. However, if the search engine also returns relevance scores, query \*(C`+a +b c\*(C' might give a better score to documents containing also term 'c'.

See also section \*(L"Boolean connectors\*(R" below, which is another way to combine items into a query.

Field name and comparison operator

Internally, each query item has a field name and comparison operator; if not written explicitly in the query, these take default values '' (empty field name) and ':' (colon operator).

Operators have a left operand (the field name) and a right operand (the value to be compared with); for example, \*(C`foo:bar\*(C' means \*(L"search documents containing term 'bar' in field 'foo'\*(R", whereas \*(C`foo=bar\*(C' means \*(L"search documents where field 'foo' has exact value 'bar'\*(R".

Here is the list of admitted operators with their intended meaning : treat value as a term to be searched within field. This is the default operator. treat value as a regex; match field against the regex. negation of above classical relational operators Inclusion in the set of comma-separated integers supplied on the right-hand side.

Operators \*(C`:\*(C', \*(C`~\*(C', \*(C`=~\*(C', \*(C`!~\*(C' and \*(C`#\*(C' admit an empty left operand (so the field name will be ''). Search engines will usually interpret this as \*(L"any field\*(R" or \*(L"the whole data record\*(R".

Value

A value (right operand to a comparison operator) can be

  • just a term (as recognized by regex \*(C`rxTerm\*(C', see new method below)

  • A quoted phrase, i.e. a collection of terms within single or double quotes. Quotes can be used not only for \*(L"exact phrases\*(R", but also to prevent misinterpretation of some values : for example \*(C`-2\*(C' would mean \*(L"value '2' with prefix '-'\*(R", in other words \*(L"exclude term '2'\*(R", so if you want to search for value -2, you should write "-2" instead. In the last example of the synopsis, quotes were used to prevent splitting of dates into several search terms.

  • a subquery within parentheses. Field names and operators distribute over parentheses, so for example \*(C`foo:(bar bie)\*(C' is equivalent to \*(C`foo:bar foo:bie\*(C'. Nested field names such as \*(C`foo:(bar:bie)\*(C' are not allowed. Sign prefixes do not distribute : \*(C`+(foo bar) +bie\*(C' is not equivalent to \*(C`+foo +bar +bie\*(C'.

Boolean connectors

Queries can contain boolean connectors '\s-1AND\s0', '\s-1OR\s0', '\s-1NOT\s0' (or their equivalent in some other languages). This is mere syntactic sugar for the '+' and '-' prefixes : \*(C`a AND b\*(C' is translated into \*(C`+a +b\*(C'; \*(C`a OR b\*(C' is translated into \*(C`(a b)\*(C'; \*(C`NOT a\*(C' is translated into \*(C`-a\*(C'. \*(C`+a OR b\*(C' does not make sense, but it is translated into \*(C`(a b)\*(C', under the assumption that the user understands \*(L"\s-1OR\s0\*(R" better than a '+' prefix. \*(C`-a OR b\*(C' does not make sense either, but has no meaningful approximation, so it is rejected.

Combinations of \s-1AND/OR\s0 clauses must be surrounded by parentheses, i.e. \*(C`(a AND b) OR c\*(C' or \*(C`a AND (b OR c)\*(C' are allowed, but \*(C`a AND b OR c\*(C' is not.

METHODS

new

new(rxTerm => qr/.../, rxOp => qr/.../, ...) Creates a new query parser, initialized with (optional) regular expressions :

rxTerm

Regular expression for matching a term. Of course it should not match the empty string. Default value is \*(C`qr/[^\s()]+/\*(C'. A term should not be allowed to include parenthesis, otherwise the parser might get into trouble.

rxField

Regular expression for matching a field name. Default value is \*(C`qr/\w+/\*(C' (meaning of \*(C`\w\*(C' according to \*(C`use locale\*(C').

rxOp

Regular expression for matching an operator. Default value is \*(C`qr/==|<=|>=|!=|=~|!~|:|=|<|>|~/\*(C'. Note that the longest operators come first in the regex, because \*(L"alternatives are tried from left to right\*(R" (see \*(L"Version 8 Regular Expressions\*(R" in perlre) : this is to avoid \*(C`a<=3\*(C' being parsed as \*(C`a < '=3'\*(C'.

rxOpNoField

Regular expression for a subset of the operators which admit an empty left operand (no field name). Default value is \*(C`qr/=~|!~|~|:/\*(C'. Such operators can be meaningful for comparisons with \*(L"any field\*(R" or with \*(L"the whole record\*(R" ; the precise interpretation depends on the search engine.

rxAnd

Regular expression for boolean connector \s-1AND\s0. Default value is \*(C`qr/AND|ET|UND|E/\*(C'.

rxOr

Regular expression for boolean connector \s-1OR\s0. Default value is \*(C`qr/OR|OU|ODER|O/\*(C'.

rxNot

Regular expression for boolean connector \s-1NOT\s0. Default value is \*(C`qr/NOT|PAS|NICHT|NON/\*(C'.

defField

If no field is specified in the query, use defField. The default is the empty string "".

parse

$q = $queryParser->parse($queryString, $implicitPlus); Returns a data structure corresponding to the parsed string. The second argument is optional; if true, it adds an implicit '+' in front of each term without prefix, so \*(C`parse("+a b c -d", 1)\*(C' is equivalent to \*(C`parse("+a +b +c -d")\*(C'. This is often seen in common \s-1WWW\s0 search engines as an option \*(L"match all words\*(R". The return value has following structure : { '+' => [{field=>'f1', op=>':', value=>'v1', quote=>'q1'}, {field=>'f2', op=>':', value=>'v2', quote=>'q2'}, ...], '' => [...], '-' => [...] } In other words, it is a hash ref with 3 keys '+', '' and '-', corresponding to the 3 sign prefixes (mandatory, ordinary or excluded items). Each key holds either a ref to an array of items, or \*(C`undef\*(C' (no items with this prefix in the query). An item is a hash ref containing

scalar, field name (may be the empty string) scalar, operator scalar, character that was used for quoting the value ('\*(L"', \*(R"'" or undef) Either

  • a scalar (simple term), or

  • a recursive ref to another query structure. In that case, \*(C`op\*(C' is necessarily '()' ; this corresponds to a subquery in parentheses.

In case of a parsing error, \*(C`parse\*(C' returns \*(C`undef\*(C'; method err can be called to get an explanatory message.

err

$msg = $queryParser->err; Message describing the last parse error

unparse

$s = $queryParser->unparse($query); Returns a string representation of the $query data structure.

AUTHOR

Laurent Dami, <laurent.dami \s-1AT\s0 etat ge ch>

COPYRIGHT AND LICENSE

Copyright (C) 2005, 2007 by Laurent Dami.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.