OpenFTS is Copyright 2000-2002 XWare and licensed under the GNU General Public License, version 2 (June 1991). This means you can use it and modify it in any way you want. If you choose to redistribute OpenFTS, you must do so under the terms of the GNU license.
IMPORTANT NOTICE: This version is incompatible with earlier versions due to changes in the base data type, the structure of the indexing tables, and the interfaces of the dictionaries.OpenFTS is in what is likely to be one of many stages. The OpenFTS developers are experimenting with various features which should eventually result in a full-featured search engine within PostgreSQL.
The latest incarnation has more natural interface which is easier to understand. In the old system, search queries look something like the following:
SELECT
txt.tid,
FROM
txt
WHERE
(txt.fts_index @ '{14054652}')
and the new system uses a natural language approach that supports boolean
operators and it looks like the following:
SELECT * FROM foo WHERE titleidx @@ '(the|this)&!we';
This is quite an improvement over the previous approach. Here's a more complete
list of changes in the latest version:
txtidx
, suitable for text indexing.
It uses words 'as is' without hashing to integers and provides
search interface in more natural way. For example, it's possible
now to test full text search from psql. More information about
tsearch is available here.
lemms
method instead of lemmsid
,
is_stoplexem
instead of is_stoplemm
.
drop
--
removes all OpenFTS tables, indices, dictionaries
(if dictionary provides 'drop' method);
drop_index
--
removes all OpenFTS indices from index tables (INDEX1,,,INDEXN) and
the GiST index on the base table (where the documents
are stored together with their primary key).
gmake-install-headers
during the PostgreSQL installation. If you have not installed tsearch,
please do so now.
tsearch is under the contrib directory of the PostgreSQL source code.
cd /usr/local/src/
tar -xzvf Search-OpenFTS-0.34.tar.gz
cd Search-OpenFTS-0.34
perl Makefile.PL
make
make install
cp -r pgsql_contrib_openfts PGSQL_SRC_HOME/contrib
cd PGSQL_SRC_HOME/contrib/pgsql_contrib_openfts
make
make install
psql DATABASE < openfts.sql
Search::OpenFTS::Index->init
function. You should start by
creating a base table that stores the indexed documents. Here's an example from
the OpenFTS distribution:
create table txt (
tid int not null primary key,
txt varchar,
fts_index txtidx
);
To configure your OpenFTS instance, call
Search::OpenFTS::Index->init
that creates the configuration and
indexing tables. In our example, the call looks like:
my $idx=Search::OpenFTS::Index->init(
dbi=>$dbi,
txttid=>'txt.tid',
use_index_table=>1,
txtidx_field=>'fts_index',
numbergroup=>10,
ignore_id_index=>[ qw( 7 13 14 12 23 ) ],
ignore_headline=>[ qw(13 15 16 17 5) ],
map=>'{ \'19\'=>[1], 18=>[1], 8=>[1], 7=>[1], 6=>[1], 5=>[1], 4=>[1],
}',
dict=>[
'Search::OpenFTS::Dict::PorterEng',
'Search::OpenFTS::Dict::UnknownDict',
]
);
You have to specify the table name to be indexed together with its primary key
(txttid
), the indexing field (txtidx
), the number of
indexing tables (numbergroup
), the types of lexemes that should be
ignored by the indexer (ignore_id_index
) and types ignored while
constructing headlines for search results (ignore_headlines
), the
available dictionaries (dict
), and a mapping of types of lexemes
to dictionaries (map
) that is used for optimization and for
multi-language support.
All configuration parameters are stored in a database table,
fts_conf
, that is created upon the invocation of the
initialization function. ignore_id_index
and
ignore_headline
can only accept types of lexemes as specified
later in this document [Parser]. For example, value 13 is the type ID for an
html tag, namely SYMTAG. Type IDs are also used to map lexemes to dictionaries.
This is helpful for optimizing the search engine and it is also helpful for
indexing multi-languages or exotic-text documents.
You can create more than one OpenFTS instance by passing a character value
(a-z) for prefix upon the invocation of the initialization function. The
initialization function will also create a table,
fts_unknown_lexem
, that stores the lexemes that are not recognized
by any of the available dictionaries and as many indexing tables as you have
specified in the corresponding parameter to init
. Note that
fts_unknown_lexem
is created by
Search::OpenFTS::Dict::UnknownDict dictionary, not the OpenFTS core. If it is
not specified upon initialization it will not be created. Here's what the SQL,
responsible for the creation of the indexing tables, looks like:
create table index1 (
lexem varchar not null,
tid int4 not null,
pos int4[] not null
);
The data model includes fields for storing the lexeme, the document ID and the position of the lexeme in the document. The latest version, no longer requires separate indices for fasting unindexing. The indices are created as follows:
create unique index index1_key on index1 ( tid, lexem );
Search::OpenFTS::Index->index
), the parser reads and converts it
into a stream of lexemes. Lexemes that were marked to be ignored are filtered
and the position of each lexeme is calculated and stored in the corresponding
indexing table.
SELECT
txt.tid,
relor( 1.0, 0.01, 0, txt.tid, txt.tid % 10 + 1 , '{"xware"}' ) as pos
FROM
txt
WHERE
txt.fts_index @@ '\'xware\''
ORDER BY pos desc
Function relor is used for ranking of the search results. The first two
arguments (1.0, 0.01
) denote the weights for words in the title
and in the body, respectively. The third argument (0
) is the
prefix that denotes which OpenFTS instance to use (see Indexer Configuration).
The fourth argument (txt.tid
) is the table name together with its
primary key that identify a document. The fifth argument (txt.tid % 10 +
1
) is the number of indexing table with the identificator
txt.tid
. In the WHERE clause the statement txt.fts_index @@
'\'xware\''
is the "contains" predicate which denotes whether fts_index
(of data type txtidx
) contains the keyword xware
or
not. Here's a more complicated query for the words hello
and
xware
:
SELECT
txt.tid,
relkov( 1.0, 0.01, 0, txt.tid, txt.tid % 10 + 1 , '{"xware", "hello"}'
) as pos
FROM
txt
WHERE
txt.fts_index @@ '\'xware\' & \'hello\''
ORDER BY pos desc
One obvious difference is that, now that we are querying for more words,
relkov
is used instead of relor
. Also note how the
where clause has changed into a conjuction.
NOTICE:
Since 0.34 version it's possible to normalize weight of document by
its size (length). Functions relcov, relor
now accepts
additional argument (last):
lemms
and
is_stoplexem
. In the case of a real dictionary, lemms returns an
array of lexemes. is_stoplexem accepts a lexeme and returns true if it is a stop
word, otherwise it returns false. For example, is_stoplexem("yahoo") returns 0
since "yahoo" is not a stop word while is_stoplexem("are") returns 1 since "are"
is a stop word. An optional method init
, if available, can be used
to initialize the dictionary.
The relevance ranking functions support weights for words found in the title
and the body. The default values are W_TITLE = 1
and W_BODY=
0.01
. As a general rule W_TITLE
must be much greater than
W_BODY
. You can change the default values by passing them as
parameters to Search::OpenFTS->new
.
OpenFTS uses a parser that reads a document or a search query and converts it into a stream of lexemes. You can use different parsers for different projects. The parser distributed with OpenFTS recognizes 19 types of lexemes:
Type | ID | Description | Exam ples |
LATWORD | 1 | latin word | hello |
CYRWORD | 2 | cyrillic word | ... |
UWORD | 3 | mixed word | ... |
4 | email address | teodor@sigaev.ru | |
FURL | 5 | full URL | http://www.yahoo.com/index.html |
HOST | 6 | host name | ... |
SCIENTIFIC | 7 | number in scientific notation | -0.12345e+15 |
VERSIONNUMBER | 8 | integer or version number | 3 7.1.2 |
PARTHYPHENWORD | 9 | part of mixed hyphenated word | ... |
CYRPARTHYPHENWORD | 10 | cyrillic part of hyphenated word | ... |
LATPARTHYPHENWORD | 11 | latin part of hyphenated word | multi in word multi-key |
SPACE | 12 | symbols | $#%^ |
SYMTAG | 13 | HTML tag | <b>
<table> |
HTTP | 14 | HTTP | http:// |
HYPHENWORD | 15 | mixed hyphenated word | ... |
LATHYPHENWORD | 16 | latin hyphenated word | multi-key |
CYRHYPHENWORD | 17 | cyrillic hyphenated word | ... |
URI | 18 | Uniform Resource Identifier | /index.html |
FILEPATH | 19 | filename or path | example.txt |
DECIMAL | 20 | number in decimal notation | 10.345 |
SIGNEDINT | 21 | integer | -4 |
UNSIGNEDINT | 22 | unsigned integer | 4 |
HTMLENTITY | 23 | HTML entity | 4 |
The package Search::OpenFTS::Parser
is a wrapper around the parser
functions. Here's an example of how it can be used:
use Search::OpenFTS::Parser;
my $parser=Search::OpenFTS::Parser->new();
my $txt = 'Hello world. OpenFTS rules!';
$parser->start_parser( \$txt );
while ((($type, $word)=$parser->get_word) && $type) {
print $parser->type_description($type), "\t$word\n";
}
$parser->end_parser;
Save the script (above) into a file called parser-test-1.pl
and then call it with perl parser-test-1.pl
.
The output should look like this:
Latin word Hello
Space symbols
Latin word world
Space symbols .
Space symbols
Latin word OpenFTS
Space symbols
Latin word rules
Space symbols !
To get all types that the parser supports, try this:
use Search::OpenFTS::Parser;
my $parser=Search::OpenFTS::Parser->new();
my @types = $parser->alltypes;
map { print "$_ => $types[$_]\n"; } 1..$#types;
The output should look like this:
1 => Latin word
2 => Cyrillic word
3 => Word
4 => Email
5 => URL
6 => Host
7 => Scientific notation
8 => VERSION
9 => Part of hyphenated word
10 => Cyrillic part of hyphenated word
11 => Latin part of hyphenated word
12 => Space symbols
13 => Char in tag
14 => HTTP head
15 => Hyphenated word
16 => Latin hyphenated word
17 => Cyrillic hyphenated word
18 => URI
19 => File or path name
20 => Decimal notation
21 => Signed integer
22 => Unsigned integer
23 => HTML entity
See simple_parser.pl in examples directory for example of very simple parser written in perl and recognizing space delimited words with length =>2.
echo 'Simple parser written in perl a'|perl simple_parser.pl
Space delimited word (len=>2),word=Simple
Space delimited word (len=>2),word=parser
Space delimited word (len=>2),word=written
Space delimited word (len=>2),word=in
Space delimited word (len=>2),word=perl
0 => Unknown type
1 => Space delimited word (len=>2)
Notice, last word 'a' doesn't recognized because it's too short.
Search::OpenFTS::Search->search
Search::OpenFTS::Search->get_sql
(1) relkov( 1.0, 0.01, 0, txt.tid, txt.tid % 10 + 1 , '{"xware", "hello"}' as pos
(2) txt.fts_index @@ '\'xware\' & \'hello\''
(2) pos desc
Search::OpenFTS::Search->_sql
get_sql
to get the SQL parts and then combines them to the SQL query to be executed.
Search::OpenFTS::Search->get_headline
Search::OpenFTS::Index->init
init
)
about the available dictionaries,
the parser, mapping of lexemes to dictionaries, and types of
lexemes to be ignored in indexing or headlines.
Search::OpenFTS::Index->index
Search::OpenFTS::Index->delete
Search::OpenFTS::Index->drop
Search::OpenFTS::Index->drop_index
Search::OpenFTS::Index->create_index
Search::OpenFTS::Index->start_index
Search::OpenFTS::Index->index_chunk
Search::OpenFTS::Index->flush
Search::OpenFTS::Parser->start_parser
Search::OpenFTS::Parser->end_parser
Search::OpenFTS::Parser->get_word
Search::OpenFTS::Parser->type_description
Search::OpenFTS::Parser->alltypes
$ createdb testfts
$ cd PGSQL_SRC_HOME/contrib/tsearch
$ psql testfts < tsearch.sql
$ cd PGSQL_SRC_HOME/contrib/pgsql_contrib_openfts/
$ /psql testfts < openfts.sql
$ cd /usr/local/src/Search-OpenFTS-0.34/examples
$ ./init.pl testfts
$ ./index.pl testfts file1 [file2 [...] ]
$ ./search.pl -p testfts word1 [word2 [...] ]