OpenFTS Primer

by Oleg Bartunov (oleg@sai.msu.su), Neophytos Demetriou (k2pts@cytanet.com.cy), Teodor Sigaev (teodor@sigaev.ru), and Daniel Wickstrom (danw@rtp.ericsson.se)

1. Introduction

OpenFTS is a PostgreSQL-based full text search engine that provides advanced features such as online indexing of data, proximity based relevance ranking, multilingual support and stemming. You can check the full list of features here.

OpenFTS is Copyright 2000-2002 XWare and licensed under the GNU General Public License, version 2 (June 1991). This means you can use it and modify it in any way you want. If you choose to redistribute OpenFTS, you must do so under the terms of the GNU license.

2. Motivation

General search engines such as Google and Altavista are great for finding content on the Web, however these search tools do not have access to information about the framework of a web site such as access control. For example, users may receive a set of links from a search that they do not have access to view. It is frustrating for users to click on the link only to get an access denied page. Thus, at the level of performing an internal site wide search, an architecture-aware search tool is required. On the other hand, most of the architecture-aware search tools utilize inverted index which is very fast for searching but very slow for online update. Incremental update of inverted index is a complex engineering task while we needed something light and free.

3. Changes

IMPORTANT NOTICE: This version is incompatible with earlier versions due to changes in the base data type, the structure of the indexing tables, and the interfaces of the dictionaries.

OpenFTS is in what is likely to be one of many stages. The OpenFTS developers are experimenting with various features which should eventually result in a full-featured search engine within PostgreSQL.

The latest incarnation has more natural interface which is easier to understand. In the old system, search queries look something like the following:


      SELECT
          txt.tid,
      FROM
          txt
      WHERE
          (txt.fts_index @ '{14054652}')

and the new system uses a natural language approach that supports boolean operators and it looks like the following:


      SELECT * FROM foo WHERE titleidx @@ '(the|this)&!we';

This is quite an improvement over the previous approach. Here's a more complete list of changes in the latest version:

The latest version is based on tsearch , a PostgreSQL contrib module, which provides the implementation of a special text data type, namely txtidx, suitable for text indexing. It uses words 'as is' without hashing to integers and provides search interface in more natural way. For example, it's possible now to test full text search from psql. More information about tsearch is available here.
Implementations of dictionary interfaces are required to work with lexems instead of integers: lemms method instead of lemmsid, is_stoplexem instead of is_stoplemm.
Better administration and maintenance API. Added:
- drop -- removes all OpenFTS tables, indices, dictionaries (if dictionary provides 'drop' method);
- drop_index -- removes all OpenFTS indices from index tables (INDEX1,,,INDEXN) and the GiST index on the base table (where the documents are stored together with their primary key).
Added generic interfaces to ISpell dictionaries and Snowball stemmers. ISpell dictionaries are free and available for many languages and could be used to return base forms of a word. This is very important for inflective languages, like Russian. Snowball stemmers, available from http://snowball.tartarus.org, can be used to stem a word, i.e. to cut the word's ending and use the linguistic root for indexing and searching.

4. Installation

Prerequisities

PostgreSQL-7.2
OpenFTS-0.34
PERL 5.005 + DBI + DBD::Pg

Preparation Tasks

Make sure you have installed all headers (for example, spi.h) using gmake-install-headers during the PostgreSQL installation. If you have not installed tsearch, please do so now. tsearch is under the contrib directory of the PostgreSQL source code.

Install OpenFTS

cd /usr/local/src/

Untar the archive file. tar -xzvf Search-OpenFTS-0.34.tar.gz
cd Search-OpenFTS-0.34
perl Makefile.PL
make
make install

cp -r pgsql_contrib_openfts PGSQL_SRC_HOME/contrib

cd PGSQL_SRC_HOME/contrib/pgsql_contrib_openfts

make
make install
psql DATABASE < openfts.sql

5. How it works

Indexer Configuration

First, you should configure OpenFTS. Indexer configuration is covered mostly by Search::OpenFTS::Index->init function. You should start by creating a base table that stores the indexed documents. Here's an example from the OpenFTS distribution:


    create table txt (
      tid  int not null primary key, 
      txt varchar,
      fts_index txtidx
    );

To configure your OpenFTS instance, call Search::OpenFTS::Index->init that creates the configuration and indexing tables. In our example, the call looks like:


my $idx=Search::OpenFTS::Index->init( 
        dbi=>$dbi, 
        txttid=>'txt.tid',
        use_index_table=>1,
        txtidx_field=>'fts_index',
        numbergroup=>10,
        ignore_id_index=>[ qw( 7 13 14 12 23 ) ],
        ignore_headline=>[ qw(13 15 16 17 5) ],
        map=>'{ \'19\'=>[1], 18=>[1], 8=>[1], 7=>[1], 6=>[1], 5=>[1], 4=>[1], 
}',
        dict=>[
                'Search::OpenFTS::Dict::PorterEng',
                'Search::OpenFTS::Dict::UnknownDict',
        ] 
);

You have to specify the table name to be indexed together with its primary key (txttid), the indexing field (txtidx), the number of indexing tables (numbergroup), the types of lexemes that should be ignored by the indexer (ignore_id_index) and types ignored while constructing headlines for search results (ignore_headlines), the available dictionaries (dict), and a mapping of types of lexemes to dictionaries (map) that is used for optimization and for multi-language support.

All configuration parameters are stored in a database table, fts_conf, that is created upon the invocation of the initialization function. ignore_id_index and ignore_headline can only accept types of lexemes as specified later in this document [Parser]. For example, value 13 is the type ID for an html tag, namely SYMTAG. Type IDs are also used to map lexemes to dictionaries. This is helpful for optimizing the search engine and it is also helpful for indexing multi-languages or exotic-text documents.

You can create more than one OpenFTS instance by passing a character value (a-z) for prefix upon the invocation of the initialization function. The initialization function will also create a table, fts_unknown_lexem, that stores the lexemes that are not recognized by any of the available dictionaries and as many indexing tables as you have specified in the corresponding parameter to init. Note that fts_unknown_lexem is created by Search::OpenFTS::Dict::UnknownDict dictionary, not the OpenFTS core. If it is not specified upon initialization it will not be created. Here's what the SQL, responsible for the creation of the indexing tables, looks like:


    create table index1 (
        lexem   varchar not null,
        tid     int4 not null,
        pos     int4[] not null
    );

The data model includes fields for storing the lexeme, the document ID and the position of the lexeme in the document. The latest version, no longer requires separate indices for fasting unindexing. The indices are created as follows:


    create unique index index1_key on index1 ( tid, lexem );

Indexing

We have already introduced the initialization function which is part of the indexing module. When a request is received for indexing a document (Search::OpenFTS::Index->index), the parser reads and converts it into a stream of lexemes. Lexemes that were marked to be ignored are filtered and the position of each lexeme is calculated and stored in the corresponding indexing table.

Search

When a query is received (function search), it is converted into a stream of lexemes and then the SQL parts are constructed (function get_sql) and combined together to form the final SQL query (function _sql). Here's how the generated SQL query for searching for the word "xware" looks like:


    SELECT 
	txt.tid,
        relor( 1.0, 0.01, 0, txt.tid, txt.tid % 10 + 1 , '{"xware"}' ) as pos
    FROM
	txt
    WHERE
	txt.fts_index @@ '\'xware\''
    ORDER BY pos desc

Function relor is used for ranking of the search results. The first two arguments (1.0, 0.01) denote the weights for words in the title and in the body, respectively. The third argument (0) is the prefix that denotes which OpenFTS instance to use (see Indexer Configuration). The fourth argument (txt.tid) is the table name together with its primary key that identify a document. The fifth argument (

txt.tid % 10 + 
1

) is the number of indexing table with the identificator txt.tid. In the WHERE clause the statement

txt.fts_index @@ 
'\'xware\''

is the "contains" predicate which denotes whether fts_index (of data type txtidx) contains the keyword xware or not. Here's a more complicated query for the words hello and xware:


    SELECT 
	txt.tid,
        relkov( 1.0, 0.01, 0, txt.tid,  txt.tid % 10 + 1 , '{"xware", "hello"}' 
) as pos
    FROM
	txt
    WHERE
	txt.fts_index @@ '\'xware\' & \'hello\''
    ORDER BY pos desc

One obvious difference is that, now that we are querying for more words, relkov is used instead of relor. Also note how the where clause has changed into a conjuction.

NOTICE: Since 0.34 version it's possible to normalize weight of document by its size (length). Functions relcov, relor now accepts additional argument (last):

0 - no normalization (default)
1 - normalized by log(length of document)
2 - normalized by length of document

Dictionaries

Dictionaries are required to have two methods: lemms and is_stoplexem. In the case of a real dictionary, lemms returns an array of lexemes. is_stoplexem accepts a lexeme and returns true if it is a stop word, otherwise it returns false. For example, is_stoplexem("yahoo") returns 0 since "yahoo" is not a stop word while is_stoplexem("are") returns 1 since "are" is a stop word. An optional method init, if available, can be used to initialize the dictionary.

Proximity Ranking

A prominent feature is the ability to rank documents according to the proximity between words of the search query -- this is accomplished by maintaining coordinate information of the lexemes for each indexed document. For example, given the query "full text search", OpenFTS will rank documents that contain the phrase "full text search" higher than documents in which the words "full", "text", "search" occur in different places. The ranking procedure is a C function that utilizes PostgreSQL's SPI and it can be used inside SQL queries. The ranking function uses methodology proposed by Andrew Kovalenko and Nickolay Kharin.

The relevance ranking functions support weights for words found in the title and the body. The default values are W_TITLE = 1 and W_BODY= 0.01. As a general rule W_TITLE must be much greater than W_BODY. You can change the default values by passing them as parameters to Search::OpenFTS->new.

Parser

OpenFTS uses a parser that reads a document or a search query and converts it into a stream of lexemes. You can use different parsers for different projects. The parser distributed with OpenFTS recognizes 19 types of lexemes:

Type	ID	Description	Exam ples
LATWORD	1	latin word	hello
CYRWORD	2	cyrillic word	...
UWORD	3	mixed word	...
EMAIL	4	email address	teodor@sigaev.ru
FURL	5	full URL	http://www.yahoo.com/index.html
HOST	6	host name	...
SCIENTIFIC	7	number in scientific notation	-0.12345e+15
VERSIONNUMBER	8	integer or version number	3 7.1.2
PARTHYPHENWORD	9	part of mixed hyphenated word	...
CYRPARTHYPHENWORD	10	cyrillic part of hyphenated word	...
LATPARTHYPHENWORD	11	latin part of hyphenated word	multi in word multi-key
SPACE	12	symbols	$#%^
SYMTAG	13	HTML tag	<b> <table>
HTTP	14	HTTP	http://
HYPHENWORD	15	mixed hyphenated word	...
LATHYPHENWORD	16	latin hyphenated word	multi-key
CYRHYPHENWORD	17	cyrillic hyphenated word	...
URI	18	Uniform Resource Identifier	/index.html
FILEPATH	19	filename or path	example.txt
DECIMAL	20	number in decimal notation	10.345
SIGNEDINT	21	integer	-4
UNSIGNEDINT	22	unsigned integer	4
HTMLENTITY	23	HTML entity	4

The package Search::OpenFTS::Parser is a wrapper around the parser functions. Here's an example of how it can be used:

use Search::OpenFTS::Parser;

my $parser=Search::OpenFTS::Parser->new();

my $txt = 'Hello world. OpenFTS rules!';

$parser->start_parser( \$txt );

while ((($type, $word)=$parser->get_word) && $type) {

    print $parser->type_description($type), "\t$word\n";

}

$parser->end_parser;

Save the script (above) into a file called parser-test-1.pl and then call it with perl parser-test-1.pl. The output should look like this:

Latin word	Hello
Space symbols	 
Latin word	world
Space symbols	.
Space symbols	 
Latin word	OpenFTS
Space symbols	 
Latin word	rules
Space symbols	!

To get all types that the parser supports, try this:

use Search::OpenFTS::Parser;

my $parser=Search::OpenFTS::Parser->new();

my @types = $parser->alltypes;

map { print "$_ => $types[$_]\n"; } 1..$#types;

The output should look like this:

1 => Latin word
2 => Cyrillic word
3 => Word
4 => Email
5 => URL
6 => Host
7 => Scientific notation
8 => VERSION
9 => Part of hyphenated word
10 => Cyrillic part of hyphenated word
11 => Latin part of hyphenated word
12 => Space symbols
13 => Char in tag
14 => HTTP head
15 => Hyphenated word
16 => Latin hyphenated word
17 => Cyrillic hyphenated word
18 => URI
19 => File or path name
20 => Decimal notation
21 => Signed integer
22 => Unsigned integer
23 => HTML entity

See simple_parser.pl in examples directory for example of very simple parser written in perl and recognizing space delimited words with length =>2.


echo 'Simple parser written in perl a'|perl simple_parser.pl 
Space delimited word (len=>2),word=Simple
Space delimited word (len=>2),word=parser
Space delimited word (len=>2),word=written
Space delimited word (len=>2),word=in
Space delimited word (len=>2),word=perl
0 => Unknown type
1 => Space delimited word (len=>2)

Notice, last word 'a' doesn't recognized because it's too short.

6. API

Search

Search::OpenFTS::Search->search
Accepts a string (in some cases the ranking function and the frequency too) and returns the reference to the list of document identifiers sorted by relevance.

Search::OpenFTS::Search->get_sql
Returns parts of SQL with respect to the search terms, which can be combined into a valid SQL statement. It filters the search terms from type identifiers that were denoted to be ignored during initialization.
For example, for query above sql parts are:
```
(1) relkov( 1.0, 0.01, 0, txt.tid,  txt.tid % 10 + 1 , '{"xware", "hello"}' as pos
(2) txt.fts_index @@ '\'xware\' & \'hello\''
(2) pos desc
```

Search::OpenFTS::Search->_sql
Constructs the SQL statement that returns the results for the given search terms. First, it calls get_sql to get the SQL parts and then combines them to the SQL query to be executed.

Search::OpenFTS::Search->get_headline
Returns headline of document with highlighted search terms

Index

Search::OpenFTS::Index->init
It is used only once, at creation of a new index, to initialize an OpenFTS instance. During initialization, the configuration and indexing tables are created. The configuration table includes information (parameters of init) about the available dictionaries, the parser, mapping of lexemes to dictionaries, and types of lexemes to be ignored in indexing or headlines.

Search::OpenFTS::Index->index
Used for indexing text. It accepts a string or a file handle and updates the indexing tables with the new entry. A parser reads the referenced text and converts it into a stream of lexemes. These lexemes then stored in indexing tables.

Search::OpenFTS::Index->delete
It accepts a document identifier and deletes all related entries in the indexing tables.

Search::OpenFTS::Index->drop
Removes all OpenFTS tables, indices, dictionaries (if dictionary provides a 'drop' method).

Search::OpenFTS::Index->drop_index
Removes all OpenFTS indices from index tables (INDEX1, ..., INDEXN) and the GiST index on the base table (where the documents are stored together with their primary key).

Search::OpenFTS::Index->create_index
HERE

Search::OpenFTS::Index->start_index
Opens a session for indexing. This prepares the OpenFTS instance to accept text for indexing.

Search::OpenFTS::Index->index_chunk
Adds a part to an index.

Search::OpenFTS::Index->flush
Dump in base of an index.

Parser

Search::OpenFTS::Parser->start_parser
Accepts a string and causes the parser to scan the text. The text is converted into a stream of lexemes.

Search::OpenFTS::Parser->end_parser
Stops parsing and frees all used memory.

Search::OpenFTS::Parser->get_word
Returns the next lexeme together with its type identifier.

Search::OpenFTS::Parser->type_description
Accepts a type identifier and returns a short description of the type.

Search::OpenFTS::Parser->alltypes
Return descriptions of all types that are supported by the parser.

7. Examples

This section explains how you can run the examples that come with the OpenFTS distribution.

Create a database. For example, testfts. $ createdb testfts
Change to the directory where tsearch is. This is a PostgreSQL module available under the contrib directory. $ cd PGSQL_SRC_HOME/contrib/tsearch
$ psql testfts < tsearch.sql

$ cd PGSQL_SRC_HOME/contrib/pgsql_contrib_openfts/

$ /psql testfts < openfts.sql

$ cd /usr/local/src/Search-OpenFTS-0.34/examples

$ ./init.pl testfts

$ ./index.pl testfts file1 [file2 [...] ]

$ ./search.pl -p testfts word1 [word2 [...] ]

8. Interesting Papers

"THE RD-TREE: AN INDEX STRUCTURE FOR SETS", Joseph M. Hellerstein, PS (70 Kb)
"Generalized Search Trees for Database Systems", 1995,Joseph M. Hellerstein,Jeffrey F. Naughton,Avi Pfeffer, PS (190 Kb), full paper (PS) (320 Kb),
"R-TREES: A dynamic index structure for spatial searching", A. Guttman, PDF (850 Kb)
"Index Structures for Databases Containing Data Items with Set-valued Attributes", 1997, Sven Helmer, PS (1350 Kb)
"On the Analysis of Indexing Schemes", 1997, Joseph M. Hellerstein,Elias Koutsoupias,Christos H. Papadimitriou PS (140 Kb)
"Implementation of Extended Indexes in POSTGRES", 1991,Paul M. Aoki, PDF (35 Kb)
"Generalizing Database Access Methods", 1999, Ming Zhou, PS (360 Kb)
"High-Concurrency Locking in R-Trees", 1995, Marcel Kornacker, PS (115 Kb)
"High-Performance Extensible Indexing", 1999, Marcel Kornacker, PS (430 Kb)
"Generalizing ''Search'' in Generalized Search Trees", 1997, Paul M. Aoki, PDF (210 Kb), exte nded abstract (PDF) (120 Kb)
"Efficient Concurrency Control in Multidimensional Access Methods", 1999, Kaushik Chakrabarti,Sharad Mehrotra PS (215 Kb)
"Indexing for String Queries using Generalized Search Trees", 199?, Jeff Foster,Megan Thomas PS (170 Kb)

9. Related Links

PostgreSQL - The official PostgreSQL web site.

GiST's support in PostgreSQL -- Information and patches for the GiST implementation in PostgreSQL.

The GiST Indexing Project -- The GiST project at the University of California, Berkeley. Studies the engineering and mathematics behind content-based indexing for massive amounts of complex content.

Omseek -- A library that enables developers to use advanced information retrieval techniques in their own search engine projects. Omseek includes implementations of Porter's Stemming Algorithm in many languages including Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Spanish, and Swedish.

Snowball -- A language in which stemmers can be exactly defined, and from which fast stemmer programs in ANSI C or Java can be generated. A range of stemmers is presented in parallel algorithmic and Snowball form, including the original Porter stemmer for English.