Library

Indices (InterBase and Firebird)

Alexey Kovyazin, last update 07-Sep-2005
The concept assumed as the basis of indices is simple and visual and is one of the most important bases of databases design. On the basis of indices many database basic objects are grounded, and furthermore the correct use of indices is key to an improvement of the productivity of databases applications. However, what is index? Index is an ordered pointer of the records in the table. Pointer means that the index contains values of one or several fields in the table and the addresses of data pages where these values are located (for details about data pages, see the chapter "InterBase database structure") (part 4). In other words, index consists of pairs of values "field value" – "physical location of this field".

Thus, by value of the field (or fields), included in the index, using index we can find quickly that place in the table where the record containing this value is allocated. Ordered means that values of the fields stored in the index are ordered. Very often index is compared with a library catalogue, in which all the books are recorded to cards and are ordered in some way: according to the alphabet or themes, and in every card contains the information where exactly the given book is allocated in storage.

Why do we need indices?

The only thing indices promote is speeding up of record retrieval by its indexed field (indexed - means included in the index). The main function of indices is to provide fast record retrieval in the table. Any index using comes to this.

How this retrieval function is realized? At the input of this function we have the value of indexed field (or several fields). As a result of retrieval we should receive the whole record, in which the indexed field has a preset value. First in the index (to put it more precisely, in the ordered array of values of the indexed field) the required value is searched, then the address of data page is taken where the required record is located, the server goes to this page and reads the found record. It looks rather inconvenient, however search, using index, is many times faster, than sequential enumeration of all values from the table.

If we continue to make the analogy between index and library catalogue, we will see that the record retrieval, using index, is very similar to the book search, using card. When we find a book in a rather small catalogue (in comparison with the whole library storage), we immediately receive the information about where exactly the book is stored and we can go straight there. Search without using index can be compared with sequential enumeration of all the books in the library!

Enumeration of all records in the table is called direct or natural. We should say that despite of powers of modern computers natural enumeration may be very long if the table contains a great number of records.

How are they organized?

Index is not a part of the table, it is a separate object connected with the table and other database objects. This is a very important point of DBMS implementation allowing to separate information storage from its representation.

InterBase as any other relational database stores records in tables in the unordered way, i.e. does not care at all how records are physically allocated in the table. Unordered storing means that two records added to the table one by one may not be next to each other. Moreover, the data extracted from the table also have no order apart from the one that should be explicitly specified by the user making a retrieval query.

However, we cannot go without ordering the stored data: end-users of applications want to see the data in the defined order - for example, people's last names according to the alphabet. Indices solve the problem of data representation in the ordered way. Field values included in the index are ordered and represented in the special view, optimized for searching the required values (namely, this is essential for creating ordered sequences).

Separating data storage from their representation gives additional benefits in comparison with direct sorting – maybe you will need to sort the initial table in different ways. Then indices will help you – there can be up to 64 indices for every table!

If we speak about implementation of indices at the physical level, they represent a binary tree which nodes represent pairs "field value in index " – "data allocation in the table ". Retrieval of the required record in index is performed using the mechanism of hash-search - one of the fastest search algorithms.

Index application

Now, when it is clear what we can demand from indices, it is time to know about their function in a database. Indices are used in three main cases:
1. Speeding up of inquiry execution. Indices are created for the fields used under the conditions of search of SQL-inquiries.

2. Support of uniqueness of values in fields; a primary key constraint (about which it was told in the chapter "Tables. Primary keys ") demands that in the table there will be no two identical values of the fields included in a primary key. In order to meet this condition, when inserting a new record you should search for the same value that will be inserted. For record retrieval the special variety of index is used - a unique index (see below).

3. Support of reference integrity. Foreign keys constraints (which are considered in the chapter "Database constraints") are used to check that the values inserted into the table necessarily exist in other table. When creating a foreign key index is automatically created. This index is applied for speeding up the inquiries using join of tables, as well as for checking the conditions of the foreign key. We have briefly covered all possible index applications. Now we will consider the peculiarities of every case in more detail and we will answer most frequently arising questions concerning index application.

Speeding up the execution of queries using indices

It is described above that application of indices can greatly speed up the execution of queries. It is really so for the most cases, but there are certain stipulations. First, we will answer the question frequently arising among those who have become familiar with indices. If indices quicken retrieval from a database, why would not we index all the fields in the table? There are two moments, blocking general indexing, - a disk space and costs when modifying the data in the table. Every created index has a size equal to a data size in the indexed field, plus data size of records allocation. If we create indices for every field in the table, their total size will be more than the data size in the table! Therefore, creation of a large number of indexes leads to a huge expenditure of a disk space.

The second moment is more important. These are outlays when modifying the data in the table. In a relational DBMS, as you know, records in tables are unordered and consequently adding/ deleting of records go without significant outlays of resources of a server. Even if a record is deleted from the middle of a database, there is no moving of data sizes to fill this emptiness, – it is not required: the server will simply mark the empty place and will write something there when necessary. As to addition, in most cases it is executed in the end of the table. However, though a server does not move the main data in the table when modifying, the data stored in indexes, is reordered every time when adding/ deleting the records! In other words, the server has to rebuild the index when adding a record to the middle of the table. Certainly, index implementation somehow is intended for frequent reorganizations, but these operations nevertheless take time and resources of the processor and when there is a large number of indexes in the table, data modification within it may be much slower than in the same table without indexes!

These are two main reasons, which interfere with general indexing. Besides them, there are some more remarks restricting index application. The first is a rule of 20 %. It says that if the retrieval query returns more than 20 % of records from the table, index using can slow down data retrieval! Certainly, the situation depends on a concrete query and the conditions set to the retrieval on, but we should remember that 20 % of records are a threshold when efficiency of using indices becomes doubtful. The second remark is not formulated so clear. It is connected with the work of InterBase optimizer.

The optimizer is a collection of mechanisms, which develop the schedule of executing the query. When the user gives any SQL query to InterBase, he specifies what server should return after executing the inquiry, but does not define, HOW server should fulfill the inquiry. The optimizer on the basis of the given query creates the schedule of its execution, i.e. from where and in what order the data for executing the query will be taken, what indices will be used at that. When the server analyzes the retrieval conditions (these are mainly parts of expression WHERE, ORDER BY, etc.) for every field included in the condition, the server tries to use index. Unfortunately, the algorithm of creating the schedule is incomplete and the optimizer frequently uses indexes that are not too effective for the concrete query because of what execution time can be slowed down essentially. Therefore, creation of unnecessary indices can lead to creation of nonoptimal schedules.

It should be marked that in Yaffil clone this problem is solved due to using modern algorithms of making schedules. The third case when index is not necessary are fields with the limited set of values - for example, the field storing the information about the sex of the person and contains only two possible values - "F" and "M"; it is no point in indexing this field. So, we have considered main constraints creating indices. Now we should cover the problem, when it is necessary to use indices to achieve improvement of productivity. There are 3 main cases when a field has to be indexed:

When this field is used under the conditions of retrieval in queries
When joins of tables use this field
When this field is used in the statement of sorting ORDER BY If the field is applied in the way mentioned above, creating the index for it can lead to improvement of query productivity.

Let's consider a syntax of creating indexes. Here is a complete format of DDL command that allows to create indexes:

CREATE [UNIQUE] [ASC[ENDING] | DESC[ENDING]] INDEX index ON table (col [, col ...]);

The minimum expression creating the index is the following:

CREATE INDEX my_index ON Table_example(ID)

In this example index with name my_index is created for table Table_example, and ID field is the indexed field. The index is ascending, i.e. values in it are ordered by ascension, as well as non-unique, and it means that ID field can have several identical values. It is certainly the simplest example of index - the most common. As we can see from the description of syntax, index may contain not one, but a few fields. Such index is used when queries are frequently fulfilled and contain a combination of indexed fields under the conditions of search or sorting. For example, if we have a table containing fields the Surname, the Name, the Patronymic, such index will be applied when making the query that uses sorting by Surname, Name, and Patronymic. In general, it is not necessary to specify the conditions for all 3 fields applied in index to use its advantages. If we want to sort the result of the query, the index will be used in case the first field in a condition of sorting coincides with the first field in the index. For example, our index will be applied in case of sorting by Surname and Name.

According to documentation for optimization of query execution containing in statement WHERE a join of fields with OR condition, we should use not the aggregate index, but a few single ones for all fields included in OR condition.

As to the question of index sort order, it can be either ascending or descending. Why do we need different sort orders? Obviously, for different sortings! If we wish to sort people by the surname in ascending order, we create the ascending index (ASC), and if in descending (from Z to A) – then descending! If we want both, we have to create both indices.

Support of reference integrity using indices

There is one more option in index definition - UNIQUE. If we specify it, the index will allow to insert only unique values into the table. Actually, it is a basis for implementation of unique keys. Unique keys are widely used in databases. That is РК is a unique key-index, but not every UK is РК. We spoke only about РК above. A primary key is the most commonly used type of a unique key. When creating a primary key for the table a unique index is automatically created. It is given a name composed of RDB$PRIMARYNNN, where NNN is a sequential unique number within a database. Thus, two major constraints of reference integrity– a unique key and a primary key are realized due to using a unique index. It is obvious, that the notion of uniqueness is incompatible with the notion of undefined value. In other words, there should not be any values of NULL type in the fields contained in unique indices. Before creating a unique index for a field, it is necessary to set NOT NULL constraint. If the index is created for the data that already exists, then when creating, the indexed field will be checked for containing any repetitive values. If contains, you will be prohibited to create index.

Apart from unique and primary key constraints, the mechanism of indices underlies the implementation of one more constraint of reference integrity - a foreign key. The foreign key constraint is set for one or several fields of any table and prevents from inserting the values into these fields, which are not included in the primary key of the other, parent table. For implementing the foreign key, i.e. for performing the check of whether there is a value in the parent table, a special index is automatically created. Its name is RDB$FOREIGNNN, where NNN is a sequential unique number within a database.

Why the mechanism of indices is used for implementing the constraints of reference integrity? The matter is that indices in InterBase are in special, preferred position – it is said that they are executed out of the context of transactions. This is a very important property. We will speak about transactions later, in the chapter devoted to them. Now we will only mention that when indices are out of transactions, it means all the users working simultaneously with the data in the same table have to observe the constraints of reference integrity.

Optimization of index productivity

In the title of this part we can find out some paradox – indices, as it was said above, are to speed up execution of queries, and it turns out, that they should be optimized too! But what to do (such is life) – someone has to take care of indices. What happens with indices? Why do they "lose shape"? We will have to say once again that indices are realized as a binary tree. And when a new record is added (updated, deleted – as you like) to the table, a new branch is added to the tree. These branches are added not to the middle of the tree, but to the tops of other branches. Gradually the tree becomes more and more branchy (or unbalanced), and search – less effective. Rebuilding of the tree or (in some cases) recalculation of statistics can improve the situation.

Periodically it is required to recreate the index to restore its productivity. Index recreation happens in the following cases:

When rebuilding the index using ALTER INDEX command.
When deleting and recreating the index using DROP INDEX and CREATE INDEX commands.
When backing up and restoring from a backup copy using gbak tool.

Also you can use recalculation of statistics. But must understand that this operation does not change the index state, it just informs the optimizer about the precise information on its state allowing to use this index correctly. In other words, recalculation of statistics is not "cure" of the index, but only the accurate diagnostics of its state. Let’s consider all these ways of index optimization in more detail. Usage of ALTER INDEX command has the following format:

ALTER INDEX name {ACTIVE | INACTIVE};

Here name is the name of index, and ACTIVE and INACTIVE - two states of the index it can be converted to using ALTER INDEX command. Parameter ACTIVE means that the index is active and can be applied in all inquiries and procedures. If you set the index to INACTIVE, it will result in disconnecting its usage. For rearranging the tree two commands should be sequentially executed:

ALTER INDEX name INACTIVE; ALTER INDEX name ACTIVE;

Thus, the index will be rebuilt. Usage of ALTER INDEX has a number of constraints: you cannot rebuild the indices used in primary, unique and foreign keys; you cannot rebuild the index if it is being used by any query at the present moment; and also for altering the index it is necessary to have the rights of administrator (SYSDBA) or to be the creator of the given index.

Recreation of the index using DROP INDEX and CREATE INDEX commands leads to a complete deleting of the index from a database, and then to its creation from a bland print. Syntax of DROP INDEX command is obvious:

DROP INDEX имя_индекса;

After deleting it is necessary to create the index with the same name and parameters using CREATE INDEX command which syntax we have already considered. The way of rebuilding the index by its complete recreation has the constraints similar to the constraints for using ALTER INDEX.

The third way of rebuilding the index is based on the property of backup copies of InterBase databases created by gbak utility. The matter is that, when backing up, the data included in the index is not saved in a backup copy, only index definition is stored. When restoring from a backup copy the index is recreated. If you want to know about the backup in more detail, see the chapter "Backing up and restoring from a backup copy" (part 4).

The fourth way to improve index productivity is to collect statistics on indices using SET STATISTICS command. Table statistics is a value within the range from 0 to 1 which value depends on the number of different records in the table. InterBase optimizer uses statistics for defining the efficiency of application of this or that index in query. When the number of records in the table can alter profoundly (for example, because of a large number of inserts or removals), then recalculation of statistics can considerably improve the productivity. The command of recalculation of statistics following:

SET STATISTICS INDEX name;

Here name is a name of index for which statistics is recalculated. Recalculation of statistics does not rebuild the index and that is why it is free from most of constraints set for the described above ways of improving the productivity except that only the creator of the index or the system administrator (the user with name SYSDBA) can recalculate statistics. Correct statistics enables the optimizer to make true decision about using any index or not.

We have considered a few ways of improving the productivity of indices. Using ALTER INDEX and DROP/CREATE INDEX commands, we can rebuild any indices except for the system indices created automatically, meant for supplying reference integrity. If you want to rebuild these indices, you should use the commands of altering and creating tables - ALTER TABLE and CREATE TABLE as these indices are an integral part of tabular keys.