Firebird 5.0.1 Improvements in Optimizer

A semi-join is an operation that joins two relations, returning rows from only one of the relations without performing the entire join. Unlike other join operators, there is no explicit syntax for specifying whether to perform a semi-join. However, you can perform a semi-join using subqueries in ANY/SOME/IN/EXISTS.

Traditionally, Firebird transforms subqueries in ANY/SOME/IN predicates into correlated subqueries in the EXISTS predicate, and executes the subquery in EXISTS for each record of the outer query. When executing a subquery inside an EXISTS predicate, the FIRST ROWS strategy is used, and its execution stops immediately after the first record is returned.

Starting with Firebird 5.0.1, subqueries in ANY/SOME/IN/EXISTS predicates can be converted to semi-joins. This feature is disabled by default, and can be enabled by setting the SubQueryConversion configuration parameter to true in the firebird.conf or database.conf file.

This feature is experimental, so it is disabled by default. You can enable it and test your queries with subqueries in ANY/SOME/IN/EXISTS predicates, and if the performance is better, leave it enabled, otherwise set the SubQueryConversion parameter back to the default (false).

The default value for the SubQueryConversion configuration parameter may be changed in the future, or the parameter may be removed altogether. This will happen once the new way of doing things is proven to be more optimal in most cases.

Unlike performing ANY/SOME/IN/EXISTS on subqueries directly, i.e. as correlated subqueries, performing them as semi-joins gives more room for optimization. Semi-joins can be performed by various Hash Join (semi) or Nested Loop Join (semi) algorithms, while correlated subqueries are always performed for each record of the outer query.

Let’s try to enable this feature by setting the SubQueryConversion parameter to true in the firebird.conf file. Now let’s do some experiments.

Let’s execute the following query:

SELECT
  COUNT(*)
FROM
  HORSE H
WHERE H.CODE_DEPARTURE = 1
  AND H.CODE_SEX = 2
  AND H.CODE_HORSE IN (
    SELECT COVER.CODE_FATHER
    FROM COVER
    WHERE COVER.CODE_DEPARTURE = 1
      AND EXTRACT(YEAR FROM COVER.BYDATE) = 2023
  )

Select Expression
    -> Aggregate
        -> Filter
            -> Hash Join (semi)
                -> Filter
                    -> Table "HORSE" as "H" Access By ID
                        -> Bitmap And
                            -> Bitmap
                                -> Index "FK_HORSE_DEPARTURE" Range Scan (full match)
                            -> Bitmap
                                -> Index "FK_HORSE_SEX" Range Scan (full match)
                -> Record Buffer (record length: 41)
                    -> Filter
                        -> Table "COVER" Access By ID
                            -> Bitmap And
                                -> Bitmap
                                    -> Index "IDX_COVER_BYYEAR" Range Scan (full match)
                                -> Bitmap
                                    -> Index "FK_COVER_DEPARTURE" Range Scan (full match)

                COUNT
=====================
                  297

Current memory = 552356752
Delta memory = 352
Max memory = 552567920
Elapsed time = 0.045 sec
Buffers = 32768
Reads = 0
Writes = 0
Fetches = 43984
Per table statistics:
--------------------------------+---------+---------+---------+---------+---------+
 Table name                     | Natural | Index   | Insert  | Update  | Delete  |
--------------------------------+---------+---------+---------+---------+---------+
COVER                           |         |     1516|         |         |         |
HORSE                           |         |    37069|         |         |         |
--------------------------------+---------+---------+---------+---------+---------+

In the execution plan we see a new join method Hash Join (semi). The result of the subquery in IN was buffered, which is visible in the plan as Record Buffer (record length: 41). That is, in this case the subquery in IN was executed once, its result was saved in the hash table memory, and then the outer query simply searched in this hash table.

For comparison, let’s run the same query with subquery-to-semi-join conversion disabled.

Sub-query
    -> Filter
        -> Filter
            -> Table "COVER" Access By ID
                -> Bitmap And
                    -> Bitmap
                        -> Index "FK_COVER_FATHER" Range Scan (full match)
                    -> Bitmap
                        -> Index "IDX_COVER_BYYEAR" Range Scan (full match)
Select Expression
    -> Aggregate
        -> Filter
            -> Table "HORSE" as "H" Access By ID
                -> Bitmap And
                    -> Bitmap
                        -> Index "FK_HORSE_DEPARTURE" Range Scan (full match)
                    -> Bitmap
                        -> Index "FK_HORSE_SEX" Range Scan (full match)

                COUNT
=====================
                  297

Current memory = 552046496
Delta memory = 352
Max memory = 552135600
Elapsed time = 0.395 sec
Buffers = 32768
Reads = 0
Writes = 0
Fetches = 186891
Per table statistics:
--------------------------------+---------+---------+---------+---------+---------+
 Table name                     | Natural | Index   | Insert  | Update  | Delete  |
--------------------------------+---------+---------+---------+---------+---------+
COVER                           |         |      297|         |         |         |
HORSE                           |         |    37069|         |         |         |
--------------------------------+---------+---------+---------+---------+---------+

The execution plan shows that the subquery is executed for each record of the main query, but uses an additional index FK_COVER_FATHER. This is also visible in the execution statistics: the number of Fetches is 4 times greater, the execution time is almost 4 times worse.

The reader may ask: why does hash semi-join show 5 times more index reads of the COVER table, but otherwise it is better? The fact is that index reads in statistics show the number of records read using the index, they do not show the total number of index accesses, some of which do not result in retrieving records at all, but these accesses are not free.

What happened? To better understand the transformation of subqueries, let’s introduce an imaginary semi-join operator "SEMI JOIN". As I already said, this type of join is not represented in the SQL language. Our query with the IN operator was transformed into an equivalent form, which can be written as follows:

SELECT
  COUNT(*)
FROM
  HORSE H
  SEMI JOIN (
    SELECT COVER.CODE_FATHER
    FROM COVER
    WHERE COVER.CODE_DEPARTURE = 1
      AND EXTRACT(YEAR FROM COVER.BYDATE) = 2023
  ) TMP ON TMP.CODE_FATHER = H.CODE_HORSE
WHERE H.CODE_DEPARTURE = 1
  AND H.CODE_SEX = 2

Now it’s clearer. The same thing happens for subqueries using EXISTS. Let’s look at another example:

SELECT
  COUNT(*)
FROM
  HORSE H
WHERE H.CODE_DEPARTURE = 1
  AND EXISTS (
    SELECT *
    FROM COVER
    WHERE COVER.CODE_DEPARTURE = 1
      AND COVER.CODE_FATHER = H.CODE_FATHER
      AND COVER.CODE_MOTHER = H.CODE_MOTHER
  )

Currently, it is not possible to write such an EXISTS using IN. Let’s see how it is implemented without transforming it into a semi-join.

Sub-query
    -> Filter
        -> Table "COVER" Access By ID
            -> Bitmap And
                -> Bitmap
                    -> Index "FK_COVER_MOTHER" Range Scan (full match)
                -> Bitmap
                    -> Index "FK_COVER_FATHER" Range Scan (full match)
Select Expression
    -> Aggregate
        -> Filter
            -> Table "HORSE" as "H" Access By ID
                -> Bitmap
                    -> Index "FK_HORSE_DEPARTURE" Range Scan (full match)

                COUNT
=====================
                91908

Current memory = 552240400
Delta memory = 352
Max memory = 554680016
Elapsed time = 19.083 sec
Buffers = 32768
Reads = 0
Writes = 0
Fetches = 935679
Per table statistics:
--------------------------------+---------+---------+---------+---------+---------+
 Table name                     | Natural | Index   | Insert  | Update  | Delete  |
--------------------------------+---------+---------+---------+---------+---------+
COVER                           |         |    91908|         |         |         |
HORSE                           |         |    96021|         |         |         |
--------------------------------+---------+---------+---------+---------+---------+

Very slow. Now let’s set SubQueryConversion = true and run the query again.

Select Expression
    -> Aggregate
        -> Filter
            -> Hash Join (semi)
                -> Filter
                    -> Table "HORSE" as "H" Access By ID
                        -> Bitmap
                            -> Index "FK_HORSE_DEPARTURE" Range Scan (full match)
                -> Record Buffer (record length: 49)
                    -> Filter
                        -> Table "COVER" Access By ID
                            -> Bitmap
                                -> Index "FK_COVER_DEPARTURE" Range Scan (full match)

                COUNT
=====================
                91908

Current memory = 552102000
Delta memory = 352
Max memory = 561520736
Elapsed time = 0.208 sec
Buffers = 32768
Reads = 0
Writes = 0
Fetches = 248009
Per table statistics:
--------------------------------+---------+---------+---------+---------+---------+
 Table name                     | Natural | Index   | Insert  | Update  | Delete  |
--------------------------------+---------+---------+---------+---------+---------+
COVER                           |         |   140254|         |         |         |
HORSE                           |         |    96021|         |         |         |
--------------------------------+---------+---------+---------+---------+---------+

The query was executed 100 times faster! If we rewrite it using our fictitious SEMI JOIN operator, the query will look like this:

SELECT
  COUNT(*)
FROM
  HORSE H
  SEMI JOIN (
    SELECT
      COVER.CODE_FATHER,
      COVER.CODE_MOTHER
    FROM COVER
  ) TMP ON TMP.CODE_FATHER = H.CODE_FATHER AND TMP.CODE_MOTHER = H.CODE_MOTHER
WHERE H.CODE_DEPARTURE = 1

Can any correlated subquery in IN/EXISTS be converted to a semi-join? No, not any, for example, if the subquery contains FETCH/FIRST/SKIP/ROWS filters, then the subquery cannot be converted to a semi-join and it will be executed as a correlated subquery. Here is an example of such a query:

SELECT
  COUNT(*)
FROM
  HORSE H
WHERE H.CODE_DEPARTURE = 1
  AND EXISTS (
    SELECT *
    FROM COVER
    WHERE COVER.CODE_FATHER = H.CODE_HORSE
    OFFSET 0 ROWS
  )

Here the phrase OFFSET 0 ROWS does not change the semantics of the query, and the result of its execution will be the same as without it. Let’s look at the plan and statistics of this query.

Sub-query
    -> Skip N Records
        -> Filter
            -> Table "COVER" Access By ID
                -> Bitmap
                    -> Index "FK_COVER_FATHER" Range Scan (full match)
Select Expression
    -> Aggregate
        -> Filter
            -> Table "HORSE" as "H" Access By ID
                -> Bitmap
                    -> Index "FK_HORSE_DEPARTURE" Range Scan (full match)

                COUNT
=====================
                10971

Current memory = 551912944
Delta memory = 288
Max memory = 552002112
Elapsed time = 0.201 sec
Buffers = 32768
Reads = 0
Writes = 0
Fetches = 408988
Per table statistics:
--------------------------------+---------+---------+---------+---------+---------+
 Table name                     | Natural | Index   | Insert  | Update  | Delete  |
--------------------------------+---------+---------+---------+---------+---------+
COVER                           |         |    10971|         |         |         |
HORSE                           |         |    96021|         |         |         |
--------------------------------+---------+---------+---------+---------+---------+

As you can see, the transformation to a semi-join did not occur. Now let’s remove OFFSET 0 ROWS and take statistics again.

Select Expression
    -> Aggregate
        -> Filter
            -> Hash Join (semi)
                -> Filter
                    -> Table "HORSE" as "H" Access By ID
                        -> Bitmap
                            -> Index "FK_HORSE_DEPARTURE" Range Scan (full match)
                -> Record Buffer (record length: 33)
                    -> Table "COVER" Full Scan

                COUNT
=====================
                10971

Current memory = 552112128
Delta memory = 288
Max memory = 585044592
Elapsed time = 0.405 sec
Buffers = 32768
Reads = 0
Writes = 0
Fetches = 854841
Per table statistics:
--------------------------------+---------+---------+---------+---------+---------+
 Table name                     | Natural | Index   | Insert  | Update  | Delete  |
--------------------------------+---------+---------+---------+---------+---------+
COVER                           |   722465|         |         |         |         |
HORSE                           |         |    96021|         |         |         |
--------------------------------+---------+---------+---------+---------+---------+

Here the conversion to semi-join has happened, and as we can see the execution time has become worse. The reason is that currently the optimizer does not have a cost estimate between the Hash Join (semi) and Nested Loop Join (semi) join algorithms using an index, so the rule is: if the join condition contains only equality, then the Hash Join (semi) algorithm is chosen, otherwise the IN/EXISTS subqueries are executed as usual.

Now let’s disable the semi-join conversion and look at the execution statistics.

Sub-query
    -> Filter
        -> Table "COVER" Access By ID
            -> Bitmap
                -> Index "FK_COVER_FATHER" Range Scan (full match)
Select Expression
    -> Aggregate
        -> Filter
            -> Table "HORSE" as "H" Access By ID
                -> Bitmap
                    -> Index "FK_HORSE_DEPARTURE" Range Scan (full match)

                COUNT
=====================
                10971

Current memory = 551912752
Delta memory = 288
Max memory = 552001920
Elapsed time = 0.193 sec
Buffers = 32768
Reads = 0
Writes = 0
Fetches = 408988
Per table statistics:
--------------------------------+---------+---------+---------+---------+---------+
 Table name                     | Natural | Index   | Insert  | Update  | Delete  |
--------------------------------+---------+---------+---------+---------+---------+
COVER                           |         |    10971|         |         |         |
HORSE                           |         |    96021|         |         |         |
--------------------------------+---------+---------+---------+---------+---------+

As you can see, Fetches is exactly equal to the case when the subquery contained the clause OFFSET 0 ROWS, and the execution time differs within the margin of error. This means that you can use the clause OFFSET 0 ROWS as a hint to disable the semi-join conversion.

Now let’s look at cases where any correlated condition other than equality and IS NOT DISTINCT FROM is used in subqueries.

SELECT
  COUNT(*)
FROM
  HORSE H
WHERE H.CODE_DEPARTURE = 1
  AND EXISTS (
    SELECT *
    FROM COVER
    WHERE COVER.BYDATE > H.BIRTHDAY
  )

Sub-query
    -> Filter
        -> Table "COVER" Access By ID
            -> Bitmap
                -> Index "COVER_IDX_BYDATE" Range Scan (lower bound: 1/1)
Select Expression
    -> Aggregate
        -> Filter
            -> Table "HORSE" as "H" Access By ID
                -> Bitmap
                    -> Index "FK_HORSE_DEPARTURE" Range Scan (full match)

As I said above, no transformation to a semi-join occurred, the subquery is executed for each record of the main query.

Let’s continue the experiments, write a query using equality and one more predicate except equality.

SELECT
  COUNT(*)
FROM
  HORSE H
WHERE H.CODE_DEPARTURE = 1
  AND EXISTS (
    SELECT *
    FROM COVER
    WHERE COVER.CODE_FATHER = H.CODE_FATHER
      AND COVER.BYDATE > H.BIRTHDAY
  )

Select Expression
    -> Aggregate
        -> Nested Loop Join (semi)
            -> Filter
                -> Table "HORSE" as "H" Access By ID
                    -> Bitmap
                        -> Index "FK_HORSE_DEPARTURE" Range Scan (full match)
            -> Filter
                -> Filter
                    -> Table "COVER" Access By ID
                        -> Bitmap
                            -> Index "COVER_IDX_BYDATE" Range Scan (lower bound: 1/1)

Here in the plan we see the first use of the Nested Loop Join (semi) join method, but unfortunately this plan is bad, because the FK_COVER_FATHER index is not used. You will not get any results from such a query. This can be fixed using the OFFSET 0 ROWS hint.

SELECT
  COUNT(*)
FROM
  HORSE H
WHERE H.CODE_DEPARTURE = 1
  AND EXISTS (
    SELECT *
    FROM COVER
    WHERE COVER.CODE_FATHER = H.CODE_FATHER
      AND COVER.BYDATE > H.BIRTHDAY
    OFFSET 0 ROWS
  )

Sub-query
    -> Skip N Records
        -> Filter
            -> Table "COVER" Access By ID
                -> Bitmap
                    -> Index "FK_COVER_FATHER" Range Scan (full match)
Select Expression
    -> Aggregate
        -> Filter
            -> Table "HORSE" as "H" Access By ID
                -> Bitmap
                    -> Index "FK_HORSE_DEPARTURE" Range Scan (full match)

                COUNT
=====================
                72199

Current memory = 554017824
Delta memory = 320
Max memory = 554284480
Elapsed time = 45.548 sec
Buffers = 32768
Reads = 0
Writes = 0
Fetches = 84145713
Per table statistics:
--------------------------------+---------+---------+---------+---------+---------+
 Table name                     | Natural | Index   | Insert  | Update  | Delete  |
--------------------------------+---------+---------+---------+---------+---------+
COVER                           |         | 75894621|         |         |         |
HORSE                           |         |    96021|         |         |         |
--------------------------------+---------+---------+---------+---------+---------+

Not the best execution time, but in this case we at least got the result.

Thus, converting subqueries to ANY/SOME/IN/EXISTS into semi-join allows in some cases to significantly speed up query execution, but at present this feature is still imperfect and therefore disabled by default. In Firebird 6.0, they will try to add cost estimation for this feature, as well as fix a number of other shortcomings. In addition, Firebird 6.0 plans to add conversion of subqueries ALL/NOT IN/NOT EXISTS into anti-join.

In conclusion of the review of the execution of subqueries in IN/EXISTS, I would like to note that if you have a query of the form

SELECT ...
FROM T1
WHERE  IN (SELECT field FROM T2 ...)

SELECT ...
FROM T1
WHERE EXISTS (SELECT ... FROM T2 WHERE T1. = T2.field)

then such queries are almost always more efficient to execute as

SELECT ...
FROM
  T1
  JOIN (SELECT DISTINCT field FROM T2) tmp ON tmp.field = T1.

Let me give you a clear example:

SELECT
  COUNT(*)
FROM
  HORSE H
WHERE H.CODE_HORSE IN (
  SELECT
    CODE_FATHER
  FROM COVER
  WHERE EXTRACT(YEAR FROM COVER.BYDATE) = 2022
)

Execution plan and statistics using Hash Join (semi)

Select Expression
    -> Aggregate
        -> Filter
            -> Hash Join (semi)
                -> Table "HORSE" as "H" Full Scan
                -> Record Buffer (record length: 41)
                    -> Filter
                        -> Table "COVER" Access By ID
                            -> Bitmap
                                -> Index "IDX_COVER_BYYEAR" Range Scan (full match)

                COUNT
=====================
                 1616

Current memory = 554176768
Delta memory = 288
Max memory = 555531328
Elapsed time = 0.229 sec
Buffers = 32768
Reads = 0
Writes = 0
Fetches = 569683
Per table statistics:
--------------------------------+---------+---------+---------+---------+---------+
 Table name                     | Natural | Index   | Insert  | Update  | Delete  |
--------------------------------+---------+---------+---------+---------+---------+
COVER                           |         |     6695|         |         |         |
HORSE                           |   525875|         |         |         |         |
--------------------------------+---------+---------+---------+---------+---------+

Quite fast, but the HORSE table is read in full.

Execution plan and statistics with classic subquery execution

Sub-query
    -> Filter
        -> Filter
            -> Table "COVER" Access By ID
                -> Bitmap And
                    -> Bitmap
                        -> Index "FK_COVER_FATHER" Range Scan (full match)
                    -> Bitmap
                        -> Index "IDX_COVER_BYYEAR" Range Scan (full match)
Select Expression
    -> Aggregate
        -> Filter
            -> Table "HORSE" as "H" Full Scan

                COUNT
=====================
                 1616

Current memory = 553472512
Delta memory = 288
Max memory = 553966592
Elapsed time = 6.862 sec
Buffers = 32768
Reads = 0
Writes = 0
Fetches = 2462726
Per table statistics:
--------------------------------+---------+---------+---------+---------+---------+
 Table name                     | Natural | Index   | Insert  | Update  | Delete  |
--------------------------------+---------+---------+---------+---------+---------+
COVER                           |         |     1616|         |         |         |
HORSE                           |   525875|         |         |         |         |
--------------------------------+---------+---------+---------+---------+---------+

Very slow. The HORSE table is full scan, and the subquery is executed multiple times — for each record in the HORSE table.

And now a quick option with DISTINCT

SELECT
  COUNT(*)
FROM
  HORSE H
  JOIN (
      SELECT
        DISTINCT
        CODE_FATHER
      FROM COVER
      WHERE EXTRACT(YEAR FROM COVER.BYDATE) = 2022
  ) TMP ON TMP.CODE_FATHER = H.CODE_HORSE

Select Expression
    -> Aggregate
        -> Nested Loop Join (inner)
            -> Unique Sort (record length: 44, key length: 12)
                -> Filter
                    -> Table "COVER" as "TMP COVER" Access By ID
                        -> Bitmap
                            -> Index "IDX_COVER_BYYEAR" Range Scan (full match)
            -> Filter
                -> Table "HORSE" as "H" Access By ID
                    -> Bitmap
                        -> Index "PK_HORSE" Unique Scan

                COUNT
=====================
                 1616

Current memory = 554349728
Delta memory = 320
Max memory = 555531328
Elapsed time = 0.011 sec
Buffers = 32768
Reads = 0
Writes = 0
Fetches = 14954
Per table statistics:
--------------------------------+---------+---------+---------+---------+---------+
 Table name                     | Natural | Index   | Insert  | Update  | Delete  |
--------------------------------+---------+---------+---------+---------+---------+
COVER                           |         |     6695|         |         |         |
HORSE                           |         |     1616|         |         |         |
--------------------------------+---------+---------+---------+---------+---------+

No unnecessary readings, the query is executed very quickly. Hence the conclusion — always look at the execution plan of subqueries in IN/EXISTS/ANY/SOME, and check alternative variants of writing queries.

Library

Firebird 5.0.1 Improvements in Optimizer

Converting subqueries to ANY/SOME/IN/EXISTS in semi-join