SELECT
COUNT(*)
FROM
HORSE H
WHERE H.CODE_DEPARTURE = 1
AND H.CODE_SEX = 2
AND H.CODE_HORSE IN (
SELECT COVER.CODE_FATHER
FROM COVER
WHERE COVER.CODE_DEPARTURE = 1
AND EXTRACT(YEAR FROM COVER.BYDATE) = 2023
)
(c) D.Simonov, IBSurgeon, 21-Aug-2024
Recently, a point release of the Firebird 5.0 DBMS was released Firebird 5.0.1. In addition to fixing errors, a new experimental optimizer function was added, which will be discussed in this article.
A semi-join is an operation that joins two relations, returning rows from only one of the relations without performing the entire join. Unlike other join operators, there is no explicit syntax for specifying whether to perform a semi-join. However, you can perform a semi-join using subqueries in ANY/SOME/IN/EXISTS.
Traditionally, Firebird transforms subqueries in ANY/SOME/IN predicates into correlated subqueries in the EXISTS predicate, and executes the subquery in EXISTS for each record of the outer query. When executing a subquery inside an EXISTS predicate, the FIRST ROWS strategy is used, and its execution stops immediately after the first record is returned.
Starting with Firebird 5.0.1, subqueries in ANY/SOME/IN/EXISTS predicates can be converted to semi-joins. This feature is disabled by default, and can be enabled by setting the SubQueryConversion
configuration parameter to true
in the firebird.conf
or database.conf
file.
This feature is experimental, so it is disabled by default. You can enable it and test your queries with subqueries in ANY/SOME/IN/EXISTS predicates, and if the performance is better, leave it enabled, otherwise set the The default value for the |
Unlike performing ANY/SOME/IN/EXISTS on subqueries directly, i.e. as correlated subqueries, performing them as semi-joins gives more room for optimization. Semi-joins can be performed by various Hash Join (semi)
or Nested Loop Join (semi)
algorithms, while correlated subqueries are always performed for each record of the outer query.
Let’s try to enable this feature by setting the SubQueryConversion
parameter to true
in the firebird.conf
file. Now let’s do some experiments.
Let’s execute the following query:
SELECT
COUNT(*)
FROM
HORSE H
WHERE H.CODE_DEPARTURE = 1
AND H.CODE_SEX = 2
AND H.CODE_HORSE IN (
SELECT COVER.CODE_FATHER
FROM COVER
WHERE COVER.CODE_DEPARTURE = 1
AND EXTRACT(YEAR FROM COVER.BYDATE) = 2023
)
Select Expression -> Aggregate -> Filter -> Hash Join (semi) -> Filter -> Table "HORSE" as "H" Access By ID -> Bitmap And -> Bitmap -> Index "FK_HORSE_DEPARTURE" Range Scan (full match) -> Bitmap -> Index "FK_HORSE_SEX" Range Scan (full match) -> Record Buffer (record length: 41) -> Filter -> Table "COVER" Access By ID -> Bitmap And -> Bitmap -> Index "IDX_COVER_BYYEAR" Range Scan (full match) -> Bitmap -> Index "FK_COVER_DEPARTURE" Range Scan (full match) COUNT ===================== 297 Current memory = 552356752 Delta memory = 352 Max memory = 552567920 Elapsed time = 0.045 sec Buffers = 32768 Reads = 0 Writes = 0 Fetches = 43984 Per table statistics: --------------------------------+---------+---------+---------+---------+---------+ Table name | Natural | Index | Insert | Update | Delete | --------------------------------+---------+---------+---------+---------+---------+ COVER | | 1516| | | | HORSE | | 37069| | | | --------------------------------+---------+---------+---------+---------+---------+
In the execution plan we see a new join method Hash Join (semi)
. The result of the subquery in IN
was buffered, which is visible in the plan as Record Buffer (record length: 41)
. That is, in this case the subquery in IN was executed once, its result was saved in the hash table memory, and then the outer query simply searched in this hash table.
For comparison, let’s run the same query with subquery-to-semi-join conversion disabled.
Sub-query -> Filter -> Filter -> Table "COVER" Access By ID -> Bitmap And -> Bitmap -> Index "FK_COVER_FATHER" Range Scan (full match) -> Bitmap -> Index "IDX_COVER_BYYEAR" Range Scan (full match) Select Expression -> Aggregate -> Filter -> Table "HORSE" as "H" Access By ID -> Bitmap And -> Bitmap -> Index "FK_HORSE_DEPARTURE" Range Scan (full match) -> Bitmap -> Index "FK_HORSE_SEX" Range Scan (full match) COUNT ===================== 297 Current memory = 552046496 Delta memory = 352 Max memory = 552135600 Elapsed time = 0.395 sec Buffers = 32768 Reads = 0 Writes = 0 Fetches = 186891 Per table statistics: --------------------------------+---------+---------+---------+---------+---------+ Table name | Natural | Index | Insert | Update | Delete | --------------------------------+---------+---------+---------+---------+---------+ COVER | | 297| | | | HORSE | | 37069| | | | --------------------------------+---------+---------+---------+---------+---------+
The execution plan shows that the subquery is executed for each record of the main query, but uses an additional index FK_COVER_FATHER
. This is also visible in the execution statistics: the number of Fetches is 4 times greater, the execution time is almost 4 times worse.
The reader may ask: why does hash semi-join show 5 times more index reads of the COVER table, but otherwise it is better? The fact is that index reads in statistics show the number of records read using the index, they do not show the total number of index accesses, some of which do not result in retrieving records at all, but these accesses are not free. |
What happened? To better understand the transformation of subqueries, let’s introduce an imaginary semi-join operator "SEMI JOIN". As I already said, this type of join is not represented in the SQL language. Our query with the IN operator was transformed into an equivalent form, which can be written as follows:
SELECT
COUNT(*)
FROM
HORSE H
SEMI JOIN (
SELECT COVER.CODE_FATHER
FROM COVER
WHERE COVER.CODE_DEPARTURE = 1
AND EXTRACT(YEAR FROM COVER.BYDATE) = 2023
) TMP ON TMP.CODE_FATHER = H.CODE_HORSE
WHERE H.CODE_DEPARTURE = 1
AND H.CODE_SEX = 2
Now it’s clearer. The same thing happens for subqueries using EXISTS. Let’s look at another example:
SELECT
COUNT(*)
FROM
HORSE H
WHERE H.CODE_DEPARTURE = 1
AND EXISTS (
SELECT *
FROM COVER
WHERE COVER.CODE_DEPARTURE = 1
AND COVER.CODE_FATHER = H.CODE_FATHER
AND COVER.CODE_MOTHER = H.CODE_MOTHER
)
Currently, it is not possible to write such an EXISTS using IN. Let’s see how it is implemented without transforming it into a semi-join.
Sub-query -> Filter -> Table "COVER" Access By ID -> Bitmap And -> Bitmap -> Index "FK_COVER_MOTHER" Range Scan (full match) -> Bitmap -> Index "FK_COVER_FATHER" Range Scan (full match) Select Expression -> Aggregate -> Filter -> Table "HORSE" as "H" Access By ID -> Bitmap -> Index "FK_HORSE_DEPARTURE" Range Scan (full match) COUNT ===================== 91908 Current memory = 552240400 Delta memory = 352 Max memory = 554680016 Elapsed time = 19.083 sec Buffers = 32768 Reads = 0 Writes = 0 Fetches = 935679 Per table statistics: --------------------------------+---------+---------+---------+---------+---------+ Table name | Natural | Index | Insert | Update | Delete | --------------------------------+---------+---------+---------+---------+---------+ COVER | | 91908| | | | HORSE | | 96021| | | | --------------------------------+---------+---------+---------+---------+---------+
Very slow. Now let’s set SubQueryConversion = true
and run the query again.
Select Expression -> Aggregate -> Filter -> Hash Join (semi) -> Filter -> Table "HORSE" as "H" Access By ID -> Bitmap -> Index "FK_HORSE_DEPARTURE" Range Scan (full match) -> Record Buffer (record length: 49) -> Filter -> Table "COVER" Access By ID -> Bitmap -> Index "FK_COVER_DEPARTURE" Range Scan (full match) COUNT ===================== 91908 Current memory = 552102000 Delta memory = 352 Max memory = 561520736 Elapsed time = 0.208 sec Buffers = 32768 Reads = 0 Writes = 0 Fetches = 248009 Per table statistics: --------------------------------+---------+---------+---------+---------+---------+ Table name | Natural | Index | Insert | Update | Delete | --------------------------------+---------+---------+---------+---------+---------+ COVER | | 140254| | | | HORSE | | 96021| | | | --------------------------------+---------+---------+---------+---------+---------+
The query was executed 100 times faster! If we rewrite it using our fictitious SEMI JOIN operator, the query will look like this:
SELECT
COUNT(*)
FROM
HORSE H
SEMI JOIN (
SELECT
COVER.CODE_FATHER,
COVER.CODE_MOTHER
FROM COVER
) TMP ON TMP.CODE_FATHER = H.CODE_FATHER AND TMP.CODE_MOTHER = H.CODE_MOTHER
WHERE H.CODE_DEPARTURE = 1
Can any correlated subquery in IN/EXISTS be converted to a semi-join? No, not any, for example, if the subquery contains FETCH/FIRST/SKIP/ROWS filters, then the subquery cannot be converted to a semi-join and it will be executed as a correlated subquery. Here is an example of such a query:
SELECT
COUNT(*)
FROM
HORSE H
WHERE H.CODE_DEPARTURE = 1
AND EXISTS (
SELECT *
FROM COVER
WHERE COVER.CODE_FATHER = H.CODE_HORSE
OFFSET 0 ROWS
)
Here the phrase OFFSET 0 ROWS
does not change the semantics of the query, and the result of its execution will be the same as without it. Let’s look at the plan and statistics of this query.
Sub-query -> Skip N Records -> Filter -> Table "COVER" Access By ID -> Bitmap -> Index "FK_COVER_FATHER" Range Scan (full match) Select Expression -> Aggregate -> Filter -> Table "HORSE" as "H" Access By ID -> Bitmap -> Index "FK_HORSE_DEPARTURE" Range Scan (full match) COUNT ===================== 10971 Current memory = 551912944 Delta memory = 288 Max memory = 552002112 Elapsed time = 0.201 sec Buffers = 32768 Reads = 0 Writes = 0 Fetches = 408988 Per table statistics: --------------------------------+---------+---------+---------+---------+---------+ Table name | Natural | Index | Insert | Update | Delete | --------------------------------+---------+---------+---------+---------+---------+ COVER | | 10971| | | | HORSE | | 96021| | | | --------------------------------+---------+---------+---------+---------+---------+
As you can see, the transformation to a semi-join did not occur. Now let’s remove OFFSET 0 ROWS
and take statistics again.
Select Expression -> Aggregate -> Filter -> Hash Join (semi) -> Filter -> Table "HORSE" as "H" Access By ID -> Bitmap -> Index "FK_HORSE_DEPARTURE" Range Scan (full match) -> Record Buffer (record length: 33) -> Table "COVER" Full Scan COUNT ===================== 10971 Current memory = 552112128 Delta memory = 288 Max memory = 585044592 Elapsed time = 0.405 sec Buffers = 32768 Reads = 0 Writes = 0 Fetches = 854841 Per table statistics: --------------------------------+---------+---------+---------+---------+---------+ Table name | Natural | Index | Insert | Update | Delete | --------------------------------+---------+---------+---------+---------+---------+ COVER | 722465| | | | | HORSE | | 96021| | | | --------------------------------+---------+---------+---------+---------+---------+
Here the conversion to semi-join has happened, and as we can see the execution time has become worse. The reason is that currently the optimizer does not have a cost estimate between the Hash Join (semi)
and Nested Loop Join (semi)
join algorithms using an index, so the rule is: if the join condition contains only equality, then the Hash Join (semi)
algorithm is chosen, otherwise the IN/EXISTS subqueries are executed as usual.
Now let’s disable the semi-join conversion and look at the execution statistics.
Sub-query -> Filter -> Table "COVER" Access By ID -> Bitmap -> Index "FK_COVER_FATHER" Range Scan (full match) Select Expression -> Aggregate -> Filter -> Table "HORSE" as "H" Access By ID -> Bitmap -> Index "FK_HORSE_DEPARTURE" Range Scan (full match) COUNT ===================== 10971 Current memory = 551912752 Delta memory = 288 Max memory = 552001920 Elapsed time = 0.193 sec Buffers = 32768 Reads = 0 Writes = 0 Fetches = 408988 Per table statistics: --------------------------------+---------+---------+---------+---------+---------+ Table name | Natural | Index | Insert | Update | Delete | --------------------------------+---------+---------+---------+---------+---------+ COVER | | 10971| | | | HORSE | | 96021| | | | --------------------------------+---------+---------+---------+---------+---------+
As you can see, Fetches is exactly equal to the case when the subquery contained the clause OFFSET 0 ROWS
, and the execution time differs within the margin of error. This means that you can use the clause OFFSET 0 ROWS
as a hint to disable the semi-join conversion.
Now let’s look at cases where any correlated condition other than equality and IS NOT DISTINCT FROM
is used in subqueries.
SELECT
COUNT(*)
FROM
HORSE H
WHERE H.CODE_DEPARTURE = 1
AND EXISTS (
SELECT *
FROM COVER
WHERE COVER.BYDATE > H.BIRTHDAY
)
Sub-query -> Filter -> Table "COVER" Access By ID -> Bitmap -> Index "COVER_IDX_BYDATE" Range Scan (lower bound: 1/1) Select Expression -> Aggregate -> Filter -> Table "HORSE" as "H" Access By ID -> Bitmap -> Index "FK_HORSE_DEPARTURE" Range Scan (full match)
As I said above, no transformation to a semi-join occurred, the subquery is executed for each record of the main query.
Let’s continue the experiments, write a query using equality and one more predicate except equality.
SELECT
COUNT(*)
FROM
HORSE H
WHERE H.CODE_DEPARTURE = 1
AND EXISTS (
SELECT *
FROM COVER
WHERE COVER.CODE_FATHER = H.CODE_FATHER
AND COVER.BYDATE > H.BIRTHDAY
)
Select Expression -> Aggregate -> Nested Loop Join (semi) -> Filter -> Table "HORSE" as "H" Access By ID -> Bitmap -> Index "FK_HORSE_DEPARTURE" Range Scan (full match) -> Filter -> Filter -> Table "COVER" Access By ID -> Bitmap -> Index "COVER_IDX_BYDATE" Range Scan (lower bound: 1/1)
Here in the plan we see the first use of the Nested Loop Join (semi)
join method, but unfortunately this plan is bad, because the FK_COVER_FATHER
index is not used. You will not get any results from such a query. This can be fixed using the OFFSET 0 ROWS
hint.
SELECT
COUNT(*)
FROM
HORSE H
WHERE H.CODE_DEPARTURE = 1
AND EXISTS (
SELECT *
FROM COVER
WHERE COVER.CODE_FATHER = H.CODE_FATHER
AND COVER.BYDATE > H.BIRTHDAY
OFFSET 0 ROWS
)
Sub-query -> Skip N Records -> Filter -> Table "COVER" Access By ID -> Bitmap -> Index "FK_COVER_FATHER" Range Scan (full match) Select Expression -> Aggregate -> Filter -> Table "HORSE" as "H" Access By ID -> Bitmap -> Index "FK_HORSE_DEPARTURE" Range Scan (full match) COUNT ===================== 72199 Current memory = 554017824 Delta memory = 320 Max memory = 554284480 Elapsed time = 45.548 sec Buffers = 32768 Reads = 0 Writes = 0 Fetches = 84145713 Per table statistics: --------------------------------+---------+---------+---------+---------+---------+ Table name | Natural | Index | Insert | Update | Delete | --------------------------------+---------+---------+---------+---------+---------+ COVER | | 75894621| | | | HORSE | | 96021| | | | --------------------------------+---------+---------+---------+---------+---------+
Not the best execution time, but in this case we at least got the result.
Thus, converting subqueries to ANY/SOME/IN/EXISTS into semi-join allows in some cases to significantly speed up query execution, but at present this feature is still imperfect and therefore disabled by default. In Firebird 6.0, they will try to add cost estimation for this feature, as well as fix a number of other shortcomings. In addition, Firebird 6.0 plans to add conversion of subqueries ALL/NOT IN/NOT EXISTS into anti-join.
In conclusion of the review of the execution of subqueries in IN/EXISTS, I would like to note that if you have a query of the form
SELECT ... FROM T1 WHEREIN (SELECT field FROM T2 ...)
or
SELECT ... FROM T1 WHERE EXISTS (SELECT ... FROM T2 WHERE T1.= T2.field)
then such queries are almost always more efficient to execute as
SELECT ... FROM T1 JOIN (SELECT DISTINCT field FROM T2) tmp ON tmp.field = T1.
Let me give you a clear example:
SELECT
COUNT(*)
FROM
HORSE H
WHERE H.CODE_HORSE IN (
SELECT
CODE_FATHER
FROM COVER
WHERE EXTRACT(YEAR FROM COVER.BYDATE) = 2022
)
Execution plan and statistics using Hash Join (semi)
Select Expression -> Aggregate -> Filter -> Hash Join (semi) -> Table "HORSE" as "H" Full Scan -> Record Buffer (record length: 41) -> Filter -> Table "COVER" Access By ID -> Bitmap -> Index "IDX_COVER_BYYEAR" Range Scan (full match) COUNT ===================== 1616 Current memory = 554176768 Delta memory = 288 Max memory = 555531328 Elapsed time = 0.229 sec Buffers = 32768 Reads = 0 Writes = 0 Fetches = 569683 Per table statistics: --------------------------------+---------+---------+---------+---------+---------+ Table name | Natural | Index | Insert | Update | Delete | --------------------------------+---------+---------+---------+---------+---------+ COVER | | 6695| | | | HORSE | 525875| | | | | --------------------------------+---------+---------+---------+---------+---------+
Quite fast, but the HORSE table is read in full.
Execution plan and statistics with classic subquery execution
Sub-query -> Filter -> Filter -> Table "COVER" Access By ID -> Bitmap And -> Bitmap -> Index "FK_COVER_FATHER" Range Scan (full match) -> Bitmap -> Index "IDX_COVER_BYYEAR" Range Scan (full match) Select Expression -> Aggregate -> Filter -> Table "HORSE" as "H" Full Scan COUNT ===================== 1616 Current memory = 553472512 Delta memory = 288 Max memory = 553966592 Elapsed time = 6.862 sec Buffers = 32768 Reads = 0 Writes = 0 Fetches = 2462726 Per table statistics: --------------------------------+---------+---------+---------+---------+---------+ Table name | Natural | Index | Insert | Update | Delete | --------------------------------+---------+---------+---------+---------+---------+ COVER | | 1616| | | | HORSE | 525875| | | | | --------------------------------+---------+---------+---------+---------+---------+
Very slow. The HORSE table is full scan, and the subquery is executed multiple times — for each record in the HORSE table.
And now a quick option with DISTINCT
SELECT
COUNT(*)
FROM
HORSE H
JOIN (
SELECT
DISTINCT
CODE_FATHER
FROM COVER
WHERE EXTRACT(YEAR FROM COVER.BYDATE) = 2022
) TMP ON TMP.CODE_FATHER = H.CODE_HORSE
Select Expression -> Aggregate -> Nested Loop Join (inner) -> Unique Sort (record length: 44, key length: 12) -> Filter -> Table "COVER" as "TMP COVER" Access By ID -> Bitmap -> Index "IDX_COVER_BYYEAR" Range Scan (full match) -> Filter -> Table "HORSE" as "H" Access By ID -> Bitmap -> Index "PK_HORSE" Unique Scan COUNT ===================== 1616 Current memory = 554349728 Delta memory = 320 Max memory = 555531328 Elapsed time = 0.011 sec Buffers = 32768 Reads = 0 Writes = 0 Fetches = 14954 Per table statistics: --------------------------------+---------+---------+---------+---------+---------+ Table name | Natural | Index | Insert | Update | Delete | --------------------------------+---------+---------+---------+---------+---------+ COVER | | 6695| | | | HORSE | | 1616| | | | --------------------------------+---------+---------+---------+---------+---------+
No unnecessary readings, the query is executed very quickly. Hence the conclusion — always look at the execution plan of subqueries in IN/EXISTS/ANY/SOME, and check alternative variants of writing queries.