This post explains how we were able to improve a database query performance by replacing two individual timestamptz
columns with single Postgres’s tstzrange
range column.
It all began when our operation team complaint that a particular report is taking really lots of time to generate. Initial analysis revealed the report spent most of the time in executing a particular query in our Postgres database. Though the query is joining(inner) three tables, the join and where clause are straightforward. We also confirmed indexes exist for the columns involved in join and where clause. So we fired up pgHero and generated the execution plan for the query. Loading the execution plan into PEV indicated that the where clause on timestamp columns are not using the indexes and had to scan all rows.
To understand the problem deeper let us consider the following pseudo table models.
theatres table
id | name |
---|---|
1000 | Theatre 1 |
1001 | Theatre 2 |
screens table
id | name |
---|---|
2000 | Screen 1 |
2001 | Screen 2 |
licenses table
id | screen_id | theatre_id | valid_from | valid_till |
---|---|---|---|---|
3000 | 2000 | 1000 | 2019-04-01T00:00:00+0000 | 2019-05-15T23:59:59+0000 |
3001 | 2001 | 1001 | 2019-04-01T00:00:00+0000 | 2019-05-15T23:59:59+0000 |
3002 | 2000 | 1000 | 2019-05-01T00:00:00+0000 | 2019-05-31T23:59:59+0000 |
3003 | 2001 | 1001 | 2019-05-01T00:00:00+0000 | 2019-05-31T23:59:59+0000 |
The report is about listing all theatre, screen pairs having any valid movie license within the specified date/time range. The offending query was written like below:
SELECT DISTINCT t.id AS theatre_id, s.id AS screen_id
FROM theatres t
INNER JOIN licenses l ON l.theatre_id = t.id
INNER JOIN screens s ON s.id = l.screen_id
WHERE (l.valid_from, l.valid_till) OVERLAPS ('2019-04-01T00:00:00+0000', '2019-05-31T23:59:59+0000')
The above query took ~450ms to select 140 rows on tables theatres, screens and licenses with row count of 37339, 147850, 350693 respectively. The execution plan indicated the query uses indexes for the various id
columns but didn’t use the indexes for valid_from
and valid_till
column.
Suspecting the timestamp indexes, we rewrote the query like below to check the usage of operator OVERLAPS
prevents the query using the indexes.
SELECT DISTINCT t.id AS theatre_id, s.id AS screen_id
FROM theatres t
INNER JOIN licenses l ON l.theatre_id = t.id
INNER JOIN screens s ON s.id = l.screen_id
WHERE (l.valid_from >= '2019-04-01T00:00:00+0000' AND l.valid_from <= '2019-05-31T23:59:59+0000')
OR (l.valid_till >= '2019-04-01T00:00:00+0000' AND l.valid_till <= '2019-05-31T23:59:59+0000')
OR (l.valid_from <= '2019-04-01T00:00:00+0000' AND l.valid_till >= '2019-05-31T23:59:59+0000')
To our surprise the query is fast, it took just ~180ms* to select the same **140 rows from the same dataset. The execution plan showed it uses the timestamp indexes. We couldn’t understand why the modified query is able to use indexes and faster, but the original query with OVERLAPS
operator couldn’t and slower. We did a little search on the Internet but we couldn’t gather much info. But thinking about it aloud we realized rationale. The operator OVERLAPS
checks whether two timestamp ranges overlaps or not. But the indexes are created for the individual timestamp columns. It is understandable that the operator can’t take advantage of the indexes as it requires a single timestamp range comprising both the start and end timestamps, i.e. the valid_from
and valid_till
. So the query had to scan all rows to construct the timestamp range, combined value of valid_from
and valid_till
, to check against the specified timestamp range.
As we realized the separate indexes doesn’t help time window overlap condition, we were set to find out is there a way to create an index for time window/timestamp ranges. We found that Postgres has builtin timestamp range data types tsrange
(without time zone info) and tstzrange
(with time zone info). So we replaced the two columns valid_from
and valid_till
with single column validity
of type tstzrange
. We also created a compatible index, GiST, for the column data. While migrating the data from existing columns, we also found invalid data of kind valid_till
being less than valid_from
. The tstzrange
type’s builtin validation caught these invalid data and we were able to fix them.
licenses table with column validity
id | screen_id | theatre_id | validity |
---|---|---|---|
3000 | 2000 | 1000 | [“2019-04-01T00:00:00+0000”,”2019-05-15T23:59:59+0000”] |
3001 | 2001 | 1001 | [“2019-04-01T00:00:00+0000”,”2019-05-15T23:59:59+0000”] |
3002 | 2000 | 1000 | [“2019-05-01T00:00:00+0000”,”2019-05-31T23:59:59+0000”] |
3003 | 2001 | 1001 | [“2019-05-01T00:00:00+0000”,”2019-05-31T23:59:59+0000”] |
We rewrote the query like below to achieve the result with the new validity
column of type tstzrange
.
SELECT DISTINCT t.id AS theatre_id, s.id AS screen_id
FROM theatres t
INNER JOIN licenses l ON l.theatre_id = t.id
INNER JOIN screens s ON s.id = l.screen_id
WHERE l.validity && tstzrange('["2019-04-01T00:00:00+0000", "2019-05-31T23:59:59+0000"]')
Now the query is as fast as the previous modified version, took ~180ms to produce same 140 rows from same the dataset. But the usage of validity
column with tstzrange
type’s operator &&
made the query succinct. Additionally tstzrange
type also brought in automatic validation for the timestamp range data.
We were happy that we not only made the query fast but also learned something new in Postgres. We hope this experience and learning of ours will be useful to you.
References
This post is originally published here.