Re: Performance on large, append-only tables

Lists: pgsql-performance
From: David Yeu <david(dot)yeu(at)skype(dot)net>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Performance on large, append-only tables
Date: 2012-02-10 16:19:57
Message-ID: 4205A216-5AD0-4F23-826F-542537B27E62@skype.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Yeah, Reply-All...

Begin forwarded message:

> From: David Yeu <david(dot)yeu(at)skype(dot)net>
> Subject: Re: [PERFORM] Performance on large, append-only tables
> Date: February 10, 2012 10:59:04 AM EST
> To: Merlin Moncure <mmoncure(at)gmail(dot)com>
>
> On Feb 10, 2012, at 10:19 AM, Merlin Moncure wrote:
>
>> You can probably significantly optimize this. But first, can we see
>> some explain analyze for the affected queries?
>
> Sorry, we should have included these in the original post. Here's the EXPLAIN output for a "id < ?" query:
>
>
> => EXPLAIN ANALYZE SELECT "lines".* FROM "lines" WHERE (lines.deleted_at IS NULL) AND ("lines".group_id = ?) AND (id < ?) ORDER BY id DESC LIMIT 20 OFFSET 0;
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=9267.44..9267.45 rows=20 width=1321) (actual time=348.844..348.877 rows=20 loops=1)
> -> Sort (cost=9267.44..9269.76 rows=4643 width=1321) (actual time=348.840..348.852 rows=20 loops=1)
> Sort Key: id
> Sort Method: top-N heapsort Memory: 29kB
> -> Index Scan using index_lines_on_group_id on lines (cost=0.00..9242.73 rows=4643 width=1321) (actual time=6.131..319.835 rows=23038 loops=1)
> Index Cond: (group_id = ?)
> Filter: ((deleted_at IS NULL) AND (id < ?))
> Total runtime: 348.987 ms
>
>
> A quick suggestion from Heroku yesterday was a new index on (group_id, id). After adding it to a database fork, we ended up with:
>
>
> => EXPLAIN ANALYZE SELECT "lines".* FROM "lines" WHERE (lines.deleted_at IS NULL) AND ("lines".group_id = ?) AND (id < ?) ORDER BY id DESC LIMIT 20 OFFSET 0;
> QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=0.00..28.88 rows=20 width=1321) (actual time=17.216..109.905 rows=20 loops=1)
> -> Index Scan Backward using index_lines_on_group_id_and_id on lines (cost=0.00..6416.04 rows=4443 width=1321) (actual time=17.207..109.867 rows=20 loops=1)
> Index Cond: ((group_id = ?) AND (id < ?))
> Filter: (deleted_at IS NULL)
> Total runtime: 110.039 ms
>
>
> The result has been pretty dramatic for the "id <> ?" queries, which make up the bulk of the queries. Running a whole bunch of EXPLAIN ANAYLZE queries also showed that some queries were actually choosing to use the index on `id' instead of `group_id', and that performed about as poorly as expected. Thankfully, the new index on (group_id, id) seems to be preferable nearly always.
>
> And for reference, here's the EXPLAIN for the LIMIT, OFFSET query:
>
>
> => EXPLAIN ANALYZE SELECT "lines".* FROM "lines" WHERE (lines.deleted_at IS NULL) AND ("lines".group_id = ?) ORDER BY id DESC LIMIT 20 OFFSET 60;
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=9274.45..9274.46 rows=20 width=1321) (actual time=109.674..109.708 rows=20 loops=1)
> -> Sort (cost=9274.42..9276.75 rows=4646 width=1321) (actual time=109.606..109.657 rows=80 loops=1)
> Sort Key: id
> Sort Method: top-N heapsort Memory: 43kB
> -> Index Scan using index_lines_on_group_id on lines (cost=0.00..9240.40 rows=4646 width=1321) (actual time=0.117..98.905 rows=7999 loops=1)
> Index Cond: (group_id = ?)
> Filter: (deleted_at IS NULL)
> Total runtime: 109.753 ms
>
>
> - Dave
>


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To:
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Performance on large, append-only tables
Date: 2012-02-10 16:26:54
Message-ID: CAGTBQpZiWptgfApcbBB0=EGctWacUayfQRz701WKxF2AttKg9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Fri, Feb 10, 2012 at 1:19 PM, David Yeu <david(dot)yeu(at)skype(dot)net> wrote:
>> => EXPLAIN ANALYZE SELECT  "lines".* FROM "lines" WHERE (lines.deleted_at IS NULL) AND ("lines".group_id = ?) AND (id < ?) ORDER BY id DESC LIMIT 20 OFFSET 0;

Interesting...

Do you have many "deleted" rows?
Do you always filter them out like this?

Because in that case, you can add the condition to the indices to
exclude deleted rows from the index. This is a big win if you have
many deleted rows, only the index expression has to be exactly the
same (verbatim) as the one used in the query.

That, and an index on "(group_id, created_at) where (deleted_at IS
NULL)" to catch the sorted by date kind of query, and you'll be done I
think.


From: David Yeu <david(dot)yeu(at)skype(dot)net>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Performance on large, append-only tables
Date: 2012-02-10 16:45:39
Message-ID: 79A88D19-AF27-4598-B616-51B6D130A9F0@skype.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Feb 10, 2012, at 11:26 AM, Claudio Freire wrote:
> That, and an index on "(group_id, created_at) where (deleted_at IS
> NULL)" to catch the sorted by date kind of query, and you'll be done I
> think.

Yeah, I didn't quite get that right -- we're actually sorting all these queries by "id DESC", not "created_at DESC", so that seems to obviate the need for any index on created_at.

Dave


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To:
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Performance on large, append-only tables
Date: 2012-02-10 16:58:13
Message-ID: CAGTBQpYA9vwL1g2LwpK_bDVwHh_t76JV52E2wFJmk2G5qtjTrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Fri, Feb 10, 2012 at 1:45 PM, David Yeu <david(dot)yeu(at)skype(dot)net> wrote:
> On Feb 10, 2012, at 11:26 AM, Claudio Freire wrote:
>> That, and an index on "(group_id, created_at) where (deleted_at IS
>> NULL)" to catch the sorted by date kind of query, and you'll be done I
>> think.
>
> Yeah, I didn't quite get that right -- we're actually sorting all these queries by "id DESC", not "created_at DESC", so that seems to obviate the need for any index on created_at.

From your OP:

> * Š WHERE group_id = ? ORDER BY created_at DESC LIMIT 20;


From: David Yeu <david(dot)yeu(at)skype(dot)net>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Performance on large, append-only tables
Date: 2012-02-10 17:00:50
Message-ID: F68E3C39-84A0-4FB6-9500-A7B0A6ECCBBC@skype.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Feb 10, 2012, at 11:58 AM, Claudio Freire wrote:

> From your OP:
>
>> * Š WHERE group_id = ? ORDER BY created_at DESC LIMIT 20;

Yup, sorry.

Dave


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: David Yeu <david(dot)yeu(at)skype(dot)net>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Performance on large, append-only tables
Date: 2012-02-10 17:12:11
Message-ID: CAGTBQpaTOF-j7L3uVXo64o5vdyzMBHHC4crTWEz_a0wdj0mOcw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Fri, Feb 10, 2012 at 2:00 PM, David Yeu <david(dot)yeu(at)skype(dot)net> wrote:
>> From your OP:
>>
>>> * Š WHERE group_id = ? ORDER BY created_at DESC LIMIT 20;
>
> Yup, sorry.

Ah, ok, so that should do it.
If you need further improvement, remember to take a look at the deleted stuff.