Quick Links

Re: Asking advice on speeding up a big table

Lists:	pgsql-general

From:	felix(at)crowfix(dot)com
To:	pgsql-general(at)postgresql(dot)org
Subject:	Asking advice on speeding up a big table
Date:	2006-04-10 06:17:36
Message-ID:	20060410061736.GA14027@crowfix.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

I have a simple benchmark which runs too slow on a 100M row table, and
I am not sure what my next step is to make it faster.

It's a simple setup, part of a larger system. There are three data
tables, each with a BIGINT id and a data column of dofferent types.
There is a fourth table with BIGINT foreign key references to the
other three tables' id columns.

felix=> \d glue
Table "oddtimes.glue"
Column | Type | Modifiers
--------+--------+-----------
key | bigint |
val | bigint |
sid | bigint |
Indexes:
"glue_key_idx" btree ("key")
"glue_key_val_idx" btree ("key", val)
"glue_val_idx" btree (val)
"glue_val_key_idx" btree (val, "key")
Foreign-key constraints:
"glue_key" FOREIGN KEY ("key") REFERENCES "key"(id)
"glue_val" FOREIGN KEY (val) REFERENCES val(id)
"glue_sid" FOREIGN KEY (sid) REFERENCES sid(id)

Usage is to match data from the key and val tables to fetch the data
value from the sid table. It's sort of a glorified Berkeley db, but
you can do compare ranges, not just exact matches. If I can make it
fast enough, I may add two more types, date and text.

The key table has 1K rows, val has 100K, and sid and glue have 100M
rows. They take about 31G space, last time I checked.

My benchmark times SQL matches like this. This example was a repeat
and has reasonable speed. But fresh values take around 3-4 seconds.
This is 8.0.3 on a dual Opteron dual core machine with only 2G RAM (it
is meant for compute intensive work and was idle while I ran these
tests). The disks are plain vanilla IDE, maybe SATA, but nothing at
all special.

felix=> explain analyze SELECT sid.data, val.data FROM key, val, sid, glue WHERE key.data = 'UajzAQjTJPevVJBuuerjU4pcl8eJcyrIxzkC' AND key.id = glue.key AND val.data = 1984186373 AND val.id = glue.val AND glue.sid = sid.id;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=5.84..4480.89 rows=1 width=48) (actual time=32.157..32.157 rows=0 loops=1)
-> Hash Join (cost=5.84..4477.87 rows=1 width=12) (actual time=32.149..32.149 rows=0 loops=1)
Hash Cond: ("outer"."key" = "inner".id)
-> Nested Loop (cost=0.00..4467.01 rows=1001 width=20) (actual time=0.205..28.304 rows=1000 loops=1)
-> Index Scan using val_data_key on val (cost=0.00..6.01 rows=1 width=12) (actual time=0.059..0.066 rows=1 loops=1)
Index Cond: (data = 1984186373)
-> Index Scan using glue_val_idx on glue (cost=0.00..4447.15 rows=1108 width=24) (actual time=0.131..20.670 rows=1000 loops=1)
Index Cond: ("outer".id = glue.val)
-> Hash (cost=5.84..5.84 rows=1 width=8) (actual time=0.123..0.123 rows=0 loops=1)
-> Index Scan using key_data_key on "key" (cost=0.00..5.84 rows=1 width=8) (actual time=0.100..0.105 rows=1 loops=1)
Index Cond: (data = 'UajzAQjTJPevVJBuuerjU4pcl8eJcyrIxzkC'::text)
-> Index Scan using sid_pkey on sid (cost=0.00..3.01 rows=1 width=52) (never executed)
Index Cond: ("outer".sid = sid.id)
Total runtime: 32.293 ms
(14 rows)

I realize these indices and tables take a lot of space and that is
at least mostly the reason for the slow speed. But still, several
seconds seems like an eternity. Are there things I can do to speed
this up? Or this time way out of line and ought to be faster, meaning
I have some bug somewhere?

--
... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
Felix Finch: scarecrow repairman & rocket surgeon / felix(at)crowfix(dot)com
GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	felix-accts-pgsql(at)crowfix(dot)com
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Asking advice on speeding up a big table
Date:	2006-04-10 06:51:30
Message-ID:	17315.1144651890@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

felix(at)crowfix(dot)com writes:
> I have a simple benchmark which runs too slow on a 100M row table, and
> I am not sure what my next step is to make it faster.

The EXPLAIN ANALYZE you showed ran in 32 msec, which ought to be fast
enough for anyone on that size table. You need to show us data on the
problem case ...

regards, tom lane

From:	felix(at)crowfix(dot)com
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Asking advice on speeding up a big table
Date:	2006-04-10 21:20:21
Message-ID:	20060410212021.GA28712@crowfix.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On Mon, Apr 10, 2006 at 02:51:30AM -0400, Tom Lane wrote:
> felix(at)crowfix(dot)com writes:
> > I have a simple benchmark which runs too slow on a 100M row table, and
> > I am not sure what my next step is to make it faster.
>
> The EXPLAIN ANALYZE you showed ran in 32 msec, which ought to be fast
> enough for anyone on that size table. You need to show us data on the
> problem case ...

It is, but it is only 32 msec because the query has already run and
cached the useful bits. And since I have random values, as soon as I
look up some new values, they are cached and no longer new.

What I was hoping for was some general insight from the EXPLAIN
ANALYZE, that maybe extra or different indices would help, or if there
is some better method for finding one row from 100 million. I realize
I am asking a vague question which probably can't be solved as
presented.

From:	"hubert depesz lubaczewski" <depesz(at)gmail(dot)com>
To:	"felix(at)crowfix(dot)com" <felix(at)crowfix(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Asking advice on speeding up a big table
Date:	2006-04-11 07:52:40
Message-ID:	9e4684ce0604110052h71314e84qf49f2b8260151321@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On 4/10/06, felix(at)crowfix(dot)com <felix(at)crowfix(dot)com> wrote:
>
> It is, but it is only 32 msec because the query has already run and
> cached the useful bits. And since I have random values, as soon as I
> look up some new values, they are cached and no longer new.

according to my experiene i would vote for too slow filesystem

> What I was hoping for was some general insight from the EXPLAIN
> ANALYZE, that maybe extra or different indices would help, or if there
> is some better method for finding one row from 100 million. I realize
> I am asking a vague question which probably can't be solved as
> presented.
>

hmm .. perhaps you can try to denormalize the table, and then use
multicolumn indices?

depesz

From:	felix(at)crowfix(dot)com
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Asking advice on speeding up a big table
Date:	2006-04-11 14:38:33
Message-ID:	20060411143833.GA9781@crowfix.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On Tue, Apr 11, 2006 at 09:52:40AM +0200, hubert depesz lubaczewski wrote:
> On 4/10/06, felix(at)crowfix(dot)com <felix(at)crowfix(dot)com> wrote:

> > What I was hoping for was some general insight from the EXPLAIN
> > ANALYZE, that maybe extra or different indices would help, or if there
> > is some better method for finding one row from 100 million. I realize
> > I am asking a vague question which probably can't be solved as
> > presented.
> >
>
> hmm .. perhaps you can try to denormalize the table, and then use
> multicolumn indices?

That's an idea ... I had thought that since my two referenced tables
are small in comparison to the third table, that wouldn't be of any
use, but I will give it a try. Thanks ...

From:	Alban Hertroys <alban(at)magproductions(dot)nl>
To:	felix(at)crowfix(dot)com
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Asking advice on speeding up a big table
Date:	2006-04-11 14:58:14
Message-ID:	443BC406.6070204@magproductions.nl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

felix(at)crowfix(dot)com wrote:
> On Tue, Apr 11, 2006 at 09:52:40AM +0200, hubert depesz lubaczewski wrote:
>
>>On 4/10/06, felix(at)crowfix(dot)com <felix(at)crowfix(dot)com> wrote:
>
>
>>>What I was hoping for was some general insight from the EXPLAIN
>>>ANALYZE, that maybe extra or different indices would help, or if there
>>>is some better method for finding one row from 100 million. I realize
>>>I am asking a vague question which probably can't be solved as
>>>presented.
>>>
>>
>>hmm .. perhaps you can try to denormalize the table, and then use
>>multicolumn indices?
>
> That's an idea ... I had thought that since my two referenced tables
> are small in comparison to the third table, that wouldn't be of any
> use, but I will give it a try. Thanks ...

You could also experiment with clustering your large tables on some
index, or using one or more partial indexes on relevant partitions of
your data set. The application of such measures really depends on how
your data behaves, so it's hard to predict whether it's going to help
you or not.

We have seen great benefits of using the 'right' index opposed to just
any index. An explain analyze showing an index scan is a good start, but
your design may yet improve.

For example, we have a table with translations of strings in different
languages. Part of that tables' contents deals with names of
geographical locations (cities mostly). An index on lowercase location
name translations and only on the translation records relevant for
locations sped up our queries immensely (from over 300ms to less than
1ms - uncached). That was a pretty amazing performance improvement to be
sure :)

So it is possible.

--
Alban Hertroys
alban(at)magproductions(dot)nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
7500 AK Enschede

// Integrate Your World //

From:	Francisco Reyes <lists(at)stringsutils(dot)com>
To:	felix-accts-pgsql(at)crowfix(dot)com
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Asking advice on speeding up a big table
Date:	2006-04-15 14:31:26
Message-ID:	cone.1145111486.252779.47312.1000@zoraida.natserv.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

felix(at)crowfix(dot)com writes:

> Usage is to match data from the key and val tables to fetch the data
> value from the sid table.

What is the relation between key and val tables?
Will key.id and val.id be equal?

I have never quite/fully understand the outputs of analyze, but I wonder why
you have:

> -> Nested Loop (cost=0.00..4467.01 rows=1001 width=20) (actual time=0.205..28.304 rows=1000 loops=1)

Why 1000 rows?
Does any of your conditions in the select returns many records?

From:	felix-accts-pgsql(at)crowfix(dot)com
To:	pgsql-general(at)postgresql(dot)org
Cc:	Francisco Reyes <lists(at)stringsutils(dot)com>
Subject:	Re: Asking advice on speeding up a big table
Date:	2006-04-15 15:49:11
Message-ID:	20060415154911.GA12672@crowfix.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On Sat, Apr 15, 2006 at 10:31:26AM -0400, Francisco Reyes wrote:
> felix(at)crowfix(dot)com writes:
>
> >Usage is to match data from the key and val tables to fetch the data
> >value from the sid table.
>
> What is the relation between key and val tables?
> Will key.id and val.id be equal?

This benchmark explores an idea for a simple berkeley-db-like lookup,
but faster and allowing ranges, and specialized for looking up info in
some other tables I have. The key table data is table.column, and
1000 is a rough guess on how many unique column names there might be.
The val table is the contents of those columns, and 100K is nother
rough guess. The end result, the sid table, is a generic ID I have,
coudl be anything, like lat/lon, room-bookshelf-shelf-book, etc.

key.id and val.id have no bearing on each other.

I have made some minor changes and speeded things up to around 15-20
lookups/sec, good enough, but not exciting :-) and in the process,
come across some odd misbehavior. I have a writeup, almost ready to
post, but I want to make sure I cxross my Is and dot my Ts properly.

From:	Francisco Reyes <lists(at)stringsutils(dot)com>
To:	felix-accts-pgsql(at)crowfix(dot)com
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Asking advice on speeding up a big table
Date:	2006-04-15 16:22:03
Message-ID:	cone.1145118123.188398.47312.1000@zoraida.natserv.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

felix-accts-pgsql(at)crowfix(dot)com writes:

> I have made some minor changes and speeded things up to around 15-20
> lookups/sec, good enough, but not exciting :-)

hmm let me understand this.
You went from 1 query 3 to 4 seconds to 45 to 60 queries in the same amount
of time... 45 to 60 times faster.. and that is not something to be excited
about. :)