Re: Sort performance on large tables

Lists: pgsql-performance
From: Charlie Savage <cfis(at)interserv(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Sort performance on large tables
Date: 2005-11-08 07:05:01
Message-ID: 43704E1D.80708@interserv.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hi everyone,

I have a question about the performance of sort.

Setup: Dell Dimension 3000, Suse 10, 1GB ram, PostgreSQL 8.1 RC 1 with
PostGIS, 1 built-in 80 GB IDE drive, 1 SATA Seagate 400GB drive. The
IDE drive has the OS and the WAL files, the SATA drive the database.
From hdparm the max IO for the IDE drive is about 50Mb/s and the SATA
drive is about 65Mb/s. Thus a very low-end machine - but it used just
for development (i.e., it is not a production machine) and the only
thing it does is run a PostgresSQL database.

I have a staging table called completechain that holds US tiger data
(i.e., streets and addresses for the US). The table is approximately
18GB. Its big because there is a lot of data, but also because the
table is not normalized (it comes that way).

I want to extract data out of the file, with the most important values
being stored in a column called tlid. The tlid field is an integer, and
the values are 98% unique. There is a second column called ogc_fid
which is unique (it is a serial field). I need to extract out unique
TLID's (doesn't matter which duplicate I get rid of). To do this I am
running this query:

SELECT tlid, min(ogc_fid)
FROM completechain
GROUP BY tlid;

The results from explain analyze are:

"GroupAggregate (cost=10400373.80..11361807.88 rows=48071704 width=8)
(actual time=7311682.715..8315746.835 rows=47599910 loops=1)"
" -> Sort (cost=10400373.80..10520553.06 rows=48071704 width=8)
(actual time=7311682.682..7972304.777 rows=48199165 loops=1)"
" Sort Key: tlid"
" -> Seq Scan on completechain (cost=0.00..2228584.04
rows=48071704 width=8) (actual time=27.514..773245.046 rows=48199165
loops=1)"
"Total runtime: 8486057.185 ms"

Doing a similar query produces the same results:

SELECT DISTINCT ON (tlid), tlid, ogc_fid
FROM completechain;

Note it takes over 10 times longer to do the sort than the full
sequential scan.

Should I expect results like this? I realize that the computer is quite
low-end and is very IO bound for this query, but I'm still surprised
that the sort operation takes so long.

Out of curiosity, I setup an Oracle database on the same machine with
the same data and ran the same query. Oracle was over an order of
magnitude faster. Looking at its query plan, it avoided the sort by
using "HASH GROUP BY." Does such a construct exist in PostgreSQL (I see
only hash joins)?

Also as an experiment I forced oracle to do a sort by running this query:

SELECT tlid, min(ogc_fid)
FROM completechain
GROUP BY tlid
ORDER BY tlid;

Even with this, it was more than a magnitude faster than Postgresql.
Which makes me think I have somehow misconfigured postgresql (see the
relevant parts of postgresql.conf below).

Any idea/help appreciated.

Thanks,

Charlie

-------------------------------

#---------------------------------------------------------------------------
# RESOURCE USAGE (except WAL)
#---------------------------------------------------------------------------

shared_buffers = 40000 # 40000 buffers * 8192
bytes/buffer = 327,680,000 bytes
#shared_buffers = 1000 # min 16 or max_connections*2, 8KB each

temp_buffers = 5000
#temp_buffers = 1000 # min 100, 8KB each
#max_prepared_transactions = 5 # can be 0 or more
# note: increasing max_prepared_transactions costs ~600 bytes of shared
memory
# per transaction slot, plus lock space (see max_locks_per_transaction).

work_mem = 16384 # in Kb
#work_mem = 1024 # min 64, size in KB

maintenance_work_mem = 262144 # in kb
#maintenance_work_mem = 16384 # min 1024, size in KB
#max_stack_depth = 2048 # min 100, size in KB

# - Free Space Map -

max_fsm_pages = 60000
#max_fsm_pages = 20000 # min max_fsm_relations*16, 6 bytes each

#max_fsm_relations = 1000 # min 100, ~70 bytes each

# - Kernel Resource Usage -

#max_files_per_process = 1000 # min 25
#preload_libraries = ''

# - Cost-Based Vacuum Delay -

#vacuum_cost_delay = 0 # 0-1000 milliseconds
#vacuum_cost_page_hit = 1 # 0-10000 credits
#vacuum_cost_page_miss = 10 # 0-10000 credits
#vacuum_cost_page_dirty = 20 # 0-10000 credits
#vacuum_cost_limit = 200 # 0-10000 credits

# - Background writer -

#bgwriter_delay = 200 # 10-10000 milliseconds between rounds
#bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers scanned/round
#bgwriter_lru_maxpages = 5 # 0-1000 buffers max written/round
#bgwriter_all_percent = 0.333 # 0-100% of all buffers scanned/round
#bgwriter_all_maxpages = 5 # 0-1000 buffers max written/round

#---------------------------------------------------------------------------
# WRITE AHEAD LOG
#---------------------------------------------------------------------------

# - Settings -

fsync = on # turns forced synchronization on or off
#wal_sync_method = fsync # the default is the first option
# supported by the operating system:
# open_datasync
# fdatasync
# fsync
# fsync_writethrough
# open_sync
#full_page_writes = on # recover from partial page writes

wal_buffers = 128
#wal_buffers = 8 # min 4, 8KB each

#commit_delay = 0 # range 0-100000, in microseconds
#commit_siblings = 5 # range 1-1000

# - Checkpoints -

checkpoint_segments = 256 # 256 * 16Mb = 4,294,967,296 bytes
checkpoint_timeout = 1200 # 1200 seconds (20 minutes)
checkpoint_warning = 30 # in seconds, 0 is off

#checkpoint_segments = 3 # in logfile segments, min 1, 16MB each
#checkpoint_timeout = 300 # range 30-3600, in seconds
#checkpoint_warning = 30 # in seconds, 0 is off

# - Archiving -

#archive_command = '' # command to use to archive a logfile
# segment

#---------------------------------------------------------------------------
# QUERY TUNING
#---------------------------------------------------------------------------

# - Planner Method Configuration -

#enable_bitmapscan = on
#enable_hashagg = on
#enable_hashjoin = on
#enable_indexscan = on
#enable_mergejoin = on
#enable_nestloop = on
#enable_seqscan = on
#enable_sort = on
#enable_tidscan = on

# - Planner Cost Constants -

effective_cache_size = 80000 # 80000 * 8192 = 655,360,000 bytes
#effective_cache_size = 1000 # typically 8KB each

random_page_cost = 2.5 # units are one sequential page fetch
#random_page_cost = 4 # units are one sequential page fetch
# cost
#cpu_tuple_cost = 0.01 # (same)
#cpu_index_tuple_cost = 0.001 # (same)
#cpu_operator_cost = 0.0025 # (same)

# - Genetic Query Optimizer -

#geqo = on
#geqo_threshold = 12
#geqo_effort = 5 # range 1-10
#geqo_pool_size = 0 # selects default based on effort
#geqo_generations = 0 # selects default based on effort
#geqo_selection_bias = 2.0 # range 1.5-2.0

# - Other Planner Options -

default_statistics_target = 100 # range 1-1000
#default_statistics_target = 10 # range 1-1000
#constraint_exclusion = off
#from_collapse_limit = 8
#join_collapse_limit = 8 # 1 disables collapsing of explicit
# JOINs

#---------------------------------------------------------------------------
#---------------------------------------------------------------------------
# RUNTIME STATISTICS
#---------------------------------------------------------------------------

# - Statistics Monitoring -

#log_parser_stats = off
#log_planner_stats = off
#log_executor_stats = off
#log_statement_stats = off

# - Query/Index Statistics Collector -

stats_start_collector = on
stats_command_string = on
stats_block_level = on
stats_row_level = on

#stats_start_collector = on
#stats_command_string = off
#stats_block_level = off
#stats_row_level = off
#stats_reset_on_server_start = off

#---------------------------------------------------------------------------
# AUTOVACUUM PARAMETERS
#---------------------------------------------------------------------------

autovacuum = true
autovacuum_naptime = 600

#autovacuum = false # enable autovacuum subprocess?
#autovacuum_naptime = 60 # time between autovacuum runs, in secs
#autovacuum_vacuum_threshold = 1000 # min # of tuple updates before
# vacuum
#autovacuum_analyze_threshold = 500 # min # of tuple updates before
# analyze
#autovacuum_vacuum_scale_factor = 0.4 # fraction of rel size before
# vacuum
#autovacuum_analyze_scale_factor = 0.2 # fraction of rel size before
# analyze
#autovacuum_vacuum_cost_delay = -1 # default vacuum cost delay for
# autovac, -1 means use
# vacuum_cost_delay
#autovacuum_vacuum_cost_limit = -1 # default vacuum cost limit for
# autovac, -1 means use
# vacuum_cost_

----------------------

CREATE TABLE tiger.completechain
(
ogc_fid int4 NOT NULL DEFAULT
nextval('completechain_ogc_fid_seq'::regclass),
module varchar(8) NOT NULL,
tlid int4 NOT NULL,
side1 int4,
source varchar(1) NOT NULL,
fedirp varchar(2),
fename varchar(30),
fetype varchar(4),
fedirs varchar(2),
cfcc varchar(3) NOT NULL,
fraddl varchar(11),
toaddl varchar(11),
fraddr varchar(11),
toaddr varchar(11),
friaddl varchar(1),
toiaddl varchar(1),
friaddr varchar(1),
toiaddr varchar(1),
zipl int4,
zipr int4,
aianhhfpl int4,
aianhhfpr int4,
aihhtlil varchar(1),
aihhtlir varchar(1),
census1 varchar(1),
census2 varchar(1),
statel int4,
stater int4,
countyl int4,
countyr int4,
cousubl int4,
cousubr int4,
submcdl int4,
submcdr int4,
placel int4,
placer int4,
tractl int4,
tractr int4,
blockl int4,
blockr int4,
wkb_geometry public.geometry NOT NULL,
CONSTRAINT enforce_dims_wkb_geometry CHECK (ndims(wkb_geometry) = 2),
CONSTRAINT enforce_geotype_wkb_geometry CHECK
(geometrytype(wkb_geometry) = 'LINESTRING'::text OR wkb_geometry IS NULL),
CONSTRAINT enforce_srid_wkb_geometry CHECK (srid(wkb_geometry) = 4269)
)
WITHOUT OIDS;
ALTER TABLE tiger.completechain OWNER TO postgres;


From: Richard Huxton <dev(at)archonet(dot)com>
To: Charlie Savage <cfis(at)interserv(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Sort performance on large tables
Date: 2005-11-08 11:14:41
Message-ID: 437088A1.8090304@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Charlie Savage wrote:
> Hi everyone,
>
> I have a question about the performance of sort.

> Note it takes over 10 times longer to do the sort than the full
> sequential scan.
>
> Should I expect results like this? I realize that the computer is quite
> low-end and is very IO bound for this query, but I'm still surprised
> that the sort operation takes so long.

The sort will be spilling to disk, which will grind your I/O to a halt.

> work_mem = 16384 # in Kb

Try upping this. You should be able to issue "set work_mem = 100000"
before running your query IIRC. That should let PG do its sorting in
larger chunks.

Also, if your most common access pattern is ordered via tlid look into
clustering the table on that.
--
Richard Huxton
Archonet Ltd


From: Charlie Savage <cfis(at)interserv(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Sort performance on large tables
Date: 2005-11-08 22:06:04
Message-ID: dkr7ga$18a4$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Thanks everyone for the feedback.

I tried increasing work_mem:

set work_mem to 300000;

select tlid, min(ogc_fid)
from completechain
group by tld;

The results are:

"GroupAggregate (cost=9041602.80..10003036.88 rows=48071704 width=8)
(actual time=4371749.523..5106162.256 rows=47599910 loops=1)"
" -> Sort (cost=9041602.80..9161782.06 rows=48071704 width=8) (actual
time=4371690.894..4758660.433 rows=48199165 loops=1)"
" Sort Key: tlid"
" -> Seq Scan on completechain (cost=0.00..2228584.04
rows=48071704 width=8) (actual time=49.518..805234.970 rows=48199165
loops=1)"
"Total runtime: 5279988.127 ms"

Thus the time decreased from 8486 seconds to 5279 seconds - which is a
nice improvement. However, that still leaves postgresql about 9 times
slower.

I tried increasing work_mem up to 500000, but at that point the machine
started using its swap partition and performance degraded back to the
original values.

Charlie

Richard Huxton wrote:
> Charlie Savage wrote:
>> Hi everyone,
>>
>> I have a question about the performance of sort.
>
>> Note it takes over 10 times longer to do the sort than the full
>> sequential scan.
>>
>> Should I expect results like this? I realize that the computer is
>> quite low-end and is very IO bound for this query, but I'm still
>> surprised that the sort operation takes so long.
>
> The sort will be spilling to disk, which will grind your I/O to a halt.
>
>> work_mem = 16384 # in Kb
>
> Try upping this. You should be able to issue "set work_mem = 100000"
> before running your query IIRC. That should let PG do its sorting in
> larger chunks.
>
> Also, if your most common access pattern is ordered via tlid look into
> clustering the table on that.

Richard Huxton wrote:
> Charlie Savage wrote:
>> Hi everyone,
>>
>> I have a question about the performance of sort.
>
>> Note it takes over 10 times longer to do the sort than the full
>> sequential scan.
>>
>> Should I expect results like this? I realize that the computer is
>> quite low-end and is very IO bound for this query, but I'm still
>> surprised that the sort operation takes so long.
>
> The sort will be spilling to disk, which will grind your I/O to a halt.
>
>> work_mem = 16384 # in Kb
>
> Try upping this. You should be able to issue "set work_mem = 100000"
> before running your query IIRC. That should let PG do its sorting in
> larger chunks.
>
> Also, if your most common access pattern is ordered via tlid look into
> clustering the table on that.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Charlie Savage <cfis(at)interserv(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Sort performance on large tables
Date: 2005-11-08 22:26:21
Message-ID: 9449.1131488781@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Charlie Savage <cfis(at)interserv(dot)com> writes:
> Thus the time decreased from 8486 seconds to 5279 seconds - which is a
> nice improvement. However, that still leaves postgresql about 9 times
> slower.

BTW, what data type are you sorting, exactly? If it's a string type,
what is your LC_COLLATE setting?

regards, tom lane


From: Charlie Savage <cfis(at)interserv(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Sort performance on large tables
Date: 2005-11-08 23:47:10
Message-ID: dkrddq$22bd$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Its an int4.

Charlie

Tom Lane wrote:
> Charlie Savage <cfis(at)interserv(dot)com> writes:
>> Thus the time decreased from 8486 seconds to 5279 seconds - which is a
>> nice improvement. However, that still leaves postgresql about 9 times
>> slower.
>
> BTW, what data type are you sorting, exactly? If it's a string type,
> what is your LC_COLLATE setting?
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>


From: Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To: Charlie Savage <cfis(at)interserv(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Sort performance on large tables
Date: 2005-11-09 01:49:39
Message-ID: 437155B3.3050507@familyhealth.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

I'd set up a trigger to maintain summary tables perhaps...

Chris

Charlie Savage wrote:
> Thanks everyone for the feedback.
>
> I tried increasing work_mem:
>
> set work_mem to 300000;
>
> select tlid, min(ogc_fid)
> from completechain
> group by tld;
>
> The results are:
>
> "GroupAggregate (cost=9041602.80..10003036.88 rows=48071704 width=8)
> (actual time=4371749.523..5106162.256 rows=47599910 loops=1)"
> " -> Sort (cost=9041602.80..9161782.06 rows=48071704 width=8) (actual
> time=4371690.894..4758660.433 rows=48199165 loops=1)"
> " Sort Key: tlid"
> " -> Seq Scan on completechain (cost=0.00..2228584.04
> rows=48071704 width=8) (actual time=49.518..805234.970 rows=48199165
> loops=1)"
> "Total runtime: 5279988.127 ms"
>
> Thus the time decreased from 8486 seconds to 5279 seconds - which is a
> nice improvement. However, that still leaves postgresql about 9 times
> slower.
>
> I tried increasing work_mem up to 500000, but at that point the machine
> started using its swap partition and performance degraded back to the
> original values.
>
> Charlie
>
>
> Richard Huxton wrote:
> > Charlie Savage wrote:
> >> Hi everyone,
> >>
> >> I have a question about the performance of sort.
> >
> >> Note it takes over 10 times longer to do the sort than the full
> >> sequential scan.
> >>
> >> Should I expect results like this? I realize that the computer is
> >> quite low-end and is very IO bound for this query, but I'm still
> >> surprised that the sort operation takes so long.
> >
> > The sort will be spilling to disk, which will grind your I/O to a halt.
> >
> >> work_mem = 16384 # in Kb
> >
> > Try upping this. You should be able to issue "set work_mem = 100000"
> > before running your query IIRC. That should let PG do its sorting in
> > larger chunks.
> >
> > Also, if your most common access pattern is ordered via tlid look into
> > clustering the table on that.
>
>
>
> Richard Huxton wrote:
>
>> Charlie Savage wrote:
>>
>>> Hi everyone,
>>>
>>> I have a question about the performance of sort.
>>
>>
>>> Note it takes over 10 times longer to do the sort than the full
>>> sequential scan.
>>>
>>> Should I expect results like this? I realize that the computer is
>>> quite low-end and is very IO bound for this query, but I'm still
>>> surprised that the sort operation takes so long.
>>
>>
>> The sort will be spilling to disk, which will grind your I/O to a halt.
>>
>>> work_mem = 16384 # in Kb
>>
>>
>> Try upping this. You should be able to issue "set work_mem = 100000"
>> before running your query IIRC. That should let PG do its sorting in
>> larger chunks.
>>
>> Also, if your most common access pattern is ordered via tlid look into
>> clustering the table on that.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Charlie Savage <cfis(at)interserv(dot)com>, Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Sort performance on large tables
Date: 2005-11-09 09:35:55
Message-ID: 1131528955.8300.2101.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Tue, 2005-11-08 at 00:05 -0700, Charlie Savage wrote:

> Setup: Dell Dimension 3000, Suse 10, 1GB ram, PostgreSQL 8.1 RC 1 with

> I want to extract data out of the file, with the most important values
> being stored in a column called tlid. The tlid field is an integer, and
> the values are 98% unique. There is a second column called ogc_fid
> which is unique (it is a serial field). I need to extract out unique
> TLID's (doesn't matter which duplicate I get rid of). To do this I am
> running this query:
>
> SELECT tlid, min(ogc_fid)
> FROM completechain
> GROUP BY tlid;
>
> The results from explain analyze are:
>
> "GroupAggregate (cost=10400373.80..11361807.88 rows=48071704 width=8)
> (actual time=7311682.715..8315746.835 rows=47599910 loops=1)"
> " -> Sort (cost=10400373.80..10520553.06 rows=48071704 width=8)
> (actual time=7311682.682..7972304.777 rows=48199165 loops=1)"
> " Sort Key: tlid"
> " -> Seq Scan on completechain (cost=0.00..2228584.04
> rows=48071704 width=8) (actual time=27.514..773245.046 rows=48199165
> loops=1)"
> "Total runtime: 8486057.185 ms"

> Should I expect results like this? I realize that the computer is quite
> low-end and is very IO bound for this query, but I'm still surprised
> that the sort operation takes so long.
>
> Out of curiosity, I setup an Oracle database on the same machine with
> the same data and ran the same query. Oracle was over an order of
> magnitude faster. Looking at its query plan, it avoided the sort by
> using "HASH GROUP BY." Does such a construct exist in PostgreSQL (I see
> only hash joins)?

PostgreSQL can do HashAggregates as well as GroupAggregates, just like
Oracle. HashAggs avoid the sort phase, so would improve performance
considerably. The difference in performance you are getting is because
of the different plan used. Did you specifically do anything to Oracle
to help it get that plan, or was it a pure out-of-the-box install (or
maybe even a "set this up for Data Warehousing" install)?

To get a HashAgg plan, you need to be able to fit all of the unique
values in memory. That would be 98% of 48071704 rows, each 8+ bytes
wide, giving a HashAgg memory sizing of over 375MB. You must allocate
memory of the next power of two above the level you want, so we would
need to allocate 512MB to work_mem before it would consider using a
HashAgg.

Can you let us know how high you have to set work_mem before an EXPLAIN
(not EXPLAIN ANALYZE) chooses the HashAgg plan?

Please be aware that publishing Oracle performance results is against
the terms of their licence and we seek to be both fair and legitimate,
especially within this public discussion forum.

Best Regards, Simon Riggs


From: Charlie Savage <cfis(at)interserv(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Sort performance on large tables
Date: 2005-11-09 17:13:46
Message-ID: 43722E4A.8060904@interserv.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hi Simon,

Thanks for the response Simon.

> PostgreSQL can do HashAggregates as well as GroupAggregates, just like
> Oracle. HashAggs avoid the sort phase, so would improve performance
> considerably. The difference in performance you are getting is because
> of the different plan used. Did you specifically do anything to Oracle
> to help it get that plan, or was it a pure out-of-the-box install (or
> maybe even a "set this up for Data Warehousing" install)?

It was an out-of-the-box plan with the standard database install option
(wasn't a Data Warehousing install).

> Can you let us know how high you have to set work_mem before an EXPLAIN
> (not EXPLAIN ANALYZE) chooses the HashAgg plan?

The planner picked a HashAggregate only when I set work_mem to 2097151 -
which I gather is the maximum allowed value according to a message
returned from the server.

> Please be aware that publishing Oracle performance results is against
> the terms of their licence and we seek to be both fair and legitimate,
> especially within this public discussion forum.

Sorry, I didn't realize - I'll be more vague next time.

Charlie