Quick Links

Why hash indexes suck

Lists:	pgsql-generalpgsql-hackers

From:	"Dann Corbit" <DCorbit(at)connx(dot)com>
To:	"Hegedus, Tamas (dot)" <Hegedus(dot)Tamas(at)mayo(dot)edu>
Cc:	<pgsql-general(at)postgresql(dot)org>
Subject:	Re: TimeOf(Subselects\|Joins)FromLargeTables?
Date:	2004-06-04 20:28:28
Message-ID:	D90A5A6C612A39408103E6ECDD77B829BC018D@voyager.corporate.connx.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

> -----Original Message-----
> From: Hegedus, Tamas . [mailto:Hegedus(dot)Tamas(at)mayo(dot)edu]
> Sent: Friday, June 04, 2004 12:37 PM
> To: Dann Corbit
> Subject: RE: [GENERAL] TimeOf(Subselects|Joins)FromLargeTables?
>
>
> Dear Dann,
>
> However, the pgsql developers do not recommend to use hash
> indexes, the speed doubled using hash indexes.
> ("Note: Testing has shown PostgreSQL's hash indexes to
> perform no better than B-tree indexes, and the index size and
> build time for hash indexes is much worse. For these reasons,
> hash index use is presently discouraged.")
>
> Question:
> If I created hash indexes, the query planner did not use
> them; it used the b-tree indexes. I had to drop the b-tree
> indexes to use the hash indexes. Why? Can I have 2 types of
> indexes on the same column and force the planner to use what I want?

Hash indexes and btree indexes work very differently. Normally, hash
indexes are best for equal joins and btree indexes are best for any join
that is not based on exact matching. So (for instance):

WHERE colname = some_constant
Or
WHERE colname IN (constant1, constant2, constant3,...,constantk)

Would be better handled by a hash index

And
WHERE colname <= some_constant
Or
WHERE colname LIKE 'const%'

Would be better handled by a btree

In most database systems I have used, you could make any sorts of
indexes on any sorts of columns in any sorts of combinations up to some
limits (e.g. 16 columns max).

The planner should be smart enough to pick the best thing. I think that
the planner is simply making a mistake here. Perhaps the PostgreSQL
gurus can help somehow forcing the plan.

There seems to be something seriously defective with hash indexes in old
versions of PostgreSQL. I thought that it had been repaired in the
latest versions of PostgreSQL starting with 7.4 and above but maybe it
is not completely fixed yet.

> Thanks,
> Tamas
>
> ===================================================================
> explain ANALYZE select p.name, p.seq from prots p, kwx k
> where p.fid=k.fid AND k.kw_acc=812;
>
> QUERY PLAN
> --------------------------------------------------------------
> --------------------------------------------------------------
> -----------
> Nested Loop (cost=0.00..662253.63 rows=69180 width=344)
> (actual time=43.337..36076.045 rows=78050 loops=1)
> -> Index Scan using ix_kwx_acc on kwx k
> (cost=0.00..245382.38 rows=69179 width=4) (actual
> time=0.109..422.159 rows=78050 loops=1)
> Index Cond: (kw_acc = 812)
> -> Index Scan using prt_fid_ix on prots p
> (cost=0.00..6.01 rows=1 width=348) (actual time=0.414..0.450
> rows=1 loops=78050)
> Index Cond: (p.fid = "outer".fid)
> Total runtime: 36134.105 ms
> ===================================================================
>
> -----Original Message-----
> From: Dann Corbit [mailto:DCorbit(at)connx(dot)com]
> Sent: Thursday, June 03, 2004 7:36 PM
> To: Hegedus, Tamas .
> Subject: RE: [GENERAL] TimeOf(Subselects|Joins)FromLargeTables?
>
>
> With a subselect, it limits the optimizers choice of which
> table to process first.
>
> It might be worth trying a hashed index on fid for both
> tables, since you are doing an equal join.
>
> > -----Original Message-----
> > From: Hegedus, Tamas . [mailto:Hegedus(dot)Tamas(at)mayo(dot)edu]
> > Sent: Thursday, June 03, 2004 7:33 PM
> > To: Dann Corbit
> > Subject: RE: [GENERAL] TimeOf(Subselects|Joins)FromLargeTables?
> >
> >
> > A little bit better. But I do not understand why?
> >
> > EXPLAIN ANALYZE select p.name, p.seq from prots p, kwx k
> > where p.fid=k.fid AND k.kw_acc=812;
> >
> > QUERY PLAN
> > --------------------------------------------------------------
> > --------------------------------------------------------------
> > --------------------
> > Merge Join (cost=0.00..160429.66 rows=84473 width=349)
> > (actual time=0.263..69192.828 rows=78050 loops=1)
> > Merge Cond: ("outer".fid = "inner".fid)
> > -> Index Scan using ix_kwx_fid on kwx k
> > (cost=0.00..44987.55 rows=84473 width=4) (actual
> > time=0.137..5675.701 rows=78050 loops=1)
> > Filter: (kw_acc = 812)
> > -> Index Scan using prots_pkey on prots p
> > (cost=0.00..112005.24 rows=981127 width=353) (actual
> > time=0.059..61816.725 rows=1210377 loops=1) Total runtime:
> > 69251.488 ms
> >
> >
> > -----Original Message-----
> > From: Dann Corbit [mailto:DCorbit(at)connx(dot)com]
> > Sent: Thursday, June 03, 2004 6:59 PM
> > To: Hegedus, Tamas .; pgsql-general(at)postgresql(dot)org
> > Subject: RE: [GENERAL] TimeOf(Subselects|Joins)FromLargeTables?
> >
> >
> > How does this query perform:
> >
> > SELECT p.name, p.seq
> > FROM prots p, kwx k
> > WHERE
> > p.fid=k.fid
> > AND
> > k.kw_acc=812
> > ;
> >
> > > -----Original Message-----
> > > From: Hegedus, Tamas . [mailto:Hegedus(dot)Tamas(at)mayo(dot)edu]
> > > Sent: Thursday, June 03, 2004 6:48 PM
> > > To: 'pgsql-general(at)postgresql(dot)org'
> > > Subject: [GENERAL] TimeOf(Subselects|Joins)FromLargeTables?
> > >
> > >
> > > Dear All,
> > >
> > > I am a biologist and I do not know what to expect from an RDB
> > > (PgSQL). I have large tables: 1215607 rows in prots,
> 2184596 rows in
> > > kwx (see table details below). I would like to do something like
> > > that:
> > >
> > > SELECT name, seq FROM prots WHERE fid in (SELECT fid FROM
> kwx WHERE
> > > kw_acc=812);
> > >
> > > After executing this (either as a subquery or joins) the
> > > best/fastest result what I had (SET enable_seqscan=off):
> 83643.482
> > > ms (see EXPLAIN ANALYZE below).
> > >
> > > The two (similar) parts of this query are executed much
> > > faster: SELECT fid FROM kwx WHERE kw_acc=812 -- takes 302ms,
> > > n(rows)=78050 SELECT name, seq FROM prots WHERE fid < 80000
> > > -- takes 1969.231 ms
> > >
> > > Is this realistic? OK?
> > > If not: how can I increase the speed by fine tuning of the RDB
> > > (indexes, run-time parameters) or my SQL query? (It came
> now into my
> > > mind: if I decrease the number of columns in the prots table (to
> > > have only 3 fields (fid, name, seq) instead of 20
> columns), than the
> > > prots table will have smaller file size on disk, than
> this table may
> > > need less disk page fetches, queries may be faster. Is this true?)
> > >
> > > Thanks for your help!
> > > Tamas
> > >
> > > ===============================================
> > > Table "public.prots"
> > > Column | Type | Modifiers
> > > -----------+----------------------+----------
> > > fid | integer | not null
> > > name | character varying(10) | not null
> > > [...other 17 columns...]
> > > seq | text |
> > > Indexes:
> > > "prots_pkey" primary key, btree (fid)
> > > "ix_prots_acc" unique, btree (acc)
> > > "ix_prots_name" unique, btree (name)
> > > "ix_prots_class" btree ("class")
> > > ===============================================
> > > Table "public.kwx"
> > > Column | Type | Modifiers
> > > --------+--------+----------
> > > fid | integer |
> > > kw_acc | integer |
> > > Indexes:
> > > "ix_kwx_acc" btree (kw_acc)
> > > "ix_kwx_fid" btree (fid)
> > > Foreign-key constraints:
> > > "fk_kws_acc" FOREIGN KEY (kw_acc) REFERENCES kw_ref(kw_acc)
> > > "fk_kws_fid" FOREIGN KEY (fid) REFERENCES prots(fid)
> > > ===============================================
> > >
> > > EXPLAIN ANALYZE SELECT name, seq from prots inner join kwx on
> > > (prots.fid=kwx.fid) where kwx.kw_acc = 812;
> > >
> > > QUERY PLAN
> > >
>
> > > --------------------------------------------------------------
> > > --------------------------------------------------------------
> > > ------------------
> > > Merge Join (cost=0.00..160429.66 rows=84473 width=349) (actual
> > > time=29.039..83505.629 rows=78050 loops=1)
> > > Merge Cond: ("outer".fid = "inner".fid)
> > > -> Index Scan using ix_kwx_fid on kwx
> > > (cost=0.00..44987.55 rows=84473 width=4) (actual
> > > time=18.893..5730.468 rows=78050 loops=1)
> > > Filter: (kw_acc = 812)
> > > -> Index Scan using prots_pkey on prots
> > > (cost=0.00..112005.24 rows=981127 width=353) (actual
> > > time=0.083..76059.235 rows=1210377 loops=1) Total runtime:
> > > 83643.482 ms (6 rows)
> > >
> > >
> > >
> > > ---------------------------(end of
> > > broadcast)---------------------------
> > > TIP 1: subscribe and unsubscribe commands go to
> > > majordomo(at)postgresql(dot)org
> > >
> >
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Dann Corbit" <DCorbit(at)connx(dot)com>
Cc:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Why hash indexes suck
Date:	2004-06-05 20:03:39
Message-ID:	499.1086465819@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Dann Corbit" <DCorbit(at)connx(dot)com> writes:
> There seems to be something seriously defective with hash indexes in old
> versions of PostgreSQL.

They still suck; I'm not aware of any situation where I'd recommend hash
over btree indexes in Postgres. I think we have fixed the hash indexes'
deadlock problems as of 7.4, but there's still no real performance
advantage.

I just had an epiphany as to the probable reason why the performance sucks.
It's this: the hash bucket size is the same as the page size (ie, 8K).

This means that if you have only one or a few items per bucket, the
information density is awful, and you lose big on I/O requirements
compared to a btree index. On the other hand, if you have enough
items per bucket to make the storage density competitive, you will
be doing linear searches through dozens if not hundreds of items that
are all in the same bucket, and you lose on CPU time (compared to btree
which can do binary search to find an item within a page).

It would probably be interesting to look into making the hash bucket
size be just a fraction of a page, with the intent of having no more
than a couple dozen items per bucket. I'm not sure what the
implications would be for intra-page storage management or index locking
conventions, but offhand it seems like there wouldn't be any
insurmountable problems.

I'm not planning on doing this myself, just throwing it out as a
possible TODO item for anyone who's convinced that hash indexes ought
to work better than they do.

regards, tom lane

From:	Sailesh Krishnamurthy <sailesh(at)cs(dot)berkeley(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Dann Corbit <DCorbit(at)connx(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Why hash indexes suck
Date:	2004-06-05 20:15:25
Message-ID:	mjq65a5ep7m.fsf@cs.berkeley.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Tom> This means that if you have only one or a few items per
Tom> bucket, the information density is awful, and you lose big on
Tom> I/O requirements compared to a btree index. On the other
Tom> hand, if you have enough items per bucket to make the storage
Tom> density competitive, you will be doing linear searches
Tom> through dozens if not hundreds of items that are all in the
Tom> same bucket, and you lose on CPU time (compared to btree
Tom> which can do binary search to find an item within a page).

This is probably a crazy idea, but is it possible to organize the data
in a page of a hash bucket as a binary tree ? Then you wouldn't lose
wrt CPU time at least.

--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	sailesh(at)cs(dot)berkeley(dot)edu
Cc:	Dann Corbit <DCorbit(at)connx(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Why hash indexes suck
Date:	2004-06-05 20:31:12
Message-ID:	689.1086467472@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Sailesh Krishnamurthy <sailesh(at)cs(dot)berkeley(dot)edu> writes:
> This is probably a crazy idea, but is it possible to organize the data
> in a page of a hash bucket as a binary tree ?

Only if you want to require a hash opclass to supply ordering operators,
which sort of defeats the purpose I think. Hash is only supposed to
need equality not ordering.

regards, tom lane

From:	Jeff Davis <jdavis-pgsql(at)empires(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Why hash indexes suck
Date:	2004-06-05 22:30:34
Message-ID:	1086474634.13110.199.camel@jeff
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Sat, 2004-06-05 at 13:31, Tom Lane wrote:
> Only if you want to require a hash opclass to supply ordering operators,
> which sort of defeats the purpose I think. Hash is only supposed to
> need equality not ordering.

Is it possible to assume some kind of ordering (i.e. strcmp() the binary
data of the type) as long as it's consistent?

Regards,
Jeff

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Davis <jdavis-pgsql(at)empires(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Why hash indexes suck
Date:	2004-06-06 00:15:54
Message-ID:	12412.1086480954@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Jeff Davis <jdavis-pgsql(at)empires(dot)org> writes:
> On Sat, 2004-06-05 at 13:31, Tom Lane wrote:
>> Only if you want to require a hash opclass to supply ordering operators,
>> which sort of defeats the purpose I think. Hash is only supposed to
>> need equality not ordering.

> Is it possible to assume some kind of ordering (i.e. strcmp() the binary
> data of the type) as long as it's consistent?

Not really; that would assume that equality of the datatype is the same
as bitwise equality, which is not the case in general (consider -0
versus +0 in IEEE floats, or any datatype with pad bits in the struct).
Some time ago we got rid of the assumption that hash should hash on all
the bits without any type-specific intervention, and I don't want to
reintroduce those bugs.

We could safely sort on the hash value, but I'm not sure how effective
that would be, considering that we're talking about values that already
hashed into the same bucket --- there's likely not to be very many
distinct hash values there.

regards, tom lane

From:	pgsql(at)mohawksoft(dot)com
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	sailesh(at)cs(dot)berkeley(dot)edu, "Dann Corbit" <dcorbit(at)connx(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Why hash indexes suck
Date:	2004-06-06 14:02:06
Message-ID:	16596.24.91.171.78.1086530526.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

> Sailesh Krishnamurthy <sailesh(at)cs(dot)berkeley(dot)edu> writes:
>> This is probably a crazy idea, but is it possible to organize the data
>> in a page of a hash bucket as a binary tree ?
>
> Only if you want to require a hash opclass to supply ordering operators,
> which sort of defeats the purpose I think. Hash is only supposed to
> need equality not ordering.
>

A btree is frequently used within the buckets of a hash table, expecially
if you expect to have a large number of items in each bucket.

If PostgreSQL could create a hash table index which is a single top level
hash table with each hash bucket being a btree index. You can eliminate a
number of btree searches by hashing, and then fall into btree performance
after the first hash lookup. The administrator should be able to gather
statistics about the population of the hash buckets and rehash if
performance begins to behave like a btree or the data is not distributed
evenly. Given proper selection of the initial number of buckets, a hash
table could blow btree out of the water. Given a poor selection of the
number of buckets, i.e. 1, a hash will behave no worse than a btree.

Also, it would be helpful to be able to specify a hash function during the
create or rehash, given a specific class of data, extraction of the random
elements can be more efficient and/or effective given knowledge of the
data. Think something like "bar codes," there are portions of the data
which are ususally the same and portions of the data which are usually
different. Focusing on the portions of the data which tend to be different
will generally provide a more evenly distributed hash.

From:	Kenneth Marshall <ktm(at)it(dot)is(dot)rice(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgreSQL(dot)org, xu(at)cs(dot)wisc(dot)edu
Subject:	Re: Re: We have got a serious problem with pg_clog/WAL synchronization
Date:	2004-08-12 13:21:17
Message-ID:	20040812132117.GB16756@it.is.rice.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

> "Min Xu (Hsu)" <xu(at)cs(dot)wisc(dot)edu> writes:
> > It seems to me this is an interesting phenomena of interactions between
> > frequent events of transaction commits and infrequent events of system
> > checkpoints. A potential alternative solution to adding a new shared
> > lock to the frequent commit operation is to let the infrequent
> > checkpoint operation take more overhead. I suppose acquiring/releasing
> > an extra lock for each commit would incur extra performance overhead,
> > even when the lock is not contented. On the other hand, let the
> > checkpoing operation acquire some existing locks (exclusively) to
> > effectively disallowing committing transactions to interfere with the
> > checkpoint process might be a better solution since it incur higher
> > overhead only when necessary.
>
> Unfortunately, there isn't any pre-existing lock that will serve.
> A transaction that is between XLogInsert'ing its COMMIT record and
> updating the shared pg_clog data area does not hold any lock that
> could be used to prevent a checkpoint from starting. (Or it didn't
> until yesterday's patch, anyway.)
>
> I looked briefly at reorganizing the existing code so that we'd do the
> COMMIT XLogInsert while we're holding lock on the shared pg_clog data,
> which would solve the problem without adding any new lock acquisition.
> But this seemed extremely messy to do. Also it would be optimizing
> transaction commit at the cost of pessimizing other uses of pg_clog,
> which might have to wait longer to get at the shared data. Adding the
> new lock has the advantage that we can be sure it's not blocking
> anything we don't want it to block.
>
> Thanks for thinking about the problem though ...
>
> regards, tom lane
>

One problem with a high-traffic LWLock is that they require a write
to shared memory for both the shared lock and the exclusive lock. On
the increasingly prevalent SMP machines, this will cause the invalidation
of the cache-line containing the lock and the consequent reload and its
inherent delay. Would it be possible to use a latch + version number in
this case to minimize this problem by allowing all but the checkpoint to
perform a read-only action on the latch? This should eliminate the cache-line
shenanigans on SMP machines.

Ken Marshall

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Kenneth Marshall <ktm(at)is(dot)rice(dot)edu>
Cc:	pgsql-hackers(at)postgreSQL(dot)org, xu(at)cs(dot)wisc(dot)edu
Subject:	Re: Re: We have got a serious problem with pg_clog/WAL synchronization
Date:	2004-08-12 13:58:56
Message-ID:	28362.1092319136@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Kenneth Marshall <ktm(at)is(dot)rice(dot)edu> writes:
> Would it be possible to use a latch + version number in
> this case to minimize this problem by allowing all but the checkpoint to
> perform a read-only action on the latch?

How would a read-only action work to block out the checkpoint?

More generally, though, this lock is hardly the one I'd be most
concerned about in an SMP situation. It's only taken once per
transaction, while there are others that may be taken many times.
(At least two of these, the WALInsertLock and the lock on shared
pg_clog, will need to be taken again in the process of recording
transaction commit.)

What I'd most like to find is a way to reduce contention for the
BufMgrLock --- there are at least some behavioral patterns in which
that is demonstrably a dominant cost. See past discussions in the
archives ("context swap storm" should find you some recent threads).

regards, tom lane

From:	Kenneth Marshall <ktm(at)it(dot)is(dot)rice(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Kenneth Marshall <ktm(at)is(dot)rice(dot)edu>, pgsql-hackers(at)postgreSQL(dot)org, xu(at)cs(dot)wisc(dot)edu
Subject:	Re: Re: We have got a serious problem with pg_clog/WAL synchronization
Date:	2004-08-12 17:03:36
Message-ID:	20040812170336.GA11862@it.is.rice.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, Aug 12, 2004 at 09:58:56AM -0400, Tom Lane wrote:
> Kenneth Marshall <ktm(at)is(dot)rice(dot)edu> writes:
> > Would it be possible to use a latch + version number in
> > this case to minimize this problem by allowing all but the checkpoint to
> > perform a read-only action on the latch?
>
> How would a read-only action work to block out the checkpoint?
>
> More generally, though, this lock is hardly the one I'd be most
> concerned about in an SMP situation. It's only taken once per
> transaction, while there are others that may be taken many times.
> (At least two of these, the WALInsertLock and the lock on shared
> pg_clog, will need to be taken again in the process of recording
> transaction commit.)
>
> What I'd most like to find is a way to reduce contention for the
> BufMgrLock --- there are at least some behavioral patterns in which
> that is demonstrably a dominant cost. See past discussions in the
> archives ("context swap storm" should find you some recent threads).
>
> regards, tom lane
>

The latch+version number is use by the checkpoint process. The
other processes can do a read of the latch to determine if it has
been set. This does not cause a cache invalidation hit. If the
latch is set, the competing processes read until it has been
cleared and the version updated. This makes the general case of
no checkpoint not incur a write and the consequent cache-line
invalidation and reload by all processors on an SMP system.

Ken

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Kenneth Marshall <ktm(at)is(dot)rice(dot)edu>
Cc:	pgsql-hackers(at)postgreSQL(dot)org, xu(at)cs(dot)wisc(dot)edu
Subject:	Re: Re: We have got a serious problem with pg_clog/WAL synchronization
Date:	2004-08-12 17:13:46
Message-ID:	182.1092330826@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Kenneth Marshall <ktm(at)is(dot)rice(dot)edu> writes:
> On Thu, Aug 12, 2004 at 09:58:56AM -0400, Tom Lane wrote:
>> How would a read-only action work to block out the checkpoint?

> The latch+version number is use by the checkpoint process. The
> other processes can do a read of the latch to determine if it has
> been set. This does not cause a cache invalidation hit. If the
> latch is set, the competing processes read until it has been
> cleared and the version updated. This makes the general case of
> no checkpoint not incur a write and the consequent cache-line
> invalidation and reload by all processors on an SMP system.

Except that reading the latch and finding it clear offers no guarantee
that a checkpoint isn't about to start. The problem is that we are
performing two separate actions (write a COMMIT xlog record and update
transaction status in clog) and we have to prevent a checkpoint from
starting in between those actions. I don't see that there's any way to
do that with a read-only latch.

regards, tom lane

From:	Kenneth Marshall <ktm(at)it(dot)is(dot)rice(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Kenneth Marshall <ktm(at)is(dot)rice(dot)edu>, pgsql-hackers(at)postgresql(dot)org, xu(at)cs(dot)wisc(dot)edu
Subject:	Re: Re: We have got a serious problem with pg_clog/WAL synchronization
Date:	2004-08-12 18:25:17
Message-ID:	20040812182517.GB11862@it.is.rice.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, Aug 12, 2004 at 01:13:46PM -0400, Tom Lane wrote:
> Kenneth Marshall <ktm(at)is(dot)rice(dot)edu> writes:
> > On Thu, Aug 12, 2004 at 09:58:56AM -0400, Tom Lane wrote:
> >> How would a read-only action work to block out the checkpoint?
>
> > The latch+version number is use by the checkpoint process. The
> > other processes can do a read of the latch to determine if it has
> > been set. This does not cause a cache invalidation hit. If the
> > latch is set, the competing processes read until it has been
> > cleared and the version updated. This makes the general case of
> > no checkpoint not incur a write and the consequent cache-line
> > invalidation and reload by all processors on an SMP system.
>
> Except that reading the latch and finding it clear offers no guarantee
> that a checkpoint isn't about to start. The problem is that we are
> performing two separate actions (write a COMMIT xlog record and update
> transaction status in clog) and we have to prevent a checkpoint from
> starting in between those actions. I don't see that there's any way to
> do that with a read-only latch.
>
> regards, tom lane

Yes, you are correct. I missed that part of the previous thread. When
I saw "exclusive lock" I thought latch since that is what I am investigating
to solve other performance issues that I am addressing.

Ken

From:	"Simon(at)2ndquadrant(dot)com" <simon(at)2ndquadrant(dot)com>
To:	"Kenneth Marshall" <ktm(at)it(dot)is(dot)rice(dot)edu>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	<pgsql-hackers(at)postgresql(dot)org>, <xu(at)cs(dot)wisc(dot)edu>
Subject:	Re: Re: We have got a serious problem with pg_clog/WAL synchronization
Date:	2004-08-13 23:16:55
Message-ID:	NOEFLCFHBPDAFHEIPGBOMEGHCCAA.simon@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

...just caught up on this.

ISTM that more heavily loading the checkpoint process IS possible if the
checkpoint uses a two-phase lock. That would replace 1 write lock with 2
lock reads...which is likely to be beneficial for SMP, given I have faith
that the other two problems you mention will succumb to some solution in the
mid-term. The first lock is an "intent lock" followed by a second,
heavyweight lock just as you now have it.

Comitter:
1. prior to COMMIT: reads for an intent lock, if found then it attempts to
take heavyweight lock...if that is not possible, then the commit waits until
after the checkpoint, just as you currently suggest
2. prior to update clog: reads for an intent lock, if found then takes
heavyweight lock...if that is not possible, then report a server error

Checkpointer: (straight to step 4 for a shutdown checkpoint)
1. writes an intent lock (it always can)
2. wait for the group commit timeout
3. wait for 0.5 second more
4. begins to wait on an exclusive heavyweight lock, before starting
checkpoint proper

This is not a provably correct state machine, but the error message should
not occur under current "normal" situations. (It is possible that an intent
lock could be written by Checkpointer (step 1), after a Committer reads for
it (step 1), then a very long delay occurs before Committer's step 2), such
that Checkpointer step 4 begins before Committer step 2.) It is very likely
that this would be noticed by Comitter step 2 and reported upon, in the
unlikely event that it occurs.

Is a longer term solution for pg to use a background log writer? That would
make group commit much easier to perform automatically without the
false-delay model currently available.

Best Regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Simon(at)2ndquadrant(dot)com" <simon(at)2ndquadrant(dot)com>
Cc:	"Kenneth Marshall" <ktm(at)is(dot)rice(dot)edu>, pgsql-hackers(at)postgresql(dot)org, xu(at)cs(dot)wisc(dot)edu
Subject:	Re: Re: We have got a serious problem with pg_clog/WAL synchronization
Date:	2004-08-13 23:29:43
Message-ID:	29055.1092439783@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Simon(at)2ndquadrant(dot)com" <simon(at)2ndquadrant(dot)com> writes:
> This is not a provably correct state machine

I think the discussion ends right there. You are assuming that the
commit is guaranteed to finish in X amount of time, when it is not
possible to make any such guarantee. We are not putting in an
unreliable commit mechanism in order to save a small amount of lock
contention. (As I tried to point out already, but it doesn't seem
to have sunk in: this newly-added lock is not likely to be that much
more contention added to the commit path, seeing that the path of
control it protects already involves taking at least two exclusive
LWLocks. Those locks will likely each cause as much or more SMP
cache thrashing as this one.)

What we could use is a better way to build LWLocks in general. I do not
know how to do that, though, in the face of SMP machines that seem to
fundamentally not have any cheap locking mechanisms...

regards, tom lane

From:	"Simon(at)2ndquadrant(dot)com" <simon(at)2ndquadrant(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Kenneth Marshall" <ktm(at)is(dot)rice(dot)edu>, <pgsql-hackers(at)postgresql(dot)org>, <xu(at)cs(dot)wisc(dot)edu>
Subject:	Re: Re: We have got a serious problem with pg_clog/WAL synchronization
Date:	2004-08-14 00:50:24
Message-ID:	NOEFLCFHBPDAFHEIPGBOGEHDCCAA.simon@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

> Tom Lane
>
> "Simon(at)2ndquadrant(dot)com" <simon(at)2ndquadrant(dot)com> writes:
> > This is not a provably correct state machine
>
> I think the discussion ends right there.

Yes...

Negative results are worth documenting too, IMHO.

Best Regards, Simon Riggs