[patch] gsoc, improving hash index v2

Lists: pgsql-hackers
From: "Xiao Meng" <mx(dot)cogito(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, "Kenneth Marshall" <ktm(at)rice(dot)edu>
Subject: [patch] gsoc, improving hash index v2
Date: 2008-07-25 14:26:05
Message-ID: ded849dd0807250726s6c4cc895oabd8579c375a6538@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi, hackers.
I've post a hash patch in a previous thread
http://archives.postgresql.org/pgsql-hackers/2008-07/msg00794.php
I do apologize for the bad readability of previous patch. Thank you all for
your comments.
Here is a new patch which fixed some bugs in the previous one.
I post it here to get some feedback and further suggestion. Any comment is
welcome.
Changes since v1:
- fix bug that it crashed in _h_spool when test big data set
- adjust the target-fillfactor calculation in _hash_metapinit
- remove the HASHVALUE_ONLY macro
- replace _create_hash_desc with _get_hash_desc to get a hard-coded hash
index tuple.
- replace index_getattr with _hash_get_datum to get the hash key datum and
avoid too many calls to _get_hash_desc and index_getattr

Here is what I intend to do.
Todo:
- get the statistics of block access i/o
- write unit tests using pgunitest to test the following:
(Josh Berkus suggested in this thread
http://archives.postgresql.org/pgsql-hackers/2008-05/msg00535.php )
bulk load, both COPY and INSERT
single-row updates, inserts and deletes
batch update by key
batch update by other index
batch delete by key
batch delete by other index
concurrent index updates (64 connections insert/deleting concurrently)

I makes some simple test mentioned here (
http://archives.postgresql.org/pgsql-hackers/2007-09/msg00208.php)
I'll make some test on bigger data set later.
using a word list of 3628800 unique words
The table size is 139MB.
Index BuildTime IndexSize
---- ---- ----
btree 51961.123 ms 93MB
hash 411069.264 ms 2048MB
hash-patch 36288.931 ms 128MB

dict=# SELECT * from hash-dict where word = '0234567891' ;
word
------------
0234567891
(1 row)

Time: 33.960 ms
dict=# SELECT * from btree-dict where word = '0234567891' ;
word
------------
0234567891
(1 row)

Time: 1.662 ms

dict=# SELECT * from hash2-dict where word = '0234567891' ;
word
------------
0234567891
(1 row)

Time: 1.457 ms

At last, there is a problem I encounter.
I'm confused by the function _hash_checkqual.
IMHO, the index tuple only store one column here and key->sk_attno should
always be 1 here.
And scanKeySize should be 1 since we didn't support multi-column hash yet.
Do I make some misunderstanding?
/*
* _hash_checkqual -- does the index tuple satisfy the scan conditions?
*/
bool
_hash_checkqual(IndexScanDesc scan, IndexTuple itup)
{
TupleDesc tupdesc = RelationGetDescr(scan->indexRelation);
ScanKey key = scan->keyData;
int scanKeySize = scan->numberOfKeys;

IncrIndexProcessed();

while (scanKeySize > 0)
{
Datum datum;
bool isNull;
Datum test;

datum = index_getattr(itup,
key->sk_attno,
tupdesc,
&isNull);

/* assume sk_func is strict */
if (isNull)
return false;
if (key->sk_flags & SK_ISNULL)
return false;

test = FunctionCall2(&key->sk_func, datum, key->sk_argument);

if (!DatumGetBool(test))
return false;

key++;
scanKeySize--;
}

return true;
}

Hope to hear from you.
--
Best Regards,
Xiao Meng

DKERC, Harbin Institute of Technology, China
Gtalk: mx(dot)cogito(at)gmail(dot)com
MSN: cnEnder(at)live(dot)com
http://xiaomeng.yo2.cn

Attachment Content-Type Size
hash-v2.patch text/x-diff 18.3 KB

From: "Xiao Meng" <mx(dot)cogito(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Subject: Re: [patch] gsoc, improving hash index v2
Date: 2008-08-06 01:33:25
Message-ID: ded849dd0808051833v1c855ba7hccba739fe7778911@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi, hackers. Here is some test I run on a bigger set.

Use a word list of 39916800 unique words
The table size is 1529MB.
Index BuildTime IndexSize
---- ---- ----
btree 874470.339ms 1027MB
hash-patch 513026.381 ms 1024MB

I use pgbench to test the time of a custom query script.
There are 2000 statements in the script.
It looks like this:
select * from dict where word='123456789a0'
...
The time of the two index is
btree: 1/0.174700=5.00250125
hash-patch: 1/0.199900=5.724098

---------------btree------------------
$ pgbench -n -f /tmp/query.sql dict
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
tps = 0.174694 (including connections establishing)
tps = 0.174700 (excluding connections establishing)

---------------hash patch-------------
$ pgbench -n -f /tmp/query.sql dict
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
tps = 0.199892 (including connections establishing)
tps = 0.199900 (excluding connections establishing)

--
Best Regards,
Xiao Meng

DKERC, Harbin Institute of Technology, China
Gtalk: mx(dot)cogito(at)gmail(dot)com
MSN: cnEnder(at)live(dot)com
http://xiaomeng.yo2.cn


From: "Xiao Meng" <mx(dot)cogito(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Subject: Re: [patch] gsoc, improving hash index v2
Date: 2008-08-06 01:39:49
Message-ID: ded849dd0808051839w49c96ff6v2247f71bea77c1f6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

sorry, I made some mistake here.
The time of the script on two indexes should be
btree: 1/0.174700=5.724098s
hash-patch: 1/0.199900=5.00250125s

On Wed, Aug 6, 2008 at 9:33 AM, Xiao Meng <mx(dot)cogito(at)gmail(dot)com> wrote:

> Hi, hackers. Here is some test I run on a bigger set.
>
> Use a word list of 39916800 unique words
> The table size is 1529MB.
> Index BuildTime IndexSize
> ---- ---- ----
> btree 874470.339ms 1027MB
> hash-patch 513026.381 ms 1024MB
>
> I use pgbench to test the time of a custom query script.
> There are 2000 statements in the script.
> It looks like this:
> select * from dict where word='123456789a0'
> ...
> The time of the two index is
> btree: 1/0.174700=5.00250125
> hash-patch: 1/0.199900=5.724098
>
> ---------------btree------------------
> $ pgbench -n -f /tmp/query.sql dict
> transaction type: Custom query
> scaling factor: 1
> query mode: simple
> number of clients: 1
> number of transactions per client: 10
> number of transactions actually processed: 10/10
> tps = 0.174694 (including connections establishing)
> tps = 0.174700 (excluding connections establishing)
>
> ---------------hash patch-------------
> $ pgbench -n -f /tmp/query.sql dict
> transaction type: Custom query
> scaling factor: 1
> query mode: simple
> number of clients: 1
> number of transactions per client: 10
> number of transactions actually processed: 10/10
> tps = 0.199892 (including connections establishing)
> tps = 0.199900 (excluding connections establishing)
>
> --
> Best Regards,
> Xiao Meng
>
> DKERC, Harbin Institute of Technology, China
> Gtalk: mx(dot)cogito(at)gmail(dot)com
> MSN: cnEnder(at)live(dot)com
> http://xiaomeng.yo2.cn
>

--
Best Regards,
Xiao Meng

DKERC, Harbin Institute of Technology, China
Gtalk: mx(dot)cogito(at)gmail(dot)com
MSN: cnEnder(at)live(dot)com
http://xiaomeng.yo2.cn


From: Jens-Wolfhard Schicke <drahflow(at)gmx(dot)de>
To: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [patch] gsoc, improving hash index v2
Date: 2008-08-06 01:40:47
Message-ID: 4899011F.1040403@gmx.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Xiao Meng wrote:
> Hi, hackers. Here is some test I run on a bigger set.
>
> The time of the two index is
> btree: 1/0.174700=5.00250125
> hash-patch: 1/0.199900=5.724098
Just to bring it to attention of everybody:

btree: 1/0.174700=5.724098
hash-patch: 1/0.199900=5.00250125

Hence the hash _is_ actually faster.

> ---------------btree------------------
> $ pgbench -n -f /tmp/query.sql dict
> ...
> tps = 0.174694 (including connections establishing)
> tps = 0.174700 (excluding connections establishing)
>
> ---------------hash patch-------------
> $ pgbench -n -f /tmp/query.sql dict
> transaction type: Custom query
> ...
> tps = 0.199892 (including connections establishing)
> tps = 0.199900 (excluding connections establishing)

Jens
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFImQEZzhchXT4RR5ARAi2nAJ98ujYi+ZOHZybSQaOw11JFpkilIACg5DGu
0Mo+UPGsdd2ZFTGirMplFm4=
=Qj5C
-----END PGP SIGNATURE-----