Re: b-tree index search algorithms

Lists: pgsql-hackers
From: Samuel Vogel <s(at)muel-vogel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: b-tree index search algorithms
Date: 2012-07-16 23:51:08
Message-ID: 5004A8EC.1000307@muel-vogel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

I'm currently on a university research project if performance could be
increased by substituting different inter-node search algorithms instead
of the currently used binary search.

But I'm having troubles understanding how the general b-tree
implementation (nbtree.h) is used to represent for example a simple
primary key on an integer column. I've debug printed the
'scankey->sk_argument' and all attributes of the index tuples on the
pages being traversed (simply ran 'DatumGetInt32' on both) but I never
see one of the integers actually appearing in my table being logged when
I do a select.

This is why I assume that all column values are hashed before they are
pushed into the b-tree, but this hashing would have to preserve the
order of the keys. I have tried to look at stack traces, but the b-tree
implementations seems to be used commonly for many things that it's hard
to find the interesting bits for me.

I would like to know how the b-tree and column indexes interact. The
readmes for the b-tree seem really implementation centric and I'm not
getting a hold the whole picture.

Regards,
Samuel Vogel


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Samuel Vogel <s(at)muel-vogel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: b-tree index search algorithms
Date: 2012-07-17 03:21:47
Message-ID: 13121.1342495307@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Samuel Vogel <s(at)muel-vogel(dot)de> writes:
> I'm currently on a university research project if performance could be
> increased by substituting different inter-node search algorithms instead
> of the currently used binary search.

Hm, what have you got in mind exactly?

> But I'm having troubles understanding how the general b-tree
> implementation (nbtree.h) is used to represent for example a simple
> primary key on an integer column. I've debug printed the
> 'scankey->sk_argument' and all attributes of the index tuples on the
> pages being traversed (simply ran 'DatumGetInt32' on both) but I never
> see one of the integers actually appearing in my table being logged when
> I do a select.

Not clear what you did wrong from this amount of detail, but integer
keys ought to be pretty obvious at the debugger level.

> This is why I assume that all column values are hashed before they are
> pushed into the b-tree,

PG's b-trees do not hash anything. If you're not seeing interpretable
key values then you're doing something wrong in your inspection
methodology.

regards, tom lane


From: Samuel Vogel <s(at)muel-vogel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: b-tree index search algorithms
Date: 2012-07-17 09:34:34
Message-ID: 500531AA.3070800@muel-vogel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am 17.07.12 05:21, schrieb Tom Lane:
> Samuel Vogel <s(at)muel-vogel(dot)de> writes:
>> I'm currently on a university research project if performance could be
>> increased by substituting different inter-node search algorithms instead
>> of the currently used binary search.
> Hm, what have you got in mind exactly?

At first I will try a simple interpolation search, but problems start
there since I need to have a numerical representation of the index keys
(or map them to one) to do the interpolation.

>> But I'm having troubles understanding how the general b-tree
>> implementation (nbtree.h) is used to represent for example a simple
>> primary key on an integer column. I've debug printed the
>> 'scankey->sk_argument' and all attributes of the index tuples on the
>> pages being traversed (simply ran 'DatumGetInt32' on both) but I never
>> see one of the integers actually appearing in my table being logged when
>> I do a select.
> Not clear what you did wrong from this amount of detail, but integer
> keys ought to be pretty obvious at the debugger level.

Okay, to be more specific: Printing
'DatumGetInt32(scankey->sk_argument)' in '_bt_compare' never shows me 50
when I execute this query: SELECT * FROM simpletest WHERE id = 50;

>> This is why I assume that all column values are hashed before they are
>> pushed into the b-tree,
> PG's b-trees do not hash anything. If you're not seeing interpretable
> key values then you're doing something wrong in your inspection
> methodology.

Okay, how are indexes on char/text columns handled then? Are they hashed
before being put into the b-tree or is my assumption correct, that in
that case the Datum is only a link to where the actual data is stored
and only 'scankey->sk_func' knows how to make use of it (compare it).
In that case it would be extremly hard to get to a numeric
representation which can be used for the interpolation.

Regards,
Samuel Vogel


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Samuel Vogel <s(at)muel-vogel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: b-tree index search algorithms
Date: 2012-07-17 17:38:44
Message-ID: 13532.1342546724@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Samuel Vogel <s(at)muel-vogel(dot)de> writes:
> Am 17.07.12 05:21, schrieb Tom Lane:
>> Samuel Vogel <s(at)muel-vogel(dot)de> writes:
>>> I'm currently on a university research project if performance could be
>>> increased by substituting different inter-node search algorithms instead
>>> of the currently used binary search.

>> Hm, what have you got in mind exactly?

> At first I will try a simple interpolation search, but problems start
> there since I need to have a numerical representation of the index keys
> (or map them to one) to do the interpolation.

Dunno about that. btree knows nothing about the datatypes it's working
on except that they have comparison functions. Converting the values
to some sort of numeric scale that you can interpolate on seems
logically dubious and fraught with practical difficulties. Now, we do
have some code in selfuncs.c that tries to do that, for some data types,
but it's only for planner estimation purposes, and we don't rely very
heavily on its results even in that context. Depending on it to be
right for search purposes sounds pretty scary.

>> Not clear what you did wrong from this amount of detail, but integer
>> keys ought to be pretty obvious at the debugger level.

> Okay, to be more specific: Printing
> 'DatumGetInt32(scankey->sk_argument)' in '_bt_compare' never shows me 50
> when I execute this query: SELECT * FROM simpletest WHERE id = 50;

Um, what does it show you? DatumGetInt32 is a macro, and at least in
gdb that won't work at all:

(gdb) p DatumGetInt32(scankey->sk_argument)
No symbol "DatumGetInt32" in current context.

However, just looking at the value produces sane answers for me:

(gdb) p *scankey
$1 = {sk_flags = 196608, sk_attno = 1, sk_strategy = 0, sk_subtype = 23,
sk_collation = 0, sk_func = {fn_addr = 0x486ec0 <btint4cmp>, fn_oid = 351,
fn_nargs = 2, fn_strict = 1 '\001', fn_retset = 0 '\000',
fn_stats = 2 '\002', fn_extra = 0x0, fn_mcxt = 0x2b82fe8, fn_expr = 0x0},
sk_argument = 50}
(gdb) p scankey->sk_argument
$2 = 50

>> PG's b-trees do not hash anything. If you're not seeing interpretable
>> key values then you're doing something wrong in your inspection
>> methodology.

> Okay, how are indexes on char/text columns handled then?

The datum values will be pointers to strings.

> ... is my assumption correct, that in
> that case the Datum is only a link to where the actual data is stored
> and only 'scankey->sk_func' knows how to make use of it (compare it).
> In that case it would be extremly hard to get to a numeric
> representation which can be used for the interpolation.

The btree code is (or reasonably can be) aware that such values are
pass-by-reference, and how to get to the bits. But the comparison
semantics of two different values are not something it knows about
except by asking the comparison function. This can be quite a
nontrivial matter even for text, since we follow strcoll() comparison
rules.

regards, tom lane


From: Samuel Vogel <s(at)muel-vogel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: b-tree index search algorithms
Date: 2012-07-18 21:48:05
Message-ID: 50072F15.2020107@muel-vogel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am 17.07.12 19:38, schrieb Tom Lane:
> btree knows nothing about the datatypes it's working on except that
> they have comparison functions. Converting the values to some sort of
> numeric scale that you can interpolate on seems logically dubious and
> fraught with practical difficulties. Now, we do have some code in
> selfuncs.c that tries to do that, for some data types, but it's only
> for planner estimation purposes, and we don't rely very heavily on its
> results even in that context. Depending on it to be right for search
> purposes sounds pretty scary.

'convert_string_to_scalar' and others look interesting in selfuncs.c,
thanks for the pointer!

>> Okay, how are indexes on char/text columns handled then?
> The datum values will be pointers to strings.

I can simply dereference it and read all bytes until a null-byte appears
(depending on the collation and that I know that it actually is a string)?

> The btree code is (or reasonably can be) aware that such values are
> pass-by-reference, and how to get to the bits. But the comparison
> semantics of two different values are not something it knows about
> except by asking the comparison function. This can be quite a
> nontrivial matter even for text, since we follow strcoll() comparison
> rules. regards, tom lane

How would the b-tree know exactly that a value is only a reference? And
even in that case you say that it could get the bits, but make no use of
it, since it does not know what they represent, right?

Regards,
Samuel


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Samuel Vogel <s(at)muel-vogel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: b-tree index search algorithms
Date: 2012-07-18 21:56:32
Message-ID: 3066.1342648592@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Samuel Vogel <s(at)muel-vogel(dot)de> writes:
> Am 17.07.12 19:38, schrieb Tom Lane:
>> The datum values will be pointers to strings.

> I can simply dereference it and read all bytes until a null-byte appears
> (depending on the collation and that I know that it actually is a string)?

We use a length word and then data, with no trailing null, at least for
text and varchar. This area has been hacked for efficiency until it's
pretty complicated, but you can read about it in postgres.h.

> How would the b-tree know exactly that a value is only a reference? And
> even in that case you say that it could get the bits, but make no use of
> it, since it does not know what they represent, right?

It has access to the data type's basic storage parameters, which are
typbyval, typlen, and typalign; and we have standard conventions for
identifying the length etc of variable-length values. It's just the
meaning of the payload data bytes that's data-type-private.

regards, tom lane


From: Samuel Vogel <s(at)muel-vogel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: b-tree index search algorithms
Date: 2012-07-19 22:14:23
Message-ID: 500886BF.20009@muel-vogel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am 18.07.12 23:56, schrieb Tom Lane:
> Samuel Vogel <s(at)muel-vogel(dot)de> writes:
>> How would the b-tree know exactly that a value is only a reference? And
>> even in that case you say that it could get the bits, but make no use of
>> it, since it does not know what they represent, right?
> It has access to the data type's basic storage parameters, which are
> typbyval, typlen, and typalign; and we have standard conventions for
> identifying the length etc of variable-length values. It's just the
> meaning of the payload data bytes that's data-type-private.

Okay, so with these I know if and how I would have to "dereference" the
data.
But how do I get to this info from inside _bt_binsrch?
RelationGetDescr(rel)->tdtypeid was my closest guess, but I need to get
a reference to FormData_pg_type somehow.

Regards,
Samuel Vogel


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Samuel Vogel <s(at)muel-vogel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: b-tree index search algorithms
Date: 2012-07-19 23:40:02
Message-ID: 18827.1342741202@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Samuel Vogel <s(at)muel-vogel(dot)de> writes:
> Am 18.07.12 23:56, schrieb Tom Lane:
>> It has access to the data type's basic storage parameters, which are
>> typbyval, typlen, and typalign; and we have standard conventions for
>> identifying the length etc of variable-length values. It's just the
>> meaning of the payload data bytes that's data-type-private.

> Okay, so with these I know if and how I would have to "dereference" the
> data.
> But how do I get to this info from inside _bt_binsrch?

RelationGetDescr(rel)->attrs[n]->attbyval etc.

regards, tom lane


From: Samuel Vogel <s(at)muel-vogel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: b-tree index search algorithms
Date: 2012-07-22 22:41:09
Message-ID: 500C8185.7070401@muel-vogel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am 20.07.12 01:40, schrieb Tom Lane:
> RelationGetDescr(rel)->attrs[n]->attbyval

Thanks!
Now does 'Relation' refer to the whole table or only the columns that
are supposed to be scanned? So will RelationGetDescr(rel)->attrs[0] give
me the description of the first column relevant to the current b-tree
scan or simply to the first column in the table?
Naively I tried RelationGetDescr(rel)->attrs[scankey->sk_attno] but it
results in a segmentation fault.

A first version of my algorithm is running now (very simple test case)
but I had to implement a workaround for the following behavior: The
(binary) search is supposed to always find the first matching offset on
a page, but I do not understand how this is guaranteed, since the
semantics of a binary search do not guarantee this. The type being
searched where it throws me off specifically is a 'chunk_id', page
contents are:
1: 12000
2: 12001
3: 12002
4: 12003
5: 12004
6: 12004
7: 12005
8: 12005
9: 12006
10: 12006
11: 16393
12: 16393
13: 16394
14: 16394
15: 16395
16: 16395

Binary search finds 11 in 4 steps, interpolation search finds 12 in 3
steps. Now if there was an additional value like "17: 16396", binary
search should also find 12 first, right?

Regards,
Samuel Vogel