Re: intarray internals

Lists: pgsql-generalpgsql-hackers
From: Volkan YAZICI <yazicivo(at)ttnet(dot)net(dot)tr>
To: pgsql-general(at)postgresql(dot)org
Subject: intarray internals
Date: 2006-05-05 21:46:01
Message-ID: 20060505214601.GA208@alamut
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hi,

I'm reading through the source code of intarray contrib module. Despite
being at the beginning, I've some questions to ask. I'd be so
appreciated if anybody can help.

[1]
What's the function of execute() in _int_bool.c? As far as I can
understand, some other functions (eg. execconsistent()) calling
execute() with specific check methods (like checkcondition_bit()) but
I still couldn't figure out which functionality execute() stands for.

[2]
In g_int_decompress(), shouldn't

if (ARRISVOID(in))
PG_RETURN_POINTER(entry);

part be replaced with

if (ARRISVOID(in))
{
if (in != (ArrayType *) DatumGetPointer(entry->key))
pfree(in);
PG_RETURN_POINTER(entry)
}

[3]
Again, in g_int_decompress(), I couldn't figure out the functionality of
below lines:

din = ARRPTR(in);
lenr = internal_size(din, lenin);

for (i = 0; i < lenin; i += 2)
for (j = din[i]; j <= din[i + 1]; j++)
if ((!i) || *(dr - 1) != j)
*dr++ = j;

If I understand right, above loop, tries to reconstruct array with more
smaller intervals - to be able to make more accurate predicates while
digging into nodes. If so, AFAICS, g_int_compress() and
g_int_decompress() methods can be (quite?) improved.

Furthermore, I've tested above functions with some random input and
couldn't create any cases hold for a[i] == a[i - 1] (which is used
in internal_size() method's loop.) Did I miss something obvious?

Regards.

P.S. Instead of an explanation to questions, pointings to right files to
read (at least for the beginning) would be appreciated too.


From: Volkan YAZICI <yazicivo(at)ttnet(dot)net(dot)tr>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: intarray internals
Date: 2006-05-06 09:38:25
Message-ID: 20060506093825.GA202@alamut
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On May 06 12:46, Volkan YAZICI wrote:
> I'm reading through the source code of intarray contrib module. Despite
> being at the beginning, I've some questions to ask. I'd be so
> appreciated if anybody can help.

Sorry, I forgot one:

[4]
In the inner_int_contains() function of _int_tool.c, there's a while
loop like

[Code assumes that arrays are sorted.]

na = ARRNELEMS(a);
nb = ARRNELEMS(b);
da = ARRPTR(a);
db = ARRPTR(b);

i = j = n = 0;
while (i < na && j < nb)
if (da[i] < db[j])
i++;
else if (da[i] == db[j])
{
n++;
i++;
j++;
}
else
j++;

return (n == nb) ? TRUE : FALSE;

AFAICS, last "j++" should be replaced with "return FALSE". Because, "n"
cannot be equal to "nb" no more, if "j" gets incremented without
incrementing "n" (remember "j < nb" in the "while" condition).

Regards.


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Volkan YAZICI <yazicivo(at)ttnet(dot)net(dot)tr>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: intarray internals
Date: 2006-05-06 10:13:03
Message-ID: 20060506101303.GB4413@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Sat, May 06, 2006 at 12:46:01AM +0300, Volkan YAZICI wrote:
> [1]
> What's the function of execute() in _int_bool.c? As far as I can
> understand, some other functions (eg. execconsistent()) calling
> execute() with specific check methods (like checkcondition_bit()) but
> I still couldn't figure out which functionality execute() stands for.

It's a boolean expression evaluator. The query given is some kind of
boolean expression. That function is a recusive function that evaluates
the expression given certain information.

> [2]
> In g_int_decompress(), shouldn't
>
> if (ARRISVOID(in))
> PG_RETURN_POINTER(entry);
>
> part be replaced with
>
> if (ARRISVOID(in))
> {
> if (in != (ArrayType *) DatumGetPointer(entry->key))
> pfree(in);
> PG_RETURN_POINTER(entry)
> }

You very rarely need to pfree() anything explicitly. However, the code
has just tested if in is VOID. If it is, you obviously don't need to
free it. Or it may not be big enough to bother explicitly freeing.

>
> [3]
> Again, in g_int_decompress(), I couldn't figure out the functionality of
> below lines:
>
> din = ARRPTR(in);
> lenr = internal_size(din, lenin);
>
> for (i = 0; i < lenin; i += 2)
> for (j = din[i]; j <= din[i + 1]; j++)
> if ((!i) || *(dr - 1) != j)
> *dr++ = j;
>
> If I understand right, above loop, tries to reconstruct array with more
> smaller intervals - to be able to make more accurate predicates while
> digging into nodes. If so, AFAICS, g_int_compress() and
> g_int_decompress() methods can be (quite?) improved.

Well, it's probably trying to undo whatever g_int_compress. If you can
explain the algorithm used it will probably be clearer.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Volkan YAZICI <yazicivo(at)ttnet(dot)net(dot)tr>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: intarray internals
Date: 2006-05-06 14:38:24
Message-ID: 20060506143824.GB202@alamut
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hi,

First, thanks so much for your response.

On May 06 12:13, Martijn van Oosterhout wrote:
> On Sat, May 06, 2006 at 12:46:01AM +0300, Volkan YAZICI wrote:
> > [1]
> > What's the function of execute() in _int_bool.c? As far as I can
> > understand, some other functions (eg. execconsistent()) calling
> > execute() with specific check methods (like checkcondition_bit()) but
> > I still couldn't figure out which functionality execute() stands for.
>
> It's a boolean expression evaluator. The query given is some kind of
> boolean expression. That function is a recusive function that evaluates
> the expression given certain information.

I thought the same but the code (and its variable handling/coersion
stuff) is quite messy to figure this out, IMHO. Nearly no comments at
all while using curitem->val or calcnot.

> > [2]
> > In g_int_decompress(), shouldn't
> >
> > if (ARRISVOID(in))
> > PG_RETURN_POINTER(entry);
> >
> > part be replaced with
> >
> > if (ARRISVOID(in))
> > {
> > if (in != (ArrayType *) DatumGetPointer(entry->key))
> > pfree(in);
> > PG_RETURN_POINTER(entry)
> > }
>
> You very rarely need to pfree() anything explicitly. However, the code
> has just tested if in is VOID. If it is, you obviously don't need to
> free it. Or it may not be big enough to bother explicitly freeing.

Yep, it shouldn't be so big. I just wanted to follow same style as in
the previous page. (See "if (ARRISVOID(r))" check in g_int_compress().)

> > [3]
> > Again, in g_int_decompress(), I couldn't figure out the functionality of
> > below lines:
> >
> > din = ARRPTR(in);
> > lenr = internal_size(din, lenin);
> >
> > for (i = 0; i < lenin; i += 2)
> > for (j = din[i]; j <= din[i + 1]; j++)
> > if ((!i) || *(dr - 1) != j)
> > *dr++ = j;
> >
> > If I understand right, above loop, tries to reconstruct array with more
> > smaller intervals - to be able to make more accurate predicates while
> > digging into nodes. If so, AFAICS, g_int_compress() and
> > g_int_decompress() methods can be (quite?) improved.
>
> Well, it's probably trying to undo whatever g_int_compress. If you can
> explain the algorithm used it will probably be clearer.

Actually, algorithms used in the g_int_compress() and g_int_decompress()
methods are quite awesome. (I don't know if this is the authors'
creation, but if so, kudos.) But the problem I think is they're quite
lossy compression methods. To clarify, here's a small explanation of
algorithm used (if I understood right):

g_int_compress():
if (integer array length > constant limit)
{
Transfrom {A, B, C, ..., Z} array into
{A, A, B, B, ..., Z, Z}

while (integer array length > constant limit)
{
Select two couples whose difference is minimum
and remove them from the list.
}
}

g_int_decompress():
for (iterate over compressed array items)
{
Transform {..., 47, 50, ...} into {..., 47, 48, 49, 50, ...}
}

As you can see both compression and decompression methods are quite
lossy. I'm not sure if this has any negative impact on the traversing
of nodes stuff for more accurate predicates, but I am currently
considering "performance gain * physical storage gain / cpu
consumation loss" ratio if we'd instead use a lossless data
compression method. I'd be appreciated to hear your ideas (and
experiences).

Regards.


From: Volkan YAZICI <yazicivo(at)ttnet(dot)net(dot)tr>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: intarray internals
Date: 2006-05-06 20:43:15
Message-ID: 20060506204315.GC202@alamut
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On May 06 05:38, Volkan YAZICI wrote:
> g_int_compress():
> if (integer array length > constant limit)
> {
> Transfrom {A, B, C, ..., Z} array into
> {A, A, B, B, ..., Z, Z}
>
> while (integer array length > constant limit)
> {
> Select two couples whose difference is minimum
> and remove them from the list.
> }
> }
>
> g_int_decompress():
> for (iterate over compressed array items)
> {
> Transform {..., 47, 50, ...} into {..., 47, 48, 49, 50, ...}
> }
>
> As you can see both compression and decompression methods are quite
> lossy. I'm not sure if this has any negative impact on the traversing
> of nodes stuff for more accurate predicates, but I am currently
> considering "performance gain * physical storage gain / cpu
> consumation loss" ratio if we'd instead use a lossless data
> compression method.

After considering the idea, I conclude up with that, current lossy
compression method seems like the best for now. But here goes my
proposal:

g_int_compress():
/* Input */
arr = {16, 20, 40, 52, 58}
len = 5

orig_len = len

while (len(arr) > LIM)
{
Find the couple with minimum diff: (16, 20)

Remove(arr, 20) /* Removing right bound. If it's the
last item of the list, remove
left item. */
--len
}

/*
* Suppose that, above function reduced array with below steps:
* 0: {16, 20, 40, 52, 58} (16, 20*)
* 1: {16, 40, 52, 58} (52*, 58)
* 2: {16, 40, 58}
*/

/* Prepend orig_len to array */
arr = {_5_, 16, 40, 58}

g_int_decompress():
/* Import info from input array. */
orig_len = 5
arr = {16, 40, 58}
len = 3

while (len < orig_len)
{
Divide every interval into two equal parts:
{16, 40, 58} -> {16, 28*, 40, 49*, 58}
}

I didn't test above method on real-world data chunks but, AFAIC, above
method would result with more accurate ranges in the decompressed arrays
and probably skip some redundant loop recursions when compared to
previous method.

Your comments?

Regards.


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Volkan YAZICI <yazicivo(at)ttnet(dot)net(dot)tr>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: intarray internals
Date: 2006-05-06 20:46:15
Message-ID: 20060506204615.GF4413@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Sat, May 06, 2006 at 05:38:24PM +0300, Volkan YAZICI wrote:
> > Well, it's probably trying to undo whatever g_int_compress. If you can
> > explain the algorithm used it will probably be clearer.
>
> Actually, algorithms used in the g_int_compress() and g_int_decompress()
> methods are quite awesome. (I don't know if this is the authors'
> creation, but if so, kudos.) But the problem I think is they're quite
> lossy compression methods. To clarify, here's a small explanation of
> algorithm used (if I understood right):

Well, you're right, the algorithm is quite neat. Note however that it's
only applied for integer sets with more than 100 values. I don't know
much about the intarray code, but this is a compression algorithm that
probably maps well to the domain. If you can find a lossless one that
will take 1,000 elements and can cram them into the space of 100
losslessly, we'd be glad to hear it.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Volkan YAZICI <yazicivo(at)ttnet(dot)net(dot)tr>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: intarray internals
Date: 2006-05-07 10:21:36
Message-ID: 20060507102136.GA202@alamut
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hi,

I've prepared a Quick & Dirty patch serie for some missing parts in
intarray contrib module. Here they go with some explanation...

On May 06 12:38, Volkan YAZICI wrote:
> [4]
> In the inner_int_contains() function of _int_tool.c, there's a while
> loop like
>
> [Code assumes that arrays are sorted.]
>
> na = ARRNELEMS(a);
> nb = ARRNELEMS(b);
> da = ARRPTR(a);
> db = ARRPTR(b);
>
> i = j = n = 0;
> while (i < na && j < nb)
> if (da[i] < db[j])
> i++;
> else if (da[i] == db[j])
> {
> n++;
> i++;
> j++;
> }
> else
> j++;
>
> return (n == nb) ? TRUE : FALSE;
>
> AFAICS, last "j++" should be replaced with "return FALSE". Because, "n"
> cannot be equal to "nb" no more, if "j" gets incremented without
> incrementing "n" (remember "j < nb" in the "while" condition).

intarray_contains.patch.0 is for above problem.

[5]
ISTM, inner_int_union() of _int_tool.c, makes some redundant visits to
array:

...
/* union */
i = j = 0;
while (i < na && j < nb)
if (da[i] < db[j])
*dr++ = da[i++];
else
*dr++ = db[j++];

while (i < na)
*dr++ = da[i++];
while (j < nb)
*dr++ = db[j++];

}

if (ARRNELEMS(r) > 1)
r = _int_unique(r);

IMHO, uniting unique values (instead of uniting and then removing
duplicates) should work faster. intarray_union.patch.0 is for this
problem. (Patched code, handles uniting for unique values.)

[6]
There's a seperate sorting algorithm isort() in _int_tool.c. Looks like
it executes some kind of shell sort on the array and returns true if
array has any duplicates. It's used for common sorting and deciding on
executing _int_unique() on the related array if isort() says it has
duplicates.

IIRC, our inner qsort() has a smaller O(N) degree when compared to above
sorting algorithm. Also for the code's sake it'd be better to switch
using qsort() in all sorting related stuff. For these reasons,
intarray_sort.patch.0 addresses this (probable) gotcha.

All 3 patches passed regression tests for intarray contrib module. But
these are just for addressing some gotchas I found while reading code.
Your comments for these problems(?) are welcome.

Regards.

Attachment Content-Type Size
intarray_contains.patch.0 text/plain 546 bytes
intarray_sort.patch.0 text/plain 2.3 KB
intarray_union.patch.0 text/plain 2.3 KB

From: Volkan YAZICI <yazicivo(at)ttnet(dot)net(dot)tr>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: intarray internals
Date: 2006-05-08 19:38:02
Message-ID: 20060508193802.GA211@alamut
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hi,

[I'm trying to share some of my thoughts about intarray contrib module.
If this is the wrong way to achieve this, please warn me. (Should I
first get in touch with Teodor Sigaev and Oleg Bartunov?)]

[6]
_int_same() in _int_op.c looks like making some redundant sorting and
not taking advantage of sorted arrays while comparing each other. Here's
the related code piece:

SORT(a);
SORT(b);
na = ARRNELEMS(a);
nb = ARRNELEMS(b);
da = ARRPTR(a);
db = ARRPTR(b);

result = FALSE;

if (na == nb)
{
result = TRUE;
for (n = 0; n < na; n++)
if (da[n] != db[n])
{
result = FALSE;
break;
}
}

IMHO, SORT() macro should be called after "if (na == nb)" block. (SORT()
doesn't remove duplicates.) Also, in the inner block, while comparing
two arrays, we can take advantage of sorting of arrays. While current
behaviour is like

if (A[0] == B[0] && A[1] == B[1] && ...)

we can replace it with sth like

if (A[0] == B[0] && A[ N] == B[ N] &&
A[1] == B[1] && A[N-1] == B[N-1] &&
...)

Attached patch tries to implement both behaviours mentioned above and
some minor hacking for arrays of 1,2 and 3 items.

Regards.

Attachment Content-Type Size
intarray_same.patch.0 text/plain 2.3 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Volkan YAZICI <yazicivo(at)ttnet(dot)net(dot)tr>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: intarray internals
Date: 2006-05-08 19:58:29
Message-ID: 29236.1147118309@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Volkan YAZICI <yazicivo(at)ttnet(dot)net(dot)tr> writes:
> [I'm trying to share some of my thoughts about intarray contrib module.
> If this is the wrong way to achieve this, please warn me. (Should I
> first get in touch with Teodor Sigaev and Oleg Bartunov?)]

Well, they're definitely the people most qualified to review your
proposed changes. More to the point, this discussion is quite off-topic
for pgsql-general. If I were you I'd be posting these comments to
pgsql-hackers with cc's to Teodor and Oleg.

regards, tom lane


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Volkan YAZICI <yazicivo(at)ttnet(dot)net(dot)tr>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [GENERAL] intarray internals
Date: 2006-05-10 12:01:18
Message-ID: 4461D60E.5020507@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hi!

Sorry for delay, I hadn't access to Internet.

> [3]
> Again, in g_int_decompress(), I couldn't figure out the functionality of
> below lines:

gist__int_ops use rangeset compression technique, read about in "THE RD-TREE: AN
INDEX STRUCTURE FOR SETS", Joseph M. Hellerstein,
http://www.sai.msu.su/~megera/postgres/gist/papers/rd-tree.ps

About your patches:
* intarray_contains.patch.0 - applied

* intarray_union.patch.0 - doesn't applied, but make small optimization to
reduce number non-unique values. I don't believe that one pass through array
with a lot of ifs will be faster than two pass with simple ifs. Did you some tests?

* intarray_same.patch.0 - move SORT as you suggest, but don't touch algorithm.
1) if (A[0] == B[0] && A[1] == B[1] && ...)

2) if (A[0] == B[0] && A[ N] == B[ N] &&
A[1] == B[1] && A[N-1] == B[N-1] &&
...)

Why are you sure that second a much faster? Did you make tests? Number of
comparisons is the same...

* intarray_sort.patch.0 - doesn't applied. isort() is very often called for
already sorted and unique arrays (which comes from index), so it should be fast
as possible for sorted arrays. As I remember ordered array is a worst case for
qsort(). May be, it will be better choice to use mergesort.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Volkan YAZICI <yazicivo(at)ttnet(dot)net(dot)tr>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: intarray internals
Date: 2006-05-11 08:25:30
Message-ID: 20060511082529.GA223@alamut
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hi,

First, thanks so much for your reply.

On May 10 04:01, Teodor Sigaev wrote:
> > Again, in g_int_decompress(), I couldn't figure out the functionality of
> > below lines:
>
> gist__int_ops use rangeset compression technique, read about in "THE
> RD-TREE: AN INDEX STRUCTURE FOR SETS", Joseph M. Hellerstein,
> http://www.sai.msu.su/~megera/postgres/gist/papers/rd-tree.ps

Thanks so much for the papers. I read the related section (and will read
whole today or tomorrow).

> * intarray_union.patch.0 - doesn't applied, but make small optimization to
> reduce number non-unique values. I don't believe that one pass through
> array with a lot of ifs will be faster than two pass with simple ifs. Did
> you some tests?

IMHO, the only significant improvement in my proposal about _int_union()
is that this method will visit arrays only once (with extra price of x2
condition checks), while current one will also make a second visit to
arrays to remove duplicates (with small condition checks).

You can be right, maybe it doesn't worth for worrying about. Improvement
(if there's any) will probably be available to see for very long arrays.
(Sorry, no tests for this proposal.)

> * intarray_same.patch.0 - move SORT as you suggest, but don't touch
> algorithm.
> 1) if (A[0] == B[0] && A[1] == B[1] && ...)
>
> 2) if (A[0] == B[0] && A[ N] == B[ N] &&
> A[1] == B[1] && A[N-1] == B[N-1] &&
> ...)
>
> Why are you sure that second a much faster? Did you make tests? Number of
> comparisons is the same...

Yep, both algorithms have O(n) comparisions in their worst cases. But
for general purposes, AFAICS, second one will perform better. For
instance consider below examples:

[Best case for 2nd algo.]
Input : 1, 2, 3, ..., 6, 7, *9
1st algo.: O(n)
2nd algo.: O(1)

[Worst case for 2nd algo.]
Input : 1, 2, 3, 4, *4, 6, 7, 8, 9
1st algo.: O(n/2)
2nd algo.: O(n)

But as you can see, because of our arrays are sorted, any missing (or
additional) element in the target array will produce a padding in the
end of the array --- assuming that arrays generally don't hold
duplicate values. Therefore, making comparisons for the tail elements
will perform better beucause of the unmatched values caused by padding.

Hope I managed to explain what I try to mean. Actually, IIRC, I saw this
method (both hacks for small sized arrays and comparisons for the tail
elements of a sorted array) in another FOSS project's source code ---
probably PHP, but I'm not sure.

For about testing, if you'd supply suitable inputs there occurs a quite
much performance improve.

> * intarray_sort.patch.0 - doesn't applied. isort() is very often called for
> already sorted and unique arrays (which comes from index), so it should be
> fast as possible for sorted arrays.

Uh, sorry. I missed that point.

> As I remember ordered array is a worst
> case for qsort(). May be, it will be better choice to use mergesort.

I'll investigate alternative methods to sort already sorted arrays.

Regards.