Re: Looked at TODO:Considering improving performance of computing CHAR() value lengths

Lists: pgsql-hackers
From: John Cochran <j69cochran(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Looked at TODO:Considering improving performance of computing CHAR() value lengths
Date: 2014-08-02 18:43:20
Message-ID: CAGQU3n4eL_dDOnrm8sE6XnwLReY_kaCzWoCYLLVwiDC1iXYEnw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greetings,
I took at look at the TODO list and got interested in the possible
optimization of the bcTruelen() function. Read the archived messages about
that subject and decided to see what could be done.

I tested the performance of 5 different versions of bcTruelen().
1. The code as it exists in postgresql currently.
2. The code with the patch described in the above mentioned messages.
3. A modification of the code mentioned in 2 above using the concept of a
sentinel.
4. A modification of the code mentioned in 1 above using a sentinel.
5. A modification of the code mentioned in 4 using knowledge of 1B and 4B
headers.

First, let me describe a sentinel. The key thing about using a sentinel is
that with on, you can guarantee loop termination without having to check
for an index. What you do is place a value that you're searching for 1 past
the limit of the range you're searching for. By doing this, you can
optimize your loop to have just 1 condition to check for instead of 2
(search value and index limit). This mean that your loop now needs to just
execute 3 opcodes (pointer decrement, search compare, conditional jump)
instead of 5 (pointer decrement, search compare, conditional jump, limit
compare, conditional jump). In the case of bcTruelen(), the sentinel value
is anything that's non-space.

I ran all 5 different versions of bcTruelen() in a benchmark program that
simply allocated a gig of memory, populated the memory with varlena
structures of varying length, alignment, and trailing spaces. Then called
the different versions of bcTruelen() using a pseudo random sequence to try
and prevent processor caching from affecting the values.

Overall, the results were as follows:
1. For small numbers of trailing space, the "4 at a time" approach was a
consistent loser in terms of performance. If there's a large number of
trailing spaces, the performance of the "4 at a time" approach could be
quite impressive however.
2. The best typical performer was the single byte at a time approach using
a sentinel and knowledge of the types of headers. For that routine, the
performance compared against the current version of bcTruelen() was as
follows
CHAR(1) with only spaces. Current version wins. New version loses by about
20%
CHAR(1) with non-space. New version wins by about 10%.
CHAR(2) with only spaces. New version loses by about 4%.
CHAR(2) with one space. New version wins by about 15%.
CHAR(2) with no spaces. New version wins by about 10%

Once CHAR(4) was reached, the new version was faster for all numbers of
trailing spaces. I suspect the reason for the new routine being slower for
short CHAR types with all spaces is because using the sentinel results in
an extra loop iteration over the code that checks for limits each
iteration. But once the number of iterations gets large enough, the shorter
loop wins even though it iterated one extra time. The tests were performed
on an Intel Core2 duo.

The code effectively has 3 identical loops with different surroundings.
1. If the arg is pointing to a 1B style header, then there is no need to
worry about a sentinel. No 1B header matches a space character whether the
target system is big or little endian.
2. For those cases where the arg is pointing to 4B style header, you do
have to worry about a good sentinel existing. For a little endian machine,
such an issue happens if you're dealing with a CHAR(134217728) through
CHAR(138412031). Unlikely to be sure. But for a big endian machine,
CHAR(32), CHAR(288) and other values can cause problems. So for 4B style
headers, a check is made to see if the sentinel location has a value of '
'. If it does, it is replaced with a zero, the scan is then performed, and
then the sentinel is then restored with the original space. It is of course
possible to unconditionally save the value at the sentinel location, stuff
in a zero, then restore the original value, but in doing so, more overhead
in incurred every call, slowing down the function.

If you wish me to submit a patch, I can do so easily.

Code is as follows:
/*
* "True" length (not counting trailing blanks) of a BpChar
*/
int
bcTruelen(BpChar *arg)
{
char *s;
char *p;

if (VARATT_IS_1B(arg))
{
s = (void *)arg;
p = s + VARSIZE_1B(arg);
do
--p;
while(*p == ' ');
}
else
{
s = VARDATA_4B(arg);
p = s + VARSIZE_4B(arg)-VARHDRSZ;
--s; /* Point to where sentinel will be */

if (*s == ' ') {
*s = 0; /* Stuff in a non-blank sentinel */
do
--p;
while(*p == ' ');
*s = ' '; /* Replace blank */
} else {
do
--p;
while(*p == ' ');
}
}

return p-s; /* Return length */
}

--

There are two ways of constructing a software design: One way is to make it
so simple that there are obviously no deficiencies and the other way is to
make it so complicated that there are no obvious deficiencies. — C.A.R.
Hoare


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: John Cochran <j69cochran(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Looked at TODO:Considering improving performance of computing CHAR() value lengths
Date: 2014-08-04 08:48:38
Message-ID: 53DF48E6.4060703@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/02/2014 09:43 PM, John Cochran wrote:
> I took at look at the TODO list and got interested in the possible
> optimization of the bcTruelen() function. Read the archived messages about
> that subject and decided to see what could be done.
>
> I tested the performance of 5 different versions of bcTruelen().
> 1. The code as it exists in postgresql currently.
> 2. The code with the patch described in the above mentioned messages.
> 3. A modification of the code mentioned in 2 above using the concept of a
> sentinel.
> 4. A modification of the code mentioned in 1 above using a sentinel.
> 5. A modification of the code mentioned in 4 using knowledge of 1B and 4B
> headers.
>
> First, let me describe a sentinel. The key thing about using a sentinel is
> that with on, you can guarantee loop termination without having to check
> for an index. What you do is place a value that you're searching for 1 past
> the limit of the range you're searching for. By doing this, you can
> optimize your loop to have just 1 condition to check for instead of 2
> (search value and index limit). This mean that your loop now needs to just
> execute 3 opcodes (pointer decrement, search compare, conditional jump)
> instead of 5 (pointer decrement, search compare, conditional jump, limit
> compare, conditional jump). In the case of bcTruelen(), the sentinel value
> is anything that's non-space.
>
> I ran all 5 different versions of bcTruelen() in a benchmark program that
> simply allocated a gig of memory, populated the memory with varlena
> structures of varying length, alignment, and trailing spaces. Then called
> the different versions of bcTruelen() using a pseudo random sequence to try
> and prevent processor caching from affecting the values.
>
> Overall, the results were as follows:

Can we see the benchmark program and the actual data, please?

> 1. For small numbers of trailing space, the "4 at a time" approach was a
> consistent loser in terms of performance. If there's a large number of
> trailing spaces, the performance of the "4 at a time" approach could be
> quite impressive however.
> 2. The best typical performer was the single byte at a time approach using
> a sentinel and knowledge of the types of headers. For that routine, the
> performance compared against the current version of bcTruelen() was as
> follows
> CHAR(1) with only spaces. Current version wins. New version loses by about
> 20%
> CHAR(1) with non-space. New version wins by about 10%.
> CHAR(2) with only spaces. New version loses by about 4%.
> CHAR(2) with one space. New version wins by about 15%.
> CHAR(2) with no spaces. New version wins by about 10%
>
> Once CHAR(4) was reached, the new version was faster for all numbers of
> trailing spaces. I suspect the reason for the new routine being slower for
> short CHAR types with all spaces is because using the sentinel results in
> an extra loop iteration over the code that checks for limits each
> iteration. But once the number of iterations gets large enough, the shorter
> loop wins even though it iterated one extra time. The tests were performed
> on an Intel Core2 duo.
>
> The code effectively has 3 identical loops with different surroundings.
> 1. If the arg is pointing to a 1B style header, then there is no need to
> worry about a sentinel. No 1B header matches a space character whether the
> target system is big or little endian.
> 2. For those cases where the arg is pointing to 4B style header, you do
> have to worry about a good sentinel existing. For a little endian machine,
> such an issue happens if you're dealing with a CHAR(134217728) through
> CHAR(138412031). Unlikely to be sure. But for a big endian machine,
> CHAR(32), CHAR(288) and other values can cause problems. So for 4B style
> headers, a check is made to see if the sentinel location has a value of '
> '. If it does, it is replaced with a zero, the scan is then performed, and
> then the sentinel is then restored with the original space. It is of course
> possible to unconditionally save the value at the sentinel location, stuff
> in a zero, then restore the original value, but in doing so, more overhead
> in incurred every call, slowing down the function.

You can't modify *arg, even momentarily, because it might point directly
to an on-disk buffer. The same Datum can be read by another process at
the same time, or even written out to disk.

You could simply check if the byte in the header happens to be != 0x20.
It's very likely that it is, so that you can use the optimized version.

The biggest issue with the previously discussed patches was the lack of
performance testing. People posted results of various micro-benchmarks
that showed different aspects, but no-one comprehensively covered all
the interesting cases. I'd like to see a simple suite of performance
tests that show:

1. The worst case performance.
2. Performance with a couple of typical cases. CHAR(n), where n is
between 1-64 would be interesting, and where the number of spaces at the
end varies.

You said you did benchmarking with a custom program, which is good, but
it would be even better to write the benchmark as a self-contained SQL
or bash script or similar. Then post it, so that others can easily
repeat the same benchmarks on different platforms, and try out different
versions of the patch.
- Heikki


From: John Cochran <j69cochran(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Looked at TODO:Considering improving performance of computing CHAR() value lengths
Date: 2014-08-04 21:40:06
Message-ID: CAGQU3n5JpRmbU7nk4NtGQ-5qWP_Y_G-qDN3_aMdXW8POR5uDhw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Made a mistake in replying to commenter, not mailing list. Redoing reply.

On Mon, Aug 4, 2014 at 4:48 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
wrote:

> On 08/02/2014 09:43 PM, John Cochran wrote:
>
>> I took at look at the TODO list and got interested in the possible
>>
> ... SNIP ...

> Can we see the benchmark program and the actual data, please?

Attached as gzipped tar file.

... SNIP ...

>
> You can't modify *arg, even momentarily, because it might point directly
> to an on-disk buffer. The same Datum can be read by another process at the
> same time, or even written out to disk
>
Thank you for the information. Just made some changes to the code so the
values pointed to by *arg aren't modified even momentarily.

> You could simply check if the byte in the header happens to be != 0x20.
> It's very likely that it is, so that you can use the optimized version.
>
I already do that.

> The biggest issue with the previously discussed patches was the lack of
> performance testing. People posted results of various micro-benchmarks that
> showed different aspects, but no-one comprehensively covered all the
> interesting cases. I'd like to see a simple suite of performance tests that
> show:
>
> 1. The worst case performance.
>
I can't provide that to you. Looking at my analysis, the worse case would
be encountered on a big endian machine. On a big endian machine, if the
data is using a 4B style header, the worse case would be for a CHAR(n)
where n is 28 plus any multiple of 256 (about 4 million possibilities in
total). For a little endian machine, the problem case happens for CHAR(n)
where n is between 128 megabytes and 132 megabytes. I seriously doubt that
the worse case would ever be encountered on a little endian machine. But
it's quite possible on a big endian machine.

> 2. Performance with a couple of typical cases. CHAR(n), where n is between
> 1-64 would be interesting, and where the number of spaces at the end varies.
>
Since I've modified the code, I'm rerunning the benchmark. When benchmark
is finished, will have execution times for all CHAR(n) where n ranges from
1 to 256 with trailing spaces ranging from 0 to n for each case. Type 1B
headers are tested for all n from 1 to 126 along with all possible
alignments from being on an even 4 byte boundary to being offset by 1, 2,
and 3. For type 4B headers, they are all aligned on a 4 byte boundary and a
value for all n from 1 to 256. The benchmark has currently gotten up to
CHAR(40) and will take a while to finish. Even so, the data available
should be more than suitable for examination.

>
> You said you did benchmarking with a custom program, which is good, but it
> would be even better to write the benchmark as a self-contained SQL or bash
> script or similar. Then post it, so that others can easily repeat the same
> benchmarks on different platforms, and try out different versions of the
> patch.
> - Heikki
>
> No can do. The benchmark program is testing 4 different variants of
bcTruelen() at the same time. In addition, the benchmark is keeping track
of the fixed overhead involved in actually testing each function so it can
be taken into consideration when examining the numbers. If I incorporated
the code into a functional database, I would be restricted to testing just
one variant at a time and would have no control over other activities
(buffer management, disk I/O, etc) that would add noise to the benchmark
data.

Currently with the available data, I would suggest the optimized
by1sentinel version of the bcTruelen(). As currently written, it attempts
the following methods in sequence.

1. If *arg is pointing to a type 1B header, it simply performs the scan of
trailing spaces using the 3 opcode loop. No other checks. Reason is that a
type 1B head can never have the value 0x20, so a suitable sentinel is
automatically known to be available.

2. The byte just prior to the 1st character is examined to see if it's a
non-space. If so, the 3 opcode loop is used since a suitable sentinel has
been shown to exist. Only reason this is slightly slower is because of the
overhead in checking for the existence of a sentinel. On average, a big
endian machine will use this option 255 times out of 256. On a little
endian machine, it is extremely likely that this will be the only code path
executed if the data starts with a type 4B header.

3. The entire header is checked to see if it's not equal to 0x20202020.
This is almost worse case. Still able to use a 3 opcode loop, but the
pointer will go 1 to 3 bytes lower than it really needs to. Easily
corrected after the loop with a simple if and assignment. A big endian
machine will use this code path 1 time out of 256. A little endian machine
will use this code path for CHAR(134217724) through CHAR(138412027). I
seriously doubt that this code path will ever be encountered on a little
endian machine. On a big endian machine, it will be used only with a type
4B header for CHAR(28) which is unlikely since CHAR(28) should be using a
1B header. It will also be used for CHAR(284), CHAR(540), CHAR(796), etc.

4. Assuming none of the 3 options above are selected, the function fall
back to the standard 5 opcode loop that's currently in use. This will only
happen on a big endian machine using CHAR(538976284) or a little endian
machine using CHAR(134744068). Rather unlikely, but it is covered.

Format of output from vtest (the benchmark program built by the attached
tar file)
The output is just a series of lines of the following format
hdr,offset,maxlen,spaces,overhead,by1,by4,by4sentinel,by1sentinel

hdr = 1 or 4 to indicate test done using a type 1B or 4B header.
offset = 0 to 3 to indicate memory alignment of header. 0 = aligned to 4
byte boundary. 1 through 3 = offset from 4 byte boundary by that amount.
maxlen = length of character data. Equivalent to n in CHAR(n)
spaces = number of trailing spaces. Ranges from 0 to maxlen.
overhead = number of nanoseconds consumed by 100,000 function calls and
benchmark loop overhead.
by1 = number of nanoseconds consumed by 100,000 functions calls to by1
bcTruelen() function. Time includes benchmark overhead. To properly compare
numbers against other versions of bcTruelen functions, subtract the
overhead number given earlier. This function represents the current version
of bcTruelen() being used by postgresql.
by4 = Same as by1 except using the "4 bytes at a time" version that the
patch in the email conversation provided as referenced by the TODO item.
by4sentinel = Same as by1 except it performs the search 4 bytes at a time
and uses a sentinel for loop termination when possible. NOTE: function not
fully tested. It should work on big and little endian machines, but it is
definitely not fully optimized for edge cases.
by1sentinel = Same a by1. It performs the search 1 bytes at a time and uses
a sentinel for loop termination.

Frankly, I don't see any downside to using the by1sentinel version of
bcTruelen() since it is as fast, or faster than the current version of
bcTruelen() for all cases. I do see lots of problems with using either of
the 4 bytes at a time versions of bcTruelen() since they are slower if
there are not many spaces at the end of the character string. The break
even point for 4 at a time style loops seems to be about 8 to 12 trailing
spaces. Below that number, they're slower. Above, faster.

I look forward to any discussions and would be extremely interested in
seeing benchmark data on a big endian architecture.

John Cochran

--

There are two ways of constructing a software design: One way is to make it
so simple that there are obviously no deficiencies and the other way is to
make it so complicated that there are no obvious deficiencies. — C.A.R.
Hoare

Attachment Content-Type Size
bcTruelen_benchmark.tar.gz application/x-gzip 140.2 KB

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: John Cochran <j69cochran(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Looked at TODO:Considering improving performance of computing CHAR() value lengths
Date: 2014-10-11 22:51:16
Message-ID: 20141011225116.GN21267@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 4, 2014 at 11:48:38AM +0300, Heikki Linnakangas wrote:
> The biggest issue with the previously discussed patches was the lack
> of performance testing. People posted results of various
> micro-benchmarks that showed different aspects, but no-one
> comprehensively covered all the interesting cases. I'd like to see a
> simple suite of performance tests that show:

Any progress on this?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +