Re: CopyReadLineText optimization

Lists: pgsql-hackerspgsql-patches
From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: <pgsql-patches(at)postgresql(dot)org>
Subject: CopyReadLineText optimization
Date: 2008-02-24 01:29:47
Message-ID: 47C0C88B.8090904@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

The purpose of CopyReadLineText is to scan the input buffer, and find
the next newline, taking into account any escape characters. It
currently operates in a loop, one byte at a time, searching for LF, CR,
or a backslash. That's a bit slow: I've been running oprofile on COPY,
and I've seen CopyReadLine to take around ~10% of the CPU time, and
Joshua Drake just posted a very similar profile to hackers.

Attached is a patch that modifies CopyReadLineText so that it uses
memchr to speed up the scan. The nice thing about memchr is that we can
take advantage of any clever optimizations that might be in libc or
compiler.

In the tests I've been running, it roughly halves the time spent in
CopyReadLine (including the new memchr calls), thus reducing the total
CPU overhead by ~5%. I'm planning to run more tests with data that has
backslashes and with different width tables to see what the worst-case
and best-case performance is like. Also, it doesn't work for CSV format
at the moment; that needs to be fixed.

5% isn't exactly breathtaking, but it's a start. I tried the same trick
to CopyReadAttributesText, but unfortunately it doesn't seem to help
there because you need to "stop" the efficient word-at-a-time scan that
memchr does (at least with glibc, YMMV) whenever there's a column
separator, while in CopyReadLineText you get to process the whole line
in one call, assuming there's no backslashes.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
copy-readline-memchr-2.patch text/x-diff 4.2 KB

From: "Luke Lonergan" <LLonergan(at)greenplum(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, <pgsql-patches(at)postgresql(dot)org>
Cc: "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com>, <eng(at)intranet(dot)greenplum(dot)com>
Subject: Re: CopyReadLineText optimization
Date: 2008-02-24 01:46:40
Message-ID: 014F2941B0A1EA47BD61D21526B806E901075565@MI8NYCMAIL08.Mi8.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Cool! It's been a while since we've done the same kind of thing :-)

- Luke

> -----Original Message-----
> From: pgsql-patches-owner(at)postgresql(dot)org
> [mailto:pgsql-patches-owner(at)postgresql(dot)org] On Behalf Of
> Heikki Linnakangas
> Sent: Saturday, February 23, 2008 5:30 PM
> To: pgsql-patches(at)postgresql(dot)org
> Subject: [PATCHES] CopyReadLineText optimization
>
> The purpose of CopyReadLineText is to scan the input buffer,
> and find the next newline, taking into account any escape
> characters. It currently operates in a loop, one byte at a
> time, searching for LF, CR, or a backslash. That's a bit
> slow: I've been running oprofile on COPY, and I've seen
> CopyReadLine to take around ~10% of the CPU time, and Joshua
> Drake just posted a very similar profile to hackers.
>
> Attached is a patch that modifies CopyReadLineText so that it
> uses memchr to speed up the scan. The nice thing about memchr
> is that we can take advantage of any clever optimizations
> that might be in libc or compiler.
>
> In the tests I've been running, it roughly halves the time
> spent in CopyReadLine (including the new memchr calls), thus
> reducing the total CPU overhead by ~5%. I'm planning to run
> more tests with data that has backslashes and with different
> width tables to see what the worst-case and best-case
> performance is like. Also, it doesn't work for CSV format at
> the moment; that needs to be fixed.
>
> 5% isn't exactly breathtaking, but it's a start. I tried the
> same trick to CopyReadAttributesText, but unfortunately it
> doesn't seem to help there because you need to "stop" the
> efficient word-at-a-time scan that memchr does (at least with
> glibc, YMMV) whenever there's a column separator, while in
> CopyReadLineText you get to process the whole line in one
> call, assuming there's no backslashes.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: <pgsql-patches(at)postgresql(dot)org>
Subject: Re: CopyReadLineText optimization
Date: 2008-02-29 18:24:52
Message-ID: 47C84DF4.30801@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> Attached is a patch that modifies CopyReadLineText so that it uses
> memchr to speed up the scan. The nice thing about memchr is that we can
> take advantage of any clever optimizations that might be in libc or
> compiler.

Here's an updated version of the patch. The principle is the same, but
the same optimization is now used for CSV input as well, and there's
more comments.

I still need to do more benchmarking. I mentioned a ~5% speedup on the
test I ran earlier, which was a load of the lineitem table from TPC-H.
It looks like with cheaper data types the gain can be much bigger;
here's an oprofile from loading the TPC-H partsupp table,

Before:

samples % image name symbol name
5146 25.7635 postgres CopyReadLine
4089 20.4716 postgres DoCopy
1449 7.2544 reiserfs (no symbols)
1369 6.8539 postgres pg_verify_mbstr_len
1013 5.0716 libc-2.7.so memcpy
749 3.7499 libc-2.7.so ____strtod_l_internal
598 2.9939 postgres heap_formtuple
548 2.7436 libc-2.7.so ____strtol_l_internal
403 2.0176 libc-2.7.so memset
309 1.5470 libc-2.7.so strlen
208 1.0414 postgres AllocSetAlloc
...

After:

samples % image name symbol name
4165 25.7879 postgres DoCopy
1574 9.7455 postgres pg_verify_mbstr_len
1520 9.4112 reiserfs (no symbols)
1005 6.2225 libc-2.7.so memchr
986 6.1049 libc-2.7.so memcpy
632 3.9131 libc-2.7.so ____strtod_l_internal
589 3.6468 postgres heap_formtuple
546 3.3806 libc-2.7.so ____strtol_l_internal
386 2.3899 libc-2.7.so memset
366 2.2661 postgres CopyReadLine
287 1.7770 libc-2.7.so strlen
215 1.3312 postgres LWLockAcquire
208 1.2878 postgres hash_any
176 1.0897 postgres LWLockRelease
161 0.9968 postgres InputFunctionCall
157 0.9721 postgres AllocSetAlloc
...

Profile shows that with the patch, ~8.5% of the CPU time is spent in
CopyReadLine+memchr, vs. 25.5% before. That's a quite significant speedup.

I still need to test the worst-case performance, with input that has a
lot of escapes. It would be interesting to hear reports with this patch
from people on different platforms. These results are from my laptop
with 32-bit Intel CPU, running Linux. There could be big differences in
the memchr implementations.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
copy-readline-memchr-3.patch text/x-diff 7.5 KB

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: CopyReadLineText optimization
Date: 2008-03-03 19:42:16
Message-ID: 200803031942.m23JgG129730@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
> > Attached is a patch that modifies CopyReadLineText so that it uses
> > memchr to speed up the scan. The nice thing about memchr is that we can
> > take advantage of any clever optimizations that might be in libc or
> > compiler.
>
> Here's an updated version of the patch. The principle is the same, but
> the same optimization is now used for CSV input as well, and there's
> more comments.
>
> I still need to do more benchmarking. I mentioned a ~5% speedup on the
> test I ran earlier, which was a load of the lineitem table from TPC-H.
> It looks like with cheaper data types the gain can be much bigger;
> here's an oprofile from loading the TPC-H partsupp table,
>
> Before:
>
> samples % image name symbol name
> 5146 25.7635 postgres CopyReadLine
> 4089 20.4716 postgres DoCopy
> 1449 7.2544 reiserfs (no symbols)
> 1369 6.8539 postgres pg_verify_mbstr_len
> 1013 5.0716 libc-2.7.so memcpy
> 749 3.7499 libc-2.7.so ____strtod_l_internal
> 598 2.9939 postgres heap_formtuple
> 548 2.7436 libc-2.7.so ____strtol_l_internal
> 403 2.0176 libc-2.7.so memset
> 309 1.5470 libc-2.7.so strlen
> 208 1.0414 postgres AllocSetAlloc
> ...
>
> After:
>
> samples % image name symbol name
> 4165 25.7879 postgres DoCopy
> 1574 9.7455 postgres pg_verify_mbstr_len
> 1520 9.4112 reiserfs (no symbols)
> 1005 6.2225 libc-2.7.so memchr
> 986 6.1049 libc-2.7.so memcpy
> 632 3.9131 libc-2.7.so ____strtod_l_internal
> 589 3.6468 postgres heap_formtuple
> 546 3.3806 libc-2.7.so ____strtol_l_internal
> 386 2.3899 libc-2.7.so memset
> 366 2.2661 postgres CopyReadLine
> 287 1.7770 libc-2.7.so strlen
> 215 1.3312 postgres LWLockAcquire
> 208 1.2878 postgres hash_any
> 176 1.0897 postgres LWLockRelease
> 161 0.9968 postgres InputFunctionCall
> 157 0.9721 postgres AllocSetAlloc
> ...
>
> Profile shows that with the patch, ~8.5% of the CPU time is spent in
> CopyReadLine+memchr, vs. 25.5% before. That's a quite significant speedup.
>
> I still need to test the worst-case performance, with input that has a
> lot of escapes. It would be interesting to hear reports with this patch
> from people on different platforms. These results are from my laptop
> with 32-bit Intel CPU, running Linux. There could be big differences in
> the memchr implementations.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com

>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: <pgsql-patches(at)postgresql(dot)org>
Subject: Re: CopyReadLineText optimization
Date: 2008-03-05 13:57:00
Message-ID: 47CEA6AC.9090106@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> I still need to test the worst-case performance, with input that has a
> lot of escapes.

Ok, I've done some more performance testing with this. I tested COPY
FROM with a table with a single "text" column. There was a million rows
in the table, with a 1000 character long string:

postgres=# CREATE TABLE narrowtable2 (id text);
CREATE TABLE
postgres=# INSERT INTO narrowtable2 SELECT repeat(E'\\', 1000) FROM
generate_series(1, 1000000);
INSERT 0 1000000

After that, I dumped that to a file, and loaded it back using COPY FROM:

time ~/installations/cvshead/bin/psql postgres -c "BEGIN; TRUNCATE
narrowtable2; COPY narrowtable2 FROM
'/home/perftester/narrowtable3.tbl'; ROLLBACK;"

I repeated the test with different frequencies of backslashes in the
string, with and without the patch, and the took the smallest number of
each test case:

backslashes with without patch
all 24.9 15.6
every 4th 12.7 11.4
every 8th 10.4 10.7
every 16th 8.7 10.3
none 6.8 9.8

So the overhead of using memchr slows us down if there's a lot of escape
or quote characters. The breakeven point seems to be about 1 in 8
characters. I'm not sure if that's a good tradeoff or not...

I also tested a table with single integer column, and found no
meaningful difference (10.5 without patch vs 10.6 with patch). oprofile
shows that in this test case, only ~5% of the CPU time is spent in
CopyReadLineText, and the patch doesn't change that.

Without patch:
samples % image name app name
symbol name
7563 12.7220 no-vmlinux postgres (no
symbols)
4050 6.8127 postgres postgres DoCopy
3334 5.6083 postgres postgres
LWLockAcquire
3238 5.4468 postgres postgres
CopyReadLine
2900 4.8782 postgres postgres
LWLockRelease
2781 4.6780 libc-2.7.so postgres
__GI_____strtoll_l_internal
2778 4.6730 postgres postgres
heap_formtuple
2636 4.4341 postgres postgres hash_any
2087 3.5106 no-vmlinux no-vmlinux (no
symbols)
1748 2.9404 libc-2.7.so postgres memset
1724 2.9000 postgres postgres
PinBuffer
1670 2.8092 postgres postgres
PageAddItem
1645 2.7671 postgres postgres
heap_insert
1459 2.4542 postgres postgres
UnpinBuffer
1457 2.4509 postgres postgres
ReadBuffer_common
1321 2.2221 postgres postgres
hash_search_with_hash_value
1278 2.1498 postgres postgres
MarkBufferDirty
1219 2.0505 oprofiled oprofiled (no
symbols)
972 1.6350 postgres postgres
pg_verify_mbstr_len
756 1.2717 postgres postgres
RelationPutHeapTuple
665 1.1186 postgres postgres pg_atoi
631 1.0614 postgres postgres
RelationGetBufferForTuple
613 1.0312 postgres postgres
AllocSetReset
...

With patch:
samples % image name app name
symbol name
42720 18.1450 no-vmlinux postgres (no
symbols)
15367 6.5270 postgres postgres DoCopy
11831 5.0251 postgres postgres
LWLockAcquire
11500 4.8845 no-vmlinux no-vmlinux (no
symbols)
10182 4.3247 postgres postgres
LWLockRelease
9912 4.2100 libc-2.7.so postgres
__GI_____strtoll_l_internal
9811 4.1671 postgres postgres hash_any
8824 3.7479 postgres postgres
heap_formtuple
7459 3.1682 postgres postgres
CopyReadLine
7187 3.0526 postgres postgres
PageAddItem
6313 2.6814 libc-2.7.so postgres memset
5842 2.4813 postgres postgres
PinBuffer
5230 2.2214 postgres postgres
UnpinBuffer
5160 2.1917 postgres postgres
heap_insert
4838 2.0549 postgres postgres
ReadBuffer_common
4819 2.0468 postgres postgres
hash_search_with_hash_value
4691 1.9925 postgres postgres
MarkBufferDirty
3675 1.5609 libc-2.7.so postgres memchr
3617 1.5363 postgres postgres
AllocSetAlloc
3585 1.5227 postgres postgres
pg_verify_mbstr_len
3326 1.4127 postgres postgres
AllocSetReset
...

These tests were on a test server with a dual-core 64-bit Intel Xeons.
I'd still like to hear reports from other platforms.

Another thing that seems like an obvious win is to merge CopyReadLine
and CopyReadAttributesText/CSV so that we do just one pass over the
input. But that seems suspiciously obvious, I wonder if I'm missing
something.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: <pgsql-patches(at)postgresql(dot)org>
Subject: Re: CopyReadLineText optimization
Date: 2008-03-05 14:50:02
Message-ID: 47CEB31A.3050406@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
>> Attached is a patch that modifies CopyReadLineText so that it uses
>> memchr to speed up the scan. The nice thing about memchr is that we
>> can take advantage of any clever optimizations that might be in libc
>> or compiler.
>
> Here's an updated version of the patch. The principle is the same, but
> the same optimization is now used for CSV input as well, and there's
> more comments.

Another update attached: It occurred to me that the memchr approach is
only safe for server encodings, where the non-first bytes of a
multi-byte character always have the hi-bit set.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
copy-readline-memchr-4.patch text/x-diff 7.5 KB

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: CopyReadLineText optimization
Date: 2008-03-05 17:46:26
Message-ID: 47CEDC72.2050600@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
>
>
> So the overhead of using memchr slows us down if there's a lot of
> escape or quote characters. The breakeven point seems to be about 1 in
> 8 characters. I'm not sure if that's a good tradeoff or not...
>
>

How about we test the first buffer read in from the file and change
strategy accordingly?

cheers

andrew


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: CopyReadLineText optimization
Date: 2008-03-06 18:45:34
Message-ID: 47D03BCE.9030909@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
>> Heikki Linnakangas wrote:
>>> Attached is a patch that modifies CopyReadLineText so that it uses
>>> memchr to speed up the scan. The nice thing about memchr is that we
>>> can take advantage of any clever optimizations that might be in libc
>>> or compiler.
>>
>> Here's an updated version of the patch. The principle is the same,
>> but the same optimization is now used for CSV input as well, and
>> there's more comments.
>
> Another update attached: It occurred to me that the memchr approach is
> only safe for server encodings, where the non-first bytes of a
> multi-byte character always have the hi-bit set.
>

We currently make the following assumption in the code:

* These four characters, and the CSV escape and quote characters, are
* assumed the same in frontend and backend encodings.
*

The four characters are the carriage return, line feed, backslash and dot.

I think the requirement might well actually be somewhat stronger than
that: i.e. that none of these will appear as a non-first byte in any
multi-byte client encoding character. If that's right, then we should be
able to write CopyReadLineText without bothering about multi-byte chars.
If it's not right then I suspect we have some cases that can fail now
anyway. (I believe some client encodings at least use backslash in
subsequent chars, and that's a nasty one because the "\." end sequence
is hard coded). I believe all the chars up to 0x2f are safe - that
includes both quote chars and dot)

cheers

andrew


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
Cc: <pgsql-patches(at)postgresql(dot)org>
Subject: Re: CopyReadLineText optimization
Date: 2008-03-06 18:52:11
Message-ID: 47D03D5B.4010309@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan wrote:
> Heikki Linnakangas wrote:
>> Another update attached: It occurred to me that the memchr approach is
>> only safe for server encodings, where the non-first bytes of a
>> multi-byte character always have the hi-bit set.
>>
>
> We currently make the following assumption in the code:
>
> * These four characters, and the CSV escape and quote characters, are
> * assumed the same in frontend and backend encodings.
> *
>
> The four characters are the carriage return, line feed, backslash and dot.
>
> I think the requirement might well actually be somewhat stronger than
> that: i.e. that none of these will appear as a non-first byte in any
> multi-byte client encoding character. If that's right, then we should be
> able to write CopyReadLineText without bothering about multi-byte chars.
> If it's not right then I suspect we have some cases that can fail now
> anyway.

No, we don't require that, and we do handle it correctly. We use
pg_encoding_mblen to determine the length of each character in
CopyReadLineText when the encoding is a client-only encoding, and only
look at the first byte of each character. In CopyReadAttributesText,
where we have a similar loop, we've already transformed the input to
server encoding.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, pgsql-patches(at)postgresql(dot)org
Subject: Re: CopyReadLineText optimization
Date: 2008-03-06 19:21:03
Message-ID: 20191.1204831263@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
> Andrew Dunstan wrote:
>> We currently make the following assumption in the code:
>>
>> * These four characters, and the CSV escape and quote characters, are
>> * assumed the same in frontend and backend encodings.
>>
>> The four characters are the carriage return, line feed, backslash and dot.
>>
>> I think the requirement might well actually be somewhat stronger than
>> that: i.e. that none of these will appear as a non-first byte in any
>> multi-byte client encoding character.

> No, we don't require that, and we do handle it correctly.

I believe we *have to* handle it correctly, because backslash actually
does appear as a non-first byte in SJIS or some other Far Eastern
encoding. In the CSV case such a restriction would be untenable anyway.

BTW, I notice that the code allows CSV escape and quote characters that
have the high bit set (in single-byte server encodings that is). Is
this a good idea? It seems like such are extremely unlikely to be the
same in two different encodings. Maybe we should restrict to the ASCII
range? Especially if the client encoding is multibyte ...

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: CopyReadLineText optimization
Date: 2008-03-06 19:24:19
Message-ID: 47D044E3.3030604@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> Andrew Dunstan wrote:
>> Heikki Linnakangas wrote:
>>> Another update attached: It occurred to me that the memchr approach is
>>> only safe for server encodings, where the non-first bytes of a
>>> multi-byte character always have the hi-bit set.
>>>
>>
>> We currently make the following assumption in the code:
>>
>> * These four characters, and the CSV escape and quote characters,
>> are
>> * assumed the same in frontend and backend encodings.
>> *
>>
>> The four characters are the carriage return, line feed, backslash and
>> dot.
>>
>> I think the requirement might well actually be somewhat stronger than
>> that: i.e. that none of these will appear as a non-first byte in any
>> multi-byte client encoding character. If that's right, then we should
>> be able to write CopyReadLineText without bothering about multi-byte
>> chars. If it's not right then I suspect we have some cases that can
>> fail now anyway.
>
> No, we don't require that, and we do handle it correctly. We use
> pg_encoding_mblen to determine the length of each character in
> CopyReadLineText when the encoding is a client-only encoding, and only
> look at the first byte of each character. In CopyReadAttributesText,
> where we have a similar loop, we've already transformed the input to
> server encoding.

Oops. I see that now. Funny how I missed it when I went looking for it :-(

I think I understand the patch now :-)

I'm still a bit worried about applying it unless it gets some adaptive
behaviour or something so that we don't cause any serious performance
regressions in some cases. Also, could we perhaps benefit from inlining
some calls, or is your compiler doing that anyway?

cheers

andrew


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: CopyReadLineText optimization
Date: 2008-03-06 19:29:24
Message-ID: 47D04614.7060109@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> BTW, I notice that the code allows CSV escape and quote characters that
> have the high bit set (in single-byte server encodings that is). Is
> this a good idea? It seems like such are extremely unlikely to be the
> same in two different encodings. Maybe we should restrict to the ASCII
> range? Especially if the client encoding is multibyte ...
>
>
>

In the commonest case these are both either " or '. I would not have any
objection to requiring them to be ASCII chars.

cheers

andrew


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, <pgsql-patches(at)postgresql(dot)org>
Subject: Re: CopyReadLineText optimization
Date: 2008-03-06 19:38:36
Message-ID: 47D0483C.50606@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> BTW, I notice that the code allows CSV escape and quote characters that
> have the high bit set (in single-byte server encodings that is). Is
> this a good idea? It seems like such are extremely unlikely to be the
> same in two different encodings. Maybe we should restrict to the ASCII
> range? Especially if the client encoding is multibyte ...

At least many of the ISO-8859-* encodings have many common non-ascii
characters, and there's no problem if the client_ and server_encodings
match. But it does seem risky to allow it if we can't detect and throw
an error on the non-safe cases. Perhaps we could translate the chars
from client to server encoding?

If the client encoding is a multibyte one, then we certainly should
elog(ERROR) if you try to do that.

Though from a practical point of view, I doubt anyone would mind if we
just always restricted them to ASCII range...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
Cc: <pgsql-patches(at)postgresql(dot)org>
Subject: Re: CopyReadLineText optimization
Date: 2008-03-06 19:44:31
Message-ID: 47D0499F.301@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan wrote:
> I'm still a bit worried about applying it unless it gets some adaptive
> behaviour or something so that we don't cause any serious performance
> regressions in some cases.

I'll try to come up with something. At the most conservative end, we
could fall back to the current method on the first escape, quote or
backslash character.

> Also, could we perhaps benefit from inlining
> some calls, or is your compiler doing that anyway?

gcc does inline all static functions that are only called from one site,
and small functions, using some heuristic. I don't think more
aggressive inlining would help.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: CopyReadLineText optimization
Date: 2008-03-06 19:56:32
Message-ID: 47D04C70.7070105@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> Andrew Dunstan wrote:
>> I'm still a bit worried about applying it unless it gets some
>> adaptive behaviour or something so that we don't cause any serious
>> performance regressions in some cases.
>
> I'll try to come up with something. At the most conservative end, we
> could fall back to the current method on the first escape, quote or
> backslash character.
>
>

That's far too conservative, I think. Somewhere a bit short of your
observed breakeven point seems about right.

cheers

andrew


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
Cc: <pgsql-patches(at)postgresql(dot)org>
Subject: Re: CopyReadLineText optimization
Date: 2008-03-06 20:08:03
Message-ID: 47D04F23.8060908@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan wrote:
> Heikki Linnakangas wrote:
>> Andrew Dunstan wrote:
>>> I'm still a bit worried about applying it unless it gets some
>>> adaptive behaviour or something so that we don't cause any serious
>>> performance regressions in some cases.
>>
>> I'll try to come up with something. At the most conservative end, we
>> could fall back to the current method on the first escape, quote or
>> backslash character.
>
> That's far too conservative, I think. Somewhere a bit short of your
> observed breakeven point seems about right.

The problem is, you don't know how many "stop" characters there is until
you've counted them.

We could fall back after X such characters, or only start using memchr
after seeing 8 consecutive non-stop characters. Whatever we choose, the
heuristic needs to be very simple and fast to check, otherwise we just
introduce more overhead trying to decide which method to use.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-patches(at)postgresql(dot)org
Subject: Re: CopyReadLineText optimization
Date: 2008-03-06 20:29:18
Message-ID: Pine.GSO.4.64.0803061510370.380@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thu, 6 Mar 2008, Heikki Linnakangas wrote:

> At the most conservative end, we could fall back to the current method
> on the first escape, quote or backslash character.

I would just count the number of escaped/quote characters on each line,
and then at the end of the line switch modes between the current code on
the new version based on what the previous line looked like. That way the
only additional overhead is a small bit only when escapes show up often,
plus a touch more just once per line. Barely noticable in the case where
nothing is escaped, very small regression for escape-heavy stuff but
certainly better than the drop you reported in the last rev of this patch.

Rev two of that design would keep a weighted moving average of the total
number of escaped characters per line (say wma=(7*wma+current)/8) and
switch modes based on that instead of the previous one. There's enough
play in the transition between where the two approaches work better at
that this should be easy enough to get a decent transition between.
Based on your data I would put the transition at wma>4, which should keep
the old code in play even if only half the lines have the bad regression
that shows up with >8 escapes per line.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: CopyReadLineText optimization
Date: 2008-03-06 20:44:47
Message-ID: 47D057BF.9030302@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Greg Smith wrote:
> On Thu, 6 Mar 2008, Heikki Linnakangas wrote:
>
>> At the most conservative end, we could fall back to the current
>> method on the first escape, quote or backslash character.
>
> I would just count the number of escaped/quote characters on each
> line, and then at the end of the line switch modes between the current
> code on the new version based on what the previous line looked like.
> That way the only additional overhead is a small bit only when escapes
> show up often, plus a touch more just once per line. Barely noticable
> in the case where nothing is escaped, very small regression for
> escape-heavy stuff but certainly better than the drop you reported in
> the last rev of this patch.
>
> Rev two of that design would keep a weighted moving average of the
> total number of escaped characters per line (say
> wma=(7*wma+current)/8) and switch modes based on that instead of the
> previous one. There's enough play in the transition between where the
> two approaches work better at that this should be easy enough to get a
> decent transition between. Based on your data I would put the
> transition at wma>4, which should keep the old code in play even if
> only half the lines have the bad regression that shows up with >8
> escapes per line.
>
>

I'd be inclined just to look at the first buffer of data we read in, and
make a one-off decision there, if we can get away with it. Then the cost
of testing is fixed rather than per line.

cheers

andrew


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: CopyReadLineText optimization
Date: 2008-03-07 20:59:17
Message-ID: 47D1ACA5.80308@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> Andrew Dunstan wrote:
>> I'm still a bit worried about applying it unless it gets some
>> adaptive behaviour or something so that we don't cause any serious
>> performance regressions in some cases.
>
> I'll try to come up with something. At the most conservative end, we
> could fall back to the current method on the first escape, quote or
> backslash character.
>
>> Also, could we perhaps benefit from inlining some calls, or is your
>> compiler doing that anyway?
>
> gcc does inline all static functions that are only called from one
> site, and small functions, using some heuristic. I don't think more
> aggressive inlining would help.
>

Another question that occurred to me - did you try using strpbrk() to
look for the next interesting character rather than your homegrown
searcher gadget? If so, how did that perform?

cheers

andrew


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
Cc: <pgsql-patches(at)postgresql(dot)org>
Subject: Re: CopyReadLineText optimization
Date: 2008-03-08 09:35:26
Message-ID: 47D25DDE.7000405@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan wrote:
>
>
> Heikki Linnakangas wrote:
>> Andrew Dunstan wrote:
>>> I'm still a bit worried about applying it unless it gets some
>>> adaptive behaviour or something so that we don't cause any serious
>>> performance regressions in some cases.
>>
>> I'll try to come up with something. At the most conservative end, we
>> could fall back to the current method on the first escape, quote or
>> backslash character.
>>
>>> Also, could we perhaps benefit from inlining some calls, or is your
>>> compiler doing that anyway?
>>
>> gcc does inline all static functions that are only called from one
>> site, and small functions, using some heuristic. I don't think more
>> aggressive inlining would help.
>>
>
> Another question that occurred to me - did you try using strpbrk() to
> look for the next interesting character rather than your homegrown
> searcher gadget? If so, how did that perform?

I haven't tried that. There's a small difference: strpbrk stops at '\0'.
But come to think of it, I guess it doesn't matter. Will test...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
Cc: <pgsql-patches(at)postgresql(dot)org>
Subject: Re: CopyReadLineText optimization
Date: 2008-03-10 16:08:35
Message-ID: 47D55D03.7010108@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan wrote:
> Another question that occurred to me - did you try using strpbrk() to
> look for the next interesting character rather than your homegrown
> searcher gadget? If so, how did that perform?

It looks like strpbrk() performs poorly:

unpatched:
testname | min duration
----------+-----------------
all | 00:00:08.099656
1/2 | 00:00:06.734241
1/4 | 00:00:06.016946
1/8 | 00:00:05.622122
1/16 | 00:00:05.304252
none | 00:00:05.155755
(6 rows)

strpbrk:

testname | min duration
----------+-----------------
all | 00:00:22.980997
1/2 | 00:00:13.724834
1/4 | 00:00:08.980246
1/8 | 00:00:06.582357
1/16 | 00:00:05.291485
none | 00:00:06.239468
(6 rows)

memchr:

testname | min duration
----------+-----------------
all | 00:00:13.684479
1/2 | 00:00:09.509213
1/4 | 00:00:06.921959
1/8 | 00:00:05.654761
1/16 | 00:00:04.719027
none | 00:00:03.718361
(6 rows)

Attached is the test script and patches I used, if someone wants to test
this on another platform.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
copy-readline-memchr-4.patch text/x-diff 7.5 KB
copy-readline-strpbrk-1.patch text/x-diff 1.5 KB
copy-script.sh application/x-shellscript 1.6 KB

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: CopyReadLineText optimization
Date: 2008-03-10 17:38:33
Message-ID: 47D57219.6080402@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> Andrew Dunstan wrote:
>> Another question that occurred to me - did you try using strpbrk() to
>> look for the next interesting character rather than your homegrown
>> searcher gadget? If so, how did that perform?
>
> It looks like strpbrk() performs poorly:

Yes, not surprising. I just looked at the implementation in glibc, which
I assume you are using, and it seemed rather basic. The one in NetBSD's
libc looks much more efficient.

See

http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/libc/string/strpbrk.c?rev=1.1.2.1&content-type=text/plain&cvsroot=glibc
and
http://cvsweb.de.netbsd.org/cgi-bin/cvsweb.cgi/src/lib/libc/string/strpbrk.c?rev=1.16;content-type=text%2Fx-cvsweb-markup

Not that what you've done isn't good, if a little obscure (as is the
NetBSD implementation)

cheers

andrew


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
Cc: <pgsql-patches(at)postgresql(dot)org>
Subject: Re: CopyReadLineText optimization
Date: 2008-03-11 10:25:18
Message-ID: 47D65E0E.8060109@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan wrote:
> Heikki Linnakangas wrote:
>> It looks like strpbrk() performs poorly:
>
> Yes, not surprising. I just looked at the implementation in glibc, which
> I assume you are using, and it seemed rather basic. The one in NetBSD's
> libc looks much more efficient.
>
> See
>
> http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/libc/string/strpbrk.c?rev=1.1.2.1&content-type=text/plain&cvsroot=glibc

There's architecture-specific optimized versions of that in glibc,
though. For example, for x86_64:

http://sourceware.org/cgi-bin/cvsweb.cgi/~checkout~/libc/sysdeps/x86_64/strspn.S?rev=1.4&cvsroot=glibc

> Not that what you've done isn't good, if a little obscure (as is the
> NetBSD implementation)

Yeah, it's a lot of code for such a simple thing :-(. On the other hand,
it's very isolated, so it's not that bad.

We chatted about this with Greg Stark yesterday, and we came up with a
few more observations and ideas:

1. CopyReadLineText is all about finding the the next end of line;
splitting to fields is done later. We therefore only care about quotes
and escapes when they affect the end of line detection. In text mode, we
only need to care about a backslash that precedes a LF/CR. Therefore,
we could search for the next LF/CR character with memchr(), and check if
the preceding character is a backslash (and if it is, check if there's
yet another backslash behind it, and so forth until we hit a
non-backslash character).

2. The manual has this to say about escaping newlines with a backslash:

"It is strongly recommended that applications generating COPY data
convert data newlines and carriage returns to the \n and \r sequences
respectively. At present it is possible to represent a data carriage
return by a backslash and carriage return, and to represent a data
newline by a backslash and newline. However, these representations might
not be accepted in future releases."

If we disallowed the backslash+LF sequence, like that paragraph suggests
we could, we wouldn't need to care about backslashes in CopyReadLineText
in text mode at all.

3. If we did the client->server encoding conversion before
CopyReadLineText, one input block at a time:

- We could skip calling pg_encoding_mblen for each character, when the
client encoding is one not supported as a server encoding. That would be
a big performance gain when that's the case, and would simplify the code
in CopyReadLineText a little bit as well.

- The conversion might be faster, as it could operate on bigger blocks

- We could merge the CopyReadLine and CopyReadAttributes functions, and
split the input block into fields in one loop. I would imagine that
being more efficient than scanning the input twice, no matter how fast
we make those scans.

There's a small difficulty with doing the conversion one input block at
a time: there's no suitable interface in the conversion routines for
that. The input block boundary can be between 1st and 2nd byte of a
multi-byte character, so we'd need to have a function that converts a
block of bytes, ignoring any incomplete bytes at the end.

I think I'll experiment with the 1st idea, to see how much code
restructuring it needs and what the performance is like.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCHES] CopyReadLineText optimization
Date: 2008-03-27 18:28:54
Message-ID: 47EBE766.3070305@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Heikki Linnakangas wrote:
> 1. CopyReadLineText is all about finding the the next end of line;
> splitting to fields is done later. We therefore only care about quotes
> and escapes when they affect the end of line detection. In text mode, we
> only need to care about a backslash that precedes a LF/CR. Therefore,
> we could search for the next LF/CR character with memchr(), and check if
> the preceding character is a backslash (and if it is, check if there's
> yet another backslash behind it, and so forth until we hit a
> non-backslash character).

While looking into this, I realized that we also need to detect the
end-of-copy marker, backslash+period+EOL. In CSV mode, we only honor the
end-of-copy marker if it's on a line of it's own, as \. can occur in
data, but in text mode we accept it at any point.

Does anyone object to changing that so that we only accept \. on a line
of its own in text mode as well? That way we wouldn't need to care about
backslashes in CopyReadLineText. AFAIK our tools have always output the
\. like that, so this would only affect custom applications that use
COPY and the \. marker.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com