Re: CopyReadLineText optimization

From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
Cc: <pgsql-patches(at)postgresql(dot)org>
Subject: Re: CopyReadLineText optimization
Date: 2008-03-11 10:25:18
Message-ID: 47D65E0E.8060109@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

Andrew Dunstan wrote:
> Heikki Linnakangas wrote:
>> It looks like strpbrk() performs poorly:
>
> Yes, not surprising. I just looked at the implementation in glibc, which
> I assume you are using, and it seemed rather basic. The one in NetBSD's
> libc looks much more efficient.
>
> See
>
> http://sources.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/libc/string/strpbrk.c?rev=1.1.2.1&content-type=text/plain&cvsroot=glibc

There's architecture-specific optimized versions of that in glibc,
though. For example, for x86_64:

http://sourceware.org/cgi-bin/cvsweb.cgi/~checkout~/libc/sysdeps/x86_64/strspn.S?rev=1.4&cvsroot=glibc

> Not that what you've done isn't good, if a little obscure (as is the
> NetBSD implementation)

Yeah, it's a lot of code for such a simple thing :-(. On the other hand,
it's very isolated, so it's not that bad.

We chatted about this with Greg Stark yesterday, and we came up with a
few more observations and ideas:

1. CopyReadLineText is all about finding the the next end of line;
splitting to fields is done later. We therefore only care about quotes
and escapes when they affect the end of line detection. In text mode, we
only need to care about a backslash that precedes a LF/CR. Therefore,
we could search for the next LF/CR character with memchr(), and check if
the preceding character is a backslash (and if it is, check if there's
yet another backslash behind it, and so forth until we hit a
non-backslash character).

2. The manual has this to say about escaping newlines with a backslash:

"It is strongly recommended that applications generating COPY data
convert data newlines and carriage returns to the \n and \r sequences
respectively. At present it is possible to represent a data carriage
return by a backslash and carriage return, and to represent a data
newline by a backslash and newline. However, these representations might
not be accepted in future releases."

If we disallowed the backslash+LF sequence, like that paragraph suggests
we could, we wouldn't need to care about backslashes in CopyReadLineText
in text mode at all.

3. If we did the client->server encoding conversion before
CopyReadLineText, one input block at a time:

- We could skip calling pg_encoding_mblen for each character, when the
client encoding is one not supported as a server encoding. That would be
a big performance gain when that's the case, and would simplify the code
in CopyReadLineText a little bit as well.

- The conversion might be faster, as it could operate on bigger blocks

- We could merge the CopyReadLine and CopyReadAttributes functions, and
split the input block into fields in one loop. I would imagine that
being more efficient than scanning the input twice, no matter how fast
we make those scans.

There's a small difficulty with doing the conversion one input block at
a time: there's no suitable interface in the conversion routines for
that. The input block boundary can be between 1st and 2nd byte of a
multi-byte character, so we'd need to have a function that converts a
block of bytes, ignoring any incomplete bytes at the end.

I think I'll experiment with the 1st idea, to see how much code
restructuring it needs and what the performance is like.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Huxton 2008-03-11 10:46:17 Re: strange pg_ctl's behavior
Previous Message Tatsuo Ishii 2008-03-11 10:23:08 Re: strange pg_ctl's behavior

Browse pgsql-patches by date

  From Date Subject
Next Message Zoltan Boszormenyi 2008-03-11 10:39:06 Fix HAVE_LONG[_LONG]_INT_64 to really define to 1
Previous Message Heikki Linnakangas 2008-03-11 09:58:20 Re: Very slow (2 tuples/second) sequential scan afterbulk insert; speed returns to ~500 tuples/second after commit