bytea performance tweak

Lists: pgsql-jdbc
From: Andreas Schmitz <schmitz(at)occamlabs(dot)de>
To: pgsql-jdbc(at)postgresql(dot)org
Subject: bytea performance tweak
Date: 2011-06-21 13:08:11
Message-ID: 20110621130713.GB3919@summalogicae.occamlabs.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

Hi,

I've noticed that fetching a bytea field produces a lot of overhead when
converting into a byte[] or InputStream. Looking at the code, I've
produced a small patch (a lookup table for the hex codes) which makes
the conversion faster. What do you think?

Best regards, Andreas

Attachment Content-Type Size
patch text/plain 2.7 KB

From: Oliver Jowett <oliver(at)opencloud(dot)com>
To: Andreas Schmitz <schmitz(at)occamlabs(dot)de>
Cc: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: bytea performance tweak
Date: 2011-06-21 23:59:07
Message-ID: BANLkTimryw=Sm5q4v53PQNkdi+RjAfgJBA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

On 22 June 2011 01:08, Andreas Schmitz <schmitz(at)occamlabs(dot)de> wrote:
> Hi,
>
> I've noticed that fetching a bytea field produces a lot of overhead when
> converting into a byte[] or InputStream. Looking at the code, I've
> produced a small patch (a lookup table for the hex codes) which makes
> the conversion faster. What do you think?

Surprising. How did you benchmark it?

Oliver


From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Oliver Jowett <oliver(at)opencloud(dot)com>
Cc: Andreas Schmitz <schmitz(at)occamlabs(dot)de>, pgsql-jdbc(at)postgresql(dot)org
Subject: Re: bytea performance tweak
Date: 2011-06-22 03:27:14
Message-ID: 4E016112.3080709@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

On 22/06/11 07:59, Oliver Jowett wrote:
> On 22 June 2011 01:08, Andreas Schmitz <schmitz(at)occamlabs(dot)de> wrote:
>> Hi,
>>
>> I've noticed that fetching a bytea field produces a lot of overhead when
>> converting into a byte[] or InputStream. Looking at the code, I've
>> produced a small patch (a lookup table for the hex codes) which makes
>> the conversion faster. What do you think?
>
> Surprising.

It's not at _all_ surprising that an array index lookup is much faster
than two function calls and a bunch of branches. It'll probably JIT
better and be easier on cache, too.

If Oliver is lobbing around mostly bytea data and/or using big bytea
fields I'm not surprised he's seeing this as at least somewhat of a
hotspot. I've seen comments on bytea performance with JDBC here before,
though nobody seems to have dug into it enough to track things down before.

Thanks Oliver!

--
Craig Ringer


From: Oliver Jowett <oliver(at)opencloud(dot)com>
To: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc: Andreas Schmitz <schmitz(at)occamlabs(dot)de>, pgsql-jdbc(at)postgresql(dot)org
Subject: Re: bytea performance tweak
Date: 2011-06-22 04:24:50
Message-ID: BANLkTik1W1j0Y0LQyJ0ZL1XYxo_TSST6=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

On 22 June 2011 15:27, Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> wrote:
> On 22/06/11 07:59, Oliver Jowett wrote:
>> On 22 June 2011 01:08, Andreas Schmitz <schmitz(at)occamlabs(dot)de> wrote:
>>> Hi,
>>>
>>> I've noticed that fetching a bytea field produces a lot of overhead when
>>> converting into a byte[] or InputStream. Looking at the code, I've
>>> produced a small patch (a lookup table for the hex codes) which makes
>>> the conversion faster. What do you think?
>>
>> Surprising.
>
> It's not at _all_ surprising that an array index lookup is much faster
> than two function calls and a bunch of branches. It'll probably JIT
> better and be easier on cache, too.
>
> If Oliver is lobbing around mostly bytea data and/or using big bytea
> fields I'm not surprised he's seeing this as at least somewhat of a
> hotspot. I've seen comments on bytea performance with JDBC here before,
> though nobody seems to have dug into it enough to track things down before.

(The patch was Andreas' work, not mine)

I'm interested in the benchmark results because

a) I would have thought that a smaller 1D array plus bitshift/or would
be about as fast as a 2D array and probably cache-friendlier.
b) the JIT should be inlining the function calls in the original
version anyway, and I'm surprised that the branches are that expensive

I've seen enough weird performance behavior from the JIT in the past
to be inherently suspicious of anything that claims to be a speedup
without numbers to back it up.
If we have a benchmark to work from, then I'm sure there's even more
performance we can squeeze out of it.

Oliver


From: Oliver Jowett <oliver(at)opencloud(dot)com>
To: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc: Andreas Schmitz <schmitz(at)occamlabs(dot)de>, pgsql-jdbc(at)postgresql(dot)org
Subject: Re: bytea performance tweak
Date: 2011-06-22 07:57:53
Message-ID: BANLkTincjfaot3pNo6Acj6aGbKS+gX+9dg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

On 22 June 2011 16:24, Oliver Jowett <oliver(at)opencloud(dot)com> wrote:

> I've seen enough weird performance behavior from the JIT in the past
> to be inherently suspicious of anything that claims to be a speedup
> without numbers to back it up.
> If we have a benchmark to work from, then I'm sure there's even more
> performance we can squeeze out of it.

FWIW here's a quick microbenchmark.
Andreas' patch is about 5x faster than the current code.
A version using a one-dimensional lookup array is marginally faster again (7%).

$ java -server -version; uname -a; time java -server -classpath . Benchmark 5
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
Linux downpour 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC
2011 x86_64 x86_64 x86_64 GNU/Linux
original code, 1000 1161632 iterations in
5018679 us 4.32 us/iteration
2D array lookup, 1000 6025844 iterations in
4948308 us 0.82 us/iteration
1D array lookup, 1000 6361396 iterations in
4996605 us 0.79 us/iteration
original code, 100000 9196 iterations in
5019373 us 545.82 us/iteration
2D array lookup, 100000 57049 iterations in
4999200 us 87.63 us/iteration
1D array lookup, 100000 61162 iterations in
5000197 us 81.75 us/iteration

real 1m55.117s
user 1m55.400s
sys 0m0.210s

(I also tried some variations with induction variables but got no real
speedup and enough weird results to make me think it's confusing the
JIT)

There is also a bug in the original patch that I noticed along the way
(the initialization of HEX_LOOKUP doesn't use lo2/hi2) - fixed in the
benchmark code.

Oliver

Attachment Content-Type Size
Benchmark.java text/x-java 6.5 KB

From: Andreas Schmitz <schmitz(at)occamlabs(dot)de>
To: Oliver Jowett <oliver(at)opencloud(dot)com>
Cc: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: bytea performance tweak
Date: 2011-06-22 14:17:12
Message-ID: 20110622141711.GA2652@summalogicae.occamlabs.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

Oliver Jowett wrote:

Hi,

> On 22 June 2011 01:08, Andreas Schmitz <schmitz(at)occamlabs(dot)de> wrote:
> > I've noticed that fetching a bytea field produces a lot of overhead when
> > converting into a byte[] or InputStream. Looking at the code, I've
> > produced a small patch (a lookup table for the hex codes) which makes
> > the conversion faster. What do you think?
>
> Surprising. How did you benchmark it?

I did some local tests using a main function, with a static array. I've
seen times like ~1900ms for decoding using the unpatched version, and
~1600ms for the patched version. I've also benchmarked my application
(which makes heavy use of bytea), and have seen some speedup there as
well (around 2-300ms faster when decoding ~23000 bytea rows with ~30MB
of data in total).

I was wondering why the protocol uses such an encoding anyway. Wouldn't
it be a lot faster just sending the bytes as they are?

Best regards, Andreas


From: Andreas Schmitz <schmitz(at)occamlabs(dot)de>
To: Oliver Jowett <oliver(at)opencloud(dot)com>
Cc: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>, pgsql-jdbc(at)postgresql(dot)org
Subject: Re: bytea performance tweak
Date: 2011-06-22 14:22:06
Message-ID: 20110622142205.GB2652@summalogicae.occamlabs.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

Oliver Jowett wrote:

Hi,

> On 22 June 2011 16:24, Oliver Jowett <oliver(at)opencloud(dot)com> wrote:
>
> > I've seen enough weird performance behavior from the JIT in the past
> > to be inherently suspicious of anything that claims to be a speedup
> > without numbers to back it up.
> > If we have a benchmark to work from, then I'm sure there's even more
> > performance we can squeeze out of it.
>
> FWIW here's a quick microbenchmark.
> Andreas' patch is about 5x faster than the current code.
> A version using a one-dimensional lookup array is marginally faster again (7%).
>
> $ java -server -version; uname -a; time java -server -classpath . Benchmark 5
> java version "1.6.0_24"
> Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
> Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
> Linux downpour 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC
> 2011 x86_64 x86_64 x86_64 GNU/Linux
> original code, 1000 1161632 iterations in
> 5018679 us 4.32 us/iteration
> 2D array lookup, 1000 6025844 iterations in
> 4948308 us 0.82 us/iteration
> 1D array lookup, 1000 6361396 iterations in
> 4996605 us 0.79 us/iteration
> original code, 100000 9196 iterations in
> 5019373 us 545.82 us/iteration
> 2D array lookup, 100000 57049 iterations in
> 4999200 us 87.63 us/iteration
> 1D array lookup, 100000 61162 iterations in
> 5000197 us 81.75 us/iteration
>
> real 1m55.117s
> user 1m55.400s
> sys 0m0.210s
>
> (I also tried some variations with induction variables but got no real
> speedup and enough weird results to make me think it's confusing the
> JIT)
>
> There is also a bug in the original patch that I noticed along the way
> (the initialization of HEX_LOOKUP doesn't use lo2/hi2) - fixed in the
> benchmark code.

wow, I wouldn't have expected that much of a speedup. I've also tested
using a one dimensional array, but my crappy main method did not see
much of a difference. Thanks for benchmarking in detail!

Best regards, Andreas


From: Oliver Jowett <oliver(at)opencloud(dot)com>
To: Andreas Schmitz <schmitz(at)occamlabs(dot)de>
Cc: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: bytea performance tweak
Date: 2011-06-22 23:23:41
Message-ID: BANLkTimO_hQFR-5DwL5dZsMdS3XxD=WWQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

On 23 June 2011 02:17, Andreas Schmitz <schmitz(at)occamlabs(dot)de> wrote:

> I was wondering why the protocol uses such an encoding anyway. Wouldn't
> it be a lot faster just sending the bytes as they are?

We can request binary-format data, which for bytea is indeed just the
raw bytes, but the problem is that it's all or nothing: if we request
binary data we have to handle binary results for all datatypes, not
just bytea (because the request is per-column and generally the driver
doesn't know the column type at the point it makes the request)
Some progress has been made with that, but it's not production-ready
yet I think.

Oliver


From: Andreas Schmitz <schmitz(at)occamlabs(dot)de>
To: Oliver Jowett <oliver(at)opencloud(dot)com>
Cc: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: bytea performance tweak
Date: 2011-06-23 10:06:16
Message-ID: 20110623100616.GA14152@summalogicae.occamlabs.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

Oliver Jowett wrote:

Hi,

> On 23 June 2011 02:17, Andreas Schmitz <schmitz(at)occamlabs(dot)de> wrote:
>
> > I was wondering why the protocol uses such an encoding anyway. Wouldn't
> > it be a lot faster just sending the bytes as they are?
>
> We can request binary-format data, which for bytea is indeed just the
> raw bytes, but the problem is that it's all or nothing: if we request
> binary data we have to handle binary results for all datatypes, not
> just bytea (because the request is per-column and generally the driver
> doesn't know the column type at the point it makes the request)
> Some progress has been made with that, but it's not production-ready
> yet I think.

ok, understood. Is there a timeline for fully supporting the binary
protocol?

Best regards, Andreas


From: Florian Weimer <fweimer(at)bfk(dot)de>
To: Andreas Schmitz <schmitz(at)occamlabs(dot)de>
Cc: Oliver Jowett <oliver(at)opencloud(dot)com>, pgsql-jdbc(at)postgresql(dot)org
Subject: Re: bytea performance tweak
Date: 2011-06-27 13:57:06
Message-ID: 82pqlzs7gt.fsf@mid.bfk.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

* Andreas Schmitz:

> Is there a timeline for fully supporting the binary protocol?

A PostgreSQL installation can contain user-defined types with a custom
binary format, so this will not work with all databases anyway.

A way to tell PostgreSQL to always send some times in binary format
would solve this, though.

--
Florian Weimer <fweimer(at)bfk(dot)de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99


From: Vitalii Tymchyshyn <tivv00(at)gmail(dot)com>
To: Florian Weimer <fweimer(at)bfk(dot)de>
Cc: Andreas Schmitz <schmitz(at)occamlabs(dot)de>, Oliver Jowett <oliver(at)opencloud(dot)com>, pgsql-jdbc(at)postgresql(dot)org
Subject: Re: bytea performance tweak
Date: 2011-06-27 15:23:38
Message-ID: 4E08A07A.1090505@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

27.06.11 16:57, Florian Weimer написав(ла):
> * Andreas Schmitz:
>
>> Is there a timeline for fully supporting the binary protocol?
> A PostgreSQL installation can contain user-defined types with a custom
> binary format, so this will not work with all databases anyway.
I'd say it would be up to Java developer to provide it's own decode for
such a types (in a PGObject?). If he wants, he should do server-level
cast to string in a query text.
Of course, I'd leave an option to make non-binary connection.

Best regards, Vitalii Tymchyshyn.


From: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To: Florian Weimer <fweimer(at)bfk(dot)de>
Cc: Andreas Schmitz <schmitz(at)occamlabs(dot)de>, Oliver Jowett <oliver(at)opencloud(dot)com>, pgsql-jdbc(at)postgresql(dot)org
Subject: Re: bytea performance tweak
Date: 2011-06-27 16:24:53
Message-ID: 1309191893.2800.11.camel@hvost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

On Mon, 2011-06-27 at 13:57 +0000, Florian Weimer wrote:
> * Andreas Schmitz:
>
> > Is there a timeline for fully supporting the binary protocol?
>
> A PostgreSQL installation can contain user-defined types with a custom
> binary format, so this will not work with all databases anyway.

Is'nt the same also true for _text_ representation of user-defined
types ?

> A way to tell PostgreSQL to always send some times in binary format
> would solve this, though.
>
> --
> Florian Weimer <fweimer(at)bfk(dot)de>
> BFK edv-consulting GmbH http://www.bfk.de/
> Kriegsstraße 100 tel: +49-721-96201-1
> D-76133 Karlsruhe fax: +49-721-96201-99
>

--
-------
Hannu Krosing
PostgreSQL Infinite Scalability and Performance Consultant
PG Admin Book: http://www.2ndQuadrant.com/books/


From: Maciek Sakrejda <msakrejda(at)truviso(dot)com>
To: Florian Weimer <fweimer(at)bfk(dot)de>
Cc: Andreas Schmitz <schmitz(at)occamlabs(dot)de>, Oliver Jowett <oliver(at)opencloud(dot)com>, pgsql-jdbc(at)postgresql(dot)org
Subject: Re: bytea performance tweak
Date: 2011-06-27 16:51:17
Message-ID: BANLkTiki7we3y+jph9AUeiZwHpqiTyMx8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

>> Is there a timeline for fully supporting the binary protocol?
>
> A PostgreSQL installation can contain user-defined types with a custom
> binary format, so this will not work with all databases anyway.

Having to read unknown types in text mode is not any easier if you're
going to do something interesting with them. In either case, the
protocol provides a way to read the type payload off the wire (so,
e.g., a column with a type you can't read won't "mess up" the whole
result set), but it's up to you to process those bytes. As Vitalii
stated, something like PGObject is probably the right way to handle
this.

> A way to tell PostgreSQL to always send some [types] in binary format
> would solve this, though.

(I assume you meant "types" not "times")

Yep, that would be nice. There are per-*column* output facilities in
the protocol, but not per-type. It's possible to ask the server for
the columns in a query before you commit to output types, but that
involves an extra round trip which won't be worth it most of the time
(still, it could be worth it if you're doing something that outputs a
lot of rows with far leaner binary representations for some data
types). Perhaps per-statement encoding choice might be the right
compromise, although that smells like a niche use case.

What's the current state of the binary patch? The wiki (
http://wiki.postgresql.org/wiki/JDBC-BinaryTransfer ) doesn't say.
With a brief look at the list archives, the last activity I see is a
completely new patch submitted by Radosław Smogura, but that
discussion peters out. That discussion implies we want a solution to
the "which-types-to-encode-how" problem, but realistically, if we want
that, we won't have binary support until the protocol changes to allow
per-type encoding selection. Would a polished patch with
all-or-nothing connection-flag-level binary support be acceptable?

---
Maciek Sakrejda | System Architect | Truviso

1065 E. Hillsdale Blvd., Suite 215
Foster City, CA 94404
(650) 242-3500 Main
www.truviso.com


From: Andreas Schmitz <schmitz(at)occamlabs(dot)de>
To: Maciek Sakrejda <msakrejda(at)truviso(dot)com>
Cc: Florian Weimer <fweimer(at)bfk(dot)de>, Oliver Jowett <oliver(at)opencloud(dot)com>, pgsql-jdbc(at)postgresql(dot)org
Subject: Re: bytea performance tweak
Date: 2011-06-27 23:49:57
Message-ID: 20110627234957.GA2653@summalogicae.occamlabs.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

Maciek Sakrejda wrote:

Hi,

> >> Is there a timeline for fully supporting the binary protocol?
> >
> > A PostgreSQL installation can contain user-defined types with a custom
> > binary format, so this will not work with all databases anyway.
>
> Having to read unknown types in text mode is not any easier if you're
> going to do something interesting with them. In either case, the
> protocol provides a way to read the type payload off the wire (so,
> e.g., a column with a type you can't read won't "mess up" the whole
> result set), but it's up to you to process those bytes. As Vitalii
> stated, something like PGObject is probably the right way to handle
> this.
>
> > A way to tell PostgreSQL to always send some [types] in binary format
> > would solve this, though.
>
> (I assume you meant "types" not "times")
>
> Yep, that would be nice. There are per-*column* output facilities in
> the protocol, but not per-type. It's possible to ask the server for
> the columns in a query before you commit to output types, but that
> involves an extra round trip which won't be worth it most of the time
> (still, it could be worth it if you're doing something that outputs a
> lot of rows with far leaner binary representations for some data
> types). Perhaps per-statement encoding choice might be the right
> compromise, although that smells like a niche use case.
>
> What's the current state of the binary patch? The wiki (
> http://wiki.postgresql.org/wiki/JDBC-BinaryTransfer ) doesn't say.
> With a brief look at the list archives, the last activity I see is a
> completely new patch submitted by Radosław Smogura, but that
> discussion peters out. That discussion implies we want a solution to
> the "which-types-to-encode-how" problem, but realistically, if we want
> that, we won't have binary support until the protocol changes to allow
> per-type encoding selection. Would a polished patch with
> all-or-nothing connection-flag-level binary support be acceptable?

so it seems to me that switching to the binary protocol is probably
going to be hard, without server side support. What about just applying
my patch in the meantime, including Oliver's fix of course?

Best regards, Andreas