Re: OutOfMemory when inserting stream of unknown length

Lists: pgsql-jdbc
From: "Mikko T(dot)" <mtiihone(at)cc(dot)hut(dot)fi>
To: pgsql-jdbc(at)postgresql(dot)org
Subject: OutOfMemory when inserting stream of unknown length
Date: 2004-08-19 10:44:06
Message-ID: Pine.OSF.4.60.0408191028070.368856@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc


When writing a java stream of unknown length to a bytea column with the
prepared statement setBinaryStream the jdbc drivers throws always an
OutOfMemoryException.

I'm calling the function with following parameters as I do not know the real
length of the stream:

stm.setBinaryStream(1, stream, Interger.MAX_VALUE);

and it seems that somewhere along the way the jdbc driver wrongly assumes that
the stream contains gigabytes of data and a memory allocation fails.

The other jdbc drivers I have tried (oracle 9.2, hypersonic 1.7.2.3) do accept
this.

The javadoc itself (reproduced at the bottom of the mail) is a bit
controversial as it at the same time says that the stream will have 'length'
number of bytes, but on the other hand says that all data will be read. I have
interpreted this so that the length is just a hint and jdbc driver must not
store more bytes than the length, but an end-of-file before the length bytes
is still valid and shouldn't cause any exceptions. And the javadoc might also
mean that if the stream hasn't reached end-of-file when length bytes have been
read the jdbc driver should still continue reading but discarding the
remaining bytes.

In summary my understanding:
stream length | length parameter | result
--------------+------------------+--------
10 | 5 | 5 bytes inserted, 5 bytes discarded
10 | 10 | 10 bytes inserted
10 | 15 | 10 bytes inserted

--- javadoc snippet:

The PreparedStatement.setBinaryStream(int parameterIndex,
InputStream x,
int length)

Sets the designated parameter to the given input stream, which will
have the specified number of bytes... The data will be read from the stream as
needed until end-of-file is reached.

Parameters:
parameterIndex - the first parameter is 1, the second is 2, ...
x - the java input stream which contains the binary parameter value
length - the number of bytes in the stream


From: Kris Jurka <books(at)ejurka(dot)com>
To: "Mikko T(dot)" <mtiihone(at)cc(dot)hut(dot)fi>
Cc: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: OutOfMemory when inserting stream of unknown length
Date: 2004-08-19 18:42:39
Message-ID: Pine.BSO.4.56.0408191336180.14552@leary.csoft.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

On Thu, 19 Aug 2004, Mikko T. wrote:

>
> When writing a java stream of unknown length to a bytea column with the
> prepared statement setBinaryStream the jdbc drivers throws always an
> OutOfMemoryException.
>
> I'm calling the function with following parameters as I do not know the real
> length of the stream:
>
> stm.setBinaryStream(1, stream, Interger.MAX_VALUE);
>
> and it seems that somewhere along the way the jdbc driver wrongly
> assumes that the stream contains gigabytes of data and a memory
> allocation fails.

This is true of the stable driver branch, it will try to allocate a huge
amount of memory. The development driver has the ability to stream data
to the backend, but unfortunately for you requires an exact length
argument. The problem is that the frontend/backend protocol messages are
preceded with a length parameter. So the options are to read the whole
stream to determine the length (which requires huge amounts of memory to
store the whole stream), or take the provided length argument at face
value and stream data to the server hoping the user was right. In the
streaming case it's ok if the length argument is less than the length of
the stream because then we can just stop reading, but not if the stream
itself is short.

Kris Jurka


From: Oliver Jowett <oliver(at)opencloud(dot)com>
To: "Mikko T(dot)" <mtiihone(at)cc(dot)hut(dot)fi>
Cc: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: OutOfMemory when inserting stream of unknown length
Date: 2004-08-19 21:27:40
Message-ID: 41251B4C.9020205@opencloud.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

Mikko T. wrote:

> The javadoc itself (reproduced at the bottom of the mail) is a bit
> controversial as it at the same time says that the stream will have
> 'length' number of bytes, but on the other hand says that all data will
> be read. I have interpreted this so that the length is just a hint and
> jdbc driver must not store more bytes than the length, but an
> end-of-file before the length bytes
> is still valid and shouldn't cause any exceptions. And the javadoc might
> also mean that if the stream hasn't reached end-of-file when length
> bytes have been read the jdbc driver should still continue reading but
> discarding the remaining bytes.

This is not my interpretation (and not the driver's interpretation
either). The driver reads exactly 'length' bytes and does not
subsequently touch the stream. It is an error to provide less than
'length' bytes in the stream (and this will actually also toast your
server connection).

Your OOME is due to older drivers not having proper streaming logic --
they read into heap then call setBytes(). Newer drivers stream the data
directly to the server.

> The PreparedStatement.setBinaryStream(int parameterIndex,
> InputStream x,
> int length)
>
> Sets the designated parameter to the given input stream, which will have
> the specified number of bytes.

This sounds like a requirement to me -- i.e. it is an error to pass a
stream that does not have the specified number of bytes.

> .. The data will be read from the stream
> as needed until end-of-file is reached.

.. but as usual the JDBC javadoc goes on to contradict itself. sigh. I
wish sun could come up with a proper spec for a change, not just a
collection of partially-documented APIs.

As I see it the reason for having a length value there is so that the
driver can stream the data directly to the DB even when it needs to know
the length ahead of time. This is exactly the case with the postgresql
driver. If we can't trust the length field to be accurate, then we must
read the entire stream into heap before starting. In that case
setBinaryStream() is no better than setBytes()..

I could live with an interpretation that says "always store exactly
length bytes, but then read to EOF if there are extra bytes left over".
It would still be an error to supply less than 'length' bytes in the
stream; I think this is better than padding with 0 bytes or anything
similar (by the time the driver sees EOF, it is committed at the
protocol level to writing a parameter of length 'length', so it can't
just stop at EOF).

But I don't know if that's a better interpretation..

-O


From: "Mikko T(dot)" <mtiihone(at)cc(dot)hut(dot)fi>
To: Oliver Jowett <oliver(at)opencloud(dot)com>
Cc: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: OutOfMemory when inserting stream of unknown length
Date: 2004-08-20 09:23:18
Message-ID: Pine.OSF.4.60.0408201155140.368856@kosh.hut.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

On Fri, 20 Aug 2004, Oliver Jowett wrote:

>> The PreparedStatement.setBinaryStream(int parameterIndex,
>> InputStream x,
>> int length)
>>
>> Sets the designated parameter to the given input stream, which will have
>> the specified number of bytes.
>
> This sounds like a requirement to me -- i.e. it is an error to pass a stream
> that does not have the specified number of bytes.
>
>> .. The data will be read from the stream as needed until end-of-file is
>> reached.
>
> .. but as usual the JDBC javadoc goes on to contradict itself. sigh. I wish
> sun could come up with a proper spec for a change, not just a collection of
> partially-documented APIs.
>
> As I see it the reason for having a length value there is so that the driver
> can stream the data directly to the DB even when it needs to know the length
> ahead of time. This is exactly the case with the postgresql driver. If we
> can't trust the length field to be accurate, then we must read the entire
> stream into heap before starting. In that case setBinaryStream() is no better
> than setBytes()..

But now the current implementation forces the user of the setBinaryStream to
buffer all bytes to memory just to know how many bytes the stream will
actually hold. And after that it can already call setBytes thus making the
whole setBinaryStream useless when the stream length is not known beforehand.

> I could live with an interpretation that says "always store exactly length
> bytes, but then read to EOF if there are extra bytes left over". It would
> still be an error to supply less than 'length' bytes in the stream; I think
> this is better than padding with 0 bytes or anything similar (by the time the
> driver sees EOF, it is committed at the protocol level to writing a parameter
> of length 'length', so it can't just stop at EOF).

I think that commiting to send very large data values that are not even
guaranteed to exist (or be available) makes it impossible to get the protocol
back to known state if for some reason the data cannot be sent or the last
command should be aborted. In the current solution the only option seems to be
to kill the whole connection but I find that quite an extreme thing to do.

In my opinion the protocol level should support sending the data in chunks (4K
- 16K could be optimal). This way the jdbc driver only has to buffer one chunk
ahead and not the whole stream and it always knows there won't be any
IOExceptions or EOF occuring during the sending. Supporting chunked sending
would also allow the jdbc driver to support cancel/rollback the streaming if
it needs to or is asked to do so by the database even if there would still be
megabytes of data to send.

-Mikko


From: Oliver Jowett <oliver(at)opencloud(dot)com>
To: "Mikko T(dot)" <mtiihone(at)cc(dot)hut(dot)fi>
Cc: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: OutOfMemory when inserting stream of unknown length
Date: 2004-08-20 09:46:25
Message-ID: 4125C871.1070608@opencloud.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

Mikko T. wrote:

> But now the current implementation forces the user of the
> setBinaryStream to buffer all bytes to memory just to know how many
> bytes the stream will actually hold. And after that it can already call
> setBytes thus making the whole setBinaryStream useless when the stream
> length is not known beforehand.

Not true -- consider the case where you know the length but the data is
held out of heap (for example, a large NIO buffer or a file on disk).

-O


From: Oliver Jowett <oliver(at)opencloud(dot)com>
To: "Mikko T(dot)" <mtiihone(at)cc(dot)hut(dot)fi>
Cc: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: OutOfMemory when inserting stream of unknown length
Date: 2004-08-20 09:59:11
Message-ID: 4125CB6F.5040003@opencloud.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

Oliver Jowett wrote:
> Mikko T. wrote:
>
>> But now the current implementation forces the user of the
>> setBinaryStream to buffer all bytes to memory just to know how many
>> bytes the stream will actually hold. And after that it can already
>> call setBytes thus making the whole setBinaryStream useless when the
>> stream length is not known beforehand.
>
>
> Not true -- consider the case where you know the length but the data is
> held out of heap (for example, a large NIO buffer or a file on disk).

Eh, sorry, I misread your last sentence. Certainly if you don't know the
length then you have to buffer internally and setBinaryStream() is of no
use.

The common case I've seen so far on the list is inserting from a file.
It seems much less common to not know the length of your stream ahead of
time..

-O


From: Oliver Jowett <oliver(at)opencloud(dot)com>
To: "Mikko T(dot)" <mtiihone(at)cc(dot)hut(dot)fi>
Cc: pgsql-jdbc(at)postgresql(dot)org
Subject: Re: OutOfMemory when inserting stream of unknown length
Date: 2004-08-20 10:04:40
Message-ID: 4125CCB8.10801@opencloud.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

Mikko T. wrote:

> I think that commiting to send very large data values that are not even
> guaranteed to exist (or be available) makes it impossible to get the
> protocol back to known state if for some reason the data cannot be sent
> or the last command should be aborted. In the current solution the only
> option seems to be to kill the whole connection but I find that quite an
> extreme thing to do.

I discussed some alternatives with Kris a while ago but no
implementation has happened yet. The most promising approach was to pad
out the parameter in the Bind message, then immediately Close the
resulting portal without executing it and return an error to the
application.

> In my opinion the protocol level should support sending the data in
> chunks (4K - 16K could be optimal). This way the jdbc driver only has to
> buffer one chunk ahead and not the whole stream and it always knows
> there won't be any IOExceptions or EOF occuring during the sending.
> Supporting chunked sending would also allow the jdbc driver to support
> cancel/rollback the streaming if it needs to or is asked to do so by the
> database even if there would still be megabytes of data to send.

Right -- something like HTTP/1.1's chunked transfer scheme. We would
have to have some "partially bound portal" state floating around on the
server side which seems ugly (and coordinating multiple parameters is
going to get nasty). Is it worth it? I have a feeling that pgsql-hackers
may take some convincing; sending data of unknown length doesn't seem
like it would be widely used, and it'd be quite a lot of complexity to add.

There's nothing much we can do about this with the current protocol
unfortunately :(

-O