Re: PQgetlength vs. octet_length()

Lists: pgsql-generalpgsql-hackers
From: Michael Clark <codingninja(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: PQgetlength vs. octet_length()
Date: 2009-08-17 22:28:21
Message-ID: bf5d83510908171528q2ab6ecc6md8d3d08f2c4615f3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hello everyone.
Having a weird issue.

I have a value inserted into a bytea column, which is about 137megs in size.

If I use octet_length() to check the size of the column for this specific
row I get this:
TestDB=# SELECT octet_length(rawdata) FROM LargeData;
octet_length
--------------
143721188

When fetching the row through the C API, and I use PQgetlength() on the
column of the row in question I get:
(gdb) p (int)PQgetlength(result, rowIndex, i)
$3 = 544453159

I am wondering if I am lacking knowledge that explains why these values are
different, or if something fishy is going on.

What led me to investigating this is that fetching this row in a C
application is causing a failure. My programs memory usage balloons to 1.3
gigs after executing this:
const char *valC = PQgetvalue(result, rowIndex, i);

Am I doing something wrong, or is there some ideas what I should investigate
next?
This seems quite puzzling to me.

Thanks in advance for any help/insight offered,
Michael.


From: Michael Clark <codingninja(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PQgetlength vs. octet_length()
Date: 2009-08-18 15:04:24
Message-ID: bf5d83510908180804y760c2f64o4e9417929da63c85@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hello - am I in the wrong mailing list for this sort of problem? :-/
Thanks,
Michael.

On Mon, Aug 17, 2009 at 6:28 PM, Michael Clark <codingninja(at)gmail(dot)com>wrote:

> Hello everyone.
> Having a weird issue.
>
> I have a value inserted into a bytea column, which is about 137megs in
> size.
>
> If I use octet_length() to check the size of the column for this specific
> row I get this:
> TestDB=# SELECT octet_length(rawdata) FROM LargeData;
> octet_length
> --------------
> 143721188
>
> When fetching the row through the C API, and I use PQgetlength() on the
> column of the row in question I get:
> (gdb) p (int)PQgetlength(result, rowIndex, i)
> $3 = 544453159
>
>
> I am wondering if I am lacking knowledge that explains why these values are
> different, or if something fishy is going on.
>
> What led me to investigating this is that fetching this row in a C
> application is causing a failure. My programs memory usage balloons to 1.3
> gigs after executing this:
> const char *valC = PQgetvalue(result, rowIndex, i);
>
> Am I doing something wrong, or is there some ideas what I should
> investigate next?
> This seems quite puzzling to me.
>
> Thanks in advance for any help/insight offered,
> Michael.
>
>


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Michael Clark" <codingninja(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PQgetlength vs. octet_length()
Date: 2009-08-18 15:23:20
Message-ID: 4A8A81180200002500029C22@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Michael Clark <codingninja(at)gmail(dot)com> wrote:
> Hello - am I in the wrong mailing list for this sort of problem? :-/

Probably. If you check here:

http://www.postgresql.org/community/lists/

You'll find this description for the list:

"The PostgreSQL developers team lives here. Discussion of current
development issues, problems and bugs, and proposed new features. If
your question cannot be answered by people in the other lists, and it
is likely that only a developer will know the answer, you may re-post
your question in this list. You must try elsewhere first!"

Your question sounds appropriate for the pgsql-general list.

>> I have a value inserted into a bytea column, which is about 137megs
>> in size.
>>
>> If I use octet_length() to check the size of the column for this
>> specific row I get this:
>> TestDB=# SELECT octet_length(rawdata) FROM LargeData;
>> octet_length
>> --------------
>> 143721188
>>
>> When fetching the row through the C API, and I use PQgetlength() on
>> the column of the row in question I get:
>> (gdb) p (int)PQgetlength(result, rowIndex, i)
>> $3 = 544453159

As long as I'm replying, I'll point out that the relative sizes would
make sense if you were using an interface which got the text
representation of the bytea column, since many bytes would be
represented in octal with a backslash escape (four characters per
byte).

-Kevin


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Michael Clark <codingninja(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PQgetlength vs. octet_length()
Date: 2009-08-18 16:15:39
Message-ID: 407d949e0908180915t1ccabad1r69fbd814373628f9@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Tue, Aug 18, 2009 at 4:04 PM, Michael Clark<codingninja(at)gmail(dot)com> wrote:
> Hello - am I in the wrong mailing list for this sort of problem? :-

Probably but it's also a pretty technical point and you're programming
in C so it's kind of borderline.

If you're using text-mode then your datum that you're getting from
libpq is a text representation of the datum. For bytea in released
versions that means anything which isn't a printable ascii character
will be octal encoded like \123. You can use PQunescapeBytea to
unescape it.

If you use binary encoding then you don't have to deal with that.
Though I seem to recall there is still a gotcha you have to worry
about if there are nul bytes in your datum. I don't recall exactly
what that meant you had to do though.

--
greg
http://mit.edu/~gsstark/resume.pdf


From: Michael Clark <codingninja(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org, Greg Stark <gsstark(at)mit(dot)edu>
Subject: PQgetlength vs. octet_length()
Date: 2009-08-18 17:39:30
Message-ID: bf5d83510908181039i365a0a5ag88db6eba829de5d0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

This thread was originally posted (incorrectly by me) to the hackers
mailing list. Moving the discussion to the gerenal.

Hi Greg,
That is what Pierre pointed out, and you are both right. I am using the
text mode.

But it seems pretty crazy that a 140meg bit of data goes to 1.3 gigs. Does
that seem a bit excessive?

I avoided the binary mode because that seemed to be rather confusing when
having to deal with non-bytea data types. The docs make it sound like
binary mode should be avoided because what you get back for a datetime
varies per platform.

Thanks,
Michael.

On Tue, Aug 18, 2009 at 12:15 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:

> On Tue, Aug 18, 2009 at 4:04 PM, Michael Clark<codingninja(at)gmail(dot)com>
> wrote:
> > Hello - am I in the wrong mailing list for this sort of problem? :-
>
> Probably but it's also a pretty technical point and you're programming
> in C so it's kind of borderline.
>
> If you're using text-mode then your datum that you're getting from
> libpq is a text representation of the datum. For bytea in released
> versions that means anything which isn't a printable ascii character
> will be octal encoded like \123. You can use PQunescapeBytea to
> unescape it.
>
> If you use binary encoding then you don't have to deal with that.
> Though I seem to recall there is still a gotcha you have to worry
> about if there are nul bytes in your datum. I don't recall exactly
> what that meant you had to do though.
>
> --
> greg
> http://mit.edu/~gsstark/resume.pdf
>


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Michael Clark <codingninja(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: PQgetlength vs. octet_length()
Date: 2009-08-18 17:48:57
Message-ID: 407d949e0908181048p530a3684j8c6ba3058b3f6888@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Tue, Aug 18, 2009 at 6:39 PM, Michael Clark<codingninja(at)gmail(dot)com> wrote:
> But it seems pretty crazy that a 140meg bit of data goes to 1.3 gigs.  Does
> that seem a bit excessive?

From what you posted earlier it looked like it was turning into about
500M which sounds about right. Presumably either libpq or your code is
holding two copies of it in ram at some point in the process.

8.5 will have an option to use a denser hex encoding but it will still
be 2x as large as the raw data.

> I avoided the binary mode because that seemed to be rather confusing when
> having to deal with non-bytea data types.  The docs make it sound like
> binary mode should be avoided because what you get back for a datetime
> varies per platform.

There are definitely disadvantages. Generally it requires you to know
what the binary representation of your data types is and they're not
all well documented or guaranteed not to change in the future. I
wouldn't recommend someone try to decode a numeric or a postgres array
for example. And floating point numbers are platform dependent. But
bytea is a case where it seems more natural to use binary than text
representation.

--
greg
http://mit.edu/~gsstark/resume.pdf


From: Michael Clark <codingninja(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Fwd: PQgetlength vs. octet_length()
Date: 2009-08-18 18:42:22
Message-ID: bf5d83510908181142i6e423f7fj1ed12afd3851cb2e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Tue, Aug 18, 2009 at 1:48 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:

> On Tue, Aug 18, 2009 at 6:39 PM, Michael Clark<codingninja(at)gmail(dot)com>
> wrote:
> > But it seems pretty crazy that a 140meg bit of data goes to 1.3 gigs.
> Does
> > that seem a bit excessive?
>
> From what you posted earlier it looked like it was turning into about
> 500M which sounds about right. Presumably either libpq or your code is
> holding two copies of it in ram at some point in the process.
>

From what I saw, stopped at this line in my code running through gdb:
const char *valC = PQgetvalue(result, rowIndex, i);
my mem usage was 300megs. Stepping over this line it went to 1.3 gigs.
Unless there is some way to misconfigure something, I can't think how my
code could do that.
I will profile it and see if I can tell who is holding on to that memory.

> 8.5 will have an option to use a denser hex encoding but it will still
> be 2x as large as the raw data.
>

Sweet!

>
> > I avoided the binary mode because that seemed to be rather confusing when
> > having to deal with non-bytea data types. The docs make it sound like
> > binary mode should be avoided because what you get back for a datetime
> > varies per platform.
>
> There are definitely disadvantages. Generally it requires you to know
> what the binary representation of your data types is and they're not
> all well documented or guaranteed not to change in the future. I
> wouldn't recommend someone try to decode a numeric or a postgres array
> for example. And floating point numbers are platform dependent. But
> bytea is a case where it seems more natural to use binary than text
> representation.
>

To do something like this, I guess it would be best for my wrapper to being
to detect when I have a bytea column in a table and do 2 fetchs, one in text
for all other columns, and one in binary for the bytea column. Is this the
best way to handle that do you think?

Thanks,
Michael.


From: "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To: "Greg Stark *EXTERN*" <gsstark(at)mit(dot)edu>, "Michael Clark" <codingninja(at)gmail(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PQgetlength vs. octet_length()
Date: 2009-08-19 07:11:11
Message-ID: D960CB61B694CF459DCFB4B0128514C203937ED0@exadv11.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Greg Stark wrote:
> If you use binary encoding then you don't have to deal with that.
> Though I seem to recall there is still a gotcha you have to worry
> about if there are nul bytes in your datum. I don't recall exactly
> what that meant you had to do though.

As far as I know, it only means that you shouldn't use strcpy, strlen
and friends on a binary string.

Yours,
Laurenz Albe


From: "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To: "Michael Clark *EXTERN*" <codingninja(at)gmail(dot)com>, <pgsql-general(at)postgresql(dot)org>, "Greg Stark" <gsstark(at)mit(dot)edu>
Subject: Re: PQgetlength vs. octet_length()
Date: 2009-08-19 07:47:43
Message-ID: D960CB61B694CF459DCFB4B0128514C203937ED1@exadv11.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Michael Clark wrote:
> That is what Pierre pointed out, and you are both right. I
> am using the text mode.
>
> But it seems pretty crazy that a 140meg bit of data goes to
> 1.3 gigs. Does that seem a bit excessive?
>
> I avoided the binary mode because that seemed to be rather
> confusing when having to deal with non-bytea data types. The
> docs make it sound like binary mode should be avoided because
> what you get back for a datetime varies per platform.

That is true.

The best thing would be to retrieve only the bytea columns in
binary format and the rest as text.

The Bind message in the frontend/backend protocol allows to
specify for each individual result column whether it should
be text or binary
( http://www.postgresql.org/docs/current/static/protocol-message-formats.html )
but the C API only allows you to get *all* result columns in either
binary or text.

You could resort to either speaking line protocol with the backend,
(which is probably more than you are ready to do), or you could
create a separate query only for the bytea value.

Yours,
Laurenz Albe