Re: OCTET_LENGTH is wrong

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: OCTET_LENGTH is wrong
Date: 2001-11-18 21:23:16
Message-ID: 200111182123.fAILNGW07403@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com> writes:
> > On Sun, 18 Nov 2001, Tom Lane wrote:
> >> I presume that where you want to come out is OCTET_LENGTH = uncompressed
> >> length in the server's encoding ... but so far no one has really made
> >> a convincing argument why that answer is better or more spec-compliant
> >> than any other answer. In particular, it's not obvious to me why
> >> "number of bytes we're actually using on disk" is wrong.
>
> > I'm not sure, but if we say that the on disk representation is the
> > value of the character value expression whose size is being checked,
> > wouldn't that be inconsistent with the other uses of the character value
>
> Yeah, it would be and is. In fact, the present code has some
> interesting behaviors: if foo.x is a text value long enough to be
> toasted, then you get different results from
>
> SELECT OCTET_LENGTH(x) FROM foo;
>
> SELECT OCTET_LENGTH(x || '') FROM foo;
>
> since the result of the concatenation expression won't be compressed.
>
> I'm not actually here to defend the existing code; in fact I believe the
> XXX comment on textoctetlen questioning its correctness is mine. What
> I am trying to point out is that the spec is so vague that it's not
> clear what the correct answer is.

Well, if the standard is unclear, we should assume to return the most
reasonable answer, which has to be non-compressed length.

In multibyte encodings, when we started returning length() in
_characters_ instead of bytes, I assumed the major use for octet_length
was to return the number of bytes needed to hold the value on the client
side.

In single byte encodings, octet_length is the same as length() so
returning a compressed length may make sense, but I don't think we want
different meanings for the function for single and multi-byte encodings.

I guess the issue is that for single-byte encodings, octet_length is
pretty useless because it is the same as length, but for multi-byte
encodings, octet_length is invaluable and almost has to return
non-compress bytes because uncompressed is that the client sees.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message charles 2001-11-18 21:30:28 Re: pg locking problem
Previous Message Bruce Momjian 2001-11-18 21:17:03 Re: full outer join bug?