Quick Links

Re: "analyze" putting wrong reltuples in pg_class

Lists:	pgsql-bugs

From:	pgsql-bugs(at)postgresql(dot)org
To:	pgsql-bugs(at)postgresql(dot)org
Subject:	Bug #728: Interactions between bytea and character encoding when doing analyze
Date:	2002-08-01 14:12:05
Message-ID:	20020801141205.2A6424759D4@postgresql.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Anders Hammarquist (iko(at)strakt(dot)com) reports a bug with a severity of 3
The lower the number the more severe it is.

Short Description
Interactions between bytea and character encoding when doing analyze

Long Description
If a byte string that is not valid unicode is inserted into a bytea
column, analyze will fail unless the data was tagged as bytea in the
insert. However, there appears to be no difference in the data returned
by a select. Both variants of the insert should probably work, but if
not the untagged one should fail at the time of insertion.

Sample Code
ifctest_iko=# create table foo (key int, value bytea);
CREATE
ifctest_iko=# INSERT INTO foo values ( 42, '\\314\\134'::bytea );
INSERT 374905 1
ifctest_iko=# analyze foo;
ANALYZE
ifctest_iko=# INSERT INTO foo values ( 42, '\\314\\134' );
INSERT 374906 1
ifctest_iko=# analyze foo;
ERROR: Invalid UNICODE character sequence found (0xcc5c)
ifctest_iko=# select * from foo;
key | value
-----+-------
42 | \\
42 | \\
(2 rows)

ifctest_iko=# set client_encoding to sqlascii;
SET VARIABLE
ifctest_iko=# select * from foo;
key | value
-----+--------
42 | (cc)\\
42 | (cc)\\
(2 rows)

No file was uploaded with this report

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	iko(at)strakt(dot)com, pgsql-bugs(at)postgresql(dot)org
Subject:	Re: Bug #728: Interactions between bytea and character encoding when doing analyze
Date:	2002-08-01 14:23:57
Message-ID:	7252.1028211837@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

pgsql-bugs(at)postgresql(dot)org writes:
> If a byte string that is not valid unicode is inserted into a bytea
> column, analyze will fail unless the data was tagged as bytea in the
> insert.

Your example produces no failure for me. You'd better be more specific
about which PG version you're running, on what platform, with what
configure options and what database encoding, etc.

regards, tom lane

From:	Anders Hammarquist <iko(at)strakt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-bugs(at)postgresql(dot)org
Subject:	Re: Bug #728: Interactions between bytea and character encoding when doing analyze
Date:	2002-08-02 10:34:06
Message-ID:	200208021034.g72AY717024386@fido.strakt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

> pgsql-bugs(at)postgresql(dot)org writes:
> > If a byte string that is not valid unicode is inserted into a bytea
> > column, analyze will fail unless the data was tagged as bytea in the
> > insert.
>
> Your example produces no failure for me. You'd better be more specific
> about which PG version you're running, on what platform, with what
> configure options and what database encoding, etc.

Ah, sorry about that. It's 7.2.1, in the Debian package incarnation
7.2.1-2. The database and the client encoding are both unicode. These
are the setting from postmaster.conf (nothing strange):

debug_level = 0
log_connections = on
log_pid = on
log_timestamp = on
syslog = 2
silent_mode = off
syslog_facility = LOCAL0
trace_notify = off
max_connections = 64
shared_buffers = 128
tcpip_socket = 1
stats_start_collector = on
stats_reset_on_server_start = off
stats_command_string = on
stats_block_level = on
stats_row_level = on

/Anders

From:	Ron Mayer <ron(at)intervideo(dot)com>
To:	<pgsql-bugs(at)postgresql(dot)org>
Cc:	Ron Mayer <ron(at)intervideo(dot)com>
Subject:	"analyze" putting wrong reltuples in pg_class
Date:	2002-08-02 21:36:43
Message-ID:	Pine.LNX.4.33.0208021416210.13301-100000@ron
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

On a number of my tables, "analyze" seems to be putting the wrong value of
"reltuples" in pg_class. "vacuum" seems to be doing the right thing.

An example of the failure mode is shown below. Please let me know what
additional info I could supply if more info would help.

Ron

logs2=# select count(*) from e_ip_full;
count
---------
1697755
(1 row)

logs2=# analyze e_ip_full;

logs2=# select relname,reltuples from pg_class where relname = 'e_ip_full';
relname | reltuples
-----------+-----------
e_ip_full | 7555
(1 row)

logs2=# vacuum e_ip_full;
VACUUM
logs2=# select relname,reltuples from pg_class where relname = 'e_ip_full';
relname | reltuples
-----------+-------------
e_ip_full | 1.69776e+06
(1 row)

logs2=# analyze verbose e_ip_full;
NOTICE: Analyzing e_ip_full
ANALYZE
logs2=# select relname,reltuples from pg_class where relname = 'e_ip_full';
relname | reltuples
-----------+-----------
e_ip_full | 7555
(1 row)

logs2=#

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Ron Mayer <ron(at)intervideo(dot)com>
Cc:	pgsql-bugs(at)postgresql(dot)org
Subject:	Re: "analyze" putting wrong reltuples in pg_class
Date:	2002-08-02 22:50:44
Message-ID:	28671.1028328644@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Ron Mayer <ron(at)intervideo(dot)com> writes:
> On a number of my tables, "analyze" seems to be putting the wrong value of
> "reltuples" in pg_class. "vacuum" seems to be doing the right thing.

Hmm. analyze by itself generates only an approximate estimate of the
row count (since it only examines a random sample of the rows). But I'd
not have expected it to be off by a factor of 200. Need more info.
What does VACUUM VERBOSE show? Also, it would be interesting to see
what contrib/pgstattuple shows, if you can run that conveniently.
Can you say anything about your typical usage pattern on these tables?
(eg, numbers of inserts vs updates vs deletes)

BTW, it's quite likely that VACUUM FULL will make the problem go away,
so don't do that until we fully understand what's happening ...

regards, tom lane

From:	Ron Mayer <ron(at)intervideo(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	<pgsql-bugs(at)postgresql(dot)org>
Subject:	Re: "analyze" putting wrong reltuples in pg_class
Date:	2002-08-03 01:21:03
Message-ID:	Pine.LNX.4.33.0208021740280.13301-100000@ron
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

On Fri, 2 Aug 2002, Tom Lane wrote:
>
> Ron Mayer <ron(at)intervideo(dot)com> writes:
> > On a number of my tables, "analyze" seems to be putting the wrong value of
> > "reltuples" in pg_class. "vacuum" seems to be doing the right thing.
>
> Hmm. analyze by itself generates only an approximate estimate of the
> row count (since it only examines a random sample of the rows). But I'd
> not have expected it to be off by a factor of 200. Need more info.
> What does VACUUM VERBOSE show?

logs2=# vacuum verbose e_ip_full;
NOTICE: --Relation e_ip_full--
NOTICE: Index e_ip__ip_obsolete_dat: Pages 15905; Tuples 1697755: Deleted
654680.
CPU 5.54s/9.96u sec elapsed 183.97 sec.
NOTICE: Index e_ip__domain: Pages 15891; Tuples 1697755: Deleted 654680.
CPU 10.51s/8.59u sec elapsed 255.68 sec.
NOTICE: Removed 654680 tuples in 8324 pages.
CPU 7.91s/1.91u sec elapsed 52.01 sec.
NOTICE: Pages 37612: Changed 0, Empty 0; Tup 1697755: Vac 654680, Keep 0,
UnUsed 454059.
Total CPU 42.91s/20.83u sec elapsed 570.05 sec.
NOTICE: --Relation pg_toast_110790174--
NOTICE: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0.
Total CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM
logs2=#

> Also, it would be interesting to see what contrib/pgstattuple shows,
> if you can run that conveniently.

Gladly, if I'm shown where to find it. Google search for pgstattuple shows
http://developer.postgresql.org/cvsweb.cgi/pgsql/contrib/pgstatuple/
which serves an error page rigth now.

> Can you say anything about your typical usage pattern on these tables?
> (eg, numbers of inserts vs updates vs deletes)

Every few days, 10,000 - 200,000 entries (new IP addresses) are loaded.

After loading, an update is run once for each entry (filling in the domain
name that goes with the IP address).

Then "Vacuum Analyze" is run, and no updates or loads happen until
the next large batch.

> BTW, it's quite likely that VACUUM FULL will make the problem go away,
> so don't do that until we fully understand what's happening ...

OK... I have 2 tables that have the same problem, so we can
experiment once. :-)

Ron

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Ron Mayer <ron(at)intervideo(dot)com>
Cc:	pgsql-bugs(at)postgresql(dot)org
Subject:	Re: "analyze" putting wrong reltuples in pg_class
Date:	2002-08-03 03:28:54
Message-ID:	360.1028345334@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Ron Mayer <ron(at)intervideo(dot)com> writes:
>> Also, it would be interesting to see what contrib/pgstattuple shows,
>> if you can run that conveniently.

> Gladly, if I'm shown where to find it.

If you built from a source package, the contrib stuff should be in that
package. If you used RPMs, look for the pgsql-contrib RPM in the same
set.

regards, tom lane

From:	Ron Mayer <ron(at)intervideo(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	<pgsql-bugs(at)postgresql(dot)org>
Subject:	Re: "analyze" putting wrong reltuples in pg_class
Date:	2002-08-03 06:29:59
Message-ID:	Pine.LNX.4.33.0208022328000.13301-100000@ron
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

On Fri, 2 Aug 2002, Tom Lane wrote:
> >> it would be interesting to see what contrib/pgstattuple shows...
> >> if you can run that conveniently.
> > Gladly, if I'm shown where to find it.
> If you built from a source package, the contrib stuff should be in that
> package. If you used RPMs, look for the pgsql-contrib RPM in the same
> set.

I assume I run it like this... ?

logs2=#
logs2=# select pgstattuple('e_ip_full');
NOTICE: physical length: 293.84MB live tuples: 1697755 (169.26MB, 57.60%)
dead tuples: 0 (0.00MB, 0.00%) free/reusable space: 110.84MB (37.72%)
overhead: 4.67%
pgstattuple
-------------
0
(1 row)

logs2=#

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Ron Mayer <ron(at)intervideo(dot)com>
Cc:	pgsql-bugs(at)postgresql(dot)org
Subject:	Re: "analyze" putting wrong reltuples in pg_class
Date:	2002-08-03 16:53:19
Message-ID:	3564.1028393599@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Ron Mayer <ron(at)intervideo(dot)com> writes:
> logs2=# select pgstattuple('e_ip_full');
> NOTICE: physical length: 293.84MB live tuples: 1697755 (169.26MB, 57.60%)
> dead tuples: 0 (0.00MB, 0.00%) free/reusable space: 110.84MB (37.72%)
> overhead: 4.67%
> pgstattuple
> -------------
> 0
> (1 row)

38% overhead space is awfully high. I am betting that your max_fsm_pages
configuration parameter needs to be kicked up --- it would seem that
your system is failing to reclaim free space effectively. (Check the
mail list archives for recent discussions of this point.)

What I think is happening is that the free space is not evenly
distributed but is concentrated near the start of the table. This
causes ANALYZE to make a faulty estimate of the average number of live
tuples per page, because its initial scan will see mostly free space
and not very many live tuples on the first few hundred pages. So it
extrapolates a too-small estimate for the total number of tuples.

It would probably be good at some point to make ANALYZE more robust,
but your immediate problem is too much wasted space. I'd recommend
bumping up max_fsm_pages to some reasonable fraction of your total
database size, and then doing a VACUUM FULL to get back the space leaked
so far.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Ron Mayer <ron(at)intervideo(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject:	Re: "analyze" putting wrong reltuples in pg_class
Date:	2002-08-04 00:45:48
Message-ID:	200208040045.g740jnQ13687@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Is there any way we can warn users when their fsm parameters are too
small?

---------------------------------------------------------------------------

Tom Lane wrote:
> Ron Mayer <ron(at)intervideo(dot)com> writes:
> > logs2=# select pgstattuple('e_ip_full');
> > NOTICE: physical length: 293.84MB live tuples: 1697755 (169.26MB, 57.60%)
> > dead tuples: 0 (0.00MB, 0.00%) free/reusable space: 110.84MB (37.72%)
> > overhead: 4.67%
> > pgstattuple
> > -------------
> > 0
> > (1 row)
>
> 38% overhead space is awfully high. I am betting that your max_fsm_pages
> configuration parameter needs to be kicked up --- it would seem that
> your system is failing to reclaim free space effectively. (Check the
> mail list archives for recent discussions of this point.)
>
> What I think is happening is that the free space is not evenly
> distributed but is concentrated near the start of the table. This
> causes ANALYZE to make a faulty estimate of the average number of live
> tuples per page, because its initial scan will see mostly free space
> and not very many live tuples on the first few hundred pages. So it
> extrapolates a too-small estimate for the total number of tuples.
>
> It would probably be good at some point to make ANALYZE more robust,
> but your immediate problem is too much wasted space. I'd recommend
> bumping up max_fsm_pages to some reasonable fraction of your total
> database size, and then doing a VACUUM FULL to get back the space leaked
> so far.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

From:	Joe Conway <mail(at)joeconway(dot)com>
To:	Anders Hammarquist <iko(at)strakt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org
Subject:	Re: Bug #728: Interactions between bytea and character encoding
Date:	2002-08-04 01:22:07
Message-ID:	3D4C81BF.1020703@joeconway.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Anders Hammarquist wrote:
>>pgsql-bugs(at)postgresql(dot)org writes:
>>
>>>If a byte string that is not valid unicode is inserted into a bytea
>>>column, analyze will fail unless the data was tagged as bytea in the
>>>insert.
>>
>>Your example produces no failure for me. You'd better be more specific
>>about which PG version you're running, on what platform, with what
>>configure options and what database encoding, etc.

> Ah, sorry about that. It's 7.2.1, in the Debian package incarnation
> 7.2.1-2. The database and the client encoding are both unicode. These
> are the setting from postmaster.conf (nothing strange):

I can confirm this is a problem on 7.2.1, but cvs tip works fine. It is
not related to the form of the insert but rather the fact that with a
one tuple table, pg_verifymbstr() never gets called (where the error is
raised). In fact, textin never gets called either. But once there are
two tuples, they do.

Here's the backtrace from 7.2.1:

Breakpoint 1, pg_verifymbstr (mbstr=0x837a698 "42", len=2) at wchar.c:541
541 if (pg_database_encoding_max_length() <= 1)
(gdb) bt
#0 pg_verifymbstr (mbstr=0x837a698 "42", len=2) at wchar.c:541
#1 0x08149c26 in textin (fcinfo=0xbfffeca0) at varlena.c:191
#2 0x08160579 in DirectFunctionCall1 (func=0x8149c00 <textin>,
arg1=137864856) at fmgr.c:657
#3 0x080bbffa in update_attstats (relid=74723, natts=2,
vacattrstats=0x8379f58) at analyze.c:1740
#4 0x080ba180 in analyze_rel (relid=74723, vacstmt=0x8378110) at
analyze.c:350
. . .

Joe

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Ron Mayer <ron(at)intervideo(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject:	Re: "analyze" putting wrong reltuples in pg_class
Date:	2002-08-04 02:13:21
Message-ID:	6176.1028427201@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Is there any way we can warn users when their fsm parameters are too
> small?

Not until we understand what too small is :-( If anyone's undertaken
any experiments to figure out what an appropriate FSM size setting is,
I'm not aware of it.

The default setting is 10000 pages which would certainly cover all the
free space in 8K*10000 = 80meg of tables, and in practice would cover
significantly more space as long as most of your pages weren't updated
often (and hence didn't have free space to worry about). But obviously
this number is on the low side for production databases, especially
large ones. We need to put "pay attention to FSM size" right after
"pay attention to shared_buffers" in the standard list of tuning tips.

Presumably there's some tradeoff curve that says max_fsm_pages should
cover X% of your physical database page count if you update Y% of the
database rows between vacuums. I'm not sure what the curve looks like
though --- the real issue is how many distinct pages are likely to be
touched when you update so-and-so many rows?

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Joe Conway <mail(at)joeconway(dot)com>
Cc:	Anders Hammarquist <iko(at)strakt(dot)com>, pgsql-bugs(at)postgresql(dot)org, Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
Subject:	Re: Bug #728: Interactions between bytea and character encoding when doing analyze
Date:	2002-08-04 02:25:54
Message-ID:	6274.1028427954@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Joe Conway <mail(at)joeconway(dot)com> writes:
> (gdb) bt
> #0 pg_verifymbstr (mbstr=0x837a698 "42", len=2) at wchar.c:541
> #1 0x08149c26 in textin (fcinfo=0xbfffeca0) at varlena.c:191
> #2 0x08160579 in DirectFunctionCall1 (func=0x8149c00 <textin>,
> arg1=137864856) at fmgr.c:657
> #3 0x080bbffa in update_attstats (relid=74723, natts=2,
> vacattrstats=0x8379f58) at analyze.c:1740

Ah. So the issue is that ANALYZE tries to do textin(byteaout(...))
in order to produce a textual representation of the most common value
in the BYTEA column, and apparently textin feels that the string
generated by byteaout is not legal text. While Joe says that the
problem has gone away in CVS tip, I'm not sure I believe that.

A possible answer is to change the pg_statistics columns from text to
some other less picky datatype. (bytea maybe ;-)) Or should we
conclude that text is broken and needs to be fixed? Choice #3 would
be "bytea is broken and needs to be fixed", but I don't care for that
answer --- if bytea can produce an output string that will break
pg_statistics, then so can some other future datatype.

Comments?

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Ron Mayer <ron(at)intervideo(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject:	Re: "analyze" putting wrong reltuples in pg_class
Date:	2002-08-04 02:59:23
Message-ID:	200208040259.g742xNE24820@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Could we somehow track how many pages we _couldn't_ get into the free
space map, then when the map is empty _and_ we find we have found there
are some pages that we couldn't store during the last vacuum, we throw a
message to the server logs? (Just thinnking out loud.)

---------------------------------------------------------------------------

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Is there any way we can warn users when their fsm parameters are too
> > small?
>
> Not until we understand what too small is :-( If anyone's undertaken
> any experiments to figure out what an appropriate FSM size setting is,
> I'm not aware of it.
>
> The default setting is 10000 pages which would certainly cover all the
> free space in 8K*10000 = 80meg of tables, and in practice would cover
> significantly more space as long as most of your pages weren't updated
> often (and hence didn't have free space to worry about). But obviously
> this number is on the low side for production databases, especially
> large ones. We need to put "pay attention to FSM size" right after
> "pay attention to shared_buffers" in the standard list of tuning tips.
>
> Presumably there's some tradeoff curve that says max_fsm_pages should
> cover X% of your physical database page count if you update Y% of the
> database rows between vacuums. I'm not sure what the curve looks like
> though --- the real issue is how many distinct pages are likely to be
> touched when you update so-and-so many rows?
>
> regards, tom lane
>

From:	Joe Conway <mail(at)joeconway(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Anders Hammarquist <iko(at)strakt(dot)com>, pgsql-bugs(at)postgresql(dot)org, Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
Subject:	Re: Bug #728: Interactions between bytea and character encoding
Date:	2002-08-04 05:26:28
Message-ID:	3D4CBB04.4090906@joeconway.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Tom Lane wrote:
> Ah. So the issue is that ANALYZE tries to do textin(byteaout(...))
> in order to produce a textual representation of the most common value
> in the BYTEA column, and apparently textin feels that the string
> generated by byteaout is not legal text. While Joe says that the
> problem has gone away in CVS tip, I'm not sure I believe that.

I didn't either, except I tried it and it worked ;-) But you're
undoubtedly correct that there are other cases which would break the
current code.

> A possible answer is to change the pg_statistics columns from text to
> some other less picky datatype. (bytea maybe ;-)) Or should we
> conclude that text is broken and needs to be fixed? Choice #3 would
> be "bytea is broken and needs to be fixed", but I don't care for that
> answer --- if bytea can produce an output string that will break
> pg_statistics, then so can some other future datatype.

BYTEA sounds like the best answer to me. TEXT is supposed to honor
character set specific peculiarities, while bytea should be able to
represent any arbitrary set of bytes.

Joe

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Joe Conway <mail(at)joeconway(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Anders Hammarquist <iko(at)strakt(dot)com>, pgsql-bugs(at)postgresql(dot)org, Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
Subject:	Re: Bug #728: Interactions between bytea and character encoding
Date:	2002-08-04 05:55:10
Message-ID:	200208040555.g745tAL02963@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Joe Conway wrote:
> Tom Lane wrote:
> > Ah. So the issue is that ANALYZE tries to do textin(byteaout(...))
> > in order to produce a textual representation of the most common value
> > in the BYTEA column, and apparently textin feels that the string
> > generated by byteaout is not legal text. While Joe says that the
> > problem has gone away in CVS tip, I'm not sure I believe that.
>
> I didn't either, except I tried it and it worked ;-) But you're
> undoubtedly correct that there are other cases which would break the
> current code.

Does this mean we don't have to esacpe >0x7f when inputting bytea
anymore?

From:	Joe Conway <mail(at)joeconway(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Anders Hammarquist <iko(at)strakt(dot)com>, pgsql-bugs(at)postgresql(dot)org, Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
Subject:	Re: Bug #728: Interactions between bytea and character encoding
Date:	2002-08-04 06:29:44
Message-ID:	3D4CC9D8.3080602@joeconway.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Bruce Momjian wrote:
> Does this mean we don't have to esacpe >0x7f when inputting bytea
> anymore?

I seem to remember that bytea data was run through the multibute code
for some reason, and I don't recall seeing that changed. ISTM that we
shouldn't force bytea thought multibyte functions at all.

The UNKNOWNIN patch did address part of the problem, just not all of it.
Previously all 'unknown' data was initially cast as TEXT, and thus was
subject to multibyte character set interpretation. But there was another
execution path that was not dealt with. I'll search the archives for the
thread.

Joe

From:	Joe Conway <mail(at)joeconway(dot)com>
To:	Joe Conway <mail(at)joeconway(dot)com>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Anders Hammarquist <iko(at)strakt(dot)com>, pgsql-bugs(at)postgresql(dot)org, Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
Subject:	Re: Bug #728: Interactions between bytea and character encoding
Date:	2002-08-04 07:45:54
Message-ID:	3D4CDBB2.4090304@joeconway.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Joe Conway wrote:
> Bruce Momjian wrote:
>
>> Does this mean we don't have to esacpe >0x7f when inputting bytea
>> anymore?
>
>
> I seem to remember that bytea data was run through the multibute code
> for some reason, and I don't recall seeing that changed. ISTM that we
> shouldn't force bytea thought multibyte functions at all.
>
> The UNKNOWNIN patch did address part of the problem, just not all of it.
> Previously all 'unknown' data was initially cast as TEXT, and thus was
> subject to multibyte character set interpretation. But there was another
> execution path that was not dealt with. I'll search the archives for the
> thread.
>

Here's the remaining issue that I remembered; see:
http://archives.postgresql.org/pgsql-hackers/2002-04/msg00256.php

The gist of this is that when client and server encoding don't match,
pg_do_encoding_conversion() gets called, regardless of data type. This
is the *wrong thing* to do for BYTEA data, I think. Fixing this,
combined with the UNKNOWNIN/OUT fix we did earlier, should eliminate the
need to escape the high bit characters when inputting bytea. The only
characters which *should* need to be escaped are the ones originally
escaped by PQescapeBytea. IMHO of course ;-)

Joe

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Joe Conway <mail(at)joeconway(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Anders Hammarquist <iko(at)strakt(dot)com>, pgsql-bugs(at)postgresql(dot)org, Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
Subject:	Re: Bug #728: Interactions between bytea and character encoding
Date:	2002-08-14 04:56:34
Message-ID:	200208140456.g7E4uYU29659@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

I am hoping for a patch for this for 7.3. Added to open items:

Fix bytea to not encode input string

---------------------------------------------------------------------------

Joe Conway wrote:
> Joe Conway wrote:
> > Bruce Momjian wrote:
> >
> >> Does this mean we don't have to esacpe >0x7f when inputting bytea
> >> anymore?
> >
> >
> > I seem to remember that bytea data was run through the multibute code
> > for some reason, and I don't recall seeing that changed. ISTM that we
> > shouldn't force bytea thought multibyte functions at all.
> >
> > The UNKNOWNIN patch did address part of the problem, just not all of it.
> > Previously all 'unknown' data was initially cast as TEXT, and thus was
> > subject to multibyte character set interpretation. But there was another
> > execution path that was not dealt with. I'll search the archives for the
> > thread.
> >
>
> Here's the remaining issue that I remembered; see:
> http://archives.postgresql.org/pgsql-hackers/2002-04/msg00256.php
>
> The gist of this is that when client and server encoding don't match,
> pg_do_encoding_conversion() gets called, regardless of data type. This
> is the *wrong thing* to do for BYTEA data, I think. Fixing this,
> combined with the UNKNOWNIN/OUT fix we did earlier, should eliminate the
> need to escape the high bit characters when inputting bytea. The only
> characters which *should* need to be escaped are the ones originally
> escaped by PQescapeBytea. IMHO of course ;-)
>
> Joe
>
>
> Joe
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

From:	Joe Conway <mail(at)joeconway(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Anders Hammarquist <iko(at)strakt(dot)com>, pgsql-bugs(at)postgresql(dot)org, Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
Subject:	Re: Bug #728: Interactions between bytea and character encoding
Date:	2002-08-14 05:05:05
Message-ID:	3D59E501.7000704@joeconway.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs

Bruce Momjian wrote:
> I am hoping for a patch for this for 7.3. Added to open items:
>
> Fix bytea to not encode input string
>

I said:
> Here's the remaining issue that I remembered; see:
> http://archives.postgresql.org/pgsql-hackers/2002-04/msg00256.php
>
> The gist of this is that when client and server encoding don't match,
> pg_do_encoding_conversion() gets called, regardless of data type.
> This is the *wrong thing* to do for BYTEA data, I think. Fixing this,
> combined with the UNKNOWNIN/OUT fix we did earlier, should eliminate
> the need to escape the high bit characters when inputting bytea. The
> only characters which *should* need to be escaped are the ones
> originally escaped by PQescapeBytea. IMHO of course ;-)

Tatsuo or Tom can answer this better than me, but I don't think this can
be fixed without a fe/be protocol change, so I'd guess it's a
7.4 issue. But, if there is a way to do it now, and someone gives me a
clue how to proceed, I'll try to get a patch together.

Joe