BUG #2574: C function: arg TEXT data corrupt

Lists: pgsql-bugspgsql-general
From: "Michael Enke" <michael(dot)enke(at)wincor-nixdorf(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #2574: C function: arg TEXT data corrupt
Date: 2006-08-14 13:53:34
Message-ID: 200608141353.k7EDrYPi038307@wwwmaster.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-general


The following bug has been logged online:

Bug reference: 2574
Logged by: Michael Enke
Email address: michael(dot)enke(at)wincor-nixdorf(dot)com
PostgreSQL version: 8.1.4
Operating system: Open Suse 10.1 and CentOS 4.3
Description: C function: arg TEXT data corrupt
Details:

I created a C function:
extern Datum test_arg(PG_FUNCTION_ARGS);
PG_FUNCTION_INFO_V1(test_arg);
Datum test_arg(PG_FUNCTION_ARGS) {
elog(INFO, "arg: %s", VARDATA(PG_GETARG_TEXT_P(0)));
PG_RETURN_INT16(0);
}
and used the
CREATE FUNCTION test_arg(TEXT) RETURNS INT4 AS 'path_to_lib.so' LANGUAGE
'C';

If I call this function "the first time"
after connecting with psql, the info output shows
corrupted data, the second and further call is ok:
me(at)noteme:~/uwx9/SVN/tpl/trunk/db/postgresql/src> psql -U tplinux
Welcome to psql 8.1.3, the PostgreSQL interactive terminal.

Type: \copyright for distribution terms
\h for help with SQL commands
\? for help with psql commands
\g or terminate with semicolon to execute query
\q to quit

tplinux=> select test_arg('1');
INFO: arg: 1s(at)3
test_arg
----------
0
(1 row)

tplinux=> select test_arg('1');
INFO: arg: 1
test_arg
----------
0
(1 row)

The same problem is with input length of 2, 3, 4, 8 char (and may be more).
With input of 4 char I never get correct value.

No problem for first call with 5, 6 , 7 char,
but with 5 char problem if before called with 6 char.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Michael Enke" <michael(dot)enke(at)wincor-nixdorf(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #2574: C function: arg TEXT data corrupt
Date: 2006-08-14 15:27:00
Message-ID: 9678.1155569220@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-general

"Michael Enke" <michael(dot)enke(at)wincor-nixdorf(dot)com> writes:
> I created a C function:
> extern Datum test_arg(PG_FUNCTION_ARGS);
> PG_FUNCTION_INFO_V1(test_arg);
> Datum test_arg(PG_FUNCTION_ARGS) {
> elog(INFO, "arg: %s", VARDATA(PG_GETARG_TEXT_P(0)));
> PG_RETURN_INT16(0);

The VARDATA of a TEXT datum is not a C string; in particular it is not
guaranteed to be null-terminated. This is an error in your code not
a bug.

The usual way to get a C string from a TEXT datum is to call textout,
eg

str = DatumGetCString(DirectFunctionCall1(textout, datumval));

regards, tom lane


From: Reece Hart <reece(at)harts(dot)net>
To: pgsql-general(at)postgresql(dot)org
Cc: Michael Enke <michael(dot)enke(at)wincor-nixdorf(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: text datum VARDATA and strings
Date: 2006-08-14 18:04:30
Message-ID: 1155578671.4158.45.camel@tallac.gene.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-general

Michael Enke recently asked in pgsql-bugs about VARDATA and C strings
(BUG #2574: C function: arg TEXT data corrupt). Since that's not a bug,
I've moved this follow-up to pgsql-general.

On Mon, 2006-08-14 at 11:27 -0400, Tom Lane wrote:
> The usual way to get a C string from a TEXT datum is to call textout,
> eg
> str = DatumGetCString(DirectFunctionCall1(textout, datumval));

Yikes! I've been accessing VARDATA text data like Michael for years
(code below). I account for length and don't expect null-termination,
but I don't use anything like Tom's suggestion above. (I always try to
do what Tom says because that usually hurts less.)

I have three questions:

1) I based everything I did on examples lifted nearly verbatim from a
7.x manual, and I bet Michael did similarly. I've never heard of
DatumGetCString, DirectFunctionCall1, or textout. Are these and other
treasures documented somewhere?

2) Does DatumGetCString(DirectFunctionCall1(textout, datumval)) do
something other than null terminate a string? All of the strings are
from [-A-Z0-1*]; server_encoding has been either SQL_ASCII or UTF8 in
case that's relevant.

3) Is there any reason to believe that the code below is problematic?

Thanks,
Reece

#include <postgres.h>
#include <fmgr.h>
#include <ctype.h>
#include <string.h>

static char* clean_sequence(const char* in, int32 n);

PG_FUNCTION_INFO_V1(pg_clean_sequence);
Datum pg_clean_sequence(PG_FUNCTION_ARGS)
{
text* t0; /* in */
text* t1; /* out */
char* tmp;
int32 tmpl;

if ( PG_ARGISNULL(0) )
{ PG_RETURN_NULL(); }

t0 = PG_GETARG_TEXT_P(0);

tmp = clean_sequence( VARDATA(t0), VARSIZE(t0)-VARHDRSZ );
tmpl = (int32) strlen(tmp);

/* copy temp sequence into new pg variable */
t1 = (text*) palloc( tmpl + VARHDRSZ );
if (!t1)
{ elog( ERROR, "couldn't palloc (%d bytes)", tmpl+VARHDRSZ ); }
memcpy(VARDATA(t1),tmp,tmpl);
VARATT_SIZEP(t1) = tmpl + VARHDRSZ;

pfree(tmp);

PG_RETURN_TEXT_P(t1);
}

/* clean_sequence -- strip non-IUPAC symbols
The intent is to strip non-sequence data which might result from
copy-pasting a fasta file or some such.

in: char*, length
out: char*, |out|<=length, NULL-TERMINATED
out is palloc'd memory; caller must free

allow chars from IUPAC std 20
+ selenocysteine (U) + ambiguity (BZX) + gap (-) + stop (*)
*/

#define isseq(c) ( ((c)>='A' && (c)<='Z' && (c)!='J' && (c)!='O') \
|| ((c)=='-') \
|| ((c)=='*') )

char* clean_sequence(const char* in, int32 n) {
char* out;
char* oi;
int32 i;

out = palloc( n + 1 ); /* w/null */
if (!out)
{ elog( ERROR, "couldn't palloc (%d bytes)", n+1 ); }

for( i=0, oi=out; i<=n-1; i++ ) {
char c = toupper(in[i]);
if ( isseq(c) ) {
*oi++ = c;
}
}
*oi = '\0';
return(out);
}

--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Reece Hart <reece(at)harts(dot)net>
Cc: pgsql-general(at)postgresql(dot)org, Michael Enke <michael(dot)enke(at)wincor-nixdorf(dot)com>
Subject: Re: text datum VARDATA and strings
Date: 2006-08-14 19:51:22
Message-ID: 13746.1155585082@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-general

Reece Hart <reece(at)harts(dot)net> writes:
> On Mon, 2006-08-14 at 11:27 -0400, Tom Lane wrote:
>> The usual way to get a C string from a TEXT datum is to call textout,
>> eg
>> str = DatumGetCString(DirectFunctionCall1(textout, datumval));

> Yikes! I've been accessing VARDATA text data like Michael for years
> (code below). I account for length and don't expect null-termination,
> but I don't use anything like Tom's suggestion above.

Sure, that works. The problem with what Michael was doing was that he
was passing the string to elog, which expects a null-terminated string.
Possibly I should have written "usual way to get a null-terminated string"
above, just to be clear.

> 1) I based everything I did on examples lifted nearly verbatim from a
> 7.x manual, and I bet Michael did similarly. I've never heard of
> DatumGetCString, DirectFunctionCall1, or textout. Are these and other
> treasures documented somewhere?

Whose 7.x manual? This stuff has been there since we invented the
"version 1" function call convention, which was 7.3 or before. There
is some documentation in the SGML docs, but really we kind of expect you
to look at the standard built-in functions to see how things are done...

> 2) Does DatumGetCString(DirectFunctionCall1(textout, datumval)) do
> something other than null terminate a string?

At the moment that's all it does ... assuming that you've already
detoasted the text datum ... but it's not impossible that someday
it will do something different. For instance, sooner or later we are
going to support multiple locales/encodings within a single database,
and I wouldn't be surprised if that involves sticking extra data into
text values. So it's best not to assume that you know what is inside a
text datum, if possible,

> 3) Is there any reason to believe that the code below is problematic?

The only thing I'd suggest is that checking for a null return from
palloc is a waste of effort. It doesn't return to you if it runs
out of memory.

regards, tom lane


From: Reece Hart <reece(at)harts(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org, Michael Enke <michael(dot)enke(at)wincor-nixdorf(dot)com>
Subject: Re: text datum VARDATA and strings
Date: 2006-08-14 22:11:49
Message-ID: 1155593509.4158.134.camel@tallac.gene.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-general

On Mon, 2006-08-14 at 15:51 -0400, Tom Lane wrote:

> Whose 7.x manual? This stuff has been there since we invented the
> "version 1" function call convention, which was 7.3 or before. There
> is some documentation in the SGML docs, but really we kind of expect you
> to look at the standard built-in functions to see how things are done...

The PostgreSQL manual. I wrote these functions early in the 7.x series
and I don't know which manual version exactly. For example, the sec
9.5.4 of
http://www.postgresql.org/docs/7.3/static/xfunc-c.html shows code for
concat_text, which I remember was the basis for what I wrote. I now see
and understand the text regarding detoasting the 'DatumGetXXX' macros;
the relevance of these were not obvious to me at the time.

> So it's best not to assume that you know what is inside a
> text datum, if possible,

Okay. Does that mean that code in 9.5.4 should have a warning to that
effect?

> > 3) Is there any reason to believe that the code below is problematic?
>
> The only thing I'd suggest is that checking for a null return from
> palloc is a waste of effort. It doesn't return to you if it runs
> out of memory.

Okay. Thanks for the advice, Tom.

-Reece

--
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0