Re: NOLOGGING option, or ?

Lists: pgsql-hackers
From: "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com>
To: "Steve Atkins" <steve(at)blighty(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com>
Subject: Re: NOLOGGING option, or ?
Date: 2005-06-02 04:30:01
Message-ID: BB05A27C22288540A3A3E8F3749B45AB15E162@MI8NYCMAIL06.Mi8.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>I've been following this thread, and I'm a little confused. Could you
possibly clarify what you mean, by providing a couple of lines of
input as it would be formatted with escape processing turned off -
containing a text field with an embedded newline and tab and a null field.

yeah, this is all a bit confusing, but I *hope* I can clarify things here as I think I got a better understanding now.

before that, let me just clarify that the performance improvements in the patch have nothing to do with the escaping mechanizm. Escapes could change. The performance gain in due to a buffered processing with minimal line/attribute buffer loads.

I think that the basic issue is that there are some database users that would like to take their data and put it into the database without pre-processing it - regardless if there are any backslashes in it or 0x0D (CR's) etc... these are the users I am targeting in my patch as these are the users I ran into in the field. The only responsibility of these users is to explicitly escape any delimiter or 0x0A (LF) characters that they intend to have as DATA. that's all.

On the other hand there are users that would like to pre-process their data with C-escape sequences (or alternatevly, users that already have their data escaped) - this is what the postgres COPY targets these days.

2 different ways to do it... none of them is right or wrong.

Examples:

users that my patch targets may have a data row as such (delim = '|', EOL = [LF]):

c:\one\two|d:\ten\nine[LF]

using the way i do escaping in my patch those 2 fields of data will end up in the DB as
Field 1: c:\one\two
Field 2: d:\ten\nine

which is what the user would want. If they wanted to have a pipe char in the second field they could escape it as such: d:\ten\nine here is a pipe \| [LF] and no error will occur, and result will be:

Field 2: d:\ten\nine here is a pipe |

If you try to insert that first data line above using the existing COPY command you will get an undesired result:

Field 1: c:one wo
Field 2: d:
ine

Now, the other way around, users that do intend for their data to have escape sequences in it may have a line like this:

that's a \t tab| and this is a \nline feed [LF]

and will get the desired result of:

Field 1: that's a tab
Field 2: and this is a
line feed

while using my code they will get undesired results:
Field 1: that's a \t tab
Field 2: and this is a \nline feed

so, basically it really depends on the target audience...

Bruce, does that sounds right to you?
Alon.

-----Original Message-----
From: pgsql-hackers-owner(at)postgresql(dot)org on behalf of Steve Atkins
Sent: Wed 6/1/2005 10:47 PM
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] NOLOGGING option, or ?

On Wed, Jun 01, 2005 at 07:35:33PM -0700, Luke Lonergan wrote:
> >> I propose an extended syntax to COPY with a change in semantics to remove
> >> the default of "WITH ESCAPE '\'".
> >
> > Er, doesn't this break existing database dumps?
>
> Yes, one of the previously stated reasons to create another command for
> loading data.
>
> Another possible approach is to keep the default, but allow the escape
> processing to be turned off.

I've been following this thread, and I'm a little confused. Could you
possibly clarify what you mean, by providing a couple of lines of
input as it would be formatted with escape processing turned off -
containing a text field with an embedded newline and tab and a null field.

Cheers,
Steve

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

--------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster


From: Alvaro Herrera <alvherre(at)surnet(dot)cl>
To: Alon Goldshuv <agoldshuv(at)greenplum(dot)com>
Cc: Steve Atkins <steve(at)blighty(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: NOLOGGING option, or ?
Date: 2005-06-02 05:17:26
Message-ID: 20050602051726.GA3694@surnet.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 02, 2005 at 12:30:01AM -0400, Alon Goldshuv wrote:

> before that, let me just clarify that the performance improvements in
> the patch have nothing to do with the escaping mechanizm. Escapes
> could change. The performance gain in due to a buffered processing
> with minimal line/attribute buffer loads.

May I suggest you present them as separate issues and separate patches?
This way, a patch with the performance improvements is very likely to
get merged; simultaneously we can discuss changes to the escaping
mechanism until everyone agrees (or not) and then produce a patch as
appropiate.

--
Alvaro Herrera (<alvherre[a]surnet.cl>)
Este mail se entrega garantizadamente 100% libre de sarcasmo.


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Alon Goldshuv <agoldshuv(at)greenplum(dot)com>
Cc: Steve Atkins <steve(at)blighty(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: NOLOGGING option, or ?
Date: 2005-06-03 16:23:26
Message-ID: 200506031623.j53GNQp10876@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alon Goldshuv wrote:
> >I've been following this thread, and I'm a little confused. Could you
> possibly clarify what you mean, by providing a couple of lines of input
> as it would be formatted with escape processing turned off - containing
> a text field with an embedded newline and tab and a null field.
>
>
> yeah, this is all a bit confusing, but I *hope* I can clarify things
> here as I think I got a better understanding now.

The basic problem with this thread is that it started with _conclusions_
(we need a LOAD DATA command, escapes are broken), and not with
statements of fact (we need another way of specifying escapes, we have
performance improvements).

Any discussion that starts with conclusions instead of facts is bound to
have this problem, and it often happens when a group discusses among
themselves, outside the community, and appears with the conclusions,
thinking they are helping us by not going into the details. As you can
see, lack of facts actually hampers the discussion.

What has me particularly concerned is someone saying that loading C:\TMP
must be broken, and not understanding that the doubling of escapes is a
major requirement to have data loaded reliably. Now, you can argue that
a different escape should be possible, or that some other escape syntax
could be used, but the existing mechanism is clearly 100% reliable when
used properly and not broken.

A quick email asking why C:\TMP doesn't load in properly would have
yielded a much clearer conversation about why escaping is required in
our current system, and the other options that should be explored.
Saying escapes are broken and here is the fix really didn't get very
far.

I recommend you just start a new thread, with a new topic, and head in
the _facts_ direction.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com>
To: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: "Steve Atkins" <steve(at)blighty(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: NOLOGGING option, or ?
Date: 2005-06-03 16:57:00
Message-ID: BEC5D9EC.5239%agoldshuv@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce,

Point taken.

> Now, you can argue that
> a different escape should be possible, or that some other escape syntax
> could be used, but the existing mechanism is clearly 100% reliable when
> used properly and not broken.

I think that having an option for another escape syntax (such as using
ESCAPE AS clause in delimited format (non-csv) COPY) with a default of '\\'
is a good compromise that will allow users to escape their data (like COPY
currently is), or by specifying another escape character allow all of their
backslashes to be treated as data.

I'll start a new discussion about it on a new thread soon.

Thx,
Alon.


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Alon Goldshuv <agoldshuv(at)greenplum(dot)com>
Cc: Steve Atkins <steve(at)blighty(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: NOLOGGING option, or ?
Date: 2005-06-03 18:12:48
Message-ID: 200506031812.j53ICm300339@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alon Goldshuv wrote:
> Bruce,
>
> Point taken.
>
> > Now, you can argue that
> > a different escape should be possible, or that some other escape syntax
> > could be used, but the existing mechanism is clearly 100% reliable when
> > used properly and not broken.
>
> I think that having an option for another escape syntax (such as using
> ESCAPE AS clause in delimited format (non-csv) COPY) with a default of '\\'
> is a good compromise that will allow users to escape their data (like COPY
> currently is), or by specifying another escape character allow all of their
> backslashes to be treated as data.
>
> I'll start a new discussion about it on a new thread soon.

Yep, great. Right now we only support single-byte escapes (and
delimiters), so there should be a discussion if multiple byte values are
useful too. Also, there was discussion of what the default should be.
I don't think there is any agreement to change the existing defaults.
You might argue for a new mode that sets certain defaults to be easier
to load, but I am afraid of a new format that isn't 100% reliable,
because it assumes that some sequence of bytes will never appear in the
data.

Anyway, these are ideas you can consider when making a proposal.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>, "Alon Goldshuv" <agoldshuv(at)greenplum(dot)com>
Cc: "Steve Atkins" <steve(at)blighty(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: NOLOGGING option, or ?
Date: 2005-06-03 19:14:19
Message-ID: BEC5FA1B.6D60%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce,

Is there a good source of multi-byte copy data test cases? What is
currently done to test the trans-coding support? (where client and server
encodings are different)

I notice that the regression data in the CVS version of postgres does not
seem to include cases other than the ASCII data, is there another source of
data/cases we're missing?

Also - Alon's looking into this, but it would appear that the presumption on
EOL for two-byte encodings is 0x0a+0xNN, where 0x0a is followed by any byte.
Similar for other current control characters (escape, delimiter). Is there
a definition of format and semantics for COPY with 2-byte encodings we
should look at?

I've looked at the code and the docs like sql-copy.html and the question is
relevant because of the following case:
if newline were defined as 0x0a+0x00 as opposed to 0x0a+0xNN where N is
arbitrary, we could parse using 16-bit logic.
however
if newline were defined as 0x0a+0xNN, we must use byte-wise parsing

TIA

- Luke


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: Alon Goldshuv <agoldshuv(at)greenplum(dot)com>, Steve Atkins <steve(at)blighty(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: NOLOGGING option, or ?
Date: 2005-06-03 21:16:20
Message-ID: 200506032116.j53LGKY28253@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Luke Lonergan wrote:
> Bruce,
>
> Is there a good source of multi-byte copy data test cases? What is
> currently done to test the trans-coding support? (where client and server
> encodings are different)
>
> I notice that the regression data in the CVS version of postgres does not
> seem to include cases other than the ASCII data, is there another source of
> data/cases we're missing?
>
> Also - Alon's looking into this, but it would appear that the presumption on
> EOL for two-byte encodings is 0x0a+0xNN, where 0x0a is followed by any byte.
> Similar for other current control characters (escape, delimiter). Is there
> a definition of format and semantics for COPY with 2-byte encodings we
> should look at?
>
> I've looked at the code and the docs like sql-copy.html and the question is
> relevant because of the following case:
> if newline were defined as 0x0a+0x00 as opposed to 0x0a+0xNN where N is
> arbitrary, we could parse using 16-bit logic.
> however
> if newline were defined as 0x0a+0xNN, we must use byte-wise parsing

We have two and three-byte encodings, so 16-bit seems like it wouldn't
work. I am not aware of any specs except the C code itself.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: NOLOGGING option, or ?
Date: 2005-06-03 21:31:04
Message-ID: BEC61A28.6D6E%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce,

> We have two and three-byte encodings, so 16-bit seems like it wouldn't
> work. I am not aware of any specs except the C code itself.

Ok - no problem.

How about test data and cases? I see the SQL encoding examples used in
src/test/regress/sql for testing encoding in SQL, but are there regressions
for QA/test of multi-byte encoding support? If not, that's OK, but it would
save us time if some were already written.

WRT the COPY command, I'd like to have regressions that test the input of
matched client/server encodings with different (standardized) multi-byte
control characters.

The current code seems to allow for an arbitrary second byte in control
characters, so if we made a statement about control character support, I
think it would be
<ctl-byte><any-byte>...mblen...<any-byte>

is allowed for specification of control characters (newline, delimiter).

Luke


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: NOLOGGING option, or ?
Date: 2005-06-03 21:38:07
Message-ID: 200506032138.j53Lc7902105@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Luke Lonergan wrote:
> Bruce,
>
> > We have two and three-byte encodings, so 16-bit seems like it wouldn't
> > work. I am not aware of any specs except the C code itself.
>
> Ok - no problem.
>
> How about test data and cases? I see the SQL encoding examples used in
> src/test/regress/sql for testing encoding in SQL, but are there regressions
> for QA/test of multi-byte encoding support? If not, that's OK, but it would
> save us time if some were already written.

No, I don't think so, but the good news is that the existing code has
always worked flawlessly.

> WRT the COPY command, I'd like to have regressions that test the input of
> matched client/server encodings with different (standardized) multi-byte
> control characters.

Makes sense, but how do we know what encodings the client supports? We
would need some tests for that.

> The current code seems to allow for an arbitrary second byte in control
> characters, so if we made a statement about control character support, I
> think it would be
> <ctl-byte><any-byte>...mblen...<any-byte>
>
> is allowed for specification of control characters (newline, delimiter).

I have no idea what you are talking about. Again, give me facts about
what we currently don't do and what you want to do.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: "Luke Lonergan" <llonergan(at)greenplum(dot)com>
To: "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: NOLOGGING option, or ?
Date: 2005-06-03 22:15:19
Message-ID: BEC62487.6D79%llonergan@greenplum.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce,

>
> I have no idea what you are talking about. Again, give me facts about
> what we currently don't do and what you want to do.

Currently:
- No statement of multi-byte control character format
- No tests to define or prove "works flawlessly" or identify when something
breaks the current operational state

Desired:
- Clear statement of multi-byte control character format
- Tests that define what "works flawlessly" means

- Luke