Re: binary compat

Lists: pgsql-hackers
From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: binary compat
Date: 2009-09-01 07:45:35
Message-ID: m2d46b3y74.fsf@hi-media.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

I've been idly thinking about binary COPY and recent performance efforts
spent on it. The main problem I have with binary output is that I never
know when it'll be any useful (except very same hardware and PostgreSQL
setup)... useful meaning I get to read it back into some database.

Would providing the following two functions help?

SELECT * FROM pg_binary_compat();
-> version, timestamp ondisk format, architecture, you name it

SELECT pg_binary_compat(version, timestamp ondisk format, ...)
-> bool

Equiped with this I would first setup a (somewhat portable) virtual
machine where I'm sure I'm able to load the binary files, then get to
use COPY BINARY everywhere possible: I have a way to easily tell when
not to use it. First example would be to teach londiste about this...

Now writing the function doesn't sound a lot of fun, but I don't know
what to compare to be able to decide whether COPY output will get read
correctly on input. It even looks to me like in some cases this would
work accross major PostgreSQL versions? Or would that depend on used
types, so we'd need a variant taking those into consideration?

Anyone interrested in spelling out the checks to get done to implement
the functions, or to implementing it, provided there's no obvious reason
why it can not be made trustworthy?

Regards,
--
dim


From: Marko Kreen <markokr(at)gmail(dot)com>
To: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc: Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: binary compat
Date: 2009-09-01 09:16:23
Message-ID: e51f66da0909010216h791388bvff20a617a974a6eb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 9/1/09, Dimitri Fontaine <dfontaine(at)hi-media(dot)com> wrote:
> I've been idly thinking about binary COPY and recent performance efforts
> spent on it. The main problem I have with binary output is that I never
> know when it'll be any useful (except very same hardware and PostgreSQL
> setup)... useful meaning I get to read it back into some database.
>
> Would providing the following two functions help?
>
> SELECT * FROM pg_binary_compat();
> -> version, timestamp ondisk format, architecture, you name it
>
> SELECT pg_binary_compat(version, timestamp ondisk format, ...)
> -> bool
>
> Equiped with this I would first setup a (somewhat portable) virtual
> machine where I'm sure I'm able to load the binary files, then get to
> use COPY BINARY everywhere possible: I have a way to easily tell when
> not to use it. First example would be to teach londiste about this...
>
> Now writing the function doesn't sound a lot of fun, but I don't know
> what to compare to be able to decide whether COPY output will get read
> correctly on input. It even looks to me like in some cases this would
> work accross major PostgreSQL versions? Or would that depend on used
> types, so we'd need a variant taking those into consideration?
>
> Anyone interrested in spelling out the checks to get done to implement
> the functions, or to implementing it, provided there's no obvious reason
> why it can not be made trustworthy?

Based on plproxy experience with binary i/o, the requirements are:

- same major.minor [Perhaps checking catversion would be better?]
- same integer_datetimes
- same server_encoding

This seems to cover most (?) builtin types.

Arch details (32/64 bit, little/big endian) should not matter,
as binary i/o functions are supposed to use arch-independent format.

As plproxy uses explicit casting ($2::typname) it did not need to worry
about any oid differences. I have not looked at binary COPY details,
whether it can also ignore oids.

--
marko


From: Paul Matthews <plm(at)netspace(dot)net(dot)au>
To: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: binary compat
Date: 2009-09-01 10:48:49
Message-ID: 4A9CFC11.2020300@netspace.net.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dimitri Fontaine wrote:
> Hi,
>
> I've been idly thinking about binary COPY and recent performance efforts
> spent on it. The main problem I have with binary output is that I never
> know when it'll be any useful (except very same hardware and PostgreSQL
> setup)... useful meaning I get to read it back into some database.
>
If you want a binary cross compatible, how about something well known,
standardized, simple and cross platform like XDR?
http://www.faqs.org/rfcs/rfc1014.html

--
Fools ignore complexity. Pragmatists suffer it.
Some can avoid it. Geniuses remove it.


From: Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>
To: "Dimitri Fontaine" <dfontaine(at)hi-media(dot)com>, Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: binary compat
Date: 2009-09-01 10:54:07
Message-ID: op.uzkq8hmjcke6l8@soyouz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> I've been idly thinking about binary COPY and recent performance efforts
> spent on it. The main problem I have with binary output is that I never
> know when it'll be any useful (except very same hardware and PostgreSQL
> setup)... useful meaning I get to read it back into some database.

I posted about patching binary copy to improve speed some time ago...
I'm a bit out of time right now to work on it, but I should have the time
in September to submit a patch.

> Would providing the following two functions help?
>
> SELECT * FROM pg_binary_compat();
> -> version, timestamp ondisk format, architecture, you name it
>
> SELECT pg_binary_compat(version, timestamp ondisk format, ...)
> -> bool
>
> Equiped with this I would first setup a (somewhat portable) virtual
> machine where I'm sure I'm able to load the binary files, then get to
> use COPY BINARY everywhere possible: I have a way to easily tell when
> not to use it. First example would be to teach londiste about this...
>
> Now writing the function doesn't sound a lot of fun, but I don't know
> what to compare to be able to decide whether COPY output will get read
> correctly on input. It even looks to me like in some cases this would
> work accross major PostgreSQL versions? Or would that depend on used
> types, so we'd need a variant taking those into consideration?
>
> Anyone interrested in spelling out the checks to get done to implement
> the functions, or to implementing it, provided there's no obvious reason
> why it can not be made trustworthy?

Actually, I've been surprised to discover that if you COPY a table which
contains integers to a file in binary format, you can reload it into a
table which contains dates (still using binary format) and Postgres will
not complain, since a date is an integer too, of the same length... so
you'll get completely garbled data, of course, and no error message. Well,
the data isn't really garbled, but it's just the interpretation of it that
is different (you could dump it again and reload it into an integer table,
and you'd find your original integers, interpreted as integers this time).
You could probably load a bunch of 4-byte text strings into an INT column,
too.

Thinking about it again, I wonder why I was surprised, since the binary
format doesn't include any column type information.

But given how careful postgres is with correctness, this could be
considered a bug... it is very careful on parsing text data, and not at
all on binary.

What makes text copy slow is all the parsing : turning text into datums is
very complex (look at the source code of strtod() in glibc...)

Validation is simple : including it in binary copy shouldn't slow it down.

So, I believe the COPY binary file format should include this information :

1- everything you proposed
- version
- timestamp format
- etc

(or perhaps a "binary format version" which is known for each type
included in the table)

2- list of column names and types
- check at COPY FROM time and issue appropriate errors : "cannot copy a
column of type DATE from binary data of type INTEGER"
- have a special command or a little tool (anything, even a simple python
script) that can look at a COPY header and write the appropriate CREATE
TABLE statement for your convenience : have you ever grumbled, when
developing an app, "hm, I want to reload this dump I did a week ago for
testing... what was the table like at this time ?..."