Re: Status of server side Large Object support?

Lists: pgsql-hackers
From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Thomas Hallgren <thhal(at)mailblocks(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Status of server side Large Object support?
Date: 2004-11-28 13:13:18
Message-ID: 200411281413.18242.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Sonntag, 28. November 2004 12:33 schrieb Thomas Hallgren:
> Hmm, ok. But there's no way to stream them in and out from disk. From
> what I can see, you have to bring all of it into memory. Not so ideal
> perhaps if you want to provide streaming media for thousands of users.

You can use the substring function to read the pieces you need.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: Thomas Hallgren <thhal(at)mailblocks(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Status of server side Large Object support?
Date: 2004-11-28 13:55:29
Message-ID: thhal-0VHGFAoUmcC4pgMoJF+KCg4PnkFjXIf@mailblocks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut wrote:
> Am Sonntag, 28. November 2004 12:33 schrieb Thomas Hallgren:
>
>>Hmm, ok. But there's no way to stream them in and out from disk. From
>>what I can see, you have to bring all of it into memory. Not so ideal
>>perhaps if you want to provide streaming media for thousands of users.
>
>
> You can use the substring function to read the pieces you need.
>
Won't the substring function bring the whole thing into memory in the
backend before it pass you the piece you need? Let's assume I want to
stream 4k at a time of a 40mb image, that's a whole lot of byte swapping
if that's the case.

How do you handle writes without first creating the whole image in memory?

From what I can see, the current JDBC driver uses the lo_<xxx> client
api's and they seem to map to the inv_<xxx> server api's.

Regards,
Thomas Hallgren


From: Bernd Helmle <mailings(at)oopsware(dot)de>
To: Thomas Hallgren <thhal(at)mailblocks(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Status of server side Large Object support?
Date: 2004-11-28 16:03:57
Message-ID: 34D0E696AF272FEEE781180B@sparkey.oopsware.intra
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

--On Sonntag, November 28, 2004 14:55:29 +0100 Thomas Hallgren
<thhal(at)mailblocks(dot)com> wrote:

>
> From what I can see, the current JDBC driver uses the lo_<xxx> client
> api's and they seem to map to the inv_<xxx> server api's.
>

Huh, does that mean the libpq's lo_*() API is deprecated, too? That would
be bad news.....

--

Bernd


From: Joe Conway <mail(at)joeconway(dot)com>
To: Thomas Hallgren <thhal(at)mailblocks(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, bryan(at)bulten(dot)ca
Subject: Re: Status of server side Large Object support?
Date: 2004-11-28 18:50:16
Message-ID: 41AA1DE8.9050202@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thomas Hallgren wrote:
> Peter Eisentraut wrote:
>> Am Sonntag, 28. November 2004 12:33 schrieb Thomas Hallgren:
>>
>>> Hmm, ok. But there's no way to stream them in and out from disk. From
>>> what I can see, you have to bring all of it into memory. Not so ideal
>>> perhaps if you want to provide streaming media for thousands of users.
>>
>> You can use the substring function to read the pieces you need.
>>
> Won't the substring function bring the whole thing into memory in the
> backend before it pass you the piece you need? Let's assume I want to
> stream 4k at a time of a 40mb image, that's a whole lot of byte swapping
> if that's the case.

Not if the column is storage type EXTERNAL. See a past discussion here:
http://archives.postgresql.org/pgsql-general/2003-07/msg01447.php

> How do you handle writes without first creating the whole image in memory?

You can't currently, but it would be a nice addition ;-)

I agree with Peter -- I think effort is better spent improving bytea.

BTW, someone on GENERAL just started a very similar thread
("implementing a lo_truncate operation, and possibly a lo_size"). I've
cc'd him here.

Joe


From: Thomas Hallgren <thhal(at)mailblocks(dot)com>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, bryan(at)bulten(dot)ca
Subject: Re: Status of server side Large Object support?
Date: 2004-11-28 22:23:29
Message-ID: thhal-0ZOiFApspcC4e9YU0/ZILPSg586XNCd@mailblocks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Conway wrote:
> Thomas Hallgren wrote:
>
>> Peter Eisentraut wrote:
>>
>>> Am Sonntag, 28. November 2004 12:33 schrieb Thomas Hallgren:
>>>
>>>> Hmm, ok. But there's no way to stream them in and out from disk. From
>>>> what I can see, you have to bring all of it into memory. Not so ideal
>>>> perhaps if you want to provide streaming media for thousands of users.
>>>
>>>
>>> You can use the substring function to read the pieces you need.
>>>
>> Won't the substring function bring the whole thing into memory in the
>> backend before it pass you the piece you need? Let's assume I want to
>> stream 4k at a time of a 40mb image, that's a whole lot of byte
>> swapping if that's the case.
>
>
> Not if the column is storage type EXTERNAL. See a past discussion here:
> http://archives.postgresql.org/pgsql-general/2003-07/msg01447.php
>
>
>> How do you handle writes without first creating the whole image in
>> memory?
>
>
> You can't currently, but it would be a nice addition ;-)
>
> I agree with Peter -- I think effort is better spent improving bytea.
>
> BTW, someone on GENERAL just started a very similar thread
> ("implementing a lo_truncate operation, and possibly a lo_size"). I've
> cc'd him here.
>
Seems to me both solutions have things (and people) speaking for them.

My goal is to provide a handle to a large amount of data. In Java, this
is a java.sql.Blob/Clob and it maps to an SQL locator. This object
supports random access so you can do the equivalent of seek, tell, read,
write, and truncate. A Blob/Clob object must be valid for the duration
of a transaction.

Mapping this object to a LargeObjectDesc and get full functionality
seems to be fairly easy. The only thing missing is the "truncate" since
a "size" function can use seek with SEEK_END on a temporary descriptor copy.

Another fairly significant advantage using large objects is that the
client JDBC driver is using it. I'd like to keep my driver as similar as
possible.

Mapping a Blob to a bytea appears to be much more difficult and the
result is more limited (half the size).

I understand from Peter and Joe that they consider bytea to be the
future. One thing to really consider for the future is the current 1 GB
limit. I'd like to see a future where bytea remains more or less as it
is today, i.e. optimal for sizes 1 GB and less, and LargeObjectDesc is
rewritten to use 64 bit quantities for seek, size, and tell, thus
allowing for much larger objects to be stored.

What is the quality of the large object solution today. Does it have
known flaws that nobody cares about since it's discontinued or is it
considered a maintained and worthy part of the overall solution?

Regards,
Thomas Hallgren


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Hallgren <thhal(at)mailblocks(dot)com>
Cc: Joe Conway <mail(at)joeconway(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, bryan(at)bulten(dot)ca
Subject: Re: Status of server side Large Object support?
Date: 2004-11-28 23:17:07
Message-ID: 21866.1101683827@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thomas Hallgren <thhal(at)mailblocks(dot)com> writes:
> What is the quality of the large object solution today. Does it have
> known flaws that nobody cares about since it's discontinued or is it
> considered a maintained and worthy part of the overall solution?

More the former than the latter, I think, at least in the minds of
the usual suspects for backend work.

The main problem I'd see with the idea of supporting over-2GB LOs is
that we store all LOs in a database in the same table (pg_largeobject)
and so you would run into the table size limit (around 16TB IIRC) with
not an amazingly large number of such LOs. We used to store each LO in
its own table but that was not better, as a few thousand LOs could
easily bring the filesystem to its knees (on platforms where the
directory lookup mechanism doesn't scale to huge numbers of entries in
a single directory). I don't think there'd be any point in upgrading
the LO support to 64 bits without some rethinking of the underlying
storage structure.

A generic issue with LOs is the extreme pain involved in dump/reload;
not only the difficulty of transporting the LOs themselves, but that
of updating references to them from the database. Vacuuming
no-longer-referenced LOs is a serious problem too. If LOs were
considered a first-class feature then I'd want to see more interest
in dealing with those problems.

Lesser issues with LOs are protection (there isn't any), user-accessible
locking (there isn't any), MVCC (there isn't any). The latter has been
on the to-do list since
http://archives.postgresql.org/pgsql-hackers/2002-05/msg00875.php
I think it could actually be fixed now without too much pain because
there is a mechanism for finding out the surrounding query's snapshot,
which functions could not do before 8.0.

regards, tom lane


From: David Garamond <lists(at)zara(dot)6(dot)isreserved(dot)com>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: Thomas Hallgren <thhal(at)mailblocks(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, bryan(at)bulten(dot)ca
Subject: Re: Status of server side Large Object support?
Date: 2004-11-29 00:22:18
Message-ID: 41AA6BBA.2050800@zara.6.isreserved.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Conway wrote:
> Not if the column is storage type EXTERNAL. See a past discussion here:
> http://archives.postgresql.org/pgsql-general/2003-07/msg01447.php

what is the reasoning behind this syntax?

ALTER TABLE [ ONLY ] table [ * ]
ALTER [ COLUMN ] column SET STORAGE
{ PLAIN | EXTERNAL | EXTENDED | MAIN }

I find it nonintuitive and hard to remember. Perhaps something like this
is better (I know, it's probably too late):

ALTER [ COLUMN ] column SET STORAGE { INLINE | EXTERNAL }
ALTER [ COLUMN ] column SET COMPRESSION { YES | NO }

--
dave


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Garamond <lists(at)zara(dot)6(dot)isreserved(dot)com>
Cc: Joe Conway <mail(at)joeconway(dot)com>, Thomas Hallgren <thhal(at)mailblocks(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, bryan(at)bulten(dot)ca
Subject: Re: Status of server side Large Object support?
Date: 2004-11-29 00:57:07
Message-ID: 22832.1101689827@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David Garamond <lists(at)zara(dot)6(dot)isreserved(dot)com> writes:
> I find it nonintuitive and hard to remember. Perhaps something like this
> is better (I know, it's probably too late):

> ALTER [ COLUMN ] column SET STORAGE { INLINE | EXTERNAL }
> ALTER [ COLUMN ] column SET COMPRESSION { YES | NO }

The semantics are not simply two independent variables, however.
In particular, IIRC the precedence of different possible actions
is such that you couldn't cleanly express it that way.

regards, tom lane


From: Thomas Hallgren <thhal(at)mailblocks(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Joe Conway <mail(at)joeconway(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, bryan(at)bulten(dot)ca
Subject: Re: Status of server side Large Object support?
Date: 2004-11-29 17:50:17
Message-ID: thhal-03fmGAs0xcC4caeCxbEL4Y/wlzeFACD@mailblocks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom,
Here's an attempt to do some major rethinking and solve all open issues
(and of course creating some new ones).

The idea is based on use of normal tables with a bytea column that
stores one LO-page per row (henceforth referred to as LO-page tables).
Several such tables can be used in order to overcome the table size
limit and to enable distribution of LO's over different tablespaces.
Also, the use of normal tables will resolve the issues concerning
protection, user-accessible locking, and the current lack of MVCC.

I feel that a more simplistic approach using already present mechanisms
would make the design easier to maintain. There's no reason why LO's
should require special attention from a vacuum or dump/restore
perspective, nor why it should be excluded from the MVCC.

This is what I think is needed:

A new composite datatype must be used in place of todays oid to identify
a large object. The type will look something like this:

CREATE TYPE lo_locator AS (
lo_page_table int,
lo_id int,
lo_xact_id int
);

The lo_page_table will hold the Oid of the associated LO-page table. The
lo_id is the Oid used by the data pages within that table. The
lo_xact_id is set to the current transaction id each time a data page is
changed. Its purpose is to resolve the concurrency issue that arise when
several transactions simultaniously change the same LO but on different
data pages.

I suggest that the system have a way to set a default LO-page table on a
per schema basis. This table could be used unless the user (schema
owner) explicitly declares another table. If no table has been declared
for a schema the default should be the table declared for 'public'. If
no table is declared there either, some global default can be used.

Among other things, a default LO-page table will make it possible to
retain backward compatibility.

Protection can be obtained using normal grant/revoke permissions on the
LO-page tables. I.e. they will serve as permission groups. LO's
requiering specific permissions must be stored in a separate LO-page table.

The LargeObjectDesc is changed as follows:
- It must have an additional Oid that appoints the table it makes use of.
- The uint32 used for the offset can be changed to an int64 at the same
time.
- The current SubTransactionId will become obsolete since all changes
made to the LO-page tables are under sub-transaction control anyway.
- Something to quickly find our way back to the row containing the
lo_locator must be added so that it's easy to update the lo_xact_id that
resides there. I'm not sure how to do that in the most efficient manner
so its represented by a comment here. Please fill in :-)

Thus we'd get:

typedef struct LargeObjectDesc
{
Oid pageTableId; /* Page-table in use for this LO */
Oid id; /* LO's identifier within LO-page table */
int64 offset; /* current seek pointer */
int flags; /* locking info, etc */
/* + something that enables us to find our way
* back so that the lo_xact_id can be updated
* effiently */
} LargeObjectDesc;

Tables hosting LO pages must be created using the following declaration:

CREATE TABLE <table name>
(
lo_id oid NOT NULL,
lo_pageno int NOT NULL,
lo_data bytea,
PRIMARY KEY (lo_id, lo_pageno)
);

Two restricions concerning a LO-pages table:
1. Each row (page) must be considered fixed in size.
2. Normal (I mean through SQL) access to the LO-page tables must be
discuraged somehow.

The lo_<xxx> protocoll needs to change so that the lo_seek and lo_tell
uses 64 bit quantities. The lo_creat, lo_open, and lo_drop will all act
on the default LO-page table. A new set of functions that allow the
LO-page table to be explicitly stated for the create, open, and drop
operations will be needed. Finally, three new functions, lo_size(int
lod), lo_truncate(int lod, int64 new_size), and lo_get_page_table_id(int
lod) should be added.

Comments, suggestions?

Regards,
Thomas Hallgren


From: Hannu Krosing <hannu(at)tm(dot)ee>
To: David Garamond <lists(at)zara(dot)6(dot)isreserved(dot)com>
Cc: Joe Conway <mail(at)joeconway(dot)com>, Thomas Hallgren <thhal(at)mailblocks(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, bryan(at)bulten(dot)ca
Subject: Re: Status of server side Large Object support?
Date: 2004-12-08 00:33:38
Message-ID: 1102466018.4020.11.camel@fuji.krosing.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On E, 2004-11-29 at 02:22, David Garamond wrote:
> Joe Conway wrote:
> > Not if the column is storage type EXTERNAL. See a past discussion here:
> > http://archives.postgresql.org/pgsql-general/2003-07/msg01447.php
>
> what is the reasoning behind this syntax?
>
> ALTER TABLE [ ONLY ] table [ * ]
> ALTER [ COLUMN ] column SET STORAGE
> { PLAIN | EXTERNAL | EXTENDED | MAIN }
>
> I find it nonintuitive and hard to remember. Perhaps something like this
> is better (I know, it's probably too late):
>
> ALTER [ COLUMN ] column SET STORAGE { INLINE | EXTERNAL }
> ALTER [ COLUMN ] column SET COMPRESSION { YES | NO }

It wold also be beneficial if the threshold size of moving the column to
TOAST (either COMPRESS or EXTERNAL) could be set on a per-column basis

This is a design decision on the same lavel as the others

------------
Hannu


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Hannu Krosing <hannu(at)tm(dot)ee>
Cc: David Garamond <lists(at)zara(dot)6(dot)isreserved(dot)com>, Joe Conway <mail(at)joeconway(dot)com>, Thomas Hallgren <thhal(at)mailblocks(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, bryan(at)bulten(dot)ca
Subject: Re: Status of server side Large Object support?
Date: 2004-12-13 03:07:09
Message-ID: 200412130307.iBD379w04484@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hannu Krosing wrote:
> On E, 2004-11-29 at 02:22, David Garamond wrote:
> > Joe Conway wrote:
> > > Not if the column is storage type EXTERNAL. See a past discussion here:
> > > http://archives.postgresql.org/pgsql-general/2003-07/msg01447.php
> >
> > what is the reasoning behind this syntax?
> >
> > ALTER TABLE [ ONLY ] table [ * ]
> > ALTER [ COLUMN ] column SET STORAGE
> > { PLAIN | EXTERNAL | EXTENDED | MAIN }
> >
> > I find it nonintuitive and hard to remember. Perhaps something like this
> > is better (I know, it's probably too late):
> >
> > ALTER [ COLUMN ] column SET STORAGE { INLINE | EXTERNAL }
> > ALTER [ COLUMN ] column SET COMPRESSION { YES | NO }
>
> It wold also be beneficial if the threshold size of moving the column to
> TOAST (either COMPRESS or EXTERNAL) could be set on a per-column basis
>
> This is a design decision on the same lavel as the others

The threshhold is currently per-row, with longer rows being toasted
first. I can't see having a per-column specification being a big win.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073