Re: Slicing TOAST

Lists: pgsql-hackerspgsql-students
From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-students(at)postgresql(dot)org
Subject: Slicing TOAST
Date: 2013-05-14 07:05:03
Message-ID: CA+U5nMJGgJNt5VXqkR=crtDqXFmuyzwEF23-fD5NuSns+6N5dA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-students

I'm proposing this now as a possible GSoC project:

In 1-byte character encodings (i.e. not UTF-8), SUBSTR() is optimised
to allow seeking straight to the exact slice when retrieving a large
toasted value. This reduces I/O considerably when you have large
toasted values since it is an O(1) action rather than an O(N).

This is possible because the slicing of toasted values is predictable
on 1 byte encodings.

It would be useful to have a predictable function perform the slicing,
so we could use that knowledge later to optimise searches in a wider
range of situations. More specifically, since UTF-8 is so common, to
allow optimisations in that encoding of common data: text, XML, JSON.

e.g. if we knew that an XML document has a required element called
TITLE and that occurs only once and always in the first slice, it
would be useful information to use in search functions. (Not sure, but
it may be possible to assign non-consecutive slice numbers to allow
variable data mid-way through a column value if needed).

e.g. in UTF-8 free text we could put 500 characters in each slice, so
that even if that could be anywhere between 500 and 2000 bytes it
would still fit just fine.

e.g. for arrays, if we put say 200 elements per slice, then accessing
particular elements would require only 1 slice retrieval.

Doing this would *possibly* reduce packing density, but not certainly
so. But it would greatly improve access times to large structured
toast values.

Implementation would be to have a slicing function that gets called
iteratively on a column value until it returns no further slices.

There is no proposal for search functions. It would be up to the
search function to confirm the details of the slicing function prior
to using that knowledge in a search. We'd need a way to check that
the function inputs matched the slicing of the column, so we'd need to
have some requirement for input on the function to be matched against
metadata on the column. So presumably some decoration of the input
parameters of functions, which sounds like a little too much
difficulty.

So the proposal would be to provide the slicing/chunking function at
the datatype level not the column level. The user would create a
binary compatible type, that is effectively XML or whatever, just with
extra constraints on usage for slicing.

But now I consider the syntax, I'll call it a splitter function since
slicer, chunker sound silly to me.

CREATE TYPE my_xml
LIKE xml
SPLITTER my_toast_function;

Search functions would then be designed that take such datatypes as
input and would be able to rely with certainty upon the toast slicing
algorithms in order to retrieve data.

Doing it this way means that different XML or JSON schemas could have
specific search functions optimised for them.

I'm proposing this now as a possible GSoC project; I don't propose to
actively work on it myself.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Hannu Krosing <hannu(at)krosing(dot)net>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-students(at)postgresql(dot)org
Subject: Re: [HACKERS] Slicing TOAST
Date: 2013-05-14 07:50:50
Message-ID: 5191ECDA.3080204@krosing.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-students

On 05/14/2013 10:05 AM, Simon Riggs wrote:
> I'm proposing this now as a possible GSoC project:
>
> In 1-byte character encodings (i.e. not UTF-8), SUBSTR() is optimised
> to allow seeking straight to the exact slice when retrieving a large
> toasted value. This reduces I/O considerably when you have large
> toasted values since it is an O(1) action rather than an O(N).
>
> This is possible because the slicing of toasted values is predictable
> on 1 byte encodings.
>
> It would be useful to have a predictable function perform the slicing,
> so we could use that knowledge later to optimise searches in a wider
> range of situations. More specifically, since UTF-8 is so common, to
> allow optimisations in that encoding of common data: text, XML, JSON.
>
> e.g. if we knew that an XML document has a required element called
> TITLE and that occurs only once and always in the first slice, it
> would be useful information to use in search functions. (Not sure, but
> it may be possible to assign non-consecutive slice numbers to allow
> variable data mid-way through a column value if needed).
>
> e.g. in UTF-8 free text we could put 500 characters in each slice, so
> that even if that could be anywhere between 500 and 2000 bytes it
> would still fit just fine.
>
> e.g. for arrays, if we put say 200 elements per slice, then accessing
> particular elements would require only 1 slice retrieval.
>
> Doing this would *possibly* reduce packing density, but not certainly
> so. But it would greatly improve access times to large structured
> toast values.
On the contrary, as it would enable us to pack the chunks fitting
more on the page, especially for :)

That is, first chunk into N bytes, then compress each chunk

-----------------
Hannu


From: Thom Brown <thom(at)linux(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-students(at)postgresql(dot)org
Subject: Re: Slicing TOAST
Date: 2013-05-14 17:21:00
Message-ID: CAA-aLv6b+whXRQVfJS0yJ56JCY0peo92Gt6gE7aBaSb6Dg+YLA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-students

On 14 May 2013 08:05, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> I'm proposing this now as a possible GSoC project:

Unfortunately the deadline for project submissions for students was
3rd May. If this isn't worked on before next year, it can of course
be put forward as an idea for GSoC 2014.

--
Thom


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Thom Brown <thom(at)linux(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, pgsql-students(at)postgresql(dot)org
Subject: Re: Slicing TOAST
Date: 2013-05-14 17:34:55
Message-ID: CA+U5nML9muHT=eUc5j6CePqCxUqwiYfWq2MZAKpGQQSXsxqtGg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-students

On 14 May 2013 18:21, Thom Brown <thom(at)linux(dot)com> wrote:

> On 14 May 2013 08:05, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> I'm proposing this now as a possible GSoC project:
>
> Unfortunately the deadline for project submissions for students was
> 3rd May. If this isn't worked on before next year, it can of course
> be put forward as an idea for GSoC 2014.

Having reviewed the list of project ideas, I thought I'd submit an
alternative, in case we have difficulties.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Slicing TOAST
Date: 2013-05-14 18:36:35
Message-ID: 51928433.1040700@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-students


> I'm proposing this now as a possible GSoC project; I don't propose to
> actively work on it myself.

The deadline for submitting GSOC projects (by students) was a week ago.
So is this a project suggestion for next year ...?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Slicing TOAST
Date: 2013-05-14 18:47:23
Message-ID: 519286BB.6080901@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-students

On 14.05.2013 21:36, Josh Berkus wrote:
>
>> I'm proposing this now as a possible GSoC project; I don't propose to
>> actively work on it myself.
>
> The deadline for submitting GSOC projects (by students) was a week ago.
> So is this a project suggestion for next year ...?

I've been thinking, we should already start collecting ideas for next
year, and collect them throughout the year. I know I come up with some
ideas every now and then, but when it's time for another GSoC, I can't
remember any of them.

I just created a GSoC2014 ideas pages on the wiki, for collecting these:
https://wiki.postgresql.org/wiki/GSoC_2014. Let's keep the ideas coming,
throughout the year.

- Heikki


From: Thom Brown <thom(at)linux(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Slicing TOAST
Date: 2013-05-14 19:00:26
Message-ID: CAA-aLv5k7ffSp_K5jNpD+=WYkZ=0Z4=WF_=B=B34P7RCSz7o_g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-students

On 14 May 2013 19:47, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:
> On 14.05.2013 21:36, Josh Berkus wrote:
>>
>>
>>> I'm proposing this now as a possible GSoC project; I don't propose to
>>> actively work on it myself.
>>
>>
>> The deadline for submitting GSOC projects (by students) was a week ago.
>> So is this a project suggestion for next year ...?
>
>
>
> I've been thinking, we should already start collecting ideas for next year,
> and collect them throughout the year. I know I come up with some ideas every
> now and then, but when it's time for another GSoC, I can't remember any of
> them.
>
> I just created a GSoC2014 ideas pages on the wiki, for collecting these:
> https://wiki.postgresql.org/wiki/GSoC_2014. Let's keep the ideas coming,
> throughout the year.

Thanks Heikki, that's a capital idea.

--
Thom


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Slicing TOAST
Date: 2013-05-14 19:04:33
Message-ID: CAM3SWZQ5B+MSEvXmcmCThKNx4iaOSmD1Ek6iZ8QTi5GWzaYkxA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-students

On Tue, May 14, 2013 at 11:47 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> I've been thinking, we should already start collecting ideas for next year,
> and collect them throughout the year. I know I come up with some ideas every
> now and then, but when it's time for another GSoC, I can't remember any of
> them.

It seems like the PostgreSQL Wiki Todo list has a lot of deadwood. I
wouldn't tell a novice hacker to go and pick something from there.
Maintaining a list of good beginner projects is actually a pretty hard
undertaking.

One thing I've heard multiple times in the past is that the archetypal
beginner project is to add some feature to psql. Well, psql is fairly
feature complete these days, so finding something to do there that's
likely to be accepted is probably not that easy.

--
Peter Geoghegan


From: Thom Brown <thom(at)linux(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Slicing TOAST
Date: 2013-05-14 19:12:29
Message-ID: CAA-aLv4FRqC=KyHEZe8uM=J6c7Hedvu4y3mLZF8r0_vtKSf2qA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-students

On 14 May 2013 20:04, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Tue, May 14, 2013 at 11:47 AM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> I've been thinking, we should already start collecting ideas for next year,
>> and collect them throughout the year. I know I come up with some ideas every
>> now and then, but when it's time for another GSoC, I can't remember any of
>> them.
>
> It seems like the PostgreSQL Wiki Todo list has a lot of deadwood. I
> wouldn't tell a novice hacker to go and pick something from there.
> Maintaining a list of good beginner projects is actually a pretty hard
> undertaking.

I think that's why Heikki is proposing we collect GSoC-friendly ideas
separately and put them on the list for next year.

--
Thom


From: David Fetter <david(at)fetter(dot)org>
To: pgsql-students(at)postgresql(dot)org
Subject: Re: Slicing TOAST
Date: 2013-05-14 20:09:47
Message-ID: 20130514200947.GE14908@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-students

On Tue, May 14, 2013 at 06:34:55PM +0100, Simon Riggs wrote:
> On 14 May 2013 18:21, Thom Brown <thom(at)linux(dot)com> wrote:
>
> > On 14 May 2013 08:05, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> >> I'm proposing this now as a possible GSoC project:
> >
> > Unfortunately the deadline for project submissions for students was
> > 3rd May. If this isn't worked on before next year, it can of course
> > be put forward as an idea for GSoC 2014.
>
> Having reviewed the list of project ideas, I thought I'd submit an
> alternative, in case we have difficulties.

It being past 3 May, we have two and only two actual options:

1. Make a go with what we have, with a high risk of failure for some of the projects.

2. Give up slots.

Option 1 is preferable in my mind to option 2 because it will not
impact our ability to get slots next year, or if it does, it will not
impact them as severely.

Simon,

Is there anyone in your organization who can mentor one or more of the
projects we have which doesn't currently have a mentor?

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: David Fetter <david(at)fetter(dot)org>
Cc: pgsql-students(at)postgresql(dot)org
Subject: Re: Slicing TOAST
Date: 2013-05-14 22:45:58
Message-ID: CA+U5nMJiWieve4a=n3Ss7B7JdjGrNoqpYy26EYg_Vuf=MhXFcQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-students

On 14 May 2013 21:09, David Fetter <david(at)fetter(dot)org> wrote:

> Is there anyone in your organization who can mentor one or more of the
> projects we have which doesn't currently have a mentor?

Depends upon the project. At least half the projects looked like
things I would vote -1 for, which is why I proposed something
straightforward and useful.

Where is the list of projects/mentors?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-students(at)postgresql(dot)org
Subject: Re: Slicing TOAST
Date: 2013-05-14 23:40:50
Message-ID: 5192CB82.5070107@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-students


> Depends upon the project. At least half the projects looked like
> things I would vote -1 for, which is why I proposed something
> straightforward and useful.
>
> Where is the list of projects/mentors?

You'd need to register as a mentor on the GSOC app so that we can let
you see the proposals.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Pavel Golub <pavel(at)microolap(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: Slicing TOAST
Date: 2013-05-15 09:01:51
Message-ID: 1886757050.20130515120151@gf.microolap.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-students

Hello, Heikki.

You wrote:

HL> On 14.05.2013 21:36, Josh Berkus wrote:
>>
>>> I'm proposing this now as a possible GSoC project; I don't propose to
>>> actively work on it myself.
>>
>> The deadline for submitting GSOC projects (by students) was a week ago.
>> So is this a project suggestion for next year ...?

HL> I've been thinking, we should already start collecting ideas for next
HL> year, and collect them throughout the year. I know I come up with some
HL> ideas every now and then, but when it's time for another GSoC, I can't
HL> remember any of them.

HL> I just created a GSoC2014 ideas pages on the wiki, for collecting these:
HL> https://wiki.postgresql.org/wiki/GSoC_2014. Let's keep the ideas coming,
HL> throughout the year.

Good idea! It reminds about feature proposed by Pavel Stehule while
ago here: http://www.postgresql.org/message-id/BANLkTini+ChGKfnyjkF1rsHSQ2kMktSDjg@mail.gmail.com

It's about streaming functionality for BYTEA type. But I think
streaming must be added to BYTEA, TEXT and VARCHAR without length
specifier too.

As Pavel stated: "A very large bytea are limited by
query size - processing long query needs too RAM". This is the holy
true, which came up suddenly in the project of one of my client.
Becuase he used bytea for images storing and text format in
PQexec, which as you know doubles-triples size of the data.

Some more details from Pavel:
<quote>
There is a few disadvantages LO against bytea, so there are requests
for "smarter" API for bytea.

Significant problem is different implementation of LO for people who
have to port application to PostgreSQL from Oracle, DB2. There are
some JDBC issues too.

For me - main disadvantage of LO in one space for all. Bytea removes
this disadvantage, but it is slower for lengths > 20 MB. It could be
really very practical have a possibility insert some large fields in
second NON SQL stream. Same situation is when large bytea is read.
</quote>

I'm not sure if the whole project is simple enough for GSOC, but I
suppose it may be splitted somehow.

PS Should we start separate thread for proposals, because I've spent
an hour since I found wiki for GSOC14 mention.

HL> - Heikki

--
With best wishes,
Pavel mailto:pavel(at)gf(dot)microolap(dot)com