Re: WIP Patch: Use sortedness of CSV foreign tables for query planning

From: "Etsuro Fujita" <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
To: "'Robert Haas'" <robertmhaas(at)gmail(dot)com>, "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "'PostgreSQL-development'" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP Patch: Use sortedness of CSV foreign tables for query planning
Date: 2012-08-07 06:02:22
Message-ID: 002501cd7462$35f41ca0$a1dc55e0$@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> From: Robert Haas [mailto:robertmhaas(at)gmail(dot)com]

> On Mon, Aug 6, 2012 at 10:33 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >> On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita
> >> <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> >>> I think file_fdw is useful for managing log files such as PG CSV logs.
Since
> >>> often, such files are sorted by timestamp, I think the patch can improve
> the
> >>> performance of log analysis, though I have to admit my demonstration was
> not
> >>> realistic.
> >
> >> Hmm, I guess I could buy that as a plausible use case.
> >
> > In the particular case of PG log files, I'd bet good money against them
> > being *exactly* sorted by timestamp. Clock skew between backends, or
> > varying amounts of time to construct and send messages, will result in
> > small inconsistencies. This would generally not matter, until the
> > planner relied on the claim of sortedness for something like a mergejoin
> > ... and then it would matter a lot.
>
> Hmm, true.
>
> > In general I'm quite suspicious of the idea of believing that externally
> > supplied data is sorted in exactly the way that PG thinks it should
> > sort. If we implement this you can bet that people will screw up, for
> > instance by using the wrong locale/collation to sort text data.
>
> I think that optimizations like this are going to be essential for
> things like pgsql_fdw (or other_rdms_fdw). Despite the thorny
> semantic issues, we're just not going to be able to get around it.
> There will even be people who want SELECT * FROM ft ORDER BY 1 to
> order by the remote side's notion of ordering rather than ours,
> despite the fact that the remote side has some insane-by-PG-standards
> definition of ordering. People are going to find ways to do that kind
> of thing whether we condone it or not, so we might as well start
> thinking now about how we're going to live with it. But that doesn't
> answer the question of whether or not we ought to support it for
> file_fdw in particular, which seems like a more arguable point.

For file_fdw, I feel inclined to simply implement file_fdw (1) to verify the key
column is sorted in the specified way at the execution phase ie, at the (first)
scan of a data file, only when pathkeys are set, and (2) to abort the
transaction if it detects the data file is not sorted.

Thanks,

Best regards,
Etsuro Fujita

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2012-08-07 07:59:42 Re: [PATCH] Docs: Make notes on sequences and rollback more obvious
Previous Message Alexander Korotkov 2012-08-07 04:25:45 Re: Statistics and selectivity estimation for ranges