Re: WIP: Collecting statistics on CSV file data

From: Shigeru HANADA <shigeru(dot)hanada(at)gmail(dot)com>
To: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP: Collecting statistics on CSV file data
Date: 2012-04-05 12:10:39
Message-ID: 4F7D8BBF.4050004@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

(2011/12/13 22:00), Etsuro Fujita wrote:
> Thank you for your effectiveness experiments and proposals for
> improvements. I updated the patch according to your proposals.
> Attached is the updated version of the patch.

Hi all,

I've revised the v6.1 patch and created v7 patch, though dead line of
this CF is coming closer. I think that this feature provides a way to
improve plans for foreign tables significantly, so I hope that this
feature is available in 9.2.

I'd like to show overview of the patch again for ease of review.

New FDW API function
====================
This patch adds new FDW API function AnalyzeForeignTable to FdwRoutine
which can be used to support updating local statistics of foreign table.
This function is invoked when ANALYZE command is executed against a
foreign table explicitly. This handler function is optional, so if
underlying FDW set NULL for the pointer, PostgreSQL doesn't touch
statistics but emits a message about skipping.

It's not required to FDWs to implement fully featured analyzer by
itself. They can use core routines, do_analyze_rel and others, for most
difficult part of analyzing. What FDWs should do is to provide a
sampling function and call do_analyze_rel with passing the sampling
function as argument in their concrete AnalyzeForeignTable.

AnalyzeForeignTable (or sampling function) can report FDW-specific
additional information by calling ereport() with given elevel.

Once we've considered an idea that FDW stores statistics information
without calling do_analyze_rel, but it seems very hard to implement, and
not so efficient. I tried to implement such handler in pgsql_fdw (which
seeems easiest to achieve) by getting remote statistics by "SELECT *
FROM pg_statistics", but it has several issues such as:

1) Highly privileged user on remote side should be mapped to ANALYZE invoker
2) Structure and semantics might be difficult on remote side, if
versions are not same.
3) We need to convert anyarray to anyarray through text representation.
We know type of elements, but bothering works are needed.

DDL changes
===========
ALTER FOREIGN TABLE supports SET STATISTICS clause and n_distinct
option. Former changes per-attribute statistics target, and latter
overrides calculated statistics. n_distinct_inherited is not available
because foreign tables can't be inherited.

psql support
============
psql completes foreign table names after the keyword "ANALYZE" in
addition to ordinary tables. Of course newly added statistics target is
shown in \d+ command.

file_fdw
========
This patch contains a use case of new handler function in
contrib/file_fdw. Since file_fdw reads data from a flat file,
fileAnalyzeForeignTable uses similar algorithm to ordinary tables; it
samples first N rows first, and replaces them randomly with subsequent
rows. Also file_fdw updates pg_class.relpages by calculating number of
pages from size of the data file.

To allow FDWs to implement sampling argorighm like this, several
functions are exported from analyze.c, e.g. random_fract,
init_selection_state, and get_next_S.

pgsql_fdw
=========
Though it's not fully finished, I've implemented ANALYZE handler for
pgsql_fdw. Please extract pgsql_fdw.tar.gz into contrib (or use pgxs)
to try it.

Regards,
--
Shigeru HANADA

Attachment Content-Type Size
postgresql-analyze-v7.patch text/plain 47.5 KB
pgsql_fdw.tar.gz application/gzip 31.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2012-04-05 12:30:51 Re: patch: improve SLRU replacement algorithm
Previous Message Robert Haas 2012-04-05 11:56:30 Re: patch: improve SLRU replacement algorithm