Re: WIP: Collecting statistics on CSV file data

From: Shigeru Hanada <shigeru(dot)hanada(at)gmail(dot)com>
To: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP: Collecting statistics on CSV file data
Date: 2011-11-18 12:00:10
Message-ID: 4EC648CA.5050904@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

(2011/11/18 16:25), Etsuro Fujita wrote:
> Thank you for your testing. I updated the patch according to your
> comments. Attached is the updated version of the patch.

I'd like to share result of my review even though it's not fully
finished. So far I looked from viewpoint of API design, code
formatting, and documentation. I'll examine effectiveness of the patch
and details of implementation next week, and hopefully try writing
ANALYZE handler for pgsql_fdw :)

New patch has correct format, and it could be applied to HEAD of master
branch cleanly. All regression tests including those of contrib modules
have passed. It contains changes of codes and regression tests related
to the issue, and they have enough comments. IMO the document in this
patch is not enough to show how to write analyze handler to FDW authors,
but it can be enhanced afterward. With this patch, FDW author can
provide optional ANALYZE handler which updates statistics of foreign
tables. Planner would be able to generate better plan by using statistics.

> Yes. But in the updated version, I've refactored analyze.c a little bit
> to allow FDW authors to simply call do_analyze_rel().
<snip>
> The updated version enables FDW authors to just write their own
> acquire_sample_rows(). On the other hand, by retaining to hook
> AnalyzeForeignTable routine at analyze_rel(), higher level than
> acquire_sample_rows() as before, it allows FDW authors to write
> AnalyzeForeignTable routine for foreign tables on a remote server to ask
> the server for its current stats instead, as pointed out earlier by Tom
> Lane.

IIUC, this patch offers three options to FDWs: a) set
AnalyzeForeignTable to NULL to indicate lack of capability, b) provide
AnalyzeForeignTable which calls do_analyze_rel with custom
sample_row_acquirer, and c) create statistics data from scratch by FDW
itself by doing similar things to do_analyze_rel with given argument or
copying statistics from remote PostgreSQL server.

ISTM that this design is well-balanced between simplicity and
flexibility. Maybe these three options would suit web-based wrappers,
file-based or RDBMS wrappers, and pgsql_fdw respectively. I think that
adding more details of FdwRoutine, such as purpose of new callback
function and difference from required ones, would help FDW authors,
including me :)

I have some random comments:

- I think separated typedef of sample_acquire_rows would make codes more
readable. In addition, parameters of the function should be described
explicitly.
- I couldn't see the reason why file_fdw sets ctid of sample tuples,
though I guess it's for Vitter's random sampling algorithm. If every
FDW must set valid ctid to sample tuples, it should be mentioned in
document of AnalyzeForeignTable. Exporting some functions from
analyze.c relates this issue?
- Why file_fdw skips sample tuples which have NULL value? AFAIS
original acquire_sample_rows doesn't do so.
- Some comment lines go past 80 columns.
- Some headers included in file_fdw.c seems unnecessary.

Regards,
--
Shigeru Hanada

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Florian Pflug 2011-11-18 12:32:47 Re: range_adjacent and discrete ranges
Previous Message Pavan Deolasee 2011-11-18 11:26:34 Re: FlexLocks