Re: WIP: Collecting statistics on CSV file data

From: Shigeru Hanada <shigeru(dot)hanada(at)gmail(dot)com>
To: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP: Collecting statistics on CSV file data
Date: 2011-10-17 17:27:15
Message-ID: 4E9C6573.8050507@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

(2011/10/07 18:09), Etsuro Fujita wrote:
> Thank you for the review and the helpful information.
> I rebased. Please find attached a patch. I'll add the patch to the next CF.
>
> Changes:
>
> * cleanups and fixes
> * addition of the following to ALTER FOREIGN TABLE
> ALTER [COLUMN] column SET STATISTICS integer
> ALTER [COLUMN] column SET ( n_distinct = val ) (n_distinct only)
> ALTER [COLUMN] column RESET ( n_distinct )
> * reflection of the force_not_null info in acquiring sample rows
> * documentation

The new patch could be applied with some shifts. Regression tests of
core and file_fdw have passed cleanly. Though I've tested only simple
tests, ANALYZE works for foreign tables for file_fdw, and estimation of
costs and selectivity seem appropriate.

New API AnalyzeForeignTable
===========================
In your design, new handler function is called instead of
do_analylze_rel. IMO this hook point would be good for FDWs which can
provide statistics in optimized way. For instance, pgsql_fdw can simply
copy statistics from remote PostgreSQL server if they are compatible.
Possible another idea is to replace acquire_sample_rows so that other
FDWs can reuse most part of fileAnalyzeForeignTable and
file_fdw_do_analyze_rel.

And I think that AnalyzeForeignTable should be optional, and it would be
very useful if a default handler is provided. Probably a default
handler can use basic FDW APIs to acquire sample rows from the result of
"SELECT * FROM foreign_table" with skipping periodically. It won't be
efficient but I think it's not so unreasonable.

Other issues
============
There are some other comments about non-critical issues.
- When there is no analyzable column, vac_update_relstats is not called.
Is this behavior intentional?
- psql can't complete foreign table name after ANALYZE.
- A new parameter has been added to vac_update_relstats in a recent
commit. Perhaps 0 is OK for that parameter.
- ANALYZE without relation name ignores foreign tables because
get_rel_oids doesn't list foreign tables.
- IMO logging "analyzing foo.bar" should not be done in
AnalyzeForeignTable handler of each FDW because some FDW might forget to
do it. Maybe it should be pulled up to analyze_rel or somewhere in core.
- It should be mentioned in a document that foreign tables are not
analyzed automatically because they are read-only.

Regards,
--
Shigeru Hanada

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2011-10-17 18:33:36 HeapTupleHeaderAdvanceLatestRemovedXid doing the wrong thing with multixacts
Previous Message Tom Lane 2011-10-17 15:37:56 Re: Underspecified window queries in regression tests