Re: WIP: Collecting statistics on CSV file data

From: Shigeru Hanada <shigeru(dot)hanada(at)gmail(dot)com>
To: Etsuro Fujita <fujita(dot)etsuro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP: Collecting statistics on CSV file data
Date: 2012-02-10 13:05:42
Message-ID: 4F351626.4030500@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

(2011/12/15 11:30), Etsuro Fujita wrote:
> (2011/12/14 15:34), Shigeru Hanada wrote:
>> I think this patch could be marked as "Ready for committer" with some
>> minor fixes. Please find attached a revised patch (v6.1).

I've tried to make pgsql_fdw work with this feature, and found that few
static functions to be needed to exported to implement ANALYZE handler
in short-cut style. The "Short-cut style" means the way to generate
statistics (pg_class and pg_statistic) for foreign tables without
retrieving sample data from foreign server.

Attached patch (export_funcs.patch) exports examine_attribute and
update_attstats which are necessary to implement ANALYZE handler for
pgsql_fdw. In addition to exporting, update_attstats is also renamed to
vac_update_attstats to fit with already exported function
vac_update_relstats.

I also attached archive of WIP pgsql_fdw with ANALYZE support. This
version has better estimation than original pgsql_fdw, because it can
use selectivity of qualifiers evaluated on local side to estimate number
of result rows. To show the effect of ANALYZE clearly, WHERE push-down
feature is disabled. Please see pgsqlAnalyzeForeignTable and
store_remote_stats in pgsql_fdw.c.

I used pgbench_accounts tables with 30000 records, and got reasonable
rows estimation for queries below.

<on remote side>
postgres=# UPDATE pgbench_accounts SET filler = NULL
postgres-# WHERE aid % 3 = 0;
postgres=# ANALYZE;

<on local side>
postgres=# ANALYZE pgbench_accounts; -- needs explicit table name
postgres=# EXPLAIN SELECT * FROM pgbench_accounts WHERE filler IS NULL;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Foreign Scan on pgbench_accounts (cost=100.00..40610.00 rows=100030
width=97)
Filter: (filler IS NULL)
Remote SQL: DECLARE pgsql_fdw_cursor_13 SCROLL CURSOR FOR SELECT aid,
bid, abalance, filler FROM public.pgbench_accounts
(3 rows)

postgres=# EXPLAIN SELECT * FROM pgbench_accounts WHERE aid < 100;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Foreign Scan on pgbench_accounts (cost=100.00..40610.00 rows=96 width=97)
Filter: (aid < 100)
Remote SQL: DECLARE pgsql_fdw_cursor_14 SCROLL CURSOR FOR SELECT aid,
bid, abalance, filler FROM public.pgbench_accounts
(3 rows)

postgres=# EXPLAIN SELECT * FROM pgbench_accounts WHERE aid < 1000;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Foreign Scan on pgbench_accounts (cost=100.00..40610.00 rows=1004
width=97)
Filter: (aid < 1000)
Remote SQL: DECLARE pgsql_fdw_cursor_15 SCROLL CURSOR FOR SELECT aid,
bid, abalance, filler FROM public.pgbench_accounts
(3 rows)

In implementing ANALYZE handler, hardest part was copying anyarray
values from remote to local. If we can make it common in core, it would
help FDW authors who want to implement ANALYZE handler without
retrieving sample rows from remote server.

Regards,
--
Shigeru Hanada

Attachment Content-Type Size
export_funcs.patch text/plain 5.5 KB
pgsql_fdw.tar.gz application/gzip 24.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-02-10 13:50:35 Re: psql tab completion for SELECT
Previous Message Noah Misch 2012-02-10 11:42:05 Re: RFC: Making TRUNCATE more "MVCC-safe"