Aggressive memory consumption in {ts,array}_typanalyze

Lists: pgsql-hackers
From: Noah Misch <noah(at)leadboat(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Aggressive memory consumption in {ts,array}_typanalyze
Date: 2012-04-16 20:58:51
Message-ID: 20120416205851.GA23994@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Both $SUBJECT functions pass to hash_create() an expected hash table size of
10000 * attstattarget. Based on header comments, this represents a near-worst
case. These typanalyze functions scan the hash tables sequentially, thereby
visiting the entire allocation. Per the recommendation in comments at
hash_create(), we should be more conservative. On my system, this tiny test
case runs in 2.8s and dirties 1.0 GiB of local memory:

CREATE TEMP TABLE t AS SELECT '{}'::int[];
SET default_statistics_target = 10000;
ANALYZE t;

Rather arbitrarily, I reduced the hash_create() size hint by 99.9%, to the
width of the histograms destined for pg_statistic. This streamlined the test
case to <20ms runtime and 2 MiB of memory.

To verify that nothing awful happens when the hash table sees considerable
dynamic growth, I used a subject table entailing a 9M-element hash table at
ANALYZE time:

CREATE UNLOGGED TABLE t AS SELECT array[n,3000000+n,6000000+n]
FROM generate_series(1,3000000) t(n);

Unpatched master takes 15s and dirties 2.1 GiB; patched takes 15s and dirties
1.2 GiB. The timing noise overlapped any systematic difference, but the
patched version might have been around 500ms slower. Based on that, I'm
comfortable trusting that improving smaller cases in this way will not greatly
harm larger cases.

The lack of field complaints about ts_typanalyze() resource usage does argue
against the need for a change here, but I think adding array_typanalyze() in
PostgreSQL 9.2 significantly increases our risk exposure. Sites may have
cranked up the statistics target on array columns to compensate for the lack
of explicit statistical support. Every cluster has several array columns in
the system catalogs.

The size hint I chose is fairly arbitrary. Any suggestions for principled
alternatives?

Thanks,
nm

Attachment Content-Type Size
array-ts-analyze-mem-v1.patch text/plain 1.4 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Aggressive memory consumption in {ts,array}_typanalyze
Date: 2012-04-18 15:09:12
Message-ID: CA+Tgmobt2Zq-7mEN9WWzWDoOM1CyOC=FKXUV7aeEi47+5TQWFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 16, 2012 at 4:58 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> The size hint I chose is fairly arbitrary.  Any suggestions for principled
> alternatives?

Based on your test results, it doesn't seem like it matters very much
what you put in there, so I'm inclined to think that num_mcelem is
fine. I thought about maybe allowing for a little slop, like
num_mcelem * 10, but maybe the way you did it is better. It's
possible that people will set ridiculously overblown stats targets on
some columns, and that's certainly going to cost something no matter
what we do, but there's no point in making that worse than it has to
be without some clear reason for doing so.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Aggressive memory consumption in {ts,array}_typanalyze
Date: 2012-04-24 02:07:29
Message-ID: CA+Tgmobd2NJ1p46xyW6z=yD8ffL4kDznYFq-+yru1+Z=ixAZow@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Apr 18, 2012 at 11:09 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Apr 16, 2012 at 4:58 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>> The size hint I chose is fairly arbitrary.  Any suggestions for principled
>> alternatives?
>
> Based on your test results, it doesn't seem like it matters very much
> what you put in there, so I'm inclined to think that num_mcelem is
> fine.  I thought about maybe allowing for a little slop, like
> num_mcelem * 10, but maybe the way you did it is better.  It's
> possible that people will set ridiculously overblown stats targets on
> some columns, and that's certainly going to cost something no matter
> what we do, but there's no point in making that worse than it has to
> be without some clear reason for doing so.

Hearing no further comments, I have committed your patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company