Re: postgres performance: comparing 2 data centers

Lists: pgsql-performance
From: "Michael Nonemacher" <Michael_Nonemacher(at)messageone(dot)com>
To: "Rod Taylor" <ports(at)rbt(dot)ca>
Cc: "Postgresql Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: postgres performance: comparing 2 data centers
Date: 2004-06-04 23:12:52
Message-ID: E3A41572DB871B42AB6939873D95E8CA038750@auscorpex-1.austin.messageone.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Agreed.

We originally created the indexes this way because we sometimes do
searches where one of the columns is constrained using =, and the other
using a range search, but it's not clear to me how much Postgres
understands multi-column indexes. Will I get the gain I'd expect from a
(member_id, group_id) index on a query like "where member_id = ? and
group_id > ?"?

I've since found a few other often-used tables where the reltuples
counts generated by 'analyze' are off by a factor of 5 or more. In the
short term, I'm just trying to eliminate the automatic-analyzes where
possible and make sure they're followed up quickly with a 'vacuum' where
it's not possible.

Is "analyze generating bad stats" a known issue? Is there anything I
could be doing to aggravate or work around the problem?

mike

-----Original Message-----
From: Rod Taylor [mailto:ports(at)rbt(dot)ca]
Sent: Friday, June 04, 2004 5:27 PM
To: Michael Nonemacher
Cc: Postgresql Performance
Subject: Re: [PERFORM] postgres performance: comparing 2 data centers

> The members table contains about 500k rows. It has an index on
> (group_id, member_id) and on (member_id, group_id).

Yes, bad stats are causing it to pick a poor plan, but you're giving it
too many options (which doesn't help) and using space up unnecessarily.

Keep (group_id, member_id)
Remove (member_id, group_id)
Add (member_id)

An index on just member_id is actually going to perform better than
member_id, group_id since it has a smaller footprint on the disk.

Anytime where both group_id and member_id are in the query, the
(group_id, member_id) index will likely be used.

--
Rod Taylor <rbt [at] rbt [dot] ca>

Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL PGP Key:
http://www.rbt.ca/signature.asc


From: Greg Stark <gsstark(at)mit(dot)edu>
To: "Michael Nonemacher" <Michael_Nonemacher(at)messageone(dot)com>
Cc: "Rod Taylor" <ports(at)rbt(dot)ca>, "Postgresql Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: postgres performance: comparing 2 data centers
Date: 2004-06-05 00:33:26
Message-ID: 873c5a2695.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

"Michael Nonemacher" <Michael_Nonemacher(at)messageone(dot)com> writes:

> Agreed.
>
> We originally created the indexes this way because we sometimes do
> searches where one of the columns is constrained using =, and the other
> using a range search, but it's not clear to me how much Postgres
> understands multi-column indexes. Will I get the gain I'd expect from a
> (member_id, group_id) index on a query like "where member_id = ? and
> group_id > ?"?

It will use them, whether you see a gain depends on the distribution of your
data. Does the group_id > ? exclude enough records that it's worth having to
do all the extra i/o the bigger index would require?

Personally I think the other poster was a bit hasty to assert unconditionally
that it's never worth it. If you have a lot of records for every member_id and
very few of which will be greater than '?' then it might be worth it. If
however you'll only ever have on the order of a hundred or fewer records per
member_id and a significant chunk of them will have group_id > '?' then it
will probably be a wash or worse.

There's another side to the story though. In a web site or other OLTP
application you may find you're better off with the multi-column index. Even
if it performs less well on average than the smaller single column index when
users have reasonable numbers of groups. That's becuase you're guaranteed
(assuming postgres is using it) that even if a user someday has an obscene
number of groups he won't suddenly break your web site by driving your
database into the ground.

There is a difference between latency and bandwidth, and between average and
worst-case. Sometimes it's necessary to keep an eye on worst-case scenarios
and not just average bandwidth.

But that said. If you are reasonably certain that you'll never or rarely have
thousands of groups per user you're probably better off with the indexes the
other person described.

> I've since found a few other often-used tables where the reltuples
> counts generated by 'analyze' are off by a factor of 5 or more. In the
> short term, I'm just trying to eliminate the automatic-analyzes where
> possible and make sure they're followed up quickly with a 'vacuum' where
> it's not possible.
>
> Is "analyze generating bad stats" a known issue? Is there anything I
> could be doing to aggravate or work around the problem?

I would suggest trying a VACUUM FULL and then retrying the ANALYZE. I suspect
you might have a lot of dead tuples at the beginning of your table which is
confusing the sampling. If that's it, then yes it's known and in fact already
improved in what will be 7.5. You may be able to avoid the situation by
vacuuming more frequently.

If that doesn't solve it then I would suggest trying to raise the statistics
targets for the columns in question with

ALTER TABLE name ALTER column SET STATISTICS integer

The default is 100 iirc. You could try 200 or even more.

--
greg