Skip site navigation (1) Skip section navigation (2)

Peripheral Links

Header And Logo

PostgreSQL
| The world's most advanced open source database.

Site Navigation

Search for
  Advanced Search

Re: Group by more efficient than distinct?



PFC wrote:
Actually, the memory used by the hash depends on the number of distinct values, not the number of rows which are processed...
    Consider :

SELECT a GROUP BY a
SELECT a,count(*) GROUP BY a

In both cases the hash only holds discinct values. So if you have 1 million rows to process but only 10 distinct values of "a", the hash will only contain those 10 values (and the counts), so it will be very small and fast, it will absorb a huge seq scan without problem. If however, you have (say) 100 million distinct values for a, using a hash would be a bad idea. As usual, divide the size of your RAM by the number of concurrent connections or something. Note that "a" could be a column, several columns, anything, the size of the hash will be proportional to the number of distinct values, ie. the number of rows returned by the query, not the number of rows processed (read) by the query. Same with hash joins etc, that's why when you join a very small table to a large one Postgres likes to use seq scan + hash join on the small table.

This surprises me - hash values are lossy, so it must still need to confirm against the real list of values, which at a minimum should require references to the rows to check against?

Is PostgreSQL doing something beyond my imagination? :-)

Cheers,
mark

--
Mark Mielke <mark(at)mielke(dot)cc>




Home | Main Index | Thread Index

Privacy Policy | PostgreSQL Archives hosted by Command Prompt, Inc. | Designed by tinysofa
Copyright © 1996 – 2008 PostgreSQL Global Development Group