Re: mal advice in FAQ 4.1.

Lists: pgsql-hackers
From: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: mal advice in FAQ 4.1.
Date: 2007-10-09 10:15:55
Message-ID: 162867790710090315h7a3ae50boc4e423c8dd519313@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello

I found lot of slow queries in some databases which I checked based on
advice 4.1. from FAQ,

To SELECT a random row, use:
SELECT col
FROM tab
ORDER BY random()
LIMIT 1;

It's robust and slow on bigger tables. Can we add some better solutions?

Regards
Pavel Stehule


From: "Nikolay Samokhvalov" <samokhvalov(at)gmail(dot)com>
To: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: mal advice in FAQ 4.1.
Date: 2007-10-09 12:52:30
Message-ID: e431ff4c0710090552y79af2b31gb159eb8e9ccf9bd6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hubert recently posted his thoughts on this topic:
http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/

I've encountered with this problem several times in web development and
every time found out that the best (in terms of performance) solution is to
use some pseudo random approach (such as ">= random() limit 1" or "limit 1
offset random()*N" or even pre-caching rows on app side).

On 10/9/07, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
>
> Hello
>
> I found lot of slow queries in some databases which I checked based on
> advice 4.1. from FAQ,
>
> To SELECT a random row, use:
> SELECT col
> FROM tab
> ORDER BY random()
> LIMIT 1;
>
> It's robust and slow on bigger tables. Can we add some better solutions?
>

--
Best regards,
Nikolay


From: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
To: nikolay(at)samokhvalov(dot)com
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: mal advice in FAQ 4.1.
Date: 2007-10-09 13:05:13
Message-ID: 162867790710090605o312195e0h6dde0d00cb939c24@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2007/10/9, Nikolay Samokhvalov <samokhvalov(at)gmail(dot)com>:
> Hubert recently posted his thoughts on this topic:
> http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/
>
> I've encountered with this problem several times in web development and
> every time found out that the best (in terms of performance) solution is to
> use some pseudo random approach (such as ">= random() limit 1" or "limit 1
> offset random()*N" or even pre-caching rows on app side).
>

I know this article, but you cannot link from faq to private
(unstable) blog. it would article on techdoc.postgresql.org

Pavel

On 10/9/07, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
> > Hello
> >
> > I found lot of slow queries in some databases which I checked based on
> > advice 4.1. from FAQ,
> >
> > To SELECT a random row, use:
> > SELECT col
> > FROM tab
> > ORDER BY random()
> > LIMIT 1;
> >
> > It's robust and slow on bigger tables. Can we add some better solutions?
> >
>
>
> --
> Best regards,
> Nikolay


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: <nikolay(at)samokhvalov(dot)com>
Cc: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: mal advice in FAQ 4.1.
Date: 2007-10-09 16:09:12
Message-ID: 87lkacclsn.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Nikolay Samokhvalov" <samokhvalov(at)gmail(dot)com> writes:

> Hubert recently posted his thoughts on this topic:
> http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/
>
> I've encountered with this problem several times in web development and
> every time found out that the best (in terms of performance) solution is to
> use some pseudo random approach (such as ">= random() limit 1" or "limit 1
> offset random()*N" or even pre-caching rows on app side).

"ORDER BY random() LIMIT 1" should be faster in 8.3 due to the bounded-sort
optimization. It should be basically the same as the two options above as far
as how many comparisons are done and how much memory is used. It does have to
call random() for every record whereas the solutions above only call random()
once.

But I think all of these are basically the same to a first degree
approximation. They all have to do a scan of all the records being considered.
If you want something faster you need a solution which can use an index to
scan only the target record. There are ways of doing that but they require
some application knowledge.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
To: "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc: nikolay(at)samokhvalov(dot)com, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: mal advice in FAQ 4.1.
Date: 2007-10-09 16:40:23
Message-ID: 162867790710090940wdf3d1dv7cc03e9d3a7483f7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2007/10/9, Gregory Stark <stark(at)enterprisedb(dot)com>:
> "Nikolay Samokhvalov" <samokhvalov(at)gmail(dot)com> writes:
>
> > Hubert recently posted his thoughts on this topic:
> > http://www.depesz.com/index.php/2007/09/16/my-thoughts-on-getting-random-row/
> >
> > I've encountered with this problem several times in web development and
> > every time found out that the best (in terms of performance) solution is to
> > use some pseudo random approach (such as ">= random() limit 1" or "limit 1
> > offset random()*N" or even pre-caching rows on app side).
>
> "ORDER BY random() LIMIT 1" should be faster in 8.3 due to the bounded-sort
> optimization. It should be basically the same as the two options above as far
> as how many comparisons are done and how much memory is used. It does have to
> call random() for every record whereas the solutions above only call random()
> once.
>
> But I think all of these are basically the same to a first degree
> approximation. They all have to do a scan of all the records being considered.
> If you want something faster you need a solution which can use an index to
> scan only the target record. There are ways of doing that but they require
> some application knowledge.
>

It needs always seq scan :(, and take space on buffer cache. Solution
based on random generated PK are much faster. I collaborate with one
my customer. He shows random products from 10K products on every page
of one eShop. And he cannot understand, so ORDER random() LIMIT is bad
trick, because this trick is on PostgreSQL FAQ.

Pavel


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, nikolay(at)samokhvalov(dot)com, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: mal advice in FAQ 4.1.
Date: 2007-10-09 19:30:59
Message-ID: 20071009193059.GE8062@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 09, 2007 at 06:40:23PM +0200, Pavel Stehule wrote:
> It needs always seq scan :(, and take space on buffer cache. Solution
> based on random generated PK are much faster. I collaborate with one
> my customer. He shows random products from 10K products on every page
> of one eShop. And he cannot understand, so ORDER random() LIMIT is bad
> trick, because this trick is on PostgreSQL FAQ.

It's the only trick that works in all situations though. There are
ofcourse faster methods, but they require information about the
distribution of the values, the type, PKs, indexes etc...

The standard does have stuff relating to extracting samples from
tables, but they're not implemented.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
To: "Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Cc: "Gregory Stark" <stark(at)enterprisedb(dot)com>, nikolay(at)samokhvalov(dot)com, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: mal advice in FAQ 4.1.
Date: 2007-10-09 19:37:50
Message-ID: 162867790710091237o7e8fb862i417d220c90292cf5@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2007/10/9, Martijn van Oosterhout <kleptog(at)svana(dot)org>:
> On Tue, Oct 09, 2007 at 06:40:23PM +0200, Pavel Stehule wrote:
> > It needs always seq scan :(, and take space on buffer cache. Solution
> > based on random generated PK are much faster. I collaborate with one
> > my customer. He shows random products from 10K products on every page
> > of one eShop. And he cannot understand, so ORDER random() LIMIT is bad
> > trick, because this trick is on PostgreSQL FAQ.
>
> It's the only trick that works in all situations though. There are
> ofcourse faster methods, but they require information about the
> distribution of the values, the type, PKs, indexes etc...
>
> The standard does have stuff relating to extracting samples from
> tables, but they're not implemented.
>

I agree. I don't wont to remove it from FAQ. I would to add note, so
sometimes is necessary to find other trick.

Regards
Pavel