Re: Get master-detail relationship metadata

From: Andy Colson <andy(at)squeakycode(dot)net>
To: Laszlo Nagy <gandalf(at)shopzeus(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org, Daniel Fekete <danieleff(at)gmail(dot)com>
Subject: Re: Get master-detail relationship metadata
Date: 2011-02-03 15:26:26
Message-ID: 4D4AC922.8050904@squeakycode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 2/3/2011 5:40 AM, Laszlo Nagy wrote:
>
> Hi All,
>
> I'm working on a client program that iterates over master-detail
> relationships in a loop chain.
>
> Pseudo code:
>
> for row_1 in table_1:
> table_2 = get_details(row_1,"table2")
> for row_2 in table_2:
> row_3 = get_details(row_2,"table3")
> .... etc.
> process_data(row1,row_2,row_3,....)
>
> My task is to write the "get_details" iterator effectively. The obvious
> way to do it is to query details in every get_details() call, but that
> is not efficient. We have relationships where one master only has a few
> details. For 1 million master rows, that would result in execution of
> millions of SQL SELECT commands, degrading the performance by
> magnitudes. My idea was that the iterator should pre-fetch and cache
> data for many master records at once. The get_details() would use the
> cached rows, thus reducing the number of SQL SELECT statements needed.
> Actually I wrote the iterator, and it works fine in some cases. For
> example:
>
> producers = get_rows("producer")
> for producer in producers:
> products = get_getails(producer,"product")
> for product in products:
> prices = get_details(product,"prices")
> for price in prices:
> process_product_price(producer,product,price)
>
> This works fine if one producer has not more than 1000 products and one
> product has not more than 10 prices. I can easly keep 10 000 records in
> memory. The actual code executes about 15 SQL queries while iterating
> over 1 million rows. Compared to the original "obvious" method,
> performance is increased to 1500%
>
> But sometimes it just doesn't work. If a producer has 1 million
> products, and one product has 100 prices, then it won't work, because I
> cannot keep 100 million prices in memory. My program should somehow
> figure out, how much rows it will get for one master, and select between
> the cached and not cached methods.
>
> So here is the question: is there a way to get this information from
> PostgreSQL itself? I know that the query plan contains information about
> this, but I'm not sure how to extract. Should I run an ANALYZE command
> of some kind, and parse the result as a string? For example:
>
> EXPLAIN select * from product where producer_id=1008;
> QUERY PLAN
> ----------------------------------------------------------------------
> Seq Scan on product (cost=0.00..1018914.74 rows=4727498 width=1400)
> Filter: (producer_id = 1008)
> (2 rows)
>
>
> Then I could extract "rows=4727498" to get an idea about how much detail
> rows I'll get for the master.
>
> Is there any better way to do it? And how reliable is this?
>
>
> Thanks,
>
> Laszlo
>
>

One way would be to join the master to the detail, and write your code
expecting duplicates.

q = get_rows("select * from product inner join price ... order by
productid, priceid");

lastprodid = ''
for x in q:
prodid = q.prodid
if prodid <> lastprodid:
# we saw the last product, prepare to move to the next product
lastprodid = prodid

... etc

> Is there any better way to do it? And how reliable is this?

It makes the sql really easy, but the code complex... so pick your poison.

-Andy

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Robert Haas 2011-02-03 15:31:27 Re: [HACKERS] Slow count(*) again...
Previous Message Mark Stosberg 2011-02-03 15:08:35 getting the most of out multi-core systems for repeated complex SELECT statements