Re: blobs

Lists: pgsql-admin
From: Steve Holdoway <steve(dot)holdoway(at)firetrust(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: blobs
Date: 2007-01-31 21:46:06
Message-ID: 20070201104606.5ea9e44e.steve.holdoway@firetrust.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

I'm got the enviable task of redesigning an MySQL based project from scratch. We need a proper rdbms for this project, and I want to use PG 8.2.

The table I'm concerned with at the moment have (currently) 5 million rows, with a churn of about 300,000 rows a week. The table has about a million hits a day, which makes it the main potential bottleneck in this database.

We need to store some large ( 0 -> 100kB ) data with each row. Would you recommend adding it as columns in this table, given that blobs will be stored in the pg_largeobject table anyway, or would you recommend a daughter table for this?

Any other suggestions on how to avoid performance problems with this table ( hardware is dual Xeon, 4GB mem, 2 hardware raid channels for storage + 1 for logs, all running debian 32 bit ).

Cheers,

Steve


From: "Phillip Smith" <phillip(dot)smith(at)weatherbeeta(dot)com(dot)au>
To: "'Steve Holdoway'" <steve(dot)holdoway(at)firetrust(dot)com>, <pgsql-admin(at)postgresql(dot)org>
Subject: Re: blobs
Date: 2007-01-31 22:25:08
Message-ID: 007201c74586$aa596cf0$9b0014ac@wbaus090
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

I don't know about your table question, but may I ask why you're running
32-bit when you have dual Xeon processors?

I have dual Xeon's in my DWH, and I used to run 32-bit which I upgraded to
64-bit over Christmas. We run a nightly import to that database which used
to take around 5 hours which now completes in less than 1 hour.

Many of our large queries also run much faster - the only thing I did was
reload the box with RedHat ES 4 64-bit instead of 32-bit.

My 2.2 cents (Aust. GST inclusive!)

Cheers,
-p

-----Original Message-----
From: pgsql-admin-owner(at)postgresql(dot)org
[mailto:pgsql-admin-owner(at)postgresql(dot)org] On Behalf Of Steve Holdoway
Sent: Thursday, 1 February 2007 08:46
To: pgsql-admin(at)postgresql(dot)org
Subject: [ADMIN] blobs

I'm got the enviable task of redesigning an MySQL based project from
scratch. We need a proper rdbms for this project, and I want to use PG 8.2.

The table I'm concerned with at the moment have (currently) 5 million rows,
with a churn of about 300,000 rows a week. The table has about a million
hits a day, which makes it the main potential bottleneck in this database.

We need to store some large ( 0 -> 100kB ) data with each row. Would you
recommend adding it as columns in this table, given that blobs will be
stored in the pg_largeobject table anyway, or would you recommend a daughter
table for this?

Any other suggestions on how to avoid performance problems with this table (
hardware is dual Xeon, 4GB mem, 2 hardware raid channels for storage + 1 for
logs, all running debian 32 bit ).

Cheers,

Steve

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

*******************Confidentiality and Privilege Notice*******************

The material contained in this message is privileged and confidential to
the addressee. If you are not the addressee indicated in this message or
responsible for delivery of the message to such person, you may not copy
or deliver this message to anyone, and you should destroy it and kindly
notify the sender by reply email.

Information in this message that does not relate to the official business
of Weatherbeeta must be treated as neither given nor endorsed by Weatherbeeta.
Weatherbeeta, its employees, contractors or associates shall not be liable
for direct, indirect or consequential loss arising from transmission of this
message or any attachments


From: Steve Holdoway <steve(dot)holdoway(at)firetrust(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: blobs
Date: 2007-01-31 23:33:24
Message-ID: 20070201123324.c69e73f0.steve.holdoway@firetrust.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Availability of hardware monitoring software, and my personally being sick of things falling over. I have to run Mysql 4.0 on this server at the moment, and wasn't prepared to take the risk (:

Maybe by the time we implement, 64 bit will be reliable enough.

Steve

On Thu, 1 Feb 2007 09:25:08 +1100
"Phillip Smith" <phillip(dot)smith(at)weatherbeeta(dot)com(dot)au> wrote:

> I don't know about your table question, but may I ask why you're running
> 32-bit when you have dual Xeon processors?
>
> I have dual Xeon's in my DWH, and I used to run 32-bit which I upgraded to
> 64-bit over Christmas. We run a nightly import to that database which used
> to take around 5 hours which now completes in less than 1 hour.
>
> Many of our large queries also run much faster - the only thing I did was
> reload the box with RedHat ES 4 64-bit instead of 32-bit.
>
> My 2.2 cents (Aust. GST inclusive!)
>
> Cheers,
> -p
>
> -----Original Message-----
> From: pgsql-admin-owner(at)postgresql(dot)org
> [mailto:pgsql-admin-owner(at)postgresql(dot)org] On Behalf Of Steve Holdoway
> Sent: Thursday, 1 February 2007 08:46
> To: pgsql-admin(at)postgresql(dot)org
> Subject: [ADMIN] blobs
>
> I'm got the enviable task of redesigning an MySQL based project from
> scratch. We need a proper rdbms for this project, and I want to use PG 8.2.
>
> The table I'm concerned with at the moment have (currently) 5 million rows,
> with a churn of about 300,000 rows a week. The table has about a million
> hits a day, which makes it the main potential bottleneck in this database.
>
> We need to store some large ( 0 -> 100kB ) data with each row. Would you
> recommend adding it as columns in this table, given that blobs will be
> stored in the pg_largeobject table anyway, or would you recommend a daughter
> table for this?
>
> Any other suggestions on how to avoid performance problems with this table (
> hardware is dual Xeon, 4GB mem, 2 hardware raid channels for storage + 1 for
> logs, all running debian 32 bit ).
>
> Cheers,
>
> Steve
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match
>
>
> *******************Confidentiality and Privilege Notice*******************
>
> The material contained in this message is privileged and confidential to
> the addressee. If you are not the addressee indicated in this message or
> responsible for delivery of the message to such person, you may not copy
> or deliver this message to anyone, and you should destroy it and kindly
> notify the sender by reply email.
>
> Information in this message that does not relate to the official business
> of Weatherbeeta must be treated as neither given nor endorsed by Weatherbeeta.
> Weatherbeeta, its employees, contractors or associates shall not be liable
> for direct, indirect or consequential loss arising from transmission of this
> message or any attachments
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>


From: "Chad Wagner" <chad(dot)wagner(at)gmail(dot)com>
To: "Steve Holdoway" <steve(dot)holdoway(at)firetrust(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: blobs
Date: 2007-01-31 23:45:15
Message-ID: 81961ff50701311545h50e457c4scb108037d6cf5c16@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

On 1/31/07, Steve Holdoway <steve(dot)holdoway(at)firetrust(dot)com> wrote:
>
> The table I'm concerned with at the moment have (currently) 5 million
> rows, with a churn of about 300,000 rows a week. The table has about a
> million hits a day, which makes it the main potential bottleneck in this
> database.

Why would it be a "bottleneck"? If you are updating or deleting 300K a
week, definitely make sure you take a look at autovacuum and turn it ON.

We need to store some large ( 0 -> 100kB ) data with each row. Would you
> recommend adding it as columns in this table, given that blobs will be
> stored in the pg_largeobject table anyway, or would you recommend a daughter
> table for this?

Depends on how you are querying the table. This is really a database
modeling question, and leads into many many more questions. I would say if
your frequently range scanning the table (selecting several rows) and in
those cases you rarely need the "blob", then I would fork it off into a
child table. If the "blob" is rarely accessed, and only accessed directly,
then definitely a child table in my book.

The reason is if your frequently fetching rows from this table and rarely
using the blob all you are doing is consuming memory that could be better
used for other things, and spinning I/O when it is not necessary.

Any other suggestions on how to avoid performance problems with this table (
> hardware is dual Xeon, 4GB mem, 2 hardware raid channels for storage + 1 for
> logs, all running debian 32 bit ).
>

It really has to do with how you design your schema.

--
Chad
http://www.postgresqlforums.com/


From: Steve Holdoway <steve(dot)holdoway(at)firetrust(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: blobs
Date: 2007-02-01 18:39:37
Message-ID: 20070202073937.b80f6ccc.steve.holdoway@firetrust.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

On Wed, 31 Jan 2007 18:45:15 -0500
"Chad Wagner" <chad(dot)wagner(at)gmail(dot)com> wrote:

> On 1/31/07, Steve Holdoway <steve(dot)holdoway(at)firetrust(dot)com> wrote:
> >
> > The table I'm concerned with at the moment have (currently) 5 million
> > rows, with a churn of about 300,000 rows a week. The table has about a
> > million hits a day, which makes it the main potential bottleneck in this
> > database.
>
>
> Why would it be a "bottleneck"? If you are updating or deleting 300K a
> week, definitely make sure you take a look at autovacuum and turn it ON.
It is a potential bottleneck as I said. I am replacing between 5 and 10 percent of the data in the table every week. There are, in addition, a daily total of about a million hits ( read/update/delete ), as I said.

This is the busiest table in the database. Therefore, it's the most likely candidate to cause performance problems. That's what I call a bottleneck.
>
>
> We need to store some large ( 0 -> 100kB ) data with each row. Would you
> > recommend adding it as columns in this table, given that blobs will be
> > stored in the pg_largeobject table anyway, or would you recommend a daughter
> > table for this?
>
>
> Depends on how you are querying the table. This is really a database
> modeling question, and leads into many many more questions. I would say if
> your frequently range scanning the table (selecting several rows) and in
> those cases you rarely need the "blob", then I would fork it off into a
> child table. If the "blob" is rarely accessed, and only accessed directly,
> then definitely a child table in my book.
>
> The reason is if your frequently fetching rows from this table and rarely
> using the blob all you are doing is consuming memory that could be better
> used for other things, and spinning I/O when it is not necessary.
Sorry, you're completley wrong. If you consider that the only way of getting info is select *, then this is true. Personally, I think that anyone who does this in code is plain lazy and should find a job more suited to them (:

I am asking for input from those who have been in this situation before, and have experience in the tradeoff of running a separate table for the big stuff as against the extra ( 8 byte? ) column that would be added to the master table. Why am I asking this? Because Postgres has an unique way of handling this kind of data, unique from even the last time I used postgres in anger - 7.4. It's different from every other rdbms ( and MySQL ), and I have no practical experience of it in the wild.
>
>
> Any other suggestions on how to avoid performance problems with this table (
> > hardware is dual Xeon, 4GB mem, 2 hardware raid channels for storage + 1 for
> > logs, all running debian 32 bit ).
> >
>
> It really has to do with how you design your schema.
No. It really has to do with the gathering of relevant information so that I can design my schema properly. Academically, 3NF may be the solution to every design, but in the real world, there are many other things to take in to account. I've monitored the data flows, sized the hardware to handle the IO, and can either spend a week or two benchmarking different solutions from cold, or I can take the advice of those with relevant experience ( who I expect to find on this list ) to point me in the right direction first.
>
>
> --
> Chad
> http://www.postgresqlforums.com/
>
Steve.

PS. Please fix your headers so replies to your mails appear on the list directly.


From: "Chad Wagner" <chad(dot)wagner(at)gmail(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Cc: "Steve Holdoway" <steve(dot)holdoway(at)firetrust(dot)com>
Subject: Re: blobs
Date: 2007-02-02 04:04:57
Message-ID: 81961ff50702012004m3aeff4d9k2193917682314fb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

On 2/1/07, Steve Holdoway <steve(dot)holdoway(at)firetrust(dot)com> wrote:
>
> > Why would it be a "bottleneck"? If you are updating or deleting 300K a
> > week, definitely make sure you take a look at autovacuum and turn it ON.
> It is a potential bottleneck as I said. I am replacing between 5 and 10
> percent of the data in the table every week. There are, in addition, a daily
> total of about a million hits ( read/update/delete ), as I said.
>
> This is the busiest table in the database. Therefore, it's the most likely
> candidate to cause performance problems. That's what I call a bottleneck.

The reason I am asking why you think it is a bottleneck is because many
MySQL users see 'reading' a table as a bottleneck, and this is usually
because they are using MyISAM tables. Under MyISAM tables readers block
readers, writers block readers, readers block writers. This is NOT the case
with PostgreSQL, readers never block other readers, writers never block
readers, and readers never block writers. PostgreSQL does this using
multi-versioning, similar to how Oracle works.

> Depends on how you are querying the table. This is really a database
> > modeling question, and leads into many many more questions. I would say
> if
> > your frequently range scanning the table (selecting several rows) and in
> > those cases you rarely need the "blob", then I would fork it off into a
> > child table. If the "blob" is rarely accessed, and only accessed
> directly,
> > then definitely a child table in my book.
> >
> > The reason is if your frequently fetching rows from this table and
> rarely
> > using the blob all you are doing is consuming memory that could be
> better
> > used for other things, and spinning I/O when it is not necessary.
> Sorry, you're completley wrong. If you consider that the only way of
> getting info is select *, then this is true. Personally, I think that anyone
> who does this in code is plain lazy and should find a job more suited to
> them (:

Okay, I would love to see your test case. My statement is based on the
suggestion that you were embedding your data in the same row as a bytea
column. Here is a test case that I ran:

NOTE: Fill factor is set to 10% to exaggerate a wide row, this also proves
the point that EMPTY space in a table affects performance.

drop table foo;
create table foo (x integer not null primary key, y text not null) with
(fillfactor = 10);
insert into foo (x, y) select generate_series(1,100000),
repeat('1234567890', 10240);
checkpoint;
analyze foo;

select relpages,reltuples,pg_size_pretty(pg_relation_size(relname)) from
pg_class where relname = 'foo';
relpages | reltuples | pg_size_pretty
----------+-----------+----------------
100000 | 100000 | 781 MB
(1 row)

-- fetch just the integer column
explain analyze select x from foo;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------
Seq Scan on foo (cost=0.00..101000.00 rows=100000 width=4) (actual time=
10.389..19288.848 rows=100000 loops=1)
Total runtime: 19700.804 ms
(2 rows)

-- fetch just the text column
explain analyze select y from foo;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------
Seq Scan on foo (cost=0.00..101000.00 rows=100000 width=32) (actual time=
9.234..19863.485 rows=100000 loops=1)
Total runtime: 20290.618 ms
(2 rows)

-- fetch both the integer and text column
explain analyze select * from foo;

------------------------------------------------------------------------------------------------------------------
Seq Scan on foo (cost=0.00..101000.00 rows=100000 width=36) (actual time=
40.478..20470.648 rows=100000 loops=1)
Total runtime: 20904.611 ms
(2 rows)

-- Runtime is the same for all three cases against the "wide" table.
Disproves the theory that selecting or not selecting a "column" has
something to do with performance. I would think logically it does when you
are talking about sorting and merging data sets, not when we are talking
about physical reads.
-- Now let's seperate the "large" text column from the rest of the data.

drop table bar;
create table bar (x integer not null primary key);
insert into bar (x) select generate_series(1,100000);
checkpoint;
analyze bar;

select relpages,reltuples,pg_size_pretty(pg_relation_size(relname)) from
pg_class where relname = 'bar';
relpages | reltuples | pg_size_pretty
----------+-----------+----------------
441 | 100000 | 3528 kB
(1 row)

-- fetch from the "skinny" table
explain analyze select * from bar;
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------
Seq Scan on bar (cost=0.00..1441.00 rows=100000 width=4) (actual time=
19.552..416.575 rows=100000 loops=1)
Total runtime: 790.901 ms
(2 rows)

-- Notice how must faster it is, seems to suggest that a table with a narrow
column with is faster.

I think this case proves the point that a "wide" blob if not accessed
frequently shouldn't be included in the core table. Naturally this makes no
sense if every single time you are going to fetch the blob and use it with
every single fetch against the table.

I am asking for input from those who have been in this situation before, and
> have experience in the tradeoff of running a separate table for the big
> stuff as against the extra ( 8 byte? ) column that would be added to the
> master table. Why am I asking this? Because Postgres has an unique way of
> handling this kind of data, unique from even the last time I used postgres
> in anger - 7.4. It's different from every other rdbms ( and MySQL ), and I
> have no practical experience of it in the wild.

I think you are referring to large objects, which are completely different
then your original statement. Your original statement seemed to ask about
including binary data in the table versus not in the table. My
understanding of large objects is they create a file on the filesystem
(probably an oid like a table does) and you store the oid of the "file" in a
table. Basically you have bytea (byte arrays) which are stored inline with
the rest of the row and large objects where store a reference to another
object.

> It really has to do with how you design your schema.
> No. It really has to do with the gathering of relevant information so that
> I can design my schema properly. Academically, 3NF may be the solution to
> every design, but in the real world, there are many other things to take in
> to account. I've monitored the data flows, sized the hardware to handle the
> IO, and can either spend a week or two benchmarking different solutions from
> cold, or I can take the advice of those with relevant experience ( who I
> expect to find on this list ) to point me in the right direction first.
>

That is my point here, the schema should be designed properly. A poor
schema design and yield awful I/O performance, which is essentially my point
in my previous message.

Anyways, if you are typically this sarcastic then I would suggest you learn
how to either better phrase your responses or not respond at all. Frankly,
no one has to help you, and making statements as you have above are rude to
everyone on this list.

I have plenty of experience building large distributed OLTP systems and
multi terabyte data warehouses, so you can either take the advice or leave
it.


From: Steve Holdoway <steve(dot)holdoway(at)firetrust(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: blobs
Date: 2007-02-04 21:48:29
Message-ID: 20070205104829.1cea225a.steve.holdoway@firetrust.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

On Thu, 1 Feb 2007 23:04:57 -0500
"Chad Wagner" <chad(dot)wagner(at)gmail(dot)com> wrote:

> I have plenty of experience building large distributed OLTP systems and
> multi terabyte data warehouses, so you can either take the advice or leave
> it.
>
Me too, since Oracle 6, Informix and Ingres in the late 80's until today. What I don't have experience of, and what you omitted from this informative post are expeciences of postgres 8.x's handling of blobs - the question I asked - rather than the effect of using text columns which is a different question altogether.

Steve