Quick Links

Storing blobs in PG DB

Lists:	pgsql-general

From:	Nikolay Moskvichev <mnv(at)papillon(dot)ru>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Storing blobs in PG DB
Date:	2007-04-04 05:44:57
Message-ID:	1175665498.977711@jet.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Hi All!

Question is : How suitable PG for storing about 2 000 000 binary files
0,5-2,0 Mb size each ? It is not planned the big number of clients or a
plenty of updatings. Like photoalbum on local host.

From:	Listmail <lists(at)peufeu(dot)com>
To:	"Nikolay Moskvichev" <mnv(at)papillon(dot)ru>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Storing blobs in PG DB
Date:	2007-04-04 20:25:56
Message-ID:	op.tp90dim0zcizji@apollo13
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

You don't like filesystems ?

On Wed, 04 Apr 2007 07:44:57 +0200, Nikolay Moskvichev <mnv(at)papillon(dot)ru>
wrote:

> Hi All!
>
> Question is : How suitable PG for storing about 2 000 000 binary files
> 0,5-2,0 Mb size each ? It is not planned the big number of clients or a
> plenty of updatings. Like photoalbum on local host.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq

From:	Nikolay Moskvichev <mnv(at)papillon(dot)ru>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Storing blobs in PG DB
Date:	2007-04-05 02:25:15
Message-ID:	1175739916.373909@jet.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

>
> You don't like filesystems ?
>
You know file system which supports SQL, referential integrity, and
managed transactions ?

From:	"Postgres User" <postgres(dot)developer(at)gmail(dot)com>
To:	"Nikolay Moskvichev" <mnv(at)papillon(dot)ru>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Storing blobs in PG DB
Date:	2007-04-05 03:38:19
Message-ID:	b88c3460704042038wec806dam1f73715ed11e95e5@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

I recently heard from Josh Berkus that at least one major CMS
application uses Postgres to store entire HTML pages (including image
files) in order to support full versioning.

As a general rule, I prefer not to store BLOBS in a DB- I'd rather
leave the BLOB in the file system and let the db save only a pointer
to it. However, if you need to store BLOBs in a database, Postgres is
one of the better platforms for it.

On 4/3/07, Nikolay Moskvichev <mnv(at)papillon(dot)ru> wrote:
> Hi All!
>
> Question is : How suitable PG for storing about 2 000 000 binary files
> 0,5-2,0 Mb size each ? It is not planned the big number of clients or a
> plenty of updatings. Like photoalbum on local host.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>

From:	Nikolay Moskvichev <mnv(at)papillon(dot)ru>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Storing blobs in PG DB
Date:	2007-04-05 04:25:59
Message-ID:	1175747163.410425@jet.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Postgres User пишет:
> I recently heard from Josh Berkus that at least one major CMS
> application uses Postgres to store entire HTML pages (including image
> files) in order to support full versioning.
>
> As a general rule, I prefer not to store BLOBS in a DB- I'd rather
> leave the BLOB in the file system and let the db save only a pointer

Yes, current solution uses the similar approach

> to it. However, if you need to store BLOBs in a database, Postgres is
> one of the better platforms for it.
>
Thanks

From:	Naz Gassiep <naz(at)mira(dot)net>
To:	Nikolay Moskvichev <mnv(at)papillon(dot)ru>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Storing blobs in PG DB
Date:	2007-04-05 17:41:25
Message-ID:	461534C5.1040705@mira.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

This doesn't answer your question, but I thought I'd throw my opinion in
anyway.

My personal view is that in general, binary files have no place in
databases. Filesystems are for files, databases are for data. My design
choice is to store the files in a fileystem and use the database to hold
metadata as well as a pointer to the file.

If you *must* put files into the database, then you can do so, and PG
will handle that many files of those sizes with ease. For all intents
and purposes, PG can store an unlimited number of files. You're far more
likely to run into walls in the form of limitations in your disk I/O
system then limitations in what PG will handle.

- Naz.

Nikolay Moskvichev wrote:
> Hi All!
>
> Question is : How suitable PG for storing about 2 000 000 binary files
> 0,5-2,0 Mb size each ? It is not planned the big number of clients or
> a plenty of updatings. Like photoalbum on local host.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>

From:	Listmail <lists(at)peufeu(dot)com>
To:	"Naz Gassiep" <naz(at)mira(dot)net>, "Nikolay Moskvichev" <mnv(at)papillon(dot)ru>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Storing blobs in PG DB
Date:	2007-04-05 21:08:21
Message-ID:	op.tqbwz7hmzcizji@apollo13
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

> My personal view is that in general, binary files have no place in
> databases. Filesystems are for files, databases are for data. My design
> choice is to store the files in a fileystem and use the database to hold
> metadata as well as a pointer to the file.
>
> If you *must* put files into the database, then you can do so, and PG
> will handle that many files of those sizes with ease. For all intents
> and purposes, PG can store an unlimited number of files. You're far more
> likely to run into walls in the form of limitations in your disk I/O
> system then limitations in what PG will handle.

And you can't backup with rsync...

From:	"Merlin Moncure" <mmoncure(at)gmail(dot)com>
To:	Listmail <lists(at)peufeu(dot)com>
Cc:	"Naz Gassiep" <naz(at)mira(dot)net>, "Nikolay Moskvichev" <mnv(at)papillon(dot)ru>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Storing blobs in PG DB
Date:	2007-04-05 21:24:06
Message-ID:	b42b73150704051424w2ccda73evb9f7864f9f4a7e6d@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On 4/5/07, Listmail <lists(at)peufeu(dot)com> wrote:
>
> > My personal view is that in general, binary files have no place in
> > databases. Filesystems are for files, databases are for data. My design
> > choice is to store the files in a fileystem and use the database to hold
> > metadata as well as a pointer to the file.
> >
> > If you *must* put files into the database, then you can do so, and PG
> > will handle that many files of those sizes with ease. For all intents
> > and purposes, PG can store an unlimited number of files. You're far more
> > likely to run into walls in the form of limitations in your disk I/O
> > system then limitations in what PG will handle.
>
> And you can't backup with rsync...

no, but you can do incrementals with PITR, which is just as good (if
not better) than rsync because you are backing up your database
'indexer' and binaries in one swoop...so the backup argument doesn't
fly, imo.

imo, sql is a richer language for storing and extracting any type of
data, binaries included, than hierarchal filesystem style
organization. I think most reasons why not to store binaries in the
database boil down to performance.

merlin

From:	Thomas Kellerer <spam_eater(at)gmx(dot)net>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Storing blobs in PG DB
Date:	2007-04-05 21:43:11
Message-ID:	ev3qhf$nlh$1@sea.gmane.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Merlin Moncure wrote on 05.04.2007 23:24:
> I think most reasons why not to store binaries in the
> database boil down to performance.

Having implemented an application where the files were stored in the filesystem
instead of the database I have to say, with my experience I would store the
files in the DB the next time. Once the number of files in a directory exceeds a
certain limit, this directory is very hard to handle.

Things like "dir", or "ls" or listing the contents through a FTP connection
become extremely slow (using HP/UX as well as Windows).

And you have to backup only _one_ source (the database), not two. Moving the
data around from system a to system b (e.g. staging (windows) -> production
(HP/UX)) is a lot easier when you can simply backup and restore the database (in
our case it was an Oracle database, but this would be the same for PG)

Thomas

From:	William Garrison <postgres(at)mobydisk(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Storing blobs in PG DB
Date:	2007-04-05 22:22:52
Message-ID:	461576BC.7040400@mobydisk.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

I have actually never stored data in the database. But in a recent
project I've realized it might have been smart. We store a terabytes of
data on the file system, and many times I would love to have an ACID
compliant file system. For example, if I delete an entry, I need to
delete it from disk and from the database. How can I be sure that was
transactional? Or if I append data to the file, and then update the
database. What then? I wind-up writing "tricky" code that does stuff
like renames a file, updates the DB, and renames it back if there is an
error in an attempt to fake-out atomicity and transactions.

Of course, I may have come-up with even more issues if the company put
this data into a SQL server. Who knows.

Where exactly does PostgreSQL put large blobs? Does it ensure ACID
compliance if I add a 2GB blob in a column?

Merlin Moncure wrote:
> On 4/5/07, Listmail <lists(at)peufeu(dot)com> wrote:
>>
>> > My personal view is that in general, binary files have no place in
>> > databases. Filesystems are for files, databases are for data. My design
>> > choice is to store the files in a fileystem and use the database to
>> hold
>> > metadata as well as a pointer to the file.
>> >
>> > If you *must* put files into the database, then you can do so, and PG
>> > will handle that many files of those sizes with ease. For all intents
>> > and purposes, PG can store an unlimited number of files. You're far
>> more
>> > likely to run into walls in the form of limitations in your disk I/O
>> > system then limitations in what PG will handle.
>>
>> And you can't backup with rsync...
>
> no, but you can do incrementals with PITR, which is just as good (if
> not better) than rsync because you are backing up your database
> 'indexer' and binaries in one swoop...so the backup argument doesn't
> fly, imo.
>
> imo, sql is a richer language for storing and extracting any type of
> data, binaries included, than hierarchal filesystem style
> organization. I think most reasons why not to store binaries in the
> database boil down to performance.
>
> merlin
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>

From:	Listmail <lists(at)peufeu(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Sort and Limit - really nasty query and feature of the day
Date:	2007-04-05 22:30:29
Message-ID:	op.tqb0s3sbzcizji@apollo13
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Today I rewrote a particularly nasty query involving a UNION ALL between
an active table and a huge archive table, some left joins, order by and
limit, and it went from 5 minutes to under one second ; however one query
became 4 with some glue in between.

EXPLAIN
SELECT * FROM (
SELECT 0 AS archived, id, price, surface, coords, detect_time, type_id,
vente, zipcode, city_id, description FROM annonces
UNION ALL
SELECT 1 AS archived, a.id, price, surface, coords, detect_time, type_id,
vente, zipcode, city_id, description FROM archive_data a LEFT JOIN
archive_ext d ON a.id=d.id ) AS foo
WHERE
detect_time >= '2006-10-30 16:17:45.064793'
AND type_id IN
(1,12,24,17,18,19,20,33,35,50,51,7,52,4,13,41,14,16,26,28,43,53,15,29,30,31,45,32,34,46,47,6)
AND vente
AND (zipcode IN (69001,69002,69003,69004,69005,69006,69007,69008,69009)
OR city_id IN (27595) OR coords &&
'(45.74101689082,4.8371263505564),(45.75898310918,4.8628736494436)'::BOX)
AND surface IS NOT NULL AND price IS NOT NULL
ORDER BY price/surface
LIMIT 100;

Here is the messy explain :

Limit (cost=333560.35..333560.60 rows=100 width=103)
-> Sort (cost=333560.35..333656.88 rows=38610 width=103)
Sort Key: (foo.price / (foo.surface)::double precision)
-> Result (cost=133.21..328438.41 rows=38610 width=103)
-> Append (cost=133.21..328245.36 rows=38610 width=103)
-> Bitmap Heap Scan on annonces
(cost=133.21..7520.56 rows=1426 width=190)
Recheck Cond: ((vente AND (zipcode = ANY
('{69001,69002,69003,69004,69005,69006,69007,69008,69009}'::integer[])))
OR (vente AND (city_id = 27595)) OR (coords &&
'(45.75898310918,4.8628736494436),(45.74101689082,4.8371263505564)'::box))
Filter: ((detect_time >= '2006-10-30
16:17:45.064793'::timestamp without time zone) AND (type_id = ANY
('{1,12,24,17,18,19,20,33,35,50,51,7,52,4,13,41,14,16,26,28,43,53,15,29,30,31,45,32,34,46,47,6}'::integer[]))
AND vente AND (surface IS NOT NULL) AND (price IS NOT NULL))
-> BitmapOr (cost=133.21..133.21 rows=4294
width=0)
-> Bitmap Index Scan on annonces_zip
(cost=0.00..55.91 rows=1761 width=0)
Index Cond: ((vente = true) AND
(zipcode = ANY
('{69001,69002,69003,69004,69005,69006,69007,69008,69009}'::integer[])))
-> Bitmap Index Scan on annonces_city
(cost=0.00..42.85 rows=1859 width=0)
Index Cond: ((vente = true) AND
(city_id = 27595))
-> Bitmap Index Scan on annonces_coords
(cost=0.00..33.37 rows=675 width=0)
Index Cond: (coords &&
'(45.75898310918,4.8628736494436),(45.74101689082,4.8371263505564)'::box)
-> Merge Right Join (cost=59679.03..320338.70
rows=37184 width=182)
Merge Cond: (d.id = a.id)
-> Index Scan using archive_ext_pkey on
archive_ext d (cost=0.00..252661.12 rows=2976314 width=119)
-> Sort (cost=59679.03..59771.99 rows=37184
width=67)
Sort Key: a.id
-> Bitmap Heap Scan on archive_data a
(cost=3951.02..56856.32 rows=37184 width=67)
Recheck Cond: ((vente AND (zipcode
= ANY
('{69001,69002,69003,69004,69005,69006,69007,69008,69009}'::integer[])))
OR (vente AND (city_id = 27595)) OR (coords &&
'(45.75898310918,4.8628736494436),(45.74101689082,4.8371263505564)'::box))
Filter: ((detect_time >=
'2006-10-30 16:17:45.064793'::timestamp without time zone) AND (type_id =
ANY
('{1,12,24,17,18,19,20,33,35,50,51,7,52,4,13,41,14,16,26,28,43,53,15,29,30,31,45,32,34,46,47,6}'::integer[]))
AND vente AND (surface IS NOT NULL) AND (price IS NOT NULL))
-> BitmapOr
(cost=3951.02..3951.02 rows=171699 width=0)
-> Bitmap Index Scan on
archive_data_zip (cost=0.00..1692.62 rows=80610 width=0)
Index Cond: ((vente =
true) AND (zipcode = ANY
('{69001,69002,69003,69004,69005,69006,69007,69008,69009}'::integer[])))
-> Bitmap Index Scan on
archive_data_city (cost=0.00..1695.31 rows=80683 width=0)
Index Cond: ((vente =
true) AND (city_id = 27595))
-> Bitmap Index Scan on
archive_data_coords (cost=0.00..535.20 rows=10406 width=0)
Index Cond: (coords &&
'(45.75898310918,4.8628736494436),(45.74101689082,4.8371263505564)'::box)

I didn't redo the explain analyze, it takes too long ; however the stats
and count estimates are pretty good, and it takes a good 5 minutes.

However, the interesting parts of the query are very fast. Let's
disassemble it :

EXPLAIN ANALYZE SELECT 0 AS archived, id, price, surface, coords,
detect_time, type_id, vente, zipcode, city_id, description FROM annonces
WHERE
detect_time >= '2006-10-30 16:17:45.064793'
AND type_id IN
(1,12,24,17,18,19,20,33,35,50,51,7,52,4,13,41,14,16,26,28,43,53,15,29,30,31,45,32,34,46,47,6)
AND vente
AND (zipcode IN (69001,69002,69003,69004,69005,69006,69007,69008,69009) OR
city_id IN (27595) OR coords &&
'(45.74101689082,4.8371263505564),(45.75898310918,4.8628736494436)'::BOX)
AND surface IS NOT NULL AND price IS NOT NULL
ORDER BY price/surface
LIMIT 100;
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=7602.40..7602.65 rows=100 width=190) (actual
time=19.102..19.163 rows=100 loops=1)
-> Sort (cost=7602.40..7605.96 rows=1426 width=190) (actual
time=19.100..19.123 rows=100 loops=1)
Sort Key: (price / (surface)::double precision)
-> Bitmap Heap Scan on annonces (cost=133.21..7527.69 rows=1426
width=190) (actual time=4.255..16.725 rows=910 loops=1)
Recheck Cond: ((vente AND (zipcode = ANY
('{69001,69002,69003,69004,69005,69006,69007,69008,69009}'::integer[])))
OR (vente AND (city_id = 27595)) OR (coords &&
'(45.75898310918,4.8628736494436),(45.74101689082,4.8371263505564)'::box))
Filter: ((detect_time >= '2006-10-30
16:17:45.064793'::timestamp without time zone) AND (type_id = ANY
('{1,12,24,17,18,19,20,33,35,50,51,7,52,4,13,41,14,16,26,28,43,53,15,29,30,31,45,32,34,46,47,6}'::integer[]))
AND vente AND (surface IS NOT NULL) AND (price IS NOT NULL))
-> BitmapOr (cost=133.21..133.21 rows=4294 width=0)
(actual time=2.662..2.662 rows=0 loops=1)
-> Bitmap Index Scan on annonces_zip
(cost=0.00..55.91 rows=1761 width=0) (actual time=0.518..0.518 rows=1367
loops=1)
Index Cond: ((vente = true) AND (zipcode = ANY
('{69001,69002,69003,69004,69005,69006,69007,69008,69009}'::integer[])))
-> Bitmap Index Scan on annonces_city
(cost=0.00..42.85 rows=1859 width=0) (actual time=0.364..0.364 rows=1316
loops=1)
Index Cond: ((vente = true) AND (city_id =
27595))
-> Bitmap Index Scan on annonces_coords
(cost=0.00..33.37 rows=675 width=0) (actual time=1.776..1.776 rows=2449
loops=1)
Index Cond: (coords &&
'(45.75898310918,4.8628736494436),(45.74101689082,4.8371263505564)'::box)

Total runtime: 19.327 ms

Bitmap saves the day. Now for the other part :

EXPLAIN ANALYZE SELECT 1 AS archived, id, price, surface, coords,
detect_time, type_id, vente, zipcode, city_id FROM archive_data
WHERE
detect_time >= '2006-10-30 16:17:45.064793'
AND type_id IN
(1,12,24,17,18,19,20,33,35,50,51,7,52,4,13,41,14,16,26,28,43,53,15,29,30,31,45,32,34,46,47,6)
AND vente
AND (zipcode IN (69001,69002,69003,69004,69005,69006,69007,69008,69009) OR
city_id IN (27595) OR coords &&
'(45.74101689082,4.8371263505564),(45.75898310918,4.8628736494436)'::BOX)
AND surface IS NOT NULL AND price IS NOT NULL
ORDER BY price/surface
LIMIT 100;
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=59864.95..59865.20 rows=100 width=67) (actual
time=490.718..490.793 rows=100 loops=1)
-> Sort (cost=59864.95..59957.91 rows=37184 width=67) (actual
time=490.716..490.765 rows=100 loops=1)
Sort Key: (price / (surface)::double precision)
-> Bitmap Heap Scan on archive_data (cost=3951.02..57042.24
rows=37184 width=67) (actual time=223.720..344.485 rows=27761 loops=1)
Recheck Cond: ((vente AND (zipcode = ANY
('{69001,69002,69003,69004,69005,69006,69007,69008,69009}'::integer[])))
OR (vente AND (city_id = 27595)) OR (coords &&
'(45.75898310918,4.8628736494436),(45.74101689082,4.8371263505564)'::box))
Filter: ((detect_time >= '2006-10-30
16:17:45.064793'::timestamp without time zone) AND (type_id = ANY
('{1,12,24,17,18,19,20,33,35,50,51,7,52,4,13,41,14,16,26,28,43,53,15,29,30,31,45,32,34,46,47,6}'::integer[]))
AND vente AND (surface IS NOT NULL) AND (price IS NOT NULL))
-> BitmapOr (cost=3951.02..3951.02 rows=171699 width=0)
(actual time=144.175..144.175 rows=0 loops=1)
-> Bitmap Index Scan on archive_data_zip
(cost=0.00..1692.62 rows=80610 width=0) (actual time=38.715..38.715
rows=86909 loops=1)
Index Cond: ((vente = true) AND (zipcode = ANY
('{69001,69002,69003,69004,69005,69006,69007,69008,69009}'::integer[])))
-> Bitmap Index Scan on archive_data_city
(cost=0.00..1695.31 rows=80683 width=0) (actual time=26.576..26.576
rows=85868 loops=1)
Index Cond: ((vente = true) AND (city_id =
27595))
-> Bitmap Index Scan on archive_data_coords
(cost=0.00..535.20 rows=10406 width=0) (actual time=78.880..78.880
rows=117333 loops=1)
Index Cond: (coords &&
'(45.75898310918,4.8628736494436),(45.74101689082,4.8371263505564)'::box)
Total runtime: 492.530 ms

So, taken individually, postgres is exceedingly good on the "hard" parts
of this query (ie. finding what I want).
Problem is that, on such a query, it would really pay to :

- Move the ORDER BY and LIMIT inside the UNION ALL
Postgres already moves the WHERE conditions.
Obviously, if both sides of the UNION have the same ORDER BY and LIMIT,
moving them inside would work well.
This only works when LIMIT is present, of course.

- Continue moving the ORDER BY and LIMIT inside the LEFT JOIN so it can
be performed before the MERGE JOIN
This allow merging 100 rows instead of 27761, which could even be done
with some other join type like Nested Loop.
This also only works with LIMIT.

- and re-sort the final query result since it's an UNION.

I don't think it would change the result since it is a left join, all the
rows on the left part are in the end result anyway.
Is it planned to add this to postgres sometimes ?

In the end I used the two fast SELECTs to get the ids of the items to
display and had PHP shove them back into the original view which has more
stuff going on inside, I was too lazy to redo it and my search engine. The
end result is about 300 times faster...

From:	Thomas Kellerer <spam_eater(at)gmx(dot)net>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Storing blobs in PG DB
Date:	2007-04-05 22:38:07
Message-ID:	ev3tof$1ag$1@sea.gmane.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

William Garrison wrote on 06.04.2007 00:22:
> I have actually never stored data in the database.

Hmm, funny statement somehow ;)

From:	William Garrison <postgres(at)mobydisk(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Storing blobs in PG DB
Date:	2007-04-06 00:21:05
Message-ID:	46159271.8030808@mobydisk.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

lol.

yeah, I meant binary blobs. :-)

Thomas Kellerer wrote:
> William Garrison wrote on 06.04.2007 00:22:
>> I have actually never stored data in the database.
>
> Hmm, funny statement somehow ;)
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>

From:	Nikolay Moskvichev <mnv(at)papillon(dot)ru>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Storing blobs in PG DB
Date:	2007-04-06 03:06:25
Message-ID:	1175828788.348751@jet.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

> This doesn't answer your question, but I thought I'd throw my opinion in
> anyway.
>
> My personal view is that in general, binary files have no place in
> databases. Filesystems are for files, databases are for data. My design
> choice is to store the files in a fileystem and use the database to hold
> metadata as well as a pointer to the file.
>

Yes, current design exactly the same.

> If you *must* put files into the database, then you can do so, and PG
> will handle that many files of those sizes with ease. For all intents
> and purposes, PG can store an unlimited number of files. You're far more
> likely to run into walls in the form of limitations in your disk I/O
> system then limitations in what PG will handle.
>
Thanks

From:	Nikolay Moskvichev <mnv(at)papillon(dot)ru>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Storing blobs in PG DB
Date:	2007-04-06 06:46:16
Message-ID:	1175841962.931010@jet.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Naz Gassiep пишет:
> This doesn't answer your question, but I thought I'd throw my opinion in
> anyway.
>
> My personal view is that in general, binary files have no place in
> databases. Filesystems are for files, databases are for data. My design
> choice is to store the files in a fileystem and use the database to hold
> metadata as well as a pointer to the file.
>

By the way, not exactly the same case, but just some pros and cons:

http://en.wikibooks.org/wiki/Programming:WebObjects/Web_Applications/Development/Database_vs_Filesystem

From:	Dave Page <dpage(at)postgresql(dot)org>
To:	Thomas Kellerer <spam_eater(at)gmx(dot)net>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Storing blobs in PG DB
Date:	2007-04-06 07:56:33
Message-ID:	4615FD31.1040005@postgresql.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Thomas Kellerer wrote:
> Merlin Moncure wrote on 05.04.2007 23:24:
>> I think most reasons why not to store binaries in the
>> database boil down to performance.
>
> Having implemented an application where the files were stored in the
> filesystem instead of the database I have to say, with my experience I
> would store the files in the DB the next time. Once the number of files
> in a directory exceeds a certain limit, this directory is very hard to
> handle.
>
> Things like "dir", or "ls" or listing the contents through a FTP
> connection become extremely slow (using HP/UX as well as Windows).

This is very true - I've ended up with data stores containing directory
hierarchys to handle this issue:

1/
1/
...
16/
2/
1/
...
16/
3/
...
16/

And so on. The more files, the more directories. The files are then
stored in the lower level directories using an appropriate algorithm to
distribute them fairly equally.

> And you have to backup only _one_ source (the database), not two. Moving
> the data around from system a to system b (e.g. staging (windows) ->
> production (HP/UX)) is a lot easier when you can simply backup and
> restore the database (in our case it was an Oracle database, but this
> would be the same for PG)

Well this is the big problem - on a busy system your database backup can
easily become out of sync with your filesystem backup, coupled with
which, you have no automatic transactional control over anything your
store in the file system.

Consequently, the more recent systems I've built have stored the blobs
in PostgreSQL.

Regards, Dave

From:	Peter Wilson <petew(at)yellowhawk(dot)co(dot)uk>
To:	Nikolay Moskvichev <mnv(at)papillon(dot)ru>
Subject:	Re: Storing blobs in PG DB
Date:	2007-04-06 08:44:31
Message-ID:	4616086F.2060606@yellowhawk.co.uk
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Nikolay Moskvichev wrote:
> Hi All!
>
> Question is : How suitable PG for storing about 2 000 000 binary files
> 0,5-2,0 Mb size each ? It is not planned the big number of clients or a
> plenty of updatings. Like photoalbum on local host.

In our CMS we store all page data in the database - either text for HTML pages
or the images and other truly binary data.

I agree with most of the comments on why *not* to store binary application data
in the file system, but I'd add the following :

For scalability, we use a very high performance database server (battery backed
RAID controller, lots of disks, lots of RAM, processors etc). Virtually the only
thing it runs is Postgres. This connects to a number of 'front-end' web servers.
These are lower performance, cheaper boxes than the database. Generally 1U,
single disk, not RAID etc. The web-servers run Apache and the processor
intensive stuff like server-side JavaScript.

A load balancer shares web traffic across the front-end servers and can detect
if any of those fail and switch them out of servicing requests.

The front-end servers connect to the database server over gigabit Ethernet to
cut latency to a minimum.

We've found that the single high-spec database server is more than capable of
servicing quite a lot of front-end web servers.

Now in that environment, if you were to choose to store things in the file
system, not only have you got the issue of synchronising file system with
database, but of synchronising all the file systems.

You could get round some of those issues by keeping all the files on one system
and using an NFS share or some complex FTP system or something, but actually
it's a lot easier to hold the data in the database.

So - when you're developing your application it's worth thinking about what
happens as it starts to get busier. What's your route to scaling? Many web
applications are written to work on a single machine with no thought to what
happens when that reaches the limit, other than get a bigger server.

All the best
Peter Wilson
--
http://www.whitebeam.org
http://www.yellowhawk.co.uk
---