Re: Database design - best practice

From: Vitalii Tymchyshyn <tivv00(at)gmail(dot)com>
To: Niels Kristian Schjødt <nielskristian(at)autouncle(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Database design - best practice
Date: 2012-11-28 14:41:14
Message-ID: CABWW-d22dFS9fL_AmrdbCUZEwY7qrgKzzr5pryCShJro=R1SKw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Let me be devil advocate here :)
First of all, even if you read any basics about normalization, don't take
it to your heart :) Think.
Know that each normalization/denormalization step has it's cons and pros.
E.g. in NoSQL world they often don't normalize much.
What's interesting with PosgreSQL is that it is suited quite good for
NoSQL-like scenarios.
First of all, each unfilled (null) data column takes 1 bit only. This, BTW,
leads to interesting consequence that performance-wise it can be better to
have null/true boolean than false/true. Especially if you've got a lot of
"false".
So, PostgreSQL should be good with 10th, possible 100th of data column with
most columns empty. Record of 151 null columns would take header +
roundup(151/8 ) = 19 bytes. Not much. NoSQLs usually put column names into
records and this costs more.
Any null columns at the end of record take no space at all (so, you can
think on reordering your columns to put the least used to the record end).
Adding column with null as default is cheap operation that do not require
table scan.
You can have partial indexes to speed things up, like create index on car
(car_id) where (has_automatic_transmission);

At the other side, when you normalize you need to join. Instead of select *
from car where has_automatic_transmission (that will use index above), you
will have to "select * from car where id in (select id from
car_with_automatic_transmission)". The plan is much more complex here. It
will be slower.

The main normalization plus for you is that you work with record as a
whole, so if there is a lot of information in there that is rarely used,
you will "pay" for it's access every time, both on selects and updates.

So, as conclusion, I agree with others, that you should check. But
remember, joining two tables with millions of records os never cheap :)

Best regards, Vitalii Tymchyshyn

2012/11/28 Niels Kristian Schjødt <nielskristian(at)autouncle(dot)com>

> Hi,
>
> I'm on the hunt for some solid knowledge on a theoretical level about the
> performance of postgresql. My question is regarding best practices, and how
> architectural decisions might influence the performance. First a little
> background:
>
> The setup:
> I have a database which holds informations on used cars. The database has
> mainly 3 tables of interest for this case:
> A cars table, an adverts table and a sellers table. One car has many
> adverts and one seller has many adverts. One advert belongs to one car and
> one seller.
> The database is powering a website for searching used cars. When searching
> for used cars, the cars table is mainly used, and a lot of the columns
> should be directly available for searching e.g. color, milage, price,
> has_automatic_transmission etc.
>
> So my main concern is actually about the cars table, since this one
> currently has a lot of columns (151 - I expect thats quite a lot?), and a
> lot of data (4 mil. rows, and growing). Now you might start by thinking,
> this could sound like a regular need for some normalization, but wait a
> second and let me explain :-)
> The columns in this table is for the most very short stings, integers,
> decimals or booleans. So take for an example has_automatic_transmission
> (boolean) I can't see why it would make sense to put that into a separate
> table and join in the values. Or the milage or the price as another
> example. The cars table used for search is indexed quite a lot.
>
> The questions:
> Having the above setup in mind, what impact on performance, in terms of
> read performance and write performance, does it have, whether I do the
> following:
> 1) In general would the read and/or the write on the database be
> faster, if I serialized some of the not searched columns in the table into
> a single text columns instead of let's say 20 booleans?
> 2) Lets say I'm updating a timestamp in a single one of the 151
> columns in the cars table. The update statement is using the id to find the
> car. Would the write performance of that UPDATE be affected, if the table
> had fewer columns?
> 3) When adding a new column to the table i know that it becomes
> slower the more rows is in the table, but what about the "width" of the
> table does that affect the performance when adding new columns?
> 4) In general what performance downsides do you get when adding a
> lot of columns to one table instead of having them in separate tables?
> 5) Is it significantly faster to select * from a table with 20
> columns, than selecting the same 20 in a table with 150 columns?
>
> Hope there is some good answers out there :-)
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

--
Best regards,
Vitalii Tymchyshyn

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Kevin Grittner 2012-11-28 14:51:26 Re: Optimize update query
Previous Message Kevin Grittner 2012-11-28 14:40:06 Re: Database design - best practice