Re: SSD + RAID

From: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To: Laszlo Nagy <gandalf(at)shopzeus(dot)com>
Cc: Ivan Voras <ivoras(at)freebsd(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: SSD + RAID
Date: 2009-11-15 10:17:24
Message-ID: 4AFFD534.40103@postnewspapers.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 15/11/2009 2:05 PM, Laszlo Nagy wrote:
>
>> A change has been written to the WAL and fsync()'d, so Pg knows it's hit
>> disk. It can now safely apply the change to the tables themselves, and
>> does so, calling fsync() to tell the drive containing the tables to
>> commit those changes to disk.
>>
>> The drive lies, returning success for the fsync when it's just cached
>> the data in volatile memory. Pg carries on, shortly deleting the WAL
>> archive the changes were recorded in or recycling it and overwriting it
>> with new change data. The SSD is still merrily buffering data to write
>> cache, and hasn't got around to writing your particular change yet.
>>
> All right. I believe you. In the current Pg implementation, I need to
> turn of disk cache.

That's certainly my understanding. I've been wrong many times before :S

> #1. user wants to change something, resulting in a write_to_disk(data) call
> #2. data is written into the WAL and fsync()-ed
> #3. at this point the write_to_disk(data) call CAN RETURN, the user can
> continue his work (the WAL is already written, changes cannot be lost)
> #4. Pg can continue writting data onto the disk, and fsync() it.
> #5. Then WAL archive data can be deleted.
>
> Now maybe I'm wrong, but between #3 and #5, the data to be written is
> kept in memory. This is basically a write cache, implemented in OS
> memory. We could really handle it like a write cache. E.g. everything
> would remain the same, except that we add some latency. We can wait some
> time after the last modification of a given block, and then write it out.

I don't know enough about the whole affair to give you a good
explanation ( I tried, and it just showed me how much I didn't know )
but here are a few issues:

- Pg doesn't know the erase block sizes or positions. It can't group
writes up by erase block except by hoping that, within a given file,
writing in page order will get the blocks to the disk in roughly
erase-block order. So your write caching isn't going to do anywhere near
as good a job as the SSD's can.

- The only way to make this help the SSD out much would be to use a LOT
of RAM for write cache and maintain a LOT of WAL archives. That's RAM
not being used for caching read data. The large number of WAL archives
means incredibly long WAL replay times after a crash.

- You still need a reliable way to tell the SSD "really flush your cache
now" after you've flushed the changes from your huge chunks of WAL files
and are getting ready to recycle them.

I was thinking that write ordering would be an issue too, as some
changes in the WAL would hit main disk before others that were earlier
in the WAL. However, I don't think that matters if full_page_writes are
on. If you replay from the start, you'll reapply some changes with older
versions, but they'll be corrected again by a later WAL record. So
ordering during WAL replay shouldn't be a problem. On the other hand,
the INCREDIBLY long WAL replay times during recovery would be a nightmare.

> I don't think that any SSD drive has more than some
> megabytes of write cache.

The big, lots-of-$$ ones have HUGE battery backed caches for exactly
this reason.

> The same amount of write cache could easily be
> implemented in OS memory, and then Pg would always know what hit the disk.

Really? How does Pg know what order the SSD writes things out from its
cache?

--
Craig Ringer

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Oleg Bartunov 2009-11-15 14:06:42 Re: FTS performance with the Polish config
Previous Message Pavel Stehule 2009-11-15 09:15:05 Re: FTS performance with the Polish config