Quick Links

ice-broker scan thread

Lists:	pgsql-hackers

From:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	ice-broker scan thread
Date:	2005-11-29 03:22:33
Message-ID:	Pine.LNX.4.58.0511282217470.13586@josh.db
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I am considering add an "ice-broker scan thread" to accelerate PostgreSQL
sequential scan IO speed. The basic idea of this thread is just like the
"read-ahead" method, but the difference is this one does not read the data
into shared buffer pool directly, instead, it reads the data into file
system cache, which makes the integration easy and this is unique to
PostgreSQL.

What happens to the original sequential scan:
for (;;)
{
/*
* a physical read may happen, due to current content of
* file system cache and if the kernel is smart enough to
* understand you want to do sequential scan
*/
physical or logical read a page;
process the page;
}

What happens to the sequential scan with ice-broker:
for (;;)
{
/* since the ice-broker has read the page in already */
logical read a page with big chance;
process the page;
}

I wrote a program to simulate the sequential scan in PostgreSQL
with/without ice-broker. The results indicate this technique has the
following characters:
(1) The important factor of speedup is the how much CPU time PostgreSQL
used on each data page. If PG is fast enough, then no speedup occurs; else
a 10% to 20% speedup is expected due to my test.
(2) It uses more CPU - this is easy to understand, since it does more
work;
(3) The benefits also depends on other factors, like how smart your file
system ...

Here is a test results on my machine:
---
$#uname -a
Linux josh.db 2.4.29-1 #2 Tue Jan 25 17:03:33 EST 2005 i686 unknown
$#cat /proc/meminfo | grep MemTotal
MemTotal: 1030988 kB
$#cat /proc/cpuinfo | grep CPU
model name : Intel(R) Pentium(R) 4 CPU 2.40GHz
$#./seqscan 10 $HOME/pginstall/bin/data/base/10794/18986 50
PostgreSQL sequential scan simulator configuration:
Memory size: 943718400
CPU cost per page: 50
Scan thread read unit size: 4

With scan threads off - duration: 56862.738 ms
With scan threads on - duration: 40611.101 ms
With scan threads off - duration: 46859.207 ms
With scan threads on - duration: 38598.234 ms
With scan threads off - duration: 56919.572 ms
With scan threads on - duration: 47023.606 ms
With scan threads off - duration: 52976.825 ms
With scan threads on - duration: 43056.506 ms
With scan threads off - duration: 54292.979 ms
With scan threads on - duration: 42946.526 ms
With scan threads off - duration: 51893.590 ms
With scan threads on - duration: 42137.684 ms
With scan threads off - duration: 46552.571 ms
With scan threads on - duration: 41892.628 ms
With scan threads off - duration: 45107.800 ms
With scan threads on - duration: 38329.785 ms
With scan threads off - duration: 47527.787 ms
With scan threads on - duration: 38293.581 ms
With scan threads off - duration: 48810.656 ms
With scan threads on - duration: 39018.500 ms
---

Notice in above the cpu_cost=50 might looks too big (if you look into the
code) - but in concurrent situation, it is not that huge. Also, on my
windows box(PIII, 800), a cpu_cost=5 can is enough to prove the benefits
of 10%.

So in general, it does help in some situations, but not a rocket science
since we can't predicate the performance of the file system. It fairly
easy to be integrated, and we should add a GUC parameter to control it.

We need more tests, any comments and tests are welcome,

Regards,
Qingqing

---

/*
* seqscan.c
* PostgreSQL sequential scan simulator with helper scan thread
*
* Note
* I wrote this simulator to see if there is any benefits for sequential scan to
* do read-ahead by another thread. The only thing you may want to change in the
* source file is MEMSZ, make it big enough to thrash your file system cache.
*
* Use the following command to compile:
* $gcc -O2 -Wall -pthread -lm seqscan.c -o seqscan
* To use it:
* $./seqscan <rounds> <datafile> <cpu_cost>
* In which rounds is how many times you want to run the test (notice each round include
* two disk-burn test), datafile is the path to any file (suggest size > 100M), and cpu_cost
* is the cost that processing each page of the file. Try different cpu_cost.
*/

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <memory.h>
#include <errno.h>
#include <math.h>

#ifdef WIN32
#include <io.h>
#include <windows.h>
#define PG_BINARY O_BINARY
#else
#include <unistd.h>
#include <pthread.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/file.h>
#define PG_BINARY 0
#endif

typedef char bool;
#define true ((bool) 1)
#define false ((bool) 0)

#define BLCKSZ 8192
#define UNITSZ 4
#define MEMSZ (950*1024*1024)

char *data_file;
int cpu_cost;
volatile bool stop_scan;
char thread_buffer[BLCKSZ*UNITSZ];

static void
cleanup_cache(void)
{
char *p;

if (NULL == (p = (char *)malloc(MEMSZ)))
{
fprintf(stderr, "insufficient memory\n");
exit(-1);
}

memset(p, 'a', MEMSZ);
free(p);
}

#ifdef WIN32
bool enable_aio = false;

static const unsigned __int64 epoch = 116444736000000000L;
static int gettimeofday(struct timeval * tp, struct timezone * tzp)
{
FILETIME file_time;
SYSTEMTIME system_time;
ULARGE_INTEGER ularge;

GetSystemTime(&system_time);
SystemTimeToFileTime(&system_time, &file_time);
ularge.LowPart = file_time.dwLowDateTime;
ularge.HighPart = file_time.dwHighDateTime;

tp->tv_sec = (long) ((ularge.QuadPart - epoch) / 10000000L);
tp->tv_usec = (long) (system_time.wMilliseconds * 1000);

return 0;
}

static void
sleep(int secs)
{
SleepEx(secs*1000, true);
}

static int
thread_open()
{
HANDLE fd;
SECURITY_ATTRIBUTES sa;

sa.nLength = sizeof(sa);
sa.bInheritHandle = TRUE;
sa.lpSecurityDescriptor = NULL;

fd = CreateFile(data_file,
GENERIC_READ,
FILE_SHARE_READ|FILE_SHARE_WRITE|FILE_SHARE_DELETE,
&sa,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN
| (enable_aio?FILE_FLAG_OVERLAPPED:0),
NULL);

if (fd == INVALID_HANDLE_VALUE)
{
int errCode;

switch (errCode = GetLastError())
{
/* EMFILE, ENFILE should not occur from CreateFile. */
case ERROR_PATH_NOT_FOUND:
case ERROR_FILE_NOT_FOUND: errno = ENOENT; break;
case ERROR_FILE_EXISTS: errno = EEXIST; break;
case ERROR_ACCESS_DENIED: errno = EACCES; break;
default:
fprintf(stderr, "thread_open failed: %d\n", errCode);
errno = EINVAL;
}

return -1;
}

return (int)fd;
}

static int
thread_read(int fd, int blkno, size_t nblk, char *buf)
{
long offset = BLCKSZ*blkno;
long nbytes;
OVERLAPPED ol;

memset(&ol, 0, sizeof(OVERLAPPED));
ol.Offset = offset;
ol.OffsetHigh = 0;

if (ReadFile((HANDLE)fd, buf, BLCKSZ*nblk, &nbytes, &ol))
{
/* successfully done without delay */
NULL;
}
else
{
int errCode;
switch (errCode = GetLastError())
{
case ERROR_IO_PENDING:
break;
case ERROR_HANDLE_EOF:
break;
default:
/* unknown error occured */
fprintf(stderr, "asyncread failed: %d\n", errCode);
exit(-1);
}
}

return nbytes;
}

static void
thread_close(int fd)
{
CloseHandle((HANDLE)fd);
}

#else /* non-windows platforms */

static int
thread_open()
{
int fd;

fd = open(data_file, O_RDWR | PG_BINARY, 0600);
if (fd < 0)
{
fprintf(stderr, "thread_open failed: %d\n", errno);
exit(-1);
}

return (int)fd;
}

static int
thread_read(int fd, int blkno, size_t nblk, char *buf)
{
long offset = BLCKSZ*blkno;
long nbytes;

nbytes = lseek(fd, offset, SEEK_SET);
nbytes = read(fd, buf, BLCKSZ*nblk);
if (nbytes <= 0)
{
fprintf(stderr, "thread_read failed: %d\n", errno);
exit(-1);
}

return nbytes;
}

static void
thread_close(int fd)
{
close(fd);
}
#endif

#ifdef WIN32
static DWORD WINAPI
scan_thread(LPVOID args)
#else
static void *
scan_thread(void *args)
#endif
{
int i, fd;
int start, end;

start = 0;
end = (size_t)args;

fd = thread_open();
for (i = start; i < end; i+=UNITSZ)
{
thread_read(fd, i, UNITSZ, (char *)thread_buffer);

/* check if I was asked to stop */
if (stop_scan == true)
break;
}
thread_close(fd);

return 0;
}

static int
init_scan(bool with_threads, size_t *nblocks)
{
int fd;

/* open file for do_scan */
fd = open(data_file, O_RDWR | PG_BINARY, 0600);
if (fd < 0)
{
fprintf(stderr, "failed to open file %s\n", data_file);
exit(-1);
}

*nblocks = lseek(fd, 0, SEEK_END) / BLCKSZ;
if (*nblocks < 0)
{
fprintf(stderr, "failed to get file length %s\n", data_file);
exit(-1);
}

if (with_threads)
{
#ifndef WIN32
pthread_t thread;
#endif
/* create scan threads */
stop_scan = false;
#ifdef WIN32
if (NULL == CreateThread(NULL, 0,
scan_thread, (void *)(*nblocks),
0, NULL))
#else
if (pthread_create(&thread, NULL,
scan_thread, (void *)(*nblocks)))
#endif
{
fprintf(stderr, "failed to start scan thread");
exit(-1);
}
}

return fd;
}

static void
do_scan(int fd, size_t nblocks)
{
int i, j, k, nbytes;
char buffer[BLCKSZ];

for (i = 0; i < nblocks; i++)
{
nbytes = lseek(fd, i*BLCKSZ, SEEK_SET);
nbytes = read(fd, buffer, BLCKSZ);
if (nbytes != BLCKSZ)
{
fprintf(stderr, "do_scan read failed\n");
exit(-1);
}

/* pretend to do some CPU intensive analysis */
for (k = 0; k < cpu_cost; k++)
{
for (j = (k*sizeof(int))%BLCKSZ;
j < BLCKSZ / (5 * sizeof(int));
j += sizeof(int))
{
int x, y;

x = ((int *)buffer)[j];
x = (int)pow((double)x, (double)(x+1));
y = (int)sin((double)x*x);
((int *)buffer)[j] = x*y;
}
}
}
}

static void
close_scan(fd)
{
stop_scan = true;
close(fd);
}

int
main(int argc, char *argv[])
{
int i, rounds, fd;
size_t nblocks;

if (argc != 4)
{
fprintf(stderr, "usage: cache <rounds> <datafile> <cpu_cost>\n");
exit(-1);
}

rounds = atoi(argv[1]);
data_file = argv[2];
cpu_cost = atoi(argv[3]);
fd = init_scan(false, &nblocks);
close_scan(fd);
fprintf(stdout, "PostgreSQL sequential scan simulator configuration:\n"
"\tMemory size: %u\n"
"\tCPU cost per page: %d\n"
"\tScan thread read unit size: %d\n\n",
MEMSZ, cpu_cost, UNITSZ);

for (i = 0; i < 2*rounds; i++)
{
struct timeval start_t, stop_t;
long usecs;
bool enable = i%2?true:false;

/* eliminate system cached data */
cleanup_cache();
sleep(2);

/* do the scan task */
gettimeofday(&start_t, NULL);
fd = init_scan(enable, &nblocks);
do_scan(fd, nblocks);
close_scan(fd);
gettimeofday(&stop_t, NULL);

/* measure the time */
if (stop_t.tv_usec < start_t.tv_usec)
{
stop_t.tv_sec--;
stop_t.tv_usec += 1000000;
}
usecs = (long) (stop_t.tv_sec - start_t.tv_sec) * 1000000
+ (long) (stop_t.tv_usec - start_t.tv_usec);
fprintf (stdout, "With scan threads %s - duration: %ld.%03ld ms\n",
enable?"on":"off",
(long) ((stop_t.tv_sec - start_t.tv_sec) * 1000 +
(stop_t.tv_usec - start_t.tv_usec) / 1000),
(long) (stop_t.tv_usec - start_t.tv_usec) % 1000);

sleep(2);
}

exit(0);
}

From:	David Boreham <david_list(at)boreham(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 03:50:43
Message-ID:	438BD013.1000804@boreham.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Qingqing Zhou wrote:

>I am considering add an "ice-broker scan thread" to accelerate PostgreSQL
>sequential scan IO speed. The basic idea of this thread is just like the
>"read-ahead" method, but the difference is this one does not read the data
>into shared buffer pool directly, instead, it reads the data into file
>system cache, which makes the integration easy and this is unique to
>PostgreSQL.
>
>
Interesting, and I wondered about this too. But for my taste the
demonstrated benefit really
isn't large enough to make it worthwhile.
BTW, I heard a long time ago that NTFS has quite fancy read-ahead, where
it attempts
to detect the application's access pattern including if it is reading
sequentially and even
if there is a 'stride' to the accesses when they're not contiguous. I
would imagine that
other filesystems attempt similar tricks. So one might expect a simple
linear prefectch
to not help much in the presence of such a filesystem.

Were you worried about the icebreaker thread getting too far ahead of
the scan ?
If it did it might page out the data you're about to read, I think. Of
course this could
be fixed by having the read ahead thread perodically check the current
location being
read by the query thread and pausing if it's got too far ahead.

Anyway, the recent performance thread has been intersting to me because
in all my career
I've never seen a database that scanned scads of data from disk to
process a query.
Typically the problems I work on arrange to read the entire database
into memory.
I think I need to get out more... ;)

From:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 03:53:36
Message-ID:	Pine.LNX.4.58.0511291429330.18112@linuxworld.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 28 Nov 2005, Qingqing Zhou wrote:

>
> I am considering add an "ice-broker scan thread" to accelerate PostgreSQL
> sequential scan IO speed. The basic idea of this thread is just like the
> "read-ahead" method, but the difference is this one does not read the data
> into shared buffer pool directly, instead, it reads the data into file
> system cache, which makes the integration easy and this is unique to
> PostgreSQL.
>

MySQL, Oracle and others implement read-ahead threads to simulate async IO
'pre-fetching'. I've been experimenting with two ideas. The first is to
increase the readahead when we're doing sequential scans (see prototype
patch using posix fadvise attached). I've not got any hardware at the
moment which I can test this patch on but I am waiting on some dbt-3
results which should indicate whether fadvise is a good idea or a bad one.

The second idea is using posix async IO at key points within the system
to better parallelise CPU and IO work. There areas I think we could use
async IO are: during sequential scans, use async IO to do pre-fetching of
blocks; inside WAL, begin flushing WAL buffers to disk before we commit;
and, inside the background writer/check point process, asynchronously
write out pages and, potentially, asynchronously build new checkpoint segments.

The motivation for using async IO is two fold: first, the results of this
paper[1] are compelling; second, modern OSs support async IO. I know that
Linux[2], Solaris[3], AIX and Windows all have async IO and I presume that
all their rivals have it as well.

The fundamental premise of the paper mentioned above is that if the
database is busy, IO should be busy. With our current block-at-a-time
processing, this isn't always the case. This is why Qingqing's read-ahead
thread makes sense. My reason for mailing is, however, that the async IO
results are more compelling than the read ahead thread.

I haven't had time to prototype whether we can easily implement async IO
but I am planning to work on it in December. The two main goals will be to
a) integrate and utilise async IO, at least within the executor context,
and b) build a primitive kind of scheduler so that we stop prefetching
when we know that there are a certain number of outstanding IOs for a
given device.

Thanks,

Gavin

[1] http://www.vldb2005.org/program/paper/wed/p1116-hall.pdf
[2] http://lse.sourceforge.net/io/aionotes.txt
[3] http://developers.sun.com/solaris/articles/event_completion.html - I'm
fairly sure they have a posix AIO wrapper around these routines, but I
cannot see it documented anywhere :-(

Attachment	Content-Type	Size
fadvise.diff	text/plain	10.2 KB

From:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To:
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 03:56:08
Message-ID:	438BD158.5050109@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Qingqing,

>> I am considering add an "ice-broker scan thread" to accelerate PostgreSQL
>> sequential scan IO speed. The basic idea of this thread is just like the
>> "read-ahead" method, but the difference is this one does not read the
>> data
>> into shared buffer pool directly, instead, it reads the data into file
>> system cache, which makes the integration easy and this is unique to
>> PostgreSQL.

You probably mean "ice-breaker" by the way :)

Chris

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 04:01:53
Message-ID:	29795.1133236913@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
> I haven't had time to prototype whether we can easily implement async IO

Just as with any suggestion to depend on threads, you are going to have
to show results that border on astounding to have any chance of getting
this in. Otherwise the portability issues are just going to make it not
worth the trouble.

regards, tom lane

From:	David Boreham <david_list(at)boreham(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 04:09:55
Message-ID:	438BD493.3070507@boreham.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gavin Sherry wrote:

> MySQL, Oracle and others implement read-ahead threads to simulate async IO

I always believed that Oracle used async file I/O. Not that I've seen their
code, but I'm fairly sure they funded the addition of kernel aio to Linux
a few years back.

But....Oracle comes from a time long ago when threads and decent
filesystems didn't exist, so some of the things they do may not be
appropriate
to add to a product that doesn't have them today.

Now...network async I/O...that'd be really useful in my world...

From:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To:	David Boreham <david_list(at)boreham(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 04:14:38
Message-ID:	Pine.LNX.4.58.0511291513260.18370@linuxworld.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 28 Nov 2005, David Boreham wrote:

> Gavin Sherry wrote:
>
> > MySQL, Oracle and others implement read-ahead threads to simulate async IO
>
> I always believed that Oracle used async file I/O. Not that I've seen their
> code, but I'm fairly sure they funded the addition of kernel aio to Linux
> a few years back.

That's right.

>
> But....Oracle comes from a time long ago when threads and decent
> filesystems didn't exist, so some of the things they do may not be
> appropriate
> to add to a product that doesn't have them today.

The paper I linked to seemed to suggest that they weren't using async IO
in 9.2 -- which is fairly old. I'm not sure why the authors didn't test
10g.

Gavin

From:	Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 04:19:50
Message-ID:	438BD6E6.1000107@paradise.net.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
>
>>I haven't had time to prototype whether we can easily implement async IO
>
>
> Just as with any suggestion to depend on threads, you are going to have
> to show results that border on astounding to have any chance of getting
> this in. Otherwise the portability issues are just going to make it not
> worth the trouble.

Do these ideas require threads in principle? ISTM that there could be
(additional) process(es) waiting to perform pre-fetching or async io,
and we could use the usual IPC machinary to talk between them...

cheers

Mark

From:	David Boreham <david_list(at)boreham(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 04:34:26
Message-ID:	438BDA52.7090601@boreham.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>
>The paper I linked to seemed to suggest that they weren't using async IO
>in 9.2 -- which is fairly old. I'm not sure why the authors didn't test
>10g.
>
>
...<reads paper>... ok, interesting. Did they say that Oracle isn't
using aio ?
I can't see that. They that Oracle has no more than one outstanding I/O
operation in flight per concurrent query,
and they appear to think that's a bad thing. I'm not seeing
that myself. Perhaps once I sleep on it, it'll become clear what they're
getting at.

One theory for lack of aio in Oracle as tested in that paper would be
that they
were testing on Linux. Since aio is relatively new in Linux I wouldn't
be surprised
if Oracle didn't actually use it until it's known to be widely deployed
in the field
and to have proven reliability. Perhaps we've reached that state around now,
and so Oracle may not yet have released an aio-capable Linux version of
their
RDBMS. Just a theory...someone from those tubular towers lurking here
could tell us for sure I guess...

From:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 04:38:39
Message-ID:	Pine.LNX.4.58.0511291524470.18419@linuxworld.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 28 Nov 2005, Tom Lane wrote:

> Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
> > I haven't had time to prototype whether we can easily implement async IO
>
> Just as with any suggestion to depend on threads, you are going to have
> to show results that border on astounding to have any chance of getting
> this in. Otherwise the portability issues are just going to make it not
> worth the trouble.

The architecture I am looking at would not rely on threads.

I didn't want to jump on list and waive my hands until I had something to
show, but since Qingqing is looking at the issue I thought I better raise
it.

Gavin

From:	Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	David Boreham <david_list(at)boreham(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 04:45:17
Message-ID:	438BDCDD.40600@paradise.net.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gavin Sherry wrote:

>
> The paper I linked to seemed to suggest that they weren't using async IO
> in 9.2 -- which is fairly old. I'm not sure why the authors didn't test
> 10g.
>

There have been async io type parameters in Oracle's init.ora files from
(at least) 8i (disk_async_io=true IIRC) - on Solaris anyway. Whether
this enabled real or simulated async io is probably a good question - I
recall during testing turning it off and seeing kio()? or similar type
calls become write()/read() in truss oupout.

regards

Mark

From:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
To:	Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 04:49:42
Message-ID:	Pine.LNX.4.58.0511282348020.13833@josh.db
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 28 Nov 2005, Mark Kirkwood wrote:
>
> Do these ideas require threads in principle? ISTM that there could be
> (additional) process(es) waiting to perform pre-fetching or async io,
> and we could use the usual IPC machinary to talk between them...
>

Right. I use threads because it is easy to write simulation program :-)

Regards,
Qingqing

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 04:51:30
Message-ID:	36e682920511282051q432bb176r24f4a025b2f16d40@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

FYI, I've personally used Oracle 9.2.0.4's async IO on Linux and have seen
several installations which make use of it also.

On 11/28/05, Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> wrote:
>
> On Mon, 28 Nov 2005, Tom Lane wrote:
>
> > Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
> > > I haven't had time to prototype whether we can easily implement async
> IO
> >
> > Just as with any suggestion to depend on threads, you are going to have
> > to show results that border on astounding to have any chance of getting
> > this in. Otherwise the portability issues are just going to make it not
> > worth the trouble.
>
> The architecture I am looking at would not rely on threads.
>
> I didn't want to jump on list and waive my hands until I had something to
> show, but since Qingqing is looking at the issue I thought I better raise
> it.
>
> Gavin
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

From:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 04:52:38
Message-ID:	Pine.LNX.4.58.0511282350370.13833@josh.db
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 28 Nov 2005, Gavin Sherry wrote:
>
> MySQL, Oracle and others implement read-ahead threads to simulate async IO
> 'pre-fetching'.

Due to my tests on Windows (using the attached program and change
enable_aio=true), seems aio doesn't help as a separate thread - but maybe
because my usage is wrong ...

Regards,
Qingqing

From:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 04:58:07
Message-ID:	Pine.LNX.4.58.0511282357080.13833@josh.db
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 28 Nov 2005, Gavin Sherry wrote:
>
> I didn't want to jump on list and waive my hands until I had something to
> show, but since Qingqing is looking at the issue I thought I better raise
> it.
>

Don't worry :-) I separate the logic into a standalone program in order to
let more people can help on this issue.

Regards,
Qingqing

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 05:09:11
Message-ID:	dmgno1$21n8$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"David Boreham" <david_list(at)boreham(dot)org> wrote
>
> BTW, I heard a long time ago that NTFS has quite fancy read-ahead, where
> it attempts to detect the application's access pattern including if it is
> reading sequentially and even if there is a 'stride' to the accesses when
> they're not contiguous. I would imagine that other filesystems attempt
> similar tricks. So one might expect a simple linear prefectch
> to not help much in the presence of such a filesystem.
>

So we need more tests. I understand how smart current file systems are, and
seems that depends on the interval that you send next file block read
request (decided by cpu_cost parameter in my program).

I imagine on a multi-way machine with strong IO device, the ice-breaker
could do much better ...

> Were you worried about the icebreaker thread getting too far ahead of the
> scan ? If it did it might page out the data you're about to read, I think.
> Of course this could be fixed by having the read ahead thread perodically
> check the current location being read by the query thread and pausing if
> it's got too far ahead.
>

Right.

Regards,
Qingqing

From:	David Boreham <david_list(at)boreham(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 05:20:59
Message-ID:	438BE53B.1090106@boreham.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Qingqing Zhou wrote:

>On Mon, 28 Nov 2005, Gavin Sherry wrote:
>
>
>>MySQL, Oracle and others implement read-ahead threads to simulate async IO
>>'pre-fetching'.
>>
>>
>
>Due to my tests on Windows (using the attached program and change
>enable_aio=true), seems aio doesn't help as a separate thread - but maybe
>because my usage is wrong ...
>
>
I don't think your NT overlapped I/O code is quite right. At least
I think it will issue reads at a high rate without waiting for any of them
to complete. Beyond some point that has to give the kernel gut-rot.
But anyway, I wouldn't expect the use of aio to make any
significant difference in an already threaded test program.
The point of aio is to allow
I/O concurrency _without_ the use of threads or multiple processes.
You could re-write your program to have a single thread but use aio.
In that case it should show the same read ahead benefit that you see
with the thread.

From:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 05:25:17
Message-ID:	Pine.LNX.4.58.0511291616390.18662@linuxworld.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 28 Nov 2005, Qingqing Zhou wrote:

>
>
> On Mon, 28 Nov 2005, Gavin Sherry wrote:
> >
> > MySQL, Oracle and others implement read-ahead threads to simulate async IO
> > 'pre-fetching'.
>
> Due to my tests on Windows (using the attached program and change
> enable_aio=true), seems aio doesn't help as a separate thread - but maybe
> because my usage is wrong ...

Right, I would imagine that it's very close. I intend to use kernel based
async IO so that we can have the prefetch effect of your sample program
without the need for threads.

Thanks,

Gavin

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 05:55:36
Message-ID:	dmgqf2$2e82$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"David Boreham" <david_list(at)boreham(dot)org> wrote
>>
> I don't think your NT overlapped I/O code is quite right. At least
> I think it will issue reads at a high rate without waiting for any of them
> to complete. Beyond some point that has to give the kernel gut-rot.
>

[also with reply to Gavin] look up dictionary for "gut-rot", got it ... Uh,
this behavior is intended - I try to push enough requests shortly to kernel
so that it understands that I am doing sequential scan, so it would pull the
data from disk to file system cache more efficiently. Some file systems may
have "free-behind" mechanism, but our main thread (who really process the
query) should be fast enough before the data vanished.

>
> You could re-write your program to have a single thread but use aio.
> In that case it should show the same read ahead benefit that you see
> with the thread.
>

I guess this is also Gavin's point - I understand that will be two different
methodologies to handle "read-ahead". If no other thread/process involved,
then the main thread will be responsible to grab a free buffer page from
bufferpool and ask the kernel to put the data there by sync IO (current
PostgreSQL does) or async IOs. And that's what I want to avoid. I'd like to
use a dedicated thread/process to "break the ice" only, i.e., pull data from
disk to file system cache, so that the main thread will only issue *logical*
read.

Regards,
Qingqing

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	Qingqing Zhou <zhouqq(at)cs(dot)toronto(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 11:02:22
Message-ID:	20051129110214.GA31333@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 29, 2005 at 02:53:36PM +1100, Gavin Sherry wrote:
> The second idea is using posix async IO at key points within the system
> to better parallelise CPU and IO work. There areas I think we could use
> async IO are: during sequential scans, use async IO to do pre-fetching of
> blocks; inside WAL, begin flushing WAL buffers to disk before we commit;
> and, inside the background writer/check point process, asynchronously
> write out pages and, potentially, asynchronously build new checkpoint segments.

I actually worked on this and got it to the stage where it wouldn't
crash anymore. It basically added a command to bufmgr.c called
PrefetchBuffer() which would initiate a request but not block. I then
hooked a few strategic places to call this. In particular during an
index scan, it would prefetch the next index block and the next few
data blocks and then return them in order as they came in.

Unfortunatly I can't really test it at it's full potential because it
uses glibc's default POSIX AIO which is *lame*. No more than one
outstanding request per fd which for PostgreSQL is crappy. There was
some evidence that in an index scan of a highly uncorrelated index that
it did make a small difference, but I never got around to testing it
fully. But bitmap scans already hugely reduce the cost of uncorrelated
indexes.

It doesn't pass regression because index_getmulti doesn't do backward
scans. Everything else works though.

If anyone is interested in the code I can send it to them. The results
on my system just wern't good enough to justify a lot more effort.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

From:	Andrew Piskorski <atp(at)piskorski(dot)com>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	David Boreham <david_list(at)boreham(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 15:22:56
Message-ID:	20051129152256.GA16183@tehun.pair.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 29, 2005 at 03:14:38PM +1100, Gavin Sherry wrote:
> On Mon, 28 Nov 2005, David Boreham wrote:
> > Gavin Sherry wrote:
> > > MySQL, Oracle and others implement read-ahead threads to simulate async IO
> >
> > I always believed that Oracle used async file I/O. Not that I've seen their

> The paper I linked to seemed to suggest that they weren't using async IO
> in 9.2 -- which is fairly old.

http://www.vldb2005.org/program/paper/wed/p1116-hall.pdf
"Getting Priorities Straight: Improving Linux Support for Database I/O"
by Hall and Bonnet
Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005

I think you've misread that paper. AFAICT it neither says nor even
suggests that Oracle 9.2 does not use asynchronous I/O on Linux. In
fact, it seems to strongly suggest exactly the opposite, that Oracle
does use async I/O whereever it can.

Note they also reference this document, which as of 2002 and Linux
kernel 2.4.x, was urging Oracle DBAs to use Oracle's kernel-based
asynchronous I/O support whenever possible:

http://www.ixora.com.au/tips/use_asynchronous_io.htm

What Hall and Bonnet's paper DOES say, is that both Oracle and MySQL
InnoDB appear to use a "conservative" I/O submission policy, but
Oracle does so more efficiently. They also argue that both Oracle and
MySQL fail to utilize the "full potential" of Linux async I/O because
of their conservative submission policies, and that an "agressive" I/O
submissions policy would work better, but only if support for
Prioritized I/O is added to Linux. They then proceed to add that
support, and make some basic changes to InnoDB to partially take
advantage of it.

Also interesting is their casual mention that for RDBMS workloads, the
default Linux 2.6 disk scheduler "anticipatory" is inferior to the
"deadline" scheduler. They base their (simple sounding) Prioritized
I/O support on the deadline scheduler.

--
Andrew Piskorski <atp(at)piskorski(dot)com>
http://www.piskorski.com/

From:	David Boreham <david_list(at)boreham(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 15:42:18
Message-ID:	438C76DA.4060801@boreham.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>Unfortunatly I can't really test it at it's full potential because it
>uses glibc's default POSIX AIO which is *lame*. No more than one
>outstanding request per fd which for PostgreSQL is crappy. There was
>
>
I had the impression from the kernel aio mailing list a while back that
post-<some kernel version> linux, the POSIX aio calls were forwarded
to the kernel aio interface. Or are you saying that the POSIX API itself
imposes
that limitation ?

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	David Boreham <david_list(at)boreham(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 16:47:00
Message-ID:	20051129164651.GF31333@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 29, 2005 at 08:42:18AM -0700, David Boreham wrote:
>
> >Unfortunatly I can't really test it at it's full potential because it
> >uses glibc's default POSIX AIO which is *lame*. No more than one
> >outstanding request per fd which for PostgreSQL is crappy. There was
> >
> I had the impression from the kernel aio mailing list a while back
> that post-<some kernel version> linux, the POSIX aio calls were
> forwarded to the kernel aio interface. Or are you saying that the
> POSIX API itself imposes that limitation ?

By default when you use aio you get the version in libc (-lrt IIRC)
which has the issue I mentioned, probably because it's probably
optimised for the lots-of-network-connections type program where
multiple outstanding requests on a single fd are not meaningful. You
can however link in some other library which gives you kernel support.
However, I don't have a new enough kernel to have the kernel support so
I havn't tested that.

POSIX AIO doesn't prescribe either way.

From:	David Boreham <david_list(at)boreham(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 17:28:57
Message-ID:	438C8FD9.3090302@boreham.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>By default when you use aio you get the version in libc (-lrt IIRC)
>which has the issue I mentioned, probably because it's probably
>optimised for the lots-of-network-connections type program where
>multiple outstanding requests on a single fd are not meaningful. You
>can however link in some other library which gives you kernel support.
>However, I don't have a new enough kernel to have the kernel support so
>I havn't tested that.
>
>
Actually, after reading up on the current state of things, I'm not sure you
can even get POSIX aio on top of kernel aio in Linux. There are also a
few limitations in the 2.6 aio implementation that might prove troublesome:
for example it only works with O_DIRECT.

libaio gives userland access to the kernel aio api (which is different
from POSIX aio).

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	David Boreham <david_list(at)boreham(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 17:52:18
Message-ID:	20051129175217.GG31333@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 29, 2005 at 10:28:57AM -0700, David Boreham wrote:
> Actually, after reading up on the current state of things, I'm not sure you
> can even get POSIX aio on top of kernel aio in Linux. There are also a
> few limitations in the 2.6 aio implementation that might prove troublesome:
> for example it only works with O_DIRECT.

Which is bizarre because it's semantically equivalent to having a
seperate thread doing the read() and sending you a signal when it's
done. What I'm thinking of testing is a join across two large table so
there is actually more than one outstanding request at a time. But it's
irritating to have to code to a special api...

From:	"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 18:03:32
Message-ID:	dmi53v$1cvr$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu> wrote
>
> I wrote a program to simulate the sequential scan in PostgreSQL
> with/without ice-broker.
>
> We need more tests
>

If anybody has a test results then I'd love to see it ...

Thanks,
Qingqing

From:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To:	Andrew Piskorski <atp(at)piskorski(dot)com>
Cc:	David Boreham <david_list(at)boreham(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 21:22:36
Message-ID:	Pine.LNX.4.58.0511300816390.23317@linuxworld.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 29 Nov 2005, Andrew Piskorski wrote:

> On Tue, Nov 29, 2005 at 03:14:38PM +1100, Gavin Sherry wrote:
> > On Mon, 28 Nov 2005, David Boreham wrote:
> > > Gavin Sherry wrote:
> > > > MySQL, Oracle and others implement read-ahead threads to simulate async IO
> > >
> > > I always believed that Oracle used async file I/O. Not that I've seen their
>
> > The paper I linked to seemed to suggest that they weren't using async IO
> > in 9.2 -- which is fairly old.
>
> http://www.vldb2005.org/program/paper/wed/p1116-hall.pdf
> "Getting Priorities Straight: Improving Linux Support for Database I/O"
> by Hall and Bonnet
> Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005
>
> I think you've misread that paper. AFAICT it neither says nor even
> suggests that Oracle 9.2 does not use asynchronous I/O on Linux. In
> fact, it seems to strongly suggest exactly the opposite, that Oracle
> does use async I/O whereever it can.
>
> Note they also reference this document, which as of 2002 and Linux
> kernel 2.4.x, was urging Oracle DBAs to use Oracle's kernel-based
> asynchronous I/O support whenever possible:
>
> http://www.ixora.com.au/tips/use_asynchronous_io.htm
>
> What Hall and Bonnet's paper DOES say, is that both Oracle and MySQL
> InnoDB appear to use a "conservative" I/O submission policy, but
> Oracle does so more efficiently. They also argue that both Oracle and
> MySQL fail to utilize the "full potential" of Linux async I/O because
> of their conservative submission policies, and that an "agressive" I/O
> submissions policy would work better, but only if support for
> Prioritized I/O is added to Linux. They then proceed to add that
> support, and make some basic changes to InnoDB to partially take
> advantage of it.
>
> Also interesting is their casual mention that for RDBMS workloads, the
> default Linux 2.6 disk scheduler "anticipatory" is inferior to the
> "deadline" scheduler. They base their (simple sounding) Prioritized
> I/O support on the deadline scheduler.
>

Right. I had seemed to recall that they configured Oracle to use a kind of
readahead thread not native async IO, but I am wrong. That's not material
to the discussion at hand.

What we need to find out is if we can easily integrate prefetching into
PostgreSQL for some subset of the work we do, find non-trivial performance
gains and demonstrate it on more than one OS. Ideally, we'd see some
non-trivial gain irrespective of the IO scheduler being used.

Thanks,

Gavin

From:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To:	David Boreham <david_list(at)boreham(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 21:30:28
Message-ID:	Pine.LNX.4.58.0511300826120.23317@linuxworld.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 29 Nov 2005, David Boreham wrote:

>
> >By default when you use aio you get the version in libc (-lrt IIRC)
> >which has the issue I mentioned, probably because it's probably
> >optimised for the lots-of-network-connections type program where
> >multiple outstanding requests on a single fd are not meaningful. You
> >can however link in some other library which gives you kernel support.
> >However, I don't have a new enough kernel to have the kernel support so
> >I havn't tested that.
> >
> >
> Actually, after reading up on the current state of things, I'm not sure you
> can even get POSIX aio on top of kernel aio in Linux. There are also a
> few limitations in the 2.6 aio implementation that might prove troublesome:
> for example it only works with O_DIRECT.
>
> libaio gives userland access to the kernel aio api (which is different
> from POSIX aio).

Yes. The O_DIRECT issue is my biggest concern about Linux at the moment.
That being said, the plan is to only pre-fetch the next N blocks, where N
< 32, and to read them into the local buffer cache. In a situation where
space in the cache low (and prefetched pages might be pushed out before we
even get to read them), we need to provide such information to the
readahead mechanism so that it can reduce the number of blocks which it
prefetches.

Gavin

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	David Boreham <david_list(at)boreham(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-29 22:19:27
Message-ID:	1133302767.2906.462.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2005-11-30 at 08:30 +1100, Gavin Sherry wrote:
> On Tue, 29 Nov 2005, David Boreham wrote:
>
> >
> > >By default when you use aio you get the version in libc (-lrt IIRC)
> > >which has the issue I mentioned, probably because it's probably
> > >optimised for the lots-of-network-connections type program where
> > >multiple outstanding requests on a single fd are not meaningful. You
> > >can however link in some other library which gives you kernel support.
> > >However, I don't have a new enough kernel to have the kernel support so
> > >I havn't tested that.
> > >
> > >
> > Actually, after reading up on the current state of things, I'm not sure you
> > can even get POSIX aio on top of kernel aio in Linux. There are also a
> > few limitations in the 2.6 aio implementation that might prove troublesome:
> > for example it only works with O_DIRECT.
> >
> > libaio gives userland access to the kernel aio api (which is different
> > from POSIX aio).
>
> Yes. The O_DIRECT issue is my biggest concern about Linux at the moment.
> That being said, the plan is to only pre-fetch the next N blocks, where N
> < 32, and to read them into the local buffer cache. In a situation where
> space in the cache low (and prefetched pages might be pushed out before we
> even get to read them), we need to provide such information to the
> readahead mechanism so that it can reduce the number of blocks which it
> prefetches.

My understanding was that Linux at least has a reasonable readahead
mechanism that works on the scale you suggest.

I think its fair to assume that anybody that wants this can afford
sufficient memory to make it worthwhile. Multiple processes per scan
implies (low numbers of users or I/O overkill).

Best Regards, Simon Riggs

From:	David Boreham <david_list(at)boreham(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-30 14:33:18
Message-ID:	438DB82E.7060507@boreham.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>Yes. The O_DIRECT issue is my biggest concern about Linux at the moment.
>That being said, the plan is to only pre-fetch the next N blocks, where N
>< 32, and to read them into the local buffer cache. In a situation where
>space in the cache low (and prefetched pages might be pushed out before we
>even get to read them), we need to provide such information to the
>readahead mechanism so that it can reduce the number of blocks which it
>prefetches.
>
>
>
>
Would you open a separate handle O_DIRECT, just for the prefetch ?

My experience with O_DIRECT and databases in the past has not been
great : what you gain with being able to control your own caching you loose
(and more) in other ways.

BTW, has anyone tried O_DIRECT and the prefetch idea on Linux ?
I'm wondering if it may not work (because the read data won't get cached
in the fs cache due to O_DIRECT).

From:	David Boreham <david_list(at)boreham(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: ice-broker scan thread
Date:	2005-11-30 14:38:42
Message-ID:	438DB972.5050201@boreham.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Qingqing Zhou wrote:

>[also with reply to Gavin] look up dictionary for "gut-rot", got it ... Uh,
>this behavior is intended - I try to push enough requests shortly to kernel
>so that it understands that I am doing sequential scan, so it would pull the
>data from disk to file system cache more efficiently. Some file systems may
>have "free-behind" mechanism, but our main thread (who really process the
>query) should be fast enough before the data vanished.
>
>
I guess I was concerned that very large numbers of concurrent operations
on the same file handle
in flight at the same time might lead to poor performance or even
instability. e.g. the kernel may
make long linked lists, it might create lock contention with itself,
that kind of bad stuff. My thinking
being that the kernel wasn't designed with applications that fire off
10,000 concurrent reads against
the same file.

>I guess this is also Gavin's point - I understand that will be two different
>methodologies to handle "read-ahead". If no other thread/process involved,
>then the main thread will be responsible to grab a free buffer page from
>bufferpool and ask the kernel to put the data there by sync IO (current
>PostgreSQL does) or async IOs. And that's what I want to avoid. I'd like to
>use a dedicated thread/process to "break the ice" only, i.e., pull data from
>disk to file system cache, so that the main thread will only issue *logical*
>read.
>
>
Right, understood. My point was that a thread with sync I/O and the
query thread with
async I/O are in fact logically identical. They're just two different
implementation techniques
for the same fundemental functionality. In some cases the non-thread
implementation might
be prefered (for example on a platform with no threads).