Re: [HACKERS] A Better External Sort?

From: mark(at)mark(dot)mielke(dot)cc
To: Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc: "Steinar H(dot) Gunderson" <sgunderson(at)bigfoot(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: [HACKERS] A Better External Sort?
Date: 2005-10-08 13:31:06
Message-ID: 20051008133106.GB23913@mark.mielke.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On Fri, Oct 07, 2005 at 09:20:59PM -0700, Luke Lonergan wrote:
> On 10/7/05 5:17 PM, "mark(at)mark(dot)mielke(dot)cc" <mark(at)mark(dot)mielke(dot)cc> wrote:
> > On Fri, Oct 07, 2005 at 04:55:28PM -0700, Luke Lonergan wrote:
> >> On 10/5/05 5:12 PM, "Steinar H. Gunderson" <sgunderson(at)bigfoot(dot)com> wrote:
> >>> What? strlen is definitely not in the kernel, and thus won't count as
> >>> system time.
> >> System time on Linux includes time spent in glibc routines.
> > Do you have a reference for this?
> > I believe this statement to be 100% false.
> How about 99%? OK, you're right, I had this confused with the profiling
> problem where glibc routines aren't included in dynamic linked profiles.

Sorry to emphasize the 100%. It wasn't meant to judge you. It was meant
to indicate that I believe 100% of system time is accounted for, while
the system call is actually active, which is not possible while glibc
is active.

I believe the way it works, is that a periodic timer interrupt
increments a specific integer every time it wakes up. If it finds
itself within the kernel, it increments the system time for the active
process, if it finds itself outside the kernel, it incremenets the
user time for the active process.

> Back to the statements earlier - the output of time had much of time for a
> dd spent in system, which means kernel, so where in the kernel would that be
> exactly?

Not really an expert here. I only play around. At a minimum, their is a
cost to switching from user context to system context and back, and then
filling in the zero bits. There may be other inefficiencies, however.
Perhaps /dev/zero always fill in a whole block (8192 usually), before
allowing the standard file system code to read only one byte.

I dunno.

But, I see this oddity too:

$ time dd if=/dev/zero of=/dev/zero bs=1 count=10000000
10000000+0 records in
10000000+0 records out
dd if=/dev/zero of=/dev/zero bs=1 count=10000000 4.05s user 11.13s system 94% cpu 16.061 total

$ time dd if=/dev/zero of=/dev/zero bs=10 count=1000000
1000000+0 records in
1000000+0 records out
dd if=/dev/zero of=/dev/zero bs=10 count=1000000 0.37s user 1.37s system 100% cpu 1.738 total

From my numbers, it looks like 1 byte reads are hard in both the user context
and the system context. It looks almost linearly, even:

$ time dd if=/dev/zero of=/dev/zero bs=100 count=100000
100000+0 records in
100000+0 records out
dd if=/dev/zero of=/dev/zero bs=100 count=100000 0.04s user 0.15s system 95% cpu 0.199 total

$ time dd if=/dev/zero of=/dev/zero bs=1000 count=10000
10000+0 records in
10000+0 records out
dd if=/dev/zero of=/dev/zero bs=1000 count=10000 0.01s user 0.02s system 140% cpu 0.021 total

At least some of this gets into the very in-depth discussions as to
whether kernel threads, or user threads, are more efficient. Depending
on the application, user threads can switch many times faster than
kernel threads. Other parts of this may just mean that /dev/zero isn't
implemented optimally.

Cheers,
mark

--
mark(at)mielke(dot)cc / markm(at)ncf(dot)ca / markm(at)nortel(dot)com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2005-10-08 14:02:29 Re: Kerberos brokenness and oops question in 8.1beta2
Previous Message Martijn van Oosterhout 2005-10-08 11:31:33 Re: Issue is changing _bt_compare function and

Browse pgsql-performance by date

  From Date Subject
Next Message mark 2005-10-08 13:34:32 Re: count(*) using index scan in "query often, update rarely" environment
Previous Message hubert depesz lubaczewski 2005-10-08 10:44:09 Re: count(*) using index scan in "query often, update rarely" environment