Questions/observations about set_ps_display ()

Lists: pgsql-hackers
From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Phantom Command ID
Date: 2006-09-20 18:27:27
Message-ID: 4511880F.6030102@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Per discussion on reducing heap tuple header, I've started to work on
the phantom cid idea.

I'm thinking of having an array of cmin,cmax pairs, indexed by phantom
cid number. Looking up cmin,cmax of a phantom id is then a simple array
lookup. To allow reusing phantom cids, we have a hash table that allows
looking up a phantomid by cmin,cmax pair.

A big question is, do we need to implement spilling to disk?

With the above data structures, each phantom cid is going to take 28
bytes of backend-private memory [See math below]. Transactions that
actually need phantom cids are not that common, but I suppose that
applications that make heavy use of plpgsql functions or do a lot of
repeated UPDATES of same rows might need millions.

[quick sizing math]
array element = sizeof(cmin) + sizeof(cmax) = 4 + 4 = 8
hash table key + data + hash element overhead = sizeof(cmin) +
sizeof(cmax) + sizeof(phantomcid) + sizeof(HASHELEMENT) = 20
Total: 28 bytes (or 32 if MAXALIGN is 8-bytes)

this excludes overhead of hash table buckets etc.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Phantom Command ID
Date: 2006-09-20 20:02:00
Message-ID: 18025.1158782520@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> A big question is, do we need to implement spilling to disk?

My thought is no, at least not in the first cut ... this is something
that can be added later if it proves critical, and right at the moment
my guess is that it never will. The data structure design sounds fine.

regards, tom lane


From: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Phantom Command ID
Date: 2006-09-20 20:13:04
Message-ID: 20060920201304.GW28987@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 20, 2006 at 04:02:00PM -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> > A big question is, do we need to implement spilling to disk?
>
> My thought is no, at least not in the first cut ... this is something
> that can be added later if it proves critical, and right at the moment
> my guess is that it never will. The data structure design sounds fine.

What would the failure mode be? Would we just keep going until the box
ran out of memory? I think it'd be better to have some kind of hard
limit so that a single backend can't grind a production server into a
swap-storm. (Arguably, not having a limit is exposing a DoS
vulnerability).
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Phantom Command ID
Date: 2006-09-20 20:22:47
Message-ID: 18367.1158783767@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Jim C. Nasby" <jim(at)nasby(dot)net> writes:
> What would the failure mode be? Would we just keep going until the box
> ran out of memory? I think it'd be better to have some kind of hard
> limit so that a single backend can't grind a production server into a
> swap-storm. (Arguably, not having a limit is exposing a DoS
> vulnerability).

[ shrug... ] If we tried to guarantee such a thing we'd be putting
arbitrary limits into hundreds if not thousands of different bits of the
backend. I think the correct answer for an admin who is worried about
such a thing is to make sure that the process ulimit is a sufficiently
small fraction of the machine's available RAM. Only if we can't
gracefully handle running up against ulimit is it our problem (hence,
we have a stack-size overflow check, but not any such thing for data size).

regards, tom lane


From: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Phantom Command ID
Date: 2006-09-20 21:30:57
Message-ID: 20060920213056.GA28987@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 20, 2006 at 04:22:47PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jim(at)nasby(dot)net> writes:
> > What would the failure mode be? Would we just keep going until the box
> > ran out of memory? I think it'd be better to have some kind of hard
> > limit so that a single backend can't grind a production server into a
> > swap-storm. (Arguably, not having a limit is exposing a DoS
> > vulnerability).
>
> [ shrug... ] If we tried to guarantee such a thing we'd be putting
> arbitrary limits into hundreds if not thousands of different bits of the
> backend. I think the correct answer for an admin who is worried about
> such a thing is to make sure that the process ulimit is a sufficiently
> small fraction of the machine's available RAM. Only if we can't
> gracefully handle running up against ulimit is it our problem (hence,
> we have a stack-size overflow check, but not any such thing for data size).

I didn't realize we had a lot of ways a backend could run a machine out
of memory, or at least ways that didn't have some kind of limit (ie:
work_mem). Are any of them very easy to run into?
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Phantom Command ID
Date: 2006-09-20 21:43:40
Message-ID: 19549.1158788620@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Jim C. Nasby" <jim(at)nasby(dot)net> writes:
> I didn't realize we had a lot of ways a backend could run a machine out
> of memory, or at least ways that didn't have some kind of limit (ie:
> work_mem). Are any of them very easy to run into?

work_mem has nothing to do with trying to guarantee "no swapping DoS".
If it did, it wouldn't be USERSET, and it wouldn't be per query step.
The fact is that ulimit does what you want in that regard already;
why should we try to reinvent that wheel?

regards, tom lane


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Phantom Command ID
Date: 2006-09-21 09:46:22
Message-ID: 45125F6E.2010108@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>
>> A big question is, do we need to implement spilling to disk?
>>
>
> My thought is no, at least not in the first cut ... this is something
> that can be added later if it proves critical, and right at the moment
> my guess is that it never will. The data structure design sounds fine.
>

I thought so too.

We could also limit the size of the hash table, which takes up most of
the memory, and only keep the latest phantom cids there. Presumably, if
current command id is 1000, you're not likely to set cmax to 500 on any
tuple in that transaction anymore.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Phantom Command ID
Date: 2006-09-21 11:00:56
Message-ID: 451270E8.5090905@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Another question is, what should cmin and cmax system columns return? If
we overlay cmin and cmax, cmin and cmax on-disk will always be the same
value. And with phantom cids, it wouldn't be meaningful outside the
inserting/deleting transaction.

The options that I can think of are:

1. Only return cmin and cmax when they mean something, that is within
the inserting / deleting transaction. This is not good if you want to
use them for debugging (and what other use do they have?)

2. Cmin and cmax return the value that's stored on disk, whether or not
they make sense.

3. Remove cmin and cmax system columns to avoid confusion, and replace
them with cminmax, that returns what's on disk.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Phantom Command ID
Date: 2006-09-21 13:52:36
Message-ID: 29290.1158846756@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> We could also limit the size of the hash table, which takes up most of
> the memory, and only keep the latest phantom cids there. Presumably, if
> current command id is 1000, you're not likely to set cmax to 500 on any
> tuple in that transaction anymore.

The downside of that though is that if you did generate any such, you'd
assign a fresh (duplicate) phantom cid --- so you're bloating the array
in exchange for reducing the hash size.

It is quite easy to have current command counter much greater than the
CID of a still-live command: consider for example an UPDATE that is
firing triggers as it goes, and each trigger executes some queries.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Phantom Command ID
Date: 2006-09-21 14:48:58
Message-ID: 387.1158850138@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Another question is, what should cmin and cmax system columns return?

If we're going to fool with these, I'd like to renew the suggestion I
made awhile back that none of the system columns should have explicit
entries in pg_attribute, but rather their lookup should be special-cased
in the parser. And whatever we do with cmin/cmax, the infomask should
become exposed as well.

> 2. Cmin and cmax return the value that's stored on disk, whether or not
> they make sense.

This is pretty much the approach we've been taking with the past overlay
hacks --- what is returned is not always what you might expect from the
column header. I think this is tolerable as long as the infomask can be
examined to determine what's really being shown, but it's probably not
the cleanest way.

> 3. Remove cmin and cmax system columns to avoid confusion, and replace
> them with cminmax, that returns what's on disk.

Don't forget it could be xvac or "cphantom" too ;-). I think I agree
with this approach but not that particular name exactly. I'm inclined
to suggest that we just continue to use "cmin" for the field --- "cmax"
could be dropped or become an alias for "cmin".

A fourth possibility is to abandon the rule that these columns never
read as null, and to have them show their contents when meaningful
(as determined by infomask) and null otherwise. However, then we'd have
to support all of cmin, cmax, cphantom, and xvac in order to ensure that
we always have a column that can show the on-disk value.

regards, tom lane


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Phantom Command ID
Date: 2006-09-26 10:31:08
Message-ID: 4519016C.3040308@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> Another question is, what should cmin and cmax system columns return?
>
> If we're going to fool with these, I'd like to renew the suggestion I
> made awhile back that none of the system columns should have explicit
> entries in pg_attribute, but rather their lookup should be special-cased
> in the parser. And whatever we do with cmin/cmax, the infomask should
> become exposed as well.

I just looked back at that discussion in the archives
(http://archives.postgresql.org/pgsql-hackers/2005-02/msg00615.php).
What was the original reason for the proposal? Space savings?

We could rename pg_attribute as pg_userattribute, and remove all the
system attributes from that. To stay backwards-compatible, we could have
a pg_attribute view on top of that contained the system attributes as well.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Phantom Command ID
Date: 2006-09-26 11:19:44
Message-ID: 22980.1159269584@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> If we're going to fool with these, I'd like to renew the suggestion I
>> made awhile back that none of the system columns should have explicit
>> entries in pg_attribute, but rather their lookup should be special-cased
>> in the parser.

> What was the original reason for the proposal? Space savings?

Partly that, and partly that it'd make it much easier to alter the set
of system attributes.

> We could rename pg_attribute as pg_userattribute, and remove all the
> system attributes from that. To stay backwards-compatible, we could have
> a pg_attribute view on top of that contained the system attributes as well.

I don't really think this is necessary. How many client programs have
you seen that don't explicitly exclude attnum<0 anyway? The places that
will need work are inside the backend, and a view won't help them.

regards, tom lane


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Phantom Command ID
Date: 2006-09-26 11:35:54
Message-ID: 4519109A.60804@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
>> We could rename pg_attribute as pg_userattribute, and remove all the
>> system attributes from that. To stay backwards-compatible, we could have
>> a pg_attribute view on top of that contained the system attributes as well.
>>
>
> I don't really think this is necessary. How many client programs have
> you seen that don't explicitly exclude attnum<0 anyway? The places that
> will need work are inside the backend, and a view won't help them.
>

None, there probably isn't any client programs like that. It would be
nice for programs to be able to discover what system attributes there
is, though.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Phantom Command ID
Date: 2006-09-26 15:13:29
Message-ID: 20060926151329.GS19827@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 26, 2006 at 12:35:54PM +0100, Heikki Linnakangas wrote:
> Tom Lane wrote:
> >>We could rename pg_attribute as pg_userattribute, and remove all the
> >>system attributes from that. To stay backwards-compatible, we could have
> >>a pg_attribute view on top of that contained the system attributes as
> >>well.
> >>
> >
> >I don't really think this is necessary. How many client programs have
> >you seen that don't explicitly exclude attnum<0 anyway? The places that
> >will need work are inside the backend, and a view won't help them.
> >
>
> None, there probably isn't any client programs like that. It would be
> nice for programs to be able to discover what system attributes there
> is, though.

+1; we need to have some way for users to find that info out, and I
can't think of a better way than pg_attribute.

If we want to create a set of views that are more human friendly I'm all
for it (it's why we started the newsysviews project afterall), but I
don't know if y'all want to open that can of worms back up.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: "Strong, David" <david(dot)strong(at)unisys(dot)com>
To: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Questions/observations about set_ps_display ()
Date: 2006-09-26 15:59:24
Message-ID: B6419AF36AC8524082E1BC17DA2506E803101769@USMV-EXCH2.na.uis.unisys.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

We're looking for some advice and/or comments.

During performance and scalability testing with 8.1.4 and also 8.2Beta1,
OProfile reported that the strncpy () C library call was taking a large
amount of CPU time while we were running one of our benchmarks.

We traced a partial benchmark run using ltrace and collected statistics
on all the strncpy () calls that were being made. We noticed that a
number of the calls were similar to the following:

strncpy (0x80808080, "BIND", 2500);

This usage of strncpy () can be traced back to the set_ps_display ()
function call. We could probably turn off set_ps_display (), but it's a
useful tool.

The specification for strncpy () indicates that when the length of the
source string (4 bytes) is less than the length of the number of bytes
to copy (2500 bytes), the remainder of the destination string will be
padded with NULL bytes (2496). Based on the GLIBC source code, strncpy
() is written to pad the destination string a byte at a time and this is
where all the time was taken, according to OProfile.

We only have access to SLES 9 SP3 and RHEL 4 U3 based environments. To
assess the performance improvement, we have replaced the strncpy () call
in set_ps_display () with a strcpy () call, as this seemed safe for our
environments. We're seeing ~3% performance improvement. However, we
realize that our environments do not represent all the environments
where Postgres is available.

The NULL padding side effect of strncpy () does mean that the process
status buffer would be cleared of any previous output. We are not sure
if there might be any security concerns when replacing the strncpy ()
with strcpy ().

A strncpy () call could still be used with a calculated length for the
PS activity, rather than the environment size. Although, this might
incur a penalty for using strlen () against the activity and perhaps a
bounds check to make sure it would fit into the process status buffer.

Are there any implications using PS_USE_CHANGE_ARGV on Linux rather than
PS_USE_CLOBBER_ARGV, as this seems to only use a 256 byte buffer for the
process status buffer?

We're wondering if a patch worth pursuing? There could be issues with
different platforms and the performance gain is very narrow. However, we
thought we would pass this information on.

Thanks

David


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Strong, David" <david(dot)strong(at)unisys(dot)com>
Cc: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Questions/observations about set_ps_display ()
Date: 2006-09-26 16:23:09
Message-ID: 26756.1159287789@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Strong, David" <david(dot)strong(at)unisys(dot)com> writes:
> The specification for strncpy () indicates that when the length of the
> source string (4 bytes) is less than the length of the number of bytes
> to copy (2500 bytes), the remainder of the destination string will be
> padded with NULL bytes (2496). Based on the GLIBC source code, strncpy
> () is written to pad the destination string a byte at a time and this is
> where all the time was taken, according to OProfile.

Hm. In the PS_USE_CLOBBER_ARGV case, this is pretty silly considering
we're going to MemSet the rest of the space anyway. We should probably
replace the StrNCpy with something that doesn't uselessly fill the rest
of the buffer, perhaps something like
memcpy(dest, src, Min(strlen(dest) + 1, avail space));

I believe that most of our uses of StrNCpy actually do not expect the
pad-out behavior, so maybe there are other uses for something along
this line ...

regards, tom lane