PRIVATE columns

Lists: pgsql-hackers
From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Subject: PRIVATE columns
Date: 2012-12-12 18:12:27
Message-ID: CA+U5nMJtFsNdm7fp=s2w07nSFSRKt9yrXmU=g040fOP8pDpEiQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Currently, ANALYZE collects data on all columns and stores these
samples in pg_statistic where they can be seen via the view pg_stats.

In some cases we have data that is private and we do not wish others
to see it, such as patient names. This becomes more important when we
have row security.

Perhaps that data can be protected, but it would be even better if we
simply didn't store value-revealing statistic data at all. Such
private data is seldom the target of searches, or if it is, it is
mostly evenly distributed anyway.

It would be good if we could collect the overall stats
* NULL fraction
* average width
* ndistinct
yet without storing either the MFVs or histogram.
Doing that would avoid inadvertent leaking of potentially private information.

SET STATISTICS 0
simply skips collection of statistics altogether

To implement this, one way would be to allow

ALTER TABLE foo
ALTER COLUMN foo1 SET STATISTICS PRIVATE;

Or we could use another magic value like -2 to request this case.

(Yes, I am aware we could use a custom datatype with a custom
typanalyze for this, but that breaks other things)

Thoughts?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Subject: Re: PRIVATE columns
Date: 2012-12-12 19:13:04
Message-ID: 50C8D740.5000001@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/12/2012 1:12 PM, Simon Riggs wrote:
> Currently, ANALYZE collects data on all columns and stores these
> samples in pg_statistic where they can be seen via the view pg_stats.
>
> In some cases we have data that is private and we do not wish others
> to see it, such as patient names. This becomes more important when we
> have row security.
>
> Perhaps that data can be protected, but it would be even better if we
> simply didn't store value-revealing statistic data at all. Such
> private data is seldom the target of searches, or if it is, it is
> mostly evenly distributed anyway.

Would protecting it the same way, we protect the passwords in pg_authid,
be sufficient?

Jan

>
> It would be good if we could collect the overall stats
> * NULL fraction
> * average width
> * ndistinct
> yet without storing either the MFVs or histogram.
> Doing that would avoid inadvertent leaking of potentially private information.
>
> SET STATISTICS 0
> simply skips collection of statistics altogether
>
> To implement this, one way would be to allow
>
> ALTER TABLE foo
> ALTER COLUMN foo1 SET STATISTICS PRIVATE;
>
> Or we could use another magic value like -2 to request this case.
>
> (Yes, I am aware we could use a custom datatype with a custom
> typanalyze for this, but that breaks other things)
>
> Thoughts?
>

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Jan Wieck <JanWieck(at)yahoo(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Subject: Re: PRIVATE columns
Date: 2012-12-12 20:12:26
Message-ID: CA+U5nMLYVzKs=jAnc+Ss_z99F=5dJEu-o0RAzYzdKV9xEfXhEg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12 December 2012 19:13, Jan Wieck <JanWieck(at)yahoo(dot)com> wrote:
> On 12/12/2012 1:12 PM, Simon Riggs wrote:
>>
>> Currently, ANALYZE collects data on all columns and stores these
>> samples in pg_statistic where they can be seen via the view pg_stats.
>>
>> In some cases we have data that is private and we do not wish others
>> to see it, such as patient names. This becomes more important when we
>> have row security.
>>
>> Perhaps that data can be protected, but it would be even better if we
>> simply didn't store value-revealing statistic data at all. Such
>> private data is seldom the target of searches, or if it is, it is
>> mostly evenly distributed anyway.
>
>
> Would protecting it the same way, we protect the passwords in pg_authid, be
> sufficient?

The user backend does need to be able to access the stats data during
optimization. It's hard to have data accessible and yet impose limits
on the uses to which that can be put. If we have row security on the
table but no equivalent capability on the stats, then we'll have
leakage. e.g. set statistics 10000, ANALYZE, then leak 10000 credit
card numbers.

Selectivity functions are not marked leakproof, nor do people think
they can easily be made so. Which means the data might be leaked by
various means through error messages, plan selection, skullduggery
etc..

If it ain't in the bucket, the bucket can't leak it.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Jan Wieck <JanWieck(at)yahoo(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Subject: Re: PRIVATE columns
Date: 2012-12-12 20:41:50
Message-ID: 50C8EC0E.4080109@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 12/12/2012 12:12 PM, Simon Riggs wrote:

>> Would protecting it the same way, we protect the passwords in pg_authid, be
>> sufficient?
>
> The user backend does need to be able to access the stats data during
> optimization. It's hard to have data accessible and yet impose limits
> on the uses to which that can be put. If we have row security on the
> table but no equivalent capability on the stats, then we'll have
> leakage. e.g. set statistics 10000, ANALYZE, then leak 10000 credit
> card numbers.
>
> Selectivity functions are not marked leakproof, nor do people think
> they can easily be made so. Which means the data might be leaked by
> various means through error messages, plan selection, skullduggery
> etc..
>
> If it ain't in the bucket, the bucket can't leak it.
>

I accidentally responded to Simon off-list to this. I understand the
need and think it would be a good thing to have. However, the real
opportunity here is to make statistics non-user visible. I can't think
of any reason that they need to be visible to the standard user? Even if
when we set the statistics private, it makes just that column non-visible.

Sincerely,

Joshua D. Drake

--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC
@cmdpromptinc - 509-416-6579


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Subject: Re: PRIVATE columns
Date: 2012-12-12 20:57:54
Message-ID: 14642.1355345874@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> Currently, ANALYZE collects data on all columns and stores these
> samples in pg_statistic where they can be seen via the view pg_stats.

Only if you have appropriate privileges.

> In some cases we have data that is private and we do not wish others
> to see it, such as patient names. This becomes more important when we
> have row security.

> Perhaps that data can be protected, but it would be even better if we
> simply didn't store value-revealing statistic data at all.

SET STATISTICS 0 seems like a sufficient solution for people who don't
trust the have_column_privilege() protection in the pg_stats view.

In practice I think this is a waste of time, though. Anyone who can
bypass the view restriction can probably just read the original table.

(I suppose we could consider marking pg_stats as a security_barrier
view to make this even safer. Not sure it's worth the trouble though;
the interesting columns are anyarray so it's hard to do much with them
mechanically.)

> It would be good if we could collect the overall stats
> * NULL fraction
> * average width
> * ndistinct
> yet without storing either the MFVs or histogram.

Do you have any evidence whatsoever that that's worth the trouble?
I'd bet against it. And if we're being paranoid, who's to say that
those numbers couldn't reveal useful data in themselves?

regards, tom lane


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Subject: Re: PRIVATE columns
Date: 2012-12-13 04:03:20
Message-ID: 50C95388.7000107@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/12/2012 3:12 PM, Simon Riggs wrote:
> On 12 December 2012 19:13, Jan Wieck <JanWieck(at)yahoo(dot)com> wrote:
>> On 12/12/2012 1:12 PM, Simon Riggs wrote:
>>>
>>> Currently, ANALYZE collects data on all columns and stores these
>>> samples in pg_statistic where they can be seen via the view pg_stats.
>>>
>>> In some cases we have data that is private and we do not wish others
>>> to see it, such as patient names. This becomes more important when we
>>> have row security.
>>>
>>> Perhaps that data can be protected, but it would be even better if we
>>> simply didn't store value-revealing statistic data at all. Such
>>> private data is seldom the target of searches, or if it is, it is
>>> mostly evenly distributed anyway.
>>
>>
>> Would protecting it the same way, we protect the passwords in pg_authid, be
>> sufficient?
>
> The user backend does need to be able to access the stats data during
> optimization. It's hard to have data accessible and yet impose limits
> on the uses to which that can be put. If we have row security on the
> table but no equivalent capability on the stats, then we'll have
> leakage. e.g. set statistics 10000, ANALYZE, then leak 10000 credit
> card numbers.

Like for the encrypted password column of pg_authid, I don't see any
reason why the values in the stats columns need to be readable for
anyone but a superuser at all. Do you?

Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Subject: Re: PRIVATE columns
Date: 2012-12-13 09:32:30
Message-ID: CA+U5nM+oTDdT4c_KSGLRJdMsgfvH6B0-N-0G98eVxR9cGms6fg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12 December 2012 20:57, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> SET STATISTICS 0 seems like a sufficient solution for people who don't
> trust the have_column_privilege() protection in the pg_stats view.

The point here is that a user may *have* privilege on the column and
have rights to see some, but not all, rows of the table.

But we cannot apply row level security to individual column values, so
neither the row nor column security applies here and it appears there
is a greater level of risk at this point.

> In practice I think this is a waste of time, though. Anyone who can
> bypass the view restriction can probably just read the original table.

Where the row security would apply.

> (I suppose we could consider marking pg_stats as a security_barrier
> view to make this even safer. Not sure it's worth the trouble though;
> the interesting columns are anyarray so it's hard to do much with them
> mechanically.)

I'm trying to respond in useful ways to your statements that row
security might not be very secure.

Please advise.

>> It would be good if we could collect the overall stats
>> * NULL fraction
>> * average width
>> * ndistinct
>> yet without storing either the MFVs or histogram.
>
> Do you have any evidence whatsoever that that's worth the trouble?
> I'd bet against it.

All I can say is that uniformly distributed data that is accessed only
by equality has no need of MFVs or histograms. Much personal data is
so evenly distributed as to make it not worth storing and in some
cases, it isn't. We don't search for credit cards with a BETWEEN, so
estimating end of ranges isn't needed.

Yet knowing number of distinct values is important to ensure that we
use an index scan. Without stats we tend to do a bitmapindexscan,
which seems to be significantly more expensive in practice.

> And if we're being paranoid, who's to say that
> those numbers couldn't reveal useful data in themselves?

I'm talking about privacy. Knowing there are 226,768 credit cards in a
table, 0% of them are NULL and they are on average 16 digits wide
tells me nothing about individual credit card numbers. Same with
patient names.

In edge cases we might infer something more when mixed with some
external knowledge, but that's a matter for the military.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PRIVATE columns
Date: 2012-12-13 09:35:47
Message-ID: CADyhKSWo9YPAyP_s8u==Uvna5qj0JwVAyh5heDyAY+N0tLd_rw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2012/12/12 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> Currently, ANALYZE collects data on all columns and stores these
>> samples in pg_statistic where they can be seen via the view pg_stats.
>
> Only if you have appropriate privileges.
>
>> In some cases we have data that is private and we do not wish others
>> to see it, such as patient names. This becomes more important when we
>> have row security.
>
>> Perhaps that data can be protected, but it would be even better if we
>> simply didn't store value-revealing statistic data at all.
>
> SET STATISTICS 0 seems like a sufficient solution for people who don't
> trust the have_column_privilege() protection in the pg_stats view.
>
> In practice I think this is a waste of time, though. Anyone who can
> bypass the view restriction can probably just read the original table.
>
> (I suppose we could consider marking pg_stats as a security_barrier
> view to make this even safer. Not sure it's worth the trouble though;
> the interesting columns are anyarray so it's hard to do much with them
> mechanically.)
>
I also agree with Tom's opinion. Even though it does not have security_barrier
flag now, unprivileged rows shall be filtered our with have_column_privilege().
It seems to me sufficient protection towards the scenario that allows users to
reference samples of contents within unprivileged columns.

Indeed, it is not sufficient protection when we have row security features;
for example, "SET STATISTICS 1000" to the table with less than 1000 rows
will eventually have full copy on pg_statistics catalog...
Unlike column, it does not save the origin of statistical data, so it
is not feasible
to control based on user's privilege. If we try to protect the
collected statistical
data (come from tables with row-security), an option is prohibits to access
entries relevant to relations with row-security. On the other hand, it
will affect
compatibility of third-party system monitoring tools that assumes pg_statistics
being visible...

Thanks,
--
KaiGai Kohei <kaigai(at)kaigai(dot)gr(dot)jp>