Re: record identical operator

From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: record identical operator
Date: 2013-09-14 18:25:52
Message-ID: 1379183152.69958.YahooMailNeo@web162901.mail.bf1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> wrote:

> what I am talking about is that
> e.g.: SELECT (ARRAY[1,2,3,NULL])[1:3] = ARRAY[1,2,3];
> obviously should be true.

The patch does not change the behavior of the = operator for any
type under any circumstances.

> But both arrays don't have the same binary representation since
> the former has a null bitmap, the latter not. So, if you had a
> composite type like (int4[]) and would compare that without
> invoking operators you'd return something false in some cases
> because of the null bitmaps.

Not for the = operator.  The new "identical" operator would find
them to not be identical, though.

Since the new operator is only for the record type, I need to wrap
the values in your example:

test=# SELECT row ((ARRAY[1,2,3,NULL])[1:3])::record
test-#      = row (ARRAY[1,2,3])::record;
 ?column?
----------
 t
(1 row)

test=# SELECT row ((ARRAY[1,2,3,NULL])[1:3])::record
test-#    === row (ARRAY[1,2,3])::record;
 ?column?
----------
 f
(1 row)

Or, to borrow from the citext example using arrays:

test=# CREATE TABLE array_table (
test(#   id serial primary key,
test(#   nums int4[]
test(# );
CREATE TABLE
test=# INSERT INTO array_table (nums)
test-#   VALUES ((ARRAY[1,2,3,NULL])[1:3]), (ARRAY[1,2,3]),
test-#          (ARRAY[1,2,3]), (NULL), (NULL);
INSERT 0 5
test=# CREATE MATERIALIZED VIEW array_matview AS
test-#   SELECT * FROM array_table;
SELECT 5
test=# CREATE UNIQUE INDEX array_matview_id
test-#   ON array_matview (id);
CREATE INDEX
test=# select * from array_matview;
 id |  nums   
----+---------
  1 | {1,2,3}
  2 | {1,2,3}
  3 | {1,2,3}
  4 |
  5 |
(5 rows)

Note that the on-disk representation of the row where id = 1
differs from the on-disk representation where id in (2,3), both in
the table and the matview.

test=# SELECT *
test-#   FROM array_matview m
test-#   FULL JOIN array_table t ON (t.id = m.id AND t === m)
test-#   WHERE t.id IS NULL OR m.id IS NULL;
 id | nums | id | nums
----+------+----+------
(0 rows)

... so the query looking for work for the RMVC statement finds
nothing to do.

test=# UPDATE array_table SET nums = (ARRAY[1,2,3,NULL])[1:3]
test-#   WHERE id between 1 and 2;
UPDATE 2

Now we have added an unnecessary bitmap to the on-disk storage of
the value where id = 2.

test=# SELECT *
test-#   FROM array_matview m
test-#   FULL JOIN array_table t ON (t.id = m.id AND t === m)
test-#   WHERE t.id IS NULL OR m.id IS NULL;
 id |  nums   | id |  nums   
----+---------+----+---------
    |         |  2 | {1,2,3}
  2 | {1,2,3} |    |
(2 rows)

... and the query sees that they differ.

test=# REFRESH MATERIALIZED VIEW CONCURRENTLY array_matview;
REFRESH MATERIALIZED VIEW
test=# SELECT *
test-#   FROM array_matview m
test-#   FULL JOIN array_table t ON (t.id = m.id AND t === m)
test-#   WHERE t.id IS NULL OR m.id IS NULL;
 id | nums | id | nums
----+------+----+------
(0 rows)

The REFRESH causes them to match again, and later REFRESH runs
won't see a need to do any work there unless the on-disk
representation changes again.

As far as I can see, we have four choices:

(1)  Never update values that are "equal", even if they appear
different to the users, as was demonstrated with the citext
example.

(2)  Require every data type which can be used in a matview to
implement some new operator or function for "identical".  Perhaps
that could be mitigated to only implementat it if equal values can
have user-visible differences.

(3)  Embed special cases into record identical tests for types
known to allow multiple on-disk representations which have no
user-visible differences.

(4)  Base the need to update a matview column on whether its
on-disk representation is identical to what a new run of the
defining query would generate.  If this causes performance problems
for use of a given type in a matview, one possible solution would
be to modify that particular type to use a canonical format when
storing a value into a record.  For example, storing an array which
has a bitmap of null values even though there are no nulls in the
array could strip the bitmap as it is stored to the record.

Currently we are using (4).  I only included (3) for completeness;
even just typing it as a hypothetical made me want to take a
shower.  (1) seems pretty horrid, too.  (2) isn't evil, exactly,
but any types which allowed user-visible differences in equal
values would not exhibit correct behavior in matviews until and
unless an "identical" operator was added.  It might also perform
noticeably worse than (4).

Option (4) has the advantage of showing correct logical behavior in
all types immediately, and restricting any performance fixes to the
types with the inconsistent storage formats.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2013-09-14 18:37:29 Re: pg_stat_statements: calls under-estimation propagation
Previous Message Abhijit Menon-Sen 2013-09-14 18:14:24 [PATCH] bitmap indexes