Re: temporal support patch

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Vlad Arkhipov <arhipov(at)dc(dot)baikal(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: temporal support patch
Date: 2012-08-19 22:28:32
Message-ID: 1345415312.20987.56.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 2012-06-25 at 17:46 +0900, Vlad Arkhipov wrote:
> It's not sufficient to store only a period of validity for a row. If two
> transactions started in the same time change the same record, you have a
> problem with TSTZRANGE type because it's normalized to empty interval.

That's an interesting point.

Let's say you tried setting it to [T1, T2) where T1 is the time of the
last transaction to update it and T2 is the time of the current
transaction. If T2 <= T1, then TSTZRANGE will throw an error, not store
the empty interval.

And we don't want to store the empty interval, because it would be a
lie. There could have been some transaction T3 that happened during T2
that saw the value from T1, so saying that there were no times where
that was visible to the system is false. Throwing an error allows you to
retry T2, which should allow a microsecond or so to pass, and the
problem should resolve itself (assuming your clock didn't move
backwards, which is a different issue).

We could also argue about the start versus end times of transactions,
and snapshot acquisition times, because that could cause confusion if
there are long-running transactions. It might be a good reason to store
the modifying transaction ID as well, but then you get into transaction
wraparound problems.

> The other issue is how to handle multiple changes of the same record
> within the transaction. Should they be stored or not?

In a typical audit log, I don't see any reason to. The internals of a
transaction should be implementation details; invisible to the outside,
right?

> Also it's necessary to store some kind of operation type that was
> applied to the record (insert/update/delete). For example, there is a
> table with one record with validity period [0, ) and value 'A'.
>
> First way
> 1. Delete this record in time 1, now there is [0, 1), A in the history
> table.
> 2. Insert a new record in time 1, now there is [0, 1), A in the history
> table and [1, ), B record in the current data table.
>
> Second way
> 1. Update this record in time 1, now there is [0, 1), A in the history
> table and [1, ), B record in the current data table.
>
> So you have the same data in the tables but the actions that led to this
> configuration were different and the history has been lost partly.

Right. Those are yet more possible options that people might want for an
audit log.

> > * There is other useful information that could be recorded, such as the
> > user who inserted/updated/deleted the record.
> I'm not sure that the database user is the proper thing to be stored in
> the history table. Many applications usually connect to a database using
> some virtual user and have their own users/roles tables to handle with
> privileges. There should be some way to substitute the stored user in
> the history table with the application's one. It's also helpful to store
> transaction id that inserted/updated/deleted the record.

If the system is recording it for audit purposes, then it better be sure
that it's true. You can't allow the application to pick and choose what
gets stored there.

While it may be true that many applications just all use the same DB
user, if you want an audit log that includes user information you have
to let the DB do some authentication.

> It's a great proposal but seems to be impossible to implement with
> triggers only solution, isn't it? Is there any kind of hooks on ALTER
> TABLE ... in PostgreSQL to update changed columns bitmaps when table
> structure changes?

Column numbers are never reused, so I think it would be stable. But if
you do need to be notified of schema changes, the new event triggers
mechanism may be able to do that.

> In SQL2011 there is only one table with the all data, historical and
> current. So it's not very convenient to specifiy WHERE condition on
> system time everywhere and for all tables in the query. By default only
> the current data is selected with a query like SELECT * FROM table.

If there is some syntax that offers a convenient shorthand for WHERE,
that's fine with me. Or using two tables, one called foo and one called
foo_history, is also fine. But I don't want the DML syntax to introduce
new mechanisms that aren't available without the fancy syntax (though
new DDL arrangements might be fine).

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2012-08-20 03:13:37 Re: gistchoose vs. bloat
Previous Message Alexander Korotkov 2012-08-19 21:31:36 Re: Statistics and selectivity estimation for ranges