Re: State of the on-disk bitmap index

Lists: pgsql-hackers
From: Daniel Bausch <bausch(at)dvs(dot)tu-darmstadt(dot)de>
To: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: State of the on-disk bitmap index
Date: 2012-08-16 10:12:38
Message-ID: 502CC796.9020709@dvs.tu-darmstadt.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello Jonah, Simon, and the hackers,

I am going to implement a simple kind of "encoded bitmap indexes" (EBI).
That is an index type where the bitmap columns may not only contain
only a single '1' in the set of bits belonging to a tuple. Instead, an
additional mapping table translates the distinct values of the table
column into a unique encoding. To select for a given value all bitmap
columns must be compared instead of only one. Queries that match
multiple different values (like IN lists or range queries) simplify to
less than the full set of bitmaps that needs to be compared because of
boolean logic. The total number of bitmaps required to represent unique
encodings for all different values is ceil(ld(n)), where n is the number
of distinct values. Compared to normal bitmap indexes this solves the
problem of high-cardinality columns. It is targetet at data warehousing
scenarios with insert only data.

The respective scientific paper can be found at
http://www.dvs.tu-darmstadt.de/publications/pdf/ebi_a4.pdf

I thought, it could be a good idea to base my work on the long proposed
on-disk bitmap index implementation. Regarding to the wiki, you, Jonah
and Simon, were the last devs that touched this thing. Unfortunately I
could not find the patch representing your state of that work. I could
only capture the development history up to Gianni Ciolli & Gabriele
Bartolini from the old pgsql-patches archives. Other people involved
were Jie Zhang, Gavin Sherry, Heikki Linnakangas, and Leonardo F. Are
you and the others still interested in getting this into PG? A rebase
of the most current bitmap index implementation onto master HEAD will be
the first 'byproduct' that I am going to deliver back to you.

1. Is anyone working on this currently?
2. Who has got the most current source code?
3. Is there a git of that or will I need to reconstruct the history from
the patches I collected?

Disclaimer: Although I read and manually processed most of the
executor's code during my last work, I would still call myself a newbie
and therefore will need some assistance most probably.

Regards,
Daniel

--
Daniel Bausch
Wissenschaftlicher Mitarbeiter
Technische Universität Darmstadt
Fachbereich Informatik
Fachgebiet Datenbanken und Verteilte Systeme

Hochschulstraße 10
64289 Darmstadt
Germany

Tel.: +49 6151 16 6706
Fax: +49 6151 16 6229


From: "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at>
To: "Daniel Bausch *EXTERN*" <bausch(at)dvs(dot)tu-darmstadt(dot)de>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: State of the on-disk bitmap index
Date: 2012-08-20 07:40:19
Message-ID: D960CB61B694CF459DCFB4B0128514C2084EEFA5@exadv11.host.magwien.gv.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Daniel Bausch wrote:
> Hello Jonah, Simon, and the hackers,
>
> I am going to implement a simple kind of "encoded bitmap indexes"
(EBI).
> That is an index type where the bitmap columns may not only contain
> only a single '1' in the set of bits belonging to a tuple. Instead,
an
> additional mapping table translates the distinct values of the table
> column into a unique encoding. To select for a given value all bitmap
> columns must be compared instead of only one. Queries that match
> multiple different values (like IN lists or range queries) simplify to
> less than the full set of bitmaps that needs to be compared because of
> boolean logic. The total number of bitmaps required to represent
unique
> encodings for all different values is ceil(ld(n)), where n is the
number
> of distinct values. Compared to normal bitmap indexes this solves the
> problem of high-cardinality columns. It is targetet at data
warehousing
> scenarios with insert only data.
>
> The respective scientific paper can be found at
> http://www.dvs.tu-darmstadt.de/publications/pdf/ebi_a4.pdf

I cannot answer your questions, but I read the paper and have some
questions myself.

1) As you mention, a WHERE clause that checks for only one value
will be more expensive with an encoded bitmap index than with
a regular bitmap index. If you want to implement encoded bitmap
indexes, wouldn't it be good to also implement regular bitmap
indexes so that the user has a choice?

2) The paper mentions that finding a good encoding and simplifying
bitmap access for a certain query are nontrivial problems.
Moreover, an encoding is good or bad only with respect to
certain queries, which the system does not know at index
creation time.
Do you have any ideas how to approach that?
If not, the paper suggests that, with enough values to check for,
even a non-optimized encoded bitmap index should perform
much better than a normal bitmap index, so maybe that's the way
to go (maybe only encode the NULL value as all zeros).

Yours,
Laurenz Albe


From: Daniel Bausch <bausch(at)dvs(dot)tu-darmstadt(dot)de>
To: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: State of the on-disk bitmap index
Date: 2012-08-20 09:44:42
Message-ID: 5032070A.302@dvs.tu-darmstadt.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am 20.08.2012 09:40, schrieb Albe Laurenz:
> Daniel Bausch wrote:
>> Hello Jonah, Simon, and the hackers,
>>
>> I am going to implement a simple kind of "encoded bitmap indexes"
> (EBI).
>> That is an index type where the bitmap columns may not only contain
>> only a single '1' in the set of bits belonging to a tuple. Instead,
> an
>> additional mapping table translates the distinct values of the table
>> column into a unique encoding. To select for a given value all bitmap
>> columns must be compared instead of only one. Queries that match
>> multiple different values (like IN lists or range queries) simplify to
>> less than the full set of bitmaps that needs to be compared because of
>> boolean logic. The total number of bitmaps required to represent
> unique
>> encodings for all different values is ceil(ld(n)), where n is the
> number
>> of distinct values. Compared to normal bitmap indexes this solves the
>> problem of high-cardinality columns. It is targetet at data
> warehousing
>> scenarios with insert only data.
>>
>> The respective scientific paper can be found at
>> http://www.dvs.tu-darmstadt.de/publications/pdf/ebi_a4.pdf
>
> I cannot answer your questions, but I read the paper and have some
> questions myself.
>
> 1) As you mention, a WHERE clause that checks for only one value
> will be more expensive with an encoded bitmap index than with
> a regular bitmap index. If you want to implement encoded bitmap
> indexes, wouldn't it be good to also implement regular bitmap
> indexes so that the user has a choice?

Sorry if that one was not clear: The first thing, I am going to do, is
to work on the normal bitmap indexes (the one based on the Bizgres
patch). I want to port it to master HEAD and give it back to the
community. After that I want to base my EBI implementation on that.
Eventually, I will publish that implementation, too. (After doing
tuning, experiments, and make sure it works well.)

> 2) The paper mentions that finding a good encoding and simplifying
> bitmap access for a certain query are nontrivial problems.
> Moreover, an encoding is good or bad only with respect to
> certain queries, which the system does not know at index
> creation time.

Actually, I was not involved in writing that paper. I want to use that
idea to show something different. I know of a follow up work by Golam
Rabilul Alam et al. that uses the query history and data mining on that
to optimize for the most common cases. There may be others. A more
detailed discussion of EBI can also be found in:

http://www-old.dvs.informatik.tu-darmstadt.de/staff/wu/query.TR.ps.gz

> Do you have any ideas how to approach that?
> If not, the paper suggests that, with enough values to check for,
> even a non-optimized encoded bitmap index should perform
> much better than a normal bitmap index, so maybe that's the way
> to go (maybe only encode the NULL value as all zeros).

Actually "all zeros" is reserved for "non-existent" (a.k.a. "deleted" or
"invisible").

The thing with the "enough values" is a bit problematic, indeed, because
even a DBA cannot influence how the queries of the user or the user
application look like. You will not use encoded bitmap indexes or
normal bitmap indexes for a column that is usually point accessed like
the ID column. For that you will stick to hash or tree indexes.

Kind regards,
Daniel

--
Daniel Bausch
Wissenschaftlicher Mitarbeiter
Technische Universität Darmstadt
Fachbereich Informatik
Fachgebiet Datenbanken und Verteilte Systeme

Hochschulstraße 10
64289 Darmstadt
Germany

Tel.: +49 6151 16 6706
Fax: +49 6151 16 6229


From: Daniel Bausch <bausch(at)dvs(dot)tu-darmstadt(dot)de>
To: Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: State of the on-disk bitmap index
Date: 2012-08-20 10:33:19
Message-ID: 5032126F.50605@dvs.tu-darmstadt.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am 20.08.2012 11:44, schrieb Daniel Bausch:
> Actually, I was not involved in writing that paper. I want to use that
> idea to show something different. I know of a follow up work by Golam
> Rabilul Alam et al. that uses the query history and data mining on that
> to optimize for the most common cases. There may be others. A more
> detailed discussion of EBI can also be found in:
>
> http://www-old.dvs.informatik.tu-darmstadt.de/staff/wu/query.TR.ps.gz

Oops, that was the wrong link. I meant this one:
http://www-old.dvs.informatik.tu-darmstadt.de/staff/wu/bitmap.ps.gz

--
Daniel Bausch
Wissenschaftlicher Mitarbeiter
Technische Universität Darmstadt
Fachbereich Informatik
Fachgebiet Datenbanken und Verteilte Systeme

Hochschulstraße 10
64289 Darmstadt
Germany

Tel.: +49 6151 16 6706
Fax: +49 6151 16 6229