Re: Bit data type header reduction in some cases

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bit data type header reduction in some cases
Date: 2014-02-25 08:00:18
Message-ID: 530C4D92.7010703@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/25/2014 08:23 AM, Haribabu Kommi wrote:
> It's regarding a Todo item of "Bit data type header reduction" in some
> cases. The header contains two parts. 1) The varlena header is
> automatically converted to 1 byte header from 4 bytes in case of small
> data. 2) The bit length header called "bit_len" to store the actual bit
> length which is of 4 bytes in size. I want to reduce this bit_len size to 1
> byte in some cases as similar to varlena header. With this change the size
> of the column reduced by 3 bytes, thus shows very good decrease in disk
> usage.
>
> I am planning to the change it based on the total length of bits data. If
> the number of bits is less than 0x7F then one byte header can be chosen
> instead of 4 byte header. During reading of the data, the similar way it
> can be calculated.

Since we're designing a new format, how about using bit padding instead
of an explicit length field? Add one 1-bit to the end of the bit data,
followed by zeros. That's even more compact. See
https://en.wikipedia.org/wiki/Padding_%28cryptography%29#Bit_padding

> The problem I am thinking is, as it can cause problems to old databases
> having bit columns when they
> upgrade to a newer version without using pg_dumpall. Is there anyway to
> handle this problem? Or Is there any better way to handle the problem
> itself? please let me know your suggestions.

On a little-endian system, you could easily use the most-significant bit
(sign bit) of the bit_len field to indicate that it's a new-format datum:

/*
* Little-endian
*/
typedef struct
{
int32 vl_len_;
int32 bit_len;
bits8 bit_dat[1];
} VarBit_Old;

typedef struct
{
int32 vl_len_;
uint8 bit_len;
bits8 bit_dat[1]; /* var len */
} VarBit_New;

#define IS_NEW_FORMAT(((VarBit_New *) x)->bit_len & 0x80)

On a big-endian system that's more difficult, because the MSB would not
be the first byte of the field. You could still make it work if the new
format would have the length byte in fourth byte of the Datum. Something
like this:

/*
* Big-endian
*/
typedef struct
{
int32 vl_len_;
int32 bit_len;
bits8 bit_dat[1];
} VarBit_Old;

typedef struct
{
int32 vl_len_;
bits8 bit_dat_first_three[3];
uint8 bit_len;
bits8 bit_dat_rest[1]; /* var len */
} VarBit_New;

#define IS_NEW_FORMAT(((VarBit_New *) x)->bit_len & 0x80)

That's not a complete solution yet, you also need a special case for a
new-format value with less than 4 bytes of bits. Such a field would
presumably not have the bit_len field at the fourth byte, because it
would be shorter than that, but you could distinguish it by looking at
the length in the varsize header. So in big-endian, you'd have one more
format for very small varbits:

/* Big-endian, less than 4 bytes of bits data */
typedef struct
{
int32 vl_len_;
uint8 bit_len;
bits8 bit_dat[1]; /* var len */
} VarBit_New_Tiny;

The same basic idea also works with bit padding.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2014-02-25 08:36:03 Re: Get more from indices.
Previous Message Haribabu Kommi 2014-02-25 06:23:44 Bit data type header reduction in some cases