Quick Links

querying with index on jsonb slower than standard column. Why?

Lists:	pgsql-performancepgsql-sql

From:	Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>
To:	pgsql-sql(at)postgresql(dot)org
Subject:	querying with index on jsonb slower than standard column. Why?
Date:	2014-12-07 22:59:38
Message-ID:	5484DBDA.6090405@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

I was doing some performance profiling regarding querying against jsonb
columns and found something I can't explain.
I created json version and standard column versions of some data, and
indexed the json 'fields' and the normal columns and executed equivalent
queries against both.
I find that the json version is quite a bit (approx 3x) slower which I
can't explain as both should (and are according to plans are) working
against what I would expect are equivalent indexes.

Can anyone explain this?

Example code is here:

create table json_test (
id SERIAL,
assay1_ic50 FLOAT,
assay2_ic50 FLOAT,
data JSONB
);

DO
$do$
DECLARE
val1 FLOAT;
val2 FLOAT;
BEGIN
for i in 1..10000000 LOOP
val1 = random() * 100;
val2 = random() * 100;
INSERT INTO json_test (assay1_ic50, assay2_ic50, data) VALUES
(val1, val2, ('{"assay1_ic50": ' || val1 || ', "assay2_ic50": ' ||
val2 || ', "mod": "="}')::jsonb);
end LOOP;
END
$do$

create index idx_data_json_assay1_ic50 on json_test (((data ->>
'assay1_ic50')::float));
create index idx_data_json_assay2_ic50 on json_test (((data ->>
'assay2_ic50')::float));

create index idx_data_col_assay1_ic50 on json_test (assay1_ic50);
create index idx_data_col_assay2_ic50 on json_test (assay2_ic50);

select count(*) from json_test;
select * from json_test limit 10;

select count(*) from json_test where (data->>'assay1_ic50')::float > 90
and (data->>'assay2_ic50')::float < 10;
select count(*) from json_test where assay1_ic50 > 90 and assay2_ic50 < 10;

Thanks
Tim

From:	Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To:	Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>, pgsql-sql(at)postgresql(dot)org
Subject:	Re: querying with index on jsonb slower than standard column. Why?
Date:	2014-12-08 00:19:51
Message-ID:	5484EEA7.1030403@aklaver.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

On 12/07/2014 02:59 PM, Tim Dudgeon wrote:
> I was doing some performance profiling regarding querying against jsonb
> columns and found something I can't explain.
> I created json version and standard column versions of some data, and
> indexed the json 'fields' and the normal columns and executed equivalent
> queries against both.
> I find that the json version is quite a bit (approx 3x) slower which I
> can't explain as both should (and are according to plans are) working
> against what I would expect are equivalent indexes.
>
> Can anyone explain this?

The docs can:

http://www.postgresql.org/docs/9.4/interactive/datatype-json.html#JSON-INDEXING

>
> Example code is here:
>
>
> create table json_test (
> id SERIAL,
> assay1_ic50 FLOAT,
> assay2_ic50 FLOAT,
> data JSONB
> );
>
> DO
> $do$
> DECLARE
> val1 FLOAT;
> val2 FLOAT;
> BEGIN
> for i in 1..10000000 LOOP
> val1 = random() * 100;
> val2 = random() * 100;
> INSERT INTO json_test (assay1_ic50, assay2_ic50, data) VALUES
> (val1, val2, ('{"assay1_ic50": ' || val1 || ', "assay2_ic50": ' ||
> val2 || ', "mod": "="}')::jsonb);
> end LOOP;
> END
> $do$
>
> create index idx_data_json_assay1_ic50 on json_test (((data ->>
> 'assay1_ic50')::float));
> create index idx_data_json_assay2_ic50 on json_test (((data ->>
> 'assay2_ic50')::float));
>
> create index idx_data_col_assay1_ic50 on json_test (assay1_ic50);
> create index idx_data_col_assay2_ic50 on json_test (assay2_ic50);
>
> select count(*) from json_test;
> select * from json_test limit 10;
>
> select count(*) from json_test where (data->>'assay1_ic50')::float > 90
> and (data->>'assay2_ic50')::float < 10;
> select count(*) from json_test where assay1_ic50 > 90 and assay2_ic50 < 10;
>
>
>
> Thanks
> Tim
>
>
>

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

From:	Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>
To:	pgsql-sql(at)postgresql(dot)org
Subject:	Re: querying with index on jsonb slower than standard column. Why?
Date:	2014-12-08 00:43:35
Message-ID:	5484F437.2080402@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

On 07/12/2014 21:19, Adrian Klaver wrote:
> On 12/07/2014 02:59 PM, Tim Dudgeon wrote:
>> I was doing some performance profiling regarding querying against jsonb
>> columns and found something I can't explain.
>> I created json version and standard column versions of some data, and
>> indexed the json 'fields' and the normal columns and executed equivalent
>> queries against both.
>> I find that the json version is quite a bit (approx 3x) slower which I
>> can't explain as both should (and are according to plans are) working
>> against what I would expect are equivalent indexes.
>>
>> Can anyone explain this?
>
> The docs can:
>
> http://www.postgresql.org/docs/9.4/interactive/datatype-json.html#JSON-INDEXING
>

If so them I'm missing it.
The index created is not a gin index. Its a standard btree index on the
data extracted from the json. So the indexes on the standard columns and
the ones on the 'fields' extracted from the json seem to be equivalent.
But perform differently.

Tim
>
>>
>> Example code is here:
>>
>>
>> create table json_test (
>> id SERIAL,
>> assay1_ic50 FLOAT,
>> assay2_ic50 FLOAT,
>> data JSONB
>> );
>>
>> DO
>> $do$
>> DECLARE
>> val1 FLOAT;
>> val2 FLOAT;
>> BEGIN
>> for i in 1..10000000 LOOP
>> val1 = random() * 100;
>> val2 = random() * 100;
>> INSERT INTO json_test (assay1_ic50, assay2_ic50, data) VALUES
>> (val1, val2, ('{"assay1_ic50": ' || val1 || ', "assay2_ic50": ' ||
>> val2 || ', "mod": "="}')::jsonb);
>> end LOOP;
>> END
>> $do$
>>
>> create index idx_data_json_assay1_ic50 on json_test (((data ->>
>> 'assay1_ic50')::float));
>> create index idx_data_json_assay2_ic50 on json_test (((data ->>
>> 'assay2_ic50')::float));
>>
>> create index idx_data_col_assay1_ic50 on json_test (assay1_ic50);
>> create index idx_data_col_assay2_ic50 on json_test (assay2_ic50);
>>
>> select count(*) from json_test;
>> select * from json_test limit 10;
>>
>> select count(*) from json_test where (data->>'assay1_ic50')::float > 90
>> and (data->>'assay2_ic50')::float < 10;
>> select count(*) from json_test where assay1_ic50 > 90 and assay2_ic50
>> < 10;
>>
>>
>>
>> Thanks
>> Tim
>>
>>
>>
>
>

From:	Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To:	Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>, pgsql-sql(at)postgresql(dot)org
Subject:	Re: querying with index on jsonb slower than standard column. Why?
Date:	2014-12-08 00:53:51
Message-ID:	5484F69F.2030006@aklaver.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

On 12/07/2014 04:43 PM, Tim Dudgeon wrote:
>
> On 07/12/2014 21:19, Adrian Klaver wrote:
>> On 12/07/2014 02:59 PM, Tim Dudgeon wrote:
>>> I was doing some performance profiling regarding querying against jsonb
>>> columns and found something I can't explain.
>>> I created json version and standard column versions of some data, and
>>> indexed the json 'fields' and the normal columns and executed equivalent
>>> queries against both.
>>> I find that the json version is quite a bit (approx 3x) slower which I
>>> can't explain as both should (and are according to plans are) working
>>> against what I would expect are equivalent indexes.
>>>
>>> Can anyone explain this?
>>
>> The docs can:
>>
>> http://www.postgresql.org/docs/9.4/interactive/datatype-json.html#JSON-INDEXING
>>
>
> If so them I'm missing it.
> The index created is not a gin index. Its a standard btree index on the
> data extracted from the json. So the indexes on the standard columns and
> the ones on the 'fields' extracted from the json seem to be equivalent.
> But perform differently.

Down into the section there is this:

"jsonb also supports btree and hash indexes. These are usually useful
only if it's important to check equality of complete JSON documents. The
btree ordering for jsonb datums is seldom of great interest, but for
completeness it is:

Object > Array > Boolean > Number > String > Null

Object with n pairs > object with n - 1 pairs

Array with n elements > array with n - 1 elements

Objects with equal numbers of pairs are compared in the order:

key-1, value-1, key-2 ...

Note that object keys are compared in their storage order; in
particular, since shorter keys are stored before longer keys, this can
lead to results that might be unintuitive, such as:

{ "aa": 1, "c": 1} > {"b": 1, "d": 1}

Similarly, arrays with equal numbers of elements are compared in the order:

element-1, element-2 ...

Primitive JSON values are compared using the same comparison rules as
for the underlying PostgreSQL data type. Strings are compared using the
default database collation.
"

As I understand it to get useful indexing into the jsonb datum(document)
you need to use the GIN indexes.

>
> Tim
>>
>>>
>>> Example code is here:
>>>
>>>
>>> create table json_test (
>>> id SERIAL,
>>> assay1_ic50 FLOAT,
>>> assay2_ic50 FLOAT,
>>> data JSONB
>>> );
>>>
>>> DO
>>> $do$
>>> DECLARE
>>> val1 FLOAT;
>>> val2 FLOAT;
>>> BEGIN
>>> for i in 1..10000000 LOOP
>>> val1 = random() * 100;
>>> val2 = random() * 100;
>>> INSERT INTO json_test (assay1_ic50, assay2_ic50, data) VALUES
>>> (val1, val2, ('{"assay1_ic50": ' || val1 || ', "assay2_ic50": ' ||
>>> val2 || ', "mod": "="}')::jsonb);
>>> end LOOP;
>>> END
>>> $do$
>>>
>>> create index idx_data_json_assay1_ic50 on json_test (((data ->>
>>> 'assay1_ic50')::float));
>>> create index idx_data_json_assay2_ic50 on json_test (((data ->>
>>> 'assay2_ic50')::float));
>>>
>>> create index idx_data_col_assay1_ic50 on json_test (assay1_ic50);
>>> create index idx_data_col_assay2_ic50 on json_test (assay2_ic50);
>>>
>>> select count(*) from json_test;
>>> select * from json_test limit 10;
>>>
>>> select count(*) from json_test where (data->>'assay1_ic50')::float > 90
>>> and (data->>'assay2_ic50')::float < 10;
>>> select count(*) from json_test where assay1_ic50 > 90 and assay2_ic50
>>> < 10;
>>>
>>>
>>>
>>> Thanks
>>> Tim
>>>
>>>
>>>
>>
>>
>
>
>

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

From:	Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>
To:	pgsql-sql(at)postgresql(dot)org
Subject:	Re: querying with index on jsonb slower than standard column. Why?
Date:	2014-12-08 01:05:39
Message-ID:	5484F963.40009@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

On 07/12/2014 21:53, Adrian Klaver wrote:
> On 12/07/2014 04:43 PM, Tim Dudgeon wrote:
>>
>> On 07/12/2014 21:19, Adrian Klaver wrote:
>>> On 12/07/2014 02:59 PM, Tim Dudgeon wrote:
>>>> I was doing some performance profiling regarding querying against
>>>> jsonb
>>>> columns and found something I can't explain.
>>>> I created json version and standard column versions of some data, and
>>>> indexed the json 'fields' and the normal columns and executed
>>>> equivalent
>>>> queries against both.
>>>> I find that the json version is quite a bit (approx 3x) slower which I
>>>> can't explain as both should (and are according to plans are) working
>>>> against what I would expect are equivalent indexes.
>>>>
>>>> Can anyone explain this?
>>>
>>> The docs can:
>>>
>>> http://www.postgresql.org/docs/9.4/interactive/datatype-json.html#JSON-INDEXING
>>>
>>>
>>
>> If so them I'm missing it.
>> The index created is not a gin index. Its a standard btree index on the
>> data extracted from the json. So the indexes on the standard columns and
>> the ones on the 'fields' extracted from the json seem to be equivalent.
>> But perform differently.
>
> Down into the section there is this:
>
> "jsonb also supports btree and hash indexes. These are usually useful
> only if it's important to check equality of complete JSON documents.
> The btree ordering for jsonb datums is seldom of great interest, but
> for completeness it is:
>
> Object > Array > Boolean > Number > String > Null
>
> Object with n pairs > object with n - 1 pairs
>
> Array with n elements > array with n - 1 elements
>
> Objects with equal numbers of pairs are compared in the order:
>
> key-1, value-1, key-2 ...
>
> Note that object keys are compared in their storage order; in
> particular, since shorter keys are stored before longer keys, this can
> lead to results that might be unintuitive, such as:
>
> { "aa": 1, "c": 1} > {"b": 1, "d": 1}
>
> Similarly, arrays with equal numbers of elements are compared in the
> order:
>
> element-1, element-2 ...
>
> Primitive JSON values are compared using the same comparison rules as
> for the underlying PostgreSQL data type. Strings are compared using
> the default database collation.
> "
>
> As I understand it to get useful indexing into the jsonb
> datum(document) you need to use the GIN indexes.

Yes, but if my understanding is correct I'm not indexing the JSON, I'm
indexing the PostgreSQL float type extracted from a field of the JSON,
and indexing using a btree index:

create index idx_data_json_assay2_ic50 on json_test (((data ->>
'assay2_ic50')::float));

The data ->> 'assay2_ic50' bit extracts the value from the JSON as text,
the ::float bit casts to a float, and the index is built on the
resulting float type.

And the index is being used, and is reasonably fast, just not as fast as
the equivalent index on the 'normal' float column.

Tim
>
>>
>> Tim
>>>
>>>>
>>>> Example code is here:
>>>>
>>>>
>>>> create table json_test (
>>>> id SERIAL,
>>>> assay1_ic50 FLOAT,
>>>> assay2_ic50 FLOAT,
>>>> data JSONB
>>>> );
>>>>
>>>> DO
>>>> $do$
>>>> DECLARE
>>>> val1 FLOAT;
>>>> val2 FLOAT;
>>>> BEGIN
>>>> for i in 1..10000000 LOOP
>>>> val1 = random() * 100;
>>>> val2 = random() * 100;
>>>> INSERT INTO json_test (assay1_ic50, assay2_ic50, data) VALUES
>>>> (val1, val2, ('{"assay1_ic50": ' || val1 || ', "assay2_ic50":
>>>> ' ||
>>>> val2 || ', "mod": "="}')::jsonb);
>>>> end LOOP;
>>>> END
>>>> $do$
>>>>
>>>> create index idx_data_json_assay1_ic50 on json_test (((data ->>
>>>> 'assay1_ic50')::float));
>>>> create index idx_data_json_assay2_ic50 on json_test (((data ->>
>>>> 'assay2_ic50')::float));
>>>>
>>>> create index idx_data_col_assay1_ic50 on json_test (assay1_ic50);
>>>> create index idx_data_col_assay2_ic50 on json_test (assay2_ic50);
>>>>
>>>> select count(*) from json_test;
>>>> select * from json_test limit 10;
>>>>
>>>> select count(*) from json_test where (data->>'assay1_ic50')::float
>>>> > 90
>>>> and (data->>'assay2_ic50')::float < 10;
>>>> select count(*) from json_test where assay1_ic50 > 90 and assay2_ic50
>>>> < 10;
>>>>
>>>>
>>>>
>>>> Thanks
>>>> Tim
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>

From:	Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To:	Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>, pgsql-sql(at)postgresql(dot)org
Subject:	Re: querying with index on jsonb slower than standard column. Why?
Date:	2014-12-08 01:15:17
Message-ID:	5484FBA5.8060902@aklaver.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

On 12/07/2014 05:05 PM, Tim Dudgeon wrote:
>
> On 07/12/2014 21:53, Adrian Klaver wrote:
>> On 12/07/2014 04:43 PM, Tim Dudgeon wrote:
>>>
>>> On 07/12/2014 21:19, Adrian Klaver wrote:
>>>> On 12/07/2014 02:59 PM, Tim Dudgeon wrote:
>>>>> I was doing some performance profiling regarding querying against
>>>>> jsonb
>>>>> columns and found something I can't explain.
>>>>> I created json version and standard column versions of some data, and
>>>>> indexed the json 'fields' and the normal columns and executed
>>>>> equivalent
>>>>> queries against both.
>>>>> I find that the json version is quite a bit (approx 3x) slower which I
>>>>> can't explain as both should (and are according to plans are) working
>>>>> against what I would expect are equivalent indexes.
>>>>>
>>>>> Can anyone explain this?
>>>>
>>>> The docs can:
>>>>
>>>> http://www.postgresql.org/docs/9.4/interactive/datatype-json.html#JSON-INDEXING
>>>>
>>>>
>>>
>>> If so them I'm missing it.
>>> The index created is not a gin index. Its a standard btree index on the
>>> data extracted from the json. So the indexes on the standard columns and
>>> the ones on the 'fields' extracted from the json seem to be equivalent.
>>> But perform differently.
>>
>> Down into the section there is this:
>>
>> "jsonb also supports btree and hash indexes. These are usually useful
>> only if it's important to check equality of complete JSON documents.
>> The btree ordering for jsonb datums is seldom of great interest, but
>> for completeness it is:
>>
>> Object > Array > Boolean > Number > String > Null
>>
>> Object with n pairs > object with n - 1 pairs
>>
>> Array with n elements > array with n - 1 elements
>>
>> Objects with equal numbers of pairs are compared in the order:
>>
>> key-1, value-1, key-2 ...
>>
>> Note that object keys are compared in their storage order; in
>> particular, since shorter keys are stored before longer keys, this can
>> lead to results that might be unintuitive, such as:
>>
>> { "aa": 1, "c": 1} > {"b": 1, "d": 1}
>>
>> Similarly, arrays with equal numbers of elements are compared in the
>> order:
>>
>> element-1, element-2 ...
>>
>> Primitive JSON values are compared using the same comparison rules as
>> for the underlying PostgreSQL data type. Strings are compared using
>> the default database collation.
>> "
>>
>> As I understand it to get useful indexing into the jsonb
>> datum(document) you need to use the GIN indexes.
>
> Yes, but if my understanding is correct I'm not indexing the JSON, I'm
> indexing the PostgreSQL float type extracted from a field of the JSON,
> and indexing using a btree index:
>
> create index idx_data_json_assay2_ic50 on json_test (((data ->>
> 'assay2_ic50')::float));
>
> The data ->> 'assay2_ic50' bit extracts the value from the JSON as text,

Which is where I would say your slow down happens. I have not spent a
lot of time jsonb as I have been waiting on the dust to settle from the
recent big changes, so my empirical evidence is lacking.

> the ::float bit casts to a float, and the index is built on the
> resulting float type.
>
> And the index is being used, and is reasonably fast, just not as fast as
> the equivalent index on the 'normal' float column.
>
> Tim
>>
>>>

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>
Cc:	pgsql-sql(at)postgresql(dot)org
Subject:	Re: querying with index on jsonb slower than standard column. Why?
Date:	2014-12-08 01:28:10
Message-ID:	16147.1418002090@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com> writes:
> The index created is not a gin index. Its a standard btree index on the
> data extracted from the json. So the indexes on the standard columns and
> the ones on the 'fields' extracted from the json seem to be equivalent.
> But perform differently.

I don't see any particular difference ...

regression=# explain analyze select count(*) from json_test where (data->>'assay1_ic50')::float > 90
and (data->>'assay2_ic50')::float < 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=341613.79..341613.80 rows=1 width=0) (actual time=901.207..901.208 rows=1 loops=1)
-> Bitmap Heap Scan on json_test (cost=123684.69..338836.02 rows=1111111 width=0) (actual time=497.982..887.128 rows=100690 loops=1)
Recheck Cond: ((((data ->> 'assay2_ic50'::text))::double precision < 10::double precision) AND (((data ->> 'assay1_ic50'::text))::double precision > 90::double precision))
Heap Blocks: exact=77578
-> BitmapAnd (cost=123684.69..123684.69 rows=1111111 width=0) (actual time=476.585..476.585 rows=0 loops=1)
-> Bitmap Index Scan on idx_data_json_assay2_ic50 (cost=0.00..61564.44 rows=3333333 width=0) (actual time=219.287..219.287 rows=999795 loops=1)
Index Cond: (((data ->> 'assay2_ic50'::text))::double precision < 10::double precision)
-> Bitmap Index Scan on idx_data_json_assay1_ic50 (cost=0.00..61564.44 rows=3333333 width=0) (actual time=208.197..208.197 rows=1000231 loops=1)
Index Cond: (((data ->> 'assay1_ic50'::text))::double precision > 90::double precision)
Planning time: 0.128 ms
Execution time: 904.196 ms
(11 rows)

regression=# explain analyze select count(*) from json_test where assay1_ic50 > 90 and assay2_ic50 < 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=197251.24..197251.25 rows=1 width=0) (actual time=895.238..895.238 rows=1 loops=1)
-> Bitmap Heap Scan on json_test (cost=36847.25..197003.24 rows=99197 width=0) (actual time=495.427..881.033 rows=100690 loops=1)
Recheck Cond: ((assay2_ic50 < 10::double precision) AND (assay1_ic50 > 90::double precision))
Heap Blocks: exact=77578
-> BitmapAnd (cost=36847.25..36847.25 rows=99197 width=0) (actual time=474.201..474.201 rows=0 loops=1)
-> Bitmap Index Scan on idx_data_col_assay2_ic50 (cost=0.00..18203.19 rows=985434 width=0) (actual time=219.060..219.060 rows=999795 loops=1)
Index Cond: (assay2_ic50 < 10::double precision)
-> Bitmap Index Scan on idx_data_col_assay1_ic50 (cost=0.00..18594.21 rows=1006637 width=0) (actual time=206.066..206.066 rows=1000231 loops=1)
Index Cond: (assay1_ic50 > 90::double precision)
Planning time: 0.129 ms
Execution time: 898.237 ms
(11 rows)

regression=# \timing
Timing is on.
regression=# select count(*) from json_test where (data->>'assay1_ic50')::float > 90
and (data->>'assay2_ic50')::float < 10;
count
--------
100690
(1 row)

Time: 882.607 ms
regression=# select count(*) from json_test where assay1_ic50 > 90 and assay2_ic50 < 10;
count
--------
100690
(1 row)

Time: 881.071 ms

regards, tom lane

From:	Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>
Cc:	pgsql-sql(at)postgresql(dot)org
Subject:	Re: querying with index on jsonb slower than standard column. Why?
Date:	2014-12-08 15:31:21
Message-ID:	5485C449.4020204@aklaver.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

On 12/07/2014 05:28 PM, Tom Lane wrote:
> Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com> writes:
>> The index created is not a gin index. Its a standard btree index on the
>> data extracted from the json. So the indexes on the standard columns and
>> the ones on the 'fields' extracted from the json seem to be equivalent.
>> But perform differently.
>
> I don't see any particular difference ...
>
> regression=# explain analyze select count(*) from json_test where (data->>'assay1_ic50')::float > 90
> and (data->>'assay2_ic50')::float < 10;
> QUERY PLAN
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=341613.79..341613.80 rows=1 width=0) (actual time=901.207..901.208 rows=1 loops=1)
> -> Bitmap Heap Scan on json_test (cost=123684.69..338836.02 rows=1111111 width=0) (actual time=497.982..887.128 rows=100690 loops=1)
> Recheck Cond: ((((data ->> 'assay2_ic50'::text))::double precision < 10::double precision) AND (((data ->> 'assay1_ic50'::text))::double precision > 90::double precision))
> Heap Blocks: exact=77578
> -> BitmapAnd (cost=123684.69..123684.69 rows=1111111 width=0) (actual time=476.585..476.585 rows=0 loops=1)
> -> Bitmap Index Scan on idx_data_json_assay2_ic50 (cost=0.00..61564.44 rows=3333333 width=0) (actual time=219.287..219.287 rows=999795 loops=1)
> Index Cond: (((data ->> 'assay2_ic50'::text))::double precision < 10::double precision)
> -> Bitmap Index Scan on idx_data_json_assay1_ic50 (cost=0.00..61564.44 rows=3333333 width=0) (actual time=208.197..208.197 rows=1000231 loops=1)
> Index Cond: (((data ->> 'assay1_ic50'::text))::double precision > 90::double precision)
> Planning time: 0.128 ms
> Execution time: 904.196 ms
> (11 rows)
>
> regression=# explain analyze select count(*) from json_test where assay1_ic50 > 90 and assay2_ic50 < 10;
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=197251.24..197251.25 rows=1 width=0) (actual time=895.238..895.238 rows=1 loops=1)
> -> Bitmap Heap Scan on json_test (cost=36847.25..197003.24 rows=99197 width=0) (actual time=495.427..881.033 rows=100690 loops=1)
> Recheck Cond: ((assay2_ic50 < 10::double precision) AND (assay1_ic50 > 90::double precision))
> Heap Blocks: exact=77578
> -> BitmapAnd (cost=36847.25..36847.25 rows=99197 width=0) (actual time=474.201..474.201 rows=0 loops=1)
> -> Bitmap Index Scan on idx_data_col_assay2_ic50 (cost=0.00..18203.19 rows=985434 width=0) (actual time=219.060..219.060 rows=999795 loops=1)
> Index Cond: (assay2_ic50 < 10::double precision)
> -> Bitmap Index Scan on idx_data_col_assay1_ic50 (cost=0.00..18594.21 rows=1006637 width=0) (actual time=206.066..206.066 rows=1000231 loops=1)
> Index Cond: (assay1_ic50 > 90::double precision)
> Planning time: 0.129 ms
> Execution time: 898.237 ms
> (11 rows)
>
> regression=# \timing
> Timing is on.
> regression=# select count(*) from json_test where (data->>'assay1_ic50')::float > 90
> and (data->>'assay2_ic50')::float < 10;
> count
> --------
> 100690
> (1 row)
>
> Time: 882.607 ms
> regression=# select count(*) from json_test where assay1_ic50 > 90 and assay2_ic50 < 10;
> count
> --------
> 100690
> (1 row)
>
> Time: 881.071 ms
>
> regards, tom lane
>
>

Running the above on my machine I do see the slow down the OP reports. I ran it several times
and it stayed around 3.5x. It might be interesting to get the OS and architecture information
from the OP.

test=# select version();
version
------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.4rc1 on i686-pc-linux-gnu, compiled by gcc (SUSE Linux) 4.8.1 20130909 [gcc-4_8-branch revision 202388], 32-bit
(1 row)

test=# \timing
Timing is on.
test=# select count(*) from json_test where (data->>'assay1_ic50')::float > 90
test-# and (data->>'assay2_ic50')::float < 10;
count
-------
99288
(1 row)

Time: 9092.966 ms

test=# select count(*) from json_test where assay1_ic50 > 90 and assay2_ic50 < 10;
count
-------
99288
(1 row)

Time: 2542.294 ms

explain analyze select count(*) from json_test where (data->>'assay1_ic50')::float > 90
and (data->>'assay2_ic50')::float < 10;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=332209.79..332209.80 rows=1 width=0) (actual time=8980.009..8980.009 rows=1 loops=1)
-> Bitmap Heap Scan on json_test (cost=123684.69..329432.02 rows=1111111 width=0) (actual time=538.688..8960.308 rows=99288 loops=1)
Recheck Cond: ((((data ->> 'assay2_ic50'::text))::double precision < 10::double precision) AND (((data ->> 'assay1_ic50'::text))::double precision > 90::double precision))
Rows Removed by Index Recheck: 7588045
Heap Blocks: exact=20894 lossy=131886
-> BitmapAnd (cost=123684.69..123684.69 rows=1111111 width=0) (actual time=531.066..531.066 rows=0 loops=1)
-> Bitmap Index Scan on idx_data_json_assay2_ic50 (cost=0.00..61564.44 rows=3333333 width=0) (actual time=258.717..258.717 rows=998690 loops=1)
Index Cond: (((data ->> 'assay2_ic50'::text))::double precision < 10::double precision)
-> Bitmap Index Scan on idx_data_json_assay1_ic50 (cost=0.00..61564.44 rows=3333333 width=0) (actual time=251.664..251.664 rows=997880 loops=1)
Index Cond: (((data ->> 'assay1_ic50'::text))::double precision > 90::double precision)
Planning time: 0.391 ms
Execution time: 8980.391 ms
(12 rows)

test=# explain analyze select count(*) from json_test where assay1_ic50 > 90 and assay2_ic50 < 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=196566.38..196566.39 rows=1 width=0) (actual time=2609.545..2609.545 rows=1 loops=1)
-> Bitmap Heap Scan on json_test (cost=37869.00..196304.39 rows=104796 width=0) (actual time=550.273..2590.093 rows=99288 loops=1)
Recheck Cond: ((assay2_ic50 < 10::double precision) AND (assay1_ic50 > 90::double precision))
Rows Removed by Index Recheck: 7588045
Heap Blocks: exact=20894 lossy=131886
-> BitmapAnd (cost=37869.00..37869.00 rows=104796 width=0) (actual time=542.666..542.666 rows=0 loops=1)
-> Bitmap Index Scan on idx_data_col_assay2_ic50 (cost=0.00..18871.73 rows=1021773 width=0) (actual time=263.959..263.959 rows=998690 loops=1)
Index Cond: (assay2_ic50 < 10::double precision)
-> Bitmap Index Scan on idx_data_col_assay1_ic50 (cost=0.00..18944.62 rows=1025624 width=0) (actual time=257.912..257.912 rows=997880 loops=1)
Index Cond: (assay1_ic50 > 90::double precision)
Planning time: 0.834 ms
Execution time: 2609.960 ms
(12 rows)

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
Cc:	Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>, pgsql-sql(at)postgresql(dot)org
Subject:	Re: querying with index on jsonb slower than standard column. Why?
Date:	2014-12-08 15:46:12
Message-ID:	26002.1418053572@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com> writes:
> On 12/07/2014 05:28 PM, Tom Lane wrote:
>> I don't see any particular difference ...

> Running the above on my machine I do see the slow down the OP reports. I
> ran it several times and it stayed around 3.5x.

Interesting. A couple of points that might be worth checking:

* I tried this on a 64-bit build, whereas you were evidently using 32-bit.

* The EXPLAIN ANALYZE output shows that my bitmaps didn't go lossy,
whereas yours did. This is likely because I had cranked up work_mem to
make the index builds go faster.

It's not apparent to me why either of those things would have an effect
like this, but *something* weird is happening here.

(Thinks for a bit...) A possible theory, seeing that the majority of the
blocks are lossy in your runs, is that the reduction to lossy form is
making worse choices about which blocks to make lossy in one case than in
the other. I don't remember exactly how those decisions are made.

Another thing that seems odd about your printout is the discrepancy
in planning time ... the two cases have just about the same planning
time for me, but not for you.

regards, tom lane

From:	Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>, pgsql-sql(at)postgresql(dot)org
Subject:	Re: querying with index on jsonb slower than standard column. Why?
Date:	2014-12-08 15:50:18
Message-ID:	5485C8BA.7040704@aklaver.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

On 12/08/2014 07:46 AM, Tom Lane wrote:
> Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com> writes:
>> On 12/07/2014 05:28 PM, Tom Lane wrote:
>>> I don't see any particular difference ...
>
>> Running the above on my machine I do see the slow down the OP reports. I
>> ran it several times and it stayed around 3.5x.
>
> Interesting. A couple of points that might be worth checking:
>
> * I tried this on a 64-bit build, whereas you were evidently using 32-bit.

My laptop is 64-bit, so when I get a chance I will setup the test there
and run it to see what happens.

>
> * The EXPLAIN ANALYZE output shows that my bitmaps didn't go lossy,
> whereas yours did. This is likely because I had cranked up work_mem to
> make the index builds go faster.
>
> It's not apparent to me why either of those things would have an effect
> like this, but *something* weird is happening here.
>
> (Thinks for a bit...) A possible theory, seeing that the majority of the
> blocks are lossy in your runs, is that the reduction to lossy form is
> making worse choices about which blocks to make lossy in one case than in
> the other. I don't remember exactly how those decisions are made.
>
> Another thing that seems odd about your printout is the discrepancy
> in planning time ... the two cases have just about the same planning
> time for me, but not for you.
>
> regards, tom lane
>
>

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

From:	Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>, pgsql-sql(at)postgresql(dot)org
Subject:	Re: querying with index on jsonb slower than standard column. Why?
Date:	2014-12-08 16:44:52
Message-ID:	5485D584.8080105@aklaver.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

On 12/08/2014 07:50 AM, Adrian Klaver wrote:
> On 12/08/2014 07:46 AM, Tom Lane wrote:
>> Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com> writes:
>>> On 12/07/2014 05:28 PM, Tom Lane wrote:
>>>> I don't see any particular difference ...
>>
>>> Running the above on my machine I do see the slow down the OP reports. I
>>> ran it several times and it stayed around 3.5x.
>>
>> Interesting. A couple of points that might be worth checking:
>>
>> * I tried this on a 64-bit build, whereas you were evidently using
>> 32-bit.
>
> My laptop is 64-bit, so when I get a chance I will setup the test there
> and run it to see what happens.
>
>>

Seems work_mem is the key:

postgres(at)test=# select version();
version

-------------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.4rc1 on x86_64-unknown-linux-gnu, compiled by gcc (SUSE
Linux) 4.8.1 20130909 [gcc-4_8-branch revision 202388], 64-bit
(1 row)

The default:

postgres(at)test=# show work_mem ;
work_mem
----------
4MB
(1 row)

postgres(at)test=# \timing
Timing is on.
postgres(at)test=# explain analyze select count(*) from json_test where
(data->>'assay1_ic50')::float > 90
test-# and (data->>'assay2_ic50')::float < 10;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=198713.45..198713.46 rows=1 width=0) (actual
time=8564.799..8564.799 rows=1 loops=1)
-> Bitmap Heap Scan on json_test (cost=36841.42..198465.53
rows=99168 width=0) (actual time=1043.226..8550.183 rows=99781 loops=1)
Recheck Cond: ((((data ->> 'assay1_ic50'::text))::double
precision > 90::double precision) AND (((data ->>
'assay2_ic50'::text))::double precision < 10::double precision))
Rows Removed by Index Recheck: 7236280
Heap Blocks: exact=30252 lossy=131908
-> BitmapAnd (cost=36841.42..36841.42 rows=99168 width=0)
(actual time=1034.738..1034.738 rows=0 loops=1)
-> Bitmap Index Scan on idx_data_json_assay1_ic50
(cost=0.00..18157.96 rows=983136 width=0) (actual time=513.878..513.878
rows=1001237 loops=1)
Index Cond: (((data ->>
'assay1_ic50'::text))::double precision > 90::double precision)
-> Bitmap Index Scan on idx_data_json_assay2_ic50
(cost=0.00..18633.62 rows=1008691 width=0) (actual time=502.396..502.396
rows=1000930 loops=1)
Index Cond: (((data ->>
'assay2_ic50'::text))::double precision < 10::double precision)
Planning time: 121.962 ms
Execution time: 8565.609 ms
(12 rows)

Time: 9110.408 ms
postgres(at)test=# explain analyze select count(*) from json_test where
assay1_ic50 > 90 and assay2_ic50 < 10;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=197225.91..197225.92 rows=1 width=0) (actual
time=1848.769..1848.769 rows=1 loops=1)
-> Bitmap Heap Scan on json_test (cost=36841.41..196977.99
rows=99168 width=0) (actual time=405.110..1839.299 rows=99781 loops=1)
Recheck Cond: ((assay1_ic50 > 90::double precision) AND
(assay2_ic50 < 10::double precision))
Rows Removed by Index Recheck: 7236280
Heap Blocks: exact=30252 lossy=131908
-> BitmapAnd (cost=36841.41..36841.41 rows=99168 width=0)
(actual time=397.138..397.138 rows=0 loops=1)
-> Bitmap Index Scan on idx_data_col_assay1_ic50
(cost=0.00..18157.96 rows=983136 width=0) (actual time=196.304..196.304
rows=1001237 loops=1)
Index Cond: (assay1_ic50 > 90::double precision)
-> Bitmap Index Scan on idx_data_col_assay2_ic50
(cost=0.00..18633.62 rows=1008691 width=0) (actual time=182.845..182.845
rows=1000930 loops=1)
Index Cond: (assay2_ic50 < 10::double precision)
Planning time: 0.212 ms
Execution time: 1848.814 ms
(12 rows)

Time: 1849.570 ms

****************************************************************************

Set work_mem up:

postgres(at)test=# set work_mem='16MB';
SET
Time: 0.143 ms
postgres(at)test=# show work_mem;
work_mem
----------
16MB
(1 row)

postgres(at)test=# explain analyze select count(*) from json_test where
(data->>'assay1_ic50')::float > 90
and (data->>'assay2_ic50')::float < 10;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=198713.45..198713.46 rows=1 width=0) (actual
time=861.413..861.413 rows=1 loops=1)
-> Bitmap Heap Scan on json_test (cost=36841.42..198465.53
rows=99168 width=0) (actual time=588.969..852.720 rows=99781 loops=1)
Recheck Cond: ((((data ->> 'assay1_ic50'::text))::double
precision > 90::double precision) AND (((data ->>
'assay2_ic50'::text))::double precision < 10::double precision))
Heap Blocks: exact=77216
-> BitmapAnd (cost=36841.42..36841.42 rows=99168 width=0)
(actual time=564.927..564.927 rows=0 loops=1)
-> Bitmap Index Scan on idx_data_json_assay1_ic50
(cost=0.00..18157.96 rows=983136 width=0) (actual time=265.318..265.318
rows=1001237 loops=1)
Index Cond: (((data ->>
'assay1_ic50'::text))::double precision > 90::double precision)
-> Bitmap Index Scan on idx_data_json_assay2_ic50
(cost=0.00..18633.62 rows=1008691 width=0) (actual time=256.225..256.225
rows=1000930 loops=1)
Index Cond: (((data ->>
'assay2_ic50'::text))::double precision < 10::double precision)
Planning time: 0.126 ms
Execution time: 861.453 ms
(11 rows)

Time: 861.965 ms
postgres(at)test=# explain analyze select count(*) from json_test where
assay1_ic50 > 90 and assay2_ic50 < 10;

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=197225.91..197225.92 rows=1 width=0) (actual
time=848.410..848.410 rows=1 loops=1)
-> Bitmap Heap Scan on json_test (cost=36841.41..196977.99
rows=99168 width=0) (actual time=578.360..839.659 rows=99781 loops=1)
Recheck Cond: ((assay1_ic50 > 90::double precision) AND
(assay2_ic50 < 10::double precision))
Heap Blocks: exact=77216
-> BitmapAnd (cost=36841.41..36841.41 rows=99168 width=0)
(actual time=554.387..554.387 rows=0 loops=1)
-> Bitmap Index Scan on idx_data_col_assay1_ic50
(cost=0.00..18157.96 rows=983136 width=0) (actual time=263.961..263.961
rows=1001237 loops=1)
Index Cond: (assay1_ic50 > 90::double precision)
-> Bitmap Index Scan on idx_data_col_assay2_ic50
(cost=0.00..18633.62 rows=1008691 width=0) (actual time=247.268..247.268
rows=1000930 loops=1)
Index Cond: (assay2_ic50 < 10::double precision)
Planning time: 0.128 ms
Execution time: 848.453 ms
(11 rows)

*****************************************************************

Then set it back:

postgres(at)test=# set work_mem='4MB';
SET
Time: 0.213 ms
postgres(at)test=# show work_mem ;
work_mem
----------
4MB
(1 row)

postgres(at)test=# explain analyze select count(*) from json_test where
(data->>'assay1_ic50')::float > 90
and (data->>'assay2_ic50')::float < 10;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=198713.45..198713.46 rows=1 width=0) (actual
time=6607.650..6607.650 rows=1 loops=1)
-> Bitmap Heap Scan on json_test (cost=36841.42..198465.53
rows=99168 width=0) (actual time=400.598..6594.442 rows=99781 loops=1)
Recheck Cond: ((((data ->> 'assay1_ic50'::text))::double
precision > 90::double precision) AND (((data ->>
'assay2_ic50'::text))::double precision < 10::double precision))
Rows Removed by Index Recheck: 7236280
Heap Blocks: exact=30252 lossy=131908
-> BitmapAnd (cost=36841.42..36841.42 rows=99168 width=0)
(actual time=392.622..392.622 rows=0 loops=1)
-> Bitmap Index Scan on idx_data_json_assay1_ic50
(cost=0.00..18157.96 rows=983136 width=0) (actual time=191.598..191.598
rows=1001237 loops=1)
Index Cond: (((data ->>
'assay1_ic50'::text))::double precision > 90::double precision)
-> Bitmap Index Scan on idx_data_json_assay2_ic50
(cost=0.00..18633.62 rows=1008691 width=0) (actual time=183.107..183.107
rows=1000930 loops=1)
Index Cond: (((data ->>
'assay2_ic50'::text))::double precision < 10::double precision)
Planning time: 0.126 ms
Execution time: 6607.692 ms
(12 rows)

Time: 6608.197 ms
postgres(at)test=# explain analyze select count(*) from json_test where
assay1_ic50 > 90 and assay2_ic50 < 10;

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=197225.91..197225.92 rows=1 width=0) (actual
time=1836.383..1836.383 rows=1 loops=1)
-> Bitmap Heap Scan on json_test (cost=36841.41..196977.99
rows=99168 width=0) (actual time=396.414..1826.818 rows=99781 loops=1)
Recheck Cond: ((assay1_ic50 > 90::double precision) AND
(assay2_ic50 < 10::double precision))
Rows Removed by Index Recheck: 7236280
Heap Blocks: exact=30252 lossy=131908
-> BitmapAnd (cost=36841.41..36841.41 rows=99168 width=0)
(actual time=388.498..388.498 rows=0 loops=1)
-> Bitmap Index Scan on idx_data_col_assay1_ic50
(cost=0.00..18157.96 rows=983136 width=0) (actual time=187.928..187.928
rows=1001237 loops=1)
Index Cond: (assay1_ic50 > 90::double precision)
-> Bitmap Index Scan on idx_data_col_assay2_ic50
(cost=0.00..18633.62 rows=1008691 width=0) (actual time=182.743..182.743
rows=1000930 loops=1)
Index Cond: (assay2_ic50 < 10::double precision)
Planning time: 0.109 ms
Execution time: 1836.422 ms
(12 rows)

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
Cc:	Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>, pgsql-sql(at)postgresql(dot)org
Subject:	Re: querying with index on jsonb slower than standard column. Why?
Date:	2014-12-08 16:56:04
Message-ID:	27877.1418057764@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com> writes:
> Seems work_mem is the key:

Fascinating. So there's some bad behavior in the lossy-bitmap stuff
that's exposed by one case but not the other. The set of heap rows we
actually need to examine is presumably identical in both cases. The
only idea that comes to mind is that the order in which the TIDs get
inserted into the bitmaps might be entirely different between the two
index types. We might have to write it off as bad luck, if the
lossification algorithm doesn't have enough information to do better;
but it seems worth looking into.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
Cc:	Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>, pgsql-sql(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [SQL] querying with index on jsonb slower than standard column. Why?
Date:	2014-12-08 20:53:09
Message-ID:	3936.1418071989@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

I wrote:
> Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com> writes:
>> Seems work_mem is the key:

> Fascinating. So there's some bad behavior in the lossy-bitmap stuff
> that's exposed by one case but not the other.

Meh. I was overthinking it. A bit of investigation with oprofile exposed
the true cause of the problem: whenever the bitmap goes lossy, we have to
execute the "recheck" condition for each tuple in the page(s) that the
bitmap has a lossy reference to. So in the fast case we are talking about

Recheck Cond: ((assay1_ic50 > 90::double precision) AND (assay2_ic50 < 10::double precision))

which involves little except pulling the float8 values out of the tuple
and executing float8gt and float8lt. In the slow case we have got

Recheck Cond: ((((data ->> 'assay1_ic50'::text))::double precision > 90::double precision) AND (((data ->> 'assay2_ic50'::text))::double precision < 10::double precision))

which means we have to pull the JSONB value out of the tuple, search
it to find the 'assay1_ic50' key, convert the associated value to text
(which is not exactly cheap because *the value is stored as a numeric*),
then reparse that text string into a float8, after which we can use
float8gt. And then probably do an equivalent amount of work on the way
to making the other comparison.

So this says nothing much about the lossy-bitmap code, and a lot about
how the JSONB code isn't very well optimized yet. In particular, the
decision not to provide an operator that could extract a numeric field
without conversion to text is looking pretty bad here.

For reference, the oprofile results down to the 1% level for
the jsonb query:

samples % symbol name
7646 8.1187 get_str_from_var
7055 7.4911 AllocSetAlloc
4447 4.7219 AllocSetCheck
4000 4.2473 BitmapHeapNext
3945 4.1889 lengthCompareJsonbStringValue
3713 3.9425 findJsonbValueFromContainer
3637 3.8618 ExecMakeFunctionResultNoSets
3624 3.8480 hash_search_with_hash_value
3452 3.6654 cstring_to_text
2993 3.1780 slot_deform_tuple
2566 2.7246 jsonb_object_field_text
2225 2.3625 palloc
2176 2.3105 heap_tuple_untoast_attr
1993 2.1162 AllocSetReset
1926 2.0451 findJsonbValueFromContainerLen
1846 1.9601 GetPrivateRefCountEntry
1563 1.6596 float8gt
1486 1.5779 float8in
1477 1.5683 InputFunctionCall
1365 1.4494 getJsonbOffset
1137 1.2073 slot_getattr
1083 1.1500 init_var_from_num
1058 1.1234 ExecEvalConst
1056 1.1213 float8_cmp_internal
1053 1.1181 cstring_to_text_with_len
1032 1.0958 text_to_cstring
988 1.0491 ExecClearTuple
969 1.0289 ResourceOwnerForgetBuffer

and for the other:

samples % symbol name
14010 12.1898 BitmapHeapNext
13479 11.7278 hash_search_with_hash_value
8201 7.1355 GetPrivateRefCountEntry
7524 6.5465 slot_deform_tuple
6091 5.2997 ExecMakeFunctionResultNoSets
4459 3.8797 ExecClearTuple
4456 3.8771 slot_getattr
3876 3.3724 ExecStoreTuple
3112 2.7077 ReleaseBuffer
3086 2.6851 float8_cmp_internal
2890 2.5145 ExecQual
2794 2.4310 HeapTupleSatisfiesMVCC
2737 2.3814 float8gt
2130 1.8533 ExecEvalScalarVarFast
2102 1.8289 IncrBufferRefCount
2100 1.8272 ResourceOwnerForgetBuffer
1896 1.6497 hash_any
1752 1.5244 ResourceOwnerRememberBuffer
1567 1.3634 DatumGetFloat8
1543 1.3425 ExecEvalConst
1486 1.2929 LWLockAcquire
1454 1.2651 _bt_checkkeys
1424 1.2390 check_stack_depth
1374 1.1955 ResourceOwnerEnlargeBuffers
1354 1.1781 pgstat_end_function_usage
1164 1.0128 tbm_iterate
1158 1.0076 CheckForSerializableConflictOut

Just to add insult to injury, this is only counting cycles in postgres
proper; it appears that in the jsonb case 30% of the overall runtime is
spent in strtod() :-(

regards, tom lane

From:	Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>, pgsql-sql(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: querying with index on jsonb slower than standard column. Why?
Date:	2014-12-08 21:14:48
Message-ID:	548614C8.90203@aklaver.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

On 12/08/2014 12:53 PM, Tom Lane wrote:
> I wrote:
>> Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com> writes:
>>> Seems work_mem is the key:
>
>> Fascinating. So there's some bad behavior in the lossy-bitmap stuff
>> that's exposed by one case but not the other.
>
> Meh. I was overthinking it. A bit of investigation with oprofile exposed
> the true cause of the problem: whenever the bitmap goes lossy, we have to
> execute the "recheck" condition for each tuple in the page(s) that the
> bitmap has a lossy reference to. So in the fast case we are talking about
>
> Recheck Cond: ((assay1_ic50 > 90::double precision) AND (assay2_ic50 < 10::double precision))
>
> which involves little except pulling the float8 values out of the tuple
> and executing float8gt and float8lt. In the slow case we have got
>
> Recheck Cond: ((((data ->> 'assay1_ic50'::text))::double precision > 90::double precision) AND (((data ->> 'assay2_ic50'::text))::double precision < 10::double precision))
>
> which means we have to pull the JSONB value out of the tuple, search
> it to find the 'assay1_ic50' key, convert the associated value to text
> (which is not exactly cheap because *the value is stored as a numeric*),
> then reparse that text string into a float8, after which we can use
> float8gt. And then probably do an equivalent amount of work on the way
> to making the other comparison.
>
> So this says nothing much about the lossy-bitmap code, and a lot about
> how the JSONB code isn't very well optimized yet. In particular, the
> decision not to provide an operator that could extract a numeric field
> without conversion to text is looking pretty bad here.
>

I think I understand the above.

I redid the test on my 32-bit machine, setting work_mem=16MB, and I got comparable results
to what I saw on the 64-bit machine. So, what I am still am puzzled by is why work_mem seems
to make the two paths equivalent in time?:

Fast case, assay1_ic50 > 90 and assay2_ic50 < 10:
1183.997 ms

Slow case, (data->>'assay1_ic50')::float > 90 and (data->>'assay2_ic50')::float < 10;:
1190.187 ms

>
> regards, tom lane
>
>

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
Cc:	Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>, pgsql-sql(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [SQL] querying with index on jsonb slower than standard column. Why?
Date:	2014-12-08 21:22:56
Message-ID:	4746.1418073776@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com> writes:
> I redid the test on my 32-bit machine, setting work_mem=16MB, and I got
> comparable results to what I saw on the 64-bit machine. So, what I am
> still am puzzled by is why work_mem seems to make the two paths
> equivalent in time?:

If work_mem is large enough that we never have to go through
tbm_lossify(), then the recheck condition will never be executed,
so its speed doesn't matter.

(So the near-term workaround for Tim is to raise work_mem when
working with tables of this size.)

regards, tom lane

From:	Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>, pgsql-sql(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: querying with index on jsonb slower than standard column. Why?
Date:	2014-12-08 21:34:38
Message-ID:	5486196E.4080602@aklaver.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

On 12/08/2014 01:22 PM, Tom Lane wrote:
> Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com> writes:
>> I redid the test on my 32-bit machine, setting work_mem=16MB, and I got
>> comparable results to what I saw on the 64-bit machine. So, what I am
>> still am puzzled by is why work_mem seems to make the two paths
>> equivalent in time?:
>
> If work_mem is large enough that we never have to go through
> tbm_lossify(), then the recheck condition will never be executed,
> so its speed doesn't matter.

Aah, peeking into tidbitmap.c is enlightening. Thanks.

>
> (So the near-term workaround for Tim is to raise work_mem when
> working with tables of this size.)
>
> regards, tom lane
>
>

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

From:	Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>
To:
Cc:	pgsql-sql(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [SQL] querying with index on jsonb slower than standard column. Why?
Date:	2014-12-08 21:39:13
Message-ID:	54861A81.8030509@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

On 08/12/2014 18:14, Adrian Klaver wrote:
> Recheck Cond: ((((data ->> 'assay1_ic50'::text))::double precision > 90::double precision) AND (((data ->> 'assay2_ic50'::text))::double precision < 10::double precision))
> >
> >which means we have to pull the JSONB value out of the tuple, search
> >it to find the 'assay1_ic50' key, convert the associated value to text
> >(which is not exactly cheap because*the value is stored as a numeric*),
> >then reparse that text string into a float8, after which we can use
> >float8gt. And then probably do an equivalent amount of work on the way
> >to making the other comparison.
> >
> >So this says nothing much about the lossy-bitmap code, and a lot about
> >how the JSONB code isn't very well optimized yet. In particular, the
> >decision not to provide an operator that could extract a numeric field
> >without conversion to text is looking pretty bad here.
Yes, that bit seemed strange to me. As I understand the value is stored
internally as numeric, but the only way to access it is as text and then
cast back to numeric.
I *think* this is the only way to do it presently?

Tim

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>
Cc:	pgsql-sql(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Re: [SQL] querying with index on jsonb slower than standard column. Why?
Date:	2014-12-12 21:24:04
Message-ID:	548B5CF4.4030009@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

On 12/08/2014 01:39 PM, Tim Dudgeon wrote:
> On 08/12/2014 18:14, Adrian Klaver wrote:
>> Recheck Cond: ((((data ->> 'assay1_ic50'::text))::double precision > 90::double precision) AND (((data ->> 'assay2_ic50'::text))::double precision < 10::double precision))
>> >
>> > which means we have to pull the JSONB value out of the tuple, search
>> > it to find the 'assay1_ic50' key, convert the associated value to text
>> > (which is not exactly cheap because *the value is stored as a numeric*),
>> > then reparse that text string into a float8, after which we can use
>> > float8gt. And then probably do an equivalent amount of work on the way
>> > to making the other comparison.
>> >
>> > So this says nothing much about the lossy-bitmap code, and a lot about
>> > how the JSONB code isn't very well optimized yet. In particular, the
>> > decision not to provide an operator that could extract a numeric field
>> > without conversion to text is looking pretty bad here.
> Yes, that bit seemed strange to me. As I understand the value is stored
> internally as numeric, but the only way to access it is as text and then
> cast back to numeric.
> I *think* this is the only way to do it presently?

Yeah, I believe the core problem is that Postgres currently doesn't have
any way to have variadic return times from a function which don't match
variadic input types. Returning a value as an actual numeric from JSONB
would require returning a numeric from a function whose input type is
text or json. So a known issue but one which would require a lot of
replumbing to fix.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>, pgsql-sql(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Re: [SQL] querying with index on jsonb slower than standard column. Why?
Date:	2014-12-12 21:44:10
Message-ID:	8867.1418420650@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> Yeah, I believe the core problem is that Postgres currently doesn't have
> any way to have variadic return times from a function which don't match
> variadic input types. Returning a value as an actual numeric from JSONB
> would require returning a numeric from a function whose input type is
> text or json. So a known issue but one which would require a lot of
> replumbing to fix.

Well, it'd be easy to fix if we were willing to invent distinct operators
depending on which type you wanted out (perhaps ->> for text output as
today, add ->># for numeric output, etc). Doesn't seem terribly nice
from a usability standpoint though.

The usability issue could be fixed by teaching the planner to fold a
construct like (jsonb ->> 'foo')::numeric into (jsonb ->># 'foo').
But I'm not sure how we do that except in a really ugly and ad-hoc
fashion.

regards, tom lane

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>, pgsql-sql(at)postgresql(dot)org, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: [PERFORM] Re: querying with index on jsonb slower than standard column. Why?
Date:	2014-12-12 22:10:29
Message-ID:	CAGTBQpZDYYF+UYgFQrKUutGk1pHJ2i0m7DS8sgte0GyZf7A8tg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

On Fri, Dec 12, 2014 at 6:44 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> The usability issue could be fixed by teaching the planner to fold a
> construct like (jsonb ->> 'foo')::numeric into (jsonb ->># 'foo').
> But I'm not sure how we do that except in a really ugly and ad-hoc
> fashion.

It would be doable if you could have polymorphism on return type, and
teach the planner to interpret (jsonb ->> 'foo')::numeric as the
operator with a numeric return type.

That's a trickier business even, but it could be far more useful and
generically helpful than ->>#.

Tricky part is what to do when the cast is missing.

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>, pgsql-sql(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Re: [SQL] querying with index on jsonb slower than standard column. Why?
Date:	2014-12-12 23:27:31
Message-ID:	548B79E3.5070301@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

On 12/12/2014 04:44 PM, Tom Lane wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
>> Yeah, I believe the core problem is that Postgres currently doesn't have
>> any way to have variadic return times from a function which don't match
>> variadic input types. Returning a value as an actual numeric from JSONB
>> would require returning a numeric from a function whose input type is
>> text or json. So a known issue but one which would require a lot of
>> replumbing to fix.
> Well, it'd be easy to fix if we were willing to invent distinct operators
> depending on which type you wanted out (perhaps ->> for text output as
> today, add ->># for numeric output, etc).

That was my immediate reaction. Not sure about the operator name. I'd
tentatively suggest -># (taking an int or text argument) and #># taking
a text[] argument, both returning numeric, and erroring out if the value
is a string, boolean, object or array.

> Doesn't seem terribly nice
> from a usability standpoint though.
>
> The usability issue could be fixed by teaching the planner to fold a
> construct like (jsonb ->> 'foo')::numeric into (jsonb ->># 'foo').
> But I'm not sure how we do that except in a really ugly and ad-hoc
> fashion.
>
>

I would be inclined to add the operator and see how cumbersome people
find it. I suspect in many cases it might be sufficient.

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>, pgsql-sql(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Re: [SQL] querying with index on jsonb slower than standard column. Why?
Date:	2014-12-13 01:20:37
Message-ID:	15863.1418433637@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> On 12/12/2014 04:44 PM, Tom Lane wrote:
>> Well, it'd be easy to fix if we were willing to invent distinct operators
>> depending on which type you wanted out (perhaps ->> for text output as
>> today, add ->># for numeric output, etc).

> That was my immediate reaction. Not sure about the operator name. I'd
> tentatively suggest -># (taking an int or text argument) and #># taking
> a text[] argument, both returning numeric, and erroring out if the value
> is a string, boolean, object or array.

>> The usability issue could be fixed by teaching the planner to fold a
>> construct like (jsonb ->> 'foo')::numeric into (jsonb ->># 'foo').
>> But I'm not sure how we do that except in a really ugly and ad-hoc
>> fashion.

> I would be inclined to add the operator and see how cumbersome people
> find it. I suspect in many cases it might be sufficient.

We can't just add the operator and worry about usability later;
if we're thinking we might want to introduce such an automatic
transformation, we have to be sure the new operator is defined in a
way that allows the transformation to not change any semantics.
What that means in this case is that if (jsonb ->> 'foo')::numeric
would have succeeded, (jsonb ->># 'foo') has to succeed; which means
it'd better be willing to attempt conversion of string values to
numeric, not just throw an error on sight.

regards, tom lane

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>, pgsql-sql(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Re: [SQL] querying with index on jsonb slower than standard column. Why?
Date:	2014-12-13 03:05:20
Message-ID:	548BACF0.5040909@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

On 12/12/2014 08:20 PM, Tom Lane wrote:
> We can't just add the operator and worry about usability later;
> if we're thinking we might want to introduce such an automatic
> transformation, we have to be sure the new operator is defined in a
> way that allows the transformation to not change any semantics.
> What that means in this case is that if (jsonb ->> 'foo')::numeric
> would have succeeded, (jsonb ->># 'foo') has to succeed; which means
> it'd better be willing to attempt conversion of string values to
> numeric, not just throw an error on sight.
>
>

Well, I'm not 100% convinced about the magic transformation being a good
thing.

Json numbers are distinct from strings, and part of the justification
for this is to extract a numeric datum from jsonb exactly as stored, on
performance grounds. So turning round now and making that turn a string
into a number if possible seems to me to be going in the wrong direction.

cheers

andrew

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, Tim Dudgeon <tdudgeon(dot)ml(at)gmail(dot)com>, pgsql-sql(at)postgresql(dot)org, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Re: [SQL] querying with index on jsonb slower than standard column. Why?
Date:	2014-12-13 05:38:41
Message-ID:	CAGTBQpYb_yUV5Gy5UXn2tpmrnzn0CwCa5ifKYp1Z-8qRbDiu6A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance pgsql-sql

On Sat, Dec 13, 2014 at 12:05 AM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
> On 12/12/2014 08:20 PM, Tom Lane wrote:
>>
>> We can't just add the operator and worry about usability later;
>> if we're thinking we might want to introduce such an automatic
>> transformation, we have to be sure the new operator is defined in a
>> way that allows the transformation to not change any semantics.
>> What that means in this case is that if (jsonb ->> 'foo')::numeric
>> would have succeeded, (jsonb ->># 'foo') has to succeed; which means
>> it'd better be willing to attempt conversion of string values to
>> numeric, not just throw an error on sight.
>>
>>
>
>
> Well, I'm not 100% convinced about the magic transformation being a good
> thing.
>
> Json numbers are distinct from strings, and part of the justification for
> this is to extract a numeric datum from jsonb exactly as stored, on
> performance grounds. So turning round now and making that turn a string into
> a number if possible seems to me to be going in the wrong direction.

It's still better than doing the conversion every time. The niceness
of that implementation aside, I don't see how it can be considered the
wrong direction.