Re: How to implement Gin method?

Lists: pgsql-hackers
From: Kenji uno <ku(at)digitaldolphins(dot)jp>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: How to implement Gin method?
Date: 2013-07-07 01:00:16
Message-ID: 201307070059.r670xw8F028361@mail.digitaldolphins.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi.

I want to try GIN and know programming information of GIN technology.

Please teach me about these functions extractValue, extractQuery and consistent.

I have posted question at stack overflow.

http://stackoverflow.com/questions/17489703/postgresql-how-to-implement-gin

Please help my question.

Thanks

Kenji uno

Windows Phoneから送信


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Kenji uno <ku(at)digitaldolphins(dot)jp>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: How to implement Gin method?
Date: 2013-07-07 09:05:05
Message-ID: 20130707090504.GA10149@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jul 07, 2013 at 10:00:16AM +0900, Kenji uno wrote:
> Hi.
>
> I want to try GIN and know programming information of GIN technology.
>
> Please teach me about these functions extractValue, extractQuery and consistent.
>
> I have posted question at stack overflow.
>
> http://stackoverflow.com/questions/17489703/postgresql-how-to-implement-gin

The documentation refers to the authors pages:
http://www.sai.msu.su/~megera/wiki/Gin

Did they help at all?

Also, GIN cannot be just applied to anything. It works to be able to
index certain types of which are difficult any other way, like
full-text search. If you give some idea of what you'd like to index
then we can give an idea of what the functions should do.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
-- Arthur Schopenhauer


From: kenji uno <ku(at)digitaldolphins(dot)jp>
To: kleptog(at)svana(dot)org
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: How to implement Gin method?
Date: 2013-07-08 06:21:09
Message-ID: 201307080620.r686KoYI030408@mail.digitaldolphins.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi.

Ok, ok thanks.

My problem is to shorten time of searching full text stored in text field.

The table definition is like following:

CREATE TABLE xxx
(
...
title character varying,
...
fts1body text,
...
)

If user requests keywords, we use a kind of "stristr" that is targeting Japanese text encoded in UTF-8.

"aaa bbb ccc" [Click here to search!]

SELECT * FROM xxx
WHERE TRUE
AND (ddstrike(title,'aaa') OR ddstrike(fts1body,'aaa') OR ...)
AND (ddstrike(title,'bbb') OR ddstrike(fts1body,'bbb') OR ...)
AND ...

As you can imagine easily, yes, it is very slow!

So I need trial and error for speeding up.

My trial is "Insert a light weight filter done by integer key, before text searching!"

For example,
filter('A') -> 1
filter('B') -> 2
filter('C') -> 4
filter('AAABBC') -> 7 or {1,2,4}

It may fit to inverse index like GIN!

So I began to study GIN.

I'm sorry to say. Today I found I could apply int array GIN support at contrib/_int.sql.

I made GIN index.

CREATE INDEX xxx_idx_filter ON xxx USING GIN (filter(fts1body) gist__int_ops);

The following sample query is very very fast! 11065 hits in 22 milli secs (total 215,278 records).

SELECT COUNT(*) FROM xxx WHERE filter(fts1body) @> filter('ABC');

However the following query is very slow! 9,400ms. It uses "Seq Scan" lol.

SELECT * FROM xxx
WHERE TRUE
AND (ddstrike(title,'ABC') OR (filter(fts1body) @> filter('AAA') AND ddstrike(fts1body,'AAA')))

Apply filter to "title" column too.

The best query result costs 3,700ms. 18 hits. It surely uses expected query plan: two "Bitmap index scan" -> "Bitmap Or" -> "Bitmap Heap Scan".

SELECT * FROM xxx
WHERE TRUE
AND (filter(title) @> filter('ABC') OR filter(fts1body) @> filter('ABC'))
AND (ddstrike(title,'ABC') OR ddstrike(fts1body,'ABC'))

The pure query costs 3,800ms. 18 hits. Single "Seq Scan".

SELECT * FROM xxx
WHERE TRUE
AND (ddstrike(title,'ABC') OR ddstrike(fts1body,'ABC'))

Finally I noticed I had spent useless time, and need to find another good one.

Sorry.

However, I may think good idea which uses inverted index.

So I want to know...
- the actual work of extractQuery and consistant.
- the detail interface of extractValue/extractQuery/consistant. It may help understanding.

I looked at contrib/_int.sql of PG8.2.22

There are definitions of int[] GIN support.

---
CREATE OPERATOR CLASS gin__int_ops
FOR TYPE _int4 USING gin
AS
OPERATOR 3 &&,
OPERATOR 6 = (anyarray, anyarray) RECHECK,
OPERATOR 7 @>,
OPERATOR 8 <@ RECHECK,
OPERATOR 13 @,
OPERATOR 14 ~ RECHECK,
OPERATOR 20 @@ (_int4, query_int),
FUNCTION 1 btint4cmp (int4, int4),
FUNCTION 2 ginarrayextract (anyarray, internal),
FUNCTION 3 ginint4_queryextract (internal, internal, int2),
FUNCTION 4 ginint4_consistent (internal, int2, internal),
STORAGE int4;
---

I checked the PG8.2.22 source code.

Both ginint4_queryextract and ginint4_consistent assume that "query" argument is a PGARRAY (ArrayType *). Where is it decided? Is it array of STORAGE type?

Both extractQuery(ginint4_queryextract) and extractValue(ginarrayextract) seem to return similar value type. They return Datum array of int4. Is it array of STORAGE type?

I want to understand the overview of GIN extension.

Thanks

kenji uno


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: kenji uno <ku(at)digitaldolphins(dot)jp>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: How to implement Gin method?
Date: 2013-07-08 19:56:39
Message-ID: 20130708195638.GC15254@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 08, 2013 at 03:21:09PM +0900, kenji uno wrote:
> Hi.
>
> Ok, ok thanks.
>
> My problem is to shorten time of searching full text stored in text field.

Ok, your explanation of your problem really helps, thanks.

> However the following query is very slow! 9,400ms. It uses "Seq Scan" lol.
>
> SELECT * FROM xxx
> WHERE TRUE
> AND (ddstrike(title,'ABC') OR (filter(fts1body) @> filter('AAA') AND ddstrike(fts1body,'AAA')))

Well, in this case it still needs to scan the whole table to search the
title obviously.

> Apply filter to "title" column too.
>
> The best query result costs 3,700ms. 18 hits. It surely uses expected query plan: two "Bitmap index scan" -> "Bitmap Or" -> "Bitmap Heap Scan".
>
> SELECT * FROM xxx
> WHERE TRUE
> AND (filter(title) @> filter('ABC') OR filter(fts1body) @> filter('ABC'))
> AND (ddstrike(title,'ABC') OR ddstrike(fts1body,'ABC'))

It would be useful to see the "explain analyze" of this query. Note that
looking up 11,000 entries in an index could very take as long as
sequentially scanning the whole table.

> However, I may think good idea which uses inverted index.

I think your above idea is a good one, but you need to work out why
your above implementation didn't work out and why you think
implementing it directly will be better.
>
> So I want to know...
> - the actual work of extractQuery and consistant.
> - the detail interface of extractValue/extractQuery/consistant. It may help understanding.
>
> I looked at contrib/_int.sql of PG8.2.22

Whoa, very old version, please look at something newer. For example the
RECHECK flag below is no longer used.

> There are definitions of int[] GIN support.
> ---
> CREATE OPERATOR CLASS gin__int_ops
> FOR TYPE _int4 USING gin
> AS
> OPERATOR 3 &&,
> OPERATOR 6 = (anyarray, anyarray) RECHECK,
> OPERATOR 7 @>,
> OPERATOR 8 <@ RECHECK,
> OPERATOR 13 @,
> OPERATOR 14 ~ RECHECK,
> OPERATOR 20 @@ (_int4, query_int),
> FUNCTION 1 btint4cmp (int4, int4),
> FUNCTION 2 ginarrayextract (anyarray, internal),
> FUNCTION 3 ginint4_queryextract (internal, internal, int2),
> FUNCTION 4 ginint4_consistent (internal, int2, internal),
> STORAGE int4;
> ---
> Both ginint4_queryextract and ginint4_consistent assume that "query" argument is a PGARRAY (ArrayType *). Where is it decided? Is it array of STORAGE type?

Remember the above uses operators which are what is indexed. The left
hand side is the array. The right hand side is whatever is defined.
intarray defines the operator &&(int[], int[]) hence the "query"
argument is int[] in that case. Apparently intarray accepts many kinds
of queries, it is the operators that define what actually happens.

> Both extractQuery(ginint4_queryextract) and extractValue(ginarrayextract) seem to return similar value type. They return Datum array of int4. Is it array of STORAGE type?

From my reading of
http://www.postgresql.org/docs/9.2/static/gin-extensibility.html, yes
they must return an array of the STORAGE type. The last paragraph on
that page says:

The actual data types of the various Datum values mentioned above
vary depending on the operator class. The item values passed to
extractValue are always of the operator class's input type, and all
key values must be of the class's STORAGE type. The type of the
query argument passed to extractQuery and consistent is whatever is
specified as the right-hand input type of the class member operator
identified by the strategy number. This need not be the same as
the item type, so long as key values of the correct type can be
extracted from it.

> I want to understand the overview of GIN extension.

Please let us know what the documentation is missing so it can be
improved.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
-- Arthur Schopenhauer