Quick Links

Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)

Lists:	pgsql-hackers

From:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-01-14 15:06:42
Message-ID:	CADyhKSWORjVYOWxP7x6XMtSt4yUQErSDLf5EjVNM4a+U=1EfMA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

The attached patch is what we discussed just before the commit-fest:Nov.

It implements an alternative way to scan a particular table using on-memory
cache instead of the usual heap access method. Unlike buffer cache, this
mechanism caches a limited number of columns on the memory, so memory
consumption per tuple is much smaller than the regular heap access method,
thus it allows much larger number of tuples on the memory.

I'd like to extend this idea to implement a feature to cache data according to
column-oriented data structure to utilize parallel calculation processors like
CPU's SIMD operations or simple GPU cores. (Probably, it makes sense to
evaluate multiple records with a single vector instruction if contents of
a particular column is put as a large array.)
However, this patch still keeps all the tuples in row-oriented data format,
because row <=> column translation makes this patch bigger than the
current form (about 2KL), and GPU integration needs to link proprietary
library (cuda or opencl) thus I thought it is not preferable for the upstream
code.

Also note that this patch needs part-1 ~ part-3 patches of CustomScan
APIs as prerequisites because it is implemented on top of the APIs.

One thing I have to apologize is, lack of documentation and source code
comments around the contrib/ code. Please give me a couple of days to
clean-up the code.
Aside from the extension code, I put two enhancement on the core code
as follows. I'd like to have a discussion about adequacy of these enhancement.

The first enhancement is a hook on heap_page_prune() to synchronize
internal state of extension with changes of heap image on the disk.
It is not avoidable to hold garbage, increasing time by time, on the cache,
thus needs to clean up as vacuum process doing. The best timing to do
is when dead tuples are reclaimed because it is certain nobody will
reference the tuples any more.

diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index f626755..023f78e 100644
--- a/src/backend/utils/time/tqual.c
bool marked[MaxHeapTuplesPerPage + 1];
} PruneState;

+/* Callback for each page pruning */
+heap_page_prune_hook_type heap_page_prune_hook = NULL;
+
/* Local functions */
static int heap_prune_chain(Relation relation, Buffer buffer,
OffsetNumber rootoffnum,
@@ -294,6 +297,16 @@ heap_page_prune(Relation relation, Buffer buffer, Transacti
onId OldestXmin,
* and update FSM with the remaining space.
*/

+ /*
+ * This callback allows extensions to synchronize their own status with
+ * heap image on the disk, when this buffer page is vacuumed.
+ */
+ if (heap_page_prune_hook)
+ (*heap_page_prune_hook)(relation,
+ buffer,
+ ndeleted,
+ OldestXmin,
+ prstate.latestRemovedXid);
return ndeleted;
}

The second enhancement makes SetHintBits() accepts InvalidBuffer to
ignore all the jobs. We need to check visibility of cached tuples when
custom-scan node scans cached table instead of the heap.
Even though we can use MVCC snapshot to check tuple's visibility,
it may internally set hint bit of tuples thus we always needs to give
a valid buffer pointer to HeapTupleSatisfiesVisibility(). Unfortunately,
it kills all the benefit of table cache if it takes to load the heap buffer
being associated with the cached tuple.
So, I'd like to have a special case handling on the SetHintBits() for
dry-run when InvalidBuffer is given.

diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index f626755..023f78e 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -103,11 +103,18 @@ static bool XidInMVCCSnapshot(TransactionId xid,
Snapshot snapshot);
*
* The caller should pass xid as the XID of the transaction to check, or
* InvalidTransactionId if no check is needed.
+ *
+ * In case when the supplied HeapTuple is not associated with a particular
+ * buffer, it just returns without any jobs. It may happen when an extension
+ * caches tuple with their own way.
*/
static inline void
SetHintBits(HeapTupleHeader tuple, Buffer buffer,
uint16 infomask, TransactionId xid)
{
+ if (BufferIsInvalid(buffer))
+ return;
+
if (TransactionIdIsValid(xid))
{
/* NB: xid must be known committed here! */

Thanks,

2013/11/13 Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>:
> 2013/11/12 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
>> Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp> writes:
>>> So, are you thinking it is a feasible approach to focus on custom-scan
>>> APIs during the upcoming CF3, then table-caching feature as use-case
>>> of this APIs on CF4?
>>
>> Sure. If you work on this extension after CF3, and it reveals that the
>> custom scan stuff needs some adjustments, there would be time to do that
>> in CF4. The policy about what can be submitted in CF4 is that we don't
>> want new major features that no one has seen before, not that you can't
>> make fixes to previously submitted stuff. Something like a new hook
>> in vacuum wouldn't be a "major feature", anyway.
>>
> Thanks for this clarification.
> 3 days are too short to write a patch, however, 2 month may be sufficient
> to develop a feature on top of the scheme being discussed in the previous
> comitfest.
>
> Best regards,
> --
> KaiGai Kohei <kaigai(at)kaigai(dot)gr(dot)jp>

--
KaiGai Kohei <kaigai(at)kaigai(dot)gr(dot)jp>

Attachment	Content-Type	Size
pgsql-v9.4-custom-scan.part-4.v5.patch	application/octet-stream	62.3 KB

From:	KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-01-21 04:54:51
Message-ID:	52DDFD9B.8050400@ak.jp.nec.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

I revisited the patch for contrib/cache_scan extension.
The previous one had a problem when T-tree node shall be rebalanced
then crashed on merging the node.

Even though contrib/cache_scan portion has more than 2KL code,
things I'd like to have a discussion first is a portion of the
core enhancements to run MVCCsnapshot on the cached tuple, and
to get callback on vacuumed pages for cache synchronization.

Any comments please.

Thanks,

(2014/01/15 0:06), Kohei KaiGai wrote:
> Hello,
>
> The attached patch is what we discussed just before the commit-fest:Nov.
>
> It implements an alternative way to scan a particular table using on-memory
> cache instead of the usual heap access method. Unlike buffer cache, this
> mechanism caches a limited number of columns on the memory, so memory
> consumption per tuple is much smaller than the regular heap access method,
> thus it allows much larger number of tuples on the memory.
>
> I'd like to extend this idea to implement a feature to cache data according to
> column-oriented data structure to utilize parallel calculation processors like
> CPU's SIMD operations or simple GPU cores. (Probably, it makes sense to
> evaluate multiple records with a single vector instruction if contents of
> a particular column is put as a large array.)
> However, this patch still keeps all the tuples in row-oriented data format,
> because row <=> column translation makes this patch bigger than the
> current form (about 2KL), and GPU integration needs to link proprietary
> library (cuda or opencl) thus I thought it is not preferable for the upstream
> code.
>
> Also note that this patch needs part-1 ~ part-3 patches of CustomScan
> APIs as prerequisites because it is implemented on top of the APIs.
>
> One thing I have to apologize is, lack of documentation and source code
> comments around the contrib/ code. Please give me a couple of days to
> clean-up the code.
> Aside from the extension code, I put two enhancement on the core code
> as follows. I'd like to have a discussion about adequacy of these enhancement.
>
> The first enhancement is a hook on heap_page_prune() to synchronize
> internal state of extension with changes of heap image on the disk.
> It is not avoidable to hold garbage, increasing time by time, on the cache,
> thus needs to clean up as vacuum process doing. The best timing to do
> is when dead tuples are reclaimed because it is certain nobody will
> reference the tuples any more.
>
> diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
> index f626755..023f78e 100644
> --- a/src/backend/utils/time/tqual.c
> bool marked[MaxHeapTuplesPerPage + 1];
> } PruneState;
>
> +/* Callback for each page pruning */
> +heap_page_prune_hook_type heap_page_prune_hook = NULL;
> +
> /* Local functions */
> static int heap_prune_chain(Relation relation, Buffer buffer,
> OffsetNumber rootoffnum,
> @@ -294,6 +297,16 @@ heap_page_prune(Relation relation, Buffer buffer, Transacti
> onId OldestXmin,
> * and update FSM with the remaining space.
> */
>
> + /*
> + * This callback allows extensions to synchronize their own status with
> + * heap image on the disk, when this buffer page is vacuumed.
> + */
> + if (heap_page_prune_hook)
> + (*heap_page_prune_hook)(relation,
> + buffer,
> + ndeleted,
> + OldestXmin,
> + prstate.latestRemovedXid);
> return ndeleted;
> }
>
>
> The second enhancement makes SetHintBits() accepts InvalidBuffer to
> ignore all the jobs. We need to check visibility of cached tuples when
> custom-scan node scans cached table instead of the heap.
> Even though we can use MVCC snapshot to check tuple's visibility,
> it may internally set hint bit of tuples thus we always needs to give
> a valid buffer pointer to HeapTupleSatisfiesVisibility(). Unfortunately,
> it kills all the benefit of table cache if it takes to load the heap buffer
> being associated with the cached tuple.
> So, I'd like to have a special case handling on the SetHintBits() for
> dry-run when InvalidBuffer is given.
>
> diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
> index f626755..023f78e 100644
> --- a/src/backend/utils/time/tqual.c
> +++ b/src/backend/utils/time/tqual.c
> @@ -103,11 +103,18 @@ static bool XidInMVCCSnapshot(TransactionId xid,
> Snapshot snapshot);
> *
> * The caller should pass xid as the XID of the transaction to check, or
> * InvalidTransactionId if no check is needed.
> + *
> + * In case when the supplied HeapTuple is not associated with a particular
> + * buffer, it just returns without any jobs. It may happen when an extension
> + * caches tuple with their own way.
> */
> static inline void
> SetHintBits(HeapTupleHeader tuple, Buffer buffer,
> uint16 infomask, TransactionId xid)
> {
> + if (BufferIsInvalid(buffer))
> + return;
> +
> if (TransactionIdIsValid(xid))
> {
> /* NB: xid must be known committed here! */
>
> Thanks,
>
> 2013/11/13 Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>:
>> 2013/11/12 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
>>> Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp> writes:
>>>> So, are you thinking it is a feasible approach to focus on custom-scan
>>>> APIs during the upcoming CF3, then table-caching feature as use-case
>>>> of this APIs on CF4?
>>>
>>> Sure. If you work on this extension after CF3, and it reveals that the
>>> custom scan stuff needs some adjustments, there would be time to do that
>>> in CF4. The policy about what can be submitted in CF4 is that we don't
>>> want new major features that no one has seen before, not that you can't
>>> make fixes to previously submitted stuff. Something like a new hook
>>> in vacuum wouldn't be a "major feature", anyway.
>>>
>> Thanks for this clarification.
>> 3 days are too short to write a patch, however, 2 month may be sufficient
>> to develop a feature on top of the scheme being discussed in the previous
>> comitfest.
>>
>> Best regards,
>> --
>> KaiGai Kohei <kaigai(at)kaigai(dot)gr(dot)jp>
>
>
>
>
>

--
OSS Promotion Center / The PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

Attachment	Content-Type	Size
pgsql-v9.4-custom-scan.part-4.v5.patch	text/plain	77.5 KB

From:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
To:	KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-02-08 02:09:29
Message-ID:	CADyhKSXCKu33hbLfdwi5_GGfqf20fd1F6QSScO+2pMOPkjBncg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

Because of time pressure in the commit-fest:Jan, I tried to simplifies the patch
for cache-only scan into three portions; (1) add a hook on heap_page_prune
for cache invalidation on vacuuming a particular page. (2) add a check to accept
InvalidBuffer on SetHintBits (3) a proof-of-concept module of cache-only scan.

(1) pgsql-v9.4-heap_page_prune_hook.v1.patch
Once on-memory columnar cache is constructed, then it needs to be invalidated
if heap page on behalf of the cache is modified. In usual DML cases, extension
can get control using row-level trigger functions for invalidation,
however, we right
now have no way to get control on a page is vacuumed, usually handled by
autovacuum process.
This patch adds a callback on heap_page_prune(), to allow extensions to prune
dead entries on its cache, not only heap pages.
I'd also like to see any other scenario we need to invalidate columnar cache
entries, if exist. It seems to me object_access_hook makes sense to conver
DDL and VACUUM FULL scenario...

(2) pgsql-v9.4-HeapTupleSatisfies-accepts-InvalidBuffer.v1.patch
In case when we want to check visibility of the tuples on cache entries (thus
no particular shared buffer is associated) using HeapTupleSatisfiesVisibility,
it internally tries to update hint bits of tuples. However, it does
not make sense
onto the tuples being not associated with a particular shared buffer.
Due to its definition, tuple entries being on cache does not connected with
a particular shared buffer. If we need to load whole of the buffer page to set
hint bits, it is totally nonsense because the purpose of on-memory cache is
to reduce disk accesses.
This patch adds an exceptional condition on SetHintBits() to skip anything
if the given buffer is InvalidBuffer. It allows to check tuple
visibility using regular
visibility check functions, without re-invention of the wheel by themselves.

(3) pgsql-v9.4-contrib-cache-scan.v1.patch
Unlike (1) and (2), this patch is just a proof of the concept to
implement cache-
only scan on top of the custom-scan interface.
It tries to offer an alternative scan path on the table with row-level
triggers for
cache invalidation if total width of referenced columns are less than 30% of the
total width of table definition. Thus, it can keep larger number of records with
meaningful portion on the main memory.
This cache shall be invalidated according to the main heap update. One is
row-level trigger, second is object_access_hook on DDL, and the third is
heap_page_prune hook. Once a columns reduced tuple gets cached, it is
copied to the cache memory from the shared buffer, so it needs a feature
to ignore InvalidBuffer for visibility check functions.

Please volunteer to reviewing the patches, especially (1) and (2) that are
very small portion.

Thanks,

2014-01-21 KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>:
> Hello,
>
> I revisited the patch for contrib/cache_scan extension.
> The previous one had a problem when T-tree node shall be rebalanced
> then crashed on merging the node.
>
> Even though contrib/cache_scan portion has more than 2KL code,
> things I'd like to have a discussion first is a portion of the
> core enhancements to run MVCCsnapshot on the cached tuple, and
> to get callback on vacuumed pages for cache synchronization.
>
> Any comments please.
>
> Thanks,
>
>
> (2014/01/15 0:06), Kohei KaiGai wrote:
>>
>> Hello,
>>
>> The attached patch is what we discussed just before the commit-fest:Nov.
>>
>> It implements an alternative way to scan a particular table using
>> on-memory
>> cache instead of the usual heap access method. Unlike buffer cache, this
>> mechanism caches a limited number of columns on the memory, so memory
>> consumption per tuple is much smaller than the regular heap access method,
>> thus it allows much larger number of tuples on the memory.
>>
>> I'd like to extend this idea to implement a feature to cache data
>> according to
>> column-oriented data structure to utilize parallel calculation processors
>> like
>> CPU's SIMD operations or simple GPU cores. (Probably, it makes sense to
>> evaluate multiple records with a single vector instruction if contents of
>> a particular column is put as a large array.)
>> However, this patch still keeps all the tuples in row-oriented data
>> format,
>> because row <=> column translation makes this patch bigger than the
>> current form (about 2KL), and GPU integration needs to link proprietary
>> library (cuda or opencl) thus I thought it is not preferable for the
>> upstream
>> code.
>>
>> Also note that this patch needs part-1 ~ part-3 patches of CustomScan
>> APIs as prerequisites because it is implemented on top of the APIs.
>>
>> One thing I have to apologize is, lack of documentation and source code
>> comments around the contrib/ code. Please give me a couple of days to
>> clean-up the code.
>> Aside from the extension code, I put two enhancement on the core code
>> as follows. I'd like to have a discussion about adequacy of these
>> enhancement.
>>
>> The first enhancement is a hook on heap_page_prune() to synchronize
>> internal state of extension with changes of heap image on the disk.
>> It is not avoidable to hold garbage, increasing time by time, on the
>> cache,
>> thus needs to clean up as vacuum process doing. The best timing to do
>> is when dead tuples are reclaimed because it is certain nobody will
>> reference the tuples any more.
>>
>> diff --git a/src/backend/utils/time/tqual.c
>> b/src/backend/utils/time/tqual.c
>> index f626755..023f78e 100644
>> --- a/src/backend/utils/time/tqual.c
>> bool marked[MaxHeapTuplesPerPage + 1];
>> } PruneState;
>>
>> +/* Callback for each page pruning */
>> +heap_page_prune_hook_type heap_page_prune_hook = NULL;
>> +
>> /* Local functions */
>> static int heap_prune_chain(Relation relation, Buffer buffer,
>> OffsetNumber rootoffnum,
>> @@ -294,6 +297,16 @@ heap_page_prune(Relation relation, Buffer buffer,
>> Transacti
>> onId OldestXmin,
>> * and update FSM with the remaining space.
>> */
>>
>> + /*
>> + * This callback allows extensions to synchronize their own status
>> with
>> + * heap image on the disk, when this buffer page is vacuumed.
>> + */
>> + if (heap_page_prune_hook)
>> + (*heap_page_prune_hook)(relation,
>> + buffer,
>> + ndeleted,
>> + OldestXmin,
>> + prstate.latestRemovedXid);
>> return ndeleted;
>> }
>>
>>
>> The second enhancement makes SetHintBits() accepts InvalidBuffer to
>> ignore all the jobs. We need to check visibility of cached tuples when
>> custom-scan node scans cached table instead of the heap.
>> Even though we can use MVCC snapshot to check tuple's visibility,
>> it may internally set hint bit of tuples thus we always needs to give
>> a valid buffer pointer to HeapTupleSatisfiesVisibility(). Unfortunately,
>> it kills all the benefit of table cache if it takes to load the heap
>> buffer
>> being associated with the cached tuple.
>> So, I'd like to have a special case handling on the SetHintBits() for
>> dry-run when InvalidBuffer is given.
>>
>> diff --git a/src/backend/utils/time/tqual.c
>> b/src/backend/utils/time/tqual.c
>> index f626755..023f78e 100644
>> --- a/src/backend/utils/time/tqual.c
>> +++ b/src/backend/utils/time/tqual.c
>> @@ -103,11 +103,18 @@ static bool XidInMVCCSnapshot(TransactionId xid,
>> Snapshot snapshot);
>> *
>> * The caller should pass xid as the XID of the transaction to check, or
>> * InvalidTransactionId if no check is needed.
>> + *
>> + * In case when the supplied HeapTuple is not associated with a
>> particular
>> + * buffer, it just returns without any jobs. It may happen when an
>> extension
>> + * caches tuple with their own way.
>> */
>> static inline void
>> SetHintBits(HeapTupleHeader tuple, Buffer buffer,
>> uint16 infomask, TransactionId xid)
>> {
>> + if (BufferIsInvalid(buffer))
>> + return;
>> +
>> if (TransactionIdIsValid(xid))
>> {
>> /* NB: xid must be known committed here! */
>>
>> Thanks,
>>
>> 2013/11/13 Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>:
>>>
>>> 2013/11/12 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
>>>>
>>>> Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp> writes:
>>>>>
>>>>> So, are you thinking it is a feasible approach to focus on custom-scan
>>>>> APIs during the upcoming CF3, then table-caching feature as use-case
>>>>> of this APIs on CF4?
>>>>
>>>>
>>>> Sure. If you work on this extension after CF3, and it reveals that the
>>>> custom scan stuff needs some adjustments, there would be time to do that
>>>> in CF4. The policy about what can be submitted in CF4 is that we don't
>>>> want new major features that no one has seen before, not that you can't
>>>> make fixes to previously submitted stuff. Something like a new hook
>>>> in vacuum wouldn't be a "major feature", anyway.
>>>>
>>> Thanks for this clarification.
>>> 3 days are too short to write a patch, however, 2 month may be sufficient
>>> to develop a feature on top of the scheme being discussed in the previous
>>> comitfest.
>>>
>>> Best regards,
>>> --
>>> KaiGai Kohei <kaigai(at)kaigai(dot)gr(dot)jp>
>>
>>
>>
>>
>>
>>
>
> --
> OSS Promotion Center / The PG-Strom Project
> KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

--
KaiGai Kohei <kaigai(at)kaigai(dot)gr(dot)jp>

Attachment	Content-Type	Size
pgsql-v9.4-contrib-cache-scan.v1.patch	application/octet-stream	74.8 KB
pgsql-v9.4-heap_page_prune_hook.v1.patch	application/octet-stream	1.7 KB
pgsql-v9.4-HeapTupleSatisfies-accepts-InvalidBuffer.v1.patch	application/octet-stream	840 bytes

From:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Cc:	KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-02-12 05:59:50
Message-ID:	CAJrrPGcjd0xgR5cyFTerVbweSHCScDCdNjMh9ey4H432XKVWXw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Feb 8, 2014 at 1:09 PM, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp> wrote:

> Hello,
>
> Because of time pressure in the commit-fest:Jan, I tried to simplifies the
> patch
> for cache-only scan into three portions; (1) add a hook on heap_page_prune
> for cache invalidation on vacuuming a particular page. (2) add a check to
> accept
> InvalidBuffer on SetHintBits (3) a proof-of-concept module of cache-only
> scan.
>
> (1) pgsql-v9.4-heap_page_prune_hook.v1.patch
> Once on-memory columnar cache is constructed, then it needs to be
> invalidated
> if heap page on behalf of the cache is modified. In usual DML cases,
> extension
> can get control using row-level trigger functions for invalidation,
> however, we right
> now have no way to get control on a page is vacuumed, usually handled by
> autovacuum process.
> This patch adds a callback on heap_page_prune(), to allow extensions to
> prune
> dead entries on its cache, not only heap pages.
> I'd also like to see any other scenario we need to invalidate columnar
> cache
> entries, if exist. It seems to me object_access_hook makes sense to conver
> DDL and VACUUM FULL scenario...
>
>
> (2) pgsql-v9.4-HeapTupleSatisfies-accepts-InvalidBuffer.v1.patch
> In case when we want to check visibility of the tuples on cache entries
> (thus
> no particular shared buffer is associated) using
> HeapTupleSatisfiesVisibility,
> it internally tries to update hint bits of tuples. However, it does
> not make sense
> onto the tuples being not associated with a particular shared buffer.
> Due to its definition, tuple entries being on cache does not connected with
> a particular shared buffer. If we need to load whole of the buffer page to
> set
> hint bits, it is totally nonsense because the purpose of on-memory cache is
> to reduce disk accesses.
> This patch adds an exceptional condition on SetHintBits() to skip anything
> if the given buffer is InvalidBuffer. It allows to check tuple
> visibility using regular
> visibility check functions, without re-invention of the wheel by
> themselves.
>
>
> (3) pgsql-v9.4-contrib-cache-scan.v1.patch
> Unlike (1) and (2), this patch is just a proof of the concept to
> implement cache-
> only scan on top of the custom-scan interface.
> It tries to offer an alternative scan path on the table with row-level
> triggers for
> cache invalidation if total width of referenced columns are less than 30%
> of the
> total width of table definition. Thus, it can keep larger number of
> records with
> meaningful portion on the main memory.
> This cache shall be invalidated according to the main heap update. One is
> row-level trigger, second is object_access_hook on DDL, and the third is
> heap_page_prune hook. Once a columns reduced tuple gets cached, it is
> copied to the cache memory from the shared buffer, so it needs a feature
> to ignore InvalidBuffer for visibility check functions.
>

I reviewed all the three patches. The first 1 and 2 core PostgreSQL patches
are fine.
And I have comments in the third patch related to cache scan.

1. +# contrib/dbcache/Makefile

Makefile header comment is not matched with file name location.

2.+ /*
+ * Estimation of average width of cached tuples - it does not make
+ * sense to construct a new cache if its average width is more than
+ * 30% of the raw data.
+ */

Move the estimation of average width calculation of cached tuples into
the case where the cache is not found,
otherwise it is an overhead for cache hit scenario.

3. + if (old_cache)
+ attrs_used = bms_union(attrs_used, &old_cache->attrs_used);

can't we need the check to see the average width is more than 30%?
During estimation it doesn't
include the existing other attributes.

4. + lchunk->right = cchunk;
+ lchunk->l_depth = TTREE_DEPTH(lchunk->right);

I think it should be lchunk->r_depth needs to be set in a clock wise
rotation.

5. can you add some comments in the code with how the block is used?

6. In do_insert_tuple function I felt moving the tuples and rearranging
their addresses is little bit costly. How about the following way?

Always insert the tuple from the bottom of the block where the empty
space is started and store their corresponding reference pointers
in the starting of the block in an array. As and when the new tuple
inserts this array increases from block start and tuples from block end.
Just need to sort this array based on item pointers, no need to update
their reference pointers.

In this case the movement is required only when the tuple is moved from
one block to another block and also whenever if the continuous
free space is not available to insert the new tuple. you can decide
based on how frequent the sorting will happen in general.

7. In ccache_find_tuple function this Assert(i_min + 1 < cchunk->ntups);
can go wrong when only one tuple present in the block
with the equal item pointer what we are searching in the forward scan
direction.

8. I am not able to find a protection mechanism in insert/delete and etc of
a tuple in Ttree. As this is a shared memory it can cause problems.

9. + /* merge chunks if this chunk has enough space to merge */
+ ccache_merge_chunk(ccache, cchunk);

calling the merge chunks for every call back of heap page prune is a
overhead for vacuum. After the merge which may again leads
to node splits because of new data.

10. "columner" is present in some places of the patch. correct it.

11. In cache_scan_next function, incase of cache insert fails because of
shared memory the tuple pointer is not reset and cache is NULL.
Because of this during next record fetch it leads to assert as cache !=
NULL.

12. + if (ccache->status != CCACHE_STATUS_IN_PROGRESS)
+ cs_put_ccache(ccache);

The cache is created with refcnt as 2 and in some times two times put
cache is called to eliminate it and in some times with a different approach.
It is little bit confusing, can you explain in with comments with why 2
is required and how it maintains?

13. A performance report is required to see how much impact it can cause on
insert/delete/update operations because of cache synchronizer.

14. The Guc variable "cache_scan_disabled" is missed in docs description.

please let me know if you need any support.

Regards,
Hari Babu
Fujitsu Australia

From:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
To:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
Cc:	KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-02-12 15:42:29
Message-ID:	CADyhKSVsKkXn1QH3i49ozWv-hT5WCX7z7xB-_5o3B0m+KUNv7A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2014-02-12 14:59 GMT+09:00 Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>:
> I reviewed all the three patches. The first 1 and 2 core PostgreSQL patches
> are fine.
> And I have comments in the third patch related to cache scan.
>
Thanks for your volunteering.

> 1. +# contrib/dbcache/Makefile
>
> Makefile header comment is not matched with file name location.
>
Ahh, it's an old name when I started to implement.

> 2.+ /*
> + * Estimation of average width of cached tuples - it does not make
> + * sense to construct a new cache if its average width is more than
> + * 30% of the raw data.
> + */
>
> Move the estimation of average width calculation of cached tuples into
> the case where the cache is not found,
> otherwise it is an overhead for cache hit scenario.
>
You are right. If and when existing cache is found and match, its width is
obviously less than 30% of total width.

> 3. + if (old_cache)
> + attrs_used = bms_union(attrs_used, &old_cache->attrs_used);
>
> can't we need the check to see the average width is more than 30%? During
> estimation it doesn't
> include the existing other attributes.
>
Indeed. It should drop some attributes on the existing cache if total average
width grows more than the threshold. Probably, we need to have a statistical
variable to track how many times or how recently referenced.

> 4. + lchunk->right = cchunk;
> + lchunk->l_depth = TTREE_DEPTH(lchunk->right);
>
> I think it should be lchunk->r_depth needs to be set in a clock wise
> rotation.
>
Oops, nice cache.

> 5. can you add some comments in the code with how the block is used?
>
Sorry, I'll add it. A block is consumed from the head to store pointers of
tuples, and from the tail to store contents of the tuples. A block can hold
multiple tuples unless usage of tuple pointers from the head does not cross
the area for tuple contents. Anyway, I'll put it on the source code.

> 6. In do_insert_tuple function I felt moving the tuples and rearranging
> their addresses is little bit costly. How about the following way?
>
> Always insert the tuple from the bottom of the block where the empty
> space is started and store their corresponding reference pointers
> in the starting of the block in an array. As and when the new tuple
> inserts this array increases from block start and tuples from block end.
> Just need to sort this array based on item pointers, no need to update
> their reference pointers.
>
> In this case the movement is required only when the tuple is moved from
> one block to another block and also whenever if the continuous
> free space is not available to insert the new tuple. you can decide based
> on how frequent the sorting will happen in general.
>
It seems to me a reasonable suggestion.
Probably, an easier implementation is replacing an old block with dead-
spaces by a new block that contains only valid tuples, if and when dead-
space grows threshold of block-usage.

> 7. In ccache_find_tuple function this Assert(i_min + 1 < cchunk->ntups); can
> go wrong when only one tuple present in the block
> with the equal item pointer what we are searching in the forward scan
> direction.
>
It shouldn't happen, because the first or second ItemPointerCompare will
handle the condition. Please assume the cchunk->ntups == 1. In this case,
any given ctid shall match either of them, because any ctid is less, equal or
larger to the tuple being only cached, thus, it moves to the right or left node
according to the scan direction.

> 8. I am not able to find a protection mechanism in insert/delete and etc of
> a tuple in Ttree. As this is a shared memory it can cause problems.
>
For design simplification, I put a giant lock per columnar-cache.
So, routines in cscan.c acquires exclusive lwlock prior to invocation of
ccache_insert_tuple / ccache_delete_tuple.

> 9. + /* merge chunks if this chunk has enough space to merge */
> + ccache_merge_chunk(ccache, cchunk);
>
> calling the merge chunks for every call back of heap page prune is a
> overhead for vacuum. After the merge which may again leads
> to node splits because of new data.
>
OK, I'll check the condition to merge the chunks, to prevent too frequent
merge / split.

> 10. "columner" is present in some places of the patch. correct it.
>
Ahh, it should be "columnar".

> 11. In cache_scan_next function, incase of cache insert fails because of
> shared memory the tuple pointer is not reset and cache is NULL.
> Because of this during next record fetch it leads to assert as cache !=
> NULL.
>
You are right. I had to modify the state of scan as if normal-seqscan path,
not just setting NULL on csstate->ccache.
I left an older manner during try & error during implementation.

> 12. + if (ccache->status != CCACHE_STATUS_IN_PROGRESS)
> + cs_put_ccache(ccache);
>
> The cache is created with refcnt as 2 and in some times two times put
> cache is called to eliminate it and in some times with a different approach.
> It is little bit confusing, can you explain in with comments with why 2
> is required and how it maintains?
>
I thought, 2 is same as create + get, so putting the cache at end of the scan
does not release the cache. However, it might be confusing as you pointed out.
The process who create the cache knows it is the creator process. So, all it
needs to do is exiting the scan without putting the cache, if it successfully
create the cache and wants to leave the cache for later scan.

> 13. A performance report is required to see how much impact it can cause on
> insert/delete/update operations because of cache synchronizer.
>
OK, I'll try to measure the difference between them on next patch submission.

> 14. The Guc variable "cache_scan_disabled" is missed in docs description.
>
OK,

Thanks, I'll submit a revised one within a couple of days.
--
KaiGai Kohei <kaigai(at)kaigai(dot)gr(dot)jp>

From:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Cc:	KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-02-12 23:30:55
Message-ID:	CAJrrPGdBUh2KyCRC+Ve=_rv5bCQyGtw4Fex_LkKoo2M0DO8VjQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 13, 2014 at 2:42 AM, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp> wrote:

> 2014-02-12 14:59 GMT+09:00 Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>:
> > 7. In ccache_find_tuple function this Assert(i_min + 1 < cchunk->ntups);
> can
> > go wrong when only one tuple present in the block
> > with the equal item pointer what we are searching in the forward scan
> > direction.
> >
> It shouldn't happen, because the first or second ItemPointerCompare will
> handle the condition. Please assume the cchunk->ntups == 1. In this case,
> any given ctid shall match either of them, because any ctid is less, equal
> or
> larger to the tuple being only cached, thus, it moves to the right or left
> node
> according to the scan direction.

yes you are correct. sorry for the noise.

> > 8. I am not able to find a protection mechanism in insert/delete and
> etc of
> > a tuple in Ttree. As this is a shared memory it can cause problems.
> >
> For design simplification, I put a giant lock per columnar-cache.
> So, routines in cscan.c acquires exclusive lwlock prior to invocation of
> ccache_insert_tuple / ccache_delete_tuple.

Correct. But this lock can be a bottleneck for the concurrency. Better to
analyze the same once we have the performance report.

Regards,
Hari Babu
Fujitsu Australia

From:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-02-13 04:27:43
Message-ID:	9A28C8860F777E439AA12E8AEA7694F8F7A166@BPXM15GP.gisp.nec.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > 8. I am not able to find a protection mechanism in insert/delete
> and etc of
> > a tuple in Ttree. As this is a shared memory it can cause problems.
> >
>
> For design simplification, I put a giant lock per columnar-cache.
> So, routines in cscan.c acquires exclusive lwlock prior to
> invocation of
> ccache_insert_tuple / ccache_delete_tuple.
>
>
> Correct. But this lock can be a bottleneck for the concurrency. Better to
> analyze the same once we have the performance report.
>
Well, concurrent updates towards a particular table may cause lock contention
due to a giant lock.
On the other hands, one my headache is how to avoid dead-locking if we try to
implement it using finer granularity locking. Please assume per-chunk locking.
It also needs to take a lock on the neighbor nodes when a record is moved out.
Concurrently, some other process may try to move another record with inverse
order. That is a ticket for dead-locking.

Is there idea or reference to implement concurrent tree structure updating?

Anyway, it is a good idea to measure the impact of concurrent updates on
cached tables, to find out the significance of lock splitting.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

> -----Original Message-----
> From: Haribabu Kommi [mailto:kommi(dot)haribabu(at)gmail(dot)com]
> Sent: Thursday, February 13, 2014 8:31 AM
> To: Kohei KaiGai
> Cc: Kaigai, Kouhei(海外, 浩平); Tom Lane; PgHacker; Robert Haas
> Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only
> table scan?)
>
> On Thu, Feb 13, 2014 at 2:42 AM, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp> wrote:
>
>
> 2014-02-12 14:59 GMT+09:00 Haribabu Kommi
> <kommi(dot)haribabu(at)gmail(dot)com>:
>
> > 7. In ccache_find_tuple function this Assert(i_min + 1 <
> cchunk->ntups); can
>
> > go wrong when only one tuple present in the block
> > with the equal item pointer what we are searching in the
> forward scan
> > direction.
> >
>
> It shouldn't happen, because the first or second ItemPointerCompare
> will
> handle the condition. Please assume the cchunk->ntups == 1. In this
> case,
> any given ctid shall match either of them, because any ctid is less,
> equal or
> larger to the tuple being only cached, thus, it moves to the right
> or left node
> according to the scan direction.
>
>
> yes you are correct. sorry for the noise.
>
>
> > 8. I am not able to find a protection mechanism in insert/delete
> and etc of
> > a tuple in Ttree. As this is a shared memory it can cause problems.
> >
>
> For design simplification, I put a giant lock per columnar-cache.
> So, routines in cscan.c acquires exclusive lwlock prior to
> invocation of
> ccache_insert_tuple / ccache_delete_tuple.
>
>
> Correct. But this lock can be a bottleneck for the concurrency. Better to
> analyze the same once we have the performance report.
>
> Regards,
> Hari Babu
>
> Fujitsu Australia

From:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
Cc:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-02-14 05:21:24
Message-ID:	CAJrrPGeU5cDLhFEePQyobrQ1mGC+ruOBmgRKrasz6Bi_xBkDZg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 13, 2014 at 3:27 PM, Kouhei Kaigai wrote:

> > > 8. I am not able to find a protection mechanism in insert/delete
> > and etc of
> > > a tuple in Ttree. As this is a shared memory it can cause
> problems.
> > >
> >
> > For design simplification, I put a giant lock per columnar-cache.
> > So, routines in cscan.c acquires exclusive lwlock prior to
> > invocation of
> > ccache_insert_tuple / ccache_delete_tuple.
> >
> >
> > Correct. But this lock can be a bottleneck for the concurrency. Better to
> > analyze the same once we have the performance report.
> >
> Well, concurrent updates towards a particular table may cause lock
> contention
> due to a giant lock.
> On the other hands, one my headache is how to avoid dead-locking if we try
> to
> implement it using finer granularity locking. Please assume per-chunk
> locking.
> It also needs to take a lock on the neighbor nodes when a record is moved
> out.
> Concurrently, some other process may try to move another record with
> inverse
> order. That is a ticket for dead-locking.
>
> Is there idea or reference to implement concurrent tree structure updating?
>
> Anyway, it is a good idea to measure the impact of concurrent updates on
> cached tables, to find out the significance of lock splitting.

we can do some of the following things,
1. Let only insert can take the exclusive lock.
2. Always follow the locking order from root to the children.
3. For delete take exclusive lock only on the exact node where the delete
is happening.
and etc, we will identify some more based on the performance data.

And one more interesting document i found in the net while searching for
the concurrency in Ttree,
which says that B-tree can outperform Ttree as an in-memory index also with
an little bit expensive of more memory
usage. The document is attached in the mail.

Regards,
Hari Babu
Fujitsu Australia

Attachment	Content-Type	Size
T-Tree_or_B-Tree_Main_Memory_Database_Index_Structure_Revisited.pdf	application/pdf	161.4 KB

From:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
To:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
Cc:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-02-20 15:19:24
Message-ID:	CADyhKSXmK6g37RRTVexFc8b1wT=M7Mc6iWN3Btck+mVMPaOx4Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

The attached patch is a revised one for cache-only scan module
on top of custom-scan interface. Please check it.
The points I changed are below.

Also, the custom-scan patch, the basis of this functionality, is still
looking for the folks who can volunteer for reviewing.
https://commitfest.postgresql.org/action/patch_view?id=1282

> 1. +# contrib/dbcache/Makefile
>
> Makefile header comment is not matched with file name location.
>
Fixed.

> 2.+ /*
> + * Estimation of average width of cached tuples - it does not make
> + * sense to construct a new cache if its average width is more than
> + * 30% of the raw data.
> + */
>
>
> Move the estimation of average width calculation of cached tuples into
> the case where the cache is not found,
> otherwise it is an overhead for cache hit scenario.
>
This logic was moved to the block if existing cache was not hit.
So, usual path does not fall to the average width calculation.

In addition, I added one additional GUC (cache_scan.width_threshold) to
specify the threshold to determine usage of cache-scan.

> 3. + if (old_cache)
> + attrs_used = bms_union(attrs_used, &old_cache->attrs_used);
>
>
> can't we need the check to see the average width is more than 30%? During
> estimation it doesn't
> include the existing other attributes.
>
See the new ccache_new_attribute_set(). It almost works like bms_union
if total width of the union set of attributes is less than the threshold.
Elsewhere, it drops attributes included in the existing cache, but not
required this scan. Ideally, I want statistical information to know how
frequently/recently the column is referenced, but no such information right
now. So, I put a logic to drop larger columns first.

> 4. + lchunk->right = cchunk;
> + lchunk->l_depth = TTREE_DEPTH(lchunk->right);
>
>
> I think it should be lchunk->r_depth needs to be set in a clock wise
> rotation.
>
Fixed. Also, I found a bug that didn't update upper pointer of child
nodes when rotation was kicked.

> 5. can you add some comments in the code with how the block is used?
>
I added a source code comment around the definition of shmseg_head.
Is it understandable?

> 6. In do_insert_tuple function I felt moving the tuples and rearranging
> their addresses is little bit costly. How about the following way?
>
> Always insert the tuple from the bottom of the block where the empty
> space is started and store their corresponding reference pointers
> in the starting of the block in an array. As and when the new tuple inserts
> this array increases from block start and tuples from block end.
> Just need to sort this array based on item pointers, no need to update
> their reference pointers.
>
> In this case the movement is required only when the tuple is moved from
> one block to another block and also whenever if the continuous
> free space is not available to insert the new tuple. you can decide based
> on how frequent the sorting will happen in general.
>
I newly added a "deadspace" field in the ccache_chunk. It shows the payload
being consumed by tuples already deleted. Because of the format of ccache_chunk,
actual free space is cchunk->usage - offsetof(ccache_chunk,
tuples[cchunk->ntups]),
but deadspace is potential free space. So, I adjusted the do_insert to kick
chunk compaction code, if target chunk has enough "potential" free space, but
no available "actual" free space to store the new tuple.
It makes the overhead around insertion of tuples since all we need to move
is pointer within cchunk->tuples, not contents itself.

> 8. I am not able to find a protection mechanism in insert/delete and etc
> of a tuple in Ttree. As this is a shared memory it can cause problems.
>
I didn't change the existing logic (a giant lock per ccache_head) yet,
because I still wonder this kind of optimization may lose the simpleness
of implementation for the proof-of-concept purpose towards the hook.
Does it really needed for this purpose?

In addition, it seems to me we don't have intent-lock feature, even though
the paper you suggested gave me an impression...

> 9. + /* merge chunks if this chunk has enough space to merge */
> + ccache_merge_chunk(ccache, cchunk);
>
> calling the merge chunks for every call back of heap page prune is a
> overhead for vacuum. After the merge which may again leads
> to node splits because of new data.
>
I adjusted the logic little bit. Now ccache_merge_chunk() is called
if vacuumed cache_chunk is different from the previous one. It allows
to reduce number of (trial) merge operation.

> 10. "columner" is present in some places of the patch. correct it.
>
Fixed.

> 12. + if (ccache->status != CCACHE_STATUS_IN_PROGRESS)
> + cs_put_ccache(ccache);
>
> The cache is created with refcnt as 2 and in some times two times put
> cache is called to eliminate it and in some times with a different approach.
> It is little bit confusing, can you explain in with comments with why
> 2 is required and how it maintains?
>
I adjusted the logic around reference count. When a cache is created,
the creator process does not decrease reference count if creation was
successfully done. Once some error happen during creation, it shall be
decreased, so we can drop the cache with one "put" operation.

> 13. A performance report is required to see how much impact it can cause
> on insert/delete/update operations because of cache synchronizer.
>
I tried to assign synchronizer trigger on "pgbench_account" then run
pgbench toward the table with/without columnar-cache.
It seems to me here is no big difference more than variants.

* with columnar-cache

$ pgbench -T 15 postgres
tps = 113.586430 (including connections establishing)
tps = 113.597228 (excluding connections establishing)

* without columnar-cache
$ pgbench -T 15 postgres
tps = 110.765080 (including connections establishing)
tps = 110.775890 (excluding connections establishing)

> 14. The Guc variable "cache_scan_disabled" is missed in docs description.
>
Its description was added.

Thanks,

2014-02-14 14:21 GMT+09:00 Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>:
> On Thu, Feb 13, 2014 at 3:27 PM, Kouhei Kaigai wrote:
>
>> > > 8. I am not able to find a protection mechanism in insert/delete
>> > and etc of
>> > > a tuple in Ttree. As this is a shared memory it can cause
>> > problems.
>> > >
>> >
>> > For design simplification, I put a giant lock per columnar-cache.
>> > So, routines in cscan.c acquires exclusive lwlock prior to
>> > invocation of
>> > ccache_insert_tuple / ccache_delete_tuple.
>> >
>> >
>> > Correct. But this lock can be a bottleneck for the concurrency. Better
>> > to
>> > analyze the same once we have the performance report.
>> >
>> Well, concurrent updates towards a particular table may cause lock
>> contention
>> due to a giant lock.
>> On the other hands, one my headache is how to avoid dead-locking if we try
>> to
>> implement it using finer granularity locking. Please assume per-chunk
>> locking.
>> It also needs to take a lock on the neighbor nodes when a record is moved
>> out.
>> Concurrently, some other process may try to move another record with
>> inverse
>> order. That is a ticket for dead-locking.
>>
>> Is there idea or reference to implement concurrent tree structure
>> updating?
>>
>> Anyway, it is a good idea to measure the impact of concurrent updates on
>> cached tables, to find out the significance of lock splitting.
>
>
> we can do some of the following things,
> 1. Let only insert can take the exclusive lock.
> 2. Always follow the locking order from root to the children.
> 3. For delete take exclusive lock only on the exact node where the delete is
> happening.
> and etc, we will identify some more based on the performance data.
>
> And one more interesting document i found in the net while searching for the
> concurrency in Ttree,
> which says that B-tree can outperform Ttree as an in-memory index also with
> an little bit expensive of more memory
> usage. The document is attached in the mail.
>
> Regards,
> Hari Babu
> Fujitsu Australia

--
KaiGai Kohei <kaigai(at)kaigai(dot)gr(dot)jp>

Attachment	Content-Type	Size
pgsql-v9.4-custom-scan.part-4.v7.patch	application/octet-stream	88.0 KB

From:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Cc:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-02-24 03:41:57
Message-ID:	CAJrrPGfq3i8ro7gpSgs7rkvXnmpgC0iA0EE_xAP=-DM+WqBAPA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 21, 2014 at 2:19 AM, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp> wrote:

> Hello,
>
> The attached patch is a revised one for cache-only scan module
> on top of custom-scan interface. Please check it.
>

Thanks for the revised patch. Please find some minor comments.

1. memcpy(dest, tuple, HEAPTUPLESIZE);
+ memcpy((char *)dest + HEAPTUPLESIZE,
+ tuple->t_data, tuple->t_len);

For a normal tuple these two addresses are different but in case of
ccache, it is a continuous memory.
Better write a comment as even if it continuous memory, it is treated as
different only.

2. + uint32 required = HEAPTUPLESIZE + MAXALIGN(tuple->t_len);

t_len is already maxaligned. No problem of using it again, The required
length calculation is differing function to function.
For example, in below part of the same function, the same t_len is used
directly. It didn't generate any problem, but it may give some confusion.

4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
+ if (pchunk != NULL && pchunk != cchunk)
+ ccache_merge_chunk(ccache, pchunk);
+ pchunk = cchunk;

The merge_chunk is called only when the heap tuples are spread across two
cache chunks. Actually one cache chunk can accommodate one or more than
heap pages. it needs some other way of handling.

4. for (i=0; i < 20; i++)

Better to replace this magic number with a meaningful macro.

5. "columner" is present in sgml file. correct it.

6. "max_cached_attnum" value in the document saying as 128 by default but
in the code it set as 256.

I will start regress and performance tests. I will inform you the same once
i finish.

Regards,
Hari Babu
Fujitsu Australia

From:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Cc:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-02-24 23:15:54
Message-ID:	CAJrrPGef6PNZkd5X17hjD6XouZ97toF4fpuc70-gh+6Q=7M1dA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 24, 2014 at 2:41 PM, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>wrote:

> On Fri, Feb 21, 2014 at 2:19 AM, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp> wrote:
>
>> Hello,
>>
>> The attached patch is a revised one for cache-only scan module
>> on top of custom-scan interface. Please check it.
>>
>
> I will start regress and performance tests. I will inform you the same
> once i finish.
>

Getting some compilation warnings while compiling the extension and also I
am not able to load the extension because of undefined symbol
"get_restriction_qual_cost".

cscan.c: In function 'cs_estimate_costs':
cscan.c:163: warning: implicit declaration of function
'get_restriction_qual_cost'
cscan.c: In function 'cs_add_scan_path':
cscan.c:437: warning: implicit declaration of function 'bms_to_string'
cscan.c:437: warning: passing argument 1 of 'makeString' makes pointer from
integer without a cast
cscan.c: In function 'cs_begin_custom_scan':
cscan.c:493: warning: implicit declaration of function 'bms_from_string'
cscan.c:493: warning: assignment makes pointer from integer without a cast

FATAL: could not load library "/postgresql/cache_scan.so":
/postgresql/cache_scan.so: undefined symbol: get_restriction_qual_cost

Regards,
Hari Babu
Fujitsu Australia

From:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-02-24 23:44:57
Message-ID:	9A28C8860F777E439AA12E8AEA7694F8F7E7A8@BPXM15GP.gisp.nec.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thanks for your testing,

> Getting some compilation warnings while compiling the extension and also
> I am not able to load the extension because of undefined symbol
> "get_restriction_qual_cost".
>
It seems to me you applied only part-1 portion of the custom-scan patches.

The get_restriction_qual_cost() is re-defined as extern function, from
static one, on the part-2 portion (ctidscan) to estimate cost value of the
qualifiers in extension. Also, bms_to_string() and bms_from_string() are
added on the part-3 portion (postgres_fdw) portion to carry a bitmap-set
according to the copyObject manner.

It may make sense to include the above supplemental changes in the cache-
scan feature also, until either of them getting upstreamed.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

> -----Original Message-----
> From: Haribabu Kommi [mailto:kommi(dot)haribabu(at)gmail(dot)com]
> Sent: Tuesday, February 25, 2014 8:16 AM
> To: Kohei KaiGai
> Cc: Kaigai, Kouhei(海外, 浩平); Tom Lane; PgHacker; Robert Haas
> Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only
> table scan?)
>
> On Mon, Feb 24, 2014 at 2:41 PM, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
> wrote:
>
>
> On Fri, Feb 21, 2014 at 2:19 AM, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
> wrote:
>
>
> Hello,
>
> The attached patch is a revised one for cache-only scan
> module
> on top of custom-scan interface. Please check it.
>
>
>
> I will start regress and performance tests. I will inform you the
> same once i finish.
>
>
>
> Getting some compilation warnings while compiling the extension and also
> I am not able to load the extension because of undefined symbol
> "get_restriction_qual_cost".
>
> cscan.c: In function ‘cs_estimate_costs’:
> cscan.c:163: warning: implicit declaration of function
> ‘get_restriction_qual_cost’
> cscan.c: In function ‘cs_add_scan_path’:
> cscan.c:437: warning: implicit declaration of function ‘bms_to_string’
> cscan.c:437: warning: passing argument 1 of ‘makeString’ makes pointer from
> integer without a cast
> cscan.c: In function ‘cs_begin_custom_scan’:
> cscan.c:493: warning: implicit declaration of function ‘bms_from_string’
> cscan.c:493: warning: assignment makes pointer from integer without a cast
>
> FATAL: could not load library "/postgresql/cache_scan.so":
> /postgresql/cache_scan.so: undefined symbol: get_restriction_qual_cost
>
>
>
> Regards,
> Hari Babu
>
> Fujitsu Australia

From:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
Cc:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-02-25 00:13:31
Message-ID:	CAJrrPGfzLHzL8+45G3cRkjPg8hbYExF9Z-HS7D-nMdvSD+nBxw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 25, 2014 at 10:44 AM, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>wrote:

> Thanks for your testing,
>
> > Getting some compilation warnings while compiling the extension and also
> > I am not able to load the extension because of undefined symbol
> > "get_restriction_qual_cost".
> >
> It seems to me you applied only part-1 portion of the custom-scan patches.
>
> The get_restriction_qual_cost() is re-defined as extern function, from
> static one, on the part-2 portion (ctidscan) to estimate cost value of the
> qualifiers in extension. Also, bms_to_string() and bms_from_string() are
> added on the part-3 portion (postgres_fdw) portion to carry a bitmap-set
> according to the copyObject manner.
>
> It may make sense to include the above supplemental changes in the cache-
> scan feature also, until either of them getting upstreamed.

Thanks for the information, I will apply other patches also and start
testing.

Regards,
Hari Babu
Fujitsu Australia

From:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
Cc:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-02-26 04:00:48
Message-ID:	CAJrrPGedhk0vG9029KEiW=87CMDnOs9R-GWKmeBp-WOmE72DQw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 25, 2014 at 11:13 AM, Haribabu Kommi
<kommi(dot)haribabu(at)gmail(dot)com>wrote:

> Thanks for the information, I will apply other patches also and start
> testing.
>

When try to run the pgbench test, by default the cache-scan plan is not
chosen because of more cost. So I increased the cpu_index_tuple_cost to a
maximum value or by turning off index_scan, so that the plan can chose the
cache_scan as the least cost.

The configuration parameters changed during the test are,

shared_buffers - 2GB, cache_scan.num_blocks - 1024
wal_buffers - 16MB, checkpoint_segments - 255
checkpoint_timeout - 15 min, cpu_index_tuple_cost - 100000 or
enable_indexscan=off

Test procedure:
1. Initialize the database with pgbench with 75 scale factor.
2. Create the triggers on pgbench_accounts
3. Use a select query to load all the data into cache.
4. Run a simple update pgbench test.

Plan details of pgbench simple update queries:

postgres=# explain update pgbench_accounts set abalance = abalance where
aid = 100000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------
Update on pgbench_accounts (cost=0.43..100008.44 rows=1 width=103)
-> Index Scan using pgbench_accounts_pkey on pgbench_accounts
(cost=0.43..100008.44 rows=1 width=103)
Index Cond: (aid = 100000)
Planning time: 0.045 ms
(4 rows)

postgres=# explain select abalance from pgbench_accounts where aid = 100000;
QUERY PLAN
------------------------------------------------------------------------------------
Custom Scan (cache scan) on pgbench_accounts (cost=0.00..99899.99 rows=1
width=4)
Filter: (aid = 100000)
Planning time: 0.042 ms
(3 rows)

I am observing a too much delay in performance results. The performance
test script is attached in the mail.
please let me know if you find any problem in the test.

Regards,
Hari Babu
Fujitsu Australia

Attachment	Content-Type	Size
run_reading.sh	application/x-sh	863 bytes

From:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-03-04 03:35:23
Message-ID:	9A28C8860F777E439AA12E8AEA7694F8F837C1@BPXM15GP.gisp.nec.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thanks for your reviewing.

According to the discussion in the Custom-Scan API thread, I moved
all the supplemental facilities (like bms_to/from_string) into the
main patch. So, you don’t need to apply ctidscan and postgres_fdw
patch for testing any more.
(I'll submit the revised one later)

> 1. memcpy(dest, tuple, HEAPTUPLESIZE);
> + memcpy((char *)dest + HEAPTUPLESIZE,
>
> + tuple->t_data, tuple->t_len);
>
> For a normal tuple these two addresses are different but in case of ccache,
> it is a continuous memory.
> Better write a comment as even if it continuous memory, it is treated
> as different only.
>
OK, I put a source code comment as follows:

/*
* Even though we put the body of HeapTupleHeaderData just after
* HeapTupleData, usually, here is no guarantee that both of data
* structures are located on continuous memory address.
* So, we explicitly adjust tuple->t_data to point the area just
* behind of itself, to reference the HeapTuple on columnar-cache
* as like regular ones.
*/

> 2. + uint32 required = HEAPTUPLESIZE + MAXALIGN(tuple->t_len);
>
> t_len is already maxaligned. No problem of using it again, The required
> length calculation is differing function to function.
> For example, in below part of the same function, the same t_len is used
> directly. It didn't generate any problem, but it may give some confusion.
>
Once I tried to trust t_len is aligned well, however, Assert() macro
said it is not a right assumption. See heap_compute_data_size(), it
computes length of tuple body and adjusts alignment according to the
"attalign" value of pg_attribute; that is not usually same with
sizeof(Datum).

> 4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
> + if (pchunk != NULL && pchunk != cchunk)
>
> + ccache_merge_chunk(ccache, pchunk);
>
> + pchunk = cchunk;
>
>
> The merge_chunk is called only when the heap tuples are spread across
> two cache chunks. Actually one cache chunk can accommodate one or more than
> heap pages. it needs some other way of handling.
>
I adjusted the logic to merge the chunks as follows:

Once a tuple is vacuumed from a chunk, it also checks whether it can be merged
with its child leafs. A chunk has up to two child leafs; left one has less ctid
that the parent, and right one has greater ctid. It means a chunk without right
child in the left sub-tree or a chunk without left child in the right sub-tree
are neighbor of the chunk being vacuumed. In addition, if vacuumed chunk does not
have either (or both) of children, it can be merged with parent node.
I modified ccache_vacuum_tuple() to merge chunks during t-tree walk-down, if
vacuumed chunk has enough free space.

> 4. for (i=0; i < 20; i++)
>
> Better to replace this magic number with a meaningful macro.
>
I rethought here is no good reason why we should construct multiple
ccache_entries at once. So, I adjusted the code to create a new
ccache_entry on demand, to track a columnar-cache being acquired.

> 5. "columner" is present in sgml file. correct it.
>
Sorry, fixed it.

> 6. "max_cached_attnum" value in the document saying as 128 by default but
> in the code it set as 256.
>
Sorry, fixed it.

Also, I tried to run a benchmark of cache_scan on the case when this module
performs most effectively.

The table t1 is declared as follows:
create table t1 (a int, b float, c float, d text, e date, f char(200));
Its width is almost 256bytes/record, and contains 4milion records, thus
total table size is almost 1GB.

* 1st trial - it takes longer time than sequential scan because of columnar-
cache construction
postgres=# explain analyze select count(*) from t1 where a % 10 = 5;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=200791.62..200791.64 rows=1 width=0) (actual time=63105.036..63105.037 rows=1 loops=1)
-> Custom Scan (cache scan) on t1 (cost=0.00..200741.62 rows=20000 width=0) (actual time=7.397..62832.728 rows=400000 loops=1)
Filter: ((a % 10) = 5)
Rows Removed by Filter: 3600000
Planning time: 214.506 ms
Total runtime: 64629.296 ms
(6 rows)

* 2nd trial - it takes much faster than sequential scan because of no disk access
postgres=# explain analyze select count(*) from t1 where a % 10 = 5;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=67457.53..67457.54 rows=1 width=0) (actual time=7833.313..7833.313 rows=1 loops=1)
-> Custom Scan (cache scan) on t1 (cost=0.00..67407.53 rows=20000 width=0) (actual time=0.154..7615.914 rows=400000 loops=1)
Filter: ((a % 10) = 5)
Rows Removed by Filter: 3600000
Planning time: 1.019 ms
Total runtime: 7833.761 ms
(6 rows)

* 3rd trial - turn off the cache_scan, so planner chooses the built-in SeqScan.
postgres=# set cache_scan.enabled = off;
SET
postgres=# explain analyze select count(*) from t1 where a % 10 = 5;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Aggregate (cost=208199.08..208199.09 rows=1 width=0) (actual time=59700.810..59700.810 rows=1 loops=1)
-> Seq Scan on t1 (cost=0.00..208149.08 rows=20000 width=0) (actual time=715.489..59518.095 rows=400000 loops=1)
Filter: ((a % 10) = 5)
Rows Removed by Filter: 3600000
Planning time: 0.630 ms
Total runtime: 59701.104 ms
(6 rows)

The reason why such an extreme result.
I adjusted the system page cache usage to constrain disk cache hit in the operating
system level, so sequential scan is dominated by disk access performance in this
case. On the other hand, columnar cache allowed to host whole of the records because
it omits to cache unreferenced columns.

* GUCs
shared_buffers = 512MB
shared_preload_libraries = 'cache_scan'
cache_scan.num_blocks = 400

[kaigai(at)iwashi backend]$ free -m
total used free shared buffers cached
Mem: 7986 7839 146 0 2 572
-/+ buffers/cache: 7265 721
Swap: 8079 265 7814

Please don't throw me stones. :-)
The primary purpose of this extension is to demonstrate usage of custom-scan
interface and heap_page_prune_hook().

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

> -----Original Message-----
> From: Haribabu Kommi [mailto:kommi(dot)haribabu(at)gmail(dot)com]
> Sent: Monday, February 24, 2014 12:42 PM
> To: Kohei KaiGai
> Cc: Kaigai, Kouhei(海外, 浩平); Tom Lane; PgHacker; Robert Haas
> Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only
> table scan?)
>
> On Fri, Feb 21, 2014 at 2:19 AM, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp> wrote:
>
>
> Hello,
>
> The attached patch is a revised one for cache-only scan module
> on top of custom-scan interface. Please check it.
>
>
>
> Thanks for the revised patch. Please find some minor comments.
>
> 1. memcpy(dest, tuple, HEAPTUPLESIZE);
> + memcpy((char *)dest + HEAPTUPLESIZE,
>
> + tuple->t_data, tuple->t_len);
>
>
> For a normal tuple these two addresses are different but in case of ccache,
> it is a continuous memory.
> Better write a comment as even if it continuous memory, it is treated
> as different only.
>
> 2. + uint32 required = HEAPTUPLESIZE + MAXALIGN(tuple->t_len);
>
> t_len is already maxaligned. No problem of using it again, The required
> length calculation is differing function to function.
> For example, in below part of the same function, the same t_len is used
> directly. It didn't generate any problem, but it may give some confusion.
>
> 4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
> + if (pchunk != NULL && pchunk != cchunk)
>
> + ccache_merge_chunk(ccache, pchunk);
>
> + pchunk = cchunk;
>
>
> The merge_chunk is called only when the heap tuples are spread across
> two cache chunks. Actually one cache chunk can accommodate one or more than
> heap pages. it needs some other way of handling.
>
> 4. for (i=0; i < 20; i++)
>
> Better to replace this magic number with a meaningful macro.
>
> 5. "columner" is present in sgml file. correct it.
>
> 6. "max_cached_attnum" value in the document saying as 128 by default but
> in the code it set as 256.
>
> I will start regress and performance tests. I will inform you the same once
> i finish.
>
>
> Regards,
> Hari Babu
>
> Fujitsu Australia

Attachment	Content-Type	Size
pgsql-v9.4-custom-scan.part-4.v9.patch	application/octet-stream	89.6 KB

From:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-03-04 04:07:11
Message-ID:	9A28C8860F777E439AA12E8AEA7694F8F8382D@BPXM15GP.gisp.nec.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Sorry, the previous one still has "columner" in the sgml files.
Please see the attached one, instead.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Kouhei Kaigai
> Sent: Tuesday, March 04, 2014 12:35 PM
> To: Haribabu Kommi; Kohei KaiGai
> Cc: Tom Lane; PgHacker; Robert Haas
> Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only
> table scan?)
>
> Thanks for your reviewing.
>
> According to the discussion in the Custom-Scan API thread, I moved all the
> supplemental facilities (like bms_to/from_string) into the main patch. So,
> you don’t need to apply ctidscan and postgres_fdw patch for testing any
> more.
> (I'll submit the revised one later)
>
> > 1. memcpy(dest, tuple, HEAPTUPLESIZE);
> > + memcpy((char *)dest + HEAPTUPLESIZE,
> >
> > + tuple->t_data, tuple->t_len);
> >
> > For a normal tuple these two addresses are different but in case of
> > ccache, it is a continuous memory.
> > Better write a comment as even if it continuous memory, it is
> > treated as different only.
> >
> OK, I put a source code comment as follows:
>
> /*
> * Even though we put the body of HeapTupleHeaderData just after
> * HeapTupleData, usually, here is no guarantee that both of data
> * structures are located on continuous memory address.
> * So, we explicitly adjust tuple->t_data to point the area just
> * behind of itself, to reference the HeapTuple on columnar-cache
> * as like regular ones.
> */
>
> > 2. + uint32 required = HEAPTUPLESIZE + MAXALIGN(tuple->t_len);
> >
> > t_len is already maxaligned. No problem of using it again, The
> > required length calculation is differing function to function.
> > For example, in below part of the same function, the same t_len is
> > used directly. It didn't generate any problem, but it may give some
> confusion.
> >
> Once I tried to trust t_len is aligned well, however, Assert() macro said
> it is not a right assumption. See heap_compute_data_size(), it computes
> length of tuple body and adjusts alignment according to the "attalign" value
> of pg_attribute; that is not usually same with sizeof(Datum).
>
> > 4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
> > + if (pchunk != NULL && pchunk != cchunk)
> >
> > + ccache_merge_chunk(ccache, pchunk);
> >
> > + pchunk = cchunk;
> >
> >
> > The merge_chunk is called only when the heap tuples are spread
> > across two cache chunks. Actually one cache chunk can accommodate one
> > or more than heap pages. it needs some other way of handling.
> >
> I adjusted the logic to merge the chunks as follows:
>
> Once a tuple is vacuumed from a chunk, it also checks whether it can be
> merged with its child leafs. A chunk has up to two child leafs; left one
> has less ctid that the parent, and right one has greater ctid. It means
> a chunk without right child in the left sub-tree or a chunk without left
> child in the right sub-tree are neighbor of the chunk being vacuumed. In
> addition, if vacuumed chunk does not have either (or both) of children,
> it can be merged with parent node.
> I modified ccache_vacuum_tuple() to merge chunks during t-tree walk-down,
> if vacuumed chunk has enough free space.
>
> > 4. for (i=0; i < 20; i++)
> >
> > Better to replace this magic number with a meaningful macro.
> >
> I rethought here is no good reason why we should construct multiple
> ccache_entries at once. So, I adjusted the code to create a new ccache_entry
> on demand, to track a columnar-cache being acquired.
>
> > 5. "columner" is present in sgml file. correct it.
> >
> Sorry, fixed it.
>
> > 6. "max_cached_attnum" value in the document saying as 128 by default
> > but in the code it set as 256.
> >
> Sorry, fixed it.
>
>
> Also, I tried to run a benchmark of cache_scan on the case when this module
> performs most effectively.
>
> The table t1 is declared as follows:
> create table t1 (a int, b float, c float, d text, e date, f char(200));
> Its width is almost 256bytes/record, and contains 4milion records, thus
> total table size is almost 1GB.
>
> * 1st trial - it takes longer time than sequential scan because of columnar-
> cache construction
> postgres=# explain analyze select count(*) from t1 where a % 10 = 5;
> QUERY PLAN
> ----------------------------------------------------------------------
> --------------------------------------------------------------
> Aggregate (cost=200791.62..200791.64 rows=1 width=0) (actual
> time=63105.036..63105.037 rows=1 loops=1)
> -> Custom Scan (cache scan) on t1 (cost=0.00..200741.62 rows=20000
> width=0) (actual time=7.397..62832.728 rows=400000 loops=1)
> Filter: ((a % 10) = 5)
> Rows Removed by Filter: 3600000 Planning time: 214.506 ms Total
> runtime: 64629.296 ms
> (6 rows)
>
> * 2nd trial - it takes much faster than sequential scan because of no disk
> access postgres=# explain analyze select count(*) from t1 where a % 10 =
> 5;
> QUERY PLAN
> ----------------------------------------------------------------------
> ------------------------------------------------------------
> Aggregate (cost=67457.53..67457.54 rows=1 width=0) (actual
> time=7833.313..7833.313 rows=1 loops=1)
> -> Custom Scan (cache scan) on t1 (cost=0.00..67407.53 rows=20000
> width=0) (actual time=0.154..7615.914 rows=400000 loops=1)
> Filter: ((a % 10) = 5)
> Rows Removed by Filter: 3600000 Planning time: 1.019 ms Total
> runtime: 7833.761 ms
> (6 rows)
>
> * 3rd trial - turn off the cache_scan, so planner chooses the built-in
> SeqScan.
> postgres=# set cache_scan.enabled = off; SET postgres=# explain analyze
> select count(*) from t1 where a % 10 = 5;
> QUERY PLAN
> ----------------------------------------------------------------------
> ------------------------------------------------
> Aggregate (cost=208199.08..208199.09 rows=1 width=0) (actual
> time=59700.810..59700.810 rows=1 loops=1)
> -> Seq Scan on t1 (cost=0.00..208149.08 rows=20000 width=0) (actual
> time=715.489..59518.095 rows=400000 loops=1)
> Filter: ((a % 10) = 5)
> Rows Removed by Filter: 3600000 Planning time: 0.630 ms Total
> runtime: 59701.104 ms
> (6 rows)
>
> The reason why such an extreme result.
> I adjusted the system page cache usage to constrain disk cache hit in the
> operating system level, so sequential scan is dominated by disk access
> performance in this case. On the other hand, columnar cache allowed to host
> whole of the records because it omits to cache unreferenced columns.
>
> * GUCs
> shared_buffers = 512MB
> shared_preload_libraries = 'cache_scan'
> cache_scan.num_blocks = 400
>
> [kaigai(at)iwashi backend]$ free -m
> total used free shared buffers
> cached
> Mem: 7986 7839 146 0 2
> 572
> -/+ buffers/cache: 7265 721
> Swap: 8079 265 7814
>
>
> Please don't throw me stones. :-)
> The primary purpose of this extension is to demonstrate usage of custom-scan
> interface and heap_page_prune_hook().
>
> Thanks,
> --
> NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei
> <kaigai(at)ak(dot)jp(dot)nec(dot)com>
>
>
> > -----Original Message-----
> > From: Haribabu Kommi [mailto:kommi(dot)haribabu(at)gmail(dot)com]
> > Sent: Monday, February 24, 2014 12:42 PM
> > To: Kohei KaiGai
> > Cc: Kaigai, Kouhei(海外, 浩平); Tom Lane; PgHacker; Robert Haas
> > Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for
> > cache-only table scan?)
> >
> > On Fri, Feb 21, 2014 at 2:19 AM, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp> wrote:
> >
> >
> > Hello,
> >
> > The attached patch is a revised one for cache-only scan module
> > on top of custom-scan interface. Please check it.
> >
> >
> >
> > Thanks for the revised patch. Please find some minor comments.
> >
> > 1. memcpy(dest, tuple, HEAPTUPLESIZE);
> > + memcpy((char *)dest + HEAPTUPLESIZE,
> >
> > + tuple->t_data, tuple->t_len);
> >
> >
> > For a normal tuple these two addresses are different but in case of
> > ccache, it is a continuous memory.
> > Better write a comment as even if it continuous memory, it is
> > treated as different only.
> >
> > 2. + uint32 required = HEAPTUPLESIZE + MAXALIGN(tuple->t_len);
> >
> > t_len is already maxaligned. No problem of using it again, The
> > required length calculation is differing function to function.
> > For example, in below part of the same function, the same t_len is
> > used directly. It didn't generate any problem, but it may give some
> confusion.
> >
> > 4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
> > + if (pchunk != NULL && pchunk != cchunk)
> >
> > + ccache_merge_chunk(ccache, pchunk);
> >
> > + pchunk = cchunk;
> >
> >
> > The merge_chunk is called only when the heap tuples are spread
> > across two cache chunks. Actually one cache chunk can accommodate one
> > or more than heap pages. it needs some other way of handling.
> >
> > 4. for (i=0; i < 20; i++)
> >
> > Better to replace this magic number with a meaningful macro.
> >
> > 5. "columner" is present in sgml file. correct it.
> >
> > 6. "max_cached_attnum" value in the document saying as 128 by default
> > but in the code it set as 256.
> >
> > I will start regress and performance tests. I will inform you the same
> > once i finish.
> >
> >
> > Regards,
> > Hari Babu
> >
> > Fujitsu Australia

Attachment	Content-Type	Size
pgsql-v9.4-custom-scan.part-4.v9.patch	application/octet-stream	89.6 KB

From:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
Cc:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-03-06 09:17:20
Message-ID:	CAJrrPGd87ozXrHzWyao_W9E6b6Ub8thmPm+nBeme7gvTqF+MNA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 4, 2014 at 3:07 PM, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com> wrote:

> > > 4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
> > > + if (pchunk != NULL && pchunk != cchunk)
> > >
> > > + ccache_merge_chunk(ccache, pchunk);
> > >
> > > + pchunk = cchunk;
> > >
> > >
> > > The merge_chunk is called only when the heap tuples are spread
> > > across two cache chunks. Actually one cache chunk can accommodate one
> > > or more than heap pages. it needs some other way of handling.
> > >
> > I adjusted the logic to merge the chunks as follows:
> >
> > Once a tuple is vacuumed from a chunk, it also checks whether it can be
> > merged with its child leafs. A chunk has up to two child leafs; left one
> > has less ctid that the parent, and right one has greater ctid. It means
> > a chunk without right child in the left sub-tree or a chunk without left
> > child in the right sub-tree are neighbor of the chunk being vacuumed. In
> > addition, if vacuumed chunk does not have either (or both) of children,
> > it can be merged with parent node.
> > I modified ccache_vacuum_tuple() to merge chunks during t-tree walk-down,
> > if vacuumed chunk has enough free space.
> >
>

Patch looks good.

Regarding merging of the nodes, instead of checking whether merge is
possible or not for every tuple which is vacuumed,
can we put some kind of threshold as whenever the node is 50% free then try
to merge it from leaf nodes until 90% is full.
The rest of the 10% will be left for the next inserts on the node. what do
you say?

I will update you later regarding the performance test results.

Regards,
Hari Babu
Fujitsu Australia

From:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
To:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
Cc:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-03-06 11:15:59
Message-ID:	CADyhKSWjHEC0XWi5-ES2n7Wz+M3rjS=LhfN=gQPgsPTTTApWSg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2014-03-06 18:17 GMT+09:00 Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>:
>
> On Tue, Mar 4, 2014 at 3:07 PM, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com> wrote:
>>
>> > > 4. + cchunk = ccache_vacuum_tuple(ccache, ccache->root_chunk, &ctid);
>> > > + if (pchunk != NULL && pchunk != cchunk)
>> > >
>> > > + ccache_merge_chunk(ccache, pchunk);
>> > >
>> > > + pchunk = cchunk;
>> > >
>> > >
>> > > The merge_chunk is called only when the heap tuples are spread
>> > > across two cache chunks. Actually one cache chunk can accommodate one
>> > > or more than heap pages. it needs some other way of handling.
>> > >
>> > I adjusted the logic to merge the chunks as follows:
>> >
>> > Once a tuple is vacuumed from a chunk, it also checks whether it can be
>> > merged with its child leafs. A chunk has up to two child leafs; left one
>> > has less ctid that the parent, and right one has greater ctid. It means
>> > a chunk without right child in the left sub-tree or a chunk without left
>> > child in the right sub-tree are neighbor of the chunk being vacuumed. In
>> > addition, if vacuumed chunk does not have either (or both) of children,
>> > it can be merged with parent node.
>> > I modified ccache_vacuum_tuple() to merge chunks during t-tree
>> > walk-down,
>> > if vacuumed chunk has enough free space.
>> >
>
>
> Patch looks good.
>
Thanks for your volunteering.

> Regarding merging of the nodes, instead of checking whether merge is
> possible or not for every tuple which is vacuumed,
> can we put some kind of threshold as whenever the node is 50% free then try
> to merge it from leaf nodes until 90% is full.
> The rest of the 10% will be left for the next inserts on the node. what do
> you say?
>
Hmm. Indeed, it makes sense. How about an idea that kicks chunk merging
if "expected" free space of merged chunk is less than 50%?
If threshold depends on the (expected) usage of merged chunk, it can avoid
over-merging.

> I will update you later regarding the performance test results.
>
Thhanks,

Also, I'll rebase the patch on top of the new custom-scan interfaces
according to Tom's suggestion, even though main logic of cache_scan
is not changed.

Best regards,
--
KaiGai Kohei <kaigai(at)kaigai(dot)gr(dot)jp>

From:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Cc:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-03-12 04:14:01
Message-ID:	CAJrrPGfjcpnj+6DkWz_8ni+1CKPHH7LLgZ8k_fLGf0MKZ7U+Lw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 6, 2014 at 10:15 PM, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp> wrote:
> 2014-03-06 18:17 GMT+09:00 Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>:
>> I will update you later regarding the performance test results.
>>

I ran the performance test on the cache scan patch and below are the readings.

Configuration:

Shared_buffers - 512MB
cache_scan.num_blocks - 600
checkpoint_segments - 255

Machine:
OS - centos - 6.4
CPU - 4 core 2.5 GHZ
Memory - 4GB

Head patched Diff
Select - 500K 772ms 2659ms -200%
Insert - 400K 3429ms 1948ms 43% (I am
not sure how it improved in this case)
delete - 200K 2066ms 3978ms -92%
update - 200K 3915ms 5899ms -50%

This patch shown how the custom scan can be used very well but coming
to patch as It is having
some performance problem which needs to be investigated.

I attached the test script file used for the performance test.

Regards,
Hari Babu
Fujitsu Australia

Attachment	Content-Type	Size
cache_scan_test.txt	text/plain	3.3 KB

From:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-03-12 06:26:10
Message-ID:	9A28C8860F777E439AA12E8AEA7694F8F897B8@BPXM15GP.gisp.nec.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thanks for your efforts!
> Head patched
> Diff
> Select - 500K 772ms 2659ms -200%
> Insert - 400K 3429ms 1948ms 43% (I am
> not sure how it improved in this case)
> delete - 200K 2066ms 3978ms -92%
> update - 200K 3915ms 5899ms -50%
>
> This patch shown how the custom scan can be used very well but coming to
> patch as It is having some performance problem which needs to be
> investigated.
>
> I attached the test script file used for the performance test.
>
First of all, it seems to me your test case has too small data set that
allows to hold all the data in memory - briefly 500K of 200bytes record
will consume about 100MB. Your configuration allocates 512MB of
shared_buffer, and about 3GB of OS-level page cache is available.
(Note that Linux uses free memory as disk cache adaptively.)

This cache is designed to hide latency of disk accesses, so this test
case does not fit its intention.
(Also, the primary purpose of this module is a demonstration for
heap_page_prune_hook to hook vacuuming, so simple code was preferred
than complicated implementation but better performance.)

I could reproduce the overall trend, no cache scan is faster than
cached scan if buffer is in memory. Probably, it comes from the
cost to walk down T-tree index using ctid per reference.
Performance penalty around UPDATE and DELETE likely come from
trigger invocation per row.
I could observe performance gain on INSERT a little bit.
It's strange for me, also. :-(

On the other hand, the discussion around custom-plan interface
effects this module because it uses this API as foundation.
Please wait for a few days to rebase the cache_scan module onto
the newer custom-plan interface; that I submitted just a moment
before.

Also, is it really necessary to tune the performance stuff in this
example module of the heap_page_prune_hook?
Even though I have a few ideas to improve the cache performance,
like insertion of multiple rows at once or local chunk copy instead
of t-tree walk down, I'm not sure whether it is productive in the
current v9.4 timeframe. ;-(

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

From:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
Cc:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-03-12 06:43:18
Message-ID:	CAJrrPGcoJTof2bOW3SGzch0TrATt5WLmTi9cMAfub=k=BSsfiQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 12, 2014 at 5:26 PM, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com> wrote:
> Thanks for your efforts!
>> Head patched
>> Diff
>> Select - 500K 772ms 2659ms -200%
>> Insert - 400K 3429ms 1948ms 43% (I am
>> not sure how it improved in this case)
>> delete - 200K 2066ms 3978ms -92%
>> update - 200K 3915ms 5899ms -50%
>>
>> This patch shown how the custom scan can be used very well but coming to
>> patch as It is having some performance problem which needs to be
>> investigated.
>>
>> I attached the test script file used for the performance test.
>>
> First of all, it seems to me your test case has too small data set that
> allows to hold all the data in memory - briefly 500K of 200bytes record
> will consume about 100MB. Your configuration allocates 512MB of
> shared_buffer, and about 3GB of OS-level page cache is available.
> (Note that Linux uses free memory as disk cache adaptively.)

Thanks for the information and a small correction. The Total number of
records are 5 million.
The select operation is selecting 500K records. The total table size
is around 1GB.

Once I get your new patch re-based on the custom scan patch, I will
test the performance
again by increasing my database size more than the RAM size. And also
I will make sure
that memory available for disk cache is less.

Regards,
Hari Babu
Fujitsu Australia

From:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
Cc:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-03-17 00:45:42
Message-ID:	9A28C8860F777E439AA12E8AEA7694F8F8B056@BPXM15GP.gisp.nec.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

The attached patches are revised ones according to the latest custom-plan
interface patch (v11).
The cache-scan module was re-implemented on the newer interface, and also
I noticed the extension does not handle the tuples being redirected correctly,
So, I revised the logic in ccache_vacuum_page() totally. It now becomes to
synchronize the cached tuples per page, not per tuple, basic and tries to
merge t-tree chunks per page basis also.

Also, I split the patches again because *demonstration* part is much larger
than the patches to the core backend. It will help reviewing.
* pgsql-v9.4-vacuum_page_hook.v11.patch
-> It adds a hook for each page being vacuumed; that needs to synchronize
the status of in-memory cache managed by extension.
* pgsql-v9.4-mvcc_allows_cache.v11.patch
-> It allows to run HeapTupleSatisfiesVisibility() towards the tuples
on the in-memory cache, not on the heap.
* pgsql-v9.4-example-cache_scan.v11.patch
-> It demonstrates the usage of above two patches. It allows to scan
a relation without storage access if possible.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Haribabu Kommi
> Sent: Wednesday, March 12, 2014 3:43 PM
> To: Kaigai Kouhei(海外浩平)
> Cc: Kohei KaiGai; Tom Lane; PgHacker; Robert Haas
> Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only
> table scan?)
>
> On Wed, Mar 12, 2014 at 5:26 PM, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com> wrote:
> > Thanks for your efforts!
> >> Head patched
> >> Diff
> >> Select - 500K 772ms 2659ms -200%
> >> Insert - 400K 3429ms 1948ms 43% (I am
> >> not sure how it improved in this case)
> >> delete - 200K 2066ms 3978ms -92%
> >> update - 200K 3915ms 5899ms -50%
> >>
> >> This patch shown how the custom scan can be used very well but coming
> >> to patch as It is having some performance problem which needs to be
> >> investigated.
> >>
> >> I attached the test script file used for the performance test.
> >>
> > First of all, it seems to me your test case has too small data set
> > that allows to hold all the data in memory - briefly 500K of 200bytes
> > record will consume about 100MB. Your configuration allocates 512MB of
> > shared_buffer, and about 3GB of OS-level page cache is available.
> > (Note that Linux uses free memory as disk cache adaptively.)
>
> Thanks for the information and a small correction. The Total number of
> records are 5 million.
> The select operation is selecting 500K records. The total table size is
> around 1GB.
>
> Once I get your new patch re-based on the custom scan patch, I will test
> the performance again by increasing my database size more than the RAM size.
> And also I will make sure that memory available for disk cache is less.
>
> Regards,
> Hari Babu
> Fujitsu Australia
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org) To make
> changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Attachment	Content-Type	Size
pgsql-v9.4-example-cache_scan.v11.patch	application/octet-stream	94.7 KB
pgsql-v9.4-mvcc_allows_cache.v11.patch	application/octet-stream	918 bytes
pgsql-v9.4-vacuum_page_hook.v11.patch	application/octet-stream	1.8 KB

From:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
Cc:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-03-18 02:14:21
Message-ID:	CAJrrPGffdN0eQjeCei4SYNJ+erqHXLUbtbWvDKkU1vJeOeeX_w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 17, 2014 at 11:45 AM, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com> wrote:
> Hello,
>
> The attached patches are revised ones according to the latest custom-plan
> interface patch (v11).
> The cache-scan module was re-implemented on the newer interface, and also
> I noticed the extension does not handle the tuples being redirected correctly,
> So, I revised the logic in ccache_vacuum_page() totally. It now becomes to
> synchronize the cached tuples per page, not per tuple, basic and tries to
> merge t-tree chunks per page basis also.
>
> Also, I split the patches again because *demonstration* part is much larger
> than the patches to the core backend. It will help reviewing.
> * pgsql-v9.4-vacuum_page_hook.v11.patch
> -> It adds a hook for each page being vacuumed; that needs to synchronize
> the status of in-memory cache managed by extension.
> * pgsql-v9.4-mvcc_allows_cache.v11.patch
> -> It allows to run HeapTupleSatisfiesVisibility() towards the tuples
> on the in-memory cache, not on the heap.
> * pgsql-v9.4-example-cache_scan.v11.patch
> -> It demonstrates the usage of above two patches. It allows to scan
> a relation without storage access if possible.

All the patches are good. The cache scan extension patch may need
further refinement
in terms of performance improvement but the same can be handled later also.
So I am marking the patch as "ready for committer". Thanks for the patch.

Regards,
Hari Babu
Fujitsu Australia

From:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
Cc:	Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PgHacker <pgsql-hackers(at)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject:	Re: contrib/cache_scan (Re: What's needed for cache-only table scan?)
Date:	2014-03-18 02:31:04
Message-ID:	9A28C8860F777E439AA12E8AEA7694F8F8BAC0@BPXM15GP.gisp.nec.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > Also, I split the patches again because *demonstration* part is much
> > larger than the patches to the core backend. It will help reviewing.
> > * pgsql-v9.4-vacuum_page_hook.v11.patch
> > -> It adds a hook for each page being vacuumed; that needs to synchronize
> > the status of in-memory cache managed by extension.
> > * pgsql-v9.4-mvcc_allows_cache.v11.patch
> > -> It allows to run HeapTupleSatisfiesVisibility() towards the tuples
> > on the in-memory cache, not on the heap.
> > * pgsql-v9.4-example-cache_scan.v11.patch
> > -> It demonstrates the usage of above two patches. It allows to scan
> > a relation without storage access if possible.
>
> All the patches are good. The cache scan extension patch may need further
> refinement in terms of performance improvement but the same can be handled
> later also.
> So I am marking the patch as "ready for committer". Thanks for the patch.
>
Thanks for your dedicated efforts on these patches.

The smaller portions of above submission can be applied independent from the
custom-plan interface, and its scale is much smaller than contrib/ portion
(about 30lines in total).
If someone can pick them up individually from the extension portion, it also
makes sense. I intended to implement the extension portion as simple as I can,
for the demonstration purpose rather than performance, however, its scale is
about 2.5KL. :-(
Yes, I know the time pressure towards v9.4 final feature freeze....

Best regards,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

> -----Original Message-----
> From: Haribabu Kommi [mailto:kommi(dot)haribabu(at)gmail(dot)com]
> Sent: Tuesday, March 18, 2014 11:14 AM
> To: Kaigai Kouhei(海外浩平)
> Cc: Kohei KaiGai; Tom Lane; PgHacker; Robert Haas
> Subject: Re: contrib/cache_scan (Re: [HACKERS] What's needed for cache-only
> table scan?)
>
> On Mon, Mar 17, 2014 at 11:45 AM, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
> wrote:
> > Hello,
> >
> > The attached patches are revised ones according to the latest
> > custom-plan interface patch (v11).
> > The cache-scan module was re-implemented on the newer interface, and
> > also I noticed the extension does not handle the tuples being
> > redirected correctly, So, I revised the logic in ccache_vacuum_page()
> > totally. It now becomes to synchronize the cached tuples per page, not
> > per tuple, basic and tries to merge t-tree chunks per page basis also.
> >
> > Also, I split the patches again because *demonstration* part is much
> > larger than the patches to the core backend. It will help reviewing.
> > * pgsql-v9.4-vacuum_page_hook.v11.patch
> > -> It adds a hook for each page being vacuumed; that needs to synchronize
> > the status of in-memory cache managed by extension.
> > * pgsql-v9.4-mvcc_allows_cache.v11.patch
> > -> It allows to run HeapTupleSatisfiesVisibility() towards the tuples
> > on the in-memory cache, not on the heap.
> > * pgsql-v9.4-example-cache_scan.v11.patch
> > -> It demonstrates the usage of above two patches. It allows to scan
> > a relation without storage access if possible.
>
> All the patches are good. The cache scan extension patch may need further
> refinement in terms of performance improvement but the same can be handled
> later also.
> So I am marking the patch as "ready for committer". Thanks for the patch.
>
> Regards,
> Hari Babu
> Fujitsu Australia