Re: 64-bit API for large object

From: Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
To: Nozomi Anzai <anzai(at)sraoss(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 64-bit API for large object
Date: 2012-09-27 04:01:18
Message-ID: CADyhKSUOxOnLcmbEiUPffHYf=hCUjUxmWpAJGpEcVecSmoKn9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I checked this patch. It looks good, but here are still some points to be
discussed.

* I have a question. What is the meaning of INT64_IS_BUSTED?
It seems to me a marker to indicate a platform without 64bit support.
However, the commit 901be0fad4034c9cf8a3588fd6cf2ece82e4b8ce
says as follows:
| Remove all the special-case code for INT64_IS_BUSTED, per decision that
| we're not going to support that anymore.

* At inv_seek(), it seems to me it checks offset correctness with wrong way,
as follows:
| case SEEK_SET:
| if (offset < 0)
| elog(ERROR, "invalid seek offset: " INT64_FORMAT, offset);
| obj_desc->offset = offset;
| break;
It is a right assumption, if large object size would be restricted to 2GB.
But the largest positive int64 is larger than expected limitation.
So, it seems to me it should be compared with (INT_MAX * PAGE_SIZE)
instead.

* At inv_write(), it definitely needs a check to prevent data-write upper 4TB.
In case when obj_desc->offset is a bit below 4TB, an additional 1GB write
will break head of the large object because of "pageno" overflow.

* Please also add checks on inv_read() to prevent LargeObjectDesc->offset
unexpectedly overflows 4TB boundary.

* At inv_truncate(), variable "off" is re-defined to int64. Is it really needed
change? All its usage is to store the result of "len % LOBLKSIZE".

Thanks,

2012/9/24 Nozomi Anzai <anzai(at)sraoss(dot)co(dot)jp>:
> Here is 64-bit API for large object version 2 patch.
>
>> I checked this patch. It can be applied onto the latest master branch
>> without any problems. My comments are below.
>>
>> 2012/9/11 Tatsuo Ishii <ishii(at)postgresql(dot)org>:
>> > Ok, here is the patch to implement 64-bit API for large object, to
>> > allow to use up to 4TB large objects(or 16TB if BLCKSZ changed to
>> > 32KB). The patch is based on Jeremy Drake's patch posted on September
>> > 23, 2005
>> > (http://archives.postgresql.org/pgsql-hackers/2005-09/msg01026.php)
>> > and reasonably updated/edited to adopt PostgreSQL 9.3 by Nozomi Anzai
>> > for the backend part and Yugo Nagata for the rest(including
>> > documentation patch).
>> >
>> > Here are changes made in the patch:
>> >
>> > 1) Frontend lo_* libpq functions(fe-lobj.c)(Yugo Nagata)
>> >
>> > lo_initialize() gathers backend 64-bit large object handling
>> > function's oid, namely lo_lseek64, lo_tell64, lo_truncate64.
>> >
>> > If client calls lo_*64 functions and backend does not support them,
>> > lo_*64 functions return error to caller. There might be an argument
>> > since calls to lo_*64 functions can automatically be redirected to
>> > 32-bit older API. I don't know this is worth the trouble though.
>> >
>> I think it should definitely return an error code when user tries to
>> use lo_*64 functions towards the backend v9.2 or older, because
>> fallback to 32bit API can raise unexpected errors if application
>> intends to seek the area over than 2GB.
>>
>> > Currently lo_initialize() throws an error if one of oids are not
>> > available. I doubt we do the same way for 64-bit functions since this
>> > will make 9.3 libpq unable to access large objects stored in pre-9.2
>> > PostgreSQL servers.
>> >
>> It seems to me the situation to split the case of pre-9.2 and post-9.3
>> using a condition of "conn->sversion >= 90300".
>
> Fixed so, and tested it by deleteing the lo_tell64's row from pg_proc.
>
>
>> > To pass 64-bit integer to PQfn, PQArgBlock is used like this: int *ptr
>> > is a pointer to 64-bit integer and actual data is placed somewhere
>> > else. There might be other way: add new member to union u to store
>> > 64-bit integer:
>> >
>> > typedef struct
>> > {
>> > int len;
>> > int isint;
>> > union
>> > {
>> > int *ptr; /* can't use void (dec compiler barfs) */
>> > int integer;
>> > int64 bigint; /* 64-bit integer */
>> > } u;
>> > } PQArgBlock;
>> >
>> > I'm a little bit worried about this way because PQArgBlock is a public
>> > interface.
>> >
>> I'm inclined to add a new field for the union; that seems to me straight
>> forward approach.
>> For example, the manner in lo_seek64() seems to me confusable.
>> It set 1 on "isint" field even though pointer is delivered actually.
>>
>> + argv[1].isint = 1;
>> + argv[1].len = 8;
>> + argv[1].u.ptr = (int *) &len;
>
> Your proposal was not adopted per discussion.
>
>
>> > Also we add new type "pg_int64":
>> >
>> > #ifndef NO_PG_INT64
>> > #define HAVE_PG_INT64 1
>> > typedef long long int pg_int64;
>> > #endif
>> >
>> > in postgres_ext.h per suggestion from Tom Lane:
>> > http://archives.postgresql.org/pgsql-hackers/2005-09/msg01062.php
>> >
>> I'm uncertain about context of this discussion.
>>
>> Does it make matter if we include <stdint.h> and use int64_t instead
>> of the self defined data type?
>
> Your proposal was not adopted per discussion.
> Per discussion, endiannness translation was moved to fe-lobj.c.
>
>
>> > 2) Backend lo_* functions (be-fsstubs.c)(Nozomi Anzai)
>> >
>> > Add lo_lseek64, lo_tell64, lo_truncate64 so that they can handle
>> > 64-bit seek position and data length. loread64 and lowrite64 are not
>> > added because if a program tries to read/write more than 2GB at once,
>> > it would be a sign that the program need to be re-designed anyway.
>> >
>> I think it is a reasonable.
>>
>> > 3) Backend inv_api.c functions(Nozomi Anzai)
>> >
>> > No need to add new functions. Just extend them to handle 64-bit data.
>> >
>> > BTW , what will happen if older 32-bit libpq accesses large objects
>> > over 2GB?
>> >
>> > lo_read and lo_write: they can read or write lobjs using 32-bit API as
>> > long as requested read/write data length is smaller than 2GB. So I
>> > think we can safely allow them to access over 2GB lobjs.
>> >
>> > lo_lseek: again as long as requested offset is smaller than 2GB, there
>> > would be no problem.
>> >
>> > lo_tell:if current seek position is beyond 2GB, returns an error.
>> >
>> Even though iteration of lo_lseek() may move the offset to 4TB, it also
>> makes unavailable to use lo_tell() to obtain the current offset, so I think
>> it is reasonable behavior.
>>
>> However, error code is not an appropriate one.
>>
>> + if (INT_MAX < offset)
>> + {
>> + ereport(ERROR,
>> + (errcode(ERRCODE_UNDEFINED_OBJECT),
>> + errmsg("invalid large-object
>> descriptor: %d", fd)));
>> + PG_RETURN_INT32(-1);
>> + }
>>
>> According to the manpage of lseek(2)
>> EOVERFLOW
>> The resulting file offset cannot be represented in an off_t.
>>
>> Please add a new error code such as ERRCODE_BLOB_OFFSET_OVERFLOW.
>
> Changed the error code and error message. We added a new error code
> "ERRCODE_UNDEFINED_OBJECT (22P07)".
>
>
>> > 4) src/test/examples/testlo64.c added for 64-bit API example(Yugo Nagata)
>> >
>> > Comments and suggestions are welcome.
>> >
>> miscellaneous comments are below.
>>
>> Regression test is helpful. Even though no need to try to create 4TB large
>> object, it is helpful to write some chunks around the design boundary.
>> Could you add some test cases that writes some chunks around 4TB offset.
>
> Added 64-bit lobj test items into regression test and confirmed it worked
> rightly.
>
>
>> Thanks,
>> --
>> KaiGai Kohei <kaigai(at)kaigai(dot)gr(dot)jp>
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>
>
> --
> Nozomi Anzai
> SRA OSS, Inc. Japan
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
KaiGai Kohei <kaigai(at)kaigai(dot)gr(dot)jp>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kohei KaiGai 2012-09-27 04:06:41 Re: [v9.3] Extra Daemons (Re: elegant and effective way for running jobs inside a database)
Previous Message Noah Misch 2012-09-27 03:41:36 Re: [WIP] Performance Improvement by reducing WAL for Update Operation