Lists: | pgsql-general |
---|
From: | Sim Zacks <sim(at)compulab(dot)co(dot)il> |
---|---|
To: | PostgreSQL general <pgsql-general(at)postgresql(dot)org> |
Subject: | finding rows with invalid characters |
Date: | 2010-11-21 07:54:23 |
Message-ID: | 4CE8D02F.8060200@compulab.co.il |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general |
I am using PG 8.2.17 with UTF8 encoding.
"PostgreSQL 8.2.17 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.1.1
(Gentoo 4.1.1)"
One of my tables somehow has invalid characters in it:
> ERROR: invalid byte sequence for encoding "UTF8": 0xa9
> HINT: This error can also happen if the byte sequence does not match
> the encoding expected by the server, which is controlled by
> "client_encoding".
I have already manually found a number of the bad rows by running
queries with text functions (upper) between groups of IDs until I found
the specific bad row.
1) Is there a quicker way to get a list of all rows with invalid characters
2) Shouldn't the database prevent these rows from being entered in the
first place?
3) I have backups of this database (using -Fc) and I noticed that on
restore, this table is not restored because of this error. Is there a
way to fix the existing backups, or tell the restore to ignore bad rows
instead of erroring out the whole table?
Thanks
Sim
From: | Dmitriy Igrishin <dmitigr(at)gmail(dot)com> |
---|---|
To: | Sim Zacks <sim(at)compulab(dot)co(dot)il> |
Cc: | PostgreSQL general <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: finding rows with invalid characters |
Date: | 2010-11-21 15:55:53 |
Message-ID: | AANLkTik9ApeYg+McXQd25Ag1A2rEbJWKod80WrghGMeG@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general |
Hey Sim,
Maybe this helps:
http://blog.tapoueh.org/articles/blog/_Getting_out_of_SQL_ASCII,_part_2.html
2010/11/21 Sim Zacks <sim(at)compulab(dot)co(dot)il>
> I am using PG 8.2.17 with UTF8 encoding.
> "PostgreSQL 8.2.17 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.1.1
> (Gentoo 4.1.1)"
>
> One of my tables somehow has invalid characters in it:
>
>> ERROR: invalid byte sequence for encoding "UTF8": 0xa9
>> HINT: This error can also happen if the byte sequence does not match the
>> encoding expected by the server, which is controlled by "client_encoding".
>>
> I have already manually found a number of the bad rows by running queries
> with text functions (upper) between groups of IDs until I found the specific
> bad row.
>
> 1) Is there a quicker way to get a list of all rows with invalid characters
> 2) Shouldn't the database prevent these rows from being entered in the
> first place?
> 3) I have backups of this database (using -Fc) and I noticed that on
> restore, this table is not restored because of this error. Is there a way to
> fix the existing backups, or tell the restore to ignore bad rows instead of
> erroring out the whole table?
>
> Thanks
> Sim
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>
--
// Dmitriy.
From: | Sim Zacks <sim(at)compulab(dot)co(dot)il> |
---|---|
To: | Dmitriy Igrishin <dmitigr(at)gmail(dot)com> |
Cc: | PostgreSQL general <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: finding rows with invalid characters |
Date: | 2010-11-23 08:39:26 |
Message-ID: | 4CEB7DBE.9@compulab.co.il |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general |
Thanks for your help.
I think a trigger will actually be the easiest. The way i can tell if
there is invalid data is simply to do an Upper(text) and if it has
invalid data it fails.
I dumped the fixed database. Now I have a years worth of backups that I
can't restore a specific table from. It most probably will never mean
anything. Every once in a while, I get asked to check what was in the db
against what is in there, but this table will probably never be audited.
Sim
On 11/23/2010 10:33 AM, Dmitriy Igrishin wrote:
> Hey Sim,
>
> 2010/11/23 Sim Zacks <sim(at)compulab(dot)co(dot)il <mailto:sim(at)compulab(dot)co(dot)il>>
>
> On 11/21/2010 05:55 PM, Dmitriy Igrishin wrote:
>> Hey Sim,
>>
>> Maybe this helps:
>> http://blog.tapoueh.org/articles/blog/_Getting_out_of_SQL_ASCII,_part_2.html
> That worked to find some of them. I still needed to find a bunch
> of others manually, such as 0xa0 and 0xd725 which weren't found
> with that function. I finally figured out that
> select * from emaildetails where emailbody like '%\xa0%' and
> select * from emaildetails where emailbody like '%\xd7\x25%' would
> show me all those rows.
>
> My 2 big problems now are:
>
> A) how to make sure that these chars are not inserted in the
> future. The database should prevent them from being inserted.
>
> Consider to use domains -- generic-based types with constraints --
> instead of generic types.
> Use regular expressions in constraints.
> As alternative, you can use triggers for more complex validation. But
> domains in you case
> IMO will work good.
>
>
> B) How to fix the backups that I have so that I can restore them.
> As I mentioned, they are being taken with -Fc
>
> Oops. Why not dump fixed database ?
>
>
> Sim
>
>
>
>> 2010/11/21 Sim Zacks <sim(at)compulab(dot)co(dot)il <mailto:sim(at)compulab(dot)co(dot)il>>
>>
>> I am using PG 8.2.17 with UTF8 encoding.
>> "PostgreSQL 8.2.17 on i686-pc-linux-gnu, compiled by GCC gcc
>> (GCC) 4.1.1 (Gentoo 4.1.1)"
>>
>> One of my tables somehow has invalid characters in it:
>>
>> ERROR: invalid byte sequence for encoding "UTF8": 0xa9
>> HINT: This error can also happen if the byte sequence
>> does not match the encoding expected by the server, which
>> is controlled by "client_encoding".
>>
>> I have already manually found a number of the bad rows by
>> running queries with text functions (upper) between groups of
>> IDs until I found the specific bad row.
>>
>> 1) Is there a quicker way to get a list of all rows with
>> invalid characters
>> 2) Shouldn't the database prevent these rows from being
>> entered in the first place?
>> 3) I have backups of this database (using -Fc) and I noticed
>> that on restore, this table is not restored because of this
>> error. Is there a way to fix the existing backups, or tell
>> the restore to ignore bad rows instead of erroring out the
>> whole table?
>>
>> Thanks
>> Sim
>>
>> --
>> Sent via pgsql-general mailing list
>> (pgsql-general(at)postgresql(dot)org
>> <mailto:pgsql-general(at)postgresql(dot)org>)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>>
>>
>>
>>
>> --
>> // Dmitriy.
>>
>>
>
>
>
>
> --
> // Dmitriy.
>
>
From: | Dmitriy Igrishin <dmitigr(at)gmail(dot)com> |
---|---|
To: | Sim Zacks <sim(at)compulab(dot)co(dot)il> |
Cc: | PostgreSQL general <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: finding rows with invalid characters |
Date: | 2010-11-23 09:02:18 |
Message-ID: | AANLkTinEj=CYHCkqci7Cv3_CtFZPUpDCHsOmpn9Y1w63@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general |
Sim, are you sure that encoding of the dump and the database
in which you are trying to restore it are same ?
2010/11/23 Sim Zacks <sim(at)compulab(dot)co(dot)il>
> Thanks for your help.
>
> I think a trigger will actually be the easiest. The way i can tell if there
> is invalid data is simply to do an Upper(text) and if it has invalid data it
> fails.
>
> I dumped the fixed database. Now I have a years worth of backups that I
> can't restore a specific table from. It most probably will never mean
> anything. Every once in a while, I get asked to check what was in the db
> against what is in there, but this table will probably never be audited.
>
>
> Sim
>
>
>
> On 11/23/2010 10:33 AM, Dmitriy Igrishin wrote:
>
> Hey Sim,
>
> 2010/11/23 Sim Zacks <sim(at)compulab(dot)co(dot)il>
>
>> On 11/21/2010 05:55 PM, Dmitriy Igrishin wrote:
>>
>> Hey Sim,
>>
>> Maybe this helps:
>>
>> http://blog.tapoueh.org/articles/blog/_Getting_out_of_SQL_ASCII,_part_2.html
>>
>> That worked to find some of them. I still needed to find a bunch of
>> others manually, such as 0xa0 and 0xd725 which weren't found with that
>> function. I finally figured out that
>> select * from emaildetails where emailbody like '%\xa0%' and
>> select * from emaildetails where emailbody like '%\xd7\x25%' would show me
>> all those rows.
>>
>> My 2 big problems now are:
>>
>> A) how to make sure that these chars are not inserted in the future. The
>> database should prevent them from being inserted.
>>
> Consider to use domains -- generic-based types with constraints -- instead
> of generic types.
> Use regular expressions in constraints.
> As alternative, you can use triggers for more complex validation. But
> domains in you case
> IMO will work good.
>
>
>> B) How to fix the backups that I have so that I can restore them. As I
>> mentioned, they are being taken with -Fc
>>
> Oops. Why not dump fixed database ?
>
>>
>> Sim
>>
>>
>>
>> 2010/11/21 Sim Zacks <sim(at)compulab(dot)co(dot)il>
>>
>>> I am using PG 8.2.17 with UTF8 encoding.
>>> "PostgreSQL 8.2.17 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.1.1
>>> (Gentoo 4.1.1)"
>>>
>>> One of my tables somehow has invalid characters in it:
>>>
>>>> ERROR: invalid byte sequence for encoding "UTF8": 0xa9
>>>> HINT: This error can also happen if the byte sequence does not match
>>>> the encoding expected by the server, which is controlled by
>>>> "client_encoding".
>>>>
>>> I have already manually found a number of the bad rows by running queries
>>> with text functions (upper) between groups of IDs until I found the specific
>>> bad row.
>>>
>>> 1) Is there a quicker way to get a list of all rows with invalid
>>> characters
>>> 2) Shouldn't the database prevent these rows from being entered in the
>>> first place?
>>> 3) I have backups of this database (using -Fc) and I noticed that on
>>> restore, this table is not restored because of this error. Is there a way to
>>> fix the existing backups, or tell the restore to ignore bad rows instead of
>>> erroring out the whole table?
>>>
>>> Thanks
>>> Sim
>>>
>>> --
>>> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
>>> To make changes to your subscription:
>>> http://www.postgresql.org/mailpref/pgsql-general
>>>
>>
>>
>>
>> --
>> // Dmitriy.
>>
>>
>>
>>
>
>
> --
> // Dmitriy.
>
>
>
>
--
// Dmitriy.
From: | Jasen Betts <jasen(at)xnet(dot)co(dot)nz> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: finding rows with invalid characters |
Date: | 2010-11-30 19:08:47 |
Message-ID: | id3i3v$j4l$1@reversiblemaps.ath.cx |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general |
On 2010-11-21, Sim Zacks <sim(at)compulab(dot)co(dot)il> wrote:
> I am using PG 8.2.17 with UTF8 encoding.
> "PostgreSQL 8.2.17 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.1.1
> (Gentoo 4.1.1)"
>
> One of my tables somehow has invalid characters in it:
>> ERROR: invalid byte sequence for encoding "UTF8": 0xa9
>> HINT: This error can also happen if the byte sequence does not match
>> the encoding expected by the server, which is controlled by
>> "client_encoding".
> I have already manually found a number of the bad rows by running
> queries with text functions (upper) between groups of IDs until I found
> the specific bad row.
>
> 1) Is there a quicker way to get a list of all rows with invalid characters
dumpthe table, run it through iconv , diff agaist the original.
> 2) Shouldn't the database prevent these rows from being entered in the
> first place?
it should have, but that bug has now been fixed.
> 3) I have backups of this database (using -Fc) and I noticed that on
> restore, this table is not restored because of this error. Is there a
> way to fix the existing backups, or tell the restore to ignore bad rows
> instead of erroring out the whole table?
translate them to SQL (use pg_resore with no databse name)
then you can again use iconv to clean them.
use iconv something like this.
iconv --from-code UTF8 --to-code UTF8 -c < input_file > output_file
This will translate surrogates and drop other invalid characters.
if you have any constraints that place lower bounds on string-length
this has the potential to break them.
--
⚂⚃ 100% natural