Some Improper Character Encoding

Started by DeltaMikeCharlie, March 23, 2021, 03:34:28 PM

Previous topic - Next topic

DeltaMikeCharlie

I've found that both my PVR (JSON feed) and your web site show what appears to be invalid characters, for example:

Your web site EPG shows:

Setsuko, a 55-year-old single ‘office lady’ in Tokyo
(MOVIE: Oh Lucy! - SBS VICELAND HD)

These appear to be Unicode characters that have had their UTF-8 encoding expanded unnecessarily.

‘ in hex is E2 80 98 which is the Unicode codepoint 'LEFT SINGLE QUOTATION MARK' (U+2018) in UTF-8
Likewise, ’ is E2 80 99 which represents codepoint 'RIGHT SINGLE QUOTATION MARK' (U+2019)

There may be more, but these are the ones that I have noticed.  A search for "â€" on the web site should find them.  I have seen them in both the title and description.

The JSON feed is slightly worse:

Setsuko, a 55-year-old single ââ,¬Ëœoffice ladyââ,¬â,,¢ in Tokyo

ââ,¬Ëœ = C3 A2 E2 82 AC CB 9C

ââ,¬â,,¢ = C3 A2 E2 82 AC E2 84 A2

It appears to have been encoded as UTF-8 twice.

From what I've read online, this seems to be a common issue regarding incorrectly interpreting encoding schemes.

Perhaps ICE is receiving these characters pre-corrupted from an upstream source or they are incorrectly encoding them for web or JSON presentation.

Perhaps ICE TV could search for these characters and replace them with their originally intended UTF8 equivalents prior to making that data available on their platform.  As these particular strings of characters are nonsensical, inadvertently changing the intended meaning would be a fairly low risk.

prl

This is an issue that's been known to IceTV for some time. It seems to happen only in places where non-ASCII Unicode characters are in the text. They seem to be being encoded into UTF-8, and then encoded into UTF-8 a second time.

The reason that they are often appearing where there should be apostrophes is that IceTV often uses the non-ASCII Unicode  RIGHT SINGLE QUOTATION MARK (U+2019) rather than the ASCII APOSTROPHE (U+0027), but there are similar problems with other letters like accented letters (e.g. ü, é), ligatures (e.g. æ, œ) and non-English characters (e.g. å, ø).
Peter
Beyonwiz T4 in-use
Beyonwiz T2, T3, T4, U4 & V2 for testing

DeltaMikeCharlie

For now, I have a working AWK command that will search for the twice-encoded characters and restore them to their once-encoded values.

cat input.json | awk '{gsub(/\xC3\xA2\xE2\x82\xAC\xC2\x9D/,"\xE2\x80\x9D");gsub(/\xC3\xA2\xE2\x82\xAC\xC5\x93/,"\xE2\x80\x9C");gsub(/\xC3\xA2\xE2\x82\xAC\xCB\x9C/,"\xE2\x80\x98");gsub(/\xC3\xA2\xE2\x82\xAC\xE2\x84\xA2/,"\xE2\x80\x99");}1' > output.json

I will have to insert this command between the EPG fetch from ICE and the EPG import command for my application.  Luckily, the process is already controlled by a script so adding a new line will be trivial.

I'm sure that there will be other characters that I will need to address in the future, but these cover the ones that I have noticed at the moment.

I may even consider converting the fancy Unicode quotes and apostrophes into standard ASCII characters too.