Systemic strange characters in descriptions

Started by franklinoffner, September 14, 2021, 02:29:02 PM

Previous topic - Next topic

franklinoffner

I am noticing strange character sequences showing up the epg descriptions.  For instance, from the web display of the upcoming War of the Worlds.
QuoteBill explains his plan to infect the aliens with a virus to Ethan, Dominic and Michael but can’t bring himself to support
You would expect a single quote/apostrophe.  It there an issue ? I see the same sequence on my recorder and the web.

When I dig into the PVR storage of the description, I find the strange characters are coming from a sequence of 8 bytes (decimal) 195,162,226,130,172,226,132,162, in the midst of the text.

Alan.
Alan
BeyonWiz T4, BeyonWiz U4.

IanL-S

The EPG data is provided to IceTV by SBS and it frequently has odd characters in it. It seems to have been worse recently - just the impression I get.

Ian
IceTV: IceBox + BYOB IceBox + 2xTRF-2400 + 2xTF7100HDPVRtPlus + SKIPPA [RIP] + T2 + U4 + V2
No IceTV: a few Toppys and T2
Synology NAS
Check out the oztoppy wiki and oztoppy Forum for Toppy help

prl

It's pretty much always been like that. It's a fault somewhere along the way with the encoding - the JSON that the Beyonwiz receives has all the text in Unicode, but what's received is that the Unicode has all codepoints greater than U+FF encoded as UTF-8, when they should be directly represented in Unicode.

Coincidentally, this morning I was looking at some code I was writing a while ago to work around the problem. I think that it can be made to work around the problem, but the problem is actually somewhere along the chain from SBS to what IceTV sends to the Beyonwiz (and possibly other PVRs), not in the Beyonwiz.

You see it in apostrophes because they are often encoded as Right Single Quotation Mark (U+2019) rather than the simple ASCII Apostrophe (U+27). But you also see it for some foreign language letters and accented letters like ü, å, ø, é, etc.
Peter
Beyonwiz T4 in-use
Beyonwiz T2, T3, T4, U4 & V2 for testing

IanL-S

Peter, the problem also arises with Toppys. I noticed quite recently. Been an ongoing issue as long as I can remember. I have been using IceTV since analogue days - had a Topfield 5000 Masterpiece.

Ian
IceTV: IceBox + BYOB IceBox + 2xTRF-2400 + 2xTF7100HDPVRtPlus + SKIPPA [RIP] + T2 + U4 + V2
No IceTV: a few Toppys and T2
Synology NAS
Check out the oztoppy wiki and oztoppy Forum for Toppy help

franklinoffner

Yes, it is clearly an issue at the IceTV end.  I'd started to write up a query in the BeyonWiz forum, but then thought to check further up the food chain and found that the same issue presents in Windows browsers, Mac browsers and in the IceTV iPad app.

I tried looking at the bytes to see if I could find a rational correlation to some UniCode glyph but without success.  I am perplexed to see how a expected single glyph ended up as 8 bytes.

I raised it here as an issue (although minor), such that it might trigger somebody in IceTv to chase it down.
Alan
BeyonWiz T4, BeyonWiz U4.

prl

I'm not surprised that it's also a problem in Topfields (and possibly other clients, too). I simply only have detailed information about it on Beyonwizes.

The problem seems to me that somewhere between SBS and where IceTV produces its JSON feed (and possibly also for XML feeds), the string is encoded from Unicode to UTF when it shouldn't be.

Here are some examples of the JSON strings sent by IceTV from a capture I did in 2017 (so the problem has been around for at least 4 years).

Encoding HORIZONTAL ELLIPSIS:
"A Palm Springs real estate developer is murdered and Jessica is quick to point the finger\u00E2\u0080\u00A6 but that could have been a big mistake!"
the encoding should be
"A Palm Springs real estate developer is murdered and Jessica is quick to point the finger\u2026 but that could have been a big mistake!"

Encoding RIGHT SINGLE QUOTATION MARK, LEFT DOUBLE QUOTATION MARK and UNDERTIE ("‿"):
"A retirement party for Frasier\u00E2\u0080\u0099s old friend Cliff provides the perfect opportunity for a \u00E2\u0080\u009CCheers\u00E2\u0080\u00BF cast reunion."
the encoding should be
"A retirement party for Frasier\u2019s old friend Cliff provides the perfect opportunity for a \u201CCheers\u203F cast reunion."
In this case , I suspect that the UNDERTIE is a typo for RIGHT DOUBLE QUOTATION MARK, and the actual correct string is:
"A retirement party for Frasier\u2019s old friend Cliff provides the perfect opportunity for a \u201CCheers\u201D cast reunion."

Encoding EN DASH (note that in this case, the ASCII APOSTROPHE character is used rather than a LEFT SINGLE QUOTATION MARK/RIGHT SINGLE QUOTATION MARK pair, so that doesn't cause an encoding problem):
"The Aztecs also devised the idea of 'chinampas' \u00E2\u0080\u0093 man-made artificial islands used to grow crops."
should be
"The Aztecs also devised the idea of 'chinampas' \u2013 man-made artificial islands used to grow crops."

In this case, the actor's name, Elisabeth Röhm, has been mangled by replacing LATIN SMALL LETTER O WITH DIAERESIS with REPLACEMENT CHARACTER ("�"), and then encoding it incorrectly:
"Elisabeth R\u00EF\u00BF\u00BDhm"
should be encoded as:
"Elisabeth R\uFFFDhm"
but the correct string would be:
"Elisabeth R\u00F6hm"

In the case of the actor's name, that's not displayed on the Beyonwiz unless you use adoxa's plugin that attaches credits to long descriptions, but the data is a mess anyway.
REPLACEMENT CHARACTER appears in a couple of other places in the same grab - in places where it doesn't seem to make any sense because the character that seems to be being replaced seems unexceptional.

I noticed in some other IceTV EPG grabs that the actor Sofie Gråbøl's name (Sarah Lund, The Killing; Hildur Odegard, Fortitude) had been changed to Sofie Grabol, so that the name encodes correctly, but is incorrectly spelt.

I'm happy to post examples from the current IceTV EPG if that would help anyone.
Peter
Beyonwiz T4 in-use
Beyonwiz T2, T3, T4, U4 & V2 for testing

prl

Quote from: franklinoffner on September 15, 2021, 10:22:11 AM
I tried looking at the bytes to see if I could find a rational correlation to some UniCode glyph but without success.  I am perplexed to see how a expected single glyph ended up as 8 bytes.

In the example you showed:
QuoteBill explains his plan to infect the aliens with a virus to Ethan, Dominic and Michael but can’t bring himself to support

It seems that IceTV is now not sending "\u" encodings of the UTF-8 encodings of Unicode codepoints, but is instead sending bare UTF-8 bytes. However, I can't explain how the RIGHT SINGLE QUOTATION MARK ("\u2019") is being encoded as "’" ("\u00e2\u20ac\u2122"). I'd expect it to be being encoded as "\u00e2\u0080\u0099", where "\u00e2" is "â" and "\u0080" & "\u0099" are unnamed Unicode <control> codepoints.

The 8-byte decimal byte sequence (195,162,226,130,172,226,132,162) is simply the UTF-8 byte encoding of "’" (2 bytes for the first character, 3 bytes each for the other two).

So it seems that while the problem remains, its details have changed somewhere along the line.

In the current EPG, the credits of the repeats of The Killing are misspelling the names of actors Sofie Gråbøl & Søren Malling as Sofie Grabol & Soren Malling, but avoiding problems with encoding, while in the credits of War Of The Worlds, Léa Drucker's name is being both spelt and encoded correctly.
Peter
Beyonwiz T4 in-use
Beyonwiz T2, T3, T4, U4 & V2 for testing

franklinoffner

Thanks Peter, that gives me a better understanding of the eight bytes against the 3 glyphs  ::).  I guess we will just have to hope that the resources are eventually allocated to find and fix.  If it is changing over time, then even a lookup table of "find this pattern, replace it with that" approach is going to fail.
Alan
BeyonWiz T4, BeyonWiz U4.

prl

#8
I've just had a look at all the non-ASCII characters in the title, subtitle and description fields in the current IceTV guide.

The most commonly mangled characters are LEFT SINGLE QUOTATION MARK, RIGHT SINGLE QUOTATION MARK, LEFT DOUBLE QUOTATION MARK, RIGHT DOUBLE QUOTATION MARK and EM DASH.

It appears that most Unicode characters in the C1 Controls and Latin-1 Supplement plane (U+0080 - U+00FF) are being encoded correctly. That plane contains most of the accented characters used in western European languages, though I didn't find any instances of characters with diaereses (like German ä, ö, ü) or any of the Scandinavian characters (like å, ø, æ) or other characters like ß that are also in that plane.

One exception to that was in the description for Jandal Burn/Prank Day, which contained "\u0085" (<control>NEXT LINE) in a position where I'd have expected "\u2026" (HORIZONTAL ELLIPSIS) or perhaps "\u2014" (EM DASH).

A few characters in the General Punctuation plane (U+2000 - U+206F) are encoded correctly.
Peter
Beyonwiz T4 in-use
Beyonwiz T2, T3, T4, U4 & V2 for testing

prl

Another incorrect representation non-ASCII characters. In the synopsis for Artefact 'Tangata Whenua' NITV, 6:30pm, Sun, 10 Oct, the XML character entity &#257; is used in the word "Māori" instead of UNICODE+101 (SMALL LETTER A WITH MACRON) (&#257; is the XML entity for that character).
Peter
Beyonwiz T4 in-use
Beyonwiz T2, T3, T4, U4 & V2 for testing