Strange JSON escape characters in EPG data.

Started by DeltaMikeCharlie, February 19, 2018, 05:12:14 PM

Previous topic - Next topic

DeltaMikeCharlie

I have noticed a difference in the "show" description text depending on if the selected format is JSON or XML.

Here is an XML example.  Note the quotes around the first line of the description.

--SNIP--
  <desc lang="en">"I wanted to know what their plan was. I was their plan!"

The Doctor has been summoned by an old friend, but in the Cabinet War Rooms far below the streets of blitz-torn London, it's his oldest enemy he finds waiting for him...

The Daleks are back - but can Winston Churchill be in league with them? </desc>
  <credits>
--SNIP--

However, the JSON version has a series of escape characters that do not appear to correspond to the character shown in the XML version.

--SNIP--
    "desc": "\u00E2\u0080\u009CI wanted to know what their plan was. I was their plan!\u00E2\u0080\u009D\r\n\r\nThe Doctor has been summoned by an old friend, but in the Cabinet War Rooms far below the streets of blitz-torn London, it's his oldest enemy he finds waiting for him...\r\n\r\nThe Daleks are back - but can Winston Churchill be in league with them? ",
--SNIP--


"\u00E2\u0080\u009C" actually appears to be the UTF-8 representation of "LEFT DOUBLE QUOTATION MARK " character which is "0x201C" in hex.

I have encountered a number of web sites suggesting that the correct JSON encoding should actually be "\u201C".

https://www.fileformat.info/info/unicode/char/201c/index.htm
http://graphemica.com/%E2%80%9C

I have also seen this occur with "\u00E2\u0080\u009D" (right double quote "\u201D") and "\u00E2\u0080\u0093" (en dash "\u2013").  Perhaps there are others.

I'm happy to be proven wrong, but I thought that files containing JSON were already supposed to be in Unicode and that only reserved characters (such as quotes, commas, etc) needed to be escaped.

It appears that the process that is creating the JSON output is reading the source as ASCII text and not UTF-8 text and converting each byte of the Unicode string individually and not as a combined entity.

I am using cURL with [--header "Accept: application/json"].  I was willing to concede that perhaps cURL is mangling the data on the way in, however, the trace file shows the data arriving pre-mangled.

Is there something wrong with my request or is the server genuinely serving up these seemingly erroneously escaped characters?

prl

I sent a long and detailed email about this and a related problem to Daniel Hall a few weeks ago. I haven't heard back.
Peter
Beyonwiz T4 in-use
Beyonwiz T2, T3, T4, U4 & V2 for testing

DeltaMikeCharlie

Quote from: prl on February 19, 2018, 09:44:12 PM
I sent a long and detailed email about this and a related problem to Daniel Hall a few weeks ago. I haven't heard back.
Thanks prl.

Daniel Hall at IceTV

Regards,

Daniel.
CTO.

prl

Peter
Beyonwiz T4 in-use
Beyonwiz T2, T3, T4, U4 & V2 for testing