Invalid Characters / Dodgy Characters

Started by BJReplay, April 11, 2005, 06:21:47 PM

Previous topic - Next topic

BJReplay

E.G. ABC Sydney & Melbourne (Channel 2 & 9), Australian Story for tonight (11 April) - ICETV Ep Num 35611 & 36672:
Description contains climbed Ball’s Pyramid and ends with waiting….

Other characters (eg é in café) are correctly escaped (é), but it looks like apostrophes amongst other things are being incorrectly handled.

Cheers

BJ

Russell at IceTV

Thanks for letting us know about this BJ, I've found the problem and it should be fixed very soon.

Russell

tonymy01

#2
I notice it for tonight too for Rove "perhaps his polar opposite – highly acclaimed actor"..
edit: I guess this may actually be ok, and I should be talking to John (TED Author) to correctly interpret these when he builds out the ascii file format..
Regards
Tony
Regards
Tony

Beyonwiz DP-S1 & Topfield 5K (using PerlTGD to upload ICE EPG/timers for the 5K, normal ICE interactive for the Wiz).

Russell at IceTV

QuoteI notice it for tonight too for Rove "perhaps his polar opposite – highly acclaimed actor"..
edit: I guess this may actually be ok, and I should be talking to John (TED Author) to correctly interpret these when he builds out the ascii file format..
Regards
Tony
Hi Tony,

Yes, in this case the ampersand code is an "en dash". The problem is what to do with characters that are in the upper half of the ascii range, such as characters from foreign languages, etc. We're converting all of them to ampersand escape sequences like the one you found, but some PVRs may not be displaying them correctly. As far as I know this is the only good way of handling these characters, but I'm not actually sure what the official word from XMLTV is on this subject. We're always open to suggestions though, if there's a method that's compatible with more PVRs.

Thanks,
Russell

trapper

#4
One of the problems at the moment with the XML ICEguide data is that the conversion(s) to cater for extended charaters etc. is not happening in the correct sequence.

Take for example the following snippet from Aug 21:
A not-to-be missed analysis of the week’s political news
This has resulted from 2 separate conversions done in the wrong order.

The apostrophe in weekend's was initially converted to <ampersand>#8217. But ampersands (as in Law & Order) also have to be converted to <ampersand>amp;

The problem in the above code snippet is that the apostrophe was converted before the ampersands. This has resulted in the ampersands which were introduced from the first conversion themselves being converted. So apostrophe becomes <ampersand>#8217 but then the ampersand there is further converted so we end up with &amp;#8217;

The problem with that is there is no already built HTML-decode which will fix it. It has to be purpose built.

Cheers...

EDIT: Just as a matter of interest I've found from my work on TED/S that the only conversion which seems necessary is the & to &amp;  Extended characters seem to display fine in HTML... at least in IE. Internet Explorer is actually very useful for checking XML files. It will report any parsing errors, highlighting where and what caused the error.

Russell at IceTV

Hi trapper,

Looking at the guide data for Brisbane on August 21st, I see the following XML data:

<programme start="20050820230000 +0000" stop="20050820234500 +0000" channel="15">
    <title lang="en">Insiders</title>
    <sub-title lang="en"></sub-title>
    <desc lang="en">A not-to-be missed analysis of the week’s political news, with interviews, discussion and analysis with Barrie Cassidy and guests.</desc>


Is it possible you're loading the data into another program that's then modifying the & to be &amp; ?

Try this URL and take a look at the source, then search for the show "Insiders" to see what our server is sending (you'll be prompted for your Ice User ID and Password):

http://www.icetv.com.au/cgi-bin/epg/iceguide.cgi?op=xmlguide&start_date=050821&end_date=050821

Let me know what you find, and if I've misunderstood your post.

Thanks,
Russell

trapper

Hi Russell... yep I did find a problem here.  :-[

I was wondering though why you use a mix of HTML4.0 entities and Unicode numbers. Eg, for apostrophe you use the HTML4.0 entity &apos; yet for the left quote you use the Unicode number 8216 rather than &lsquo;

Would it be possible to standardise on HTML4.0 entities?

Also, in the data for 24 August the word Mélodine has become M<ampersand>#3927;dine

Cheers...

Russell at IceTV

Hi trapper,

Glad to hear you found the problem.

We use the Unicode number for characters that are multibyte characters, such as the accented é character you mentioned.  But for simple ones like the apostrophe, we use the HTML entities.  The XML spec doesn't allow things like the é character to be left unquoted, and it's a lot easier to just convert them to their numeric value than to have to use a large lookup table for all the characters.  I'd have to check to be sure, but the left quote you mentioned was probably a "smart" left quote, or something similar, that wasn't a normal character, and thus was actually a multibyte character that got converted to Unicode.

Thanks for spotting the problem with Mélodine -- it was a typo actually.  I've fixed it, and it should now appear as M<ampersand>#233;lodine

Russell