Author Topic: Diacriticals in IceTV text gone?  (Read 2291 times)

Offline prl

  • Guru
  • *****
  • Posts: 3199
    • View Profile
Diacriticals in IceTV text gone?
« on: June 07, 2010, 10:52:38 PM »
In the ACT guide, all the diacriticals and non-Latin characters have been removed from text in the guide, so that Concert Schönbrunn 2010 is listed as Concert Schonbrunn  2010 (though the correct transliteration is Schoenbrunn). In The Killing, Sofie Gråbøl and Søren Malling have become Sofie Grabol and Soren Malling. What gives?
Peter
Beyonwiz T4 in-use
Beyonwiz T2, T3 & T4 for testing

Offline Mitch IceGuide

  • Guru
  • *****
  • Posts: 501
    • View Profile
Re: Diacriticals in IceTV text gone?
« Reply #1 on: June 12, 2010, 12:32:32 PM »
Does this create technical probs for you?

Offline prl

  • Guru
  • *****
  • Posts: 3199
    • View Profile
Re: Diacriticals in IceTV text gone?
« Reply #2 on: June 12, 2010, 02:47:49 PM »
Does this create technical probs for you?
No, in fact the opposite. Beyonwiz appears to make a mess of multi-byte UTF characters in its EPG. Just wondering why they seem to have gone. I though that perhaps there had been complaints about them not being presented correctly on some devices, and so they'd been removed. However, when I looked back at IceTV summaries for the Swedish versions of Wallander, I found Johanna Sällström's surname spelt that way (correct), as well as Sällstrom and Sallstrom, so maybe it's just random whether diacriticals are in the guide or not.

Offline raymondjpg

  • IceTV Beta
  • Guru
  • *
  • Posts: 511
    • View Profile
Re: Diacriticals in IceTV text gone?
« Reply #3 on: June 15, 2010, 12:08:25 PM »
Does this create technical probs for you?
Perhaps more to the point, does this create technical problems for you?

Otherwise why not use them?

If the program information supplied is bereft of diacriticals I could understand that it may not be worthwhile investing time and effort in restoring them.

Regards
Topfield TF7100HDPVRt, Beyonwiz DP-P2, Beyonwiz T2, Beyonwiz U4, Hauppauge WinTV-HVR-2200, Hauppauge WinTV-MiniStick 01240, Hauppauge WinTV-dualHD (x2)

Offline tonymy01

  • Guru
  • *****
  • Posts: 740
    • View Profile
Re: Diacriticals in IceTV text gone?
« Reply #4 on: June 15, 2010, 02:51:55 PM »
I would prefer for them to be there, and we drill the various PVR manufacturers to support the text encoding properly (or get a general consensus of the typical text encoding used on the various devices, and have ICE provide that encoding.   I guess the EIT standard would be a good start in this respect).
Regards
Regards
Tony

Beyonwiz DP-S1 & Topfield 5K (using PerlTGD to upload ICE EPG/timers for the 5K, normal ICE interactive for the Wiz).

Offline prl

  • Guru
  • *****
  • Posts: 3199
    • View Profile
Re: Diacriticals in IceTV text gone?
« Reply #5 on: June 15, 2010, 05:07:57 PM »
It looks as though the source of the problem may be SBS (the Channel Most Likely To Use Diacriticals). I had a look at their synopsis for the last episode of The Killing, and it also had Gråbøl and Søren as Grabol and Soren respectively. Scandinavian letters with diacritical marks are regarded as different letters completely from their "base" letter, and most don't really have a transliteration. This is different from the use of the umlaut in German, where ö, for example, is not regarded as a separate letter from o, but rather just an indication that the pronunciation is altered, and there is an accepted transliteration (an 'e' following the base letter). Umlauts are commonly used in forming plurals in German (ein Apfel, zwei Äpfel), and words that differ only by an umlaut can also mean very different things (schön as in Schönbrunn=beautiful; schon=already) so they shouldn't be just dropped.

I've also noticed that the IceTV guide appears to have stopped using one of the higher-numbered UTF-8 characters for apostrophe (possibly U+2019, Single Apostrophe) and is using the ASCII Apostrophe (U+0027) instead.

Offline futzle

  • IceTV Beta
  • Senior Member
  • *
  • Posts: 136
    • View Profile
Re: Diacriticals in IceTV text gone?
« Reply #6 on: June 16, 2010, 12:31:46 PM »
(In my day job I am frustrated by developers who think that if it works for ASCII, that's good enough, so I'm not surprised at how hard this is.  But I'll spare everyone my Unicode lecture.)

I'm concerned about how diacriticals would affect keyword search.  If I've got a keyword search with a non-ASCII character, will it be found when the Ice guide contains an ASCII-ified version of the word?

IceTV could mitigate this problem by doing keyword matches that decompose characters, in both the needle and the haystack, down to unadorned ones.  (The keyword for the techies is "compatibility equivalence".) Perhaps it already does this?  This would be a good thing no matter whether or not IceTV settles on using only ASCII in its descriptions.

Offline prl

  • Guru
  • *****
  • Posts: 3199
    • View Profile
Re: Diacriticals in IceTV text gone?
« Reply #7 on: June 16, 2010, 02:25:42 PM »
...
IceTV could mitigate this problem by doing keyword matches that decompose characters, in both the needle and the haystack, down to unadorned ones.  (The keyword for the techies is "compatibility equivalence".) Perhaps it already does this?  This would be a good thing no matter whether or not IceTV settles on using only ASCII in its descriptions.
It doesn't appear to do this. Searching for Sällström only finds Wallander episodes that have the name spelt that way, and searching for Sallstrom only finds episodes with the name spelt without diacriticals.

The problem is a bit more complicated, because you may want to match correct transliterations (Schoenbrunn for Schönbrunn) as well as naive transliterations (Schonbrunn for Schönbrunn). Ideally this should work both ways, but it's not possible in general to invert German umlaut transliterations, though this may not be a big problem for searching if it generates only a few false hits.

For Scandanavian names the naive transliteration may be the only sensible way to go, because most of those letters have no recognised transliteration. I have no idea what the rules are like for transliteration of diacriticals in eastern European languages that use the Latin alphabet as a basis, where there are markings on consonants as well as vowels.

Edit: this may all be a bit moot anyway, if SBS isn't using more than ASCII in what they provide to IceTV. I don't really expect that Ice would go to the effort of adding the diacriticals to what SBS has provided them. After all, SBS would probably have much more knowledge about foreign orthography than Ice.
« Last Edit: June 16, 2010, 02:41:41 PM by prl »

Offline futzle

  • IceTV Beta
  • Senior Member
  • *
  • Posts: 136
    • View Profile
Re: Diacriticals in IceTV text gone?
« Reply #8 on: June 16, 2010, 08:22:38 PM »
The problem is a bit more complicated, because you may want to match correct transliterations (Schoenbrunn for Schönbrunn) as well as naive transliterations (Schonbrunn for Schönbrunn).

Agreed.  That is a part of my Unicode lecture, which I promised not to use, so thank you for bringing it up. :)

Offline prl

  • Guru
  • *****
  • Posts: 3199
    • View Profile
Re: Diacriticals in IceTV text gone?
« Reply #9 on: June 16, 2010, 08:33:14 PM »
The problem is a bit more complicated, because you may want to match correct transliterations (Schoenbrunn for Schönbrunn) as well as naive transliterations (Schonbrunn for Schönbrunn).

Agreed.  That is a part of my Unicode lecture, which I promised not to use, so thank you for bringing it up. :)

:)


Share via facebook Share via twitter