Google forces a translation to Japanese

Google forces a translation to Japanese

shimi linux-il at shimi.net
Mon Sep 14 13:29:35 IDT 2009


2009/9/14 Shachar Shemesh <shachar at shemesh.biz>

>  Hi all,
>
> One of my clients is having a weird problem, and I'm pretty much at my
> wit's end as for what to do about it.
>
> The site is called "Tzofit" (at tzofit.co.il), and is an index and
> publisher for Zimmers. When you search Google for "צימרים" the site appears
> on the second page, and when you search Google for "צופית" it is the first
> result. In both cases, you cannot miss it - Google displays the site's title
> and summary as Japanese!
>
> Now here's where it gets really strange. While the main site is proclaimed
> to be in Japanese, all the deep links are in Hebrew. If you ask to see the
> Google cache, the site appears in Hebrew. If you search for its address
> directly (tzofit.co.il), the site appears with correct title and summary.
> The only explanation I have is that this is a Google index bug.
>
> The problem is that even if that is the case, I cannot see what I can do
> about it. I tried to ask about it on the Google forums (
> http://www.google.com/support/forum/p/Web+Search/thread?tid=08c423ea40d5c1ab&hl=en),
> but, as expected, got not replies. On the other hand, I did not manage to
> find anything wrong with the actual page.
>
> Trying to translate the Japanese text, using Google Translate, back to
> English seems to show that the text translates, but is not coherent
> sentences. Then again, looking at the raw encoding, this does not appear to
> be Hebrew interpreted with the wrong encoding (or am I missing something?)
>
> If anyone has any clue, it would be much appreciated.
>
>
 I would try the following:

   - remove extra newlines from beginning of document. an xml document
   should begin with an xml definition. maybe newlines are valid, i never
   checked, but usually they don't begin that way, so why do it... :)
   - in an html document, you define the language inside the html opening
   tag, with lang="he". the meta tag that does this is redundant, and I would
   assume google likes the html definition better.
   - the newlines in the file appears to be dos-style. maybe you want to try
   to run the file through dos2unix
   - it could be this windows-1255 thing - maybe try putting there
   iso-8859-8-i - or even better, switch to utf-8 altogether. "everybody loves
   utf-8" :)


These are my ideas...

HTH,

-- Shimi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.huji.ac.il/pipermail/linux-il/attachments/20090914/9796f0d0/attachment.html>


More information about the Linux-il mailing list