Colors and how they are perceived in different cultures

While doing some research on colors and how they can be used and perceived differently in different regions and cultures, I came across an excellent data visualization called ‘Colours and Cultures’, designed by AlwaysWithHonor.com and David McCandless. It’s also used on the cover of  the book ‘Information is Beautiful‘ (Note: the US edition apparently sports a different cover and is called ‘The Visual Miscellaneum: A Colorful Guide to the World’s Most Consequential Trivia’).

Image called "Colours In Culture" from "Information Is Beautiful"

Fascinating stuff and certainly provides some food for thought when designing web pages for global audiences.

From Windows to Bing Webmaster Tools

I realize it’s been seriously quiet on my blog since I first posted last year. Well, things got busy over in Windows shipping Windows 8, and building the next-gen search experiences for Windows.com and optimizing designs, controls, and process for SEO. However, as it turns out I won’t see this exciting release through all the way to RTM: I have finally succumbed to the luring promise of being able to work in the very “warp core” of search at Microsoft and have moved to Bing to work with an exceptional team as Program Manager for the Bing Webmaster Tools.

So, hope to see you all at SMX Advanced Seattle next week in my new capacity! If you’re visiting, come look me up at the Bing booth!

Robots.txt, UTF-8 and the UTF-8 Signature

In today’s post I wanted to talk a little bit about a specific issue I came across for one of the sites I work on, hoping that you might find it useful. In this particular case, I was surprised to see the first line in one of our sites’ robots.txt files being marked as “invalid” in Google Webmaster Tools. Looking at the Site Configuration > Crawler Access section, Google’s test tool showed a question mark at the beginning of the sitemap. I immediately had the feeling this had something to do with the encoding of the file and the possible presence of a UTF-8 signature. But wait, that doesn’t make sense:

Looking at http://code.google.com/web/controlcrawlindex/docs/robots_txt.html Google clearly states the following (the highlight is mine):

File format

The expected file format is plain text encoded in UTF-8. The file consists of records (lines) separated by CR, CR/LF or LF.

Only valid records will be considered; all other content will be ignored. For example, if the resulting document is a HTML page, only valid text lines will be taken into account, the rest will be discarded without warning or error.

If a character encoding is used that results in characters being used which are not a subset of UTF-8, this may result in the contents of the file being parsed incorrectly.

An optional Unicode BOM (byte order mark) at the beginning of the robots.txt file is ignored.

So, clearly Google supports UTF-8 encoded robots.txt. Also, the highlighted section clearly states that they will ignore the Unicode “byte order mark” (BOM) at the beginning of the file, if present. (The reason I quote “byte order mark” here is that there really is no such thing as byte order in the case of UTF-8. “UTF-8 signature” is probably a more accurate name for the byte-sequence 0xEF 0xBB 0xBF -- more at this link about UTF-8 on Wikipedia which Google includes above). So what gives?

In this particular case, I didn’t have immediate access to the published robots.txt but at the same time did not want my assessment that the UTF-8 signature was causing this to just be an “educated guess”. So, I needed another way to demonstrate that:

  1. The robots.txt file in question was in fact UTF-8 encoded
  2. There was a UTF-8 signature present in the file

Fiddler to the Rescue

Of all tools that I use in SEO day-to-day, Fiddler, the free network debugging tool written by the illustrious Eric Lawrence on the Internet Explorer team, is probably the one that I use most. I could spend a whole post on the benefits of Fiddler to the more technically inclined SEO (and I’m sure someday I will), but if you are interested in its capabilities, check out the video Getting Started with Fiddler (8 min; 9MB WMV).

In any case, using Fiddler, I was able to at least one of the things called out above quickly. Let’s take a look.

Fiddler Hex View

Fiddler has a HexView tab that shows the response headers and body as a string of hexadecimal values alongside a textual representation. This makes it easy to spot the UTF-8 signature byte-sequence (sorry folks, but I’ve hidden the actual file name from view, no need to expose that poor robots.txt file’s bits and make it feel uncomfortable in public). Click to enlarge:

fiddler-robots-screen-hex-utf-8-issue

That’s it, EF BB EF (or ) at the beginning of the response body which is proof that the file used the UTF-8 signature. But given Google’s guidance alluded to above, why would Google Webmaster Tools be thrown off by this byte sequence? Google’s engineers are a smart bunch, so might there be more out of whack?

What I also noticed in Fiddler us that the server response did not include the content encoding. I would have expected something like: Content-Type: text/plain; charset=utf-8. Instead, the server returned: Content-Type: text/plain

Could it be that the combination of not having Content-Encoding in the header plus having the UFT-8 signature byte sequence is what is throwing off Google Webmaster Tools in this case? At this point this is pure speculation until I find (or publish) a robots.txt with both the signature present and a correct Content-Encoding in the header. I will update the post if I do find one in the wild.

Mitigating the Issue

Since I take it that Google processes robots.txt line-by-line and only ignores lines that are invalid, adding an empty first line or a comment (starting with #) in the first line may be sufficient to fix this particular issue. That said it’s probably a good idea to simply not use the UTF-8 signature and save your robot.txt file without it (a versatile editors like Notepad++ for Windows makes this easy). Also, it makes sense to ensure sure the server response emits the Content-Encoding header correctly.

On the other hand, if you are not using really using any extended characters, you could also consider just saving your robots.txt in ANSI. A good, comprehensive post is Robots Speaking Many Languages over at the Bing Webmaster Central blog. It also discusses how to percent-encode entries that use extended characters – good examples of which can be found in http://en.wikipedia.org/robots.txt.

What are your experiences with robots.txt and UTF-8? Do you any examples of robots.txt that use the UTF-8 signature that work just fine for Google WM tools?

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: