In today’s post I wanted to talk a little bit about a specific issue I came across for one of the sites I work on, hoping that you might find it useful. In this particular case, I was surprised to see the first line in one of our sites’ robots.txt files being marked as “invalid” in Google Webmaster Tools. Looking at the Site Configuration > Crawler Access section, Google’s test tool showed a question mark at the beginning of the sitemap. I immediately had the feeling this had something to do with the encoding of the file and the possible presence of a UTF-8 signature. But wait, that doesn’t make sense:
Looking at http://code.google.com/web/controlcrawlindex/docs/robots_txt.html Google clearly states the following (the highlight is mine):
The expected file format is plain text encoded in UTF-8. The file consists of records (lines) separated by CR, CR/LF or LF.
Only valid records will be considered; all other content will be ignored. For example, if the resulting document is a HTML page, only valid text lines will be taken into account, the rest will be discarded without warning or error.
If a character encoding is used that results in characters being used which are not a subset of UTF-8, this may result in the contents of the file being parsed incorrectly.
An optional Unicode BOM (byte order mark) at the beginning of the robots.txt file is ignored.
So, clearly Google supports UTF-8 encoded robots.txt. Also, the highlighted section clearly states that they will ignore the Unicode “byte order mark” (BOM) at the beginning of the file, if present. (The reason I quote “byte order mark” here is that there really is no such thing as byte order in the case of UTF-8. “UTF-8 signature” is probably a more accurate name for the byte-sequence
0xEF 0xBB 0xBF -- more at this link about UTF-8 on Wikipedia which Google includes above). So what gives?
In this particular case, I didn’t have immediate access to the published robots.txt but at the same time did not want my assessment that the UTF-8 signature was causing this to just be an “educated guess”. So, I needed another way to demonstrate that:
- The robots.txt file in question was in fact UTF-8 encoded
- There was a UTF-8 signature present in the file
Fiddler to the Rescue
Of all tools that I use in SEO day-to-day, Fiddler, the free network debugging tool written by the illustrious Eric Lawrence on the Internet Explorer team, is probably the one that I use most. I could spend a whole post on the benefits of Fiddler to the more technically inclined SEO (and I’m sure someday I will), but if you are interested in its capabilities, check out the video Getting Started with Fiddler (8 min; 9MB WMV).
In any case, using Fiddler, I was able to at least one of the things called out above quickly. Let’s take a look.
Fiddler Hex View
Fiddler has a HexView tab that shows the response headers and body as a string of hexadecimal values alongside a textual representation. This makes it easy to spot the UTF-8 signature byte-sequence (sorry folks, but I’ve hidden the actual file name from view, no need to expose that poor robots.txt file’s bits and make it feel uncomfortable in public). Click to enlarge:
That’s it, EF BB EF (or
ï»¿) at the beginning of the response body which is proof that the file used the UTF-8 signature. But given Google’s guidance alluded to above, why would Google Webmaster Tools be thrown off by this byte sequence? Google’s engineers are a smart bunch, so might there be more out of whack?
What I also noticed in Fiddler us that the server response did not include the content encoding. I would have expected something like: Content-Type: text/plain; charset=utf-8. Instead, the server returned: Content-Type: text/plain
Could it be that the combination of not having Content-Encoding in the header plus having the UFT-8 signature byte sequence is what is throwing off Google Webmaster Tools in this case? At this point this is pure speculation until I find (or publish) a robots.txt with both the signature present and a correct Content-Encoding in the header. I will update the post if I do find one in the wild.
Mitigating the Issue
Since I take it that Google processes robots.txt line-by-line and only ignores lines that are invalid, adding an empty first line or a comment (starting with #) in the first line may be sufficient to fix this particular issue. That said it’s probably a good idea to simply not use the UTF-8 signature and save your robot.txt file without it (a versatile editors like Notepad++ for Windows makes this easy). Also, it makes sense to ensure sure the server response emits the Content-Encoding header correctly.
On the other hand, if you are not using really using any extended characters, you could also consider just saving your robots.txt in ANSI. A good, comprehensive post is Robots Speaking Many Languages over at the Bing Webmaster Central blog. It also discusses how to percent-encode entries that use extended characters – good examples of which can be found in http://en.wikipedia.org/robots.txt.
What are your experiences with robots.txt and UTF-8? Do you any examples of robots.txt that use the UTF-8 signature that work just fine for Google WM tools?