Posting from: Philipsburg, MT
Listening to: Bloodthirsty Vegetarians, BV Podcast #0153
This morning I was listening to the second talk segment (roughly 14:50 in) of BV Podcast #0153 in which hosts Rich and John discuss the robots.txt file on whitehouse.gov. According to John, under the Bush administration there were thousands of entries in this file excluding countless pages on the site from being indexeed in public search engines whereas one of the first things the Obama administration did was to remove this file making it clear that anything on the website was for public consumption and that the administration was doing something to make its operation more transparent.
Interesting, I thought. Let’s look into it.
First, I Googled “robots.txt White House” and found that the White House actually does have a robots.txt file- here it is. So it does exist, and at the time of this blog entry, it has three disallow entries. I was not clear what, if anything, is the significance of this so I did some more looking.
Another of the top search results from my previously mentioned Google search was a blog entry written by Jason Kottke documenting the contents of robots.txt the day before and the day of the inauguration. Apparently, it went from over 2400 entries down to 1 entry and now it is back up to 3 entries. But I still didn’t understand the potential significance of this.
According to CNET News, the significance is not much:
White House expands use of search-blocking code
While it would be accurate to state that the White House has in one day tripled the number of sites it excludes from Google crawling, it is also important to note that this is not a big deal–in fact, it doesn’t matter at all.
For the most part, the Bush White House’s use of robots.txt was totally legitimate, something that Kevin Fox, an engineer at Friendfeed told the folks at Google Blogoscoped:
This is a bit silly. The old robots.txt excludes internal search result pages and redundant text versions of HTML pages. This is exactly what robots.txt is for. Google’s Webmaster Guidelines state “Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines.”
It’s understandable that the robots.txt of an 8-year-old site is longer than that of a 1-day-old site, and it’s not as if ‘/secrets/top’ or ‘/katrina/response/’ were put in the robots file.
Fun as it may be, this is a nonstory.
Those bloggers drunk on hope who desperately wanted to see proof of Obama’s commitment to his campaign promises of transparency and Google Government now find themselves with a difficult choice: they can either accept and acknowledge that robots.txt files are not a set of digital tea leaves through which you can read the new administration, or, if robots.txt does carry weight, they can try to come up with a way of explaining a 200 percent increase in the number of directories blocked by Obama’s Web team as anything but Cheney-esque secrecy.
So Team Obama gets no points from me for the robots.txt file schtick.
But Obama does get points off for the following. The CNET article continues:
As for the president’s commitment to transparency, he has already violated his pledge to post all nonemergency bills on the Whitehouse.gov Web site for five days before signing them. The text of the Lilly Ledbetter Fair Pay Act of 2009, which was signed into law yesterday, was certainly not posted to Whitehouse.gov for anywhere near five days.
Obama’s broken commitment to transparency remains advertised on the White House blog:
One significant addition to WhiteHouse.gov reflects a campaign promise from the president: we will publish all nonemergency legislation to the Web site for five days, and allow the public to review and comment before the president signs it.
It is by looking to these kinds of concrete issues by which we can judge the president, not robots.txt
Same old, same old- just with a D after his name instead of an R.