In response to a Wired article that ran yesterday, Google is fixing its archives of Usenet posts, one of the richest and oldest repositories of user-generated content ever to exist online.
For those of you under the age of 30, Usenet began in 1979 in Chapel Hill as a collection of newsgroups. In the years that followed, Internet history unfolded, jargon was coined, and lore was created in these discussions. In 2001, Google acquired two Usenet archives comprising 700 million posts and failed to index them in any meaningful way. As of today, that wrong is being righted.
In the past, searching Usenet posts archived in Google Groups often yielded few or no results. For example, this recent discussion thread is all about the brokenness of Google's Usenet archives and search capabilities.
"None of my posts are showing up (using advanced search, trying email and name in the author field, even limiting the date range to the right years)," wrote one user.
Noting that Google's Usenet search "often... returns no results for queries which obviously shouldn't," another user said, "You just have to cross your fingers and hope that they [Google] notice the problem themselves and fix it."
Fortunately, after media attention and user complaints, the search giant has responded and rectified the situation.
Today, Google rep Victoria Katsarou told Wired, "It turns out there was a bug, a specific bug, that affected search within a specific group. That bug is something we're working on fixing, and I think that will be fixed by tomorrow. Thanks for writing this, because that's how we discovered this specific bug."
Just one bug wrecking search results for archives spanning 700 million posts and more than 20 years of data? Seems hardly likely.
Search results are particularly buggy when users filter them by date. As an example, searching alt.usenet.kooks for "godwin" produces 6,520 results. Until we tried to look at results sorted by date. Once that happened, we got 93 results. And searching alt.comp.freeware for "MS-DOS" yielded no results after 2000, even though we eventually found posts dating back to 1995 when we browsed without narrowing the dates.
If complaints by Internet old-timers, Slashdot threads and detailed email exchanges aren't enough to get Google to tend this garden of information and ensure it is searchable, and if media attention is really what it takes, then we must add our voices to those at Wired in asking Google to keep Usenet useful. And we ask that like-minded individuals do the same in the comments.
For a nice Usenet history lesson in timeline form, check out Google's highlights of Usenet posts dating back as far as 1981. Of particular interest to us Web geeks at ReadWriteWeb is Tim Berners-Lee's announcement of the World Wide Web.
Comments
Subscribe to comments for this post OR Subscribe to comments for all ReadWriteWeb posts
All of those old Usenet posts have been retro-modded by the CancelMoose and will no longer be propagated.
Whether you find it likely or not, single small bugs in code can happily affect large data sets with great ease. This _was_ a specific bug. Date-constrained search is (believe it or not) another specific issue. We're looking into it.
Lovely. I used to belong to the ASCII art groups. I always hoped Google would get around to fixing Usenet.
*IT STILL DOESN'T WORK*
Two or three years ago, I could search for old hip hop mixtapes and tracklists using Google Groups. Since I was looking for mixtapes from the mid-90's, I would filter it so no posts after 1997 would come up. I got many results and exactly what I was looking for.
Now, even today (after they claimed to fix it), advanced search brings up nothing.
"Just one bug wrecking search results for archives spanning 700 million posts and more than 20 years of data? Seems hardly likely."
Wrong. Fail. Go back to comp sci 101 and do not pass go.
Single *character* typos have been responsible for internet worms causing billions of dollars in damage.
I started on News groups around 1994.
When one thinks of the Usenet as a temporary, transitory medium, with conversations that went for weeks (or sometimes months in the old days) it is wonderful that it is preserved, a great wealth of history. Cultural history and records (September 11, 2001 reactions for instance) along with a look at the development of technology. (Remember when bandwidth was a huge issue on discussion groups?)
For Google, they also have to deal with a whole lot of different storage methods and media used in the different archives, as well as preserving files against degradation over time. Plus they may have to do recover efforts on older archives. technocally, this is no small challenge.
Kudos to Google for doing this, it is much appreciated.
This is good news. Next on Google's list of things to do with their usenet archive is to start enforcing an antispam policy on posts made to usenet through Google's servers. Groups like misc.invest.futures have become useless due to spam about knockoff sporting goods from China being injected through Google's servers. Complaints have yielded no results. A fix would be trivially simple, yet Google seems to prefer archiving this junk instead.
Oh, I never realized there was a date problem, I just complained it has become impossible to filter by newsgroup for a few years. How can you find relevant info without filtering by newsgroup ?
bug = management oversight
I'm sure the bug was very specific. I imagine it went a little something like this:
if( 0 ) {
// TODO: Stubbed for now
SearchUsenet( searchTerms );
}
else {
DisplayNResults( rnd(), rnd() );
}
Sure, a single bug can have far-reaching effects, but the problems described here are multi-faceted and clearly indicate a complete lack of attention to developing appropriate search mechanisms for this archive...
And why does Wired have to write a piece before Google looks into these problems? If there have been numerous and ongoing threads about it, they should have known already... Unless they simply aren't paying attention to this.
Found this while searching for information on why I can no long find my old posts that used to be available via Google search. I tried every method I could think of to search for a specific post, and it still doesn't work almost two months after this article.
Seems like Google either doesn't care or is unable to fix the problem and isn't very forthcoming about it.
If I were in charge, and wanted to make sure the bug was fixed, I'd have a link for people to post if there are continued problems to make sure the bug was fixed. Oh, wait, that means I'd have to admit there was a problem.
Guess we need another Wired article?
Posted by: marty.fried.myopenid.com
|
December 7, 2009 9:29 AM