|
|
|||||||||||||||||
|
|
October 2007
The Technorati 100 rank is based on unique links in the past 6 months. We computed a similar Google Reader "position" by ordering blogs according to number of subscribers. With lots of caveats and exceptions, blogs that get the most incoming links tend to have the most feed subscribers and general Web traffic. Although the correlation is very rough, I think the above chart shows this trend -- e.g. the bottom right is empty. (The top-left less so, but the 4 biggest outliers are easy to explain.) We haven't had time to analyze the Google Reader data in much detail, but wanted to post an early draft so that others have the same tools. If you're interested, use the interactive version to explore outliers, then post likely explanations in the comments. Bleg: If you know someone who is working on automating the collection of subscriber stats, please let us know. There are well over 50 "top" blogs missing (especially outside of tech and business), and the existing data has known problems. Update at 4:12 p.m.: on the new interactive chart, mouse over any point for details on the blog: name, URL, tagline, description, Alexa rank, Technorati 100 rank and Google Reader subscriber count. How many people subscribe to feeds from the most popular blogs? Moving down the list, how quickly do the numbers fall off? The answer: it's a power law distribution, popularly known as a "long tail" thanks to Chris Anderson's 2006 book and ongoing blog. A key earlier work: Clay Shirky's 2003 essay, Power Laws, Weblogs and Inequality. Granting that this weekend's subscriber stats are flawed and that the manually-assembled list is incomplete, the curve shows the expected shape. The above image is a static teaser; I hope to post an interactive chart by the end of the day. [update: done] (Kevin Burton charted similar data on Sunday. We omitted non-blogs and added color coding by category. FYI: Kevin's Tailrank meme tracker currently leads with Sunday's TechCrunch post about Google Reader. I wonder whether the Mashable post will appear later today.) First, a very sincere kudos to Pete Cashmore of Mashable for digging up hard data that shows how Google Reader's "feed bundles" can skew results. It's clear that subscribing to a bunch of feeds with a single click isn't the same as choosing individual feeds by hand. What Pete added: real numbers.
(I haven't dug into the specific examples to see if Pete's conclusions are the mostly likely explanation, though his results aren't that surprising.) Where Pete went wrong: concluding that these specific errors make the whole dataset worthless. ("**" added) Google Reader stats, in case you don't know, are bulls**t. In fact, all Feedburner stats for most top blogs are bulls**t due to the effect of default feeds. I've looked at lots of data over the years, including as part of corporate data quality projects. Even without gaming, any non-trivial dataset I've seen has flaws. The answer is to understand and attempt to fix or work around flaws, not to throw out the whole thing or wait for flawless data. The perfect is the enemy of the good. Even if the data is flawed for some or all of the 91 blogs included in the feed bundles, much of that can be corrected for based on analysis like Pete has done. And, there's a whole world of blogs that aren't affected. Sure, there are similar problems with other feed readers, home pages with RSS subscriptions, etc. Switching from feeds to sites, there's also no shortage of problems measuring page views (bots!), unique visitors (cookie deletion!), time spent on site (tabs!), etc. But data is useful and important, so people make do with what's available given time and budget constraints. (Mashable isn't shy about mentioning some stats on every page of the site: "in excess of 5 million monthly pageviews" and "ranks among the Top 100 blogs worldwide".) Has anyone automated the process of harvesting the subscriber stats? If so, please send details. I'll bet there's meaningful data to be found. One more bit of overreach: The easiest way to get a default feed on one of these startpages is to own it, promise to promote it on your blog or be friends with the person who runs it. Did the post include any evidence? (I'm sure these things happen, but making implied accusations without backing them up doesn't lend credibility to the post.) On Crunchnotes, Mike Arrington presents another side to the story, including a useful data point: the complainers rallied around the notion that the stats are somehow fixed. In particular, some of the feeds are included in bundles that users can add to the reader, jacking up their stats. Google Reader recently began to show the number of subscribers for each feed in the result list when you search for a new feed. (Ionut Alex Chitu on Friday, Oct. 12) On Sunday, TechCrunch published a preliminary list from Gabe Rivera, weighted towards tech and general news sites. (A reasonable starting point considering the source.) Robert Scoble posted numbers from the new TechMeme Leaderboard plus 55 of his favorite feeds. Together, the posts hit the top of TechMeme late Sunday night and gathered reactions all day Monday. Here's a roundup. On the Official Google Reader Blog, Mihai Parparita provided some details: Google subscriber counts: These numbers include subscribers across all Google services, including Reader, iGoogle, and Orkut. FeedBurner numbers: If you use FeedBurner to manage and track your feed, you will see a subscriber count there that is attributed to "Google Feedfetcher." This number is a sum of all the feeds that you have redirecting to your FeedBurner feed URL. Alas, he also delivered the bad news that the weekend data collection wasn't accurate: Reader's feed search was recently showing stale and incomplete data, but as of today (October 15) the numbers should be the same everywhere. The update resolved a discrepancy between Google Reader and Feedfetcher that Tim Bray observed on Sunday. To find some data for yourself, click on "Add subscription" and type a keyword, e.g. a blog name or the domain name without the ".com". (Entering the full domain name will subscribe you without showing the feed count.) To search using the domain name, click "Browse" and then use the "Search and browse" option at the bottom. Pro: fewer false-positive matches. Con: likely to miss some actual feeds, e.g. hosted at Feedburner or other service. As shown in the (edited) screenshot above, a blog or site may have several feeds. It's tricky to get an accurate count: add numbers for actual feeds, skip duplicates, skip keyword matches on other sites, and make a judgement call for comments and other feeds tied to specific tags or sections of a site. (Someone may subscribe to a tag rather than the whole site, but the same person may subscribe to multiple tags.) I suspect that most people who subscribe to comment feeds are already subscribed to a blog's main feed -- though perhaps that's less true for blogs that have a separate comment feed for each post. At first, the several URLs for a single feed looks like a bug -- e.g. a URL with or without the trailing slash ("/"). However, Andy Beard points out that it's a great hidden feature to see how someone subscribed to your feed: 1. Using A Subscription Button 2. Autodiscovery 3. Javascript Bookmark Alas, some of the duplication is in fact a bug, e.g. the exact same URL but different title or description. See Nusuni Dot Com for a good example. Another bug (per Technosailor): Google continues to fetch the old feed URL despite a 301 "permanent redirect" code in the HTTP header. (A 302 "temporary redirect" is one way to have the feed URL on your domain but have the actual feed served by FeedBurner or another service. TBD: research how this affects the reported stats.) Try gathering data for several sites: it takes time. Now that Google is publishing the data, they should publish their own "top feeds" list. Louis Gray pointed out that he requested this feature over 7 months ago. (Might be interesting to see whether Google Reader has made any progress on his other 9 suggestions.) I'm not holding my breath, e.g. FeedBurner never released such a list based on their data. With Google Reader's use of AJAX, the data isn't easy to grab with a script either. Perhaps the solution will come in the form of an API, e.g. FeedBurner provides one. (Though that only works for blogs who use FeedBurner and haven't opted out of making their data public.) Meanwhile, please stay tuned for further analysis of the data that's available today. |
|
|||||||||||||||