Twitter has established itself as top social networking destination, mentioned the same breath as sites like Facebook, LinkedIn, or YouTube, as well as a go-to destination for breaking news.But as a search engine that could be the Google for real-time media, Twitter still fails. For Twitter data partner Topsy, that was an opportunity.
If the web is now being cobbled together by status updates and hashtagged posts as much as it is by PageRanked websites, then a lot is being lost. In Twitters case, the company has only been scratching the surface of a history of tweets stretching back to 2006. An archive, to date, containing some 425 million tweets.
Topsy, one of only four Certified Resellers of Twitters data, says it has now indexed every tweet ever posted something Twitter doesnt do, and couldnt easily reproduce due the infrastructure and costs involved. (Topsy has raised $35 million in venture capital since 2008 to get to this point, the company says.)
Meanwhile, todays Twitter is more interested in the now, and the recent, not the distant past. A visit tosearch.twitter.compulls up tweet results that only stretch back a matter of days, not months, and certainly not years. And with every passing season, that time frame compresses even more. Twitters index currently only goes back a week,it states. In2009, it stretched back a week and a half. Before that, it was a month.
Topsy was able to dig into Twitters archive all the way back to 2010, with an expansion announced in August. Now, it can go back the full seven years. That makes it the largest and most comprehensive archive of Twitters data that has ever existed for free, public access. Outside of Twitter, onlydata partners like GnipandThe Library of Congresshave had access to this data before but it was not in a format everyday users could access and search. And it definitely wasnt free.
According to Topsy co-founder and CTOVipul Ved Prakash,the ability to index every tweet from Twitters beginning onwards now 425 billion items across 3,500 servers was a big data feat. The third generation of our indexing technology has increased the density of the number of documents we can index on a server, so that means we can now run a massive index that includes every tweet, and, he adds, Topsy will eventually be able to scale that to trillions of documents. Topsys competitors not building infrastructure-based businesses, Prakashchallenges, wont be able to keep up.
Though companies often like to make bold claims such as this, there is some truth to that statement. Todays web is changing.Twitter, for example, is now pumping out some 400 million to 600 million new tweets daily, each of which becomes indexed on Topsy within 150 milliseconds. Put another way: the amount of data that Twitter will produce between now and this time next year is more than every tweet it haseverproduced to date.
And when you also take Facebook into account, you realize the web Google understands is now just a slice. The amount of data being created on Twitter plus Facebook today is more than the data being created on the rest of the webPrakashexplains. Social data has become the bigger public corpus. (Theres your answer to why does Google+ exist?)
And if the social web is now the larger web, then its not surprising that Topsys ambitions expand beyond Twitter. The companys technology is already capable of indexing every public page on other social media sites, like Facebook, in addition to links users tweet. It also has an archive of all of Google+ public posts.
Link:
Topsy, Now With Every Tweet From 2006 On, Has Other Social Media Indexes In The Works