Link Rot and Digital Decay on Government, News and Other Webpages – Pew Research Center
Pew Research Center conducted the analysis to examine how often online content that once existed becomes inaccessible. One part of the study looks at a representative sample of webpages that existed over the past decade to see how many are still accessible today. For this analysis, we collected a sample of pages from the Common Crawl web repository for each year from 2013 to 2023. We then tried to access those pages to see how many still exist.
A second part of the study looks at the links on existing webpages to see how many of those links are still functional. We did this by collecting a large sample of pages from government websites, news websites and the online encyclopedia Wikipedia.
We identified relevant news domains using data from the audience metrics company comScore and relevant government domains (at multiple levels of government) using data from get.gov, the official administrator for the .gov domain. We collected the news and government pages via Common Crawl and the Wikipedia pages from an archive maintained by the Wikimedia Foundation. For each collection, we identified the links on those pages and followed them to their destination to see what share of those links point to sites that are no longer accessible.
A third part of the study looks at how often individual posts on social media sites are deleted or otherwise removed from public view. We did this by collecting a large sample of public tweets on the social media platform X (then known as Twitter) in real time using the Twitter Streaming API. We then tracked the status of those tweets for a period of three months using the Twitter Search API to monitor how many were still publicly available. Refer to the report methodology for more details.
The internet is an unimaginably vast repository of modern life, with hundreds of billions of indexed webpages. But even as users across the world rely on the web to access books, images, news articles and other resources, this content sometimes disappears from view.
A new Pew Research Center analysis shows just how fleeting online content actually is:
This digital decay occurs in many different online spaces. We examined the links that appear on government and news websites, as well as in the References section of Wikipedia pages as of spring 2023. This analysis found that:
To see how digital decay plays out on social media, we also collected a real-time sample of tweets during spring 2023 on the social media platform X (then known as Twitter) and followed them for three months. We found that:
There are many ways of defining whether something on the internet that used to exist is now inaccessible to people trying to reach it today. For instance, inaccessible could mean that:
For this report, we focused on the first of these: pages that no longer exist. The other definitions of accessibility are beyond the scope of this research.
Our approach is a straightforward way of measuring whether something online is accessible or not. But even so, there is some ambiguity.
First, there are dozens of status codes indicating a problem that a user might encounter when they try to access a page. Not all of them definitively indicate whether the page is permanently defunct or just temporarily unavailable. Second, for security reasons, many sites actively try to prevent the sort of automated data collection that we used to test our full list of links.
For these reasons, we used the most conservative estimate possible for deciding whether a site was actually accessible or not. We counted pages as inaccessible only if they returned one of nine error codes that definitively indicate that the page and/or its host server no longer exist or have become nonfunctional regardless of how they are being accessed, and by whom. The full list of error codes that we included in our definition are in the methodology.
Here are some of the findings from our analysis of digital decay in various online spaces.
To conduct this part of our analysis, we collected a random sample of just under 1 million webpages from the archives of Common Crawl, an internet archive service that periodically collects snapshots of the internet as it exists at different points in time. We sampled pages collected by Common Crawl each year from 2013 through 2023 (approximately 90,000 pages per year) and checked to see if those pages still exist today.
We found that 25% of all the pages we collected from 2013 through 2023 were no longer accessible as of October 2023. This figure is the sum of two different types of broken pages: 16% of pages are individually inaccessible but come from an otherwise functional root-level domain; the other 9% are inaccessible because their entire root domain is no longer functional.
Not surprisingly, the older snapshots in our collection had the largest share of inaccessible links. Of the pages collected from the 2013 snapshot, 38% were no longer accessible in 2023. But even for pages collected in the 2021 snapshot, about one-in-five were no longer accessible just two years later.
We sampled around 500,000 pages from government websites using the Common Crawl March/April 2023 snapshot of the internet, including a mix of different levels of government (federal, state, local and others). We found every link on each page and followed a random selection of those links to their destination to see if the pages they refer to still exist.
Across the government websites we sampled, there were 42 million links. The vast majority of those links (86%) were internal, meaning they link to a different page on the same website. An explainer resource on the IRS website that links to other documents or forms on the IRS site would be an example of an internal link.
Around three-quarters of government webpages we sampled contained at least one on-page link. The typical (median) page contains 50 links, but many pages contain far more. A page in the 90th percentile contains 190 links, and a page in the 99th percentile (that is, the top 1% of pages by number of links) has 740 links.
Other facts about government webpage links:
When we followed these links, we found that 6% point to pages that are no longer accessible. Similar shares of internal and external links are no longer functional.
Overall, 21% of all the government webpages we examined contained at least one broken link. Across every level of government we looked at, there were broken links on at least 14% of pages; city government pages had the highest rates of broken links.
For this analysis, we sampled 500,000 pages from 2,063 websites classified as News/Information by the audience metrics firm comScore. The pages were collected from the Common Crawl March/April 2023 snapshot of the internet.
Across the news sites sampled, this collection contained more than 14 million links pointing to an outside website. Some 94% of these pages contain at least one external-facing link. The median page contains 20 links, and pages in the top 10% by link count have 56 links.
Like government websites, the vast majority of these links go to secure HTTP pages (those with a URL beginning with https://). Around 12% of links on these news sites point to a static file, like a PDF document. And 32% of links on news sites redirected to a different URL than the one they originally pointed to slightly less than the 39% of external links on government sites that redirect.
When we tracked these links to their destination, we found that 5% of all links on news site pages are no longer accessible. And 23% of all the pages we sampled contained at least one broken link.
Broken links are about as prevalent on the most-trafficked news websites as they are on the least-trafficked sites. Some 25% of pages on news websites in the top 20% by site traffic have at least one broken link. That is nearly identical to the 26% of sites in the bottom 20% by site traffic.
For this analysis, we collected a random sample of 50,000 English-language Wikipedia pages and examined the links in their References section. The vast majority of these pages (82%) contain at least one reference link that is, one that directs the reader to a webpage other than Wikipedia itself.
In total, there are just over 1 million reference links across all the pages we collected. The typical page has four reference links.
The analysis indicates that 11% of all references linked on Wikipedia are no longer accessible. On about 2% of source pages containing reference links, every link on the page was broken or otherwise inaccessible, while another 53% of pages contained at least one broken link.
For this analysis, we collected nearly 5 million tweets posted from March 8 to April 27, 2023, on the social media platform X, which at the time was known as Twitter. We did this using Twitters Streaming API, collecting 3,000 public tweets every 30 minutes in real time. This provided us with a representative sample of all tweets posted on the platform during that period. We monitored those tweets until June 15, 2023, and checked each day to see if they were still available on the site or not.
At the end of the observation period, we found that 18% of the tweets from our initial collection window were no longer publicly visible on the site. In a majority of cases, this was because the account that originally posted the tweet was made private, suspended or deleted entirely. For the remaining tweets, the account that posted the tweet was still visible on the site, but the individual tweet had been deleted.
Tweets were especially likely to be deleted or removed over the course of our collection period if they were:
We also found that removed or deleted tweets tended to come from newer accounts with relatively few followers and modest activityon the site. On average, tweets that were no longer visible on the site were posted by accounts around eight months younger than those whose tweets stayed on the site.
And when we analyzed the types of tweets that were no longer available, we found that retweets, quote tweets and original tweets did not differ much from the overall average. But replies were relatively unlikely to be removed just 12% of replies were inaccessible at the end of our monitoring period.
Most tweets that are removed from the site tend to disappear soon after being posted. In addition to looking at how many tweets from our collection were still available at the end of our tracking period, we conducted a survival analysis to see how long these tweets tended to remain available. We found that:
Put another way: Half of tweets that are eventually removed from the platform are unavailable within the first six days of being posted. And 90% of these tweets are unavailable within 46 days.
Tweets dont always disappear forever, though. Some 6% of the tweets we collected disappeared and then became available again at a later point. This could be due to an account going private and then returning to public status, or to the account being suspended and later reinstated. Of those reappeared tweets, the vast majority (90%) were still accessible on Twitter at the end of the monitoring period.
Link:
Link Rot and Digital Decay on Government, News and Other Webpages - Pew Research Center
- What we learned from Open AI whistleblower Suchir Balaji's Wikipedia Page - The Times of India - December 18th, 2024 [December 18th, 2024]
- From an old version of the Wikipedia page for Warren G and N... - kottke.org - December 18th, 2024 [December 18th, 2024]
- What were the most popular Wikipedia pages of 2024? - WCF Courier - December 18th, 2024 [December 18th, 2024]
- Encyclopedia of the Future: Why is Wikipedia Best Research Option? - Analytics Insight - December 18th, 2024 [December 18th, 2024]
- Wikipedia's Most-Viewed Articles of 2024: Politics, Football, and...Death? - PCMag Middle East - December 18th, 2024 [December 18th, 2024]
- Taxiride Fallout Continues Over Alleged Amendments To Band Wikipedia Page - The Music - December 18th, 2024 [December 18th, 2024]
- Delhi High Court to examine Caravan, Ken articles to decide interim relief in ANI vs Wikipedia - Bar & Bench - Indian Legal News - December 18th, 2024 [December 18th, 2024]
- Boriswave Wikipedia page set up in reference to immigration surge under ex-PM - The London Economic - December 18th, 2024 [December 18th, 2024]
- Wikipedia suspends pro-Palestine editors coordinating efforts behind the scenes - The Jerusalem Post - December 14th, 2024 [December 14th, 2024]
- Wikipedia's 7-year yogurt spelling war was longer than three Shakespeare plays - Boing Boing - December 14th, 2024 [December 14th, 2024]
- Wikipedia boyfriends on celebrating their mundane, anti-online corner of the internet - British GQ - December 14th, 2024 [December 14th, 2024]
- What were the most popular Wikipedia pages of 2024? - York News-Times - December 14th, 2024 [December 14th, 2024]
- Wikipedia's Most-Viewed Articles of 2024: Politics, Football, and...Death? - PCMag UK - December 14th, 2024 [December 14th, 2024]
- What were the most popular Wikipedia pages of 2024? - Martinsville Bulletin - December 14th, 2024 [December 14th, 2024]
- Death most popular thing on Wikipedia, again - Boing Boing - December 5th, 2024 [December 5th, 2024]
- Heres the top 25 list of most-viewed Wikipedia articles of 2024 - KXAN.com - December 5th, 2024 [December 5th, 2024]
- Here Are the Top 25 Wikipedia Searches for 2024 And #1 is BLEAK - Mediaite - December 5th, 2024 [December 5th, 2024]
- Morrissey hits out at Wikipedia for failing to set the record straight - The Independent - December 5th, 2024 [December 5th, 2024]
- Jimmy Wales on Why Wikipedia Is Still So Good - New York Magazine - December 5th, 2024 [December 5th, 2024]
- Here Are The 5 Most Read Wikipedia Pages In 2024 - The Spun - December 5th, 2024 [December 5th, 2024]
- Wikipedia reveals its most searched posts - 97.1 The Ticket - December 5th, 2024 [December 5th, 2024]
- Wikipedia just revealed what weve all been obsessing over in 2024 - Sherwood News - December 5th, 2024 [December 5th, 2024]
- The Terrible Towel Wikipedia page is a must-read yinzer masterpiece - PGH City Paper - December 5th, 2024 [December 5th, 2024]
- The Most Popular Wikipedia Pages Of The Year - iHeart - December 5th, 2024 [December 5th, 2024]
- Neither Donald Trump nor Taylor Swift: This was the most-viewed Wikipedia page in the U.S. in 2024 - AS USA - December 5th, 2024 [December 5th, 2024]
- What were the most popular Wikipedia pages of 2024? - Winona Daily News - December 5th, 2024 [December 5th, 2024]
- Morrissey Mad At Wikipedia, Claims He Was Never In The Nosebleeds Nor Slaughter And The Dogs - Stereogum - December 5th, 2024 [December 5th, 2024]
- Heres the top 25 list of most-viewed Wikipedia articles of 2024 - MSN - December 5th, 2024 [December 5th, 2024]
- The Nosebleeds and Slaughter And The Dogs Band members list explored as Morrissey slams Wikipedia listing - Soap Central - December 5th, 2024 [December 5th, 2024]
- Diddy, Dune, and Donald Trump: The most popular Wikipedia pages of 2024 - STV News - December 5th, 2024 [December 5th, 2024]
- India's bollywood, elections, and IPL among top 10 most viewed articles on Wikipedia - The Tatva - December 5th, 2024 [December 5th, 2024]
- Morrissey says he has no connection with The Nosebleeds and Slaughter And The Dogs, despite claims on Wikipedia - NME - December 5th, 2024 [December 5th, 2024]
- Wikipedia Called To Order By Samson Mow: The Urgency To Invest In Bitcoin - Cointribune EN - December 5th, 2024 [December 5th, 2024]
- Wikipedia and the ANI defamation suit | Explained - The Hindu - December 5th, 2024 [December 5th, 2024]
- A Wikipedia for cells: researchers get an updated look at the Human Cell Atlas, and its remarkable - Nature.com - November 23rd, 2024 [November 23rd, 2024]
- Opinion: Wikipedia has it out for Israel, and weve got the data to prove it - National Post - November 23rd, 2024 [November 23rd, 2024]
- Who edits history? Politics and business in the pages of Wikipedia - EU Reporter - November 23rd, 2024 [November 23rd, 2024]
- What your Wikipedia reading says about you: Study find different styles - The New Daily - November 14th, 2024 [November 14th, 2024]
- Going down a Wikipedia rabbit hole? Science says youre one of these three types - The Conversation - October 26th, 2024 [October 26th, 2024]
- Studying Wikipedia browsing habits to learn how people learn - Penn Today - October 26th, 2024 [October 26th, 2024]
- Portland mayor candidate Rene Gonzalez violated rules by using public funds on Wikipedia page, auditor finds - Oregon Public Broadcasting - October 26th, 2024 [October 26th, 2024]
- Top 5 Editing Conflicts in Wikipedia Pages on Religion - Baptist News Global - October 26th, 2024 [October 26th, 2024]
- Wikipedia editors form urgent task force to combat rampant issues with recent wave of content: 'The entire thing was ... [a] hoax' - Yahoo! Voices - October 26th, 2024 [October 26th, 2024]
- Audit: Rene Gonzalez violated campaign finance law by using city funds to edit Wikipedia page - Fox 12 Oregon - October 26th, 2024 [October 26th, 2024]
- Auditor: Gonzalez violated the law by paying to update his Wikipedia entry - Portland Tribune - October 26th, 2024 [October 26th, 2024]
- Musk Says Wikipedia Controlled By Far-Left Activists, Urges People To Stop Donating To Them! - News24 - October 26th, 2024 [October 26th, 2024]
- Silent Hill 2 Remake Wikipedia page locked after salty fans try to rewrite its critically-acclaimed reception - Eurogamer - October 9th, 2024 [October 9th, 2024]
- The Silent Hill 2 Remakes Wikipedia page briefly got transformed into a phantasmagorical reflection of the psyches of idiots unable to accept reality... - October 9th, 2024 [October 9th, 2024]
- Outrage as Wikipedia changes grooming gangs article to moral panic from the 'Far-Right' - GB News - October 9th, 2024 [October 9th, 2024]
- Silent Hill 2 Falls Victim to Faux Review Bombing on Wikipedia - DualShockers - October 9th, 2024 [October 9th, 2024]
- No, you're not losing it, Silent Hill 2 Remake's Wikipedia page's review scores have been altered, and the site has had to lock it to stop people... - October 9th, 2024 [October 9th, 2024]
- Exploring (and building) the depths of Wikipedia - The Michigan Daily - October 9th, 2024 [October 9th, 2024]
- Wikipedia and Catholicism: Navigating Misinformation and Religious Bias - World Religion News - October 9th, 2024 [October 9th, 2024]
- Weird things are happening on the Silent Hill 2 remake Wikipedia page, as folks sabotage review scores for reasons - Sports Illustrated - October 9th, 2024 [October 9th, 2024]
- Silent Hill 2 Remake Wikipedia Page Locked After Fans Tried to Change Reviews - Rely on Horror - October 9th, 2024 [October 9th, 2024]
- Trolls Edit Silent Hill 2 Remake Wikipedia Page To Lower Its Review Scores - PlayStation Universe - October 9th, 2024 [October 9th, 2024]
- The Kremlin is rewriting Wikipedia - Hindustan Times - October 9th, 2024 [October 9th, 2024]
- Wikipedia Locks Silent Hill 2 Remake Page After It's Spammed With Fake Negative Reviews - TheGamer - October 9th, 2024 [October 9th, 2024]
- Silent Hill 2 remake Wikipedia locked after getting trolled - NME - October 9th, 2024 [October 9th, 2024]
- Wikimedia Technology Summit 2024 brings together tech enthusiasts and developers to bring inclusivity to Wikipedia and Wikimedia projects - Business... - October 9th, 2024 [October 9th, 2024]
- AI's threat to Wikipedia - ABC News - October 9th, 2024 [October 9th, 2024]
- Silent Hill 2 remake page on Wikipedia blocked after fans try to rewrite critics' positive reviews - ITC - October 9th, 2024 [October 9th, 2024]
- Matt Walsh Recalls Critics Trying to Get Him Arrested Using Wikipedia - The Daily Wire - October 4th, 2024 [October 4th, 2024]
- Wikipedia and Religion: Uncovering the Dynamics of Reliable Sources and Digital Bias - Baptist News Global - October 4th, 2024 [October 4th, 2024]
- Wikipedia: Accuracy or Prejudice? Islamophobia in the Web 2.0 Era - World Religion News - October 4th, 2024 [October 4th, 2024]
- Ultrarunner Camille Herron is dumped by Lululemon after her husband edited her rivals' Wikipedia pages to boos - Daily Mail - October 3rd, 2024 [October 3rd, 2024]
- Ultrarunner Camille Herrons Primary Sponsor Drops Her After Wikipedia Scandal - Runner's World - October 3rd, 2024 [October 3rd, 2024]
- Ultrarunner Camille Herron dropped by Lululemon following Wikipedia editing controversy - Runner's World UK - October 3rd, 2024 [October 3rd, 2024]
- Wikipedia relies on army of volunteers as it stares down AI - Devex - October 3rd, 2024 [October 3rd, 2024]
- This Ultramarathon Runner Was Dropped By A Major Sponsor Amid A Wikipedia Editing Scandal - Women's Health - October 3rd, 2024 [October 3rd, 2024]
- Wikipedia scandal: Heres why ultrarunner Camille Herron was dropped by Lululemon - Women's Agenda - October 3rd, 2024 [October 3rd, 2024]
- Guess The Wikipedia Footballer #4: Can you name these 10 footballers that played under Carlo Ancelotti? - Planet Football - October 3rd, 2024 [October 3rd, 2024]
- ANI vs Wikipedia: The free encyclopedias impact on India and more - The Hindu - September 16th, 2024 [September 16th, 2024]
- Wikipedia and AI: Could artificial intelligence kill the online encyclopedia? - Newstalk - September 16th, 2024 [September 16th, 2024]
- Reliable Sources: How Wikipedia Admin David Gerard Launders His Grudges Into the Public Record - World Religion News - August 31st, 2024 [August 31st, 2024]
- Wikipedia and the Digital Services Act: Lessons on the strength of community and the future of internet regulation - Le Taurillon - August 31st, 2024 [August 31st, 2024]
- Depths Of Wikipedia: This Page Is Dedicated To The Weird Side Of Wikipedia (97 New Pics) - AOL - August 31st, 2024 [August 31st, 2024]
- Wikipedia's Longest-Running Hoax Remained Online for Almost 10 Years: The Story of Jar'Edo Wens - The Journal - August 31st, 2024 [August 31st, 2024]
- 40 Times People Found Such Hilarious Gems On Wikipedia, They Just Had To Share (New Pics) - Bored Panda - August 31st, 2024 [August 31st, 2024]
- People only just learning hidden Wikipedia function that makes site easier to read - The Mirror - August 31st, 2024 [August 31st, 2024]