What is the Wayback Machine? | How the Internet Archive uses Web Crawlers to Preserve Internet History
Published 5 months ago | Last update 2 months ago
Strange things are afoot at the Circle K.
2,019 views, 1 RAM, and 0 comments
- What is this so-called Wayback Machine?
- Who Runs the Wayback Machine?
- How Does the Wayback Machine Work?
- You Too Can Contribute to the Wayback Machine
- How to Use the Wayback Machine?
- Step-by-Step Guide
- What are the best uses for the Wayback Machine?
- What About Pages That Aren’t Included in the Wayback Machine?
- Legal Status
- Have fun not being a part of Internet history!
- Legal Issues
- The Wayback Machine is Capitalist Propaganda
- Take an Excellent Adventure Through Time
- Additional Sources
- Official Pages
- Unofficial Resources
- Related Articles
What is this so-called Wayback Machine?
The Wayback Machine is a vast digital archive of web pages that was launched by Brewster Kahle and Bruce Gilliat in 2001 and consists of snapshots of webpages taken over time. The Wayback Machine was made to address the problem of content disappearing from the Internet whenever a website was changed or taken down.
In this way, the Wayback Machine serves to preserve Internet history and provide “universal access to all knowledge.” The name of the Wayback Machine is a reference to the WABAC Machine, a time travel device famously used by Mr. Peabody and Sherman.
The project started in 1996 with the archiving of cached web pages that were stored on digital tape. The project was unveiled and officially launched in a ceremony at the University of California, Berkeley in 2001, at which point it already contained 10 billion snapshots.
Who Runs the Wayback Machine?
The Wayback Machine is run by the Internet Archive, a non-profit organization based in San Francisco, California that has been actively documenting Internet history since 1996. Today, the Internet Archive works with over 450 libraries and a variety of other partners. It is a member of the International Internet Preservation Consortium.
Aside from the database of 330 billion web pages available through the Wayback Machine, the Internet Archive contains a wide variety of digital and digitized media, including 20 million books, 4.5 million audio recordings, 4 million videos, 3 million images, and 200,000 software programs. The archive is stored on over 60 petabytes of server space. About 30 petabytes of this contain unique content (as a rule, everything is stored at least twice).
The Internet Archive is also an activist site that advocates for a free and open Internet. Therefore, all of the resources that you can find on the Internet Archive are available to access for free.
In the case of a huge loss of data due to a natural disaster or other unfortunate occurrences, the Archive also has a partial copy of its collection stored at the Bibliotheca Alexandrina in Egypt and a facility in Amsterdam.
How Does the Wayback Machine Work?
The majority of the content for the Wayback Machine is gathered using web-crawling software which identifies a domain and then follows a series of rules to retrieve and archive its content. These crawls were originally completed by the web-crawling company Alexa Internet (not to be confused with Despacito Alexa) which was also founded by Kahle and Gilliat. Now, a number of other sources assist with “Worldwide Web Crawls,” including the NARA, Internet Memory Foundation, and Common Crawl.
These "Worldwide Web Crawls," crawls intended to capture the global web, have been running since 2010. These large crawls can take months or even years to complete, though typically multiple run concurrently.
The original web-crawler is pictured below.
The frequency of crawling varies substantially per website. The Internet Archive has not released information detailing exactly how websites are chosen for crawling (and how frequently) but a website’s traffic and overall popularity is certainly a factor. This is likely tied to a website’s Alexa ranking.
When a page is crawled, this usually includes all of the HTML/source code of the page, as well as functioning hyperlinks and images, when possible. When the hyperlinks work, navigating the Wayback Machine is very much like navigating an original website.
However, web crawlers have some limitations. For one, they can only follow a certain number of hyperlinks based on a preset depth limit and will stop archiving links once the limit is reached.
The crawlers also cannot discover or archive a page if the website owner has included a robot.txt file on the site. The Internet Archive also generally removes archived pages upon request from website owners.
The Wayback Machine also has a lag of 3 to 10 hours between when a site was crawled and when it was added to the archive. Back in 2014, however, the lag was closer to six months long, so a few hours really isn’t anything to complain about.
The Internet Archive currently uses over 60 petabytes of server space to store content. This includes at least 30 petabytes of unique content, with about 13-15 terabytes of additional content being added each day.
The archive practices “data mirroring” (or “paired storage”) and stores two copies of all content.
Once a new item is created, automated systems quickly replicate that item across two distinct disk drives in separate servers that are (usually) in separate physical data centers. This “mirroring” of content is done both to minimize the likelihood of data loss or data corruption (due to unexpected harddrive or system failures) and to increase the efficiency of access to the content. Both of these storage locations (called “primary” and “secondary”) are immediately available to serve their copy of the content to patrons… and if one storage location becomes unavailable, the content remains available from the alternate storage location.
The content is stored on about 20,000 hard drives, which are housed in specialized computers. Here is an in-depth description of how the Internet Archive stores data.
You Too Can Contribute to the Wayback Machine
Though most pages are chosen by crawlers, some are manually curated. An especially cool aspect of this is that any user can submit pages to be added, so long as the chosen domain allows web-crawling.
To add a page yourself, simply go to the Wayback Machine homepage and past the URL of the page you want to save into the text box under “Save Page Now.”
How to Use the Wayback Machine?
The Wayback Machine is pretty simple to use:
First, direct yourself to the Wayback Machine through this link, and type the URL of the website you wish to visit into the search bar. For example, you could type “dkvine.com” and hit enter.
A timeline will show up that shows all of the snapshots that have been saved of that particular page. In this case, dkvine.com has been saved 635 times, with the earliest instance being July 11, 2000, and the most recent being January 6, 2019. Select the year you want. We’ll choose 2000 in order to get an early look at the website.
A calendar for the selected year will come up, with small blue circles over the days when snapshots took place. The number of snapshots depends most on how much traffic the website was getting at the time. Additionally, earlier years tend to have fewer snapshots overall, probably because the Internet Archive had fewer resources.
Anyway, choose a day that has an associated snapshot and take a look. We’ll choose August 15, 2000, which shows us the retro dkvine.com homepage, which happens to feature a picture of Bottles the Mole and Danny DeVito. By choosing the snapshot from January 19, 2001, we see that the website changed its logo and look not long after.
Keep in mind that in some cases the attempt to crawl a website on that particular day may have failed, and you will just see an error message.
What are the best uses for the Wayback Machine?
One of the most obvious uses for the Wayback Machine is using it to view outdated websites as they appeared in the late 90s and early 2000s. It can be pretty humorous and educational to see how much websites have changed.
The Wayback Machine can also be extremely useful for research purposes, especially when it comes to accessing websites that covered niche or once-timely topics that are no longer in the public eye and have all but disappeared. For example, several of the sources used for this article on Hiroyuki Nishimura were only accessible using the Wayback machine archives.
Journalists have also used the Wayback Machine to view old pages and hold politicians accountable for statements and expose battlefield lies. As one example:
In 2014 an archived social media page of Igor Girkin, a pro-Russia rebel leader in Ukraine, showed he boasted his troops shot down a suspected Ukrainian military airplane before it became known that the plane actually was a Malaysian Airlines jet aboard which 298 civilians died. Girkin deleted the post and subsequently blamed Ukraine’s military.
The machine can also be used to improve your SEO, which is something that people seem to like. This page has some imaginative ideas on other ways to use the Wayback Machine, including illegal things like blackmail.
What About Pages That Aren’t Included in the Wayback Machine?
Despite their best efforts, the Wayback Machine is not truly comprehensive, as it isn’t possible to take a snapshot of every single web page day in and day out. If an old page isn’t available on the Wayback Machine, it might not be possible to access anywhere.
And of course, if website owners place a robot.txt file on their site or request that the Archive removes their materials, that content also won’t be available.
Have fun not being a part of Internet history!
The Wayback Machine is generally legal, as it stored and displayed with no intention to depict the content as original or profit off of it. However, it has a shakier legal standing in Europe, where content creators can decide where their work is published or duplicated, meaning that they have to delete archived content upon request.
Apparently, the US Nuclear Regulatory Commission even had to request that pages containing sensitive information about off-the-grid nuclear power plants be removed.
There have been several lawsuits and other legal actions taken against the Internet Archive for items stored on the Wayback Machine. In one example, an “adult movie star” named Daniel Davydiuk fought to have sensitive photos of himself removed from the archive.
The Wayback Machine is Capitalist Propaganda
Despite the somewhat socialist leaning of the Internet Archive (free knowledge for all!), the entire site is blocked in China and Russia, due to their censorship laws. So you’re out of luck if you live in one of those countries and want to play Oregon Trail, I guess.
Take an Excellent Adventure Through Time
The Wayback Machine is an extremely valuable resource for a number of academic, journalistic, and professional pursuits. It is also a great way to spend several hours if you just want to click around and explore.
- The Internet Archive
- The Wayback Machine
- Internet Archive Blog - from the team at Archive.org
- Internet Archive Twitter
- Wayback Machine Twitter
- Internet Archive - Wikipedia
- Wayback Machine - Wikipedia
- Alexa Internet - Wikipedia
- Brewster Kahle - Wikipedia
- Bruce Gilliat - Wikipedia
- Forbes - How Much of the Internet Does the Wayback Machine Really Archive?
- Search No Further... 7+ Search Engines That Are Worth Checking Out
- Why Go to College? | Top 5 Educational Resources on the Internet
- What is DuckDuckGo and What Advantages Does it Have Over Other Search Engines?
- Remember Oregon Trail? | A Brief History of the Most Popular Educational Video Game of All Time
- Biography of Christopher “moot” Poole: The Hacker Known as “4chan”
- Biography of Hiroyuki Nishimura: The Father of 2channel
Louis Cicalese is a person who has written about the hacker known as 4chan, the hacker known as
2channel 5channel, lesser-known search engines, CSS color names, Leeroy Jenkins, hiring Kermit the Frog impersonators and various other topics.
Account created 8 months ago.
55 posts, 57 comments, and 53 RAMs.
2 months ago:
Posted thread Remember Oregon Trail? | A Brief History of the Most Popular Educational Video Game of All Time
Post a New Comment
Do you like having a good time?
Read Quality Articles
Read some quality articles. If you can manage to not get banned for like five minutes, you can even post your own articles.
Argue with People on the Internet
Use your account to explain why people are wrong on the Internet forum.
Vandalize the Wiki
Or don't. I'm not your dad.
Ask and/or Answer Questions
If someone asks a terrible question, post a LMGTFY link.
Make Some Money
Hire freelancers and/or advertise your goods and/or services. Hire people directly. We're not a middleman or your dad. Manage your own business transactions.