Research Guides: Web Archiving @ UBC Library: About web archiving

What is web archiving?

Web archiving is the process of collecting and preserving resources from the internet in order to make the content available for future historians, researchers, and the public. The web archiving process allows you to view a website's content at a particular point in time as well as how the design of the webpages looked.

Here is how UBC's main webpage (www.ubc.ca) looked at the time of its first web capture by the Internet Archive in 1997, and how it looks now:

How web archiving works

The web archiving workflow is a similar process to the traditional process of archiving paper documents: information is selected, processed, stored, described, and made available.

Glossary

Archive-it has a comprehensive glossary of web archiving terms. Here are some of the most common ones you will hear used when talking about web archiving:

Archive-It	Archive-It is a subscription service from the Internet Archive that allows institutions to build and preserve collections of web content.
Application Program Interface (API)	The connection and interaction between computer software(s) in order to produce a certain type of service, tool, or mechanism.
Crawl	The operation conducted by an automated program (a "web crawler" or "spider") to identify materials on the live web for archiving.
Crawler	Also known as a "web crawler" or a "spider", this term refers to the software that browses the Internet and captures web pages.
Crawler trap	Part of a website that can generate an infinite number of (often invalid) URLs.
Domain/sub-domain	The domain is the root of a host name (ie. .com, .ca., .gov). The sub-domain is the directory named before the root name domain (ie. library.ubc.ca).
Host	Where web content is stored, as usually designated by its internet host name (ie. ubc.ca).
Internet Archive	The Internet Archive is a non-profit digital library of Internet sites and other cultural artifacts in digital form. The Internet Archive is responsible for the Wayback Machine and the Archive-It subscription service.
Link rot	When a hyperlink no longer connects to the page it once did, either ceasing to exist and producing an error message, or having been moved to a new link and the original link no longer contains the original content.
Robots.txt	Files that a site's owner can add to their website to keep crawlers from accessing all or parts of it.
Seed	The URL that is being crawled.
Uniform Resource Locator (URL)	The location of a resource on the web.
WARC	Short for Web Archive File, a WARC is an open source format used for the long term preservation of web files.
Wayback Machine	Founded by the Internet Archive in 2001, the Wayback Machine is a digital archive for the World Wide Web. The Wayback Machine displays archived web pages as if they were live on the web.
Web scraping	A method of extracting content from a website, usually limited to textual data.