Want to suggest a website to add to our collections? Interested in developing a web archive collection for a research project? Contact us through our request form.
For other inquiries, please email us directly at digitization.centre@ubc.ca.
Web archiving is the process of collecting and preserving resources from the internet in order to make the content available for future historians, researchers, and the public. The web archiving process allows you to view a website's content at a particular point in time as well as how the design of the webpages looked.
Here is how UBC's main webpage (www.ubc.ca) looked at the time of its first web capture by the Internet Archive in 1997, and how it looks now:
The web archiving workflow is a similar process to the traditional process of archiving paper documents: information is selected, processed, stored, described, and made available.
Archive-it has a comprehensive glossary of web archiving terms. Here are some of the most common ones you will hear used when talking about web archiving:
Archive-It | Archive-It is a subscription service from the Internet Archive that allows institutions to build and preserve collections of web content. |
Application Program Interface (API) | The connection and interaction between computer software(s) in order to produce a certain type of service, tool, or mechanism. |
Crawl | The operation conducted by an automated program (a "web crawler" or "spider") to identify materials on the live web for archiving. |
Crawler | Also known as a "web crawler" or a "spider", this term refers to the software that browses the Internet and captures web pages. |
Crawler trap | Part of a website that can generate an infinite number of (often invalid) URLs. |
Domain/sub-domain | The domain is the root of a host name (ie. .com, .ca., .gov). The sub-domain is the directory named before the root name domain (ie. library.ubc.ca). |
Host | Where web content is stored, as usually designated by its internet host name (ie. ubc.ca). |
Internet Archive | The Internet Archive is a non-profit digital library of Internet sites and other cultural artifacts in digital form. The Internet Archive is responsible for the Wayback Machine and the Archive-It subscription service. |
Link rot | When a hyperlink no longer connects to the page it once did, either ceasing to exist and producing an error message, or having been moved to a new link and the original link no longer contains the original content. |
Robots.txt | Files that a site's owner can add to their website to keep crawlers from accessing all or parts of it. |
Seed | The URL that is being crawled. |
Uniform Resource Locator (URL) | The location of a resource on the web. |
WARC | Short for Web Archive File, a WARC is an open source format used for the long term preservation of web files. |
Wayback Machine | Founded by the Internet Archive in 2001, the Wayback Machine is a digital archive for the World Wide Web. The Wayback Machine displays archived web pages as if they were live on the web. |
Web scraping | A method of extracting content from a website, usually limited to textual data. |