Internet Archive Web Crawls

Wide crawls of the Internet conducted by Internet Archive. Please visit the Wayback Machine to explore archived web sites. Since September 10th, 2010, the Internet Archive has been running Worldwide Web Crawls of the global web, capturing web elements, pages, sites and parts of sites. Each Worldwide Web Crawl was initiated from one or more lists of URLs that are known as "Seed Lists". Descriptions of the Seed Lists associated with each crawl may be provided as part of the metadata for...

9.5B 9.5B

Live Web Proxy Crawls

108,915

ITEMS

9.5B

VIEWS

Apr 26, 2011 04/11

eye 9.5B

Content crawled via the Wayback Machine Live Proxy mostly by the Save Page Now feature on web.archive.org. Liveweb proxy is a component of Internet Archive’s wayback machine project. The liveweb proxy captures the content of a web page in real time, archives it into a ARC or WARC file and returns the ARC/WARC record back to the wayback machine to process. The recorded ARC/WARC file becomes part of the wayback machine in due course of time.

9.7B 9.7B

Survey Crawls

100,903

ITEMS

9.7B

VIEWS

Nov 17, 2012 11/12

eye 9.7B

Survey crawls are run about twice a year, on average, and attempt to capture the content of the front page of every web host ever seen by the Internet Archive since 1996.
Topic: survey crawls

1.7B 1.7B

Wide Crawl Number 17: Started August 3rd, 2018

69,707

ITEMS

1.7B

VIEWS

Aug 2, 2018 08/18

eye 1.7B

Wide17 was seeded with the "Total Domains" list of 256,796,456 URLs provided by  Domains Index   on June 26th, and crawled with max-hops set to "3" and de-duplication set "on".   

2.2B 2.2B

Wide Crawl Number 14 - Started Mar 4th, 2016 - Ended Sep 15th, 2016

71,768

ITEMS

2.2B

VIEWS

Mar 4, 2016 03/16

eye 2.2B

The seed for Wide00014 was: - Slash pages from every domain on the web: -- a list of domains using Survey crawl seeds -- a list of domains using Wide00012 web graph -- a list of domains using Wide00013 web graph - Top ranked pages (up to a max of 100) from every linked-to domain using the Wide00012 inter-domain navigational link graph -- a ranking of all URLs that have more than one incoming inter-domain link (rank was determined by number of incoming links using Wide00012 inter domain links)...

1B 1.0B

Survey Crawl Number 8

8,758

ITEMS

VIEWS

Oct 29, 2018 10/18

eye 1B

1.9B 1.9B

Survey Crawl Number 0 - Started May 18th, 2013 - Ended May 15, 2014

16,282

ITEMS

1.9B

VIEWS

Nov 17, 2012 11/12

eye 1.9B

The seed for this crawl was a list of every host in the Wayback Machine This crawl was run at a level 1 (URLs including their embeds, plus the URLs of all outbound links including their embeds) The WARC files associated with this crawl are not currently available to the general public.

1.1B 1.1B

Wide Crawl Number 16: Started June 3rd, 2017 - Still running

83,809

ITEMS

1.1B

VIEWS

Jan 9, 2015 01/15

eye 1.1B

Web wide crawl number 16 The seed list for Wide00016 was made from the join of the top 1 million domains from CISCO and the top 1 million domains from Alexa.

1.3B 1.3B

Wide Crawl Number 15: Started Oct 1st, 2016 - Ended May 8th, 2017

38,827

ITEMS

1.3B

VIEWS

Jan 9, 2015 01/15

eye 1.3B

Web wide crawl.

1B 1.0B

GDELT

57,656

ITEMS

VIEWS

Aug 27, 2014 08/14

eye 1B

A daily crawl of more than 200,000 home pages of news sites, including the pages linked from those home pages. Site list provided by The GDELT Project
Topics: GDELT, News

1.2B 1.2B

Wide Crawl started June 2014

45,341

ITEMS

1.2B

VIEWS

Jun 6, 2014 06/14

eye 1.2B

Web wide crawl with initial seedlist and crawler configuration from June 2014.

1.1B 1.1B

Survey Crawl Number 6: Sep 11th, 2017 - running now

13,723

ITEMS

1.1B

VIEWS

Jan 9, 2015 01/15

eye 1.1B

The seeds for this crawl came from: 251 million Domains that had at least one link from a different domain in the Wayback Machine, across all time ~ 300 million Domains that we had in the Wayback, across all time 55,945,067 Domains from https://archive.org/details/wide00016 This crawl was run with a Heritrix setting of "maxHops=0" (URLs including their embeds) The WARC files associated with this crawl are not currently available to the general public.

1.2B 1.2B

Wide Crawl Number 12 - started March, 14th 2015

49,621

ITEMS

1.2B

VIEWS

Jan 9, 2015 01/15

eye 1.2B

Web wide crawl with initial seedlist and crawler configuration from January 2015.

1.3B 1.3B

Wide Crawl started April 2013

25,035

ITEMS

1.3B

VIEWS

Apr 18, 2013 04/13

eye 1.3B

Web wide crawl with initial seedlist and crawler configuration from April 2013.

1.3B 1.3B

Survey Crawl Number 2 - Started Dec 17th, 2014 - Ended Jul 31st, 2015

13,890

ITEMS

1.3B

VIEWS

Dec 17, 2014 12/14

eye 1.3B

888.6M 889M

Wide Crawl Number 13

46,050

ITEMS

888.6M

VIEWS

Jan 9, 2015 01/15

eye 888.6M

Web Wide Crawl Number 13

1B 1.0B

Survey Crawl Number 4 - Started Jan 9th, 2016 - Ended Feb 25th, 2016

12,622

ITEMS

VIEWS

Jan 9, 2015 01/15

eye 1B

794.7M 795M

Survey Crawl Number 7

6,605

ITEMS

794.7M

VIEWS

Feb 24, 2018 02/18

eye 794.7M

This "Survey" crawl was started on Feb. 24, 2018. This crawl was run with a Heritrix setting of "maxHops=0" (URLs including their embeds) Survey 7 is based on a seed list of 339,249,218 URLs which is all the URLs in the Wayback Machine that we saw a 200 response code from in 2017 based on a query we ran on Feb. 1st, 2018.   The WARC files associated with this crawl are not currently available to the general public.

812M 812M

Wikipedia Outlinks March 2016

27,594

ITEMS

812M

VIEWS

Mar 3, 2016 03/16

eye 812M

Crawl of outlinks from wikipedia.org started March, 2016. These files are currently not publicly accessible. Properties of this collection. It has been several years since the last time we did this. For this collection, several things were done: 1. Turned off duplicate detection. This collection will be complete, as there is a good chance we will share the data, and sharing data with pointers to random other collections, is a complex problem. 2. For the first time, did all the different wikis....

1.1B 1.1B

Wayback Indexes

554

ITEMS

1.1B

VIEWS

Apr 4, 2012 04/12

eye 1.1B

Wayback indexes. This data is currently not publicly accessible.

825.6M 826M

Wide Crawl started August 2013

21,932

ITEMS

825.6M

VIEWS

Jul 30, 2013 07/13

eye 825.6M

Web wide crawl with initial seedlist and crawler configuration from August 2013.

719.8M 720M

Wide Crawl started January 2012

30,373

ITEMS

719.8M

VIEWS

Dec 30, 2011 12/11

eye 719.8M

Web wide crawl with initial seedlist and crawler configuration from January 2012 using HQ software.

528.7M 529M

Wide Crawl started February 2014

9,806

ITEMS

528.7M

VIEWS

Nov 15, 2013 11/13

eye 528.7M

Web wide crawl with initial seedlist and crawler configuration from February 2014.

720.9M 721M

Survey Crawl Number 3 - Started Aug 1st, 2015 - Ended Feb 11th, 2016

10,137

ITEMS

720.9M

VIEWS

Jan 9, 2015 01/15

eye 720.9M

629.4M 629M

Wide Crawl started April 2012

39,279

ITEMS

629.4M

VIEWS

Mar 31, 2012 03/12

eye 629.4M

Web wide crawl with initial seedlist and crawler configuration from April 2012.

621.1M 621M

Survey Crawl Number 1 - Started May 17th, 2014 - Ended Dec 10th, 2014

6,909

ITEMS

621.1M

VIEWS

Apr 25, 2014 04/14

eye 621.1M

464.5M 465M

Wide Crawl started September 2012

22,423

ITEMS

464.5M

VIEWS

Aug 24, 2012 08/12

eye 464.5M

Web wide crawl with initial seedlist and crawler configuration from September 2012.

469.8M 470M

Wide Crawl Started January 2013

15,157

ITEMS

469.8M

VIEWS

Jan 1, 2013 01/13

eye 469.8M

Wide crawls of the Internet conducted by Internet Archive. Access to content is restricted. Please visit the Wayback Machine to explore archived web sites.

483.7M 484M

Wide Crawl started October 2010

15,839

ITEMS

483.7M

VIEWS

Oct 5, 2010 10/10

eye 483.7M

Web wide crawl with initial seedlist and crawler configuration from October 2010

444.4M 444M

Wide Crawl started October 2011

12,648

ITEMS

444.4M

VIEWS

Sep 30, 2011 09/11

eye 444.4M

Web wide crawl with initial seedlist and crawler configuration from March 2011 using HQ software.

155.2M 155M

Host Screen Captures

17,413

ITEMS

155.2M

VIEWS

Nov 10, 2012 11/12

eye 155.2M

Screen captures of hosts discovered during wide crawls. This data is currently not publicly accessible.

472.6M 473M

.com survey started January 2011

2,535

ITEMS

472.6M

VIEWS

Jan 20, 2011 01/11

eye 472.6M

Survey crawl of .com domains started January 2011.
Topic: webcrawl

411.2M 411M

Wide Crawl started March 2011

8,528

ITEMS

411.2M

VIEWS

Oct 5, 2010 10/10

eye 411.2M

Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi. What’s in the data set: Crawl start date: 09 March, 2011 Crawl end date: 23 December, 2011 Number of captures: 2,713,676,341 Number of unique URLs: 2,273,840,159 Number of hosts: 29,032,069 The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT)...

176.5M 177M

Survey Crawl Number 9

561

ITEMS

176.5M

VIEWS

Apr 20, 2019 04/19

eye 176.5M

8.9M 8.9M

Wide Crawl Number 18

4,264

ITEMS

8.9M

VIEWS

Aug 10, 2021 08/21

eye 8.9M

346.1M 346M

Survey Crawl Number 5: Oct 21st, 2016 to Sep 10th, 2017

3,888

ITEMS

346.1M

VIEWS

Jan 9, 2015 01/15

eye 346.1M

325.4M 325M

Wikipedia Outlinks February 2012

2,951

ITEMS

325.4M

VIEWS

Feb 3, 2012 02/12

eye 325.4M

Crawl of outlinks from wikipedia.org started February, 2012. These files are currently not publicly accessible.

225.7M 226M

International News Crawls

8,425

ITEMS

225.7M

VIEWS

Oct 4, 2010 10/10

eye 225.7M

Crawls of International News Sites

378.7M 379M

Around The World Crawl

2,150

ITEMS

378.7M

VIEWS

Jul 16, 2012 07/12

eye 378.7M

Data crawled by Sloan Foundation on behalf of Internet Archive

Live Web Proxy Crawls

36.4M 36M

liveweb-20180313231434

Mar 13, 2018 03/18

eye 36.4M

favorite 4

comment 1

( 1 reviews )

120.7M 121M

Collections News Crawls v3

6,102

ITEMS

120.7M

VIEWS

Nov 12, 2013 11/13

by [email protected]

eye 120.7M

Miscellaneous high-value news sites
Topics: World news, US news, news

158.5M 159M

Hackernews crawl number 0

6,164

ITEMS

158.5M

VIEWS

Mar 3, 2016 03/16

eye 158.5M

Hacker News Crawl of their links.

160.5M 160M

Wikipedia Outlinks May 2011

1,638

ITEMS

160.5M

VIEWS

Jul 11, 2011 07/11

eye 160.5M

Crawl of outlinks from wikipedia.org started May, 2011. These files are currently not publicly accessible.

157.9M 158M

Shallow Crawls

1,042

ITEMS

157.9M

VIEWS

Nov 17, 2012 11/12

eye 157.9M

Shallow crawls that collect content 1 level deep including embeds. This data is currently not publicly accessible.

137.3M 137M

Wayback CDX Shards

1,214

ITEMS

137.3M

VIEWS

Aug 21, 2013 08/13

eye 137.3M

CDX Index shards for the Wayback Machine. The Wayback Machine works by looking for historic URL's based on a query. This is done by searching an index of all the web objects (pages, images, etc) that have been archived over the years. This collection holds the index used for this purpose, which is broken up into 300 pieces so they fit into items more naturally and distribute the lookup load. Each of these 300 pieces is stored in at least 2 items, and then those are also stored on the backup...

111.7M 112M

Wikipedia Outlinks July 2011

1,011

ITEMS

111.7M

VIEWS

Jul 11, 2011 07/11

eye 111.7M

Crawl of outlinks from wikipedia.org started July, 2011. These files are currently not publicly accessible.

87.8M 88M

Geocities Closing Crawl

149

ITEMS

87.8M

VIEWS

Apr 4, 2012 04/12

eye 87.8M

Geocities crawl performed by Internet Archive. This data is currently not publicly accessible. from Wikipedia : Yahoo! GeoCities is a Web hosting service. GeoCities was originally founded by David Bohnett and John Rezner in late 1994 as Beverly Hills Internet (BHI), and by 1999 GeoCities was the third-most visited Web site on the World Wide Web. In its original form, site users selected a "city" in which to place their Web pages. The "cities" were metonymously named after...

107.8M 108M

End of Term Web Crawls

22,254

ITEMS

107.8M

VIEWS

Sep 13, 2012 09/12

by IA

eye 107.8M

This collection includes web crawls of the Federal Executive, Legislative, and Judicial branches of government performed at the end of US presidential terms of office.
Topics: web, end of term, US, federal government

76M 76M

Youtube Videos

662,763

ITEMS

76M

VIEWS

Jan 26, 2011 01/11

eye 76M

Captures of pages from YouTube. Currently these are discovered by searching for YouTube links on Twitter.
Topics: YouTube, Twitter, Video

73M 73M

COM Survey Crawl 2009-2010

729

ITEMS

73M

VIEWS

Apr 6, 2012 04/12

eye 73M

COM survey crawl data collected by Internet Archive in 2009-2010. This data is currently not publicly accessible.

Live Web Proxy Crawls

27.4M 27M

liveweb-20170906011614

Sep 6, 2017 09/17

eye 27.4M

favorite 0

comment 0

63.2M 63M

Shallow Crawl Started 2013

544

ITEMS

63.2M

VIEWS

Feb 19, 2013 02/13

eye 63.2M

Shallow crawl started 2013 that collects content 1 level deep, including embeds. Access to content is restricted. Please visit the Wayback Machine to explore archived web sites.

64M 64M

Shallow Crawl Started 2013

252

ITEMS

64M

VIEWS

Mar 5, 2013 03/13

eye 64M

Shallow crawl started 2013 that collects content 1 level deep, including embeds. Access to content is restricted. Please visit the Wayback Machine to explore archived web sites.

58.2M 58M

survey_net00000

300

ITEMS

58.2M

VIEWS

Nov 24, 2010 11/10

eye 58.2M

Survey crawl of .net domains started December 2010.
Topic: webcrawl

36.3M 36M

End of Term 2016 Web Crawls

19,083

ITEMS

36.3M

VIEWS

Nov 21, 2016 11/16

eye 36.3M

This collection contains web crawls performed as part of the End of Term Web Archive, a collaborative project that aims to preserve the U.S. federal government web presence at each change of administration. Content includes publicly-accessible government websites hosted on .gov, .mil, and relevant non-.gov domains, as well as government social media materials. The web archiving was performed in the Fall and Winter of 2016 and Spring of 2017. For more information, see...
Topics: end of term, federal government, 2016, president, congress, government data

46.3M 46M

nsdlweb

ITEMS

46.3M

VIEWS

Jul 12, 2012 07/12

eye 46.3M

this data is currently not publicly accessible.

37.6M 38M

Collections news crawls v2

820

ITEMS

37.6M

VIEWS

Dec 13, 2012 12/12

by [email protected]

eye 37.6M

Live Web Proxy Crawls

6.2M 6.2M

liveweb-20190719071056

Jul 19, 2019 07/19

eye 6.2M

favorite 0

comment 0

32M 32M

ORG Survey Crawls

191

ITEMS

32M

VIEWS

Jun 20, 2011 06/11

eye 32M

Survey of .org domains. This data is currently not publicly accessible.

36.2M 36M

International News Crawl started in September 2010

1,083

ITEMS

36.2M

VIEWS

Oct 4, 2010 10/10

eye 36.2M

Crawl of International News Sites with initial seedlist and crawler configuration from Sep 1, 2010.

Live Web Proxy Crawls

1.4M 1.4M

liveweb-20190114034752

Jan 14, 2019 01/19

eye 1.4M

favorite 0

comment 0

Live Web Proxy Crawls

3.7M 3.7M

liveweb-20141219115441

Dec 19, 2014 12/14

eye 3.7M

favorite 0

comment 0

Live Web Proxy Crawls

3.8M 3.8M

liveweb-20160731004441

Jul 31, 2016 07/16

eye 3.8M

favorite 1

comment 0

Live Web Proxy Crawls

3.8M 3.8M

liveweb-20160731032528

Jul 31, 2016 07/16

eye 3.8M

favorite 0

comment 0

Live Web Proxy Crawls

3.7M 3.7M

liveweb-20141219031006

Dec 19, 2014 12/14

eye 3.7M

favorite 0

comment 0

32M 32M

End of Term 2012 Web Crawls

2,383

ITEMS

32M

VIEWS

Sep 13, 2012 09/12

eye 32M

This collection contains web crawls performed on the US Federal Executive, Legislative & Judicial branches of government in 2012-2013.
Topics: end of term, US, Federal government, 2012, Obama

21.2M 21M

survey_net00001

170

ITEMS

21.2M

VIEWS

Dec 29, 2011 12/11

eye 21.2M

Survey crawl of .net domains started October 2011.
Topics: webwidecrawl, net

19.4M 19M

Edu & Gov Crawl, June 2010

704

ITEMS

19.4M

VIEWS

Jun 21, 2010 06/10

eye 19.4M

TEST COLLECTION: Crawl of .edu and .gov sites started in June 2010.
Topic: crawldata

12.6M 13M

2004 Election

178

ITEMS

12.6M

VIEWS

Apr 4, 2012 04/12

eye 12.6M

2004 Election crawl performed by Internet Archive. This data is currently not publicly accessible.

Live Web Proxy Crawls

1.7M 1.7M

liveweb-20201010054705

Oct 10, 2020 10/20

eye 1.7M

favorite 0

comment 0

Live Web Proxy Crawls

1.9M 1.9M

liveweb-20160727084029

Jul 27, 2016 07/16

eye 1.9M

favorite 0

comment 0

19.1M 19M

crawl_UNK

32,949

ITEMS

19.1M

VIEWS

Jan 26, 2011 01/11

eye 19.1M

Crawl data. This data is currently not publicly accessible.

11.3M 11M

End Of Term 2016 UNT Crawls

1,275

ITEMS

11.3M

VIEWS

Aug 28, 2017 08/17

eye 11.3M

End of Term 2016 Web Archive government web crawls by project partner the University of North Texas.
Topics: end of term, federal government, 2016, president, congress, university of north texas

16.3M 16M

End of Term 2008 California Digital Library Crawl

415

ITEMS

16.3M

VIEWS

Apr 3, 2012 04/12

eye 16.3M

End of term 2008 crawl data gathered by Internet Archive on behalf of the California Digital Library. This data is currently not publicly accessible.

13.4M 13M

End Of Term 2016 Pre-Inauguration Crawls

4,693

ITEMS

13.4M

VIEWS

Dec 15, 2016 12/16

eye 13.4M

This collection contains web crawls performed as the pre-inauguration crawl for part of the End of Term Web Archive, a collaborative project that aims to preserve the U.S. federal government web presence at each change of administration. Content includes publicly-accessible government websites hosted on .gov, .mil, and relevant non-.gov domains, as well as government social media materials. The web archiving was performed in the Fall and Winter of 2016 to capture websites prior to the January...
Topics: end of term, federal government, 2016, president, congress

MORE RESULTS
Fetching more results

DESCRIPTION

The Internet Archive discovers and captures web pages through many different web crawls.

At any given time several distinct crawls are running, some for months, and some every day or longer.

View the web archive through the Wayback Machine.

ACTIVITY

comment

Collection Info

Access-restricted: true

Addeddate: 2010-06-11 18:34:15

Collection: web

Identifier: webwidecrawl

Mediatype: collection

Public-format: Metadata
Symlink Instructions
Collection Header
JPEG
JPEG Thumb
PNG
Animated GIF
Item Tile

Publicdate: 2010-06-11 18:34:36

Show_hidden_subcollections: true

Subject: webwidecrawl

Title: Internet Archive Web Crawls

Created on

June 11
2010

ARossi
Archivist

ADDITIONAL CONTRIBUTORS

Wayback Machine Web Crawling
Archivist

VIEWS

Total Views 38,975,157,788 (Older Stats)

ITEMS

Total Items 1,602,417 (Older Stats)

TOP REGIONS (LAST 30 DAYS)

(data not available)

Internet Archive Web Crawls

Filters 1,607,577 RESULTS

Media TypeMedia Type

YearYear

Topics & SubjectsTopics & Subjects

CollectionCollection

CreatorCreator

LanguageLanguage

eye 16B

eye 9.5B

eye 9.7B

eye 1.7B

eye 2.2B

eye 1B

eye 1.9B

eye 1.1B

eye 1.3B

eye 1B

eye 1.2B

eye 1.1B

eye 1.2B

eye 1.3B

eye 1.3B

eye 888.6M

eye 1B

eye 794.7M

eye 812M

eye 1.1B

eye 825.6M

eye 719.8M

eye 528.7M

eye 720.9M

eye 629.4M

eye 621.1M

eye 464.5M

eye 469.8M

eye 483.7M

eye 444.4M

eye 155.2M

eye 472.6M

eye 411.2M

eye 176.5M

eye 8.9M

eye 346.1M

eye 325.4M

eye 225.7M

eye 378.7M

eye 36.4M

favorite 4

comment 1

eye 120.7M

eye 158.5M

eye 160.5M

eye 157.9M

eye 137.3M

eye 111.7M

eye 87.8M

eye 107.8M

eye 76M

eye 73M

eye 27.4M

favorite 0

comment 0

eye 63.2M

eye 64M

eye 58.2M

eye 36.3M

eye 46.3M

eye 37.6M

eye 6.2M

favorite 0

comment 0

eye 32M

eye 36.2M

eye 1.4M

favorite 0

comment 0

eye 3.7M

favorite 0

comment 0

1,607,577
RESULTS

Media Type

Year

Topics & Subjects

Collection

Creator

Language