The universe of content on the Web that could be indexed, in theory, by standard search engines is known as the “publicly indexable Web.” The publicly indexable Web is limited to those pages that are accessible by following a link from another Web page that is recognized by a search engine. This limitation exists because online indexing techniques used by popular search engines and directories such as Yahoo, Lycos and AltaVista, are based on “spidering” technology, which finds sites to index by following links from site to site in a continuous search for new content. If a Web page or site is not linked by others, then spidering will not discover that page or site.
Furthermore, many larger Web sites contain instructions, through software, that prevent spiders from investigating that site, and therefore the contents of such sites also cannot be indexed using spidering technology. Because of the vast size and decentralized structure of the Web, no search engine or directory indexes all of the content on the publicly indexable Web. We credit current estimates that no more than 50% of the content currently on the publicly indexable Web has been indexed by all search engines and directories combined. No currently available method or combination of methods for collecting URLs can collect the addresses of all URLs on the Web. The portion of the Web that is not theoretically indexable through the use of “spidering” technology, because other Web pages do not link to it, is called the “Deep Web.” Such sites or pages can still be made publicly accessible without being made publicly indexable by, for example, using individual or mass emailings (also known as “spam”) to distribute the URL to potential readers or customers, or by using types of Web links that cannot be found by spiders but can be seen and used by readers. “Spamming” is a common method of distributing to potential customers links to sexually explicit content that is not indexable. Because the Web is decentralized, it is impossible to say exactly how large it is. A 2000 study estimated a total of 7.1 million unique Web sites, which at the Web’s historical rate of growth, would have increased to 11 million unique sites as of September 2001. Estimates of the total number of Web pages vary, but a figure of 2 billion is a reasonable estimate of the number of Web pages that can be reached, in theory, by standard search engines. We need not make a specific finding as to a figure, for by any measure the Web is extremely vast, and it is constantly growing. The indexable Web is growing at a rate of approximately 1.5 million pages per day. The size of the un-indexable Web, or the “Deep Web,” while impossible to determine precisely, is estimated to be two to ten times that of the publicly indexable Web.