Convergence of search engines & Weblogs
With about 3 billion pages in its database, Google is the most comprehensive search engine today. With an ability to crawls approximately 150 million pages of the Internet a day Google can visit all of the 3 billion pages in about 20 days.
However, this is not how it works. Microdoc News comments on Grub and LookSmart initiative of building and operating a distributed crawler and identifies yet another step towards convergence of search engines and weblogs.
Grub: Distributed Crawling
Microdoc News registered and downloaded the Grub client. Microdoc News is one of 1096 clients running - crawling 47,840,029 URLs in the last 24 hours. Microdoc News assigned one computer on its network dedicated to Grub crawling. In the six hours so far, Microdoc News computer has crawled 12,750 URLs. 14% of these were changed, 73% were identified as unchanged and 5% of these URLs were down.
Microdoc News can see the benefits of distributed computing and the power each additional computer adds to the network -- we disagree with jimlog 2.0 where it was purported there that there would be little gained:
The big problem with this project is the bandwidth needed. With a non-distributed crawler, the pages have to be downloaded to the main servers just once. With a distributed crawler, they have to be downloaded at the client, and then uploaded to the server. Uploading to the server from the client is the same as having the server download the page in the first place. So the work is doubled. While it's possible to reduce the size of the data uploaded to the central server by parsing the web pages, to build an effective search engine you need all the data, so the client can't reduce the size much. For example, Google keeps the entire page intact. [Search: By the People, For the People at jimlog 2.0]
Microdoc News checked 75% of the sites and did not therefore have to communicate anything with the central Grub servers. It would seem that there is at least a 70% increment in power with each computer added to the network with an overhead of about 30% that needs to be carried by the central Grub servers.
Convergence of Weblogs and Search Engines
Weblog writing is a highly distributed activity, particularly when a weblogger hosts her/his weblog on a different server than a main hosting server. Grub is an example of crawling the web using a distributed model. Both are information management activities conducted using a distributed model.
Blogging, on the one hand, is a mechanism used by Google to good effect, to identify which parts of the Internet are considered important by readers of the web. Blogging is one of the mechanisms used by Google Inc. to identify what should be crawled more often, and which pages are important to those reading the web, as Peter Norving suggests:
"I want more clues about which page to look at rather than another page. . . "It isn't a problem of computing resources but deciding what parts of the Web should be updated more frequently than others," he said. [Wired News: Building a Bigger Search Engine].
Crawling could move towards a more distributed model as well. One of the important steps that has been glossed over to this point is Grub's "localized crawling" feature. This is where, I as both crawling agent and webmaster place a grub.txt file in the root of each of my websites. This is a clear pointer to a website that Grub has not had to locate. It is located for Grub by the distributed crawling agent. This is also a site that the webmaster/crawling agent wants to get listed in this new search engine.
Now you could have blogger and crawling agent as one and the same person. Get the word out that the best way to get you site listed is to actually also become a crawling agent . . . what is the potential? Every webmaster, every blogger, every person who wants to get traffic then has an incentive to be a crawling agent.
The next step is to build each crawling client so that it has a detection device on the machine it is hosted on to identify when new pages are added. A client, if you like, as it is most likely sitting on a machine that is also hosting a web site or blog site, becomes a distributed spy to detect when that site has pages that need updating.
To ensure that a webmaster does not hack the results and therefore get a better listing in the distributed system than otherwise, other bloggers sites need to provide information for scheduling the updating of my site. My client indicates there are new pages, the scheduling machine lists those pages as to be crawled soon, and according to the number of links pointing to that site that contains that page, the pages are listed to be crawled in order of link importance.
Now this is all conjecture and pulling ideas out of a hat, and some of it may never happen. This space, however, is a space to be watched. Convergence is happening now and going to be an exciting space of new things in the next months.
Source: Microdoc News