Monday, August 21, 2006

Google Search Result - Number of pages for any search

I was playing with google search and found some interesteting things about the maximum number of result displayed for any search.

Google search result displays maximum 1000 results i.e. 100 pages for any search. This is also not one step process. It will display search result less than 100 (in 70's, 80's or 90's) for the given query and in the last page, it will say "Click here to see omitted results". When user clicks that it will show those omitted pages but won't include everything from their repository.

For example, if you search for the word "binary" using the following this query it will display number of results for this query and what exactly it showing i.e.

Results 1 - 10 of about 207,000,000 for binary

It gives impression as if it will show all the pages where it has found binary word. They do have all the web pages in their repository. But here to show the number of occurance of a particular work is fetched using some kind of dictionary data structure.

If you are patient enough and don't get tired clicking some hyperlinks then follow to the last record and you should get somethign like this for any search.



Now if you click the omitted result link, it will take you to the first page according to relevance i.e. page ranking for a given search result. If you again click to next till you hit the following



As can be seen from the above image, it will show maximum 1000 results only.

Few things to note:


  • It uses the fact that if user is not able to find what he is looking for in first 100 pages then there is not need to show all the results

  • Google does have big distributed file systems with Big Table, Big File and Map Reduce et all and using these components records are fetched quickly. They do huge dictionary and n-gram database and many data structures to give users results quickly using their page ranking and relevance system. All these are used togather to show first 1000 results for any query which makes it really faster.

  • They do cache every result and query so they can cache good amount of data structure as Google is used by gazillions of people.

  • They do have plenty of data so only thing is now judiciously using these huge data and showing results from different perspective. They have categorized these data in differet buckets like jobs, base, health, school and so on.




There are some really good documents available on the net about internals of their file system i.e. Google File System, Big Table, Big Files, Map Reduce and the first paper in which they have mentioned how Google actually works. So if you are interested google for these documents and you will be able to get in some knowledge of their infrastructure.

No comments: