Monday, August 21, 2006

Google Search Result - Number of pages for any search

I was playing with google search and found some interesteting things about the maximum number of result displayed for any search.

Google search result displays maximum 1000 results i.e. 100 pages for any search. This is also not one step process. It will display search result less than 100 (in 70's, 80's or 90's) for the given query and in the last page, it will say "Click here to see omitted results". When user clicks that it will show those omitted pages but won't include everything from their repository.

For example, if you search for the word "binary" using the following this query it will display number of results for this query and what exactly it showing i.e.

Results 1 - 10 of about 207,000,000 for binary

It gives impression as if it will show all the pages where it has found binary word. They do have all the web pages in their repository. But here to show the number of occurance of a particular work is fetched using some kind of dictionary data structure.

If you are patient enough and don't get tired clicking some hyperlinks then follow to the last record and you should get somethign like this for any search.

Now if you click the omitted result link, it will take you to the first page according to relevance i.e. page ranking for a given search result. If you again click to next till you hit the following

As can be seen from the above image, it will show maximum 1000 results only.

Few things to note:

  • It uses the fact that if user is not able to find what he is looking for in first 100 pages then there is not need to show all the results

  • Google does have big distributed file systems with Big Table, Big File and Map Reduce et all and using these components records are fetched quickly. They do huge dictionary and n-gram database and many data structures to give users results quickly using their page ranking and relevance system. All these are used togather to show first 1000 results for any query which makes it really faster.

  • They do cache every result and query so they can cache good amount of data structure as Google is used by gazillions of people.

  • They do have plenty of data so only thing is now judiciously using these huge data and showing results from different perspective. They have categorized these data in differet buckets like jobs, base, health, school and so on.

There are some really good documents available on the net about internals of their file system i.e. Google File System, Big Table, Big Files, Map Reduce and the first paper in which they have mentioned how Google actually works. So if you are interested google for these documents and you will be able to get in some knowledge of their infrastructure.

Friday, August 18, 2006

Google Maps Flight Simulator

Today i came across this really cool hack for google maps -> Google Maps Fligh Simulator.

It allows you to pick the city from where you want to take your flight and then drive your plane with arrow keys. It also has a link to add your city in it.

Enjoy and play with it.

Saturday, August 12, 2006

Snap - Another Search Engine

Snap another search engine in this heavily crowded search engine market. It is claiming that what is offered in market is not good or not what user wants to see and they are the one who have understood what user wants and provides them with fast, accurate and relevant results.

They have mentioned in their site that all big search engines use Text-In Text-Out method for searching and usually end user ends up with the following experience

1. Type a keyword into a search box and hit enter
2. Wait a bit, then
3. get a list of 20 or so text results split into non-paid and sponsored links
4. Proceed to decipher each link and excerpt (does this site have what I’m looking for? Is it spam?)
5. Click on a link and go to the site. If you’re lucky, that’s it.
6. But most of us aren’t lucky, so you go back to your results page and begin the process again.
7. And repeat as often as needed for as long as your patience can muster. (Most searchers don’t get off the first page of results, even though we’re comforted that there are millions of results for some of our queries…)
8. If this doesn’t yield the result you’re looking for, you probably go back to the original query and try a new modification.
9. And repeat the process again as needed.

And their search engine is different in the following way

  • Quick visual dispaly of results preview. (On mouse over ajax request or on key down or on select ajax request and getting data in right pane

  • Activity anticipating user's intent (Autocomplete on keyup by sending ajax request on key up)

  • Direct interactivity with your search (As search result and web page contents are in two panes this can be achieved without much hassle)

  • Better relevance through successful past result (Implementing voting and ranking system taking user input

Well i don't feel providing such features can embrace scalability, relevant result and user satisfcation. First thing which they need to think about providing relevant result using some technique like page ranking, content filtering, using anchor text etc. Secondly and most importantly think about scalability. Big search companies have enough money and infrastructure in place which has allowed them to have their search engine scable over the period of time.

I would love to see how they grow from where they are now. All the best to them.

Monday, August 07, 2006

Listen PDF documents rather than reading

Sometime you want to listen to some document while you are busy doing something. It is possible with adobe acrobat reader (I have tested the following shortcuts for version 7.* and it works fine)

Ctrl+shift+b - to hear the entire Document
Ctrl+shift+v - to hear the page
Ctrl+shift+c - to resume
Ctrl+shift+e - to stop

Well the reading is not great but it can do the job if you want something humming while you are performing some other task. Nice way of multitasking :)
The reading is more like running commentary which we observe in some sports.

Software Configuration Managment - All about source control

In my opinion every software developer and a computer science or engineering student who writes software programs should know about Software Configuration management or SCM.

It helps in maintaining your code base for each program over a period of time. You can always go back to your older version or see how you have refactored your source code over the period of time. It helps in learning agile development methodology for software development which is an essence of writing great software product.

Eric Sink is writing online book describing about source control.

Some of the free available SCM tools are
1) Subversion
2) Perforce

I won't recommend to use VSS even if you get it for free.

Other SCM tools which i have used for a while is Clear Case by IBM (formerly rational).