Refining Internet Searches

27 - Refining Internet Searches, Part 2

Last issue we reviewed searching in the proprietary databases that Yahoo and other portals use. This time we’ll track feral pages out on the Wild Wild Web itself.

Proprietary information stored in databases is organized as it’s stored and it is therefore much easier to deal with than the chaos of the Web. When search engines check the full Web they must scan hundreds of millions of pages, many without proper titles or other useful clues as to what they contain. This may require methods similar to but slightly different from merely checking a database.

How can you tell whether your search is restricted to a proprietary database? Look in your search engine’s advanced features. For example, Yahoo defaults to its own database but you can click a check box in the advanced area to get out into the Web.

Why would you want to search the entire Web instead of a nice tidy database?

Several portals like Yahoo, MSN, and Alta Vista now charge a fee to register certain sites in their databases. There were already millions of unregistered sites out there because the portals simply couldn’t keep up with the Web’s growth and now there will be missing pages because of the registration fee. What does this mean to you? When you search one of these proprietary databases, they may have fewer sites to choose from and you may see more business ad sites instead of real Web pages. If you can’t find what you’re looking for on the databases, it’s worth trying the Web simply because there are many more options there.

Once you’re out in the Web itself it helps to know a little about how search engines work. A good search engine will sort through millions of sites that mention the subject you want and then display only the best choices at the top of a list. How does your engine decide which URLs to present to you? It uses a system called relevance ranking.

Relevance ranking works like this. Each engine uses its own algorithm or formula to determine relevance and then displays results based on that calculation. Let’s use a made up example to show how this might work. Suppose we’re looking for the word, "astrolabe". Here’s our pretend formula:

(M x 15) + (T x 10) + (P x 3) = R

where M stands for "<metatag>", T stands for "<title>", P stands for "page", and R stands for "rank". (Metatags and HTML code were covered in previous articles.)

Metatags are specifically designed to convey information to browsers and search engines. A word listed in the <metatag> portion of a page carries much more weight than it would anywhere else on a page. The <title> section of a page should usually have a strong correlation with the subject of the page and so it's a strong second. Words in the text portion of a page are the least useful because they are often unrelated to a particular subject. For example, astrolabe, a nautical instrument, may appear in a Web site about the comedy group Firesign Theater because it's used on one of their albums.

Assume that your search engine finds a site named Bob’s Naval that uses the word astrolabe in its <metatag>. The engine multiplies 1 x 15 = 15. Suppose that Bob’s page does not use astrolabe in its <title> code. That produces a 0 x 10 = 0. But Bob’s page does mention astrolabe 8 times in its text. The engine multiplies 8 x 3 = 24.

After all the multiplication is completed the engine sums the results, in this case 15 + 0 + 24 = 39. As each new page is processed, the search engine stores the results in a file. Larger numbers are at the top and smaller numbers at the bottom such that Bob's Naval with its 39 will be listed above Sam’s Tales of the Sea site that only has a score of 27. Once the search is completely finished, the search engine displays the top URLs for you to select from.

When you get baffling results back from a search, it may not be the fault of the search engine. For example, an online business that sells nautical instruments can trick search engines into placing it at the top of your astrolabe search by putting that word in its <metatag> code dozens of time. An inexperienced Web developer who has a great site devoted entirely to astrolabes may not know about the value of <metatag> and <title> and could end up way down at the bottom of the list. There's no way to completely avoid this problem, but you can try several different searches using various words related to the subject you're looking for and hope for the best.