Add to this, the fact that the Web lacks the bibliographic control standards we take for granted in the print world: There is no equivalent to the ISBN to uniquely identify a document; no standard system, analogous to those developed by the Library of Congress, of cataloguing or classification; no central catalogue including the Web's holdings. In fact, many, if not most, Web documents lack even the name of the author and the date of publication.
Imagine you are searching for information in the world's largest library, where the books and journals (stripped of their covers and title pages) are shelved in no particular order, and without reference to a central catalogue. A researcher's nightmare? Without question. The World Wide Web defined? Not exactly. Instead of a central catalogue, the Web offers the choice of dozens of different search tools, each with its own database, command language, search capabilities, and method of displaying results.
Given the above, the need is clear to familiarize yourself with a variety of search tools and to develop effective search techniques, if you hope to take advantage of the resources offered by the Web without spending many fruitless hours flailing about, and eventually drowning, in a sea of irrelevant information.
Search engines allow the user to enter keywords that are run against a database (most often created automatically, by "spiders" or "robots"). Based on a combination of criteria (established by the user and/or the search engine), the search engine retrieves WWW documents from its database that match the keywords entered by the searcher. It is important to note that when you are using a search engine you are not searching the Internet "live", as it exists at this very moment. Rather, you are searching a fixed database that has been compiled some time previous to your search.
While all search engines are intended to perform the same task, each goes about this task in a different way, which leads to sometimes amazingly different results. Factors that influence results include the size of the database, the frequency of updating, and the search capabilities. Search engines also differ in their search speed, the design of the search interface, the way in which they display results, and the amount of help they offer.
In most cases, search engines are best used to locate a specific piece of information, such as a known document, an image, or a computer program, rather than a general subject.
Examples of search engines include:
The growth in the number of search engines has led to the creation of "meta" search tools, often referred to as multi-threaded search engines. These search engines allow the user to search multiple databases simultaneously, via a single interface. While they do not offer the same level of control over the search interface and search logic as do individual search engines, most of the multi-threaded engines are very fast. Recently, the capabilities of meta-tools have been improved to include such useful features as the ability to sort results by site, by type of resource, or by domain, the ability to select which search engines to include, and the ability to modify results. These modifications have greatly increased the effectiveness and utility of the meta-tools.
Popular multi-threaded search engines include:
Subject-specific search engines do not attempt to index the entire Web. Instead, they focus on searching for Web sites or pages within a defined subject area, geographical area, or type of resource. Because these specialized search engines aim for depth of coverage within a single area, rather than breadth of coverage across subjects, they are often able to index documents that are not included even in the largest search engine databases. For this reason, they offer a useful starting point for certain searches. The table below lists some of the subject-specific search engines by category. For a more comprehensive list of subject-specific search engines, see one of the following directories of search tools:
Regional (Canada)
Regional (Other)
|
Companies
|
People (E-mail addresses) | People (Postal addresses & telephone numbers) |
Images | Jobs |
Games | Software |
Health/Medicine | Education/Children's Sites
|
Subject directories are hierarchically organized indexes of subject categories that allow the Web searcher to browse through lists of Web sites by subject in search of relevant information. They are compiled and maintained by humans and many include a search engine for searching their own database.
Subject directory databases tend to be smaller than those of the search engines, which means that result lists tend to be smaller as well. However, there are other differences between search engines and subject directories that can lead to the latter producing more relevant results. For example, while a search engine typically indexes every page of a given Web site, a subject directory is more likely to provide a link only to the site's home page. Furthermore, because their maintenance includes human intervention, subject directories greatly reduce the probability of retrieving results out of context.
Because subject directories are arranged by category and because they usually return links to the top level of a web site rather than to individual pages, they lend themselves best to searching for information about a general subject, rather than for a specific piece of information.
Examples of subject directories include:
Specialized subject directories
Due to the
Web's immense size and constant transformation, keeping up with important
sites in all subject areas is humanly impossible. Therefore, a guide compiled
by a subject specialist to important resources in his or her area of expertise
is more likely than a general subject directory to produce relevant
information and is usually more comprehensive than a general guide. Such
guides exist for virtually every topic. For example, Voice of the Shuttle (http://vos.ucsb.edu)
provides an excellent starting point for humanities research. Film buffs
should consider starting their search with the Internet Movie Database (http://us.imdb.com).
Just as multi-threaded search engines attempt to provide simultaneous access to a number of different search engines, some web sites act as collections or clearinghouses of specialized subject directories. Many of these sites offer reviews and annotations of the subject directories included and most work on the principle of allowing subject experts to maintain the individual subject directories. Some clearinghouses maintain the specialized guides on their own web site while others link to guides located at various remote sites.
Examples of clearinghouses include:
Search logic refers to the way in which you, and the search engine you are using, combine your search terms. For example, the search I Love Cricket could be interpreted as a search for any of the three search terms, all of the search terms, or the exact phrase. Depending on the logic applied, the results of each of the three searches would differ greatly. All search engines have some default method of combining terms, but their documentation does not always make it easy to ascertain which method is in use. Reading online Help and experimenting with different combinations of words can both help in this regard. Most search engines also allow the searcher to modify the default search logic, either with the use of pull-down menus or special operators, such as the + sign to require that a search term be present and the - sign to exclude a term from a search.
Boolean logic is the term used to describe certain logical operations that are used to combine search terms in many databases. The basic Boolean operators are represented by the words AND, OR and NOT. Variations on these operators, sometimes called proximity operators, that are supported by some search engines include ADJACENT, NEAR and FOLLOWED BY. Whether or not a search engine supports Boolean logic, and the way in which it implements it, is another important consideration when selecting a search tool. The following diagrams illustrate the basic Boolean operations.
AND |
OR |
NOT |
Boolean operators are most useful for complex searches, while the + and - operators are often adequate for simple searches.
Ctrl-F: After following a link to a document retrieved with a search engine, it is sometimes not immediately apparent why the document has been retrieved. This may be because the words for which you searched appear near the bottom of the document. A quick method of finding the relevant words is to type Ctrl-F to search for the text in the current document. |
Bookmark your results: If you are likely to want to repeat a search at a later date, add a bookmark (or favorite) to your current search results. |
Right truncation of URLs: Often, a search will retrieve links to many documents at one site. For example, searching for "Harish's Web page" will retrieve not only my home page (http://harishonline.htmlplanet.com), but also any pages that contain the phrase "Harish's Web page", whether or not they are linked to the home page. Rather than clicking on each URL in succession to find the desired document, truncate the URL at the point at which it appears most likely to represent the document you are seeking and type this URL in the Location box of your web browser. |
Guessing URLs: Basic knowledge of the way in which URLs are constructed will help you to guess the correct URL for a given web site. For example, most large American companies will have registered a domain name in the format www.company_name.com (eg. Microsoft - www.microsoft.com); American universities are almost always in the .edu domain (eg. Cornell - www.cornell.edu or UCLA - www.ucla.edu); and Canadian universities follow the format www.university_name.ca (eg. Simon Fraser University - www.sfu.ca or the University of Toronto - www.utoronto.ca). |
Wildcards: Some search engines allow the use of "wildcard" characters in search statements. Wildcards are useful for retrieving variant spellings (eg. color, colour) and words with a common root (eg. psychology, psychological, psychologist, psychologists, etc.). Wildcard characters vary from one search engine to another, the most common ones being *, #, and ?. Some search engines permit only right truncation (eg. psycholog*), while others also support middle truncation (eg. colo*r). |
Relevance ranking: All of the search engines covered in this workshop use an algorithm to rank retrieved documents in order of decreasing relevance. Consequently, it is often not necessary to browse through more than the first few pages of results, even when the total results number in the thousands. Furthermore, some search engines (eg. AltaVista) allow the searcher to determine which terms are the most "important", while others have a "more like this" feature that permits the searcher to generate new queries based on relevant documents retrieved by the initial search. These features are discussed in more detail in the following section of this document. |
URL: http://www.altavista.com |
Size: 331 million pages |
Retrieved: Hurricane Floyd: 736 documents; Echinacea purpurea: 2661 documents |
Currency: Hurricane Floyd: 3 days (September 29, 1999); Echinacea purpurea: 1 day (October 1, 1999) |
Interface: AltaVista includes simple and advanced interfaces. They are less intuitive than most other search engine interfaces but they are well documented and allow some of the most powerful searching on the Web if you are willing to learn how to use them. Both interfaces allow the use of Boolean logic, though different syntax is used in the two interfaces. The simple interface includes a single search box and a pull-down menu that allows you to limit your search to one of 25 languages. The advanced interface includes a search box, limit by language, and options to limit a search by date and to rank results according to keywords of your choice. |
Search Features:
Search logic and syntax: AltaVista defaults to Boolean OR, that is it will retrieve results containing any of the search words; however, the greater number of search terms a document contains, the more highly it is ranked. In its simple interface, AltaVista supports the use of + and - to require and exclude terms. In its advanced interface, it supports all Boolean operators - AND, OR and AND NOT, plus the proximity operator NEAR (terms within 10 words of each other). In both interfaces, enclosing search terms in quotation marks searches for an exact phrase. Limit options: Searches may be limited by date (Advanced only) and language (Simple and Advanced). AltaVista also allows you to restrict a search to certain fields (or sections) within a document, and by type of document, e.g. Title, URL, Image, Java applets, and Links to a specified page. For example, the search title:"search engines" will retrieve only those pages that include the phrase "search engines" in their title. The search link:harishonline.htmlplanet.com will retrieve all pages in AltaVista's database that include links to the specified URL. Truncation: AltaVista uses the * character to support both right (e.g. psycholog*) and middle (e.g. colo*r) truncation. Case sensitivity: If you begin a word with an upper case letter, AltaVista searches only for that word with an upper case letter. If you use lower case, AltaVista retrieves upper and lower case. For example, dodge retrieves dodge and Dodge; Dodge retrieves Dodge only. Unique features: AltaVista has partnered with the AskJeeves search engine to allow you to input natural language queries, which are searched against the AskJeeves database, not the AltaVista database. For example: What is the distance between Vancouver and Toronto? |
Results:
What is displayed: The result display includes the document title, URL, first two lines of the document text, language, date and size (in bytes). In Advanced Search, you may choose to display only the number of results matching your search rather than the results themselves. Order of results: Results are displayed in order of decreasing relevance. In Advanced Search, you may specify "ranking keywords" that force documents that contain these words to appear near the top of the result list. Refining results: There is no way to refine your search results. AltaVista has partnered with LookSmart, AskJeeves and About.com to allow you to run variations of your search in their databases. |
Other features:
|
URL: http://www.excite.com |
Size: 159 million pages |
Retrieved: Hurricane Floyd: 18 Web pages; 210 news stories; Echinacea purpurea: 698 Web pages; 0 news stories |
Currency: Hurricane Floyd: Web pages - unknown; News stories - less than one day (Oct. 2, 1999); Echinacea purpurea: Web pages - unknown; News stories - Not applicable |
Interface: Excite offers two interfaces: simple and Advanced Web Search. The simple interface consists of a single search box with no options for modifying or limiting a search. Advanced Web Search presents the searcher with a series of search boxes that allow you to perform either word or phrase searching and to instruct Excite which words and/or phrases the document CAN contain, MUST contain, and MUST NOT contain. It also allows limiting by language, country, and domain (.com, .edu, .org, etc.). |
Search Features:
Search logic and syntax: Excite defaults to Boolean OR, that is it will retrieve results containing any of the search words; however, the greater number of search terms a document contains, the more highly it is ranked. Excite supports the use of the Boolean operators AND, OR and NOT (all of which must be in capital letters), the + and - signs to require and exclude words from your search, and phrase searching using quotation marks. Limit options: Simple Search offers no limit options. Advanced Web Search offers limiting by language, country and domain. Truncation: None Case sensitivity: None |
Results:
What is displayed: For each document, Excite displays title, URL, brief summary and "relevance" level, as a percentage (how this number is calculated is not explained). There is an option to display document titles only. Order of results: The first page of results displays relevant directory categories, the first ten Web site results, followed by the first five news stories and relevant discussion groups. By default, results are displayed in order of decreasing relevance. You may choose instead to display the forty most relevant results grouped by Web site. This is a useful feature that would be even more so if it was not limited to forty documents. Refining results: Excite offers two methods of refining search results:
|
Other features:
|
URL: http://www.alltheweb.com |
Size: 327 million pages |
Retrieved:Hurricane Floyd: 28 documents; Echinacea purpurea: 3,863 documents |
Currency: Unknown |
Interface: Single search box with a pull-down menu to select "All the words", "Any of the words" or "Exact phrase". |
Search Features:
Search logic and syntax: Select "All of the words" (default),
"Any of the words" or "Exact phrase". |
Results:
What is displayed: FAST displays title, brief summary and URL for each document found. Order of results: Results are displayed ten per page, ranked by relevance. Refining results: Not available |
Other features:
|
URL: http://www.google.com |
Size: 355 million pages |
Retrieved:Hurricane Floyd: 18 documents; Echinacea purpurea: 1,117 documents |
Currency: Unknown |
Interface: Single search box with two search buttons: "Google Search" and "I'm Feeling Lucky". The latter automatically displays the page deemed most relevant rather than displaying a list of results. |
Search Features:
Search logic and syntax: Google defaults to Boolean AND. Enclose phrases in quotation marks. Use + and - to require and exclude search terms. Limit options: None Truncation: None Case sensitivity: None |
Results:
What is displayed: Results include document title, first few words of text, URL, and a link to a previously cached version of the page. Order of results: Google's PageRank algorithm ranks pages based on the number of pages that link to a given document. That is, the more frequently a document is linked to, the "better" it is. Google groups results by site, although this feature does not always appear to function properly. Refining results: Clicking the "GoogleScout" link retrieves pages that are "related" to the current result. Like Excite's "more like this" feature, it sometimes has the effect of retrieving pages related by subject while at other times it simply retrieves other pages from the same site as the original result. |
Other features:
|
URL: http://hotbot.lycos.com |
Size: 280 million pages |
Retrieved: Hurricane Floyd: 20,600 documents; Echinacea purpurea: 760 documents |
Currency: Hurricane Floyd: 2 weeks (September 18, 1999); Echinacea purpurea: 2 to 3 weeks (September 16, 1999) |
Interface: HotBot offers two interfaces: a default (not to say simple, as it offers more options than most search engines' advanced interfaces) and an Advanced Search. Both interfaces feature pull-down menus for modifying search criteria (for example, to switch between word and phrase searching), and for restricting searches by date, geographical location, language and domain name. These pull-down menus make available advanced search features to users who otherwise might be intimidated by the complex Boolean logic needed to perform a similar search in AltaVista or another search engine. Check boxes are used to limit searches to particular types of media. |
Search features:
Search logic and syntax: Pull-down menus in both the default and advanced interfaces allow you to select between "all of the words", "any of the words", "exact phrase", "the person" (HotBot automatically rotates search terms, so a search for "Bill Gates" will look for "Bill Gates" and "Gates, Bill"), "links to this URL", and "Boolean phrase". HotBot supports the Boolean operators AND, OR and NOT, the + and - signs to require or exclude search terms, and quotation marks to specify phrase searching. Limit options: The default search screen allows limiting by date, language and media type (e.g. Javascript, Image, Video). Advanced Search provides the same options plus an increased number of media types, and limiting by Internet domain (e.g. .edu, etc.,), geographical region, and page depth. "Word Stemming" (Advanced Search only) searches for grammatical variations of your search terms (e.g. searches for "thought" will also find "think" and "thinking"). Truncation: HotBot offers the most sophisticated truncation features of any search engine. The * character may be used to replace any number of characters, while the ? character replaces a single character. Both of these symbols may be used at the end, in the middle, or at the beginning of a word. Case sensitivity: Searches with all lower-case letters are not case-sensitive. The use of an upper-case letter anywhere in a search string limits search results to documents that match the exact search, including the case. |
Results:
What is displayed: HotBot offers three options: full descriptions, brief descriptions, and URLs only. The full display includes the document title, the first few lines of text, and URL. The brief display includes the title and first few lines of text. Order of results: Matching directory categories, if any, are followed by matching Web pages, in order of decreasing relevance. Refining results: HotBot offers two methods for refining results:
|
Other features:
|
URL: http://www.northernlight.com |
Size: 282 million pages |
Retrieved: Hurricane Floyd: 821 Web pages; 350 news stories; Echinacea purpurea: 4,331 Web pages; 0 news stories |
Currency: Hurricane Floyd: Web pages - 1 to 2 weeks (Sept. 24, 1999); News stories - less than one day (Oct. 2, 1999); Echinacea purpurea: Web pages - less than 2 weeks (September 21, 1999); News stories - Not applicable |
Interface: Northern Light offers two interfaces: a simple interface that consists of a search box (into which you may enter search terms combined with Boolean and other search operators) and a pull-down menu to specify which part of the database will be searched; and Power Search, which offers the choice of searching for words anywhere, in the title, or in the URL and limiting by date, language, country, type of source, and subject category. |
Search features:
Search logic and syntax: The default search logic is Boolean AND. In both its interfaces, Northern Light supports the Boolean operators AND, OR and NOT, the + and - symbols to require and exclude terms, and quotation marks to indicate a phrase. Limit options: Searches may be limited by location in the document (anywhere, title or URL), date, type of source (e.g. commercial sites, personal pages, educational sites, non-profit sites), language, country, subject (e.g. Arts, Business, Education, Travel), and document type (e.g. company information, for sale, learning materials, press releases, reviews). Truncation: Northern Light uses two truncation symbols: * to represent multiple characters and % to represent a single character. Both symbols may be used at the end or in the middle of a word. Case sensitivity: None. |
Results:
What is displayed: Result display includes document title, document type, brief summary, date, type of site, and URL. Order of results: Results are displayed in decreasing order of relevance. Power Search also allows you to sort results by date instead. Refining results: Northern Light uses Custom Search Folders to group results that share similar characteristics - subject, source, document type and language - into categories that are defined "on the fly" for each search. These folders are displayed on the left side of the window, next to the search results. By clicking on a folder, you can view only those results that are within a particular category. |
Other: Along with its database of WWW documents, Northern Light offers fee-based access to a "Special Collection" of millions of documents from 4,500 information sources including books, magazines, academic journals, and online news services. While some of these sources are available elsewhere on the Web for free, most are not, and as Northern Light states in its documentation, nowhere else are they available from a single source. Prices are set by each information provider and range from US $1.00 to $4.00 per article. Free abstracts are available for all Special Collection documents. You may choose to search either or both of Northern Light's collections in a single search. Northern Light also offers an Industry search and a Current News search. |