Session 1: Search Engines, MetaSearch Engines, and Keywords
  • Introduction to the Web and finding things on it

  • Searching the Internet: the Recall Problem

  • Choices for Searching the Web

  • Searching the Web through Search Engines

  • Searching with Google

  • An Example of an Under-Documented Google Feature:
    Google's Little-Known Synonym Operator

  • Searching Multiple Search Engines through MetaSearch Engines

  • Clustering

 

Introduction to the Web and finding things on it

Although the Internet is not the beginning and end resource of good academic/scholarly research resources, it is a huge, timely, and continually growing resource that the academic searcher simply can not ignore.  So we will start this semester's LIS5703 course content with the Internet, and searching it, as our topic--not because the Internet contains either all the information you will need to be looking for, or because it necessarily contains the best information you are looking for, because it really isn't likely to. 

But, clearly, the Web is a 500 pound gorilla in the living room--you simply can't ignore it--and it is growing at an unheard of rate, trying to become the uncontrolled archive of Humanity's recorded communications.

Today there are estimated to be something over 11 billion (yes, that is a "b") web pages on the Internet.
 

Searching the Internet: the Recall Problem

When you carry out a search of the Web for something you are looking for, you have two essential questions about the results you get from your search engine:

  1. Did you get all of it (the Recall problem)

    Searching the Web for information is a lot more risky than most people who search it realize. The problem is an old one, known to information scientists as the recall problem: when you don't know what is "out there," getting something back as your search result is usually assumed by the searcher to be tantamount to finding everything out there that there was to be found. As this material will point out to you, that just isn't very likely to be the case at all. You can't be sure if your search strategy and the search engine you used actually got for you ("recalled") all of the pertinent information there was to get on the Internet to be gotten.  This is for a couple of different reasons . . .
     
    • You don't know if the keywords you used were the words that the authors of web pages used to talk about the topic you are actually interested in (you may have searched for "abortion," but some web sites may have used other, similar words or phrases like "unwanted pregnancies" or "choice" or "pro-choice" or "pro-life" or . . . .) and not the keyword you though of searching by.
       
    • You just don't know if the search engine you happen to be using even indexes all of the possible web pages that actually exist somewhere on the World Wide Web.
       
  2. Is a particular search result Quality information

Knowing how to search to find information on the Internet is only one of the steps involved in the process of deciding that you have located the "good" material on your search topic--getting good recall.  But, assessing the quality and veracity of your search results is another important issue.  You see, unlike searching in a library's catalog for books on the shelves of that library's collection, searching the Internet means searching across a range of information sources (web pages) that have not been made available to you under any controlled circumstances--no one has used any selection criteria to assess the worthiness or accuracy or validity of web pages.  The World Wide Web is essentially an uncontrolled form of communication: there is lots of spurious junk and fluff on the Internet.  

This fundamental nature of the Web--essentially, an uncontrolled, open medium of communication--coupled with the rather thoughtless user searching that is facilitated by search engines, leads to the difficulties that most of us have finding something we are looking for on the Web.  We call up a search engine, put a word or a phrase into the search box, and get hundreds or even thousands web pages returned to us--too many for us to look through seriously.  We are overwhelmed with pages and pages of results, and invariably end up deciding to accept one of the few at the beginning of the results list we actually took the time to look at (Current research indicates that users typically don't attend to more than the first 25 or so results listed).

In this course we will develop a model of the means of determining the quality of the information you find (the next session), deciding on your own what the likely motives were of the individuals or agency or company responsible for serving up the information you found.  Our point right now, though, is this: finding information on the Web usually captures more information than finally turns out to be of high enough quality for your purposes.  Put another way: quality information is a definite subset of all information returned to you by the major search engines.

Choices for Searching the Web: Finding Tools for the Web

A recent text on how to search the Internet (The Information Specialist's Guide to Searching and Researching on the Internet and the World Wide Web by Ernest Ackermann and Karen Hartman: Amazon listing) categorizes Web finding processes as falling into three main areas:

  1. Using Smaller Directories, which are subdivided into either
  2.  
    1. Subject Guides,
    2. Reference Works
       
  3. Browsing the Web through the larger, less selective Directories of Web pages (also called Subject Directories)
     
  4. Searching the Web through Search Engines

The types of search tools we are going to go over with you in this week's material, Session 1, is the third category, what they label as "Searching the Web through Search Engines."  In subsequent sessions we will cover reference works, subject directories (both general and academic), and finally, what we call subject guides

Searching the Web through Search Engines

There are two different types of search engines that we need you to consider:

  1. Regular, everyday search engines themselves, like Google, AskJeeves, Teoma.
  2. "Collective," super search engines, called MetaSearch engines, like Vivisimo, Fazzle, and Ixquick.

First, lets consider search engines. We don't particularly care what your personal choice of search engine is.  You just need to recognize that we are going to make our references to you in terms of the most frequently used search engine, Google.  It is the most frequently used, and it is also one of two current search engines that searches the largest number of web sites.  It is the preference of most professional searchers, and it has the most written about its use (both documented features and its huge number of un- or lightly-documented search features).

You will find the highest, most well-received and reviewed search engines on a Searching/Browsing page that I maintain.  The search engines linked to there are according to the rankings taken from an April, 2004 article in SearchEngineWatch, by Danny Sullivan.  I recommend that you read this article:

"Major Search Engines and Directories"

As you can see in this article, Sullivan gives three levels of rankings of search engines:

  1. Top Choices (Google, Yahoo!, AskJeeves)

  2. Strongly Consider (AllTheWeb, AOL Search, Hotbot)

  3. Other Choices (Alta Vista, Gigablast)

Clearly, Google is still the most frequently used search engine today (46.3%), and its index of web pages is probably the largest.  But, Yahoo! is offering some competition to Google, capturing 23.4%.  Keep watching those two "supermarket" vendors of search services.

 Searching with Google

This is not a course about the fundamentals of Google searching, but I will at least point out to you where to find good information and tutorials--both online and in print--on using Google.  What the course is about, instead, are the best strategies and techniques for using good search engines like Google to find reliable, quality information.

You should first take a look at the helpful examples and explanations given in the Google Guide, an online tutorial by Nancy Blachman, an instructor at Stanford, and one of the early, late 90's users of Google:

Please note that her Google Guide is also available as a downloaded pdf file, for less than $2.00.  It is over 140 pages to print, but well worth the money if you want a handy printed version to read off-line.

For even more detailed written information about searching Google, you need to consider purchasing one of the several trade paperback books that have appeared recently.  Among the best are these three:

The first one, How to Do Everything with Google, is written by two software engineers at Google and Nancy Blachman.  This book has at least the informal "seal of approval" of Google, inasmuch as it was written by two "insiders" who assisted in Google's construction.   The book gives details about how Google's search engine works that are only guessed at by other authors of Google books--mainly because Google is a very, very proprietary organization that typically says little about how its search features work. 

As you will see only if you wade through Google's un-promoted help pages and other sub-pages, there are a ton of other search-related features and services that Google is experimenting with "in the background."  You can have access to these features too, but you really have to hunt for them.  Any of these books above (as well as the online tutorial) can help do that (and especially the first one) for you.

You will be surprised (and amazed) at what Google can do for you.
 

An Example of an Under-Documented Google Feature:
Google's Little-Known
Synonym Operator

What the already-noted tutorial and books above about Google concentrate on is different from what we will be concentrating on for this course.  The tutorial and books are technical guides to the fundamentals of using the Google interface and the syntax of Google's search engine.  What this course is about, however, is deciding what are the most effective and efficient ways of asking search systems (like Web search engines, or subject directories, or library catalogs, or indexing and abstracting databases) for something you are looking for. 

You see, the development of a list of good keywords for searching purposes is not a "technical matter."  It is an intellectual activity that has to do with a consideration of how authors use words (nouns and adjectives) that describe what they are writing about, and how we, from such an analysis, can build a searching vocabulary of nouns and adjectives--not a usual kind of fare of the computer hobbyists.  So, put away your King-of-the-Computer costume and get ready to think instead about how authors use words, and what useful nouns and adjectives are likely to be closely associated with a topic on web pages--what keywords or key phrases are most likely to assist you in finding other web pages about the same or similar topics.

Most folks who are sitting down at their web browser to do searching quickly fall into the trap of believing that the rest of the universe shares their own personal way of looking at topics: we all tend to have an egocentrism about the "right" way (our way) to search for something.  What we usually do is begin searching for something we are interested in by giving the search engine a keyword that expresses our personal understanding of what it is we are looking for.  This is fine . . . as an initial starting place . . . but if we never start thinking about how others might verbalize the same idea, we are not going to find out very effectively what else is "out there" on the Web.  In other words, our personal point of view is not likely to overlap completely with how those who supplied the content of topically-appropriate web pages might have chosen different words or phrases to express their understandings of the topic. 

I am pointing out this mismatch in search vocabulary to you now because most searches done on the Web are build by end users who only know how they each think about a topic, not how others (authors of web content) have actually describing it on web pages.  What I need you to learn to do is figure out how others are thinking (and therefore, writing) about a topic, because those other keywords are going to be extremely important to you in finding all of the relevant web sites that are of interest to you.

Let me give you the hypothetical example of a college student using the Internet to find materials on the topic of abortion.  Well, the usual (and correct) first search to try is the term itself, abortion, in the Google search box: 

I have noted (with black circles in the beginnings of the results records) the indications of where Google found the keyword abortion in the very abbreviated entries of the web pages it returns to us for our consideration. 

However, for the fuller development of a list of different keywords that all (somehow) have to do with the topic of abortion, there is another, undocumented (or at least little known) feature of Google that can, under some circumstances, be very helpful to you.

That feature is the use of the "approximate" symbol (the tilde character, ~) directly in front of a single keyword.  In other words, if we enter the search not for abortion but instead ~abortion, we will be given a list of web pages that include other synonyms (and antonyms) for the topic of "abortion" (again, highlighted in bold black) . . . as well as the ones that had the word abortion in them.  This is rather like going to your handy-dandy Roget's Thesaurus and finding all of those other words or phrases that are similar to or opposite of abortion.

Doing a search on ~abortion in Google we were able to find these 8 related words or phrases for the word abortion below in just the first 40 results entries:

  • abortion
  • abortions
  • birth
  • birth control
  • choice
  • partial-birth
  • pregnancy
  • pro-choice
  • pro-life

Here are a few examples of what those entries looked like:

So, you now know how to allow Google to assist you in finding other keywords that have a similar or opposite meaning to the keyword that you already knew about: Google itself can be used to help you develop your search strategy's most effective list of keywords and phrases.  However, Google itself does a poor job of telling its users that this feature is available: as I mentioned above, you really need the assistance of a good tutorial or book to get a good overview of everything that Google can do for a searcher.

These new synonyms and related terms we got from Google are what professionals in Library and Information Studies label as "uncontrolled" vocabulary--vocabulary that was simply "found" in the documents (the web pages). 

Later in this course I will be introducing you to another form of vocabulary, "controlled" vocabulary--vocabulary that is used "officially" in a particular information retrieval system (like catalogs and indexing services).  For now, though, you just need to know about this way of forcing Google to assist you in figuring out what other related terms and phrases authors are using to refer to a keyword similar to the keyword you knew to use for searching purposes.
 

An Interesting Aside: 

The word "abortion" brings about rather strident emotional reactions from many of us who see it or hear it.  These moral reactions, on both "sides" of any issue, lead to the substitution of other words by those who wish for the issue to be "framed" in words that do not have the (to them) negative impact of the offending word.  However, don't expect Google's synonym procedure to understand the various sensibilities of the searchers who use Google's synonym procedure.  Google's algorithm to find and display other, synonymous words to the original word is not "politically sensitive" to your moral or political sensibilities.  For an interesting look at opposing metaphors in political speech, see two books by George Lakoff, a cognitive scientist:


Searching Multiple Search Engines through MetaSearch Engines

It will amaze you to learn this, but it is true: according to estimates by Lawrence and Giles a few years ago (see the summary of their 1999 article here: Access and Distribution of Information on the Internet)
 

no single search engine indexes more than about 1/5th of the web pages available through the publicly-accessible Web.  Yes, about 20% (actually, 16%!) or so of the Web is indexed in a single search engine.

Well, one of the ways of overcoming the fact that each search engine has different ways of figuring out how to identify web pages to index is to rely on search tools that don't actually do any engine searching themselves, but instead submit search requests to a large number of search engines for you, and then combine the results received from them together in a single list that is reported to you.

If you look at the metasearch engines that are listed for your inspection on the Searching / Browsing page of my Find it on the Web site, you will find the metasearch engines listed in two categories:

  1. Award Winners

  2. Other Top Choices

These two categories were created by Chris Sherman, an associate editor of SearchEngineWatch, in its March 2005 article about "Metacrawlers and Metasearch Engines."  I recommend that you read this article, since you, like most Internet users, are not up-to-date about what tools are actually available to be used to search the Web and what tools have appeared recently and moved up in the ratings and the rankings:  most users simply start with some particular search engine . . . and stay with it for longer than they probably should. Familiarity, not features, is what drives most usage persistency.   It is sometimes a matter of casual choice when first made, turning into a tenaciously-held personal preference over time, and unfortunately no contradictory evidence is likely to get you to "give up" the familiar and comfortable for anything else.

There is a feature that is special about a few of the metasearch engines that you need to know about, called clustering. 

Clustering

Several of the metasearch engines are beginning to introduce conceptual "clustering" into their software, allowing you, the user, to see how web pages are associated with other web pages.  How the metasearch engines do this differs from engine to engine (if they do it at all--most don't), but lets just say their are exotic measures of "relatedness" that allow some of these metasearch engines to group web pages together for you in ways that allow you, the user, to discover new, relevant web pages that would have been difficult for you to find using your straightforward searching, or even Google's related terms searching (the tilde operator).

As an example, take the first metasearch engine listed on the Searching / Browsing page, Vivisimo.  If we type the keyword abortion into the Vivisimo search box, here is the result we get:

Notice, immediately, that Vivisimo not only returns with pages for the occurrence of the word abortion (that is what is in the right-side window), but it also gives you an analysis of, and labeling for, clusters of web pages that all seem to be "similar," in some way.  So, Vivisimo tells us that there are web pages that are about abortion as an "issue," about the concept of "life," about the area labeled "clinic," and so forth.

If you notice, though, some of those clusters of web pages have labels that we already recognize at being "synonyms" or "antonyms" for the concept of abortion:

  • life

  • pro-choice

  • partial-birth abortion

  • ....

And there were more clusters; the image above of a Vivisimo result page just captured what was on one page of the browser.

One final example of metasearch clustering is the visual result offered by KartOO:

Clicking on a area within this KartOO graphic display of the interconnectedness of web page clusters will open up newer, nested areas, and finally actual web pages.

 


Exercise 1: Trying out some Search Engines and MetaSearch Engines


In just a few weeks, you are going to be asked to submit to me the social sciences or policy topic you would like to work on for your bibliographic project.  After that topic is either approved, changed, or negotiated with me, you will be beginning your search for the scholarly literature, and the related, reliable information that will go into your bibliography.  At that point, your ability to determine the appropriate vocabulary (keywords) to use in finding materials on a topic will be very important to you.

Before you begin to develop that uncontrolled and controlled vocabulary, though, I need to know that you know what the variety of searching alternatives are for you on the Web.  So, this first exercise, Exercise 1, simply asks you to experiment with the top search engines and metasearch engines.  Kick the tires and take the search engines for a test drive, trying out the same search in each one.  Make up some topic (a concept or an issue or a trend, for example) that has to do with your degree's subject area, and look for it on the web, using at least 3-5 of the top-ranked search engines, and 3-5 of the top-ranked metasearch engines.

Remember, you can find links to those search engines and metasearch engines on this page:

Searching / Browsing
 

Exercise 1: Search Engine and MetaSearch Engine Tryout 

Please experiment with the search tools introduced to you above in Session 1 by trying to find materials having to do with some concept that applies to your degree's subject area. 

This is not intended to be a thorough search of any of these tools; it is just a way to have you investigate search tools you may not have used before.  Spend no more than an hour or two doing this, but realize that what you learn about these tools in this process will be of invaluable assistance to you later on in this course, as you proceed to built your bibliographic project.