Introduction to the Web
and finding things on it
Although the Internet is not the beginning and
end resource of good academic/scholarly research resources, it is a huge,
timely, and continually growing resource that the academic searcher simply can not ignore.
So we will start this semester's LIS5703 course content with the Internet, and searching it, as
our topic--not because the Internet contains either all the information you will
need to be looking for, or because it necessarily contains the best information you
are looking
for, because it really isn't likely to.
But, clearly, the
Web is a 500
pound gorilla in the living room--you simply can't ignore it--and it is growing
at an unheard of rate, trying to become the uncontrolled archive of Humanity's recorded
communications.
Today there are estimated to be something over
11 billion (yes, that is a "b") web pages
on the Internet.
Searching the Internet: the
Recall Problem
When you carry out a search of the Web for something you
are looking for, you have two essential questions about the results you get from
your search engine:
- Did you get all of it (the
Recall
problem)?
Searching the Web for information is a lot
more
risky than most people who search it realize. The problem is an old
one, known to information scientists as the recall problem: when you
don't know what is "out there," getting something back as your search
result is
usually assumed by the searcher to be tantamount to finding everything out there
that there was to be found. As this material will point out to you, that just
isn't very likely to be the case at all. You can't be sure if your search strategy
and the search engine you used actually got for you ("recalled") all of the pertinent
information there was to get on the Internet to be gotten. This is for a couple of
different reasons . .
.
- You don't know if the keywords you
used were the words that the authors of web pages used to talk about
the topic you are actually interested in (you may have searched for
"abortion," but some web sites may have used other, similar words or
phrases like "unwanted pregnancies" or "choice" or "pro-choice" or
"pro-life" or . . . .) and not the keyword you though of
searching by.
- You just don't know if the search engine you
happen to be using even indexes all of the possible web pages that
actually exist somewhere on the World Wide Web.
- Is a particular search result
Quality information?
Knowing how to search to find information on
the Internet is only one of the steps involved in the process of deciding that you
have located the "good" material on your search topic--getting good
recall. But, assessing the
quality and veracity of your search results is another
important issue. You see, unlike searching in a library's catalog
for books on the shelves of that library's collection, searching the Internet means searching
across a range of information sources (web pages) that have not been made
available to you under any controlled circumstances--no one has used any selection
criteria to assess the worthiness or accuracy or validity of web pages. The World Wide Web is essentially
an uncontrolled form of communication: there is lots of spurious junk and fluff
on the Internet.
This fundamental nature of the Web--essentially,
an uncontrolled, open medium of communication--coupled with the rather
thoughtless user searching that is facilitated by search engines, leads to the
difficulties that most of us have finding something we are looking for on the
Web. We call up a search engine, put a word or a phrase into the search
box, and get hundreds or even thousands web pages returned to us--too
many for us to look through seriously. We are overwhelmed with pages and
pages of results, and invariably end up deciding to accept one of the few at the
beginning of the results list we actually took the time to look at (Current research
indicates that users typically don't attend to more than the first 25 or so
results listed).
In this course we will develop a model of the means of
determining the quality of the information you find (the next session), deciding
on your own what the
likely motives were of the individuals or agency or company responsible for
serving up the information you found. Our point right now, though, is this: finding
information on the Web usually captures more information than finally turns out
to be of high enough quality for your purposes. Put another way:
quality information is a definite subset of all
information returned to you by the major search engines.
Choices for
Searching the Web: Finding Tools for the Web
A recent text on how to search the Internet (The
Information Specialist's Guide to Searching and Researching on the Internet and
the World Wide Web by Ernest Ackermann and Karen Hartman: Amazon
listing) categorizes Web finding processes as falling into three main areas:
- Using Smaller Directories,
which are subdivided into either
- Subject Guides,
- Reference Works
- Browsing the Web through the larger,
less selective Directories of Web pages (also called Subject
Directories)
- Searching the Web through Search Engines
The types of search tools we are going to go over
with you in this week's material, Session 1, is the third category, what they
label as "Searching the Web through Search Engines." In subsequent
sessions we will cover reference works, subject directories (both
general and academic), and finally, what we call subject guides.
Searching the Web
through Search Engines There are two
different types of search engines that we need you to consider:
- Regular, everyday search engines
themselves, like Google, AskJeeves, Teoma.
- "Collective," super search engines, called
MetaSearch engines, like Vivisimo, Fazzle, and Ixquick.
First, lets consider search engines. We don't particularly care what your personal choice of search
engine is. You just need to recognize that we are going to make
our references to you in terms of the most frequently used search engine,
Google. It is the most frequently used, and it is also one of two
current search engines that searches the largest number of web sites. It is the preference of most professional searchers,
and it has the most written about its use (both documented features and
its huge number of un- or lightly-documented search features). You will find the highest, most
well-received and reviewed search engines on a Searching/Browsing page that
I maintain. The search engines linked to
there are according to the rankings taken from an April, 2004 article in
SearchEngineWatch, by Danny Sullivan.
I recommend that you read this article:
"Major
Search Engines and Directories"
As you can see in this article,
Sullivan gives three levels of rankings of search engines:
-
Top Choices (Google,
Yahoo!, AskJeeves)
-
Strongly Consider (AllTheWeb, AOL
Search,
Hotbot)
-
Other Choices (Alta
Vista, Gigablast)
Clearly, Google is still the
most frequently used search engine today
(46.3%), and its index of web pages is probably
the largest. But, Yahoo! is offering some competition to Google,
capturing 23.4%.
Keep watching those two "supermarket" vendors of search services.
Searching with Google
This is not a course about the fundamentals of
Google searching, but I will at least point out to you where to find good information and
tutorials--both online and in print--on using Google.
What the course is about, instead, are the best strategies and techniques for
using good search engines like Google to find reliable, quality
information. You should first take a look
at the helpful examples and explanations given in the
Google Guide,
an online tutorial
by Nancy Blachman, an instructor at Stanford, and one of the early, late
90's users of Google:
Please note that her Google Guide
is also available as a downloaded pdf file, for less than $2.00. It
is over 140 pages to print, but well worth the money if you want a handy
printed version to read off-line.
For even more detailed written information about searching Google, you need to consider
purchasing one of the several trade paperback books that have appeared recently. Among the best are these
three:
The first one, How to Do Everything
with Google, is written by two software engineers at Google
and Nancy Blachman.
This
book has at least the informal "seal of approval" of Google,
inasmuch as it was written by two "insiders" who assisted in Google's
construction. The book gives details about how Google's search engine works that
are only guessed at by other authors of Google books--mainly
because Google is a very, very proprietary organization that typically
says little about how its search features work.
As you will see only if you wade through
Google's un-promoted help pages and other sub-pages, there are a ton of other
search-related features and services that Google is experimenting
with "in the background." You can have access to these
features too, but you
really have to hunt for them. Any of these books above (as well as
the online tutorial) can help do
that (and especially the first one) for you.
You will be surprised (and amazed) at what
Google can do for you.
An Example of an
Under-Documented Google Feature:
Google's Little-Known Synonym
Operator
What the already-noted tutorial and books above
about Google concentrate on is
different from what we will be concentrating on for this course. The
tutorial and books
are technical guides to the fundamentals of using the Google
interface and the syntax of Google's search engine. What this course
is about, however, is deciding what are the most effective and efficient ways of
asking search systems (like Web search engines, or subject directories, or
library catalogs, or indexing and abstracting databases) for something you are
looking for.
You see, the development of a list of good
keywords for searching purposes is not a "technical matter." It is an
intellectual activity that has to do with a consideration of how authors use
words (nouns and adjectives) that describe what they are writing about, and how
we, from such an analysis, can build a searching vocabulary of nouns and adjectives--not a usual
kind of fare of the computer hobbyists. So, put
away your King-of-the-Computer costume and get ready to think instead about how authors
use words, and what useful nouns and adjectives are likely to be closely
associated with a topic on
web pages--what keywords or key phrases are most likely to assist you in finding other web pages
about the same or similar topics.
Most folks who are sitting down at
their web browser to do
searching quickly fall into the trap
of believing that the rest of the universe shares their own personal way of
looking at topics: we all tend to have an egocentrism about the
"right" way (our way) to search for something. What we usually do is begin searching for
something we are interested in by giving the search
engine a keyword that expresses our personal
understanding of what it is we are looking for. This is
fine . . . as an initial starting place . . . but if
we never start
thinking about how others might verbalize the same idea, we are
not going to find out very effectively what else is "out there" on the Web. In other words, our personal point of
view is not likely to overlap completely with how those who
supplied the content of topically-appropriate web pages
might have chosen different words or phrases to express their understandings of the
topic.
I am pointing out this mismatch in search
vocabulary
to you now because most searches done on the Web are build by end users who only
know how they each think about a topic, not how others (authors of web
content) have actually describing it on web pages. What
I need you to learn to do is
figure out how others are thinking (and therefore, writing) about
a topic, because those other keywords are going to be extremely important to you in
finding all of the relevant web sites that are of interest to you.
Let me give you the hypothetical example of a
college student using the Internet to find materials on the topic of
abortion. Well, the usual (and correct) first search to try
is the term itself,
abortion, in the Google search box:

I have noted (with black circles in
the beginnings of the results records) the
indications of where Google found the keyword
abortion in
the very abbreviated entries of the web pages it returns to us for
our consideration.
However, for the fuller development of a list
of different keywords that all (somehow) have to do with the topic of
abortion,
there is another, undocumented (or at least little known)
feature of Google that can, under some circumstances, be very helpful to you.
That feature is the use of the
"approximate" symbol (the tilde character, ~) directly in front of a
single keyword. In other words, if we enter the search not for
abortion but instead
~abortion, we will be given a list of web
pages that include other synonyms (and antonyms) for the topic of "abortion"
(again, highlighted in bold black) . . . as well as the ones that
had the word
abortion in them. This is rather like going
to your handy-dandy Roget's Thesaurus and finding
all of those other words or phrases that are similar to or
opposite of
abortion.
Doing a search on
~abortion in
Google we were able to find these 8 related words or phrases
for the word
abortion below in just the first 40 results entries:
- abortion
- abortions
- birth
- birth control
- choice
- partial-birth
- pregnancy
- pro-choice
- pro-life
Here are a few examples of what those
entries looked like:

So, you now know how to allow Google
to assist you in finding other keywords that have a similar or
opposite meaning
to the keyword that you already knew about: Google
itself can be used to help you develop your search strategy's
most effective list of keywords and phrases. However,
Google itself does a poor job of telling its users that this
feature is available: as I mentioned above, you really need the
assistance of a good tutorial or book to get a good overview of
everything that Google can do for a searcher.
These new synonyms and related terms we got from Google are what
professionals in Library and Information Studies label
as "uncontrolled" vocabulary--vocabulary that was simply "found" in the
documents (the web pages).
Later in this course I will be
introducing you to another form of vocabulary, "controlled"
vocabulary--vocabulary that is used "officially" in a particular
information retrieval system (like catalogs and indexing services).
For now, though, you just need to know about this way
of forcing Google to assist you in figuring out what other
related terms and phrases authors are using to refer to a keyword similar to
the keyword you knew to use for searching purposes.
An
Interesting Aside:
The word "abortion" brings about rather strident emotional reactions
from many of us who see it or hear it. These moral reactions,
on both "sides" of any issue, lead to the substitution of other
words by those who wish for the issue to be "framed" in words that
do not have the (to them) negative impact of the offending word.
However, don't expect Google's synonym procedure to
understand the various sensibilities of the searchers who use
Google's synonym procedure. Google's algorithm to
find and display other, synonymous words to the original word is not
"politically sensitive" to your moral or political sensibilities.
For an interesting look at opposing metaphors in political speech,
see two books by George Lakoff, a cognitive scientist:
|
Searching Multiple Search
Engines through MetaSearch Engines It will amaze you to learn this, but
it is true: according to estimates by Lawrence and Giles a few years ago (see
the summary of their 1999 article here:
Access
and Distribution of Information on the Internet)
no single search engine
indexes more than about 1/5th of the web pages available through the publicly-accessible Web. Yes, about
20% (actually, 16%!) or so of the Web is indexed in a single search engine.
Well, one of the ways of overcoming
the fact that each search engine has different ways of figuring out how to
identify web pages to index is to rely on search tools that don't actually do any engine searching
themselves, but instead submit search requests to a large number of search engines for you, and
then combine the
results received from them together in a single list that is reported to you.
If you look at the metasearch
engines that are listed for your inspection on the
Searching / Browsing page of
my Find it on the Web site, you will find
the metasearch engines listed in two categories:
-
Award Winners
-
Other Top Choices
These two categories were created by Chris Sherman, an associate editor of
SearchEngineWatch, in its March
2005 article about "Metacrawlers
and Metasearch Engines." I recommend that
you read this article, since you, like most Internet users, are not
up-to-date
about what tools are actually available to be used to search the Web and what tools
have appeared recently and moved up in the ratings and the rankings: most
users simply start with some particular search engine . . . and stay with it for
longer than they probably should. Familiarity, not features, is what drives
most usage persistency. It
is sometimes a matter of casual choice when first made, turning into a
tenaciously-held personal preference over time, and unfortunately no
contradictory evidence is likely to get you to "give up" the familiar
and comfortable for
anything else.
There is a feature that is
special about a few of the metasearch engines that you need to know about,
called clustering.
Clustering
Several of the metasearch engines are beginning to introduce conceptual "clustering" into
their software, allowing you, the user, to see how web pages are associated with
other web pages. How the metasearch engines do this differs from engine to
engine (if they do it at all--most don't), but lets just say their are exotic measures of "relatedness" that
allow some of these metasearch engines to group web pages together for you in
ways that allow you, the user, to discover new, relevant web pages that would
have been difficult for you to find using your straightforward searching, or
even Google's related terms searching (the tilde operator).
As an example, take the first
metasearch engine listed on the
Searching / Browsing page, Vivisimo. If we type the keyword
abortion
into the Vivisimo search box, here is the result we get:

Notice, immediately, that
Vivisimo not only returns with pages for the occurrence of the word
abortion
(that is what is in the right-side window), but it also gives you an analysis
of, and labeling for, clusters of web pages that all seem to be "similar," in
some way. So, Vivisimo tells us that there are web pages that are
about abortion
as an "issue," about the concept of "life," about the area labeled "clinic," and
so forth.
If you notice, though, some of those
clusters of web pages have labels that we already recognize at being "synonyms"
or "antonyms" for the concept of
abortion:
-
life
-
pro-choice
-
partial-birth abortion
-
....
And there were more clusters; the
image above of a Vivisimo result page just captured what was on one page
of the browser.
One final example of metasearch
clustering is the visual result offered by KartOO:

Clicking on a area within this
KartOO graphic display of the interconnectedness of web page clusters will
open up newer, nested areas, and finally actual web pages.
|