Searching with Google

Table of Contents





For the last several years, Google has maintained the largest index of pages on the World Wide Web (see SearchEngineWatch.com).  Today, in the later half of 2003, it is estimated that there are over 3.3 billion pages indexed by Google's database of web pages.

This document is about searching Google.  As you will soon find out, Google has its own system of search features and devices.  You must know them in order to make effective use of Google for your Internet searching needs.

Search Engines vs Web Directories

First, understand that Google is a search engine.  In other words, it uses a "spider" to "crawl" the World Wide Web, looking for new pages and changed, updated pages (and other documents, like Word files and pdf files) that it did not know about previously or that have changed since the last time it found a page at a particular URL, and adds that new data to its already huge database that indexes the World Wide Web.

It is not, in and of itself, a web directory, as was an earlier version of the very popular Yahoo! site when it started in the 1990's.  Web directories are datasets that are much, much smaller--maybe a few thousand up to a couple of million web sites referred to in a very large web directory like Open Directory Project (see below) as opposed to over 3 billion web pages referred to in a search engine like Google.  Web directories, however, are very, very useful finding tools because they are filled with web sites that have been classified by human beings who are experts in some area of information.  Search engines are added to by programs which scan the publicly-accessible World Wide Web for new or updated pages; web directories are organized, carefully chosen lists of the better and best web sites dealing with all possible topics.

Indeed, web directories are so important that many search engines actually include web directory services at their sites as well as their all-web search engines.  Google, for example, has a separate tab or button labeled "Directory" which lays out 16 general categories under which, altogether, over 3 million web sites are organized and pointed to.  Google didn't create this information; Google licensed the right to list the web directory services of the Open Directory Project (ODP):

The ODP is also known as DMOZ, an acronym for Directory Mozilla.  This name reflects its loose association with Netscape's Mozilla project, an Open Source browser initiative.  The ODP was developed in the spirit of Open Source, where development and maintenance are done by net-citizens, and results are made freely available for all net-citizens. 

Basic Search Mechanisms

The default page brought up at Googl'e main address, http://google.com/, shown above, is the input box to search the World Wide Web. 

There are three main types of web searches that Google facilitates:

  1. searching the Web itself--the first tab (and the default condition)

  2. browsing through a Directory of web sites made available by Open Directory Project (ODP)

  3. an advanced search page (the Advanced Search text link)

You should also notice that Google has five tabs above its search box, with the other four--besides the Web--being Images, Groups, Directory, and News.  One of them, Directory (circled item #2), has already been mentioned--it is the Google presentation of Open Directory Project.  The other tabs can be easily used as well: Images allows the user to locate images that are used on web pages on the Internet; Groups allows the user to search through the millions of messages left on newsgroups since the early 80's; News allows users to get current news in a variety of areas, and to search for past news stories.

The area to the right of the search box contains three text links:

Advanced Search
Preferences
Language Tools

We will return to Preferences and Language Tools later; Advanced Search will be taken up below.

Basic Indexing Features

  • Case Doesn't Matter.  Upper case and capitalization of search words has no impact or change in what results one gets from Google: all words are stored in Google's index as lower case, and all searches using any combination of upper and lower case letters obtains the same results:

    Homer

    HOMER

    homer

  • Stop Words.  there are a small number of very common words that Google does not use for indexing purposes.  Called "stop words" or "delete words," these two-dozen or so words are used frequently in our English language text, but are also, therefore, of little retrieval value--they are common language "placeholders" that don't allow us to discriminate effectively for search purposes between pages that have them and web pages that don't have them.  It makes little sense to say you wish to search for only those web pages that contain the word "in" for example.

    If a glossary of all the words appearing in all the WWW's pages were produced, most of the occurrences would be to words like "of," "for," "by," "with," "to," etc. Since we have evidence that these words are of almost no search significance, the producers of search engine indexes like Google save extraordinary amounts of database space by not indexing according to the occurrences of these almost insignificant words.  But, there are sometimes special circumstances under which a user might wish to be able to search according to these words, and you should know that special procedures are made available to you to force Google to search for the occurrence of  the word, say "the," in a search result ("the Who," for example).

    Google stop words
     

    a at in that when
    about be is the where
    an by it this which
    and for of to who
    are from on was will
    as I or what with

 

  • To Force a Stop Word to be Searched by Google.  Although stop words are normally not searched for by Google, they can be forced into a Google search specification in one of two ways. 

    First, one may place a plus sign (+) directly in front of the word that Google will usually not include in a search: the plus means that following word must be found in the search results. 

already paid +for

Second, one may simply include the usually excluded word in a bound phrase (enclosed in quotes):

"already paid for"

Basic Search Features

  • Default logical AND operation. When the user puts more than one word into the search box, the engine assumes that the user wishes for the two (or more) words to be ANDed together.  In other words, the default logical operation used by Google in the absence of any specification at all is the logical AND operator.  Please also note that the logical operators are always specified in capital letters by the user.

    cats dogs

    . . .
    will retrieve the same results as . . .

    cats AND dogs

  • Logical OR condition. When the user seeks to expand search results by giving Google several different words or by specifying a list of synonyms or near-synonyms, the user should specify the OR logical operator between each of the words.  In other words, the OR operator is not implied, as is the default condition of the logical AND operator.  One must indicate the logical OR by placing it, in capital letters, between two words:

    cats OR dogs

  • Logical negation.  The last logical operation--negation--is performed in Google by placing a minus sign directly in front of the word (no space between the word and the minus sign!).  Therefore, were we searching for web pages that were about cats, but not about dogs, we would use this formulation:

cats -dogs

  • Phrase Binding.  To ask Google to search for a string of words occurring in a fixed order, enclose the words (the phrase) in quotes. 

As shown above in the discussion of stop words, this phrase-binding process can include stop words that Google normally doesn't allow the user to specify:

"gone with the wind"

"vitamin a"

This phrase searching technique is particularly useful in finding odd or uncommon phraseology in pieces of well-known text.  Were you searching for a copy of Lincoln's Gettysburg address, for example, you should go ahead and specify . . .

"four score and seven years ago our fathers"

Punctuation, by the way, is ignored by Google, so either of these specifications will tally the same result:

"four score and seven years ago, our fathers"

Advanced Search Mechanisms

Advanced Search Page

  • Google gives users a menu-driven means of using some of its advanced features, although there is also a feature of  search word qualification that can be added to searching using its Basic Search box as well.  We will first deal with Google's menu page of advanced search features.

    By clicking on the "Advanced Search" text link on the right side of the main page's search box (Circle 3 in the illustration below)

. . .one gets this Advanced Search page:

  • Find results area:  The first area, in a blue background, is comparable to the logical operations supported by Google in its basic search box:

with all of the words                       is comparable to the AND search
with the exact phrase                     is comparable to phrase searching
with at least one of the words          is comparable to the OR search
without the words                          is comparable to negation of a term

  • Language area:  The row just under the Find results area contains the language specification of the web page content.  By default, it is set to "any language."  But it can be set to any one of a very large number of specific languages (over 30).
     

  • File Format area:  By defaut, Google searches for your specification coming from any number of different file formats, including, of course, html.  However, here are other file formats that you can choose, specifically, to have Google search for:

any format
Adobe PDF (pdt)
Adobe Postscript (ps)
Microsoft Word (doc)
Microsoft Excel (xls)
Microsoft Powerpoint (ppt)
Rich Text Format (rtf)

As will be pointed out later, there are a number of other file formats that can be searched on if one uses the term qualification technique--something that will be presented to you later on in this tutorial.

  • Date area:  This feature allows the user to specify, in a choice of four different categories, how recently the pages were updated from which the search specification was found:

anytime (default)
past 3 months
past 6 months
past year

  • Occurrences area:  This feature allows the user to indicate where the search specification words came from either on the page, or in links to the page:

anywhere in the page
in the title of the page
in the text of the page
in the URL of the page
in links to the page

  • Domain area:  Here you are allowed to specify that your results can only come from, or must be excluded from coming from, a particular domain (com, edu, gov, mil, net, . . . .).
     

  • Safesearch area:  This area tells Google to exclude from your search results sites that contain pornography and explicit sexual content.  Through a Safesearch Preferences page (set Preferences off of a link to the right of the main page's search box), you may set the strictness of Google's exclusionary criteria to three conditions:

    exclusions turned off
    moderate strictness
    very strict

Term Qualification in Basic Search

The original usability philosophy of Google was "make Google easy to search, and give the user advanced options in a menu-driven format."  So, instead of forcing users to have to know how to use the logical operators, Google's leadership team decided to offer a default AND logical operator, and to allow users to combine terms in other ways through a menu of  specific alternatives (the Advanced Search menu page).  In the background, however, Google had a list of term qualification devices that could do the same (and sometimes, more) thing.  That is what this section is about--search syntax that is not actively promoted to its user community.

A frustrating characteristic of Google for proficient, professional searchers is the existence of these "undocumented" search techniques, which Google only slowly and cautiously brings to the attention of its users for retrieval purposes.   But then again, Google's management team probably doesn't want to complicate the simple searching done by the vast majority of its users (the "80/20 rule" applies: 80% of all Google searches are performed using only 20% of its search features).

However, there are a number of command-line term qualifiers that work in Google's search syntax (and are being written about by authors of recent trade books).  These qualifiers are described in this section.

Please note that the form of all Google qualifiers is

qualifier_word:search_words_or_phrase  

as in

intitle:admissions

(For further examples of the special, advanced term qualification features discussed below, see Google's own explanations on its Advanced Search Operators page.)

  • intitle qualifier: Use this search qualifier to specify that the search take place in only those words  that came from a web page's title field.  (Before using this qualifier, please remember that the web page's title field contents probably do not appear on the web page itself.  The title field is part of a  page's standard html coding, but it is not necessarily also displayed in the actual contents of the page that are viewable on the web.)

intitle:"help page"

  • inurl qualifier:  Use this qualifier to restrict the search results to pages or documents containing the search word specified in the page's url:

inurl:ibm

  • site qualifier:  Use this qualifier to restrict the results to a given site. For instance, this search will find pages with the phrase "customer service" that are also within the IBM domain:

"customer service" site:www.ibm.com

Another example of the site search is, for example, searching a university's domain (say, www.ou.edu) for something about the Sociology Department:

site:www.ou.edu "sociology department"

  • info qualifier:  The info qualifier presents some very abbreviated information that the Google database has about that web page:

info:www.ibm.com

  • related qualifier:  This qualifier will show you web pages that are "similar" to a specified web page. 

related:www.nbc.com

Indeed, the related qualifier produces the same results as clicking on the "Similar pages" link on the last line of an entry in a Google search response set:

What you would expect to get in a response to this query is other "entities" (in this case, broadcast companies) in this category.  If you notice in the next to the last line of the entry shown above, the breakdown of the category is Television > Networks > NBC.  NBC is a specific entity within the category Networks: you would therefore expect the related qualifier (or the "Similar pages" link at the end of a search response entry) to return other specifics within the Networks category.

  • link qualifier:  Interestingly, this qualifier returns a list of web pages that link to the web site you specify.  In other words, if you would like to see who is linking to a website, use this qualifier:

link:www.ou.edu

  • cache qualifier:  This qualifier returns the most recent copy of the page that Google has stored and used for its indexing.  

cashe:www.cnn.com

Like "Similar pages," this information is also available from a link on the last line of a response entry (see illustration above)

[stopped here, Sept 24, 2003]

Google's Results Layout

How to Interpret Your Search Result

Other Features in the Background

  • Phone numbers

  • Addresses

  • Street maps

  • Stock quotes

  • Translations

  • Dictionary definitions

  • Synonyms

  • Calculator

Still Farther in the Background: the Labs

http://labs.google.com/

  •  

  •  

  •