Web Design > Google Search >
Publishing Best Practices
Publishing Best Practices
When working with the Google Search Appliance, use these tips and
guidelines provided by Google to improve the search experience for
users trying to find your content.
Make web pages for users, not for search engines
Create a useful, information-rich content site. Write pages that clearly and
accurately describe your content. Don't load pages with irrelevant words. Think
about the words users would type to find your pages, and make sure that your
site actually includes those words within it.
Focus on text
Focus on the text on your site. Make sure that your TITLE and ALT tags are
descriptive and accurate. Since the Google crawler doesn't recognize text
contained in images, avoid using graphical text and instead place information
within the alt and anchor text of pictures. When linking to non-HTML documents,
use strong descriptions within the anchor text that describe the links your
site is making.
Make your site easy to navigate
Make a site with a clear hierarchy of hypertext links. Every page should be
reachable from at least one hypertext link. Offer a site map to your users with
hypertext links that point to the important parts of your site. Keep the links
on a given page to a reasonable number (fewer than 100).
Ensure that your site is linked
Ensure that your site is linked from all relevant sites within your network.
Interlinking between sites and within sites gives the Google crawler additional
ability to find content, as well as improving the quality of the search.
Make sure that the Google crawler can read your content
Validate all HTML content to ensure that the HTML is well-formed. Use a text
browser such as Lynx to examine your site, because most search engine spiders
see your site much as Lynx would. If extra features such as JavaScript,
cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your
site in a text browser, then search engine crawlers may have trouble crawling
your site.
Allow search bots to crawl your sites without session IDs or arguments that
track their path through the site. These techniques are useful for tracking
individual user behavior, but the access pattern of bots is entirely different.
Using these techniques may result in multiple copies of the same document being
indexed for your site, as crawl robots will see each unique URL (including
session ID) as a unique document.
Ensure that your site's internal link structure provides a hypertext link path
to all of your pages. The Google search engine follows hypertext links from one
page to the next, so pages that are not linked to by others may be missed.
Additionally, you should consult the administrator of your Google Search
Appliance to ensure that your site's home page is accessible to the search
engine.
Use robots standards to control search engine interaction with your
content
Make use of the robots.txt file on your web server. This file tells crawlers
which files and directories can or cannot be crawled, including various file
types. If the search engine gets an error when getting this file, no content
will be crawled on that server. The robots.txt file will be checked on a
regular basis, but changes may not have immediate results. Each port (including
HTTP and HTTPS) requires its own robots.txt file.
Use robots meta tags to control whether individual documents are indexed,
whether the links on a document should be crawled, and whether the document
should be cached. The "NOARCHIVE" value for robots meta tags is supported by
the Google search engine to block cached content, even though it is not
mentioned in the robots standard.
For information on how robots.txt files and ROBOTS meta tags work, see the
Robots Exclusion standard at:
http://www.robotstxt.org/wc/exclusion.html
If the search engine is generating too much traffic on your site during peak
hours, contact your Google Search Appliance administrator to customize the
traffic.
Let the search engine know how fresh your content is
Make sure your web server supports the If-Modified-Since HTTP header. This
feature allows your web server to tell the Google Search Appliance whether your
content has changed since it last crawled your site. Supporting this feature
saves you bandwidth and overhead. Columbia's webservers support this feature.
Understand why some documents may be missing from the index
Each time that the Google Search Appliance updates its database of web pages,
the documents in the index can change. Here are a few examples of reasons why
pages may not appear in the index.
-
Your content pages may have been intentionally blocked by a robots.txt file or
ROBOTS meta tags.
-
Your web site was inaccessible when the crawl robot attempted to access it, due
to network or server outage. If this happens, the Google Search Appliance will
retry multiple times; but if the site cannot be crawled, it will not be
included in the index.
-
The Google crawl robot cannot find a path of links to your site from the
starting points it was given.
-
Your content pages may not be considered relevant to the query you entered.
Ensure that the query terms exist on your target page.
-
Your content pages contain invalid HTML code.
-
Your content pages were manually removed from the index by the Google Search
Appliance administrator.
If you still have questions, contact your Google Search Appliance administrator
to get more information.
Avoid using frames
The Google search engine supports frames to the extent that it can. Frames tend
to cause problems with search engines, bookmarks, e-mail links and so on,
because frames don't fit the conceptual model of the web (where every document
corresponds to a single URL).
Searches that return framed pages will most likely only produce hits against
the "body" HTML page and present it back without the original framed "Menu" or
"Header" pages. Google recommends that you use tables or dynamically generate
content into a single page (using ASP, JSP, PHP, etc.), instead of using FRAME
tags. This will ultimately maintain the content owner's originally intended
look and feel, as well as allow most search engines to properly index your
content.
Avoid placing content and links in script code
Most search engines do not read any information found in SCRIPT tags within an
HTML document. This means that content within script code will not be indexed,
and hypertext links within script code will not be followed when crawling. When
using a scripting language, make sure that your content and links are outside
SCRIPT tags. Investigate alternate HTML technologies to dynamic web pages, such
as HTML layers.
|