Skip to content

The ultimate guide to bot herding and spangling spider – Part Two

--Advertisements --

In the first part of our three-part series, we learned what bots are and why crawl budgets are important. Let's take a look at how to let search engines know what's important and some common coding issues.

How to let search engines know what's important

When a robot crawls your site, a number of landmarks redirect it to your files.

Like humans, robots follow links to get an idea of ​​the information on your site. But they also search your code and directories for specific files, tags, and items. Let's take a look at a number of these elements.

Robots.txt

The first thing that a bot will look for on your site is your robots.txt file.

For complex sites, a robots.txt file is essential. For smaller sites with just a handful of pages, a robots.txt file may not be necessary – otherwise, search engine robots will simply explore everything on your site.

There are two main ways to guide robots using your robots.txt file.

-- Advertisements --

1. First, you can use the "disallow" directive. This will require bots to ignore uniform resource locators, files, file extensions, or even entire sections of your site:

User-agent: Googlebot
Disable: / example /

Although the disallow directive prevents bots from exploring certain parts of your site (thus saving on the crawl budget), this will not necessarily prevent the pages from being indexed and displayed in search results:

The Enigmatic and Unnecessary Message "No information is available for this page" is something you will want to see in your search listings.

See also  What marketers need to know about Facebook's updated business tool terms

The above example occurred because of this prohibiting directive in census.gov/robots.txt:

-- Advertisements --

User-agent: Googlebot
Crawl-delay: 3

Disable: / cgi-bin /

2. Another method is to use the directive noindex . If you do not index a certain page or file, this will not prevent it from being explored. However, this will prevent it from being indexed (or removed from the index). This robots.txt directive is unofficially supported by Google, and is not supported by Bing at all (so be sure to have a User-agent: * set of refusals for Bingbot and other bots other than Googlebot ):

User-user: Googlebot
Noindex: / example /
User-agent: *
Disable: / example /

Obviously, since these pages are …

[Read the full article on Search Engine Land.]


The opinions expressed in this article are those of the guest author and not necessarily Marketing Land. The authors of the staff are listed here.

-- Advertisements --

About the author

Stephan Spencer is the creator of the immersive 3-day SEO Traffic Control seminar; an author of the books of O. Reilly The Art of SEO Google Power Search and Social eCommerce ; founder of the SEO agency Netconcepts (acquired in 2010); inventor of SEO proxy technology GravityStream; and host of two podcast shows The Optimized Geek and Marketing Speak.

Advertisements