功能测试
首页 > SEO实战 > 正文

《Robots.txt 协议标准》中英文对照帮你搞定搜索引擎收录问题

发布-zoozi | 查看- | 发表时间-2007-6-16

Robots.txt 是存放在站点根目录下的一个纯文本文件。虽然它的设置很简单,但是作用却很强大。在我们国内,网站管理员似乎对robots.txt并没引起足够的重视,但是它的重要性不可忽视,今天想通过这篇《Robots.txt 协议标准》中英文对照的文章来简单谈一下robots.txt的写作。

搜索引擎使用spider程序自动访问互联网上的网页并获取网页信息。spider在访问一个网站时,会首先会检查该网站的根域下是否有一个叫做robots.txt的纯文本文件。您可以在您的网站中创建一个纯文本文件robots.txt,在文件中声明该网站中不想被robot访问的部分或者指定搜索引擎只收录特定的部分。
请注意,仅当您的网站包含不希望被搜索引擎收录的内容时,才需要使用robots.txt文件。如果您希望搜索引擎收录网站上所有内容,请勿建立robots.txt文件或者创建一个内容为空的robots.txt文件即可。另外,robots.txt必须放置在一个站点的根目录下,而且文件名必须全部小写,切记。

baiduspider一般情况下每天访问一次网站的robots.txt,针对robots所做的修改,会在2个工作日也就是48小时内内生效。但是需要提醒的是,robots.txt要想取消以前百度已经收录的内容,可能需要数月的时间。

robots.txt文件用法举例
例一:
User-agent: *
Disallow: /images/123/
Disallow: /123/
Disallow: /123.html
禁止所有搜索引擎蜘蛛抓取"/images/123/ "目录,以及 "/123/"目录和 /123.html 文件

例二:
User-agent: 123
Disallow:/123/
只允许名为"123"的搜索引擎蜘蛛抓取"/123/"目录,而拒绝其他的搜索引擎蜘蛛抓取:
 

例三:禁止任何搜索引擎抓取我的网站,设置方法如下:
User-agent: *
Disallow: /

例四:
User-agent: *
Disallow: /*?*
Disallow: /123/*.htm
禁止访问网站中所有的动态页面和"/123/"目录下的所有以".htm"为后缀的URL(包含子目录)。
 

《Robots.txt 协议标准》英文原版如下:

A Standard for Robot Exclusion

Table of contents:

Status of this document

This document represents a consensus on 30 June 1994 on the robots mailing list (robots-request@nexor.co.uk), between the majority of robot authors and other people with an interest in robots. It has also been open for discussion on the Technical World Wide Web mailing list (www-talk@info.cern.ch). This document is based on a previous working draft under the same title.

It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody, and there no guarantee that all current and future robots will use it. Consider it a common facility the majority of robot authors offer the WWW community to protect WWW server against unwanted accesses by their robots.

The latest version of this document can be found on http://www.robotstxt.org/wc/robots.html.

Introduction

WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. For more information see the robots page.

In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.

The Method

The method used to exclude robots from a server is to create a file on the server which specifies an access policy for robots. This file must be accessible via HTTP on the local URL "/robots.txt". The contents of this file are specified below.

This approach was chosen because it can be easily implemented on any existing WWW server, and a robot can find the access policy with only a single document retrieval.

A possible drawback of this single-file approach is that only a server administrator can maintain such a list, not the individual document maintainers on the server. This can be resolved by a local process to construct the single file from a number of others, but if, or how, this is done is outside of the scope of this document.

The choice of the URL was motivated by several criteria:

  • The filename should fit in file naming restrictions of all common operating systems.
  • The filename extension should not require extra server configuration.
  • The filename should indicate the purpose of the file and be easy to remember.
  • The likelihood of a clash with existing files should be minimal.

The Format

The format and semantics of the "/robots.txt" file are as follows:

The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL). Each record contains lines of the form "<field>:<optionalspace><value><optionalspace>". The field name is case insensitive.

Comments can be included in file using UNIX bourne shell conventions: the '#' character is used to indicate that preceding space (if any) and the remainder of the line up to the line termination is discarded. Lines containing only a comment are discarded completely, and therefore do not indicate a record boundary.

The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored.

User-agent
The value of this field is the name of the robot the record is describing access policy for.

If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.

The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.

Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.

The presence of an empty "/robots.txt" file has no explicit associated semantics, it will be treated as if it was not present, i.e. all robots will consider themselves welcome.

Examples

The following example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/" or "/tmp/", or /foo.html:

# robots.txt for http://www.example.com/User-agent: *Disallow: /cyberworld/map/ # This is an infinite virtual URL spaceDisallow: /tmp/ # these will soon disappearDisallow: /foo.html

This example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/", except the robot called "cybermapper":

# robots.txt for http://www.example.com/User-agent: *Disallow: /cyberworld/map/ # This is an infinite virtual URL space# Cybermapper knows where to go.User-agent: cybermapperDisallow:

This example indicates that no robots should visit this site further:

# go awayUser-agent: *Disallow: /如果英文基础不好的朋友,看起来可能有点可能,那就赶快去充电。好了,现在就去设置你的robots.txt文件吧。Let’s go!
或许你还对下面的文章感兴趣

◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。

唐夏 英文不好,看不明白
2008-4-8 14:00:06 【回复】
最新文章
文章页自定义模板在主题的INCLUDE目录下的ARTICLE_SELF_MOUDLE.ASP
最近评论
访客留言
Copyright seolover.com. All Rights Reserved. 网站设计-多媒体制作-Flash设计
创想互动 版权所有