Robots txt php file

PHP: Parsing robots.txt

If you’re writing any kind of script that involves fetching HTML pages or files from another server you really need to make sure that you follow netiquette — the «unofficial rules defining proper behaviour on Internet».

This means that your script needs to:

  1. identify itself using the User Agent string including a URL;
  2. check the sites robots.txt file to see if they want you to have access to the pages in question; and
  3. not flood their server with too-frequent, repetitive or otherwise unnecessary requests.

If you don’t meet these requirements then don’t be surprised if they retaliate by blocking your IP address and/or filing a complaint. This article presents methods for achieving the first two goals, but the third is up to you.

Setting a User Agent

Before using any of the PHP file functions on a remote server you should decide on and set a sensible User Agent string. There are no real restrictions on what this can be, but some commonality is beginning to emerge.

Читайте также:  Umeos ru login index php

The following formats are widely recognised:

  • www.example.net
  • NameOfAgent (http://www.example.net)
  • NameOfAgent/1.0 (http://www.example.net/bot.html)
  • NameOfAgent/1.1 (link checker; http://www.example.net/bot.html)
  • NameOfAgent/2.0 (link checker; http://www.example.net; webmaster@example.net)
  • .

The detail you provide should be proportionate to the amount of activity you’re going to generate on the targeted sites/servers. The NameOfAgent value should be chosen with care as there are a lot of established user agents and you don’t want to have to change this later. Check your server log files and our directory of user agents for examples.

Once you’ve settled on a name, using it is as simple as adding the following line to the start of your script:

By passing a User Agent string with all requests you run less risk of your IP address being blocked, but you also take on some extra responsibility. People will want to know why your script is accessing their site. They may also expect it to follow any restrictions defined in their robots.txt file.

Parsing robots.txt

That brings us to the purpose of this article — how to fetch and parse a robots.txt file.

The following script is useful if you only want to fetch one or two pages from a site (to check for links to your site for example). It will tell you whether a given user agent can access a specific page.

For a search engine spider or a script that intends to download a lot of files you should implement a cacheing mechanism so that the robots.txt file only needs to be fetched once every day or so.

This script is designed to parse a well-formed robots.txt file with no in-line comments. Each call to the script will result in the robots.txt file being downloaded again. A missing robots.txt file or a Disallow statement with no argument will result in a return value of TRUE granting access.

We have recently rewritten this code to remove file and file_get_contents which are blocked now on many PHP servers replacing them with our own http_get_contents function. We have also enabled following of redirects (CURLOPT_FOLLOWLOCATION).

The script can be called as follows:

$canaccess = robots_allowed(«http://www.example.net/links.php»); $canaccess = robots_allowed(«http://www.example.net/links.php», «NameOfAgent«);

If you don’t pass a value for the second parameter then the script will only check for global rules — those under ‘*’ in the robots.txt file. If you do pass the name of an agent then the script also finds and applies rules specific to that agent.

For more information on the robots.txt file see the links below.

Allowing for the Allow directive

The following modified code has been supplied by Eric at LinkUp.com. It fixes a bug where a missing (404) robots.txt file would result in the false return value. It also adds extra code to cater for the Allow directive now recognised by some search engines.

The 404 checking requires the cURL module to be compiled into PHP and we haven’t tested ourselves the Allow directive parsing, but I’m sure it works. Please report any transcription errors.

Another option for the last section might be to first sort the $rules by length and then only check the longest ones for an Allow or Disallow directive as they will override any shorter rules.

Previously robots.txt could only be used to Disallow spiders from accessing specific directories, or the whole website. The Allow directive allows you to then grant access to specific subdirectories that would otherwise be blocked by Disallow rules.

You should be careful using this, however, as it’s not part of the original standard and not all search engines will understand. On the other hand, if you’re running a web spider, taking Allow rules into account will give you access to more pages.

References

Источник

Dynamic/automatic robots.txt file with PHP and .htaccess

Worried about accidentally publishing your test site robots.txt to your live site robots.txt and thereby blocking the search engines from your site?

It’s an easy mistake to make and one that can be costly if not noticed right away. What about if your robots.txt file adjusted itself automatically based on whether it’s the test site or the live site?

Here’s an easy way to do it with Apache’s .htaccess file and PHP.

.htaccess

Add this to your .htaccess file:

RewriteEngine on RewriteRule ^robots.txt$ /robots.php [L] 

If RewriteEngine on has already been called in your .htaccess omit the first line. /robots.php can be changed to any PHP page.

PHP

Now create robots.php in the root (or whatever file/location you chose) and use this PHP code:

 User-agent: * Disallow: / else < // Enter your live site robots.txt here ?>User-agent: * Disallow: /cms/ ?> 

The above code should be self-explanatory and can be adapted to handle multiple hosts if you need to — but this simple example should be sufficient for most cases.

Once set up, when you visit robots.txt you will see it automatically adjust itself based on which site you are accessing. You never need worry about uploading the wrong robots.txt file again!

Tim Bennett is a Leeds-based web designer from Yorkshire. He has a First Class Honours degree in Computing from Leeds Metropolitan University and currently runs his own one-man web design company, Texelate.

Источник

Настройка robots.txt через php файл для поддоменов

Если на проекте присутствует несколько поддоменов, то рано или поздно вы столкнетесь с вопросом разных файлов robots.txt для каждого из них. Но что делать, если htaccess отказывается менять файл для каждого условия?

Как добавить поддоменам разные robots.txt при помощи PHP в файле htaccess?

Что из себя представляет данный файл? Как правило, его структура выглядит так:

User-Agent: * Disallow: *?* Disallow: /video/ Sitemap: http://site.ru/sitemap.xml

Основная причина, почему в статье речь пойдет про PHP:
1. вам не придется создавать для каждого поддомена свой личный файл robots.txt (это плюс, если таких поддоменов больше 3, например, региональные)
2. на некоторых площадках, htaccess должным образом не обрабатывает правила замены

О каких правилах замены идет речь? Ну, предположим:

RewriteCond % ^site.ru$ RewriteRule ^robots.txt$ /robots-main.txt [L] RewriteCond % ^subdomain.site.ru$ RewriteRule ^robots.txt$ /robots-subdomains.txt [L]

А теперь Решение:
Т.к. работаю с битриксом, то в моем случае этот блок настоек выглядит так:

 Options +FollowSymLinks RewriteEngine On RewriteBase / RewriteRule ^robots.txt$ robots_for_domain.php [L] . 

RewriteRule ^robots.txt$ robots_for_domain.php [L]

User-Agent: * Disallow: *?* Disallow: /video/ Disallow: /news/ ?> Sitemap: http:///sitemap.xml Host: http://

Обратите внимание, для всех поддоменов, добавили проверку, если это не основной сайт (site.ru), то дописываем Disallow:

Вот таким нехитрым образом, мы подключили обработку всех robots.txt на один файл.
У этого метода есть только один, на мой взгляд, существенный минус – если вам нужны уникальные строки для одного из поддоменов, то вам необходимо будет при помощи проверок, свич-кэйсов и прочих методов добавить это в php файл.

Но это уже, как говорится, совсем другая история 🙂 .

Если на Вашем проекте необходимо добавить выбор города или региона с переходом на поддомен, обращайтесь ко мне.

Источник

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

PHP class for parse all directives from robots.txt files according to specifications

License

bopoda/robots-txt-parser

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

RobotsTxtParser — PHP class for parsing all the directives of the robots.txt files

RobotsTxtValidator — PHP class for check is url allow or disallow according to robots.txt rules.

Try demo of RobotsTxtParser on-line on live domains.

Parsing is carried out according to the rules in accordance with Google & Yandex specifications:

  1. Pars the Clean-param directive according to the clean-param syntax.
  2. Deleting comments (everything following the ‘#’ character, up to the first line break, is disregarded)
  3. The improvement of the Parse of Host — the intersection directive, should refer to the user-agent ‘*’; If there are multiple hosts, the search engines take the value of the first.
  4. From the class, unused methods are removed, refactoring done, the scope of properties of the class is corrected.
  5. Added more test cases, as well as test cases added to the whole new functionality.
  6. RobotsTxtValidator class added to check if url is allowed to parsing.
  7. With version 2.0, the speed of RobotsTxtParser was significantly improved.
  • DIRECTIVE_ALLOW = ‘allow’;
  • DIRECTIVE_DISALLOW = ‘disallow’;
  • DIRECTIVE_HOST = ‘host’;
  • DIRECTIVE_SITEMAP = ‘sitemap’;
  • DIRECTIVE_USERAGENT = ‘user-agent’;
  • DIRECTIVE_CRAWL_DELAY = ‘crawl-delay’;
  • DIRECTIVE_CLEAN_PARAM = ‘clean-param’;
  • DIRECTIVE_NOINDEX = ‘noindex’;

Install the latest version with

$ composer require bopoda/robots-txt-parser

Run phpunit tests using command

You can start the parser by getting the content of a robots.txt file from a website:

$parser = new RobotsTxtParser(file_get_contents('http://example.com/robots.txt')); var_dump($parser->getRules());

Or simply using the contents of the file as input (ie: when the content is already cached):

$parser = new RobotsTxtParser(" User-Agent: * Disallow: /ajax Disallow: /search Clean-param: param1 /path/file.php User-agent: Yahoo Disallow: / Host: example.com Host: example2.com "); var_dump($parser->getRules());
array(2) < ["*"]=> array(3) < ["disallow"]=> array(2) < [0]=> string(5) "/ajax" [1]=> string(7) "/search" > ["clean-param"]=> array(1) < [0]=> string(21) "param1 /path/file.php" > ["host"]=> string(11) "example.com" > ["yahoo"]=> array(1) < ["disallow"]=> array(1) < [0]=> string(1) "/" > > >

In order to validate URL, use the RobotsTxtValidator class:

$parser = new RobotsTxtParser(file_get_contents('http://example.com/robots.txt')); $validator = new RobotsTxtValidator($parser->getRules()); $url = '/'; $userAgent = 'MyAwesomeBot'; if ($validator->isUrlAllow($url, $userAgent)) < // Crawl the site URL and do nice stuff >

Feel free to create PR in this repository. Please, follow PSR style.

See the list of contributors which participated in this project.

Please use v2.0+ version which works by same rules but is more highly performance.

About

PHP class for parse all directives from robots.txt files according to specifications

Источник

Оцените статью