- Saved searches
- Use saved searches to filter your results more quickly
- License
- minkphp/Mink
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
- How to disguise your PHP script as a browser?
- Как полностью эмулировать браузер в php? [закрыт]
- 1 ответ 1
- How can I emulate a get request exactly like a web browser?
- 2 Answers 2
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
PHP web browser emulator abstraction
License
minkphp/Mink
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
use Behat\Mink\Mink, Behat\Mink\Session, Behat\Mink\Driver\GoutteDriver, Behat\Mink\Driver\Goutte\Client as GoutteClient; $startUrl = 'http://example.com'; // init Mink and register sessions $mink = new Mink(array( 'goutte1' => new Session(new GoutteDriver(new GoutteClient())), 'goutte2' => new Session(new GoutteDriver(new GoutteClient())), 'custom' => new Session(new MyCustomDriver($startUrl)) )); // set the default session name $mink->setDefaultSessionName('goutte2'); // visit a page $mink->getSession()->visit($startUrl); // call to getSession() without argument will always return a default session if has one (goutte2 here) $mink->getSession()->getPage()->findLink('Downloads')->click(); echo $mink->getSession()->getPage()->getContent(); // call to getSession() with argument will return session by its name $mink->getSession('custom')->getPage()->findLink('Downloads')->click(); echo $mink->getSession('custom')->getPage()->getContent(); // this all is done to make possible mixing sessions $mink->getSession('goutte1')->getPage()->findLink('Chat')->click(); $mink->getSession('goutte2')->getPage()->findLink('Chat')->click();
$> curl -sS https://getcomposer.org/installer | php $> php composer.phar install
- Konstantin Kudryashov everzet [lead developer]
- Christophe Coevoet stof [lead developer]
- Alexander Obuhovich aik099 [lead developer]
- Other awesome developers
How to disguise your PHP script as a browser?
We’ve been using information from a site for a while now (something that the site allows if you mention the source and we do) and we’ve been copying the information by hand. As you could imagine this can become tedious pretty fast so I’ve been trying to automate the process by fetching the information with a PHP script. The URL I’m trying to fetch is:
http://mediaforest.ro/weeklycharts/viewchart.aspx?r=WeeklyChartRadioLocal&y=2010&w=46 08-11-10 14-11-10
If I enter it in a browser it works, if I try a file_get_contents() I get Bad Request I figured that they checked to see if the client is a browser so I rolled a CURL based solution:
$ch = curl_init(); $header=array( 'User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12', 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language: en-us,en;q=0.5', 'Accept-Encoding: gzip,deflate', 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7', 'Keep-Alive: 115', 'Connection: keep-alive', ); curl_setopt($ch,CURLOPT_URL,$url); curl_setopt($ch,CURLOPT_RETURNTRANSFER,true); curl_setopt($ch,CURLOPT_COOKIEFILE,'cookies.txt'); curl_setopt($ch,CURLOPT_COOKIEJAR,'cookies.txt'); curl_setopt($ch,CURLOPT_HTTPHEADER,$header); $result=curl_exec($ch); curl_close($ch);
I’ve checked and the headers are identical with my browser’s headers and I still get Bad Request So I tried another solution:
http://www.php.net/manual/en/function.curl-setopt.php#78046
Как полностью эмулировать браузер в php? [закрыт]
Закрыт. Этот вопрос необходимо уточнить или дополнить подробностями. Ответы на него в данный момент не принимаются.
Хотите улучшить этот вопрос? Добавьте больше подробностей и уточните проблему, отредактировав это сообщение.
Задача такая. Есть сайт (http://anistar.ru/new/), который нормально открывается в браузере. Мне нужно скачать страницу из кода на php, чтобы дальше ее распрасить, но сайт защищается от этого средствами JavaScript. Как полностью эмулировать браузер в php? Или может есть какая-нибудь библиотека? Я искал в гугле и пробовал curl, но у меня не получилось:
Я еще раз повторяю автору и всем отвечающим «CURL ТУТ НИ ПРИ ЧЕМ. «. На данном сайте стоит защита от парсинга в виде js скрипта, который проводит расчеты в браузере пользователя, прежде чем дать доступ к сайту.
1 ответ 1
Попробуйте использовать cookie. Задайте опции CURLOPT_COOKIEFILE и CURLOPT_COOKIEJAR , детальнее curl_setopt
в 90% curl справляется, но если целевой сайт использует что-то вроде CloudFlare, где стоит 3-5сек задержка и дальше редирект на JavaScript ‘e, в таком случае, лично меня спасал PhantomJS. Если Вам нужно подождать, пока на целевом сайте выполнится какой-то JavaScript и только потом получить страницу, PhantomJS справится с данной задачей. К примеру, можно использовать вот эту библиотеку jonnyw/php-phantomjs
addOption('--cookies-file=some_file.txt'); //все опции можно указать в конфиге и подгрузить его (формат json) $client->addOption('--config=path/to/config'); //создаем объекты запроса и ответа $request = $client->getMessageFactory()->createRequest('http://somepage.ru', 'GET'); $response = $client->getMessageFactory()->createResponse(); //добавляем заголовок $request->addHeader('User-Agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.132 Safari/537.36') ->setDelay(5);//ставим timeout в секундах, сколько нужно ожидать, прежде чем вернуть страницу //посылаем запрос $client->send($request, $response); //получаем собственно страницу. По умолчанию страница возвращается в UTF-8 $content = $response->getContent(); ?>
Эта либа использует jakoch/phantomjs-installer сама поставит все требуемые бинарники, но у меня почему-то так не завелось, пришлось самому скачивать бинарники и затем в клиенте указать путь к папке с ними
How can I emulate a get request exactly like a web browser?
There are websites that when I open specific ajax request on browser, I get the resulted page. But when I try to load them with curl, I receive an error from the server. How can I properly emulate a get request to the server that will simulate a browser? That is what I am doing:
$url="https://new.aol.com/productsweb/subflows/ScreenNameFlow/AjaxSNAction.do?s=username&f=firstname&l=lastname"; ini_set('user_agent', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)'); $ch = curl_init(); curl_setopt($ch, CURLOPT_URL,$url); $result=curl_exec($ch); print $result;
2 Answers 2
Are you sure the curl module honors ini_set(‘user_agent’. )? There is an option CURLOPT_USERAGENT described at http://docs.php.net/function.curl-setopt.
Could there also be a cookie tested by the server? That you can handle by using CURLOPT_COOKIE, CURLOPT_COOKIEFILE and/or CURLOPT_COOKIEJAR.
edit: Since the request uses https there might also be error in verifying the certificate, see CURLOPT_SSL_VERIFYPEER.
$url="https://new.aol.com/productsweb/subflows/ScreenNameFlow/AjaxSNAction.do?s=username&f=firstname&l=lastname"; $agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)'; $ch = curl_init(); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); curl_setopt($ch, CURLOPT_VERBOSE, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_USERAGENT, $agent); curl_setopt($ch, CURLOPT_URL,$url); $result=curl_exec($ch); var_dump($result);
Hey, how about if I want to access from ‘front’ page then to the ‘target’ page (same domain), I dont know why but when I access the ‘target’ page directly it will response : ‘TRY TO MAKE ROBOT?’. But when I access the ‘front’ first (with browser), the response is normal.
Did u save cookies to use on the second url step? curl_setopt($ch, CURLOPT_COOKIEJAR, ‘file or path’); to set on 1st step and curl_setopt($ch, CURLOPT_COOKIEFILE, ‘file or path’); to read on 2nd step. Maybe you need use referer too, like curl_setopt($ch, CURLOPT_REFERER, true); in that case, you can use domain (or IP).
i’ll make an example, first decide what browser you want to emulate, in this case i chose Firefox 60.6.1esr (64-bit) , and check what GET request it issues, this can be obtained with a simple netcat server (MacOS bundles netcat, most linux distributions bunles netcat, and Windows users can get netcat from.. Cygwin.org , among other places),
setting up the netcat server to listen on port 9999: nc -l 9999
now hitting http://127.0.0.1:9999 in firefox, i get:
$ nc -l 9999 GET / HTTP/1.1 Host: 127.0.0.1:9999 User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate Connection: keep-alive Upgrade-Insecure-Requests: 1
now let us compare that with this simple script:
$ nc -l 9999 GET / HTTP/1.1 Host: 127.0.0.1:9999 Accept: */*
there are several missing headers here, they can all be added with the CURLOPT_HTTPHEADER option of curl_setopt, but the User-Agent specifically should be set with CURLOPT_USERAGENT instead (it will be persistent across multiple calls to curl_exec() and if you use CURLOPT_FOLLOWLOCATION then it will persist across http redirections as well), and the Accept-Encoding header should be set with CURLOPT_ENCODING instead (if they’re set with CURLOPT_ENCODING then curl will automatically decompress the response if the server choose to compress it, but if you set it via CURLOPT_HTTPHEADER then you must manually detect and decompress the content yourself, which is a pain in the ass and completely unnecessary, generally speaking) so adding those we get:
'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0', CURLOPT_ENCODING=>'gzip, deflate', CURLOPT_HTTPHEADER=>array( 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language: en-US,en;q=0.5', 'Connection: keep-alive', 'Upgrade-Insecure-Requests: 1', ), )); curl_exec($ch);
now running that code, our netcat server gets:
$ nc -l 9999 GET / HTTP/1.1 Host: 127.0.0.1:9999 User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0 Accept-Encoding: gzip, deflate Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-US,en;q=0.5 Connection: keep-alive Upgrade-Insecure-Requests: 1
and voila! our php-emulated browser GET request should now be indistinguishable from the real firefox GET request 🙂
this next part is just nitpicking, but if you look very closely, you’ll see that the headers are stacked in the wrong order, firefox put the Accept-Encoding header in line 6, and our emulated GET request puts it in line 3.. to fix this, we can manually put the Accept-Encoding header in the right line,
'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0', CURLOPT_ENCODING=>'gzip, deflate', CURLOPT_HTTPHEADER=>array( 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language: en-US,en;q=0.5', 'Accept-Encoding: gzip, deflate', 'Connection: keep-alive', 'Upgrade-Insecure-Requests: 1', ), )); curl_exec($ch);
running that, our netcat server gets:
$ nc -l 9999 GET / HTTP/1.1 Host: 127.0.0.1:9999 User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate Connection: keep-alive Upgrade-Insecure-Requests: 1
problem solved, now the headers is even in the correct order, and the request seems to be COMPLETELY INDISTINGUISHABLE from the real firefox request 🙂 (i don’t actually recommend this last step, it’s a maintenance burden to keep CURLOPT_ENCODING in sync with the custom Accept-Encoding header, and i’ve never experienced a situation where the order of the headers are significant)