Парсим Google поиск при помощи Python
С тех пор как Google прикрыл свой Google Web Search Api в 2011, было очень сложно найти альтернативу. Нам нужно было получать ссылки из Google поиска с помощью скрипта на Python. Итак, мы сделали свой, и покажем небольшой гайд о том, как парсить Google поиск при помощи библиотеки requests и Beautiful Soup.
Для начала, давайте установим зависимости. Сохраним следующую информацию в файле requiriments.txt
Теперь, в командной строке, запустим pip install -r requiriments.txt для того, чтобы эти зависимости установить. Затем импортируем эти модули в скрипт:
import urllib import requests from bs4 import BeautifulSoup
Для выполнения поиска, Google ожидает, что запрос будет в параметрах URL. Кроме того, все пробелы должны быть заменены на знак ‘+’. Чтобы построить URL, мы правильно отформатируем запрос и поместим его в параметр q.
query = "hackernoon How To Scrape Google With Python" query = query.replace(' ', '+') URL = f"https://google.com/search?q="
Google возвращает разные результаты поиска для мобильных и настольных компьютеров. Таким образом, в зависимости от варианта использования, мы должны указать соответствующий user-agent.
# desktop user-agent USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0" # mobile user-agent MOBILE_USER_AGENT = "Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36"
Сделать запрос легко. Однако requests ожидает, что user-agent будет в заголовках. Чтобы правильно установить заголовки, мы должны передать словарь для них.
headers = resp = requests.get(URL, headers=headers)
Теперь нам нужно проверить, проходит ли наш запрос. Самый простой способ — проверить статус-код. Если он возвращает значение 200, это значит, что запрос прошел успешно. Затем нам нужно поместить ответ на наш запрос в Beautiful Soup для разбора содержимого.
if resp.status_code == 200: soup = BeautifulSoup(resp.content, "html.parser")
Далее идет анализ данных и извлечение всех якорных ссылок со страницы. Это легко делается при помощи библиотеки Beautiful Soup. Поскольку мы итерируемся через якоря, нам нужно сохранить результаты в списке.
results = [] for g in soup.find_all('div', class_='r'): anchors = g.find_all('a') if anchors: link = anchors[0]['href'] title = g.find('h3').text item = < "title": title, "link": link >results.append(item) print(results)
Вот и все. Этот скрипт довольно прост и подвержен всякого рода ошибкам. Но вы ведь должны с чего-либо начать. Вы можете клонировать или скачать весь скрипт из git-репозитория.
Есть также некоторые предостережения, которые стоит учитывать при парсинге Google. Если вы выполняете слишком много запросов в течение короткого периода времени, Google начнет выдавать вам капчи. Это раздражает и будет ограничивать, вас в скорости и количестве запросов.
Вот почему был создан RapidAPI Google Search API, который позволяет выполнять неограниченный поиск, не беспокоясь о капчах.
How to Scrape Google Search Results with Python
By Dirk Hoekstra on 28 Dec, 2020
In this article, we’re going to build a Google search result scraper in Python!
We’ll start with creating everything ourselves. And then, we’ll make our lives easier by implementing a SERP API.
Setting up the project
Let’s start by creating a folder that will hold our project.
mkdir google-scraper cd google-scraper touch scraper.py
Next, we should have a way to retrieve the HTML of Google.
As a first test, I add the following code to get the HTML of the Google home page.
# Use urllib to perform the request import urllib.request url = 'https://google.com' # Perform the request request = urllib.request.Request(url) raw_response = urllib.request.urlopen(request).read() # Read the repsonse as a utf-8 string html = raw_response.decode("utf-8") print(html)
And when we run this, everything works as expected! 🎉
python3 main.py // A lot of HTML gibberish here
Getting a Search Result page
We’ve got the HTML of the Google home page. But, there is not a lot of interesting information there.
Let’s update our script to get a search result page.
The url format of Google is https://google.com/search?q=Your+Search+Query
Note that spaces are replaced with + symbols.
For the next step, I update the url variable to a question that has been burning on my mind: «What is the answer to life the universe and everything»
url = 'https://google.com/search?q=What+is+the+answer+to+life+the+universe+and+everything'
Let’s run the program and see the result.
python3 scraper.py urllib.error.HTTPError: HTTP Error 403: Forbidden
Hmm, something is going wrong.
It turns out that Google is no too keen on automated programs getting the search result page.
A solution is to mask the fact that we are an automated program by setting a normal User-Agent header.
# Use urllib to perform the request import urllib.request url = 'https://google.com/search?q=What+is+the+answer+to+life+the+universe+and+everything' # Perform the request request = urllib.request.Request(url) # Set a normal User Agent header request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36') raw_response = urllib.request.urlopen(request).read() # Read the repsonse as a utf-8 string html = raw_response.decode("utf-8") print(html)
And when we run the program again, it prints the HTML gibberish of the search result page! 🙌
Setting up BeautifulSoup
To extract information from the raw HTML I’m going to use the BeautifulSoup package.
pip3 install beautifulsoup4
Next, we should import the package.
from bs4 import BeautifulSoup
Then, we can construct a soup object form the html
# Other code here # Construct the soup object soup = BeautifulSoup(html, 'html.parser') # Let's print the title to see if everything works print(soup.title)
For now, we use the soup object to print out the page title. Just to see if everything works correctly.
python3 scraper.py
Great, it extracts the title of our search page! 🔥
Extracting the Search Results
Let’s take it a step further and extract the actual search results from the page.
To figure out how to access the search results I fire up Chrome and inspect a Google search result page.
There are 2 things I notice.
We can use this information to extract the search results with BeautifulSoup.
# Other code here # Construct the soup object soup = BeautifulSoup(html, 'html.parser') # Find all the search result divs divs = soup.select("#search div.g") for div in divs: # For now just print the text contents. print(div.get_text() + "\n\n")
Let’s run the program and see if it works
python3 scraper.py Results 42 (number) - Wikipediaen.wikipedia.org › wiki › 42_(num. en.wikipedia.org › wiki › 42_(num. In cacheVergelijkbaarVertaal deze paginaThe number 42 is, in The Hitchhiker's Guide to the Galaxy by Douglas Adams, the "Answer to the Ultimate Question of Life, the Universe, and Everything", calculated by an enormous supercomputer named Deep Thought over a period of 7.5 million years.Phrases from The Hitchhiker's . · 43 (number) · 41 (number) · Pronic number // And many more results
The good news: It kind of works.
The bad news: A lot of gibberish is still included.
Let’s only extract the search titles. When I inspect the page I see that the search titles are contained in h3 tags.
We can use that information to extract the titles.
# Find all the search result divs divs = soup.select("#search div.g") for div in divs: # Search for a h3 tag results = div.select("h3") # Check if we have found a result if (len(results) >= 1): # Print the title h3 = results[0] print(h3.get_text())
And now the moment of thruth. Let’s run it and see if it works.
python3 scraper.py 42 (number) - Wikipedia Phrases from The Hitchhiker's Guide to the Galaxy - Wikipedia the answer to life, universe and everything - YouTube The answer to life, the universe, and everything | MIT News . 42: The answer to life, the universe and everything | The . 42 | Dictionary.com Five Reasons Why 42 May Actually Be the Answer to Life, the . Ultimate Question | Hitchhikers | Fandom For Math Fans: A Hitchhiker's Guide to the Number 42 . Why 42 is NOT the answer to life, the universe and everything .
Amazing! We’ve just confirmed that the answer to everything is 42.
Scraping a lot of pages
Nice, we’ve just constructed a basic Google search scraper!
There is a catch though. Google will quickly figure out that this is a bot and block the IP address.
A possible solution would be to scrape very sparsly and wait 10 seconds between requests. However, this is not the best solution if you need to scrape a lot of search queries.
Another solution would be to buy proxy servers. This way you can scrape from different IP addresses.
But once again, there is a catch here. A lot of people want to scrape Google search results, so most proxies have already been blocked by Google.
You could buy dedicated or residential proxies, but that can quickly become very expensive.
In my opinion, the best and simplest solution is to use a SERP API!
Using a SERP API
SERP stands for Search Engine Ranking Page. In this example, I’m going to use the ScraperBox google search api.
The documentation shows an example, so le’ts use that as our base and tweak it a bit.
import urllib.parse import urllib.request import ssl import json ssl._create_default_https_context = ssl._create_unverified_context # Urlencode the query string q = urllib.parse.quote_plus("What is the answer to life the universe and everything") # Create the query URL. query = "https://api.scraperbox.com/google" query += "?token=%s" % "YOUR_API_TOKEN" query += "&q=%s" % q # Call the API. request = urllib.request.Request(query) raw_response = urllib.request.urlopen(request).read() raw_json = raw_response.decode("utf-8") response = json.loads(raw_json) # Print the result titles. for result in response['organic_results']: print(result['title'])
Make sure to replace YOUR_API_TOKEN with your scraperbox API token.
And when running this it displays the search result titles!
python3 scraper.py 42 (number) - Wikipedia Phrases from The Hitchhiker's Guide to the Galaxy - Wikipedia The answer to life, the universe, and everything | MIT News . 42: The answer to life, the universe and everything | The . the answer to life, universe and everything - YouTube Answer To The Ultimate Question - The . - YouTube 42 | Dictionary.com For Math Fans: A Hitchhiker's Guide to the Number 42 . The Answer to Life, the Universe, and Everything - MATLAB .
And once again the search results are shown! 🎉
Conclusion
We’ve set up a Google scraper in Python using the BeautifulSoup package.
Then, we created a program that uses a SERP API.
And, not to forget: we’ve figured out that the answer to the universe is 42
Dirk Hoekstra has a Computer Science and Artificial Intelligence degree and is the co-founder of Scraperbox. He is a technical author on Medium where his articles have been read over 100,000 times. Founder of multiple tech companies of which one was acquired in 2020.