Web scraping with Python (using webbrowser module)
Web scraping is the term for using a program to download and process content from the Web. For example, Google runs many web scraping programs to index web pages for its search engine. Python has several modules that make it easy to scrape web pages. Some of these modules are:
- webbrowser Comes with Python and opens a browser to a specific page.
- Requests Downloads files and web pages from the Internet.
- Beautiful Soup Parses HTML, the format that web pages are written in.
- Selenium Launches and controls a web browser. Selenium is able to fill in forms and simulate mouse clicks in this browser.
Let’s use the webbrowser module and make some programs to understand web scraping. Usually this module is used to launch a new browser with a specified URL. This is achieved through the webbrowser module’s open() function. Try this code:
When we run this program, a web browser tab will open to the URL http://covrisolutions.com. Now let’s make another program to automatically launch the map in your browser using the contents of your clipboard or using the command line. This way, you only have to copy the address to a clipboard and run the script, and the map will be loaded for you.
Make sure your classpath settings are configured so that you can run this program from command line along with your IDE, which in my case is geany. I’ll use the address of Covri Comunication and Management Solutions which is Chapel Rd, Bagher Complex, Fateh Maidan, Abids, Hyderabad, Telangana 500001. You may use your own or continue with mine. What I intend to do is type this :
C:\Users\Python>mapit1.py Chapel Rd, Bagher Complex, Fateh Maidan, Abids, Hyderabad, Telangana 500001
in the command prompt and my program should open a browser with the google map for this address. See the code below:
#! python3
import webbrowser, sys
if len(sys.argv) >1:
The first line is the the program’s #! shebang line which is a directive for your command line interpreter how it should execute a script. Next we need to import the webbrowser module for launching the browser and import the sys module for reading the potential command line arguments. The sys.argv variable stores a list of the program’s filename and command line arguments. If this list has more than just the filename in it, then len(sys.argv) evaluates to an integer greater than 1, meaning that command line arguments have indeed been provided.
Command line arguments are usually separated by spaces, but in this case, you want to interpret all of the arguments as a single string. Since sys.argv is a list of strings, you can pass it to the join() method, which returns a single string value. You don’t want the program name in this string, so instead of sys.argv, you should pass sys.argv[1:] to chop off the first element of the array. The final string that this expression evaluates to is stored in the address variable.
Last line uses the webbrowser.open() to open a browser with the URL provided.
Now run the program by entering this into the command line . . .
mapit1.py Chapel Rd, Bagher Complex, Fateh Maidan, Abids, Hyderabad, Telangana 500001
A new page opens showing on Covri Comunication and Management Solutions on the google map as seen in the screen shot shown below:
Now let’s consider a scenario where there are no command line arguments, now our program will assume that the address is stored on the clipboard. See the modified program below:
#! python3
# mapIt.py – Launches a map in the browser using an address from the
# command line or clipboard.
import webbrowser, sys, pyperclip
Now our program assume that the address is stored on the clipboard. We can get the clipboard content with pyperclip.paste() and store it in a variable named address. Notice that we have imported pyperclip module to use it’s paste(). If pyperclip module is not present, please install it otherwise this program won’t work.
Finally, to launch a web browser with the Google Maps URL, call webbrowser.open(). Now copy the address in clip board and run this program. Again a new page opens showing on Covri Comunication and Management Solutions on the google map.
The webbrowser module lets users cut out the step of opening the browser and directing themselves to a website. Other programs could use this functionality to do the following:
• Open all links on a page in separate browser tabs.
• Open the browser to the URL for your local weather.
• Open several social network sites that you regularly check.
Try to make programs to implement the above mentioned functionalities. Again I am reminding to set up the PATH variable so that you may run the program through command prompt. Here we end today’s discussion, in the next post we shall look into the requests module, so till we meet next keep practicing and learning Python as Python is easy to learn!
webbrowser — Convenient web-browser controller¶
The webbrowser module provides a high-level interface to allow displaying web-based documents to users. Under most circumstances, simply calling the open() function from this module will do the right thing.
Under Unix, graphical browsers are preferred under X11, but text-mode browsers will be used if graphical browsers are not available or an X11 display isn’t available. If text-mode browsers are used, the calling process will block until the user exits the browser.
If the environment variable BROWSER exists, it is interpreted as the os.pathsep -separated list of browsers to try ahead of the platform defaults. When the value of a list part contains the string %s , then it is interpreted as a literal browser command line to be used with the argument URL substituted for %s ; if the part does not contain %s , it is simply interpreted as the name of the browser to launch. 1
For non-Unix platforms, or when a remote browser is available on Unix, the controlling process will not wait for the user to finish with the browser, but allow the remote browser to maintain its own windows on the display. If remote browsers are not available on Unix, the controlling process will launch a new browser and wait.
The script webbrowser can be used as a command-line interface for the module. It accepts a URL as the argument. It accepts the following optional parameters: -n opens the URL in a new browser window, if possible; -t opens the URL in a new browser page (“tab”). The options are, naturally, mutually exclusive. Usage example:
python -m webbrowser -t "https://www.python.org"
This module does not work or is not available on WebAssembly platforms wasm32-emscripten and wasm32-wasi . See WebAssembly platforms for more information.
The following exception is defined:
exception webbrowser. Error ¶
Exception raised when a browser control error occurs.
The following functions are defined:
webbrowser. open ( url , new = 0 , autoraise = True ) ¶
Display url using the default browser. If new is 0, the url is opened in the same browser window if possible. If new is 1, a new browser window is opened if possible. If new is 2, a new browser page (“tab”) is opened if possible. If autoraise is True , the window is raised if possible (note that under many window managers this will occur regardless of the setting of this variable).
Note that on some platforms, trying to open a filename using this function, may work and start the operating system’s associated program. However, this is neither supported nor portable.
Raises an auditing event webbrowser.open with argument url .
Open url in a new window of the default browser, if possible, otherwise, open url in the only browser window.
webbrowser. open_new_tab ( url ) ¶
Open url in a new page (“tab”) of the default browser, if possible, otherwise equivalent to open_new() .
webbrowser. get ( using = None ) ¶
Return a controller object for the browser type using. If using is None , return a controller for a default browser appropriate to the caller’s environment.
webbrowser. register ( name , constructor , instance = None , * , preferred = False ) ¶
Register the browser type name. Once a browser type is registered, the get() function can return a controller for that browser type. If instance is not provided, or is None , constructor will be called without parameters to create an instance when needed. If instance is provided, constructor will never be called, and may be None .
Setting preferred to True makes this browser a preferred result for a get() call with no argument. Otherwise, this entry point is only useful if you plan to either set the BROWSER variable or call get() with a nonempty argument matching the name of a handler you declare.
Changed in version 3.7: preferred keyword-only parameter was added.
A number of browser types are predefined. This table gives the type names that may be passed to the get() function and the corresponding instantiations for the controller classes, all defined in this module.