Python parse google docs

Содержание

Reading Google Sheets data using Python
printminion / parseGData.py
How to read Google Sheets data in Pandas with GSpread
The GSpread package makes it quick and easy to read Google Sheets spreadsheets from Google Drive and load them into Pandas dataframes. Here’s a quick guide to using it.
Install the packages
Authenticate with Google Sheets
Open the Google Sheet with Python
Select the specific worksheet
Bring it all together

Reading Google Sheets data using Python

Google docs are one of the widely used tools across the industry and the spreadsheets are used to store lot of our data, which we would want to access anytime for data analysis or any other purpose. Check my previous posts where I have talked about analyzing & visualizing the data using google spreadsheets.

Many a times there is a need to access this data at the run time and there are different ways you can access these data from the google spreadsheet, onw way by using Google API’s, which requires you to Turn on google sheets API, install google client library, Authentication and then write a script to access the data. This is a wonderful and a secure way to access the data from the drive.

However I was looking for a much simpler way to access the data Per say in a Pythonic way, within 2-3 lines of code and which can be easily consumed by the Pandas Dataframe and can help me for further data analysis and I do not have to spend much time writing the script for accessing the google spreadsheet.

Читайте также: Триггер в python это

In this post we will see how Python Requests and Pandas Dataframe can be used in conjunction to pull the data from the Google spreadsheet and will be consumed directly into Pandas dataframe.

Pandas is an open source and fast & easy high performance data analysis library which is developed by Wes Mckinney. you can find more details on it’s official page here. Pandas has two types of data structures Series & Dataframe, Series is like one dimensional objects like an array list and Dataframe is a two-dimensional spreadsheet structure having rows & columns. We are going to use Dataframes for this post.

Requests is a simple to use HTTP library for python, which we would be using to scrape the content of our google spreadsheet.

Dataset to be used for this post is Booker Prize winners list from Wikipedia:

Check my previous posts here on how to import data from web in google spreadsheet in one simple step.

Once you have the data from the wikipedia link in your google spreadsheet, Save it on your drive and then navigate to the following menu File > Publish to the Web and select Comma-Separated values(.csv) and copy the link.

Now using Python requests we will write a simple two line of code to get this data in pandas data frame:

import pandas as pd #Paste copied link here pathtoCsv = r’https://docs.google.com/spreadsheets/d/12mrYK9ouCRLq6tPje8p4XJIM9ceotAEZJZpXhLa6-uc/pub?output=csv’ df = pd.read_csv(pathtoCsv, encoding = ‘utf8’)

The data is now consumed in the pandas dataframe(df) and can be used for further analysis. Let’s see how to explore the data using dataframe.

I’m using ipython notebook for analysis but you can do this exercise in any editor of your choice.

Let’s find out the info about the dataframe:

There are total 5 columns and 51 rows in this dataset and you can also check the datatype for each column as well.

Basic Analysis using Pandas Dataframe:

United Kingdom has topped the list, Followed by Australia and South Africa. We can further use this dataframe api’s and function in conjunction with plotting libraries to plot the graphs and visualize the dataset.

Updated: July 4, 2017

Источник

printminion / parseGData.py

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

»’

@desc Parse Google Drive spreadsheet data via python

@author Misha M.-Kupriyanov https://plus.google.com/104512463398531242371/

@link https://gist.github.com/3969255

»’

# Spreadsheet https://docs.google.com/spreadsheet/pub?key=0Akgh73WhU1qHdFg4UmRhaThfUFNBaFR3N3BMVW9uZmc&output=html

import logging

import urllib2

import json

def getRowValue ( row , format , column_name ):

logging . info ( ‘getRowValue[%s]:%s’ % ( column_name , row ))

if str ( column_name ) == » :

raise ValueError ( ‘column_name must not empty’ )

begin = row . find ( ‘%s:’ % column_name )

logging . info ( ‘begin:%s’ % begin )

if begin == — 1 :

return »

begin = begin + len ( column_name ) + 1

end = — 1

found_begin = False

for entity in format :

logging . info ( ‘checking:%s’ % entity )

if found_begin and row . find ( entity ) != — 1 :

end = row . find ( entity )

break

if entity == column_name :

found_begin = True

#check if last element

if format [ len ( format ) — 1 ] == column_name :

end = len ( row )

else :

if end == — 1 :

end = len ( row )

else :

end = end — 2

logging . info ( ‘%s:%s’ % ( column_name , row ) )

#logging.info(‘speakertitle[%s]’ % speaker_title )

#logging.info(‘%s:%s’ % (column_name, row.find(column_name)))

# logging.info(‘%s — %s’ % (begin, end))

value = row [ begin : end ]. strip ()

logging . info ( ‘%s[%s-%s]:[%s]’ % ( column_name , begin , end , value ))

return value

# JSON Representation

url = ‘https://spreadsheets.google.com/feeds/list/0Akgh73WhU1qHdFg4UmRhaThfUFNBaFR3N3BMVW9uZmc/od6/public/basic?prettyprint=true&alt=json’ ;

response = urllib2 . urlopen ( url )

html = response . read ()

html = json . loads ( html )

format = [ ‘column1’ , ‘column2’ , ‘column3’ ]

for entry in html [ ‘feed’ ][ ‘entry’ ]:

row = entry [ ‘content’ ][ ‘$t’ ]. encode ( ‘utf-8’ ). strip ()

column1 = column2 = column3 = »

column1 = getRowValue ( row , format , ‘column1’ )

column2 = getRowValue ( row , format , ‘column2’ )

column3 = getRowValue ( row , format , ‘column3’ )

print ‘column1:%s column2:%s column3:%s’ % ( column1 , column2 , column3 )

Источник

How to read Google Sheets data in Pandas with GSpread

The GSpread package makes it quick and easy to read Google Sheets spreadsheets from Google Drive and load them into Pandas dataframes. Here’s a quick guide to using it.

GSpread is a Python package that makes it quick and easy to read and write data from Google Sheets spreadsheets stored in your Google Drive into Python. With a tiny bit of extra code, you can also load your Google Sheets data straight into a Pandas dataframe.

Once your GSheets data is in a Pandas dataframe, you can manipulate as you wish, connect it to other data sources, or push it into other systems, such as databases, or push it back into Google Sheets so you can access it via Google Data Studio and other tools. Here’s a quick guide to how it’s done.

Install the packages

First, open a Jupyter notebook and install the GSpread Python package by entering !pip3 install gspread into a cell and then executing it by pressing shift and enter. Then import the gspread and pandas packages. !pip3 install gspread

import gspread as gs import pandas as pd

Authenticate with Google Sheets

Next, use the service_account() function from GSpread to authenticate on your Google Sheets account using a client secrets JSON keyfile for your service account. You can get one of these from the Google API Console.

gc = gs.service_account(filename='service_account.json')

Open the Google Sheet with Python

Now we have a connection to Google Sheets stored in the gc object, we can append the open_by_url() function to this and pass in the URL of the Google Sheets spreadsheet we want to load in Pandas. This will then use the authenticated connection to fetch the data and pull it into your Jupyter notebook as a Gspread spreadsheet object.

sh = gc.open_by_url('https://docs.google.com/spreadsheets/d/1vxyaK4dDdXzDJ5Axakrvik4CUxzGyHvgmmBQJ3Qo7dE/edit?usp=sharing')

Select the specific worksheet

Spreadsheets in Google Sheets and Excel are divided into worksheets. Whether your spreadsheet contains a single worksheet or multiple named worksheets, you’ll need to load the specific one. To do this we append the worksheet() function to the sh spreadsheet object and give it the name of the seet.

To access the content from the worksheet, we append the get_all_records() function to the object containing the worksheet. This returns a Python dictionary containing all the data in the worksheet. ws.get_all_records() Since we want to display the contents of our Google Sheet in a Pandas dataframe, we need to create a dataframe from that dictionary. This is actually really easy. A single line of code will do that for us.

df = pd.DataFrame(ws.get_all_records()) df.head()

course_title	course_duration	course_type	course_categories
0	Recurrent Neural Networks for Language Modelin.	4 hours	Course	Machine Learning
1	Market Basket Analysis in Python	4 hours	Course	Machine Learning
2	Machine Learning for Everyone	4 hours	Course	Machine Learning
3	Machine Learning for Business	4 hours	Course	Machine Learning
4	Practicing Machine Learning Interview Question.	4 hours	Course	Machine Learning

Bring it all together

To wrap up, here’s the whole set of code in one neat block. To recap, all you need to do is import GSpread and Pandas, create a service account connection, load your spreadsheet, select the worksheet, and load the records into a Pandas dataframe. You can then manipulate

import gspread as gs import pandas as pd gc = gs.service_account(filename='service_account.json') sh = gc.open_by_url('https://docs.google.com/spreadsheets/d/1vxyaK4dDdXzDJ5Axakrvik4CUxzGyHvgmmBQJ3Qo7dE/edit?usp=sharing') ws = sh.worksheet('Courses') df = pd.DataFrame(ws.get_all_records()) df.head()

course_title	course_duration	course_type	course_categories
0	Recurrent Neural Networks for Language Modelin.	4 hours	Course	Machine Learning
1	Market Basket Analysis in Python	4 hours	Course	Machine Learning
2	Machine Learning for Everyone	4 hours	Course	Machine Learning
3	Machine Learning for Business	4 hours	Course	Machine Learning
4	Practicing Machine Learning Interview Question.	4 hours	Course	Machine Learning

Matt Clarke, Saturday, June 05, 2021

Источник