How to read a CSV file from a URL with Python?
but it didn’t work and I got an error http://example.com/passkey=wedsmdjsjmdd No such file or directory: Thanks!
you need to open the url and read it in as a big text string (see urllib/requests) , then I assume you can initialize the csv reader with a string instead of a file object, but I dont know, Ive always used it with an open filehandle.
@JoranBeasley, I think that your method is correct, maybe I need something like this http://processing.org/reference/loadStrings_.html but using python
FYI: the read_csv function in the pandas library (pandas.pydata.org) accepts URLs. See pandas.pydata.org/pandas-docs/stable/generated/…
9 Answers 9
Using pandas it is very simple to read a csv file directly from a url
import pandas as pd data = pd.read_csv('https://example.com/passkey=wedsmdjsjmdd')
This will read your data in tabular format, which will be very easy to process
Didn’t work for me, maybe I ran out of memory. pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 33, saw 2
is there anyway to use this with a retry, many times i get a 500 error and when i read_csv again it works. this happens a lot when i am reading from google sheets
This answer worked. The other with csv.reader() always gave me a _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?) .
You need to replace open with urllib.urlopen or urllib2.urlopen.
import csv import urllib2 url = 'http://winterolympicsmedals.com/medals.csv' response = urllib2.urlopen(url) cr = csv.reader(response) for row in cr: print row
This would output the following
Year,City,Sport,Discipline,NOC,Event,Event gender,Medal 1924,Chamonix,Skating,Figure skating,AUT,individual,M,Silver 1924,Chamonix,Skating,Figure skating,AUT,individual,W,Gold .
The original question is tagged «python-2.x», but for a Python 3 implementation (which requires only minor changes) see below.
can you pass that to csv_reader ? I guess so . its pretty «file-like», but I’ve never done it or even thought to do that
I think urllib2.urlopen returns a file-like object, so you can probably just remove the .read() , and pass response to the csv.reader .
@mongotop that means it is working. That shows you where the object is in memory. Looks like it only reads a line at a time, so maybe cr.next() inside a loop is what you are looking for. (haven’t used csv reader myself. )
You could do it with the requests module as well:
url = 'http://winterolympicsmedals.com/medals.csv' r = requests.get(url) text = r.iter_lines() reader = csv.reader(text, delimiter=',')
One question. The reader variable is a _csv.reader object. When i iterate through this object to print the contents, I get the following error. Error: iterator should return strings, not bytes (did you open the file in text mode?). How do i read contents of the csvreader object and say load it to a pandas dataframe?
@Harikrishna this is probably problem in Python 3 and this case is answered here: stackoverflow.com/questions/18897029/…
This reads the whole thing into memory, not really necessary, especially if you are going to use csv.reader. At this point, just use Pandas.
To increase performance when downloading a large file, the below may work a bit more efficiently:
import requests from contextlib import closing import csv url = "http://download-and-process-csv-efficiently/python.csv" with closing(requests.get(url, stream=True)) as r: reader = csv.reader(r.iter_lines(), delimiter=',', quotechar='"') for row in reader: # Handle each row here. print row
By setting stream=True in the GET request, when we pass r.iter_lines() to csv.reader(), we are passing a generator to csv.reader(). By doing so, we enable csv.reader() to lazily iterate over each line in the response with for row in reader .
This avoids loading the entire file into memory before we start processing it, drastically reducing memory overhead for large files.
Great solution, but I had to also import codecs and wrap the r.iter_lines() within codecs.iterdecode() like so: codecs.iterdecode(r.iterlines(), ‘utf-8’) . in order to solve byte vs str issues, unicode decoding problems and universal new line problems.
This question is tagged python-2.x so it didn’t seem right to tamper with the original question, or the accepted answer. However, Python 2 is now unsupported, and this question still has good google juice for «python csv urllib», so here’s an updated Python 3 solution.
It’s now necessary to decode urlopen ‘s response (in bytes) into a valid local encoding, so the accepted answer has to be modified slightly:
import csv, urllib.request url = 'http://winterolympicsmedals.com/medals.csv' response = urllib.request.urlopen(url) lines = [l.decode('utf-8') for l in response.readlines()] cr = csv.reader(lines) for row in cr: print(row)
Note the extra line beginning with lines = , the fact that urlopen is now in the urllib.request module, and print of course requires parentheses.
It’s hardly advertised, but yes, csv.reader can read from a list of strings.
And since someone else mentioned pandas, here’s a pandas rendition that displays the CSV in a console-friendly output:
python3 -c 'import pandas df = pandas.read_csv("http://winterolympicsmedals.com/medals.csv") print(df.to_string())'
Pandas is not a lightweight library, though. If you don’t need the things that pandas provides, or if startup time is important (e.g. you’re writing a command line utility or any other program that needs to load quickly), I’d advise that you stick with the standard library functions.
Fetch a file from a local url with Python requests?
I am using Python’s requests library in one method of my application. The body of the method looks like this:
def handle_remote_file(url, **kwargs): response = requests.get(url, . ) buff = StringIO.StringIO() buff.write(response.content) . return True
I’d like to write some unit tests for that method, however, what I want to do is to pass a fake local url such as:
class RemoteTest(TestCase): def setUp(self): self.url = 'file:///tmp/dummy.txt' def test_handle_remote_file(self): self.assertTrue(handle_remote_file(self.url))
When I call requests.get with a local url, I got the KeyError exception below:
requests.get('file:///tmp/dummy.txt') /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/packages/urllib3/poolmanager.pyc in connection_from_host(self, host, port, scheme) 76 77 # Make a fresh ConnectionPool of the desired type 78 pool_cls = pool_classes_by_scheme[scheme] 79 pool = pool_cls(host, port, **self.connection_pool_kw) 80 KeyError: 'file'
The question is how can I pass a local url to requests.get? PS: I made up the above example. It possibly contains many errors.
7 Answers 7
As @WooParadog explained requests library doesn’t know how to handle local files. Although, current version allows to define transport adapters.
Therefore you can simply define you own adapter which will be able to handle local files, e.g.:
from requests_testadapter import Resp import os class LocalFileAdapter(requests.adapters.HTTPAdapter): def build_response_from_file(self, request): file_path = request.url[7:] with open(file_path, 'rb') as file: buff = bytearray(os.path.getsize(file_path)) file.readinto(buff) resp = Resp(buff) r = self.build_response(request, resp) return r def send(self, request, stream=False, timeout=None, verify=True, cert=None, proxies=None): return self.build_response_from_file(request) requests_session = requests.session() requests_session.mount('file://', LocalFileAdapter()) requests_session.get('file://')
I’m using requests-testadapter module in the above example.
Here’s a transport adapter I wrote which is more featureful than b1r3k’s and has no additional dependencies beyond Requests itself. I haven’t tested it exhaustively yet, but what I have tried seems to be bug-free.
import requests import os, sys if sys.version_info.major < 3: from urllib import url2pathname else: from urllib.request import url2pathname class LocalFileAdapter(requests.adapters.BaseAdapter): """Protocol Adapter to allow Requests to GET file:// URLs @todo: Properly handle non-empty hostname portions. """ @staticmethod def _chkpath(method, path): """Return an HTTP status for the given filesystem path.""" if method.lower() in ('put', 'delete'): return 501, "Not Implemented" # TODO elif method.lower() not in ('get', 'head'): return 405, "Method Not Allowed" elif os.path.isdir(path): return 400, "Path Not A File" elif not os.path.isfile(path): return 404, "File Not Found" elif not os.access(path, os.R_OK): return 403, "Access Denied" else: return 200, "OK" def send(self, req, **kwargs): # pylint: disable=unused-argument """Return the file specified by the given request @type req: C@todo: Should I bother filling `response.headers` and processing If-Modified-Since and friends using `os.stat`? """ path = os.path.normcase(os.path.normpath(url2pathname(req.path_url))) response = requests.Response() response.status_code, response.reason = self._chkpath(req.method, path) if response.status_code == 200 and req.method.lower() != 'head': try: response.raw = open(path, 'rb') except (OSError, IOError) as err: response.status_code = 500 response.reason = str(err) if isinstance(req.url, bytes): response.url = req.url.decode('utf-8') else: response.url = req.url response.request = req response.connection = self return response def close(self): pass
(Despite the name, it was completely written before I thought to check Google, so it has nothing to do with b1r3k's.) As with the other answer, follow this with:
requests_session = requests.session() requests_session.mount('file://', LocalFileAdapter()) r = requests_session.get('file:///path/to/your/file')
Given a URL to a text file, what is the simplest way to read the contents of the text file?
In Python, when given the URL for a text file, what is the simplest way to access the contents off the text file and print the contents of the file out locally line-by-line without saving a local copy of the text file?
TargetURL=http://www.myhost.com/SomeFile.txt #read the file #print first line #print second line #etc
14 Answers 14
Edit 09/2016: In Python 3 and up use urllib.request instead of urllib2
Actually the simplest way is:
import urllib2 # the lib that handles the url stuff data = urllib2.urlopen(target_url) # it's a file like object and works just like a file for line in data: # files are iterable print line
You don't even need "readlines", as Will suggested. You could even shorten it to: *
import urllib2 for line in urllib2.urlopen(target_url): print line
But remember in Python, readability matters.
However, this is the simplest way but not the safe way because most of the time with network programming, you don't know if the amount of data to expect will be respected. So you'd generally better read a fixed and reasonable amount of data, something you know to be enough for the data you expect but will prevent your script from been flooded:
import urllib2 data = urllib2.urlopen("http://www.google.com").read(20000) # read only 20 000 chars data = data.split("\n") # then split it into lines for line in data: print line
* Second example in Python 3:
import urllib.request # the lib that handles the url stuff for line in urllib.request.urlopen(target_url): print(line.decode('utf-8')) #utf-8 or iso8859-1 or whatever the page encoding scheme is