How to know the byte position of a row of a CSV file in python?
If by «byte position» you mean the byte position as if you had read the file in as a normal text file, then my suggestion is to do just that. Read in the file line by line as text, and get the position within the line that way. You can still parse the CSV data row by row yourself using the csv module:
for line in myfile: row = csv.reader([line]).next()
I think it is perfectly good design for the CSV reader to not provide a byte position of this kind, because it really doesn’t make much sense in a CSV context. After all, «data» and data are the exact same four bytes of data as far as CSV is concerned, but the d might be the 2nd byte or the 1st byte depending on whether the optional surrounding quotes were used.
Similar question
I got hit by the same issue. Turns out, CSV is not that difficult to parse for this limited use case. In case more future-people come to this (assumes the excel way of escaping):
from typing import Iterator, Tuple def index_csv( input_path: str, read_size: int = 16 * 1024, eol_char: str = "\n" ) -> Iterator[Tuple[int, int]]: QUOTE_CHAR = ord('"') NEW_LINE_CHAR = ord(eol_char) in_quote = False char_count = 0 row_number = 1 last_output_char_count = 0 with open(input_path, "rb") as csvf: while True: chunk = csvf.read(read_size) if not chunk: break for c in chunk: char_count += 1 if in_quote: if c == QUOTE_CHAR: in_quote = False else: if c == NEW_LINE_CHAR: yield (row_number, char_count) last_output_char_count = char_count row_number += 1 elif c == QUOTE_CHAR: in_quote = True # The last row might not be valid CSV row if last_output_char_count != char_count: yield (row_number, char_count)
Tested on 10GB CSV file with 100k rows (one column is hugre). Took 5m to index it.
The csv module does indeed read in blocks using a read-ahead buffer as suggested in responses to this post:
I had a similar need to you and generalized my solution for anyone else who might be doing similar things:
Short answer: not possible. The byte position is not available through the csvreader API
Related Query
- How can I change a huge file into csv in python
- Why does Python CSV module make the CSV file read-only when I try to open it after running program?
- How can I generate the exe application from python file
- How to bold the text of a row cells in python docx?
- Python how to find out the unique elements in a text file and output in another text file
- How to upload a base64 encoded string to s3 and access the url in html file in python
- How does __future__ statements know the syntax of new python versions?
- How to know the size of an Azure blob object via Python Azure SDK
- How do I call a specific python function in a file from the command line?
- how to make python script executable when click on the file
- How to append data to an existing csv file in AWS S3 using python boto3
- How does Python know the datatype of a variable?
- How to know the first and last position on a string that is already on a list in python?
- How to create a custom CSV file from Python dictionary?
- How do I drop the None line in a csv file using python?
- how to remove non utf 8 code and save as a csv file python
- How to extract from a Python list while also accounting for the position of the extracted elements?
- How can I get the ctime and/or mtime of a file in Python including timezone?
- How to increase the amount of lines written to a file in python
- How to read lines from a file in python and remove the newline character from it?
- How to compare if an item in a list has the same value AND position as an item in another list? Python 2.7
- How to import a file with delimiters irregular with the csv module in python?
- How to Store Rows from CSV File into Python and Print Data with HTML
- python Postgresql: Ignoring the last column from csv file
- In a python script, how can I save the contents of a string to file, and the file is in current directory?
- How can my Python File be run «anywhere» in the terminal
- How to write Python dictionaries to CSV when the order of keys/values varies?
- I would like to know how to set the host config of the adaptive card using Bot framework SDK for python
- How to execute another python file and then close the existing one?
- How to read the file and convert it to a binary image in Python
- how to extract the numbers in txt file in python
- Iterate from a certain row of a csv file in Python
- Writing on the topmost row of a CSV file
- Python 3: How to read a csv file and store specific values as variables
- How do I import a function into a Python package without also importing the file that contains it?
- How to find the average of multiple columns in a file using python
- How to download a file via the browser from Amazon S3 using Python (and boto) at Google App Engine?
- Python — How to find the average height of presidents from selected row numbers?
- How do you delete a column of values in a csv file but not the first item?
- How do I add a pipe the vertical bar (|) into a yaml file from Python
More Query from same tag
- Getting same instance with filter and filter_by using different ID strings
- all builtin function of empty list
- access class attributes from within other attribute definitions in python
- Python: How do I print a string that contains multiple words on different lines?
- Sharing state between two async programs in python using asyncio and contextvars
- Scrapy: merge items from different sites
- new to mac and textmate, can someone explain these shortcuts?
- Python os.dup2 redirect enables output buffering on windows python consoles
- Ansible «ansible_python_interpreter» Error
- Scrapy ERROR: Spider must return Request, BaseItem or None, got ‘dict’
- Garbage collection for vector of object in PyO3
- After installing Jupyter Notebook NumPy and TensorFlow are not working
- Kedro install fail to install, but few attempt later it is successful
- Reaching a middle ground between search() and findall() in regular expressions
- Quicksort for Just 2 Elements
- How to create image sub folders in python based on image labels
- local pip in conda environment checks globally and says Requirement already satisfied
- find optional middle of string surrounded by lazy, regex
- Detect SSL hash algorithm with Python
- Print sum value of a range with Python
- How can I obtain random numbers in gekko?
- call sagemaker endpoint using lambda function
- Draw line over image with PyQt
- How to change the location the ‘r’ axis for matplotlib polar plot?
- Implementing a class error and returning 0 text
- python3 email message to disable base64 and remove MIME-Version
- Python: Pass extra arguments to callable
- Multiprocessing’s map function ignores updates of my global variable
- Bigram function using python
- py2neo — How can I use merge_one function along with multiple attributes for my node?
- Google cloudml Always Gives Me The Same Results
- How to define a modified leaky ReLU — TensorFlow
- Scikit Learn K-means Clustering & TfidfVectorizer: How to pass top n terms with highest tf-idf score to k-means
- How to update date_expected value in stock_move document based on custom datetime field in sale_order_line
- TypeError in strptime in python 3.4