Python decode text file

Содержание

Decoding text file using lists in Python
2 Answers 2
How to encode/decode this file in Python?
2 Answers 2
Python: how to convert from Windows 1251 to Unicode?

Decoding text file using lists in Python

I encoded this sentence: This is an amazing «abstract» AND this: is the end of this amazing abstract. to this: 1 2 3 4 «5» 6 7: 2 8 9 10 7 4 5. The corresponding index table (as text file) is:

word,index This,1 is,2 an,3 amazing,4 abstract,5 AND,6 this,7 the,8 end,9 of,10

Now I want to go from these numbers:’ 1 2 3 4 «5» 6 7: 2 8 9 10 7 4 5. ‘ to its corresponding words using the index table. I used this code to open the index table text file as a sliced list:

index_file = open("decompress.txt", "r") content_index = index_file.read().split() print(content_index)

['word,index', 'This,1', 'is,2', 'an,3', 'amazing,4', 'abstract,5', 'AND,6', 'this,7', 'the,8', 'end,9', 'of,10']

for line in content_index: fields = line.split(",")

['word', 'index'] ['This', '1'] ['is', '2'] ['an', '3'] ['amazing', '4'] ['abstract', '5'] ['AND', '6'] ['this', '7'] ['the', '8'] ['end', '9'] ['of', '10']

I tried decoding the numbers using fields[0] en fields[1] and for loops, but I did not succeed. Any help would be greatly appreciated!

This begs to use csv.DictReader . You won’t have to do any splitting, deal with the header row nor deal with indexes. Each «row» will become an dictionary with ‘word’ and ‘ index’ keys. If needed, this list of dictionaries can then be converted to 2 dicts: one from word to index and the other from index to word

Please post the source code of what you’ve tried so far in decoding your strings and what errors you encountered.

2 Answers 2

First of all, it’s better to use dict and replace your code:

for line in content_index: fields = line.split(",")

fields = <> for line in content_index: word, number = line.split(',') fields[number] = word

Then you can use regular expressions to easily replace specific patterns (in your case — numbers) by any other strings. Regular expression for finding number will be \d+ where \d means digit and + is for one or more So:

import re original_string = ' 1 2 3 4 "5" 6 7: 2 8 9 10 7 4 5. ' def replacement(match): """ This function accepts regular expression match and returns corresponding replacement if it's found in `fields` """ return fields.get(match.group(0), '') # Learn more about match groups at `re` documentation. result = re.sub(r'\d+', replacement, original_string) # This line will iterate through original string, calling `replacement` for each number in this string, substituting return value to string.

So the final code will be:

import re fields = <> with open('decompress.txt') as f: for line in f.readlines(): word, number = line.split(',') fields[number] = word original_string = ' 1 2 3 4 "5" 6 7: 2 8 9 10 7 4 5. ' def replacement(match): """ This function accepts regular expression match and returns corresponding replacement if it's found in `fields` """ return fields.get(match.group(0), '') result = re.sub(r'\d+', replacement, original_string) print(result)

You can learn more about regular expressions in Python documentation about re library. It’s very powerful tool for text processing and parsing.

Источник

How to encode/decode this file in Python?

I am planning to make a little Python game that will randomly print keys (English) out of a dictionary and the user has to input the value (in German). If the value is correct, it prints ‘correct’ and continue. If the value is wrong, it prints ‘wrong’ and breaks. I thought this would be an easy task but I got stuck on the way. My problem is I do not know how to print the German characters. Let’s say I have a file ‘dictionary.txt’ with this text:

cat:Katze dog:Hund exercise:Übung solve:lösen door:Tür cheese:Käse

# -*- coding: UTF-8 -*- words = <> # empty dictionary with open('dictionary.txt') as my_file: for line in my_file.readlines(): if len(line.strip())>0: # ignoring blank lines elem = line.split(':') # split on ":" words[elem[0]] = elem[1].strip() # appending elements to dictionary print words

2 Answers 2

You are looking at byte string values, printed as repr() results because they are contained in a dictionary. String representations can be re-used as Python string literals and non-printable and non-ASCII characters are shown using string escape sequences. Container values are always represented with repr() to ease debugging.

Thus, the string ‘K\xc3\xa4se’ contains two non-ASCII bytes with hex values C3 and A4, a UTF-8 combo for the U+00E4 codepoint.

You should decode the values to unicode objects:

with open('dictionary.txt') as my_file: for line in my_file: # just loop over the file if line.strip(): # ignoring blank lines key, value = line.decode('utf8').strip().split(':') wordsPython decode text file = value

or better still, use codecs.open() to decode the file as you read it:

import codecs with codecs.open('dictionary.txt', 'r', 'utf8') as my_file: for line in my_file: if line.strip(): # ignoring blank lines key, value = line.strip().split(':') wordsPython decode text file = value

Printing the resulting dictionary will still use repr() results for the contents, so now you’ll see u’cheese’: u’K\xe4se’ instead, because \xe4 is the escape code for Unicode point 00E4, the ä character. Print individual words if you want the actual characters to be written to the terminal:

But now you can compare these values with other data that you decoded, provided you know their correct encoding, and manipulate them and encode them again to whatever target codec you needed to use. print will do this automatically, for example, when printing unicode values to your terminal.

You may want to read up on Unicode and Python:

Источник

Python: how to convert from Windows 1251 to Unicode?

I’m trying to convert file content from Windows-1251 (Cyrillic) to Unicode with Python. I found this function, but it doesn’t work.

#!/usr/bin/env python import os import sys import shutil def convert_to_utf8(filename): # gather the encodings you think that the file may be # encoded inside a tuple encodings = ('windows-1253', 'iso-8859-7', 'macgreek') # try to open the file and exit if some IOError occurs try: f = open(filename, 'r').read() except Exception: sys.exit(1) # now start iterating in our encodings tuple and try to # decode the file for enc in encodings: try: # try to decode the file with the first encoding # from the tuple. # if it succeeds then it will reach break, so we # will be out of the loop (something we want on # success). # the data variable will hold our decoded text data = f.decode(enc) break except Exception: # if the first encoding fail, then with the continue # keyword will start again with the second encoding # from the tuple an so on. until it succeeds. # if for some reason it reaches the last encoding of # our tuple without success, then exit the program. if enc == encodings[-1]: sys.exit(1) continue # now get the absolute path of our filename and append .bak # to the end of it (for our backup file) fpath = os.path.abspath(filename) newfilename = fpath + '.bak' # and make our backup file with shutil shutil.copy(filename, newfilename) # and at last convert it to utf-8 f = open(filename, 'w') try: f.write(data.encode('utf-8')) except Exception, e: print e finally: f.close()

Источник