Python dataframe to vector

How to Vectorize Text in DataFrames for NLP Tasks — 3 Simple Techniques

Simple code examples using Texthero, Gensim, and Tensorflow

Start Learning NLP Today!

You might have heard that Natural Language Processing (NLP) is one of the most transformative technologies of the decade. It is a fantastic time to get into NLP as new libraries have abstracted away many of the complexities allowing users state of the art results with little model training and little code.

While a computer can actually be quite good at finding patterns and summarizing documents, it must transform words into numbers before making sense of them. Instead of vectorizing the text at the word level like I’ve done in the past, I’ll explore the following techniques to vectorize text in dataframes at the sentence, paragraph or document level:

As a bonus, I’ll visualize the vectors using t-SNE too! Complete Code is towards the bottom!

Download the Dataset

Instead of scraping data from one of my favorite sources, I’m going to explore cards from the game Magic: The Gathering. I’ve been into card games almost my entire life and started playing Magic: The Gathering (MTG) around 2nd grade in the mid 90’s. You can play it for free online if you’re interested in trying it!

Читайте также:  Best ide for php

I was excited to find free datasets of all the cards. To follow the examples, download the AllPrintings SQLITE file from MTGJSON.com

MTGJSON is an open-source project that catalogs all Magic: The Gathering cards in a portable format.

Import Dependencies and Data

It is easy to connect and load the data into a dataframe since it is already an sqlite file. Follow three steps to load the libraries, data and DataFrame!

1. Import pandas and sqlite3 libraries
2. Connect to the sqlite file
3. Load the…

Источник

Converting a Pandas Dataframe Column into a Vector: A Step-by-Step Guide

To achieve the desired outcome, you can apply a specific method to a given dataframe. The method searches for occurrences of certain items within the dataframe, either at the beginning with whitespace after, in the middle with whitespace before and after, or at the end with whitespace before. If multiple matches are required, you can extract them using a certain tool and then process the results. Alternatively, if you only wish to match items at the beginning or end of the sentences, a different method can be applied.

How to extract a pandas dataframe column to a vector

my_col = test['my_column'].tolist() 

Python — How to extract a class from column using, df [df [‘col_name’].str.contains (‘truck’)] Or use a chained str.get and get the occurrences df.col1.str.get (0).str.get (0) Example: df = pd.DataFrame () df [‘col1’] = [ [ [‘truck’,3, (‘a’,2)]], [ [‘car’, 2, (‘b’, 2)]]] col1 0 [ [truck, 3, (a, 2)]] 1 [ [car, 2, (b, 2)]] where df.col1.str.get (0).str.get (0) yields

How to extract numbers from a DataFrame column in python?

To obtain the feet and inches part, employ .split() . In case you are confident that just a couple of NaN rows need to be addressed, a basic version may suffice.

df['height_feet'] = df['height'].dropna().apply(lambda x: str(x).split("'")[0]) df['height_inches'] = df['height'].dropna().apply(lambda x: str(x).split("'")[-1][0:-1]) df[['height', 'height_feet', 'height_inches']] 

The initial part of the split is the feet section, while the final part is the inches section, excluding the final character.

>>> print(df[['height', 'height_feet', 'height_inches']]) height height_feet height_inches 0 5'4" 5 4 1 5'7" 5 7 2 5'7" 5 7 3 5'0" 5 0 4 5'5" 5 5 . . . . 2562 5'3" 5 3 2563 5'11" 5 11 2564 5'3" 5 3 2565 4'11" 4 11 2566 5'2" 5 2 [2567 rows x 3 columns] 
df['height2'] = df['height'].str.extract(r'''(?P\d*)'(?P\d+)"''') \ .astype(float).mul([30.48, 2.54]).sum(axis=1) 

Either of the following MSDT codes can be used: str.split or str.strip .

df['height3'] = df['height'].str.rstrip('"').str.split("'", expand=True) \ .astype(float).mul([30.48, 2.54]).sum(axis=1) 
>>> df.filter(like='height') height height2 height3 0 5'4" 162.56 162.56 1 5'7" 170.18 170.18 2 5'7" 170.18 170.18 3 5'0" 152.40 152.40 4 5'5" 165.10 165.10 . . . . 2562 5'3" 160.02 160.02 2563 5'11" 180.34 180.34 2564 5'3" 160.02 160.02 2565 4'11" 149.86 149.86 2566 5'2" 157.48 157.48 [2567 rows x 3 columns] 

Python — Extracting specific columns from, I’m trying to use python to read my csv file extract specific columns to a pandas.dataframe and show that dataframe. However, I don’t see the data frame, I receive Series([], dtype: object) as an output. Code sampleinput_file = «C:\\.\\consumer_complaints.csv»dataset = pd.read_csv(input_file)df = pd.DataFrame(dataset)cols = [1,2,3,4]df = df[df.columns[cols]]Feedback

How to extract a class from column using pandas

Verify the presence of your search value by examining str.contains .

df[df['col_name'].str.contains('truck')] 

Utilize a linked str.get to obtain the instances.

df = pd.DataFrame() df['col1'] = [[['truck',3, ('a',2)]], [['car', 2, ('b', 2)]]] col1 0 [[truck, 3, (a, 2)]] 1 [[car, 2, (b, 2)]] 
0 truck 1 car Name: col1, dtype: object 
df.loc[df.col1.str.get(0).str.get(0).eq('truck')] col1 0 [[truck, 3, (a, 2)]] 

How to select columns from groupby object in pandas?, If you perform an operation on a single column the return will be a series with multiindex and you can simply apply pd.DataFrame to it and then reset_index. Share Improve this answer

How to extract strings from a list in a column in a python pandas dataframe?

I think this solves your problem.

import pandas as pd lst = ["fi", "ap", "ko", "co", "ex"] df = pd.DataFrame([["fi doesn't work correctly"],["apples are cool"],["this works but translation is ko"]],columns=["Explanation"]) extracted =[] for index, row in df.iterrows(): tempList =[] rowSplit = row['Explanation'].split(" ") for val in rowSplit: if val in lst: tempList.append(val) if len(tempList)>0: extracted.append(','.join(tempList)) else: extracted.append('N/A') df['Explanation Extracted'] = extracted 

Pandas’ functionality can prove advantageous.

def extract_explanation(dataframe): custom_substring = ["fi", "ap", "ko", "co", "ex"] substrings = dataframe['explanation'].split(" ") explanation = "N/A" for string in substrings: if string in custom_substring: explanation = string return explanation df['Explanation Extracted'] = df.apply(extract_explanation, axis=1) 

The caveat lies in assuming only a single explanation, but the option to create a list is available for cases where multiple explanations may be anticipated.

df = pd.DataFrame( , index=["a", "b", "c"] ) 

Employing the code labeled as .str.extract() enables you to perform certain actions.

lst = ["fi", "ap", "ko", "co", "ex"] pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)" df["Explanation Extracted"] = df.Explanation.str.extract(pattern, expand=False) 
 Explanation Explanation Extracted a fi doesn't co work correctly fi b apples are cool NaN c this works but translation is ko ko 

The pattern \ \ \ \ \ r»\(\?:\^\|\s\+\)\(«\ \+\ «\|»\.join\(lst\)\ \+\ r»\)\(\?:\s\+\|\$\)»\ \ \ \ searches for any item in lst either at the beginning with whitespace afterwards, in the middle with whitespace before and after, or at the end with whitespace before. By using str.extract() , we can extract the capture group (the middle part in () ). If there is no match, the outcome will be NaN .

To obtain multiple matches, you may utilize .str.findall() followed by «, «.join to retrieve the outcomes.

pattern = r"(?:^|\s+)(" + "|".join(lst) + r")(?:\s+|$)" df["Explanation Extracted"] = ( df.Explanation.str.findall(pattern).str.join(", ").replace() ) 

Alternative without regex:

df.index = df.index.astype("category") matches = df.Explanation.str.split().explode().loc[lambda s: s.isin(lst)] df["Explanation Extracted"] = ( matches.groupby(level=0).agg(set).str.join(", ").replace() ) 

To solely match at the start or end of the sentences, substitute the initial section with:

df.index = df.index.astype("category") splitted = df.Explanation.str.split() matches = ( (splitted.str[:1] + splitted.str[-1:]).explode().loc[lambda s: s.isin(lst)] ) . 

Extracting JSON data from column in pandas DataFrame, Below I have created code that creates a new dataframe that solves your issue. This data frame can be merged back onto your original dataframe. df = pd.DataFrame () for row in range (len (df1 [‘car’])): df = df.append (pd.DataFrame ( [ast.literal_eval (df1 [‘car_info’] [row])])) The function literal_eval can be imported …

Источник

Блог

Как превратить фрейм данных Pandas в таблицу векторов

#python #pandas #dataframe #vector #transformation

#python #pandas #фрейм данных #вектор #преобразование

Вопрос:

У меня есть фрейм данных Pandas с двумя столбцами, содержащий список идентификаторов пользователей и некоторые URL-адреса, которые они посетили. Это выглядит так:

 users urls 0 user1 url1 1 user1 url3 2 user1 url5 3 user2 url2 4 user2 url4 5 user2 url5 6 user3 url1 7 user3 url4 8 user3 url5 

Я хочу создать векторное представление самого себя, например:

 url1 url2 url3 url4 url5 user1 1.0 NaN 1.0 NaN 1.0 user2 NaN 1.0 NaN 1.0 1.0 user3 1.0 NaN NaN 1.0 1.0 

Я пробовал разные вещи, но продолжаю натыкаться на стену. Есть идеи?

Комментарии:

1. проверьте pd.get_dummies(df) . Кроме того, не отправляйте вопросы с изображениями в качестве входных данных, так людям будет сложнее помочь вам

2. И, пожалуйста, не публикуйте изображения. Итак, теперь в вопросе есть красивые таблицы, отформатированные как текст, чтобы мы могли легко копировать их содержимое.

Ответ №1:

То, что вы описываете, является сводной точкой столбца url

 # Make data df = pd.DataFrame([ ['user1', 'url1'], ['user1', 'url3'], ['user1', 'url5'], ['user2', 'url2'], ['user2', 'url4'], ['user2', 'url5'], ['user3', 'url1'], ['user3', 'url4'], ['user3', 'url5'] ], columns=['users', 'urls']) # add column to fill pivoted values df['count'] = 1 new_df = df.pivot(index='users',columns='urls',values='count').fill_na(0) new_df # urls url1 url2 url3 url4 url5 # users # user1 1.0 0.0 1.0 0.0 1.0 # user2 0.0 1.0 0.0 1.0 1.0 # user3 1.0 0.0 0.0 1.0 1.0 

Это помещает столбец users в индекс, но вы можете использовать reset_index, чтобы снова сделать его обычным столбцом.

Ответ №2:

Воссоздание вашей проблемы с:

 df = pd.DataFrame([ ['user1', 'url1'], ['user1', 'url3'], ['user1', 'url5'], ['user2', 'url2'], ['user2', 'url4'], ['user2', 'url5'], ['user3', 'url1'], ['user3', 'url4'], ['user3', 'url5'] ], columns=['users', 'urls']) 

Я применил цикл for для своего решения. Я уверен, что кто-то более компетентный, чем я, может найти более элегантное решение.

 new_df = pd.DataFrame() for user in np.unique(df['users']): s = pd.get_dummies(df[df['users']==user]['urls']).sum() s.name = user new_df = new_df.append(s) new_df url1 url3 url5 url2 url4 user1 1.0 1.0 1.0 NaN NaN user2 NaN NaN 1.0 1.0 1.0 user3 1.0 NaN 1.0 NaN 1.0 

Если вы хотите, чтобы ваши столбцы были упорядочены, вы можете просто применить это:

 new_df = pd.DataFrame(columns=np.unique(df['urls'])) 

Который инициализирует новый фрейм данных, чтобы ваши уникальные URL-адреса отображались в виде столбцов.

 url1 url2 url3 url4 url5 user1 1.0 NaN 1.0 NaN 1.0 user2 NaN 1.0 NaN 1.0 1.0 user3 1.0 NaN NaN 1.0 1.0 

Источник

Оцените статью