Delete stop words python

How to remove stop words using nltk or python

I have a dataset from which I would like to remove stop words. I used NLTK to get a list of stop words:

from nltk.corpus import stopwords stopwords.words('english') 

Exactly how do I compare the data to the list of stop words, and thus remove the stop words from the data?

It is also necessary to run nltk.download(«stopwords») in order to make the stopword dictionary available.

Pay attention that a word like «not» is also considered a stopword in nltk. If you do something like sentiment analysis, spam filtering, a negation may change the entire meaning of the sentence and if you remove it from the processing phase, you might not get accurate results.

13 Answers 13

from nltk.corpus import stopwords # . filtered_words = [word for word in word_list if word not in stopwords.words('english')] 

Thanks to both answers, they both work although it would seem i have a flaw in my code preventing the stop list from working correctly. Should this be a new question post? not sure how things work around here just yet!

stopwords.words(‘english’) are lower case. So make sure to use only lower case words in the list e.g. [w.lower() for w in word_list]

Читайте также:  Java command line environment

You could also do a set diff, for example:

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english'))) 

Note: this converts the sentence to a SET which removes all the duplicate words and therefore you will not be able to use frequency counting on the result

converting to a set might remove viable information from the sentence by scraping multiple occurrences of an important word.

To exclude all type of stop-words including nltk stop-words, you could do something like this:

from stop_words import get_stop_words from nltk.corpus import stopwords stop_words = list(get_stop_words('en')) #About 900 stopwords nltk_words = list(stopwords.words('english')) #About 150 stopwords stop_words.extend(nltk_words) output = [w for w in word_list if not w in stop_words] 

I suppose you have a list of words (word_list) from which you want to remove stopwords. You could do something like this:

filtered_word_list = word_list[:] #make a copy of the word_list for word in word_list: # iterate over word_list if word in stopwords.words('english'): filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword 

There’s a very simple light-weight python package stop-words just for this sake.

Fist install the package using: pip install stop-words

Then you can remove your words in one line using list comprehension:

from stop_words import get_stop_words filtered_words = [word for word in dataset if word not in get_stop_words('english')] 

This package is very light-weight to download (unlike nltk), works for both Python 2 and Python 3 ,and it has stop words for many other languages like:

 Arabic Bulgarian Catalan Czech Danish Dutch English Finnish French German Hungarian Indonesian Italian Norwegian Polish Portuguese Romanian Russian Spanish Swedish Turkish Ukrainian 

Here is my take on this, in case you want to immediately get the answer into a string (instead of a list of filtered words):

STOPWORDS = set(stopwords.words('english')) text = ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text 

Use textcleaner library to remove stopwords from your data.

Follow these steps to do so with this library.

import textcleaner as tc data = tc.document() #you can also pass list of sentences to the document class constructor. data.remove_stpwrds() #inplace is set to False by default 

Use above code to remove the stop-words.

Although the question is a bit old, here is a new library, which is worth mentioning, that can do extra tasks.

In some cases, you don’t want only to remove stop words. Rather, you would want to find the stopwords in the text data and store it in a list so that you can find the noise in the data and make it more interactive.

The library is called ‘textfeatures’ . You can use it as follows:

! pip install textfeatures import textfeatures as tf import pandas as pd 

For example, suppose you have the following set of strings:

texts = [ "blue car and blue window", "black crow in the window", "i see my reflection in the window"] df = pd.DataFrame(texts) # Convert to a dataframe df.columns = ['text'] # give a name to the column df 

Now, call the stopwords() function and pass the parameters you want:

tf.stopwords(df,"text","stopwords") # extract stop words df[["text","stopwords"]].head() # give names to columns 

The result is going to be:

 text stopwords 0 blue car and blue window [and] 1 black crow in the window [in, the] 2 i see my reflection in the window [i, my, in, the] 

As you can see, the last column has the stop words included in that docoument (record).

Источник

How to remove Stop Words in Python using NLTK?

Remove Stop Words

In this tutorial, we will learn how to remove stop words from a piece of text in Python. Removing stop words from text comes under pre-processing of data before using machine learning models on it.

What are stop words?

Stop Words are words in the natural language that have very little meaning. These are words like ‘is’, ‘the’, ‘and.

While extracting information from text, these words don’t provide anything meaningful. Therefore it is a good practice to remove stop words from the text before using it to train machine learning models.

Another advantage of removing stop words is that it reduces the size of the dataset and the time taken in training of the model.

The practice of removing stop words is also common among search engines. Search engines like Google remove stop words from search queries to yield a quicker response.

In this tutorial, we will be using the NLTK module to remove stop words.

NLTK module is the most popular module when it comes to natural language processing.

To start we will first download the corpus with stop words from the NLTK module.

Download the corpus with stop words from NLTK

To download the corpus use :

import nltk nltk.download('stopwords')

Download

Now we can start using the corpus.

Let’s print out the list of stop words from the corpus. To do that use:

from nltk.corpus import stopwords print(stopwords.words('english'))
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

This is the list of stop words for English language. There are other languages available too.

To print the list of languages available use :

from nltk.corpus import stopwords print(stopwords.fileids())
['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']

These are the languages for which stop words are available in the NLTK ‘stopwords‘ corpus.

How to add your own stop words to the corpus?

To add stop words of your own to the list use :

new_stopwords = stopwords.words('english') new_stopwords.append('SampleWord')

Now you can use ‘new_stopwords‘ as the new corpus. Let’s learn how to remove stop words from a sentence using this corpus.

How to remove stop words from the text?

In this section, we will learn how to remove stop words from a piece of text. Before we can move on, you should read this tutorial on tokenization.

Tokenization is the process of breaking down a piece of text into smaller units called tokens. These tokens form the building block of NLP.

We will use tokenization to convert a sentence into a list of words. Then we will remove the stop words from that Python list.

nltk.download('punkt') from nltk.tokenize import word_tokenize text = "This is a sentence in English that contains the SampleWord" text_tokens = word_tokenize(text) remove_sw = [word for word in text_tokens if not word in stopwords.words()] print(remove_sw)
['This', 'sentence', 'English', 'contains', 'SampleWord']

You can see that the output contains ‘SampleWord‘ that is because we used the default corpus for removing stop words. Let’s use the corpus that we created. We’ll use list comprehension for the same.

nltk.download('punkt') from nltk.tokenize import word_tokenize text = "This is a sentence in English that contains the SampleWord" text_tokens = word_tokenize(text) remove_sw = [word for word in text_tokens if not word in new_stopwords] print(remove_sw)
['This', 'sentence', 'English', 'contains']

Conclusion

This tutorial was about removing stop words from the text in python. We used the NLTK module to remove stop words from the text. We hope you had fun learning with us!

Источник

Оцените статью