Natural language processing in Python to generate word cloud WordCloud

Natural language processing in Python to generate word cloud WordCloud

Original link: tecdat.cn/?p=8585

Original source: Tuoduan Data Tribe Official Account

 

 

Learn how to perform exploratory data analysis on natural language processing using WordCloud in Python.

What is WordCloud?

 

Many times, you may see a cloud filled with words of different sizes, which represent the frequency or importance of each word. This is called a tag cloud or word cloud . For this tutorial, you will learn how to create your own WordCloud in Python and customize it as needed. 

prerequisites

The

numpy
The library is one of the most popular and useful libraries for handling multidimensional arrays and matrices. It also works with
Pandas
Libraries are used in conjunction to perform data analysis.

wordcloud
Installation can be a bit tricky. If you only need it to draw a basic wordcloud, then
pip install wordcloud
or
conda install -c conda-forge wordcloud
Will suffice.

git clone https://github.com/amueller/word_cloud.git cd word_cloud pip install. Copy code

Data set:

1. you load all the necessary libraries:

# Start with loading all necessary libraries import numpy as np import pandas as pd from os import path from PIL import Image from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator import matplotlib.pyplot as plt % matplotlib inline Copy code
c:\intelpython3\lib\site-packages\matplotlib\__init__.py: import warnings warnings.filterwarnings("ignore") Copy code


Load the data frame. Please note,

index_col=0
We did not read in the row name (index) as a separate column.

# Load in the dataframe df = pd.read_csv("data/winemag-data-130k-v2.csv", index_col=0) Copy code
# Looking at first 5 rows of the dataset df.head() Copy code

 

 

Get the printout.

print("There are {} observations and {} features in this dataset./n".format(df.shape[0],df.shape[1])) print("There are {} types of wine in this dataset such as {}.../n".format(len(df.variety.unique()), ", ".join(df.variety.unique()[0:5]))) print("There are {} countries producing wine in this dataset such as {}.../n".format(len(df.country.unique()), ", ".join(df.country.unique()[0:5]))) Copy code
There are 129971 observations and 13 features in this dataset. There are 708 types of wine in this dataset such as White Blend, Portuguese Red, Pinot Gris, Riesling, Pinot Noir... There are 44 countries producing wine in this dataset such as Italy, Portugal, US, Spain, France... Copy code
df[["country", "description","points"]].head() Copy code
 countrydescriptionPoints
0ItalyThe aroma includes tropical fruits, broom, brimston...87
1 piecePortugalThis is a ripe fruity, silky wine...87
2weTart and lively, the taste of lime pulp and...87
3wePineapple peel, lemon pith and orange blossom...87
4weJust like regular bottling since 2012, this...87

use

groupby()
And calculate summary statistics.

Using the wine dataset, you can group by country and view prices in all countries. ``

 

 

This will select the top 5 highest average scores among all 44 countries:

  Copy code
 
Points
price
country
  
United Kingdom
91.581081
51.681159
India
90.222222
13.333333
Austria
90.101345
30.762772
Germany
89.851732
42.257547
Canada
89.369650
35.712598

You can use Pandas DataFrame and Matplotlib's plot method to plot the number of wines by country/region.

plt.ylabel("Number of Wines") plt.show() Copy code

 

 

Among the 44 wine-producing countries, there are more than 50,000 wines in the US wine review data set, twice as many as the second-ranked country: France-a country famous for its wines. Italy also produces a large number of high-quality wines, with nearly 20,000 wines available for review.

Does quantity exceed quality?

Now, look at the plots in all 44 countries/regions by the highest-rated wines:

plt.ylabel("Highest point of Wines") plt.show() Copy code

 

 

Australia, the United States, Portugal, Italy and France all have 100-point wines. If you notice, in terms of the number of wines produced in the dataset, Portugal ranks 5th and Australia ranks 9th. These two countries/regions have fewer than 8,000 wine types.

Set up basic WordCloud

Before using any function, the first thing you might want to do is to check out the docstring of the function and review all required and optional parameters. To do this, type

?function
And run it to get all the information.

?WordCloud copy code
[1;31mInit signature:[0m [0mWordCloud[0m[1;33m([0m[0mfont_path[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mwidth[0m[1; 33m=[0m[1;36m400[0m[1;33m,[0m [0mheight[0m[1;33m=[0m[1;36m200[0m[1;33m,[0m [0mmargin[0m[1;33m= [0m[1;36m2[0m[1;33m,[0m [0mranks_only[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mprefer_horizontal[0m[1;33m=[0m [1;36m0.9[0m[1;33m,[0m [0mmask[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mscale[0m[1;33m=[0m [1;36m1[0m[1;33m,[0m [0mcolor_func[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mmax_words[0m[1;33m=[0m[1 ;36m200[0m[1;33m,[0m [0mmin_font_size[0m[1;33m=[0m[1;36m4[0m[1;33m,[0m [0mstopwords[0m[1;33m=[0m[1;32mNone] on [0m[1;33m,[0m [0mrandom_state[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mbackground_color[0m[1;33m=[0m[1;34m'black '[0m[1;33m,[0m [0mmax_font_size[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mfont_step[0m[1;33m=[0m[1;36m1[ 0m[1;33m,[0m [0mmode[0m[1;33m=[0m[1;34m'RGB'[0m[1;33m,[0m [0mrelative_scaling[0m[1;33m=[0m[1;36m0.5[0m[1;33m,[0m [0mregexp[0m [1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mcollocations[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m [0mcolormap[0m[1 ;33m=[0m[1;32mNone[0m[1;33m,[0m [0mnormalize_plurals[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m [0mcontour_width[0m[1;33m] on =[0m[1;36m0[0m[1;33m,[0m [0mcontour_color[0m[1;33m=[0m[1;34m'black'[0m[1;33m)[0m[1;33m[0m[ 0m [1;31mDocstring:[0m Word cloud object for generating and drawing. Parameters ---------- font_path: string Font path to the font that will be used (OTF or TTF). Defaults to DroidSansMono path on a Linux machine. If you are on another OS or don't have this font; you need to adjust this path. width: int (default=400) Width of the canvas. height: int (default=200) Height of the canvas. prefer_horizontal: float (default=0.90) The ratio of times to try horizontal fitting as opposed to vertical. If prefer_horizontal <1, the algorithm will try rotating the word if it doesn't fit. (There is currently no built-in way to get only vertical words.) mask: nd-array or None (default=None) If not None, gives a binary mask on where to draw words. If mask is not None, width and height will be ignored, and the shape of mask will be used instead. All white (#FF or #FFFFFF) entries will be considered "masked out" while other entries will be free to draw on. [This changed in the most recent version!] contour_width: float (default=0) If mask is not None and contour_width> 0, draw the mask contour. contour_color: color value (default="black") Mask contour color. scale: float (default=1) Scaling between computation and drawing. For large word-cloud images, using scale instead of larger canvas size is significantly faster, but might lead to a coarser fit for the words. min_font_size: int (default=4) Smallest font size to use. Will stop when there is no more room in this size. font_step: int (default=1) Step size for the font. font_step> 1 might speed up computation but give a worse fit. max_words: number (default=200) The maximum number of words. stopwords: set of strings or None The words that will be eliminated. If None, the build-in STOPWORDS list will be used. background_color: color value (default="black") Background color for the word cloud image. max_font_size: int or None (default=None) Maximum font size for the largest word. If None, the height of the image is used. mode: string (default="RGB") Transparent background will be generated when mode is "RGBA" and background_color is None. relative_scaling: float (default=.5) Importance of relative word frequencies for font-size. With relative_scaling=0, only word-ranks are considered. With relative_scaling=1, a word that is twice as frequent will have twice the size. If you want to consider the word frequencies and not only their rank, relative_scaling around .5 often looks good. .. versionchanged: 2.0 Default is now 0.5. color_func: callable, default=None Callable with parameters word, font_size, position, orientation, font_path, random_state that returns a PIL color for each word. Overwrites "colormap". See colormap for specifying a matplotlib colormap instead. regexp: string or None (optional) Regular expression to split the input text into tokens in process_text. If None is specified, ``r"\w[\w']+"`` is used. collocations: bool, default=True Whether to include collocations (bigrams) of two words. .. versionadded: 2.0 colormap: string or matplotlib colormap, default="viridis" Matplotlib colormap to randomly draw colors from for each word. Ignored if "color_func" is specified. .. versionadded: 2.0 normalize_plurals: bool, default=True Whether to remove trailing's' from words. If True and a word appears with and without a trailing's', the one with trailing's' is removed and its counts are added to the version without trailing's' - unless the word ends with'ss'. Attributes ---------- ``words_``: dict of string to float Word tokens with associated frequency. .. versionchanged: 2.0 ``words_'' is now a dictionary ``layout_``: list of tuples (string, int, (int, int), int, color)) Encodes the fitted word cloud. Encodes for each word the string, font size, position, orientation, and color. Notes ----- Larger canvases will make the code significantly slower. If you need a large word cloud, try a lower canvas size, and set the scale parameter. The algorithm might give more weight to the ranking of the words then their actual frequencies, depending on the ``max_font_size`` and the scaling heuristic. [1;31mFile:[0m c:\intelpython3\lib\site-packages\wordcloud\wordcloud.py [1;31mType:[0m type Copy code

You can see that the only parameter required by the WordCloud object is text , while all other parameters are optional.

So let's start with a simple example: use the first observation description as input to the wordcloud. The three steps are:

  • Extract comments (text file)
  • Create and generate wordcloud images
  • Use matplotlib to display the cloud
# Display the generated image: plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show() Copy code

 

 

You can see that the first review mentioned a lot about the aroma of wine.

Now, change some optional parameters in WordCloud like

max_font_size
,
max_word
with
background_color
.

plt.imshow(wordcloud, interpolation="bilinear") plt.axis("off") plt.show() Copy code

 

 

If you want to save images, WordCloud provides a function 

to_file

# Save the image in the img folder: wordcloud.to_file("img/first_review.png") Copy code
<wordcloud.wordcloud.WordCloud at 0x16f1d704978> Copy code

When you load them into it, the result will look like this:

 

 

So now you merge all the wine reviews into one big text and create a huge fat cloud to see the most common characteristics of these wines.

 

print ( "There are {} words in the combination of all review.". format (len (text))) copying the code
There are 31661073 words in the combination of all review. code
# Display the generated image: # the matplotlib way: plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show() Copy code

 

 

Oh, it seems that black cherry and full-bodied mellowness are the most popular features, while Cabernet Sauvignon is the most popular feature. This and Cabernet Sauvignon is one of the most well-known red wine grape varieties in the world.

Now, let's pour these words into a glass of wine!

In order to create a shape for your wordcloud, first, you need to find a PNG file to become a mask. The following is a good website, you can find it on the Internet:

 

 

To ensure that the mask works properly, let's view it as a numpy array:

Copy code
array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8) Copy code

 

1. use the

transform_format()
The function swaps the number 0 for 255.

def transform_format(val): if val == 0: return 255 else: return val Copy code

Then, create a new mask with the same shape as your existing mask, and set the function

transform_format()
Apply to every value in every row of the previous mask.

Now you will create a new mask in the correct form. Copy code
Copy code
array([[255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255], ..., [255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255]]) Copy code

Ok! With the correct mask, you can start to make a wordcloud with the selected shape.

# show plt.figure(figsize=[20,10]) plt.imshow(wc, interpolation='bilinear') plt.axis("off") plt.show() Copy code

 

 

Created a wordcloud in the shape of a wine bottle! It seems that black cherries, fruit flavors and the full-bodied characteristics of the wine are most often mentioned in the wine description. Now, let s take a closer look at each country s comments:

 

 

Create wordcloud according to color patterns

It is possible to merge all reviews of the five countries with the most wines. To find these countries/regions, you can view the country/region of the plot and the number of wines above the relationship between the , or you can use the groups above to find the number of observations in each country/region (each group), and

sort_values()
Use parameters
ascending=False
descending sort.

country US 54504 France 22093 Italy 19540 Spain 6645 Portugal 5691 dtype: int64 Copy code

So now you have 5 popular countries: the United States, France, Italy, Spain, and Portugal.

country US 54504 France 22093 Italy 19540 Spain 6645 Portugal 5691 Chile 4472 Argentina 3800 Austria 3345 Australia 2329 Germany 2165 dtype: int64 Copy code

Currently, only 5 countries are sufficient.

To get all the reviews for each country, you can use

"".join(list)
The syntax connects all the comments, and the syntax combines all the elements in a space-separated list.

Copy code

Then, create the wordcloud as described above.

# store to file plt.savefig("img/us_wine.png", format="png") plt.show() Copy code

 

 

looks great! Now, let us repeat the French comment again.

# store to file plt.savefig("img/fra_wine.png", format="png") #plt.show() Copy code

Note that the image should be saved after drawing so that the word cloud has the desired color mode.

 

 

# store to file plt.savefig("img/ita_wine.png", format="png") #plt.show() Copy code

 

 

After Italy is Spain:

# store to file plt.savefig("img/spa_wine.png", format="png") #plt.show() Copy code

 

Finally, Portugal:

# store to file plt.savefig("img/por_wine.png", format="png") #plt.show() Copy code

 

 

The final results are in the table below.