Original link: tecdat.cn/?p=8585
Original source: Tuoduan Data Tribe Official Account
Learn how to perform exploratory data analysis on natural language processing using WordCloud in Python.
What is WordCloud?
Many times, you may see a cloud filled with words of different sizes, which represent the frequency or importance of each word. This is called a tag cloud or word cloud . For this tutorial, you will learn how to create your own WordCloud in Python and customize it as needed.
prerequisites
The
git clone https://github.com/amueller/word_cloud.git cd word_cloud pip install. Copy code
Data set:
1. you load all the necessary libraries:
# Start with loading all necessary libraries import numpy as np import pandas as pd from os import path from PIL import Image from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator import matplotlib.pyplot as plt % matplotlib inline Copy code
c:\intelpython3\lib\site-packages\matplotlib\__init__.py: import warnings warnings.filterwarnings("ignore") Copy code
Load the data frame. Please note,
# Load in the dataframe df = pd.read_csv("data/winemag-data-130k-v2.csv", index_col=0) Copy code
# Looking at first 5 rows of the dataset df.head() Copy code
Get the printout.
print("There are {} observations and {} features in this dataset./n".format(df.shape[0],df.shape[1])) print("There are {} types of wine in this dataset such as {}.../n".format(len(df.variety.unique()), ", ".join(df.variety.unique()[0:5]))) print("There are {} countries producing wine in this dataset such as {}.../n".format(len(df.country.unique()), ", ".join(df.country.unique()[0:5]))) Copy code
There are 129971 observations and 13 features in this dataset. There are 708 types of wine in this dataset such as White Blend, Portuguese Red, Pinot Gris, Riesling, Pinot Noir... There are 44 countries producing wine in this dataset such as Italy, Portugal, US, Spain, France... Copy code
df[["country", "description","points"]].head() Copy code
country | description | Points | |
---|---|---|---|
0 | Italy | The aroma includes tropical fruits, broom, brimston... | 87 |
1 piece | Portugal | This is a ripe fruity, silky wine... | 87 |
2 | we | Tart and lively, the taste of lime pulp and... | 87 |
3 | we | Pineapple peel, lemon pith and orange blossom... | 87 |
4 | we | Just like regular bottling since 2012, this... | 87 |
use
Using the wine dataset, you can group by country and view prices in all countries. ``
This will select the top 5 highest average scores among all 44 countries:
Copy code
Points | price | |
---|---|---|
country | ||
United Kingdom | 91.581081 | 51.681159 |
India | 90.222222 | 13.333333 |
Austria | 90.101345 | 30.762772 |
Germany | 89.851732 | 42.257547 |
Canada | 89.369650 | 35.712598 |
You can use Pandas DataFrame and Matplotlib's plot method to plot the number of wines by country/region.
plt.ylabel("Number of Wines") plt.show() Copy code
Among the 44 wine-producing countries, there are more than 50,000 wines in the US wine review data set, twice as many as the second-ranked country: France-a country famous for its wines. Italy also produces a large number of high-quality wines, with nearly 20,000 wines available for review.
Does quantity exceed quality?
Now, look at the plots in all 44 countries/regions by the highest-rated wines:
plt.ylabel("Highest point of Wines") plt.show() Copy code
Australia, the United States, Portugal, Italy and France all have 100-point wines. If you notice, in terms of the number of wines produced in the dataset, Portugal ranks 5th and Australia ranks 9th. These two countries/regions have fewer than 8,000 wine types.
Set up basic WordCloud
Before using any function, the first thing you might want to do is to check out the docstring of the function and review all required and optional parameters. To do this, type
?WordCloud copy code
[1;31mInit signature:[0m [0mWordCloud[0m[1;33m([0m[0mfont_path[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mwidth[0m[1; 33m=[0m[1;36m400[0m[1;33m,[0m [0mheight[0m[1;33m=[0m[1;36m200[0m[1;33m,[0m [0mmargin[0m[1;33m= [0m[1;36m2[0m[1;33m,[0m [0mranks_only[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mprefer_horizontal[0m[1;33m=[0m [1;36m0.9[0m[1;33m,[0m [0mmask[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mscale[0m[1;33m=[0m [1;36m1[0m[1;33m,[0m [0mcolor_func[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mmax_words[0m[1;33m=[0m[1 ;36m200[0m[1;33m,[0m [0mmin_font_size[0m[1;33m=[0m[1;36m4[0m[1;33m,[0m [0mstopwords[0m[1;33m=[0m[1;32mNone] on [0m[1;33m,[0m [0mrandom_state[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mbackground_color[0m[1;33m=[0m[1;34m'black '[0m[1;33m,[0m [0mmax_font_size[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mfont_step[0m[1;33m=[0m[1;36m1[ 0m[1;33m,[0m [0mmode[0m[1;33m=[0m[1;34m'RGB'[0m[1;33m,[0m [0mrelative_scaling[0m[1;33m=[0m[1;36m0.5[0m[1;33m,[0m [0mregexp[0m [1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mcollocations[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m [0mcolormap[0m[1 ;33m=[0m[1;32mNone[0m[1;33m,[0m [0mnormalize_plurals[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m [0mcontour_width[0m[1;33m] on =[0m[1;36m0[0m[1;33m,[0m [0mcontour_color[0m[1;33m=[0m[1;34m'black'[0m[1;33m)[0m[1;33m[0m[ 0m [1;31mDocstring:[0m Word cloud object for generating and drawing. Parameters ---------- font_path: string Font path to the font that will be used (OTF or TTF). Defaults to DroidSansMono path on a Linux machine. If you are on another OS or don't have this font; you need to adjust this path. width: int (default=400) Width of the canvas. height: int (default=200) Height of the canvas. prefer_horizontal: float (default=0.90) The ratio of times to try horizontal fitting as opposed to vertical. If prefer_horizontal <1, the algorithm will try rotating the word if it doesn't fit. (There is currently no built-in way to get only vertical words.) mask: nd-array or None (default=None) If not None, gives a binary mask on where to draw words. If mask is not None, width and height will be ignored, and the shape of mask will be used instead. All white (#FF or #FFFFFF) entries will be considered "masked out" while other entries will be free to draw on. [This changed in the most recent version!] contour_width: float (default=0) If mask is not None and contour_width> 0, draw the mask contour. contour_color: color value (default="black") Mask contour color. scale: float (default=1) Scaling between computation and drawing. For large word-cloud images, using scale instead of larger canvas size is significantly faster, but might lead to a coarser fit for the words. min_font_size: int (default=4) Smallest font size to use. Will stop when there is no more room in this size. font_step: int (default=1) Step size for the font. font_step> 1 might speed up computation but give a worse fit. max_words: number (default=200) The maximum number of words. stopwords: set of strings or None The words that will be eliminated. If None, the build-in STOPWORDS list will be used. background_color: color value (default="black") Background color for the word cloud image. max_font_size: int or None (default=None) Maximum font size for the largest word. If None, the height of the image is used. mode: string (default="RGB") Transparent background will be generated when mode is "RGBA" and background_color is None. relative_scaling: float (default=.5) Importance of relative word frequencies for font-size. With relative_scaling=0, only word-ranks are considered. With relative_scaling=1, a word that is twice as frequent will have twice the size. If you want to consider the word frequencies and not only their rank, relative_scaling around .5 often looks good. .. versionchanged: 2.0 Default is now 0.5. color_func: callable, default=None Callable with parameters word, font_size, position, orientation, font_path, random_state that returns a PIL color for each word. Overwrites "colormap". See colormap for specifying a matplotlib colormap instead. regexp: string or None (optional) Regular expression to split the input text into tokens in process_text. If None is specified, ``r"\w[\w']+"`` is used. collocations: bool, default=True Whether to include collocations (bigrams) of two words. .. versionadded: 2.0 colormap: string or matplotlib colormap, default="viridis" Matplotlib colormap to randomly draw colors from for each word. Ignored if "color_func" is specified. .. versionadded: 2.0 normalize_plurals: bool, default=True Whether to remove trailing's' from words. If True and a word appears with and without a trailing's', the one with trailing's' is removed and its counts are added to the version without trailing's' - unless the word ends with'ss'. Attributes ---------- ``words_``: dict of string to float Word tokens with associated frequency. .. versionchanged: 2.0 ``words_'' is now a dictionary ``layout_``: list of tuples (string, int, (int, int), int, color)) Encodes the fitted word cloud. Encodes for each word the string, font size, position, orientation, and color. Notes ----- Larger canvases will make the code significantly slower. If you need a large word cloud, try a lower canvas size, and set the scale parameter. The algorithm might give more weight to the ranking of the words then their actual frequencies, depending on the ``max_font_size`` and the scaling heuristic. [1;31mFile:[0m c:\intelpython3\lib\site-packages\wordcloud\wordcloud.py [1;31mType:[0m type Copy code
You can see that the only parameter required by the WordCloud object is text , while all other parameters are optional.
So let's start with a simple example: use the first observation description as input to the wordcloud. The three steps are:
- Extract comments (text file)
- Create and generate wordcloud images
- Use matplotlib to display the cloud
# Display the generated image: plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show() Copy code
You can see that the first review mentioned a lot about the aroma of wine.
Now, change some optional parameters in WordCloud like
plt.imshow(wordcloud, interpolation="bilinear") plt.axis("off") plt.show() Copy code
If you want to save images, WordCloud provides a function
# Save the image in the img folder: wordcloud.to_file("img/first_review.png") Copy code
<wordcloud.wordcloud.WordCloud at 0x16f1d704978> Copy code
When you load them into it, the result will look like this:
So now you merge all the wine reviews into one big text and create a huge fat cloud to see the most common characteristics of these wines.
print ( "There are {} words in the combination of all review.". format (len (text))) copying the code
There are 31661073 words in the combination of all review. code
# Display the generated image: # the matplotlib way: plt.imshow(wordcloud, interpolation='bilinear') plt.axis("off") plt.show() Copy code
Oh, it seems that black cherry and full-bodied mellowness are the most popular features, while Cabernet Sauvignon is the most popular feature. This and Cabernet Sauvignon is one of the most well-known red wine grape varieties in the world.
Now, let's pour these words into a glass of wine!
In order to create a shape for your wordcloud, first, you need to find a PNG file to become a mask. The following is a good website, you can find it on the Internet:
To ensure that the mask works properly, let's view it as a numpy array:
Copy code
array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint8) Copy code
1. use the
def transform_format(val): if val == 0: return 255 else: return val Copy code
Then, create a new mask with the same shape as your existing mask, and set the function
Now you will create a new mask in the correct form. Copy code
Copy code
array([[255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255], ..., [255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255], [255, 255, 255, ..., 255, 255, 255]]) Copy code
Ok! With the correct mask, you can start to make a wordcloud with the selected shape.
# show plt.figure(figsize=[20,10]) plt.imshow(wc, interpolation='bilinear') plt.axis("off") plt.show() Copy code
Created a wordcloud in the shape of a wine bottle! It seems that black cherries, fruit flavors and the full-bodied characteristics of the wine are most often mentioned in the wine description. Now, let s take a closer look at each country s comments:
Create wordcloud according to color patterns
It is possible to merge all reviews of the five countries with the most wines. To find these countries/regions, you can view the country/region of the plot and the number of wines above the relationship between the , or you can use the groups above to find the number of observations in each country/region (each group), and
country US 54504 France 22093 Italy 19540 Spain 6645 Portugal 5691 dtype: int64 Copy code
So now you have 5 popular countries: the United States, France, Italy, Spain, and Portugal.
country US 54504 France 22093 Italy 19540 Spain 6645 Portugal 5691 Chile 4472 Argentina 3800 Austria 3345 Australia 2329 Germany 2165 dtype: int64 Copy code
Currently, only 5 countries are sufficient.
To get all the reviews for each country, you can use
Copy code
Then, create the wordcloud as described above.
# store to file plt.savefig("img/us_wine.png", format="png") plt.show() Copy code
looks great! Now, let us repeat the French comment again.
# store to file plt.savefig("img/fra_wine.png", format="png") #plt.show() Copy code
Note that the image should be saved after drawing so that the word cloud has the desired color mode.
# store to file plt.savefig("img/ita_wine.png", format="png") #plt.show() Copy code
After Italy is Spain:
# store to file plt.savefig("img/spa_wine.png", format="png") #plt.show() Copy code
Finally, Portugal:
# store to file plt.savefig("img/por_wine.png", format="png") #plt.show() Copy code
The final results are in the table below.