Python for NLP natural language processing: using the Facebook FastText library

Python for NLP natural language processing: using the Facebook FastText library

Original link:

Original source: Tuoduan Data Tribe Official Account



In this article, we will study FastText , which is another extremely useful module for word embedding and text classification.

In this article, we will briefly explore the FastText library. This article is divided into two parts. In the first part, we will see how the FastText library creates a vector representation that can be used to find semantic similarity between words. In the second part, we will see the application of FastText library in text classification.

FastText for semantic similarity

FastText supports bag of words and Skip-Gram models . In this article, we will implement the skip-gram model. Since these topics are very similar, we choose these topics to have a large amount of data to create a corpus. You can add more topics of similar nature as needed.

In the first step, we need to import the required libraries.

$ Pip install wikipedia copy the code

Import library

The following script imports the required libraries into our application:

from keras.preprocessing.text import Tokenizer from gensim.models.fasttext import FastText import numpy as np import matplotlib.pyplot as plt import nltk from string import punctuation from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer from nltk.tokenize import sent_tokenize from nltk import WordPunctTokenizer import wikipedia import nltk'punkt')'wordnet')'stopwords') en_stop = set(nltk.corpus.stopwords.words('english')) %matplotlib inline For word representation and semantic similarity, we can use the Gensim model for FastText. Copy code

Wikipedia article

In this step, we will crawl the required Wikipedia articles. Look at the following script:

artificial_intelligence ="Artificial Intelligence").content machine_learning ="Machine Learning").content deep_learning ="Deep Learning").content neural_network ="Neural Network").content artificial_intelligence = sent_tokenize(artificial_intelligence) machine_learning = sent_tokenize(machine_learning) deep_learning = sent_tokenize(deep_learning) neural_network = sent_tokenize(neural_network) artificial_intelligence.extend(machine_learning) artificial_intelligence.extend(deep_learning) artificial_intelligence.extend(neural_network) Copy code

To crawl Wikipedia pages, we can use the

. The name of the page you want to cut and paste is passed as a parameter to
method. The method returns
Object, and then you can use that object to pass
Property to retrieve the content of the page, as shown in the script above.

Then use this

The method marks the scraped content from the four Wikipedia pages as sentences. The
The method returns a list of sentences. The sentences on the four pages are marked separately. Finally, through the
The method connects the sentences in the four articles.

Data preprocessing

The next step is to clear the text data by removing punctuation marks and numbers.

The functions defined below perform preprocessing tasks.

import re from nltk.stem import WordNetLemmatizer stemmer = WordNetLemmatizer() def preprocess_text(document): preprocessed_text = ''.join(tokens) return preprocessed_text Copy code

Let's see if our function performs the required task by preprocessing a pseudo sentence:

sent = preprocess_text("Artificial intelligence, is the most advanced technology of the present era") print(sent) Copy code


The prepared statement is as follows:

artificial intelligence advanced technology present copy the code

You will see that the punctuation marks and stop words have been removed.

Create word representation

We have preprocessed the corpus. Now it's time to use FastText to create word representations. 1. let us define the hyperparameters for the FastText model:

embedding_size = 60 window_size = 40 min_word = 5 down_sampling = 1e-2 Copy code


Is the size of the embedding vector.

The next hyperparameter is

, Which specifies the minimum frequency of word generation in the corpus. Finally, the most frequently occurring words will pass
The number specified by the attribute is downsampled.

Now let us

Create a model for word representation.

%%time ft_model = FastText(word_tokenized_corpus, size=embedding_size, window=window_size, min_count=min_word, sample=down_sampling, sg=1, iter=100) Copy code


The parameter defines the type of model we want to create. A value of 1 means that we want to create a skip syntax model. Zero specifies the bag of words model, which is also the default value.

Execute the above script. It may take some time to run. On my machine, the time statistics of the above code running are as follows:

CPU times: user 1min 45s, sys: 434 ms, total: 1min 45s Wall time: 57.2 s Copy code
Print (ft_model.wv [ 'Artificial' ]) Copy the code

This is the output:

[-3.7653010e-02 -4.5558015e-01 3.2035065e-01 -1.5289043e-01 4.0645871e-02 -1.8946664e-01 7.0426887e-01 2.8806925e-01 -1.8166199e-01 1.7566417e-01 1.1522485e-01 -3.6525184e-01 -6.4378887e-01 -1.6650060e-01 7.4625671e-01 -4.8166099e-01 2.0884991e-01 1.8067230e-01 -6.2647951e-01 2.7614883e-01 -3.6478557e-02 1.4782918e-02 -3.3124462e-01 1.9372456e-01 4.3028224e-02 -8.2326338e-02 1.0356739e-01 4.0792203e-01 -2.0596240e-02 -3.5974573e-02 9.9928051e-02 1.7191900e-01 -2.1196717e-01 6.4424530e-02 -4.4705093e-02 9.7391091e-02 -2.8846195e-01 8.8607501e-03 1.6520244e-01 -3.6626378e-01 -6.2017748e-04 -1.5083785e-01 -1.7499258e-01 7.1994811e-02 -1.9868813e-01 -3.1733567e-01 1.9832127e-01 1.2799081e-01 -7.6522082e-01 5.2335665e-02 -4.5766738e-01 -2.7947658e-01 3.7890410e-03 -3.8761377e-01 -9.3001537e-02 -1.7128626e-01 -1.2923178e-01 3.9627206e-01 -3.6673656e-01 2.2755004e-01] Copy code

 Now let us find the five most similar words: "man-made", "intelligent", "machine", "network", "frequently appearing", and "depth". You can choose any number of words. The following script will print the specified word and the 5 most similar words.

for k,v in semantically_similar_words.items(): print(k+":"+str(v)) Copy code

The output is as follows:

artificial:['intelligence','inspired','book','academic','biological'] intelligence:['artificial','human','people','intelligent','general'] machine:['ethic','learning','concerned','argument','intelligence'] network:['neural','forward','deep','backpropagation','hidden'] recurrent:['rnns','short','schmidhuber','shown','feedforward'] deep:['convolutional','speech','network','generative','neural'] Copy code

We can also find the cosine similarity between the vectors of any two words, as shown below:

print (ft_model.wv.similarity (w1 = 'artificial ', w2 = 'intelligence')) copying the code

The output display value is "0.7481". The value can be between 0 and 1. A higher value indicates a higher degree of similarity.


Visualizing word similarity

Although each word in the model is represented as a 60-dimensional vector, we can use principal component analysis techniques to find two principal components. The two main components can then be used to draw words in a two-dimensional space.

print(all_similar_words) print(type(all_similar_words)) print(len(all_similar_words)) Copy code

Each key in the dictionary is a word. The corresponding value is a list of all semantically similar words. Since we have found the top 5 most similar words in the list of 6 words: "manual", "intelligence", "machine", "network", "recurring", and "deep", you will find that there are 30 words that


Next, we must find the word vectors of all these 30 words, and then use PCA to reduce the dimension of the word vectors from 60 to 2. Then you can use

Method, the
The method is to draw the alias of the word method on a two-dimensional vector space.

Execute the following script to visualize words:

word_vectors = ft_model.wv[all_similar_words] for word_names, x, y in zip(word_names, p_comps[:, 0], p_comps[:, 1]): plt.annotate(word_names, xy=(x+0.06, y+0.03), xytext=(0, 0), textcoords='offset points') Copy code

The output of the above script is as follows:

It can be seen that the words that often appear together in the text are also close to each other in the two-dimensional plane.

FastText for text classification

Text classification refers to classifying text data into predefined categories based on the content of the text. Sentiment analysis, spam detection, and tag detection are some of the most common examples of use cases for text classification.

data set

The data set contains multiple files, but we only

Documents are of interest. The file contains 5.2 million comments about different businesses (including restaurants, bars, dentists, doctors, beauty salons, etc.). However, due to memory limitations, we will only use the first 50,000 records to train our model. If necessary, try more records.

Let's import the required libraries and load the dataset:

import pandas as pd import numpy as np yelp_reviews = pd.read_csv("/content/drive/My Drive/Colab Datasets/yelp_review_short.csv") Copy code


In the above script, we

The function loads a file containing 50,000 comments.

By converting the value of the comment to the categorical value, we can simplify our problem. This will pass in

Add a new column to the data set to complete.

Finally, the title of the data frame is as follows

Install FastText

The next step is to import the FastText model, you can use the following

The command imports the command from the GitHub repository, as shown in the following script:


!wget code

If you run the above script and see the following results, it means that FastText has been downloaded successfully:

--2019-08-16 15:05:05-- Resolving ( Connecting to (||:443... connected. HTTP request sent, awaiting response... 302 Found Location: [following] --2019-08-16 15:05:05-- Resolving ( Connecting to (||:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/zip] Saving to:'' [<=>] 92.06K --.-KB/s in 0.03s 2019-08-16 15:05:05 (3.26 MB/s)-'' saved [94267] Copy code

The next step is to unzip the FastText module. Just type the following command:

!unzip v0.1.0.zipCopy code

Next, you must navigate to the directory where you downloaded FastText and execute

Command to run the C++ binary file. Perform the following steps:

cd fastText-0.1.0 !make Copy code

If you see the following output, it means that FastText has been successfully installed on your computer.

c++ -pthread -std=c++0x -O3 -funroll-loops -c src/ c++ -pthread -std=c++0x -O3 -funroll-loops -c src/ c++ -pthread -std=c++0x -O3 -funroll-loops -c src/ c++ -pthread -std=c++0x -O3 -funroll-loops -c src/ c++ -pthread -std=c++0x -O3 -funroll-loops -c src/ c++ -pthread -std=c++0x -O3 -funroll-loops -c src/ c++ -pthread -std=c++0x -O3 -funroll-loops -c src/ c++ -pthread -std=c++0x -O3 -funroll-loops -c src/ c++ -pthread -std=c++0x -O3 -funroll-loops -c src/ c++ -pthread -std=c++0x -O3 -funroll-loops args.o dictionary.o productquantizer.o matrix.o qmatrix.o vector.o model.o utils.o fasttext.o src/ -o fasttext Copy code

To verify the installation, execute the following command:

!./fasttextCopy code

You should see that FastText supports the following commands:

usage: fasttext <command> <args> The commands supported by FastText are: supervised train a supervised classifier quantize quantize a model to reduce the memory usage test evaluate a supervised classifier predict predict most likely labels predict-prob predict most likely labels with probabilities skipgram train a skipgram model cbow train a cbow model print-word-vectors print word vectors given a trained model print-sentence-vectors print sentence vectors given a trained model nn query for nearest neighbors analogies query for analogies Copy code

Text classification

Before training the FastText model for text classification, it is necessary to mention that FastText accepts data in a special format, as follows:

_label_tag This is sentence 1 _label_tag2 This is sentence 2. Copy code

If we look at our data set, it is not in the desired format. The text with positive sentiment should look like this:

__label__ positive burgers are very big portions here.Copy code

Similarly, negative comments should look like this:

__label__negative They do not use organic ingredients, but I thi ... copy the code

The following script filters out from the data set

Column, then
At that
Add a prefix before all values in the column. Similarly,
Replace the spaces in the column. Finally, the updated data frame is written in the form of

import pandas as pd from io import StringIO import csv col = ['reviews_score','text'] Copy code



Now let's print the updated

Data frame.

yelp_reviews.head() Copy code

You should see the following results:

reviews_score text 0 __label__positive Super simple place but amazing nonetheless. It... 1 __label__positive Small unassuming place that changes their menu... 2 __label__positive Lester's is located in a beautiful neighborhoo... 3 __label__positive Love coming here. Yes the place always needs t... 4 __label__positive Had their chocolate almond croissant and it wa... Copy code

Similarly, the tail of the data frame looks like this:

reviews_score text 49995 __label__positive This is an awesome consignment store! They hav... 49996 __label__positive Awesome laid back atmosphere with made-to-orde... 49997 __label__positive Today was my first appointment and I can hones... 49998 __label__positive I love this chic salon. They use the best prod... 49999 __label__positive This place is delicious. All their meats and s... Copy code

We have transformed the data set into the desired shape. The next step is to divide our data into training and test sets. 80% of the data (that is, the first 40,000 records out of 50,000 records) will be used for training data, and 20% of the data (the last 10,000 records) will be used to evaluate the performance of the algorithm.

The following script divides the data into a training set and a test set:

!head -n 40000 "/content/drive/My Drive/Colab Datasets/yelp_reviews_updated.txt"> "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt" !tail -n 10000 "/content/drive/My Drive/Colab Datasets/yelp_reviews_updated.txt"> "/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt" Copy code

A file containing training data will be generated. Similarly, the newly generated
The file will contain test data.

Now it's time to train our FastText text classification algorithm.

%%time !./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt" -output model_yelp_reviews Copy code

In order to train the algorithm, we must use

Command and pass it to the input file. This is the output of the above script:

Read 4M words Number of words: 177864 Number of labels: 2 Progress: 100.0% words/sec/thread: 2548017 lr: 0.000000 loss: 0.246120 eta: 0h0m CPU times: user 212 ms, sys: 48.6 ms, total: 261 ms Wall time: 15.6 s Copy code

You can pass the following

Command to view the model:

!lsCopy code

This is the output:

args.o Makefile matrix.o model.o src model_yelp_reviews.bin tutorials dictionary.o model_yelp_reviews.vec utils.o PATENTS vector.o fasttext fasttext.o productquantizer.o qmatrix.o yelp_reviews_train.txt LICENSE Copy code


See it in the list of documents above.

Finally, you can use the following

Command to test the model. gotta be
Specify the model name and test file after the command, as shown below:

! ./fasttext test model_yelp_reviews.bin "/content /drive/My Drive/Colab Datasets/yelp_reviews_test.txt" Copy the code

The output of the above script is as follows:

N 10000 P@1 0.909 R@1 0.909 Number of examples: 10000 Copy code


Refers to accuracy,
Refers to the recall rate. You can see that our model achieved an accuracy and recall rate of 0.909, which is pretty good.

Now, let's try to clear the text of punctuation and special characters and convert it to lowercase letters to improve the consistency of the text.

!cat "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt" | sed -e "s/\([.\!?,'/()]\)//1/g" | tr "[ : upper:] "" [: lower:] ">"/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt " copy the code

And the following script cleared the test set:

"/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt" | sed -e "s/\([.\!?,'/()]\)//1/g" | tr "[:upper :] "" [: lower: ] ">"/content/drive/My Drive/Colab Datasets/yelp_reviews_test_clean.txt " copy the code

Now, we will train the model on the cleaned training set:

%%time !./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt" -output model_yelp_reviews Copy code

Finally, we will use the model trained on the purified training set to make predictions on the test set:

! ./fasttext test model_yelp_reviews.bin "/content /drive/My Drive/Colab Datasets/yelp_reviews_test_clean.txt" Copy the code

The output of the above script is as follows:

N 10000 P@1 0.915 R@1 0.915 Number of examples: 10000 Copy code

You will see a small increase in accuracy and recall. To further improve the model, you can increase the age and learning rate of the model. The following script sets the number of yuan to 30 and the learning rate to 0.5.

%%time !./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt" -output model_yelp_reviews -epoch 30 -lr 0.5 Copy code



Recently, it has been proved that the FastText model can be used for word embedding and text classification tasks on many data sets. Compared with other word embedding models, it is very easy to use and lightning fast.