Original link: tecdat.cn/?p=8572
Original source: Tuoduan Data Tribe Official Account
In this article, we will study FastText , which is another extremely useful module for word embedding and text classification.
In this article, we will briefly explore the FastText library. This article is divided into two parts. In the first part, we will see how the FastText library creates a vector representation that can be used to find semantic similarity between words. In the second part, we will see the application of FastText library in text classification.
FastText for semantic similarity
FastText supports bag of words and Skip-Gram models . In this article, we will implement the skip-gram model. Since these topics are very similar, we choose these topics to have a large amount of data to create a corpus. You can add more topics of similar nature as needed.
In the first step, we need to import the required libraries.
$ Pip install wikipedia copy the code
The following script imports the required libraries into our application:
from keras.preprocessing.text import Tokenizer from gensim.models.fasttext import FastText import numpy as np import matplotlib.pyplot as plt import nltk from string import punctuation from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer from nltk.tokenize import sent_tokenize from nltk import WordPunctTokenizer import wikipedia import nltk nltk.download('punkt') nltk.download('wordnet') nltk.download('stopwords') en_stop = set(nltk.corpus.stopwords.words('english')) %matplotlib inline For word representation and semantic similarity, we can use the Gensim model for FastText. Copy code
In this step, we will crawl the required Wikipedia articles. Look at the following script:
artificial_intelligence = wikipedia.page("Artificial Intelligence").content machine_learning = wikipedia.page("Machine Learning").content deep_learning = wikipedia.page("Deep Learning").content neural_network = wikipedia.page("Neural Network").content artificial_intelligence = sent_tokenize(artificial_intelligence) machine_learning = sent_tokenize(machine_learning) deep_learning = sent_tokenize(deep_learning) neural_network = sent_tokenize(neural_network) artificial_intelligence.extend(machine_learning) artificial_intelligence.extend(deep_learning) artificial_intelligence.extend(neural_network) Copy code
To crawl Wikipedia pages, we can use the
Then use this
The next step is to clear the text data by removing punctuation marks and numbers.
import re from nltk.stem import WordNetLemmatizer stemmer = WordNetLemmatizer() def preprocess_text(document): preprocessed_text = ''.join(tokens) return preprocessed_text Copy code
Let's see if our function performs the required task by preprocessing a pseudo sentence:
sent = preprocess_text("Artificial intelligence, is the most advanced technology of the present era") print(sent) Copy code
The prepared statement is as follows:
artificial intelligence advanced technology present copy the code
You will see that the punctuation marks and stop words have been removed.
Create word representation
We have preprocessed the corpus. Now it's time to use FastText to create word representations. 1. let us define the hyperparameters for the FastText model:
embedding_size = 60 window_size = 40 min_word = 5 down_sampling = 1e-2 Copy code
The next hyperparameter is
Now let us
%%time ft_model = FastText(word_tokenized_corpus, size=embedding_size, window=window_size, min_count=min_word, sample=down_sampling, sg=1, iter=100) Copy code
Execute the above script. It may take some time to run. On my machine, the time statistics of the above code running are as follows:
CPU times: user 1min 45s, sys: 434 ms, total: 1min 45s Wall time: 57.2 s Copy code
Print (ft_model.wv [ 'Artificial' ]) Copy the code
This is the output:
[-3.7653010e-02 -4.5558015e-01 3.2035065e-01 -1.5289043e-01 4.0645871e-02 -1.8946664e-01 7.0426887e-01 2.8806925e-01 -1.8166199e-01 1.7566417e-01 1.1522485e-01 -3.6525184e-01 -6.4378887e-01 -1.6650060e-01 7.4625671e-01 -4.8166099e-01 2.0884991e-01 1.8067230e-01 -6.2647951e-01 2.7614883e-01 -3.6478557e-02 1.4782918e-02 -3.3124462e-01 1.9372456e-01 4.3028224e-02 -8.2326338e-02 1.0356739e-01 4.0792203e-01 -2.0596240e-02 -3.5974573e-02 9.9928051e-02 1.7191900e-01 -2.1196717e-01 6.4424530e-02 -4.4705093e-02 9.7391091e-02 -2.8846195e-01 8.8607501e-03 1.6520244e-01 -3.6626378e-01 -6.2017748e-04 -1.5083785e-01 -1.7499258e-01 7.1994811e-02 -1.9868813e-01 -3.1733567e-01 1.9832127e-01 1.2799081e-01 -7.6522082e-01 5.2335665e-02 -4.5766738e-01 -2.7947658e-01 3.7890410e-03 -3.8761377e-01 -9.3001537e-02 -1.7128626e-01 -1.2923178e-01 3.9627206e-01 -3.6673656e-01 2.2755004e-01] Copy code
Now let us find the five most similar words: "man-made", "intelligent", "machine", "network", "frequently appearing", and "depth". You can choose any number of words. The following script will print the specified word and the 5 most similar words.
for k,v in semantically_similar_words.items(): print(k+":"+str(v)) Copy code
The output is as follows:
artificial:['intelligence','inspired','book','academic','biological'] intelligence:['artificial','human','people','intelligent','general'] machine:['ethic','learning','concerned','argument','intelligence'] network:['neural','forward','deep','backpropagation','hidden'] recurrent:['rnns','short','schmidhuber','shown','feedforward'] deep:['convolutional','speech','network','generative','neural'] Copy code
We can also find the cosine similarity between the vectors of any two words, as shown below:
print (ft_model.wv.similarity (w1 = 'artificial ', w2 = 'intelligence')) copying the code
The output display value is "0.7481". The value can be between 0 and 1. A higher value indicates a higher degree of similarity.
Visualizing word similarity
Although each word in the model is represented as a 60-dimensional vector, we can use principal component analysis techniques to find two principal components. The two main components can then be used to draw words in a two-dimensional space.
print(all_similar_words) print(type(all_similar_words)) print(len(all_similar_words)) Copy code
Each key in the dictionary is a word. The corresponding value is a list of all semantically similar words. Since we have found the top 5 most similar words in the list of 6 words: "manual", "intelligence", "machine", "network", "recurring", and "deep", you will find that there are 30 words that
Next, we must find the word vectors of all these 30 words, and then use PCA to reduce the dimension of the word vectors from 60 to 2. Then you can use
Execute the following script to visualize words:
word_vectors = ft_model.wv[all_similar_words] for word_names, x, y in zip(word_names, p_comps[:, 0], p_comps[:, 1]): plt.annotate(word_names, xy=(x+0.06, y+0.03), xytext=(0, 0), textcoords='offset points') Copy code
The output of the above script is as follows:
It can be seen that the words that often appear together in the text are also close to each other in the two-dimensional plane.
FastText for text classification
Text classification refers to classifying text data into predefined categories based on the content of the text. Sentiment analysis, spam detection, and tag detection are some of the most common examples of use cases for text classification.
The data set contains multiple files, but we only
Let's import the required libraries and load the dataset:
import pandas as pd import numpy as np yelp_reviews = pd.read_csv("/content/drive/My Drive/Colab Datasets/yelp_review_short.csv") Copy code
In the above script, we
By converting the value of the comment to the categorical value, we can simplify our problem. This will pass in
Finally, the title of the data frame is as follows
The next step is to import the FastText model, you can use the following
!wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zipCopy code
If you run the above script and see the following results, it means that FastText has been downloaded successfully:
--2019-08-16 15:05:05-- https://github.com/facebookresearch/fastText/archive/v0.1.0.zip Resolving github.com (github.com)... 220.127.116.11 Connecting to github.com (github.com)|18.104.22.168|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://codeload.github.com/facebookresearch/fastText/zip/v0.1.0 [following] --2019-08-16 15:05:05-- https://codeload.github.com/facebookresearch/fastText/zip/v0.1.0 Resolving codeload.github.com (codeload.github.com)... 22.214.171.124 Connecting to codeload.github.com (codeload.github.com)|126.96.36.199|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/zip] Saving to:'v0.1.0.zip' v0.1.0.zip [<=>] 92.06K --.-KB/s in 0.03s 2019-08-16 15:05:05 (3.26 MB/s)-'v0.1.0.zip' saved  Copy code
The next step is to unzip the FastText module. Just type the following command:
!unzip v0.1.0.zipCopy code
Next, you must navigate to the directory where you downloaded FastText and execute
cd fastText-0.1.0 !make Copy code
If you see the following output, it means that FastText has been successfully installed on your computer.
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/args.cc c++ -pthread -std=c++0x -O3 -funroll-loops -c src/dictionary.cc c++ -pthread -std=c++0x -O3 -funroll-loops -c src/productquantizer.cc c++ -pthread -std=c++0x -O3 -funroll-loops -c src/matrix.cc c++ -pthread -std=c++0x -O3 -funroll-loops -c src/qmatrix.cc c++ -pthread -std=c++0x -O3 -funroll-loops -c src/vector.cc c++ -pthread -std=c++0x -O3 -funroll-loops -c src/model.cc c++ -pthread -std=c++0x -O3 -funroll-loops -c src/utils.cc c++ -pthread -std=c++0x -O3 -funroll-loops -c src/fasttext.cc c++ -pthread -std=c++0x -O3 -funroll-loops args.o dictionary.o productquantizer.o matrix.o qmatrix.o vector.o model.o utils.o fasttext.o src/main.cc -o fasttext Copy code
To verify the installation, execute the following command:
You should see that FastText supports the following commands:
usage: fasttext <command> <args> The commands supported by FastText are: supervised train a supervised classifier quantize quantize a model to reduce the memory usage test evaluate a supervised classifier predict predict most likely labels predict-prob predict most likely labels with probabilities skipgram train a skipgram model cbow train a cbow model print-word-vectors print word vectors given a trained model print-sentence-vectors print sentence vectors given a trained model nn query for nearest neighbors analogies query for analogies Copy code
Before training the FastText model for text classification, it is necessary to mention that FastText accepts data in a special format, as follows:
_label_tag This is sentence 1 _label_tag2 This is sentence 2. Copy code
If we look at our data set, it is not in the desired format. The text with positive sentiment should look like this:
__label__ positive burgers are very big portions here.Copy code
Similarly, negative comments should look like this:
__label__negative They do not use organic ingredients, but I thi ... copy the code
The following script filters out from the data set
import pandas as pd from io import StringIO import csv col = ['reviews_score','text'] Copy code
Now let's print the updated
yelp_reviews.head() Copy code
You should see the following results:
reviews_score text 0 __label__positive Super simple place but amazing nonetheless. It... 1 __label__positive Small unassuming place that changes their menu... 2 __label__positive Lester's is located in a beautiful neighborhoo... 3 __label__positive Love coming here. Yes the place always needs t... 4 __label__positive Had their chocolate almond croissant and it wa... Copy code
Similarly, the tail of the data frame looks like this:
reviews_score text 49995 __label__positive This is an awesome consignment store! They hav... 49996 __label__positive Awesome laid back atmosphere with made-to-orde... 49997 __label__positive Today was my first appointment and I can hones... 49998 __label__positive I love this chic salon. They use the best prod... 49999 __label__positive This place is delicious. All their meats and s... Copy code
We have transformed the data set into the desired shape. The next step is to divide our data into training and test sets. 80% of the data (that is, the first 40,000 records out of 50,000 records) will be used for training data, and 20% of the data (the last 10,000 records) will be used to evaluate the performance of the algorithm.
The following script divides the data into a training set and a test set:
!head -n 40000 "/content/drive/My Drive/Colab Datasets/yelp_reviews_updated.txt"> "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt" !tail -n 10000 "/content/drive/My Drive/Colab Datasets/yelp_reviews_updated.txt"> "/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt" Copy code
Now it's time to train our FastText text classification algorithm.
%%time !./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt" -output model_yelp_reviews Copy code
In order to train the algorithm, we must use
Read 4M words Number of words: 177864 Number of labels: 2 Progress: 100.0% words/sec/thread: 2548017 lr: 0.000000 loss: 0.246120 eta: 0h0m CPU times: user 212 ms, sys: 48.6 ms, total: 261 ms Wall time: 15.6 s Copy code
You can pass the following
This is the output:
args.o Makefile quantization-results.sh classification-example.sh matrix.o README.md classification-results.sh model.o src CONTRIBUTING.md model_yelp_reviews.bin tutorials dictionary.o model_yelp_reviews.vec utils.o eval.py PATENTS vector.o fasttext pretrained-vectors.md wikifil.pl fasttext.o productquantizer.o word-vector-example.sh get-wikimedia.sh qmatrix.o yelp_reviews_train.txt LICENSE quantization-example.sh Copy code
Finally, you can use the following
! ./fasttext test model_yelp_reviews.bin "/content /drive/My Drive/Colab Datasets/yelp_reviews_test.txt" Copy the code
The output of the above script is as follows:
N 10000 P@1 0.909 R@1 0.909 Number of examples: 10000 Copy code
Now, let's try to clear the text of punctuation and special characters and convert it to lowercase letters to improve the consistency of the text.
!cat "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt" | sed -e "s/\([.\!?,'/()]\)//1/g" | tr "[ : upper:] "" [: lower:] ">"/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt " copy the code
And the following script cleared the test set:
"/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt" | sed -e "s/\([.\!?,'/()]\)//1/g" | tr "[:upper :] "" [: lower: ] ">"/content/drive/My Drive/Colab Datasets/yelp_reviews_test_clean.txt " copy the code
Now, we will train the model on the cleaned training set:
%%time !./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt" -output model_yelp_reviews Copy code
Finally, we will use the model trained on the purified training set to make predictions on the test set:
! ./fasttext test model_yelp_reviews.bin "/content /drive/My Drive/Colab Datasets/yelp_reviews_test_clean.txt" Copy the code
The output of the above script is as follows:
N 10000 P@1 0.915 R@1 0.915 Number of examples: 10000 Copy code
You will see a small increase in accuracy and recall. To further improve the model, you can increase the age and learning rate of the model. The following script sets the number of yuan to 30 and the learning rate to 0.5.
%%time !./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt" -output model_yelp_reviews -epoch 30 -lr 0.5 Copy code
Recently, it has been proved that the FastText model can be used for word embedding and text classification tasks on many data sets. Compared with other word embedding models, it is very easy to use and lightning fast.