Two classification problem: text classification practice based on BERT! Attach complete code

Two classification problem: text classification practice based on BERT! Attach complete code

Datawhale\

Author: Gao Baoli, Excellent Learner of Datawhale

Message: Bert is naturally suitable for classification tasks. There are many methods for text classification, such as fasttext, textcnn, etc., but in front of Bert, it is a little bit of a shame.

Recommended review display refers to selecting one of many user reviews as the reason for recommendation of the shop, in order to hope that more people will open the shop./

This is like a recommendation system, because it is necessary to recommend suitable comments to different users. For example, in the same Cantonese restaurant, user A has high requirements for the environment. If the recommendation reason is "good environment", A will click in; and user B is more concerned about the taste of the dishes and does not have high requirements for the environment, then the recommendation reason is like " If it's delicious", B is more likely to click in. In other words, the same shop, according to user preferences, different people see different reasons for recommendation.

This task is a typical short text (up to 20 words) two classification problem, using pre-trained Bert to solve. Below, explain from the topic description, problem-solving ideas and code implementation.

Title description

Background description

The goal of this recommendation review display task is to dig out short sentences that are suitable as the reason for recommendation from real user reviews. The recommended reasons for review software display should meet the following three characteristics:

  • Has a length limit
  • High content relevance
  • Has strong text appeal

Some real reasons for recommendation are shown in the blue box below:

Data set/

This task is a binary classification task, so the positive-negative sample ratio is more important. The training set has a total of 16,000 items, the ratio of positive to negative samples is about 1:2, and there are some imbalances, but the overall is not serious.

Data link : pan.baidu.com/s/1z_SJ5KhH...

Or reply to keyword recommendation data acquisition in the Datawhale background

Problem-solving ideas

The premise of ML/DL

Whether it is machine learning or deep learning, it is based on the premise that "training set and test set are independent and identically distributed". Only when this premise is met can the model perform well. Simply analyze the length of the text. If the training set is short text and the test set is long text, the model will not perform very well.

Train [ 'length' ] = Train [ 'Content' ] .apply (the lambda Row: len (Row)) Test [ 'length' ] = Test [ 'Content' ] .apply (the lambda Row: len (Row)) copying the code

The results of data analysis are as follows:

Regarding the length of the comment, the following two characteristics can be seen:

  • The quantiles of the training set and the test set are almost exactly the same:

  • Looking at the mean and standard deviation of the training set and the test set are also roughly the same
MeanStandard deviation
Training set8.673.18
Test set8.633.11

Therefore, the training set and the test set are independent and identically distributed in terms of the length of the comments, and the lengths of label 0 and label 1 are not too different, and the text length as a feature has little effect on classification. At the same time, it is concluded that if our model performs well on the training set, there are reasons to believe that it will perform well on the test set.

The main idea

There are many methods for text classification, fasttext, textcnn, or RNN-based, etc., but in front of Bert, these methods are just like a little bit. Bert is naturally suitable for classification tasks.

The official method is to take the hidden corresponding to [CLS] through a fully connected layer to get the classification result. In order to make full use of the information at this time step, take out the last layer of Bert, and then perform some simple operations, as follows:

  • Bert, get a hidden representation of each time step, and time step t is the sentence length.
  • There are three methods for comprehensive time step hidden layer representation information: global average pooling, global maximum pooling, and [CLS] and attention scores of other positions in the sequence.
  • Put the comprehensive information into the fully connected layer for text classification.

Model training

Five-fold cross-validation is used, that is, the training set is divided into five parts, one part is used as the validation set, and the remaining four parts are used as the training set, which is equivalent to obtaining five models. As can be seen from the figure below, the combination of the validation set is the training set. The predictions of the five models on the test set are averaged to obtain the final prediction results.

Because the Bert model has a lot of parameters, and the training set only has 16,000, in order to prevent overfitting, the early stopping method is adopted.

Keras is implemented as follows:\

from keras_bert import load_trained_model_from_checkpoint, Tokenizer from keras_self_attention import SeqSelfAttention def build_bert ( nclass, selfloss, lr, is_train ):     """     nclass: the number of nodes in the output layer;     lr: learning rate;     selfloss: loss function     is_train: Whether to fine-tune bert     """      bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path, seq_len = None )     for  l  in  bert_model.layers: l.trainable = is_train x1_in = Input(shape=( None ,)) x2_in = Input(shape=( None ,)) x = bert_model([x1_in, x2_in]) x = Lambda( lambda x: x[:, :])(x) avg_pool_3 = GlobalAveragePooling1D()(x) max_pool_3 = GlobalMaxPooling1D()(x) attention_3 = SeqSelfAttention(attention_activation = 'softmax' )(x) attention_3 = Lambda( lambda x: x[:, 0 ])(attention_3) x = keras.layers.concatenate([avg_pool_3, max_pool_3, attention_3]) p = Dense(nclass, activation = 'sigmoid' )(x) model = Model([x1_in, x2_in], p) model. compile (loss=selfloss, optimizer=Adam(lr), = metrics [ 'ACC' ]) Print (model.summary ()) return Model duplicated code

I also tried some complex operations (for example, followed by a CNN or a layer of GRU); I also tried to take out the features of the last three layers to do some operations. Although the effect is not improved, it is not bad.

Optimization and improvement

The ratio of positive and negative samples in the training set is 1:2. Although the sample imbalance is not obvious, it is not balanced. The general loss function is cross-entropy, but the relationship between cross-entropy and AUC is not strictly monotonic. The decrease of cross-entropy does not necessarily bring about the improvement of AUC. The best method is to directly optimize AUC, but AUC is difficult to calculate.

When the sample is balanced, the effects of AUC, F1, and accuracy (accuary) are similar. But when the sample is unbalanced, accuary cannot be used as an evaluation indicator, and F1 or AUC should be used as an evaluation indicator. Think about it carefully, AUC and F1 are both related to Precision and Recall, so I chose to optimize F1 directly. But F1 is non-derivable, and there are ways. It is recommended that the function smoothing talk written by Su Jianlin: the derivable approximation of non-derivable functions. Use F1_loss directly as the loss function.

def  f1_loss ( y_true, y_pred ):     # y_true: true label 0 or 1; y_pred: the probability of being a positive class     loss =  2  * tf.reduce_sum(y_true * y_pred)/tf.reduce_sum(y_true + y_pred) + K.epsilon( )      return  -Loss copy the code

Result analysis

Model 1 : batch=16, cross entropy loss function, learning rate 1e-5, fine-tuning the Bert layer, namely:

build_bert ( . 1 , 'binary_crossentropy' , 1E-. 5 , True ) copy the code

Model 2 : Load model 1, fix the Bert layer, fine-tune the fully connected layer, the batch is still 16, and the learning rate is 1e-7, namely:

build_bert ( . 1 , f1_loss, 1E-. 7 , False ) copy the code

The comparison is as follows:

Complete code

It runs for about 1 hour on the GPU, and the CPU can also run, it may take four or five hours\

import keras from keras.utils import to_categorical from keras.layers import * from keras.callbacks import * from keras.models import Model import keras.backend as K from keras.optimizers import Adam import codecs import gc import numpy as np import pandas as pd import time import os fromkeras.utils.training_utils Import multi_gpu_model Import tensorflow AS TF from keras.backend.tensorflow_backend Import set_session from sklearn.model_selection Import KFold from keras_bert Import load_trained_model_from_checkpoint, the Tokenizer from keras_self_attention Import SeqSelfAttention from sklearn.metrics Import roc_auc_score the lines # cross-entropy 0.9552568091358987 batch = 16 1e-5 online 0.96668 # offline 0.9603767202619631 batch = 16 Based on the previous step, use f1loss to not adjust the bert layer 1e-7 online 0.97010 class OurTokenizer ( Tokenizer ): def _tokenize ( self, text ): R = [] for c in text: if c in self._token_dict: R.append(c) elif self._is_space(c): R.append( '[unused1]' ) # space class uses untrained [unused1] to represent else : R.append( '[UNK]' ) # The remaining characters are [UNK] return R  def  f1_loss ( y_true, y_pred ):     # y_true: true label 0 or 1; y_pred: the probability of being a positive class     loss =  2  * tf.reduce_sum(y_true * y_pred)/tf.reduce_sum(y_true + y_pred) + K.epsilon( ) return -loss def seq_padding ( X, padding = 0 ): L = [ len (x) for x in X] ML = max (L) return np.array([ np.concatenate([x, [padding] * (ML- len (x))]) if len (x) <ML else x for x in X ]) class data_generator : def __init__ ( self, data, batch_size = 8 , shuffle = True ): self.data = data self.batch_size = batch_size self.shuffle = shuffle self.steps = len (self.data)//self.batch_size if len (self.data)% self.batch_size != 0 : self.steps += 1 def __len__ ( self ): return self.steps def __iter__ ( self ): while True : idxs = list ( range ( len (self.data))) if self.shuffle: np.random.shuffle(idxs) X1, X2, Y = [], [], [] for i in idxs: d = self.data[i] text = d[ 0 ][:maxlen] # indices, segments = tokenizer.encode(first='unaffable', second='steel', max_len=10) x1, x2 = tokenizer.encode(first=text) y = np.float32(d[ 1 ]) X1.append(x1) X2.append(x2) Y.append([y]) if len (X1) == self.batch_size or i == idxs[- 1 ]: X1 = seq_padding(X1) X2 = seq_padding(X2) Y = seq_padding(Y) # print('Y', Y) yield [X1, X2], Y[:, 0 ] [X1, X2, Y] = [], [], [] def build_bert ( nclass, selfloss, lr, is_train ): bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path, seq_len = None ) for l in bert_model.layers: l.trainable = is_train x1_in = Input(shape=( None ,)) x2_in = Input(shape=( None ,)) x = bert_model([x1_in, x2_in]) x = Lambda( lambda x: x[:, :])(x) avg_pool_3 = GlobalAveragePooling1D()(x) max_pool_3 = GlobalMaxPooling1D()(x) # Official document: https://www.cnpython.com/pypi/keras-self-attention # Source code https://github.com/CyberZHG/keras-self-attention/blob/master/keras_self_attention/seq_self_attention.py attention_3 = SeqSelfAttention(attention_activation = 'softmax' )(x) attention_3 = Lambda( lambda x: x[:, 0 ])(attention_3) x = keras.layers.concatenate([avg_pool_3, max_pool_3, attention_3], name= "fc" ) p = Dense(nclass, activation = 'sigmoid' )(x) model = Model([x1_in, x2_in], p) model. compile (loss=selfloss, optimizer=Adam(lr), metrics=[ 'acc' ]) print (model.summary()) return model def run_cv ( nfold, data, data_test ): kf = KFold(n_splits=nfold, shuffle = True , random_state = 2020 ).split(data) train_model_pred = np.zeros(( len (data), 1 )) test_model_pred = np.zeros(( len (data_test), 1 )) lr = 1e-7 # 1e-5 # categorical_crossentropy (optional plan:'binary_crossentropy', f1_loss) selfloss = f1_loss is_train = False # True False     for  i, (train_fold, test_fold)  in  enumerate (kf): print ( '***************%d-th************** **' % i)         t = time.time() X_train, X_valid, = data[train_fold, :], data[test_fold, :] model = build_bert( 1 , selfloss, lr, is_train) early_stopping = EarlyStopping(monitor = 'val_acc' , patience = 3 ) plateau = ReduceLROnPlateau(monitor= "val_acc" , verbose= 1 , mode= 'max' , factor= 0.5 , patience= 2 ) checkpoint = ModelCheckpoint( '/home/codes/news_classify/comment_classify/expriments/' + str (i) + '_2.hdf5' , monitor = 'val_acc' , verbose = 2 , save_best_only = True , mode = 'max' , save_weights_only = False ) batch_size = 16 train_D = data_generator(X_train, batch_size=batch_size, shuffle = True ) valid_D = data_generator(X_valid, batch_size=batch_size, shuffle = False ) test_D = data_generator(data_test, batch_size=batch_size, shuffle= False ) model.load_weights( '/home/codes/news_classify/comment_classify/expriments/' + str (i) + '.hdf5' ) model.fit_generator( train_D.__iter__(), steps_per_epoch = len (train_D), epochs = 8 , validation_data=valid_D.__iter__(), validation_steps = len (valid_D), callbacks=[early_stopping, plateau, checkpoint], ) # return model train_model_pred[test_fold] = model.predict_generator(valid_D.__iter__(), steps = len (valid_D), verbose = 1 ) test_model_pred += model.predict_generator(test_D.__iter__(), steps = len (test_D), verbose = 1 ) del model gc.collect() K.clear_session() print ( 'time:' , time.time()-t) return train_model_pred, test_model_pred if __name__ == ' __main__ ' : config = tf.ConfigProto() config.gpu_options.per_process_gpu_memory_fraction = 0.8 # Quantitative config.gpu_options.allow_growth = True # On demand set_session(tf.Session(config=config)) t = time.time() maxlen = 20 # The maximum length of the data set is 19 config_path = '/home/codes/news_classify/chinese_L-12_H-768_A-12/bert_config.json ' checkpoint_path = '/home/codes/news_classify/chinese_L-12_H-768_A-12/bert_model.ckpt' dict_path = '/ home/codes/news_classify/chinese_L-12_H-768_A-12/vocab.txt ' token_dict = {} with codecs. open (dict_path, 'r' , 'utf8' ) as reader: for line in reader: token = line.strip() token_dict[token] = len (token_dict) tokenizer = OurTokenizer(token_dict) data_dir = '/home/codes/news_classify/comment_classify/' train_df = pd.read_csv(os.path.join(data_dir, 'union_train.csv' )) test_df = pd.read_csv(os.path.join(data_dir, 'test.csv' )) print ( len (train_df), len (test_df)) DATA_LIST = [] for data_row in train_df.iloc[:].itertuples(): DATA_LIST.append((data_row.content, data_row.label)) DATA_LIST = np.array(DATA_LIST) DATA_LIST_TEST = [] for data_row in test_df.iloc[:].itertuples(): DATA_LIST_TEST.append((data_row.content, 0 )) DATA_LIST_TEST = np.array(DATA_LIST_TEST) n_cv = 5 train_model_pred, test_model_pred = run_cv(n_cv, DATA_LIST, DATA_LIST_TEST) train_df[ 'Prediction' ] = train_model_pred test_df[ 'Prediction' ] = test_model_pred/n_cv train_df.to_csv(os.path.join(data_dir, 'train_union_submit2.csv' ), index = False ) test_df[ 'ID' ] = test_df.index test_df[[ 'ID' , 'Prediction' ]].to_csv(os.path.join(data_dir, 'submit2.csv' ), index = False ) auc = roc_auc_score(np.array(train_df[ 'label' ]), np.array(train_df[ 'Prediction' ])) print ( 'auc' , auc)     Print ( 'Time IS' , the time.time () - T)   # 2853s copy the code

Reference

1. How to Fine-Tune BERT for Text Classification?

2. A talk on function smoothing written by Mr. Su Jianlin: Differentiable approximation of non-derivable functions

Wonderful review of past issues Route and data download suitable for beginners to get started with artificial intelligence. Machine learning online manual Deep learning online manual AI basic download (pdf updated to 25 episodes) qq group 1003271085 on this site , join the WeChat group, please reply to "add group" to get a discount on the knowledge of this site planet coupons, please reply "knowledge planet" like articles, a point in looking copy the code