CN113239700A

CN113239700A - Text semantic matching device, system, method and storage medium for improving BERT

Info

Publication number: CN113239700A
Application number: CN202110459186.7A
Authority: CN
Inventors: 王庆岩; 顾金铭; 殷楠楠; 谢金宝; 梁欣涛; 沈涛
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2021-08-10

Abstract

The text semantic matching device, system, method and storage medium for improving BERT, in particular to a matching device, system, method and storage medium for text semantic matching, BER, word granularity, relative position coding and attention pooling, belonging to the field of natural language processing; the method aims to solve the problems that the training time of the BERT model is long, absolute position coding cannot indicate the relative position between words in a sentence, and the output text representation cannot completely utilize the text representation sequence output by the BERT model; the invention processes the text by establishing a word embedding mechanism in the transmission layer, a relative position coding mechanism of the coding layer and an attention mechanism after pooling through the output layer, and completes the semantic matching of the subsequent text; the method and the device not only improve the accuracy of text matching, more accurately reflect different positions of sentences and information among different positions, but also obtain the text representation containing more semantic information after dimension reduction by adopting an attention pooling mode.

Description

Text semantic matching device, system, method and storage medium for improving BERT

Technical Field

The invention discloses a text semantic matching method for improving BERT, in particular to matching equipment, a system, a method and a storage medium for text semantic matching, BER, word granularity, relative position coding and attention pooling, and belongs to the field of natural language processing.

Background

Text semantic matching is one of the basic tasks of the natural language processing field (NLP) and aims at modeling the semantics of two texts and classifying the relationship between them. The research of text semantic matching can be applied to natural language processing tasks such as automatic question answering, machine translation, dialogue system and rephrasing, and the tasks can be abstracted into text matching tasks to a certain extent.

The primary problem faced by the task of text semantic matching is the problem of text representation, which refers to mapping words in a text into word vector representations so that a computer can process the text. In recent years, with the development of large-scale pre-training models, text representation technology has been greatly developed, and various pre-training models based on large-scale text prediction emerge like bamboo shoots in spring after rain, such as ELMo, OpenAI GPT, BERT, XLNet and the like. Since the great success of the BERT pre-training model, improvements based on the BERT pre-training model have been proposed continuously, such as RoBERTa, ALBERT, etc.

Although the model has achieved good results, the three previous methods for reducing the dimension are [ CLS ] vector extraction, average pooling and maximum pooling; the three methods are too large in the three-dimensional text representation sequence output by one-sided application, so that the proposed method integrates the relation between the [ CLS ] vector and the rest vectors to obtain text representation which more accurately reflects text semantics.

The text representation generated by performing the pooling operation on the output text sequence extracted from the text by the pre-training model is an important step of the text semantic matching model. Collobert et al propose a global maximum pooling method that generates a semantically matched textual representation by the maximum of each vector-corresponding element in the textual representation sequence. Conneau et al combine a Bi-directional long-short time memory (Bi-LSTM) network with global maximum pooling and global average pooling, respectively, to encode sentence-level semantic information, and compare to obtain a structure in which Bi-LSTM is combined with global maximum pooling, which has an optimal effect on sentence-level semantic encoding. And the Kim generates a text representation sequence based on the word2vec embedding model, and combines a Convolutional Neural Network (CNN) with global maximum pooling to perform a text classification task. Hu and the like combine CNN and global maximum pooling to provide a text semantic matching model without prior knowledge. BERT proposed that the pooling approach employed be to extract the vector of the special character [ CLS ] as a semantic matching text representation. The above methods all use only a part of the output text sequence, and do not combine the special character [ CLS ] vector in BERT with other sequence vectors, and adopt attention pooling to solve the above problems.

Disclosure of Invention

In a text matching task, a BERT model obtains good performance, but still has the problems that the training time is long, absolute Position codes cannot indicate the Relative positions between words and words in a sentence, and the output text representation cannot completely utilize the text representation sequence output by the BERT model, and the invention provides a text matching model AP _ REP _ WordBERT based on word Embedding, Attention Pooling (AP) and Relative Position coding (PRE) for improving the BERT; the technical scheme of the invention is as follows:

the first scheme is as follows: the text semantic matching system for improving the BERT comprises a data preprocessing subsystem and a BERT subsystem; the data preprocessing subsystem is responsible for arranging the acquired text and transmitting the text to the BERT model subsystem, the BERT model subsystem is used for establishing a model and outputting the model, and finally the output layer subsystem is used for improving the model and outputting a matching result.

Specifically, the data preprocessing subsystem comprises a text acquisition module, a splicing module and a word segmentation module; the BERT model subsystem comprises an input representation layer, an encoding layer and an output layer; the output layer includes an attention pooling module and a classifier.

Scheme II: the method is realized on the basis of the system by establishing a word embedding mechanism in the transmission layer, a relative position coding mechanism of the coding layer and processing a text by the attention mechanism after pooling of the output layer, so as to complete subsequent text semantic matching; the method comprises the following specific steps:

step one, inputting a text by the text acquisition module and inserting a special element vector to complete the initialization operation of a text matching task;

splicing the main vectors by the splicing module by using a self-attention mechanism;

thirdly, the word segmentation module utilizes a word embedding mechanism to segment the text vector according to word granularity and serves as a final word segmentation result;

fourthly, coding the text by using a relative position coding mechanism and outputting the relative position learned by the model;

step five, using the special element vector inserted in the step one to perform attention pooling calculation with other output vector sequences in the output text sequence;

and step six, performing function calculation by using the classifier to complete text semantic matching.

Further, in the step one, the text matching task specifically includes two parts:

the first part, the text pair is spliced, a special symbol [ CLS ] is added in front of the first sentence in the text pair, a special symbol [ SEP ] is added at the end of the first sentence, then the second sentence is accessed, a special symbol [ SEP ] is added at the end of the second sentence, and the spliced sentences are segmented into sentences according to the character granularity;

and in the second part, the word vector, the segmentation vector and the position vector of each word are summed to be used as a vector representation of the final input BERT model.

Further, in the second step, the self-attention mechanism specifically comprises the following steps:

step two, performing similarity calculation on the query set Q of the current words and each key K to obtain a weight;

step two, normalizing the weights by using a Softmax function;

and step two, weighting and summing the weight and the corresponding value V to obtain the final attention result.

Further, in step three, the self-attention mechanism specifically includes the following steps: in step three, the word embedding mechanism specifically comprises the following steps:

step three, adding Chinese words in the text into an original word list;

step two, inputting a sentence, firstly, segmenting the sentence once by adopting a jieba word segmentation tool to obtain a word sequence w_i，w_i∈[w₁，w₂，...，w_l]；

Step three, traversing w_iIf w is_iKeeping in a word list, otherwise, carrying out subdivision once by using a word segmentation function carried by BERT;

step three and four, each w_iThe word segmentation results are orderly spliced together to serve as the final word segmentation result.

Further, in step four, the relative position coding means adding two groups of vectors representing relationships between words in the self-attention mechanism and taking the vectors as parameters to participate in training, and the specific steps are as follows:

step four, two groups of vectors representing the relationship between the words are interacted:

step four, calculating an attention score;

and step four, weighting and outputting a vector.

Further, in step five, the relative position code depends on the coding mode of the two-dimensional coordinate representation position, and the relative position code is shared in the self-attention mechanism of each layer by converting the multi-dimensional vector into the relative position of the two-dimensional vector, and the representation in the relative position code of any layer is relative information between position and position.

Further, in step six, the classifier is used as a text semantic matching model for the multilayer perceptron, and the classifier consists of a forward propagation neural network, a Softmax normalization function and an Argmax maximum index function:

the forward propagation neural network has two hidden layers in total, all neurons of the first hidden layer are fully connected with a semantic matching representation vector v, and the v is mapped into a high-dimensional semantic space to analyze semantic matching information contained in the high-dimensional semantic space; fully connecting the neurons in the second hidden layer with all the neurons in the first hidden layer, and respectively outputting activation values corresponding to labels 0 with different representative semantics and labels 1 with the same representative semantics to obtain a two-dimensional activation vector;

the Softmax normalization function is used for normalizing the two-dimensional activation vector obtained by the forward propagation network to enable the sum of all elements in the vector to be 1, and a two-dimensional prediction vector is obtained; the vector, namely the prediction of the synonymy relation between two input sentences to be matched by the text semantic matching model, wherein two elements in the vector respectively correspond to the prediction probabilities of a label 0 and a label 1 and are used for calculating a model loss function;

and comparing the probability values of the two elements in the two-dimensional probability vector by the Argmax maximum index function, returning an index corresponding to the maximum probability value element in the vector, wherein the index is a text semantic matching model, and obtaining a final prediction label.

The third scheme is as follows: the text semantic matching equipment for improving the BERT comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the text semantic matching system and the text semantic matching method for improving the BERT when executing the computer program.

The invention has the beneficial effects that:

the model AP _ REP _ WordBERT provided by the invention is mainly improved as follows: firstly, a pre-training model with words as segmentation granularity is adopted in the aspect of pre-training model selection, so that the accuracy of text matching is improved, and the training speed of the model is accelerated; secondly, deleting absolute position codes of the BERT model, and adopting relative position codes, so that information of different positions and among different positions of sentences can be more accurately embodied, and position information among texts can be more definitely embodied; and finally, in a text output stage, obtaining a text representation after dimension reduction and containing more semantic information by adopting an attention pooling mode.

Drawings

FIG. 1 is a block diagram of an AP _ REP _ WordBERT model;

FIG. 2 is a block diagram of the BERT model;

FIG. 3 is a schematic diagram of the encoding of a Transformer model;

FIG. 4 is a block diagram of an attention mechanism;

FIG. 5 is a schematic diagram of a BERT input generation;

FIG. 6 is a diagram of relative position encoded vector encoding;

FIG. 7 is an attention pooling block diagram;

FIG. 8 is a diagram of a classifier structure;

FIG. 9 is a schematic diagram of the proportion of portions of a data set;

FIG. 10 is a graph comparing accuracy of BERT models;

FIG. 11 is a run-time comparison graph with the BERT model;

FIG. 12 is a comparison of different pooling schemes;

FIG. 13 is a graph comparing accuracy of derived models with BERT;

fig. 14 is a comparison graph of different learning rates.

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Detailed Description

The first embodiment is as follows:

the model provided by the embodiment is mainly improved based on a BERT model, firstly, the AP _ REP _ WordBERT model expands a word list through a jieba word segmentation tool on the basis of the BERT pre-training model, and words are changed into basic units of word segmentation by taking the words as the basic units of word segmentation in the word segmentation stage. The improvement not only improves the accuracy of text representation, but also saves memory and improves the training speed of the model; secondly, a position coding mode is improved, most of the pre-training models adopt the same position coding mode, namely an absolute position coding mode, the position coding mode only explicitly expresses information of different positions of the whole sentence and does not reflect position information between words between sentences, for example, the absolute position coding mode can only reflect that the position 1 and the position 2 are different positions, but can not reflect that the distance between the position 1 and the position 2 is shorter than the distance between the position 1 and the position 4, and the relative position coding solves the problem, so that the finally generated text is represented more closely to text semantics; and finally, reducing the dimension of the text representation sequence output by the BERT model by adopting an attention pooling method. Because the text representation sequence output by the pre-training model is a three-dimensional vector, the text representation sequence needs to be reduced into a two-dimensional vector and then is sent to a classifier for classification judgment.

The overall structure of the text semantic matching model provided by the embodiment is shown in fig. 1, the input of the model is two texts to be matched, and in order to utilize the interaction information between the two texts in the encoding process, the two texts to be matched are spliced into a text sequence and then used as the input of the AP _ REP _ WordBERT model.

The model improves the BERT model in three places: firstly, the word segmentation part improves the word segmentation into word segmentation; adding a relative position code and deleting an absolute position code; and thirdly, obtaining text representation by adopting attention pooling. Then, the following three aspects are discussed:

1.1 BERT model:

the BERT model mainly consists of three parts: an input presentation layer, an encoding layer, and an output layer for data pre-processing. The block diagram of the BERT model is shown in FIG. 2.

1.1.1 input representation layer for data preprocessing:

in the text matching task, the input presentation layer of BERT mainly completes two parts of contents, the first part is: splicing the text pair, adding a special symbol [ CLS ] before a first sentence in the text pair, adding a special symbol [ SEP ] at the end of the first sentence, then accessing a second sentence, adding a special symbol [ SEP ] at the end of the second sentence, and segmenting the sentence by the spliced sentence according to the character granularity; the second part is as follows: the word vector, segment vector and position vector for each word are summed as a vector representation for the final input BERT model.

1.1.2 coding layer:

the Transformer model is an essential component of the BERT model coding layer and is denoted Trm in fig. 1 and 2. The Transformer model is divided into a coding part and a decoding part, and because the nature of the model is a classification model, only the coding part of the Transformer is applied, and the internal structure of the coding part is shown in fig. 3:

as can be seen from fig. 3, the Transformer consists of three parts in addition to input and output, a multi-head attention mechanism, a forward feedback network and a layer normalization applying a residual network structure, where Nx represents the number of times the parts are repeated within a box. The multi-head attention mechanism is to extract a major part of the text information. The multi-head attention mechanism is realized by dividing the attention mechanism into a plurality of times, and finally splicing results obtained by each attention mechanism to serve as a final result of the multi-head attention mechanism. The attention mechanism calculation process used by the Transformer model is shown in fig. 4:

the calculation of the attention mechanism is mainly divided into three stages, wherein the first step is to calculate similarity of a query set Q (query) and each key K (Key) to obtain a weight, and a common function is a dot product, as shown in formula (1):

F(Q，K_i)＝Q^TK_i (1)

the second step is to normalize these weights using the Softmax function, as shown in equation (2):

finally, the normalized weighted value a_iWeighting and summing the weight and corresponding Value V (Value) to obtain the final Attention result (Attention Value), wherein Q is the query of the current word, K is the key except the current word, and V is the Value except the current word

As shown in formula (3):

Attention(Q，K，V)＝∑_ia_iV_i (3)

the Q, K and V calculations used in the Transformer model all use the same input sequence, so this attention mechanism is called the self-attention mechanism in the Transformer.

1.1.3 output layer:

and after performing transform coding on the input vector, processing the text output sequence according to the specific task to be processed. For the text matching task, mainly adopting a [ CLS ] vector which can most express text meaning in the sentence as the final text representation of the sentence pair, and inputting the vector into a Softmax layer for classification.

In the embodiment, the sentence segmentation mode of the input representation layer is firstly improved by the model, and the segmentation with the character granularity is changed into the segmentation with the word granularity; secondly, deleting absolute position codes and introducing relative position codes into a coding layer; and finally, attention pooling is introduced into an output layer to process the output text representation, and finally the text representation which is more fitted with text semantics is obtained.

2.1 segmenting text by word granularity:

almost all pre-training model word segmentation modes are segmented based on characters at present. Because word segmentation based on words has the following advantages: fewer parameters, independence of word segmentation algorithms and substantially no occurrence of unknown words. Therefore, most of the models adopt the word as the granularity to segment the sentences. Although the segmentation of sentences at word granularity has the advantages above, there are many disadvantages to this approach, and above all, although the number of parameters is smaller than that of segmentation at word granularity, the parameters updated each time are larger than those updated by segmentation at word granularity due to too small segmentation granularity. For example: the sentence "rats in cats and mice can successfully escape each time. "divided by word size" cat/and/old/mouse/lire/old/mouse/every/all/ability/go/work/escape/. "segmentation by word size (taking jieba word segmentation tool as an example)" cat and mouse/rid/mouse/every/all/can/success/escape/. "splitting at word granularity updates 15 word vectors, while only 10 word vectors are updated at word granularity. Therefore, the updating speed is accelerated, and the memory occupied by each updating is reduced. Secondly, this method does not rely on word segmentation algorithms, but also brings the possibility of generating ambiguities. In the above examples, the word "mouse" is used as an example, and it is used as a part of a title in "cat and mouse", and is used as an animal name in combination with other words in "mouse". If the characters are used as the segmentation granularity, only unified processing is performed when the characters are input into a computer for processing, and the problem cannot be caused when the words are used as the segmentation granularity.

The word as the segmentation granularity does not have the advantages of the segmentation by the word granularity, but can be completely solved by a certain method. Firstly, the segmentation parameters are increased by taking words as granularity, so that the phenomenon of overfitting is inevitably generated, but the overfitting can be completely relieved through pre-training, so that the problem cannot be embodied seriously; secondly, only a part of the most common words are reserved by the model depending on the word segmentation algorithm problem, so that the results segmented by the word segmentation tools are almost the same and the difference is small; thirdly, the boundary segmentation is wrong, the problem is difficult to avoid, but the semantic matching of the text is different from the problem of strict boundary segmentation required by the sequence labeling class; fourthly: most characters are added into the word list by the model, and excessive unknown words can not appear.

Because the model is improved on the basis of the BERT model, the original word segmentation mode is inevitably improved, and the word segmentation mode of the model is as follows:

adding the Chinese words into the original word list;

inputting a sentence, and firstly segmenting the sentence once by adopting a jieba word segmentation tool to obtain w_i，w_i∈[w₁，w₂，...，w_l]；

Traverse w_iIf w is_iKeeping in a word list, otherwise, carrying out subdivision once by using a word segmentation function carried by BERT;

each w_iThe word segmentation results are orderly spliced together to serve as the final word segmentation result.

3.1 coding:

3.1.1 relative position coding:

all BERTs and their derivatives encode words almost exclusively in the manner of fig. 5, which indicates that each word is summed from a word vector, a segment vector, and a position vector. The position vector is mainly generated by absolute position coding, but the coding mode can only indicate the relation of different positions of each word in a sentence and cannot indicate the relative relation between different positions, so that the model introduces relative position coding to improve the BERT model.

The model eliminates absolute position codes in BERT, as shown in an orange part of FIG. 5, adds two groups of vectors representing relations between words into a self-attention mechanism by improving the calculation mode of the self-attention mechanism, and takes part in the training process as parameters.

Let the input sequence be x ═ x₁，K，x_n) Each of which

The same length of z-as the input sequence is generated by the self-attention mechanism (z-z)₁，K，z_n) Each of which

Since the BERT model adopts a self-attention mechanism, the equations (1) to (3) are further developedLine refinement, the calculation of the self-attention mechanism needs the following three steps:

first, x is expressed by the formula (4)_iAnd x_jAnd (3) carrying out interaction:

next, the attention score is calculated by equation (5):

and finally, weighting by the formula (6) to obtain output:

these are parameter matrices that are not shared in the self-attention mechanism of each layer. Wherein xW^Q，xW^KAnd xW^VThe results are Q, K and V, respectively, and eij is the similarity of xi and xj, alpha_ijTo score attention.

In order to express the relationship between different positions in the same sentence, a group of parameters which can train and express relative positions are added in the calculation of the attention score and the final output, and the parameters are shared in the self-attention mechanism of each layer, and the specific steps are as follows:

firstly, when different position words are interacted, the interaction is still carried out in a dot product mode, but a first parameter which represents the relative position is added during the interaction

As shown in formula (7):

next, in calculating the attention score, the calculation is performed by using the Softmax equation in the same manner as the original self-attention mechanism calculation equation.

Finally, a second parameter representing the relative position is added to the calculation output

As shown in equation (8):

3.1.2 feasibility analysis of relative position coding:

suppose the input word vector is [ x ]₁，x₂，K，x_i]The vector encoded using absolute position is [ p ]₁，p₂，K，p_i]Inputting the absolute position code and the word vector into the attention-free mechanism, the following operations are performed, as shown in equations (9) to (13):

q_i＝(x_i+p_i)W_Q (9)

k_j＝(x_j+p_j)W_K (10)

v_j＝(x_j+p_j)W_V (11)

q_ifor a query k at location i_jFor a key v in position j_jIs the value a of the j position_ijIs the similarity o between the words at i and j_iIs the vector output at i;

unfolding formula (12) to give formula (14):

after introducing the relative position coding, it can be seen that p in the first bracket is_iItem removed, second parenthesis

Using binary position vectors

Instead, as shown in equation (15):

similarly, expansion with respect to formula (13) yields formula (16):

will be p in the formula_jW_VInstead of making

To give formula (17):

it can be seen that the relative position coding will depend on the coding mode of the position represented by the two-dimensional coordinates (i, j), and through the vector

And

to a relative position dependent on i-j. It is for this reason that the relative position is encoded in the self-attention mechanism of the respective layersIs shared, and is the relative information between position and position, whether represented in the relative position coding of any layer.

Although two relative position parameters may be passed with respect to the text sequence

And

capturing relative position information between input elements, as shown in FIG. 6; the maximum relative position is limited to the range of | k |, since an exact relative position is not useful beyond a certain distance. Therefore, the model adopts a cutting mode, the method not only reduces the training parameter quantity, but also can lead the model to be popularized to the sequence length which can not be seen in the training period. Therefore, the relative position is selected from the top k words and the bottom k words centered on the current word, and k is 4 in the model.

The cutting method is shown in equations (18) to (20).

clip(x，k)＝max(-k，min(k，x)) (20)

Finally, the model learns the relative position expression

And

wherein

4.1 attention pooling:

attention pooling arrangement as shown in figure 7,

the attention pooling method is by a special element [ CLS ] inserted when preprocessing text]Performing attention calculation on the vectors and the rest output vector sequences in the output text sequence to obtain a semantic matching text representation v_AttAs the text representation sequence E corresponds to a semantically matched text representation.

The attention pooling calculation formula is shown in equation (21):

v_Att＝Attention(e_[CLS]，K_E，V_E) (21)

wherein: e.g. of the type_[CLS]As a special element [ CLS]Corresponding vector, K_E、V_EIs except for e_[CLS]The rest of the text represents the sequence.

The Attention (Attention) mechanism calculation formula is shown in formula (22):

v_attresults after attention pooling were shown wherein:

n is the input sequence length. From the above description it can be seen that the attention mechanism calculation method mentioned above is similar, and the current application for the attention mechanism is mainly focused on forming a self-attention mechanism, i.e. the Q matrix, the K matrix and the V matrix are all generated from the same input sequence X, and the Q matrix is generated from the vector CLS]And the K matrix and the V matrix are generated from dividing [ CLS]Text outside the vector represents the sequence. In the BERT model, [ CLS ] was shown]The vectors are used as the overall meaning of the most expressive text, but the rest vectors also have sentence information contained in a specific position, and the final processing method achieves the integration of the whole and the part, so that the output text representation is more consistent with the real text semantics.

5.1 classifier:

the multi-layer perceptron is used as a classifier of a text semantic matching model, and the classifier consists of a forward propagation neural network, a Softmax normalization function and an Argmax maximum index function. The structure of the classifier is shown in fig. 8.

Wherein the forward propagating neural network has two hidden layers in common. All neurons of the first hidden layer are fully connected with the semantic matching representation vector v, and the v is mapped to a high-dimensional semantic space to analyze semantic matching information contained in the high-dimensional semantic space. And fully connecting the neurons in the second hidden layer with all the neurons in the first hidden layer, and respectively outputting the corresponding activation values of a label 0 (with different semantics) and a label 1 (with the same semantics) to obtain a two-dimensional activation vector.

And the Softmax normalization function is used for normalizing the two-dimensional activation vector obtained by the forward propagation network to enable the sum of all elements in the vector to be 1, so that a two-dimensional prediction vector is obtained. The vector, namely the prediction of the synonymy relation between two input sentences to be matched by the text semantic matching model, and two elements in the vector respectively correspond to the prediction probabilities of a label 0 and a label 1 and are used for model loss function calculation.

The Argmax maximum index function compares the probability values of the two elements in the two-dimensional probability vector and returns the index corresponding to the maximum probability value element in the vector (the index starts from 0, if the value of the first element is larger than that of the second element, the index 0 is returned, otherwise, the index 1 is returned). The index is the final predicted label y of the text semantic matching model.

The activation functions used by the classifier all adopt Gaussian Error Linear Units (GELU), and the calculation formula of the GELU is as follows:

the classifier has the calculation formula as follows:

y＝A rg max(p) (27)

wherein:

an output of a first hidden layer of the forward propagation network;

weights and biases for the first hidden layer of the forward propagation network; f. of₂∈i²Is the output of the second layer of the forward propagation network;

b₂∈R²weights and biases for the second layer of the forward propagation network; p is the probability vector obtained by the Softmax normalization function.

6.1 data set and Pre-trained model:

the dataset used is a large-scale chinese question matching dataset (LCQMC) that is used to determine the semantic relationship between two chinese question sentences. The LCQMC data set is divided into training set, validation set and test set, which contain 260068 samples, wherein 238766 training sets, 8802 validation sets and 12500 test sets are shown in fig. 9.

Each sample consists of a pair of chinese question sentences and corresponding tags. The labels are divided into two types of 0 and 1, the label 0 represents that the semantics of the two Chinese question sentences are different, the label 1 represents that the semantics of the two Chinese question sentences are the same, and the sample number ratio of the labels 0 to 1 is 1: 1.34. The data set presentation is shown in table 1:

TABLE 1 sentence pairs in dataset

The adopted pre-training model is a pursuit-science pre-training model, the pre-training method of the model is to carry out continuous pre-training on the basis of BERT-Chinese of Haohang open source, and the pre-training task is MLM. And in the initialization stage, each word is divided into words by using a Bert self-contained Tokenizer file, and then the average of word embedding is used as the initialization of word embedding. The model is trained by using a single 24G RTX for 100 ten thousand steps (about 10 days), the sequence length is 512, the learning rate is 5e-6, the batch size is 16, the accumulated gradient is 16 steps, and the training is equivalent to that the batch size is about 6 ten thousand steps trained by 256; the corpus is approximately 30 or more G of general-purpose corpus.

The GPU used for all experiments was NVIDIA GTX1080Ti (11G).

6.2 evaluation index

Verifying the performance of emotion classification in the aspect by using the accuracy as an evaluation index, and therefore, defining true (T) to represent the correct number of model predictions by comparing the prediction results of the model on a verification set and a test set with real labels; false (f) represents the number of model prediction errors; number (n) represents the total number of samples predicted by the model, and the calculation formula of the Accuracy (Accuracy) is shown in formula (28): :

in general, the larger the Accuracy, the greater the Accuracy, and the better the Accuracy of the model performance.

6.3 objective function:

text matching is a classification problem, and a sparse minimum cross entropy loss function is used as an objective function to optimize a model, wherein the formula is shown as formula (29):

wherein D is the size of the data volume of the training set, C is the number of categories of the data set, wherein C is 2,

for the prediction category, y is the actual category of the data, λ | | θ | | non-calculation²Is a cross-regularization term.

6.4 parameter setting:

the model training adopts an Adam optimizer to optimize and update all parameters, the embedding size dimension of BERT of the model is used as 768, the bias initialization is 0, the learning rate is set to be 2e-5, Dropout is set to be 0.1, the Batch size is 32, the sequence length is 512, the L2 regular term coefficient a is 10, and the excitation function is ReLU.

6.5 comparative experiment:

6.5.1 comparison with the BERT model:

as shown in fig. 8 and 9 below, the three-point improvement proposed around the example is compared with the BERT model in terms of both accuracy and runtime. The four models to be compared are: BERT model, WordBERT model (a model that performs word segmentation at word granularity), REP _ WordBERT model (adding relative position codes on the basis of WordBERT model), AP _ REP _ WordBERT model (adding attention pooling on the basis of REP _ WordBERT model).

As can be seen from fig. 10, the accuracy of the final model is improved by 2.04% in terms of accuracy compared to the BERT model; in terms of running time, in the experiment, under the current laboratory conditions, when the batch size is 32, the server memory overflows, so that the batch size is 16 for training, and the other three models do not have the above problem, so that the model provided by the embodiment saves more memory. The BERT model run time was compared to the other models with run time trained with a batch size of 16 times 2 as run time with a batch size of 32. It can be seen from fig. 11 that the operating time is also greatly increased.

6.5.2 pooling comparison:

at present, the main flow model mainly comprises a global maximum Pooling (Max Pooling), an Average Pooling (Average Pooling) and an extraction [ CLS ] vector for processing the dimension reduction method of the output sequence of the BERT model, the Attention Pooling (Pooling) is provided for the subject, and the four Pooling modes are compared based on a REP _ WordBERT model in the part. The experimental results are shown in fig. 12:

it can be seen from the figure that the pooling method described in the example has certain advantages over other pooling methods.

6.5.3 comparison with the derivative pre-trained model of BERT:

the model of this part comparison takes the BERT model as the main structure, and the BERT model is improved in some aspects.

The ERNIE model greatly enhances the general semantic representation capability by uniformly modeling the lexical structure, the grammatical structure and the semantic information in the training data;

the BRET-wwm model is a word masked with [ MASK ] in the Chinese pre-training phase, and a word masked by BERT;

the method comprises the following steps that a RoBERTA model is longer in pre-training time, larger in batch size and more in training data, and a Masking mechanism is dynamically adjusted in a pre-training stage;

the ALBERT-xlarge model provides two methods capable of greatly reducing the parameter quantity of the model, so that the model structure of the ALBERT can be expanded to the xlarge version.

The accuracy of the model comes from the official website, except that the BERT model and AP _ REP _ WordBERT are run on the server in the laboratory. Fig. 12 shows that, on the basis of the modified BERT model as well, the accuracy of the proposed model on the LCQMC data set is higher than that of the derivative models of other BERTs.

6.5.4 super parameter tuning:

the subject adopts a control variable method, and selects the optimal learning rate of the model AP _ REP _ WordBERT through multiple times of experiment comparison, and the experiment result is shown in figure 14:

as can be seen from the above figure, when the learning rate is 2 × 10^-5The accuracy is highest. The experimental results can be analyzed, when the learning rate is too low, the training is too slow, and when the same training algebra is adopted, the network is not converged to an optimal value; when the learning rate is too large, the network may not converge, resulting in a decrease in accuracy.

The text semantic matching experiment result based on the LCQMC data set shows that the accuracy of the AP _ REP _ WordBERT model is improved by 2.04% compared with the BERT model, and the speed is 1.4 times of that of the BERT model; compared with other BERT derivative models, the method has a certain improvement.

According to the above method example, the functional modules may be divided according to the block diagram shown in fig. 1 of the specification, for example, the functional modules may be divided corresponding to the functions, or two or more functions may be integrated into one processing module; the integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, the division of the modules in the embodiment of the present invention is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

Specifically, the system includes a processor, a memory, a bus, and a communication device; the memory is used for storing computer execution instructions, the processor is connected with the memory through the bus, the processor executes the computer execution instructions stored in the memory, and the communication equipment is responsible for being connected with an external network and carrying out a data receiving and sending process; the processor is connected with the memory, and the memory comprises database software;

specifically, the database software is a database of version more than SQL Server2005 and is stored in a computer-readable storage medium; the processor and the memory contain instructions for causing the personal computer or the server or the network device to perform all or part of the steps of the method; the type of processor used includes central processing units, general purpose processors, digital signal processors, application specific integrated circuits, field programmable gate arrays or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof; the storage medium comprises a U disk, a mobile hard disk, a read-only memory, a random access memory, a magnetic disk or an optical disk.

Specifically, the software system is loaded on a Central Processing Unit (CPU), a general purpose Processor, a Digital Signal Processor (DSP), an Application-Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. The communication device for communication between the relevant person and the user may utilize a transceiver, a transceiver circuit, a communication interface, or the like.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. The text semantic matching system for improving BERT is characterized in that: the system comprises a data preprocessing subsystem and a BERT subsystem; the data preprocessing subsystem is responsible for arranging the acquired text and transmitting the text to the BERT model subsystem, the BERT model subsystem is used for establishing a model and outputting the model, and finally the output layer subsystem is used for improving the model and outputting a matching result.

2. The BERT-improved text semantic matching system of claim 1, wherein: the data preprocessing subsystem comprises a text acquisition module, a splicing module and a word segmentation module; the BERT model subsystem comprises an input representation layer, an encoding layer and an output layer; the output layer includes an attention pooling module and a classifier.

3. Text semantic matching method for improving BERT, which is different from the text semantic matching of the existing BERT, is implemented based on the system of any one of claims 1-2, and is characterized in that: the method processes the text by establishing a word embedding mechanism in the transmission layer, a relative position coding mechanism of the coding layer and an attention mechanism after pooling through the output layer, and completes the semantic matching of the subsequent text; the method comprises the following specific steps:

4. The method of claim 3 for text semantic matching for improved BERT, wherein: in step one, the text matching task specifically includes two parts:

5. The method of claim 3 for text semantic matching for improved BERT, wherein: in the second step, the self-attention mechanism comprises the following specific steps:

step two, normalizing the weights by using a Softmax function;

6. The method of claim 3 for text semantic matching for improved BERT, wherein: in the third step, the self-attention mechanism comprises the following specific steps: in step three, the word embedding mechanism specifically comprises the following steps:

step three, adding Chinese words in the text into an original word list;

step two, inputting a sentence, firstly, segmenting the sentence once by adopting a jieba word segmentation tool to obtain a word sequence w_i，w_i∈[w₁,w₂,...,w_l]；

7. The method of claim 3 for text semantic matching for improved BERT, wherein: in the fourth step, the relative position coding means adding two groups of vectors representing the relationship between words in the self-attention mechanism and taking the vectors as parameters to participate in training, and the specific steps are as follows:

step four, calculating an attention score;

and step four, weighting and outputting a vector.

8. The method of text semantic matching for improved BERT according to claim 7, wherein: in the fifth step, the relative position code depends on the coding mode of the two-dimensional coordinate representation position, and the relative position code is shared in the self-attention mechanism of each layer by converting the multi-dimensional vector into the relative position of the two-dimensional vector, and the representation in the relative position code of any layer is relative information between the position and the position.

9. The method of claim 3 for text semantic matching for improved BERT, wherein: in the sixth step, the classifier is used as a text semantic matching model for the multilayer perceptron, and the classifier consists of a forward propagation neural network, a Softmax normalization function and an Argmax maximum index function:

10. Text semantic matching equipment for improving BERT, which is characterized in that: the system comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to realize the system and the method for text semantic matching of the BERT improvement of any one of claims 1 to 9.