CN117131153A

CN117131153A - Text matching method, device, system and storage medium

Info

Publication number: CN117131153A
Application number: CN202310928224.8A
Authority: CN
Inventors: 林乐平; 石玉博; 蔡晓东
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2023-07-26
Filing date: 2023-07-26
Publication date: 2023-11-28

Abstract

The application provides a text matching method, a device, a system and a storage medium, belonging to the field of text matching, wherein the method comprises the following steps: importing an original text data set, and performing word segmentation processing on the original text data set to obtain a plurality of segmented text sentences; dividing all segmented text sentences into groups to obtain a plurality of segmented text sentence groups; respectively updating each word-segmented text sentence group to obtain a target text sentence group; and respectively predicting each target text sentence group to obtain a text matching result. The method improves the effect of feature extraction, enhances the data efficiency and the generalization capability of the model, reduces the calculated amount, better saves the training cost, can better acquire the global information of sentences, and solves the problems of long-range dependence and the like in long text matching.

Description

Text matching method, device, system and storage medium

Technical Field

The application mainly relates to the technical field of text matching, in particular to a text matching method, a device, a system and a storage medium.

Background

The existing text matching method is generally difficult to better model long text, and when a long text matching task is processed, the model needs to consider more semantic information and context. This results in an increase in computational complexity of the model and a slow speed of training and reasoning. Meanwhile, since long text typically has a longer sequence length, the model may be limited by sequence truncation, resulting in some important semantic information loss.

Disclosure of Invention

The application aims to solve the technical problem of providing a text matching method, a device, a system and a storage medium aiming at the defects of the prior art.

The technical scheme for solving the technical problems is as follows: a text matching method comprising the steps of:

importing an original text data set, and performing word segmentation on the original text data set to obtain a plurality of segmented text sentences;

dividing all the text sentences subjected to word segmentation into a group two by two to obtain a plurality of text sentence groups subjected to word segmentation;

updating each word-segmented text sentence group respectively to obtain a target text sentence group corresponding to each word-segmented text sentence group;

and respectively predicting each target text sentence group to obtain prediction scores corresponding to each target text sentence group, and taking all the prediction scores as text matching results.

The other technical scheme for solving the technical problems is as follows: a text matching device, comprising:

an importing module for importing an original text data set;

the word segmentation processing module is used for carrying out word segmentation processing on the original text data set to obtain a plurality of segmented text sentences;

the grouping module is used for grouping all the text sentences subjected to word segmentation into one group in pairs to obtain a plurality of text sentence groups subjected to word segmentation;

the updating module is used for updating each word-segmented text sentence group respectively to obtain a target text sentence group corresponding to each word-segmented text sentence group;

and the text matching result obtaining module is used for respectively predicting each target text sentence group to obtain prediction scores corresponding to each target text sentence group, and taking all the prediction scores as text matching results.

Based on the text matching method, the application further provides a text matching system.

The other technical scheme for solving the technical problems is as follows: a text matching system comprising a memory, a processor and a computer program stored in the memory and executable on the processor, which when executed by the processor, implements a text matching method as described above.

Based on the text matching method, the application further provides a computer readable storage medium.

The other technical scheme for solving the technical problems is as follows: a computer readable storage medium storing a computer program which, when executed by a processor, implements a text matching method as described above.

The beneficial effects of the application are as follows: the segmented text sentences are obtained by word segmentation processing of the original text data set, the segmented text sentences are divided into a group of two pairs to obtain segmented text sentence groups, the segmented text sentence groups are updated to obtain target text sentence groups, and text matching results are obtained by prediction of the target text sentence groups, so that the effect of feature extraction is improved, the data efficiency and the generalization capability of a model are enhanced, meanwhile, the calculated amount is reduced, the training cost is better saved, global information of sentences can be better obtained, and the problems of long-range dependence and the like in long text matching are solved.

Drawings

Fig. 1 is a schematic flow chart of a text matching method according to an embodiment of the present application;

fig. 2 is a block diagram of a text matching device according to an embodiment of the present application.

Detailed Description

The principles and features of the present application are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the application and are not to be construed as limiting the scope of the application.

Fig. 1 is a schematic flow chart of a text matching method according to an embodiment of the present application.

As shown in fig. 1, a text matching method includes the following steps:

In the above embodiment, the word segmentation processing is performed on the original text data set to obtain the segmented text sentences, the segmented text sentences are divided into one group in pairs to obtain the segmented text sentence group, the updating of the segmented text sentence group is performed to obtain the target text sentence group, and the prediction of the target text sentence group is performed to obtain the text matching result, so that the effect of feature extraction is improved, the data efficiency and the generalization capability of the model are enhanced, meanwhile, the calculated amount is reduced, the training cost is better saved, the global information of sentences can be better obtained, and the problems of long-range dependence and the like in long text matching are solved.

Optionally, as an embodiment of the present application, the process of performing word segmentation on the original text data set to obtain a plurality of segmented text sentences includes:

and performing word segmentation processing on the original text data set by utilizing a jieba word segmentation library to obtain a plurality of segmented text sentences.

It should be appreciated that the original dataset is first pre-processed, the dataset is segmented using the python kit jieba (i.e., the jieba word segmentation library), words in sentences are separated from word to word by spaces, and a dictionary file is created.

Specifically, the jieba word stock is a popular Chinese word stock for segmenting Chinese text into individual words. The method is an open-source project, has the characteristics of easy use and high performance, is widely applied to Chinese natural language processing tasks, is a powerful Chinese word segmentation library, provides a simple and easy-to-use interface and a plurality of word segmentation modes, and can play an important role in Chinese text processing. No matter in the tasks of information retrieval, text classification, emotion analysis and the like, jieba can provide a reliable solution for the word segmentation requirement of Chinese text.

In the embodiment, the jieba word segmentation library is utilized to perform word segmentation processing on the original text data set to obtain a plurality of segmented text sentences, dictionary files can be established, a foundation is laid for subsequent data processing, data efficiency and generalization capability of a model are enhanced, meanwhile, calculation amount is reduced, and training cost is better saved.

Optionally, as an embodiment of the present application, the process of updating each of the word-segmented text sentence groups to obtain a target text sentence group corresponding to each of the word-segmented text sentence groups includes:

respectively carrying out vectorization processing on each segmented text sentence group through a pre-training model SimBERT to obtain hidden layer text vectors corresponding to each segmented text sentence group;

vector updating is carried out on each hidden layer text vector respectively to obtain updated text vectors corresponding to each word-segmented text sentence group;

and respectively splicing each hidden layer text vector and the updated text vector corresponding to each word-segmented text sentence group to obtain a target text sentence group corresponding to each word-segmented text sentence group.

It should be appreciated that the pre-trained model SimBERT is a BERT model based on Microsoft design, based on UniLM ideas, combining the search and generation tasks to further fine tune the model. It has similar question generation and similar sentence retrieval capabilities 12. It is used in various applications, such as generating synonyms 2 and text similarity retrieval 3.

Specifically, sentence 1 and sentence 2 (i.e., the segmented text sentence set) are imported into the pre-training model SimBERT, each sentence containing 3 parts of position vectors, segment vectors, and word vectors, respectively. And distinguishing different sentences by using marks [ CLS ] and [ SEP ], wherein [ CLS ] represents specific symbols which are output in a distinguishing way, and [ SEP ] represents specific symbols which separate discontinuous token sequences, and meanwhile, the position information of each sentence is kept; the word vector is the word vector corresponding to each token in the input sentence. The characteristic of sharing weight by the twin network is utilized. The results are introduced into a transducer encoder, and each token is represented by a bi-directional encoded result. The transducer encoder comprises a self-attention layer, a residual layer, a normalization layer and a feedforward neural network layer. The encoder takes the superimposed character-level vector as input and finally gets the hidden layer vector with semantic information, the last layer output of the pre-trained model, which contains [ CLS ] and [ SEP ]. A hidden layer vector (P, Q) (i.e., a hidden layer text vector) is obtained.

In the above embodiment, the text sentence groups after each word segmentation are updated respectively to obtain the target text sentence groups, so that global information and context relation of sentences are better extracted, weight sharing among sentences is realized, and the problem that the global information and context relation of sentences are difficult to capture in the matching process is solved.

Optionally, as an embodiment of the present application, the process of updating the vector of each hidden layer text vector to obtain updated text vectors corresponding to each segmented text sentence group includes:

extracting global word sense from each hidden layer text vector through a Bi-LSTM model to obtain global word sense vectors corresponding to each segmented text sentence group, wherein each global word sense vector comprises a plurality of global word sense nodes;

node updating is carried out on a plurality of global word sense nodes corresponding to each segmented text sentence group respectively, so that a plurality of updated global word sense nodes corresponding to each segmented text sentence group are obtained;

performing maximum pooling processing on a plurality of updated global word sense nodes corresponding to each segmented text sentence group through a first formula to obtain updated text vectors corresponding to each segmented text sentence group, wherein the first formula is as follows:

wherein conv _x For the updated text vector corresponding to the x-th word-segmented text sentence group, max pool is the maximum pooling function, reLU is the activation function,and (3) the global word sense node after the i-th update corresponding to the text sentence group after the x-th word segmentation is used, U is a weight matrix, and b is a bias.

It should be appreciated that the output (P, Q) of the upper network (i.e., hidden layer text vector) is fed into the Bi-LSTM layer (i.e., bi-LSTM model) to obtain global word sense information as follows:

wherein the method comprises the steps of

It should be appreciated that the Bi-LSTM model, i.e., the two-way long and short term memory network (Bidirectional LSTM, biLSTM for short), is a cyclic neural network (RNN) based model. Compared with the traditional unidirectional LSTM model, the BiLSTM can simultaneously consider historical information and future information, so that modeling capability of the model on sequence data is improved. BiLSTM obtains a final output by inputting an input sequence into two LSTM layers in time order and reverse time order, respectively, and splicing their outputs along the time axis. In this way, the model is able to extract features from both past and future contexts and is better able to capture long-term dependencies in the sequence data. BiLSTM is widely used in fields of natural language processing, audio signal processing, handwriting recognition and the like, and particularly when tasks such as classification, labeling, generation and the like of sequence data are required, biLSTM has become a common model.

Specifically, DPCNN is mainly composed of a retransmission embedding layer (text region embedding layer), two convolution block blocks (each block is composed of two conv convolution functions with fixed convolution kernels of 3) (layers constructed by two blocks can be directly connected through pre-activation), and a Repeat structure, and a Max-poling layer is added before conv and after pre-activation.

Specifically, the two re-calculated vectors of node information (i.e., updated global sense nodes) are calculated as follows:

where function f represents the activation function ReLU, U represents the weight matrix, and b represents the bias. The maximum pooling of two sentences (i.e., updated text vectors) is calculated as follows:

conv＝max pool(X,X ₁ )。

in the above embodiment, the updated text vectors are obtained by respectively updating the vectors of the hidden layer text vectors, so that global information of the text can be captured better, the problem of gradient disappearance can be reduced, and the feature representation can be learned more effectively, thereby being beneficial to extracting the context relation.

Optionally, as an embodiment of the present application, the process of updating nodes of a plurality of global word sense nodes corresponding to each segmented text sentence group to obtain a plurality of updated global word sense nodes corresponding to each segmented text sentence group includes:

calculating the attention coefficients of each global word sense node and the rest global word sense nodes respectively through a second formula to obtain a plurality of attention coefficients corresponding to each segmented text sentence group, wherein the second formula is as follows:

wherein (alpha) _ij ) _x For the attention coefficients of the ith global word sense node and the jth global word sense node corresponding to the xth word-divided text sentence group, the LeakyReLU is an activation function,for the self-attention matrix, W is the weight matrix,>for the ith global word sense node corresponding to the xth sentence group after the xth word segmentation,/for the text sentence group after the xth word segmentation>For the jth global word sense node corresponding to the xth word-divided text sentence group,/for the xth global word sense node>For the kth global word sense node corresponding to the xth word-segmented text sentence group,/for the (x) th global word sense node>Global word sense nodes except for the ith global word sense node corresponding to the xth word-divided text sentence group;

and respectively carrying out node update calculation on a plurality of attention coefficients corresponding to each segmented text sentence group and a plurality of global word sense nodes corresponding to each segmented text sentence group through a third formula to obtain a plurality of updated global word sense nodes corresponding to each segmented text sentence group, wherein the third formula is as follows:

wherein,an ith updated global word sense node corresponding to the xth word-segmented text sentence group, (alpha) _ij ) _x Attention coefficients of an ith global word sense node and a jth global word sense node corresponding to an xth word-divided text sentence group are +.>For the jth global word sense node corresponding to the xth word-divided text sentence group,/for the xth global word sense node>For the jth global word sense node corresponding to the xth word-segmented text sentence group, sigma is an activation function,>and W is a weight matrix for global word sense nodes except for the ith global word sense node corresponding to the xth word-divided text sentence group.

It should be appreciated that the vector of outputs of the upper network will be(i.e. a plurality of global word sensesNode) is used as a layer input of GAT, wherein nodes are also needed, n represents the number of nodes, and F represents the number of features of each node.

Specifically, the attention scores of the sentence word vector center node and the neighbor nodes are calculated as follows:

where ij represents two neighboring nodes, w is a weight matrix for training the nodes,then two word vectors in a sentence are represented. Self-attention, a shared-attention mechanism a, is then performed on the node as follows:

introducing softmax regularizes all i adjacent nodes j, as follows:

the attention mechanism a is a single layer feed forward neural network, after which the following formula is introduced:

wherein LeaKyReLU is a nonlinear activation function, as follows:

it should be appreciated that the output of the calculated new node information (i.e., updated global sense nodes) is derived separately as follows:

in the above embodiment, node updating is performed on the plurality of global word sense nodes respectively to obtain updated global word sense nodes, so that global information of a text can be captured better, the problem of gradient disappearance can be solved, and feature representation can be learned more effectively, thereby being beneficial to extracting a context relation.

Optionally, as an embodiment of the present application, the process of respectively splicing each hidden layer text vector and the updated text vector corresponding to each segmented text sentence group to obtain the target text sentence group corresponding to each segmented text sentence group includes:

splicing each hidden layer text vector and the updated text vector corresponding to each word-segmented text sentence group through a fourth formula to obtain a target text sentence group corresponding to each word-segmented text sentence group, wherein the fourth formula is as follows:

wherein,for the target text sentence group corresponding to the x-th segmented text sentence group, conv _x An updated text vector corresponding to the x-th segmented text sentence group, B _x And (5) the hidden layer text vector corresponding to the text sentence group after the xth word segmentation.

It should be understood that, two sentences are sent to the GAT layer through the pre-training encoding layer to perform sentence integral modeling, the DPCNN is used to extract sentence characteristics, and finally, the output of the two sentences (i.e. the updated text vector and the hidden layer text vector) is finally connected to the full-connection layer to perform final classification prediction (i.e. the target text sentence group).

Specifically, the output of the pre-training coding layer (i.e. hidden layer text vector) and the output of the DPCNN (i.e. updated text vector) are connected by residual error, so as to obtain the output (i.e. target text sentence group), as follows:

y _out ＝concat(conv，(P,Q))，

concat is a vector stitching operation commonly used in deep learning, and refers to an operation of connecting two or more tensors along a certain dimension, the input tensors can be stitched together in a specified dimension to form a new tensor, and the operation can be performed by using a numpy toolkit in pytorch, so that the difference between two texts can be better reflected, and the complexity of a network can be reduced.

In the above embodiment, each hidden layer text vector and the updated text vector are respectively spliced by the fourth formula to obtain the target text sentence group, so that the difference between the two texts can be better reflected, and the complexity of the network is reduced.

Optionally, as an embodiment of the present application, the predicting the target text sentence groups respectively, and the obtaining the prediction score corresponding to the target text sentence groups includes:

predicting each target text sentence group through a fifth formula to obtain a prediction score corresponding to each target text sentence group, wherein the fifth formula is as follows:

wherein,for the predictive score corresponding to the text sentence group after the xth word segmentation, softmax is the activation function,/->Target text sentences corresponding to the x-th segmented text sentence groupGroup U ₁ For the weight matrix, b is the bias.

Specifically, a classification prediction of softmax is performed, as follows:

U ₁ representing the weight matrix and b represents the bias.

In the above embodiment, the fifth formula is used to predict each target text sentence group to obtain the prediction score, so that the effect of feature extraction is improved, the data efficiency and the generalization capability of the model are enhanced, meanwhile, the calculated amount is reduced, the training cost is better saved, the global information of sentences can be better obtained, and the problems of long-range dependence and the like in long text matching are solved.

Alternatively, as another embodiment of the present application, the present application first processes the original data set to construct a vector sentence suitable for input into the network. Secondly, respectively inputting the two matched sentences into a pretrained model twin BERT network (SimBERT) to obtain vector representation. Then constructing a graph expressed by sentence relation, and sending the graph into a graph annotation meaning network GAT. And extracting the obtained sentence characteristics through the DPCNN network, and finally accessing the full connection layer to perform classification processing on the result.

The application mainly introduces the concept of the graph in the text matching task, and takes each word vector in the sentence as the node of the graph and the relation between the vectors as the edge of the graph. The global information of sentences can be better acquired by using the graph attention network, and the problems of long-range dependence and the like in long text matching are solved. Meanwhile, a weight sharing mechanism of the SimBERT pre-training model is exerted, and the weight sharing can reduce the number of parameters of the model, improve the effect of feature extraction, enhance the data efficiency and enhance the generalization capability of the model. The DPCNN also reduces the calculated amount and better saves the training cost.

Alternatively, as another embodiment of the present application, the present application uses a graph, which is a data structure that can better model global information, better extracts global information of sentences and context. Meanwhile, the SimBERT twin pre-training model and the DPCNN network are utilized, so that training cost can be better saved, calculated amount is reduced, and efficiency in the model matching process is improved. The method enhances the interactivity between sentences, so that the model better obtains the global context relation of the text.

Optionally, as another embodiment of the present application, the technical problem to be solved by the present application is as follows:

the existing text matching model aims at the relation between the long text sentence before and after the matching and capturing, and the global theme information is not well acquired, so that the semantic matching effect is greatly weakened.

Meanwhile, the calculation amount and the parameter amount of the current long text matching model are large, which is unfavorable for saving the matching cost.

Optionally, as another embodiment of the present application, the technical method for solving the problem of the present application is as follows:

for the problems that the global information of sentences and the context association are difficult to capture in the matching process, and the like. A text representation method based on combination of a twin pre-training model and a graph meaning network is constructed. The data structure which can better model the global information is utilized, and the global information and the context relation of sentences are better extracted. And constructing a twin pre-training model SimBERT network, realizing weight sharing among sentences, and constructing a matching model which is favorable for extracting context relations.

Finally, in the sentence feature extraction process, DPCNN (deep pyramid convolution model) is utilized, and residual connection is used to transfer information across the hierarchy, so that the model is allowed to better capture global information of the text. Residual connection can alleviate the gradient vanishing problem and help the model learn the feature representation more efficiently.

Alternatively, as another embodiment of the present application, as shown in fig. 2, a text matching apparatus includes:

an importing module for importing an original text data set;

Alternatively, another embodiment of the present application provides a text matching system including a memory, a processor, and a computer program stored in the memory and executable on the processor, which when executed by the processor, implements the text matching method as described above. The system may be a computer or the like.

Alternatively, another embodiment of the present application provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the text matching method as described above.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus and units described above may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present application.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the application are intended to be included within the scope of the application.

Claims

1. A text matching method, comprising the steps of:

2. The text matching method according to claim 1, wherein the process of word segmentation of the original text data set to obtain a plurality of segmented text sentences comprises:

3. The text matching method according to claim 1, wherein the process of updating each of the segmented text sentence groups to obtain a target text sentence group corresponding to each of the segmented text sentence groups includes:

4. A text matching method according to claim 3, wherein the process of respectively performing vector update on each hidden layer text vector to obtain updated text vectors corresponding to each segmented text sentence group comprises:

5. The text matching method according to claim 4, wherein the step of updating the nodes of the plurality of global word sense nodes corresponding to the respective segmented text sentence groups to obtain the plurality of updated global word sense nodes corresponding to the respective segmented text sentence groups includes:

wherein (alpha) _ij ) _x For the attention coefficients of the ith global word sense node and the jth global word sense node corresponding to the xth word-divided text sentence group, the LeakyReLU is an activation function,for the self-attention matrix, W is the weight matrix,>for the ith global word sense node corresponding to the xth sentence group after the xth word segmentation,/for the text sentence group after the xth word segmentation>For the jth global word sense node corresponding to the xth word-divided text sentence group,/for the xth global word sense node>Corresponds to the text sentence group after the xth word segmentationIs the kth global sense node, ++>Global word sense nodes except for the ith global word sense node corresponding to the xth word-divided text sentence group;

wherein,an ith updated global word sense node corresponding to the xth word-segmented text sentence group, (alpha) _ij ) _x The attention coefficients of the ith global word sense node and the jth global word sense node corresponding to the xth word-divided text sentence group,for the kth global word sense node corresponding to the xth word-segmented text sentence group,/for the (x) th global word sense node>For the kth global word sense node corresponding to the xth word-segmented text sentence group, sigma is an activation function,>and W is a weight matrix for global word sense nodes except for the ith global word sense node corresponding to the xth word-divided text sentence group.

6. The text matching method according to claim 3, wherein the process of concatenating each of the hidden layer text vectors and the updated text vectors corresponding to each of the segmented text sentence sets to obtain the target text sentence set corresponding to each of the segmented text sentence sets includes:

7. The text matching method according to claim 1, wherein the predicting each of the target text sentence groups to obtain a prediction score corresponding to each of the target text sentence groups includes:

wherein,for the predictive score corresponding to the text sentence group after the xth word segmentation, softmax is the activation function,/->For the target text sentence group corresponding to the x-th segmented text sentence group, U ₁ For the weight matrix, b is the bias.

8. A text matching apparatus, comprising:

an importing module for importing an original text data set;

9. A text matching system comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the text matching method according to any of claims 1 to 7 is implemented when the computer program is executed by the processor.

10. A computer readable storage medium storing a computer program, characterized in that the text matching method according to any one of claims 1 to 7 is implemented when the computer program is executed by a processor.