CN115712713A

CN115712713A - Text matching method, device and system and storage medium

Info

Publication number: CN115712713A
Application number: CN202211476656.1A
Authority: CN
Inventors: 蔡晓东; 董丽芳
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2022-11-23
Filing date: 2022-11-23
Publication date: 2023-02-24

Abstract

The invention provides a text matching method, a device, a system and a storage medium, belonging to the field of language processing, wherein the method comprises the following steps: carrying out annotation analysis on the original sentence pair to obtain an annotated sentence pair; encoding each labeled sentence pair by using an encoder to obtain a sentence pair hidden vector; carrying out vector analysis on the hidden vectors according to each original sentence pair and sentence pair to obtain a difference vector, a first initial global vector and a second initial global vector; and calculating the similarity matching score of the sentence according to the difference vector, the first initial global vector and the second initial global vector to obtain a text matching result. The method and the device highlight the importance of the key word in sentence matching, realize more accurate text matching, can more accurately judge the similarity of the text and improve the accuracy of the text matching compared with the prior art.

Description

Text matching method, device and system and storage medium

Technical Field

The invention mainly relates to the technical field of language processing, in particular to a text matching method, a text matching device, a text matching system and a storage medium.

Background

Text matching is an important and challenging task in natural language processing, is used for judging the similarity of two sections of texts, and is widely applied to scenes such as a search engine, a recommendation system, a question and answer system and the like. In the existing advanced text matching model, most methods are to perform uniform processing on each word and directly perform text comparison. However, this ignores the matching granularity of the text, thereby reducing the accuracy of the matching.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a text matching method, apparatus, system and storage medium for overcoming the defects of the prior art.

The technical scheme for solving the technical problems is as follows: a text matching method comprises the following steps:

importing a plurality of original sentence pairs, and respectively carrying out label analysis on each original sentence pair to obtain a label sentence pair of each original sentence pair;

constructing an encoder, and encoding the labeled sentence pairs of the original sentence pairs by using the encoder to obtain sentence pair hidden vectors of the original sentence pairs;

carrying out vector analysis on hidden vectors according to the original sentence pairs and sentences of the original sentence pairs respectively to obtain a difference vector, a first initial global vector and a second initial global vector of each original sentence pair;

and respectively calculating sentence pair similarity matching scores according to the difference vector, the first initial global vector and the second initial global vector of each original sentence pair to obtain the sentence pair similarity matching scores of each original sentence pair, and taking all the sentence pair similarity matching scores as text matching results.

Another technical solution of the present invention for solving the above technical problems is as follows: a text matching apparatus comprising:

the system comprises a label analysis module, a label analysis module and a label analysis module, wherein the label analysis module is used for importing a plurality of original sentence pairs and respectively carrying out label analysis on each original sentence pair to obtain a label sentence pair of each original sentence pair;

the coding analysis module is used for constructing a coder, and coding the labeled sentence pairs of the original sentence pairs by utilizing the coder to obtain the sentence pair hidden vectors of the original sentence pairs;

the vector analysis module is used for carrying out vector analysis on hidden vectors according to the original sentence pairs and sentences of the original sentence pairs respectively to obtain a difference vector, a first initial global vector and a second initial global vector of each original sentence pair;

and the matching result obtaining module is used for calculating sentence pair similarity matching scores according to the difference vector, the first initial global vector and the second initial global vector of each original sentence pair respectively to obtain the sentence pair similarity matching scores of each original sentence pair, and taking all the sentence pair similarity matching scores as text matching results.

Based on the text matching method, the invention also provides a text matching system.

Another technical solution of the present invention for solving the above technical problems is as follows: a text matching system comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said computer program, when executed by said processor, implementing a text matching method as described above.

Based on the text matching method, the invention also provides a computer readable storage medium.

Another technical solution of the present invention for solving the above technical problems is as follows: a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a text matching method as set forth above.

The beneficial effects of the invention are: the method comprises the steps of obtaining a labeled sentence pair through label analysis of an original sentence pair, obtaining a sentence pair hidden vector through encoding of the labeled sentence pair by using an encoder, obtaining a difference vector, a first initial global vector and a second initial global vector according to the original sentence pair and the vector analysis of the sentence pair hidden vector, and obtaining a text matching result through calculating similarity matching scores according to the sentence pairs of the difference vector, the first initial global vector and the second initial global vector.

Drawings

Fig. 1 is a schematic flowchart of a text matching method according to an embodiment of the present invention;

fig. 2 is a block diagram of a text matching apparatus according to an embodiment of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

Fig. 1 is a schematic flowchart of a text matching method according to an embodiment of the present invention.

As shown in fig. 1, a text matching method includes the following steps:

It should be appreciated that the keywords of the sentence pairs in the dataset (i.e., the original sentence pairs) are labeled.

It should be understood that the NLTK text processing library may also be utilized to perform annotation analysis on each of the original sentence pairs.

It should be understood that the NLTK dataset (i.e., the NLTK text processing library) is a Python library commonly used in the field of NLP research.

In the above embodiment, a labeled sentence pair is obtained by labeling and analyzing an original sentence pair, a sentence pair hidden vector is obtained by encoding the labeled sentence pair by using an encoder, a difference vector, a first initial global vector and a second initial global vector are obtained by analyzing the original sentence pair and the sentence pair hidden vector, and a text matching result is obtained by calculating the sentence pair similarity matching scores of the difference vector, the first initial global vector and the second initial global vector.

Optionally, as an embodiment of the present invention, the process of performing label analysis on each original sentence pair to obtain a labeled sentence pair of each original sentence pair includes:

extracting potential keywords from each original sentence pair respectively to obtain a plurality of potential keywords of each original sentence pair;

respectively matching each potential keyword of each original sentence pair according to a preset knowledge base to obtain a plurality of matched keywords of each original sentence pair;

and based on a named entity recognition method, labeling the corresponding original sentence pairs according to the matched keywords of the original sentence pairs respectively to obtain the labeled sentence pairs of the original sentence pairs.

It should be appreciated that a keyword discriminator is designed to label keywords in sentence pairs (i.e., the original sentence pairs) in the data set.

It should be understood that the predetermined knowledge base may be a wikipedia entity graph or a dog-searching knowledge graph, the wikipedia entity graph is used for an english language corpus, and the dog-searching knowledge graph is used for a chinese medicine SM.

Specifically, potential keywords (i.e., the potential keywords) are first extracted from the NLTK (i.e., the NLTK dataset), including part-of-speech tags of nouns, verbs, and adjectives. These potential keywords (i.e. the potential keywords) are then matched by using an external knowledge base (i.e. the preset knowledge base) comprising: the Wikipedia entity graph is used for an English language corpus, and the dog searching knowledge graph is used for Chinese medicine SM. Finally, using a Named Entity Recognition (NER) method (i.e., the named entity recognition method), words (entities) having specific meanings in the text (i.e., the original sentence pair), mainly including names of people, place, organization, proper nouns, etc., are labeled in the text sequence (i.e., the original sentence pair) and used as keywords.

In the above embodiment, the original sentence pairs are respectively labeled and analyzed to obtain labeled sentence pairs, words with specific meanings can be labeled, more accurate text matching is realized, and compared with the prior art, the similarity of texts can be more accurately judged and the accuracy of text matching is improved.

Optionally, as an embodiment of the present invention, the pair of annotated sentences includes a first annotated sentence and a second annotated sentence, and the encoder includes a BERT model and a max-pooling layer;

the process of respectively encoding the labeled sentence pairs of the original sentence pairs by using the encoder to obtain the sentence pair hidden vectors of the original sentence pairs comprises the following steps:

respectively coding the first labeled sentence of each original sentence pair by using the BERT model to obtain a first hidden component of each original sentence pair;

respectively coding a second labeled sentence of each original sentence pair by using the BERT model to obtain a second hidden component of each original sentence pair;

performing maximum pooling processing on the first hidden component of each original sentence pair by using the maximum pooling layer to obtain a first sentence hidden vector of each original sentence pair;

performing maximum pooling processing on the second hidden component of each original sentence pair by using the maximum pooling layer to obtain a second sentence hidden vector of each original sentence pair;

wherein the sentence-pair concealment vector of the original sentence pair comprises a first sentence concealment vector of the original sentence pair and a second sentence concealment vector of the original sentence pair.

It should be appreciated that building a "keyword-attention" layer in parallel on the last layer of the BERT model, places the model more focused on keywords.

Specifically, after the keywords of a sentence pair are labeled, the keyword-attention layer is stacked in parallel with the last layer of the BERT (i.e., the BERT model) for injecting the keywords. The key-attention layer forced model learns the essential differences of sentence pairs by paying attention to the differences of the keywords of sentence pairs. The implementation method is that a keyword self-attention masking layer is added in parallel to the last layer of transform layer of BERT, and the maximum pooling layer is added next to the last layer for generating the hidden vector of sentence pair.

It should be understood that the keyword self attention mask layer uses the functionality of the transform layer in the BERT model.

In particular, assume that the sentence pair is represented by A (i.e., the first annotated sentence) and B (i.e., the second annotated sentence), and the input of the keyword-attention layer is

Wherein each element in { } represents a word representation of a sentence pair (i.e., the labeled sentence pair), and hidden components (i.e., the first hidden component and the second hidden component) of the sentence pair obtained after the keyword passes through the attention mask layer are

Wherein

The marked hidden components (i.e. the first hidden component and the second hidden component) of sentences a and B are represented respectively. Then, through the maximal pooling operation, the final sentence hiding vectors (i.e. the first sentence hiding vector and the second sentence hiding vector) are obtained by H respectively _kw (A) (i.e. the first sentence concealment vector), H _kw (B) (i.e. the second sentence concealment vector) representation, namely:

where Maxpooling (. Cndot.) represents the maximum pooling operation.

In the embodiment, the sentence pair hidden vector is obtained by coding the labeled sentence pair by using the coder, so that the model can pay more attention to the keywords, and meanwhile, the model is forced to pay more attention to the difference of the keywords of the sentence pair, so that the essential difference of the sentence pair is learned, more accurate text matching is realized, and compared with the prior art, the similarity of the text can be more accurately judged, and the accuracy of text matching is improved.

Optionally, as an embodiment of the present invention, the process of performing vector analysis on the hidden vector according to each original sentence pair and the sentence pair of the original sentence pair respectively to obtain the difference vector, the first initial global vector, and the second initial global vector of each original sentence pair includes:

respectively coding each original sentence pair by utilizing the BERT model to obtain a first initial global vector and a second initial global vector of each original sentence pair;

and calculating a difference vector according to the first sentence hiding vector, the second sentence hiding vector, the first initial global vector and the second initial global vector of each original sentence pair to obtain the difference vector of each original sentence pair.

It should be appreciated that the difference vector of the sentence-pair keyword match (i.e. the first sentence-concealment vector or the second sentence-concealment vector) and the standard sample sentence-pair match (i.e. the first initial global vector or the second initial global vector) is calculated, causing the models to learn their differences.

In the above embodiment, the BERT model is used to encode each original sentence pair to obtain a first initial global vector and a second initial global vector, and the difference vector is obtained by calculating the difference vector according to each first sentence hidden vector, each second sentence hidden vector, each first initial global vector, and each second initial global vector, so that the difference between the keyword matching of the model learning sentence pair and the matching of the standard sample sentence pair can be promoted.

Optionally, as an embodiment of the present invention, the calculating a disparity vector according to a first sentence hiding vector, a second sentence hiding vector, a first initial global vector, and a second initial global vector of each original sentence pair respectively to obtain a disparity vector of each original sentence pair includes:

based on a first equation, calculating a difference vector according to a first sentence hidden vector, a second sentence hidden vector, a first initial global vector and a second initial global vector of each original sentence pair, respectively, to obtain a difference vector of each original sentence pair, where the first equation is:

K _diff ＝[H _A (CLS)-H _B (CLS)]·[H _kw (A)-H _kw (B)]，

wherein, K _diff Is a disparity vector, H _A (CLS) is a first initial global vector, H _B (CLS) is a second initial global vector, H _kw (A) Concealing the vector for the first sentence, H _kw (B) For the second sentence, the vector is hidden,. For the dot product operator.

Understandably, the sentence is hidden by the vector H _kw (A) (i.e., the first sentence concealment vector), H _kw (B) (i.e., the second sentence hiding vector) are respectively paired with the standard sentencesGlobal vector H obtained after coding _A (CLS) (i.e. the first initial global vector), H _B (CLS) (i.e. the second initial global vector) performs disparity vector calculation, and the obtained disparity vector uses K _diff To show that, then:

K _diff ＝(H _A (CLS)-(H _B (CLS))·(H _kw (A)-H _kw (B))

wherein, the dot product operator is represented for computing a global vector H of sentences A (i.e. the first annotated sentence) and B (i.e. the second annotated sentence) _A (CLS) (i.e. the first initial global vector), H _B (CLS) (i.e., the second initial global vector) and sentence concealment vector H _kw (A) (i.e. the first sentence concealment vector), H _kw (B) (i.e. the second sentence concealment vector).

In the above embodiment, the difference vector is obtained by calculating the difference vector according to each of the first sentence hiding vector, the second sentence hiding vector, the first initial global vector and the second initial global vector based on the first equation, so that the difference between the keyword matching of the model learning sentence pair and the matching of the standard sample sentence pair can be promoted.

Optionally, as an embodiment of the present invention, the calculating the sentence pair similarity matching scores according to the difference vector, the first initial global vector, and the second initial global vector of each original sentence pair respectively to obtain the sentence pair similarity matching scores of each original sentence pair includes:

based on a second formula, respectively calculating sentence pair similarity matching scores according to the difference vector, the first initial global vector and the second initial global vector of each original sentence pair to obtain the sentence pair similarity matching scores of each original sentence pair, wherein the second formula is as follows:

sim_scores＝softmax[W·H(CLS)+b]，

wherein the content of the first and second substances,

wherein sim _ scores isSentence to similarity match score, softmax [ ·]For the activation function, W is the weight matrix, H (CLS) is the target global vector, b is the bias vector,. Is the dot product operator, H _A (CLS) is the first initial global vector, H _B (CLS) is a second initial global vector, K _diff Is a vector of the differences, and is,

is the join operator.

It should be understood that the sentence pair similarity matching score is calculated by the classification layer by concatenating the standard sample sentence pair matching vectors (i.e., the first initial global vector and the second initial global vector) with the disparity vector.

Specifically, by connecting the standard sample sentence pair matching (i.e. the first initial global vector and the second initial global vector) with the disparity vector, the classification layer calculates a sentence pair similarity matching score s im _ scores, specifically:

sim_scores＝softmax(W·H(CLS)+b)

wherein

Representing the join operator. Softmax (·) is an activation function, W is a weight matrix, b is a bias vector, and represents a point-by-point operator.

In the above embodiment, the sentence-to-similarity matching score is obtained by calculating the sentence-to-similarity matching score based on the second formula according to each difference vector, the first initial global vector and the second initial global vector, so that the sentence-to-similarity matching score is obtained by calculating the classification layer, the importance of the important matching granularity, which is a keyword, in sentence matching is highlighted, more accurate text matching is realized, and compared with the prior art, the similarity of the text can be more accurately judged and the accuracy of text matching is improved.

Optionally, as another embodiment of the present invention, the present invention designs a keyword discriminator to mark keywords of sentence pairs in the data set; a 'keyword-attention' layer is parallelly constructed on the last layer of the BERT model, so that the model focuses more on the keywords; calculating difference vectors of the sentence pair keyword matching and the standard sample sentence pair matching, and prompting the model to learn the difference; and connecting the matching vector of the standard sample sentence pair with the difference vector so as to obtain a sentence pair similarity matching score by the classification layer. The method can accurately label the keyword information of the sentence pair, highlights the importance of the important matching granularity of the keyword in sentence matching by adding the keyword-attention layer, realizes more accurate text matching, can more accurately judge the similarity of the text and improves the accuracy of the text matching compared with the prior art.

Alternatively, as another embodiment of the present invention, as shown in fig. 2, a text matching apparatus includes:

the system comprises a label analysis module, a label analysis module and a database, wherein the label analysis module is used for importing a plurality of original sentence pairs and respectively carrying out label analysis on each original sentence pair to obtain a label sentence pair of each original sentence pair;

the coding analysis module is used for constructing a coder, and coding the labeled sentence pairs of the original sentence pairs by using the coder to obtain the sentence pair hidden vectors of the original sentence pairs;

the vector analysis module is used for carrying out vector analysis on hidden vectors according to the original sentence pairs and sentences of the original sentence pairs respectively to obtain difference vectors, first initial global vectors and second initial global vectors of the original sentence pairs;

Optionally, as an embodiment of the present invention, the annotation analysis module is specifically configured to:

Alternatively, another embodiment of the present invention provides a text matching system, including a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the computer program is executed by the processor, the text matching method as described above is implemented. The system may be a computer or the like.

Alternatively, another embodiment of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the text matching method as described above.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partly contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a portable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A text matching method is characterized by comprising the following steps:

importing a plurality of original sentence pairs, and performing label analysis on each original sentence pair to obtain a label sentence pair of each original sentence pair;

2. The method of claim 1, wherein the process of performing annotation analysis on each original sentence pair to obtain an annotated sentence pair of each original sentence pair comprises:

3. The text matching method of claim 1, wherein the pair of annotated sentences comprises a first annotated sentence and a second annotated sentence, and wherein the encoder comprises a BERT model and a max-pooling layer;

4. The text matching method according to claim 3, wherein the process of performing vector analysis on the hidden vector according to each original sentence pair and the sentence pair of the original sentence pair to obtain the disparity vector, the first initial global vector and the second initial global vector of each original sentence pair comprises:

respectively encoding each original sentence pair by using the BERT model to obtain a first initial global vector and a second initial global vector of each original sentence pair;

5. The text matching method according to claim 4, wherein the process of calculating a difference vector according to the first sentence hiding vector, the second sentence hiding vector, the first initial global vector and the second initial global vector of each original sentence pair to obtain a difference vector of each original sentence pair comprises:

calculating a difference vector according to a first sentence hiding vector, a second sentence hiding vector, a first initial global vector and a second initial global vector of each original sentence pair based on a first equation to obtain the difference vector of each original sentence pair, wherein the first equation is as follows:

K _diff ＝[H _A (CLS)-H _B (CLS)]·[H _kw (A)-H _kw (B)]，

wherein, K _diff Is a disparity vector, H _A (CLS) is the first initial global vector, H _B (CLS) is a second initial global vector, H _kw (A) Concealing the vector for the first sentence, H _kw (B) For the second sentence hidden vector, the dot product operator.

6. The text matching method according to claim 4, wherein the step of calculating sentence-pair similarity matching scores according to the difference vector, the first initial global vector and the second initial global vector of each original sentence pair to obtain the sentence-pair similarity matching score of each original sentence pair comprises:

sim_scores＝softmax[W·H(CLS)+b]，

wherein, the first and the second end of the pipe are connected with each other,

wherein sim _ scores is the sentence pair similarity matching score, softmax [ · C]For the activation function, W is the weight matrix, H (CLS) is the target global vector, b is the bias vector,. Is the dot product operator, H _A (CLS) is a first initial global vector, H _B (CLS) is a second initial global vector, K _diff Is a vector of the difference to be the difference vector,

is the join operator.

7. A text matching apparatus, comprising:

8. The text matching apparatus of claim 7, wherein the annotation analysis module is specifically configured to:

matching each potential keyword of each original sentence pair according to a preset knowledge base to obtain a plurality of matched keywords of each original sentence pair;

9. A text matching system comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the text matching method according to any of claims 1 to 6 is implemented when the computer program is executed by the processor.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a text matching method according to any one of claims 1 to 6.