WO2021169263A1

WO2021169263A1 - Semantic matching method and device based on internal adversarial mechanism, and storage medium

Info

Publication number: WO2021169263A1
Application number: PCT/CN2020/117422
Authority: WO
Inventors: 骆迅; 王科强; 郝新东
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-02-26
Filing date: 2020-09-24
Publication date: 2021-09-02
Also published as: CN111427995B; CN111427995A

Abstract

A semantic matching method and device based on an internal adversarial mechanism, and a storage medium, relating to the technical field of artificial intelligence. The method comprises the following steps: respectively performing word segmentation processing and character segmentation processing on question statements to be matched and candidate question statements; respectively calculating similarities between the candidate question sentences and said question sentences; sorting the similarities between the candidate question statements and said question statements, and using the candidate question statements in a set rank as similar candidate question statements; respectively calculating similarities between the similar candidate question statements and said question statements; and using a sorting result, in the set rank, of the similarities between the similar candidate question statements and said question statements and a sorting result, in the set rank, of the similarities between the candidate question statements and said question statements as two variables of a Pearson correlation coefficient calculation formula, and determining a matching result according to a correlation coefficient. The method can effectively improve the semantic matching quality and precision.

Description

Semantic matching method, device and storage medium based on internal confrontation mechanism

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on February 26, 2020, the application number is 202010119430.0, and the invention title is "Semantic matching method, device and storage medium based on internal confrontation mechanism". The reference is incorporated in this application.

Technical field

This application relates to the field of artificial intelligence technology, and more specifically, to a semantic matching method based on an internal confrontation mechanism.

Background technique

Human-machine dialogue is currently a very popular application scenario in the field of Natural Language Processing (NLP). From traditional artificial intelligence (AI) customer service to voice chat robots, the core technologies are semantic recognition, semantic understanding, and semantic matching.

At present, most human-machine dialogue systems on the market are concentrated in the fields of finance, customer service, and entertainment, while human-machine Q&A in the medical field is relatively in its infancy. On the one hand, because medical scenarios are more complex and involve more technical terms, it is difficult for AI to fully understand the demands of patients. On the other hand, due to the low fault tolerance of medical scenarios, correspondingly higher requirements for AI recognition accuracy are put forward. There are some question-and-answer systems for patient education in the market, such as Doctor Thumb and Kang Fuzi. However, these systems generally have some problems. For example, they can only answer some simple questions, can't do anything about complex demands, and answer the wrong questions. The main reason is that the current semantic matching models generally have shortcomings such as poor robustness and insufficient transfer learning effects.

The core module of the patient-teaching question-and-answer system is the semantic recall module. The main function is to search for the answer closest to the patient's appeal based on the patient's question in the answer database and make an answer. Therefore, the performance of the patient-teaching question-and-answer system mainly depends on the accuracy of the semantic recall module. Currently, most of the semantic recall modules are based on deep learning networks, such as CNN (Convolutional Neural Network), LSTM (Long Short-Term Memory), ESIM (Enhanced-LSTM, enhanced long and short-term memory network) , Decomposable Attention (decomposable attention mechanism network), Multihead (Multi-head attention mechanism network), etc. These deep learning networks have their own advantages and disadvantages, and they are also suitable for different scenarios. The inventor realizes that these models have shortcomings such as over-fitting and large perturbation by data quality.

Summary of the invention

In view of the above problems, the purpose of this application is to provide a semantic matching method based on an internal confrontation mechanism. On the basis of the original problem recall module, a value evaluation network is added. This network will evaluate the result of each problem recall module and feed it back to the semantic matching network (that is, the problem recall module, using depth Learning model construction), as new training data, re-train and output to the value evaluation network, until the evaluation score of the value evaluation network reaches the threshold, the confrontation is terminated. It can improve the robustness and transfer learning effect of the semantic matching system, and improve the quality and accuracy of problem recall.

According to one aspect of the present application, a semantic matching method based on an internal confrontation mechanism is provided, which includes the following steps:

S110: Perform word segmentation processing and word segmentation processing on the matching question sentence and the candidate question sentence respectively;

S120: Perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing, to determine the word-based sentence of the question sentence to be matched and the candidate question sentence The set of sentence vector features; and, respectively, perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the question sentence to be matched and The candidate question sentence is based on the sentence vector feature set of the sentence pair of the word; wherein the candidate question sentence is at least one question sentence with a set similarity with the question sentence to be matched that is retrieved in a specified database through a search engine;

S130: Combine the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the similarity between the candidate question sentence and the question sentence to be matched;

S140: Sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select the candidate question sentence with the similarity in a set ranking as the similar candidate question sentence ；

S150: Perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing, and respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing. Character vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, the feature set of the question sentence to be matched based on the word, and the character set The feature set of the similar candidate question sentence; after the four determined feature sets are spliced, the similarity between the similar candidate question sentence and the question sentence to be matched is determined;

S160: Sort the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched in the order from high to low, respectively, Obtain the ranking results within the set ranking; use the two ranking results as two variables in the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, use the sentence to be matched The candidate question sentence ranked first in the similarity ranking of the candidate question sentence is the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to retrieve the question to be matched in the designated database again The sentence has at least one question sentence with a set similarity, and the S120 is performed.

According to another aspect of the present application, a semantic matching system based on an internal confrontation mechanism is provided, including:

The word segmentation processing unit for the question sentence to be matched and the candidate question sentence: used to perform word segmentation and word segmentation processing on the matched question sentence and the candidate question sentence respectively;

The feature set formation unit of the question sentence to be matched and the sentence vector of the candidate question sentence: used to perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the State the sentence vector feature set of the sentence pair based on the word for the question sentence to be matched and the candidate question sentence; Perform word vectorization processing to determine the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and the candidate question sentence; wherein, the candidate question sentence is the one retrieved from a specified database by a search engine. The question sentence to be matched has at least one question sentence with a set similarity;

The semantic similarity determination unit of the question sentence to be matched and the candidate question sentence: used to join the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the candidate question The similarity between the sentence and the sentence to be matched;

Similar candidate question sentence determination unit: used to sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select candidates whose similarity is ranked within a set ranking Question sentences as similar candidate question sentences;

The semantic similarity determination unit of the question sentence to be matched and the similar candidate question sentence: used to respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation process, and respectively for each similar candidate question sentence The question sentence and the question sentence to be matched after the word segmentation process are subjected to word vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, and the The feature set of the question sentence to be matched of the word and the feature set of the similar candidate question sentence based on the word; after the four determined feature sets are spliced, the similar candidate question sentence is determined to be the same as the candidate question sentence to be matched The similarity between question sentences;

Semantic matching result determination unit: for determining the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched according to the highest Sort to the lower order to obtain the ranking results within the set ranking; use the two ranking results as the two variables of the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, it will be The question sentence to be matched and the candidate question sentence ranked first in the similarity of the candidate question sentence are the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to search the specified database again. Search for at least one question sentence with a set similarity to the question sentence to be matched, and calculate the similarity.

According to another aspect of the present application, there is provided an electronic device, including a memory and a processor, and a computer program is stored in the memory. When the computer program is executed by the processor, the following semantic matching method based on an internal confrontation mechanism is implemented. step:

According to another aspect of the present application, a computer-readable storage medium is provided, the computer-readable storage medium includes a semantic matching program based on an internal confrontation mechanism, and when the semantic matching program based on an internal confrontation mechanism is executed by a processor , Realize the following steps of semantic matching method based on internal confrontation mechanism:

Using the above semantic matching method according to the internal confrontation mechanism of the present application, the semantic recall network and the value evaluation network, a confrontation mechanism between the dual networks is formed, which can better evaluate candidate question sentences and users without affecting efficiency. The similarity between questions improves the accuracy and precision of the question recall module, and pushes users with higher-quality answers. The word segmentation is used to input the neural network for training at the same time, which improves the accuracy of matching.

Description of the drawings

FIG. 1 is a flowchart of a semantic matching method based on an internal confrontation mechanism according to Embodiment 1 of the present application;

2 is a schematic diagram of the logical structure of a semantic matching system based on an internal confrontation mechanism according to Embodiment 2 of the present application;

FIG. 3 is a schematic diagram of a logical structure of an electronic device according to Embodiment 3 of the present application.

Detailed ways

In the following description, for illustrative purposes, in order to provide a comprehensive understanding of one or more embodiments, many specific details are set forth. However, it is obvious that these embodiments can also be implemented without these specific details. In other examples, for the convenience of describing one or more embodiments, well-known structures and devices are shown in the form of block diagrams.

The specific embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Example 1

FIG. 1 is a flowchart of a semantic matching method of an internal confrontation mechanism according to Embodiment 1 of the present application.

As shown in Figure 1, a semantic matching method based on an internal confrontation mechanism includes the following steps:

S110: Perform word segmentation processing and word segmentation processing on the matching question sentence and the candidate question sentence respectively.

In step S110, the process of word segmentation includes: removing stop words and removing special symbols in the sentence to be matched, and then using a deep learning tokenizer (Tokenizer) for word segmentation; removing stop words and removing special symbols for the candidate question sentence After the symbol, use the deep learning tokenizer for word segmentation.

For example, the question sentence to be matched is "What does diabetes eat?", and after word segmentation is processed, it becomes "diabetes/eat/what".

The process of word segmentation includes: after removing the stop words and special symbols for the sentence to be matched, then using a deep learning tokenizer (Tokenizer) for word segmentation; after removing the stop words and special symbols for the candidate question sentence , Use deep learning tokenizer for word segmentation processing.

For example, the question sentence to be matched is "what do you eat for diabetes?", after word segmentation, it is "sugar/urine/illness/eating/what/what?".

Stop words mean that in information retrieval, in order to save storage space and improve search efficiency, certain words or words will be automatically filtered before or after processing natural language text. Stop words mainly include English characters, numbers, mathematical characters, Punctuation marks and single Chinese characters with extremely high frequency of use, etc. Special characters are symbols that use less frequent characters and are difficult to input directly, such as mathematical symbols, unit symbols, tabs, etc., in addition to traditional or commonly used symbols. The purpose of removing stop words and removing special symbols is to make the sentence to be matched more concise and improve the efficiency of semantic matching.

S120: In the embedding layer of the pre-established semantic recall network, perform word vectorization processing on each candidate question sentence after word segmentation processing and the question sentence to be matched after word segmentation processing to determine whether the question sentence to be matched and the candidate question sentence are based on The sentence vector feature set of the sentence pair of the word; and, respectively, the word vectorization process is performed on each candidate question sentence after word segmentation processing and the question sentence to be matched after word segmentation processing to determine the question sentence to be matched and the candidate question sentence The sentence vector feature set of word-based sentence pairs; among them, the candidate question sentence is a question sentence similar to the question sentence to be matched.

In step S120, the candidate question sentence is at least one question sentence that has a set similarity with the question sentence to be matched and is retrieved in a designated database through es (elasticsearch, search engine). The number can be 128 or more. For example: For example, the question sentence to be matched is "What does diabetes eat?" es retrieves 128 candidate question sentences such as "Definition of Diabetes" and "How to Exercise for Diabetes".

A large number of question sentences that may be related to the candidate question collected in advance are stored in the designated database, and these question sentences can also be stored in the database in the form of word segmentation and word segmentation to facilitate matching queries.

The pre-established semantic recall network includes: embedding (vectorization) layer, convolutional layer and pooling layer. The embedding layer includes Pre-train Embedding (pre-training vectorization) layer and train Embedding (training vectorization) layer.

After word segmentation, the question sentence to be matched and each candidate question sentence retrieved by es are input into the semantic recall network for matching, and the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and each candidate question sentence is calculated.

Perform word vectorization processing on each candidate question sentence after word segmentation processing and the sentence to be matched after word segmentation processing to determine the sentence vector feature set of the sentence pair based on the word for the question sentence to be matched and the candidate question sentence. The specific process includes :

Pre-train Embedding and train Embedding are performed on the sentence to be matched after word segmentation, and then the word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced to form the first word vector matrix.

Pre-train Embedding and train Embedding are performed on candidate question sentences in word segmentation form, respectively, and then the word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced to form a second word vector matrix.

The pre-train embedding dimension and train embedding dimension can be set to be 300 dimensions. The two word vector matrices formed in this way are each represented by a 600-dimensional vector, which can describe the word more accurately.

Input the first word vector matrix into the convolutional layer, perform feature extraction, output the sentence vector feature set based on the word of the question sentence to be matched, input the sentence vector feature set based on the word of the question sentence to be matched into the pooling layer, perform dimensionality reduction, and discard some It is really irrelevant data to prevent overfitting.

The second word vector matrix is input to the convolutional layer for feature extraction, and the candidate question sentence is output based on the word-based sentence vector feature set. The candidate question sentence is based on the word-based sentence vector feature set into the pooling layer, dimensionality reduction is performed, and some of the facts are discarded Irrelevant data to prevent overfitting.

The word-based sentence vector feature set of the question sentence to be matched output after the dimensionality reduction of the pooling layer and the word-based sentence vector feature set of the candidate question sentence are spliced together to obtain the sentence vector feature set of the word-based sentence pair.

Perform word vectorization processing on each candidate question sentence after word segmentation processing and the sentence to be matched after word segmentation processing to determine the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and the candidate question sentence. The process includes:

After word segmentation, the sentence to be matched is subjected to Pre-train Embedding processing and train Embedding processing respectively, and then the word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced to form the first word vector matrix.

Pre-train Embedding processing and train Embedding processing are performed on the candidate question sentences in word segmentation form, and then the word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced to form a second word vector matrix.

Set the pre-train embedding dimension and train embedding to both be 300 dimensions. Each word of the two word vector matrix formed in this way is represented by a 600-dimensional vector, which can describe the word more finely.

Input the first word vector matrix into the convolutional layer, perform feature extraction, output the word-based sentence vector feature set of the question sentence to be matched, and input the word-based sentence vector feature set of the question sentence to be matched into the pooling layer, perform dimensionality reduction, and discard some It is really irrelevant data to prevent overfitting.

Input the second word vector matrix into the convolutional layer, perform feature extraction, output the character-based sentence vector feature set of the candidate question sentence, and input the word-based sentence vector feature set of the candidate question sentence into the pooling layer, perform dimensionality reduction, and discard some of the facts. Irrelevant data to prevent overfitting.

The convolutional layer may include 3 convolutional neural networks, the number of cores of each convolutional neural network is 1, 2, and 3 respectively, and the size of the filter of each convolutional neural network is 256, 192, and 128, respectively. The word vector matrix and the word vector matrix are respectively input to three convolutional neural networks for training and feature extraction.

The pooling layer includes avg-pooling (average pooling layer) and max-pooling (maximum pooling layer). The sentence vector feature set is input avg-pooling and max-pooling in turn, and the order of input avg-pooling and max-pooling is not distinguished. Successively.

S130: Combine the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair, and determine the similarity between the candidate question sentence and the question sentence to be matched through the sigmoid function.

In step S130, the specific process includes:

The output after diff (vector subtraction operation), mul (vector multiplication operation) and max (vector maximization operation) are performed on the sentence vector feature set of word-based sentence pairs, and the sentences of word-based sentence pairs The vector feature set is the output after vector subtraction operation, vector multiplication operation and vector maximization operation respectively. The output of these six items are spliced to form the final first text feature vector set, and the first text feature vector set is then subjected to dimensionality reduction processing Then, input the sigmoid function to output a value, which is the similarity between the question sentence to be matched and the candidate question sentence. The output value of the sigmoid function is a score between 0 and 1.

The sigmoid function is a common sigmoid function in biology, also known as the sigmoid growth curve. In information science, sigmoid function is often used as the threshold function of neural network due to its single-increment and inverse function, which maps variables between 0 and 1.

The dimensionality reduction processing includes: adopting BatchNormalization (normalization) processing to convert the first text feature vector set into the same standard system, and then undergoing Dense (dense) processing, relu (preventing gradient disappearance) processing, dropout (preventing model from over-fitting)合) Treatment.

The essence of the algorithm in this application is to convert two sentences into vector representations with certain characteristic information, and then calculate the similarity of the sentence vectors to obtain the similarity between the question sentence to be matched and the candidate question sentence.

S140: Sort the similarity between the candidate question sentence and the question sentence to be matched in a descending order, and select the candidate question sentence whose similarity ranks in a set ranking as the similar candidate question sentence.

In step S140, the ranking is set to the top five. Select the values output by the sigmoid function, sort them from highest to lowest, select the top five values, and the candidate question sentences corresponding to the top five values are similar candidate question sentences.

S150: In the embedding layer of the pre-established value evaluation network, each similar candidate question sentence and the word segmentation processing to be matched question sentence are respectively processed for word vectorization, and each similar candidate question sentence and word segmentation processing are respectively processed The subsequent question sentences to be matched are subjected to word vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, the feature set of the word-based question sentence to be matched, and the feature set based on the word The feature set of similar candidate question sentences of words; after the four determined feature sets are spliced, the similarity between the similar candidate question sentence and the question sentence to be matched is determined through the sigmoid function.

In step S150, the pre-established value evaluation network includes: an embedding layer, a neural network layer, and a pooling layer. The embedding layer includes Pre-train Embedding layer and train Embedding layer. The neural network layer also includes the BiGRU (bidirectional gated loop) neural network layer, the encoded layer and the soft attention (soft attention mechanism) layer.

The question sentence to be matched is matched with each similar candidate question sentence input value evaluation network, and the similarity is calculated.

Perform word vectorization processing on each similar candidate question sentence and the sentence to be matched after word segmentation. The specific process includes: pre-train embedding processing and train embedding processing on the sentence to be matched after word segmentation processing, and then The word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced together to form a third word vector matrix; similar candidate question sentences are processed by Pre-train Embedding and train Embedding respectively, and then the words output by Pre-train Embedding are processed The vector and the word vector output by train Embedding are spliced to form the fourth word vector matrix.

Perform word vectorization processing on each similar candidate question sentence and the sentence to be matched after word segmentation. The specific process includes:

Pre-train Embedding and train Embedding are performed on the sentence to be matched after word segmentation, and then the word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced to form a third word vector matrix; The candidate question sentences are processed by Pre-train Embedding and train Embedding respectively, and then the word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced to form a fourth word vector matrix.

Set the pre-train embedding dimension and train embedding dimension to be both 300 dimensions, so that the two word vector matrices formed are each represented by a 600-dimensional vector, which can describe the words more accurately.

To determine the feature set of the sentence to be matched based on the word, the specific process includes:

The third word vector matrix is input into the BiGRU neural network layer, and after deep-level feature extraction, the encoded layer and soft attention layer are respectively passed through the encoded layer and soft attention layer, and the outputs of the encoded layer and soft attention layer are spliced, and then input into the BiGRU neural network layer to extract the features. Then, the dimensionality reduction is performed through the pooling layer, and the feature set of the sentence to be matched based on the word is output.

To determine the feature set of similar candidate question sentences based on words, the specific process includes:

The fourth word vector matrix is input into the BiGRU neural network layer, and after deep-level feature extraction, the encoded layer and soft attention layer are respectively passed through the encoded layer and the soft attention layer, and the outputs of the encoded layer and soft attention layer are spliced, and then input into the BiGRU neural network layer to extract the features. Then, the dimensionality reduction is performed through the pooling layer, and the feature set of the similar candidate sentence based on the word is output.

Input the third word vector matrix into the BiGRU neural network layer, perform deep feature extraction, and then use the encoded layer and soft attention layer to splice the output of the encoded layer and soft attention layer, and then input the BiGRU neural network layer to extract features. Then, the dimensionality reduction is performed through the pooling layer, and the feature set of the sentence to be matched based on the word is output.

The fourth word vector matrix is input into the BiGRU neural network layer, and after deep-level feature extraction, the encoded layer and soft attention layer are respectively stitched through the encoded layer and the soft attention layer, and then input again into the BiGRU neural network layer to extract features Then, the dimensionality reduction is performed through the pooling layer, and the feature set of similar candidate question sentences based on the word is output.

BiGru is a variant of the LSTM structure. It has an update gate and a reset gate to strengthen the semantic understanding of contextual relationships; soft attention aligns the deep-level information after feature extraction; encoded is used to extract feature After the information is encoded. The pooling layer includes avg-pooling and max-pooling.

After the four feature sets are spliced, the similarity between the similar candidate question sentence and the question sentence to be matched is obtained through the sigmoid function. The specific process includes:

The feature set of word-based question sentences to be matched, the feature set of word-based similar candidate question sentences, the feature set of word-based question sentences to be matched, and the feature set of word-based similar candidate question sentences are spliced to form the final The second text feature vector set, after performing dimensionality reduction processing on the second text feature vector set, the input sigmoid function outputs a value, which is the similarity between the question sentence to be matched and the similar candidate question sentence.

S160: Sort the similarity between the similar candidate question sentence and the question sentence to be matched in order from high to low, and obtain the sorting result within the set ranking; compare the similarity between the candidate question sentence and the question sentence to be matched, Sort from high to low to obtain the ranking results within the set ranking; use the two ranking results as two variables in the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, the problem is to be matched The first candidate question sentence in the ranking of the similarity between the sentence and the candidate question sentence is the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine will re-search in the specified database through the search engine and the question sentence to be matched is similar to the question sentence to be matched. At least one question sentence of the degree, proceed to S120.

In step S160, the Pearson correlation coefficient is represented by the English lowercase letter r, and the calculation formula is as follows:

r is the Pearson correlation coefficient, ranging from -1 to 1. The larger the value, the better the correlation; X is the ranking sequence of similarity calculated by the semantic recall network, and Y is the ranking sequence of similarity calculated by the value matching network; n is the set ranking, 5 is selected in this embodiment. If the coefficient is high, it means that the matching effect of the semantic recall network is good, and if the coefficient is low, it means that the matching effect of the semantic recall network is poor. The threshold can be set to 0.7.

The present application further can store the similarity ranking sequence between the question sentence to be matched and the similar candidate question sentence, the ranking sequence of the top five similarities between the question sentence to be matched and the candidate question sentence, and the Pearson correlation coefficient. As training data. It can also record the customer's likes and dislikes, and send the feedback data back to the value evaluation network as training data.

In this application, the respective results of the semantic recall network and the value evaluation network are used as new training data, and the training is performed again, which is enhanced with respect to the first data. Mainly through the different layers of the neural network, vector operations are performed to obtain the final similarity. In the adversarial training process, the training data is fully and repeatedly used, and the use in the patient-teaching question-and-answer system can effectively make up for the lack of matching of some disease data. On the one hand, it saves time for data collection, on the other hand, it greatly reduces the trouble of manual maintenance and iterative upgrades.

Example 2

2 is a schematic diagram of the logical structure of a semantic matching system based on an internal confrontation mechanism according to Embodiment 2 of the present application.

As shown in Figure 2, a semantic matching system based on an internal confrontation mechanism includes: a word segmentation processing unit for the question sentence to be matched and a candidate question sentence, a word segmentation processing unit for the question sentence to be matched and a sentence vector feature set forming unit for the question sentence to be matched, and a sentence vector feature set to be matched. The semantic similarity determination unit of the question sentence and the candidate question sentence, the similar candidate question sentence determination unit, the semantic similarity determination unit of the question sentence to be matched and the similar candidate question sentence, and the semantic matching result determination unit.

The semantic similarity determination unit of the question sentence to be matched and the similar candidate question sentence: used to respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation process, and respectively for each similar candidate question sentence The question sentence and the question sentence to be matched after the word segmentation process are subjected to word vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, and the The feature set of the question sentence to be matched and the feature set of the similar candidate question sentence based on the word; after splicing the four determined feature sets, it is determined that the similar candidate question sentence is the same as the question sentence to be matched The similarity between question sentences;

Example 3

As shown in Figure 3, an electronic device 1 includes a memory 3 and a processor 2. The memory 3 stores a computer program 4, and the computer program 4 is executed by the processor 2 to implement the internal The steps of the semantic matching method of the confrontation mechanism.

Example 4

A computer-readable storage medium, which may be non-volatile or volatile. The computer-readable storage medium includes a semantic matching program based on an internal confrontation mechanism, and when the semantic matching program based on an internal confrontation mechanism is executed by a processor, the steps of the semantic matching method based on the internal confrontation mechanism of Embodiment 1 are implemented .

The semantic matching method, system, device, and storage medium of the internal confrontation mechanism according to the present application are described by way of example with reference to FIGS. 1, 2 and 3 as described above. However, those skilled in the art should understand that for the semantic matching method, system, device and storage medium of the internal confrontation mechanism proposed in this application, various improvements can be made without departing from the content of this application. Therefore, the protection scope of this application should be determined by the content of the appended claims.

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

A semantic matching method based on an internal confrontation mechanism, which includes the following steps:

S110: Perform word segmentation processing and word segmentation processing on the matching question sentence and the candidate question sentence respectively;

S120: Perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the sentence to be matched and the candidate question sentence based on words The set of sentence vector features; and, respectively, performing word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the question sentence to be matched and The candidate question sentence is based on the sentence vector feature set of the sentence pair of the word; wherein the candidate question sentence is at least one question sentence with a set similarity with the question sentence to be matched that is retrieved in a specified database through a search engine;

S130: Combine the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the similarity between the candidate question sentence and the question sentence to be matched;

S140: Sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select the candidate question sentence with the similarity in a set ranking as the similar candidate question sentence ；

S150: Perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing, and respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing. Character vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, the feature set of the question sentence to be matched based on the word, and the character set The feature set of the similar candidate question sentence; after the four determined feature sets are spliced, the similarity between the similar candidate question sentence and the question sentence to be matched is determined;

S160: Sort the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched in the order from high to low, respectively, Obtain the ranking results within the set ranking; use the two ranking results as two variables in the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, use the sentence to be matched The candidate question sentence ranked first in the similarity ranking of the candidate question sentence is the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to retrieve the question to be matched in the designated database again The sentence has at least one question sentence with a set similarity, and the S120 is performed.
The semantic matching method based on an internal confrontation mechanism according to claim 1, wherein, in S110,

The word segmentation processing includes: after removing stop words and special symbols from the question sentence to be matched, using a deep learning tokenizer to perform word segmentation; after removing stop words and special symbols from the candidate question sentence, Use deep learning tokenizer for word segmentation processing;

The word segmentation processing includes: after removing stop words and special symbols for the question sentence to be matched, using a deep learning tokenizer to perform word segmentation processing; removing stop words and special symbols for the candidate question sentence Then, use the deep learning tokenizer for word segmentation processing.
The semantic matching method based on an internal confrontation mechanism according to claim 1, wherein, in S120, each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing are respectively performed. The vectorization process to determine the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and the candidate question sentence includes:

The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing are spliced to form a first word vector matrix; and, the candidate question sentences are respectively subjected to Pre-training The word vectors obtained after embedding processing and train embedding processing are spliced to form a second word vector matrix;

Performing feature extraction on the first word vector matrix and the second word vector matrix respectively to determine the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;

Dimensionality reduction is performed respectively on the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;

Combine the word-based sentence vector feature set of the question sentence to be matched after the dimensionality reduction and the word-based sentence vector feature set of the candidate question sentence together to obtain the sentence pair of the question sentence to be matched and the candidate question sentence based on the word Sentence vector feature set.
The semantic matching method based on the internal confrontation mechanism according to claim 3, wherein, in S120, the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing are respectively processed The process of performing word vectorization processing to determine the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and the candidate question sentence includes:

The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing are spliced to form a first word vector matrix; and the candidate question sentences are respectively subjected to Pre-training The word vectors obtained after train Embedding processing and train Embedding processing are spliced to form a second word vector matrix;

Performing feature extraction on the first word vector matrix and the second word vector matrix respectively to determine the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;

Dimensionality reduction is performed respectively on the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;

Combine the word-based sentence vector feature set of the question sentence to be matched after dimensionality reduction and the word-based sentence vector feature set of the candidate question sentence to obtain the word-based sentence pair of the question sentence to be matched and the candidate question sentence Sentence vector feature set.
The semantic matching method based on an internal confrontation mechanism according to claim 1, wherein, in S130,

Perform vector subtraction, vector multiplication, and vector maximization on the sentence vector feature sets of the word-based sentence pairs, respectively, and perform vector phase comparison on the sentence vector feature sets of the word-based sentence pairs. The output after the subtraction operation, the vector multiplication operation, and the vector maximization operation are spliced to form a first text feature vector set. After the first text feature vector set is reduced in dimensionality, the sigmoid function is input to determine the problem to be matched The degree of similarity between the sentence and the candidate question sentence.
The semantic matching method based on an internal confrontation mechanism according to claim 1, wherein, in S150, the word vectorization process is performed on each of the similar candidate question sentence and the sentence to be matched after the word segmentation process, and , The process of performing word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing includes:

The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing are spliced to form a third word vector matrix; and the similar candidate question sentences are respectively subjected to Pre-training processing. The word vectors obtained after train Embedding processing and train Embedding processing are spliced to form a fourth word vector matrix; and,

The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing to form a third word vector matrix; Pre-train the similar candidate question sentences respectively The word vectors obtained after embedding processing and train embedding processing are spliced to form a fourth word vector matrix.
The semantic matching method based on an internal confrontation mechanism according to claim 6, wherein, in S150, the feature set of the question sentence to be matched based on the word and the feature of the similar candidate question sentence based on the word are respectively determined Set, the feature set of the question sentence to be matched based on the word, and the feature set of the similar candidate question sentence based on the word; after the four determined feature sets are spliced, the similar candidate question sentence is determined to be The process of describing the similarity between the sentences to be matched includes:

After feature extraction and dimension reduction are performed on the third word vector matrix, the fourth word vector matrix, the third word vector matrix, and the fourth word vector matrix, respectively, the word-based problem to be matched is obtained A feature set of a sentence, a feature set of the similar candidate question sentence based on a word, a feature set of the question sentence to be matched based on a word, and a feature set of the similar candidate question sentence based on a word;

The four feature sets are spliced together to form a second text feature vector set. After the second text feature vector set is reduced in dimensionality, the sigmoid function is input to determine the question sentence to be matched and the similar candidate question The similarity between sentences.
A semantic matching system based on an internal confrontation mechanism, which includes:

The word segmentation processing unit for the question sentence to be matched and the candidate question sentence: used to perform word segmentation and word segmentation processing on the matched question sentence and the candidate question sentence respectively;

The feature set formation unit of the question sentence to be matched and the sentence vector of the candidate question sentence: used to perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the State the sentence vector feature set of the sentence pair based on the word for the question sentence to be matched and the candidate question sentence; Perform word vectorization processing to determine the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and the candidate question sentence; wherein, the candidate question sentence is the one retrieved from a specified database by a search engine. The question sentence to be matched has at least one question sentence with a set similarity;

The semantic similarity determination unit of the question sentence to be matched and the candidate question sentence: used to join the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the candidate question The similarity between the sentence and the sentence to be matched;

Similar candidate question sentence determination unit: used to sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select candidates whose similarity is ranked within a set ranking Question sentences as similar candidate question sentences;

The semantic similarity determination unit of the question sentence to be matched and the similar candidate question sentence: used to respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation process, and respectively for each similar candidate question sentence The question sentence and the question sentence to be matched after the word segmentation process are subjected to word vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, and the The feature set of the question sentence to be matched and the feature set of the similar candidate question sentence based on the word; after splicing the four determined feature sets, it is determined that the similar candidate question sentence is the same as the question sentence to be matched The similarity between question sentences;

Semantic matching result determination unit: for determining the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched according to the highest Sort to the lower order to obtain the ranking results within the set ranking; use the two ranking results as the two variables of the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, it will be The question sentence to be matched and the candidate question sentence ranked first in the similarity of the candidate question sentence are the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to search the specified database again. Search for at least one question sentence with a set similarity to the question sentence to be matched, and calculate the similarity.
An electronic device, comprising a memory and a processor, the memory and the processor are connected to each other, the memory is used to store a computer program, the computer program is configured to be executed by the processor, the computer The program configuration is used to implement a semantic matching method based on an internal confrontation mechanism:

Wherein, the method includes:

S110: Perform word segmentation processing and word segmentation processing on the matching question sentence and the candidate question sentence respectively;

S120: Perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing, to determine the word-based sentence of the question sentence to be matched and the candidate question sentence The set of sentence vector features; and, respectively, perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the question sentence to be matched and The candidate question sentence is based on the sentence vector feature set of the sentence pair of the word; wherein the candidate question sentence is at least one question sentence with a set similarity with the question sentence to be matched that is retrieved in a specified database through a search engine;

S130: Combine the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the similarity between the candidate question sentence and the question sentence to be matched;

S140: Sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select the candidate question sentence with the similarity in a set ranking as the similar candidate question sentence ；

S150: Perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing, and respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing. Character vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, the feature set of the question sentence to be matched based on the word, and the character set The feature set of the similar candidate question sentence; after the four determined feature sets are spliced, the similarity between the similar candidate question sentence and the question sentence to be matched is determined;

S160: Sort the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched in the order from high to low, respectively, Obtain the ranking results within the set ranking; use the two ranking results as two variables in the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, use the sentence to be matched The candidate question sentence ranked first in the similarity ranking of the candidate question sentence is the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to retrieve the question to be matched in the designated database again The sentence has at least one question sentence with a set similarity, and the S120 is performed.
The electronic device according to claim 9, wherein, in S110,

The word segmentation processing includes: after removing stop words and special symbols from the question sentence to be matched, using a deep learning tokenizer to perform word segmentation; after removing stop words and special symbols from the candidate question sentence, Use deep learning tokenizer for word segmentation processing;

The word segmentation processing includes: after removing stop words and special symbols for the question sentence to be matched, using a deep learning tokenizer to perform word segmentation processing; removing stop words and special symbols for the candidate question sentence Then, use the deep learning tokenizer for word segmentation processing.
The electronic device according to claim 9, wherein, in S120, the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing are respectively subjected to word vectorization processing to determine The process of the sentence vector feature set of the sentence pair based on the word for the question sentence to be matched and the candidate question sentence includes:

The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing are spliced to form a first word vector matrix; and, the candidate question sentences are respectively subjected to Pre-training The word vectors obtained after embedding processing and train embedding processing are spliced to form a second word vector matrix;

Performing feature extraction on the first word vector matrix and the second word vector matrix respectively to determine the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;

Performing dimensionality reduction on the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;

Combine the word-based sentence vector feature set of the question sentence to be matched after the dimensionality reduction and the word-based sentence vector feature set of the candidate question sentence together to obtain the sentence pair of the question sentence to be matched and the candidate question sentence based on the word Sentence vector feature set.
The electronic device according to claim 11, wherein, in S120, the word vectorization processing is performed on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing, respectively, The process of determining the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and the candidate question sentence includes:

The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing are spliced to form a first word vector matrix; and the candidate question sentences are respectively subjected to Pre-training The word vectors obtained after train Embedding processing and train Embedding processing are spliced to form a second word vector matrix;

Performing feature extraction on the first word vector matrix and the second word vector matrix respectively to determine the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;

Dimensionality reduction is performed respectively on the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;

Combine the word-based sentence vector feature set of the question sentence to be matched after dimensionality reduction and the word-based sentence vector feature set of the candidate question sentence to obtain the word-based sentence pair of the question sentence to be matched and the candidate question sentence Sentence vector feature set.
The electronic device according to claim 9, wherein, in S130,

Perform vector subtraction, vector multiplication, and vector maximization on the sentence vector feature sets of the word-based sentence pairs, respectively, and perform vector phase comparison on the sentence vector feature sets of the word-based sentence pairs. The output after the subtraction operation, the vector multiplication operation, and the vector maximization operation are spliced to form a first text feature vector set. After the first text feature vector set is reduced in dimensionality, the sigmoid function is input to determine the problem to be matched The degree of similarity between the sentence and the candidate question sentence.
9. The electronic device according to claim 9, wherein, in S150, each of the similar candidate question sentences and the word segmentation processing to-be-matched question sentence are respectively subjected to word vectorization processing, and each similar candidate question sentence is respectively processed. The process of word vectorization processing of the candidate question sentence and the question sentence to be matched after the word segmentation processing includes:

The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing are spliced to form a third word vector matrix; and the similar candidate question sentences are respectively subjected to Pre-training processing. The word vectors obtained after train Embedding processing and train Embedding processing are spliced to form a fourth word vector matrix; and,

The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing to form a third word vector matrix; Pre-train the similar candidate question sentences respectively The word vectors obtained after embedding processing and train embedding processing are spliced to form a fourth word vector matrix.
The electronic device according to claim 14, wherein, in S150, the determining the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, and the word-based all Describe the feature set of the question sentence to be matched and the feature set of the similar candidate question sentence based on words; after the four determined feature sets are spliced, it is determined that the similar candidate question sentence and the question sentence to be matched are The process of similarity between:

After feature extraction and dimension reduction are performed on the third word vector matrix, the fourth word vector matrix, the third word vector matrix, and the fourth word vector matrix, respectively, the word-based problem to be matched is obtained A feature set of a sentence, a feature set of the similar candidate question sentence based on a word, a feature set of the question sentence to be matched based on a word, and a feature set of the similar candidate question sentence based on a word;

The four feature sets are spliced together to form a second text feature vector set. After the second text feature vector set is reduced in dimensionality, the sigmoid function is input to determine the question sentence to be matched and the similar candidate question The similarity between sentences.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, it is used to implement a semantic matching method based on an internal confrontation mechanism. The method includes the following step:

S110: Perform word segmentation processing and word segmentation processing on the matching question sentence and the candidate question sentence respectively;

S120: Perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing, to determine the word-based sentence of the question sentence to be matched and the candidate question sentence The set of sentence vector features; and, respectively, perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the question sentence to be matched and The candidate question sentence is based on the sentence vector feature set of the sentence pair of the word; wherein the candidate question sentence is at least one question sentence with a set similarity with the question sentence to be matched that is retrieved in a specified database through a search engine;

S130: Combine the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the similarity between the candidate question sentence and the question sentence to be matched;

S140: Sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select the candidate question sentence with the similarity in a set ranking as the similar candidate question sentence ；

S150: Perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing, and respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing. Character vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, the feature set of the question sentence to be matched based on the word, and the character set The feature set of the similar candidate question sentence; after the four determined feature sets are spliced, the similarity between the similar candidate question sentence and the question sentence to be matched is determined;

S160: Sort the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched in the order from high to low, respectively, Obtain the ranking results within the set ranking; use the two ranking results as two variables in the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, use the sentence to be matched The candidate question sentence ranked first in the similarity ranking of the candidate question sentence is the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to retrieve the question to be matched in the designated database again The sentence has at least one question sentence with a set similarity, and the S120 is performed.
The computer-readable storage medium according to claim 16, wherein, in S110,

The word segmentation processing includes: after removing stop words and special symbols from the question sentence to be matched, using a deep learning tokenizer to perform word segmentation; after removing stop words and special symbols from the candidate question sentence, Use deep learning tokenizer for word segmentation processing;

The word segmentation processing includes: after removing stop words and special symbols for the question sentence to be matched, using a deep learning tokenizer to perform word segmentation processing; removing stop words and special symbols for the candidate question sentence Then, use the deep learning tokenizer for word segmentation processing.
The computer-readable storage medium according to claim 16, wherein, in S120, each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing are respectively subjected to word vectorization processing , The process of determining the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and the candidate question sentence includes:

The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing are spliced to form a first word vector matrix; and, the candidate question sentences are respectively subjected to Pre-training The word vectors obtained after embedding processing and train embedding processing are spliced to form a second word vector matrix;

Performing feature extraction on the first word vector matrix and the second word vector matrix respectively to determine the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;

Dimensionality reduction is performed respectively on the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;

Combine the word-based sentence vector feature set of the question sentence to be matched after the dimensionality reduction and the word-based sentence vector feature set of the candidate question sentence together to obtain the sentence pair of the question sentence to be matched and the candidate question sentence based on the word Sentence vector feature set.
The computer-readable storage medium according to claim 18, wherein, in S120, the word vector is performed on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing. The process of determining the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and the candidate question sentence includes:

The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing are spliced to form a first word vector matrix; and the candidate question sentences are respectively subjected to Pre-training The word vectors obtained after train Embedding processing and train Embedding processing are spliced to form a second word vector matrix;

Performing feature extraction on the first word vector matrix and the second word vector matrix respectively to determine the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;

Dimensionality reduction is performed respectively on the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;

Combine the word-based sentence vector feature set of the question sentence to be matched after dimensionality reduction and the word-based sentence vector feature set of the candidate question sentence to obtain the word-based sentence pair of the question sentence to be matched and the candidate question sentence Sentence vector feature set.
The computer-readable storage medium according to claim 16, wherein, in S130,

Perform vector subtraction, vector multiplication, and vector maximization on the sentence vector feature sets of the word-based sentence pairs, respectively, and perform vector phase comparison on the sentence vector feature sets of the word-based sentence pairs. The output after the subtraction operation, the vector multiplication operation, and the vector maximization operation are spliced to form a first text feature vector set. After the first text feature vector set is reduced in dimensionality, the sigmoid function is input to determine the problem to be matched The degree of similarity between the sentence and the candidate question sentence.