WO2021169263A1 - Semantic matching method and device based on internal adversarial mechanism, and storage medium - Google Patents

Semantic matching method and device based on internal adversarial mechanism, and storage medium Download PDF

Info

Publication number
WO2021169263A1
WO2021169263A1 PCT/CN2020/117422 CN2020117422W WO2021169263A1 WO 2021169263 A1 WO2021169263 A1 WO 2021169263A1 CN 2020117422 W CN2020117422 W CN 2020117422W WO 2021169263 A1 WO2021169263 A1 WO 2021169263A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
sentence
question sentence
matched
vector
Prior art date
Application number
PCT/CN2020/117422
Other languages
French (fr)
Chinese (zh)
Inventor
骆迅
王科强
郝新东
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021169263A1 publication Critical patent/WO2021169263A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of artificial intelligence technology, and more specifically, to a semantic matching method based on an internal confrontation mechanism.
  • NLP Natural Language Processing
  • the core module of the patient-teaching question-and-answer system is the semantic recall module.
  • the main function is to search for the answer closest to the patient's appeal based on the patient's question in the answer database and make an answer. Therefore, the performance of the patient-teaching question-and-answer system mainly depends on the accuracy of the semantic recall module.
  • most of the semantic recall modules are based on deep learning networks, such as CNN (Convolutional Neural Network), LSTM (Long Short-Term Memory), ESIM (Enhanced-LSTM, enhanced long and short-term memory network) , Decomposable Attention (decomposable attention mechanism network), Multihead (Multi-head attention mechanism network), etc.
  • CNN Convolutional Neural Network
  • LSTM Long Short-Term Memory
  • ESIM Enhanced-LSTM, enhanced long and short-term memory network
  • Decomposable Attention decomposable attention mechanism network
  • Multihead Multi-head attention mechanism network
  • the purpose of this application is to provide a semantic matching method based on an internal confrontation mechanism.
  • a value evaluation network is added. This network will evaluate the result of each problem recall module and feed it back to the semantic matching network (that is, the problem recall module, using depth Learning model construction), as new training data, re-train and output to the value evaluation network, until the evaluation score of the value evaluation network reaches the threshold, the confrontation is terminated. It can improve the robustness and transfer learning effect of the semantic matching system, and improve the quality and accuracy of problem recall.
  • a semantic matching method based on an internal confrontation mechanism which includes the following steps:
  • S110 Perform word segmentation processing and word segmentation processing on the matching question sentence and the candidate question sentence respectively;
  • S120 Perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing, to determine the word-based sentence of the question sentence to be matched and the candidate question sentence
  • the set of sentence vector features and, respectively, perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the question sentence to be matched and
  • the candidate question sentence is based on the sentence vector feature set of the sentence pair of the word; wherein the candidate question sentence is at least one question sentence with a set similarity with the question sentence to be matched that is retrieved in a specified database through a search engine;
  • S130 Combine the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the similarity between the candidate question sentence and the question sentence to be matched;
  • S140 Sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select the candidate question sentence with the similarity in a set ranking as the similar candidate question sentence ;
  • S150 Perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing, and respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing.
  • Character vectorization processing to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, the feature set of the question sentence to be matched based on the word, and the character set The feature set of the similar candidate question sentence; after the four determined feature sets are spliced, the similarity between the similar candidate question sentence and the question sentence to be matched is determined;
  • S160 Sort the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched in the order from high to low, respectively, Obtain the ranking results within the set ranking; use the two ranking results as two variables in the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, use the sentence to be matched The candidate question sentence ranked first in the similarity ranking of the candidate question sentence is the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to retrieve the question to be matched in the designated database again The sentence has at least one question sentence with a set similarity, and the S120 is performed.
  • a semantic matching system based on an internal confrontation mechanism including:
  • the word segmentation processing unit for the question sentence to be matched and the candidate question sentence used to perform word segmentation and word segmentation processing on the matched question sentence and the candidate question sentence respectively;
  • the feature set formation unit of the question sentence to be matched and the sentence vector of the candidate question sentence used to perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the State the sentence vector feature set of the sentence pair based on the word for the question sentence to be matched and the candidate question sentence; Perform word vectorization processing to determine the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and the candidate question sentence; wherein, the candidate question sentence is the one retrieved from a specified database by a search engine.
  • the question sentence to be matched has at least one question sentence with a set similarity;
  • the semantic similarity determination unit of the question sentence to be matched and the candidate question sentence used to join the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the candidate question The similarity between the sentence and the sentence to be matched;
  • Similar candidate question sentence determination unit used to sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select candidates whose similarity is ranked within a set ranking Question sentences as similar candidate question sentences;
  • the semantic similarity determination unit of the question sentence to be matched and the similar candidate question sentence used to respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation process, and respectively for each similar candidate question sentence
  • the question sentence and the question sentence to be matched after the word segmentation process are subjected to word vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, and the The feature set of the question sentence to be matched of the word and the feature set of the similar candidate question sentence based on the word; after the four determined feature sets are spliced, the similar candidate question sentence is determined to be the same as the candidate question sentence to be matched The similarity between question sentences;
  • Semantic matching result determination unit for determining the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched according to the highest Sort to the lower order to obtain the ranking results within the set ranking; use the two ranking results as the two variables of the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, it will be The question sentence to be matched and the candidate question sentence ranked first in the similarity of the candidate question sentence are the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to search the specified database again. Search for at least one question sentence with a set similarity to the question sentence to be matched, and calculate the similarity.
  • an electronic device including a memory and a processor, and a computer program is stored in the memory.
  • the computer program is executed by the processor, the following semantic matching method based on an internal confrontation mechanism is implemented. step:
  • S110 Perform word segmentation processing and word segmentation processing on the matching question sentence and the candidate question sentence respectively;
  • S120 Perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing, to determine the word-based sentence of the question sentence to be matched and the candidate question sentence
  • the set of sentence vector features and, respectively, perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the question sentence to be matched and
  • the candidate question sentence is based on the sentence vector feature set of the sentence pair of the word; wherein the candidate question sentence is at least one question sentence with a set similarity with the question sentence to be matched that is retrieved in a specified database through a search engine;
  • S130 Combine the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the similarity between the candidate question sentence and the question sentence to be matched;
  • S140 Sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select the candidate question sentence with the similarity in a set ranking as the similar candidate question sentence ;
  • S150 Perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing, and respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing.
  • Character vectorization processing to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, the feature set of the question sentence to be matched based on the word, and the character set The feature set of the similar candidate question sentence; after the four determined feature sets are spliced, the similarity between the similar candidate question sentence and the question sentence to be matched is determined;
  • S160 Sort the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched in the order from high to low, respectively, Obtain the ranking results within the set ranking; use the two ranking results as two variables in the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, use the sentence to be matched The candidate question sentence ranked first in the similarity ranking of the candidate question sentence is the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to retrieve the question to be matched in the designated database again The sentence has at least one question sentence with a set similarity, and the S120 is performed.
  • a computer-readable storage medium includes a semantic matching program based on an internal confrontation mechanism, and when the semantic matching program based on an internal confrontation mechanism is executed by a processor , Realize the following steps of semantic matching method based on internal confrontation mechanism:
  • S110 Perform word segmentation processing and word segmentation processing on the matching question sentence and the candidate question sentence respectively;
  • S120 Perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing, to determine the word-based sentence of the question sentence to be matched and the candidate question sentence
  • the set of sentence vector features and, respectively, perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the question sentence to be matched and
  • the candidate question sentence is based on the sentence vector feature set of the sentence pair of the word; wherein the candidate question sentence is at least one question sentence with a set similarity with the question sentence to be matched that is retrieved in a specified database through a search engine;
  • S130 Combine the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the similarity between the candidate question sentence and the question sentence to be matched;
  • S140 Sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select the candidate question sentence with the similarity in a set ranking as the similar candidate question sentence ;
  • S150 Perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing, and respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing.
  • Character vectorization processing to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, the feature set of the question sentence to be matched based on the word, and the character set The feature set of the similar candidate question sentence; after the four determined feature sets are spliced, the similarity between the similar candidate question sentence and the question sentence to be matched is determined;
  • S160 Sort the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched in the order from high to low, respectively, Obtain the ranking results within the set ranking; use the two ranking results as two variables in the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, use the sentence to be matched The candidate question sentence ranked first in the similarity ranking of the candidate question sentence is the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to retrieve the question to be matched in the designated database again The sentence has at least one question sentence with a set similarity, and the S120 is performed.
  • the semantic recall network and the value evaluation network a confrontation mechanism between the dual networks is formed, which can better evaluate candidate question sentences and users without affecting efficiency.
  • the similarity between questions improves the accuracy and precision of the question recall module, and pushes users with higher-quality answers.
  • the word segmentation is used to input the neural network for training at the same time, which improves the accuracy of matching.
  • FIG. 1 is a flowchart of a semantic matching method based on an internal confrontation mechanism according to Embodiment 1 of the present application;
  • FIG. 2 is a schematic diagram of the logical structure of a semantic matching system based on an internal confrontation mechanism according to Embodiment 2 of the present application;
  • FIG. 3 is a schematic diagram of a logical structure of an electronic device according to Embodiment 3 of the present application.
  • FIG. 1 is a flowchart of a semantic matching method of an internal confrontation mechanism according to Embodiment 1 of the present application.
  • a semantic matching method based on an internal confrontation mechanism includes the following steps:
  • S110 Perform word segmentation processing and word segmentation processing on the matching question sentence and the candidate question sentence respectively.
  • step S110 the process of word segmentation includes: removing stop words and removing special symbols in the sentence to be matched, and then using a deep learning tokenizer (Tokenizer) for word segmentation; removing stop words and removing special symbols for the candidate question sentence After the symbol, use the deep learning tokenizer for word segmentation.
  • Tokenizer deep learning tokenizer
  • the question sentence to be matched is "What does diabetes eat?”, and after word segmentation is processed, it becomes "diabetes/eat/what".
  • the process of word segmentation includes: after removing the stop words and special symbols for the sentence to be matched, then using a deep learning tokenizer (Tokenizer) for word segmentation; after removing the stop words and special symbols for the candidate question sentence , Use deep learning tokenizer for word segmentation processing.
  • Tokenizer deep learning tokenizer
  • the question sentence to be matched is "what do you eat for diabetes?", after word segmentation, it is "sugar/urine/illness/eating/what/what?".
  • Stop words mean that in information retrieval, in order to save storage space and improve search efficiency, certain words or words will be automatically filtered before or after processing natural language text. Stop words mainly include English characters, numbers, mathematical characters, Punctuation marks and single Chinese characters with extremely high frequency of use, etc. Special characters are symbols that use less frequent characters and are difficult to input directly, such as mathematical symbols, unit symbols, tabs, etc., in addition to traditional or commonly used symbols. The purpose of removing stop words and removing special symbols is to make the sentence to be matched more concise and improve the efficiency of semantic matching.
  • S120 In the embedding layer of the pre-established semantic recall network, perform word vectorization processing on each candidate question sentence after word segmentation processing and the question sentence to be matched after word segmentation processing to determine whether the question sentence to be matched and the candidate question sentence are based on The sentence vector feature set of the sentence pair of the word; and, respectively, the word vectorization process is performed on each candidate question sentence after word segmentation processing and the question sentence to be matched after word segmentation processing to determine the question sentence to be matched and the candidate question sentence
  • the sentence vector feature set of word-based sentence pairs; among them, the candidate question sentence is a question sentence similar to the question sentence to be matched.
  • the candidate question sentence is at least one question sentence that has a set similarity with the question sentence to be matched and is retrieved in a designated database through es (elasticsearch, search engine).
  • the number can be 128 or more.
  • the question sentence to be matched is "What does diabetes eat?" es retrieves 128 candidate question sentences such as "Definition of Diabetes” and "How to Exercise for Diabetes”.
  • a large number of question sentences that may be related to the candidate question collected in advance are stored in the designated database, and these question sentences can also be stored in the database in the form of word segmentation and word segmentation to facilitate matching queries.
  • the pre-established semantic recall network includes: embedding (vectorization) layer, convolutional layer and pooling layer.
  • the embedding layer includes Pre-train Embedding (pre-training vectorization) layer and train Embedding (training vectorization) layer.
  • the question sentence to be matched and each candidate question sentence retrieved by es are input into the semantic recall network for matching, and the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and each candidate question sentence is calculated.
  • the specific process includes :
  • Pre-train Embedding and train Embedding are performed on the sentence to be matched after word segmentation, and then the word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced to form the first word vector matrix.
  • Pre-train Embedding and train Embedding are performed on candidate question sentences in word segmentation form, respectively, and then the word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced to form a second word vector matrix.
  • the pre-train embedding dimension and train embedding dimension can be set to be 300 dimensions.
  • the two word vector matrices formed in this way are each represented by a 600-dimensional vector, which can describe the word more accurately.
  • Input the first word vector matrix into the convolutional layer perform feature extraction, output the sentence vector feature set based on the word of the question sentence to be matched, input the sentence vector feature set based on the word of the question sentence to be matched into the pooling layer, perform dimensionality reduction, and discard some It is really irrelevant data to prevent overfitting.
  • the second word vector matrix is input to the convolutional layer for feature extraction, and the candidate question sentence is output based on the word-based sentence vector feature set.
  • the candidate question sentence is based on the word-based sentence vector feature set into the pooling layer, dimensionality reduction is performed, and some of the facts are discarded Irrelevant data to prevent overfitting.
  • the word-based sentence vector feature set of the question sentence to be matched output after the dimensionality reduction of the pooling layer and the word-based sentence vector feature set of the candidate question sentence are spliced together to obtain the sentence vector feature set of the word-based sentence pair.
  • the question sentence to be matched and each candidate question sentence retrieved by es are input into the semantic recall network for matching, and the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and each candidate question sentence is calculated.
  • the process includes:
  • the sentence to be matched is subjected to Pre-train Embedding processing and train Embedding processing respectively, and then the word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced to form the first word vector matrix.
  • Pre-train Embedding processing and train Embedding processing are performed on the candidate question sentences in word segmentation form, and then the word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced to form a second word vector matrix.
  • Each word of the two word vector matrix formed in this way is represented by a 600-dimensional vector, which can describe the word more finely.
  • Input the first word vector matrix into the convolutional layer perform feature extraction, output the word-based sentence vector feature set of the question sentence to be matched, and input the word-based sentence vector feature set of the question sentence to be matched into the pooling layer, perform dimensionality reduction, and discard some It is really irrelevant data to prevent overfitting.
  • Input the second word vector matrix into the convolutional layer perform feature extraction, output the character-based sentence vector feature set of the candidate question sentence, and input the word-based sentence vector feature set of the candidate question sentence into the pooling layer, perform dimensionality reduction, and discard some of the facts. Irrelevant data to prevent overfitting.
  • the word-based sentence vector feature set of the question sentence to be matched output after the dimensionality reduction of the pooling layer and the word-based sentence vector feature set of the candidate question sentence are spliced together to obtain the sentence vector feature set of the word-based sentence pair.
  • the convolutional layer may include 3 convolutional neural networks, the number of cores of each convolutional neural network is 1, 2, and 3 respectively, and the size of the filter of each convolutional neural network is 256, 192, and 128, respectively.
  • the word vector matrix and the word vector matrix are respectively input to three convolutional neural networks for training and feature extraction.
  • the pooling layer includes avg-pooling (average pooling layer) and max-pooling (maximum pooling layer).
  • the sentence vector feature set is input avg-pooling and max-pooling in turn, and the order of input avg-pooling and max-pooling is not distinguished. Successively.
  • S130 Combine the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair, and determine the similarity between the candidate question sentence and the question sentence to be matched through the sigmoid function.
  • step S130 the specific process includes:
  • the output after diff (vector subtraction operation), mul (vector multiplication operation) and max (vector maximization operation) are performed on the sentence vector feature set of word-based sentence pairs, and the sentences of word-based sentence pairs
  • the vector feature set is the output after vector subtraction operation, vector multiplication operation and vector maximization operation respectively.
  • the output of these six items are spliced to form the final first text feature vector set, and the first text feature vector set is then subjected to dimensionality reduction processing Then, input the sigmoid function to output a value, which is the similarity between the question sentence to be matched and the candidate question sentence.
  • the output value of the sigmoid function is a score between 0 and 1.
  • the sigmoid function is a common sigmoid function in biology, also known as the sigmoid growth curve.
  • sigmoid function is often used as the threshold function of neural network due to its single-increment and inverse function, which maps variables between 0 and 1.
  • the dimensionality reduction processing includes: adopting BatchNormalization (normalization) processing to convert the first text feature vector set into the same standard system, and then undergoing Dense (dense) processing, relu (preventing gradient disappearance) processing, dropout (preventing model from over-fitting) ⁇ ) Treatment.
  • the essence of the algorithm in this application is to convert two sentences into vector representations with certain characteristic information, and then calculate the similarity of the sentence vectors to obtain the similarity between the question sentence to be matched and the candidate question sentence.
  • S140 Sort the similarity between the candidate question sentence and the question sentence to be matched in a descending order, and select the candidate question sentence whose similarity ranks in a set ranking as the similar candidate question sentence.
  • step S140 the ranking is set to the top five. Select the values output by the sigmoid function, sort them from highest to lowest, select the top five values, and the candidate question sentences corresponding to the top five values are similar candidate question sentences.
  • each similar candidate question sentence and the word segmentation processing to be matched question sentence are respectively processed for word vectorization, and each similar candidate question sentence and word segmentation processing are respectively processed
  • the subsequent question sentences to be matched are subjected to word vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, the feature set of the word-based question sentence to be matched, and the feature set based on the word
  • the feature set of similar candidate question sentences of words; after the four determined feature sets are spliced, the similarity between the similar candidate question sentence and the question sentence to be matched is determined through the sigmoid function.
  • the pre-established value evaluation network includes: an embedding layer, a neural network layer, and a pooling layer.
  • the embedding layer includes Pre-train Embedding layer and train Embedding layer.
  • the neural network layer also includes the BiGRU (bidirectional gated loop) neural network layer, the encoded layer and the soft attention (soft attention mechanism) layer.
  • the question sentence to be matched is matched with each similar candidate question sentence input value evaluation network, and the similarity is calculated.
  • the specific process includes: pre-train embedding processing and train embedding processing on the sentence to be matched after word segmentation processing, and then The word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced together to form a third word vector matrix; similar candidate question sentences are processed by Pre-train Embedding and train Embedding respectively, and then the words output by Pre-train Embedding are processed The vector and the word vector output by train Embedding are spliced to form the fourth word vector matrix.
  • the pre-train embedding dimension and train embedding dimension can be set to be 300 dimensions.
  • the two word vector matrices formed in this way are each represented by a 600-dimensional vector, which can describe the word more accurately.
  • the specific process includes:
  • Pre-train Embedding and train Embedding are performed on the sentence to be matched after word segmentation, and then the word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced to form a third word vector matrix;
  • the candidate question sentences are processed by Pre-train Embedding and train Embedding respectively, and then the word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced to form a fourth word vector matrix.
  • the specific process includes:
  • the third word vector matrix is input into the BiGRU neural network layer, and after deep-level feature extraction, the encoded layer and soft attention layer are respectively passed through the encoded layer and soft attention layer, and the outputs of the encoded layer and soft attention layer are spliced, and then input into the BiGRU neural network layer to extract the features. Then, the dimensionality reduction is performed through the pooling layer, and the feature set of the sentence to be matched based on the word is output.
  • the specific process includes:
  • the fourth word vector matrix is input into the BiGRU neural network layer, and after deep-level feature extraction, the encoded layer and soft attention layer are respectively passed through the encoded layer and the soft attention layer, and the outputs of the encoded layer and soft attention layer are spliced, and then input into the BiGRU neural network layer to extract the features. Then, the dimensionality reduction is performed through the pooling layer, and the feature set of the similar candidate sentence based on the word is output.
  • the specific process includes:
  • Input the third word vector matrix into the BiGRU neural network layer perform deep feature extraction, and then use the encoded layer and soft attention layer to splice the output of the encoded layer and soft attention layer, and then input the BiGRU neural network layer to extract features. Then, the dimensionality reduction is performed through the pooling layer, and the feature set of the sentence to be matched based on the word is output.
  • the specific process includes:
  • the fourth word vector matrix is input into the BiGRU neural network layer, and after deep-level feature extraction, the encoded layer and soft attention layer are respectively stitched through the encoded layer and the soft attention layer, and then input again into the BiGRU neural network layer to extract features Then, the dimensionality reduction is performed through the pooling layer, and the feature set of similar candidate question sentences based on the word is output.
  • BiGru is a variant of the LSTM structure. It has an update gate and a reset gate to strengthen the semantic understanding of contextual relationships; soft attention aligns the deep-level information after feature extraction; encoded is used to extract feature After the information is encoded.
  • the pooling layer includes avg-pooling and max-pooling.
  • the similarity between the similar candidate question sentence and the question sentence to be matched is obtained through the sigmoid function.
  • the specific process includes:
  • the feature set of word-based question sentences to be matched, the feature set of word-based similar candidate question sentences, the feature set of word-based question sentences to be matched, and the feature set of word-based similar candidate question sentences are spliced to form the final
  • the second text feature vector set after performing dimensionality reduction processing on the second text feature vector set, the input sigmoid function outputs a value, which is the similarity between the question sentence to be matched and the similar candidate question sentence.
  • S160 Sort the similarity between the similar candidate question sentence and the question sentence to be matched in order from high to low, and obtain the sorting result within the set ranking; compare the similarity between the candidate question sentence and the question sentence to be matched, Sort from high to low to obtain the ranking results within the set ranking; use the two ranking results as two variables in the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, the problem is to be matched The first candidate question sentence in the ranking of the similarity between the sentence and the candidate question sentence is the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine will re-search in the specified database through the search engine and the question sentence to be matched is similar to the question sentence to be matched. At least one question sentence of the degree, proceed to S120.
  • step S160 the Pearson correlation coefficient is represented by the English lowercase letter r, and the calculation formula is as follows:
  • r is the Pearson correlation coefficient, ranging from -1 to 1. The larger the value, the better the correlation; X is the ranking sequence of similarity calculated by the semantic recall network, and Y is the ranking sequence of similarity calculated by the value matching network; n is the set ranking, 5 is selected in this embodiment. If the coefficient is high, it means that the matching effect of the semantic recall network is good, and if the coefficient is low, it means that the matching effect of the semantic recall network is poor.
  • the threshold can be set to 0.7.
  • the present application further can store the similarity ranking sequence between the question sentence to be matched and the similar candidate question sentence, the ranking sequence of the top five similarities between the question sentence to be matched and the candidate question sentence, and the Pearson correlation coefficient. As training data. It can also record the customer's likes and dislikes, and send the feedback data back to the value evaluation network as training data.
  • the respective results of the semantic recall network and the value evaluation network are used as new training data, and the training is performed again, which is enhanced with respect to the first data.
  • vector operations are performed to obtain the final similarity.
  • the adversarial training process the training data is fully and repeatedly used, and the use in the patient-teaching question-and-answer system can effectively make up for the lack of matching of some disease data. On the one hand, it saves time for data collection, on the other hand, it greatly reduces the trouble of manual maintenance and iterative upgrades.
  • FIG. 2 is a schematic diagram of the logical structure of a semantic matching system based on an internal confrontation mechanism according to Embodiment 2 of the present application.
  • a semantic matching system based on an internal confrontation mechanism includes: a word segmentation processing unit for the question sentence to be matched and a candidate question sentence, a word segmentation processing unit for the question sentence to be matched and a sentence vector feature set forming unit for the question sentence to be matched, and a sentence vector feature set to be matched.
  • the word segmentation processing unit for the question sentence to be matched and the candidate question sentence used to perform word segmentation and word segmentation processing on the matched question sentence and the candidate question sentence respectively;
  • the feature set formation unit of the question sentence to be matched and the sentence vector of the candidate question sentence used to perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the State the sentence vector feature set of the sentence pair based on the word for the question sentence to be matched and the candidate question sentence; Perform word vectorization processing to determine the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and the candidate question sentence; wherein, the candidate question sentence is the one retrieved from a specified database by a search engine.
  • the question sentence to be matched has at least one question sentence with a set similarity;
  • the semantic similarity determination unit of the question sentence to be matched and the candidate question sentence used to join the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the candidate question The similarity between the sentence and the sentence to be matched;
  • Similar candidate question sentence determination unit used to sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select candidates whose similarity is ranked within a set ranking Question sentences as similar candidate question sentences;
  • the semantic similarity determination unit of the question sentence to be matched and the similar candidate question sentence used to respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation process, and respectively for each similar candidate question sentence
  • the question sentence and the question sentence to be matched after the word segmentation process are subjected to word vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, and the The feature set of the question sentence to be matched and the feature set of the similar candidate question sentence based on the word; after splicing the four determined feature sets, it is determined that the similar candidate question sentence is the same as the question sentence to be matched
  • Semantic matching result determination unit for determining the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched according to the highest Sort to the lower order to obtain the ranking results within the set ranking; use the two ranking results as the two variables of the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, it will be The question sentence to be matched and the candidate question sentence ranked first in the similarity of the candidate question sentence are the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to search the specified database again. Search for at least one question sentence with a set similarity to the question sentence to be matched, and calculate the similarity.
  • FIG. 3 is a schematic diagram of a logical structure of an electronic device according to Embodiment 3 of the present application.
  • an electronic device 1 includes a memory 3 and a processor 2.
  • the memory 3 stores a computer program 4, and the computer program 4 is executed by the processor 2 to implement the internal The steps of the semantic matching method of the confrontation mechanism.
  • a computer-readable storage medium which may be non-volatile or volatile.
  • the computer-readable storage medium includes a semantic matching program based on an internal confrontation mechanism, and when the semantic matching program based on an internal confrontation mechanism is executed by a processor, the steps of the semantic matching method based on the internal confrontation mechanism of Embodiment 1 are implemented .

Abstract

A semantic matching method and device based on an internal adversarial mechanism, and a storage medium, relating to the technical field of artificial intelligence. The method comprises the following steps: respectively performing word segmentation processing and character segmentation processing on question statements to be matched and candidate question statements; respectively calculating similarities between the candidate question sentences and said question sentences; sorting the similarities between the candidate question statements and said question statements, and using the candidate question statements in a set rank as similar candidate question statements; respectively calculating similarities between the similar candidate question statements and said question statements; and using a sorting result, in the set rank, of the similarities between the similar candidate question statements and said question statements and a sorting result, in the set rank, of the similarities between the candidate question statements and said question statements as two variables of a Pearson correlation coefficient calculation formula, and determining a matching result according to a correlation coefficient. The method can effectively improve the semantic matching quality and precision.

Description

基于内部对抗机制的语义匹配方法、装置及存储介质Semantic matching method, device and storage medium based on internal confrontation mechanism
本申请要求于2020年02月26日提交中国专利局、申请号为202010119430.0,发明名称为“基于内部对抗机制的语义匹配方法、装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on February 26, 2020, the application number is 202010119430.0, and the invention title is "Semantic matching method, device and storage medium based on internal confrontation mechanism". The reference is incorporated in this application.
技术领域Technical field
本申请涉及人工智能技术领域,更为具体地,涉及一种基于内部对抗机制的语义匹配方法。This application relates to the field of artificial intelligence technology, and more specifically, to a semantic matching method based on an internal confrontation mechanism.
背景技术Background technique
人机对话目前是自然语言处理(Natural Language Processing,NLP)领域一个非常热门的应用场景。从传统的智能人工智能(Artificial Intelligence,AI)客服到语音聊天机器人等等,其核心技术都是语义识别、语义理解和语义匹配。Human-machine dialogue is currently a very popular application scenario in the field of Natural Language Processing (NLP). From traditional artificial intelligence (AI) customer service to voice chat robots, the core technologies are semantic recognition, semantic understanding, and semantic matching.
目前,市面上大部分人机对话系统集中在金融、客服、娱乐等领域,而医疗领域的人机问答相对来说仍处于起步阶段。一方面,是因为医疗场景更加复杂,涉及专业术语较多,AI难以完全理解患者的诉求。另一方面,由于医疗场景容错性低,相应的对AI的识别准确度提出了更高的要求。市面上已有一些患教问答系统存在,例如拇指医生、康夫子。不过,这些系统普遍存在一些问题,比如只能回答一些简单问题,对于复杂诉求无能为力,答非所问等等。究其原因,主要是当前语义匹配模型普遍存在鲁棒性差,迁移学习效果不够好等缺点。At present, most human-machine dialogue systems on the market are concentrated in the fields of finance, customer service, and entertainment, while human-machine Q&A in the medical field is relatively in its infancy. On the one hand, because medical scenarios are more complex and involve more technical terms, it is difficult for AI to fully understand the demands of patients. On the other hand, due to the low fault tolerance of medical scenarios, correspondingly higher requirements for AI recognition accuracy are put forward. There are some question-and-answer systems for patient education in the market, such as Doctor Thumb and Kang Fuzi. However, these systems generally have some problems. For example, they can only answer some simple questions, can't do anything about complex demands, and answer the wrong questions. The main reason is that the current semantic matching models generally have shortcomings such as poor robustness and insufficient transfer learning effects.
患教问答系统的核心模块是语义召回模块,主要作用是根据患者的提问去答案库寻找最接近患者诉求的答案并做出回答。因此,患教问答系统的性能好坏主要取决于语义召回模块是否精准。当前,语义召回模块大部分都是基于深度学习网络构成,例如CNN(卷积神经网络)、LSTM(Long Short-Term Memory,长短期记忆网络)、ESIM(Enhanced-LSTM,强化长短期记忆网络)、Decomposable Attention(可分解注意力机制网络)、Multihead(Multi-head attention,多头注意力机制网络)等等。这些深度学习网络各有各的优缺点,也适用于不同的场景。发明人意识到,这些模型均会出现过拟合、受数据质量扰动性大等缺点。The core module of the patient-teaching question-and-answer system is the semantic recall module. The main function is to search for the answer closest to the patient's appeal based on the patient's question in the answer database and make an answer. Therefore, the performance of the patient-teaching question-and-answer system mainly depends on the accuracy of the semantic recall module. Currently, most of the semantic recall modules are based on deep learning networks, such as CNN (Convolutional Neural Network), LSTM (Long Short-Term Memory), ESIM (Enhanced-LSTM, enhanced long and short-term memory network) , Decomposable Attention (decomposable attention mechanism network), Multihead (Multi-head attention mechanism network), etc. These deep learning networks have their own advantages and disadvantages, and they are also suitable for different scenarios. The inventor realizes that these models have shortcomings such as over-fitting and large perturbation by data quality.
发明内容Summary of the invention
鉴于上述问题,本申请的目的是提供一种基于内部对抗机制的语义匹配方法。在原有问题召回模块的基础上加入了一个价值评估网络,该网络会对每次问题召回模块结果的好坏进行一次评估,并反馈给semantic matching(语义匹配)网络(即问题召回模块,采用深度学习模型构建),作为新的训练数据,重新进行训练并输出给价值评估网络,直到价值评估网络的评估分数达到阈值,才终止对抗。可以提高语义匹配系统的鲁棒性和迁移学习效果,改善问题召回质量和精度。In view of the above problems, the purpose of this application is to provide a semantic matching method based on an internal confrontation mechanism. On the basis of the original problem recall module, a value evaluation network is added. This network will evaluate the result of each problem recall module and feed it back to the semantic matching network (that is, the problem recall module, using depth Learning model construction), as new training data, re-train and output to the value evaluation network, until the evaluation score of the value evaluation network reaches the threshold, the confrontation is terminated. It can improve the robustness and transfer learning effect of the semantic matching system, and improve the quality and accuracy of problem recall.
根据本申请的一个方面,提供了一种基于内部对抗机制的语义匹配方法,包括如下步骤:According to one aspect of the present application, a semantic matching method based on an internal confrontation mechanism is provided, which includes the following steps:
S110:对待匹配问题语句和候选问题语句分别进行分词处理和分字处理;S110: Perform word segmentation processing and word segmentation processing on the matching question sentence and the candidate question sentence respectively;
S120:分别对每一条所述分词处理后的候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于词的句子对的句向量特征集合;以及,分别对每一条所述分字处理后的候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于字的句子对的句向量特征集合;其中,所述候选问题语句为通过搜索引擎在指定数据库中检索出的与待匹配问题语句具有设定相似度的至少一个问题语句;S120: Perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing, to determine the word-based sentence of the question sentence to be matched and the candidate question sentence The set of sentence vector features; and, respectively, perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the question sentence to be matched and The candidate question sentence is based on the sentence vector feature set of the sentence pair of the word; wherein the candidate question sentence is at least one question sentence with a set similarity with the question sentence to be matched that is retrieved in a specified database through a search engine;
S130:将所述基于词的句子对的句向量特征集合和所述基于字的句子对的句向量特征集合进行拼接,确定所述候选问题语句与所述待匹配问题语句之间的相似度;S130: Combine the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the similarity between the candidate question sentence and the question sentence to be matched;
S140:将所述候选问题语句与所述待匹配问题语句之间的相似度按照由高向低的顺序排序,选出所述相似度排序在设定名次内的候选问题语句作为相似候选问题语句;S140: Sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select the candidate question sentence with the similarity in a set ranking as the similar candidate question sentence ;
S150:分别对每一条相似候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以及,分别对每一条相似候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理;以分别确定基于词的所述待匹配问题语句的特征集合、基于词的所述相似候选问题语句的特征集合、基于字的所述待匹配问题语句的特征集合、以及基于字的所述相似候选问题语句的特征集合;将所确定的四个特征集合进行拼接后,确定所述相似候选问题语句与所述待匹配问题语句之间的相似度;S150: Perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing, and respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing. Character vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, the feature set of the question sentence to be matched based on the word, and the character set The feature set of the similar candidate question sentence; after the four determined feature sets are spliced, the similarity between the similar candidate question sentence and the question sentence to be matched is determined;
S160:分别将所述相似候选问题语句与所述待匹配问题语句之间的相似度、以及所述候选问题语句与所述待匹配问题语句之间的相似度按照由高向低的顺序排序,获取设定名次内的排序结果;并将两个排序结果作为皮尔逊相关系数计算公式的两个变量,计算相关系数,若所述相关系数达到设定的阈值,则以所述待匹配问题语句与所述候选问题语句相似度排序第一名的候选问题语句为语义匹配的结果,若所述相关系数低于所述设定的阈值,则重新通过搜索引擎在指定数据库中检索与待匹配问题语句具有设定相似度的至少一个问题语句,进行所述S120。S160: Sort the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched in the order from high to low, respectively, Obtain the ranking results within the set ranking; use the two ranking results as two variables in the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, use the sentence to be matched The candidate question sentence ranked first in the similarity ranking of the candidate question sentence is the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to retrieve the question to be matched in the designated database again The sentence has at least one question sentence with a set similarity, and the S120 is performed.
根据本申请的另一方面,提供一种基于内部对抗机制的语义匹配系统,包括:According to another aspect of the present application, a semantic matching system based on an internal confrontation mechanism is provided, including:
待匹配问题语句和候选问题语句分词分字处理单元:用于对待匹配问题语句和候选问题语句分别进行分词处理和分字处理;The word segmentation processing unit for the question sentence to be matched and the candidate question sentence: used to perform word segmentation and word segmentation processing on the matched question sentence and the candidate question sentence respectively;
待匹配问题语句和候选问题语句句向量特征集合形成单元:用于分别对每一条所述分词处理后的候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于词的句子对的句向量特征集合;以及,分别对每一条所述分字处理后的候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于字的句子对的句向量特征集合;其中,所述候选问题语句为通过搜索引擎在指定数据库中检索出的与待匹配问题语句具有设定相似度的至少一个问题语句;The feature set formation unit of the question sentence to be matched and the sentence vector of the candidate question sentence: used to perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the State the sentence vector feature set of the sentence pair based on the word for the question sentence to be matched and the candidate question sentence; Perform word vectorization processing to determine the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and the candidate question sentence; wherein, the candidate question sentence is the one retrieved from a specified database by a search engine. The question sentence to be matched has at least one question sentence with a set similarity;
待匹配问题语句和候选问题语句语义相似度确定单元:用于将所述基于词的句子对的句向量特征集合和所述基于字的句子对的句向量特征集合进行拼接,确定所述候选问题语 句与所述待匹配问题语句之间的相似度;The semantic similarity determination unit of the question sentence to be matched and the candidate question sentence: used to join the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the candidate question The similarity between the sentence and the sentence to be matched;
相似候选问题语句确定单元:用于将所述候选问题语句与所述待匹配问题语句之间的相似度按照由高向低的顺序排序,选出所述相似度排序在设定名次内的候选问题语句作为相似候选问题语句;Similar candidate question sentence determination unit: used to sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select candidates whose similarity is ranked within a set ranking Question sentences as similar candidate question sentences;
待匹配问题语句和相似候选问题语句语义相似度确定单元:用于分别对每一条相似候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以及,分别对每一条相似候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理;以分别确定基于词的所述待匹配问题语句的特征集合、基于词的所述相似候选问题语句的特征集合、基于字的所述待匹配问题语句的特征集合、以及基于字的所述相似候选问题语句的特征集合;将所确定的四个特征集合进行拼接后,确定所述相似候选问题语句与所述待匹配问题语句之间的相似度;The semantic similarity determination unit of the question sentence to be matched and the similar candidate question sentence: used to respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation process, and respectively for each similar candidate question sentence The question sentence and the question sentence to be matched after the word segmentation process are subjected to word vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, and the The feature set of the question sentence to be matched of the word and the feature set of the similar candidate question sentence based on the word; after the four determined feature sets are spliced, the similar candidate question sentence is determined to be the same as the candidate question sentence to be matched The similarity between question sentences;
语义匹配结果确定单元:用于分别将所述相似候选问题语句与所述待匹配问题语句之间的相似度、以及所述候选问题语句与所述待匹配问题语句之间的相似度按照由高向低的顺序排序,获取设定名次内的排序结果;并将两个排序结果作为皮尔逊相关系数计算公式的两个变量,计算相关系数,若所述相关系数达到设定的阈值,则以所述待匹配问题语句与所述候选问题语句相似度排序第一名的候选问题语句为语义匹配的结果,若所述相关系数低于所述设定的阈值,则重新通过搜索引擎在指定数据库中检索与待匹配问题语句具有设定相似度的至少一个问题语句,进行计算相似度。Semantic matching result determination unit: for determining the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched according to the highest Sort to the lower order to obtain the ranking results within the set ranking; use the two ranking results as the two variables of the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, it will be The question sentence to be matched and the candidate question sentence ranked first in the similarity of the candidate question sentence are the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to search the specified database again. Search for at least one question sentence with a set similarity to the question sentence to be matched, and calculate the similarity.
根据本申请的另一方面,提供一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述计算机程序被处理器执行时实现下述基于内部对抗机制的语义匹配方法的步骤:According to another aspect of the present application, there is provided an electronic device, including a memory and a processor, and a computer program is stored in the memory. When the computer program is executed by the processor, the following semantic matching method based on an internal confrontation mechanism is implemented. step:
S110:对待匹配问题语句和候选问题语句分别进行分词处理和分字处理;S110: Perform word segmentation processing and word segmentation processing on the matching question sentence and the candidate question sentence respectively;
S120:分别对每一条所述分词处理后的候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于词的句子对的句向量特征集合;以及,分别对每一条所述分字处理后的候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于字的句子对的句向量特征集合;其中,所述候选问题语句为通过搜索引擎在指定数据库中检索出的与待匹配问题语句具有设定相似度的至少一个问题语句;S120: Perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing, to determine the word-based sentence of the question sentence to be matched and the candidate question sentence The set of sentence vector features; and, respectively, perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the question sentence to be matched and The candidate question sentence is based on the sentence vector feature set of the sentence pair of the word; wherein the candidate question sentence is at least one question sentence with a set similarity with the question sentence to be matched that is retrieved in a specified database through a search engine;
S130:将所述基于词的句子对的句向量特征集合和所述基于字的句子对的句向量特征集合进行拼接,确定所述候选问题语句与所述待匹配问题语句之间的相似度;S130: Combine the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the similarity between the candidate question sentence and the question sentence to be matched;
S140:将所述候选问题语句与所述待匹配问题语句之间的相似度按照由高向低的顺序排序,选出所述相似度排序在设定名次内的候选问题语句作为相似候选问题语句;S140: Sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select the candidate question sentence with the similarity in a set ranking as the similar candidate question sentence ;
S150:分别对每一条相似候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以及,分别对每一条相似候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理;以分别确定基于词的所述待匹配问题语句的特征集合、基于词的所述相似候选问题语句的特征集合、基于字的所述待匹配问题语句的特征集合、以及基于字的所 述相似候选问题语句的特征集合;将所确定的四个特征集合进行拼接后,确定所述相似候选问题语句与所述待匹配问题语句之间的相似度;S150: Perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing, and respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing. Character vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, the feature set of the question sentence to be matched based on the word, and the character set The feature set of the similar candidate question sentence; after the four determined feature sets are spliced, the similarity between the similar candidate question sentence and the question sentence to be matched is determined;
S160:分别将所述相似候选问题语句与所述待匹配问题语句之间的相似度、以及所述候选问题语句与所述待匹配问题语句之间的相似度按照由高向低的顺序排序,获取设定名次内的排序结果;并将两个排序结果作为皮尔逊相关系数计算公式的两个变量,计算相关系数,若所述相关系数达到设定的阈值,则以所述待匹配问题语句与所述候选问题语句相似度排序第一名的候选问题语句为语义匹配的结果,若所述相关系数低于所述设定的阈值,则重新通过搜索引擎在指定数据库中检索与待匹配问题语句具有设定相似度的至少一个问题语句,进行所述S120。S160: Sort the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched in the order from high to low, respectively, Obtain the ranking results within the set ranking; use the two ranking results as two variables in the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, use the sentence to be matched The candidate question sentence ranked first in the similarity ranking of the candidate question sentence is the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to retrieve the question to be matched in the designated database again The sentence has at least one question sentence with a set similarity, and the S120 is performed.
根据本申请的另一方面,提供一种计算机可读存储介质,所述计算机可读存储介质中包括基于内部对抗机制的语义匹配程序,所述基于内部对抗机制的语义匹配程序被处理器执行时,实现下述基于内部对抗机制的语义匹配方法的步骤:According to another aspect of the present application, a computer-readable storage medium is provided, the computer-readable storage medium includes a semantic matching program based on an internal confrontation mechanism, and when the semantic matching program based on an internal confrontation mechanism is executed by a processor , Realize the following steps of semantic matching method based on internal confrontation mechanism:
S110:对待匹配问题语句和候选问题语句分别进行分词处理和分字处理;S110: Perform word segmentation processing and word segmentation processing on the matching question sentence and the candidate question sentence respectively;
S120:分别对每一条所述分词处理后的候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于词的句子对的句向量特征集合;以及,分别对每一条所述分字处理后的候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于字的句子对的句向量特征集合;其中,所述候选问题语句为通过搜索引擎在指定数据库中检索出的与待匹配问题语句具有设定相似度的至少一个问题语句;S120: Perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing, to determine the word-based sentence of the question sentence to be matched and the candidate question sentence The set of sentence vector features; and, respectively, perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the question sentence to be matched and The candidate question sentence is based on the sentence vector feature set of the sentence pair of the word; wherein the candidate question sentence is at least one question sentence with a set similarity with the question sentence to be matched that is retrieved in a specified database through a search engine;
S130:将所述基于词的句子对的句向量特征集合和所述基于字的句子对的句向量特征集合进行拼接,确定所述候选问题语句与所述待匹配问题语句之间的相似度;S130: Combine the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the similarity between the candidate question sentence and the question sentence to be matched;
S140:将所述候选问题语句与所述待匹配问题语句之间的相似度按照由高向低的顺序排序,选出所述相似度排序在设定名次内的候选问题语句作为相似候选问题语句;S140: Sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select the candidate question sentence with the similarity in a set ranking as the similar candidate question sentence ;
S150:分别对每一条相似候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以及,分别对每一条相似候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理;以分别确定基于词的所述待匹配问题语句的特征集合、基于词的所述相似候选问题语句的特征集合、基于字的所述待匹配问题语句的特征集合、以及基于字的所述相似候选问题语句的特征集合;将所确定的四个特征集合进行拼接后,确定所述相似候选问题语句与所述待匹配问题语句之间的相似度;S150: Perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing, and respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing. Character vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, the feature set of the question sentence to be matched based on the word, and the character set The feature set of the similar candidate question sentence; after the four determined feature sets are spliced, the similarity between the similar candidate question sentence and the question sentence to be matched is determined;
S160:分别将所述相似候选问题语句与所述待匹配问题语句之间的相似度、以及所述候选问题语句与所述待匹配问题语句之间的相似度按照由高向低的顺序排序,获取设定名次内的排序结果;并将两个排序结果作为皮尔逊相关系数计算公式的两个变量,计算相关系数,若所述相关系数达到设定的阈值,则以所述待匹配问题语句与所述候选问题语句相似度排序第一名的候选问题语句为语义匹配的结果,若所述相关系数低于所述设定的阈值,则重新通过搜索引擎在指定数据库中检索与待匹配问题语句具有设定相似度的至少一个问题语句,进行所述S120。S160: Sort the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched in the order from high to low, respectively, Obtain the ranking results within the set ranking; use the two ranking results as two variables in the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, use the sentence to be matched The candidate question sentence ranked first in the similarity ranking of the candidate question sentence is the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to retrieve the question to be matched in the designated database again The sentence has at least one question sentence with a set similarity, and the S120 is performed.
利用上述根据本申请的内部对抗机制的语义匹配方法,语义召回网络和价值评估网络,形成了双网络之间的对抗机制,在不影响效率的基础上,能够更好的评估候选问题语句和用户提问之间的相似度,提高问题召回模块的准确率和精度,为用户推送质量更高的答案。采用分词分字同时输入神经网络进行训练,提高了匹配的精确性。Using the above semantic matching method according to the internal confrontation mechanism of the present application, the semantic recall network and the value evaluation network, a confrontation mechanism between the dual networks is formed, which can better evaluate candidate question sentences and users without affecting efficiency. The similarity between questions improves the accuracy and precision of the question recall module, and pushes users with higher-quality answers. The word segmentation is used to input the neural network for training at the same time, which improves the accuracy of matching.
附图说明Description of the drawings
图1为根据本申请实施例1的基于内部对抗机制的语义匹配方法的流程图;FIG. 1 is a flowchart of a semantic matching method based on an internal confrontation mechanism according to Embodiment 1 of the present application;
图2为根据本申请实施例2的基于内部对抗机制的语义匹配系统的逻辑结构示意图;2 is a schematic diagram of the logical structure of a semantic matching system based on an internal confrontation mechanism according to Embodiment 2 of the present application;
图3为根据本申请实施例3的电子装置的逻辑结构示意图。FIG. 3 is a schematic diagram of a logical structure of an electronic device according to Embodiment 3 of the present application.
具体实施方式Detailed ways
在下面的描述中,出于说明的目的,为了提供对一个或多个实施例的全面理解,阐述了许多具体细节。然而,很明显,也可以在没有这些具体细节的情况下实现这些实施例。在其它例子中,为了便于描述一个或多个实施例,公知的结构和设备以方框图的形式示出。In the following description, for illustrative purposes, in order to provide a comprehensive understanding of one or more embodiments, many specific details are set forth. However, it is obvious that these embodiments can also be implemented without these specific details. In other examples, for the convenience of describing one or more embodiments, well-known structures and devices are shown in the form of block diagrams.
以下将结合附图对本申请的具体实施例进行详细描述。The specific embodiments of the present application will be described in detail below with reference to the accompanying drawings.
实施例1Example 1
图1为根据本申请实施例1的内部对抗机制的语义匹配方法的流程图。FIG. 1 is a flowchart of a semantic matching method of an internal confrontation mechanism according to Embodiment 1 of the present application.
如图1所示,一种基于内部对抗机制的语义匹配方法,包括如下步骤:As shown in Figure 1, a semantic matching method based on an internal confrontation mechanism includes the following steps:
S110:对待匹配问题语句和候选问题语句分别进行分词处理和分字处理。S110: Perform word segmentation processing and word segmentation processing on the matching question sentence and the candidate question sentence respectively.
在步骤S110中,分词处理的过程包括:在对待匹配问题语句去停用词、去特殊符号,然后使用深度学习分词器(Tokenizer)进行分词处理;在对候选问题语句去停用词、去特殊符号后,使用深度学习分词器进行分词处理。In step S110, the process of word segmentation includes: removing stop words and removing special symbols in the sentence to be matched, and then using a deep learning tokenizer (Tokenizer) for word segmentation; removing stop words and removing special symbols for the candidate question sentence After the symbol, use the deep learning tokenizer for word segmentation.
例如待匹配问题语句为“糖尿病吃什么?”,分词处理后为“糖尿病/吃/什么”。For example, the question sentence to be matched is "What does diabetes eat?", and after word segmentation is processed, it becomes "diabetes/eat/what".
分字处理的过程包括:在对待匹配问题语句去停用词、去特殊符号后,然后使用深度学习分词器(Tokenizer)进行分字处理;在对候选问题语句去停用词、去特殊符号后,使用深度学习分词器进行分字处理。The process of word segmentation includes: after removing the stop words and special symbols for the sentence to be matched, then using a deep learning tokenizer (Tokenizer) for word segmentation; after removing the stop words and special symbols for the candidate question sentence , Use deep learning tokenizer for word segmentation processing.
例如待匹配问题语句为“糖尿病吃什么?”,分字处理后为“糖/尿/病/吃/什/么”。For example, the question sentence to be matched is "what do you eat for diabetes?", after word segmentation, it is "sugar/urine/illness/eating/what/what?".
停用词是指在信息检索中,为节省存储空间和提高搜索效率,在处理自然语言文本之前或之后会自动过滤掉某些字或词,停用词主要包括英文字符、数字、数学字符、标点符号及使用频率特高的单汉字等。特殊字符是相对于传统或常用的符号外,使用频率较少字符且难以直接输入的符号,比如数学符号、单位符号、制表符等。去停用词、去特殊符号的目的是使待匹配问题语句更为简洁,提高语义匹配效率。Stop words mean that in information retrieval, in order to save storage space and improve search efficiency, certain words or words will be automatically filtered before or after processing natural language text. Stop words mainly include English characters, numbers, mathematical characters, Punctuation marks and single Chinese characters with extremely high frequency of use, etc. Special characters are symbols that use less frequent characters and are difficult to input directly, such as mathematical symbols, unit symbols, tabs, etc., in addition to traditional or commonly used symbols. The purpose of removing stop words and removing special symbols is to make the sentence to be matched more concise and improve the efficiency of semantic matching.
S120:在预先建立的语义召回网络的embedding层,分别对每一条分词处理后的候选问题语句和分词处理后的待匹配问题语句进行词向量化处理,以确定待匹配问题语句和候选问题语句基于词的句子对的句向量特征集合;以及,分别对每一条分字处理后的候选问题语句和分字处理后的待匹配问题语句进行字向量化处理,以确定待匹配问题语句和候选问题语句基于字的句子对的句向量特征集合;其中,候选问题语句为与待匹配问题语句相 似的问题语句。S120: In the embedding layer of the pre-established semantic recall network, perform word vectorization processing on each candidate question sentence after word segmentation processing and the question sentence to be matched after word segmentation processing to determine whether the question sentence to be matched and the candidate question sentence are based on The sentence vector feature set of the sentence pair of the word; and, respectively, the word vectorization process is performed on each candidate question sentence after word segmentation processing and the question sentence to be matched after word segmentation processing to determine the question sentence to be matched and the candidate question sentence The sentence vector feature set of word-based sentence pairs; among them, the candidate question sentence is a question sentence similar to the question sentence to be matched.
在步骤S120中,候选问题语句为通过es(elasticsearch,搜索引擎)在指定的数据库内检索出的与待匹配问题语句具有设定相似度的至少一个问题语句。数量可为128个或者更多。例如:例如待匹配问题语句为“糖尿病吃什么?”es检索出“糖尿病的定义”、“糖尿病如何运动”等128个作为候选问题语句。In step S120, the candidate question sentence is at least one question sentence that has a set similarity with the question sentence to be matched and is retrieved in a designated database through es (elasticsearch, search engine). The number can be 128 or more. For example: For example, the question sentence to be matched is "What does diabetes eat?" es retrieves 128 candidate question sentences such as "Definition of Diabetes" and "How to Exercise for Diabetes".
该指定的数据库内存储有预先搜集整理的与候选问题可能相关的大量问题语句,这些问题语句也可均以分词形式和分字形式存储在数据库中,以便于匹配查询。A large number of question sentences that may be related to the candidate question collected in advance are stored in the designated database, and these question sentences can also be stored in the database in the form of word segmentation and word segmentation to facilitate matching queries.
预先建立的语义召回网络包括:embedding(向量化)层、卷积层和池化层。embedding层又包括Pre-train Embedding(预训练向量化)层和train Embedding(训练向量化)层。The pre-established semantic recall network includes: embedding (vectorization) layer, convolutional layer and pooling layer. The embedding layer includes Pre-train Embedding (pre-training vectorization) layer and train Embedding (training vectorization) layer.
分词处理后的待匹配问题语句、es检索出的每一个候选问题语句输入语义召回网络进行匹配,计算待匹配问题语句和每一个候选问题语句基于词的句子对的句向量特征集合。After word segmentation, the question sentence to be matched and each candidate question sentence retrieved by es are input into the semantic recall network for matching, and the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and each candidate question sentence is calculated.
分别对每一条分词处理后的候选问题语句和分词处理后的待匹配问题语句进行词向量化处理,以确定待匹配问题语句和候选问题语句基于词的句子对的句向量特征集合,具体过程包括:Perform word vectorization processing on each candidate question sentence after word segmentation processing and the sentence to be matched after word segmentation processing to determine the sentence vector feature set of the sentence pair based on the word for the question sentence to be matched and the candidate question sentence. The specific process includes :
将分词处理后的待匹配问题语句分别进行Pre-train Embedding处理和train Embedding处理,然后将Pre-train Embedding输出的词向量和train Embedding输出的词向量拼接,形成第一词向量矩阵。Pre-train Embedding and train Embedding are performed on the sentence to be matched after word segmentation, and then the word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced to form the first word vector matrix.
将分词形式候选问题语句分别进行Pre-train Embedding处理和train Embedding处理,然后将Pre-train Embedding输出的词向量和train Embedding输出的词向量拼接,形成第二词向量矩阵。Pre-train Embedding and train Embedding are performed on candidate question sentences in word segmentation form, respectively, and then the word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced to form a second word vector matrix.
可设定Pre-train embedding维度和train Embedding维度均为300维。这样形成的两个词向量矩阵每个词均用600维的向量表示可以更精确的描述词。The pre-train embedding dimension and train embedding dimension can be set to be 300 dimensions. The two word vector matrices formed in this way are each represented by a 600-dimensional vector, which can describe the word more accurately.
将第一词向量矩阵输入卷积层,进行特征提取,输出待匹配问题语句基于词的句向量特征集合,待匹配问题语句基于词的句向量特征集合输入池化层,进行降维,丢弃一些实在是不相关的数据,防止过拟合。Input the first word vector matrix into the convolutional layer, perform feature extraction, output the sentence vector feature set based on the word of the question sentence to be matched, input the sentence vector feature set based on the word of the question sentence to be matched into the pooling layer, perform dimensionality reduction, and discard some It is really irrelevant data to prevent overfitting.
将第二词向量矩阵输入卷积层,进行特征提取,输出候选问题语句基于词的句向量特征集合,候选问题语句基于词的句向量特征集合输入池化层,进行降维,丢弃一些实在是不相关的数据,防止过拟合。The second word vector matrix is input to the convolutional layer for feature extraction, and the candidate question sentence is output based on the word-based sentence vector feature set. The candidate question sentence is based on the word-based sentence vector feature set into the pooling layer, dimensionality reduction is performed, and some of the facts are discarded Irrelevant data to prevent overfitting.
将池化层降维后输出的待匹配问题语句基于词的句向量特征集合和候选问题语句基于词的句向量特征集合拼接在一起,得到基于词的句子对的句向量特征集合。The word-based sentence vector feature set of the question sentence to be matched output after the dimensionality reduction of the pooling layer and the word-based sentence vector feature set of the candidate question sentence are spliced together to obtain the sentence vector feature set of the word-based sentence pair.
分字处理后的待匹配问题语句、es检索出的每一个候选问题语句输入语义召回网络进行匹配,计算待匹配问题语句和每一个候选问题语句基于字的句子对的句向量特征集合。After word segmentation, the question sentence to be matched and each candidate question sentence retrieved by es are input into the semantic recall network for matching, and the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and each candidate question sentence is calculated.
分别对每一条分字处理后的候选问题语句和分字处理后的待匹配问题语句进行字向量化处理,以确定待匹配问题语句和候选问题语句基于字的句子对的句向量特征集合,具体过程包括:Perform word vectorization processing on each candidate question sentence after word segmentation processing and the sentence to be matched after word segmentation processing to determine the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and the candidate question sentence. The process includes:
将分字处理后的待匹配问题语句分别进行Pre-train Embedding处理和train  Embedding处理,然后将Pre-train Embedding输出的字向量和train Embedding输出的字向量拼接,形成第一字向量矩阵。After word segmentation, the sentence to be matched is subjected to Pre-train Embedding processing and train Embedding processing respectively, and then the word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced to form the first word vector matrix.
将分字形式候选问题语句分别进行Pre-train Embedding处理和train Embedding处理,然后将Pre-train Embedding输出的字向量和train Embedding输出的字向量拼接,形成第二字向量矩阵。Pre-train Embedding processing and train Embedding processing are performed on the candidate question sentences in word segmentation form, and then the word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced to form a second word vector matrix.
设定Pre-train embedding维度和train Embedding均为300维。这样形成的两个字向量矩阵的每个字均用600维的向量表示可以更精细的描述字。Set the pre-train embedding dimension and train embedding to both be 300 dimensions. Each word of the two word vector matrix formed in this way is represented by a 600-dimensional vector, which can describe the word more finely.
将第一字向量矩阵输入卷积层,进行特征提取,输出待匹配问题语句基于字的句向量特征集合,待匹配问题语句基于字的句向量特征集合输入池化层,进行降维,丢弃一些实在是不相关的数据,防止过拟合。Input the first word vector matrix into the convolutional layer, perform feature extraction, output the word-based sentence vector feature set of the question sentence to be matched, and input the word-based sentence vector feature set of the question sentence to be matched into the pooling layer, perform dimensionality reduction, and discard some It is really irrelevant data to prevent overfitting.
将第二字向量矩阵输入卷积层,进行特征提取,输出候选问题语句基于字的句向量特征集合,候选问题语句基于字的句向量特征集合输入池化层,进行降维,丢弃一些实在是不相关的数据,防止过拟合。Input the second word vector matrix into the convolutional layer, perform feature extraction, output the character-based sentence vector feature set of the candidate question sentence, and input the word-based sentence vector feature set of the candidate question sentence into the pooling layer, perform dimensionality reduction, and discard some of the facts. Irrelevant data to prevent overfitting.
将池化层降维后输出的待匹配问题语句基于字的句向量特征集合和候选问题语句基于字的句向量特征集合拼接在一起得到,基于字的句子对的句向量特征集合。The word-based sentence vector feature set of the question sentence to be matched output after the dimensionality reduction of the pooling layer and the word-based sentence vector feature set of the candidate question sentence are spliced together to obtain the sentence vector feature set of the word-based sentence pair.
卷积层可包括3个卷积神经网络,每个卷积神经网络的核的数量分别为1、2和3,每个卷积神经网络的过滤器的大小分别为256、192和128。词向量矩阵和字向量矩阵,分别依次输入3个卷积神经网络进行训练,进行特征提取。The convolutional layer may include 3 convolutional neural networks, the number of cores of each convolutional neural network is 1, 2, and 3 respectively, and the size of the filter of each convolutional neural network is 256, 192, and 128, respectively. The word vector matrix and the word vector matrix are respectively input to three convolutional neural networks for training and feature extraction.
池化层包含avg-pooling(平均池化层)和max-pooling(最大池化层),句向量特征集合依次输入avg-pooling和max-pooling,输入avg-pooling和max-pooling的顺序不分先后。The pooling layer includes avg-pooling (average pooling layer) and max-pooling (maximum pooling layer). The sentence vector feature set is input avg-pooling and max-pooling in turn, and the order of input avg-pooling and max-pooling is not distinguished. Successively.
S130:将基于词的句子对的句向量特征集合和基于字的句子对的句向量特征集合进行拼接,通过sigmoid函数确定出候选问题语句与待匹配问题语句之间的相似度。S130: Combine the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair, and determine the similarity between the candidate question sentence and the question sentence to be matched through the sigmoid function.
在步骤S130中,具体过程包括:In step S130, the specific process includes:
对基于词的句子对的句向量特征集合分别做diff(向量相减运算)、mul(向量相乘运算)和max(向量最大化运算)后的输出,以及,对基于字的句子对的句向量特征集合分别做向量相减运算、向量相乘运算和向量最大化运算后的输出,这六项的输出进行拼接形成最终第一文本特征向量集合,第一文本特征向量集合再进行降维处理后,输入sigmoid函数输出一个数值,即为待匹配问题语句与候选问题语句之间的相似度。sigmoid函数输出的数值为为0到1之间的分数。The output after diff (vector subtraction operation), mul (vector multiplication operation) and max (vector maximization operation) are performed on the sentence vector feature set of word-based sentence pairs, and the sentences of word-based sentence pairs The vector feature set is the output after vector subtraction operation, vector multiplication operation and vector maximization operation respectively. The output of these six items are spliced to form the final first text feature vector set, and the first text feature vector set is then subjected to dimensionality reduction processing Then, input the sigmoid function to output a value, which is the similarity between the question sentence to be matched and the candidate question sentence. The output value of the sigmoid function is a score between 0 and 1.
sigmoid函数是一个在生物学中常见的S型函数,也称为S型生长曲线。在信息科学中,由于其单增以及反函数单增等性质,sigmoid函数常被用作神经网络的阈值函数,将变量映射到0,1之间。The sigmoid function is a common sigmoid function in biology, also known as the sigmoid growth curve. In information science, sigmoid function is often used as the threshold function of neural network due to its single-increment and inverse function, which maps variables between 0 and 1.
其中降维处理包括:采用BatchNormalization(归一化)处理将第一文本特征向量集合转换在同一标准体系中,再经过Dense(稠密)处理、relu(防止梯度消失)处理、dropout(防止模型过拟合)处理。The dimensionality reduction processing includes: adopting BatchNormalization (normalization) processing to convert the first text feature vector set into the same standard system, and then undergoing Dense (dense) processing, relu (preventing gradient disappearance) processing, dropout (preventing model from over-fitting)合) Treatment.
本申请算法本质是将两个句子转换成具有某种特征信息的向量表征,然后通过计算句向量的相似度,得到待匹配问题语句与候选问题语句之间的相似度。The essence of the algorithm in this application is to convert two sentences into vector representations with certain characteristic information, and then calculate the similarity of the sentence vectors to obtain the similarity between the question sentence to be matched and the candidate question sentence.
S140:将候选问题语句与待匹配问题语句之间的相似度按照由高向低的顺序排序,选出相似度排序在设定名次内的候选问题语句作为相似候选问题语句。S140: Sort the similarity between the candidate question sentence and the question sentence to be matched in a descending order, and select the candidate question sentence whose similarity ranks in a set ranking as the similar candidate question sentence.
在步骤S140中,设定名次为前五名。选出sigmoid函数输出的数值,由大到小排序,选出前五名数值,前五名数值对应的候选问题语句即为相似候选问题语句。In step S140, the ranking is set to the top five. Select the values output by the sigmoid function, sort them from highest to lowest, select the top five values, and the candidate question sentences corresponding to the top five values are similar candidate question sentences.
S150:在预先建立的价值评估网络的embedding层,分别对每一条相似候选问题语句和分词处理后的待匹配问题语句进行词向量化处理,以及,分别对每一条相似候选问题语句和分字处理后的待匹配问题语句进行字向量化处理;以分别确定基于词的待匹配问题语句的特征集合、基于词的相似候选问题语句的特征集合、基于字的待匹配问题语句的特征集合、以及基于字的相似候选问题语句的特征集合;将所确定的四个特征集合进行拼接后,通过sigmoid函数确定相似候选问题语句与待匹配问题语句之间的相似度。S150: In the embedding layer of the pre-established value evaluation network, each similar candidate question sentence and the word segmentation processing to be matched question sentence are respectively processed for word vectorization, and each similar candidate question sentence and word segmentation processing are respectively processed The subsequent question sentences to be matched are subjected to word vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, the feature set of the word-based question sentence to be matched, and the feature set based on the word The feature set of similar candidate question sentences of words; after the four determined feature sets are spliced, the similarity between the similar candidate question sentence and the question sentence to be matched is determined through the sigmoid function.
在步骤S150中,预先建立的价值评估网络包括:embedding层、神经网络层和池化层。embedding层又包括Pre-train Embedding层和train Embedding层。神经网络层又包括BiGRU(双向门控循环)神经网络层、encoded(编码)层和soft attention(柔性注意力机制)层。In step S150, the pre-established value evaluation network includes: an embedding layer, a neural network layer, and a pooling layer. The embedding layer includes Pre-train Embedding layer and train Embedding layer. The neural network layer also includes the BiGRU (bidirectional gated loop) neural network layer, the encoded layer and the soft attention (soft attention mechanism) layer.
待匹配问题语句分别与每一个相似候选问题语句输入价值评估网络进行匹配,计算相似度。The question sentence to be matched is matched with each similar candidate question sentence input value evaluation network, and the similarity is calculated.
分别对每一条相似候选问题语句和分词处理后的待匹配问题语句进行词向量化处理,具体过程包括:将分词处理后的待匹配问题语句分别进行Pre-train Embedding处理和train Embedding处理,然后将Pre-train Embedding输出的词向量和train Embedding输出的词向量拼接,形成第三词向量矩阵;将相似候选问题语句分别进行Pre-train Embedding处理和train Embedding处理,然后将Pre-train Embedding输出的词向量和train Embedding输出的词向量拼接,形成第四词向量矩阵。Perform word vectorization processing on each similar candidate question sentence and the sentence to be matched after word segmentation. The specific process includes: pre-train embedding processing and train embedding processing on the sentence to be matched after word segmentation processing, and then The word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced together to form a third word vector matrix; similar candidate question sentences are processed by Pre-train Embedding and train Embedding respectively, and then the words output by Pre-train Embedding are processed The vector and the word vector output by train Embedding are spliced to form the fourth word vector matrix.
可设定Pre-train embedding维度和train Embedding维度均为300维。这样形成的两个词向量矩阵每个词均用600维的向量表示可以更精确的描述词。The pre-train embedding dimension and train embedding dimension can be set to be 300 dimensions. The two word vector matrices formed in this way are each represented by a 600-dimensional vector, which can describe the word more accurately.
分别对每一条相似候选问题语句和分字处理后的待匹配问题语句进行字向量化处理,具体过程包括:Perform word vectorization processing on each similar candidate question sentence and the sentence to be matched after word segmentation. The specific process includes:
将分字处理后的待匹配问题语句分别进行Pre-train Embedding处理和train Embedding处理,然后将Pre-train Embedding输出的字向量和train Embedding输出的字向量拼接,形成第三字向量矩阵;将相似候选问题语句分别进行Pre-train Embedding处理和train Embedding处理,然后将Pre-train Embedding输出的字向量和train Embedding输出的字向量拼接,形成第四字向量矩阵。Pre-train Embedding and train Embedding are performed on the sentence to be matched after word segmentation, and then the word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced to form a third word vector matrix; The candidate question sentences are processed by Pre-train Embedding and train Embedding respectively, and then the word vector output by Pre-train Embedding and the word vector output by train Embedding are spliced to form a fourth word vector matrix.
设定Pre-train embedding维度和train Embedding维度均为300维,这样形成的两个字向量矩阵每个字均用600维的向量表示可以更精确的描述字。Set the pre-train embedding dimension and train embedding dimension to be both 300 dimensions, so that the two word vector matrices formed are each represented by a 600-dimensional vector, which can describe the words more accurately.
确定基于词的待匹配问题语句的特征集合,具体过程包括:To determine the feature set of the sentence to be matched based on the word, the specific process includes:
将第三词向量矩阵,输入BiGRU神经网络层,进行深层次的特征提取后,分别通过encoded层和soft attention层,将encoded层和soft attention层输出进行拼接,然后再次输入BiGRU神经网络层提取特征后,通过池化层进行降维,输出基于词的待匹配问题语句的特征集合。The third word vector matrix is input into the BiGRU neural network layer, and after deep-level feature extraction, the encoded layer and soft attention layer are respectively passed through the encoded layer and soft attention layer, and the outputs of the encoded layer and soft attention layer are spliced, and then input into the BiGRU neural network layer to extract the features. Then, the dimensionality reduction is performed through the pooling layer, and the feature set of the sentence to be matched based on the word is output.
确定基于词的相似候选问题语句的特征集合,具体过程包括:To determine the feature set of similar candidate question sentences based on words, the specific process includes:
将第四词向量矩阵,输入BiGRU神经网络层,进行深层次的特征提取后,分别通过encoded层和soft attention层,将encoded层和soft attention层输出进行拼接,然后再次输入BiGRU神经网络层提取特征后,通过池化层进行降维,输出基于词的相似候选问题语句的特征集合。The fourth word vector matrix is input into the BiGRU neural network layer, and after deep-level feature extraction, the encoded layer and soft attention layer are respectively passed through the encoded layer and the soft attention layer, and the outputs of the encoded layer and soft attention layer are spliced, and then input into the BiGRU neural network layer to extract the features. Then, the dimensionality reduction is performed through the pooling layer, and the feature set of the similar candidate sentence based on the word is output.
确定基于字的待匹配问题语句的特征集合,具体过程包括:To determine the feature set of the sentence to be matched based on the word, the specific process includes:
将第三字向量矩阵,输入BiGRU神经网络层,进行深层次的特征提取后,分别通过encoded层和soft attention层,将encoded层和soft attention层输出进行拼接,然后再次输入BiGRU神经网络层提取特征后,通过池化层进行降维,输出基于字的待匹配问题语句的特征集合。Input the third word vector matrix into the BiGRU neural network layer, perform deep feature extraction, and then use the encoded layer and soft attention layer to splice the output of the encoded layer and soft attention layer, and then input the BiGRU neural network layer to extract features. Then, the dimensionality reduction is performed through the pooling layer, and the feature set of the sentence to be matched based on the word is output.
确定基于字的相似候选问题语句的特征集合,具体过程包括:To determine the feature set of similar candidate question sentences based on words, the specific process includes:
将第四字向量矩阵,输入BiGRU神经网络层,进行深层次的特征提取后,分别通过encoded层和soft attention层,将encoded层和soft attention层输出进行拼接,然后再次输入BiGRU神经网络层提取特征后,通过池化层进行降维,输出基于字的相似候选问题语句的特征集合。The fourth word vector matrix is input into the BiGRU neural network layer, and after deep-level feature extraction, the encoded layer and soft attention layer are respectively stitched through the encoded layer and the soft attention layer, and then input again into the BiGRU neural network layer to extract features Then, the dimensionality reduction is performed through the pooling layer, and the feature set of similar candidate question sentences based on the word is output.
BiGru是LSTM结构的一种变体,有一个更新门和重置门,用于强化对前后文关系的语义理解;soft attention对特征提取后的深层次的信息进行对齐;encoded用于对特征提取后的信息进行编码。池化层包括avg-pooling和max-pooling。BiGru is a variant of the LSTM structure. It has an update gate and a reset gate to strengthen the semantic understanding of contextual relationships; soft attention aligns the deep-level information after feature extraction; encoded is used to extract feature After the information is encoded. The pooling layer includes avg-pooling and max-pooling.
将四个特征集合进行拼接后,通过sigmoid函数得出相似候选问题语句与待匹配问题语句之间的相似度,具体过程包括:After the four feature sets are spliced, the similarity between the similar candidate question sentence and the question sentence to be matched is obtained through the sigmoid function. The specific process includes:
基于词的待匹配问题语句的特征集合、基于词的相似候选问题语句的特征集合、基于字的待匹配问题语句的特征集合和基于字的相似候选问题语句的特征集合,进行拼接后形成最终的第二文本特征向量集合,对第二文本特征向量集合进行降维处理后,输入sigmoid函数输出一个数值,即为待匹配问题语句与相似候选问题语句之间的相似度。The feature set of word-based question sentences to be matched, the feature set of word-based similar candidate question sentences, the feature set of word-based question sentences to be matched, and the feature set of word-based similar candidate question sentences are spliced to form the final The second text feature vector set, after performing dimensionality reduction processing on the second text feature vector set, the input sigmoid function outputs a value, which is the similarity between the question sentence to be matched and the similar candidate question sentence.
S160:将相似候选问题语句与待匹配问题语句之间的相似度按照由高向低的顺序排序,获取设定名次内的排序结果;将候选问题语句与待匹配问题语句之间的相似度,由高向低排序,获取设定名次内的排序结果;并将两个排序结果做为皮尔逊相关系数计算公式的两个变量,计算相关系数,若相关系数达到设定的阈值,待匹配问题语句与候选问题语句相似度排序第一名的候选问题语句为语义匹配的结果,若相关系数低于设定的阈值,则重新通过搜索引擎在指定数据库中检索与待匹配问题语句具有设定相似度的至少一个问题语句,进行S120。S160: Sort the similarity between the similar candidate question sentence and the question sentence to be matched in order from high to low, and obtain the sorting result within the set ranking; compare the similarity between the candidate question sentence and the question sentence to be matched, Sort from high to low to obtain the ranking results within the set ranking; use the two ranking results as two variables in the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, the problem is to be matched The first candidate question sentence in the ranking of the similarity between the sentence and the candidate question sentence is the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine will re-search in the specified database through the search engine and the question sentence to be matched is similar to the question sentence to be matched. At least one question sentence of the degree, proceed to S120.
在步骤S160中,皮尔逊相关系数,用英文小写字母r代表,计算公式如下:In step S160, the Pearson correlation coefficient is represented by the English lowercase letter r, and the calculation formula is as follows:
Figure PCTCN2020117422-appb-000001
Figure PCTCN2020117422-appb-000001
r为皮尔逊相关系数,范围为-1到1,数值越大,说明相关性越好;X为语义召回网络计算的相似度的排序序列,Y为价值匹配网络计算的相似度的排序序列;n为设定的名次,本实施例选5。如果系数较高,说明语义召回网络的匹配效果好,若系数较低,说明语义召回网络的匹配效果差。阈值可设为0.7。r is the Pearson correlation coefficient, ranging from -1 to 1. The larger the value, the better the correlation; X is the ranking sequence of similarity calculated by the semantic recall network, and Y is the ranking sequence of similarity calculated by the value matching network; n is the set ranking, 5 is selected in this embodiment. If the coefficient is high, it means that the matching effect of the semantic recall network is good, and if the coefficient is low, it means that the matching effect of the semantic recall network is poor. The threshold can be set to 0.7.
本申请进一步,可将待匹配问题语句与相似候选问题语句的之间相似度排序序列、待匹配问题语句与候选问题语句之间的相似度前五名的排序序列及皮尔逊相关系数存储,可作为训练数据。也可记录客户的点赞点踩行为,并将反馈数据传回给价值评估网络,作为训练数据。The present application further can store the similarity ranking sequence between the question sentence to be matched and the similar candidate question sentence, the ranking sequence of the top five similarities between the question sentence to be matched and the candidate question sentence, and the Pearson correlation coefficient. As training data. It can also record the customer's likes and dislikes, and send the feedback data back to the value evaluation network as training data.
本申请把语义召回网络和价值评估网络各自的结果作为新的训练数据,重新进行训练,相对于一次数据增强。主要是通过神经网络的不同层,进行向量运算,得到最后的相似度。在对抗训练过程中,对训练数据进行了充分反复的利用,使用在患教问答系统中能够有效弥补部分疾病数据匹配不足的缺陷。一方面节省了数据采集的时间,另一方面,大大的减少了人工维护和迭代升级的麻烦。In this application, the respective results of the semantic recall network and the value evaluation network are used as new training data, and the training is performed again, which is enhanced with respect to the first data. Mainly through the different layers of the neural network, vector operations are performed to obtain the final similarity. In the adversarial training process, the training data is fully and repeatedly used, and the use in the patient-teaching question-and-answer system can effectively make up for the lack of matching of some disease data. On the one hand, it saves time for data collection, on the other hand, it greatly reduces the trouble of manual maintenance and iterative upgrades.
实施例2Example 2
图2为根据本申请实施例2的基于内部对抗机制的语义匹配系统的逻辑结构示意图。2 is a schematic diagram of the logical structure of a semantic matching system based on an internal confrontation mechanism according to Embodiment 2 of the present application.
如图2所示,一种基于内部对抗机制的语义匹配系统,包括:待匹配问题语句和候选问题语句分词分字处理单元、待匹配问题语句和候选问题语句句向量特征集合形成单元、待匹配问题语句和候选问题语句语义相似度确定单元、相似候选问题语句确定单元、待匹配问题语句和相似候选问题语句语义相似度确定单元、语义匹配结果确定单元。As shown in Figure 2, a semantic matching system based on an internal confrontation mechanism includes: a word segmentation processing unit for the question sentence to be matched and a candidate question sentence, a word segmentation processing unit for the question sentence to be matched and a sentence vector feature set forming unit for the question sentence to be matched, and a sentence vector feature set to be matched. The semantic similarity determination unit of the question sentence and the candidate question sentence, the similar candidate question sentence determination unit, the semantic similarity determination unit of the question sentence to be matched and the similar candidate question sentence, and the semantic matching result determination unit.
待匹配问题语句和候选问题语句分词分字处理单元:用于对待匹配问题语句和候选问题语句分别进行分词处理和分字处理;The word segmentation processing unit for the question sentence to be matched and the candidate question sentence: used to perform word segmentation and word segmentation processing on the matched question sentence and the candidate question sentence respectively;
待匹配问题语句和候选问题语句句向量特征集合形成单元:用于分别对每一条所述分词处理后的候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于词的句子对的句向量特征集合;以及,分别对每一条所述分字处理后的候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于字的句子对的句向量特征集合;其中,所述候选问题语句为通过搜索引擎在指定数据库中检索出的与待匹配问题语句具有设定相似度的至少一个问题语句;The feature set formation unit of the question sentence to be matched and the sentence vector of the candidate question sentence: used to perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the State the sentence vector feature set of the sentence pair based on the word for the question sentence to be matched and the candidate question sentence; Perform word vectorization processing to determine the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and the candidate question sentence; wherein, the candidate question sentence is the one retrieved from a specified database by a search engine. The question sentence to be matched has at least one question sentence with a set similarity;
待匹配问题语句和候选问题语句语义相似度确定单元:用于将所述基于词的句子对的句向量特征集合和所述基于字的句子对的句向量特征集合进行拼接,确定所述候选问题语句与所述待匹配问题语句之间的相似度;The semantic similarity determination unit of the question sentence to be matched and the candidate question sentence: used to join the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the candidate question The similarity between the sentence and the sentence to be matched;
相似候选问题语句确定单元:用于将所述候选问题语句与所述待匹配问题语句之间的 相似度按照由高向低的顺序排序,选出所述相似度排序在设定名次内的候选问题语句作为相似候选问题语句;Similar candidate question sentence determination unit: used to sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select candidates whose similarity is ranked within a set ranking Question sentences as similar candidate question sentences;
待匹配问题语句和相似候选问题语句语义相似度确定单元:用于分别对每一条相似候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以及,分别对每一条相似候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理;以分别确定基于词的所述待匹配问题语句的特征集合、基于词的所述相似候选问题语句的特征集合、基于字的所述待匹配问题语句的特征集合、以及基于字的所述相似候选问题语句的特征集合;将所确定的四个特征集合进行拼接后,确定所述相似候选问题语句与所述待匹配问题语句之间的相似度;The semantic similarity determination unit of the question sentence to be matched and the similar candidate question sentence: used to respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation process, and respectively for each similar candidate question sentence The question sentence and the question sentence to be matched after the word segmentation process are subjected to word vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, and the The feature set of the question sentence to be matched and the feature set of the similar candidate question sentence based on the word; after splicing the four determined feature sets, it is determined that the similar candidate question sentence is the same as the question sentence to be matched The similarity between question sentences;
语义匹配结果确定单元:用于分别将所述相似候选问题语句与所述待匹配问题语句之间的相似度、以及所述候选问题语句与所述待匹配问题语句之间的相似度按照由高向低的顺序排序,获取设定名次内的排序结果;并将两个排序结果作为皮尔逊相关系数计算公式的两个变量,计算相关系数,若所述相关系数达到设定的阈值,则以所述待匹配问题语句与所述候选问题语句相似度排序第一名的候选问题语句为语义匹配的结果,若所述相关系数低于所述设定的阈值,则重新通过搜索引擎在指定数据库中检索与待匹配问题语句具有设定相似度的至少一个问题语句,进行计算相似度。Semantic matching result determination unit: for determining the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched according to the highest Sort to the lower order to obtain the ranking results within the set ranking; use the two ranking results as the two variables of the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, it will be The question sentence to be matched and the candidate question sentence ranked first in the similarity of the candidate question sentence are the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to search the specified database again. Search for at least one question sentence with a set similarity to the question sentence to be matched, and calculate the similarity.
实施例3Example 3
图3为根据本申请实施例3的电子装置的逻辑结构示意图。FIG. 3 is a schematic diagram of a logical structure of an electronic device according to Embodiment 3 of the present application.
如图3所述,一种电子装置1,包括存储器3和处理器2,所述存储器3中存储有计算机程序4,所述计算机程序4被处理器2执行时实现实施例1所述基于内部对抗机制的语义匹配方法的步骤。As shown in Figure 3, an electronic device 1 includes a memory 3 and a processor 2. The memory 3 stores a computer program 4, and the computer program 4 is executed by the processor 2 to implement the internal The steps of the semantic matching method of the confrontation mechanism.
实施例4Example 4
一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性的。所述计算机可读存储介质中包括基于内部对抗机制的语义匹配程序,所述基于内部对抗机制的语义匹配程序被处理器执行时,实现实施例1所述基于内部对抗机制的语义匹配方法的步骤。A computer-readable storage medium, which may be non-volatile or volatile. The computer-readable storage medium includes a semantic matching program based on an internal confrontation mechanism, and when the semantic matching program based on an internal confrontation mechanism is executed by a processor, the steps of the semantic matching method based on the internal confrontation mechanism of Embodiment 1 are implemented .
如上参照图1、图2和图3以示例的方式描述根据本申请的内部对抗机制的语义匹配方法、系统、装置和存储介质。但是,本领域技术人员应当理解,对于上述本申请所提出的内部对抗机制的语义匹配方法、系统、装置和存储介质,还可以在不脱离本申请内容的基础上做出各种改进。因此,本申请的保护范围应当由所附的权利要求书的内容确定。The semantic matching method, system, device, and storage medium of the internal confrontation mechanism according to the present application are described by way of example with reference to FIGS. 1, 2 and 3 as described above. However, those skilled in the art should understand that for the semantic matching method, system, device and storage medium of the internal confrontation mechanism proposed in this application, various improvements can be made without departing from the content of this application. Therefore, the protection scope of this application should be determined by the content of the appended claims.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (20)

  1. 一种基于内部对抗机制的语义匹配方法,其中,包括如下步骤:A semantic matching method based on an internal confrontation mechanism, which includes the following steps:
    S110:对待匹配问题语句和候选问题语句分别进行分词处理和分字处理;S110: Perform word segmentation processing and word segmentation processing on the matching question sentence and the candidate question sentence respectively;
    S120:分别对每一条所述分词处理后的候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于词的句子对的句向量特征集合;以及,分别对每一条所述分字处理后的候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于字的句子对的句向量特征集合;其中,所述候选问题语句为通过搜索引擎在指定数据库中检索出的与待匹配问题语句具有设定相似度的至少一个问题语句;S120: Perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the sentence to be matched and the candidate question sentence based on words The set of sentence vector features; and, respectively, performing word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the question sentence to be matched and The candidate question sentence is based on the sentence vector feature set of the sentence pair of the word; wherein the candidate question sentence is at least one question sentence with a set similarity with the question sentence to be matched that is retrieved in a specified database through a search engine;
    S130:将所述基于词的句子对的句向量特征集合和所述基于字的句子对的句向量特征集合进行拼接,确定所述候选问题语句与所述待匹配问题语句之间的相似度;S130: Combine the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the similarity between the candidate question sentence and the question sentence to be matched;
    S140:将所述候选问题语句与所述待匹配问题语句之间的相似度按照由高向低的顺序排序,选出所述相似度排序在设定名次内的候选问题语句作为相似候选问题语句;S140: Sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select the candidate question sentence with the similarity in a set ranking as the similar candidate question sentence ;
    S150:分别对每一条相似候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以及,分别对每一条相似候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理;以分别确定基于词的所述待匹配问题语句的特征集合、基于词的所述相似候选问题语句的特征集合、基于字的所述待匹配问题语句的特征集合、以及基于字的所述相似候选问题语句的特征集合;将所确定的四个特征集合进行拼接后,确定所述相似候选问题语句与所述待匹配问题语句之间的相似度;S150: Perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing, and respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing. Character vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, the feature set of the question sentence to be matched based on the word, and the character set The feature set of the similar candidate question sentence; after the four determined feature sets are spliced, the similarity between the similar candidate question sentence and the question sentence to be matched is determined;
    S160:分别将所述相似候选问题语句与所述待匹配问题语句之间的相似度、以及所述候选问题语句与所述待匹配问题语句之间的相似度按照由高向低的顺序排序,获取设定名次内的排序结果;并将两个排序结果作为皮尔逊相关系数计算公式的两个变量,计算相关系数,若所述相关系数达到设定的阈值,则以所述待匹配问题语句与所述候选问题语句相似度排序第一名的候选问题语句为语义匹配的结果,若所述相关系数低于所述设定的阈值,则重新通过搜索引擎在指定数据库中检索与待匹配问题语句具有设定相似度的至少一个问题语句,进行所述S120。S160: Sort the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched in the order from high to low, respectively, Obtain the ranking results within the set ranking; use the two ranking results as two variables in the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, use the sentence to be matched The candidate question sentence ranked first in the similarity ranking of the candidate question sentence is the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to retrieve the question to be matched in the designated database again The sentence has at least one question sentence with a set similarity, and the S120 is performed.
  2. 如权利要求1所述的基于内部对抗机制的语义匹配方法,其中,在S110中,The semantic matching method based on an internal confrontation mechanism according to claim 1, wherein, in S110,
    所述分词处理包括:在对所述待匹配问题语句去停用词、去特殊符号后,使用深度学习分词器进行分词处理;在对所述候选问题语句去停用词、去特殊符号后,使用深度学习分词器进行分词处理;The word segmentation processing includes: after removing stop words and special symbols from the question sentence to be matched, using a deep learning tokenizer to perform word segmentation; after removing stop words and special symbols from the candidate question sentence, Use deep learning tokenizer for word segmentation processing;
    所述分字处理包括:在对所述待匹配问题语句去停用词、去特殊符号后,使用深度学习分词器进行分字处理;在对所述候选问题语句去停用词、去特殊符号后,使用深度学习分词器进行分字处理。The word segmentation processing includes: after removing stop words and special symbols for the question sentence to be matched, using a deep learning tokenizer to perform word segmentation processing; removing stop words and special symbols for the candidate question sentence Then, use the deep learning tokenizer for word segmentation processing.
  3. 如权利要求1所述的基于内部对抗机制的语义匹配方法,其中,在S120中,所述分别对每一条所述分词处理后的候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于词的句子对的句向量特 征集合的过程包括:The semantic matching method based on an internal confrontation mechanism according to claim 1, wherein, in S120, each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing are respectively performed. The vectorization process to determine the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and the candidate question sentence includes:
    将所述分词处理后的待匹配问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的词向量进行拼接,形成第一词向量矩阵;以及,将所述候选问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的词向量进行拼接,形成第二词向量矩阵;The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing are spliced to form a first word vector matrix; and, the candidate question sentences are respectively subjected to Pre-training The word vectors obtained after embedding processing and train embedding processing are spliced to form a second word vector matrix;
    分别对所述第一词向量矩阵以及所述第二词向量矩阵进行特征提取,以确定所述待匹配问题语句基于词的句向量特征集合和所述候选问题语句基于词的句向量特征集合;Performing feature extraction on the first word vector matrix and the second word vector matrix respectively to determine the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;
    分别对所述待匹配问题语句基于词的句向量特征集合和所述候选问题语句基于词的句向量特征集合进行降维;Dimensionality reduction is performed respectively on the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;
    将降维后的待匹配问题语句基于词的句向量特征集合和候选问题语句基于词的句向量特征集合拼接在一起,得到所述待匹配问题语句和所述候选问题语句基于词的句子对的句向量特征集合。Combine the word-based sentence vector feature set of the question sentence to be matched after the dimensionality reduction and the word-based sentence vector feature set of the candidate question sentence together to obtain the sentence pair of the question sentence to be matched and the candidate question sentence based on the word Sentence vector feature set.
  4. 如权利要求3所述的基于内部对抗机制的语义匹配方法,其中,在S120中,所述分别对每一条所述分字处理后的候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于字的句子对的句向量特征集合的过程包括:The semantic matching method based on the internal confrontation mechanism according to claim 3, wherein, in S120, the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing are respectively processed The process of performing word vectorization processing to determine the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and the candidate question sentence includes:
    将所述分字处理后的待匹配问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的字向量进行拼接,形成第一字向量矩阵;以及,将所述候选问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的字向量进行拼接,形成第二字向量矩阵;The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing are spliced to form a first word vector matrix; and the candidate question sentences are respectively subjected to Pre-training The word vectors obtained after train Embedding processing and train Embedding processing are spliced to form a second word vector matrix;
    分别对所述第一字向量矩阵以及所述第二字向量矩阵进行特征提取,以确定所述待匹配问题语句基于字的句向量特征集合和所述候选问题语句基于字的句向量特征集合;Performing feature extraction on the first word vector matrix and the second word vector matrix respectively to determine the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;
    分别对所述待匹配问题语句基于字的句向量特征集合和所述候选问题语句基于字的句向量特征集合进行降维;Dimensionality reduction is performed respectively on the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;
    将降维后的待匹配问题语句基于字的句向量特征集合和候选问题语句基于字的句向量特征集合拼接在一起,得到所述待匹配问题语句和所述候选问题语句基于字的句子对的句向量特征集合。Combine the word-based sentence vector feature set of the question sentence to be matched after dimensionality reduction and the word-based sentence vector feature set of the candidate question sentence to obtain the word-based sentence pair of the question sentence to be matched and the candidate question sentence Sentence vector feature set.
  5. 如权利要求1所述的基于内部对抗机制的语义匹配方法,其中,在S130中,The semantic matching method based on an internal confrontation mechanism according to claim 1, wherein, in S130,
    对所述基于词的句子对的句向量特征集合分别做向量相减运算、向量相乘运算和向量最大化运算后的输出,以及,对基于字的句子对的句向量特征集合分别做向量相减运算、向量相乘运算和向量最大化运算后的输出进行拼接形成第一文本特征向量集合,对所述第一文本特征向量集合进行降维处理后,输入sigmoid函数以确定所述待匹配问题语句与所述候选问题语句之间的相似度。Perform vector subtraction, vector multiplication, and vector maximization on the sentence vector feature sets of the word-based sentence pairs, respectively, and perform vector phase comparison on the sentence vector feature sets of the word-based sentence pairs. The output after the subtraction operation, the vector multiplication operation, and the vector maximization operation are spliced to form a first text feature vector set. After the first text feature vector set is reduced in dimensionality, the sigmoid function is input to determine the problem to be matched The degree of similarity between the sentence and the candidate question sentence.
  6. 如权利要求1所述的基于内部对抗机制的语义匹配方法,其中,在S150中,所述分别对每一条相似候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以及,分别对每一条相似候选问题语句和所述分字处理后的待匹配问题语句进行字向量化 处理的过程包括:The semantic matching method based on an internal confrontation mechanism according to claim 1, wherein, in S150, the word vectorization process is performed on each of the similar candidate question sentence and the sentence to be matched after the word segmentation process, and , The process of performing word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing includes:
    将所述分词处理后的待匹配问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的词向量进行拼接,形成第三词向量矩阵;以及,将所述相似候选问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的词向量进行拼接,形成第四词向量矩阵;以及,The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing are spliced to form a third word vector matrix; and the similar candidate question sentences are respectively subjected to Pre-training processing. The word vectors obtained after train Embedding processing and train Embedding processing are spliced to form a fourth word vector matrix; and,
    将所述分字处理后的待匹配问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的字向量进行拼接,形成第三字向量矩阵;将所述相似候选问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的字向量进行拼接,形成第四字向量矩阵。The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing to form a third word vector matrix; Pre-train the similar candidate question sentences respectively The word vectors obtained after embedding processing and train embedding processing are spliced to form a fourth word vector matrix.
  7. 如权利要求6所述的基于内部对抗机制的语义匹配方法,其中,在S150中,所述分别确定基于词的所述待匹配问题语句的特征集合、基于词的所述相似候选问题语句的特征集合、基于字的所述待匹配问题语句的特征集合、以及基于字的所述相似候选问题语句的特征集合;将所确定的四个特征集合进行拼接后,确定所述相似候选问题语句与所述待匹配问题语句之间的相似度的过程包括:The semantic matching method based on an internal confrontation mechanism according to claim 6, wherein, in S150, the feature set of the question sentence to be matched based on the word and the feature of the similar candidate question sentence based on the word are respectively determined Set, the feature set of the question sentence to be matched based on the word, and the feature set of the similar candidate question sentence based on the word; after the four determined feature sets are spliced, the similar candidate question sentence is determined to be The process of describing the similarity between the sentences to be matched includes:
    分别对所述第三词向量矩阵、所述第四词向量矩阵、所述第三字向量矩阵和所述第四字向量矩阵进行特征提取和降维后,得到基于词的所述待匹配问题语句的特征集合、基于词的所述相似候选问题语句的特征集合、基于字的所述待匹配问题语句的特征集合、基于字的所述相似候选问题语句的特征集合;After feature extraction and dimension reduction are performed on the third word vector matrix, the fourth word vector matrix, the third word vector matrix, and the fourth word vector matrix, respectively, the word-based problem to be matched is obtained A feature set of a sentence, a feature set of the similar candidate question sentence based on a word, a feature set of the question sentence to be matched based on a word, and a feature set of the similar candidate question sentence based on a word;
    将所述四个特征集合进行拼接后形成第二文本特征向量集合,对所述第二文本特征向量集合进行降维处理后,输入sigmoid函数以确定所述待匹配问题语句与所述相似候选问题语句之间的相似度。The four feature sets are spliced together to form a second text feature vector set. After the second text feature vector set is reduced in dimensionality, the sigmoid function is input to determine the question sentence to be matched and the similar candidate question The similarity between sentences.
  8. 一种基于内部对抗机制的语义匹配系统,其中,包括:A semantic matching system based on an internal confrontation mechanism, which includes:
    待匹配问题语句和候选问题语句分词分字处理单元:用于对待匹配问题语句和候选问题语句分别进行分词处理和分字处理;The word segmentation processing unit for the question sentence to be matched and the candidate question sentence: used to perform word segmentation and word segmentation processing on the matched question sentence and the candidate question sentence respectively;
    待匹配问题语句和候选问题语句句向量特征集合形成单元:用于分别对每一条所述分词处理后的候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于词的句子对的句向量特征集合;以及,分别对每一条所述分字处理后的候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于字的句子对的句向量特征集合;其中,所述候选问题语句为通过搜索引擎在指定数据库中检索出的与待匹配问题语句具有设定相似度的至少一个问题语句;The feature set formation unit of the question sentence to be matched and the sentence vector of the candidate question sentence: used to perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the State the sentence vector feature set of the sentence pair based on the word for the question sentence to be matched and the candidate question sentence; Perform word vectorization processing to determine the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and the candidate question sentence; wherein, the candidate question sentence is the one retrieved from a specified database by a search engine. The question sentence to be matched has at least one question sentence with a set similarity;
    待匹配问题语句和候选问题语句语义相似度确定单元:用于将所述基于词的句子对的句向量特征集合和所述基于字的句子对的句向量特征集合进行拼接,确定所述候选问题语句与所述待匹配问题语句之间的相似度;The semantic similarity determination unit of the question sentence to be matched and the candidate question sentence: used to join the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the candidate question The similarity between the sentence and the sentence to be matched;
    相似候选问题语句确定单元:用于将所述候选问题语句与所述待匹配问题语句之间的相似度按照由高向低的顺序排序,选出所述相似度排序在设定名次内的候选问题语句作为 相似候选问题语句;Similar candidate question sentence determination unit: used to sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select candidates whose similarity is ranked within a set ranking Question sentences as similar candidate question sentences;
    待匹配问题语句和相似候选问题语句语义相似度确定单元:用于分别对每一条相似候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以及,分别对每一条相似候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理;以分别确定基于词的所述待匹配问题语句的特征集合、基于词的所述相似候选问题语句的特征集合、基于字的所述待匹配问题语句的特征集合、以及基于字的所述相似候选问题语句的特征集合;将所确定的四个特征集合进行拼接后,确定所述相似候选问题语句与所述待匹配问题语句之间的相似度;The semantic similarity determination unit of the question sentence to be matched and the similar candidate question sentence: used to respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation process, and respectively for each similar candidate question sentence The question sentence and the question sentence to be matched after the word segmentation process are subjected to word vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, and the The feature set of the question sentence to be matched and the feature set of the similar candidate question sentence based on the word; after splicing the four determined feature sets, it is determined that the similar candidate question sentence is the same as the question sentence to be matched The similarity between question sentences;
    语义匹配结果确定单元:用于分别将所述相似候选问题语句与所述待匹配问题语句之间的相似度、以及所述候选问题语句与所述待匹配问题语句之间的相似度按照由高向低的顺序排序,获取设定名次内的排序结果;并将两个排序结果作为皮尔逊相关系数计算公式的两个变量,计算相关系数,若所述相关系数达到设定的阈值,则以所述待匹配问题语句与所述候选问题语句相似度排序第一名的候选问题语句为语义匹配的结果,若所述相关系数低于所述设定的阈值,则重新通过搜索引擎在指定数据库中检索与待匹配问题语句具有设定相似度的至少一个问题语句,进行计算相似度。Semantic matching result determination unit: for determining the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched according to the highest Sort to the lower order to obtain the ranking results within the set ranking; use the two ranking results as the two variables of the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, it will be The question sentence to be matched and the candidate question sentence ranked first in the similarity of the candidate question sentence are the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to search the specified database again. Search for at least one question sentence with a set similarity to the question sentence to be matched, and calculate the similarity.
  9. 一种电子装置,其中,包括存储器和处理器,所述存储器和所述处理器相互连接,所述存储器用于存储计算机程序,所述计算机程序被配置为由所述处理器执行,所述计算机程序配置用于执行一种基于内部对抗机制的语义匹配方法:An electronic device, comprising a memory and a processor, the memory and the processor are connected to each other, the memory is used to store a computer program, the computer program is configured to be executed by the processor, the computer The program configuration is used to implement a semantic matching method based on an internal confrontation mechanism:
    其中,所述方法包括:Wherein, the method includes:
    S110:对待匹配问题语句和候选问题语句分别进行分词处理和分字处理;S110: Perform word segmentation processing and word segmentation processing on the matching question sentence and the candidate question sentence respectively;
    S120:分别对每一条所述分词处理后的候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于词的句子对的句向量特征集合;以及,分别对每一条所述分字处理后的候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于字的句子对的句向量特征集合;其中,所述候选问题语句为通过搜索引擎在指定数据库中检索出的与待匹配问题语句具有设定相似度的至少一个问题语句;S120: Perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing, to determine the word-based sentence of the question sentence to be matched and the candidate question sentence The set of sentence vector features; and, respectively, perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the question sentence to be matched and The candidate question sentence is based on the sentence vector feature set of the sentence pair of the word; wherein the candidate question sentence is at least one question sentence with a set similarity with the question sentence to be matched that is retrieved in a specified database through a search engine;
    S130:将所述基于词的句子对的句向量特征集合和所述基于字的句子对的句向量特征集合进行拼接,确定所述候选问题语句与所述待匹配问题语句之间的相似度;S130: Combine the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the similarity between the candidate question sentence and the question sentence to be matched;
    S140:将所述候选问题语句与所述待匹配问题语句之间的相似度按照由高向低的顺序排序,选出所述相似度排序在设定名次内的候选问题语句作为相似候选问题语句;S140: Sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select the candidate question sentence with the similarity in a set ranking as the similar candidate question sentence ;
    S150:分别对每一条相似候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以及,分别对每一条相似候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理;以分别确定基于词的所述待匹配问题语句的特征集合、基于词的所述相似候选问题语句的特征集合、基于字的所述待匹配问题语句的特征集合、以及基于字的所述相似候选问题语句的特征集合;将所确定的四个特征集合进行拼接后,确定所述相似候选问题语句与所述待匹配问题语句之间的相似度;S150: Perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing, and respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing. Character vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, the feature set of the question sentence to be matched based on the word, and the character set The feature set of the similar candidate question sentence; after the four determined feature sets are spliced, the similarity between the similar candidate question sentence and the question sentence to be matched is determined;
    S160:分别将所述相似候选问题语句与所述待匹配问题语句之间的相似度、以及所述候选问题语句与所述待匹配问题语句之间的相似度按照由高向低的顺序排序,获取设定名次内的排序结果;并将两个排序结果作为皮尔逊相关系数计算公式的两个变量,计算相关系数,若所述相关系数达到设定的阈值,则以所述待匹配问题语句与所述候选问题语句相似度排序第一名的候选问题语句为语义匹配的结果,若所述相关系数低于所述设定的阈值,则重新通过搜索引擎在指定数据库中检索与待匹配问题语句具有设定相似度的至少一个问题语句,进行所述S120。S160: Sort the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched in the order from high to low, respectively, Obtain the ranking results within the set ranking; use the two ranking results as two variables in the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, use the sentence to be matched The candidate question sentence ranked first in the similarity ranking of the candidate question sentence is the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to retrieve the question to be matched in the designated database again The sentence has at least one question sentence with a set similarity, and the S120 is performed.
  10. 如权利要求9所述的电子装置,其中,在S110中,The electronic device according to claim 9, wherein, in S110,
    所述分词处理包括:在对所述待匹配问题语句去停用词、去特殊符号后,使用深度学习分词器进行分词处理;在对所述候选问题语句去停用词、去特殊符号后,使用深度学习分词器进行分词处理;The word segmentation processing includes: after removing stop words and special symbols from the question sentence to be matched, using a deep learning tokenizer to perform word segmentation; after removing stop words and special symbols from the candidate question sentence, Use deep learning tokenizer for word segmentation processing;
    所述分字处理包括:在对所述待匹配问题语句去停用词、去特殊符号后,使用深度学习分词器进行分字处理;在对所述候选问题语句去停用词、去特殊符号后,使用深度学习分词器进行分字处理。The word segmentation processing includes: after removing stop words and special symbols for the question sentence to be matched, using a deep learning tokenizer to perform word segmentation processing; removing stop words and special symbols for the candidate question sentence Then, use the deep learning tokenizer for word segmentation processing.
  11. 如权利要求9所述的电子装置,其中,在S120中,所述分别对每一条所述分词处理后的候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于词的句子对的句向量特征集合的过程包括:The electronic device according to claim 9, wherein, in S120, the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing are respectively subjected to word vectorization processing to determine The process of the sentence vector feature set of the sentence pair based on the word for the question sentence to be matched and the candidate question sentence includes:
    将所述分词处理后的待匹配问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的词向量进行拼接,形成第一词向量矩阵;以及,将所述候选问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的词向量进行拼接,形成第二词向量矩阵;The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing are spliced to form a first word vector matrix; and, the candidate question sentences are respectively subjected to Pre-training The word vectors obtained after embedding processing and train embedding processing are spliced to form a second word vector matrix;
    分别对所述第一词向量矩阵以及所述第二词向量矩阵进行特征提取,以确定所述待匹配问题语句基于词的句向量特征集合和所述候选问题语句基于词的句向量特征集合;Performing feature extraction on the first word vector matrix and the second word vector matrix respectively to determine the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;
    分别对所述待匹配问题语句基于词的句向量特征集合和所述候选问题语句基于词的句向量特征集合进行降维;Performing dimensionality reduction on the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;
    将降维后的待匹配问题语句基于词的句向量特征集合和候选问题语句基于词的句向量特征集合拼接在一起,得到所述待匹配问题语句和所述候选问题语句基于词的句子对的句向量特征集合。Combine the word-based sentence vector feature set of the question sentence to be matched after the dimensionality reduction and the word-based sentence vector feature set of the candidate question sentence together to obtain the sentence pair of the question sentence to be matched and the candidate question sentence based on the word Sentence vector feature set.
  12. 如权利要求11所述的电子装置,其中,在S120中,所述分别对每一条所述分字处理后的候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于字的句子对的句向量特征集合的过程包括:The electronic device according to claim 11, wherein, in S120, the word vectorization processing is performed on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing, respectively, The process of determining the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and the candidate question sentence includes:
    将所述分字处理后的待匹配问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的字向量进行拼接,形成第一字向量矩阵;以及,将所述候选问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的字向量进行拼接,形成第二字向量矩阵;The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing are spliced to form a first word vector matrix; and the candidate question sentences are respectively subjected to Pre-training The word vectors obtained after train Embedding processing and train Embedding processing are spliced to form a second word vector matrix;
    分别对所述第一字向量矩阵以及所述第二字向量矩阵进行特征提取,以确定所述待匹 配问题语句基于字的句向量特征集合和所述候选问题语句基于字的句向量特征集合;Performing feature extraction on the first word vector matrix and the second word vector matrix respectively to determine the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;
    分别对所述待匹配问题语句基于字的句向量特征集合和所述候选问题语句基于字的句向量特征集合进行降维;Dimensionality reduction is performed respectively on the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;
    将降维后的待匹配问题语句基于字的句向量特征集合和候选问题语句基于字的句向量特征集合拼接在一起,得到所述待匹配问题语句和所述候选问题语句基于字的句子对的句向量特征集合。Combine the word-based sentence vector feature set of the question sentence to be matched after dimensionality reduction and the word-based sentence vector feature set of the candidate question sentence to obtain the word-based sentence pair of the question sentence to be matched and the candidate question sentence Sentence vector feature set.
  13. 如权利要求9所述的电子装置,其中,在S130中,The electronic device according to claim 9, wherein, in S130,
    对所述基于词的句子对的句向量特征集合分别做向量相减运算、向量相乘运算和向量最大化运算后的输出,以及,对基于字的句子对的句向量特征集合分别做向量相减运算、向量相乘运算和向量最大化运算后的输出进行拼接形成第一文本特征向量集合,对所述第一文本特征向量集合进行降维处理后,输入sigmoid函数以确定所述待匹配问题语句与所述候选问题语句之间的相似度。Perform vector subtraction, vector multiplication, and vector maximization on the sentence vector feature sets of the word-based sentence pairs, respectively, and perform vector phase comparison on the sentence vector feature sets of the word-based sentence pairs. The output after the subtraction operation, the vector multiplication operation, and the vector maximization operation are spliced to form a first text feature vector set. After the first text feature vector set is reduced in dimensionality, the sigmoid function is input to determine the problem to be matched The degree of similarity between the sentence and the candidate question sentence.
  14. 如权利要求9所述的电子装置,其中,在S150中,所述分别对每一条相似候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以及,分别对每一条相似候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理的过程包括:9. The electronic device according to claim 9, wherein, in S150, each of the similar candidate question sentences and the word segmentation processing to-be-matched question sentence are respectively subjected to word vectorization processing, and each similar candidate question sentence is respectively processed. The process of word vectorization processing of the candidate question sentence and the question sentence to be matched after the word segmentation processing includes:
    将所述分词处理后的待匹配问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的词向量进行拼接,形成第三词向量矩阵;以及,将所述相似候选问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的词向量进行拼接,形成第四词向量矩阵;以及,The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing are spliced to form a third word vector matrix; and the similar candidate question sentences are respectively subjected to Pre-training processing. The word vectors obtained after train Embedding processing and train Embedding processing are spliced to form a fourth word vector matrix; and,
    将所述分字处理后的待匹配问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的字向量进行拼接,形成第三字向量矩阵;将所述相似候选问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的字向量进行拼接,形成第四字向量矩阵。The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing to form a third word vector matrix; Pre-train the similar candidate question sentences respectively The word vectors obtained after embedding processing and train embedding processing are spliced to form a fourth word vector matrix.
  15. 如权利要求14所述的电子装置,其中,在S150中,所述分别确定基于词的所述待匹配问题语句的特征集合、基于词的所述相似候选问题语句的特征集合、基于字的所述待匹配问题语句的特征集合、以及基于字的所述相似候选问题语句的特征集合;将所确定的四个特征集合进行拼接后,确定所述相似候选问题语句与所述待匹配问题语句之间的相似度的过程包括:The electronic device according to claim 14, wherein, in S150, the determining the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, and the word-based all Describe the feature set of the question sentence to be matched and the feature set of the similar candidate question sentence based on words; after the four determined feature sets are spliced, it is determined that the similar candidate question sentence and the question sentence to be matched are The process of similarity between:
    分别对所述第三词向量矩阵、所述第四词向量矩阵、所述第三字向量矩阵和所述第四字向量矩阵进行特征提取和降维后,得到基于词的所述待匹配问题语句的特征集合、基于词的所述相似候选问题语句的特征集合、基于字的所述待匹配问题语句的特征集合、基于字的所述相似候选问题语句的特征集合;After feature extraction and dimension reduction are performed on the third word vector matrix, the fourth word vector matrix, the third word vector matrix, and the fourth word vector matrix, respectively, the word-based problem to be matched is obtained A feature set of a sentence, a feature set of the similar candidate question sentence based on a word, a feature set of the question sentence to be matched based on a word, and a feature set of the similar candidate question sentence based on a word;
    将所述四个特征集合进行拼接后形成第二文本特征向量集合,对所述第二文本特征向量集合进行降维处理后,输入sigmoid函数以确定所述待匹配问题语句与所述相似候选问题语句之间的相似度。The four feature sets are spliced together to form a second text feature vector set. After the second text feature vector set is reduced in dimensionality, the sigmoid function is input to determine the question sentence to be matched and the similar candidate question The similarity between sentences.
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序, 所述计算机程序被处理器执行时用于实现一种基于内部对抗机制的语义匹配方法,所述方法包括以下步骤:A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, it is used to implement a semantic matching method based on an internal confrontation mechanism. The method includes the following step:
    S110:对待匹配问题语句和候选问题语句分别进行分词处理和分字处理;S110: Perform word segmentation processing and word segmentation processing on the matching question sentence and the candidate question sentence respectively;
    S120:分别对每一条所述分词处理后的候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于词的句子对的句向量特征集合;以及,分别对每一条所述分字处理后的候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于字的句子对的句向量特征集合;其中,所述候选问题语句为通过搜索引擎在指定数据库中检索出的与待匹配问题语句具有设定相似度的至少一个问题语句;S120: Perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing, to determine the word-based sentence of the question sentence to be matched and the candidate question sentence The set of sentence vector features; and, respectively, perform word vectorization processing on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing to determine the question sentence to be matched and The candidate question sentence is based on the sentence vector feature set of the sentence pair of the word; wherein the candidate question sentence is at least one question sentence with a set similarity with the question sentence to be matched that is retrieved in a specified database through a search engine;
    S130:将所述基于词的句子对的句向量特征集合和所述基于字的句子对的句向量特征集合进行拼接,确定所述候选问题语句与所述待匹配问题语句之间的相似度;S130: Combine the sentence vector feature set of the word-based sentence pair and the sentence vector feature set of the word-based sentence pair to determine the similarity between the candidate question sentence and the question sentence to be matched;
    S140:将所述候选问题语句与所述待匹配问题语句之间的相似度按照由高向低的顺序排序,选出所述相似度排序在设定名次内的候选问题语句作为相似候选问题语句;S140: Sort the similarity between the candidate question sentence and the question sentence to be matched in order from high to low, and select the candidate question sentence with the similarity in a set ranking as the similar candidate question sentence ;
    S150:分别对每一条相似候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以及,分别对每一条相似候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理;以分别确定基于词的所述待匹配问题语句的特征集合、基于词的所述相似候选问题语句的特征集合、基于字的所述待匹配问题语句的特征集合、以及基于字的所述相似候选问题语句的特征集合;将所确定的四个特征集合进行拼接后,确定所述相似候选问题语句与所述待匹配问题语句之间的相似度;S150: Perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing, and respectively perform word vectorization processing on each similar candidate question sentence and the question sentence to be matched after the word segmentation processing. Character vectorization processing; to respectively determine the feature set of the question sentence to be matched based on the word, the feature set of the similar candidate question sentence based on the word, the feature set of the question sentence to be matched based on the word, and the character set The feature set of the similar candidate question sentence; after the four determined feature sets are spliced, the similarity between the similar candidate question sentence and the question sentence to be matched is determined;
    S160:分别将所述相似候选问题语句与所述待匹配问题语句之间的相似度、以及所述候选问题语句与所述待匹配问题语句之间的相似度按照由高向低的顺序排序,获取设定名次内的排序结果;并将两个排序结果作为皮尔逊相关系数计算公式的两个变量,计算相关系数,若所述相关系数达到设定的阈值,则以所述待匹配问题语句与所述候选问题语句相似度排序第一名的候选问题语句为语义匹配的结果,若所述相关系数低于所述设定的阈值,则重新通过搜索引擎在指定数据库中检索与待匹配问题语句具有设定相似度的至少一个问题语句,进行所述S120。S160: Sort the similarity between the similar candidate question sentence and the question sentence to be matched, and the similarity between the candidate question sentence and the question sentence to be matched in the order from high to low, respectively, Obtain the ranking results within the set ranking; use the two ranking results as two variables in the Pearson correlation coefficient calculation formula to calculate the correlation coefficient. If the correlation coefficient reaches the set threshold, use the sentence to be matched The candidate question sentence ranked first in the similarity ranking of the candidate question sentence is the result of semantic matching. If the correlation coefficient is lower than the set threshold, the search engine is used to retrieve the question to be matched in the designated database again The sentence has at least one question sentence with a set similarity, and the S120 is performed.
  17. 如权利要求16所述的计算机可读存储介质,其中,在S110中,The computer-readable storage medium according to claim 16, wherein, in S110,
    所述分词处理包括:在对所述待匹配问题语句去停用词、去特殊符号后,使用深度学习分词器进行分词处理;在对所述候选问题语句去停用词、去特殊符号后,使用深度学习分词器进行分词处理;The word segmentation processing includes: after removing stop words and special symbols from the question sentence to be matched, using a deep learning tokenizer to perform word segmentation; after removing stop words and special symbols from the candidate question sentence, Use deep learning tokenizer for word segmentation processing;
    所述分字处理包括:在对所述待匹配问题语句去停用词、去特殊符号后,使用深度学习分词器进行分字处理;在对所述候选问题语句去停用词、去特殊符号后,使用深度学习分词器进行分字处理。The word segmentation processing includes: after removing stop words and special symbols for the question sentence to be matched, using a deep learning tokenizer to perform word segmentation processing; removing stop words and special symbols for the candidate question sentence Then, use the deep learning tokenizer for word segmentation processing.
  18. 如权利要求16所述的计算机可读存储介质,其中,在S120中,所述分别对每一条所述分词处理后的候选问题语句和所述分词处理后的待匹配问题语句进行词向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于词的句子对的句向量特征集合的 过程包括:The computer-readable storage medium according to claim 16, wherein, in S120, each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing are respectively subjected to word vectorization processing , The process of determining the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and the candidate question sentence includes:
    将所述分词处理后的待匹配问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的词向量进行拼接,形成第一词向量矩阵;以及,将所述候选问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的词向量进行拼接,形成第二词向量矩阵;The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing are spliced to form a first word vector matrix; and, the candidate question sentences are respectively subjected to Pre-training The word vectors obtained after embedding processing and train embedding processing are spliced to form a second word vector matrix;
    分别对所述第一词向量矩阵以及所述第二词向量矩阵进行特征提取,以确定所述待匹配问题语句基于词的句向量特征集合和所述候选问题语句基于词的句向量特征集合;Performing feature extraction on the first word vector matrix and the second word vector matrix respectively to determine the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;
    分别对所述待匹配问题语句基于词的句向量特征集合和所述候选问题语句基于词的句向量特征集合进行降维;Dimensionality reduction is performed respectively on the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;
    将降维后的待匹配问题语句基于词的句向量特征集合和候选问题语句基于词的句向量特征集合拼接在一起,得到所述待匹配问题语句和所述候选问题语句基于词的句子对的句向量特征集合。Combine the word-based sentence vector feature set of the question sentence to be matched after the dimensionality reduction and the word-based sentence vector feature set of the candidate question sentence together to obtain the sentence pair of the question sentence to be matched and the candidate question sentence based on the word Sentence vector feature set.
  19. 如权利要求18所述的计算机可读存储介质,其中,在S120中,所述分别对每一条所述分字处理后的候选问题语句和所述分字处理后的待匹配问题语句进行字向量化处理,以确定所述待匹配问题语句和所述候选问题语句基于字的句子对的句向量特征集合的过程包括:The computer-readable storage medium according to claim 18, wherein, in S120, the word vector is performed on each of the candidate question sentence after the word segmentation processing and the question sentence to be matched after the word segmentation processing. The process of determining the sentence vector feature set of the word-based sentence pair of the question sentence to be matched and the candidate question sentence includes:
    将所述分字处理后的待匹配问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的字向量进行拼接,形成第一字向量矩阵;以及,将所述候选问题语句分别进行Pre-train Embedding处理和train Embedding处理后获得的字向量进行拼接,形成第二字向量矩阵;The word vectors obtained after the word segmentation processing to-be-matched question sentences are respectively subjected to Pre-train Embedding processing and train Embedding processing are spliced to form a first word vector matrix; and the candidate question sentences are respectively subjected to Pre-training The word vectors obtained after train Embedding processing and train Embedding processing are spliced to form a second word vector matrix;
    分别对所述第一字向量矩阵以及所述第二字向量矩阵进行特征提取,以确定所述待匹配问题语句基于字的句向量特征集合和所述候选问题语句基于字的句向量特征集合;Performing feature extraction on the first word vector matrix and the second word vector matrix respectively to determine the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;
    分别对所述待匹配问题语句基于字的句向量特征集合和所述候选问题语句基于字的句向量特征集合进行降维;Dimensionality reduction is performed respectively on the word-based sentence vector feature set of the question sentence to be matched and the word-based sentence vector feature set of the candidate question sentence;
    将降维后的待匹配问题语句基于字的句向量特征集合和候选问题语句基于字的句向量特征集合拼接在一起,得到所述待匹配问题语句和所述候选问题语句基于字的句子对的句向量特征集合。Combine the word-based sentence vector feature set of the question sentence to be matched after dimensionality reduction and the word-based sentence vector feature set of the candidate question sentence to obtain the word-based sentence pair of the question sentence to be matched and the candidate question sentence Sentence vector feature set.
  20. 如权利要求16所述的计算机可读存储介质,其中,在S130中,The computer-readable storage medium according to claim 16, wherein, in S130,
    对所述基于词的句子对的句向量特征集合分别做向量相减运算、向量相乘运算和向量最大化运算后的输出,以及,对基于字的句子对的句向量特征集合分别做向量相减运算、向量相乘运算和向量最大化运算后的输出进行拼接形成第一文本特征向量集合,对所述第一文本特征向量集合进行降维处理后,输入sigmoid函数以确定所述待匹配问题语句与所述候选问题语句之间的相似度。Perform vector subtraction, vector multiplication, and vector maximization on the sentence vector feature sets of the word-based sentence pairs, respectively, and perform vector phase comparison on the sentence vector feature sets of the word-based sentence pairs. The output after the subtraction operation, the vector multiplication operation, and the vector maximization operation are spliced to form a first text feature vector set. After the first text feature vector set is reduced in dimensionality, the sigmoid function is input to determine the problem to be matched The degree of similarity between the sentence and the candidate question sentence.
PCT/CN2020/117422 2020-02-26 2020-09-24 Semantic matching method and device based on internal adversarial mechanism, and storage medium WO2021169263A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010119430.0A CN111427995B (en) 2020-02-26 2020-02-26 Semantic matching method, device and storage medium based on internal countermeasure mechanism
CN202010119430.0 2020-02-26

Publications (1)

Publication Number Publication Date
WO2021169263A1 true WO2021169263A1 (en) 2021-09-02

Family

ID=71547247

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/117422 WO2021169263A1 (en) 2020-02-26 2020-09-24 Semantic matching method and device based on internal adversarial mechanism, and storage medium

Country Status (2)

Country Link
CN (1) CN111427995B (en)
WO (1) WO2021169263A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114048354A (en) * 2022-01-10 2022-02-15 广州启辰电子科技有限公司 Test question retrieval method, device and medium based on multi-element characterization and metric learning
CN116312968A (en) * 2023-02-09 2023-06-23 广东德澳智慧医疗科技有限公司 Psychological consultation and healing system based on man-machine conversation and core algorithm

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111427995B (en) * 2020-02-26 2023-05-26 平安科技(深圳)有限公司 Semantic matching method, device and storage medium based on internal countermeasure mechanism
CN111859986B (en) * 2020-07-27 2023-06-20 中国平安人寿保险股份有限公司 Semantic matching method, device, equipment and medium based on multi-task twin network
CN112149424A (en) * 2020-08-10 2020-12-29 招联消费金融有限公司 Semantic matching method and device, computer equipment and storage medium
CN112287069B (en) * 2020-10-29 2023-07-25 平安科技(深圳)有限公司 Information retrieval method and device based on voice semantics and computer equipment
CN113204973A (en) * 2021-04-30 2021-08-03 平安科技(深圳)有限公司 Training method, device, equipment and storage medium of answer-questions recognition model
CN112991346B (en) * 2021-05-13 2022-04-26 深圳科亚医疗科技有限公司 Training method and training system for learning network for medical image analysis
CN113407767A (en) * 2021-06-29 2021-09-17 北京字节跳动网络技术有限公司 Method and device for determining text relevance, readable medium and electronic equipment
CN113656547B (en) * 2021-08-17 2023-06-30 平安科技(深圳)有限公司 Text matching method, device, equipment and storage medium
CN114090747A (en) * 2021-10-14 2022-02-25 特斯联科技集团有限公司 Automatic question answering method, device, equipment and medium based on multiple semantic matching
CN113988073A (en) * 2021-10-26 2022-01-28 迪普佰奥生物科技(上海)股份有限公司 Text recognition method and system suitable for life science
CN116361839B (en) * 2023-05-26 2023-07-28 四川易景智能终端有限公司 Secret-related shielding method based on NLP

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050164152A1 (en) * 2004-01-28 2005-07-28 Lawson James D. Compatibility assessment method
CN105893523A (en) * 2016-03-31 2016-08-24 华东师范大学 Method for calculating problem similarity with answer relevance ranking evaluation measurement
CN107980130A (en) * 2017-11-02 2018-05-01 深圳前海达闼云端智能科技有限公司 It is automatic to answer method, apparatus, storage medium and electronic equipment
CN108345585A (en) * 2018-01-11 2018-07-31 浙江大学 A kind of automatic question-answering method based on deep learning
CN109271505A (en) * 2018-11-12 2019-01-25 深圳智能思创科技有限公司 A kind of question answering system implementation method based on problem answers pair
CN110032632A (en) * 2019-04-04 2019-07-19 平安科技(深圳)有限公司 Intelligent customer service answering method, device and storage medium based on text similarity
CN111427995A (en) * 2020-02-26 2020-07-17 平安科技(深圳)有限公司 Semantic matching method and device based on internal countermeasure mechanism and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291783B (en) * 2016-04-12 2021-04-30 芋头科技(杭州)有限公司 Semantic matching method and intelligent equipment
CN109101494A (en) * 2018-08-10 2018-12-28 哈尔滨工业大学(威海) A method of it is calculated for Chinese sentence semantic similarity, equipment and computer readable storage medium
CN109948143B (en) * 2019-01-25 2023-04-07 网经科技(苏州)有限公司 Answer extraction method of community question-answering system
CN109992788B (en) * 2019-04-10 2023-08-29 鼎富智能科技有限公司 Deep text matching method and device based on unregistered word processing

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050164152A1 (en) * 2004-01-28 2005-07-28 Lawson James D. Compatibility assessment method
CN105893523A (en) * 2016-03-31 2016-08-24 华东师范大学 Method for calculating problem similarity with answer relevance ranking evaluation measurement
CN107980130A (en) * 2017-11-02 2018-05-01 深圳前海达闼云端智能科技有限公司 It is automatic to answer method, apparatus, storage medium and electronic equipment
CN108345585A (en) * 2018-01-11 2018-07-31 浙江大学 A kind of automatic question-answering method based on deep learning
CN109271505A (en) * 2018-11-12 2019-01-25 深圳智能思创科技有限公司 A kind of question answering system implementation method based on problem answers pair
CN110032632A (en) * 2019-04-04 2019-07-19 平安科技(深圳)有限公司 Intelligent customer service answering method, device and storage medium based on text similarity
CN111427995A (en) * 2020-02-26 2020-07-17 平安科技(深圳)有限公司 Semantic matching method and device based on internal countermeasure mechanism and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114048354A (en) * 2022-01-10 2022-02-15 广州启辰电子科技有限公司 Test question retrieval method, device and medium based on multi-element characterization and metric learning
CN114048354B (en) * 2022-01-10 2022-04-26 广州启辰电子科技有限公司 Test question retrieval method, device and medium based on multi-element characterization and metric learning
CN116312968A (en) * 2023-02-09 2023-06-23 广东德澳智慧医疗科技有限公司 Psychological consultation and healing system based on man-machine conversation and core algorithm

Also Published As

Publication number Publication date
CN111427995B (en) 2023-05-26
CN111427995A (en) 2020-07-17

Similar Documents

Publication Publication Date Title
WO2021169263A1 (en) Semantic matching method and device based on internal adversarial mechanism, and storage medium
CN111415740B (en) Method and device for processing inquiry information, storage medium and computer equipment
WO2022227207A1 (en) Text classification method, apparatus, computer device, and storage medium
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN111709243B (en) Knowledge extraction method and device based on deep learning
CN109635083B (en) Document retrieval method for searching topic type query in TED (tele) lecture
US11874862B2 (en) Community question-answer website answer sorting method and system combined with active learning
CN116134432A (en) System and method for providing answers to queries
CN107832326B (en) Natural language question-answering method based on deep convolutional neural network
CN111737426A (en) Method for training question-answering model, computer equipment and readable storage medium
US11860932B2 (en) Scene graph embeddings using relative similarity supervision
CN112115716A (en) Service discovery method, system and equipment based on multi-dimensional word vector context matching
CN112925918B (en) Question-answer matching system based on disease field knowledge graph
US20220284321A1 (en) Visual-semantic representation learning via multi-modal contrastive training
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN113221530A (en) Text similarity matching method and device based on circle loss, computer equipment and storage medium
CN112632250A (en) Question and answer method and system under multi-document scene
CN113673252A (en) Automatic join recommendation method for data table based on field semantics
Li Text recognition and classification of english teaching content based on SVM
US10970488B2 (en) Finding of asymmetric relation between words
WO2022061877A1 (en) Event extraction and extraction model training method, apparatus and device, and medium
CN113934835A (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
CN110674293B (en) Text classification method based on semantic migration
CN112836027A (en) Method for determining text similarity, question answering method and question answering system
Li et al. Otcmr: Bridging heterogeneity gap with optimal transport for cross-modal retrieval

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20922386

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20922386

Country of ref document: EP

Kind code of ref document: A1