CN112364634A

CN112364634A - Synonym matching method based on question sentence

Info

Publication number: CN112364634A
Application number: CN202011203497.9A
Authority: CN
Inventors: 陈兴元; 金澎; 陈可
Original assignee: Chengdu Buwen Technology Co ltd
Current assignee: Chengdu Buwen Technology Co ltd
Priority date: 2020-11-02
Filing date: 2020-11-02
Publication date: 2021-02-12

Abstract

The invention discloses a synonym matching method based on question sentences, which is characterized in that a mask processing method based on mask processing is provided for user question sentences, words which never appear in a question-answer pair vocabulary set in the user question sentences are subjected to mask processing, then the positions of the words are predicted by other words in the user question sentences, and probability distribution on a question-answer pair vocabulary set is output. In addition, a vocabulary vector table of the corpus vocabulary table is obtained according to the massive corpus set, and the vocabulary covered by the vocabulary vector table is far larger than the question-answer word set. The probability that a word not present in the question-answer pair vocabulary set generates each word in the question-answer pair vocabulary set is calculated accordingly. And finally, comprehensively considering the local probability and the global probability to find out the most similar words in the question-answer pair vocabulary set.

Description

Synonym matching method based on question sentence

Technical Field

The invention relates to the technical field of natural language processing, in particular to a question sentence-based synonym matching method.

Background

The conventional questions and answers of the vertical domain automatic question-answering system based on the FAQ are usually collected and sorted by domain experts, and one question corresponds to one answer, which is also called a question-answer pair. In the existing question-answering system implemented by adopting the FAQ technology, when the questions proposed by the user and the questions listed in the system are calculated, and sentence similarity calculation is performed one by one, when words in the questions proposed by the user do not appear in a question-answering pair set formed by the FAQ, word vectors obtained by massive corpora are adopted to process the questions proposed by the user. The main drawbacks of this method are:

1. current question context information cannot be used. And the information is exactly that the word x does not appear_iAnd key information of similarity of each word in the query-answer pair word set VF. The absence of this information will directly render the calculated similarity unusable.

2. The existing word vector is calculated in a word2vec mode, is a shallow neural network, has no multi-head representation, has no self attention, and uses no sentence-level information. Therefore, its word vector representation is not good.

3. BERT cannot be effectively utilized. BERT is the latest research result of natural language processing, and can better utilize context information to model languages by predicting two tasks of a mask word and a next sentence through a self-attention mechanism. But has the disadvantage that the vocabulary cannot be too large, typically 3-5 ten thousand. If the vocabulary of the BERT is to cover all possible pronunciations of the user's question, it needs to be expanded to 50 million or even larger, which results in the BERT not being trained.

In summary, in the current method for solving the problem that words in user question do not appear in question and answer versus word set VF based on word vectors, the word vectors represent poor quality, and context information of current question cannot be utilized, so that the matching accuracy is not high.

Disclosure of Invention

Aiming at the defects in the prior art, the synonym matching method based on the question solves the problem that the matching accuracy is low due to the fact that the word vector representation quality is poor and the context information of the current question cannot be utilized in the prior art.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a synonym matching method based on question sentences comprises the following steps:

s1, constructing a question-answer vocabulary set, a vocabulary vector table of a corpus vocabulary table and a BERT training vocabulary table;

s2, training the BERT language model by adopting a BERT training vocabulary to obtain a trained BERT language model;

s3, obtaining a user question word sequence, performing mask processing on predicted words which do not appear in a question-answer pair word set in the user question word sequence, and calculating a distribution probability set, namely a local probability set, of mask words after the mask processing on the question-answer pair word set based on a trained BERT language model;

s4, calculating a global probability set of the predicted words on the question-answer word set according to a word vector table of a corpus word table;

s5, solving intersection of the local probability set and the global probability set, judging whether the intersection is empty, if not, skipping to S6, if yes, inputting a new user question word sequence, and skipping to S3;

s6, calculating the comprehensive probability of the predicted words according to the local probability set and the global probability set, finding out synonyms in the question-answer pair vocabulary set according to the comprehensive probability of the predicted words, and replacing the predicted words in the user question word sequence with the synonyms to obtain the standard question.

Further, the step S1 includes the following steps:

s11, constructing an original standard question-answer pair set FQA, and constructing a question-answer pair word collection based on words in the original standard question-answer pair set FQA;

s12, collecting corpora, constructing a corpus set, and constructing a BERT training vocabulary based on the corpus set;

and S13, constructing a corpus vocabulary according to the corpus set, and processing the corpus vocabulary by adopting a skip-gram method to obtain a vocabulary vector table of the corpus vocabulary.

Further, the step S3 includes the following sub-steps:

s31, obtaining a user question word sequence, and masking predicted words which do not appear in the question and answer pair vocabulary set in the user question word sequence to obtain masked words;

s32, calculating the distribution probability of the mask words on the BERT training vocabulary by adopting the trained BERT language model to obtain a distribution probability set of the mask words on the BERT training vocabulary;

s33, carrying out normalization processing on the distribution probability set on the BERT training vocabulary to obtain the distribution probability set of the mask words on the question-answer pair vocabulary, namely a local probability set.

Further, the step S4 includes the following sub-steps:

s41, finding out a predicted word vector corresponding to a predicted word and a question-answer word-pair vector of each word in a question-answer word set according to a word vector table of a corpus vocabulary table;

and S42, respectively solving inner products of the predicted word vectors and the question-answer pair word vectors, and carrying out normalization processing on the inner products by adopting softmax to obtain a global probability set of the predicted words on the question-answer pair word set.

Further, the calculation formula of the local probability set in step S3 is as follows:

wherein the content of the first and second substances,

is the jth local probability, p, in the local probability set_jIs the jth distribution probability in the distribution probability set, | V_FAnd | is the total number of words in the question-answer pair vocabulary.

Further, the calculation formula of the global probability set in step S4 is as follows:

p(w_j|x_i)＝softmax(v(w_j)·v(x_i)),w_j∈V_F

x＝{x₁,x₂,…,x_i,…,x_T}

wherein, p (w)_j|x_i) Is the jth global probability in the global probability set, v (w)_j) Question-answer word-pair vector for the j-th word in the question-answer word set, v (x)_i) For predicting word vectors, w_jFor question-answering the jth word of the vocabulary, V_FA collection of words for question answering, x a sequence of words for question of the user, x_iThe word is the ith word in the user question word sequence x, namely the predicted word, and T is the total number of words in the user question word sequence x.

Further, the formula for calculating the comprehensive probability of the predicted word in step S6 is as follows:

or

Wherein w is the comprehensive probability of the predicted word,

is the jth local probability in the local probability set, p (w)_j|x_i) Is the jth global probability in the global probability set.

In conclusion, the beneficial effects of the invention are as follows:

(1) the invention provides a method for processing the user question based on mask processing, and words x never appearing in the question-answer vocabulary set in the user question_iAnd performing mask processing, predicting the position by using other words in the sentence, and outputting probability distribution on a question and answer pair word set. In addition, a vocabulary vector table of the corpus vocabulary table is obtained according to the massive corpus set, and the vocabulary covered by the vocabulary vector table is far larger than the question-answer word set. From which the word x is calculated_iThe probability of the question and answer to each word in the vocabulary set is generated. Finally, the local probability and the global probability are comprehensively considered to find the neutralization x in the question-answer pair vocabulary set_iThe most similar words.

(2) The invention can not only fully utilize the context information of the current user question, but also consider the prior context, thereby greatly improving the performance of sentence similarity calculation.

Drawings

Fig. 1 is a flowchart of a synonym matching method based on a question sentence.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

As shown in fig. 1, a method for matching synonyms based on question sentences includes the following steps:

step S1 includes the following steps:

Wherein, the question and answer are collected

BERT training vocabulary

A corpus collection.

step S3 includes the following substeps:

The calculation formula of the local probability set in step S33 is:

wherein the content of the first and second substances,

step S4 includes the following substeps:

The calculation formula of the global probability set is as follows:

p(w_j|x_i)＝softmax(v(w_j)·v(x_i)),w_j∈V_F

x＝{x₁,x₂,…,x_i,…,x_T}

in practical applications, in consideration of operating efficiency, the top N words (e.g., N ═ 50) can be ranked by taking the local probability and the global probability respectively, and then the words are intersected.

The formula for calculating the comprehensive probability of the predicted word is as follows:

or

Wherein w is the comprehensive probability of the predicted word,

Claims

1. A synonym matching method based on question sentences is characterized by comprising the following steps:

2. The question sentence-based synonym matching method of claim 1, wherein the step S1 includes the steps of:

3. The question sentence-based synonym matching method of claim 1, wherein the step S3 includes the following substeps:

4. The question sentence-based synonym matching method of claim 1, wherein the step S4 includes the following substeps:

5. The question-based synonym matching method according to claim 1, wherein the calculation formula of the local probability set in the step S3 is as follows:

wherein the content of the first and second substances,

6. The question-based synonym matching method according to claim 1, wherein the global probability set in step S4 is calculated as:

p(w_j|x_i)＝softmax(v(w_j)·v(x_i)),w_j∈V_Fx＝{x₁,x₂,…,x_i,…,x_T}

7. The question sentence-based synonym matching method of claim 1, wherein the formula for calculating the comprehensive probability of the predicted word in the step S6 is as follows:

or

Wherein w is the comprehensive probability of the predicted word,