CN112364634A - Synonym matching method based on question sentence - Google Patents
Synonym matching method based on question sentence Download PDFInfo
- Publication number
- CN112364634A CN112364634A CN202011203497.9A CN202011203497A CN112364634A CN 112364634 A CN112364634 A CN 112364634A CN 202011203497 A CN202011203497 A CN 202011203497A CN 112364634 A CN112364634 A CN 112364634A
- Authority
- CN
- China
- Prior art keywords
- question
- word
- vocabulary
- words
- probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 22
- 239000013598 vector Substances 0.000 claims abstract description 40
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 6
- 230000000873 masking effect Effects 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 238000003672 processing method Methods 0.000 abstract 1
- 238000003058 natural language processing Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000006386 neutralization reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Human Computer Interaction (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a synonym matching method based on question sentences, which is characterized in that a mask processing method based on mask processing is provided for user question sentences, words which never appear in a question-answer pair vocabulary set in the user question sentences are subjected to mask processing, then the positions of the words are predicted by other words in the user question sentences, and probability distribution on a question-answer pair vocabulary set is output. In addition, a vocabulary vector table of the corpus vocabulary table is obtained according to the massive corpus set, and the vocabulary covered by the vocabulary vector table is far larger than the question-answer word set. The probability that a word not present in the question-answer pair vocabulary set generates each word in the question-answer pair vocabulary set is calculated accordingly. And finally, comprehensively considering the local probability and the global probability to find out the most similar words in the question-answer pair vocabulary set.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a question sentence-based synonym matching method.
Background
The conventional questions and answers of the vertical domain automatic question-answering system based on the FAQ are usually collected and sorted by domain experts, and one question corresponds to one answer, which is also called a question-answer pair. In the existing question-answering system implemented by adopting the FAQ technology, when the questions proposed by the user and the questions listed in the system are calculated, and sentence similarity calculation is performed one by one, when words in the questions proposed by the user do not appear in a question-answering pair set formed by the FAQ, word vectors obtained by massive corpora are adopted to process the questions proposed by the user. The main drawbacks of this method are:
1. current question context information cannot be used. And the information is exactly that the word x does not appeariAnd key information of similarity of each word in the query-answer pair word set VF. The absence of this information will directly render the calculated similarity unusable.
2. The existing word vector is calculated in a word2vec mode, is a shallow neural network, has no multi-head representation, has no self attention, and uses no sentence-level information. Therefore, its word vector representation is not good.
3. BERT cannot be effectively utilized. BERT is the latest research result of natural language processing, and can better utilize context information to model languages by predicting two tasks of a mask word and a next sentence through a self-attention mechanism. But has the disadvantage that the vocabulary cannot be too large, typically 3-5 ten thousand. If the vocabulary of the BERT is to cover all possible pronunciations of the user's question, it needs to be expanded to 50 million or even larger, which results in the BERT not being trained.
In summary, in the current method for solving the problem that words in user question do not appear in question and answer versus word set VF based on word vectors, the word vectors represent poor quality, and context information of current question cannot be utilized, so that the matching accuracy is not high.
Disclosure of Invention
Aiming at the defects in the prior art, the synonym matching method based on the question solves the problem that the matching accuracy is low due to the fact that the word vector representation quality is poor and the context information of the current question cannot be utilized in the prior art.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a synonym matching method based on question sentences comprises the following steps:
s1, constructing a question-answer vocabulary set, a vocabulary vector table of a corpus vocabulary table and a BERT training vocabulary table;
s2, training the BERT language model by adopting a BERT training vocabulary to obtain a trained BERT language model;
s3, obtaining a user question word sequence, performing mask processing on predicted words which do not appear in a question-answer pair word set in the user question word sequence, and calculating a distribution probability set, namely a local probability set, of mask words after the mask processing on the question-answer pair word set based on a trained BERT language model;
s4, calculating a global probability set of the predicted words on the question-answer word set according to a word vector table of a corpus word table;
s5, solving intersection of the local probability set and the global probability set, judging whether the intersection is empty, if not, skipping to S6, if yes, inputting a new user question word sequence, and skipping to S3;
s6, calculating the comprehensive probability of the predicted words according to the local probability set and the global probability set, finding out synonyms in the question-answer pair vocabulary set according to the comprehensive probability of the predicted words, and replacing the predicted words in the user question word sequence with the synonyms to obtain the standard question.
Further, the step S1 includes the following steps:
s11, constructing an original standard question-answer pair set FQA, and constructing a question-answer pair word collection based on words in the original standard question-answer pair set FQA;
s12, collecting corpora, constructing a corpus set, and constructing a BERT training vocabulary based on the corpus set;
and S13, constructing a corpus vocabulary according to the corpus set, and processing the corpus vocabulary by adopting a skip-gram method to obtain a vocabulary vector table of the corpus vocabulary.
Further, the step S3 includes the following sub-steps:
s31, obtaining a user question word sequence, and masking predicted words which do not appear in the question and answer pair vocabulary set in the user question word sequence to obtain masked words;
s32, calculating the distribution probability of the mask words on the BERT training vocabulary by adopting the trained BERT language model to obtain a distribution probability set of the mask words on the BERT training vocabulary;
s33, carrying out normalization processing on the distribution probability set on the BERT training vocabulary to obtain the distribution probability set of the mask words on the question-answer pair vocabulary, namely a local probability set.
Further, the step S4 includes the following sub-steps:
s41, finding out a predicted word vector corresponding to a predicted word and a question-answer word-pair vector of each word in a question-answer word set according to a word vector table of a corpus vocabulary table;
and S42, respectively solving inner products of the predicted word vectors and the question-answer pair word vectors, and carrying out normalization processing on the inner products by adopting softmax to obtain a global probability set of the predicted words on the question-answer pair word set.
Further, the calculation formula of the local probability set in step S3 is as follows:
wherein the content of the first and second substances,is the jth local probability, p, in the local probability setjIs the jth distribution probability in the distribution probability set, | VFAnd | is the total number of words in the question-answer pair vocabulary.
Further, the calculation formula of the global probability set in step S4 is as follows:
p(wj|xi)=softmax(v(wj)·v(xi)),wj∈VF
x={x1,x2,…,xi,…,xT}
wherein, p (w)j|xi) Is the jth global probability in the global probability set, v (w)j) Question-answer word-pair vector for the j-th word in the question-answer word set, v (x)i) For predicting word vectors, wjFor question-answering the jth word of the vocabulary, VFA collection of words for question answering, x a sequence of words for question of the user, xiThe word is the ith word in the user question word sequence x, namely the predicted word, and T is the total number of words in the user question word sequence x.
Further, the formula for calculating the comprehensive probability of the predicted word in step S6 is as follows:
Wherein w is the comprehensive probability of the predicted word,is the jth local probability in the local probability set, p (w)j|xi) Is the jth global probability in the global probability set.
In conclusion, the beneficial effects of the invention are as follows:
(1) the invention provides a method for processing the user question based on mask processing, and words x never appearing in the question-answer vocabulary set in the user questioniAnd performing mask processing, predicting the position by using other words in the sentence, and outputting probability distribution on a question and answer pair word set. In addition, a vocabulary vector table of the corpus vocabulary table is obtained according to the massive corpus set, and the vocabulary covered by the vocabulary vector table is far larger than the question-answer word set. From which the word x is calculatediThe probability of the question and answer to each word in the vocabulary set is generated. Finally, the local probability and the global probability are comprehensively considered to find the neutralization x in the question-answer pair vocabulary setiThe most similar words.
(2) The invention can not only fully utilize the context information of the current user question, but also consider the prior context, thereby greatly improving the performance of sentence similarity calculation.
Drawings
Fig. 1 is a flowchart of a synonym matching method based on a question sentence.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
As shown in fig. 1, a method for matching synonyms based on question sentences includes the following steps:
s1, constructing a question-answer vocabulary set, a vocabulary vector table of a corpus vocabulary table and a BERT training vocabulary table;
step S1 includes the following steps:
s11, constructing an original standard question-answer pair set FQA, and constructing a question-answer pair word collection based on words in the original standard question-answer pair set FQA;
s12, collecting corpora, constructing a corpus set, and constructing a BERT training vocabulary based on the corpus set;
and S13, constructing a corpus vocabulary according to the corpus set, and processing the corpus vocabulary by adopting a skip-gram method to obtain a vocabulary vector table of the corpus vocabulary.
S2, training the BERT language model by adopting a BERT training vocabulary to obtain a trained BERT language model;
s3, obtaining a user question word sequence, performing mask processing on predicted words which do not appear in a question-answer pair word set in the user question word sequence, and calculating a distribution probability set, namely a local probability set, of mask words after the mask processing on the question-answer pair word set based on a trained BERT language model;
step S3 includes the following substeps:
s31, obtaining a user question word sequence, and masking predicted words which do not appear in the question and answer pair vocabulary set in the user question word sequence to obtain masked words;
s32, calculating the distribution probability of the mask words on the BERT training vocabulary by adopting the trained BERT language model to obtain a distribution probability set of the mask words on the BERT training vocabulary;
s33, carrying out normalization processing on the distribution probability set on the BERT training vocabulary to obtain the distribution probability set of the mask words on the question-answer pair vocabulary, namely a local probability set.
The calculation formula of the local probability set in step S33 is:
wherein the content of the first and second substances,is the jth local probability, p, in the local probability setjIs the jth distribution probability in the distribution probability set, | VFAnd | is the total number of words in the question-answer pair vocabulary.
S4, calculating a global probability set of the predicted words on the question-answer word set according to a word vector table of a corpus word table;
step S4 includes the following substeps:
s41, finding out a predicted word vector corresponding to a predicted word and a question-answer word-pair vector of each word in a question-answer word set according to a word vector table of a corpus vocabulary table;
and S42, respectively solving inner products of the predicted word vectors and the question-answer pair word vectors, and carrying out normalization processing on the inner products by adopting softmax to obtain a global probability set of the predicted words on the question-answer pair word set.
The calculation formula of the global probability set is as follows:
p(wj|xi)=softmax(v(wj)·v(xi)),wj∈VF
x={x1,x2,…,xi,…,xT}
wherein, p (w)j|xi) Is the jth global probability in the global probability set, v (w)j) Question-answer word-pair vector for the j-th word in the question-answer word set, v (x)i) For predicting word vectors, wjFor question-answering the jth word of the vocabulary, VFA collection of words for question answering, x a sequence of words for question of the user, xiThe word is the ith word in the user question word sequence x, namely the predicted word, and T is the total number of words in the user question word sequence x.
S5, solving intersection of the local probability set and the global probability set, judging whether the intersection is empty, if not, skipping to S6, if yes, inputting a new user question word sequence, and skipping to S3;
in practical applications, in consideration of operating efficiency, the top N words (e.g., N ═ 50) can be ranked by taking the local probability and the global probability respectively, and then the words are intersected.
S6, calculating the comprehensive probability of the predicted words according to the local probability set and the global probability set, finding out synonyms in the question-answer pair vocabulary set according to the comprehensive probability of the predicted words, and replacing the predicted words in the user question word sequence with the synonyms to obtain the standard question.
The formula for calculating the comprehensive probability of the predicted word is as follows:
Claims (7)
1. A synonym matching method based on question sentences is characterized by comprising the following steps:
s1, constructing a question-answer vocabulary set, a vocabulary vector table of a corpus vocabulary table and a BERT training vocabulary table;
s2, training the BERT language model by adopting a BERT training vocabulary to obtain a trained BERT language model;
s3, obtaining a user question word sequence, performing mask processing on predicted words which do not appear in a question-answer pair word set in the user question word sequence, and calculating a distribution probability set, namely a local probability set, of mask words after the mask processing on the question-answer pair word set based on a trained BERT language model;
s4, calculating a global probability set of the predicted words on the question-answer word set according to a word vector table of a corpus word table;
s5, solving intersection of the local probability set and the global probability set, judging whether the intersection is empty, if not, skipping to S6, if yes, inputting a new user question word sequence, and skipping to S3;
s6, calculating the comprehensive probability of the predicted words according to the local probability set and the global probability set, finding out synonyms in the question-answer pair vocabulary set according to the comprehensive probability of the predicted words, and replacing the predicted words in the user question word sequence with the synonyms to obtain the standard question.
2. The question sentence-based synonym matching method of claim 1, wherein the step S1 includes the steps of:
s11, constructing an original standard question-answer pair set FQA, and constructing a question-answer pair word collection based on words in the original standard question-answer pair set FQA;
s12, collecting corpora, constructing a corpus set, and constructing a BERT training vocabulary based on the corpus set;
and S13, constructing a corpus vocabulary according to the corpus set, and processing the corpus vocabulary by adopting a skip-gram method to obtain a vocabulary vector table of the corpus vocabulary.
3. The question sentence-based synonym matching method of claim 1, wherein the step S3 includes the following substeps:
s31, obtaining a user question word sequence, and masking predicted words which do not appear in the question and answer pair vocabulary set in the user question word sequence to obtain masked words;
s32, calculating the distribution probability of the mask words on the BERT training vocabulary by adopting the trained BERT language model to obtain a distribution probability set of the mask words on the BERT training vocabulary;
s33, carrying out normalization processing on the distribution probability set on the BERT training vocabulary to obtain the distribution probability set of the mask words on the question-answer pair vocabulary, namely a local probability set.
4. The question sentence-based synonym matching method of claim 1, wherein the step S4 includes the following substeps:
s41, finding out a predicted word vector corresponding to a predicted word and a question-answer word-pair vector of each word in a question-answer word set according to a word vector table of a corpus vocabulary table;
and S42, respectively solving inner products of the predicted word vectors and the question-answer pair word vectors, and carrying out normalization processing on the inner products by adopting softmax to obtain a global probability set of the predicted words on the question-answer pair word set.
5. The question-based synonym matching method according to claim 1, wherein the calculation formula of the local probability set in the step S3 is as follows:
6. The question-based synonym matching method according to claim 1, wherein the global probability set in step S4 is calculated as:
p(wj|xi)=softmax(v(wj)·v(xi)),wj∈VFx={x1,x2,…,xi,…,xT}
wherein, p (w)j|xi) Is the jth global probability in the global probability set, v (w)j) Question-answer word-pair vector for the j-th word in the question-answer word set, v (x)i) For predicting word vectors, wjFor question-answering the jth word of the vocabulary, VFA collection of words for question answering, x a sequence of words for question of the user, xiThe word is the ith word in the user question word sequence x, namely the predicted word, and T is the total number of words in the user question word sequence x.
7. The question sentence-based synonym matching method of claim 1, wherein the formula for calculating the comprehensive probability of the predicted word in the step S6 is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011203497.9A CN112364634A (en) | 2020-11-02 | 2020-11-02 | Synonym matching method based on question sentence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011203497.9A CN112364634A (en) | 2020-11-02 | 2020-11-02 | Synonym matching method based on question sentence |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112364634A true CN112364634A (en) | 2021-02-12 |
Family
ID=74512617
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011203497.9A Pending CN112364634A (en) | 2020-11-02 | 2020-11-02 | Synonym matching method based on question sentence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112364634A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1320086A1 (en) * | 2001-12-13 | 2003-06-18 | Sony International (Europe) GmbH | Method for generating and/or adapting language models |
CN110110054A (en) * | 2019-03-22 | 2019-08-09 | 北京中科汇联科技股份有限公司 | A method of obtaining question and answer pair in the slave non-structured text based on deep learning |
CN110442675A (en) * | 2019-06-27 | 2019-11-12 | 平安科技(深圳)有限公司 | Question and answer matching treatment, model training method, device, equipment and storage medium |
CN111597319A (en) * | 2020-05-26 | 2020-08-28 | 成都不问科技有限公司 | Question matching method based on FAQ question-answering system |
CN111832292A (en) * | 2020-06-03 | 2020-10-27 | 北京百度网讯科技有限公司 | Text recognition processing method and device, electronic equipment and storage medium |
-
2020
- 2020-11-02 CN CN202011203497.9A patent/CN112364634A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1320086A1 (en) * | 2001-12-13 | 2003-06-18 | Sony International (Europe) GmbH | Method for generating and/or adapting language models |
CN110110054A (en) * | 2019-03-22 | 2019-08-09 | 北京中科汇联科技股份有限公司 | A method of obtaining question and answer pair in the slave non-structured text based on deep learning |
CN110442675A (en) * | 2019-06-27 | 2019-11-12 | 平安科技(深圳)有限公司 | Question and answer matching treatment, model training method, device, equipment and storage medium |
CN111597319A (en) * | 2020-05-26 | 2020-08-28 | 成都不问科技有限公司 | Question matching method based on FAQ question-answering system |
CN111832292A (en) * | 2020-06-03 | 2020-10-27 | 北京百度网讯科技有限公司 | Text recognition processing method and device, electronic equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
夏远远;王宇;: "基于HNC理论的社区问答系统问句检索模型构建", 计算机应用与软件, no. 08, 12 August 2018 (2018-08-12) * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110147436B (en) | Education knowledge map and text-based hybrid automatic question-answering method | |
Li et al. | A co-attention neural network model for emotion cause analysis with emotional context awareness | |
CN108363743B (en) | Intelligent problem generation method and device and computer readable storage medium | |
CN107798140B (en) | Dialog system construction method, semantic controlled response method and device | |
Sukkarieh et al. | Automarking: using computational linguistics to score short ‚free− text responses | |
CN102663129A (en) | Medical field deep question and answer method and medical retrieval system | |
Sahu et al. | Prashnottar: a Hindi question answering system | |
CN110851599A (en) | Automatic scoring method and teaching and assisting system for Chinese composition | |
CN114416942A (en) | Automatic question-answering method based on deep learning | |
CN112328800A (en) | System and method for automatically generating programming specification question answers | |
CN115858750A (en) | Power grid technical standard intelligent question-answering method and system based on natural language processing | |
Nugraha et al. | Typographic-based data augmentation to improve a question retrieval in short dialogue system | |
CN111666374A (en) | Method for integrating additional knowledge information into deep language model | |
CN112417823A (en) | Chinese text word order adjusting and quantitative word completion method and system | |
Liu | Research on the development of computer intelligent proofreading system based on the perspective of English translation application | |
CN117251455A (en) | Intelligent report generation method and system based on large model | |
Rosset et al. | The LIMSI participation in the QAst track | |
He et al. | [Retracted] Application of Grammar Error Detection Method for English Composition Based on Machine Learning | |
CN112364634A (en) | Synonym matching method based on question sentence | |
Alwaneen et al. | Stacked dynamic memory-coattention network for answering why-questions in Arabic | |
CN114722153A (en) | Intention classification method and device | |
CN111581326B (en) | Method for extracting answer information based on heterogeneous external knowledge source graph structure | |
Khandait et al. | Automatic question generation through word vector synchronization using lamma | |
Guo | An automatic scoring method for Chinese-English spoken translation based on attention LSTM | |
Sreeram et al. | Language modeling for code-switched data: Challenges and approaches |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |