CN112364634A - Synonym matching method based on question sentence - Google Patents

Synonym matching method based on question sentence Download PDF

Info

Publication number
CN112364634A
CN112364634A CN202011203497.9A CN202011203497A CN112364634A CN 112364634 A CN112364634 A CN 112364634A CN 202011203497 A CN202011203497 A CN 202011203497A CN 112364634 A CN112364634 A CN 112364634A
Authority
CN
China
Prior art keywords
question
word
vocabulary
words
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011203497.9A
Other languages
Chinese (zh)
Inventor
陈兴元
金澎
陈可
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Buwen Technology Co ltd
Original Assignee
Chengdu Buwen Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Buwen Technology Co ltd filed Critical Chengdu Buwen Technology Co ltd
Priority to CN202011203497.9A priority Critical patent/CN112364634A/en
Publication of CN112364634A publication Critical patent/CN112364634A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a synonym matching method based on question sentences, which is characterized in that a mask processing method based on mask processing is provided for user question sentences, words which never appear in a question-answer pair vocabulary set in the user question sentences are subjected to mask processing, then the positions of the words are predicted by other words in the user question sentences, and probability distribution on a question-answer pair vocabulary set is output. In addition, a vocabulary vector table of the corpus vocabulary table is obtained according to the massive corpus set, and the vocabulary covered by the vocabulary vector table is far larger than the question-answer word set. The probability that a word not present in the question-answer pair vocabulary set generates each word in the question-answer pair vocabulary set is calculated accordingly. And finally, comprehensively considering the local probability and the global probability to find out the most similar words in the question-answer pair vocabulary set.

Description

Synonym matching method based on question sentence
Technical Field
The invention relates to the technical field of natural language processing, in particular to a question sentence-based synonym matching method.
Background
The conventional questions and answers of the vertical domain automatic question-answering system based on the FAQ are usually collected and sorted by domain experts, and one question corresponds to one answer, which is also called a question-answer pair. In the existing question-answering system implemented by adopting the FAQ technology, when the questions proposed by the user and the questions listed in the system are calculated, and sentence similarity calculation is performed one by one, when words in the questions proposed by the user do not appear in a question-answering pair set formed by the FAQ, word vectors obtained by massive corpora are adopted to process the questions proposed by the user. The main drawbacks of this method are:
1. current question context information cannot be used. And the information is exactly that the word x does not appeariAnd key information of similarity of each word in the query-answer pair word set VF. The absence of this information will directly render the calculated similarity unusable.
2. The existing word vector is calculated in a word2vec mode, is a shallow neural network, has no multi-head representation, has no self attention, and uses no sentence-level information. Therefore, its word vector representation is not good.
3. BERT cannot be effectively utilized. BERT is the latest research result of natural language processing, and can better utilize context information to model languages by predicting two tasks of a mask word and a next sentence through a self-attention mechanism. But has the disadvantage that the vocabulary cannot be too large, typically 3-5 ten thousand. If the vocabulary of the BERT is to cover all possible pronunciations of the user's question, it needs to be expanded to 50 million or even larger, which results in the BERT not being trained.
In summary, in the current method for solving the problem that words in user question do not appear in question and answer versus word set VF based on word vectors, the word vectors represent poor quality, and context information of current question cannot be utilized, so that the matching accuracy is not high.
Disclosure of Invention
Aiming at the defects in the prior art, the synonym matching method based on the question solves the problem that the matching accuracy is low due to the fact that the word vector representation quality is poor and the context information of the current question cannot be utilized in the prior art.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a synonym matching method based on question sentences comprises the following steps:
s1, constructing a question-answer vocabulary set, a vocabulary vector table of a corpus vocabulary table and a BERT training vocabulary table;
s2, training the BERT language model by adopting a BERT training vocabulary to obtain a trained BERT language model;
s3, obtaining a user question word sequence, performing mask processing on predicted words which do not appear in a question-answer pair word set in the user question word sequence, and calculating a distribution probability set, namely a local probability set, of mask words after the mask processing on the question-answer pair word set based on a trained BERT language model;
s4, calculating a global probability set of the predicted words on the question-answer word set according to a word vector table of a corpus word table;
s5, solving intersection of the local probability set and the global probability set, judging whether the intersection is empty, if not, skipping to S6, if yes, inputting a new user question word sequence, and skipping to S3;
s6, calculating the comprehensive probability of the predicted words according to the local probability set and the global probability set, finding out synonyms in the question-answer pair vocabulary set according to the comprehensive probability of the predicted words, and replacing the predicted words in the user question word sequence with the synonyms to obtain the standard question.
Further, the step S1 includes the following steps:
s11, constructing an original standard question-answer pair set FQA, and constructing a question-answer pair word collection based on words in the original standard question-answer pair set FQA;
s12, collecting corpora, constructing a corpus set, and constructing a BERT training vocabulary based on the corpus set;
and S13, constructing a corpus vocabulary according to the corpus set, and processing the corpus vocabulary by adopting a skip-gram method to obtain a vocabulary vector table of the corpus vocabulary.
Further, the step S3 includes the following sub-steps:
s31, obtaining a user question word sequence, and masking predicted words which do not appear in the question and answer pair vocabulary set in the user question word sequence to obtain masked words;
s32, calculating the distribution probability of the mask words on the BERT training vocabulary by adopting the trained BERT language model to obtain a distribution probability set of the mask words on the BERT training vocabulary;
s33, carrying out normalization processing on the distribution probability set on the BERT training vocabulary to obtain the distribution probability set of the mask words on the question-answer pair vocabulary, namely a local probability set.
Further, the step S4 includes the following sub-steps:
s41, finding out a predicted word vector corresponding to a predicted word and a question-answer word-pair vector of each word in a question-answer word set according to a word vector table of a corpus vocabulary table;
and S42, respectively solving inner products of the predicted word vectors and the question-answer pair word vectors, and carrying out normalization processing on the inner products by adopting softmax to obtain a global probability set of the predicted words on the question-answer pair word set.
Further, the calculation formula of the local probability set in step S3 is as follows:
Figure BDA0002756235830000031
wherein the content of the first and second substances,
Figure BDA0002756235830000032
is the jth local probability, p, in the local probability setjIs the jth distribution probability in the distribution probability set, | VFAnd | is the total number of words in the question-answer pair vocabulary.
Further, the calculation formula of the global probability set in step S4 is as follows:
p(wj|xi)=softmax(v(wj)·v(xi)),wj∈VF
x={x1,x2,…,xi,…,xT}
wherein, p (w)j|xi) Is the jth global probability in the global probability set, v (w)j) Question-answer word-pair vector for the j-th word in the question-answer word set, v (x)i) For predicting word vectors, wjFor question-answering the jth word of the vocabulary, VFA collection of words for question answering, x a sequence of words for question of the user, xiThe word is the ith word in the user question word sequence x, namely the predicted word, and T is the total number of words in the user question word sequence x.
Further, the formula for calculating the comprehensive probability of the predicted word in step S6 is as follows:
Figure BDA0002756235830000041
or
Figure BDA0002756235830000042
Wherein w is the comprehensive probability of the predicted word,
Figure BDA0002756235830000043
is the jth local probability in the local probability set, p (w)j|xi) Is the jth global probability in the global probability set.
In conclusion, the beneficial effects of the invention are as follows:
(1) the invention provides a method for processing the user question based on mask processing, and words x never appearing in the question-answer vocabulary set in the user questioniAnd performing mask processing, predicting the position by using other words in the sentence, and outputting probability distribution on a question and answer pair word set. In addition, a vocabulary vector table of the corpus vocabulary table is obtained according to the massive corpus set, and the vocabulary covered by the vocabulary vector table is far larger than the question-answer word set. From which the word x is calculatediThe probability of the question and answer to each word in the vocabulary set is generated. Finally, the local probability and the global probability are comprehensively considered to find the neutralization x in the question-answer pair vocabulary setiThe most similar words.
(2) The invention can not only fully utilize the context information of the current user question, but also consider the prior context, thereby greatly improving the performance of sentence similarity calculation.
Drawings
Fig. 1 is a flowchart of a synonym matching method based on a question sentence.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
As shown in fig. 1, a method for matching synonyms based on question sentences includes the following steps:
s1, constructing a question-answer vocabulary set, a vocabulary vector table of a corpus vocabulary table and a BERT training vocabulary table;
step S1 includes the following steps:
s11, constructing an original standard question-answer pair set FQA, and constructing a question-answer pair word collection based on words in the original standard question-answer pair set FQA;
s12, collecting corpora, constructing a corpus set, and constructing a BERT training vocabulary based on the corpus set;
and S13, constructing a corpus vocabulary according to the corpus set, and processing the corpus vocabulary by adopting a skip-gram method to obtain a vocabulary vector table of the corpus vocabulary.
Wherein, the question and answer are collected
Figure BDA0002756235830000051
BERT training vocabulary
Figure BDA0002756235830000052
A corpus collection.
S2, training the BERT language model by adopting a BERT training vocabulary to obtain a trained BERT language model;
s3, obtaining a user question word sequence, performing mask processing on predicted words which do not appear in a question-answer pair word set in the user question word sequence, and calculating a distribution probability set, namely a local probability set, of mask words after the mask processing on the question-answer pair word set based on a trained BERT language model;
step S3 includes the following substeps:
s31, obtaining a user question word sequence, and masking predicted words which do not appear in the question and answer pair vocabulary set in the user question word sequence to obtain masked words;
s32, calculating the distribution probability of the mask words on the BERT training vocabulary by adopting the trained BERT language model to obtain a distribution probability set of the mask words on the BERT training vocabulary;
s33, carrying out normalization processing on the distribution probability set on the BERT training vocabulary to obtain the distribution probability set of the mask words on the question-answer pair vocabulary, namely a local probability set.
The calculation formula of the local probability set in step S33 is:
Figure BDA0002756235830000061
wherein the content of the first and second substances,
Figure BDA0002756235830000062
is the jth local probability, p, in the local probability setjIs the jth distribution probability in the distribution probability set, | VFAnd | is the total number of words in the question-answer pair vocabulary.
S4, calculating a global probability set of the predicted words on the question-answer word set according to a word vector table of a corpus word table;
step S4 includes the following substeps:
s41, finding out a predicted word vector corresponding to a predicted word and a question-answer word-pair vector of each word in a question-answer word set according to a word vector table of a corpus vocabulary table;
and S42, respectively solving inner products of the predicted word vectors and the question-answer pair word vectors, and carrying out normalization processing on the inner products by adopting softmax to obtain a global probability set of the predicted words on the question-answer pair word set.
The calculation formula of the global probability set is as follows:
p(wj|xi)=softmax(v(wj)·v(xi)),wj∈VF
x={x1,x2,…,xi,…,xT}
wherein, p (w)j|xi) Is the jth global probability in the global probability set, v (w)j) Question-answer word-pair vector for the j-th word in the question-answer word set, v (x)i) For predicting word vectors, wjFor question-answering the jth word of the vocabulary, VFA collection of words for question answering, x a sequence of words for question of the user, xiThe word is the ith word in the user question word sequence x, namely the predicted word, and T is the total number of words in the user question word sequence x.
S5, solving intersection of the local probability set and the global probability set, judging whether the intersection is empty, if not, skipping to S6, if yes, inputting a new user question word sequence, and skipping to S3;
in practical applications, in consideration of operating efficiency, the top N words (e.g., N ═ 50) can be ranked by taking the local probability and the global probability respectively, and then the words are intersected.
S6, calculating the comprehensive probability of the predicted words according to the local probability set and the global probability set, finding out synonyms in the question-answer pair vocabulary set according to the comprehensive probability of the predicted words, and replacing the predicted words in the user question word sequence with the synonyms to obtain the standard question.
The formula for calculating the comprehensive probability of the predicted word is as follows:
Figure BDA0002756235830000071
or
Figure BDA0002756235830000072
Wherein w is the comprehensive probability of the predicted word,
Figure BDA0002756235830000073
is the jth local probability in the local probability set, p (w)j|xi) Is the jth global probability in the global probability set.

Claims (7)

1. A synonym matching method based on question sentences is characterized by comprising the following steps:
s1, constructing a question-answer vocabulary set, a vocabulary vector table of a corpus vocabulary table and a BERT training vocabulary table;
s2, training the BERT language model by adopting a BERT training vocabulary to obtain a trained BERT language model;
s3, obtaining a user question word sequence, performing mask processing on predicted words which do not appear in a question-answer pair word set in the user question word sequence, and calculating a distribution probability set, namely a local probability set, of mask words after the mask processing on the question-answer pair word set based on a trained BERT language model;
s4, calculating a global probability set of the predicted words on the question-answer word set according to a word vector table of a corpus word table;
s5, solving intersection of the local probability set and the global probability set, judging whether the intersection is empty, if not, skipping to S6, if yes, inputting a new user question word sequence, and skipping to S3;
s6, calculating the comprehensive probability of the predicted words according to the local probability set and the global probability set, finding out synonyms in the question-answer pair vocabulary set according to the comprehensive probability of the predicted words, and replacing the predicted words in the user question word sequence with the synonyms to obtain the standard question.
2. The question sentence-based synonym matching method of claim 1, wherein the step S1 includes the steps of:
s11, constructing an original standard question-answer pair set FQA, and constructing a question-answer pair word collection based on words in the original standard question-answer pair set FQA;
s12, collecting corpora, constructing a corpus set, and constructing a BERT training vocabulary based on the corpus set;
and S13, constructing a corpus vocabulary according to the corpus set, and processing the corpus vocabulary by adopting a skip-gram method to obtain a vocabulary vector table of the corpus vocabulary.
3. The question sentence-based synonym matching method of claim 1, wherein the step S3 includes the following substeps:
s31, obtaining a user question word sequence, and masking predicted words which do not appear in the question and answer pair vocabulary set in the user question word sequence to obtain masked words;
s32, calculating the distribution probability of the mask words on the BERT training vocabulary by adopting the trained BERT language model to obtain a distribution probability set of the mask words on the BERT training vocabulary;
s33, carrying out normalization processing on the distribution probability set on the BERT training vocabulary to obtain the distribution probability set of the mask words on the question-answer pair vocabulary, namely a local probability set.
4. The question sentence-based synonym matching method of claim 1, wherein the step S4 includes the following substeps:
s41, finding out a predicted word vector corresponding to a predicted word and a question-answer word-pair vector of each word in a question-answer word set according to a word vector table of a corpus vocabulary table;
and S42, respectively solving inner products of the predicted word vectors and the question-answer pair word vectors, and carrying out normalization processing on the inner products by adopting softmax to obtain a global probability set of the predicted words on the question-answer pair word set.
5. The question-based synonym matching method according to claim 1, wherein the calculation formula of the local probability set in the step S3 is as follows:
Figure FDA0002756235820000021
wherein the content of the first and second substances,
Figure FDA0002756235820000022
is the jth local probability, p, in the local probability setjIs the jth distribution probability in the distribution probability set, | VFAnd | is the total number of words in the question-answer pair vocabulary.
6. The question-based synonym matching method according to claim 1, wherein the global probability set in step S4 is calculated as:
p(wj|xi)=softmax(v(wj)·v(xi)),wj∈VFx={x1,x2,…,xi,…,xT}
wherein, p (w)j|xi) Is the jth global probability in the global probability set, v (w)j) Question-answer word-pair vector for the j-th word in the question-answer word set, v (x)i) For predicting word vectors, wjFor question-answering the jth word of the vocabulary, VFA collection of words for question answering, x a sequence of words for question of the user, xiThe word is the ith word in the user question word sequence x, namely the predicted word, and T is the total number of words in the user question word sequence x.
7. The question sentence-based synonym matching method of claim 1, wherein the formula for calculating the comprehensive probability of the predicted word in the step S6 is as follows:
Figure FDA0002756235820000031
or
Figure FDA0002756235820000032
Wherein w is the comprehensive probability of the predicted word,
Figure FDA0002756235820000033
is the jth local probability in the local probability set, p (w)j|xi) Is the jth global probability in the global probability set.
CN202011203497.9A 2020-11-02 2020-11-02 Synonym matching method based on question sentence Pending CN112364634A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011203497.9A CN112364634A (en) 2020-11-02 2020-11-02 Synonym matching method based on question sentence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011203497.9A CN112364634A (en) 2020-11-02 2020-11-02 Synonym matching method based on question sentence

Publications (1)

Publication Number Publication Date
CN112364634A true CN112364634A (en) 2021-02-12

Family

ID=74512617

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011203497.9A Pending CN112364634A (en) 2020-11-02 2020-11-02 Synonym matching method based on question sentence

Country Status (1)

Country Link
CN (1) CN112364634A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1320086A1 (en) * 2001-12-13 2003-06-18 Sony International (Europe) GmbH Method for generating and/or adapting language models
CN110110054A (en) * 2019-03-22 2019-08-09 北京中科汇联科技股份有限公司 A method of obtaining question and answer pair in the slave non-structured text based on deep learning
CN110442675A (en) * 2019-06-27 2019-11-12 平安科技(深圳)有限公司 Question and answer matching treatment, model training method, device, equipment and storage medium
CN111597319A (en) * 2020-05-26 2020-08-28 成都不问科技有限公司 Question matching method based on FAQ question-answering system
CN111832292A (en) * 2020-06-03 2020-10-27 北京百度网讯科技有限公司 Text recognition processing method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1320086A1 (en) * 2001-12-13 2003-06-18 Sony International (Europe) GmbH Method for generating and/or adapting language models
CN110110054A (en) * 2019-03-22 2019-08-09 北京中科汇联科技股份有限公司 A method of obtaining question and answer pair in the slave non-structured text based on deep learning
CN110442675A (en) * 2019-06-27 2019-11-12 平安科技(深圳)有限公司 Question and answer matching treatment, model training method, device, equipment and storage medium
CN111597319A (en) * 2020-05-26 2020-08-28 成都不问科技有限公司 Question matching method based on FAQ question-answering system
CN111832292A (en) * 2020-06-03 2020-10-27 北京百度网讯科技有限公司 Text recognition processing method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
夏远远;王宇;: "基于HNC理论的社区问答系统问句检索模型构建", 计算机应用与软件, no. 08, 12 August 2018 (2018-08-12) *

Similar Documents

Publication Publication Date Title
CN110147436B (en) Education knowledge map and text-based hybrid automatic question-answering method
Li et al. A co-attention neural network model for emotion cause analysis with emotional context awareness
CN108363743B (en) Intelligent problem generation method and device and computer readable storage medium
CN107798140B (en) Dialog system construction method, semantic controlled response method and device
Sukkarieh et al. Automarking: using computational linguistics to score short ‚free− text responses
CN102663129A (en) Medical field deep question and answer method and medical retrieval system
Sahu et al. Prashnottar: a Hindi question answering system
CN110851599A (en) Automatic scoring method and teaching and assisting system for Chinese composition
CN114416942A (en) Automatic question-answering method based on deep learning
CN112328800A (en) System and method for automatically generating programming specification question answers
CN115858750A (en) Power grid technical standard intelligent question-answering method and system based on natural language processing
Nugraha et al. Typographic-based data augmentation to improve a question retrieval in short dialogue system
CN111666374A (en) Method for integrating additional knowledge information into deep language model
CN112417823A (en) Chinese text word order adjusting and quantitative word completion method and system
Liu Research on the development of computer intelligent proofreading system based on the perspective of English translation application
CN117251455A (en) Intelligent report generation method and system based on large model
Rosset et al. The LIMSI participation in the QAst track
He et al. [Retracted] Application of Grammar Error Detection Method for English Composition Based on Machine Learning
CN112364634A (en) Synonym matching method based on question sentence
Alwaneen et al. Stacked dynamic memory-coattention network for answering why-questions in Arabic
CN114722153A (en) Intention classification method and device
CN111581326B (en) Method for extracting answer information based on heterogeneous external knowledge source graph structure
Khandait et al. Automatic question generation through word vector synchronization using lamma
Guo An automatic scoring method for Chinese-English spoken translation based on attention LSTM
Sreeram et al. Language modeling for code-switched data: Challenges and approaches

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination