CN109344236B - Problem similarity calculation method based on multiple characteristics - Google Patents

Problem similarity calculation method based on multiple characteristics Download PDF

Info

Publication number
CN109344236B
CN109344236B CN201811041071.0A CN201811041071A CN109344236B CN 109344236 B CN109344236 B CN 109344236B CN 201811041071 A CN201811041071 A CN 201811041071A CN 109344236 B CN109344236 B CN 109344236B
Authority
CN
China
Prior art keywords
new
rel
similarity
question
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811041071.0A
Other languages
Chinese (zh)
Other versions
CN109344236A (en
Inventor
刘波
彭永幸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
Original Assignee
Jinan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University filed Critical Jinan University
Priority to CN201811041071.0A priority Critical patent/CN109344236B/en
Publication of CN109344236A publication Critical patent/CN109344236A/en
Application granted granted Critical
Publication of CN109344236B publication Critical patent/CN109344236B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a problem similarity calculation method based on multiple characteristics, which comprises the following steps: aiming at an input new question sentence, comparing the input new question sentence with stored historical questions and corresponding answers, and calculating the similarity between the new question and the historical questions based on character features, the similarity based on word semantic features, the similarity based on sentence implicit subject features and the similarity based on answer semantic features; the final similarity is the product sum of the 5 similarities and the weights corresponding to the 5 similarities, and the weights are obtained by training through a linear regression method. The invention adopts various characteristics to increase the diversity of sample attributes and improve the generalization capability of the model. Meanwhile, the soft cosine distance is utilized to fuse TF-IDF with information such as editing distance, word semantics and the like, so that the semantic gap between words is overcome, and the accuracy of similarity calculation is improved.

Description

Problem similarity calculation method based on multiple characteristics
Technical Field
The invention relates to the field of computer natural language processing and automatic question-answering system research, in particular to a problem similarity calculation method based on multiple characteristics.
Background
With the rapid increase of digital information, the difficulty of acquiring required information resources from the network is increased. How to accurately and quickly find needed information for users in massive digital information brings a serious challenge to Natural Language Processing (NLP) technology and information retrieval technology. Therefore, in order to provide users with information acquisition channels with high real-time and high accuracy, research institutions and related technology companies have begun to research automated question and answer systems (QA). In the automatic question-answering system, a user can directly obtain corresponding answers only by inputting questions, and the user is not required to extract key words according to the questions to search and read a large number of webpages to find the answers. Compared with the traditional search engine, the automatic question-answering system is simpler, easier to use, real-time and accurate, provides comfortable human-computer interaction experience for users, and becomes a new generation of research hotspot of the current information technology. The automatic question-answering system allows a user to describe questions in a natural language form, then accurately understand the questions of the user, organize answers by retrieving information searched on a question-answering library or the internet, and finally return refined and accurate results, providing an efficient information acquisition channel.
The question similarity calculation is the first link in an automatic question-answering system, and aims to find out the historical question which is most similar to the newly-proposed question from the existing question set, so that the answer of the new question is given according to the answer set of the historical question.
At present, there are some achievements in the field of automatic question answering in China. The general community question-answering system comprises a Quora, a first-order question-answering system, a Baidu-aware system and the like, and the professional community question-answering system relates to multiple specialties, such as question-answering systems related to IT technologies such as Stack Overflow, CSDN and the like. Therefore, the problem similarity calculation method directly influences the accuracy of the question answering system, and has good industrial prospects.
Through years of research accumulation, the automatic question-answering system forms a universal framework and mainly comprises three modules of information retrieval, question analysis and answer acquisition. The problem analysis module is mainly used for analyzing the problems input by the user, finding out the historical problems which are most similar to the newly-proposed problems from the existing problem set, and researching the problems, wherein the research content relates to problem similarity analysis and problem sequencing, and the most important is the similarity calculation between the problems, so that the historical problem set is sequenced according to the similarity. The answer obtaining module is mainly used for obtaining a corresponding answer set according to a similar question set obtained by question retrieval.
Text similarity correlation techniques are the basis of question similarity techniques (both questions and answers are of text type). There are three main methods for calculating the similarity of texts.
The first method is based on similarity calculation of a Vector Space Model (VSM), a text is mapped to a point in a Vector Space, and the distance between the point in the Space and the point is calculated by a mathematical method. Researchers have proposed applying the VSM model to similar problem retrieval tasks for Frequently Asked Questions (FAQ), and improved the VSM for the task features of FAQ. However, the text sparseness of the method causes overlarge dimensionality and easily causes a semantic gap problem.
The second method is similarity calculation based on syntactic analysis, and introduces a graphical mode to describe the mutually dominant and dominated relations of each phrase in a sentence. Some researchers propose an analysis method based on a deep structure, which includes the steps of firstly analyzing the dependence relationship of problems, selecting the most important words in sentences and effective words directly attached to the words for pairing, and then calculating the text similarity of Chinese characters based on the dependence relationship structure. However, the syntactic analysis and dependency relationship analysis of this method are complex, require linguistic background, and have poor analysis effect on complex long sentence patterns.
The third is similarity calculation based on semantics, which comprises two similarities of word semantics and sentence semantics. Semantic similarity calculation of words generally uses semantic dictionaries such as WordNet and Hownet, which contain semantic relationships between words. Researchers believe that the complete expression of a phrase depends not only on the syntactic structure, but also on the words and their weights, and therefore improve the semantic representation of the words using WordNet. In the aspect of semantic similarity calculation of sentences, researchers learn the conversion probability between two question sentences by using a machine translation model of IBM (International Business machines corporation), thereby expressing the semantic similarity of the sentences and searching similar questions. However, the method excessively depends on the semantic dictionary, and the accuracy of similarity calculation is influenced by the completeness and the correctness of the semantic dictionary; similarly, the similarity calculation method based on semantics has a poor effect in processing long sentence types with complex syntax.
Meanwhile, most methods in the prior art extract text representation features based on single-type information, focus on the single-type features, and do not consider that the meaning of text representation is formed by multi-aspect and multi-level information, so that the accuracy rate of calculating similarity is not high.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a problem similarity calculation method based on multiple characteristics, which is suitable for calculating the similarity between the problems in an English question-answering system and has the advantage of high accuracy.
The purpose of the invention is realized by the following technical scheme: a problem similarity calculation method based on multiple characteristics comprises the following steps: aiming at an input new question sentence, comparing the input new question sentence with stored historical questions and corresponding answers, and calculating the similarity between the new question and the historical questions based on character features, the similarity based on word semantic features, the similarity based on sentence implicit subject features and the similarity based on answer semantic features; the final similarity is the product sum of the 5 similarities and the weights corresponding to the 5 similarities, and the weights are obtained by training through a linear regression method.
Preferably, the new question sentences, the historical questions and the corresponding answers for comparison calculation are preprocessed, including punctuation removal processing, case-to-case conversion (all capital letters are converted into lower case letters), stop words and low-frequency words.
Preferably, the method for calculating the similarity based on the character features is as follows: firstly, obtaining a relation matrix between words by calculating the edit distance between each pair of words, and then calculating a soft cosine distance by using TF-IDF (term frequency-inverse document frequency) representation of a question and the relation matrix as similarity based on character characteristics.
Furthermore, the calculation method of the relation matrix between the words is as follows:
defining a corpus as a question and answer text data set for training and testing a model, and forming a relation matrix M by editing distance between words if the size of a dictionary in the corpus is nlev∈Rn×n,Rn×nIs a set of real number matrices of size n × n (same meaning as below), MlevMiddle element mi,jIs the ith word w in the dictionaryiWith the jth word wjThe edit distance of (1). The edit distance calculation formula is as follows:
Figure BDA0001792096800000031
wherein, | | wiI is the word wiNumber of characters contained in,||wjI is the word wjThe number of characters included, α is the weighting factor for the diagonal elements, β is the enhancement factor for the distance score lev (w)i,wj) The recursive formula of (c) is as follows:
Figure BDA0001792096800000032
wherein m and n represent wiAnd wjThe length of the word (i.e., the number of characters included). cost represents wiM-th character to wjThe cost of replacing the nth character is 0 if the two characters are consistent, otherwise, the cost is 1.
Furthermore, the TF-IDF representation of the question sentence is calculated by the following method: in a sentence, for the word wiCalculating a TF value representing a frequency of occurrence of a word in a current sentence and an IDF value representing an inverse text frequency index for the word wiThe TF-IDF calculation formula is as follows:
TFIDFi=TFi*IDFi
further, with respect to the newly proposed question QnewAnd historical question QrelThe soft cosine distance calculation method comprises the following steps:
question Qnew、QrelAre respectively expressed as TFIDFnewAnd TFIDFrel
TFIDFnew=[dnew,1,dnew,2,…,dnew,n]T
TFIDFrel=[drel,1,drel,2,…,drel,n]T
dnew,iDenotes wiAt QnewTF-IDF value of (1), drel,jDenotes wjAt QrelThe TF-IDF value in (1), n represents the number of words contained in the dictionary of the corpus, and T represents the transpose of the vector.
Simultaneously, the matrix is M according to the obtained relation between the wordslevCalculating Q by using soft cosine distancenew、QrelSimilarity R based on character features betweenlev(Qnew,Qrel) The formula is as follows:
Figure BDA0001792096800000041
where "·" is a dot product of a vector and a matrix (the same meaning below), and is calculated as follows:
Figure BDA0001792096800000042
Figure BDA0001792096800000043
Figure BDA0001792096800000044
preferably, the method for calculating the similarity based on the semantic features of the words comprises the following steps:
(1) and (3) training by using a word2vec tool to obtain the distributed representation of each word in the corpus, namely each word corresponds to a K-dimensional real number vector.
(2) Calculating the semantic relation m between every two words in the dictionary by solving the cosine distance between word vectors for the corpus with the dictionary size of ni,j,i,j∈[1,n]To obtain a relation matrix Mw2v,Mw2v∈Rn×n
(3) Read question Qnew、QrelExpressed as TF-IDF, respectively as TFIDFnewAnd TFIDFrel
(4) Computing Q using soft cosine distancesnew、QrelSimilarity R based on semantic features of words between wordsw2v(Qnew,Qrel) The formula is as follows:
Figure BDA0001792096800000051
preferably, the method for calculating the similarity based on the semantic features of the sentences comprises the following steps:
word vector w obtained by training according to word2vec tooli∈RKIf a question contains M words, the question may be represented as Qmatrix=(w1,w2,…,wM),
Figure BDA0001792096800000052
Expressing the question sentence as the arithmetic mean of word vectors in the sentence, namely vector:
Figure BDA0001792096800000053
according to the above formula, question Qnew、QrelRespectively obtaining AVG vector by calculating arithmetic mean value of word vector in sentencenewAnd AVGrelBy finding AVGnewAnd AVGrelThe cosine distance of (A) is obtained as Qnew、QrelSimilarity R based on question semantic featuresvec(Qnew,Qrel)。
Preferably, the method for calculating the similarity based on the implicit topic features of the sentences comprises the following steps:
the corpus is used as the input of an LdaModel function in Gensim, and an LDA model is obtained through training; then, a new question Q for calculating the distribution of the topics is put forwardnewAnd historical question QrelInputting the data into a trained LDA model to obtain vector representations of two question sentences based on topic distribution, and respectively recording the vector representations as LDAnew、LDArel. By evaluating LDAnewAnd LDArelTo obtain Qnew、QrelSimilarity R based on sentence implicit topic characteristicslda(Qnew,Qrel)。
Preferably, the method for calculating the similarity based on the answer semantic features comprises the following steps:
in the question-answering system, each historical question corresponds to a candidate answer set. For newly proposed question QnewAnd historical question QrelAnd QrelCorresponding toCandidate answer ArelCalculating QnewAnd ArelThe semantic similarity of the word level as the similarity R of two question sentences based on the answer semantic featuresqa(Qnew,Qrel) The specific process is as follows:
(1) and (3) training by using a word2vec tool to obtain the distributed representation of each word in the corpus, namely each word corresponds to a K-dimensional real number vector.
(2) Calculating the semantic relation m between every two words in the dictionary by solving the cosine distance between word vectors for the corpus with the dictionary size of ni,j,i,j∈[1,n]To obtain a relation matrix Mw2v,Mw2v∈Rn×n
(3) Read question QnewTF-IDF of (A) represents TFIDFnew;ArelExpressed by TF-IDF to obtain TFIDFans=[dans,1,dans,2,...,dans,n]TWherein d isans,iDenotes wiIn ArelThe TF-IDF value in (1), n represents the number of words contained in the dictionary of the corpus, and T represents the transpose of the vector.
(4) Computing Q using soft cosine distancesnew、QrelSimilarity R based on answer semantic features between the twoqa(Qnew,Qrel) The formula is as follows:
Figure BDA0001792096800000054
preferably, the method for calculating the final similarity includes:
Figure BDA0001792096800000061
wherein R isk(Qnew,Qrel) Representing similarity based on the feature k, i.e. similarity based on character features, similarity based on word semantic features, similarity based on sentence implied topic features, and similarity based on answer semantic features, respectively. Lambda [ alpha ]kIs a parameter to be trainedAnd (4) obtaining the number by utilizing a linear regression analysis method for training.
Furthermore, in the training process by using the linear regression analysis method, the iteration steps are as follows:
(1) randomly initializing weight lambda corresponding to each featurekCalculating the square loss of the current iteration according to the weight; the square loss function is as follows:
Figure BDA0001792096800000062
where I is a given number of training samples, the training samples are known problem pairs and the similarity of the problem pairs,
Figure BDA0001792096800000063
is the similarity prediction value of the ith sample, Y(i)Is a known value of the i-th sample similarity;
(2) weight λ for each feature based on the square losskPerforming partial derivation to obtain the gradient of the current iteration of the weight
Figure BDA0001792096800000064
t represents the t-th iteration;
(3) according to
Figure BDA0001792096800000065
Updating the weight of each feature, α being the step size;
(4) recalculating the square loss according to the new weight, if the current square loss is not less than the square loss of the previous iteration, stopping the iteration to obtain the weight lambda corresponding to each featurekA final value; otherwise, the steps (2) to (4) are circularly carried out.
Compared with the prior art, the invention has the following advantages and technical effects:
(1) in the field of machine learning, a training sample is described by a set of attributes, with different subsets of attributes providing different perspectives of observed data. The present invention extracts five types of features by observing the question and answer sentences described in natural language from five different perspectives. Compared with the expression mode based on single type features, the diversity of sample attributes is increased by the aid of various features, and the generalization capability of the model is improved.
(2) The method utilizes the soft cosine distance to fuse TF-IDF with information such as editing distance, word semantics and the like, and calculates the similarity between problems. Compared with the traditional similarity calculation method, the method overcomes the semantic gap between words and improves the accuracy of similarity calculation.
Drawings
FIG. 1 is a flowchart of the method of the present embodiment.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
In this embodiment, a question similarity calculation method based on multiple features measures the similarity between two question sentences by using 5 features, which are respectively a character feature, a word semantic feature, a sentence implied subject feature, and an answer semantic feature. The similarity based on these 5 features is integrated into the final similarity of the new problem and the historical problem. Referring now to fig. 1, the steps of the method are described in detail by incorporating an example.
(1) Inputting a new question sentence Qnew:Where I can buy good oil for massage?
(2) Reading a historical question sentence Qrel:is there any place i can find scentedmassage oils in qatar?
Simultaneously reading answer A of a historical questionrel:Yes.It is right behind Kahrama in theNational area.
(3) To Qnew﹑QrelAnd ArelRespectively carrying out pretreatment, including: punctuation removal, case-to-case conversion (all capital letters converted to lowercase), stop-word, low-frequency words, and the like. Obtaining:
Qnew:buy good oil massage
Qrel:place find scented massage oils qatar
Arel:yes right behind kahrama national area
(4) respectively calculate QnewAnd QrelSimilarity based on the following 5 features.
In the following, it is assumed that { area, while, buy, find, good, kahrama, message, national, oil, oils, place, qatar, right, sized, yes } is used as the corpus dictionary set.
(4-1) similarity based on character features
The character features are that the similarity between words is measured from a character level, and the similarity of the word character level is solved by using the edit distance. Firstly, a relation matrix M between the words is obtained by calculating the edit distance between each pair of wordslevThen using the TF-IDF representation of the question and the relation matrix MlevAnd calculating the soft cosine distance as the similarity based on the character features. The method comprises the following specific steps:
(4-1-1) calculating a relationship matrix Mlev
Assuming that the size of the lexicon in a corpus (meaning the question and answer text data sets used to train and test the models, hereinafter the corpus is synonymous) is n, the edit distance between words forms a relationship matrix Mlev∈Rn×n,Rn×nIs a set of real number matrices of size n × n, MlevMiddle element mi,jIs the ith word w in the dictionaryiWith the jth word wjThe edit distance of (1). The edit distance calculation formula is as follows:
Figure BDA0001792096800000081
wherein, | | wiI is the word wiThe number of characters, | wjI is the word wjThe number of characters is contained, α is a weighting factor of diagonal elements, β is an enhancement factor of distance scores, α is 1.8, β is 5. lev (w)i,wj) The recursive formula of (c) is as follows:
Figure BDA0001792096800000082
wherein m and n represent wiAnd wjThe length of the word (i.e., the number of characters included). cost represents wiM-th character to wjThe cost of replacing the nth character is 0 if the two characters are consistent, otherwise, the cost is 1.
Since the dictionary size is 15 words, MlevIs a matrix of 15 × 15, of the form:
Figure BDA0001792096800000083
(4-1-2) TF-IDF representation of computational question
TF-IDF consists of TF and IDF.
In a sentence, for the word wiThe formula for calculating the TF value is as follows:
Figure BDA0001792096800000084
wherein, TFiRepresenting the frequency of occurrence of the ith word in the current sentence, niRepresenting the number of times the word appears in the current sentence, nkRepresenting the number of occurrences of the kth word in the current sentence.
For a given corpus, the IDF value of each word is fixed, the word wiThe calculation method of (c) is as follows:
Figure BDA0001792096800000085
wherein: | D | represents the total number of texts, and the denominator represents the number of texts containing the word.
In a sentence, for the word wiThe TF-IDF calculation formula is as follows:
TFIDFi=TFi*IDFi
question Qnew、QrelExpressed as the following vectors, respectively:
TFIDFnew=[dnew,1,dnew,2,…,dnew,n]T
TFIDFrel=[drel,1,drel,2,…,drel,n]T
dnew,idenotes wiAt QnewTF-IDF value of (1), drel,jDenotes wjAt QrelThe TF-IDF value in (1), n represents the number of words contained in the dictionary of the corpus, and T represents the transpose of the vector.
The question Q obtained by the pretreatment in the step (3)new、Qrel15-dimensional vectors respectively represented as follows:
TFIDFnew=[0.0,0.0,0.528634,0.0,0.528634,…]T
TFIDFrel=[0.0,0.0,0.0,0.423394,0.0,…]T
(4-1-3) calculating a soft cosine distance
Soft Cosine distance is relative to Cosine distance, in 2014, Sidorov proposed an improved Cosine similarity calculation method named Soft Cosine distance (Soft Cosine), and a relation matrix is introduced to express the relation between words when calculating Cosine distance.
Question Qnew、QrelRespectively expressed as TFIDF by the step (4-1-2)newAnd TFIDFrelObtaining a relation matrix M between words through the step (4-1-1)levCalculating question Q by using soft cosine distancenew、QrelSimilarity R based on character features betweenlev(Qnew,Qrel) The formula is as follows:
Figure BDA0001792096800000091
where "·" is a dot product of a vector and a matrix (the same meaning below), and is calculated as follows:
Figure BDA0001792096800000092
Figure BDA0001792096800000093
Figure BDA0001792096800000094
in this example
Figure BDA0001792096800000101
(4-2) similarity based on semantic features of words
(4-2-1) training by using a word2vec tool to obtain a distributed representation of each word in the corpus, namely, each word corresponds to a 200-dimensional word vector, as follows:
Figure BDA0001792096800000102
(4-2-2) calculating the semantic relation m between every two words in the dictionary by solving the cosine distance between the word vectors for the corpus with the dictionary size of ni,j,i,j∈[1,n]To obtain a relation matrix Mw2v,Mw2v∈Rn×n。mi,jThe calculation formula of (a) is as follows:
mi,j=max(0,cos(wi,wj))2
Figure BDA0001792096800000103
wherein, wi、wjReal number vector, w, representing the dimension K of the ith and jth words in a corpusi,wj∈RK,RKIs a one-dimensional vector with length K (the meaning is the same below), "·" is a standard dot product between vectors (the meaning is the same below), and the calculation formula is as follows:
Figure BDA0001792096800000104
wherein, wi,mDenotes wiThe m-th component of (a), wj,mDenotes wjThe mth component of (2).
In the present embodiment, the dictionary size is 15 words, Mw2vIs a matrix of 15 × 15, as follows:
Figure BDA0001792096800000105
(4-2-3) reading (4-1-2) the calculated question Qnew、QrelTF-IDF of (A) is represented by:
TFIDFnew=[0.0,0.0,0.528634,0.0,0.528634,…]T
TFIDFrel=[0.0,0.0,0.0,0.423394,0.0,…]T
(4-2-4) solving question Q by a soft cosine distance formulanew、QrelSimilarity based on word semantic features between:
Figure BDA0001792096800000111
(4-3) similarity based on semantic features of sentences
Word vector w obtained by word2vec trainingi∈RKIf a question contains M words, the question may be represented as Qmatrix=(w1,w2,…,wM),
Figure BDA0001792096800000112
Expressing the question sentence as the arithmetic mean of word vectors in the sentence, namely vector:
Figure BDA0001792096800000113
according to the above formula, question Qnew、QrelRespectively obtaining AVG vector by calculating arithmetic mean value of word vector in sentencenewAnd AVGrel
AVGnew=[0.014657,0.075914,-0.042454,0.219559,-0.117374,…]
AVGrel=[-0.088187,-0.025432,-0.05328,0.17098376,-0.13033055,…]
By finding AVGnewAnd AVGrelThe cosine distance of (A) is obtained as Qnew、QrelSimilarity R based on question semantic featuresvec(Qnew,Qrel) The formula is as follows:
Figure BDA0001792096800000114
(4-4) similarity based on implicit topic characteristics of sentences
The invention uses LDA (latent Dirichlet allocation) implicit topic model to ask for the implicit topic of the question sentence. After LDA training, the implicit topic distribution of each document in the document set can be obtained, and further the topic vector of the sentence is obtained. For example, sentence QmImplicit topic distribution of
Figure BDA0001792096800000115
Figure BDA0001792096800000116
Representing the probability of belonging to the ith topic, I being the number of implied topics, QmExpressed as a vector:
Figure BDA0001792096800000117
the subject distribution of the question was calculated by the Gensim (https:// radimrehurek. com/genesis /) subject model open source tool. Firstly, a corpus is used as input of an LdaModel function in Gensim, an LDA model is obtained through training, then a question requiring topic distribution calculation is input into the well-trained LDA model, the obtained output is the topic distribution of the question, and the sentence is expressed as a vector.
For newly proposed question QnewAnd historical question QrelCalculating to obtain vector representations of two question sentences based on topic distribution through a Gensim topic model open source tool, and respectively recording the vector representations as LDAnew、LDArel
LDAnew=[0.001784,0.001934,0.002056,0.002072,0.001772,…]
LDArel=[0.001706,0.001850,0.001967,0.001982,0.001695,…]
By evaluating LDAnewAnd LDArelThe cosine distance of the distance to obtain the similarity R based on the hidden topic characteristics of the sentencelda(Qnew,Qrel) The formula is as follows:
Figure BDA0001792096800000121
(4-5) similarity based on answer semantic features
In the question-answering system, each historical question corresponds to a candidate answer set. For newly proposed question QnewAnd historical question QrelAnd QrelCorresponding candidate answer ArelCalculating Q by a method similar to that in (4-2)newAnd ArelThe semantic similarity of the word level as the similarity R of two question sentences based on the answer semantic featuresqa(Qnew,Qrel) The specific process is as follows:
(4-5-1) reading (4-2-1) the word vector of 200 dimensions of each word in the dictionary obtained by calculation;
(4-5-2) reading M calculated in (4-2-2)w2vA matrix;
(4-5-3) reading the Q obtained in (4-1-2)newTF-IDF of (A) represents:
TFIDFnew=[0.0,0.0,0.528634,0.0,0.528634,…]T
calculating A obtained by pretreatment in the step (3)relIs expressed by the following TF-IDF, resulting in the following 15-dimensional vector:
TFIDFans=[0.408248,0.408248,0.0,0.0,0.0,…]T
(4-5-4) calculating Q by using soft cosine distancenew、QrelSimilarity R based on answer semantic features between the twoqa(Qnew,Qrel) The formula is as follows:
Figure BDA0001792096800000122
(5) calculating the final similarity
Calculating QnewAnd QrelFinal similarity Sim (Q)new,Qrel) The formula is as follows:
Figure BDA0001792096800000123
wherein R isk(Qnew,Qrel) The similarity based on the feature k is represented by (4-1) to (4-5), namely:
Rlev(Qnew,Qrel)=0.225969,Rv2w(Qnew,Qrel)=0.304225,Rvec(Qnew,Qrel)=0.738933,Rlda(Qnew,Qrel)=0.685844,Rqa(Qnew,Qrel)=0..018413
λkthe parameters to be trained are obtained by utilizing a linear regression analysis method for training. The training method adopts forward stepwise regression, and the error is reduced as much as possible in each step. And (4) adopting a square loss function, and iterating for a certain number of times until the loss function is minimum. The square loss function is as follows:
Figure BDA0001792096800000124
where I is a given number of training samples, the training samples are known problem pairs and the similarity of the problem pairs,
Figure BDA0001792096800000131
is the similarity prediction value of the ith sample, Y(i)Is a known value of the i-th sample similarity.
The iterative process is as follows:
1. randomly initializing weight lambda corresponding to each featurekCalculating the square loss of the current iteration according to the weight;
2. weight λ for each feature based on the square losskPerforming partial derivation to obtain the gradient of the current iteration of the weight
Figure BDA0001792096800000132
t represents the t-th iteration;
3. according to
Figure BDA0001792096800000133
Updating the weight of each feature, wherein α is the step size and is 0.1;
4. recalculating the square loss according to the new weight, if the current square loss is not less than the square loss of the previous iteration, stopping the iteration to obtain the weight lambda corresponding to each featurekA final value; otherwise, the steps 2, 3 and 4 are circularly performed.
In this embodiment, R is obtained by training according to the above stepsk(Qnew,Qrel) The weight of (c):
λlev=0.055985,λw2v=0.753228,λvec=0.207070,λlda=0.475735,λqathe final similarity was calculated as-0.122604:
Figure BDA0001792096800000134
the techniques described herein may be implemented by various means. For example, these techniques may be implemented in hardware, firmware, software, or a combination thereof. For a hardware implementation, the processing modules may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Programmable Logic Devices (PLDs), field-programmable gate arrays (FPGAs), processors, controllers, micro-controllers, electronic devices, other electronic units designed to perform the functions described herein, or a combination thereof.
For a firmware and/or software implementation, the techniques may be implemented with modules (e.g., procedures, steps, flows, and so on) that perform the functions described herein. The firmware and/or software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (3)

1. A problem similarity calculation method based on multiple features is characterized by comprising the following steps: aiming at an input new question sentence, comparing the input new question sentence with stored historical questions and corresponding answers, and calculating the similarity between the new question and the historical questions based on character features, the similarity based on word semantic features, the similarity based on sentence implicit subject features and the similarity based on answer semantic features; the final similarity is the product sum of the 5 similarities and the weights corresponding to the 5 similarities, and the weights are obtained by training through a linear regression method;
the method for calculating the similarity based on the character features comprises the following steps: obtaining a relation matrix between the words by calculating the editing distance between each pair of words, then calculating a soft cosine distance by using TF-IDF representation of the question and the relation matrix, and taking the soft cosine distance as the similarity based on character characteristics;
the calculation method of the relation matrix among the words is as follows:
defining a corpus as a question and answer text data set for training and testing a model, and forming a relation matrix M by editing distance between words if the size of a dictionary in the corpus is nlev,MlevMiddle element mi,jIs the first in the dictionaryiWord wiWith the jth word wjThe edit distance calculation formula is as follows:
Figure FDA0002444071900000011
wherein, | | wiI is the word wiThe number of characters, | wjI is the word wjThe number of characters included in the text, α is a weighting factor for diagonal elements, β is a reinforcement factor for distance scores, lev (w)i,wj) The recursive formula of (c) is as follows:
Figure FDA0002444071900000012
wherein m and n represent wiAnd wjLength of the word, cost, denotes wiM-th character to wjThe replacement cost of the nth character is determined, if the two characters are consistent, the cost is 0, otherwise, the cost is 1;
the TF-IDF expression calculation method of the question is as follows: in a sentence, for the word wiCalculating a TF value representing a frequency of occurrence of a word in a current sentence and an IDF value representing an inverse text frequency index for the word wiThe TF-IDF calculation formula is as follows:
TFIDFi=TFi*IDFi
for newly proposed question QnewAnd historical question QrelThe soft cosine distance calculation method comprises the following steps:
question Qnew、QrelAre respectively expressed as TFIDFnewAnd TFIDFrel
TFIDFnew=[dnew,1,dnew,2,…,dnew,n]T
TFIDFrel=[drel,1,drel,2,…,drel,n]T
dnew,iDenotes wiAt QnewTF-IDF value of (1), drel,jDenotes wjAt QrelThe TF-IDF value in (1), n represents the number of words contained in a dictionary of the corpus, and T represents the transposition of the vector;
simultaneously according to the solved relation matrix M between the wordslevCalculating Q by using soft cosine distancenew、QrelSimilarity R based on character features betweenlev(Qnew,Qrel) The formula is as follows:
Figure FDA0002444071900000021
where "·" is a dot product of a vector and a matrix;
the method for calculating the similarity based on the semantic features of the words comprises the following steps:
(6-1) training by using the word2vec tool to obtain a distributed representation of each word in the corpus, namely, each word corresponds to one wordKA real number vector of dimensions;
(6-2) calculating the semantic relation m between every two words in the dictionary by solving the cosine distance between word vectors for the corpus with the dictionary size of ni,j,i,j∈[1,n]To obtain a relation matrix Mw2v,Mw2v∈Rn×n
(6-3) reading question Qnew、QrelExpressed as TF-IDF, respectively as TFIDFnewAnd TFIDFrel
(6-4) calculating Q by soft cosine distancenew、QrelSimilarity R based on semantic features of words between wordsw2v(Qnew,Qrel) The formula is as follows:
Figure FDA0002444071900000022
the method for calculating the similarity based on the semantic features of sentences comprises the following steps:
word vector w obtained by training according to word2vec tooli∈RKIf the question contains M words, the question tableShown as Qmatrix=(w1,w2,…,wM),
Figure FDA0002444071900000023
Expressing the question sentence as the arithmetic mean of word vectors in the sentence, namely vector:
Figure FDA0002444071900000024
according to the above formula, question Qnew、QrelRespectively obtaining AVG vector by calculating arithmetic mean value of word vector in sentencenewAnd AVGrelBy finding AVGnewAnd AVGrelThe cosine distance of (A) is obtained as Qnew、QrelSimilarity R based on question semantic featuresvec(Qnew,Qrel);
The method for calculating the similarity based on the implicit topic features of the sentences comprises the following steps:
the corpus is used as the input of an LdaModel function in Gensim, and an LDA model is obtained through training; then, a new question Q for calculating the distribution of the topics is put forwardnewAnd historical question QrelInputting the data into a trained LDA model to obtain vector representations of two question sentences based on topic distribution, and respectively recording the vector representations as LDAnew、LDArel(ii) a By evaluating LDAnewAnd LDArelTo obtain Qnew、QrelSimilarity R based on sentence implicit topic characteristicslda(Qnew,Qrel);
The method for calculating the similarity based on the answer semantic features comprises the following steps:
for newly proposed question QnewAnd historical question QrelAnd QrelCorresponding candidate answer ArelCalculating QnewAnd ArelThe semantic similarity of the word level as the similarity R of two question sentences based on the answer semantic featuresqa(Qnew,Qrel) The specific process is as follows:
(9-1) training by using word2vec tool to obtain distributed representation of each word in the corpus, namely, each word corresponds to one wordKA real number vector of dimensions;
(9-2) calculating the semantic relation m between every two words in the dictionary by solving the cosine distance between word vectors for the corpus with the dictionary size of ni,j,i,j∈[1,n]To obtain a relation matrix Mw2v,Mw2v∈Rn×n
(9-3) reading question QnewTF-IDF of (A) represents TFIDFnew;ArelExpressed by TF-IDF to obtain TFIDFans=[dans,1,dans,2,...,dans,n]TWherein d isans,iDenotes wiIn ArelThe TF-IDF value in (1), n represents the number of words contained in a dictionary of the corpus, and T represents the transposition of the vector;
(9-4) calculating Q by using soft cosine distancenew、QrelSimilarity R based on answer semantic features between the twoqa(Qnew,Qrel) The formula is as follows:
Figure FDA0002444071900000031
2. the method of claim 1, wherein the new sentence, the historical sentence and the corresponding answer are pre-processed for comparison calculation, and the pre-processing comprises punctuation removal, case-to-case conversion, stop word and low frequency word.
3. The method for calculating the similarity of problems based on multiple features according to claim 1, wherein the method for calculating the final similarity is as follows:
Figure FDA0002444071900000032
wherein R isk(Qnew,Qrel) Represents Qnew、QrelThe similarity between the characters and the sentences is based on the characteristic k, namely the similarity based on the character characteristic, the similarity based on the word semantic characteristic, the similarity based on the sentence implicit topic characteristic and the similarity based on the answer semantic characteristic; lambda [ alpha ]kThe parameters to be trained are obtained by utilizing a linear regression analysis method for training.
CN201811041071.0A 2018-09-07 2018-09-07 Problem similarity calculation method based on multiple characteristics Active CN109344236B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811041071.0A CN109344236B (en) 2018-09-07 2018-09-07 Problem similarity calculation method based on multiple characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811041071.0A CN109344236B (en) 2018-09-07 2018-09-07 Problem similarity calculation method based on multiple characteristics

Publications (2)

Publication Number Publication Date
CN109344236A CN109344236A (en) 2019-02-15
CN109344236B true CN109344236B (en) 2020-09-04

Family

ID=65304890

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811041071.0A Active CN109344236B (en) 2018-09-07 2018-09-07 Problem similarity calculation method based on multiple characteristics

Country Status (1)

Country Link
CN (1) CN109344236B (en)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399615B (en) * 2019-07-29 2023-08-18 中国工商银行股份有限公司 Transaction risk monitoring method and device
CN110532565B (en) * 2019-08-30 2022-03-25 联想(北京)有限公司 Statement processing method and device and electronic equipment
CN110543551B (en) * 2019-09-04 2022-11-08 北京香侬慧语科技有限责任公司 Question and statement processing method and device
CN110825857B (en) * 2019-09-24 2023-07-21 平安科技(深圳)有限公司 Multi-round question and answer identification method and device, computer equipment and storage medium
CN110738049B (en) * 2019-10-12 2023-04-18 招商局金融科技有限公司 Similar text processing method and device and computer readable storage medium
CN110781662B (en) * 2019-10-21 2022-02-01 腾讯科技(深圳)有限公司 Method for determining point-to-point mutual information and related equipment
CN111723297B (en) * 2019-11-20 2023-05-12 中共南通市委政法委员会 Dual-semantic similarity judging method for grid society situation research and judgment
CN113139034A (en) * 2020-01-17 2021-07-20 深圳市优必选科技股份有限公司 Statement matching method, statement matching device and intelligent equipment
CN111191464A (en) * 2020-01-17 2020-05-22 珠海横琴极盛科技有限公司 Semantic similarity calculation method based on combined distance
CN111368177B (en) * 2020-03-02 2023-10-24 北京航空航天大学 Answer recommendation method and device for question-answer community
CN111401031A (en) * 2020-03-05 2020-07-10 支付宝(杭州)信息技术有限公司 Target text determination method, device and equipment
CN111414765B (en) * 2020-03-20 2023-07-25 北京百度网讯科技有限公司 Sentence consistency determination method and device, electronic equipment and readable storage medium
CN111582498B (en) * 2020-04-30 2023-05-12 重庆富民银行股份有限公司 QA auxiliary decision-making method and system based on machine learning
CN111259668B (en) * 2020-05-07 2020-08-18 腾讯科技(深圳)有限公司 Reading task processing method, model training device and computer equipment
CN111680515B (en) * 2020-05-21 2022-05-03 平安国际智慧城市科技股份有限公司 Answer determination method and device based on AI (Artificial Intelligence) recognition, electronic equipment and medium
CN113779183B (en) * 2020-06-08 2024-05-24 北京沃东天骏信息技术有限公司 Text matching method, device, equipment and storage medium
CN112380830B (en) * 2020-06-18 2024-05-17 达观数据有限公司 Matching method, system and computer readable storage medium for related sentences in different documents
CN113392176B (en) * 2020-09-28 2023-08-22 腾讯科技(深圳)有限公司 Text similarity determination method, device, equipment and medium
CN112507097B (en) * 2020-12-17 2022-11-18 神思电子技术股份有限公司 Method for improving generalization capability of question-answering system
CN112632252B (en) * 2020-12-25 2021-09-17 中电金信软件有限公司 Dialogue response method, dialogue response device, computer equipment and storage medium
CN112926340B (en) * 2021-03-25 2024-05-07 东南大学 Semantic matching model for knowledge point positioning
CN113377927A (en) * 2021-06-28 2021-09-10 成都卫士通信息产业股份有限公司 Similar document detection method and device, electronic equipment and storage medium
CN113792125B (en) * 2021-08-25 2024-04-02 北京库睿科技有限公司 Intelligent retrieval ordering method and device based on text relevance and user intention
CN113722459A (en) * 2021-08-31 2021-11-30 平安国际智慧城市科技股份有限公司 Question and answer searching method based on natural language processing model and related device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
CN103729381A (en) * 2012-10-16 2014-04-16 佳能株式会社 Method and device used for recognizing semantic information in series of documents
CN105701253A (en) * 2016-03-04 2016-06-22 南京大学 Chinese natural language interrogative sentence semantization knowledge base automatic question-answering method
CN106997376A (en) * 2017-02-28 2017-08-01 浙江大学 The problem of one kind is based on multi-stage characteristics and answer sentence similarity calculating method
CN107092596A (en) * 2017-04-24 2017-08-25 重庆邮电大学 Text emotion analysis method based on attention CNNs and CCR

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9135240B2 (en) * 2013-02-12 2015-09-15 International Business Machines Corporation Latent semantic analysis for application in a question answer system
US9953027B2 (en) * 2016-09-15 2018-04-24 International Business Machines Corporation System and method for automatic, unsupervised paraphrase generation using a novel framework that learns syntactic construct while retaining semantic meaning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
CN103729381A (en) * 2012-10-16 2014-04-16 佳能株式会社 Method and device used for recognizing semantic information in series of documents
CN105701253A (en) * 2016-03-04 2016-06-22 南京大学 Chinese natural language interrogative sentence semantization knowledge base automatic question-answering method
CN106997376A (en) * 2017-02-28 2017-08-01 浙江大学 The problem of one kind is based on multi-stage characteristics and answer sentence similarity calculating method
CN107092596A (en) * 2017-04-24 2017-08-25 重庆邮电大学 Text emotion analysis method based on attention CNNs and CCR

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"A noval similarity calculation method based on Chinese sentence keyword weight";Yu Y等;《Journal of software》;20140530;第1151-1155页 *

Also Published As

Publication number Publication date
CN109344236A (en) 2019-02-15

Similar Documents

Publication Publication Date Title
CN109344236B (en) Problem similarity calculation method based on multiple characteristics
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
Liu et al. Learning semantic word embeddings based on ordinal knowledge constraints
US20170177563A1 (en) Methods and systems for automated text correction
Xie et al. Topic enhanced deep structured semantic models for knowledge base question answering
US11068653B2 (en) System and method for context-based abbreviation disambiguation using machine learning on synonyms of abbreviation expansions
US20240111956A1 (en) Nested named entity recognition method based on part-of-speech awareness, device and storage medium therefor
CN111428490A (en) Reference resolution weak supervised learning method using language model
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
Hussein A plagiarism detection system for arabic documents
Zhang et al. Term recognition using conditional random fields
Celikyilmaz et al. A graph-based semi-supervised learning for question-answering
CN111400487A (en) Quality evaluation method of text abstract
CN110991193A (en) Translation matrix model selection system based on OpenKiwi
Jian et al. English text readability measurement based on convolutional neural network: A hybrid network model
Zhang et al. Disease prediction and early intervention system based on symptom similarity analysis
CN111581365B (en) Predicate extraction method
CN116108840A (en) Text fine granularity emotion analysis method, system, medium and computing device
CN112926340B (en) Semantic matching model for knowledge point positioning
Ghasemi et al. Farsick: A persian semantic textual similarity and natural language inference dataset
Zhang et al. Dual attention model for citation recommendation with analyses on explainability of attention mechanisms and qualitative experiments
CN111767388B (en) Candidate pool generation method
CN114265924A (en) Method and device for retrieving associated table according to question
Reshmi et al. Textual entailment based on semantic similarity using wordnet
Rei et al. Parser lexicalisation through self-learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant