CN109344236B - Problem similarity calculation method based on multiple characteristics - Google Patents
Problem similarity calculation method based on multiple characteristics Download PDFInfo
- Publication number
- CN109344236B CN109344236B CN201811041071.0A CN201811041071A CN109344236B CN 109344236 B CN109344236 B CN 109344236B CN 201811041071 A CN201811041071 A CN 201811041071A CN 109344236 B CN109344236 B CN 109344236B
- Authority
- CN
- China
- Prior art keywords
- new
- rel
- similarity
- question
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a problem similarity calculation method based on multiple characteristics, which comprises the following steps: aiming at an input new question sentence, comparing the input new question sentence with stored historical questions and corresponding answers, and calculating the similarity between the new question and the historical questions based on character features, the similarity based on word semantic features, the similarity based on sentence implicit subject features and the similarity based on answer semantic features; the final similarity is the product sum of the 5 similarities and the weights corresponding to the 5 similarities, and the weights are obtained by training through a linear regression method. The invention adopts various characteristics to increase the diversity of sample attributes and improve the generalization capability of the model. Meanwhile, the soft cosine distance is utilized to fuse TF-IDF with information such as editing distance, word semantics and the like, so that the semantic gap between words is overcome, and the accuracy of similarity calculation is improved.
Description
Technical Field
The invention relates to the field of computer natural language processing and automatic question-answering system research, in particular to a problem similarity calculation method based on multiple characteristics.
Background
With the rapid increase of digital information, the difficulty of acquiring required information resources from the network is increased. How to accurately and quickly find needed information for users in massive digital information brings a serious challenge to Natural Language Processing (NLP) technology and information retrieval technology. Therefore, in order to provide users with information acquisition channels with high real-time and high accuracy, research institutions and related technology companies have begun to research automated question and answer systems (QA). In the automatic question-answering system, a user can directly obtain corresponding answers only by inputting questions, and the user is not required to extract key words according to the questions to search and read a large number of webpages to find the answers. Compared with the traditional search engine, the automatic question-answering system is simpler, easier to use, real-time and accurate, provides comfortable human-computer interaction experience for users, and becomes a new generation of research hotspot of the current information technology. The automatic question-answering system allows a user to describe questions in a natural language form, then accurately understand the questions of the user, organize answers by retrieving information searched on a question-answering library or the internet, and finally return refined and accurate results, providing an efficient information acquisition channel.
The question similarity calculation is the first link in an automatic question-answering system, and aims to find out the historical question which is most similar to the newly-proposed question from the existing question set, so that the answer of the new question is given according to the answer set of the historical question.
At present, there are some achievements in the field of automatic question answering in China. The general community question-answering system comprises a Quora, a first-order question-answering system, a Baidu-aware system and the like, and the professional community question-answering system relates to multiple specialties, such as question-answering systems related to IT technologies such as Stack Overflow, CSDN and the like. Therefore, the problem similarity calculation method directly influences the accuracy of the question answering system, and has good industrial prospects.
Through years of research accumulation, the automatic question-answering system forms a universal framework and mainly comprises three modules of information retrieval, question analysis and answer acquisition. The problem analysis module is mainly used for analyzing the problems input by the user, finding out the historical problems which are most similar to the newly-proposed problems from the existing problem set, and researching the problems, wherein the research content relates to problem similarity analysis and problem sequencing, and the most important is the similarity calculation between the problems, so that the historical problem set is sequenced according to the similarity. The answer obtaining module is mainly used for obtaining a corresponding answer set according to a similar question set obtained by question retrieval.
Text similarity correlation techniques are the basis of question similarity techniques (both questions and answers are of text type). There are three main methods for calculating the similarity of texts.
The first method is based on similarity calculation of a Vector Space Model (VSM), a text is mapped to a point in a Vector Space, and the distance between the point in the Space and the point is calculated by a mathematical method. Researchers have proposed applying the VSM model to similar problem retrieval tasks for Frequently Asked Questions (FAQ), and improved the VSM for the task features of FAQ. However, the text sparseness of the method causes overlarge dimensionality and easily causes a semantic gap problem.
The second method is similarity calculation based on syntactic analysis, and introduces a graphical mode to describe the mutually dominant and dominated relations of each phrase in a sentence. Some researchers propose an analysis method based on a deep structure, which includes the steps of firstly analyzing the dependence relationship of problems, selecting the most important words in sentences and effective words directly attached to the words for pairing, and then calculating the text similarity of Chinese characters based on the dependence relationship structure. However, the syntactic analysis and dependency relationship analysis of this method are complex, require linguistic background, and have poor analysis effect on complex long sentence patterns.
The third is similarity calculation based on semantics, which comprises two similarities of word semantics and sentence semantics. Semantic similarity calculation of words generally uses semantic dictionaries such as WordNet and Hownet, which contain semantic relationships between words. Researchers believe that the complete expression of a phrase depends not only on the syntactic structure, but also on the words and their weights, and therefore improve the semantic representation of the words using WordNet. In the aspect of semantic similarity calculation of sentences, researchers learn the conversion probability between two question sentences by using a machine translation model of IBM (International Business machines corporation), thereby expressing the semantic similarity of the sentences and searching similar questions. However, the method excessively depends on the semantic dictionary, and the accuracy of similarity calculation is influenced by the completeness and the correctness of the semantic dictionary; similarly, the similarity calculation method based on semantics has a poor effect in processing long sentence types with complex syntax.
Meanwhile, most methods in the prior art extract text representation features based on single-type information, focus on the single-type features, and do not consider that the meaning of text representation is formed by multi-aspect and multi-level information, so that the accuracy rate of calculating similarity is not high.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a problem similarity calculation method based on multiple characteristics, which is suitable for calculating the similarity between the problems in an English question-answering system and has the advantage of high accuracy.
The purpose of the invention is realized by the following technical scheme: a problem similarity calculation method based on multiple characteristics comprises the following steps: aiming at an input new question sentence, comparing the input new question sentence with stored historical questions and corresponding answers, and calculating the similarity between the new question and the historical questions based on character features, the similarity based on word semantic features, the similarity based on sentence implicit subject features and the similarity based on answer semantic features; the final similarity is the product sum of the 5 similarities and the weights corresponding to the 5 similarities, and the weights are obtained by training through a linear regression method.
Preferably, the new question sentences, the historical questions and the corresponding answers for comparison calculation are preprocessed, including punctuation removal processing, case-to-case conversion (all capital letters are converted into lower case letters), stop words and low-frequency words.
Preferably, the method for calculating the similarity based on the character features is as follows: firstly, obtaining a relation matrix between words by calculating the edit distance between each pair of words, and then calculating a soft cosine distance by using TF-IDF (term frequency-inverse document frequency) representation of a question and the relation matrix as similarity based on character characteristics.
Furthermore, the calculation method of the relation matrix between the words is as follows:
defining a corpus as a question and answer text data set for training and testing a model, and forming a relation matrix M by editing distance between words if the size of a dictionary in the corpus is nlev∈Rn×n,Rn×nIs a set of real number matrices of size n × n (same meaning as below), MlevMiddle element mi,jIs the ith word w in the dictionaryiWith the jth word wjThe edit distance of (1). The edit distance calculation formula is as follows:
wherein, | | wiI is the word wiNumber of characters contained in,||wjI is the word wjThe number of characters included, α is the weighting factor for the diagonal elements, β is the enhancement factor for the distance score lev (w)i,wj) The recursive formula of (c) is as follows:
wherein m and n represent wiAnd wjThe length of the word (i.e., the number of characters included). cost represents wiM-th character to wjThe cost of replacing the nth character is 0 if the two characters are consistent, otherwise, the cost is 1.
Furthermore, the TF-IDF representation of the question sentence is calculated by the following method: in a sentence, for the word wiCalculating a TF value representing a frequency of occurrence of a word in a current sentence and an IDF value representing an inverse text frequency index for the word wiThe TF-IDF calculation formula is as follows:
TFIDFi=TFi*IDFi。
further, with respect to the newly proposed question QnewAnd historical question QrelThe soft cosine distance calculation method comprises the following steps:
question Qnew、QrelAre respectively expressed as TFIDFnewAnd TFIDFrel:
TFIDFnew=[dnew,1,dnew,2,…,dnew,n]T
TFIDFrel=[drel,1,drel,2,…,drel,n]T
dnew,iDenotes wiAt QnewTF-IDF value of (1), drel,jDenotes wjAt QrelThe TF-IDF value in (1), n represents the number of words contained in the dictionary of the corpus, and T represents the transpose of the vector.
Simultaneously, the matrix is M according to the obtained relation between the wordslevCalculating Q by using soft cosine distancenew、QrelSimilarity R based on character features betweenlev(Qnew,Qrel) The formula is as follows:
where "·" is a dot product of a vector and a matrix (the same meaning below), and is calculated as follows:
preferably, the method for calculating the similarity based on the semantic features of the words comprises the following steps:
(1) and (3) training by using a word2vec tool to obtain the distributed representation of each word in the corpus, namely each word corresponds to a K-dimensional real number vector.
(2) Calculating the semantic relation m between every two words in the dictionary by solving the cosine distance between word vectors for the corpus with the dictionary size of ni,j,i,j∈[1,n]To obtain a relation matrix Mw2v,Mw2v∈Rn×n。
(3) Read question Qnew、QrelExpressed as TF-IDF, respectively as TFIDFnewAnd TFIDFrel。
(4) Computing Q using soft cosine distancesnew、QrelSimilarity R based on semantic features of words between wordsw2v(Qnew,Qrel) The formula is as follows:
preferably, the method for calculating the similarity based on the semantic features of the sentences comprises the following steps:
word vector w obtained by training according to word2vec tooli∈RKIf a question contains M words, the question may be represented as Qmatrix=(w1,w2,…,wM),
Expressing the question sentence as the arithmetic mean of word vectors in the sentence, namely vector:
according to the above formula, question Qnew、QrelRespectively obtaining AVG vector by calculating arithmetic mean value of word vector in sentencenewAnd AVGrelBy finding AVGnewAnd AVGrelThe cosine distance of (A) is obtained as Qnew、QrelSimilarity R based on question semantic featuresvec(Qnew,Qrel)。
Preferably, the method for calculating the similarity based on the implicit topic features of the sentences comprises the following steps:
the corpus is used as the input of an LdaModel function in Gensim, and an LDA model is obtained through training; then, a new question Q for calculating the distribution of the topics is put forwardnewAnd historical question QrelInputting the data into a trained LDA model to obtain vector representations of two question sentences based on topic distribution, and respectively recording the vector representations as LDAnew、LDArel. By evaluating LDAnewAnd LDArelTo obtain Qnew、QrelSimilarity R based on sentence implicit topic characteristicslda(Qnew,Qrel)。
Preferably, the method for calculating the similarity based on the answer semantic features comprises the following steps:
in the question-answering system, each historical question corresponds to a candidate answer set. For newly proposed question QnewAnd historical question QrelAnd QrelCorresponding toCandidate answer ArelCalculating QnewAnd ArelThe semantic similarity of the word level as the similarity R of two question sentences based on the answer semantic featuresqa(Qnew,Qrel) The specific process is as follows:
(1) and (3) training by using a word2vec tool to obtain the distributed representation of each word in the corpus, namely each word corresponds to a K-dimensional real number vector.
(2) Calculating the semantic relation m between every two words in the dictionary by solving the cosine distance between word vectors for the corpus with the dictionary size of ni,j,i,j∈[1,n]To obtain a relation matrix Mw2v,Mw2v∈Rn×n。
(3) Read question QnewTF-IDF of (A) represents TFIDFnew;ArelExpressed by TF-IDF to obtain TFIDFans=[dans,1,dans,2,...,dans,n]TWherein d isans,iDenotes wiIn ArelThe TF-IDF value in (1), n represents the number of words contained in the dictionary of the corpus, and T represents the transpose of the vector.
(4) Computing Q using soft cosine distancesnew、QrelSimilarity R based on answer semantic features between the twoqa(Qnew,Qrel) The formula is as follows:
preferably, the method for calculating the final similarity includes:
wherein R isk(Qnew,Qrel) Representing similarity based on the feature k, i.e. similarity based on character features, similarity based on word semantic features, similarity based on sentence implied topic features, and similarity based on answer semantic features, respectively. Lambda [ alpha ]kIs a parameter to be trainedAnd (4) obtaining the number by utilizing a linear regression analysis method for training.
Furthermore, in the training process by using the linear regression analysis method, the iteration steps are as follows:
(1) randomly initializing weight lambda corresponding to each featurekCalculating the square loss of the current iteration according to the weight; the square loss function is as follows:
where I is a given number of training samples, the training samples are known problem pairs and the similarity of the problem pairs,is the similarity prediction value of the ith sample, Y(i)Is a known value of the i-th sample similarity;
(2) weight λ for each feature based on the square losskPerforming partial derivation to obtain the gradient of the current iteration of the weightt represents the t-th iteration;
(4) recalculating the square loss according to the new weight, if the current square loss is not less than the square loss of the previous iteration, stopping the iteration to obtain the weight lambda corresponding to each featurekA final value; otherwise, the steps (2) to (4) are circularly carried out.
Compared with the prior art, the invention has the following advantages and technical effects:
(1) in the field of machine learning, a training sample is described by a set of attributes, with different subsets of attributes providing different perspectives of observed data. The present invention extracts five types of features by observing the question and answer sentences described in natural language from five different perspectives. Compared with the expression mode based on single type features, the diversity of sample attributes is increased by the aid of various features, and the generalization capability of the model is improved.
(2) The method utilizes the soft cosine distance to fuse TF-IDF with information such as editing distance, word semantics and the like, and calculates the similarity between problems. Compared with the traditional similarity calculation method, the method overcomes the semantic gap between words and improves the accuracy of similarity calculation.
Drawings
FIG. 1 is a flowchart of the method of the present embodiment.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
In this embodiment, a question similarity calculation method based on multiple features measures the similarity between two question sentences by using 5 features, which are respectively a character feature, a word semantic feature, a sentence implied subject feature, and an answer semantic feature. The similarity based on these 5 features is integrated into the final similarity of the new problem and the historical problem. Referring now to fig. 1, the steps of the method are described in detail by incorporating an example.
(1) Inputting a new question sentence Qnew:Where I can buy good oil for massage?
(2) Reading a historical question sentence Qrel:is there any place i can find scentedmassage oils in qatar?
Simultaneously reading answer A of a historical questionrel:Yes.It is right behind Kahrama in theNational area.
(3) To Qnew﹑QrelAnd ArelRespectively carrying out pretreatment, including: punctuation removal, case-to-case conversion (all capital letters converted to lowercase), stop-word, low-frequency words, and the like. Obtaining:
Qnew:buy good oil massage
Qrel:place find scented massage oils qatar
Arel:yes right behind kahrama national area
(4) respectively calculate QnewAnd QrelSimilarity based on the following 5 features.
In the following, it is assumed that { area, while, buy, find, good, kahrama, message, national, oil, oils, place, qatar, right, sized, yes } is used as the corpus dictionary set.
(4-1) similarity based on character features
The character features are that the similarity between words is measured from a character level, and the similarity of the word character level is solved by using the edit distance. Firstly, a relation matrix M between the words is obtained by calculating the edit distance between each pair of wordslevThen using the TF-IDF representation of the question and the relation matrix MlevAnd calculating the soft cosine distance as the similarity based on the character features. The method comprises the following specific steps:
(4-1-1) calculating a relationship matrix Mlev
Assuming that the size of the lexicon in a corpus (meaning the question and answer text data sets used to train and test the models, hereinafter the corpus is synonymous) is n, the edit distance between words forms a relationship matrix Mlev∈Rn×n,Rn×nIs a set of real number matrices of size n × n, MlevMiddle element mi,jIs the ith word w in the dictionaryiWith the jth word wjThe edit distance of (1). The edit distance calculation formula is as follows:
wherein, | | wiI is the word wiThe number of characters, | wjI is the word wjThe number of characters is contained, α is a weighting factor of diagonal elements, β is an enhancement factor of distance scores, α is 1.8, β is 5. lev (w)i,wj) The recursive formula of (c) is as follows:
wherein m and n represent wiAnd wjThe length of the word (i.e., the number of characters included). cost represents wiM-th character to wjThe cost of replacing the nth character is 0 if the two characters are consistent, otherwise, the cost is 1.
Since the dictionary size is 15 words, MlevIs a matrix of 15 × 15, of the form:
(4-1-2) TF-IDF representation of computational question
TF-IDF consists of TF and IDF.
In a sentence, for the word wiThe formula for calculating the TF value is as follows:
wherein, TFiRepresenting the frequency of occurrence of the ith word in the current sentence, niRepresenting the number of times the word appears in the current sentence, nkRepresenting the number of occurrences of the kth word in the current sentence.
For a given corpus, the IDF value of each word is fixed, the word wiThe calculation method of (c) is as follows:
wherein: | D | represents the total number of texts, and the denominator represents the number of texts containing the word.
In a sentence, for the word wiThe TF-IDF calculation formula is as follows:
TFIDFi=TFi*IDFi
question Qnew、QrelExpressed as the following vectors, respectively:
TFIDFnew=[dnew,1,dnew,2,…,dnew,n]T
TFIDFrel=[drel,1,drel,2,…,drel,n]T
dnew,idenotes wiAt QnewTF-IDF value of (1), drel,jDenotes wjAt QrelThe TF-IDF value in (1), n represents the number of words contained in the dictionary of the corpus, and T represents the transpose of the vector.
The question Q obtained by the pretreatment in the step (3)new、Qrel15-dimensional vectors respectively represented as follows:
TFIDFnew=[0.0,0.0,0.528634,0.0,0.528634,…]T
TFIDFrel=[0.0,0.0,0.0,0.423394,0.0,…]T。
(4-1-3) calculating a soft cosine distance
Soft Cosine distance is relative to Cosine distance, in 2014, Sidorov proposed an improved Cosine similarity calculation method named Soft Cosine distance (Soft Cosine), and a relation matrix is introduced to express the relation between words when calculating Cosine distance.
Question Qnew、QrelRespectively expressed as TFIDF by the step (4-1-2)newAnd TFIDFrelObtaining a relation matrix M between words through the step (4-1-1)levCalculating question Q by using soft cosine distancenew、QrelSimilarity R based on character features betweenlev(Qnew,Qrel) The formula is as follows:
where "·" is a dot product of a vector and a matrix (the same meaning below), and is calculated as follows:
(4-2-1) training by using a word2vec tool to obtain a distributed representation of each word in the corpus, namely, each word corresponds to a 200-dimensional word vector, as follows:
(4-2-2) calculating the semantic relation m between every two words in the dictionary by solving the cosine distance between the word vectors for the corpus with the dictionary size of ni,j,i,j∈[1,n]To obtain a relation matrix Mw2v,Mw2v∈Rn×n。mi,jThe calculation formula of (a) is as follows:
mi,j=max(0,cos(wi,wj))2
wherein, wi、wjReal number vector, w, representing the dimension K of the ith and jth words in a corpusi,wj∈RK,RKIs a one-dimensional vector with length K (the meaning is the same below), "·" is a standard dot product between vectors (the meaning is the same below), and the calculation formula is as follows:
wherein, wi,mDenotes wiThe m-th component of (a), wj,mDenotes wjThe mth component of (2).
In the present embodiment, the dictionary size is 15 words, Mw2vIs a matrix of 15 × 15, as follows:
(4-2-3) reading (4-1-2) the calculated question Qnew、QrelTF-IDF of (A) is represented by:
TFIDFnew=[0.0,0.0,0.528634,0.0,0.528634,…]T
TFIDFrel=[0.0,0.0,0.0,0.423394,0.0,…]T
(4-2-4) solving question Q by a soft cosine distance formulanew、QrelSimilarity based on word semantic features between:
(4-3) similarity based on semantic features of sentences
Word vector w obtained by word2vec trainingi∈RKIf a question contains M words, the question may be represented as Qmatrix=(w1,w2,…,wM),Expressing the question sentence as the arithmetic mean of word vectors in the sentence, namely vector:
according to the above formula, question Qnew、QrelRespectively obtaining AVG vector by calculating arithmetic mean value of word vector in sentencenewAnd AVGrel:
AVGnew=[0.014657,0.075914,-0.042454,0.219559,-0.117374,…]
AVGrel=[-0.088187,-0.025432,-0.05328,0.17098376,-0.13033055,…]
By finding AVGnewAnd AVGrelThe cosine distance of (A) is obtained as Qnew、QrelSimilarity R based on question semantic featuresvec(Qnew,Qrel) The formula is as follows:
(4-4) similarity based on implicit topic characteristics of sentences
The invention uses LDA (latent Dirichlet allocation) implicit topic model to ask for the implicit topic of the question sentence. After LDA training, the implicit topic distribution of each document in the document set can be obtained, and further the topic vector of the sentence is obtained. For example, sentence QmImplicit topic distribution of Representing the probability of belonging to the ith topic, I being the number of implied topics, QmExpressed as a vector:
the subject distribution of the question was calculated by the Gensim (https:// radimrehurek. com/genesis /) subject model open source tool. Firstly, a corpus is used as input of an LdaModel function in Gensim, an LDA model is obtained through training, then a question requiring topic distribution calculation is input into the well-trained LDA model, the obtained output is the topic distribution of the question, and the sentence is expressed as a vector.
For newly proposed question QnewAnd historical question QrelCalculating to obtain vector representations of two question sentences based on topic distribution through a Gensim topic model open source tool, and respectively recording the vector representations as LDAnew、LDArel:
LDAnew=[0.001784,0.001934,0.002056,0.002072,0.001772,…]
LDArel=[0.001706,0.001850,0.001967,0.001982,0.001695,…]
By evaluating LDAnewAnd LDArelThe cosine distance of the distance to obtain the similarity R based on the hidden topic characteristics of the sentencelda(Qnew,Qrel) The formula is as follows:
(4-5) similarity based on answer semantic features
In the question-answering system, each historical question corresponds to a candidate answer set. For newly proposed question QnewAnd historical question QrelAnd QrelCorresponding candidate answer ArelCalculating Q by a method similar to that in (4-2)newAnd ArelThe semantic similarity of the word level as the similarity R of two question sentences based on the answer semantic featuresqa(Qnew,Qrel) The specific process is as follows:
(4-5-1) reading (4-2-1) the word vector of 200 dimensions of each word in the dictionary obtained by calculation;
(4-5-2) reading M calculated in (4-2-2)w2vA matrix;
(4-5-3) reading the Q obtained in (4-1-2)newTF-IDF of (A) represents:
TFIDFnew=[0.0,0.0,0.528634,0.0,0.528634,…]T
calculating A obtained by pretreatment in the step (3)relIs expressed by the following TF-IDF, resulting in the following 15-dimensional vector:
TFIDFans=[0.408248,0.408248,0.0,0.0,0.0,…]T
(4-5-4) calculating Q by using soft cosine distancenew、QrelSimilarity R based on answer semantic features between the twoqa(Qnew,Qrel) The formula is as follows:
(5) calculating the final similarity
Calculating QnewAnd QrelFinal similarity Sim (Q)new,Qrel) The formula is as follows:
wherein R isk(Qnew,Qrel) The similarity based on the feature k is represented by (4-1) to (4-5), namely:
Rlev(Qnew,Qrel)=0.225969,Rv2w(Qnew,Qrel)=0.304225,Rvec(Qnew,Qrel)=0.738933,Rlda(Qnew,Qrel)=0.685844,Rqa(Qnew,Qrel)=0..018413
λkthe parameters to be trained are obtained by utilizing a linear regression analysis method for training. The training method adopts forward stepwise regression, and the error is reduced as much as possible in each step. And (4) adopting a square loss function, and iterating for a certain number of times until the loss function is minimum. The square loss function is as follows:
where I is a given number of training samples, the training samples are known problem pairs and the similarity of the problem pairs,is the similarity prediction value of the ith sample, Y(i)Is a known value of the i-th sample similarity.
The iterative process is as follows:
1. randomly initializing weight lambda corresponding to each featurekCalculating the square loss of the current iteration according to the weight;
2. weight λ for each feature based on the square losskPerforming partial derivation to obtain the gradient of the current iteration of the weightt represents the t-th iteration;
4. recalculating the square loss according to the new weight, if the current square loss is not less than the square loss of the previous iteration, stopping the iteration to obtain the weight lambda corresponding to each featurekA final value; otherwise, the steps 2, 3 and 4 are circularly performed.
In this embodiment, R is obtained by training according to the above stepsk(Qnew,Qrel) The weight of (c):
λlev=0.055985,λw2v=0.753228,λvec=0.207070,λlda=0.475735,λqathe final similarity was calculated as-0.122604:
the techniques described herein may be implemented by various means. For example, these techniques may be implemented in hardware, firmware, software, or a combination thereof. For a hardware implementation, the processing modules may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Programmable Logic Devices (PLDs), field-programmable gate arrays (FPGAs), processors, controllers, micro-controllers, electronic devices, other electronic units designed to perform the functions described herein, or a combination thereof.
For a firmware and/or software implementation, the techniques may be implemented with modules (e.g., procedures, steps, flows, and so on) that perform the functions described herein. The firmware and/or software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.
Claims (3)
1. A problem similarity calculation method based on multiple features is characterized by comprising the following steps: aiming at an input new question sentence, comparing the input new question sentence with stored historical questions and corresponding answers, and calculating the similarity between the new question and the historical questions based on character features, the similarity based on word semantic features, the similarity based on sentence implicit subject features and the similarity based on answer semantic features; the final similarity is the product sum of the 5 similarities and the weights corresponding to the 5 similarities, and the weights are obtained by training through a linear regression method;
the method for calculating the similarity based on the character features comprises the following steps: obtaining a relation matrix between the words by calculating the editing distance between each pair of words, then calculating a soft cosine distance by using TF-IDF representation of the question and the relation matrix, and taking the soft cosine distance as the similarity based on character characteristics;
the calculation method of the relation matrix among the words is as follows:
defining a corpus as a question and answer text data set for training and testing a model, and forming a relation matrix M by editing distance between words if the size of a dictionary in the corpus is nlev,MlevMiddle element mi,jIs the first in the dictionaryiWord wiWith the jth word wjThe edit distance calculation formula is as follows:
wherein, | | wiI is the word wiThe number of characters, | wjI is the word wjThe number of characters included in the text, α is a weighting factor for diagonal elements, β is a reinforcement factor for distance scores, lev (w)i,wj) The recursive formula of (c) is as follows:
wherein m and n represent wiAnd wjLength of the word, cost, denotes wiM-th character to wjThe replacement cost of the nth character is determined, if the two characters are consistent, the cost is 0, otherwise, the cost is 1;
the TF-IDF expression calculation method of the question is as follows: in a sentence, for the word wiCalculating a TF value representing a frequency of occurrence of a word in a current sentence and an IDF value representing an inverse text frequency index for the word wiThe TF-IDF calculation formula is as follows:
TFIDFi=TFi*IDFi;
for newly proposed question QnewAnd historical question QrelThe soft cosine distance calculation method comprises the following steps:
question Qnew、QrelAre respectively expressed as TFIDFnewAnd TFIDFrel:
TFIDFnew=[dnew,1,dnew,2,…,dnew,n]T
TFIDFrel=[drel,1,drel,2,…,drel,n]T
dnew,iDenotes wiAt QnewTF-IDF value of (1), drel,jDenotes wjAt QrelThe TF-IDF value in (1), n represents the number of words contained in a dictionary of the corpus, and T represents the transposition of the vector;
simultaneously according to the solved relation matrix M between the wordslevCalculating Q by using soft cosine distancenew、QrelSimilarity R based on character features betweenlev(Qnew,Qrel) The formula is as follows:
where "·" is a dot product of a vector and a matrix;
the method for calculating the similarity based on the semantic features of the words comprises the following steps:
(6-1) training by using the word2vec tool to obtain a distributed representation of each word in the corpus, namely, each word corresponds to one wordKA real number vector of dimensions;
(6-2) calculating the semantic relation m between every two words in the dictionary by solving the cosine distance between word vectors for the corpus with the dictionary size of ni,j,i,j∈[1,n]To obtain a relation matrix Mw2v,Mw2v∈Rn×n;
(6-3) reading question Qnew、QrelExpressed as TF-IDF, respectively as TFIDFnewAnd TFIDFrel;
(6-4) calculating Q by soft cosine distancenew、QrelSimilarity R based on semantic features of words between wordsw2v(Qnew,Qrel) The formula is as follows:
the method for calculating the similarity based on the semantic features of sentences comprises the following steps:
word vector w obtained by training according to word2vec tooli∈RKIf the question contains M words, the question tableShown as Qmatrix=(w1,w2,…,wM),
Expressing the question sentence as the arithmetic mean of word vectors in the sentence, namely vector:
according to the above formula, question Qnew、QrelRespectively obtaining AVG vector by calculating arithmetic mean value of word vector in sentencenewAnd AVGrelBy finding AVGnewAnd AVGrelThe cosine distance of (A) is obtained as Qnew、QrelSimilarity R based on question semantic featuresvec(Qnew,Qrel);
The method for calculating the similarity based on the implicit topic features of the sentences comprises the following steps:
the corpus is used as the input of an LdaModel function in Gensim, and an LDA model is obtained through training; then, a new question Q for calculating the distribution of the topics is put forwardnewAnd historical question QrelInputting the data into a trained LDA model to obtain vector representations of two question sentences based on topic distribution, and respectively recording the vector representations as LDAnew、LDArel(ii) a By evaluating LDAnewAnd LDArelTo obtain Qnew、QrelSimilarity R based on sentence implicit topic characteristicslda(Qnew,Qrel);
The method for calculating the similarity based on the answer semantic features comprises the following steps:
for newly proposed question QnewAnd historical question QrelAnd QrelCorresponding candidate answer ArelCalculating QnewAnd ArelThe semantic similarity of the word level as the similarity R of two question sentences based on the answer semantic featuresqa(Qnew,Qrel) The specific process is as follows:
(9-1) training by using word2vec tool to obtain distributed representation of each word in the corpus, namely, each word corresponds to one wordKA real number vector of dimensions;
(9-2) calculating the semantic relation m between every two words in the dictionary by solving the cosine distance between word vectors for the corpus with the dictionary size of ni,j,i,j∈[1,n]To obtain a relation matrix Mw2v,Mw2v∈Rn×n;
(9-3) reading question QnewTF-IDF of (A) represents TFIDFnew;ArelExpressed by TF-IDF to obtain TFIDFans=[dans,1,dans,2,...,dans,n]TWherein d isans,iDenotes wiIn ArelThe TF-IDF value in (1), n represents the number of words contained in a dictionary of the corpus, and T represents the transposition of the vector;
(9-4) calculating Q by using soft cosine distancenew、QrelSimilarity R based on answer semantic features between the twoqa(Qnew,Qrel) The formula is as follows:
2. the method of claim 1, wherein the new sentence, the historical sentence and the corresponding answer are pre-processed for comparison calculation, and the pre-processing comprises punctuation removal, case-to-case conversion, stop word and low frequency word.
3. The method for calculating the similarity of problems based on multiple features according to claim 1, wherein the method for calculating the final similarity is as follows:
wherein R isk(Qnew,Qrel) Represents Qnew、QrelThe similarity between the characters and the sentences is based on the characteristic k, namely the similarity based on the character characteristic, the similarity based on the word semantic characteristic, the similarity based on the sentence implicit topic characteristic and the similarity based on the answer semantic characteristic; lambda [ alpha ]kThe parameters to be trained are obtained by utilizing a linear regression analysis method for training.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811041071.0A CN109344236B (en) | 2018-09-07 | 2018-09-07 | Problem similarity calculation method based on multiple characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811041071.0A CN109344236B (en) | 2018-09-07 | 2018-09-07 | Problem similarity calculation method based on multiple characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109344236A CN109344236A (en) | 2019-02-15 |
CN109344236B true CN109344236B (en) | 2020-09-04 |
Family
ID=65304890
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811041071.0A Active CN109344236B (en) | 2018-09-07 | 2018-09-07 | Problem similarity calculation method based on multiple characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109344236B (en) |
Families Citing this family (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399615B (en) * | 2019-07-29 | 2023-08-18 | 中国工商银行股份有限公司 | Transaction risk monitoring method and device |
CN110532565B (en) * | 2019-08-30 | 2022-03-25 | 联想(北京)有限公司 | Statement processing method and device and electronic equipment |
CN110543551B (en) * | 2019-09-04 | 2022-11-08 | 北京香侬慧语科技有限责任公司 | Question and statement processing method and device |
CN110825857B (en) * | 2019-09-24 | 2023-07-21 | 平安科技(深圳)有限公司 | Multi-round question and answer identification method and device, computer equipment and storage medium |
CN110738049B (en) * | 2019-10-12 | 2023-04-18 | 招商局金融科技有限公司 | Similar text processing method and device and computer readable storage medium |
CN110781662B (en) * | 2019-10-21 | 2022-02-01 | 腾讯科技(深圳)有限公司 | Method for determining point-to-point mutual information and related equipment |
CN111723297B (en) * | 2019-11-20 | 2023-05-12 | 中共南通市委政法委员会 | Dual-semantic similarity judging method for grid society situation research and judgment |
CN113139034A (en) * | 2020-01-17 | 2021-07-20 | 深圳市优必选科技股份有限公司 | Statement matching method, statement matching device and intelligent equipment |
CN111191464A (en) * | 2020-01-17 | 2020-05-22 | 珠海横琴极盛科技有限公司 | Semantic similarity calculation method based on combined distance |
CN111368177B (en) * | 2020-03-02 | 2023-10-24 | 北京航空航天大学 | Answer recommendation method and device for question-answer community |
CN111401031A (en) * | 2020-03-05 | 2020-07-10 | 支付宝(杭州)信息技术有限公司 | Target text determination method, device and equipment |
CN111414765B (en) * | 2020-03-20 | 2023-07-25 | 北京百度网讯科技有限公司 | Sentence consistency determination method and device, electronic equipment and readable storage medium |
CN111582498B (en) * | 2020-04-30 | 2023-05-12 | 重庆富民银行股份有限公司 | QA auxiliary decision-making method and system based on machine learning |
CN111259668B (en) * | 2020-05-07 | 2020-08-18 | 腾讯科技(深圳)有限公司 | Reading task processing method, model training device and computer equipment |
CN111680515B (en) * | 2020-05-21 | 2022-05-03 | 平安国际智慧城市科技股份有限公司 | Answer determination method and device based on AI (Artificial Intelligence) recognition, electronic equipment and medium |
CN113779183B (en) * | 2020-06-08 | 2024-05-24 | 北京沃东天骏信息技术有限公司 | Text matching method, device, equipment and storage medium |
CN112380830B (en) * | 2020-06-18 | 2024-05-17 | 达观数据有限公司 | Matching method, system and computer readable storage medium for related sentences in different documents |
CN113392176B (en) * | 2020-09-28 | 2023-08-22 | 腾讯科技(深圳)有限公司 | Text similarity determination method, device, equipment and medium |
CN112507097B (en) * | 2020-12-17 | 2022-11-18 | 神思电子技术股份有限公司 | Method for improving generalization capability of question-answering system |
CN112632252B (en) * | 2020-12-25 | 2021-09-17 | 中电金信软件有限公司 | Dialogue response method, dialogue response device, computer equipment and storage medium |
CN112926340B (en) * | 2021-03-25 | 2024-05-07 | 东南大学 | Semantic matching model for knowledge point positioning |
CN113377927A (en) * | 2021-06-28 | 2021-09-10 | 成都卫士通信息产业股份有限公司 | Similar document detection method and device, electronic equipment and storage medium |
CN113792125B (en) * | 2021-08-25 | 2024-04-02 | 北京库睿科技有限公司 | Intelligent retrieval ordering method and device based on text relevance and user intention |
CN113722459A (en) * | 2021-08-31 | 2021-11-30 | 平安国际智慧城市科技股份有限公司 | Question and answer searching method based on natural language processing model and related device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101286161A (en) * | 2008-05-28 | 2008-10-15 | 华中科技大学 | Intelligent Chinese request-answering system based on concept |
CN103729381A (en) * | 2012-10-16 | 2014-04-16 | 佳能株式会社 | Method and device used for recognizing semantic information in series of documents |
CN105701253A (en) * | 2016-03-04 | 2016-06-22 | 南京大学 | Chinese natural language interrogative sentence semantization knowledge base automatic question-answering method |
CN106997376A (en) * | 2017-02-28 | 2017-08-01 | 浙江大学 | The problem of one kind is based on multi-stage characteristics and answer sentence similarity calculating method |
CN107092596A (en) * | 2017-04-24 | 2017-08-25 | 重庆邮电大学 | Text emotion analysis method based on attention CNNs and CCR |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9135240B2 (en) * | 2013-02-12 | 2015-09-15 | International Business Machines Corporation | Latent semantic analysis for application in a question answer system |
US9953027B2 (en) * | 2016-09-15 | 2018-04-24 | International Business Machines Corporation | System and method for automatic, unsupervised paraphrase generation using a novel framework that learns syntactic construct while retaining semantic meaning |
-
2018
- 2018-09-07 CN CN201811041071.0A patent/CN109344236B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101286161A (en) * | 2008-05-28 | 2008-10-15 | 华中科技大学 | Intelligent Chinese request-answering system based on concept |
CN103729381A (en) * | 2012-10-16 | 2014-04-16 | 佳能株式会社 | Method and device used for recognizing semantic information in series of documents |
CN105701253A (en) * | 2016-03-04 | 2016-06-22 | 南京大学 | Chinese natural language interrogative sentence semantization knowledge base automatic question-answering method |
CN106997376A (en) * | 2017-02-28 | 2017-08-01 | 浙江大学 | The problem of one kind is based on multi-stage characteristics and answer sentence similarity calculating method |
CN107092596A (en) * | 2017-04-24 | 2017-08-25 | 重庆邮电大学 | Text emotion analysis method based on attention CNNs and CCR |
Non-Patent Citations (1)
Title |
---|
"A noval similarity calculation method based on Chinese sentence keyword weight";Yu Y等;《Journal of software》;20140530;第1151-1155页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109344236A (en) | 2019-02-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109344236B (en) | Problem similarity calculation method based on multiple characteristics | |
CN108363790B (en) | Method, device, equipment and storage medium for evaluating comments | |
Liu et al. | Learning semantic word embeddings based on ordinal knowledge constraints | |
US20170177563A1 (en) | Methods and systems for automated text correction | |
Xie et al. | Topic enhanced deep structured semantic models for knowledge base question answering | |
US11068653B2 (en) | System and method for context-based abbreviation disambiguation using machine learning on synonyms of abbreviation expansions | |
US20240111956A1 (en) | Nested named entity recognition method based on part-of-speech awareness, device and storage medium therefor | |
CN111428490A (en) | Reference resolution weak supervised learning method using language model | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
Hussein | A plagiarism detection system for arabic documents | |
Zhang et al. | Term recognition using conditional random fields | |
Celikyilmaz et al. | A graph-based semi-supervised learning for question-answering | |
CN111400487A (en) | Quality evaluation method of text abstract | |
CN110991193A (en) | Translation matrix model selection system based on OpenKiwi | |
Jian et al. | English text readability measurement based on convolutional neural network: A hybrid network model | |
Zhang et al. | Disease prediction and early intervention system based on symptom similarity analysis | |
CN111581365B (en) | Predicate extraction method | |
CN116108840A (en) | Text fine granularity emotion analysis method, system, medium and computing device | |
CN112926340B (en) | Semantic matching model for knowledge point positioning | |
Ghasemi et al. | Farsick: A persian semantic textual similarity and natural language inference dataset | |
Zhang et al. | Dual attention model for citation recommendation with analyses on explainability of attention mechanisms and qualitative experiments | |
CN111767388B (en) | Candidate pool generation method | |
CN114265924A (en) | Method and device for retrieving associated table according to question | |
Reshmi et al. | Textual entailment based on semantic similarity using wordnet | |
Rei et al. | Parser lexicalisation through self-learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |