CN109344236A

CN109344236A - One kind being based on the problem of various features similarity calculating method

Info

Publication number: CN109344236A
Application number: CN201811041071.0A
Authority: CN
Inventors: 刘波; 彭永幸
Original assignee: Jinan University
Current assignee: Jinan University; University of Jinan
Priority date: 2018-09-07
Filing date: 2018-09-07
Publication date: 2019-02-15
Anticipated expiration: 2038-09-07
Also published as: CN109344236B

Abstract

The problem of being based on various features the invention discloses one kind similarity calculating method, comprising steps of for the new problem sentence of input, calculating is compared with the historical problem of storage and corresponding answer in it, calculates the similarity between new problem and historical problem based on character feature, the similarity based on word semantic feature, the similarity based on sentence semantics feature, the similarity for implying based on sentence theme feature and the similarity based on answer semanteme feature；Final similarity is the sum of products of above-mentioned 5 similarities and its respective respective weights, and weight is obtained using linear regression method training.The present invention increases the diversity of sample attribute using various features, improves the generalization ability of model.TF-IDF is merged using soft COS distance with information such as editing distance, phrase semantics simultaneously, the semantic gap between word is overcome, improves the accuracy rate of similarity calculation.

Description

One kind being based on the problem of various features similarity calculating method

Technical field

The present invention relates to Computer Natural Language Processings and automatically request-answering system research field, in particular to a kind of based on more The problem of planting feature similarity calculating method.

Background technique

With quickling increase for digital information, the difficulty that people obtain required information resources from network also increases therewith Greatly.How in the digital information of magnanimity, required information rapidly precisely is found for user and gives natural language processing (NLP) Technology and information retrieval technique brings stern challenge.Therefore, in order to provide the user with the high information of strong real-time, accuracy Channel is obtained, research institution and relevant technical company begin one's study automatically request-answering system (QA).In automatically request-answering system, user Corresponding answer can be directly obtained by only needing to input problem, it is no longer necessary to user according to problem extract keyword retrieved with And it reads a large amount of webpages and finds answer.Automatically request-answering system is more simpler than traditional search engines easy-to-use, real-time, accurate, for Family provides comfortable man-machine interaction experience, becomes the research hotspot of current information technology a new generation.Automatically request-answering system allows The problem of user describes problem in the form of natural language, then accurately understands user, and pass through retrieval question and answer library or interconnection The information organization answers searched on the net, finally return to refining and accurately as a result, providing efficient acquisition of information channel.

Problem similarity calculation is primary link in automatically request-answering system, it is therefore an objective to found out from existing problem set with The new historical problem for proposing that problem is most like, to provide the answer of new problem according to the answer set of historical problem.

Currently, domestic, in automatic question answering field, there are also achievements.Universal community's question answering system includes Quora, top news Question and answer, Baidu know, professional community's question answering system is related to multi-specialized, such as the IT technology such as Stack Overflow, CSDN Relevant question answering system.Therefore, problem similarity calculating method directly affects the accuracy rate of question answering system, has good industry Prospect.

It is accumulated by years of researches, automatically request-answering system forms general frame, mainly by information retrieval, problem point Analysis and answer obtain three module compositions.Wherein, the main task of case study module was divided the problem of user's input Analysis is found out from existing problem set and is related to problem similarity with the most like historical problem of problem, research contents is newly proposed Analysis and problem sequence, wherein the similarity calculation most importantly between problem and problem, thus according to similarity to history Problem set is ranked up.Answer obtains the Similar Problems set that module is mainly retrieved according to problem, obtains corresponding answer Case set.

Text similarity the relevant technologies are the bases of problem similarity technology (problem and answer belong to text type).Text There are mainly three types of this similarity calculating methods.

The first is the similarity calculation based on vector space model (Vector Space Model, VSM), and text is reflected A point being mapped in vector space recycles mathematical method to calculate space midpoint at a distance from point.Having researcher to propose will VSM model is applied in the Similar Problems retrieval tasks of frequently asked question (Frequently Asked Questions, FAQ), and VSM is improved for the task feature of FAQ.But this method text is sparse to cause dimension excessive, is easy to appear semanteme Divide problem.

Second is the similarity calculation based on syntactic analysis, is introduced into patterned mode and describes each word in a sentence Group mutually dominates and by dominance relation.Have researcher propose the analysis method based on deep structure, first to problem carry out according to It deposits relationship analysis, chooses most important word in sentence and be directly attached to effective word of the word and matched, is then based on interdependent Relational structure carries out the Text similarity computing of Chinese.But the tools such as this method syntactic analysis, dependency analysis are more multiple It is miscellaneous, linguistics background is needed, and bad to the analytical effect of complicated long sentence pattern.

The third is semantic-based similarity calculation, includes two kinds of similarities of phrase semantic and sentence semantics.About word The Semantic Similarity Measurement of language, usually using the semantic dictionaries such as WordNet and Hownet, contained in semantic dictionary word and word it Between semantic relation.There is researcher to think that the expressed intact of short sentence depends not only upon syntactic structure, but also depends on word And its weight, therefore improve using WordNet the semantic expressiveness of word.In terms of the Semantic Similarity Measurement of sentence, grind Personnel are studied carefully using the transition probability between Machine Translation Model two problem sentences of study of IBM, to indicate the semanteme of sentence Similarity retrieves Similar Problems.But this method depends on semantic dictionary unduly, and the accuracy rate of similarity calculation is by semantic dictionary Completeness and correctness influence；Similarly, long sentence pattern effect of the semantic-based similarity calculating method in processing syntax complexity It is poor.

Method in the prior art is mostly the information extraction text representation feature based on single type simultaneously, is focused on single Type feature, does not account for text representation and is meant that and be made of many-sided multi-level information, therefore calculates similarity Accuracy rate is not high.

Summary of the invention

For overcome the deficiencies in the prior art, the problem of being based on various features the object of the present invention is to provide one kind similarity Calculation method, similarity calculation of this method between problem English question answering system, has the advantages that accuracy rate is high.

The purpose of the present invention is realized by the following technical solution: one kind being based on the problem of various features similarity calculation side Method, comprising steps of meter is compared with the historical problem of storage and corresponding answer in it for the new problem sentence of input It calculates, calculates the similarity between new problem and historical problem based on character feature, the similarity based on word semantic feature, base In the similarity of sentence semantics feature, the similarity of theme feature implied based on sentence and based on the similar of answer semanteme feature Degree；Final similarity is the sum of products of above-mentioned 5 similarities and its respective respective weights, and weight utilizes linear regression method training It obtains.

Preferably, for first being pre-processed for the new problem sentence of contrast conting, historical problem and corresponding answer, Including the processing of removal punctuation mark, capital and small letter conversion (all capitalizations change into small letter), remove stop words and low frequency word.

Preferably, the method for calculating the similarity based on character feature is: firstly, by calculating the volume between each pair of word Distance is collected, the relational matrix between word is obtained, then utilizes the TF-IDF (term frequency-inverse of question sentence Document frequency) it indicates to calculate soft COS distance with relational matrix, as the similarity based on character feature.

Further, the relational matrix calculation method between word is:

Defining corpus is the question sentence and answer text data set for trained and test model, it is assumed that dictionary in corpus Size be n, then the relational matrix M that editing distance between word is formed_lev∈R^n×n, R^n×nIt is the real number square that size is n × n Battle array set (hereafter meaning is identical), M_levMiddle element m_i,jFor i-th of word w in dictionary_iWith j-th of word w_jEditing distance.Editor away from It is as follows from calculation formula:

Wherein, | | w_i| | it is word w_iIn include character number, | | w_j| | it is word w_jIn include character number, α is The weighted factor of diagonal element, β are the intensifiers apart from score.lev(w_i,w_j) recursive computing formula it is as follows:

Wherein, m and n represent w_iAnd w_jThe length (including the number of character) of word.Cost indicates w_iIn m-th of character To w_jIn n-th of character replacement cost, if two characters are consistent, cost 0, otherwise cost be 1.

Further, the TF-IDF of question sentence indicates that calculation method is: in a sentence, for word w_i, calculate TF Value and IDF value, TF value indicate the frequency that word occurs in current sentence, and IDF value indicates inverse document frequency, for word w_iTF-IDF calculation formula are as follows:

TFIDF_i=TF_i*IDF_i。

Further, question sentence Q is proposed for new_newWith history question sentence Q_rel, the calculation method of soft COS distance is:

Question sentence Q_new、Q_relIt is expressed as TFIDF_newAnd TFIDF_rel:

TFIDF_new=[d_new,1,d_new,2,…,d_new,n]^T

TFIDF_rel=[d_rel,1,d_rel,2,…,d_rel,n]^T

d_new,iIndicate w_iIn Q_newIn TF-IDF value, d_rel,jIndicate w_jIn Q_relIn TF-IDF value, n indicates corpus It include the number of word in dictionary, T indicates the transposition of vector.

It is simultaneously M according to relational matrix between the word acquired_lev, Q is calculated using soft COS distance_new、Q_relBetween be based on The similarity R of character feature_lev(Q_new,Q_rel), formula is as follows:

Wherein, " " is the dot product (hereafter meaning is identical) of vector with matrix, and calculation is as follows:

Preferably, the method for calculating the similarity based on word semantic feature, step is:

(1) obtaining the distributed of each word in corpus using the training of word2vec tool indicates, i.e., each word pair The real vector for answering a K to tie up.

(2) corpus for being n to dictionary size calculates word two in dictionary by seeking the COS distance between term vector Semantic relation m between two_i,j, i, j ∈ [1, n] obtain relational matrix M_w2v, M_w2v∈R^n×n。

(3) question sentence Q is read_new、Q_relUsing the expression of TF-IDF, respectively TFIDF_newAnd TFIDF_rel。

(4) Q is calculated using soft COS distance_new、Q_relBetween the similarity R based on word semantic feature_w2v(Q_new,Q_rel), Formula is as follows:

Preferably, the method for calculating the similarity based on sentence semantics feature, step is:

The term vector w obtained according to the training of word2vec tool_i∈R^K, it is assumed that question sentence includes M word, then question sentence can be with It is expressed as Q_matrix=(w₁,w₂,…,w_M),

It is the form of term vector arithmetic mean of instantaneous value in sentence, i.e. vector by problem sentence expression:

According to above formula, question sentence Q_new、Q_relArithmetic mean of instantaneous value by calculating term vector in sentence respectively obtains vector AVG_new And AVG_rel, by seeking AVG_newWith AVG_relCOS distance obtain Q_new、Q_relBetween the similarity based on question semanteme feature R_vec(Q_new,Q_rel)。

Preferably, the method that the similarity of theme feature is implied based on sentence is calculated, step is:

Using corpus as the input of LdaModel function in Gensim, LDA model is obtained by training；Then needs Calculate the new proposition question sentence Q of theme distribution_newWith history question sentence Q_relIt is input to trained LDA model, obtains two question sentence bases It is indicated in the vector of theme distribution, is denoted as LDA respectively_new、LDA_rel.By seeking LDA_newWith LDA_relCOS distance, obtain Q_new、Q_relBetween the similarity R of theme feature is implied based on sentence_lda(Q_new,Q_rel)。

Preferably, the method for calculating the similarity based on answer semanteme feature, step is:

In question answering system, each historical problem corresponds to a candidate answers set.Question sentence Q is proposed for new_newWith History question sentence Q_relAnd Q_relCorresponding candidate answers A_rel, calculate Q_newWith A_relWord level semantic similarity, as Similarity R of two question sentences based on answer semanteme feature_qa(Q_new,Q_rel), detailed process is as follows:

(3) question sentence Q is read_newTF-IDF indicate TFIDF_new；A_relIt is indicated by TF-IDF, obtains TFIDF_ans= [d_ans,1,d_ans,2,...,d_ans,n]^T, wherein d_ans,iIndicate w_iIn A_relIn TF-IDF value, n indicate corpus dictionary in wrap Number containing word, T indicate the transposition of vector.

(4) Q is calculated using soft COS distance_new、Q_relBetween the similarity R based on answer semanteme feature_qa(Q_new,Q_rel), Formula is as follows:

Preferably, the calculation method of final similarity is:

Wherein, R_k(Q_new,Q_rel) represent the similarity based on feature k, i.e., similarity, base respectively based on character feature In the similarity of phrase semantic feature, the similarity based on sentence semantics feature, the similarity for implying based on sentence theme feature With the similarity based on answer semanteme feature.λ_kIt is to be obtained to training parameter using the training of linear regression analysis method.

Further, using in linear regression analysis method training process, iterative step is as follows:

(1) the corresponding weight λ of each feature of random initializtion_k, according to the Squared Error Loss of weight calculation current iteration；Square Loss function is as follows:

Wherein, I is given training samples number, and training sample is known problem pair and the similarity to problem,It is the similarity predicted value of i-th of sample, Y⁽ⁱ⁾It is the given value of i-th of Sample Similarity；

(2) according to Squared Error Loss, to the weight λ of each feature_kIt carries out seeking local derviation, obtains the gradient of the current iteration of weightT indicates the t times iteration；

(3) basisThe weight of each feature is updated, α is step-length；

(4) Squared Error Loss is recalculated according to new weight, if current Squared Error Loss is flat not less than preceding an iteration Side's loss, then stop iteration, obtain the corresponding weight λ of each feature_kEnd value；Otherwise circulation carries out step (2)-(4).

Compared with prior art, the present invention has the following advantages that and technical effect:

(1) in machine learning field, training sample is by one group of attribute description, and different attribute sets provides not The visual angle of same observation data.The present invention from five different views with the problem of natural language description and answer sentence, It is extracted five seed type features.Compared with based on single type feature representation mode, various features increase the more of sample attribute Sample improves the generalization ability of model.

(2) present invention is merged TF-IDF with information such as editing distance, phrase semantics using soft COS distance, is calculated The similarity of problem and problem.Traditional similarity calculating method is compared, the present invention overcomes the semantic gaps between word, mention The high accuracy rate of similarity calculation.

Detailed description of the invention

Fig. 1 is the flow chart of the present embodiment method.

Specific embodiment

Present invention will now be described in further detail with reference to the embodiments and the accompanying drawings, but embodiments of the present invention are unlimited In this.

The present embodiment one kind measures two problems using 5 kinds of features based on similarity calculating method the problem of various features Similarity between sentence is character feature respectively, phrase semantic feature, sentence semantics feature, the implicit theme feature of sentence, answers Case semantic feature.Similarity based on this 5 kinds of features is integrated into the final similarity of new problem and historical problem.Referring to Fig. 1, by each step for stating this method in detail in conjunction with an example.

(1) a new problem sentence Q is inputted_new: Where I can buy good oil for massage?

(2) a historical problem sentence Q is read_rel: is there any place i can find scented Massage oils in qatar?

The answer A of a historical problem is read simultaneously_rel: Yes.It is right behind Kahrama in the National area.

(3) to Q_new﹑ Q_relAnd A_relIt is pre-processed respectively, comprising: the processing of removal punctuation mark, capital and small letter conversion are (all Capitalization changes into small letter), remove stop words and low frequency word etc..It obtains:

Q_new: buy good oil massage

Q_rel: place find scented massage oils qatar

A_rel: yes right behind kahrama national area

(4) Q is calculated separately_newAnd Q_relSimilarity based on following 5 kinds of features.

Be assumed below will area, behind, buy, find, good, kahrama, massage, national, oil, Oils, place, qatar, right, scented, yes } it is used as corpus dictionary set.

The similarity of (4-1) based on character feature

Character feature is the similarity measured between word from character level, and word character level is sought using editing distance Similarity.Firstly, obtaining the relational matrix M between word by calculating the editing distance between each pair of word_lev, then sharp It is indicated and relational matrix M with the TF-IDF of question sentence_levSoft COS distance is calculated as the similarity based on character feature.Specifically such as Under:

(4-1-1) calculated relationship matrix M_lev

(refer to the question sentence and answer text data set for trained and test model, hereinafter corpus assuming that corpus Meaning is identical) in the size of dictionary be n, then the relational matrix M that editing distance between word is formed_lev∈R^n×n, R^n×nIt is big The small real number matrix set for n × n, M_levMiddle element m_i,jFor i-th of word w in dictionary_iWith j-th of word w_jEditing distance.Editor Distance calculation formula is as follows:

Wherein, | | w_i| | it is word w_iIn include character number, | | w_j| | it is word w_jIn include character number, α is The weighted factor of diagonal element, β are the intensifiers apart from score, and α value is that 1.8, β value is 5.lev(w_i,w_j) recurrence Calculation formula is as follows:

Because dictionary size is 15 words, then M_levIt is 15 × 15 matrix, following form:

The TF-IDF that (4-1-2) calculates question sentence is indicated

TF-IDF is made of TF and IDF.

In a sentence, for word w_i, calculate TF value formula it is as follows:

Wherein, TF_iIndicate the frequency that i-th of word occurs in current sentence, n_iIndicate that the word goes out in current sentence Existing number, n_kIndicate the number that k-th of word occurs in current sentence.

For given corpus, the IDF value of each word is fixed, word w_iCalculation it is as follows:

Wherein: | D | indicate total text number, denominator then indicates the text number comprising the word.

In a sentence, for word w_iTF-IDF calculation formula are as follows:

TFIDF_i=TF_i*IDF_i

Question sentence Q_new、Q_relIt is expressed as following vector:

TFIDF_new=[d_new,1,d_new,2,…,d_new,n]^T

TFIDF_rel=[d_rel,1,d_rel,2,…,d_rel,n]^T

Step (3) are pre-processed to obtained question sentence Q_new、Q_relIt is expressed as 15 dimensional vectors of following form:

TFIDF_new=[0.0,0.0,0.528634,0.0,0.528634 ...]^T

TFIDF_rel=[0.0,0.0,0.0,0.423394,0.0 ...]^T。

(4-1-3) calculates soft COS distance

Soft COS distance is 2014 for COS distance, and Sidorov proposes improved cosine similarity The entitled soft COS distance (Soft Cosine) of calculation method, when calculating COS distance introduce relational matrix come indicate word it Between relationship.

Question sentence Q_new、Q_relIndicate to be respectively TFIDF by step (4-1-2)_newAnd TFIDF_rel, pass through step (4-1-1) Acquiring relational matrix between word is M_lev, question sentence Q is calculated using soft COS distance_new、Q_relBetween based on the similar of character feature Spend R_lev(Q_new,Q_rel), formula is as follows:

In the present embodiment

The similarity of (4-2) based on word semantic feature

(4-2-1), which obtains the distributed of each word in corpus using the training of word2vec tool, to be indicated, i.e., each word The term vector of corresponding one 200 dimension of language, as follows:

The corpus that (4-2-2) is n to dictionary size calculates word in dictionary by seeking the COS distance between term vector The semantic relation m of language between any two_i,j, i, j ∈ [1, n] obtain relational matrix M_w2v, M_w2v∈R^n×n。m_i,jCalculation formula such as Under:

m_i,j=max (0, cos (w_i,w_j))²

Wherein, w_i、w_jIndicate the real vector of ith and jth word K dimension in corpus, w_i,w_j∈R^K, R^KBe length be K One-dimensional vector (hereafter meaning is identical), " " is the standard dot product (hereafter meaning is identical) between vector, and calculation formula is as follows:

Wherein, w_i,mIndicate w_iM-th of component, w_j,mIndicate w_jM-th of component.

Dictionary size is 15 words in the present embodiment, then M_w2vIt is 15 × 15 matrix, as follows:

(4-2-3) reads the question sentence Q that (4-1-2) is calculated_new、Q_relTF-IDF expression be respectively as follows:

TFIDF_new=[0.0,0.0,0.528634,0.0,0.528634 ...]^T

TFIDF_rel=[0.0,0.0,0.0,0.423394,0.0 ...]^T

(4-2-4) seeks question sentence Q by soft COS distance formula_new、Q_relBetween the similarity based on word semantic feature:

The similarity of (4-3) based on sentence semantics feature

The term vector w obtained according to word2vec training_i∈R^K, it is assumed that question sentence includes M word, then question sentence can indicate For Q_matrix=(w₁,w₂,…,w_M),It is term vector arithmetic average in sentence by problem sentence expression The form of value, i.e. vector:

According to above formula, question sentence Q_new、Q_relArithmetic mean of instantaneous value by calculating term vector in sentence respectively obtains vector AVG_new And AVG_rel:

AVG_new=[0.014657,0.075914, -0.042454,0.219559, -0.117374 ...]

AVG_rel=[- 0.088187, -0.025432, -0.05328,0.17098376, -0.13033055 ...]

By seeking AVG_newWith AVG_relCOS distance obtain Q_new、Q_relBetween the similarity based on question semanteme feature R_vec(Q_new,Q_rel), formula is as follows:

(4-4) implies the similarity of theme feature based on sentence

The present invention implies the implicit master that topic model seeks question sentence using LDA (Latent Dirichlet Allocation) Topic.After training by LDA, in available document sets every document implicit theme distribution, and then obtain the theme of sentence Vector.For example, sentence Q_mImplicit theme distribution Indicate that the probability for belonging to i-th of theme, I are The quantity of theme is implied, then Q_mIt is expressed as vector:

The theme distribution of question sentence is opened by Gensim (https: //radimrehurek.com/gensim/) topic model Source tool is calculated.First using corpus as the input of LdaModel function in Gensim, LDA mould is obtained by training The question sentence for needing to calculate theme distribution, is then input to trained LDA model by type, and obtained output is the theme of the question sentence Distribution, and be vector by the sentence expression.

Question sentence Q is proposed for new_newWith history question sentence Q_rel, two are calculated by Gensim topic model Open-Source Tools Question sentence is indicated based on the vector of theme distribution, is denoted as LDA respectively_new、LDA_rel:

LDA_new=[0.001784,0.001934,0.002056,0.002072,0.001772 ...]

LDA_rel=[0.001706,0.001850,0.001967,0.001982,0.001695 ...]

By seeking LDA_newWith LDA_relCOS distance, obtain the similarity R that theme feature is implied based on sentence_lda(Q_new, Q_rel), formula is as follows:

The similarity of (4-5) based on answer semanteme feature

In question answering system, each historical problem corresponds to a candidate answers set.Question sentence Q is proposed for new_newWith History question sentence Q_relAnd Q_relCorresponding candidate answers A_rel, Q is calculated using the method being similar in (4-2)_newWith A_relWord The semantic similarity of language level, the similarity R as two question sentences based on answer semanteme feature_qa(Q_new,Q_rel), detailed process is such as Under:

(4-5-1) reads the term vector that each word 200 is tieed up in the dictionary that (4-2-1) is calculated；

(4-5-2) reads the M that (4-2-2) is calculated_w2vMatrix；

(4-5-3) reads the Q that (4-1-2) is obtained_newTF-IDF indicate:

TFIDF_new=[0.0,0.0,0.528634,0.0,0.528634 ...]^T

It calculates step (3) and pre-processes obtained A_relTF-IDF indicate, obtain following 15 dimensional vector:

TFIDF_ans=[0.408248,0.408248,0.0,0.0,0.0 ...]^T

(4-5-4) calculates Q using soft COS distance_new、Q_relBetween the similarity R based on answer semanteme feature_qa(Q_new, Q_rel), formula is as follows:

(5) final similarity is calculated

Calculate Q_newWith Q_relFinal similarity Sim (Q_new,Q_rel), formula is as follows:

Wherein, R_k(Q_new,Q_rel) similarity based on feature k is represented, it is acquired, that is, distinguished by (4-1)~(4-5) respectively Are as follows:

R_lev(Q_new,Q_rel)=0.225969, R_v2w(Q_new,Q_rel)=0.304225, R_vec(Q_new,Q_rel)=0.738933, R_lda(Q_new,Q_rel)=0.685844, R_qa(Q_new,Q_rel)=0..018413

λ_kIt is to be obtained to training parameter using the training of linear regression analysis method.Training method uses forward stepwire regression, Each step all reduces error as far as possible.Using quadratic loss function, the certain number of iteration, until loss function minimum.Square damage It is as follows to lose function:

Wherein, I is given training samples number, and training sample is known problem pair and the similarity to problem,It is the similarity predicted value of i-th of sample, Y⁽ⁱ⁾It is the given value of i-th of Sample Similarity.

Iterative process is as follows:

1. the corresponding weight λ of each feature of random initializtion_k, according to the Squared Error Loss of weight calculation current iteration；

2. according to Squared Error Loss, to the weight λ of each feature_kIt carries out seeking local derviation, obtains the gradient of the current iteration of weightT indicates the t times iteration；

3. basisThe weight of each feature is updated, α is step-length, takes 0.1；

4. Squared Error Loss is recalculated according to new weight, if current Squared Error Loss is not less than square of preceding an iteration Loss, then stop iteration, obtain the corresponding weight λ of each feature_kEnd value；Otherwise circulation carries out 2,3,4 steps.

The present embodiment obtains R according to above-mentioned steps training_k(Q_new,Q_rel) weight:

λ_lev=0.055985, λ_w2v=0.753228, λ_vec=0.207070, λ_lda=0.475735, λ_qa=-0.122604 Calculate final similarity are as follows:

It can implement the technology that the present invention describes by various means.For example, these technologies may be implemented in hardware, consolidate In part, software or combinations thereof.For hardware embodiments, processing module may be implemented in one or more specific integrated circuits (ASIC), digital signal processor (DSP), programmable logic device (PLD), field-programmable logic gate array (FPGA), place Manage device, controller, microcontroller, electronic device, other electronic units for being designed to execute function described in the invention or In a combination thereof.

It, can be with the module of execution functions described herein (for example, process, step for firmware and/or Software implementations Suddenly, process etc.) implement the technology.Firmware and/or software code are storable in memory and are executed by processor.Storage Device may be implemented in processor or outside processor.

Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above method embodiment can pass through The relevant hardware of program instruction is completed, and program above-mentioned can store in a computer-readable storage medium, the program When being executed, step including the steps of the foregoing method embodiments is executed；And storage medium above-mentioned includes: ROM, RAM, magnetic disk or light The various media that can store program code such as disk.

The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, other any changes, modifications, substitutions, combinations, simplifications made without departing from the spirit and principles of the present invention, It should be equivalent substitute mode, be included within the scope of the present invention.

Claims

1. one kind is based on the problem of various features similarity calculating method, which is characterized in that comprising steps of newly asking for input Inscribe sentence, calculatings be compared with the historical problem of storage and corresponding answer in it, calculate new problem and historical problem it Between the similarity based on character feature, the similarity based on word semantic feature, the similarity based on sentence semantics feature, be based on Sentence implies the similarity of theme feature and the similarity based on answer semanteme feature；Final similarity is above-mentioned 5 similarities And its respectively the sum of products of respective weights, weight are obtained using linear regression method training.

2. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that for being used for New problem sentence, historical problem and the corresponding answer of contrast conting are first pre-processed, including removal punctuation mark handles, is big Stop words and low frequency word are removed in small letter conversion.

3. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that calculating is based on The method of the similarity of character feature is: obtaining the relationship square between word by calculating the editing distance between each pair of word Gust, it is then indicated using the TF-IDF of question sentence and relational matrix calculates soft COS distance, using soft COS distance as based on character The similarity of feature.

4. the problem of being based on various features similarity calculating method according to claim 3, which is characterized in that between word Relational matrix calculation method be:

Defining corpus is the question sentence and answer text data set for trained and test model, it is assumed that dictionary is big in corpus Small is n, then the relational matrix M that the editing distance between word is formed_lev, M_levMiddle element m_i,jFor i-th of word w in dictionary_iWith J word w_jEditing distance, editing distance calculation formula is as follows:

Wherein, | | w_i| | it is word w_iIn include character number, | | w_j| | it is word w_jIn include character number, α is diagonal The weighted factor of element, β are the intensifiers apart from score；lev(w_i,w_j) recursive computing formula it is as follows:

Wherein, m and n represent w_iAnd w_jThe length of word, cost indicate w_iIn m-th of character to w_jIn n-th of character replacement generation Valence, if two characters are consistent, cost 0, otherwise cost is 1.

5. the problem of being based on various features similarity calculating method according to claim 3, which is characterized in that question sentence TF-IDF indicates that calculation method is: in a sentence, for word w_i, TF value and IDF value are calculated, TF value indicates that word is being worked as The frequency occurred in preceding sentence, IDF value indicates inverse document frequency, for word w_iTF-IDF calculation formula are as follows:

TFIDF_i=TF_i*IDF_i；

Question sentence Q is proposed for new_newWith history question sentence Q_rel, the calculation method of soft COS distance is:

Question sentence Q_new、Q_relIt is expressed as TFIDF_newAnd TFIDF_rel:

TFIDF_new=[d_new,1,d_new,2,…,d_new,n]^T

TFIDF_rel=[d_rel,1,d_rel,2,…,d_rel,n]^T

d_new,iIndicate w_iIn Q_newIn TF-IDF value, d_rel,jIndicate w_jIn Q_relIn TF-IDF value, n indicate corpus dictionary In include word number, T indicate vector transposition；

Simultaneously according to relational matrix M between the word acquired_lev, Q is calculated using soft COS distance_new、Q_relBetween it is special based on character The similarity R of sign_lev(Q_new,Q_rel), formula is as follows:

Wherein, " " is the dot product of vector and matrix.

6. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that calculating is based on The method of the similarity of phrase semantic feature, step are:

(6-1), which obtains the distributed of each word in corpus using the training of word2vec tool, indicates that is, each word is corresponding The real vector of one K dimension；

The corpus that (6-2) is n to dictionary size calculates in dictionary word two-by-two by seeking the COS distance between term vector Between semantic relation m_i,j, i, j ∈ [1, n] obtain relational matrix M_w2v, M_w2v∈R^n×n；

(6-3) reads question sentence Q_new、Q_relUsing the expression of TF-IDF, respectively TFIDF_newAnd TFIDF_rel；

(6-4) calculates Q using soft COS distance_new、Q_relBetween the similarity R based on word semantic feature_w2v(Q_new,Q_rel), it is public Formula is as follows:

7. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that calculating is based on The method of the similarity of sentence semantics feature, step are:

The term vector w obtained according to the training of word2vec tool_i∈R^K, it is assumed that question sentence includes M word, then question sentence is expressed as Q_matrix=(w₁,w₂,…,w_M),

According to above formula, question sentence Q_new、Q_relArithmetic mean of instantaneous value by calculating term vector in sentence respectively obtains vector AVG_newWith AVG_rel, by seeking AVG_newWith AVG_relCOS distance obtain Q_new、Q_relBetween the similarity R based on question semanteme feature_vec (Q_new,Q_rel)。

8. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that calculating is based on The method that sentence implies the similarity of theme feature, step is:

Using corpus as the input of LdaModel function in Gensim, LDA model is obtained by training；Then needing to calculate The new proposition question sentence Q of theme distribution_newWith history question sentence Q_relIt is input to trained LDA model, two question sentences is obtained and is based on master The vector of topic distribution indicates, is denoted as LDA respectively_new、LDA_rel；By seeking LDA_newWith LDA_relCOS distance, obtain Q_new、Q_rel Between the similarity R of theme feature is implied based on sentence_lda(Q_new,Q_rel)。

9. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that calculating is based on The method of the similarity of answer semanteme feature, step are:

Question sentence Q is proposed for new_newWith history question sentence Q_relAnd Q_relCorresponding candidate answers A_rel, calculate Q_newWith A_relWord The semantic similarity of language level, the similarity R as two question sentences based on answer semanteme feature_qa(Q_new,Q_rel), detailed process is such as Under:

(9-1), which obtains the distributed of each word in corpus using the training of word2vec tool, indicates that is, each word is corresponding The real vector of one K dimension；

The corpus that (9-2) is n to dictionary size calculates in dictionary word two-by-two by seeking the COS distance between term vector Between semantic relation m_i,j, i, j ∈ [1, n] obtain relational matrix M_w2v, M_w2v∈R^n×n；

(9-3) reads question sentence Q_newTF-IDF indicate TFIDF_new；A_relIt is indicated by TF-IDF, obtains TFIDF_ans=[d_ans,1, d_ans,2,...,d_ans,n]^T, wherein d_ans,iIndicate w_iIn A_relIn TF-IDF value, n indicate corpus dictionary in include word Number, T indicate vector transposition；

(9-4) calculates Q using soft COS distance_new、Q_relBetween the similarity R based on answer semanteme feature_qa(Q_new,Q_rel), it is public Formula is as follows:

10. the problem of being based on various features similarity calculating method according to claim 1, which is characterized in that most last phase Calculation method like degree is:

Wherein, R_k(Q_new,Q_rel) represent Q_new、Q_relBetween the similarity based on feature k, that is, be respectively the phase based on character feature Theme feature is implied like degree, the similarity based on word semantic feature, the similarity based on sentence semantics feature, based on sentence Similarity and similarity based on answer semanteme feature；λ_kIt is to be obtained to training parameter using the training of linear regression analysis method.