CN109344236B

CN109344236B - Problem similarity calculation method based on multiple characteristics

Info

Publication number: CN109344236B
Application number: CN201811041071.0A
Authority: CN
Inventors: 刘波; 彭永幸
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2018-09-07
Filing date: 2018-09-07
Publication date: 2020-09-04
Anticipated expiration: 2038-09-07
Also published as: CN109344236A

Abstract

The invention discloses a problem similarity calculation method based on multiple characteristics, which comprises the following steps: aiming at an input new question sentence, comparing the input new question sentence with stored historical questions and corresponding answers, and calculating the similarity between the new question and the historical questions based on character features, the similarity based on word semantic features, the similarity based on sentence implicit subject features and the similarity based on answer semantic features; the final similarity is the product sum of the 5 similarities and the weights corresponding to the 5 similarities, and the weights are obtained by training through a linear regression method. The invention adopts various characteristics to increase the diversity of sample attributes and improve the generalization capability of the model. Meanwhile, the soft cosine distance is utilized to fuse TF-IDF with information such as editing distance, word semantics and the like, so that the semantic gap between words is overcome, and the accuracy of similarity calculation is improved.

Description

Problem similarity calculation method based on multiple characteristics

Technical Field

The invention relates to the field of computer natural language processing and automatic question-answering system research, in particular to a problem similarity calculation method based on multiple characteristics.

Background

With the rapid increase of digital information, the difficulty of acquiring required information resources from the network is increased. How to accurately and quickly find needed information for users in massive digital information brings a serious challenge to Natural Language Processing (NLP) technology and information retrieval technology. Therefore, in order to provide users with information acquisition channels with high real-time and high accuracy, research institutions and related technology companies have begun to research automated question and answer systems (QA). In the automatic question-answering system, a user can directly obtain corresponding answers only by inputting questions, and the user is not required to extract key words according to the questions to search and read a large number of webpages to find the answers. Compared with the traditional search engine, the automatic question-answering system is simpler, easier to use, real-time and accurate, provides comfortable human-computer interaction experience for users, and becomes a new generation of research hotspot of the current information technology. The automatic question-answering system allows a user to describe questions in a natural language form, then accurately understand the questions of the user, organize answers by retrieving information searched on a question-answering library or the internet, and finally return refined and accurate results, providing an efficient information acquisition channel.

The question similarity calculation is the first link in an automatic question-answering system, and aims to find out the historical question which is most similar to the newly-proposed question from the existing question set, so that the answer of the new question is given according to the answer set of the historical question.

At present, there are some achievements in the field of automatic question answering in China. The general community question-answering system comprises a Quora, a first-order question-answering system, a Baidu-aware system and the like, and the professional community question-answering system relates to multiple specialties, such as question-answering systems related to IT technologies such as Stack Overflow, CSDN and the like. Therefore, the problem similarity calculation method directly influences the accuracy of the question answering system, and has good industrial prospects.

Through years of research accumulation, the automatic question-answering system forms a universal framework and mainly comprises three modules of information retrieval, question analysis and answer acquisition. The problem analysis module is mainly used for analyzing the problems input by the user, finding out the historical problems which are most similar to the newly-proposed problems from the existing problem set, and researching the problems, wherein the research content relates to problem similarity analysis and problem sequencing, and the most important is the similarity calculation between the problems, so that the historical problem set is sequenced according to the similarity. The answer obtaining module is mainly used for obtaining a corresponding answer set according to a similar question set obtained by question retrieval.

Text similarity correlation techniques are the basis of question similarity techniques (both questions and answers are of text type). There are three main methods for calculating the similarity of texts.

The first method is based on similarity calculation of a Vector Space Model (VSM), a text is mapped to a point in a Vector Space, and the distance between the point in the Space and the point is calculated by a mathematical method. Researchers have proposed applying the VSM model to similar problem retrieval tasks for Frequently Asked Questions (FAQ), and improved the VSM for the task features of FAQ. However, the text sparseness of the method causes overlarge dimensionality and easily causes a semantic gap problem.

The second method is similarity calculation based on syntactic analysis, and introduces a graphical mode to describe the mutually dominant and dominated relations of each phrase in a sentence. Some researchers propose an analysis method based on a deep structure, which includes the steps of firstly analyzing the dependence relationship of problems, selecting the most important words in sentences and effective words directly attached to the words for pairing, and then calculating the text similarity of Chinese characters based on the dependence relationship structure. However, the syntactic analysis and dependency relationship analysis of this method are complex, require linguistic background, and have poor analysis effect on complex long sentence patterns.

The third is similarity calculation based on semantics, which comprises two similarities of word semantics and sentence semantics. Semantic similarity calculation of words generally uses semantic dictionaries such as WordNet and Hownet, which contain semantic relationships between words. Researchers believe that the complete expression of a phrase depends not only on the syntactic structure, but also on the words and their weights, and therefore improve the semantic representation of the words using WordNet. In the aspect of semantic similarity calculation of sentences, researchers learn the conversion probability between two question sentences by using a machine translation model of IBM (International Business machines corporation), thereby expressing the semantic similarity of the sentences and searching similar questions. However, the method excessively depends on the semantic dictionary, and the accuracy of similarity calculation is influenced by the completeness and the correctness of the semantic dictionary; similarly, the similarity calculation method based on semantics has a poor effect in processing long sentence types with complex syntax.

Meanwhile, most methods in the prior art extract text representation features based on single-type information, focus on the single-type features, and do not consider that the meaning of text representation is formed by multi-aspect and multi-level information, so that the accuracy rate of calculating similarity is not high.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a problem similarity calculation method based on multiple characteristics, which is suitable for calculating the similarity between the problems in an English question-answering system and has the advantage of high accuracy.

The purpose of the invention is realized by the following technical scheme: a problem similarity calculation method based on multiple characteristics comprises the following steps: aiming at an input new question sentence, comparing the input new question sentence with stored historical questions and corresponding answers, and calculating the similarity between the new question and the historical questions based on character features, the similarity based on word semantic features, the similarity based on sentence implicit subject features and the similarity based on answer semantic features; the final similarity is the product sum of the 5 similarities and the weights corresponding to the 5 similarities, and the weights are obtained by training through a linear regression method.

Preferably, the new question sentences, the historical questions and the corresponding answers for comparison calculation are preprocessed, including punctuation removal processing, case-to-case conversion (all capital letters are converted into lower case letters), stop words and low-frequency words.

Preferably, the method for calculating the similarity based on the character features is as follows: firstly, obtaining a relation matrix between words by calculating the edit distance between each pair of words, and then calculating a soft cosine distance by using TF-IDF (term frequency-inverse document frequency) representation of a question and the relation matrix as similarity based on character characteristics.

Furthermore, the calculation method of the relation matrix between the words is as follows:

defining a corpus as a question and answer text data set for training and testing a model, and forming a relation matrix M by editing distance between words if the size of a dictionary in the corpus is n_lev∈R^n×n，R^n×nIs a set of real number matrices of size n × n (same meaning as below), M_levMiddle element m_i,jIs the ith word w in the dictionary_iWith the jth word w_jThe edit distance of (1). The edit distance calculation formula is as follows:

wherein, | | w_iI is the word w_iNumber of characters contained in，||w_jI is the word w_jThe number of characters included, α is the weighting factor for the diagonal elements, β is the enhancement factor for the distance score lev (w)_i,w_j) The recursive formula of (c) is as follows:

wherein m and n represent w_iAnd w_jThe length of the word (i.e., the number of characters included). cost represents w_iM-th character to w_jThe cost of replacing the nth character is 0 if the two characters are consistent, otherwise, the cost is 1.

Furthermore, the TF-IDF representation of the question sentence is calculated by the following method: in a sentence, for the word w_iCalculating a TF value representing a frequency of occurrence of a word in a current sentence and an IDF value representing an inverse text frequency index for the word w_iThe TF-IDF calculation formula is as follows:

TFIDF_i＝TF_i*IDF_i。

further, with respect to the newly proposed question Q_newAnd historical question Q_relThe soft cosine distance calculation method comprises the following steps:

question Q_new、Q_relAre respectively expressed as TFIDF_newAnd TFIDF_rel：

TFIDF_new＝[d_new,1,d_new,2,…,d_new,n]^T

TFIDF_rel＝[d_rel,1,d_rel,2,…,d_rel,n]^T

d_new,iDenotes w_iAt Q_newTF-IDF value of (1), d_rel,jDenotes w_jAt Q_relThe TF-IDF value in (1), n represents the number of words contained in the dictionary of the corpus, and T represents the transpose of the vector.

Simultaneously, the matrix is M according to the obtained relation between the words_levCalculating Q by using soft cosine distance_new、Q_relSimilarity R based on character features between_lev(Q_new,Q_rel) The formula is as follows:

where "·" is a dot product of a vector and a matrix (the same meaning below), and is calculated as follows:

preferably, the method for calculating the similarity based on the semantic features of the words comprises the following steps:

(1) and (3) training by using a word2vec tool to obtain the distributed representation of each word in the corpus, namely each word corresponds to a K-dimensional real number vector.

(2) Calculating the semantic relation m between every two words in the dictionary by solving the cosine distance between word vectors for the corpus with the dictionary size of n_i,j，i,j∈[1,n]To obtain a relation matrix M_w2v，M_w2v∈R^n×n。

(3) Read question Q_new、Q_relExpressed as TF-IDF, respectively as TFIDF_newAnd TFIDF_rel。

(4) Computing Q using soft cosine distances_new、Q_relSimilarity R based on semantic features of words between words_w2v(Q_new,Q_rel) The formula is as follows:

preferably, the method for calculating the similarity based on the semantic features of the sentences comprises the following steps:

word vector w obtained by training according to word2vec tool_i∈R^KIf a question contains M words, the question may be represented as Q_matrix＝(w₁,w₂,…,w_M)，

Expressing the question sentence as the arithmetic mean of word vectors in the sentence, namely vector:

according to the above formula, question Q_new、Q_relRespectively obtaining AVG vector by calculating arithmetic mean value of word vector in sentence_newAnd AVG_relBy finding AVG_newAnd AVG_relThe cosine distance of (A) is obtained as Q_new、Q_relSimilarity R based on question semantic features_vec(Q_new,Q_rel)。

Preferably, the method for calculating the similarity based on the implicit topic features of the sentences comprises the following steps:

the corpus is used as the input of an LdaModel function in Gensim, and an LDA model is obtained through training; then, a new question Q for calculating the distribution of the topics is put forward_newAnd historical question Q_relInputting the data into a trained LDA model to obtain vector representations of two question sentences based on topic distribution, and respectively recording the vector representations as LDA_new、LDA_rel. By evaluating LDA_newAnd LDA_relTo obtain Q_new、Q_relSimilarity R based on sentence implicit topic characteristics_lda(Q_new,Q_rel)。

Preferably, the method for calculating the similarity based on the answer semantic features comprises the following steps:

in the question-answering system, each historical question corresponds to a candidate answer set. For newly proposed question Q_newAnd historical question Q_relAnd Q_relCorresponding toCandidate answer A_relCalculating Q_newAnd A_relThe semantic similarity of the word level as the similarity R of two question sentences based on the answer semantic features_qa(Q_new,Q_rel) The specific process is as follows:

(3) Read question Q_newTF-IDF of (A) represents TFIDF_new；A_relExpressed by TF-IDF to obtain TFIDF_ans＝[d_ans,1,d_ans,2,...,d_ans,n]^TWherein d is_ans,iDenotes w_iIn A_relThe TF-IDF value in (1), n represents the number of words contained in the dictionary of the corpus, and T represents the transpose of the vector.

(4) Computing Q using soft cosine distances_new、Q_relSimilarity R based on answer semantic features between the two_qa(Q_new,Q_rel) The formula is as follows:

preferably, the method for calculating the final similarity includes:

wherein R is_k(Q_new,Q_rel) Representing similarity based on the feature k, i.e. similarity based on character features, similarity based on word semantic features, similarity based on sentence implied topic features, and similarity based on answer semantic features, respectively. Lambda [ alpha ]_kIs a parameter to be trainedAnd (4) obtaining the number by utilizing a linear regression analysis method for training.

Furthermore, in the training process by using the linear regression analysis method, the iteration steps are as follows:

(1) randomly initializing weight lambda corresponding to each feature_kCalculating the square loss of the current iteration according to the weight; the square loss function is as follows:

where I is a given number of training samples, the training samples are known problem pairs and the similarity of the problem pairs,

is the similarity prediction value of the ith sample, Y⁽ⁱ⁾Is a known value of the i-th sample similarity;

(2) weight λ for each feature based on the square loss_kPerforming partial derivation to obtain the gradient of the current iteration of the weight

t represents the t-th iteration;

(3) according to

Updating the weight of each feature, α being the step size;

(4) recalculating the square loss according to the new weight, if the current square loss is not less than the square loss of the previous iteration, stopping the iteration to obtain the weight lambda corresponding to each feature_kA final value; otherwise, the steps (2) to (4) are circularly carried out.

Compared with the prior art, the invention has the following advantages and technical effects:

(1) in the field of machine learning, a training sample is described by a set of attributes, with different subsets of attributes providing different perspectives of observed data. The present invention extracts five types of features by observing the question and answer sentences described in natural language from five different perspectives. Compared with the expression mode based on single type features, the diversity of sample attributes is increased by the aid of various features, and the generalization capability of the model is improved.

(2) The method utilizes the soft cosine distance to fuse TF-IDF with information such as editing distance, word semantics and the like, and calculates the similarity between problems. Compared with the traditional similarity calculation method, the method overcomes the semantic gap between words and improves the accuracy of similarity calculation.

Drawings

FIG. 1 is a flowchart of the method of the present embodiment.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

In this embodiment, a question similarity calculation method based on multiple features measures the similarity between two question sentences by using 5 features, which are respectively a character feature, a word semantic feature, a sentence implied subject feature, and an answer semantic feature. The similarity based on these 5 features is integrated into the final similarity of the new problem and the historical problem. Referring now to fig. 1, the steps of the method are described in detail by incorporating an example.

(1) Inputting a new question sentence Q_new：Where I can buy good oil for massage？

(2) Reading a historical question sentence Q_rel：is there any place i can find scentedmassage oils in qatar？

Simultaneously reading answer A of a historical question_rel：Yes.It is right behind Kahrama in theNational area.

(3) To Q_new﹑Q_relAnd A_relRespectively carrying out pretreatment, including: punctuation removal, case-to-case conversion (all capital letters converted to lowercase), stop-word, low-frequency words, and the like. Obtaining:

Q_new：buy good oil massage

Q_rel：place find scented massage oils qatar

A_rel：yes right behind kahrama national area

(4) respectively calculate Q_newAnd Q_relSimilarity based on the following 5 features.

In the following, it is assumed that { area, while, buy, find, good, kahrama, message, national, oil, oils, place, qatar, right, sized, yes } is used as the corpus dictionary set.

(4-1) similarity based on character features

The character features are that the similarity between words is measured from a character level, and the similarity of the word character level is solved by using the edit distance. Firstly, a relation matrix M between the words is obtained by calculating the edit distance between each pair of words_levThen using the TF-IDF representation of the question and the relation matrix M_levAnd calculating the soft cosine distance as the similarity based on the character features. The method comprises the following specific steps:

(4-1-1) calculating a relationship matrix M_lev

Assuming that the size of the lexicon in a corpus (meaning the question and answer text data sets used to train and test the models, hereinafter the corpus is synonymous) is n, the edit distance between words forms a relationship matrix M_lev∈R^n×n，R^n×nIs a set of real number matrices of size n × n, M_levMiddle element m_i,jIs the ith word w in the dictionary_iWith the jth word w_jThe edit distance of (1). The edit distance calculation formula is as follows:

wherein, | | w_iI is the word w_iThe number of characters, | w_jI is the word w_jThe number of characters is contained, α is a weighting factor of diagonal elements, β is an enhancement factor of distance scores, α is 1.8, β is 5. lev (w)_i,w_j) The recursive formula of (c) is as follows:

Since the dictionary size is 15 words, M_levIs a matrix of 15 × 15, of the form:

(4-1-2) TF-IDF representation of computational question

TF-IDF consists of TF and IDF.

In a sentence, for the word w_iThe formula for calculating the TF value is as follows:

wherein, TF_iRepresenting the frequency of occurrence of the ith word in the current sentence, n_iRepresenting the number of times the word appears in the current sentence, n_kRepresenting the number of occurrences of the kth word in the current sentence.

For a given corpus, the IDF value of each word is fixed, the word w_iThe calculation method of (c) is as follows:

wherein: | D | represents the total number of texts, and the denominator represents the number of texts containing the word.

In a sentence, for the word w_iThe TF-IDF calculation formula is as follows:

TFIDF_i＝TF_i*IDF_i

question Q_new、Q_relExpressed as the following vectors, respectively:

TFIDF_new＝[d_new,1,d_new,2,…,d_new,n]^T

TFIDF_rel＝[d_rel,1,d_rel,2,…,d_rel,n]^T

The question Q obtained by the pretreatment in the step (3)_new、Q_rel15-dimensional vectors respectively represented as follows:

TFIDF_new＝[0.0,0.0,0.528634,0.0,0.528634,…]^T

TFIDF_rel＝[0.0,0.0,0.0,0.423394,0.0,…]^T。

(4-1-3) calculating a soft cosine distance

Soft Cosine distance is relative to Cosine distance, in 2014, Sidorov proposed an improved Cosine similarity calculation method named Soft Cosine distance (Soft Cosine), and a relation matrix is introduced to express the relation between words when calculating Cosine distance.

Question Q_new、Q_relRespectively expressed as TFIDF by the step (4-1-2)_newAnd TFIDF_relObtaining a relation matrix M between words through the step (4-1-1)_levCalculating question Q by using soft cosine distance_new、Q_relSimilarity R based on character features between_lev(Q_new,Q_rel) The formula is as follows:

in this example

(4-2) similarity based on semantic features of words

(4-2-1) training by using a word2vec tool to obtain a distributed representation of each word in the corpus, namely, each word corresponds to a 200-dimensional word vector, as follows:

(4-2-2) calculating the semantic relation m between every two words in the dictionary by solving the cosine distance between the word vectors for the corpus with the dictionary size of n_i,j，i,j∈[1,n]To obtain a relation matrix M_w2v，M_w2v∈R^n×n。m_i,jThe calculation formula of (a) is as follows:

m_i,j＝max(0,cos(w_i,w_j))²

wherein, w_i、w_jReal number vector, w, representing the dimension K of the ith and jth words in a corpus_i,w_j∈R^K，R^KIs a one-dimensional vector with length K (the meaning is the same below), "·" is a standard dot product between vectors (the meaning is the same below), and the calculation formula is as follows:

wherein, w_i,mDenotes w_iThe m-th component of (a), w_j,mDenotes w_jThe mth component of (2).

In the present embodiment, the dictionary size is 15 words, M_w2vIs a matrix of 15 × 15, as follows:

(4-2-3) reading (4-1-2) the calculated question Q_new、Q_relTF-IDF of (A) is represented by:

TFIDF_new＝[0.0,0.0,0.528634,0.0,0.528634,…]^T

TFIDF_rel＝[0.0,0.0,0.0,0.423394,0.0,…]^T

(4-2-4) solving question Q by a soft cosine distance formula_new、Q_relSimilarity based on word semantic features between:

(4-3) similarity based on semantic features of sentences

Word vector w obtained by word2vec training_i∈R^KIf a question contains M words, the question may be represented as Q_matrix＝(w₁,w₂,…,w_M)，

according to the above formula, question Q_new、Q_relRespectively obtaining AVG vector by calculating arithmetic mean value of word vector in sentence_newAnd AVG_rel：

AVG_new＝[0.014657，0.075914，-0.042454，0.219559，-0.117374，…]

AVG_rel＝[-0.088187,-0.025432,-0.05328,0.17098376,-0.13033055，…]

By finding AVG_newAnd AVG_relThe cosine distance of (A) is obtained as Q_new、Q_relSimilarity R based on question semantic features_vec(Q_new,Q_rel) The formula is as follows:

(4-4) similarity based on implicit topic characteristics of sentences

The invention uses LDA (latent Dirichlet allocation) implicit topic model to ask for the implicit topic of the question sentence. After LDA training, the implicit topic distribution of each document in the document set can be obtained, and further the topic vector of the sentence is obtained. For example, sentence Q_mImplicit topic distribution of

Representing the probability of belonging to the ith topic, I being the number of implied topics, Q_mExpressed as a vector:

the subject distribution of the question was calculated by the Gensim (https:// radimrehurek. com/genesis /) subject model open source tool. Firstly, a corpus is used as input of an LdaModel function in Gensim, an LDA model is obtained through training, then a question requiring topic distribution calculation is input into the well-trained LDA model, the obtained output is the topic distribution of the question, and the sentence is expressed as a vector.

For newly proposed question Q_newAnd historical question Q_relCalculating to obtain vector representations of two question sentences based on topic distribution through a Gensim topic model open source tool, and respectively recording the vector representations as LDA_new、LDA_rel：

LDA_new＝[0.001784,0.001934,0.002056,0.002072,0.001772，…]

LDA_rel＝[0.001706,0.001850,0.001967,0.001982,0.001695，…]

By evaluating LDA_newAnd LDA_relThe cosine distance of the distance to obtain the similarity R based on the hidden topic characteristics of the sentence_lda(Q_new,Q_rel) The formula is as follows:

(4-5) similarity based on answer semantic features

In the question-answering system, each historical question corresponds to a candidate answer set. For newly proposed question Q_newAnd historical question Q_relAnd Q_relCorresponding candidate answer A_relCalculating Q by a method similar to that in (4-2)_newAnd A_relThe semantic similarity of the word level as the similarity R of two question sentences based on the answer semantic features_qa(Q_new,Q_rel) The specific process is as follows:

(4-5-1) reading (4-2-1) the word vector of 200 dimensions of each word in the dictionary obtained by calculation;

(4-5-2) reading M calculated in (4-2-2)_w2vA matrix;

(4-5-3) reading the Q obtained in (4-1-2)_newTF-IDF of (A) represents:

TFIDF_new＝[0.0,0.0,0.528634,0.0,0.528634,…]^T

calculating A obtained by pretreatment in the step (3)_relIs expressed by the following TF-IDF, resulting in the following 15-dimensional vector:

TFIDF_ans＝[0.408248,0.408248,0.0,0.0,0.0,…]^T

(4-5-4) calculating Q by using soft cosine distance_new、Q_relSimilarity R based on answer semantic features between the two_qa(Q_new,Q_rel) The formula is as follows:

(5) calculating the final similarity

Calculating Q_newAnd Q_relFinal similarity Sim (Q)_new,Q_rel) The formula is as follows:

wherein R is_k(Q_new,Q_rel) The similarity based on the feature k is represented by (4-1) to (4-5), namely:

R_lev(Q_new,Q_rel)＝0.225969，R_v2w(Q_new,Q_rel)＝0.304225，R_vec(Q_new,Q_rel)＝0.738933，R_lda(Q_new,Q_rel)＝0.685844，R_qa(Q_new,Q_rel)＝0..018413

λ_kthe parameters to be trained are obtained by utilizing a linear regression analysis method for training. The training method adopts forward stepwise regression, and the error is reduced as much as possible in each step. And (4) adopting a square loss function, and iterating for a certain number of times until the loss function is minimum. The square loss function is as follows:

is the similarity prediction value of the ith sample, Y⁽ⁱ⁾Is a known value of the i-th sample similarity.

The iterative process is as follows:

1. randomly initializing weight lambda corresponding to each feature_kCalculating the square loss of the current iteration according to the weight;

2. weight λ for each feature based on the square loss_kPerforming partial derivation to obtain the gradient of the current iteration of the weight

t represents the t-th iteration;

3. according to

Updating the weight of each feature, wherein α is the step size and is 0.1;

4. recalculating the square loss according to the new weight, if the current square loss is not less than the square loss of the previous iteration, stopping the iteration to obtain the weight lambda corresponding to each feature_kA final value; otherwise, the steps 2, 3 and 4 are circularly performed.

In this embodiment, R is obtained by training according to the above steps_k(Q_new,Q_rel) The weight of (c):

λ_lev＝0.055985，λ_w2v＝0.753228，λ_vec＝0.207070，λ_lda＝0.475735，λ_qathe final similarity was calculated as-0.122604:

the techniques described herein may be implemented by various means. For example, these techniques may be implemented in hardware, firmware, software, or a combination thereof. For a hardware implementation, the processing modules may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Programmable Logic Devices (PLDs), field-programmable gate arrays (FPGAs), processors, controllers, micro-controllers, electronic devices, other electronic units designed to perform the functions described herein, or a combination thereof.

For a firmware and/or software implementation, the techniques may be implemented with modules (e.g., procedures, steps, flows, and so on) that perform the functions described herein. The firmware and/or software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A problem similarity calculation method based on multiple features is characterized by comprising the following steps: aiming at an input new question sentence, comparing the input new question sentence with stored historical questions and corresponding answers, and calculating the similarity between the new question and the historical questions based on character features, the similarity based on word semantic features, the similarity based on sentence implicit subject features and the similarity based on answer semantic features; the final similarity is the product sum of the 5 similarities and the weights corresponding to the 5 similarities, and the weights are obtained by training through a linear regression method;

the method for calculating the similarity based on the character features comprises the following steps: obtaining a relation matrix between the words by calculating the editing distance between each pair of words, then calculating a soft cosine distance by using TF-IDF representation of the question and the relation matrix, and taking the soft cosine distance as the similarity based on character characteristics;

the calculation method of the relation matrix among the words is as follows:

defining a corpus as a question and answer text data set for training and testing a model, and forming a relation matrix M by editing distance between words if the size of a dictionary in the corpus is n_lev，M_levMiddle element m_i,jIs the first in the dictionary_iWord w_iWith the jth word w_jThe edit distance calculation formula is as follows:

wherein, | | w_iI is the word w_iThe number of characters, | w_jI is the word w_jThe number of characters included in the text, α is a weighting factor for diagonal elements, β is a reinforcement factor for distance scores, lev (w)_i,w_j) The recursive formula of (c) is as follows:

wherein m and n represent w_iAnd w_jLength of the word, cost, denotes w_iM-th character to w_jThe replacement cost of the nth character is determined, if the two characters are consistent, the cost is 0, otherwise, the cost is 1;

the TF-IDF expression calculation method of the question is as follows: in a sentence, for the word w_iCalculating a TF value representing a frequency of occurrence of a word in a current sentence and an IDF value representing an inverse text frequency index for the word w_iThe TF-IDF calculation formula is as follows:

TFIDF_i＝TF_i*IDF_i；

for newly proposed question Q_newAnd historical question Q_relThe soft cosine distance calculation method comprises the following steps:

question Q_new、Q_relAre respectively expressed as TFIDF_newAnd TFIDF_rel：

TFIDF_new＝[d_new,1,d_new,2,…,d_new,n]^T

TFIDF_rel＝[d_rel,1,d_rel,2,…,d_rel,n]^T

d_new,iDenotes w_iAt Q_newTF-IDF value of (1), d_rel,jDenotes w_jAt Q_relThe TF-IDF value in (1), n represents the number of words contained in a dictionary of the corpus, and T represents the transposition of the vector;

simultaneously according to the solved relation matrix M between the words_levCalculating Q by using soft cosine distance_new、Q_relSimilarity R based on character features between_lev(Q_new,Q_rel) The formula is as follows:

where "·" is a dot product of a vector and a matrix;

the method for calculating the similarity based on the semantic features of the words comprises the following steps:

(6-1) training by using the word2vec tool to obtain a distributed representation of each word in the corpus, namely, each word corresponds to one word_KA real number vector of dimensions;

(6-2) calculating the semantic relation m between every two words in the dictionary by solving the cosine distance between word vectors for the corpus with the dictionary size of n_i,j，i,j∈[1,n]To obtain a relation matrix M_w2v，M_w2v∈R^n×n；

(6-3) reading question Q_new、Q_relExpressed as TF-IDF, respectively as TFIDF_newAnd TFIDF_rel；

(6-4) calculating Q by soft cosine distance_new、Q_relSimilarity R based on semantic features of words between words_w2v(Q_new,Q_rel) The formula is as follows:

the method for calculating the similarity based on the semantic features of sentences comprises the following steps:

word vector w obtained by training according to word2vec tool_i∈R^KIf the question contains M words, the question tableShown as Q_matrix＝(w₁,w₂,…,w_M)，

according to the above formula, question Q_new、Q_relRespectively obtaining AVG vector by calculating arithmetic mean value of word vector in sentence_newAnd AVG_relBy finding AVG_newAnd AVG_relThe cosine distance of (A) is obtained as Q_new、Q_relSimilarity R based on question semantic features_vec(Q_new,Q_rel)；

The method for calculating the similarity based on the implicit topic features of the sentences comprises the following steps:

the corpus is used as the input of an LdaModel function in Gensim, and an LDA model is obtained through training; then, a new question Q for calculating the distribution of the topics is put forward_newAnd historical question Q_relInputting the data into a trained LDA model to obtain vector representations of two question sentences based on topic distribution, and respectively recording the vector representations as LDA_new、LDA_rel(ii) a By evaluating LDA_newAnd LDA_relTo obtain Q_new、Q_relSimilarity R based on sentence implicit topic characteristics_lda(Q_new,Q_rel)；

The method for calculating the similarity based on the answer semantic features comprises the following steps:

for newly proposed question Q_newAnd historical question Q_relAnd Q_relCorresponding candidate answer A_relCalculating Q_newAnd A_relThe semantic similarity of the word level as the similarity R of two question sentences based on the answer semantic features_qa(Q_new,Q_rel) The specific process is as follows:

(9-1) training by using word2vec tool to obtain distributed representation of each word in the corpus, namely, each word corresponds to one word_KA real number vector of dimensions;

(9-2) calculating the semantic relation m between every two words in the dictionary by solving the cosine distance between word vectors for the corpus with the dictionary size of n_i,j，i,j∈[1,n]To obtain a relation matrix M_w2v，M_w2v∈R^n×n；

(9-3) reading question Q_newTF-IDF of (A) represents TFIDF_new；A_relExpressed by TF-IDF to obtain TFIDF_ans＝[d_ans,1,d_ans,2,...,d_ans,n]^TWherein d is_ans,iDenotes w_iIn A_relThe TF-IDF value in (1), n represents the number of words contained in a dictionary of the corpus, and T represents the transposition of the vector;

(9-4) calculating Q by using soft cosine distance_new、Q_relSimilarity R based on answer semantic features between the two_qa(Q_new,Q_rel) The formula is as follows:

2. the method of claim 1, wherein the new sentence, the historical sentence and the corresponding answer are pre-processed for comparison calculation, and the pre-processing comprises punctuation removal, case-to-case conversion, stop word and low frequency word.

3. The method for calculating the similarity of problems based on multiple features according to claim 1, wherein the method for calculating the final similarity is as follows:

wherein R is_k(Q_new,Q_rel) Represents Q_new、Q_relThe similarity between the characters and the sentences is based on the characteristic k, namely the similarity based on the character characteristic, the similarity based on the word semantic characteristic, the similarity based on the sentence implicit topic characteristic and the similarity based on the answer semantic characteristic; lambda [ alpha ]_kThe parameters to be trained are obtained by utilizing a linear regression analysis method for training.