CN104899188A - Problem similarity calculation method based on subjects and focuses of problems - Google Patents

Problem similarity calculation method based on subjects and focuses of problems Download PDF

Info

Publication number
CN104899188A
CN104899188A CN201510270876.2A CN201510270876A CN104899188A CN 104899188 A CN104899188 A CN 104899188A CN 201510270876 A CN201510270876 A CN 201510270876A CN 104899188 A CN104899188 A CN 104899188A
Authority
CN
China
Prior art keywords
word
similarity
theme
focus
relevant issues
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510270876.2A
Other languages
Chinese (zh)
Inventor
鲁伟明
余瑶
吴江琴
庄越挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201510270876.2A priority Critical patent/CN104899188A/en
Publication of CN104899188A publication Critical patent/CN104899188A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a problem similarity calculation method based on subjects and focuses of problems Basic preprocessing, such as word segmentation and the like, is carried out on problem data by using a tokenizer, and based on the basic preprocessing, a tree tailor model based on the minimum description length divides each problem into a problem subject and a problem focus; with respect to subject structures and focus structures of two problems, a language model and a language model based on translation are respectively used to calculate a similarity score, and a joint similarity is obtained by means of weighted summation; and a subject similarity between the two problems is calculated by using a method based on a BTM subject model, and two similarities are finally subjected to weighted summation to obtain the final problem similarity. According to the present invention, architectural features and subject information of the problems are introduced into the problem similarity calculation, the information of the problems is more sufficiently used, and by introducing the subject information of the problems besides word statistics information into the problem similarity calculation, accuracy of the problem similarity calculation is improved.

Description

A kind of problem similarity calculating method based on problem theme and focus
Technical field
The present invention relates to a kind of problem similarity calculating method, particularly relate to a kind of problem similarity calculating method based on problem theme and focus.
Background technology
Along with developing rapidly of internet, the approach of people's obtaining information and knowledge is more and more diversified, and the question answering system based on Frequently Asked Question (FAQ) is one of them effective mode.The research of problem Similarity Measure has very important significance to based on Frequently Asked Question question answering system tool, and the accuracy rate of problem Similarity Measure also plays a very important role to question answering system performance.The accuracy rate so how improving problem Similarity Measure becomes the focus of current research naturally.
The calculating of present problems similarity is mainly divided into four kinds of methods: based on the method for word statistical information retrieval model; Based on the computing method of semantic dictionary; Based on the computing method of extensive document sets; Based on the computing method of editing distance.
TF-IDF method based on word word frequency statistical information computational problem between similarity, do not need understanding statement being carried out to the degree of depth.Because question length is very short, cause proper vector sparse, therefore TF-IDF is not fine for the effect of problem Similarity Measure.
Question text is divided into a series of word by the method based on semantic dictionary, goes to calculate the similarity between word based on semantic dictionary, then goes the similarity between computational problem based on the similarity of word.For English, conventional semantic dictionary has WordNet, and for Chinese, conventional semantic dictionary is HowNet.Similarity calculating method based on semantic dictionary has use simple, calculates the advantages such as quick.But also there are two obvious shortcomings: semantic dictionary can not comprise all words; Have plenty of the many meanings of a word, cause which meaning of bad selection to do word Similarity Measure.
The method of carrying out adding up based on extensive document sets is one of method of the many calculating short text similarity of recent researches.The latent Semantic Analysis (LSA) that Deerwester SC proposes is exactly a kind of popular similarity calculating method based on text set.Also some problems are had by LSA method computational problem similarity.Such as, the problem of user's input contains some not at the neologisms of semantic space, and in addition because the concept space of structure is fixing, therefore the dimension of the vector of problem of representation is also fixing, the vector of description problem may be caused very sparse, and impact calculates the precision of similarity.
Editing distance is initially treated is do not consider semantic character, and it has a wide range of applications in various fields such as similarity of character string calculating, data scrubbing, spell checks.At computing statement similarity based method, also there is certain application.Such as, the people such as Leusch utilize editing distance computing statement similarity, but also for mechanical translation.Someone proposed method editing distance and semantic dictionary combined again afterwards.To the effect that: based on common editing distance algorithm, adopting word as basic edit cell instead of single Chinese character, then adopting semantic distance as the replacement cost between word and the weight that assignment is inserted, deletion is different with replacing three kinds of operations.This method considers the order of vocabulary and the information such as semantic, calculates and realizes all fairly simple, also can obtain good effect.But these methods are all text based statistical attributes, the semantic similarity of text well can not be embodied.
Summary of the invention
The present invention is the weak point in order to overcome current computational problem similarity based method, the accuracy rate of raising problem retrieval, a kind of problem similarity calculating method based on problem theme and focus is provided, for calculating problem that user proposes in question answering system and frequently asked question concentrates the similarity of problem, to more new capital important in inhibiting and the effect of question answering and Frequently Asked Question.
The technical scheme that the present invention solves the employing of its technical matters comprises the following steps:
1) pre-service Frequently Asked Question data: by natural language processing instrument by problem set data participle, remove invalid word, record the classification belonging to each problem;
2) theme of partition problem and focus structure: build word space according to word segmentation result, and calculate the specificity score of wherein each word, according to problem comprise word specificity score size word is reordered the debatable topic chain of shape; The topic chain of target problem and relevant issues divides by the tree Cutting model then based on the shortest description length, obtains thematic structure and the focus structure of each problem;
3) based on the associating similarity between problem theme and focus calculation problem: for the thematic structure of target problem and relevant issues, the method for language model is adopted to calculate associating similarity; For the focus structure of target problem and relevant issues, the method based on the language model of translation is adopted to calculate associating similarity; The associating similarity of problem theme and focus is obtained finally by the weighted sum calculating above-mentioned two similarities;
4) computational problem similarity: calculate the Topic Similarity between target problem and relevant issues based on BTM topic model, by by Topic Similarity and step 3) in the associating similarity that calculates be weighted summation and obtain final problem similarity.
Described step 2) comprising:
2.1) according to step 1) in word segmentation result build word space, and the specificity score of each word in following formulae discovery word space is adopted according to the statistical information of problem data generic, build the formula calculating word specificity score:
S(w)=1/(-∑ c∈CP(c|w)logP(c|w)+ε)
P ( c | w ) = count ( c , w ) Σ c ∈ C count ( c , w )
Wherein, S (w) represents the specificity score that word w is corresponding, and c represents the classification of a certain problem, all categories set corresponding to C problem of representation data, the probability that P (c|w) occurs in classification c for word w; Count (c, w) represents the number of times that in classification c, word w occurs; ε represents smoothing factor.
2.2) for each problem, according to the specificity score of word each after its participle, the word of this problem is resequenced, obtain the topic chain of this problem;
2.3) by the topic chain combination of the topic chain of target problem and relevant issues thereof together, form a question-based teaching, the root node of tree is empty; Utilize the tree Cutting model based on the shortest description length to carry out cutting to this tree, for a tree and a kind of method of cutting out, the tree building following formula describes length L (M, S) and calculates:
L(M,S)=L(Γ)+L(θ|Γ)+L(S|Γ,θ)
M=(Γ,θ)
Γ=(C 1,C 2,...,C k)
θ=[P(C 1),P(C 2),...,P(C k)]
Wherein, Γ represents the node classification of tree after cutting, and θ represents that the ProbabilityDistribution Vector that classification is corresponding, M represent the tree Cutting model that Γ determines, S represents sample set, and k is the sum of category set, for classification C icorresponding probability;
Select to make to set and describe the shortest cutting method of length and the tree Cutting model M method as partition problem theme and focus, cutting is carried out to question-based teaching, corresponding branch also can be divided into two, wherein form the thematic structure of this branch correspondence problem near the part of root node root, remainder forms the focus structure of this branch correspondence problem.
Described step 3) comprising:
3.1) for the thematic structure part of target problem T and relevant issues Q, calculate thematic structure similarity based on language model, thematic structure similarity adopts following formulae discovery:
P LM ( T t | Q t ) = Π w ∈ T t P LM ( w | Q t )
P LM ( w | Q t ) = ( 1 - λ ) # ( w , Q t ) | Q t | + λ # ( w , C ) | C |
Wherein, T tand Q trepresent the thematic structure of target problem T and relevant issues Q respectively, P lM(T t| Q t) represent the thematic structure similarity of target problem T and relevant issues Q, P lM(w|Q t) be the thematic structure Q of relevant issues Q tgenerate the probability of word w, # (w, Q t) represent the thematic structure Q of word w at relevant issues Q tthe number of times of middle appearance, the number of times that # (w, C) occurs in classification C for word w, λ is Jelinek-Mercer smoothing factor;
3.2) for the focus structure part of target problem T and relevant issues Q, utilize the language model based on translation to calculate focus structure similarity, thematic structure similarity adopts following formulae discovery:
P TRLM ( T f | Q f ) = Π w ∈ T f P TRLM ( w | Q f )
P TRLM ( w | Q f ) = ( 1 - λ ) [ α Σ t ∈ Q f P ( w | t ) # ( t , Q f ) | Q f | + ( 1 - α ) # ( w , Q f ) | Q f | + λ # ( w , C ) | C | ]
Wherein, P (w|t) represents the translation probability of word t to word w, P tRLM(T f| Q f) representing the focus structure similarity of target problem T and relevant issues Q, α represents the weight shared by translation probability part, P tRLM(w|Q f) be the focus structure Q of relevant issues Q fgenerate the probability of word w, T fand Q frepresent the focus structure of target problem T and relevant issues Q respectively, # (t, Q f) represent the focus structure Q of word t at relevant issues Q fthe number of times of middle appearance; # (w, Q f) represent the focus structure Q of word w at relevant issues Q fthe number of times of middle appearance;
3.3) after the theme calculating target problem T and relevant issues Q and focus similarity, calculate associating similarity by the mode of weighted sum, build the formula calculating associating similarity:
Dis T&F(T,Q)=τP LM(T t|Q t)+(1-τ)P TRLM(T f|Q f)
Wherein, Dis t & F(T, Q) represents the associating similarity of target problem T and relevant issues Q; τ represents weighting coefficient.
Described step 4) comprising:
4.1) train based on the problem set data of BTM topic model to problem and obtain corresponding theme space and theme vector corresponding to problem, utilize Euclidean distance to calculate Topic Similarity between two problems;
4.2) by by 4.1) Topic Similarity of the target problem T that calculates and relevant issues Q and adopt following formula to be weighted summation by the associating similarity of the target problem T that obtains and relevant issues Q, finally obtain the problem similarity between target problem T and relevant issues Q:
Dis (T,Q)=μDis T&F(T,Q)+(1-μ)Dis Topic(T,Q)
Wherein, μ represents weighting coefficient, μ=0.9; Dis t & Fassociating similarity between (T, Q) problem of representation T and problem Q, Dis topictopic Similarity between (T, Q) problem of representation T and problem Q.
Described step 4.1) in the concrete computation process of Topic Similarity as follows:
4.1.1) calculate words pair set according to problem data and dictionary and close B, word is to referring to by appearing in same text fragments and unordered two different words after pre-service, for problem data, can each problem be regarded as an independently text fragments, for each word carries out initialization operation to random designated key;
4.1.2) according to 4.1.1) result adopt the p-theme distribution P (z|b) of following formulae discovery word:
P ( z | b ) = P ( z ) P ( w i | z ) P ( w j | z ) Σ z P ( z ) P ( w i | z ) P ( w j | z )
Wherein, z represents theme, and b represents word pair, w iand w jrepresent that word is to the word of two in b, P (z) represents the probability of theme z, P (w i| z) represent word w under theme z iprobability;
4.1.3) adopt following formulae discovery problem-word to distribution P (b|d):
P ( b | d ) = n d ( b ) Σ b n d ( b )
Wherein, d problem of representation, n d (b)the number of times that in problem of representation d, word occurs b;
4.1.4) according to 4.1.2) and result 4.1.3) adopt following formulae discovery problem-theme distribution:
P ( z | d ) = Σ b P ( z | b ) P ( b | d )
By as above four steps just can by the term vector spatial mappings of problem to the theme vector space of being trained to obtain by BTM topic model, obtain the probability distribution of each problem on each theme, thus obtaining the theme vector of problem, the dimension of vector equals the number of theme in theme space;
4.1.5) distance of the main body vector of two problems is calculated finally by Euclidean distance, using this distance as the Topic Similarity between two problems.
Described step 1) in natural language processing instrument be fudanNLP, Harbin Institute of Technology language cloud platform LTP, instrument such as stammerer participle etc.By these instruments by Frequently Asked Question data participle, remove invalid word, build term vector space, record the classification belonging to each problem.
The beneficial effect that the inventive method compared with prior art has:
Problem is divided into theme and focus two parts by the design feature that 1, this process employs problem data itself, utilizes more abundant, thus make problem Similarity Measure result more accurate to problem information;
2, the method is the similarity that the short text data method that have employed based on BTM topic model calculates between two problems for problem, by the problem subject information outside word statistical information is incorporated into problem Similarity Measure, thus make problem Similarity Measure result more accurate;
3, the method adopts diverse ways respectively for problem theme and focus section, and by the transition probability between word is incorporated in Similarity Measure, take into account the semantic similarity between problem, thus make problem Similarity Measure result more accurate.
Accompanying drawing explanation
Fig. 1 is overview flow chart of the present invention;
Fig. 2 is step 2) process flow diagram;
Fig. 3 is step 3) process flow diagram;
Fig. 4 is step 4) process flow diagram;
Fig. 5 is that embodiment forms tree result schematic diagram;
Fig. 6 is that embodiment cuts out result schematic diagram;
Fig. 7 is embodiment result partial exploded view.
Embodiment
As shown in Figure 1, the inventive method, comprises the following steps:
1) pre-service Frequently Asked Question data: by natural language processing instrument by problem set data participle, remove invalid word, record the classification belonging to each problem;
Described step 1) in natural language processing instrument be fudanNLP, Harbin Institute of Technology language cloud platform LTP, instrument such as stammerer participle etc.By these instruments by Frequently Asked Question data participle, remove invalid word, build term vector space, record the classification belonging to each problem.
2) theme of partition problem and focus structure:
As shown in Figure 2, build word space according to word segmentation result, and calculate the specificity score of wherein each word, according to problem comprise word specificity score size word is reordered the debatable topic chain of shape; The topic chain of target problem and relevant issues divides by the tree Cutting model then based on the shortest description length, obtains thematic structure and the focus structure of each problem.
2.1) according to step 1) in word segmentation result build word space, and the specificity score of each word in following formulae discovery word space is adopted according to the statistical information of problem data generic, build the formula calculating word specificity score:
S(w)=1/(-∑ c∈CP(c|w)logP(c|w)+ε)
P ( c | w ) = count ( c , w ) Σ c ∈ C count ( c , w )
Wherein, S (w) represents the specificity score that word w is corresponding, and c represents the classification of a certain problem, all categories set corresponding to C problem of representation data, the probability that P (c|w) occurs in classification c for word w; Count (c, w) represents the number of times that in classification c, word w occurs; ε represents smoothing factor, ε=0.001 in concrete enforcement.
2.2) for each problem, according to the specificity score of word each after its participle, the word of this problem is resequenced, obtain the topic chain of this problem;
The topic chain such as obtaining certain problem q is the word sequence sorting according to the word specificity score of this problem and obtain: w 1→ w 2→ ... → w i→ ... → w n; Wherein, word w ibe included in problem q, and 1≤i≤n; Meet S (w h) > S (w l), 1≤h < l≤n.Thus, the word that in problem, specificity score is lower more can represent the focus of problem, and on the contrary, the word that specificity score is higher more can represent the theme of problem.
2.3) by the topic chain combination of the topic chain of target problem and relevant issues thereof together, form a question-based teaching, the root node of tree is empty;
The tree Cutting model based on the shortest description length is utilized to carry out cutting to this tree, any one cutting method can be adopted to carry out cutting, the description length set under can calculating which for each cutting method, for a tree and a kind of method of cutting out, the tree building following formula describes length L (M, S) and calculates:
L(M,S)=L(Γ)+L(θ|Γ)+L(S|Γ,θ)
M=(T,θ)
Γ=(C 1,C 2,...,C k)
θ=[P(C 1),P(C 2),...,P(C k)]
Wherein, Γ represents the node classification of tree after cutting, and θ represents that the ProbabilityDistribution Vector that classification is corresponding, M represent the tree Cutting model that Γ determines, S represents sample set, and k is the sum of category set, for classification C icorresponding probability;
Select to make to set and describe the shortest cutting method of length and the tree Cutting model M method as partition problem theme and focus, cutting is carried out to question-based teaching, corresponding branch also can be divided into two, wherein form the thematic structure of this branch correspondence problem near the part of root node root, remainder forms the focus structure of this branch correspondence problem.
3) based on the associating similarity between problem theme and focus calculation problem: as shown in Figure 3, for the thematic structure of target problem and relevant issues, the method for language model is adopted to calculate associating similarity; For the focus structure of target problem and relevant issues, the method based on the language model of translation is adopted to calculate associating similarity; The associating similarity of problem theme and focus is obtained finally by the weighted sum calculating above-mentioned two similarities.
3.1) for the thematic structure part of target problem T and relevant issues Q, calculate thematic structure similarity based on language model, thematic structure similarity adopts following formulae discovery:
P LM ( T t | Q t ) = &Pi; w &Element; T t P LM ( w | Q t )
P LM ( w | Q t ) = ( 1 - &lambda; ) # ( w , Q t ) | Q t | + &lambda; # ( w , C ) | C |
Wherein, T tand Q trepresent the thematic structure of target problem T and relevant issues Q respectively, P lM(T t| Q t) represent the thematic structure similarity of target problem T and relevant issues Q, P lM(w|Q t) be the thematic structure Q of relevant issues Q tgenerate the probability of word w, # (w, Q t) represent the thematic structure Q of word w at relevant issues Q tthe number of times of middle appearance, the number of times that # (w, C) occurs in classification C for word w, λ is Jelinek-Mercer smoothing factor, λ=0.1 in concrete enforcement;
3.2) for the focus structure part of target problem T and relevant issues Q, utilize the language model based on translation to calculate focus structure similarity, thematic structure similarity adopts following formulae discovery:
P TRLM ( T f | Q f ) = &Pi; w &Element; T f P TRLM ( w | Q f )
P TRLM ( w | Q f ) = ( 1 - &lambda; ) [ &alpha; &Sigma; t &Element; Q f P ( w | t ) # ( t , Q f ) | Q f | + ( 1 - &alpha; ) # ( w , Q f ) | Q f | + &lambda; # ( w , C ) | C | ]
Wherein, P (w|t) represents the translation probability of word t to word w, P tRLM(T f| Q f) representing the focus structure similarity of target problem T and relevant issues Q, α represents the weight shared by translation probability part, P tRLM(w|Q f) be the focus structure Q of relevant issues Q fgenerate the probability of word w, T fand Q frepresent the focus structure of target problem T and relevant issues Q respectively, # (t, Q f) represent the focus structure Q of word t at relevant issues Q fthe number of times of middle appearance; # (w, Q f) represent the focus structure Q of word w at relevant issues Q fthe number of times of middle appearance, the number of times that # (w, C) occurs in classification C for word w;
3.3) after the theme calculating target problem T and relevant issues Q and focus similarity, calculate associating similarity by the mode of weighted sum, build the formula calculating associating similarity:
Dis T&F(T,Q)=τP LM(T t|Q t)+(1-τ)P TRLM(T f|Q f)
Wherein, Dis t & F(T, Q) represents the associating similarity of target problem T and relevant issues Q; τ represents weighting coefficient, τ=0.4 in concrete enforcement.
4) computational problem similarity: as shown in Figure 4, calculate the Topic Similarity between target problem and relevant issues based on BTM topic model, by by Topic Similarity and step 3) in the associating similarity that calculates be weighted summation and obtain final problem similarity.
4.1) train based on the problem set data of BTM topic model to problem and obtain corresponding theme space and theme vector corresponding to problem, utilize Euclidean distance to calculate Topic Similarity between two problems;
4.1.1) calculate words pair set according to problem data and dictionary and close B, word is to referring to by appearing in same text fragments and unordered two different words after pre-service, for problem data, can each problem be regarded as an independently text fragments, for each word carries out initialization operation to random designated key; Then according to Gibbs sampling method calculate BTM topic model parameter θ and
4.1.2) according to 4.1.1) result adopt the p-theme distribution P (z|b) of following formulae discovery word:
P ( z | b ) = P ( z ) P ( w i | z ) P ( w j | z ) &Sigma; z P ( z ) P ( w i | z ) P ( w j | z )
Wherein, z represents theme, and b represents word pair, w iand w jrepresent that word is to the word of two in b, P (z) represents the probability of theme z, P (w i| z) represent word w under theme z iprobability;
4.1.3) adopt following formulae discovery problem-word to distribution P (b|d):
P ( b | d ) = n d ( b ) &Sigma; b n d ( b )
Wherein, d problem of representation, n d (b)the number of times that in problem of representation d, word occurs b;
4.1.4) according to 4.1.2) and result 4.1.3) adopt following formulae discovery problem-theme distribution:
P ( z | d ) = &Sigma; b P ( z | b ) P ( b | d )
By as above four steps just can by the term vector spatial mappings of problem to the theme vector space of being trained to obtain by BTM topic model, obtain the probability distribution of each problem on each theme, thus obtaining the theme vector of problem, the dimension of vector equals the number of theme in theme space;
4.1.5) distance of the main body vector of two problems is calculated finally by Euclidean distance, using this distance as the Topic Similarity between two problems.
4.2) by by 4.1) Topic Similarity of the target problem T that calculates and relevant issues Q and adopt following formula to be weighted summation by the associating similarity of the target problem T that obtains and relevant issues Q, finally obtain the problem similarity between target problem T and relevant issues Q:
Dis (T,Q)=μDis T&F(T,Q)+(1-μ)Dis Topic(T,Q)
Wherein, μ represents weighting coefficient, μ=0.9; Dis t & Fassociating similarity between (T, Q) problem of representation T and problem Q, Dis topictopic Similarity between (T, Q) problem of representation T and problem Q.
The concrete steps of this example enforcement are described in detail below in conjunction with method of the present invention, as follows:
(1) data set of example employing is all from the books of question and answer type in digital library.This example has extracted 610 question and answer class books altogether from books engineering science and education class library resource, amounts to 137888 problem sets.Problem relates to classification: agricultural, biology, chemical industry, computing machine, electronics, machine-building, Aero-Space, medicine, robotization etc., totally 25 large classifications.Through step 1) pre-service obtain the word space that vocabulary size is 54074.
(2) according to information in (1), calculate the specificity score of each word in word space, then from high to low word is reordered according to the specificity score of word each in the word segmentation result of problem, define the topic chain of problem.With problem " computer always crashes; be what reason? " for example, its word segmentation result is " computer always crash what reason ", is " computer-> deadlock-> always-> reason-> what " by the specificity score the calculating each electromagnetism topic chain obtained that also sort.Adopt aforesaid way that the topic chain combination of the topic chain of target problem and relevant issues is formed question-based teaching.With problem " computer always crashes, and is what reason? ", " what kind of the principle of work of computer is? ", " which the chief component of computer has? " these three problems are example, and the question-based teaching that they are formed as shown in Figure 5.
(3) the tree method of cutting out based on the shortest description length carries out cutting to the question-based teaching obtained in (2), branch corresponding in question-based teaching after cutting also can be divided into two, wherein represent the thematic structure of this branch correspondence problem near the part of root node root, a part represents the focus structure of this branch correspondence problem in addition.For the tree of accompanying drawing 5, last cutting result is for shown in accompanying drawing 6.Is the thematic structure of problem above separator bar, below be problem focus structure.Such as, problem " computer always crashes, and is what reason? " thematic structure be (computer, deadlock), (always, reason, what) focus structure be.
(4) based on language model, Similarity Measure is carried out to the problem thematic structure part obtained in (3); Adopting disclosed Similar Problems to gathering as Parallel Corpus, then utilizing translation model to train transition probability between word to use at the language model based on translation.The language model based on translation is utilized to carry out Similarity Measure to the problem focus structure part obtained in (3).Finally, problem thematic structure part similarity and problem focus structure part similarity are weighted summation and obtain associating similarity.
(5) based on BTM language model the term vector of problem is converted to the theme feature vector in theme space, utilizes Euclidean distance to calculate Topic Similarity between two problems based on this vector.
(6) Topic Similarity calculated in the associating similarity calculated in (4) and (5) is weighted summation, finally obtains the problem similarity between two problems, and return.
The operation result of this example: method used in the present invention and traditional problem similarity calculating method based on vector space model (VSM) and language model (LM) are compared by Pk and NDCGk two kinds of evaluation indexes.Wherein the result of Pk as shown in Figure 7.The result of NDCGk is as shown in the table:
Method NDCG1 NDCG3 NDCG5
VSM 79.2% 76.28% 70.8%
LM 80.3% 77.89% 71.45%
This method 82% 80.9% 77.86%
Contrast can be found out, the accuracy of the problem similarity calculating method that this method is obviously current in the accuracy of problem Similarity Measure.This problem similarity calculating method based on problem theme and focus has good use value and application prospect.

Claims (6)

1., based on a problem similarity calculating method for problem theme and focus, it is characterized in that comprising the following steps:
1) pre-service Frequently Asked Question data: by natural language processing instrument by problem set data participle, remove invalid word, record the classification belonging to each problem;
2) theme of partition problem and focus structure: build word space according to word segmentation result, and calculate the specificity score of wherein each word, according to problem comprise word specificity score size word is reordered the debatable topic chain of shape; The topic chain of target problem and relevant issues divides by the tree Cutting model then based on the shortest description length, obtains thematic structure and the focus structure of each problem;
3) based on the associating similarity between problem theme and focus calculation problem: for the thematic structure of target problem and relevant issues, the method for language model is adopted to calculate associating similarity; For the focus structure of target problem and relevant issues, the method based on the language model of translation is adopted to calculate associating similarity; The associating similarity of problem theme and focus is obtained finally by the weighted sum calculating above-mentioned two similarities;
4) computational problem similarity: calculate the Topic Similarity between target problem and relevant issues based on BTM topic model, by by Topic Similarity and step 3) in the associating similarity that calculates be weighted summation and obtain final problem similarity.
2., according to the problem similarity calculating method based on problem theme and focus described in claim 1, it is characterized in that described step 2) comprising:
2.1) according to step 1) in word segmentation result build word space, and the specificity score of each word in following formulae discovery word space is adopted according to the statistical information of problem data generic, build the formula calculating word specificity score:
S(w)=1/(-∑ c∈CP(c|w)logP(c|w)+ε)
P ( c | w ) = count ( c , w ) &Sigma; c &Element; C count ( c , w )
Wherein, S (w) represents the specificity score that word w is corresponding, and c represents the classification of a certain problem, all categories set corresponding to C problem of representation data, the probability that P (c|w) occurs in classification c for word w; Count (c, w) represents the number of times that in classification c, word w occurs; ε represents smoothing factor.
2.2) for each problem, according to the specificity score of word each after its participle, the word of this problem is resequenced, obtain the topic chain of this problem;
2.3) by the topic chain combination of the topic chain of target problem and relevant issues thereof together, form a question-based teaching, the root node of tree is empty; Utilize the tree Cutting model based on the shortest description length to carry out cutting to this tree, for a tree and a kind of method of cutting out, the tree building following formula describes length L (M, S) and calculates:
L(M,S)=L(Γ)+L(θ|Γ)+L(S|Γ,θ)
M=(Γ,θ)
Γ=(C 1,C 2,…,C k)
θ=[P(C 1),P(C 2),…,P(C k)]
Wherein, Γ represents the node classification of tree after cutting, and θ represents that the ProbabilityDistribution Vector that classification is corresponding, M represent the tree Cutting model that Γ determines, S represents sample set, and k is the sum of category set, for classification C icorresponding probability;
Select to make to set and describe the shortest cutting method of length and the tree Cutting model M method as partition problem theme and focus, cutting is carried out to question-based teaching, corresponding branch also can be divided into two, wherein form the thematic structure of this branch correspondence problem near the part of root node root, remainder forms the focus structure of this branch correspondence problem.
3., according to the problem similarity calculating method based on problem theme and focus described in claim 1, it is characterized in that described step 3) comprising:
3.1) for the thematic structure part of target problem T and relevant issues Q, calculate thematic structure similarity based on language model, thematic structure similarity adopts following formulae discovery:
P LM ( T t | Q t ) = &Pi; w &Element; T t P LM ( w | Q t )
P LM ( w | Q t ) = ( 1 - &lambda; ) # ( w , Q t ) | Q t | + &lambda; # ( w , C ) | C |
Wherein, T tand Q trepresent the thematic structure of target problem T and relevant issues Q respectively, P lM(T t| Q t) represent the thematic structure similarity of target problem T and relevant issues Q, P lM(w|Q t) be the thematic structure Q of relevant issues Q tgenerate the probability of word w, # (w, Q t) represent the thematic structure Q of word w at relevant issues Q tthe number of times of middle appearance, the number of times that # (w, C) occurs in classification C for word w, λ is Jelinek-Mercer smoothing factor;
3.2) for the focus structure part of target problem T and relevant issues Q, utilize the language model based on translation to calculate focus structure similarity, thematic structure similarity adopts following formulae discovery:
P TRLM ( T f | Q f ) = &Pi; w &Element; T f P TRLM ( w | Q f )
P TRLM ( W | Q f ) = ( 1 - &lambda; ) [ &alpha; &Sigma; t &Element; Q f P ( w | t ) # ( t , Q f ) | Q f | + ( 1 - &alpha; ) # ( w , Q f ) | Q f | ] + &lambda; # ( w , C ) | C |
Wherein, P (w|t) represents the translation probability of word t to word w, P tRLM(T f| Q f) representing the focus structure similarity of target problem T and relevant issues Q, α represents the weight shared by translation probability part, P tRLM(w|Q f) be the focus structure Q of relevant issues Q fgenerate the probability of word w, T fand Q frepresent the focus structure of target problem T and relevant issues Q respectively, # (t, Q f) represent the focus structure Q of word t at relevant issues Q fthe number of times of middle appearance; # (w, Q f) represent the focus structure Q of word w at relevant issues Q fthe number of times of middle appearance;
3.3) after the theme calculating target problem T and relevant issues Q and focus similarity, calculate associating similarity by the mode of weighted sum, build the formula calculating associating similarity:
Dis T&F(T,Q)=τP LM(T t|Q t)+(1-τ)P TRLM(T f|Q f)
Wherein, Dis t & F(T, Q) represents the associating similarity of target problem T and relevant issues Q; τ represents weighting coefficient.
4., according to the problem similarity calculating method based on problem theme and focus described in claim 1, it is characterized in that described step 4) comprising:
4.1) train based on the problem set data of BTM topic model to problem and obtain corresponding theme space and theme vector corresponding to problem, utilize Euclidean distance to calculate Topic Similarity between two problems;
4.2) by by 4.1) Topic Similarity of the target problem T that calculates and relevant issues Q and adopt following formula to be weighted summation by the associating similarity of the target problem T that obtains and relevant issues Q, finally obtain the problem similarity between target problem T and relevant issues Q:
Dis (T,Q)=μDis T&F(T,Q)+(1-μ)Dis Topic(T,Q)
Wherein, μ represents weighting coefficient, μ=0.9; Dis t & Fassociating similarity between (T, Q) problem of representation T and problem Q, Dis topictopic Similarity between (T, Q) problem of representation T and problem Q.
5., according to the problem similarity calculating method based on problem theme and focus described in claim 4, it is characterized in that: described step 4.1) in the concrete computation process of Topic Similarity as follows:
4.1.1) calculate words pair set according to problem data and dictionary and close B, word is to referring to by appearing in same text fragments and unordered two different words after pre-service, for problem data, can each problem be regarded as an independently text fragments, for each word carries out initialization operation to random designated key;
4.1.2) according to 4.1.1) result adopt the p-theme distribution P (z|b) of following formulae discovery word:
P ( z | b ) = P ( z ) P ( w i | z ) P ( w j | z ) &Sigma; z P ( z ) P ( w i | z ) P ( w j | z )
Wherein, z represents theme, and b represents word pair, w iand w jrepresent that word is to the word of two in b, P (z) represents the probability of theme z, P (w i| z) represent word w under theme z iprobability;
4.1.3) adopt following formulae discovery problem-word to distribution P (b|d):
P ( b | d ) = n d ( b ) &Sigma; b n d ( b )
Wherein, d problem of representation, n d (b)the number of times that in problem of representation d, word occurs b;
4.1.4) according to 4.1.2) and result 4.1.3) adopt following formulae discovery problem-theme distribution:
P ( z | d ) = &Sigma; b P ( z | b ) P ( b | d )
By as above four steps just can by the term vector spatial mappings of problem to the theme vector space of being trained to obtain by BTM topic model, obtain the probability distribution of each problem on each theme, thus obtaining the theme vector of problem, the dimension of vector equals the number of theme in theme space;
4.1.5) distance of the main body vector of two problems is calculated finally by Euclidean distance, using this distance as the Topic Similarity between two problems.
6., according to the problem similarity calculating method based on problem theme and focus described in claim 1, it is characterized in that: described step 1) in natural language processing instrument be the instrument such as fudanNLP, Harbin Institute of Technology language cloud platform LTP, stammerer participle.By these instruments by Frequently Asked Question data participle, remove invalid word, build term vector space, record the classification belonging to each problem.
CN201510270876.2A 2015-03-11 2015-05-25 Problem similarity calculation method based on subjects and focuses of problems Pending CN104899188A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510270876.2A CN104899188A (en) 2015-03-11 2015-05-25 Problem similarity calculation method based on subjects and focuses of problems

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN2015101063939 2015-03-11
CN201510106393 2015-03-11
CN201510270876.2A CN104899188A (en) 2015-03-11 2015-05-25 Problem similarity calculation method based on subjects and focuses of problems

Publications (1)

Publication Number Publication Date
CN104899188A true CN104899188A (en) 2015-09-09

Family

ID=54031857

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510270876.2A Pending CN104899188A (en) 2015-03-11 2015-05-25 Problem similarity calculation method based on subjects and focuses of problems

Country Status (1)

Country Link
CN (1) CN104899188A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786794A (en) * 2016-02-05 2016-07-20 青岛理工大学 Question-answer pair search method and community question-answer search system
CN106202574A (en) * 2016-08-19 2016-12-07 清华大学 The appraisal procedure recommended towards microblog topic and device
CN106599196A (en) * 2016-12-14 2017-04-26 竹间智能科技(上海)有限公司 Artificial intelligence conversation method and system
CN107273913A (en) * 2017-05-11 2017-10-20 武汉理工大学 A kind of short text similarity calculating method based on multi-feature fusion
CN107729300A (en) * 2017-09-18 2018-02-23 百度在线网络技术(北京)有限公司 Processing method, device, equipment and the computer-readable storage medium of text similarity
CN108536852A (en) * 2018-04-16 2018-09-14 上海智臻智能网络科技股份有限公司 Question and answer exchange method and device, computer equipment and computer readable storage medium
CN108595619A (en) * 2018-04-23 2018-09-28 海信集团有限公司 A kind of answering method and equipment
CN108874772A (en) * 2018-05-25 2018-11-23 太原理工大学 A kind of polysemant term vector disambiguation method
CN109522479A (en) * 2018-11-09 2019-03-26 广东美的制冷设备有限公司 Search processing method and device
CN110895656A (en) * 2018-09-13 2020-03-20 武汉斗鱼网络科技有限公司 Text similarity calculation method and device, electronic equipment and storage medium
CN111191034A (en) * 2019-12-30 2020-05-22 科大讯飞股份有限公司 Human-computer interaction method, related device and readable storage medium
CN113821639A (en) * 2021-09-18 2021-12-21 支付宝(杭州)信息技术有限公司 Text focus analysis method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101566998A (en) * 2009-05-26 2009-10-28 华中师范大学 Chinese question-answering system based on neural network
CN101694659A (en) * 2009-10-20 2010-04-14 浙江大学 Individual network news recommending method based on multitheme tracing
CN103823848A (en) * 2014-02-11 2014-05-28 浙江大学 LDA (latent dirichlet allocation) and VSM (vector space model) based similar Chinese herb literature recommendation method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101566998A (en) * 2009-05-26 2009-10-28 华中师范大学 Chinese question-answering system based on neural network
CN101694659A (en) * 2009-10-20 2010-04-14 浙江大学 Individual network news recommending method based on multitheme tracing
CN103823848A (en) * 2014-02-11 2014-05-28 浙江大学 LDA (latent dirichlet allocation) and VSM (vector space model) based similar Chinese herb literature recommendation method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HUIZHONG DUAN ET AL: "Searching Questions by Identifying Question Topic and Question Focus", 《PROCEEDINGS OF ACL-08:HLT》 *
LI CAI ET AL: "Learning the Latent Topics for Question Retrievel in Community QA", 《INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING》 *
XIAOHUI YAN ET AL: "A Biterm Topic Model for Short Texts", 《WWW 13:PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON WORLD WIDE WEB》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786794B (en) * 2016-02-05 2018-09-04 青岛理工大学 A kind of question and answer are to search method and community's question and answer searching system
CN105786794A (en) * 2016-02-05 2016-07-20 青岛理工大学 Question-answer pair search method and community question-answer search system
CN106202574A (en) * 2016-08-19 2016-12-07 清华大学 The appraisal procedure recommended towards microblog topic and device
CN106599196A (en) * 2016-12-14 2017-04-26 竹间智能科技(上海)有限公司 Artificial intelligence conversation method and system
CN107273913B (en) * 2017-05-11 2020-04-21 武汉理工大学 Short text similarity calculation method based on multi-feature fusion
CN107273913A (en) * 2017-05-11 2017-10-20 武汉理工大学 A kind of short text similarity calculating method based on multi-feature fusion
CN107729300A (en) * 2017-09-18 2018-02-23 百度在线网络技术(北京)有限公司 Processing method, device, equipment and the computer-readable storage medium of text similarity
CN107729300B (en) * 2017-09-18 2021-12-24 百度在线网络技术(北京)有限公司 Text similarity processing method, device and equipment and computer storage medium
CN108536852A (en) * 2018-04-16 2018-09-14 上海智臻智能网络科技股份有限公司 Question and answer exchange method and device, computer equipment and computer readable storage medium
CN108595619A (en) * 2018-04-23 2018-09-28 海信集团有限公司 A kind of answering method and equipment
CN108874772A (en) * 2018-05-25 2018-11-23 太原理工大学 A kind of polysemant term vector disambiguation method
CN110895656A (en) * 2018-09-13 2020-03-20 武汉斗鱼网络科技有限公司 Text similarity calculation method and device, electronic equipment and storage medium
CN110895656B (en) * 2018-09-13 2023-12-29 北京橙果转话科技有限公司 Text similarity calculation method and device, electronic equipment and storage medium
CN109522479A (en) * 2018-11-09 2019-03-26 广东美的制冷设备有限公司 Search processing method and device
CN111191034A (en) * 2019-12-30 2020-05-22 科大讯飞股份有限公司 Human-computer interaction method, related device and readable storage medium
CN111191034B (en) * 2019-12-30 2023-01-17 科大讯飞股份有限公司 Human-computer interaction method, related device and readable storage medium
CN113821639A (en) * 2021-09-18 2021-12-21 支付宝(杭州)信息技术有限公司 Text focus analysis method and system

Similar Documents

Publication Publication Date Title
CN104899188A (en) Problem similarity calculation method based on subjects and focuses of problems
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
CN103207905B (en) A kind of method of calculating text similarity of based target text
Tulkens et al. Evaluating unsupervised Dutch word embeddings as a linguistic resource
CN102622338B (en) Computer-assisted computing method of semantic distance between short texts
CN108052593A (en) A kind of subject key words extracting method based on descriptor vector sum network structure
CN104391942A (en) Short text characteristic expanding method based on semantic atlas
CN107305539A (en) A kind of text tendency analysis method based on Word2Vec network sentiment new word discoveries
CN107992542A (en) A kind of similar article based on topic model recommends method
CN105843897A (en) Vertical domain-oriented intelligent question and answer system
CN106202032A (en) A kind of sentiment analysis method towards microblogging short text and system thereof
CN111143672B (en) Knowledge graph-based professional speciality scholars recommendation method
CN106372061A (en) Short text similarity calculation method based on semantics
CN103049569A (en) Text similarity matching method on basis of vector space model
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN103049470A (en) Opinion retrieval method based on emotional relevancy
CN104484380A (en) Personalized search method and personalized search device
CN104008187B (en) Semi-structured text matching method based on the minimum edit distance
Bilgin et al. Sentiment analysis with term weighting and word vectors
CN104699797A (en) Webpage data structured analytic method and device
CN110134934A (en) Text emotion analysis method and device
CN110705247A (en) Based on x2-C text similarity calculation method
Sadr et al. Unified topic-based semantic models: A study in computing the semantic relatedness of geographic terms
CN104881399A (en) Event identification method and system based on probability soft logic PSL
Hindocha et al. Short-text Semantic Similarity using GloVe word embedding

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150909