CN104899188A

CN104899188A - Problem similarity calculation method based on subjects and focuses of problems

Info

Publication number: CN104899188A
Application number: CN201510270876.2A
Authority: CN
Inventors: 鲁伟明; 余瑶; 吴江琴; 庄越挺
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2015-03-11
Filing date: 2015-05-25
Publication date: 2015-09-09

Abstract

The present invention discloses a problem similarity calculation method based on subjects and focuses of problems Basic preprocessing, such as word segmentation and the like, is carried out on problem data by using a tokenizer, and based on the basic preprocessing, a tree tailor model based on the minimum description length divides each problem into a problem subject and a problem focus; with respect to subject structures and focus structures of two problems, a language model and a language model based on translation are respectively used to calculate a similarity score, and a joint similarity is obtained by means of weighted summation; and a subject similarity between the two problems is calculated by using a method based on a BTM subject model, and two similarities are finally subjected to weighted summation to obtain the final problem similarity. According to the present invention, architectural features and subject information of the problems are introduced into the problem similarity calculation, the information of the problems is more sufficiently used, and by introducing the subject information of the problems besides word statistics information into the problem similarity calculation, accuracy of the problem similarity calculation is improved.

Description

A kind of problem similarity calculating method based on problem theme and focus

Technical field

The present invention relates to a kind of problem similarity calculating method, particularly relate to a kind of problem similarity calculating method based on problem theme and focus.

Background technology

Along with developing rapidly of internet, the approach of people's obtaining information and knowledge is more and more diversified, and the question answering system based on Frequently Asked Question (FAQ) is one of them effective mode.The research of problem Similarity Measure has very important significance to based on Frequently Asked Question question answering system tool, and the accuracy rate of problem Similarity Measure also plays a very important role to question answering system performance.The accuracy rate so how improving problem Similarity Measure becomes the focus of current research naturally.

The calculating of present problems similarity is mainly divided into four kinds of methods: based on the method for word statistical information retrieval model; Based on the computing method of semantic dictionary; Based on the computing method of extensive document sets; Based on the computing method of editing distance.

TF-IDF method based on word word frequency statistical information computational problem between similarity, do not need understanding statement being carried out to the degree of depth.Because question length is very short, cause proper vector sparse, therefore TF-IDF is not fine for the effect of problem Similarity Measure.

Question text is divided into a series of word by the method based on semantic dictionary, goes to calculate the similarity between word based on semantic dictionary, then goes the similarity between computational problem based on the similarity of word.For English, conventional semantic dictionary has WordNet, and for Chinese, conventional semantic dictionary is HowNet.Similarity calculating method based on semantic dictionary has use simple, calculates the advantages such as quick.But also there are two obvious shortcomings: semantic dictionary can not comprise all words; Have plenty of the many meanings of a word, cause which meaning of bad selection to do word Similarity Measure.

The method of carrying out adding up based on extensive document sets is one of method of the many calculating short text similarity of recent researches.The latent Semantic Analysis (LSA) that Deerwester SC proposes is exactly a kind of popular similarity calculating method based on text set.Also some problems are had by LSA method computational problem similarity.Such as, the problem of user's input contains some not at the neologisms of semantic space, and in addition because the concept space of structure is fixing, therefore the dimension of the vector of problem of representation is also fixing, the vector of description problem may be caused very sparse, and impact calculates the precision of similarity.

Editing distance is initially treated is do not consider semantic character, and it has a wide range of applications in various fields such as similarity of character string calculating, data scrubbing, spell checks.At computing statement similarity based method, also there is certain application.Such as, the people such as Leusch utilize editing distance computing statement similarity, but also for mechanical translation.Someone proposed method editing distance and semantic dictionary combined again afterwards.To the effect that: based on common editing distance algorithm, adopting word as basic edit cell instead of single Chinese character, then adopting semantic distance as the replacement cost between word and the weight that assignment is inserted, deletion is different with replacing three kinds of operations.This method considers the order of vocabulary and the information such as semantic, calculates and realizes all fairly simple, also can obtain good effect.But these methods are all text based statistical attributes, the semantic similarity of text well can not be embodied.

Summary of the invention

The present invention is the weak point in order to overcome current computational problem similarity based method, the accuracy rate of raising problem retrieval, a kind of problem similarity calculating method based on problem theme and focus is provided, for calculating problem that user proposes in question answering system and frequently asked question concentrates the similarity of problem, to more new capital important in inhibiting and the effect of question answering and Frequently Asked Question.

The technical scheme that the present invention solves the employing of its technical matters comprises the following steps:

1) pre-service Frequently Asked Question data: by natural language processing instrument by problem set data participle, remove invalid word, record the classification belonging to each problem;

2) theme of partition problem and focus structure: build word space according to word segmentation result, and calculate the specificity score of wherein each word, according to problem comprise word specificity score size word is reordered the debatable topic chain of shape; The topic chain of target problem and relevant issues divides by the tree Cutting model then based on the shortest description length, obtains thematic structure and the focus structure of each problem;

3) based on the associating similarity between problem theme and focus calculation problem: for the thematic structure of target problem and relevant issues, the method for language model is adopted to calculate associating similarity; For the focus structure of target problem and relevant issues, the method based on the language model of translation is adopted to calculate associating similarity; The associating similarity of problem theme and focus is obtained finally by the weighted sum calculating above-mentioned two similarities;

4) computational problem similarity: calculate the Topic Similarity between target problem and relevant issues based on BTM topic model, by by Topic Similarity and step 3) in the associating similarity that calculates be weighted summation and obtain final problem similarity.

Described step 2) comprising:

2.1) according to step 1) in word segmentation result build word space, and the specificity score of each word in following formulae discovery word space is adopted according to the statistical information of problem data generic, build the formula calculating word specificity score:

S(w)＝1/(-∑ _c∈CP(c|w)logP(c|w)+ε)

P (c | w) = \frac{count (c, w)}{Σ_{c &Element; C} count (c, w)}

Wherein, S (w) represents the specificity score that word w is corresponding, and c represents the classification of a certain problem, all categories set corresponding to C problem of representation data, the probability that P (c|w) occurs in classification c for word w; Count (c, w) represents the number of times that in classification c, word w occurs; ε represents smoothing factor.

2.2) for each problem, according to the specificity score of word each after its participle, the word of this problem is resequenced, obtain the topic chain of this problem;

2.3) by the topic chain combination of the topic chain of target problem and relevant issues thereof together, form a question-based teaching, the root node of tree is empty; Utilize the tree Cutting model based on the shortest description length to carry out cutting to this tree, for a tree and a kind of method of cutting out, the tree building following formula describes length L (M, S) and calculates:

L(M，S)＝L(Γ)+L(θ|Γ)+L(S|Γ，θ)

M＝(Γ，θ)

Γ＝(C ₁，C ₂，...，C _k)

θ＝[P(C ₁)，P(C ₂)，...，P(C _k)]

Wherein, Γ represents the node classification of tree after cutting, and θ represents that the ProbabilityDistribution Vector that classification is corresponding, M represent the tree Cutting model that Γ determines, S represents sample set, and k is the sum of category set, for classification C _icorresponding probability;

Select to make to set and describe the shortest cutting method of length and the tree Cutting model M method as partition problem theme and focus, cutting is carried out to question-based teaching, corresponding branch also can be divided into two, wherein form the thematic structure of this branch correspondence problem near the part of root node root, remainder forms the focus structure of this branch correspondence problem.

Described step 3) comprising:

3.1) for the thematic structure part of target problem T and relevant issues Q, calculate thematic structure similarity based on language model, thematic structure similarity adopts following formulae discovery:

P_{LM} (T_{t} | Q_{t}) = \underset{w &Element; T_{t}}{Π} P_{LM} (w | Q_{t})

P_{LM} (w | Q_{t}) = (1 - λ) \frac{# (w, Q_{t})}{| Q_{t} |} + λ \frac{# (w, C)}{| C |}

Wherein, T _tand Q _trepresent the thematic structure of target problem T and relevant issues Q respectively, P _lM(T _t| Q _t) represent the thematic structure similarity of target problem T and relevant issues Q, P _lM(w|Q _t) be the thematic structure Q of relevant issues Q _tgenerate the probability of word w, # (w, Q _t) represent the thematic structure Q of word w at relevant issues Q _tthe number of times of middle appearance, the number of times that # (w, C) occurs in classification C for word w, λ is Jelinek-Mercer smoothing factor;

3.2) for the focus structure part of target problem T and relevant issues Q, utilize the language model based on translation to calculate focus structure similarity, thematic structure similarity adopts following formulae discovery:

P_{TRLM} (T_{f} | Q_{f}) = \underset{w &Element; T_{f}}{Π} P_{TRLM} (w | Q_{f})

P_{TRLM} (w | Q_{f}) = (1 - λ) [α \underset{t &Element; Q_{f}}{Σ} P (w | t) \frac{# (t, Q_{f})}{| Q_{f} |} + (1 - α) \frac{# (w, Q_{f})}{| Q_{f} |} + λ \frac{# (w, C)}{| C |}]

Wherein, P (w|t) represents the translation probability of word t to word w, P _tRLM(T _f| Q _f) representing the focus structure similarity of target problem T and relevant issues Q, α represents the weight shared by translation probability part, P _tRLM(w|Q _f) be the focus structure Q of relevant issues Q _fgenerate the probability of word w, T _fand Q _frepresent the focus structure of target problem T and relevant issues Q respectively, # (t, Q _f) represent the focus structure Q of word t at relevant issues Q _fthe number of times of middle appearance; # (w, Q _f) represent the focus structure Q of word w at relevant issues Q _fthe number of times of middle appearance;

3.3) after the theme calculating target problem T and relevant issues Q and focus similarity, calculate associating similarity by the mode of weighted sum, build the formula calculating associating similarity:

Dis _T&F(T，Q)＝τP _LM(T _t|Q _t)+(1-τ)P _TRLM(T _f|Q _f)

Wherein, Dis _{t & F}(T, Q) represents the associating similarity of target problem T and relevant issues Q; τ represents weighting coefficient.

Described step 4) comprising:

4.1) train based on the problem set data of BTM topic model to problem and obtain corresponding theme space and theme vector corresponding to problem, utilize Euclidean distance to calculate Topic Similarity between two problems;

4.2) by by 4.1) Topic Similarity of the target problem T that calculates and relevant issues Q and adopt following formula to be weighted summation by the associating similarity of the target problem T that obtains and relevant issues Q, finally obtain the problem similarity between target problem T and relevant issues Q:

Dis _(T，Q)＝μDis _T&F(T，Q)+(1-μ)Dis _Topic(T，Q)

Wherein, μ represents weighting coefficient, μ=0.9; Dis _{t & F}associating similarity between (T, Q) problem of representation T and problem Q, Dis _topictopic Similarity between (T, Q) problem of representation T and problem Q.

Described step 4.1) in the concrete computation process of Topic Similarity as follows:

4.1.1) calculate words pair set according to problem data and dictionary and close B, word is to referring to by appearing in same text fragments and unordered two different words after pre-service, for problem data, can each problem be regarded as an independently text fragments, for each word carries out initialization operation to random designated key;

4.1.2) according to 4.1.1) result adopt the p-theme distribution P (z|b) of following formulae discovery word:

P (z | b) = \frac{P (z) P (w_{i} | z) P (w_{j} | z)}{Σ_{z} P (z) P (w_{i} | z) P (w_{j} | z)}

Wherein, z represents theme, and b represents word pair, w _iand w _jrepresent that word is to the word of two in b, P (z) represents the probability of theme z, P (w _i| z) represent word w under theme z _iprobability;

4.1.3) adopt following formulae discovery problem-word to distribution P (b|d):

P (b | d) = \frac{n_{d} (b)}{Σ_{b} n_{d} (b)}

Wherein, d problem of representation, n _{d (b)}the number of times that in problem of representation d, word occurs b;

4.1.4) according to 4.1.2) and result 4.1.3) adopt following formulae discovery problem-theme distribution:

P (z | d) = \underset{b}{Σ} P (z | b) P (b | d)

By as above four steps just can by the term vector spatial mappings of problem to the theme vector space of being trained to obtain by BTM topic model, obtain the probability distribution of each problem on each theme, thus obtaining the theme vector of problem, the dimension of vector equals the number of theme in theme space;

4.1.5) distance of the main body vector of two problems is calculated finally by Euclidean distance, using this distance as the Topic Similarity between two problems.

Described step 1) in natural language processing instrument be fudanNLP, Harbin Institute of Technology language cloud platform LTP, instrument such as stammerer participle etc.By these instruments by Frequently Asked Question data participle, remove invalid word, build term vector space, record the classification belonging to each problem.

The beneficial effect that the inventive method compared with prior art has:

Problem is divided into theme and focus two parts by the design feature that 1, this process employs problem data itself, utilizes more abundant, thus make problem Similarity Measure result more accurate to problem information;

2, the method is the similarity that the short text data method that have employed based on BTM topic model calculates between two problems for problem, by the problem subject information outside word statistical information is incorporated into problem Similarity Measure, thus make problem Similarity Measure result more accurate;

3, the method adopts diverse ways respectively for problem theme and focus section, and by the transition probability between word is incorporated in Similarity Measure, take into account the semantic similarity between problem, thus make problem Similarity Measure result more accurate.

Accompanying drawing explanation

Fig. 1 is overview flow chart of the present invention;

Fig. 2 is step 2) process flow diagram;

Fig. 3 is step 3) process flow diagram;

Fig. 4 is step 4) process flow diagram;

Fig. 5 is that embodiment forms tree result schematic diagram;

Fig. 6 is that embodiment cuts out result schematic diagram;

Fig. 7 is embodiment result partial exploded view.

Embodiment

As shown in Figure 1, the inventive method, comprises the following steps:

2) theme of partition problem and focus structure:

As shown in Figure 2, build word space according to word segmentation result, and calculate the specificity score of wherein each word, according to problem comprise word specificity score size word is reordered the debatable topic chain of shape; The topic chain of target problem and relevant issues divides by the tree Cutting model then based on the shortest description length, obtains thematic structure and the focus structure of each problem.

S(w)＝1/(-∑ _c∈CP(c|w)logP(c|w)+ε)

P (c | w) = \frac{count (c, w)}{Σ_{c &Element; C} count (c, w)}

Wherein, S (w) represents the specificity score that word w is corresponding, and c represents the classification of a certain problem, all categories set corresponding to C problem of representation data, the probability that P (c|w) occurs in classification c for word w; Count (c, w) represents the number of times that in classification c, word w occurs; ε represents smoothing factor, ε=0.001 in concrete enforcement.

The topic chain such as obtaining certain problem q is the word sequence sorting according to the word specificity score of this problem and obtain: w ₁→ w ₂→ ... → w _i→ ... → w _n; Wherein, word w _ibe included in problem q, and 1≤i≤n; Meet S (w _h) > S (w _l), 1≤h < l≤n.Thus, the word that in problem, specificity score is lower more can represent the focus of problem, and on the contrary, the word that specificity score is higher more can represent the theme of problem.

2.3) by the topic chain combination of the topic chain of target problem and relevant issues thereof together, form a question-based teaching, the root node of tree is empty;

The tree Cutting model based on the shortest description length is utilized to carry out cutting to this tree, any one cutting method can be adopted to carry out cutting, the description length set under can calculating which for each cutting method, for a tree and a kind of method of cutting out, the tree building following formula describes length L (M, S) and calculates:

L(M，S)＝L(Γ)+L(θ|Γ)+L(S|Γ，θ)

M＝(T，θ)

Γ＝(C ₁，C ₂，...，C _k)

θ＝[P(C ₁)，P(C ₂)，...，P(C _k)]

3) based on the associating similarity between problem theme and focus calculation problem: as shown in Figure 3, for the thematic structure of target problem and relevant issues, the method for language model is adopted to calculate associating similarity; For the focus structure of target problem and relevant issues, the method based on the language model of translation is adopted to calculate associating similarity; The associating similarity of problem theme and focus is obtained finally by the weighted sum calculating above-mentioned two similarities.

P_{LM} (T_{t} | Q_{t}) = \underset{w &Element; T_{t}}{Π} P_{LM} (w | Q_{t})

P_{LM} (w | Q_{t}) = (1 - λ) \frac{# (w, Q_{t})}{| Q_{t} |} + λ \frac{# (w, C)}{| C |}

Wherein, T _tand Q _trepresent the thematic structure of target problem T and relevant issues Q respectively, P _lM(T _t| Q _t) represent the thematic structure similarity of target problem T and relevant issues Q, P _lM(w|Q _t) be the thematic structure Q of relevant issues Q _tgenerate the probability of word w, # (w, Q _t) represent the thematic structure Q of word w at relevant issues Q _tthe number of times of middle appearance, the number of times that # (w, C) occurs in classification C for word w, λ is Jelinek-Mercer smoothing factor, λ=0.1 in concrete enforcement;

P_{TRLM} (T_{f} | Q_{f}) = \underset{w &Element; T_{f}}{Π} P_{TRLM} (w | Q_{f})

P_{TRLM} (w | Q_{f}) = (1 - λ) [α \underset{t &Element; Q_{f}}{Σ} P (w | t) \frac{# (t, Q_{f})}{| Q_{f} |} + (1 - α) \frac{# (w, Q_{f})}{| Q_{f} |} + λ \frac{# (w, C)}{| C |}]

Wherein, P (w|t) represents the translation probability of word t to word w, P _tRLM(T _f| Q _f) representing the focus structure similarity of target problem T and relevant issues Q, α represents the weight shared by translation probability part, P _tRLM(w|Q _f) be the focus structure Q of relevant issues Q _fgenerate the probability of word w, T _fand Q _frepresent the focus structure of target problem T and relevant issues Q respectively, # (t, Q _f) represent the focus structure Q of word t at relevant issues Q _fthe number of times of middle appearance; # (w, Q _f) represent the focus structure Q of word w at relevant issues Q _fthe number of times of middle appearance, the number of times that # (w, C) occurs in classification C for word w;

Dis _T&F(T，Q)＝τP _LM(T _t|Q _t)+(1-τ)P _TRLM(T _f|Q _f)

Wherein, Dis _{t & F}(T, Q) represents the associating similarity of target problem T and relevant issues Q; τ represents weighting coefficient, τ=0.4 in concrete enforcement.

4) computational problem similarity: as shown in Figure 4, calculate the Topic Similarity between target problem and relevant issues based on BTM topic model, by by Topic Similarity and step 3) in the associating similarity that calculates be weighted summation and obtain final problem similarity.

4.1.1) calculate words pair set according to problem data and dictionary and close B, word is to referring to by appearing in same text fragments and unordered two different words after pre-service, for problem data, can each problem be regarded as an independently text fragments, for each word carries out initialization operation to random designated key; Then according to Gibbs sampling method calculate BTM topic model parameter θ and

P (z | b) = \frac{P (z) P (w_{i} | z) P (w_{j} | z)}{Σ_{z} P (z) P (w_{i} | z) P (w_{j} | z)}

4.1.3) adopt following formulae discovery problem-word to distribution P (b|d):

P (b | d) = \frac{n_{d} (b)}{Σ_{b} n_{d} (b)}

P (z | d) = \underset{b}{Σ} P (z | b) P (b | d)

Dis _(T，Q)＝μDis _T&F(T，Q)+(1-μ)Dis _Topic(T，Q)

The concrete steps of this example enforcement are described in detail below in conjunction with method of the present invention, as follows:

(1) data set of example employing is all from the books of question and answer type in digital library.This example has extracted 610 question and answer class books altogether from books engineering science and education class library resource, amounts to 137888 problem sets.Problem relates to classification: agricultural, biology, chemical industry, computing machine, electronics, machine-building, Aero-Space, medicine, robotization etc., totally 25 large classifications.Through step 1) pre-service obtain the word space that vocabulary size is 54074.

(2) according to information in (1), calculate the specificity score of each word in word space, then from high to low word is reordered according to the specificity score of word each in the word segmentation result of problem, define the topic chain of problem.With problem " computer always crashes; be what reason? " for example, its word segmentation result is " computer always crash what reason ", is " computer-> deadlock-> always-> reason-> what " by the specificity score the calculating each electromagnetism topic chain obtained that also sort.Adopt aforesaid way that the topic chain combination of the topic chain of target problem and relevant issues is formed question-based teaching.With problem " computer always crashes, and is what reason? ", " what kind of the principle of work of computer is? ", " which the chief component of computer has? " these three problems are example, and the question-based teaching that they are formed as shown in Figure 5.

(3) the tree method of cutting out based on the shortest description length carries out cutting to the question-based teaching obtained in (2), branch corresponding in question-based teaching after cutting also can be divided into two, wherein represent the thematic structure of this branch correspondence problem near the part of root node root, a part represents the focus structure of this branch correspondence problem in addition.For the tree of accompanying drawing 5, last cutting result is for shown in accompanying drawing 6.Is the thematic structure of problem above separator bar, below be problem focus structure.Such as, problem " computer always crashes, and is what reason? " thematic structure be (computer, deadlock), (always, reason, what) focus structure be.

(4) based on language model, Similarity Measure is carried out to the problem thematic structure part obtained in (3); Adopting disclosed Similar Problems to gathering as Parallel Corpus, then utilizing translation model to train transition probability between word to use at the language model based on translation.The language model based on translation is utilized to carry out Similarity Measure to the problem focus structure part obtained in (3).Finally, problem thematic structure part similarity and problem focus structure part similarity are weighted summation and obtain associating similarity.

(5) based on BTM language model the term vector of problem is converted to the theme feature vector in theme space, utilizes Euclidean distance to calculate Topic Similarity between two problems based on this vector.

(6) Topic Similarity calculated in the associating similarity calculated in (4) and (5) is weighted summation, finally obtains the problem similarity between two problems, and return.

The operation result of this example: method used in the present invention and traditional problem similarity calculating method based on vector space model (VSM) and language model (LM) are compared by Pk and NDCGk two kinds of evaluation indexes.Wherein the result of Pk as shown in Figure 7.The result of NDCGk is as shown in the table:

Method	NDCG1	NDCG3	NDCG5
				VSM	79.2％	76.28％	70.8％
LM	80.3％	77.89％	71.45％
				This method	82％	80.9％	77.86％

Contrast can be found out, the accuracy of the problem similarity calculating method that this method is obviously current in the accuracy of problem Similarity Measure.This problem similarity calculating method based on problem theme and focus has good use value and application prospect.

Claims

1., based on a problem similarity calculating method for problem theme and focus, it is characterized in that comprising the following steps:

2., according to the problem similarity calculating method based on problem theme and focus described in claim 1, it is characterized in that described step 2) comprising:

S(w)＝1/(-∑ _c∈CP(c|w)logP(c|w)+ε)

P (c | w) = \frac{count (c, w)}{Σ_{c &Element; C} count (c, w)}

L(M,S)＝L(Γ)+L(θ|Γ)+L(S|Γ,θ)

M＝(Γ,θ)

Γ＝(C ₁,C ₂,…,C _k)

θ＝[P(C ₁),P(C ₂),…,P(C _k)]

3., according to the problem similarity calculating method based on problem theme and focus described in claim 1, it is characterized in that described step 3) comprising:

P_{LM} (T_{t} | Q_{t}) = \underset{w &Element; T_{t}}{Π} P_{LM} (w | Q_{t})

P_{LM} (w | Q_{t}) = (1 - λ) \frac{# (w, Q_{t})}{| Q_{t} |} + λ \frac{# (w, C)}{| C |}

P_{TRLM} (T_{f} | Q_{f}) = \underset{w &Element; T_{f}}{Π} P_{TRLM} (w | Q_{f})

P_{TRLM} (W | Q_{f}) = (1 - λ) [α \underset{t &Element; Q_{f}}{Σ} P (w | t) \frac{# (t, Q_{f})}{| Q_{f} |} + (1 - α) \frac{# (w, Q_{f})}{| Q_{f} |}] + λ \frac{# (w, C)}{| C |}

Dis _T&F(T,Q)＝τP _LM(T _t|Q _t)+(1-τ)P _TRLM(T _f|Q _f)

4., according to the problem similarity calculating method based on problem theme and focus described in claim 1, it is characterized in that described step 4) comprising:

Dis _(T,Q)＝μDis _T&F(T,Q)+(1-μ)Dis _Topic(T,Q)

5., according to the problem similarity calculating method based on problem theme and focus described in claim 4, it is characterized in that: described step 4.1) in the concrete computation process of Topic Similarity as follows:

P (z | b) = \frac{P (z) P (w_{i} | z) P (w_{j} | z)}{Σ_{z} P (z) P (w_{i} | z) P (w_{j} | z)}

4.1.3) adopt following formulae discovery problem-word to distribution P (b|d):

P (b | d) = \frac{n_{d (b)}}{Σ_{b} n_{d (b)}}

P (z | d) = \underset{b}{Σ} P (z | b) P (b | d)

6., according to the problem similarity calculating method based on problem theme and focus described in claim 1, it is characterized in that: described step 1) in natural language processing instrument be the instrument such as fudanNLP, Harbin Institute of Technology language cloud platform LTP, stammerer participle.By these instruments by Frequently Asked Question data participle, remove invalid word, build term vector space, record the classification belonging to each problem.