CN101520802A

CN101520802A - Question-answer pair quality evaluation method and system

Info

Publication number: CN101520802A
Application number: CN200910081558A
Authority: CN
Inventors: 方高林; 刘怀军; 郑全战
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2009-04-13
Filing date: 2009-04-13
Publication date: 2009-09-02

Abstract

The invention discloses a question-answer pair quality evaluation method including the steps of clustering input question-answer pairs according to question contents to obtain a cluster including questions with same or similar semantic meanings and answers to the questions; performing quality evaluation between the question-answer pairs and quality evaluation in the question-answer pairs to the cluster and respectively obtaining the quality evaluation result between the question-answer pairs and the quality evaluation result in the question-answer pairs; inosculating the quality evaluation result between the question-answer pairs and the quality evaluation result in the question-answer pairs and outputting the question-answer pairs with high quality. The invention also provides a question-answer pair quality evaluation system so as to realize more effective question-answer pair quality evaluation and improve the commonality of quality evaluation.

Description

Quality evaluating method that a kind of question and answer are right and system

Technical field

The present invention relates to the internet information treatment technology, relate in particular to right quality evaluating method of a kind of question and answer and system.

Background technology

Along with Internet development, information is more and more abundanter, and how obtaining useful knowledge from the information of magnanimity is present urgent problem.For better knowledge services can be provided, a plurality of knowledge question interaction platforms grow up successively.On these knowledge question interaction platforms, the user is the consumer of content, also is the creator of content; The user can by the knowledge question interaction platform seek amusement help, carry out social interaction, also can put question to and answer a question and the answer of problem is estimated.Typical question and answer produce flow process: the user asks a question on the knowledge question interaction platform, and other users participate in answering, and the user of enquirement confirms a satisfied answer to the answer of different user.

Along with increasing of problem number, the semantic problem number that repeats is also more and more, and big multi-user is when puing question to, and whether care system the inside does not exist identical problem and answer.Therefore, on present question and answer interaction platform, the question and answer that exist a lot of repetitions are right.Though for the problem that has solved, all passed through the quizmaster and confirmed this step, different quizmasters' affirmation standard is different, some quizmaster thanks the answerer to furnish an answer and provides very high evaluation, and does not mind the quality of answer.Therefore, exist on the knowledge question interaction platform of replication problem and answer at these, distinguish high-quality question and answer to low-quality question and answer to seeming very necessary.

Exist a kind of decision tree framework that adopts to merge various features in the prior art to the method for question and answer to classifying.The feature of using comprises: based on the content of text feature with based on the usage feature.The N tuple (N-grams) of content of text feature such as speech, the length of speech, different speech number, frequency be greater than speech number of threshold value or the like based on the ternary syntax (Trigram) the language model entropy of character, in answering.The usage feature mainly comprises: the user is for the right rank of agreeing with and oppose number, answerer of question and answer, rank of quizmaster or the like.This method is studied at the different characteristic role, and with its incorporate under the decision tree framework realize to the high-quality question and answer to the right differentiation of middle inferior quality question and answer.

Yet, this method do not consider a problem and answer between the semantic matches degree, the semantic matches degree between problem and the answer is the right bases of high-quality question and answer; This method do not consider many repetition question and answer between relation for the influence of question and answer to quality; In addition, the right data of question and answer lack the usage feature in the production process usually, and this method more relies on the usage feature, can influence its versatility.This shows, prior art to question and answer when carrying out quality assessment, its effect is unsatisfactory, and has the problem of versatility difference.

Summary of the invention

In view of this, fundamental purpose of the present invention is to provide right quality evaluating method of a kind of question and answer and system, to realize that to question and answer more effective quality assessment is improved versatility.

For achieving the above object, technical scheme of the present invention is achieved in that

The invention provides the right quality evaluating method of a kind of question and answer, this method comprises:

To the question and answer of input to carrying out cluster according to problem content, obtain by identical or close problem of semanteme and answer thereof form bunch;

To described bunch carry out question and answer to quality assessment and the internal quality assessment of question and answer, and obtain respectively question and answer to quality evaluation result and the internal quality evaluation result of question and answer;

To described question and answer to quality evaluation result and the internal quality evaluation result of question and answer merge, the question and answer of outputting high quality are right.

Described cluster comprises: k-means cluster and single pass cluster.

Described single pass cluster is specially:

The problem of back input is carried out similarity calculating one by one with the class of current existence,, then described problem is merged with corresponding class if the similarity of described problem and one of them class exceeds default similarity threshold; If the similarity of all classes of described problem and current existence all is lower than default similarity threshold, then be that described problem is created a new class.

State question and answer to quality assessment, be specially:

Participle, part-of-speech tagging and the processing of removal stop words are carried out in each answer in described bunch;

Add up the document frequency that each speech occurs, and with document frequency greater than the speech of frequency threshold as bunch in the theme center of all answers;

Calculate the distance at each answer and theme center by general cosine distance function, and each answer is sorted according to the weights size of distance;

According to calculating based on the similarity of sentence level, eliminate similarity relation and relation of inclusion in the answer after the ordering, obtain described question and answer to quality evaluation result.

The internal quality assessment of described question and answer comprises: the matching degree calculating of evaluation, problem and the answer of problem and answer quality and single question and answer are to the evaluation of quality.

Described problem and the evaluation content of answering quality comprise at least a in the following content: the length of problem formatted message, answer, answer in visual signature information, the positive counter-example dictionary of problem feature, answer positive counter-example dictionary feature and question and answer to the non-text feature in the forming process.

This method further comprises: the matching degree that obtains described problem and answer by the mode based on the theme cluster.

Described single question and answer are specially the evaluation of quality:

By the maximum entropy statistical model following feature is merged, obtains the right quality assessment score value of each question and answer:

The length of problem formatted message, answer, answer in visual signature information, the positive counter-example dictionary of problem feature, answer positive counter-example dictionary feature, question and answer matching degree to non-text feature, problem and answer in the forming process.

The present invention also provides a kind of question and answer right QA system, and this system comprises:

The cluster module is used for question and answer to input to carrying out cluster according to problem content, obtain by identical or close problem of semanteme and answer thereof form bunch;

The first quality assessment module, be used for to described bunch carry out question and answer to quality assessment, obtain question and answer to quality evaluation result;

The second quality assessment module is used for carrying out the internal quality assessment of question and answer to described bunch, obtains the internal quality evaluation result of question and answer;

Fusion Module, be used for to described question and answer to quality evaluation result and the internal quality evaluation result of question and answer merge, the question and answer of outputting high quality are right.

Quality evaluating method and system that a kind of question and answer provided by the present invention are right, by to the question and answer of input to carrying out cluster according to problem content, obtain by identical or close problem of semanteme and answer thereof form bunch; Then to bunch carry out question and answer to quality assessment and the internal quality assessment of question and answer, and obtain respectively question and answer to quality evaluation result and the internal quality evaluation result of question and answer; Again to question and answer to quality evaluation result and the internal quality evaluation result of question and answer merge, and then the question and answer of outputting high quality are right.The present invention has realized question and answer more effective quality assessment, and versatility is higher.

High-quality question and answer can be separated low-quality question and answer centering therefrom by the present invention, be formed high-quality knowledge base; As the data source of search engine, the part of high-quality data as search engine index directly can be placed on the forward position of Search Results; As the knowledge base of automatic question answering, can be with select quality data directly as the knowledge source of automatic question answering, for the user furnishes an answer.In addition, the present invention not only can handle the knowledge question data, also can handle the data that users such as blog, forum, BBBS (Bulletin Board System)BS (BBS, Bulletin Board System), frequently asked questions and corresponding answer (FAQ, Frequently AskedQuestions) question and answer data produce; Can directly be used for setting up encyclopaedic knowledge through the quality data after estimating.

Description of drawings

Fig. 1 is the process flow diagram of the right quality evaluating method of a kind of question and answer of the present invention;

The synoptic diagram of Fig. 2 in the embodiment of the invention quality evaluation result being merged;

Fig. 3 is the composition structural representation of the right QA system of a kind of question and answer of the present invention.

Embodiment

The technical solution of the present invention is further elaborated below in conjunction with the drawings and specific embodiments.

The right quality evaluating method of a kind of question and answer provided by the present invention as shown in Figure 1, mainly may further comprise the steps:

Step 101, to the question and answer of input to carrying out cluster according to problem content, obtain by identical or close problem of semanteme and answer thereof form bunch.

In actual applications, because there is different expression waies in same problem, therefore by the operation of problem cluster, problem that can semanteme is identical or close is poly-to be a class, these semantic identical or close problems and corresponding answer then formed the corresponding a plurality of answers of a plurality of problems bunch.The present invention can adopt such as clustering algorithm problem of implementation clusters such as k-means cluster and single pass clusters, but clustering algorithm of the present invention is not limited only to above-mentioned the act, can also expand according to actual needs.Be that example describes with the single pass cluster below, the principle of single pass cluster is: the problem of back input and the class of current existence are carried out similarity one by one and are calculated, if exceed default similarity threshold, then this problem is merged with corresponding class with the similarity of certain class; If all be lower than default similarity threshold, then be that this problem is created a new class with the similarity of all classes of current existence.Handle the ability of large-scale data in order to improve cluster operation, the descriptor of the present invention in can the employing problem is as the index of each class.

Concrete single pass cluster operation is:

Steps A is analyzed the problem of back input, comprises that the sentence to problem carries out operations such as participle, part-of-speech tagging and removal stop words.

Step B carries out normalized to the speech that obtains after analyzing.

Set up synonymicon, and adopt same speech to represent all synonyms according to synonymicon.For example: all full name are replaced with abbreviation, the speech of fallibility is represented with correct speech.Synonymicon is to put acquisition in order by the human-edited, comprising vocabulary of equal value, as: computer=computing machine, Dad=father=loving father=father, much=how old; Also comprise simple full name, as: the Olympic Games (the full name)=Olympic Games (abbreviation); Also comprise the vocabulary that mistake is corrected, as: wealth is paid logical=Wealth tong, and is electric bright=as to light or the like.

Step C extracts the descriptor according to the weights ordering from the process problem of normalized.

The leaching process of descriptor is specially: search the word frequency tf that each speech occurred and the document frequency df of this problem from the statistics language material, and adopt formula λ log (tf) log (1/df) to compose for each speech and go up weights; According to weights order from big to small all speech in the problem are sorted, and ordering is extracted several forward speech as descriptor (for example preceding 3 speech) according to weights.Wherein, λ is the different parameters value that is provided with at different parts of speech, and the λ value of noun, adjective, verb, adverbial word reduces successively usually; So-called word frequency is meant the frequency of adding up each speech appearance of being added up in the language material; So-called document frequency is meant the frequency of occurrences of adding up the document of being added up in the language material that comprises each speech.

Step D according to the descriptor of extracting, calculates the similarity of each class of this problem and current existence from problem, if the similarity of this problem and certain class exceeds default similarity threshold, then this problem is merged with corresponding class; If all be lower than default similarity threshold with the similarity of all classes of current existence, then be that this problem produces a new class, and with the descriptor of this problem as the new index of class.

Owing to the word in the problem has been carried out normalized, so the similarity of computational problem and each class is exactly the number that there is same words in comparison.The value defined of similarity is

Wherein, k represents the number of same words, tf _kThe word frequency of representing k same words, df _kThe document frequency of representing k same words, λ _kRepresent k same words corresponding parameters value.

Through after the problem cluster, semantic identical or close problem and gather into a class, these semantic identical or close problems and corresponding answer then formed the corresponding a plurality of answers of a plurality of problems bunch.

Step 102, to bunch carry out question and answer to quality assessment and the internal quality assessment of question and answer, and obtain respectively question and answer to quality evaluation result and the internal quality evaluation result of question and answer.

To bunch in difference answer and to analyze, the mutual relationship between answering judges that question and answer are quality assessment problems to be solved between question and answer to quality according to each.The present invention to answer in each bunch between quality assessment, adopt a kind of quality evaluating method based on the theme center, specifically comprise:

Step a, with bunch in each answer be used as a document, carry out processing such as participle, part-of-speech tagging and removal stop words.

Step b adds up the document frequency that each speech occurs, and with document frequency greater than the speech of frequency threshold as bunch in the theme center of all answers.

Frequency threshold can be set according to actual needs, and for example: the setpoint frequency threshold value is 1, then with document frequency greater than the theme center as whole answer such as all nouns of 1, verb, adjective, adverbial word.A document is used as in each answer in bunch, so-called document frequency greater than 1 promptly be meant this speech bunch at least two answers in occurred.

Step c calculates the distance at each answer and theme center by general cosine distance function, and according to the weights size of distance all answers is sorted.

If the feature word set at the theme center of answering is combined into: O={w ₁, w ₂..., w _n, the lexical set of current answer A is: A={c ₁, c ₂..., c _m, the cosine vector of then answering A and theme center is:

\cos (A, Q) = \frac{\underset{i &Element; A \cap O}{Σ} W_{o, i} W_{a, i}}{\sqrt{\underset{x &Element; O}{Σ} {(W_{o, x})}^{2}} \sqrt{\underset{y &Element; A}{Σ} {(W_{a, y})}^{2}}} .

Wherein, W _{O, x}The weight of expression speech x in the theme center O, and W _{O, x}=tf _{L, x}Log (tf _x) log (1/df _x), tf _{L, x}Be illustrated in the local frequencies of the speech x appearance of adding up in the answer in this bunch, W _{A, y}The weights W of expression speech y in answering A _{A, y}=log (tf _y) log (1/df _y).In addition, the word frequency tf of each speech _kWith document frequency df _xBe that statistics obtains from whole language material.

Calculate after each answers cosine distance with the theme center, according to the weights order from big to small of distance all answers are sorted, weights are big more, show that this answer and theme center are close more.

Steps d is eliminated similarity relation and relation of inclusion in the answer set after sorting.

Answer after ordering set is analyzed, judged and answer in the set whether have same or analogous answer, be i.e. similarity relation; Perhaps having certain answer is the situation of the subclass of another answer, i.e. relation of inclusion.If two answers are identical or similar, then the weights of these two answers in ordering are also basic identical; If one of them answer is the subclass of another answer, then the weights of superset in ordering are greater than the weights of subclass in ordering.

When in answering set, having similarity relation, promptly there is same or analogous answer, then only need to keep maximum the getting final product of answering of weights in the ordering, and remaining all is a redundant information, can remove; When having relation of inclusion in answering set, i.e. one of them answer is the subclass of another answer, then only needs to keep the answer of superset.

In order to discern the similarity relation and the relation of inclusion of answering in the set, the present invention adopts and realizes based on the similarity calculating method of sentence level, is specially:

Step 01, subordinate sentence is carried out in each answer.The rule of subordinate sentence is: according to ". ? " discern Deng punctuation mark, and the length of sentence is about 50 words.

Step 02, utilize hash algorithm the text of each sentence to be converted into the finger print information of one 4 byte.Answer the finger print information combination A={s that then comprises a series of 4 bytes in the set for such one ₁, s ₂..., s _n, with s _iRegard a speech as, thereby can set up a document inverted list, share same finger print information s _iAll texts just form a classification.Then to the calculated fingerprint repetition in twos of the text in classification degree, if fingerprint repeats degree greater than preset threshold value (for example 40%), then judge and have similarity relation or relation of inclusion, thereby, and write down two relations between the answer to lower the identifying of sequencing weight in the pairing answer of text that participates in the calculating of fingerprint repetition degree with the removal mark; On the contrary,, then judge not have similarity relation or relation of inclusion, also just do not need to carry out similarity relation and relation of inclusion relevant treatment if fingerprint repeats degree less than preset threshold value.

Step 03, repeat said process, all be eliminated up to the similarity relation and the relation of inclusion of all classes.

After aforesaid operations is all complete, answering in the answer that has similarity relation and relation of inclusion in the set the less answer of sequencing weight all is used and removes mark and identify, produce a correct evaluation score value with this, as question and answer to quality evaluation result.

The internal quality assessment of question and answer comprises: the matching degree calculating of evaluation, problem and the answer of problem and answer quality and single question and answer are to the evaluation of quality.

Wherein, problem and the evaluation content of answering quality can comprise at least a in the following content: 1, problem formatted message, comprise: the length of problem, punctuation mark and whether have interrogative etc., the problem quality that can satisfy prescribed form is higher, and against regulation form, and express unclear problem, do not have high-quality usually; 2, the length of Da Fuing, the answer moderate according to statistical length has higher quality usually; 3, answer in visual signature information, comprising: for each paragraph speech number, whether the paragraph prefix has black matrix is increased the weight of symbol or the like, the higher answer of quality is except moderate length usually, answering also has good visual signature information.4, the positive counter-example dictionary of problem feature, promptly the speech in the problem is respectively in positive example dictionary and the ratio in the counter-example dictionary; 5, answer positive counter-example dictionary feature, the ratio of speech in positive example dictionary and counter-example dictionary in promptly answering; 6, question and answer are to the non-text element in the forming process, for example: user's evaluation, answerer's rank, answerer's answer number and acceptance rate or the like.

It is pointed out that the quality for reflection problem and answer, the present invention has defined positive example dictionary and counter-example dictionary respectively.If the large percentage of speech in the positive example dictionary in problem or the answer, then this problem or answer are higher as high-quality possibility; Otherwise if the large percentage of speech in the positive example dictionary in problem or the answer, then this problem or answer are lower as high-quality possibility.

The building process of positive example dictionary and counter-example dictionary is as follows: at first, extract the language material of a large amount of question and answer to (as 5000), and it is marked two classes, a class is quality data collection D1, and another kind of is middle low quality data collection D2; To the problem extracted and all vocabulary that occur in answering add up, if the frequency of certain vocabulary in quality data collection D1 divided by the frequency in whole data set (comprising D1 and D2) greater than predetermined threshold value α ₁, then this vocabulary enters the positive example dictionary; If the frequency of certain vocabulary in quality data collection D1 divided by the frequency in whole data set (comprising D1 and D2) less than predetermined threshold value α ₂, then this vocabulary enters the counter-example dictionary.The vocabulary that occurs in the problem enters the positive counter-example dictionary of problem, and the vocabulary that occurs in the answer enters the positive counter-example dictionary of answer.

The present invention proposes the matching degree calculating that a kind of method based on the theme cluster is carried out problem and answer, is specially:

Step 001 is collected a certain amount of overall corpus (as 80GB) as the statistics language material of putting mutual information, and this statistics language material is carried out word segmentation processing, and according to formula

PMI (w_{1}, w_{2}) = \log_{2} \frac{P (w_{1}, w_{2})}{P (w_{1}) P (w_{2})}

Calculate the some mutual information between speech and the speech.Wherein, PMI (w ₁, w ₂) expression speech w ₁With speech w ₂Between the some mutual information, P (w ₁) expression speech w ₁The frequency of occurrences in statistics, P (w ₂) expression speech w ₂The frequency of occurrences in statistics, P (w ₁w ₂) expression speech w ₁And w ₂Co-occurrence frequency, if i.e. speech w ₁And w ₂Appear in continuous several sentence, and the number of words of these continuous several sentences is less than length threshold (as 150 Chinese characters), then thinks speech w ₁And w ₂Co-occurrence.In addition, w in a document ₁And w ₂Occur repeatedly all only calculating once.

Step 002, to bunch in problem carry out processing such as participle and part-of-speech tagging, keep vocabulary q with noun part of speech ₁, q ₂Q _m, the number of noun is designated as m.

Step 003, answer is handled, judge the length of answering, if greater than length threshold (as 150 Chinese characters), then it is carried out descriptor and extract processing, what descriptor was extracted mainly is operating as: search the word frequency tf that each speech occurred and the document frequency df of answer from the global statistics language material, and adopt formula TFlog (tf) log (1/df) to compose for each speech and go up weights; According to weights order from big to small all speech in answering are sorted, and extract several forward (for example n=50) nouns as descriptor.Wherein TF represents the local frequencies that corresponding speech is added up in the answer at its place.If the length of answering is then directly carried out processing such as participle, part-of-speech tagging less than length threshold to it, and extract vocabulary a with noun part of speech ₁, a ₂A _n, number is designated as n.

Step 004 is with q _iThe initial point that is the theme is judged a _jWith q _iThe some mutual information whether greater than a mutual information threshold value, if greater than, then with a _jAdd center chain; If all less than a mutual information threshold value, then with a _jDeletion.The vocabulary number that finally obtains comprising in the center chain is designated as k, and the matching degree between problem definition and the answer is: k/m+n.This defines expression, if more many relevant with keyword in the problem of the keyword in answering, this probability is just big more, expression put question to and the degree of correlation of answer high more.

In addition, in order to merge above-mentioned various features, the present invention adopts the fusion framework of maximum entropy statistical model as each feature, to realize the evaluation of single question and answer to quality.Certainly, the fusion framework among the present invention also can adopt the sorter of other types to realize, for example: and support vector machine, Bayes etc., and fusion framework of the present invention is not limited only to above-mentioned the act.

To estimate sorter be example that the fusion process of each feature is described in detail with maximum entropy below, as shown in Figure 2, maximum entropy is estimated the input feature vector that sorter adopts and is comprised: the length of problem formatted message, answer, answer in visual signature information, the positive counter-example dictionary of problem feature, answer positive counter-example dictionary feature, question and answer matching degree to non-text feature, problem and answer in the forming process.

Wherein, the positive counter-example dictionary of problem feature with the production process of answering positive counter-example dictionary feature is: each speech in statistical problem and the answer is belonging to quality data and the probability that belongs to low quality data in positive counter-example dictionary respectively; Utilize Bayesian formula to calculate P (good|Q) then, the probability of P (good|A), this probability are respectively as the problem of maximum entropy positive counter-example dictionary feature with answer the input of positive counter-example dictionary feature.

The length of answering is defined as the probability P (good|L) that belongs to quality data under this length L, and

P (good | L) = \frac{P (good) p (good | L)}{P (good) p (good | L) + P (bad) p (bad | L)} .

Probability p (good|L), p (bad|L) adds up in training process and obtains.

Question and answer are to the non-text feature in the forming process, be that ratio, answerer's the answer number of the rank of ratio, answerer by the user being estimated score and best result and the highest level answerer's acceptance rate during greater than certain numerical value averages, a numerical value that obtains is with the input of this numerical value as non-text feature.

The problem formatted message is defined as P (good|Q)=λ ₁P (good|L _Q)+λ ₂+ λ ₃, λ wherein ₁+ λ ₂+ λ ₃=1, λ ₁, λ ₂, λ ₃Problem of representation is in this length L respectively _QUnder be high-quality probability P (good|L _Q) weighted value, problem be that high-quality weighted value, problem are high-quality weighted value when having a question the speech feature when having the punctuation mark feature.

Visual signature information is according to judging whether final formation satisfies the resulting result of formatted message in the answer, if satisfy, then this characteristic information is 1, otherwise is 0.

Above-mentioned training process is, at first in 10000 training samples, train the model parameter of maximum entropy, utilize the model parameter of maximum entropy to discern then, finally give each question and answer a correct evaluation score value is arranged, with this as the internal quality evaluation result of question and answer.The question and answer that are lower than certain threshold value for score value are right to low-quality question and answer in then thinking, directly deletion.

Step 103, to question and answer to quality evaluation result and the internal quality evaluation result of question and answer merge, the question and answer of outputting high quality are right.

The quality evaluation result that single question and answer are right and bunch in question and answer to quality evaluation result organically blend, can be undertaken by the mode of weighting, also can pass through sorter, the right evaluation score value of single question and answer and question and answer to the evaluation score value merge as two features.According to experiment statistics, the present invention adopts following scheme:

The right number N of all question and answer in the statistics bunch at first, with bunch in all question and answer to through behind the right evaluation sorter of single question and answer, it is right only to be left high-quality question and answer;

Right for these high-quality question and answer, remove involved question and answer to lower to sequencing weight with similar question and answer;

According to the question and answer that comprise in each bunch number is carried out classification marking: if N〉50, the ordering maximum in then will this bunch is normalized to 1, gets that first three is individual right as high-quality question and answer; If N〉20, the ordering maximum in then will this bunch is normalized to 0.9, get preceding two right as high-quality question and answer; If N〉10, the ordering maximum in then will this bunch is normalized to 0.8, and it is previous right as high-quality question and answer to get; If N〉5, the ordering maximum in then will this bunch is normalized to 0.7, and it is previous right as high-quality question and answer to get; If N〉1, the ordering maximum in then will this bunch is normalized to 0.6, and averages with evaluation score value in the question and answer, if maximum score value surpasses 0.7, keeps that it is right for high-quality question and answer, otherwise deletion; If N=1 then is made as 0.5 with these question and answer to score value, and average,, keep that it is right for high-quality question and answer if maximum score value surpasses 0.7 with the internal evaluation score value of question and answer, otherwise deletion.

For realizing the right quality evaluating method of question and answer of the invention described above, the present invention also provides a kind of question and answer right QA system, as shown in Figure 3, this system comprises: cluster module 10, the first quality assessment module 20, the second quality assessment module 30 and Fusion Module 40.Wherein, cluster module 10 is used for question and answer to input to carrying out cluster according to problem content, obtain by identical or close problem of semanteme and answer thereof form bunch.The first quality assessment module 20 connects cluster module 10, be used for to bunch carry out question and answer to quality assessment, obtain question and answer to quality evaluation result.The second quality assessment module 30 connects cluster module 10, is used for obtaining the internal quality evaluation result of question and answer to bunch carrying out the internal quality assessment of question and answer.Fusion Module 40 connects the first quality assessment module 20 and the second quality assessment module 30, be used for to question and answer to quality evaluation result and the internal quality evaluation result of question and answer merge, the question and answer of outputting high quality are right.

In sum, high-quality question and answer can be separated low-quality question and answer centering therefrom, be formed high-quality knowledge base by the present invention; As the data source of search engine, the part of high-quality data as search engine index directly can be placed on the forward position of Search Results; As the knowledge base of automatic question answering, can be with select quality data directly as the knowledge source of automatic question answering, for the user furnishes an answer.In addition, the present invention not only can handle the knowledge question data, also can handle the data that users such as blog, forum, BBS, FAQ question and answer data produce; Can directly be used for setting up encyclopaedic knowledge through the quality data after estimating.

The above is preferred embodiment of the present invention only, is not to be used to limit protection scope of the present invention.

Claims

1, the right quality evaluating method of a kind of question and answer is characterized in that, this method comprises:

2, according to the right quality evaluating method of the described question and answer of claim 1, it is characterized in that described cluster comprises: k-means cluster and single pass cluster.

According to the right quality evaluating method of the described question and answer of claim 2, it is characterized in that 3, described single pass cluster is specially:

4, according to the right quality evaluating method of the described question and answer of claim 1, it is characterized in that, described question and answer to quality assessment, be specially:

According to the right quality evaluating method of the described question and answer of claim 1, it is characterized in that 5, the internal quality assessment of described question and answer comprises: the matching degree calculating of evaluation, problem and the answer of problem and answer quality and single question and answer are to the evaluation of quality.

6, according to the right quality evaluating method of the described question and answer of claim 5, it is characterized in that described problem and the evaluation content of answering quality comprise at least a in the following content: the length of problem formatted message, answer, answer in visual signature information, the positive counter-example dictionary of problem feature, answer positive counter-example dictionary feature and question and answer to the non-text feature in the forming process.

7, according to the right quality evaluating method of the described question and answer of claim 5, it is characterized in that this method further comprises: the matching degree that obtains described problem and answer by mode based on the theme cluster.

According to the internal quality evaluating method of the described question and answer of claim 5, it is characterized in that 8, described single question and answer are specially the evaluation of quality:

9, the right QA system of a kind of question and answer is characterized in that, this system comprises: