CN103425635A

CN103425635A - Method and device for recommending answers

Info

Publication number: CN103425635A
Application number: CN2012101510445A
Authority: CN
Inventors: 陈庆轩; 梁丰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2012-05-15
Filing date: 2012-05-15
Publication date: 2013-12-04
Anticipated expiration: 2032-05-15
Also published as: CN103425635B

Abstract

The invention provides a method and a device for recommending answers. The method includes acquiring questions and text content corresponding to the questions and segmenting to obtain semantic units of the questions and semantic units of answers, searching weights of the semantic units of the questions in different categories according to a built question domain dictionary to compute the theme weight of the questions in different categories, searching weights of the semantic units of the answers in different categories according to a built answer domain dictionary to compute the theme weight of the answers in different categories, computing the theme similarities of the various answers and the questions respectively according to the theme weight of the questions and the theme weight of the answers, and finally recommending the answers according to the computing result of the theme similarity. Compared with the prior art, the method and the device for recommending answers have the advantages that accuracy of semantic similarities between questions and answers is improved effectively and recall rate is increased since the question domain dictionary and the answer domain dictionary are generated respectively.

Description

A kind of answer recommend method and device

[technical field]

The present invention relates to the internet information processing technology field, particularly a kind of answer recommend method and device.

[background technology]

Along with the development of communication technology and network, such as Baidu know, Sina likes to ask, Google's question and answer, search ask, the network interdynamic Ask-Answer Community such as Yahoo's knowledge hall, day by day receive people's concern.These network interdynamic Ask-Answer Communities provide a platform that can carry out interaction for the netizen, and the user can freely ask a question, browses problem, answer a question, and the interchange of being helped each other, share knowledge.Increasing along with the Ask-Answer Community participating user, the candidate answers number increases thereupon, and Ask-Answer Community usually can check on one's answers and carry out auto-sequencing, in order to recommend preferred answer for the user.

In the auto-sequencing that checks on one's answers, at present, mostly adopt the text subject analytical technology to analyze semantic relevancy that question and answer are right etc. and judge that question and answer are to satisfaction, and then check on one's answers and carry out auto-sequencing.The text subject analytical technology is mainly based on topic model, text mapping is become to the topic vector, the topic vector is again that the distribution by word means, therefore the Topic Similarity between text calculates the similarity calculating that can change between the topic vector, and this similarity can be measured by the cosine similarity.

Existing text subject analytical approach is mostly based on a hypothesis: text all belongs to same topic space, and each topic belongs to same word distribution.Yet question and answer centering question and answer may adopt different describing modes, the inconsistent situation of word appears, for example, in computer realm, the field word of problem distributes to commonly use or colloquial compuword is main, as computer, operating system etc.; And the field word distribution of answering be take some professional compuwords as main, such as PC, win7 etc.; And for example, put question to the user to be asked a question with regard to the technical ability of certain game, but be the description to concrete technical ability in the answer that the user answers, do not comprise the word in problem.Now, the semantic relevancy that calculates answer and problem according to existing method is lower, can make and can't recall with the actual answer be complementary of problem or after the sequence of answer leans on, cause the decline of question and answer to the quality determination rate of accuracy, make the user can't find preferred answer.

[summary of the invention]

In view of this, the invention provides a kind of answer recommend method and device, Generating Problems field dictionary and the field of answer dictionary, shine upon statement with the field of expanding question and answer centering problem and answer respectively, effectively promote the accuracy rate that between problem and answer, semantic similarity is judged, improved recall rate.

Concrete technical scheme is as follows:

A kind of answer recommend method, the method comprises the following steps:

S1, obtain the content of text of the corresponding answer of problem and this problem, participle obtains the semantic primitive of described problem and the semantic primitive of described answer;

S2, utilization be the problem domain dictionary of foundation in advance, finds out the weight of semantic primitive in each classification of described problem, calculates the topic weights of described problem in each classification;

And

Utilize the answer field dictionary of setting up in advance, find out the weight of semantic primitive in each classification of described each answer, calculate respectively the topic weights of described each answer in each classification;

S3, the topic weights of utilizing the described problem obtain and the topic weights of each answer, calculate respectively the Topic Similarity of each answer and described problem, according to the result of calculation recommendation answer of described Topic Similarity.

According to one preferred embodiment of the present invention, the method for building up of described problem domain dictionary specifically comprises:

Obtain the content of question and answer to problem in language material, participle obtains the semantic primitive of described problem;

Calculate respectively the weight of each semantic primitive in each classification of described problem;

Described each semantic primitive and the weight in each classification thereof are formed to the problem domain dictionary.

According to one preferred embodiment of the present invention, the method for building up of described answer field dictionary specifically comprises:

Obtain the content of question and answer to answer in language material, participle obtains the semantic primitive of described answer;

Calculate respectively the weight of each semantic primitive in each classification of described answer;

Described each semantic primitive and the weight in each classification thereof are formed to answer field dictionary.

According to one preferred embodiment of the present invention, after the semantic primitive of the described semantic primitive that obtains described problem or answer, also comprise:

Semantic primitive by word frequency lower than default word frequency threshold value filters out;

Only, to filtering rear remaining semantic primitive, calculate respectively the weight in each classification.

According to one preferred embodiment of the present invention, the weight of described semantic primitive in each classification calculated according to following listed a kind of or combination in any:

The otherness of the word frequency of described semantic primitive between of all categories, described semantic primitive are in the word frequency of middle appearance of all categories or the contrary word frequency rate of described semantic primitive.

According to one preferred embodiment of the present invention, the weighing computation method of described semantic primitive in each classification is:

w ({token, C}_{j}) = \sqrt{\sqrt{\underset{j}{Σ} {(p_{ij} - \overset{&OverBar;}{p_{1}})}^{2}} / \underset{j}{Σ} p_{ij}} \times {(\log (N / N ({token}_{i})))}^{2} \times \sqrt[n]{p_{ij}}

Wherein, w (token _i, C _j) expression semantic primitive token _iAt classification C _jIn weight;

P _Ij=T _Ij/ L _j, L _jMean classification C _jIn the number of times summation of all semantic primitives of containing, T _IjMean semantic primitive token _iAt classification C _jThe number of times of middle appearance;

\overset{&OverBar;}{p_{1}} = Σ_{j} p_{ij} / m,

Wherein, m is the classification number;

Be illustrated in semantic primitive token _iAt classification C _jThe word frequency of middle appearance, n is the word frequency factor of influence;

N means the number of times summation that in language material, all semantic primitives occur, N (token _i) expression semantic primitive token _iThe number of times occurred.

According to one preferred embodiment of the present invention, described each semantic primitive and the weight in each classification thereof are formed to problem domain dictionary or answer field dictionary before, also comprise:

Weight to each semantic primitive between each classification is carried out the similarity weight heavy filtration, for same semantic primitive, will be in same weight interval the occurrence number weight that is greater than predetermined threshold value filter out;

Only by semantic primitive in residue the weight in classification in order to form problem domain dictionary or answer field dictionary.

According to one preferred embodiment of the present invention, according to described semantic primitive, the weight size in each classification is arranged in described weight interval.

According to one preferred embodiment of the present invention, described each semantic primitive and the weight in each classification thereof are formed to the problem domain dictionary before, also comprise:

The semantic primitive that individual character, repeat number word string or numeric string length is surpassed to the preset length threshold value filters out;

After only filtering, remaining semantic primitive is in order to form problem domain dictionary or answer field dictionary.

According to one preferred embodiment of the present invention, the computing method of the Topic Similarity of described answer and problem comprise:

Calculate respectively described answer and the problem Topic Similarity under each classification;

Choose the Topic Similarity maximal value that the calculates Topic Similarity as described answer and problem.

According to one preferred embodiment of the present invention, the computing method of the Topic Similarity of described answer and problem are:

sim(query，ans)＝Max _j{weight(query，C _j)×weight(ans，C _j)}

Wherein, sim (query, ans) means the Topic Similarity of answer and problem, weight (query, C _j) problem of representation is at classification C _jIn topic weights, weight (ans, C _j) mean that answer is at classification C _jIn topic weights.

A kind of answer recommendation apparatus, this device comprises:

The text acquisition module, for obtaining the content of text of problem and the corresponding answer of this problem, participle obtains the semantic primitive of described problem and the semantic primitive of described answer;

The topic weights computing module, for utilizing the problem domain dictionary of setting up in advance, find out the weight of semantic primitive in each classification of described problem, calculates the topic weights of described problem in each classification;

And

For utilizing the answer field dictionary of setting up in advance, find out the weight of semantic primitive in each classification of described each answer, calculate respectively the topic weights of described each answer in each classification;

Similarity calculation module, for the topic weights of the described problem of utilizing described topic weights computing module to obtain and the topic weights of each answer, calculate respectively the Topic Similarity of each answer and described problem, according to the result of calculation of described Topic Similarity, recommend answer.

According to one preferred embodiment of the present invention, described problem domain dictionary is set up module by the problem dictionary in advance and is set up, and described problem dictionary is set up module and specifically comprised:

Problem is obtained submodule, and for obtaining the content of question and answer to the language material problem, participle obtains the semantic primitive of described problem;

The first weight calculation submodule, the weight for each semantic primitive of calculating respectively described problem in each classification;

The first integron module, for forming the problem domain dictionary by described each semantic primitive and in the weight of each classification.

According to one preferred embodiment of the present invention, described answer field dictionary is set up module by the answer dictionary in advance and is set up, and described answer dictionary is set up module and specifically comprised:

Submodule is obtained in answer, and for obtaining the content of question and answer to the language material answer, participle obtains the semantic primitive of described answer;

The second weight calculation submodule, the weight for each semantic primitive of calculating respectively described answer in each classification;

The second integron module, for forming answer field dictionary by described each semantic primitive and in the weight of each classification.

According to one preferred embodiment of the present invention, module set up by described problem dictionary or described answer dictionary is set up module, also comprises:

Word frequency is filtered submodule, for by word frequency, the semantic primitive lower than default word frequency threshold value filters out;

After filtering, remaining semantic primitive offers described the first weight calculation submodule or described the second weight calculation submodule.

According to one preferred embodiment of the present invention, described the first weight calculation submodule or the second weight calculation submodule calculate the weight of described semantic primitive in each classification according to following listed a kind of or combination in any:

According to one preferred embodiment of the present invention, described the first weight calculation submodule or the second weight calculation submodule calculate the method for the weight of described semantic primitive in each classification and are:

w ({token, C}_{j}) = \sqrt{\sqrt{\underset{j}{Σ} {(p_{ij} - \overset{&OverBar;}{p_{1}})}^{2}} / \underset{j}{Σ} p_{ij}} \times {(\log (N / N ({token}_{i})))}^{2} \times \sqrt[n]{p_{ij}}

\overset{&OverBar;}{p_{1}} = Σ_{j} p_{ij} / m,

Wherein, m is the classification number;

Weight is filtered submodule, for to each semantic primitive the weight between each classification carry out the similarity weight heavy filtration, for same semantic primitive, will be in same weight interval the occurrence number weight that is greater than predetermined threshold value filter out;

Only by semantic primitive, the weight in the residue classification offers described the first integron module or described the second integron module, in order to form problem domain dictionary or answer field dictionary.

Semantic primitive is filtered submodule, for the semantic primitive that individual character, repeat number word string or numeric string length is surpassed to the preset length threshold value, filters out;

After only filtering, remaining semantic primitive offers described the first integron module or described the second integron module, in order to form problem domain dictionary or answer field dictionary.

According to one preferred embodiment of the present invention, described similarity calculation module is calculated respectively described answer and the Topic Similarity of problem under each classification, and chooses the Topic Similarity maximal value that the calculates Topic Similarity as described answer and problem.

According to one preferred embodiment of the present invention, the method that described similarity calculation module is calculated the Topic Similarity of described answer and problem is:

sim(query，ans)＝Max _j{weight(query，C _j)×weight(ans，C _j)}

As can be seen from the above technical solutions, answer recommend method provided by the invention and device, utilize question and answer to language material difference Generating Problems field dictionary and answer field dictionary, thereby expand the field mapping statement that question and answer are right, effectively promoted the accuracy rate of question and answer to semantic similarity, solution problem and answer to the inconsistent situation of word of describing same subject under the inaccurate problem of coupling, improve recall rate.

[accompanying drawing explanation]

The answer recommend method process flow diagram that Fig. 1 provides for the embodiment of the present invention one;

The method for building up process flow diagram of the problem domain dictionary that Fig. 2 provides for the embodiment of the present invention one;

The method for building up process flow diagram of the answer field dictionary that Fig. 3 provides for the embodiment of the present invention one;

The answer recommendation apparatus schematic diagram that Fig. 4 provides for the embodiment of the present invention two;

The problem dictionary that Fig. 5 provides for the embodiment of the present invention two is set up the schematic diagram of module;

The answer dictionary that Fig. 6 provides for the embodiment of the present invention two is set up the schematic diagram of module.

[embodiment]

In order to make the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with the drawings and specific embodiments, describe the present invention.

In the question answering process of network interdynamic Ask-Answer Community, problem is with different and different because of dialogist's knowledge background to the statement meeting of same theme in answer, for example<compressed software, winrar >,<lantern slide, PPT >,<system software, win7 > etc., although such statement word difference has higher semantic similarity under specific domain background.

The present invention utilizes this characteristic, the word in different classes of for problem and answer respectively, set up problem domain dictionary and answer field dictionary, by the computing method of semantic similarity between minute field computational problem and answer, in order to the result of calculation according to similarity, carry out the answer recommendation.

Embodiment mono-,

Fig. 1 is the answer recommend method process flow diagram that the present embodiment provides, and as shown in Figure 1, the method comprises:

Step S10, obtain the content of text of the corresponding answer of problem and this problem, participle obtains the semantic primitive of described problem and the semantic primitive of described answer.

A problem may comprise the answer of a plurality of correspondences, and the content of text of problem and each answer is carried out to the processing such as participle filtration, the semantic primitive comprised in the problem that obtains obtaining and each answer.

The present invention can existing segmenting method carries out participle to the content of text of problem or answer, such as the N metagrammar, divides morphology, Forward Maximum Method method, reverse maximum matching method etc.The N metagrammar of take divides morphology as example, carries out the monobasic division and obtains each monobasic semantic primitive, as " text ", " data ", " form " etc.; Carry out the binary division and obtain each Two-tuple Linguistic Information Processing unit, as " text box ", " packet ", " new form " etc.; Carry out the ternary division and obtain each ternary semantic primitive, " multiline text frame ", " data package capture ", " new form is downloaded " etc.; The rest may be inferred, carries out the participle of N unit semantic primitive.N unit semantic primitive is N the lexical item that context is adjacent in problem or answer, and N the lexical item occurred continuously is middle without separators such as word, punctuate or spaces.

Problem or answer may comprise the content in a plurality of territories.Such as, a problem comprises title, text and three territories that remark additionally, and extracts respectively the content of text in these three territories, it is carried out to participle and obtain corresponding semantic primitive.Problem or answer are acquired respectively to corresponding N unit semantic primitive according to title, text and supplemental content.

Give an example, the problem that the user proposes is:

" seek advice the computer talent

The thing that my computer is downloaded after restarting has not before just had but what I do not nullify again? "

This problem comprise that title " is sought advice the computer talent " and body matter " my computer restart after before the thing downloaded just do not had but what I do not nullify again? "Take this title as example, and its word segmentation result comprises: the monobasic semantic primitive " is sought advice ", " computer ", " master-hand ", and computer " is sought advice " in the Two-tuple Linguistic Information Processing unit, " computer talent " and ternary semantic primitive " are sought advice the computer talent ".

Step S20, utilization be the problem domain dictionary of foundation in advance, finds out the weight of semantic primitive in each classification of described problem, calculates the topic weights of described problem in each classification; And, utilize the answer field dictionary of setting up in advance, find out the weight of semantic primitive in each classification of described each answer, calculate respectively the topic weights of described each answer in each classification.

Described problem domain dictionary or answer field dictionary comprise semantic primitive and the weight of each semantic primitive in each classification.Described classification is several default domain classifications, can adopt the encyclopaedia classification, for example, and the classification such as computing machine, medicine, education, map, song, film.

About the concrete process of establishing of utilizing existing question and answer to set up in advance problem domain dictionary and answer field dictionary to corpus, will in follow-up length, describe in detail.

Utilize the problem domain dictionary, find out the weight of each semantic primitive in each classification of described problem, the weight of all semantic primitives that problem is comprised is sued for peace according to each classification, obtains the topic weights of problem in each classification.For example, utilize semantic primitive " computer " to be searched in the problem domain dictionary, obtaining this semantic primitive " computer " is 15 in computer other weight, in educational other weight, is 30, and the weight in the medicine classification is 10.Find out successively the weight of each semantic primitive of problem in each classification obtained in step S10.

According to different classes of, each semantic primitive is weighted to summation in the weight under respective classes, obtain the topic weights of problem under each classification.If the weight of semantic primitive under certain classification can not find, the weight of this semantic primitive under this classification is zero.For example, the semantic primitive that problem obtains through participle, only " computer " and " master-hand " is in medicine classification right of possession weight, the topic weights in the medicine classification using semantic primitive " computer " and " master-hand's " weight addition as problem.

In like manner, utilize answer field dictionary, find out the weight of semantic primitive in each classification of described each answer, and the weight of all semantic primitives that answer is comprised sued for peace according to each classification, obtained the topic weights of answer in each classification.

Step S30, the topic weights of utilizing the described problem obtain and the topic weights of each answer, calculate respectively the Topic Similarity of each answer and described problem, according to the result of calculation recommendation answer of described Topic Similarity.

Utilize topic weights and the answer topic weights in each classification of problem in each classification calculated in step S20, calculate the Topic Similarity of answer and problem.

The computing method of the Topic Similarity of answer and problem can be, but not limited to adopt the mode of the topic weights product of the topic weights of problem and answer to be calculated.Particularly, first calculate respectively described answer and the problem Topic Similarity under each classification, then choose the Topic Similarity maximal value that the calculates Topic Similarity as described answer and problem, that is:

sim(query，ans)＝weight(query，C _j)×weight(ans，C _j)

In the problem calculated or answer, after the topic weights in each classification, only the topic weights in front 5 classifications of On The Choice and answer is carried out similarity calculating.

If the topic weights that problem is the highest is 0, showing can not have theme clearly to judge to this problem, can not calculate the Topic Similarity of question and answer centering question and answer, now, adopts existing semantic relevancy to weigh the degree of correlation that question and answer are right.

If the highest topic weights of answer is 0, showing can not have theme clearly to judge to this answer, can not calculate the Topic Similarity of this answer and problem, now in like manner, adopts existing semantic relevancy to weigh the degree of correlation that question and answer are right.

The multiplied by weight in corresponding classification by problem and answer, as such other degree of subject relativity, and choose the degree of subject relativity of the maximal value of product as answer and problem.

By above-mentioned computing method, can calculate the degree of subject relativity that question and answer are right.As shown in table 1 below:

Table 1

Degree of subject relativity according to question and answer to problem and answer, can identify preferably the question and answer pair with same subject, and the judgement that can produce the Topic Similarity of comparison high weight, thereby, judge that for the content related fields from text the question and answer quality provides effective means, can recommend answer more accurately.

Below in conjunction with Fig. 2 and Fig. 3, the problem domain dictionary of foundation in advance and the method for building up of answer field dictionary are described.

Fig. 2 is the method for building up process flow diagram of the problem domain dictionary that provides of the present embodiment, and as shown in Figure 2, the method specifically comprises:

Step S401, obtain the content of question and answer to problem in language material, participle obtains the semantic primitive of described problem.

Obtain whole question and answer to expecting the content of text of problem in storehouse, carry out participle, and the lexical item that participle is obtained removed the filtration treatment such as stop words, punctuate, obtained the semantic primitive of problem.Concrete processing procedure and step S10 are similar, in this, repeat no more.

Step S402, the semantic primitive by word frequency lower than default word frequency threshold value filter out.

In order to raise the efficiency, first semantic primitive to be filtered based on word frequency, the semantic primitive by word frequency lower than default word frequency threshold value filters out.Such as, remove word frequency lower than the semantic primitive of 5 times.

Certainly, this step is not steps necessary, when less demanding to treatment effeciency, can not carry out.

Step S403, calculate respectively the weight of each semantic primitive in each classification of described problem.

The weight of described semantic primitive in each classification calculated according to following listed a kind of or combination in any:

The otherness of the word frequency of semantic primitive between of all categories, semantic primitive are in the word frequency of middle appearance of all categories or the contrary word frequency rate of semantic primitive.

Word frequency with semantic primitive is integrated as example in otherness, semantic primitive between of all categories in the word frequency of middle appearance of all categories and the contrary word frequency rate three of semantic primitive, the weighing computation method of semantic primitive in each classification can be, but not limited to adopt: the otherness of the word frequency of semantic primitive between of all categories, semantic primitive are calculated at the contrary word frequency rate three's of the word frequency of middle appearance of all categories and semantic primitive product, that is:

w ({token, C}_{j}) = \sqrt{\sqrt{\underset{j}{Σ} {(p_{ij} - \overset{&OverBar;}{p_{1}})}^{2}} / \underset{j}{Σ} p_{ij}} \times {(\log (N / N ({token}_{i})))}^{2} \times \sqrt[n]{p_{ij}}

Wherein, w (token _i, C _j) expression semantic primitive token _iAt classification C _jIn weight.

P _Ij=T _Ij/ L _j, L _jMean classification C _jIn the number of times summation of all semantic primitives of containing, T _IjMean semantic primitive token _iAt classification C _jThe number of times of middle appearance.

\overset{&OverBar;}{p_{1}} = Σ_{j} p_{ij} / m,

Wherein, m is the classification number.

Mean token _iThe otherness of word frequency between classification.

Be illustrated in semantic primitive token _iAt classification C _jThe word frequency of middle appearance, n is the word frequency factor of influence.Word frequency factor of influence n can be set according to actual conditions, regulates the degree of influence of word frequency, as chooses n=5.

N means the number of times summation that in language material, all semantic primitives occur, N (token _i) expression semantic primitive token _iThe number of times occurred, log (N/N (token _i)) expression semantic primitive token _iContrary word frequency rate.Should contrary word frequency rate also can directly adopt the rate of falling the document in the natural language processing language material.

Step S404, the weight to each semantic primitive between each classification are carried out the similarity weight heavy filtration.

For the significance level between each classification makes a distinction by semantic primitive, after the weight in the computing semantic unit in each classification, need to filter out those and weight repeatedly occur in same weight interval.That is, for same semantic primitive, will be in same weight interval the occurrence number weight that is greater than predetermined threshold value filter out.

Described weight interval (as [and 0,10) interval) weight size according to described semantic primitive in each classification arranged.Particularly, can be, but not limited to adopt following methods:

Utilize the difference of the maximal value of semantic primitive to be calculated weight in all categories and minimum value divided by power interval numbers, determine each weight interval of described semantic primitive to be calculated.

For example, can determine the weight interval by a kind of didactic rule, if the highest weighting of a semantic primitive in each class must be divided into Score _max, minimum weight must be divided into Score _min, burst length can be defined as (Score _max-Score _min)/L, wherein, L is default power interval numbers, gets L=6 in this enforcement.Similar weight number Threshold is M/2, and wherein M represents that this semantic primitive divides in the recuperation of how many classification right of possessions.

For example, as the situation of the weight distribution of semantic primitive " stock " in each classification is: 1: 1.65,2: 2.32,3: 58.62,4: 3.12,5: 3.62,7: 14.82,8: 24.31,11: 14.85.At first certain range length is (58.62-0)/6=10, the weight interval can be divided into [0,10), [10,20) ..., " stock " divides in the recuperation of 8 classification right of possessions altogether, similar weight number threshold value is 4, in classification 1,2,4,5 weight of " stock " word all belong to the weight interval [0,10), therefore the weight of these four classifications is filtered, finally stay 3: 58.62,7: 14.82,8: 24.31, the weight of 11: 14.85 these four classifications.

It is worth mentioning that, to treatment effeciency and accuracy requirement, when not high, also can not carry out this step.

Step S405, the semantic primitive by individual character, repeat number word string or numeric string length over the preset length threshold value filter out.

After weight in the computing semantic unit in each classification, also semantic primitive is carried out to filtration treatment, comprising:

By the semantic primitive of individual character, Chinese character or word filter that length is 1 are fallen.

The semantic primitive that the numerical character string length is surpassed to the preset length threshold value filters out, and being greater than 10 digit strings such as, length is insignificant, is filtered.

The semantic primitive of repeat number word string is filtered out.Such as, it is insignificant that the digit strings (numeric string that is greater than 4 as 00001 repeat length such as grade) of larger multiplicity is arranged, and is filtered.

It is worth mentioning that, the filtration treatment of this step also can be processed before the weight in each classification in the computing semantic unit, specifically can be before step S402 or afterwards.

Step S406, described each semantic primitive and the weight in each classification thereof are formed to the problem domain dictionary.

That is, at least comprise semantic primitive and the weight of each semantic primitive in each classification in described problem domain dictionary.

In like manner, the method for building up process flow diagram of the answer field dictionary that Fig. 3 provides for the present embodiment as shown in Figure 3, specifically comprises:

Step S501, obtain the content of question and answer to answer in language material, participle obtains the semantic primitive of described answer.

Step S502, the semantic primitive by word frequency lower than default word frequency threshold value filter out.

Step S503, calculate respectively the weight of each semantic primitive in each classification of described answer.

Step S504, the weight to each semantic primitive between each classification are carried out the similarity weight heavy filtration, for same semantic primitive, will be in same weight interval the occurrence number weight that is greater than predetermined threshold value filter out.

Step S505, the semantic primitive by individual character, repeat number word string or numeric string length over the preset length threshold value filter out.

Step S506, described each semantic primitive and the weight in each classification thereof are formed to answer field dictionary.

Above-mentioned steps S501 is similar to step S406 to disposal route and the step S401 of step S506, in this, repeats no more.

By above-mentioned method for building up, form problem domain dictionary and the answer field dictionary of each classification.As shown in following table 2 and table 3.

Table 2

Problem domain Two-tuple Linguistic Information Processing unit	Weight	Two-tuple Linguistic Information Processing unit, answer field	Weight
				Text box	45.226	Control end	51.5122
Share online	45.2149	Mitnick	51.3074
				Default gateway	45.1803	Stop message	50.968
Packet	45.1551	Click cancellation	50.8755
				In java	45.1044	Partition table	50.8634
The Excel form	45.0597	Robot dog	50.7862
				Enter DOS	45.004	The ash pigeon	50.533

What table 2 represented is the distribution of Two-tuple Linguistic Information Processing unit in problem domain and answer field in the computing machine classification.As can be seen from Table 2, problem domain is mainly practical function or the Two-tuple Linguistic Information Processing unit that is effective, and in the answer field, is mainly to perform an action or the Two-tuple Linguistic Information Processing unit of application technology.

Table 3

Problem domain Two-tuple Linguistic Information Processing unit	Weight	Two-tuple Linguistic Information Processing unit, answer field	Weight
				Normal value	45.4417	Hbv antibody	46.8926
Each menstruation	45.4238	Superficial suggestion	46.6657
				Ovarian cyst	45.4168	The liver function check	46.468
Pleurisy	45.3994	Vaccine is strengthened	46.3076
				Hepatitis B core	45.3889	Fish contain	46.2249

What table 3 represented is the distribution of Two-tuple Linguistic Information Processing unit in problem domain and answer field in medical classification.As can be seen from Table 3, problem domain is mainly the Two-tuple Linguistic Information Processing unit for some inquiries of illness, and, in the answer field, is mainly some cures and suggestive Two-tuple Linguistic Information Processing unit.

The mode that the present invention calculates respectively problem and answer utilization, can capture the common semantic primitive of Q&A for this field better.Simultaneously, can fully take into account N unit semantic primitive also more unbalanced situation of distribution situation in each classification, the reasonable the set goal that reached.

Be more than the detailed description that method provided by the present invention is carried out, below answer recommendation apparatus provided by the invention be described in detail.

Embodiment bis-

Fig. 4 is the answer recommendation apparatus schematic diagram that the present embodiment provides.As shown in Figure 4, this device comprises:

Text acquisition module 10, for obtaining the content of text of problem and the corresponding answer of this problem, participle obtains the semantic primitive of described problem and the semantic primitive of described answer.

Text acquisition module 10 can adopt existing segmenting method to carry out participle to the content of text of problem or answer, such as the N metagrammar, divides morphology, Forward Maximum Method method, reverse maximum matching method etc.The N metagrammar of take divides morphology as example, carries out the monobasic division and obtains each monobasic semantic primitive, as " text ", " data ", " form " etc.; Carry out the binary division and obtain each Two-tuple Linguistic Information Processing unit, as " text box ", " packet ", " new form " etc.; Carry out the ternary division and obtain each ternary semantic primitive, " multiline text frame ", " data package capture ", " new form is downloaded " etc.; The rest may be inferred, carries out the participle of N unit semantic primitive.N unit semantic primitive is N the lexical item that context is adjacent in problem or answer, and N the lexical item occurred continuously is middle without separators such as word, punctuate or spaces.

Topic weights computing module 20, for utilizing the problem domain dictionary of setting up in advance, find out the weight of semantic primitive in each classification of the problem that text acquisition module 10 obtains, and calculates the topic weights of described problem in each classification.

And, for utilizing the answer field dictionary of setting up in advance, find out the weight of semantic primitive in each classification of each answer that text acquisition module 10 obtains, calculate respectively the topic weights of described each answer in each classification.

About the apparatus for establishing that utilizes existing question and answer to set up in advance problem domain dictionary and answer field dictionary to corpus, will in follow-up length, describe in detail.

Utilize the problem domain dictionary, find out the weight of each semantic primitive in each classification of problem, the weight of all semantic primitives that problem is comprised is sued for peace according to each classification, obtains the topic weights of problem in each classification.For example, utilize semantic primitive " computer " to be searched in the problem domain dictionary, obtaining this semantic primitive " computer " is 15 in computer other weight, in educational other weight, is 30, and the weight in the medicine classification is 10.Find out successively the weight of each semantic primitive of problem in each classification obtained in text acquisition module 10.

In like manner, utilize answer field dictionary, find out the weight of semantic primitive in each classification of each answer, and the weight of all semantic primitives that answer is comprised sued for peace according to each classification, obtained the topic weights of answer in each classification.

Similarity calculation module 30, for the topic weights of the described problem of utilizing topic weights computing module 20 to obtain and the topic weights of each answer, calculate respectively the Topic Similarity of each answer and described problem, according to the result of calculation recommendation answer of Topic Similarity.

Utilize topic weights and the answer topic weights in each classification of problem in each classification calculated in topic weights computing module 20, calculate the Topic Similarity of answer and problem.

sim(query，ans)＝weight(query，C _j)×weight(ans，C _j)

Similarity calculation module 30 can only be chosen problem that topic weights computing module 20 calculates and the topic weights in front 5 classifications of answer is carried out similarity calculating.

Below in conjunction with Fig. 5 and Fig. 6, the problem domain dictionary of foundation in advance and the apparatus for establishing of answer field dictionary are described.

Fig. 5 is the apparatus for establishing schematic diagram of the problem domain dictionary that provides of the present embodiment, as shown in Figure 5, specifically comprises:

Problem is obtained submodule 401, and for obtaining the content of question and answer to the language material problem, participle obtains the semantic primitive of described problem.

Obtain whole question and answer to expecting the content of text of problem in storehouse, carry out participle, and the lexical item that participle is obtained removed the filtration treatment such as stop words, punctuate, obtained the semantic primitive of problem.Concrete processing procedure and text acquisition module 10 are similar, in this, repeat no more.

Word frequency is filtered submodule 402, for by word frequency, the semantic primitive lower than default word frequency threshold value filters out.

Certainly, this submodule is not necessary submodule, when less demanding to treatment effeciency, can not comprise.

The first weight calculation submodule 403, the weight for each semantic primitive of calculating respectively described problem in each classification.

w ({token, C}_{j}) = \sqrt{\sqrt{\underset{j}{Σ} {(p_{ij} - \overset{&OverBar;}{p_{1}})}^{2}} / \underset{j}{Σ} p_{ij}} \times {(\log (N / N ({token}_{i})))}^{2} \times \sqrt[n]{p_{ij}}

\overset{&OverBar;}{p_{1}} = Σ_{j} p_{ij} / m,

Wherein, m is the classification number.

Mean token _iThe otherness of word frequency between classification.

Weight is filtered submodule 404, for to each semantic primitive the weight between each classification carry out the similarity weight heavy filtration.

It is worth mentioning that, to treatment effeciency and accuracy requirement, when not high, also can not comprise this submodule.

Semantic primitive is filtered submodule 405, for the semantic primitive that individual character, repeat number word string or numeric string length is surpassed to the preset length threshold value, filters out.

Semantic primitive is filtered 405 pairs of semantic primitives of submodule and is carried out filtration treatment, comprising:

It is worth mentioning that, before this submodule also can be arranged on the first weight calculation submodule 403, specifically can be before word frequency be filtered submodule 402 or afterwards.

The first integron module 406, for forming the problem domain dictionary by described each semantic primitive and in the weight of each classification.That is, at least comprise semantic primitive and the weight of each semantic primitive in each classification in described problem domain dictionary.

In like manner, the apparatus for establishing schematic diagram of the answer field dictionary that Fig. 6 provides for the present embodiment as shown in Figure 6, specifically comprises:

Submodule 501 is obtained in answer, and for obtaining the content of question and answer to the language material answer, participle obtains the semantic primitive of described answer.

Word frequency is filtered submodule 502, for by word frequency, the semantic primitive lower than default word frequency threshold value filters out.

The second weight calculation submodule 503, the weight for each semantic primitive of calculating respectively described answer in each classification.

Weight is filtered submodule 504, for to each semantic primitive the weight between each classification carry out the similarity weight heavy filtration, for same semantic primitive, will be in same weight interval the occurrence number weight that is greater than predetermined threshold value filter out.

Semantic primitive is filtered submodule 505, for the semantic primitive that individual character, repeat number word string or numeric string length is surpassed to the preset length threshold value, filters out.

The second integron module 506, for forming answer field dictionary by described each semantic primitive and in the weight of each classification.

The setting of above-mentioned submodule 501 to 506 is similar to submodule 401 to 406, in this, repeats no more.

By above-mentioned apparatus for establishing, form problem domain dictionary and the answer field dictionary of each classification.As shown in following table 1 and table 2.

Answer recommend method provided by the invention and device, utilize question and answer language material to be set up respectively to problem domain dictionary and the answer field dictionary that comprises each classification, thereby expand the field mapping statement that question and answer are right, effectively promoted the accuracy rate of question and answer to semantic similarity, solution problem and answer to the inconsistent situation of word of describing same subject under the inaccurate problem of coupling, improve recall rate.The present invention can be used for the answer of diverse network interacting Question-Answer community and recommends, divides the aspects such as domain correlation degree commending contents and Search Results recommendation.

The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims

1. an answer recommend method, is characterized in that, comprising:

And

2. method according to claim 1, is characterized in that, the method for building up of described problem domain dictionary specifically comprises:

3. method according to claim 1, is characterized in that, the method for building up of described answer field dictionary specifically comprises:

4. according to the method in claim 2 or 3, it is characterized in that, after the semantic primitive of the described semantic primitive that obtains described problem or answer, also comprise:

5. according to the method in claim 2 or 3, it is characterized in that, the weight of described semantic primitive in each classification calculated according to following listed a kind of or combination in any:

6. method according to claim 5, is characterized in that, the weighing computation method of described semantic primitive in each classification is:

w ({token, C}_{j}) = \sqrt{\sqrt{\underset{j}{Σ} {(p_{ij} - \overset{&OverBar;}{p_{1}})}^{2}} / \underset{j}{Σ} p_{ij}} \times {(\log (N / N ({token}_{i})))}^{2} \times \sqrt[n]{p_{ij}}

\overset{&OverBar;}{p_{1}} = Σ_{j} p_{ij} / m,

Wherein, m is the classification number;

7. according to the method in claim 2 or 3, it is characterized in that, described each semantic primitive and the weight in each classification thereof are formed to problem domain dictionary or answer field dictionary before, also comprise:

8. method according to claim 7, is characterized in that, according to described semantic primitive, the weight size in each classification is arranged in described weight interval.

9. according to the method in claim 2 or 3, it is characterized in that, described each semantic primitive and the weight in each classification thereof are formed to the problem domain dictionary before, also comprise:

10. method according to claim 1, is characterized in that, the computing method of the Topic Similarity of described answer and problem comprise:

11. method according to claim 10, is characterized in that, the computing method of the Topic Similarity of described answer and problem are:

sim(query，ans)＝Max _j{weight(query，C _j)×weight(ans，C _j)}

12. an answer recommendation apparatus, is characterized in that, comprising:

And

13. device according to claim 12, is characterized in that, described problem domain dictionary is set up module by the problem dictionary in advance and is set up, and described problem dictionary is set up module and specifically comprised:

14. device according to claim 12, is characterized in that, described answer field dictionary is set up module by the answer dictionary in advance and is set up, and described answer dictionary is set up module and specifically comprised:

15. according to the described device of claim 13 or 14, it is characterized in that, module set up by described problem dictionary or described answer dictionary is set up module, also comprises:

16. according to the described device of claim 13 or 14, it is characterized in that, described the first weight calculation submodule or the second weight calculation submodule calculate the weight of described semantic primitive in each classification according to following listed a kind of or combination in any:

17. device according to claim 16, is characterized in that, the method that described the first weight calculation submodule or the second weight calculation submodule calculate the weight of described semantic primitive in each classification is:

w ({token, C}_{j}) = \sqrt{\sqrt{\underset{j}{Σ} {(p_{ij} - \overset{&OverBar;}{p_{1}})}^{2}} / \underset{j}{Σ} p_{ij}} \times {(\log (N / N ({token}_{i})))}^{2} \times \sqrt[n]{p_{ij}}

\overset{&OverBar;}{p_{1}} = Σ_{j} p_{ij} / m,

Wherein, m is the classification number;

18. according to the described device of claim 13 or 14, it is characterized in that, module set up by described problem dictionary or described answer dictionary is set up module, also comprises:

19. device according to claim 18, is characterized in that, according to described semantic primitive, the weight size in each classification is arranged in described weight interval.

20. according to the described device of claim 13 or 14, it is characterized in that, module set up by described problem dictionary or described answer dictionary is set up module, also comprises:

21. device according to claim 12, it is characterized in that, described similarity calculation module is calculated respectively described answer and the Topic Similarity of problem under each classification, and chooses the Topic Similarity maximal value that the calculates Topic Similarity as described answer and problem.

22. device according to claim 21, is characterized in that, the method that described similarity calculation module is calculated the Topic Similarity of described answer and problem is:

sim(query，ans)＝Max _j{weight(query，C _j)×weight(ans，C _j)}