CN103425635A - Method and device for recommending answers - Google Patents

Method and device for recommending answers Download PDF

Info

Publication number
CN103425635A
CN103425635A CN2012101510445A CN201210151044A CN103425635A CN 103425635 A CN103425635 A CN 103425635A CN 2012101510445 A CN2012101510445 A CN 2012101510445A CN 201210151044 A CN201210151044 A CN 201210151044A CN 103425635 A CN103425635 A CN 103425635A
Authority
CN
China
Prior art keywords
answer
weight
classification
semantic primitive
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012101510445A
Other languages
Chinese (zh)
Other versions
CN103425635B (en
Inventor
陈庆轩
梁丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201210151044.5A priority Critical patent/CN103425635B/en
Publication of CN103425635A publication Critical patent/CN103425635A/en
Application granted granted Critical
Publication of CN103425635B publication Critical patent/CN103425635B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device for recommending answers. The method includes acquiring questions and text content corresponding to the questions and segmenting to obtain semantic units of the questions and semantic units of answers, searching weights of the semantic units of the questions in different categories according to a built question domain dictionary to compute the theme weight of the questions in different categories, searching weights of the semantic units of the answers in different categories according to a built answer domain dictionary to compute the theme weight of the answers in different categories, computing the theme similarities of the various answers and the questions respectively according to the theme weight of the questions and the theme weight of the answers, and finally recommending the answers according to the computing result of the theme similarity. Compared with the prior art, the method and the device for recommending answers have the advantages that accuracy of semantic similarities between questions and answers is improved effectively and recall rate is increased since the question domain dictionary and the answer domain dictionary are generated respectively.

Description

A kind of answer recommend method and device
[technical field]
The present invention relates to the internet information processing technology field, particularly a kind of answer recommend method and device.
[background technology]
Along with the development of communication technology and network, such as Baidu know, Sina likes to ask, Google's question and answer, search ask, the network interdynamic Ask-Answer Community such as Yahoo's knowledge hall, day by day receive people's concern.These network interdynamic Ask-Answer Communities provide a platform that can carry out interaction for the netizen, and the user can freely ask a question, browses problem, answer a question, and the interchange of being helped each other, share knowledge.Increasing along with the Ask-Answer Community participating user, the candidate answers number increases thereupon, and Ask-Answer Community usually can check on one's answers and carry out auto-sequencing, in order to recommend preferred answer for the user.
In the auto-sequencing that checks on one's answers, at present, mostly adopt the text subject analytical technology to analyze semantic relevancy that question and answer are right etc. and judge that question and answer are to satisfaction, and then check on one's answers and carry out auto-sequencing.The text subject analytical technology is mainly based on topic model, text mapping is become to the topic vector, the topic vector is again that the distribution by word means, therefore the Topic Similarity between text calculates the similarity calculating that can change between the topic vector, and this similarity can be measured by the cosine similarity.
Existing text subject analytical approach is mostly based on a hypothesis: text all belongs to same topic space, and each topic belongs to same word distribution.Yet question and answer centering question and answer may adopt different describing modes, the inconsistent situation of word appears, for example, in computer realm, the field word of problem distributes to commonly use or colloquial compuword is main, as computer, operating system etc.; And the field word distribution of answering be take some professional compuwords as main, such as PC, win7 etc.; And for example, put question to the user to be asked a question with regard to the technical ability of certain game, but be the description to concrete technical ability in the answer that the user answers, do not comprise the word in problem.Now, the semantic relevancy that calculates answer and problem according to existing method is lower, can make and can't recall with the actual answer be complementary of problem or after the sequence of answer leans on, cause the decline of question and answer to the quality determination rate of accuracy, make the user can't find preferred answer.
[summary of the invention]
In view of this, the invention provides a kind of answer recommend method and device, Generating Problems field dictionary and the field of answer dictionary, shine upon statement with the field of expanding question and answer centering problem and answer respectively, effectively promote the accuracy rate that between problem and answer, semantic similarity is judged, improved recall rate.
Concrete technical scheme is as follows:
A kind of answer recommend method, the method comprises the following steps:
S1, obtain the content of text of the corresponding answer of problem and this problem, participle obtains the semantic primitive of described problem and the semantic primitive of described answer;
S2, utilization be the problem domain dictionary of foundation in advance, finds out the weight of semantic primitive in each classification of described problem, calculates the topic weights of described problem in each classification;
And
Utilize the answer field dictionary of setting up in advance, find out the weight of semantic primitive in each classification of described each answer, calculate respectively the topic weights of described each answer in each classification;
S3, the topic weights of utilizing the described problem obtain and the topic weights of each answer, calculate respectively the Topic Similarity of each answer and described problem, according to the result of calculation recommendation answer of described Topic Similarity.
According to one preferred embodiment of the present invention, the method for building up of described problem domain dictionary specifically comprises:
Obtain the content of question and answer to problem in language material, participle obtains the semantic primitive of described problem;
Calculate respectively the weight of each semantic primitive in each classification of described problem;
Described each semantic primitive and the weight in each classification thereof are formed to the problem domain dictionary.
According to one preferred embodiment of the present invention, the method for building up of described answer field dictionary specifically comprises:
Obtain the content of question and answer to answer in language material, participle obtains the semantic primitive of described answer;
Calculate respectively the weight of each semantic primitive in each classification of described answer;
Described each semantic primitive and the weight in each classification thereof are formed to answer field dictionary.
According to one preferred embodiment of the present invention, after the semantic primitive of the described semantic primitive that obtains described problem or answer, also comprise:
Semantic primitive by word frequency lower than default word frequency threshold value filters out;
Only, to filtering rear remaining semantic primitive, calculate respectively the weight in each classification.
According to one preferred embodiment of the present invention, the weight of described semantic primitive in each classification calculated according to following listed a kind of or combination in any:
The otherness of the word frequency of described semantic primitive between of all categories, described semantic primitive are in the word frequency of middle appearance of all categories or the contrary word frequency rate of described semantic primitive.
According to one preferred embodiment of the present invention, the weighing computation method of described semantic primitive in each classification is:
w ( token , C j ) = Σ j ( p ij - p 1 ‾ ) 2 / Σ j p ij × ( log ( N / N ( token i ) ) ) 2 × p ij n
Wherein, w (token i, C j) expression semantic primitive token iAt classification C jIn weight;
P Ij=T Ij/ L j, L jMean classification C jIn the number of times summation of all semantic primitives of containing, T IjMean semantic primitive token iAt classification C jThe number of times of middle appearance;
p 1 ‾ = Σ j p ij / m , Wherein, m is the classification number;
Figure BDA00001640305500033
Be illustrated in semantic primitive token iAt classification C jThe word frequency of middle appearance, n is the word frequency factor of influence;
N means the number of times summation that in language material, all semantic primitives occur, N (token i) expression semantic primitive token iThe number of times occurred.
According to one preferred embodiment of the present invention, described each semantic primitive and the weight in each classification thereof are formed to problem domain dictionary or answer field dictionary before, also comprise:
Weight to each semantic primitive between each classification is carried out the similarity weight heavy filtration, for same semantic primitive, will be in same weight interval the occurrence number weight that is greater than predetermined threshold value filter out;
Only by semantic primitive in residue the weight in classification in order to form problem domain dictionary or answer field dictionary.
According to one preferred embodiment of the present invention, according to described semantic primitive, the weight size in each classification is arranged in described weight interval.
According to one preferred embodiment of the present invention, described each semantic primitive and the weight in each classification thereof are formed to the problem domain dictionary before, also comprise:
The semantic primitive that individual character, repeat number word string or numeric string length is surpassed to the preset length threshold value filters out;
After only filtering, remaining semantic primitive is in order to form problem domain dictionary or answer field dictionary.
According to one preferred embodiment of the present invention, the computing method of the Topic Similarity of described answer and problem comprise:
Calculate respectively described answer and the problem Topic Similarity under each classification;
Choose the Topic Similarity maximal value that the calculates Topic Similarity as described answer and problem.
According to one preferred embodiment of the present invention, the computing method of the Topic Similarity of described answer and problem are:
sim(query,ans)=Max j{weight(query,C j)×weight(ans,C j)}
Wherein, sim (query, ans) means the Topic Similarity of answer and problem, weight (query, C j) problem of representation is at classification C jIn topic weights, weight (ans, C j) mean that answer is at classification C jIn topic weights.
A kind of answer recommendation apparatus, this device comprises:
The text acquisition module, for obtaining the content of text of problem and the corresponding answer of this problem, participle obtains the semantic primitive of described problem and the semantic primitive of described answer;
The topic weights computing module, for utilizing the problem domain dictionary of setting up in advance, find out the weight of semantic primitive in each classification of described problem, calculates the topic weights of described problem in each classification;
And
For utilizing the answer field dictionary of setting up in advance, find out the weight of semantic primitive in each classification of described each answer, calculate respectively the topic weights of described each answer in each classification;
Similarity calculation module, for the topic weights of the described problem of utilizing described topic weights computing module to obtain and the topic weights of each answer, calculate respectively the Topic Similarity of each answer and described problem, according to the result of calculation of described Topic Similarity, recommend answer.
According to one preferred embodiment of the present invention, described problem domain dictionary is set up module by the problem dictionary in advance and is set up, and described problem dictionary is set up module and specifically comprised:
Problem is obtained submodule, and for obtaining the content of question and answer to the language material problem, participle obtains the semantic primitive of described problem;
The first weight calculation submodule, the weight for each semantic primitive of calculating respectively described problem in each classification;
The first integron module, for forming the problem domain dictionary by described each semantic primitive and in the weight of each classification.
According to one preferred embodiment of the present invention, described answer field dictionary is set up module by the answer dictionary in advance and is set up, and described answer dictionary is set up module and specifically comprised:
Submodule is obtained in answer, and for obtaining the content of question and answer to the language material answer, participle obtains the semantic primitive of described answer;
The second weight calculation submodule, the weight for each semantic primitive of calculating respectively described answer in each classification;
The second integron module, for forming answer field dictionary by described each semantic primitive and in the weight of each classification.
According to one preferred embodiment of the present invention, module set up by described problem dictionary or described answer dictionary is set up module, also comprises:
Word frequency is filtered submodule, for by word frequency, the semantic primitive lower than default word frequency threshold value filters out;
After filtering, remaining semantic primitive offers described the first weight calculation submodule or described the second weight calculation submodule.
According to one preferred embodiment of the present invention, described the first weight calculation submodule or the second weight calculation submodule calculate the weight of described semantic primitive in each classification according to following listed a kind of or combination in any:
The otherness of the word frequency of described semantic primitive between of all categories, described semantic primitive are in the word frequency of middle appearance of all categories or the contrary word frequency rate of described semantic primitive.
According to one preferred embodiment of the present invention, described the first weight calculation submodule or the second weight calculation submodule calculate the method for the weight of described semantic primitive in each classification and are:
w ( token , C j ) = Σ j ( p ij - p 1 ‾ ) 2 / Σ j p ij × ( log ( N / N ( token i ) ) ) 2 × p ij n
Wherein, w (token i, C j) expression semantic primitive token iAt classification C jIn weight;
P Ij=T Ij/ L j, L jMean classification C jIn the number of times summation of all semantic primitives of containing, T IjMean semantic primitive token iAt classification C jThe number of times of middle appearance;
p 1 ‾ = Σ j p ij / m , Wherein, m is the classification number;
Figure BDA00001640305500063
Be illustrated in semantic primitive token iAt classification C jThe word frequency of middle appearance, n is the word frequency factor of influence;
N means the number of times summation that in language material, all semantic primitives occur, N (token i) expression semantic primitive token iThe number of times occurred.
According to one preferred embodiment of the present invention, module set up by described problem dictionary or described answer dictionary is set up module, also comprises:
Weight is filtered submodule, for to each semantic primitive the weight between each classification carry out the similarity weight heavy filtration, for same semantic primitive, will be in same weight interval the occurrence number weight that is greater than predetermined threshold value filter out;
Only by semantic primitive, the weight in the residue classification offers described the first integron module or described the second integron module, in order to form problem domain dictionary or answer field dictionary.
According to one preferred embodiment of the present invention, according to described semantic primitive, the weight size in each classification is arranged in described weight interval.
According to one preferred embodiment of the present invention, module set up by described problem dictionary or described answer dictionary is set up module, also comprises:
Semantic primitive is filtered submodule, for the semantic primitive that individual character, repeat number word string or numeric string length is surpassed to the preset length threshold value, filters out;
After only filtering, remaining semantic primitive offers described the first integron module or described the second integron module, in order to form problem domain dictionary or answer field dictionary.
According to one preferred embodiment of the present invention, described similarity calculation module is calculated respectively described answer and the Topic Similarity of problem under each classification, and chooses the Topic Similarity maximal value that the calculates Topic Similarity as described answer and problem.
According to one preferred embodiment of the present invention, the method that described similarity calculation module is calculated the Topic Similarity of described answer and problem is:
sim(query,ans)=Max j{weight(query,C j)×weight(ans,C j)}
Wherein, sim (query, ans) means the Topic Similarity of answer and problem, weight (query, C j) problem of representation is at classification C jIn topic weights, weight (ans, C j) mean that answer is at classification C jIn topic weights.
As can be seen from the above technical solutions, answer recommend method provided by the invention and device, utilize question and answer to language material difference Generating Problems field dictionary and answer field dictionary, thereby expand the field mapping statement that question and answer are right, effectively promoted the accuracy rate of question and answer to semantic similarity, solution problem and answer to the inconsistent situation of word of describing same subject under the inaccurate problem of coupling, improve recall rate.
[accompanying drawing explanation]
The answer recommend method process flow diagram that Fig. 1 provides for the embodiment of the present invention one;
The method for building up process flow diagram of the problem domain dictionary that Fig. 2 provides for the embodiment of the present invention one;
The method for building up process flow diagram of the answer field dictionary that Fig. 3 provides for the embodiment of the present invention one;
The answer recommendation apparatus schematic diagram that Fig. 4 provides for the embodiment of the present invention two;
The problem dictionary that Fig. 5 provides for the embodiment of the present invention two is set up the schematic diagram of module;
The answer dictionary that Fig. 6 provides for the embodiment of the present invention two is set up the schematic diagram of module.
[embodiment]
In order to make the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with the drawings and specific embodiments, describe the present invention.
In the question answering process of network interdynamic Ask-Answer Community, problem is with different and different because of dialogist's knowledge background to the statement meeting of same theme in answer, for example<compressed software, winrar >,<lantern slide, PPT >,<system software, win7 > etc., although such statement word difference has higher semantic similarity under specific domain background.
The present invention utilizes this characteristic, the word in different classes of for problem and answer respectively, set up problem domain dictionary and answer field dictionary, by the computing method of semantic similarity between minute field computational problem and answer, in order to the result of calculation according to similarity, carry out the answer recommendation.
Embodiment mono-,
Fig. 1 is the answer recommend method process flow diagram that the present embodiment provides, and as shown in Figure 1, the method comprises:
Step S10, obtain the content of text of the corresponding answer of problem and this problem, participle obtains the semantic primitive of described problem and the semantic primitive of described answer.
A problem may comprise the answer of a plurality of correspondences, and the content of text of problem and each answer is carried out to the processing such as participle filtration, the semantic primitive comprised in the problem that obtains obtaining and each answer.
The present invention can existing segmenting method carries out participle to the content of text of problem or answer, such as the N metagrammar, divides morphology, Forward Maximum Method method, reverse maximum matching method etc.The N metagrammar of take divides morphology as example, carries out the monobasic division and obtains each monobasic semantic primitive, as " text ", " data ", " form " etc.; Carry out the binary division and obtain each Two-tuple Linguistic Information Processing unit, as " text box ", " packet ", " new form " etc.; Carry out the ternary division and obtain each ternary semantic primitive, " multiline text frame ", " data package capture ", " new form is downloaded " etc.; The rest may be inferred, carries out the participle of N unit semantic primitive.N unit semantic primitive is N the lexical item that context is adjacent in problem or answer, and N the lexical item occurred continuously is middle without separators such as word, punctuate or spaces.
Problem or answer may comprise the content in a plurality of territories.Such as, a problem comprises title, text and three territories that remark additionally, and extracts respectively the content of text in these three territories, it is carried out to participle and obtain corresponding semantic primitive.Problem or answer are acquired respectively to corresponding N unit semantic primitive according to title, text and supplemental content.
Give an example, the problem that the user proposes is:
" seek advice the computer talent
The thing that my computer is downloaded after restarting has not before just had but what I do not nullify again? "
This problem comprise that title " is sought advice the computer talent " and body matter " my computer restart after before the thing downloaded just do not had but what I do not nullify again? "Take this title as example, and its word segmentation result comprises: the monobasic semantic primitive " is sought advice ", " computer ", " master-hand ", and computer " is sought advice " in the Two-tuple Linguistic Information Processing unit, " computer talent " and ternary semantic primitive " are sought advice the computer talent ".
Step S20, utilization be the problem domain dictionary of foundation in advance, finds out the weight of semantic primitive in each classification of described problem, calculates the topic weights of described problem in each classification; And, utilize the answer field dictionary of setting up in advance, find out the weight of semantic primitive in each classification of described each answer, calculate respectively the topic weights of described each answer in each classification.
Described problem domain dictionary or answer field dictionary comprise semantic primitive and the weight of each semantic primitive in each classification.Described classification is several default domain classifications, can adopt the encyclopaedia classification, for example, and the classification such as computing machine, medicine, education, map, song, film.
About the concrete process of establishing of utilizing existing question and answer to set up in advance problem domain dictionary and answer field dictionary to corpus, will in follow-up length, describe in detail.
Utilize the problem domain dictionary, find out the weight of each semantic primitive in each classification of described problem, the weight of all semantic primitives that problem is comprised is sued for peace according to each classification, obtains the topic weights of problem in each classification.For example, utilize semantic primitive " computer " to be searched in the problem domain dictionary, obtaining this semantic primitive " computer " is 15 in computer other weight, in educational other weight, is 30, and the weight in the medicine classification is 10.Find out successively the weight of each semantic primitive of problem in each classification obtained in step S10.
According to different classes of, each semantic primitive is weighted to summation in the weight under respective classes, obtain the topic weights of problem under each classification.If the weight of semantic primitive under certain classification can not find, the weight of this semantic primitive under this classification is zero.For example, the semantic primitive that problem obtains through participle, only " computer " and " master-hand " is in medicine classification right of possession weight, the topic weights in the medicine classification using semantic primitive " computer " and " master-hand's " weight addition as problem.
In like manner, utilize answer field dictionary, find out the weight of semantic primitive in each classification of described each answer, and the weight of all semantic primitives that answer is comprised sued for peace according to each classification, obtained the topic weights of answer in each classification.
Step S30, the topic weights of utilizing the described problem obtain and the topic weights of each answer, calculate respectively the Topic Similarity of each answer and described problem, according to the result of calculation recommendation answer of described Topic Similarity.
Utilize topic weights and the answer topic weights in each classification of problem in each classification calculated in step S20, calculate the Topic Similarity of answer and problem.
The computing method of the Topic Similarity of answer and problem can be, but not limited to adopt the mode of the topic weights product of the topic weights of problem and answer to be calculated.Particularly, first calculate respectively described answer and the problem Topic Similarity under each classification, then choose the Topic Similarity maximal value that the calculates Topic Similarity as described answer and problem, that is:
sim(query,ans)=weight(query,C j)×weight(ans,C j)
Wherein, sim (query, ans) means the Topic Similarity of answer and problem, weight (query, C j) problem of representation is at classification C jIn topic weights, weight (ans, C j) mean that answer is at classification C jIn topic weights.
In the problem calculated or answer, after the topic weights in each classification, only the topic weights in front 5 classifications of On The Choice and answer is carried out similarity calculating.
If the topic weights that problem is the highest is 0, showing can not have theme clearly to judge to this problem, can not calculate the Topic Similarity of question and answer centering question and answer, now, adopts existing semantic relevancy to weigh the degree of correlation that question and answer are right.
If the highest topic weights of answer is 0, showing can not have theme clearly to judge to this answer, can not calculate the Topic Similarity of this answer and problem, now in like manner, adopts existing semantic relevancy to weigh the degree of correlation that question and answer are right.
The multiplied by weight in corresponding classification by problem and answer, as such other degree of subject relativity, and choose the degree of subject relativity of the maximal value of product as answer and problem.
By above-mentioned computing method, can calculate the degree of subject relativity that question and answer are right.As shown in table 1 below:
Table 1
Figure BDA00001640305500101
Figure BDA00001640305500111
Degree of subject relativity according to question and answer to problem and answer, can identify preferably the question and answer pair with same subject, and the judgement that can produce the Topic Similarity of comparison high weight, thereby, judge that for the content related fields from text the question and answer quality provides effective means, can recommend answer more accurately.
Below in conjunction with Fig. 2 and Fig. 3, the problem domain dictionary of foundation in advance and the method for building up of answer field dictionary are described.
Fig. 2 is the method for building up process flow diagram of the problem domain dictionary that provides of the present embodiment, and as shown in Figure 2, the method specifically comprises:
Step S401, obtain the content of question and answer to problem in language material, participle obtains the semantic primitive of described problem.
Obtain whole question and answer to expecting the content of text of problem in storehouse, carry out participle, and the lexical item that participle is obtained removed the filtration treatment such as stop words, punctuate, obtained the semantic primitive of problem.Concrete processing procedure and step S10 are similar, in this, repeat no more.
Step S402, the semantic primitive by word frequency lower than default word frequency threshold value filter out.
In order to raise the efficiency, first semantic primitive to be filtered based on word frequency, the semantic primitive by word frequency lower than default word frequency threshold value filters out.Such as, remove word frequency lower than the semantic primitive of 5 times.
Certainly, this step is not steps necessary, when less demanding to treatment effeciency, can not carry out.
Step S403, calculate respectively the weight of each semantic primitive in each classification of described problem.
The weight of described semantic primitive in each classification calculated according to following listed a kind of or combination in any:
The otherness of the word frequency of semantic primitive between of all categories, semantic primitive are in the word frequency of middle appearance of all categories or the contrary word frequency rate of semantic primitive.
Word frequency with semantic primitive is integrated as example in otherness, semantic primitive between of all categories in the word frequency of middle appearance of all categories and the contrary word frequency rate three of semantic primitive, the weighing computation method of semantic primitive in each classification can be, but not limited to adopt: the otherness of the word frequency of semantic primitive between of all categories, semantic primitive are calculated at the contrary word frequency rate three's of the word frequency of middle appearance of all categories and semantic primitive product, that is:
w ( token , C j ) = &Sigma; j ( p ij - p 1 &OverBar; ) 2 / &Sigma; j p ij &times; ( log ( N / N ( token i ) ) ) 2 &times; p ij n
Wherein, w (token i, C j) expression semantic primitive token iAt classification C jIn weight.
P Ij=T Ij/ L j, L jMean classification C jIn the number of times summation of all semantic primitives of containing, T IjMean semantic primitive token iAt classification C jThe number of times of middle appearance.
p 1 &OverBar; = &Sigma; j p ij / m , Wherein, m is the classification number.
Figure BDA00001640305500123
Mean token iThe otherness of word frequency between classification.
Figure BDA00001640305500124
Be illustrated in semantic primitive token iAt classification C jThe word frequency of middle appearance, n is the word frequency factor of influence.Word frequency factor of influence n can be set according to actual conditions, regulates the degree of influence of word frequency, as chooses n=5.
N means the number of times summation that in language material, all semantic primitives occur, N (token i) expression semantic primitive token iThe number of times occurred, log (N/N (token i)) expression semantic primitive token iContrary word frequency rate.Should contrary word frequency rate also can directly adopt the rate of falling the document in the natural language processing language material.
Step S404, the weight to each semantic primitive between each classification are carried out the similarity weight heavy filtration.
For the significance level between each classification makes a distinction by semantic primitive, after the weight in the computing semantic unit in each classification, need to filter out those and weight repeatedly occur in same weight interval.That is, for same semantic primitive, will be in same weight interval the occurrence number weight that is greater than predetermined threshold value filter out.
Described weight interval (as [and 0,10) interval) weight size according to described semantic primitive in each classification arranged.Particularly, can be, but not limited to adopt following methods:
Utilize the difference of the maximal value of semantic primitive to be calculated weight in all categories and minimum value divided by power interval numbers, determine each weight interval of described semantic primitive to be calculated.
For example, can determine the weight interval by a kind of didactic rule, if the highest weighting of a semantic primitive in each class must be divided into Score max, minimum weight must be divided into Score min, burst length can be defined as (Score max-Score min)/L, wherein, L is default power interval numbers, gets L=6 in this enforcement.Similar weight number Threshold is M/2, and wherein M represents that this semantic primitive divides in the recuperation of how many classification right of possessions.
For example, as the situation of the weight distribution of semantic primitive " stock " in each classification is: 1: 1.65,2: 2.32,3: 58.62,4: 3.12,5: 3.62,7: 14.82,8: 24.31,11: 14.85.At first certain range length is (58.62-0)/6=10, the weight interval can be divided into [0,10), [10,20) ..., " stock " divides in the recuperation of 8 classification right of possessions altogether, similar weight number threshold value is 4, in classification 1,2,4,5 weight of " stock " word all belong to the weight interval [0,10), therefore the weight of these four classifications is filtered, finally stay 3: 58.62,7: 14.82,8: 24.31, the weight of 11: 14.85 these four classifications.
It is worth mentioning that, to treatment effeciency and accuracy requirement, when not high, also can not carry out this step.
Step S405, the semantic primitive by individual character, repeat number word string or numeric string length over the preset length threshold value filter out.
After weight in the computing semantic unit in each classification, also semantic primitive is carried out to filtration treatment, comprising:
By the semantic primitive of individual character, Chinese character or word filter that length is 1 are fallen.
The semantic primitive that the numerical character string length is surpassed to the preset length threshold value filters out, and being greater than 10 digit strings such as, length is insignificant, is filtered.
The semantic primitive of repeat number word string is filtered out.Such as, it is insignificant that the digit strings (numeric string that is greater than 4 as 00001 repeat length such as grade) of larger multiplicity is arranged, and is filtered.
It is worth mentioning that, the filtration treatment of this step also can be processed before the weight in each classification in the computing semantic unit, specifically can be before step S402 or afterwards.
Step S406, described each semantic primitive and the weight in each classification thereof are formed to the problem domain dictionary.
That is, at least comprise semantic primitive and the weight of each semantic primitive in each classification in described problem domain dictionary.
In like manner, the method for building up process flow diagram of the answer field dictionary that Fig. 3 provides for the present embodiment as shown in Figure 3, specifically comprises:
Step S501, obtain the content of question and answer to answer in language material, participle obtains the semantic primitive of described answer.
Step S502, the semantic primitive by word frequency lower than default word frequency threshold value filter out.
Step S503, calculate respectively the weight of each semantic primitive in each classification of described answer.
Step S504, the weight to each semantic primitive between each classification are carried out the similarity weight heavy filtration, for same semantic primitive, will be in same weight interval the occurrence number weight that is greater than predetermined threshold value filter out.
Step S505, the semantic primitive by individual character, repeat number word string or numeric string length over the preset length threshold value filter out.
Step S506, described each semantic primitive and the weight in each classification thereof are formed to answer field dictionary.
Above-mentioned steps S501 is similar to step S406 to disposal route and the step S401 of step S506, in this, repeats no more.
By above-mentioned method for building up, form problem domain dictionary and the answer field dictionary of each classification.As shown in following table 2 and table 3.
Table 2
Problem domain Two-tuple Linguistic Information Processing unit Weight Two-tuple Linguistic Information Processing unit, answer field Weight
Text box 45.226 Control end 51.5122
Share online 45.2149 Mitnick 51.3074
Default gateway 45.1803 Stop message 50.968
Packet 45.1551 Click cancellation 50.8755
In java 45.1044 Partition table 50.8634
The Excel form 45.0597 Robot dog 50.7862
Enter DOS 45.004 The ash pigeon 50.533
What table 2 represented is the distribution of Two-tuple Linguistic Information Processing unit in problem domain and answer field in the computing machine classification.As can be seen from Table 2, problem domain is mainly practical function or the Two-tuple Linguistic Information Processing unit that is effective, and in the answer field, is mainly to perform an action or the Two-tuple Linguistic Information Processing unit of application technology.
Table 3
Problem domain Two-tuple Linguistic Information Processing unit Weight Two-tuple Linguistic Information Processing unit, answer field Weight
Normal value 45.4417 Hbv antibody 46.8926
Each menstruation 45.4238 Superficial suggestion 46.6657
Ovarian cyst 45.4168 The liver function check 46.468
Pleurisy 45.3994 Vaccine is strengthened 46.3076
Hepatitis B core 45.3889 Fish contain 46.2249
What table 3 represented is the distribution of Two-tuple Linguistic Information Processing unit in problem domain and answer field in medical classification.As can be seen from Table 3, problem domain is mainly the Two-tuple Linguistic Information Processing unit for some inquiries of illness, and, in the answer field, is mainly some cures and suggestive Two-tuple Linguistic Information Processing unit.
The mode that the present invention calculates respectively problem and answer utilization, can capture the common semantic primitive of Q&A for this field better.Simultaneously, can fully take into account N unit semantic primitive also more unbalanced situation of distribution situation in each classification, the reasonable the set goal that reached.
Be more than the detailed description that method provided by the present invention is carried out, below answer recommendation apparatus provided by the invention be described in detail.
Embodiment bis-
Fig. 4 is the answer recommendation apparatus schematic diagram that the present embodiment provides.As shown in Figure 4, this device comprises:
Text acquisition module 10, for obtaining the content of text of problem and the corresponding answer of this problem, participle obtains the semantic primitive of described problem and the semantic primitive of described answer.
A problem may comprise the answer of a plurality of correspondences, and the content of text of problem and each answer is carried out to the processing such as participle filtration, the semantic primitive comprised in the problem that obtains obtaining and each answer.
Text acquisition module 10 can adopt existing segmenting method to carry out participle to the content of text of problem or answer, such as the N metagrammar, divides morphology, Forward Maximum Method method, reverse maximum matching method etc.The N metagrammar of take divides morphology as example, carries out the monobasic division and obtains each monobasic semantic primitive, as " text ", " data ", " form " etc.; Carry out the binary division and obtain each Two-tuple Linguistic Information Processing unit, as " text box ", " packet ", " new form " etc.; Carry out the ternary division and obtain each ternary semantic primitive, " multiline text frame ", " data package capture ", " new form is downloaded " etc.; The rest may be inferred, carries out the participle of N unit semantic primitive.N unit semantic primitive is N the lexical item that context is adjacent in problem or answer, and N the lexical item occurred continuously is middle without separators such as word, punctuate or spaces.
Problem or answer may comprise the content in a plurality of territories.Such as, a problem comprises title, text and three territories that remark additionally, and extracts respectively the content of text in these three territories, it is carried out to participle and obtain corresponding semantic primitive.Problem or answer are acquired respectively to corresponding N unit semantic primitive according to title, text and supplemental content.
Topic weights computing module 20, for utilizing the problem domain dictionary of setting up in advance, find out the weight of semantic primitive in each classification of the problem that text acquisition module 10 obtains, and calculates the topic weights of described problem in each classification.
And, for utilizing the answer field dictionary of setting up in advance, find out the weight of semantic primitive in each classification of each answer that text acquisition module 10 obtains, calculate respectively the topic weights of described each answer in each classification.
Described problem domain dictionary or answer field dictionary comprise semantic primitive and the weight of each semantic primitive in each classification.Described classification is several default domain classifications, can adopt the encyclopaedia classification, for example, and the classification such as computing machine, medicine, education, map, song, film.
About the apparatus for establishing that utilizes existing question and answer to set up in advance problem domain dictionary and answer field dictionary to corpus, will in follow-up length, describe in detail.
Utilize the problem domain dictionary, find out the weight of each semantic primitive in each classification of problem, the weight of all semantic primitives that problem is comprised is sued for peace according to each classification, obtains the topic weights of problem in each classification.For example, utilize semantic primitive " computer " to be searched in the problem domain dictionary, obtaining this semantic primitive " computer " is 15 in computer other weight, in educational other weight, is 30, and the weight in the medicine classification is 10.Find out successively the weight of each semantic primitive of problem in each classification obtained in text acquisition module 10.
According to different classes of, each semantic primitive is weighted to summation in the weight under respective classes, obtain the topic weights of problem under each classification.If the weight of semantic primitive under certain classification can not find, the weight of this semantic primitive under this classification is zero.For example, the semantic primitive that problem obtains through participle, only " computer " and " master-hand " is in medicine classification right of possession weight, the topic weights in the medicine classification using semantic primitive " computer " and " master-hand's " weight addition as problem.
In like manner, utilize answer field dictionary, find out the weight of semantic primitive in each classification of each answer, and the weight of all semantic primitives that answer is comprised sued for peace according to each classification, obtained the topic weights of answer in each classification.
Similarity calculation module 30, for the topic weights of the described problem of utilizing topic weights computing module 20 to obtain and the topic weights of each answer, calculate respectively the Topic Similarity of each answer and described problem, according to the result of calculation recommendation answer of Topic Similarity.
Utilize topic weights and the answer topic weights in each classification of problem in each classification calculated in topic weights computing module 20, calculate the Topic Similarity of answer and problem.
The computing method of the Topic Similarity of answer and problem can be, but not limited to adopt the mode of the topic weights product of the topic weights of problem and answer to be calculated.Particularly, first calculate respectively described answer and the problem Topic Similarity under each classification, then choose the Topic Similarity maximal value that the calculates Topic Similarity as described answer and problem, that is:
sim(query,ans)=weight(query,C j)×weight(ans,C j)
Wherein, sim (query, ans) means the Topic Similarity of answer and problem, weight (query, C j) problem of representation is at classification C jIn topic weights, weight (ans, C j) mean that answer is at classification C jIn topic weights.
Similarity calculation module 30 can only be chosen problem that topic weights computing module 20 calculates and the topic weights in front 5 classifications of answer is carried out similarity calculating.
If the topic weights that problem is the highest is 0, showing can not have theme clearly to judge to this problem, can not calculate the Topic Similarity of question and answer centering question and answer, now, adopts existing semantic relevancy to weigh the degree of correlation that question and answer are right.
If the highest topic weights of answer is 0, showing can not have theme clearly to judge to this answer, can not calculate the Topic Similarity of this answer and problem, now in like manner, adopts existing semantic relevancy to weigh the degree of correlation that question and answer are right.
The multiplied by weight in corresponding classification by problem and answer, as such other degree of subject relativity, and choose the degree of subject relativity of the maximal value of product as answer and problem.
Degree of subject relativity according to question and answer to problem and answer, can identify preferably the question and answer pair with same subject, and the judgement that can produce the Topic Similarity of comparison high weight, thereby, judge that for the content related fields from text the question and answer quality provides effective means, can recommend answer more accurately.
Below in conjunction with Fig. 5 and Fig. 6, the problem domain dictionary of foundation in advance and the apparatus for establishing of answer field dictionary are described.
Fig. 5 is the apparatus for establishing schematic diagram of the problem domain dictionary that provides of the present embodiment, as shown in Figure 5, specifically comprises:
Problem is obtained submodule 401, and for obtaining the content of question and answer to the language material problem, participle obtains the semantic primitive of described problem.
Obtain whole question and answer to expecting the content of text of problem in storehouse, carry out participle, and the lexical item that participle is obtained removed the filtration treatment such as stop words, punctuate, obtained the semantic primitive of problem.Concrete processing procedure and text acquisition module 10 are similar, in this, repeat no more.
Word frequency is filtered submodule 402, for by word frequency, the semantic primitive lower than default word frequency threshold value filters out.
In order to raise the efficiency, first semantic primitive to be filtered based on word frequency, the semantic primitive by word frequency lower than default word frequency threshold value filters out.Such as, remove word frequency lower than the semantic primitive of 5 times.
Certainly, this submodule is not necessary submodule, when less demanding to treatment effeciency, can not comprise.
The first weight calculation submodule 403, the weight for each semantic primitive of calculating respectively described problem in each classification.
The weight of described semantic primitive in each classification calculated according to following listed a kind of or combination in any:
The otherness of the word frequency of semantic primitive between of all categories, semantic primitive are in the word frequency of middle appearance of all categories or the contrary word frequency rate of semantic primitive.
Word frequency with semantic primitive is integrated as example in otherness, semantic primitive between of all categories in the word frequency of middle appearance of all categories and the contrary word frequency rate three of semantic primitive, the weighing computation method of semantic primitive in each classification can be, but not limited to adopt: the otherness of the word frequency of semantic primitive between of all categories, semantic primitive are calculated at the contrary word frequency rate three's of the word frequency of middle appearance of all categories and semantic primitive product, that is:
w ( token , C j ) = &Sigma; j ( p ij - p 1 &OverBar; ) 2 / &Sigma; j p ij &times; ( log ( N / N ( token i ) ) ) 2 &times; p ij n
Wherein, w (token i, C j) expression semantic primitive token iAt classification C jIn weight.
P Ij=T Ij/ L j, L jMean classification C jIn the number of times summation of all semantic primitives of containing, T IjMean semantic primitive token iAt classification C jThe number of times of middle appearance.
p 1 &OverBar; = &Sigma; j p ij / m , Wherein, m is the classification number.
Figure BDA00001640305500193
Mean token iThe otherness of word frequency between classification.
Be illustrated in semantic primitive token iAt classification C jThe word frequency of middle appearance, n is the word frequency factor of influence.Word frequency factor of influence n can be set according to actual conditions, regulates the degree of influence of word frequency, as chooses n=5.
N means the number of times summation that in language material, all semantic primitives occur, N (token i) expression semantic primitive token iThe number of times occurred, log (N/N (token i)) expression semantic primitive token iContrary word frequency rate.Should contrary word frequency rate also can directly adopt the rate of falling the document in the natural language processing language material.
Weight is filtered submodule 404, for to each semantic primitive the weight between each classification carry out the similarity weight heavy filtration.
For the significance level between each classification makes a distinction by semantic primitive, after the weight in the computing semantic unit in each classification, need to filter out those and weight repeatedly occur in same weight interval.That is, for same semantic primitive, will be in same weight interval the occurrence number weight that is greater than predetermined threshold value filter out.
Described weight interval (as [and 0,10) interval) weight size according to described semantic primitive in each classification arranged.Particularly, can be, but not limited to adopt following methods:
Utilize the difference of the maximal value of semantic primitive to be calculated weight in all categories and minimum value divided by power interval numbers, determine each weight interval of described semantic primitive to be calculated.
For example, can determine the weight interval by a kind of didactic rule, if the highest weighting of a semantic primitive in each class must be divided into Score max, minimum weight must be divided into Score min, burst length can be defined as (Score max-Score min)/L, wherein, L is default power interval numbers, gets L=6 in this enforcement.Similar weight number Threshold is M/2, and wherein M represents that this semantic primitive divides in the recuperation of how many classification right of possessions.
For example, as the situation of the weight distribution of semantic primitive " stock " in each classification is: 1: 1.65,2: 2.32,3: 58.62,4: 3.12,5: 3.62,7: 14.82,8: 24.31,11: 14.85.At first certain range length is (58.62-0)/6=10, the weight interval can be divided into [0,10), [10,20) ..., " stock " divides in the recuperation of 8 classification right of possessions altogether, similar weight number threshold value is 4, in classification 1,2,4,5 weight of " stock " word all belong to the weight interval [0,10), therefore the weight of these four classifications is filtered, finally stay 3: 58.62,7: 14.82,8: 24.31, the weight of 11: 14.85 these four classifications.
It is worth mentioning that, to treatment effeciency and accuracy requirement, when not high, also can not comprise this submodule.
Semantic primitive is filtered submodule 405, for the semantic primitive that individual character, repeat number word string or numeric string length is surpassed to the preset length threshold value, filters out.
Semantic primitive is filtered 405 pairs of semantic primitives of submodule and is carried out filtration treatment, comprising:
By the semantic primitive of individual character, Chinese character or word filter that length is 1 are fallen.
The semantic primitive that the numerical character string length is surpassed to the preset length threshold value filters out, and being greater than 10 digit strings such as, length is insignificant, is filtered.
The semantic primitive of repeat number word string is filtered out.Such as, it is insignificant that the digit strings (numeric string that is greater than 4 as 00001 repeat length such as grade) of larger multiplicity is arranged, and is filtered.
It is worth mentioning that, before this submodule also can be arranged on the first weight calculation submodule 403, specifically can be before word frequency be filtered submodule 402 or afterwards.
The first integron module 406, for forming the problem domain dictionary by described each semantic primitive and in the weight of each classification.That is, at least comprise semantic primitive and the weight of each semantic primitive in each classification in described problem domain dictionary.
In like manner, the apparatus for establishing schematic diagram of the answer field dictionary that Fig. 6 provides for the present embodiment as shown in Figure 6, specifically comprises:
Submodule 501 is obtained in answer, and for obtaining the content of question and answer to the language material answer, participle obtains the semantic primitive of described answer.
Word frequency is filtered submodule 502, for by word frequency, the semantic primitive lower than default word frequency threshold value filters out.
The second weight calculation submodule 503, the weight for each semantic primitive of calculating respectively described answer in each classification.
Weight is filtered submodule 504, for to each semantic primitive the weight between each classification carry out the similarity weight heavy filtration, for same semantic primitive, will be in same weight interval the occurrence number weight that is greater than predetermined threshold value filter out.
Semantic primitive is filtered submodule 505, for the semantic primitive that individual character, repeat number word string or numeric string length is surpassed to the preset length threshold value, filters out.
The second integron module 506, for forming answer field dictionary by described each semantic primitive and in the weight of each classification.
The setting of above-mentioned submodule 501 to 506 is similar to submodule 401 to 406, in this, repeats no more.
By above-mentioned apparatus for establishing, form problem domain dictionary and the answer field dictionary of each classification.As shown in following table 1 and table 2.
Answer recommend method provided by the invention and device, utilize question and answer language material to be set up respectively to problem domain dictionary and the answer field dictionary that comprises each classification, thereby expand the field mapping statement that question and answer are right, effectively promoted the accuracy rate of question and answer to semantic similarity, solution problem and answer to the inconsistent situation of word of describing same subject under the inaccurate problem of coupling, improve recall rate.The present invention can be used for the answer of diverse network interacting Question-Answer community and recommends, divides the aspects such as domain correlation degree commending contents and Search Results recommendation.
The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims (22)

1. an answer recommend method, is characterized in that, comprising:
S1, obtain the content of text of the corresponding answer of problem and this problem, participle obtains the semantic primitive of described problem and the semantic primitive of described answer;
S2, utilization be the problem domain dictionary of foundation in advance, finds out the weight of semantic primitive in each classification of described problem, calculates the topic weights of described problem in each classification;
And
Utilize the answer field dictionary of setting up in advance, find out the weight of semantic primitive in each classification of described each answer, calculate respectively the topic weights of described each answer in each classification;
S3, the topic weights of utilizing the described problem obtain and the topic weights of each answer, calculate respectively the Topic Similarity of each answer and described problem, according to the result of calculation recommendation answer of described Topic Similarity.
2. method according to claim 1, is characterized in that, the method for building up of described problem domain dictionary specifically comprises:
Obtain the content of question and answer to problem in language material, participle obtains the semantic primitive of described problem;
Calculate respectively the weight of each semantic primitive in each classification of described problem;
Described each semantic primitive and the weight in each classification thereof are formed to the problem domain dictionary.
3. method according to claim 1, is characterized in that, the method for building up of described answer field dictionary specifically comprises:
Obtain the content of question and answer to answer in language material, participle obtains the semantic primitive of described answer;
Calculate respectively the weight of each semantic primitive in each classification of described answer;
Described each semantic primitive and the weight in each classification thereof are formed to answer field dictionary.
4. according to the method in claim 2 or 3, it is characterized in that, after the semantic primitive of the described semantic primitive that obtains described problem or answer, also comprise:
Semantic primitive by word frequency lower than default word frequency threshold value filters out;
Only, to filtering rear remaining semantic primitive, calculate respectively the weight in each classification.
5. according to the method in claim 2 or 3, it is characterized in that, the weight of described semantic primitive in each classification calculated according to following listed a kind of or combination in any:
The otherness of the word frequency of described semantic primitive between of all categories, described semantic primitive are in the word frequency of middle appearance of all categories or the contrary word frequency rate of described semantic primitive.
6. method according to claim 5, is characterized in that, the weighing computation method of described semantic primitive in each classification is:
w ( token , C j ) = &Sigma; j ( p ij - p 1 &OverBar; ) 2 / &Sigma; j p ij &times; ( log ( N / N ( token i ) ) ) 2 &times; p ij n
Wherein, w (token i, C j) expression semantic primitive token iAt classification C jIn weight;
P Ij=T Ij/ L j, L jMean classification C jIn the number of times summation of all semantic primitives of containing, T IjMean semantic primitive token iAt classification C jThe number of times of middle appearance;
p 1 &OverBar; = &Sigma; j p ij / m , Wherein, m is the classification number;
Be illustrated in semantic primitive token iAt classification C jThe word frequency of middle appearance, n is the word frequency factor of influence;
N means the number of times summation that in language material, all semantic primitives occur, N (token i) expression semantic primitive token iThe number of times occurred.
7. according to the method in claim 2 or 3, it is characterized in that, described each semantic primitive and the weight in each classification thereof are formed to problem domain dictionary or answer field dictionary before, also comprise:
Weight to each semantic primitive between each classification is carried out the similarity weight heavy filtration, for same semantic primitive, will be in same weight interval the occurrence number weight that is greater than predetermined threshold value filter out;
Only by semantic primitive in residue the weight in classification in order to form problem domain dictionary or answer field dictionary.
8. method according to claim 7, is characterized in that, according to described semantic primitive, the weight size in each classification is arranged in described weight interval.
9. according to the method in claim 2 or 3, it is characterized in that, described each semantic primitive and the weight in each classification thereof are formed to the problem domain dictionary before, also comprise:
The semantic primitive that individual character, repeat number word string or numeric string length is surpassed to the preset length threshold value filters out;
After only filtering, remaining semantic primitive is in order to form problem domain dictionary or answer field dictionary.
10. method according to claim 1, is characterized in that, the computing method of the Topic Similarity of described answer and problem comprise:
Calculate respectively described answer and the problem Topic Similarity under each classification;
Choose the Topic Similarity maximal value that the calculates Topic Similarity as described answer and problem.
11. method according to claim 10, is characterized in that, the computing method of the Topic Similarity of described answer and problem are:
sim(query,ans)=Max j{weight(query,C j)×weight(ans,C j)}
Wherein, sim (query, ans) means the Topic Similarity of answer and problem, weight (query, C j) problem of representation is at classification C jIn topic weights, weight (ans, C j) mean that answer is at classification C jIn topic weights.
12. an answer recommendation apparatus, is characterized in that, comprising:
The text acquisition module, for obtaining the content of text of problem and the corresponding answer of this problem, participle obtains the semantic primitive of described problem and the semantic primitive of described answer;
The topic weights computing module, for utilizing the problem domain dictionary of setting up in advance, find out the weight of semantic primitive in each classification of described problem, calculates the topic weights of described problem in each classification;
And
For utilizing the answer field dictionary of setting up in advance, find out the weight of semantic primitive in each classification of described each answer, calculate respectively the topic weights of described each answer in each classification;
Similarity calculation module, for the topic weights of the described problem of utilizing described topic weights computing module to obtain and the topic weights of each answer, calculate respectively the Topic Similarity of each answer and described problem, according to the result of calculation of described Topic Similarity, recommend answer.
13. device according to claim 12, is characterized in that, described problem domain dictionary is set up module by the problem dictionary in advance and is set up, and described problem dictionary is set up module and specifically comprised:
Problem is obtained submodule, and for obtaining the content of question and answer to the language material problem, participle obtains the semantic primitive of described problem;
The first weight calculation submodule, the weight for each semantic primitive of calculating respectively described problem in each classification;
The first integron module, for forming the problem domain dictionary by described each semantic primitive and in the weight of each classification.
14. device according to claim 12, is characterized in that, described answer field dictionary is set up module by the answer dictionary in advance and is set up, and described answer dictionary is set up module and specifically comprised:
Submodule is obtained in answer, and for obtaining the content of question and answer to the language material answer, participle obtains the semantic primitive of described answer;
The second weight calculation submodule, the weight for each semantic primitive of calculating respectively described answer in each classification;
The second integron module, for forming answer field dictionary by described each semantic primitive and in the weight of each classification.
15. according to the described device of claim 13 or 14, it is characterized in that, module set up by described problem dictionary or described answer dictionary is set up module, also comprises:
Word frequency is filtered submodule, for by word frequency, the semantic primitive lower than default word frequency threshold value filters out;
After filtering, remaining semantic primitive offers described the first weight calculation submodule or described the second weight calculation submodule.
16. according to the described device of claim 13 or 14, it is characterized in that, described the first weight calculation submodule or the second weight calculation submodule calculate the weight of described semantic primitive in each classification according to following listed a kind of or combination in any:
The otherness of the word frequency of described semantic primitive between of all categories, described semantic primitive are in the word frequency of middle appearance of all categories or the contrary word frequency rate of described semantic primitive.
17. device according to claim 16, is characterized in that, the method that described the first weight calculation submodule or the second weight calculation submodule calculate the weight of described semantic primitive in each classification is:
w ( token , C j ) = &Sigma; j ( p ij - p 1 &OverBar; ) 2 / &Sigma; j p ij &times; ( log ( N / N ( token i ) ) ) 2 &times; p ij n
Wherein, w (token i, C j) expression semantic primitive token iAt classification C jIn weight;
P Ij=T Ij/ L j, L jMean classification C jIn the number of times summation of all semantic primitives of containing, T IjMean semantic primitive token iAt classification C jThe number of times of middle appearance;
p 1 &OverBar; = &Sigma; j p ij / m , Wherein, m is the classification number;
Be illustrated in semantic primitive token iAt classification C jThe word frequency of middle appearance, n is the word frequency factor of influence;
N means the number of times summation that in language material, all semantic primitives occur, N (token i) expression semantic primitive token iThe number of times occurred.
18. according to the described device of claim 13 or 14, it is characterized in that, module set up by described problem dictionary or described answer dictionary is set up module, also comprises:
Weight is filtered submodule, for to each semantic primitive the weight between each classification carry out the similarity weight heavy filtration, for same semantic primitive, will be in same weight interval the occurrence number weight that is greater than predetermined threshold value filter out;
Only by semantic primitive, the weight in the residue classification offers described the first integron module or described the second integron module, in order to form problem domain dictionary or answer field dictionary.
19. device according to claim 18, is characterized in that, according to described semantic primitive, the weight size in each classification is arranged in described weight interval.
20. according to the described device of claim 13 or 14, it is characterized in that, module set up by described problem dictionary or described answer dictionary is set up module, also comprises:
Semantic primitive is filtered submodule, for the semantic primitive that individual character, repeat number word string or numeric string length is surpassed to the preset length threshold value, filters out;
After only filtering, remaining semantic primitive offers described the first integron module or described the second integron module, in order to form problem domain dictionary or answer field dictionary.
21. device according to claim 12, it is characterized in that, described similarity calculation module is calculated respectively described answer and the Topic Similarity of problem under each classification, and chooses the Topic Similarity maximal value that the calculates Topic Similarity as described answer and problem.
22. device according to claim 21, is characterized in that, the method that described similarity calculation module is calculated the Topic Similarity of described answer and problem is:
sim(query,ans)=Max j{weight(query,C j)×weight(ans,C j)}
Wherein, sim (query, ans) means the Topic Similarity of answer and problem, weight (query, C j) problem of representation is at classification C jIn topic weights, weight (ans, C j) mean that answer is at classification C jIn topic weights.
CN201210151044.5A 2012-05-15 2012-05-15 Method and apparatus are recommended in a kind of answer Active CN103425635B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210151044.5A CN103425635B (en) 2012-05-15 2012-05-15 Method and apparatus are recommended in a kind of answer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210151044.5A CN103425635B (en) 2012-05-15 2012-05-15 Method and apparatus are recommended in a kind of answer

Publications (2)

Publication Number Publication Date
CN103425635A true CN103425635A (en) 2013-12-04
CN103425635B CN103425635B (en) 2018-02-02

Family

ID=49650400

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210151044.5A Active CN103425635B (en) 2012-05-15 2012-05-15 Method and apparatus are recommended in a kind of answer

Country Status (1)

Country Link
CN (1) CN103425635B (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103714488A (en) * 2014-01-03 2014-04-09 无锡清华信息科学与技术国家实验室物联网技术中心 Method for optimizing question answering platform in social network
CN104298735A (en) * 2014-09-30 2015-01-21 北京金山安全软件有限公司 Method and device for identifying application program type
CN105005564A (en) * 2014-04-17 2015-10-28 北京搜狗科技发展有限公司 Data processing method and apparatus based on question-and-answer platform
CN105653840A (en) * 2015-12-21 2016-06-08 青岛中科慧康科技有限公司 Similar case recommendation system based on word and phrase distributed representation, and corresponding method
CN105740310A (en) * 2015-12-21 2016-07-06 哈尔滨工业大学 Automatic answer summarizing method and system for question answering system
CN105786874A (en) * 2014-12-23 2016-07-20 北京奇虎科技有限公司 Method and device for constructing question-answer knowledge base data items based on encyclopedic entries
CN105786793A (en) * 2015-12-23 2016-07-20 百度在线网络技术(北京)有限公司 Method and device for analyzing semanteme of spoken language text information
CN106294505A (en) * 2015-06-10 2017-01-04 华中师范大学 A kind of method and apparatus feeding back answer
CN106610932A (en) * 2015-10-27 2017-05-03 中兴通讯股份有限公司 Corpus processing method and device and corpus analyzing method and device
CN106844686A (en) * 2017-01-26 2017-06-13 武汉奇米网络科技有限公司 Intelligent customer service question and answer robot and its implementation based on SOLR
CN106997375A (en) * 2017-02-28 2017-08-01 浙江大学 Recommendation method is replied in customer service based on deep learning
CN106997342A (en) * 2017-03-27 2017-08-01 上海奔影网络科技有限公司 Intension recognizing method and device based on many wheel interactions
CN107145573A (en) * 2017-05-05 2017-09-08 上海携程国际旅行社有限公司 The problem of artificial intelligence customer service robot, answers method and system
CN107168967A (en) * 2016-03-07 2017-09-15 阿里巴巴集团控股有限公司 The acquisition methods and device of object knowledge point
CN107329995A (en) * 2017-06-08 2017-11-07 北京神州泰岳软件股份有限公司 A kind of controlled answer generation method of semanteme, apparatus and system
CN107844531A (en) * 2017-10-17 2018-03-27 东软集团股份有限公司 Answer output intent, device and computer equipment
CN108345672A (en) * 2018-02-09 2018-07-31 平安科技(深圳)有限公司 Intelligent response method, electronic device and storage medium
CN108446320A (en) * 2018-02-09 2018-08-24 北京搜狗科技发展有限公司 A kind of data processing method, device and the device for data processing
CN109033318A (en) * 2018-07-18 2018-12-18 北京市农林科学院 Intelligent answer method and device
CN109299478A (en) * 2018-12-05 2019-02-01 长春理工大学 Intelligent automatic question-answering method and system based on two-way shot and long term Memory Neural Networks
CN110852094A (en) * 2018-08-01 2020-02-28 北京京东尚科信息技术有限公司 Method, apparatus and computer-readable storage medium for retrieving a target
CN113342950A (en) * 2021-06-04 2021-09-03 北京信息科技大学 Answer selection method and system based on semantic union

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1489089A (en) * 2002-08-19 2004-04-14 松下电器产业株式会社 Document search system and question answer system
CN1790332A (en) * 2005-12-28 2006-06-21 刘文印 Display method and system for reading and browsing problem answers
CN1928864A (en) * 2006-09-22 2007-03-14 浙江大学 FAQ based Chinese natural language ask and answer method
CN101174259A (en) * 2007-09-17 2008-05-07 张琰亮 Intelligent interactive request-answering system
US20080126319A1 (en) * 2006-08-25 2008-05-29 Ohad Lisral Bukai Automated short free-text scoring method and system
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
US20090089876A1 (en) * 2007-09-28 2009-04-02 Jamie Lynn Finamore Apparatus system and method for validating users based on fuzzy logic
CN101520802A (en) * 2009-04-13 2009-09-02 腾讯科技(深圳)有限公司 Question-answer pair quality evaluation method and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1489089A (en) * 2002-08-19 2004-04-14 松下电器产业株式会社 Document search system and question answer system
CN1790332A (en) * 2005-12-28 2006-06-21 刘文印 Display method and system for reading and browsing problem answers
US20080126319A1 (en) * 2006-08-25 2008-05-29 Ohad Lisral Bukai Automated short free-text scoring method and system
CN1928864A (en) * 2006-09-22 2007-03-14 浙江大学 FAQ based Chinese natural language ask and answer method
CN101174259A (en) * 2007-09-17 2008-05-07 张琰亮 Intelligent interactive request-answering system
US20090089876A1 (en) * 2007-09-28 2009-04-02 Jamie Lynn Finamore Apparatus system and method for validating users based on fuzzy logic
CN101286161A (en) * 2008-05-28 2008-10-15 华中科技大学 Intelligent Chinese request-answering system based on concept
CN101520802A (en) * 2009-04-13 2009-09-02 腾讯科技(深圳)有限公司 Question-answer pair quality evaluation method and system

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103714488A (en) * 2014-01-03 2014-04-09 无锡清华信息科学与技术国家实验室物联网技术中心 Method for optimizing question answering platform in social network
CN105005564B (en) * 2014-04-17 2019-09-03 北京搜狗科技发展有限公司 A kind of data processing method and device based on answer platform
CN105005564A (en) * 2014-04-17 2015-10-28 北京搜狗科技发展有限公司 Data processing method and apparatus based on question-and-answer platform
CN104298735B (en) * 2014-09-30 2018-06-05 北京金山安全软件有限公司 Method and device for identifying application program type
CN104298735A (en) * 2014-09-30 2015-01-21 北京金山安全软件有限公司 Method and device for identifying application program type
CN105786874A (en) * 2014-12-23 2016-07-20 北京奇虎科技有限公司 Method and device for constructing question-answer knowledge base data items based on encyclopedic entries
CN106294505B (en) * 2015-06-10 2020-07-07 华中师范大学 Answer feedback method and device
CN106294505A (en) * 2015-06-10 2017-01-04 华中师范大学 A kind of method and apparatus feeding back answer
CN106610932A (en) * 2015-10-27 2017-05-03 中兴通讯股份有限公司 Corpus processing method and device and corpus analyzing method and device
CN105740310B (en) * 2015-12-21 2019-08-02 哈尔滨工业大学 A kind of automatic answer method of abstracting and system in question answering system
CN105740310A (en) * 2015-12-21 2016-07-06 哈尔滨工业大学 Automatic answer summarizing method and system for question answering system
CN105653840A (en) * 2015-12-21 2016-06-08 青岛中科慧康科技有限公司 Similar case recommendation system based on word and phrase distributed representation, and corresponding method
CN105786793A (en) * 2015-12-23 2016-07-20 百度在线网络技术(北京)有限公司 Method and device for analyzing semanteme of spoken language text information
CN105786793B (en) * 2015-12-23 2019-05-28 百度在线网络技术(北京)有限公司 Parse the semantic method and apparatus of spoken language text information
CN107168967B (en) * 2016-03-07 2020-12-04 创新先进技术有限公司 Target knowledge point acquisition method and device
CN107168967A (en) * 2016-03-07 2017-09-15 阿里巴巴集团控股有限公司 The acquisition methods and device of object knowledge point
CN106844686A (en) * 2017-01-26 2017-06-13 武汉奇米网络科技有限公司 Intelligent customer service question and answer robot and its implementation based on SOLR
CN106997375B (en) * 2017-02-28 2020-08-18 浙江大学 Customer service reply recommendation method based on deep learning
CN106997375A (en) * 2017-02-28 2017-08-01 浙江大学 Recommendation method is replied in customer service based on deep learning
CN106997342A (en) * 2017-03-27 2017-08-01 上海奔影网络科技有限公司 Intension recognizing method and device based on many wheel interactions
CN107145573A (en) * 2017-05-05 2017-09-08 上海携程国际旅行社有限公司 The problem of artificial intelligence customer service robot, answers method and system
CN107329995A (en) * 2017-06-08 2017-11-07 北京神州泰岳软件股份有限公司 A kind of controlled answer generation method of semanteme, apparatus and system
CN107844531A (en) * 2017-10-17 2018-03-27 东软集团股份有限公司 Answer output intent, device and computer equipment
CN107844531B (en) * 2017-10-17 2020-05-22 东软集团股份有限公司 Answer output method and device and computer equipment
CN108446320A (en) * 2018-02-09 2018-08-24 北京搜狗科技发展有限公司 A kind of data processing method, device and the device for data processing
CN108345672A (en) * 2018-02-09 2018-07-31 平安科技(深圳)有限公司 Intelligent response method, electronic device and storage medium
WO2019153607A1 (en) * 2018-02-09 2019-08-15 平安科技(深圳)有限公司 Intelligent response method, electronic device and storage medium
CN109033318A (en) * 2018-07-18 2018-12-18 北京市农林科学院 Intelligent answer method and device
CN109033318B (en) * 2018-07-18 2020-11-27 北京市农林科学院 Intelligent question and answer method and device
CN110852094B (en) * 2018-08-01 2023-11-03 北京京东尚科信息技术有限公司 Method, apparatus and computer readable storage medium for searching target
CN110852094A (en) * 2018-08-01 2020-02-28 北京京东尚科信息技术有限公司 Method, apparatus and computer-readable storage medium for retrieving a target
CN109299478A (en) * 2018-12-05 2019-02-01 长春理工大学 Intelligent automatic question-answering method and system based on two-way shot and long term Memory Neural Networks
CN113342950A (en) * 2021-06-04 2021-09-03 北京信息科技大学 Answer selection method and system based on semantic union
CN113342950B (en) * 2021-06-04 2023-04-21 北京信息科技大学 Answer selection method and system based on semantic association

Also Published As

Publication number Publication date
CN103425635B (en) 2018-02-02

Similar Documents

Publication Publication Date Title
CN103425635A (en) Method and device for recommending answers
Waitelonis et al. Linked data enabled generalized vector space model to improve document retrieval
CN103778214B (en) A kind of item property clustering method based on user comment
CN107122413A (en) A kind of keyword extracting method and device based on graph model
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
Hartawan et al. Using vector space model in question answering system
CN105843897A (en) Vertical domain-oriented intelligent question and answer system
CN106294744A (en) Interest recognition methods and system
CN106970910A (en) A kind of keyword extracting method and device based on graph model
CN111143672B (en) Knowledge graph-based professional speciality scholars recommendation method
CN103886034A (en) Method and equipment for building indexes and matching inquiry input information of user
CN103885937A (en) Method for judging repetition of enterprise Chinese names on basis of core word similarity
Wu et al. Using relation selection to improve value propagation in a conceptnet-based sentiment dictionary
CN106126619A (en) A kind of video retrieval method based on video content and system
CN109992674B (en) Recommendation method fusing automatic encoder and knowledge graph semantic information
CN108681574A (en) A kind of non-true class quiz answers selection method and system based on text snippet
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN107193883B (en) Data processing method and system
KR20060122276A (en) Relation extraction from documents for the automatic construction of ontologies
CN110633464A (en) Semantic recognition method, device, medium and electronic equipment
CN103646099A (en) Thesis recommendation method based on multilayer drawing
CN108804595A (en) A kind of short text representation method based on word2vec
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN105630890A (en) Neologism discovery method and system based on intelligent question-answering system session history
Hasanati et al. Implementation of support vector machine with lexicon based for sentimenT ANALYSIS ON TWITter

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant