CN104933097A - Data processing method and device for retrieval - Google Patents

Data processing method and device for retrieval Download PDF

Info

Publication number
CN104933097A
CN104933097A CN201510279830.7A CN201510279830A CN104933097A CN 104933097 A CN104933097 A CN 104933097A CN 201510279830 A CN201510279830 A CN 201510279830A CN 104933097 A CN104933097 A CN 104933097A
Authority
CN
China
Prior art keywords
answer
fragment
template
negative
ratio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510279830.7A
Other languages
Chinese (zh)
Other versions
CN104933097B (en
Inventor
王石
宗明
孙兴武
蒋祥涛
张希娟
马艳军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201510279830.7A priority Critical patent/CN104933097B/en
Publication of CN104933097A publication Critical patent/CN104933097A/en
Application granted granted Critical
Publication of CN104933097B publication Critical patent/CN104933097B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data processing method and device for retrieval. The method comprises the following steps: acquiring a question and page data including an answer of the question, wherein the question is a question of which the answer is positive or negative; generating a question-answer template pair <question, answer> which is paired with the question according to the question and the page data; extracting more than one answer segment from the page data according to the matching degrees between the question and the answer segments in the page data; and determining that the viewpoint of the more than one answer segment is positive or negative according to the quantity of negative demonstrative words of the extracted more than one answer segment and the quantity of negative demonstrative words of the question. Through adoption of the method and device, the data processing efficiency of the retrieval results of yes/no questions is increased greatly.

Description

A kind of data processing method for retrieving and device
Technical field
The present invention relates to internet arena, in particular to a kind of data processing method for retrieving and device.
Background technology
Undertaken by internet retrieving or such as Ask-Answer Community, forum, encyclopaedia and so on Internet resources in; usually the problem of such as " pregnant woman can eat watermelon ", " rushing milk to baby with mineral water OK " and so on is had; the answer of this kind of problem is generally " being (YES; affirmative) " or " no (NO; negative) ", and we are referred to as to be/non-problems (also referred to as YES-NO problem or polarity problems).Internet user obtain this kind ofly is/associated answer of non-problems time, scattered related web page can only be obtained at present by search engine, and then analyze answer viewpoint wherein voluntarily through the uncorrelated webpage of artificial filter, the data analysis of this relevant result for retrieval that causes checking on one's answers or the efficiency of process lower.
Summary of the invention
For solving above-mentioned technical matters, the invention provides a kind of data processing method for retrieving and device, for being/non-problems and answer webpage corresponding to this problem, corresponding problem-answer template pair can be generated, and according to problem answers template to determining that this is/the matching degree of non-problems and answer fragment, corresponding answer fragment is extracted as tolerance using matching degree, substantially improve the efficiency of the data processing to result for retrieval, accuracy, and determine from the answer fragment extracted being/viewpoint of non-problems is positive or negative, improve for being/the acquisition efficiency of the viewpoint data of non-problems and reliability, user can be facilitated, check it is/the result for retrieval of non-problems quickly.
According to the first aspect of embodiment of the present invention, provide a kind of data processing method for retrieving, the method can comprise: acquisition problem and comprise the page data of answer of described problem, wherein, the problem of described problem to be answer be positive or negative, problem-answer the template of matching with described problem is generated to < problem according to described problem and described page data, answer >, from described page data, one more than answer fragment is extracted according to the matching degree of answer fragment in described problem and described page data, wherein, in described problem and described page data, the matching degree of the first answer fragment is calculated by following ratio: described problem-answer template is to < problem, in answer >, the weighting sum of the common entry of each answer and described first answer fragment accounts for the ratio of described first answer fragment, determine that the viewpoint of described more than one answer fragment is positive or negative according to the negative deictic words number of described more than one the answer fragment extracted and the negative deictic words number of described problem.
In certain embodiments of the present invention, described method also can comprise: the viewpoint of adding up described more than one answer fragment is the ratio of positive or negative, extract the additional information of corresponding answer fragment as described ratio that viewpoint is positive or negative, and show described ratio and described additional information to user.
In certain embodiments of the present invention, described method also can comprise and show described ratio by more than one forms following: number percent, form, histogram, string diagram.
In certain embodiments of the present invention, problem-answer the template of matching with described problem is generated to < problem according to described problem and described page data, answer > can comprise: more than one the second trunk structure analyzing one of more than one the first trunk structure of described problem and the answer fragment of described web data, described first trunk structure and described second trunk structure are configured to first kind problem-answer template to < problem, answer >, obtain more than an one answer fragment basket corresponding to identical with described more than one the first trunk structure, screen more than one n-gram and n-skipgram of answer fragment corresponding to a described basket as answer constituent, the trunk structure of the trunk structure of the described basket filtered out and answer fragment corresponding to a described basket is configured to Equations of The Second Kind problem-answer template to < problem, answer >, by described first kind problem-answer template to < problem, answer > and described Equations of The Second Kind problem-answer template are to < problem, answer > merging obtains described problem-answer template to < problem, answer >.
In certain embodiments of the present invention, described problem-answer template is to < problem, in answer > each answer and described first answer fragment common entry be weighted to the first following component and the arithmetic product of second component, wherein, first component is that described problem-answer template is to < problem, the occurrence number of common entry described in all answers of answer > and described problem-answer template are to < problem, the ratio of the occurrence number of all words in all answers of answer >, second component is that described problem-answer template is to < problem, the number of all answers of answer > and described problem-answer template are to < problem, the ratio comprising the answer number of described common entry in answer > is taken the logarithm.
According to the second aspect of embodiment of the present invention, provide a kind of data processing equipment for retrieving, this device can comprise: acquisition module, for obtaining problem and comprising the page data of answer of described problem, wherein, the problem of described problem to be answer be positive or negative, generation module, for generating the problem-answer template of matching with described problem according to described problem and described page data to < problem, answer >, abstraction module, for extracting more than one answer fragment according to the matching degree of answer fragment in described problem and described page data from described page data, wherein, in described problem and described page data, the matching degree of the first answer fragment is calculated by following ratio: described problem-answer template is to < problem, in answer >, the weighting sum of the common entry of each answer and described first answer fragment accounts for the ratio of described first answer fragment, judge module, for determining that the viewpoint of described more than one answer fragment is positive or negative according to the negative deictic words number of described more than one answer fragment extracted and the negative deictic words number of described problem.
In certain embodiments of the present invention, described device also can comprise: display module, be the ratio of positive or negative for adding up the viewpoint of described more than one answer fragment, and extract the additional information of corresponding answer fragment as described ratio that viewpoint is positive or negative, and show described ratio and described additional information to user.
In certain embodiments of the present invention, described display module also can be used for showing described ratio by more than one forms following: number percent, form, histogram, string diagram.
In certain embodiments of the present invention, described generation module, can be used for carrying out following operation: more than one the second trunk structure analyzing one of more than one the first trunk structure of described problem and the answer fragment of described web data, described first trunk structure and described second trunk structure are configured to first kind problem-answer template to < problem, answer >, obtain more than an one answer fragment basket corresponding to identical with described more than one the first trunk structure, screen more than one n-gram and n-skipgram of answer fragment corresponding to a described basket as answer constituent, the trunk structure of the trunk structure of the described basket filtered out and answer fragment corresponding to a described basket is configured to Equations of The Second Kind problem-answer template to < problem, answer >, by described first kind problem-answer template to < problem, answer > and described Equations of The Second Kind problem-answer template are to < problem, answer > merging obtains described problem-answer template to < problem, answer >.
In certain embodiments of the present invention, problem described in described abstraction module-answer template is to < problem, in answer > each answer and described first answer fragment common entry be weighted to the first following component and the arithmetic product of second component, wherein, first component is that described problem-answer template is to < problem, the occurrence number of common entry described in all answers of answer > and described problem-answer template are to < problem, the ratio of the occurrence number of all words in all answers of answer >, second component is that described problem-answer template is to < problem, the number of all answers of answer > and described problem-answer template are to < problem, the ratio comprising the answer number of described common entry in answer > is taken the logarithm.
The said method that embodiment of the present invention provides and device, by being/matching degree of non-problems and answer fragment extracts answer fragment, significantly improves the specific aim of result for retrieval data pin to this problem, improve accuracy and the reliability of result for retrieval data; Carry out viewpoint analysis by the answer fragment extracted, improve to being/the data-handling efficiency of non-problems result for retrieval, be conducive to obtaining the answer for this problem efficiently; Shown for being/the viewpoint ratio of non-problems and corresponding answer fragment by display format simply and intuitively, facilitate that user is quick, result for retrieval data are checked in contrast.
Accompanying drawing explanation
Fig. 1 illustrates the schematic flow sheet of data processing method for retrieving according to one embodiment of the present invention;
Fig. 2 illustrates the structural representation of data processing equipment for retrieving according to one embodiment of the present invention.
Embodiment
For making the object of embodiment of the present invention, technical scheme and advantage clearly, below in conjunction with accompanying drawing, the present invention is described in further detail.
See Fig. 1, illustrate the schematic flow sheet of data processing method for retrieving according to one embodiment of the present invention, this data processing method being used for retrieving can comprise:
S101, acquisition problem and comprise the page data of answer of described problem, wherein, the problem of this problem to be answer be positive or negative,
S102, generates the problem-answer template of matching with this problem to < problem according to this problem and this page data, answer >,
S103, from page data, one more than answer fragment is extracted according to the matching degree of answer fragment in this problem and this page data, wherein, this problem is calculated by following ratio with the matching degree of the first answer fragment in this page data: this problem-answer template is to < problem, in answer >, the weighting sum of the common entry of each answer and the first answer fragment accounts for the ratio of the first answer fragment
According to the negative deictic words number of this more than one the answer fragment extracted and the negative deictic words number of this problem, S104, determines that the viewpoint of this more than one answer fragment is positive or negative.
In embodiments of the present invention, for the problem of problem refer to that answer is generally certainly (such as, be, yes etc.) or negative (such as, no, no etc.), we are referred to herein as is/non-problems.Embodiment of the present invention be can be used for for being/the data processing method of the result for retrieval of non-problems for the data processing method retrieved.
Data processing method for retrieving of the present invention can comprise: perform step S101, acquisition is/non-problems and comprise the page data of answer of this problem, wherein, be/source of non-problems can comprise multiple, such as, be/non-problems can come from the search terms of searching platform, also can come from the Internet resources such as Ask-Answer Community, forum, encyclopaedia.Correspondingly, comprise be/source of the page data of the answer of non-problems also can comprise multiple, such as, comprise this and be/page data of the answer of non-problems can come from through search engine retrieving to comprise this problem answers one or more (such as, be more than or equal to 2) the page, also can come from the answer page etc. for this problem of the user to Ask-Answer Community, forum, encyclopaedia etc.
Next, perform step S102, problem-answer the template of matching with this problem is generated to < problem according to the problem obtained in step S101 and the page data of answer that comprises this problem, answer >, specifically, can comprise: analyze this and be/more than one the second trunk structure of one of more than one the first trunk structure of non-problems and the answer fragment of described web data, first trunk structure and the second trunk structure are configured to first kind problem-answer template to < problem, answer >, also can be called that initial problem-answer template is to < problem, answer >, obtain more than an one answer fragment basket corresponding to identical with described more than one the first trunk structure, screen more than one n-gram and n-skipgram of answer fragment corresponding to a basket as answer constituent, the trunk structure of answer fragment corresponding to the trunk structure of a basket and this basket is configured to Equations of The Second Kind problem-answer template to < problem, answer >, also can be called that the problem-answer module of expansion is to < problem, answer >, by initial problem-answer template to < problem, problem-answer the module of answer > and expansion is to < problem, answer > merges can obtain all problem-answer modules of matching with this problem to < problem, answer >.
For initial problem-answer template to < problem, the structure of answer >, can comprise that to analyze this be/the trunk structure of non-problems, that is, this is/sentence trunk the structure of the question sentence of non-problems.For being/analysis of the concrete trunk structure of non-problems is by outside the fundamental analysis results such as participle, part-of-speech tagging, proper name identification, term (term) importance, also the word segmentation result of problem will be carried out extensive further based on synonym, upper hyponym, auxiliary verb, target be based on being/feature of non-problems, analysis is/core word of non-problems and trunk structure.Be/core word of non-problems refers to that can be used for directly answering is/the word of non-problems.Such as, so to/non-problems " pregnant woman can eat watermelon ", core word is " energy ".To being/non-problems can carry out interdependent syntactic analysis, mark the core word in a large amount of interdependent syntactic analysis result, can train extraction model, complete core word identification.Sentence trunk structure refers to and such as, usually comprises subject, predicate and object by the composition that embodiment problem trunk looks like.In embodiments of the present invention, for being/non-problems, can analyze from multiple different levels, obtaining multiple different sentence trunk structure.Such as, so to/non-question sentence " pregnant woman can eat watermelon ", can on part of speech, entry, semantic category three levels, centered by core word " energy ", default some syntactic constituents about it respectively, form the trunk structure of multiple candidate, comprising:
Hierarchy of terms:
1, pregnant woman can eat watermelon
2, watermelon can be eaten
3, pregnant woman can eat
4, Yun Funeng
5, can eat
Semantic category level (wherein, use ... represent a synonym or the set of upper bottom):
6, { crowd } { energy } { edible } { fruit }
7, { { fruit } can { be eaten }
8, { crowd } { energy } { edible }
9, { crowd } { energy }
10, can { edible }
Part of speech level (wherein, n is noun, and v is verb):
11、n v v n
12、v v n
13、n v v
14、n v
15、v n
Hybrid hierarchy:
16, pregnant woman's { auxiliary verb _ energy } { edible } watermelon
17, pregnant woman's { auxiliary verb _ energy } v{ fruit }
For the sentence trunk analysis of an answer fragment in the answer fragment (such as, more than one clause can be comprised) of web data, with the sentence trunk analysis classes of above-mentioned problem seemingly, do not repeat them here.It should be noted that, in the part of speech step analysis of sentence, can only retain the word with the identical entry of sentence trunk structure of problem, this mainly because of being/answer of non-problems is entry in replication problem greatly.Such as, for one of the trunk structure " pregnant woman { can } v{ fruit } " of problem " pregnant woman can eat watermelon ", answer fragment is " had better not eat ", and the trunk structure of so answer fragment can comprise: 1, eat; 2, edible; 3, v (, for eating, the entry corresponding with v in the trunk structure of problem is consistent for the entry that v is corresponding herein).
One or more trunk structures (adopting pat_a mark in this article) the corresponding combinations of pairs of the one or more trunk structures (adopting pat_q mark in this article) and answer fragment of analyzing the problem obtained can be obtained multiple initial problem-answer template to <pat_q, pat_a>.
Except the initial problem-answer template of above-mentioned acquisition is to <pat_q, pat_a>, can also to problem _ answer template to expanding, obtain a basket identical with the trunk structure of problem (such as, more than one problem) corresponding one or more answer fragments, screen the constituent of one or more n-gram and n-skipgram as answer of these answer fragments, the trunk structure of the trunk structure of a basket and the corresponding answer fragment of a basket is built the problem-answer template of expansion to < problem, answer >.Such as, so to the trunk structure of/non-problems " { can } { edible } ", in staqtistical data base sentence trunk structure be this structure be/all answer fragments of non-problems (such as, " do not eat ", " useful to { problem; agent } " etc.) in n-gram and n-skipgram (such as, n can value be 1,2,3 etc.), screening n-gram and n-skipgram is greater than predetermined threshold, as the constituent of problem-answer template centering answer.Wherein, the gram_score (n-gram, q) of formula (1) below can be adopted to carry out quantification score to the assessment of n-gram:
gram_score(n-gram,q)=tf(n-gram,q)*idf(n-gram) (1)
Wherein, gram_score (n-gram, q) be the importance degree scoring of n-gram in q, * be arithmetic product, q is yes/sentence of non-problems, n-gram is the sequence of n word composition continuously in this sentence q, tf (n-gram, q)=the occurrence number of n-gram (in the corresponding answer of the q)/occurrence number of n-gram (in the answer of all problems), idf (n-gram)=log ((number of all corresponding answer of q)/(comprising the number of the answer of n-gram)), / for arithmetic is divided by, log is computing of taking the logarithm.Gram_score (n-gram, q) is greater than the n-gram of predetermined threshold, can be used as the ingredient of problem-answer template centering answer template.
Similarly, n-gram is replaced with n-skipgram, take the mode similar with the above-mentioned gram_score (n-gram, q) of n-gram to carry out importance degree assessment to n-skipgram.
For the problem-answer template pair of expansion, can by first manually marking out a small amount of answer fragment, a collection of problem-answer template pair is learnt by machine learning algorithm, based on these problems-answer template learnt to obtaining more answer fragment, thus obtain more problem-answer template pair.By the continuous iteration of learning process, till obtained problem-answer template is to no longer remarkable increasing.Wherein, after each iteration, all assess problem-answer template, the higher answer fragment of score is wherein assessed in screening, avoids mistake cumulative.Such as, can based on problem-answer template to the precision of the answer fragment that <pat_q, pat_a> obtain to problem-answer template to assessing.Obviously, problem-answer template centering, if the granularity of problem and answer template is all entry rank, and do not have default sentence element, so this is higher to the degree of accuracy of problem-answer template.Such as, for problem " pregnant woman can eat watermelon ", if template can eat watermelon to for <pat_q=pregnant woman, pat_a=had better not eat >, so its precision is very high, but its generalization ability is very weak, recall rate is very low, can only recall the sentence containing " had better not eat ".Also recall rate aspect can be considered to the assessment of <pat_q, pat_a> to problem-answer template.Those skilled in the art, when assessing, can consider degree of accuracy and recall rate two aspect, select degree of accuracy and the suitable problem-answer template pair of recall rate.
Can be obtained by above-described mode be/these problems-answer template is combined the problem-answer template pair with multiple expansion by the multiple initial problem-answer template of non-problems, can obtain this to be/total problem-answer template pair of non-problems.
Next, perform step S103, according to being/non-problems and page data in the matching degree of answer fragment from page data, extract more than one (being more than or equal to 1) answer fragment, wherein, the page data comprising this problem answers can comprise more than one answer fragment, matching degree according to this problem and these answer fragments chooses some answer fragments, these answer fragments for this be/specific aim of non-problems is stronger, improve to being/the data-handling efficiency of non-problems result for retrieval, be conducive to obtaining the answer for this problem efficiently.Wherein, the matching degree of an answer fragment in this problem and page data by following formula (2) match_score (q, a) carry out Quantitative marking:
m a t c h _ s c o r e ( q , a ) = m a x ( &Sigma; w &Element; Q A P a t s ( q ) &cap; a w e i g h t ( w , q ) &Sigma; w &Element; a w e i g h t ( w , q ) ) - - - ( 2 )
Wherein, match_score (q, a) be problem q and the matching degree comprising an answer fragment a in the web data of this problem q, the set that question and answer-answer template right of problem q of QAPats (q) for generating in above-mentioned steps S102, it can comprise more than one problem-answer template pair, w ∈ QAPats (q) ∩ a represents that entry w appears in an answer template of mating with problem q, appear in answer fragment a simultaneously, w ∈ a represents that entry w appears in answer fragment a, and max is for getting maximal value.In formula (2), problem-answer the template of computational problem q is to the weighting sum of each answer template in QAPats (q) and the common entry w of answer fragment a, the ratio of the shared whole sentence of answer fragment a, the answer template maximum with answer fragment a matching ratio is selected in all answer templates, and by the matching degree of itself and answer fragment a, as the matching degree of problem q and answer fragment a.Briefly, formula (2) can be regarded as entries all in answer fragment by problem-answer template to the maximum ratio covered.Weight (w, q) in formula (2) obtains by formula (3):
weight(w,q)=tf(w,q)*idf(w) (3)
Wherein, tf (w, q)=the occurrence number of entry w (in all corresponding answer templates of the q)/occurrence number of all words (in all corresponding answer templates of q), idf (w)=log ((number of all answer templates)/(comprising the number of the answer template of entry w)).
The matching degree of problem q and answer fragment a is calculated by above-mentioned formula (2) and formula (3), determine whether to extract answer fragment a with comparing of matching degree threshold value according to this matching degree, if this matching degree is greater than matching degree threshold value, then extract this answer fragment, otherwise, do not extract this answer fragment.For whether choosing answer fragment except adopting matching degree as except main basis for estimation, also can consider the position of answer fragment in the paragraph of the page (section head, Duan Zhong, section tail), whether answer is adopted, the number of sentence, the technorati authority etc. of answer contributor in answer, analyzed by nonlinear regression model (NLRM), determine that whether this answer fragment is screened.
After obtaining by formula (2) clause that in answer fragment, matching degree score is the highest, centered by the clause that score is the highest, the clause being enlarged beyond forward and backward matching degree threshold value can be continued, forms answer fragment.Wherein, also need to carry out special processing to two class sentences.Specifically, the first kind is the expansion of condition class sentence, if the highest clause of score be conditional clause reason clause (such as, " if ... words "), then continue to expand the clause of result (such as, " so ... ") below; Equations of The Second Kind sentence is the expansion of turnover sentence, if the highest clause of matching degree score be the anterior clause (such as, " although ... ") of turnover sentence, then continue to expand turnover clause below (such as, " ... ").
Next, perform step S104, determine that the viewpoint of this more than one answer fragment is certainly or negates according to the number of in more than one the answer fragment extracted in step S103 negating the number of deictic words and the negative deictic words of this problem, wherein, negative deictic words can comprise negative word (such as, not etc.), negative Sentiment orientation word (such as, can be verb or adjective) and antonym etc.Specifically, so to/non-problems, determine that whether its core word is with negative prefixes, if with negative prefixes, then the negative deictic words number of problem is designated as 1, if the very corn of a subject word is adjective or verb, then analyzes the Sentiment orientation of this core word, if the Sentiment orientation of this core word is negative, then also the negative deictic words number of this problem is designated as 1.Such as, problem " almond of bearing hardships is poisonous ", core word " poisonous " is negative Sentiment orientation word.In statistical problem, the number arithmetic summation of negative prefixes and negative Sentiment orientation word, is designated as query_neg_cnt.Then, negate the number of deictic words in statistics answer fragment, for the negative deictic words of answer fragment, except comprising negative prefixes, negative Sentiment orientation word, also can comprise the antonym of entry in problem.Add up the number of antonym in negative prefixes, negative Sentiment orientation word and problem in answer fragment respectively, these number arithmetic is sued for peace, is designated as answer_neg_cnt.After obtaining query_neg_cnt and answer_neg_cnt, the two is added, if sum is even number, then thinks that the viewpoint of this answer fragment is for affirmative, if sum is odd number, then think that the viewpoint of this answer fragment is for negative.
In embodiments of the present invention, also can comprise: the viewpoint of adding up these answer fragments extracted is the ratio of positive or negative, and extract the additional information of corresponding answer fragment as this ratio that viewpoint is positive or negative, as the support argument supporting viewpoint or negative viewpoint certainly.And, ratio that above-mentioned viewpoint is positive or negative and additional information corresponding to this comparison can be shown with form more intuitively to user, such as, one or more displays in number percent, form, histogram, string diagram etc. can be passed through.In some embodiments, the factor that it is also conceivable to such as answer fragment length, answer website technorati authority, answer supplier technorati authority and so on checks on one's answers and quantizes, and preferentially shows to user and quantizes webpage corresponding to the high answer fragment of score.In some embodiments, can also be ratio and the corresponding answer fragment contrast display of positive or negative by viewpoint, facilitate user to check contrast result for retrieval quickly.
More than describing the flow process of the data processing method for retrieving of the present invention in conjunction with embodiment, describing the device of the above-mentioned data processing method of application below in conjunction with embodiment.
See Fig. 2, illustrate the structural representation of data processing equipment for retrieving according to one embodiment of the present invention, this device 200 can comprise:
Acquisition module 201, for obtaining problem and comprising the page data of answer of this problem, wherein, the problem of this problem to be answer be positive or negative,
Generation module 202, for generating the problem-answer template of matching with this problem according to this problem and this page data to < problem, answer >,
Abstraction module 203, for extracting more than one answer fragment according to the matching degree of answer fragment in this problem and this page data from this page data, wherein, this problem is calculated by following ratio with the matching degree of the first answer fragment in this page data: described problem-answer template is to < problem, in answer >, the weighting sum of the common entry of each answer and the first answer fragment accounts for the ratio of the first answer fragment
According to the negative deictic words number of described more than one answer fragment extracted and the negative deictic words number of this problem, judge module 204, for determining that the viewpoint of described more than one answer fragment is positive or negative.
Embodiment of the present invention can comprise acquisition module 201, generation module 202, abstraction module 203 and judge module 204 for the data processing equipment 200 retrieved, these modules can be arranged at the server end of search engine, and can be connected with other functional modules of search engine, other functional modules can be called, can call for other functional modules.
In embodiments of the present invention, for the problem of problem refer to that answer is generally certainly (such as, be, yes etc.) or negative (such as, no, no etc.), we are referred to herein as is/non-problems.
It is/non-problems and comprise the page data of answer of this problem that acquisition module 201 can obtain, wherein, be/source of non-problems can comprise multiple, such as, be/non-problems can come from the search terms of searching platform, also can come from the Internet resources such as Ask-Answer Community, forum, encyclopaedia.Correspondingly, comprise be/source of the page data of the answer of non-problems also can comprise multiple, such as, comprise this and be/page data of the answer of non-problems can come from through search engine retrieving to comprise this problem answers one or more (such as, be more than or equal to 2) the page, also can come from the answer page etc. for this problem of the user to Ask-Answer Community, forum, encyclopaedia etc.
The page data of the problem that generation module 202 can obtain according to acquisition module 201 and the answer that comprises this problem generates the problem-answer template of matching with this problem to < problem, answer >, specifically, generation module can be used for carrying out following operation: analyze this and be/more than one the second trunk structure of one of more than one the first trunk structure of non-problems and the answer fragment of described web data, first trunk structure and the second trunk structure are configured to first kind problem-answer template to < problem, answer >, also can be called that initial problem-answer template is to < problem, answer >, obtain more than an one answer fragment basket corresponding to identical with described more than one the first trunk structure, screen more than one n-gram and n-skipgram of answer fragment corresponding to a basket as answer constituent, the trunk structure of answer fragment corresponding to the trunk structure of a basket and this basket is configured to Equations of The Second Kind problem-answer template to < problem, answer >, also can be called that the problem-answer module of expansion is to < problem, answer >, by initial problem-answer template to < problem, problem-answer the module of answer > and expansion is to < problem, answer > merges can obtain all problem-answer modules of matching with this problem to < problem, answer >.
For initial problem-answer template to < problem, the structure of answer >, can comprise that to analyze this be/the trunk structure of non-problems, that is, this is/sentence trunk the structure of the question sentence of non-problems.For being/analysis of the concrete trunk structure of non-problems is by outside the fundamental analysis results such as participle, part-of-speech tagging, proper name identification, term (term) importance, also the word segmentation result of problem will be carried out extensive further based on synonym, upper hyponym, auxiliary verb, target be based on being/feature of non-problems, analysis is/core word of non-problems and trunk structure.Be/core word of non-problems refers to that can be used for directly answering is/the word of non-problems.Such as, so to/non-problems " pregnant woman can eat watermelon ", core word is " energy ".To being/non-problems can carry out interdependent syntactic analysis, mark the core word in a large amount of interdependent syntactic analysis result, can train extraction model, complete core word identification.Sentence trunk structure refers to and such as, usually comprises subject, predicate and object by the composition that embodiment problem trunk looks like.In embodiments of the present invention, for being/non-problems, can analyze from multiple different levels, obtaining multiple different sentence trunk structure.
For the sentence trunk analysis of an answer fragment in the answer fragment (such as, more than one clause can be comprised) of web data, with the sentence trunk analysis classes of above-mentioned problem seemingly, do not repeat them here.It should be noted that, in the part of speech step analysis of sentence, can only retain the word with the identical entry of sentence trunk structure of problem, this mainly because of being/answer of non-problems is entry in replication problem greatly.Such as, for one of the trunk structure " pregnant woman { can } v{ fruit } " of problem " pregnant woman can eat watermelon ", answer fragment is " had better not eat ", and the trunk structure of so answer fragment can comprise: 1, eat; 2, edible; 3, v (, for eating, the entry corresponding with v in the trunk structure of problem is consistent for the entry that v is corresponding herein).
Except the initial problem-answer template of above-mentioned acquisition is to <pat_q, pat_a>, can also to problem _ answer template to expanding, obtain a basket identical with the trunk structure of problem (such as, more than one problem) corresponding one or more answer fragments, screen the constituent of one or more n-gram and n-skipgram as answer of these answer fragments, the trunk structure of the trunk structure of a basket and the corresponding answer fragment of a basket is built the problem-answer template of expansion to < problem, answer >.Such as, so to the trunk structure of/non-problems " { can } { edible } ", in staqtistical data base sentence trunk structure be this structure be/all answer fragments of non-problems (such as, " do not eat ", " useful to { problem; agent } " etc.) in n-gram and n-skipgram (such as, n can value be 1,2,3 etc.), screening n-gram and n-skipgram is greater than predetermined threshold, as the constituent of problem-answer template centering answer.Wherein, the gram_score (n-gram, q) of formula (1) can be adopted to carry out quantification score to the assessment of n-gram.The assessment formula that n-skipgram can adopt and n-gram is similar is assessed.
For the problem-answer template pair of expansion, can by first manually marking out a small amount of answer fragment, a collection of problem-answer template pair is learnt by machine learning algorithm, based on these problems-answer template learnt to obtaining more answer fragment, thus obtain more problem-answer template pair.By the continuous iteration of learning process, till obtained problem-answer template is to no longer remarkable increasing.Wherein, after each iteration, all assess problem-answer template, the higher answer fragment of score is wherein assessed in screening, avoids mistake cumulative.Such as, can based on problem-answer template to the precision of the answer fragment that <pat_q, pat_a> obtain to problem-answer template to assessing.Obviously, problem-answer template centering, if the granularity of problem and answer template is all entry rank, and do not have default sentence element, so this is higher to the degree of accuracy of problem-answer template.Such as, for problem " pregnant woman can eat watermelon ", if template can eat watermelon to for <pat_q=pregnant woman, pat_a=had better not eat >, so its precision is very high, but its generalization ability is very weak, recall rate is very low, can only recall the sentence containing " had better not eat ".Also recall rate aspect can be considered to the assessment of <pat_q, pat_a> to problem-answer template.Those skilled in the art, when assessing, can consider degree of accuracy and recall rate two aspect, select degree of accuracy and the suitable problem-answer template pair of recall rate.
Can be obtained by above-described mode be/these problems-answer template is combined the problem-answer template pair with multiple expansion by the multiple initial problem-answer template of non-problems, can obtain this to be/total problem-answer template pair of non-problems.
Abstraction module 203 can according to being/non-problems and page data in the matching degree of answer fragment from page data, extract more than one (being more than or equal to 1) answer fragment, wherein, the page data comprising this problem answers can comprise more than one answer fragment, matching degree according to this problem and these answer fragments chooses some answer fragments, these answer fragments for this be/specific aim of non-problems is stronger, improve to being/the data-handling efficiency of non-problems result for retrieval, be conducive to obtaining the answer for this problem efficiently.Wherein, by the match_score in following formula (2), (q a) carries out Quantitative marking to the matching degree of an answer fragment in this problem and page data.
The matching degree of problem q and answer fragment a is calculated by above-mentioned formula (2) and formula (3), determine whether to extract answer fragment a with comparing of matching degree threshold value according to this matching degree, if this matching degree is greater than matching degree threshold value, then extract this answer fragment, otherwise, do not extract this answer fragment.For whether choosing answer fragment except adopting matching degree as except main basis for estimation, also can consider the position of answer fragment in the paragraph of the page (section head, Duan Zhong, section tail), whether answer is adopted, the number of sentence, the technorati authority etc. of answer contributor in answer, analyzed by nonlinear regression model (NLRM), determine that whether this answer fragment is screened.
After obtaining the highest clause of matching degree score by formula (2), centered by the clause that score is the highest, the clause being enlarged beyond forward and backward matching degree threshold value can be continued, forms answer fragment.Wherein, also need to carry out special processing to two class sentences.Specifically, the first kind is the expansion of condition class sentence, if the highest clause of score be conditional clause reason clause (such as, " if ... words "), then continue to expand the clause of result (such as, " so ... ") below; Equations of The Second Kind sentence is the expansion of turnover sentence, if the highest clause of matching degree score be the anterior clause (such as, " although ... ") of turnover sentence, then continue to expand turnover clause below (such as, " ... ").
Judge module 204 can determine according to the number of in more than one the answer fragment extracted in abstraction module 203 negating the number of deictic words and the negative deictic words of this problem that the viewpoint of this more than one answer fragment is certainly or negates, wherein, negative deictic words can comprise negative word (such as, not etc.), negative Sentiment orientation word (such as, can be verb or adjective) and antonym etc.Specifically, so to/non-problems, determine that whether its core word is with negative prefixes, if with negative prefixes, then the negative deictic words number of problem is designated as 1, if the very corn of a subject word is adjective or verb, then analyzes the Sentiment orientation of this core word, if the Sentiment orientation of this core word is negative, then also the negative deictic words number of this problem is designated as 1.Such as, problem " almond of bearing hardships is poisonous ", core word " poisonous " is negative Sentiment orientation word.In statistical problem, the number arithmetic summation of negative prefixes and negative Sentiment orientation word, is designated as query_neg_cnt.Then, negate the number of deictic words in statistics answer fragment, for the negative deictic words of answer fragment, except comprising negative prefixes, negative Sentiment orientation word, also can comprise the antonym of entry in problem.Add up the number of antonym in negative prefixes, negative Sentiment orientation word and problem in answer fragment respectively, these number arithmetic is sued for peace, is designated as answer_neg_cnt.After obtaining query_neg_cnt and answer_neg_cnt, the two is added, if sum is even number, then thinks that the viewpoint of this answer fragment is for affirmative, if sum is odd number, then think that the viewpoint of this answer fragment is for negative.
In embodiments of the present invention, device 200 also can comprise display module, be the ratio of positive or negative for adding up the viewpoint of these answer fragments extracted, and extract the additional information of corresponding answer fragment as this ratio that viewpoint is positive or negative, as the support argument supporting viewpoint or negative viewpoint certainly.And, ratio that above-mentioned viewpoint is positive or negative and additional information corresponding to this comparison can be shown to user with form more intuitively, such as, can pass through one or more displays in number percent, form, histogram, string diagram etc., convenient being used for checks result for retrieval quickly.
Through the above description of the embodiments, those skilled in the art can be well understood to the present invention and can realize by the mode of software combined with hardware platform, can certainly all be implemented by hardware.Based on such understanding, what technical scheme of the present invention contributed to background technology can embody with the form of software product in whole or in part, this computer software product can be stored in storage medium, as ROM/RAM, magnetic disc, CD etc., comprising some instructions in order to make a computer equipment (can be personal computer, server, smart mobile phone or the network equipment etc.) perform the method described in some part of each embodiment of the present invention or embodiment.
The term used in instructions of the present invention and wording, just to illustrating, are not meaned and are formed restriction.It will be appreciated by those skilled in the art that under the prerequisite of the ultimate principle not departing from disclosed embodiment, can various change be carried out to each details in above-mentioned embodiment.Therefore, scope of the present invention is only determined by claim, and in the claims, except as otherwise noted, all terms should be understood by the most wide in range rational meaning.

Claims (10)

1. the data processing method for retrieving, is characterized in that, comprising:
Acquisition problem and comprise the page data of answer of described problem, wherein, the problem of described problem to be answer be positive or negative,
Problem-answer the template of matching with described problem is generated to < problem, answer > according to described problem and described page data,
From described page data, one more than answer fragment is extracted according to the matching degree of answer fragment in described problem and described page data, wherein, in described problem and described page data, the matching degree of the first answer fragment is calculated by following ratio: described problem-answer template is to < problem, in answer >, the weighting sum of the common entry of each answer and described first answer fragment accounts for the ratio of described first answer fragment
Determine that the viewpoint of described more than one answer fragment is positive or negative according to the negative deictic words number of described more than one the answer fragment extracted and the negative deictic words number of described problem.
2. method according to claim 1, is characterized in that, also comprises:
The viewpoint of adding up described more than one answer fragment is the ratio of positive or negative, extracts the additional information of corresponding answer fragment as described ratio that viewpoint is positive or negative, and shows described ratio and described additional information to user.
3. method according to claim 2, is characterized in that, also comprises and shows described ratio by more than one forms following: number percent, form, histogram, string diagram.
4. method as claimed in any of claims 1 to 3, is characterized in that, generate the problem-answer template of matching with described problem to < problem according to described problem and described page data, answer > comprises:
Analyze more than one the second trunk structure of one of more than one the first trunk structure of described problem and the answer fragment of described web data, described first trunk structure and described second trunk structure are configured to first kind problem-answer template to < problem, answer >
Obtain more than an one answer fragment basket corresponding to identical with described more than one the first trunk structure, screen more than one n-gram and n-skipgram of answer fragment corresponding to a described basket as answer constituent, the trunk structure of the trunk structure of the described basket filtered out and answer fragment corresponding to a described basket is configured to Equations of The Second Kind problem-answer template to < problem, answer >
By described first kind problem-answer template to < problem, answer > and described Equations of The Second Kind problem-answer template are to < problem, answer > merging obtains described problem-answer template to < problem, answer >.
5. method according to claim 4, it is characterized in that, described problem-answer template is to < problem, in answer > each answer and described first answer fragment common entry be weighted to the first following component and the arithmetic product of second component, wherein
First component is that described problem-answer template is to < problem, the occurrence number of common entry described in all answers of answer > and described problem-answer template are to < problem, the ratio of the occurrence number of all words in all answers of answer >
Second component is that described problem-answer template is to < problem, the number of all answers of answer > and described problem-answer template are to < problem, and the ratio comprising the answer number of described common entry in answer > is taken the logarithm.
6. the data processing equipment for retrieving, is characterized in that, comprising:
Acquisition module, for obtaining problem and comprising the page data of answer of described problem, wherein, the problem of described problem to be answer be positive or negative,
Generation module, for generating the problem-answer template of matching with described problem according to described problem and described page data to < problem, answer >,
Abstraction module, for extracting more than one answer fragment according to the matching degree of answer fragment in described problem and described page data from described page data, wherein, in described problem and described page data, the matching degree of the first answer fragment is calculated by following ratio: described problem-answer template is to < problem, in answer >, the weighting sum of the common entry of each answer and described first answer fragment accounts for the ratio of described first answer fragment
According to the negative deictic words number of described more than one answer fragment extracted and the negative deictic words number of described problem, judge module, for determining that the viewpoint of described more than one answer fragment is positive or negative.
7. device according to claim 6, is characterized in that, also comprises:
Display module is the ratio of positive or negative for adding up the viewpoint of described more than one answer fragment, and extracts the additional information of corresponding answer fragment as described ratio that viewpoint is positive or negative, and shows described ratio and described additional information to user.
8. device according to claim 7, is characterized in that, described display module is also for showing described ratio by more than one forms following: number percent, form, histogram, string diagram.
9. according to the device in claim 6 to 8 described in any one, it is characterized in that, described generation module, for carrying out following operation:
Analyze more than one the second trunk structure of one of more than one the first trunk structure of described problem and the answer fragment of described web data, described first trunk structure and described second trunk structure are configured to first kind problem-answer template to < problem, answer >
Obtain more than an one answer fragment basket corresponding to identical with described more than one the first trunk structure, screen more than one n-gram and n-skipgram of answer fragment corresponding to a described basket as answer constituent, the trunk structure of the trunk structure of the described basket filtered out and answer fragment corresponding to a described basket is configured to Equations of The Second Kind problem-answer template to < problem, answer >
By described first kind problem-answer template to < problem, answer > and described Equations of The Second Kind problem-answer template are to < problem, answer > merging obtains described problem-answer template to < problem, answer >.
10. device according to claim 9, it is characterized in that, problem described in described abstraction module-answer template is to < problem, in answer > each answer and described first answer fragment common entry be weighted to the first following component and the arithmetic product of second component, wherein
First component is that described problem-answer template is to < problem, the occurrence number of common entry described in all answers of answer > and described problem-answer template are to < problem, the ratio of the occurrence number of all words in all answers of answer >
Second component is that described problem-answer template is to < problem, the number of all answers of answer > and described problem-answer template are to < problem, and the ratio comprising the answer number of described common entry in answer > is taken the logarithm.
CN201510279830.7A 2015-05-27 2015-05-27 A kind of data processing method and device for retrieval Active CN104933097B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510279830.7A CN104933097B (en) 2015-05-27 2015-05-27 A kind of data processing method and device for retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510279830.7A CN104933097B (en) 2015-05-27 2015-05-27 A kind of data processing method and device for retrieval

Publications (2)

Publication Number Publication Date
CN104933097A true CN104933097A (en) 2015-09-23
CN104933097B CN104933097B (en) 2019-04-16

Family

ID=54120265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510279830.7A Active CN104933097B (en) 2015-05-27 2015-05-27 A kind of data processing method and device for retrieval

Country Status (1)

Country Link
CN (1) CN104933097B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229675A (en) * 2017-04-28 2017-10-03 北京神州泰岳软件股份有限公司 Question and answer base construction method, method of answering, the apparatus and system of list type knowledge
CN107832374A (en) * 2017-10-26 2018-03-23 平安科技(深圳)有限公司 Construction method, electronic installation and the storage medium in standard knowledge storehouse
CN108804627A (en) * 2018-05-31 2018-11-13 科大讯飞股份有限公司 Information acquisition method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003263438A (en) * 2002-03-08 2003-09-19 Nippon Telegr & Teleph Corp <Ntt> Yes/No TYPE QUESTION TREE PREPARING DEVICE, Yes/No TYPE QUESTION TREE PREPARING METHOD, PROGRAM AND RECORDING MEDIUM
CN101377777A (en) * 2007-09-03 2009-03-04 北京百问百答网络技术有限公司 Automatic inquiring and answering method and system
CN103377245A (en) * 2012-04-27 2013-10-30 腾讯科技(深圳)有限公司 Automatic question and answer method and device
CN103927381A (en) * 2014-04-29 2014-07-16 北京百度网讯科技有限公司 Right-and-wrong problem processing method and device
CN104063497A (en) * 2014-07-04 2014-09-24 百度在线网络技术(北京)有限公司 Viewpoint processing method and device and searching method and device
CN104216913A (en) * 2013-06-04 2014-12-17 Sap欧洲公司 Problem answering frame
CN104573028A (en) * 2015-01-14 2015-04-29 百度在线网络技术(北京)有限公司 Intelligent question-answer implementing method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003263438A (en) * 2002-03-08 2003-09-19 Nippon Telegr & Teleph Corp <Ntt> Yes/No TYPE QUESTION TREE PREPARING DEVICE, Yes/No TYPE QUESTION TREE PREPARING METHOD, PROGRAM AND RECORDING MEDIUM
CN101377777A (en) * 2007-09-03 2009-03-04 北京百问百答网络技术有限公司 Automatic inquiring and answering method and system
CN103377245A (en) * 2012-04-27 2013-10-30 腾讯科技(深圳)有限公司 Automatic question and answer method and device
CN104216913A (en) * 2013-06-04 2014-12-17 Sap欧洲公司 Problem answering frame
CN103927381A (en) * 2014-04-29 2014-07-16 北京百度网讯科技有限公司 Right-and-wrong problem processing method and device
CN104063497A (en) * 2014-07-04 2014-09-24 百度在线网络技术(北京)有限公司 Viewpoint processing method and device and searching method and device
CN104573028A (en) * 2015-01-14 2015-04-29 百度在线网络技术(北京)有限公司 Intelligent question-answer implementing method and system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229675A (en) * 2017-04-28 2017-10-03 北京神州泰岳软件股份有限公司 Question and answer base construction method, method of answering, the apparatus and system of list type knowledge
CN107832374A (en) * 2017-10-26 2018-03-23 平安科技(深圳)有限公司 Construction method, electronic installation and the storage medium in standard knowledge storehouse
CN108804627A (en) * 2018-05-31 2018-11-13 科大讯飞股份有限公司 Information acquisition method and device
CN108804627B (en) * 2018-05-31 2021-04-06 科大讯飞股份有限公司 Information acquisition method and device

Also Published As

Publication number Publication date
CN104933097B (en) 2019-04-16

Similar Documents

Publication Publication Date Title
Merchant et al. Nlp based latent semantic analysis for legal text summarization
US10437867B2 (en) Scenario generating apparatus and computer program therefor
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN106649818B (en) Application search intention identification method and device, application search method and server
US10095685B2 (en) Phrase pair collecting apparatus and computer program therefor
US9881059B2 (en) Systems and methods for suggesting headlines
Sunilkumar et al. A survey on semantic similarity
US9390161B2 (en) Methods and systems for extracting keyphrases from natural text for search engine indexing
US10430717B2 (en) Complex predicate template collecting apparatus and computer program therefor
CN102663129A (en) Medical field deep question and answer method and medical retrieval system
Abdul-Kader et al. Question answer system for online feedable new born Chatbot
Dawdy-Hesterberg et al. Learnability and generalisation of Arabic broken plural nouns
Hissah et al. Detecting and classifying crimes from arabic twitter posts using text mining techniques
CN103049470A (en) Opinion retrieval method based on emotional relevancy
Altheneyan et al. Big data ML-based fake news detection using distributed learning
CN109344246B (en) Electronic questionnaire generating method, computer readable storage medium and terminal device
Wu et al. ECNU at SemEval-2017 task 3: Using traditional and deep learning methods to address community question answering task
CN111325018A (en) Domain dictionary construction method based on web retrieval and new word discovery
CN104933097A (en) Data processing method and device for retrieval
Wan et al. A deep neural network model for coreference resolution in geological domain
Mechti et al. Author profiling using style-based features
CN113010639A (en) Commodity analysis method and device based on E-commerce platform
CN112084376A (en) Map knowledge based recommendation method and system and electronic device
Aliyanto et al. Supervised probabilistic latent semantic analysis (sPLSA) for estimating technology readiness level
Vu et al. Building a vietnamese sentiwordnet using vietnamese electronic dictionary and string kernel

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant