CN104933097B - A kind of data processing method and device for retrieval - Google Patents

A kind of data processing method and device for retrieval Download PDF

Info

Publication number
CN104933097B
CN104933097B CN201510279830.7A CN201510279830A CN104933097B CN 104933097 B CN104933097 B CN 104933097B CN 201510279830 A CN201510279830 A CN 201510279830A CN 104933097 B CN104933097 B CN 104933097B
Authority
CN
China
Prior art keywords
answer
segment
template
negative
described problem
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510279830.7A
Other languages
Chinese (zh)
Other versions
CN104933097A (en
Inventor
王石
宗明
孙兴武
蒋祥涛
张希娟
马艳军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201510279830.7A priority Critical patent/CN104933097B/en
Publication of CN104933097A publication Critical patent/CN104933097A/en
Application granted granted Critical
Publication of CN104933097B publication Critical patent/CN104933097B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides the data processing method and device for retrieval.This method comprises: obtaining the page data of problem and the answer comprising the problem, wherein, the problem is the problem of answer is positive or negative, the problem of matching with the problem-answer template is generated to<problem according to described problem and the page data, answer>, more than one answer segment is extracted from page data according to the matching degree of answer segment in the problem and the page data, the viewpoint that one above answer segment is determined according to the negative deictic words number of the negative deictic words number of the one above answer segment extracted and described problem is positive or negative.The above method and device of the invention is substantially improved to being/the data-handling efficiency of the search result of non-problems.

Description

A kind of data processing method and device for retrieval
Technical field
The present invention relates to internet areas, in particular to a kind of data processing method and device for retrieval.
Background technique
In the Internet resources for carrying out retrieval or such as Ask-Answer Community, forum, encyclopaedia etc by internet, it will usually Have the problem of such as " pregnant woman can eat watermelon ", " to baby with mineral water rush milk OK " etc, such issues that answer it is usual For " being (YES, certainly) " or " no (NO, negative) ", we term it be/non-problems (also referred to as YES-NO problem or polarity Problem).Internet user obtain it is this kind of be/associated answer of non-problems when, zero can only be obtained by search engine at present Then scattered related web page using the uncorrelated webpage of artificial filter and voluntarily analyzes answer viewpoint therein, this causes to answer The data of the relevant search result of case are analyzed or the efficiency of processing is lower.
Summary of the invention
In order to solve the above technical problems, the present invention provides a kind of data processing method and device for retrieval, needles To being/non-problems answer webpage corresponding with the problem, corresponding problem-answer template pair can be generated, and answer according to problem Case template is to determining that this is/the matching degree of non-problems and answer segment, using matching degree as measuring the corresponding answer segment of extraction, Substantially improve efficiency, the accuracy of the data processing to search result, and from the answer segment extracted determine to be/ The viewpoint of non-problems is positive or negative, is improved for being/acquisition the efficiency and reliability of the viewpoint data of non-problems, so that User, which can quickly and easily check, is/search result of non-problems.
According to the present invention embodiment in a first aspect, providing a kind of data processing method for retrieval, this method Can include: obtain the page data of problem and the answer comprising described problem, wherein described problem is that answer is positive or negative The problem of, the problem of matching with described problem-answer template is generated to < problem according to described problem and the page data, is answered Case >, more than one is extracted from the page data according to the matching degree of answer segment in described problem and the page data Answer segment, wherein the matching degree of the first answer segment is counted by following ratios in described problem and the page data Calculate: described problem-answer template to<problem, answer>in the common entry of each answer and the first answer segment plus The sum of power accounts for the ratio of the first answer segment, according to the negative deictic words of the one above answer segment extracted Several and described problem negative deictic words number determines that the viewpoint of one above answer segment is positive or negative.
In certain embodiments of the present invention, the method may also include that the one above answer segment of statistics Viewpoint is the ratio of positive or negative, and extraction viewpoint is additional letter of the correspondence answer segment of positive or negative as the ratio Breath, and the ratio and the additional information are shown to user.
In certain embodiments of the present invention, described in the method may also include and be shown by more than one following forms Ratio: percentage, table, histogram, lines figure.
In certain embodiments of the present invention, it is generated according to described problem and the page data and is matched with described problem The problem of-answer template to<problem, answer>can include: analyze more than one first trunk structure and the net of described problem More than one second trunk structure of one of the answer segment of page data, by first trunk structure and the second trunk knot Structure is configured to first kind problem-answer template to<problem, and answer>, it obtains identical with one first trunk structure above More than one answer segment corresponding to a basket screens the corresponding answer segment of the basket more than one N-gram and n-skipgram is as answer constituent, by the trunk structure of the basket filtered out and described The trunk structure of the corresponding answer segment of a basket is configured to the second class problem-answer template to<problem, and answer>, it will be described First kind problem-answer template is to<problem, and answer>and the second class problem-answer template are to<problem, answer>merge To described problem-answer template to<problem, answer>.
In certain embodiments of the present invention, described problem-answer template is to<problem, answer>in each answer with The arithmetic product for being weighted to following the first component and second component of the common entry of the first answer segment, wherein the One component is described problem-answer template to<problem, answer>all answers described in common entry frequency of occurrence and institute Problem-answer template is stated to<problem, answer>all answers in all words frequency of occurrence ratio, second component is described Problem-answer template to<problem, answer>all answers number and described problem-answer template to<problem, answer>in The ratio of answer number comprising the common entry takes logarithm.
The second aspect of embodiment according to the present invention provides a kind of data processing equipment for retrieval, the device Can include: module is obtained, for obtaining the page data of problem and the answer comprising described problem, wherein described problem is to answer The problem of case is positive or negative, generation module are matched for being generated according to described problem and the page data with described problem Pair problem-answer template to<problem, answer>, abstraction module, for according to answer in described problem and the page data The matching degree of segment extracts more than one answer segment from the page data, wherein described problem and the page data In the matching degree of the first answer segment calculated by following ratios: described problem-answer template to<problem, answer>in it is every The sum of the weighting of the common entry of one answer and the first answer segment accounts for the ratio of the first answer segment, judges mould Block, for according to one above answer segment for extracting negate deictic words number and described problem negative deictic words Number determines that the viewpoint of one above answer segment is positive or negative.
In certain embodiments of the present invention, described device may also include that display module, for count it is one with The viewpoint of upper answer segment be positive or negative ratio, and extract viewpoint for positive or negative correspondence answer segment as institute The additional information of ratio is stated, and shows the ratio and the additional information to user.
In certain embodiments of the present invention, the display module can also be used to show by more than one following forms The ratio: percentage, table, histogram, lines figure.
In certain embodiments of the present invention, the generation module can be used for carrying out operations described below: analysis described problem More than one first trunk structure and the web data one of answer segment more than one second trunk structure, by institute State the first trunk structure and second trunk structure be configured to first kind problem-answer template to<problem, answer>, obtain with More than one answer segment corresponding to the identical basket of one above first trunk structure, screening described first More than one n-gram and n-skipgram of the corresponding answer segment of group problem are as answer constituent, the institute that will be filtered out The trunk structure of the trunk structure and the corresponding answer segment of the basket of stating a basket is configured to the second class and asks Topic-answer template to<problem, answer>, by the first kind problem-answer template to<problem, answer>and second class are asked For topic-answer template to<problem, answer>merging obtains described problem-answer template to<problem, and answer>.
In certain embodiments of the present invention, problem described in the abstraction module-answer template is to<problem, and answer> In each answer and the first answer segment common entry the calculation for being weighted to following the first component and second component Art product, wherein the first component is described problem-answer template to<problem, answer>all answers described in common entry Frequency of occurrence and described problem-answer template to<problem, answer>all answers in all words frequency of occurrence ratio, Second component is described problem-answer template to<problem, answer>all answers number and described problem-answer template pair <problem, answer>in the ratio of answer number comprising the common entry take logarithm.
The above method and device that embodiment of the present invention provides are taken out by the matching degree of/non-problems and answer segment Answer segment is taken, the specific aim of search result data for this problem is significantly improved, improves the accurate of search result data Property and reliability;Viewpoint analysis is carried out by the answer segment that extracts, is improved to being/the data processing of non-problems search result Efficiency is conducive to efficiently obtain the answer for the problem;By simply and intuitively display format show for be/non-ask The viewpoint ratio of topic and corresponding answer segment, facilitate that user is quick, search result data are checked in comparison.
Detailed description of the invention
Fig. 1 illustrates the flow diagrams of the data processing method for retrieval according to an embodiment of the present invention;
Fig. 2 illustrates the structural schematic diagram of the data processing equipment for retrieval according to an embodiment of the present invention.
Specific embodiment
To keep the purposes, technical schemes and advantages of embodiment of the present invention clearer, below in conjunction with attached drawing to this hair It is bright to be described in further detail.
Referring to Fig. 1, the process for illustrating the data processing method for retrieval according to an embodiment of the present invention is shown It is intended to, this is used for the data processing method retrieved can include:
S101 obtains the page data of problem and the answer comprising described problem, wherein the problem be answer for certainly or The problem of negative,
S102 generates the problem of matching with the problem-answer template to < problem according to the problem and the page data, answers Case >,
S103 extracts more than one from page data according to the matching degree of answer segment in the problem and the page data Answer segment, wherein the matching degree of the first answer segment is calculated by following ratios in the problem and the page data: should Problem-answer template to<problem, answer>in the sum of the weighting of the common entry of each answer and the first answer segment account for the The ratio of one answer segment,
S104 is indicated according to the negative of the negative deictic words number of more than one the answer segment extracted and the problem Word number determines that the viewpoint of more than one answer segment is positive or negative.
In embodiments of the present invention, targeted problem refer to answer be usually certainly (for example, be, yes etc.) or The problem of negating (for example, no, no etc.), we are referred to herein as being/non-problems.Embodiment of the present invention for retrieval Data processing method is that can be used for for being/the data processing method of the search result of non-problems.
The data processing method for retrieval of the invention can include: execute step S101, acquisition is/non-problems and packet The page data of answer containing the problem, wherein be/source of non-problems may include a variety of, for example, be/non-problems may be from In the retrieval project of searching platform, it may also originate from the Internet resources such as Ask-Answer Community, forum, encyclopaedia.Correspondingly, comprising be/it is non- The source of the page data of the answer of problem may also comprise it is a variety of, for example, comprising this be/page data of the answer of non-problems can Come from through the page for the one or more (for example, being greater than or equal to 2) comprising the problem answers that search engine retrieving arrives Face may also originate from the answer page for this problem of user etc. to Ask-Answer Community, forum, encyclopaedia etc..
Next, executing step S102, the page of the answer according to problem acquired in step s101 and comprising the problem Data generate the problem of matching with the problem-answer template to<problem, answer>, specifically, can include: analyze this be/it is non- More than one second trunk structure of one of the answer segment of more than one first trunk structure of problem and the web data, First trunk structure and the second trunk structure are configured to first kind problem-answer template to<problem, answer>, it is referred to as Initial problem-answer template to<problem, answer>;It obtains and is asked with one above identical first group of first trunk structure More than one corresponding answer segment of topic, more than one n-gram and n- of the corresponding answer segment of a screening basket Skipgram is as answer constituent, by the trunk structure of basket answer segment corresponding with the basket Trunk structure is configured to the second class problem-answer template to<problem, and answer>, the problem of being referred to as extension-answer module To<problem, answer>;By initial problem-answer template to<problem, the problem of answer>and extension-answer module to<problem, Answer>merging can obtain all problems-answer module with problem pairing to<problem, and answer>.
For initial problem-answer template to<problem, answer>building, it may include analyzing this is/the master of non-problems Stem structure, that is, this is/the sentence trunk structure of the question sentence of non-problems.For be/analysis of the specific trunk structure of non-problems can Except the fundamental analysis results such as participle, part-of-speech tagging, proper name identification, term (term) importance, will also based on synonym, Upper hyponym, auxiliary verb are further generalized the word segmentation result of problem, target be based on be/non-problems the characteristics of, analysis It is/core word and trunk structure of non-problems.Be/core word of non-problems refers to that can be used for directly answering is/the word of non-problems. For example, core word is " energy " to then/non-problems " pregnant woman can eat watermelon ".To be/non-problems can carry out interdependent syntax point Analysis marks the core word in a large amount of interdependent syntactic analysis results, can train extraction model, completes core word identification.Sentence master Stem structure refers to the ingredient of the embodiment problem trunk meaning, for example, generally comprising subject, predicate and object.In implementation of the invention In mode, it for being/non-problems, can be analyzed from multiple and different levels, obtain multiple and different sentence trunk structures.Example It such as, can be on three part of speech, entry, semantic category levels, with core word to then/non-question sentence " pregnant woman can eat watermelon " Centered on " energy ", some syntactic constituents of its default left and right, constitute the trunk structure of multiple candidates respectively, comprising:
Hierarchy of terms:
1, pregnant woman can eat watermelon
2, watermelon can be eaten
3, pregnant woman can eat
4, Yun Funeng
5, it can eat
Semantic class hierarchy (wherein, indicating a synonym or upper the next set with { ... }):
6, { crowd } { energy } { edible } { fruit }
7, { { fruit } can { be eaten }
8, { crowd } { energy } { edible }
9, { crowd } { energy }
10, can { edible }
Part of speech level (wherein, n is noun, and v is verb):
11、n v v n
12、v v n
13、n v v
14、n v
15、v n
Hybrid hierarchy:
16, pregnant woman's { auxiliary verb _ energy } { edible } watermelon
17, pregnant woman { auxiliary verb _ energy } v { fruit }
For the sentence of an answer segment in the answer segment (for example, may include more than one clause) of web data Trunk analysis, similar with the analysis of the sentence trunk of above-mentioned problem, details are not described herein.It should be noted that in the word of sentence Property step analysis in, can only retain the word of identical as the sentence trunk structure of problem entry, this is primarily due to be/non-problems Answer be greatly entry in replication problem.For example, for problem " pregnant woman can eat watermelon " trunk structure it One " pregnant woman { can } v { fruit } ", answer segment are " had better not eat ", then the trunk structure of answer segment may include: 1, It eats;2, it eats;3, v (the corresponding entry of v is to eat herein, and entry corresponding with v in the trunk structure of problem is consistent).
The one or more trunk structures (being identified herein using pat_q) and answer segment for the problem of analysis is obtained The corresponding combinations of pairs of one or more trunk structures (herein using pat_a identify) can get multiple initial ask Topic-answer template pair<pat_q,pat_a>.
In addition to initial problem-answer template pair of above-mentioned acquisition<pat_q,pat_a>, can also be to problem _ answer mould Plate obtains a basket (for example, more than one problem) corresponding one identical with the trunk structure of problem to being extended A or multiple answer segments screen the composition of the one or more n-gram and n-skipgram of these answer segments as answer The trunk structure that the trunk structure of a basket and a basket correspond to answer segment is constructed the problem of extending-by ingredient Answer template to<problem, answer>.For example, to then/non-problems trunk structure " { energy } { edible } ", sentence in staqtistical data base Sub- trunk structure be the structure be/all answer segments of non-problems are (for example, " not eat ", " beneficial to { problem, agent } " Deng) in n-gram and n-skipgram (for example, n can value be 1,2,3 etc.), screening n-gram and n-skipgram is greater than pre- Threshold value is determined, as problem-answer template centering answer constituent.Wherein, following public affairs can be used to the assessment of n-gram The gram_score (n-gram, q) of formula (1) carries out quantization score:
Gram_score (n-gram, q)=tf (n-gram, q) * idf (n-gram) (1)
Wherein, gram_score (n-gram, q) is different degree scoring of the n-gram in q, and * is arithmetic product, q is yes/ The sentence of non-problems, n-gram are the sequence of continuous n word composition in sentence q, tf (n-gram, q)=(the correspondence answer of q The frequency of occurrence of middle n-gram)/(frequency of occurrence of n-gram in answer of all the problems), idf (n-gram)=log be ((q's The number of all corresponding answers)/(number of the answer comprising n-gram)) ,/be divided by for arithmetic, log is to take logarithm operation. Gram_score (n-gram, q) is greater than the n-gram of predetermined threshold, can be used as problem-answer template centering answer template group At part.
Similarly, n-gram is replaced with into n-skipgram, the above-mentioned gram_ with n-gram is taken to n-skipgram Score (n-gram, q) similar mode carries out different degree assessment.
The problem of for extending-answer template pair can pass through machine by manually marking out a small amount of answer segment first Learning algorithm study is to a collection of problem-answer template pair, based on the these problems learnt-answer template to available more Answer segment, to obtain more problems-answer template pair.By the continuous iteration of learning process, until obtained problem- Answer template is to until no longer dramatically increasing.Wherein, after each iteration, all problem-answer template is assessed, screens it The middle higher answer segment of assessment score avoids mistake cumulative.For example, can be based on problem-answer template to < pat_q, pat_ The precision of the answer segment for a > obtain is to problem-answer template to assessing.Obviously, in problem-answer template pair, if The granularity of problem and answer template is all entry rank, and without default sentence element, then this is to problem-answer template Accuracy is higher.For example, for problem " pregnant woman can eat watermelon ", if template can eat west to for < pat_q=pregnant woman Melon, pat_a=had better not eat >, then its precision is very high, but its generalization ability is very weak, and recall rate is very low, can only recall Sentence containing " had better not eat ".To problem-answer template pair<pat_q,pat_a>assessment it is also possible to consider recall rate sides Face.Those skilled in the art can comprehensively consider two aspect of accuracy and recall rate, accuracy and recall rate is selected to close in assessment Suitable problem-answer template pair.
Can get by manner described above be/multiple initial problem-answer templates of non-problems to multiple expansions The problem of exhibition-answer template pair, by these problems-answer template to merging, can obtain this be/total problem-of non-problems answers Case template pair.
Next, execute step S103, according to be/non-problems and page data in answer segment matching degree from page number According to middle more than one (being greater than or equal to 1) answer segment that extracts comprising the page data of the problem answers may include one A above answer segment chooses some answer segments, these answer segments according to the matching degree of the problem and these answer segments Be for this/specific aim of non-problems is stronger, it improves to being/the data-handling efficiency of non-problems search result, is conducive to efficiently Ground obtains the answer for the problem.Wherein, under the matching degree of an answer segment in the problem and page data can pass through State in formula (2) match_score (q, a) carry out Quantitative marking:
Wherein, (q is a) an answer segment a in problem q and the web data comprising problem q to match_score Matching degree, QAPats (q) be above-mentioned steps S102 in generate the problem of q question and answer-answer template pair set, it may include More than one problem-answer template pair, w ∈ QAPats (q) ∩ a indicate that entry w is appeared in and the matched answer mould of problem q It in plate, while appearing in answer segment a, w ∈ a indicates that entry w is appeared in answer segment a, and max is to be maximized.Formula (2) in, the problem of computational problem q-answer template is to each answer template in QAPats (q) and the common entry w of answer segment a The sum of weighting, the ratio of the entire sentence of shared answer segment a selects to match with answer segment a in all answer templates The maximum answer template of ratio, and the matching degree by the matching degree of itself and answer segment a, as problem q and answer segment a.Letter For list, formula (2) can be regarded as maximum ratio of the entry all in answer segment by problem-answer template to covering.It is public Weight (w, q) in formula (2) can be obtained by formula (3):
Weight (w, q)=tf (w, q) * idf (w) (3)
Wherein, tf (w, q)=(frequency of occurrence of entry w in all corresponding answer templates of q)/(all corresponding answer templates of q In all words frequency of occurrence), idf (w)=log ((numbers of all answer templates)/(the answer template comprising entry w Number)).
The matching degree that go wrong q and answer segment a are calculated by above-mentioned formula (2) and formula (3), according to the matching degree and The comparison of matching degree threshold value determines whether to extract answer segment a, if the matching degree extracts the answer piece greater than matching degree threshold value Section, otherwise, does not extract the answer segment.For whether choose answer segment in addition to using matching degree as main judgment basis it Outside, it is also contemplated that whether position (section head, Duan Zhong, section tail) of the answer segment in the paragraph of the page, answer are adopted, in answer Number, technorati authority of answer contributor of sentence etc., are analyzed by nonlinear regression model (NLRM), determine whether the answer segment is sieved Choosing.
It is obtained in answer segment after the clause of matching degree highest scoring by formula (2), it can be with highest scoring Centered on clause, continues the clause for being forwardly and rearwardly enlarged beyond matching degree threshold value, form answer segment.Wherein, it is also necessary to right Two class sentences carry out specially treated.Specifically, the first kind is the extension of condition class sentence, if the clause of highest scoring is condition Sentence the reason of clause (for example, " if ... if "), continue to extend the subsequent clause of result (for example, " so ... "); Second class sentence be turnover sentence extension, if the clause of matching degree highest scoring be turnover sentence front clause (for example, " though So ... "), then continue to extend subsequent turnover clause (for example, " still ... ").
Next, executing step S104, indicated according to negative in more than one the answer segment extracted in step S103 The number of the negative deictic words of the number and problem of word determines that the viewpoint of more than one answer segment is certainly or negates, Wherein, negative deictic words may include negative word (for example, grade), negative Sentiment orientation word (for example, it may be verb or describing Word) and antonym etc..Specifically, determining whether its core word has negative prefixes to then/non-problems, if having The negative deictic words number of problem is then denoted as 1 by negative prefixes, if the very corn of a subject word is adjective or verb, is analyzed The Sentiment orientation of the core word, if the Sentiment orientation of the core word is negatively, also by the negative deictic words number of the problem It is denoted as 1.For example, problem " almond of bearing hardships is toxic ", " toxic " core word is negative Sentiment orientation word.Negate in statistical problem The summation of the number arithmetic of prefix and negative Sentiment orientation word, is denoted as query_neg_cnt.Then, count no in answer segment The number for determining deictic words, for the negative deictic words of answer segment, in addition to include negative prefixes, negative Sentiment orientation word it Outside, it may also include the antonym of entry in problem.Negative prefixes, negative Sentiment orientation word in answer segment are counted respectively and are asked These number arithmetic are summed, are denoted as answer_neg_cnt by the number of antonym in topic.Obtain query_neg_cnt and After answer_neg_cnt, the two is added, if sum is even number, then it is assumed that the viewpoint of the answer segment is willing It is fixed, if sum is odd number, then it is assumed that the viewpoint of the answer segment is negative.
In embodiments of the present invention, may also include that the viewpoints of these answer segments that statistics extracts for certainly or The ratio of negative, and additional information of the correspondence answer segment as the ratio that viewpoint is positive or negative is extracted, as support Certainly the support argument of viewpoint or negative viewpoint.Moreover, can show that above-mentioned viewpoint is willing to user in the form of more intuitive Fixed or negative ratio and the corresponding additional information of the comparison, for example, percentage, table, histogram, lines figure can be passed through Deng one of or a variety of displays.In some embodiments, it is also contemplated that such as answer fragment length, answer website authority The factor of degree, answer supplier's technorati authority etc quantifies answer, preferentially shows the high answer piece of quantization score to user The corresponding webpage of section.It in some embodiments, can also be the ratio and corresponding answer segment of positive or negative by viewpoint Comparison display, facilitates user quickly to check comparison search result.
The process for the data processing method for retrieval that detailed description of the preferred embodimentsthe present invention has been described is combined above, below will The device of the above-mentioned data processing method of application is described in conjunction with specific embodiment.
Referring to fig. 2, the structure for illustrating the data processing equipment for retrieval according to an embodiment of the present invention is shown It is intended to, the device 200 can include:
Module 201 is obtained, for obtaining the page data of problem and the answer comprising the problem, wherein the problem is to answer The problem of case is positive or negative,
Generation module 202, for generating the problem of matching with the problem-answer template according to the problem and the page data To<problem, answer>,
Abstraction module 203, for according to the matching degree of answer segment in the problem and the page data from the page data Middle more than one answer segment of extraction, wherein the matching degree of the first answer segment passes through following in the problem and the page data Ratio is calculated: described problem-answer template to<problem, answer>in each answer and the first answer segment common word The sum of weighting of item accounts for the ratio of the first answer segment,
Judgment module 204, for according to the negative deictic words number of the one above answer segment extracted and this The negative deictic words number of problem determines that the viewpoint of one above answer segment is positive or negative.
The data processing equipment 200 for retrieval of embodiment of the present invention may include obtaining module 201, generation module 202, abstraction module 203 and judgment module 204, these modules may be disposed at the server end of search engine, and can with search It indexes the other function module held up to be attached, other function module can be called, for other function module calling.
In embodiments of the present invention, targeted problem refer to answer be usually certainly (for example, be, yes etc.) or The problem of negating (for example, no, no etc.), we are referred to herein as being/non-problems.
Obtaining module 201 and can obtaining is/the page data of non-problems and the answer comprising the problem, wherein be/non-ask The source of topic may include a variety of, for example, be/non-problems may be from the retrieval project of searching platform, and it may also originate from question and answer society The Internet resources such as area, forum, encyclopaedia.Correspondingly, comprising be/source of the page data of the answer of non-problems may also comprise it is more Kind, for example, comprising this be/page data of the answer of non-problems may be from arriving by search engine retrieving comprising the problem The page of the one or more (for example, being greater than or equal to 2) of answer, may also originate to Ask-Answer Community, forum, encyclopaedia etc. Answer page for this problem of user etc..
Generation module 202 can be raw according to the page data for obtaining the problem of module 201 obtains and the answer comprising the problem At the problem of pairing with the problem-answer template to<problem, answer>, specifically, generation module can be used for carrying out following behaviour Make: analyze this be/one of one of the answer segment of more than one first trunk structure of non-problems and the web data with First trunk structure and the second trunk structure are configured to first kind problem-answer template to < problem by upper second trunk structure, Answer>, it is referred to as initial problem-answer template to<problem, answer>;It obtains and one above first trunk knot More than one answer segment corresponding to the identical basket of structure, one of the corresponding answer segment of a screening basket The above n-gram and n-skipgram is as answer constituent, by the trunk structure of a basket and the basket pair The trunk structure for the answer segment answered is configured to the second class problem-answer template to<problem, and answer>, it is referred to as extension Problem-answer module to<problem, answer>;By initial problem-answer template to<problem, answer>and the problem of extension-it answers For case module to<problem, answer>merging can obtain all problems-answer module with problem pairing to<problem, and answer>.
For initial problem-answer template to<problem, answer>building, it may include analyzing this is/the master of non-problems Stem structure, that is, this is/the sentence trunk structure of the question sentence of non-problems.For be/analysis of the specific trunk structure of non-problems can Except the fundamental analysis results such as participle, part-of-speech tagging, proper name identification, term (term) importance, will also based on synonym, Upper hyponym, auxiliary verb are further generalized the word segmentation result of problem, target be based on be/non-problems the characteristics of, analysis It is/core word and trunk structure of non-problems.Be/core word of non-problems refers to that can be used for directly answering is/the word of non-problems. For example, core word is " energy " to then/non-problems " pregnant woman can eat watermelon ".To be/non-problems can carry out interdependent syntax point Analysis marks the core word in a large amount of interdependent syntactic analysis results, can train extraction model, completes core word identification.Sentence master Stem structure refers to the ingredient of the embodiment problem trunk meaning, for example, generally comprising subject, predicate and object.In implementation of the invention In mode, it for being/non-problems, can be analyzed from multiple and different levels, obtain multiple and different sentence trunk structures.
For the sentence of an answer segment in the answer segment (for example, may include more than one clause) of web data Trunk analysis, similar with the analysis of the sentence trunk of above-mentioned problem, details are not described herein.It should be noted that in the word of sentence Property step analysis in, can only retain the word of identical as the sentence trunk structure of problem entry, this is primarily due to be/non-problems Answer be greatly entry in replication problem.For example, for problem " pregnant woman can eat watermelon " trunk structure it One " pregnant woman { can } v { fruit } ", answer segment are " had better not eat ", then the trunk structure of answer segment may include: 1, It eats;2, it eats;3, v (the corresponding entry of v is to eat herein, and entry corresponding with v in the trunk structure of problem is consistent).
In addition to initial problem-answer template pair of above-mentioned acquisition<pat_q,pat_a>, can also be to problem _ answer mould Plate obtains a basket (for example, more than one problem) corresponding one identical with the trunk structure of problem to being extended A or multiple answer segments screen the composition of the one or more n-gram and n-skipgram of these answer segments as answer The trunk structure that the trunk structure of a basket and a basket correspond to answer segment is constructed the problem of extending-by ingredient Answer template to<problem, answer>.For example, to then/non-problems trunk structure " { energy } { edible } ", sentence in staqtistical data base Sub- trunk structure be the structure be/all answer segments of non-problems are (for example, " not eat ", " beneficial to { problem, agent } " Deng) in n-gram and n-skipgram (for example, n can value be 1,2,3 etc.), screening n-gram and n-skipgram is greater than pre- Threshold value is determined, as problem-answer template centering answer constituent.Wherein, formula (1) can be used to the assessment of n-gram Gram_score (n-gram, q) carry out quantization score.To n-skipgram can be used the assessment formula similar with n-gram into Row assessment.
The problem of for extending-answer template pair can pass through machine by manually marking out a small amount of answer segment first Learning algorithm study is to a collection of problem-answer template pair, based on the these problems learnt-answer template to available more Answer segment, to obtain more problems-answer template pair.By the continuous iteration of learning process, until obtained problem- Answer template is to until no longer dramatically increasing.Wherein, after each iteration, all problem-answer template is assessed, screens it The middle higher answer segment of assessment score avoids mistake cumulative.For example, can be based on problem-answer template to < pat_q, pat_ The precision of the answer segment for a > obtain is to problem-answer template to assessing.Obviously, in problem-answer template pair, if The granularity of problem and answer template is all entry rank, and without default sentence element, then this is to problem-answer template Accuracy is higher.For example, for problem " pregnant woman can eat watermelon ", if template can eat west to for < pat_q=pregnant woman Melon, pat_a=had better not eat >, then its precision is very high, but its generalization ability is very weak, and recall rate is very low, can only recall Sentence containing " had better not eat ".To problem-answer template pair<pat_q,pat_a>assessment it is also possible to consider recall rate sides Face.Those skilled in the art can comprehensively consider two aspect of accuracy and recall rate, accuracy and recall rate is selected to close in assessment Suitable problem-answer template pair.
Can get by manner described above be/multiple initial problem-answer templates of non-problems to multiple expansions The problem of exhibition-answer template pair, by these problems-answer template to merging, can obtain this be/total problem-of non-problems answers Case template pair.
Abstraction module 203 can according to be/non-problems and page data in the matching degree of answer segment taken out from page data Take more than one (being greater than or equal to 1) answer segment comprising the page data of the problem answers may include more than one Answer segment chooses some answer segments according to the matching degree of the problem and these answer segments, these answer segments are for this Be/specific aim of non-problems is stronger, it improves to being/the data-handling efficiency of non-problems search result, is conducive to efficiently obtain Answer for the problem.Wherein, the matching degree of an answer segment in the problem and page data can pass through following formula (2) (q a) carries out Quantitative marking to the match_score in.
The matching degree that go wrong q and answer segment a are calculated by above-mentioned formula (2) and formula (3), according to the matching degree and The comparison of matching degree threshold value determines whether to extract answer segment a, if the matching degree extracts the answer piece greater than matching degree threshold value Section, otherwise, does not extract the answer segment.For whether choose answer segment in addition to using matching degree as main judgment basis it Outside, it is also contemplated that whether position (section head, Duan Zhong, section tail) of the answer segment in the paragraph of the page, answer are adopted, in answer Number, technorati authority of answer contributor of sentence etc., are analyzed by nonlinear regression model (NLRM), determine whether the answer segment is sieved Choosing.
After obtaining the clause of matching degree highest scoring by formula (2), can centered on the clause of highest scoring, Continue the clause for being forwardly and rearwardly enlarged beyond matching degree threshold value, forms answer segment.Wherein, it is also necessary to which two class sentences are carried out Specially treated.Specifically, the first kind is the extension of condition class sentence, if the reason of clause of highest scoring is conditional clause clause (for example, " if ... if "), continue to extend the subsequent clause of result (for example, " so ... ");Second class sentence is Sentence extension of transferring continues if the clause of matching degree highest scoring be the front clause (for example, " although ... ") of turnover sentence Extend subsequent turnover clause (for example, " still ... ").
Judgment module 204 can be according to negative deictic words in more than one the answer segment extracted in abstraction module 203 The number of the negative deictic words of number and the problem determines that the viewpoint of more than one answer segment is certainly or negates, In, negative deictic words may include negative word (for example, grade), negative Sentiment orientation word (for example, it may be verb or describing Word) and antonym etc..Specifically, determining whether its core word has negative prefixes to then/non-problems, if having The negative deictic words number of problem is then denoted as 1 by negative prefixes, if the very corn of a subject word is adjective or verb, is analyzed The Sentiment orientation of the core word, if the Sentiment orientation of the core word is negatively, also by the negative deictic words number of the problem It is denoted as 1.For example, problem " almond of bearing hardships is toxic ", " toxic " core word is negative Sentiment orientation word.Negate in statistical problem The summation of the number arithmetic of prefix and negative Sentiment orientation word, is denoted as query_neg_cnt.Then, count no in answer segment The number for determining deictic words, for the negative deictic words of answer segment, in addition to include negative prefixes, negative Sentiment orientation word it Outside, it may also include the antonym of entry in problem.Negative prefixes, negative Sentiment orientation word in answer segment are counted respectively and are asked These number arithmetic are summed, are denoted as answer_neg_cnt by the number of antonym in topic.Obtain query_neg_cnt and After answer_neg_cnt, the two is added, if sum is even number, then it is assumed that the viewpoint of the answer segment is willing It is fixed, if sum is odd number, then it is assumed that the viewpoint of the answer segment is negative.
In embodiments of the present invention, device 200 may also include display module, for counting these answers extracted The viewpoint of segment be positive or negative ratio, and extract viewpoint be positive or negative correspondence answer segment as the ratio Additional information, as support affirmative viewpoint or the support argument of negative viewpoint.Moreover, can be in the form of more intuitive to user Show that above-mentioned viewpoint is the ratio and the corresponding additional information of the comparison of positive or negative, for example, percentage, table can be passed through One of lattice, histogram, lines figure etc. or a variety of displays, are conveniently used for quickly checking search result.
Through the above description of the embodiments, those skilled in the art can be understood that the present invention can be by The mode of software combination hardware platform is realized, naturally it is also possible to all be implemented by hardware.Based on this understanding, this hair Bright technical solution can be embodied in the form of software products in whole or in part to what background technique contributed, the meter Calculation machine software product can store in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are used so that one Platform computer equipment (can be personal computer, server, smart phone or the network equipment etc.) executes each reality of the present invention Apply method described in certain parts of example or embodiment.
Term and wording used in description of the invention are just to for example, be not intended to constitute restriction.Ability Field technique personnel should be appreciated that under the premise of not departing from the basic principle of disclosed embodiment, to above embodiment In each details can carry out various change.Therefore, the scope of the present invention is only determined by claim, in the claims, unless It is otherwise noted, all terms should be understood by the broadest reasonable meaning.

Claims (10)

1. a kind of data processing method for retrieval characterized by comprising
The page data of acquisition problem and the answer comprising described problem, wherein described problem is that answer is positive or negative Problem,
The problem of matching with described problem-answer template is generated to<problem according to described problem and the page data, answer>,
More than one is extracted from the page data according to the matching degree of answer segment in described problem and the page data Answer segment, wherein the matching degree of the first answer segment is counted by following ratios in described problem and the page data Calculate: described problem-answer template to<problem, answer>in the common entry of each answer and the first answer segment plus The sum of power accounts for the ratio of the first answer segment,
According to the negative deictic words of the negative deictic words number of the one above answer segment extracted and described problem Number determines that the viewpoint of one above answer segment is positive or negative.
2. the method according to claim 1, wherein further include:
The viewpoint for counting one above answer segment is the ratio of positive or negative, extracts pair that viewpoint is positive or negative It answers answer segment as the additional information of the ratio, and shows the ratio and the additional information to user.
3. according to the method described in claim 2, it is characterized in that, further including showing the ratio by more than one following forms Example: percentage, table, histogram, lines figure.
4. the method according to claim 1, which is characterized in that according to described problem and the page Data generate the problem of matching with described problem-answer template to<problem, answer>include:
Analyze more than one of one of more than one first trunk structure of described problem and the answer segment of the page data Second trunk structure, by first trunk structure and second trunk structure be configured to first kind problem-answer template to < Problem, answer >,
Obtain more than one answer segment corresponding to a basket identical with one first trunk structure above, sieve Select more than one n-gram and n-skipgram of the corresponding answer segment of the basket as answer constituent, it will The trunk structure of the basket filtered out and the trunk structure building of the corresponding answer segment of the basket It is the second class problem-answer template to<problem, answer>,
By the first kind problem-answer template to<problem, answer>and the second class problem-answer template answer<problem Case>merging obtains described problem-answer template to<problem, and answer>.
5. according to the method described in claim 4, it is characterized in that, described problem-answer template to<problem, answer>in it is each The arithmetic product for being weighted to following the first component and second component of the common entry of a answer and the first answer segment, Wherein,
First component is described problem-answer template to<problem, answer>all answers described in common entry go out occurrence It is several with described problem-answer template to<problem, answer>all answers in all words frequency of occurrence ratio,
Second component is described problem-answer template to<problem, answer>all answers number and described problem-answer mould Plate to<problem, answer>in the ratio of the answer number comprising the common entry take logarithm.
6. a kind of data processing equipment for retrieval characterized by comprising
Module is obtained, for obtaining the page data of problem and the answer comprising described problem, wherein described problem is that answer is The problem of positive or negative,
Generation module, for generating the problem of matching with described problem-answer template according to described problem and the page data To<problem, answer>,
Abstraction module, for according to the matching degree of answer segment in described problem and the page data from the page data Extract more than one answer segment, wherein under the matching degree of the first answer segment passes through in described problem and the page data The ratio of stating is calculated: described problem-answer template to<problem, answer>in each answer and the first answer segment The sum of the weighting of common entry accounts for the ratio of the first answer segment,
Judgment module, for according to one above answer segment for extracting negating deictic words number and described problem Negative deictic words number determines that the viewpoint of one above answer segment is positive or negative.
7. device according to claim 6, which is characterized in that further include:
Display module, the viewpoint for counting one above answer segment is the ratio of positive or negative, and extracts viewpoint For additional information of the correspondence answer segment as the ratio of positive or negative, and the ratio and described attached is shown to user Add information.
8. device according to claim 7, which is characterized in that the display module is also used to through more than one following shapes Formula shows the ratio: percentage, table, histogram, lines figure.
9. the device according to any one of claim 6 to 8, which is characterized in that the generation module, for carrying out down State operation:
Analyze more than one of one of more than one first trunk structure of described problem and the answer segment of the page data Second trunk structure, by first trunk structure and second trunk structure be configured to first kind problem-answer template to < Problem, answer >,
Obtain more than one answer segment corresponding to a basket identical with one first trunk structure above, sieve Select more than one n-gram and n-skipgram of the corresponding answer segment of the basket as answer constituent, it will The trunk structure of the basket filtered out and the trunk structure building of the corresponding answer segment of the basket It is the second class problem-answer template to<problem, answer>,
By the first kind problem-answer template to<problem, answer>and the second class problem-answer template answer<problem Case>merging obtains described problem-answer template to<problem, and answer>.
10. device according to claim 9, which is characterized in that problem described in the abstraction module-answer template to < Problem, answer > in the common entry of each answer and the first answer segment be weighted to following the first component and the Two-component arithmetic product, wherein
First component is described problem-answer template to<problem, answer>all answers described in common entry go out occurrence It is several with described problem-answer template to<problem, answer>all answers in all words frequency of occurrence ratio,
Second component is described problem-answer template to<problem, answer>all answers number and described problem-answer mould Plate to<problem, answer>in the ratio of the answer number comprising the common entry take logarithm.
CN201510279830.7A 2015-05-27 2015-05-27 A kind of data processing method and device for retrieval Active CN104933097B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510279830.7A CN104933097B (en) 2015-05-27 2015-05-27 A kind of data processing method and device for retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510279830.7A CN104933097B (en) 2015-05-27 2015-05-27 A kind of data processing method and device for retrieval

Publications (2)

Publication Number Publication Date
CN104933097A CN104933097A (en) 2015-09-23
CN104933097B true CN104933097B (en) 2019-04-16

Family

ID=54120265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510279830.7A Active CN104933097B (en) 2015-05-27 2015-05-27 A kind of data processing method and device for retrieval

Country Status (1)

Country Link
CN (1) CN104933097B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229675B (en) * 2017-04-28 2019-02-05 北京神州泰岳软件股份有限公司 Question and answer base construction method, method, apparatus of answering and the system of list type knowledge
CN107832374A (en) * 2017-10-26 2018-03-23 平安科技(深圳)有限公司 Construction method, electronic installation and the storage medium in standard knowledge storehouse
CN108804627B (en) * 2018-05-31 2021-04-06 科大讯飞股份有限公司 Information acquisition method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003263438A (en) * 2002-03-08 2003-09-19 Nippon Telegr & Teleph Corp <Ntt> Yes/No TYPE QUESTION TREE PREPARING DEVICE, Yes/No TYPE QUESTION TREE PREPARING METHOD, PROGRAM AND RECORDING MEDIUM
CN101377777A (en) * 2007-09-03 2009-03-04 北京百问百答网络技术有限公司 Automatic inquiring and answering method and system
CN103377245A (en) * 2012-04-27 2013-10-30 腾讯科技(深圳)有限公司 Automatic question and answer method and device
CN103927381A (en) * 2014-04-29 2014-07-16 北京百度网讯科技有限公司 Right-and-wrong problem processing method and device
CN104063497A (en) * 2014-07-04 2014-09-24 百度在线网络技术(北京)有限公司 Viewpoint processing method and device and searching method and device
CN104216913A (en) * 2013-06-04 2014-12-17 Sap欧洲公司 Problem answering frame
CN104573028A (en) * 2015-01-14 2015-04-29 百度在线网络技术(北京)有限公司 Intelligent question-answer implementing method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003263438A (en) * 2002-03-08 2003-09-19 Nippon Telegr & Teleph Corp <Ntt> Yes/No TYPE QUESTION TREE PREPARING DEVICE, Yes/No TYPE QUESTION TREE PREPARING METHOD, PROGRAM AND RECORDING MEDIUM
CN101377777A (en) * 2007-09-03 2009-03-04 北京百问百答网络技术有限公司 Automatic inquiring and answering method and system
CN103377245A (en) * 2012-04-27 2013-10-30 腾讯科技(深圳)有限公司 Automatic question and answer method and device
CN104216913A (en) * 2013-06-04 2014-12-17 Sap欧洲公司 Problem answering frame
CN103927381A (en) * 2014-04-29 2014-07-16 北京百度网讯科技有限公司 Right-and-wrong problem processing method and device
CN104063497A (en) * 2014-07-04 2014-09-24 百度在线网络技术(北京)有限公司 Viewpoint processing method and device and searching method and device
CN104573028A (en) * 2015-01-14 2015-04-29 百度在线网络技术(北京)有限公司 Intelligent question-answer implementing method and system

Also Published As

Publication number Publication date
CN104933097A (en) 2015-09-23

Similar Documents

Publication Publication Date Title
Bhatia et al. Automatic labelling of topics with neural embeddings
US9990356B2 (en) Device and method for analyzing reputation for objects by data mining
Alam et al. Humaid: Human-annotated disaster incidents data from twitter with deep learning benchmarks
US10437867B2 (en) Scenario generating apparatus and computer program therefor
CN105468605B (en) Entity information map generation method and device
CN111538894B (en) Query feedback method and device, computer equipment and storage medium
CN110516067A (en) Public sentiment monitoring method, system and storage medium based on topic detection
US10095685B2 (en) Phrase pair collecting apparatus and computer program therefor
CA2774278C (en) Methods and systems for extracting keyphrases from natural text for search engine indexing
Hou et al. Newsminer: Multifaceted news analysis for event search
US10430717B2 (en) Complex predicate template collecting apparatus and computer program therefor
US9864795B1 (en) Identifying entity attributes
CN109960756A (en) Media event information inductive method
Kalamatianos et al. Sentiment analysis of greek tweets and hashtags using a sentiment lexicon
CN105843796A (en) Microblog emotional tendency analysis method and device
Jiang et al. Topic sentiment change analysis
Ahmed et al. A novel approach for Sentimental Analysis and Opinion Mining based on SentiWordNet using web data
WO2020101477A1 (en) System and method for dynamic entity sentiment analysis
Liang et al. Expert finding for microblog misinformation identification
CN104933097B (en) A kind of data processing method and device for retrieval
Bhardwaj et al. Web scraping using summarization and named entity recognition (ner)
Arafat et al. Analyzing public emotion and predicting stock market using social media
Sadman et al. Understanding the pandemic through mining covid news using natural language processing
Jia et al. International public opinion analysis of four olympic games: From 2008 to 2022
Zhang et al. Product features extraction and categorization in Chinese reviews

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant