CN101777046A

CN101777046A - Searching method and system

Info

Publication number: CN101777046A
Application number: CN200910001619A
Authority: CN
Inventors: 谭诚; 黄耀海
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2009-01-09
Filing date: 2009-01-09
Publication date: 2010-07-14

Abstract

The invention discloses a searching method and a searching system. The searching method comprises the following steps: primary searching, namely, performing primary searching on a plurality of documents by using an initial searching model to acquire result documents and selecting feedback documents from the result documents; selection, namely, filtering the feedback documents by using the number of phrases in the feedback documents as a standard, selecting documents from the feedback documents and using the documents as selected feedback documents, wherein the phrases are formed by words in the initial searching model; feedback information acquisition, namely, acquiring feedback information from the selected feedback documents based on the dependency relation between the words in the selected feedback documents and the words in the initial searching model; generation, namely, generating a new searching model by adding the feedback information into the initial searching model; and secondary searching, namely, performing secondary searching on the plurality of documents by using the new searching model.

Description

Search method and searching system

Technical field

The present invention relates to search method and searching system, be specifically related to use the search method and the searching system of relevant feedback technology.

Background technology

In the middle of all searching systems, utilize the searching system of keyword search engine the most generally to use.Along with the development of retrieval technique, a kind of new search method of utilizing feedback technique more and more effectively and be commonly used.In the method for this use feedback technique, the top n document (N is the positive integer that can suitably be set) that obtains by retrieval for the first time is used as the feedback document, and the information of extracting from the feedback document is used to retrieve next time.

Fig. 7 is the process flow diagram that is illustrated in employed retrieving in the conventional searching system of utilizing related feedback information.

In step S701, system obtains initial query formula (query), and carries out retrieval for the first time based on described initial query formula, so that obtain the result document tabulation, described result document tabulation can be returned to the user.Can utilize any keyword retrieval method known to those skilled in the art to carry out retrieval for the first time, as long as the result document that obtains by retrieval for the first time can be sorted according to the mark of the degree of correlation of these documents of indication and initial query formula.

In step S703, the result document in the tabulation is arranged by descending according to above-mentioned mark.N document (top n document) conduct feedback document is the preceding selected by system from lists of documents.N is the positive integer that can be selected arbitrarily or can suitably be set according to other modes by the user.

In step S705, system circulates to the top n document, to obtain participle (word segment) result by lexical analyzer (lexical parser).

In step S707, the associated score of each speech in the feedback document (top n document) is for example calculated by system according to following formula 1.

(formula 1)

Wherein, w _jJ speech in the expression top n document, w _jCan travel through all speech in all top n documents; Relevance_score (w _j) expression w _jAssociated score, it illustrates speech w _jDegree of correlation with the initial query formula; Doc _iI document in the expression top n document, wherein i is recycled to N from 1; Word_score (w _j, doc _i) for example be w _jAt current document doc _iThe middle quantity that occurs.

Finishing the associated score of each speech (is relevance_score (w _j)) calculating after, these speech are sorted according to associated score, M speech with the highest associated score is selected as feedback information.At this, M can be selected or can be by system's predetermined any positive integer automatically by the user as required.

In step S709, except the initial query formula, system also considers feedback information, and generates new query formulation.For example, system has with what calculated among the step S707 that M speech of high associated score adds in the initial query formula, and obtains new query formulation.

In step S711, system uses the new query formulation that obtains among the step S709 to carry out retrieval for the second time.

In step S713, system obtains the final result for retrieval of result for retrieval conduct for the second time, and this result is returned to the user.

More information about relevant feedback is open in various documents, " Relevance Feedback in Information Retrieval in the SmartSystem-experiments in Automatic Document Processing " such as Rocchio.J.J., 313-323, Englewood Cliffs, NJ:Prentice Hall Inc., 1971; " the Improving Retrieval Performance By RelevanceFeedback " of G.Salton and ChrisBuckley, JASIS 41.288-297, CHRI, 1990; " the A Statistical Model for Relevance Feedback inInformation Retrieval " of C.T.YU, W.S.LUKE and T.Y.CHEUNG, Journal of the Amodation for ComputingMachinery, Vol.23, No.2, in April, 1976, pp.273-286.Because the relevant feedback technology is known by those skilled in the art, therefore will omit more detailed description here to this technology.

Yet, in the prior art, above-mentioned processing according to the searching system of utilizing related feedback information, system has only utilized the word segmentation result of lexical analyzer to calculate associated score, that is to say, in the calculating of associated score, only utilize the information of single speech, but do not considered the dependence between the speech.

Further, feedback information is influential to the performance of retrieval for the second time.In utilizing the conventional searching system of related feedback information, each feedback information is used in the retrieval for the second time equably.Yet the associated score of feedback information is also unequal each other.This means that the contribution of the speech in the feedback information is also unequal each other, so these speech preferably should differentially be utilized.

Further, in utilizing the conventional searching system of related feedback information, the top n document that generates by retrieval for the first time is not further processed and is directly used in feedback searching.The inventor finds that the performance of feedback searching also is influential to the accuracy rate of retrieving for the second time, but enough not handy in feedback searching sometimes by the top n document of retrieval acquisition for the first time.

Further, in utilizing the conventional searching system of related feedback information, document length also is influential to the calculating of feedback information.When calculating feedback information, long document can have inequitable " advantage ", therefore preferably adjusts by length normalization method.

Therefore, need a kind of new relevant feedback search method and system, it improves the accuracy rate of feedback document and the performance of searching system.

Summary of the invention

An object of the present invention is to provide a kind of relevant feedback search method and system that is used to solve at least one above-mentioned technical matters, described relevant feedback search method and system improve the accuracy rate of feedback document and the performance of searching system.

According to a first aspect of the invention, provide a kind of search method, comprising: searching step for the first time, by using the initial query formula a plurality of documents are carried out retrieval for the first time obtaining result document, and from described result document, select the feedback document; Select step, by the feedback document being filtered as criterion with the phrase quantity in the feedback document, select some documents as selected feedback document from the feedback document, described phrase is to be made of the speech in the initial query formula; Feedback information obtains step, based on speech in the described selected feedback document and the dependence between the speech in the initial query formula, obtains feedback information from described selected feedback document; Generate step,, generate new query formulation by described feedback information is added in the initial query formula; And searching step for the second time, by using new query formulation, described a plurality of documents are carried out retrieval for the second time.

Preferably, described feedback information obtains step and comprises the associated score calculation procedure, based on speech in the described selected feedback document and the described dependence between the speech in the described initial query formula, calculates associated score.

Preferably, described associated score calculation procedure is based on speech in the described selected feedback document and the described dependence between the speech in the described initial query formula, and, calculate associated score in described selected feedback document based on the quantity that each speech occurs.

Preferably, described feedback information obtains step based on speech in the described selected feedback document and the described dependence between the speech in the described initial query formula, and, obtain described feedback information based on speech in the described selected feedback document and the fundamental relation mark between the speech in the described initial query formula.

Preferably, described feedback information obtains step and comprises the associated score calculation procedure, based on speech in the described selected feedback document and the described dependence between the speech in the described initial query formula, and, calculate associated score based on speech in the described selected feedback document and the fundamental relation mark between the speech in the described initial query formula.

Preferably, described associated score calculation procedure is based on speech in the described selected feedback document and the described dependence between the speech in the described initial query formula, based on speech in the described selected feedback document and the fundamental relation mark between the speech in the described initial query formula, and, calculate associated score in described selected feedback document based on the quantity that each speech occurs.

Preferably, described second time, searching step comprised the weight set-up procedure, by utilizing described associated score, adjusted the weight of each speech in the feedback information, and described weight is being used during the retrieval for the second time.

Preferably, described feedback information obtains step and comprises that selected ci poem selects step, and the speech of selecting to have the predetermined quantity of high associated score is as described feedback information.

Preferably, described feedback information obtains step and comprises document length normalization method step, according to the length computation document length normalization method ratio of each described selected feedback document, and described document length normalization method ratio is applied in the calculating to described feedback information.

Preferably, described dependence obtains by using syntax analyzer, more preferably obtains by the shallow grammar analysis device.

According to a second aspect of the invention, provide a kind of searching system, comprising: indexing unit for the first time is used for by using the initial query formula that a plurality of documents are carried out retrieval for the first time obtaining result document, and selects the feedback document from described result document; Selecting arrangement is used for the feedback document being filtered as criterion by the phrase quantity with the feedback document, selects some documents as selected feedback document from the feedback document, and described phrase is to be made of the speech in the initial query formula; Feedback information obtains device, is used for obtaining feedback information based on the speech of described selected feedback document and the dependence between the speech in the initial query formula from described selected feedback document; Generating apparatus is used for generating new query formulation by adding described feedback information to the initial query formula; And indexing unit for the second time, by using new query formulation, described a plurality of documents are carried out retrieval for the second time.

According to the present invention, determine speech and the dependence between the speech in the initial query formula in the feedback document, make and when calculating associated score, also consider dependence between each speech and the initial query formula with the acquisition feedback information.

According to a preferred embodiment, the associated score of feedback information also is used to the weight of speech during retrieval for the second time in the correction feedback information, so that the difference between the speech in the consideration feedback information.

According to another preferred embodiment, also utilize phrase quantity that the feedback document is filtered, with the more relevant document of selection from candidate documents, thereby improve the accuracy rate of feeding back document.

According to another preferred embodiment, also utilize the associated score of the speech in the document normalization correction feedback document, so that reduce the influence of long document.

Description of drawings

The accompanying drawing that is incorporated in the instructions and forms an instructions part illustrates embodiments of the invention, and with describe one and be used from explanation principle of the present invention.

Fig. 1 illustrates the block diagram that is used to realize according to the layout of the calculation element of searching system of the present invention.

Fig. 2 is the block diagram that illustrates according to the configuration of the searching system of the use related feedback information of first embodiment of the invention.

Fig. 3 is the process flow diagram that the retrieving of carrying out according to the searching system of utilizing related feedback information of first embodiment of the invention is shown.

Fig. 4 is the figure that the word segmentation result that obtains by lexical analyzer is shown in one example.

Fig. 5 is the figure that the grammer result who passes through syntax analyzer (syntax parser) acquisition in one example is shown.

Fig. 6 is the process flow diagram that is used for being illustrated in the filter process that the 3rd embodiment adopts.

Fig. 7 is the process flow diagram that the retrieving of carrying out by the conventional searching system of utilizing related feedback information is shown.

Fig. 8 is the block diagram that the preferred disposition of the feedback information acquisition device 205 among Fig. 2 is shown.

Embodiment

Describe embodiments of the invention in detail hereinafter with reference to accompanying drawing.

Fig. 1 illustrates the block diagram that is used to realize according to the layout of the calculation element of searching system of the present invention.For brevity, searching system is structured in the single calculation element.Yet no matter this searching system is structured in a plurality of calculation elements that still are structured in the single calculation element as network system, and this searching system all is effective.

As shown in Figure 1, calculation element 100 is used to implement retrieving.Calculation element 100 can comprise CPU 101, chipset 102, RAM 103, memory controller 104, display controller 105, hard disk drive 106, CD-ROM drive 107 and display 108.Calculation element 100 can also comprise the signal wire 111 that is connected between CPU 101 and the chipset 102, be connected the signal wire 112 between chipset 102 and the RAM 103, be connected the peripheral bus 113 between chipset 102 and the various peripheral unit, be connected the signal wire 114 between memory controller 104 and the hard disk drive 106, be connected the signal wire 115 between memory controller 104 and the CD-ROM drive 107, and be connected signal wire 116 between display controller 105 and the display 108.

Client 120 can or be directly connected to calculation element 100 via network 130.Client 120 can send to retrieval tasks calculation element 100, and calculation element 100 can return to result for retrieval client 120.

As shown in Figure 2, searching system comprises indexing unit 201 first time, is used for retrieving for the first time with the acquisition result document by using the initial query formula that a plurality of documents are carried out, and selects the feedback document from described result document; Selecting arrangement 203 is used for the feedback document being filtered as criterion by the phrase quantity with the feedback document, selects some documents as selected feedback document from the feedback document, and described phrase is to be made of the speech in the initial query formula; Feedback information obtains device 205, is used for obtaining feedback information based on the speech of described selected feedback document and the dependence between the speech in the initial query formula from selected feedback document; Generating apparatus 207 is used for generating new query formulation by adding described feedback information to the initial query formula; And indexing unit 209 for the second time, by using new query formulation, described a plurality of documents are carried out retrieval for the second time.

Fig. 8 illustrates the block diagram that feedback information obtains the preferred disposition of device 205.As shown in Figure 8, feedback information obtains device 205 and preferably includes: speech selecting arrangement 801, and the speech that is used to select to have the predetermined quantity of high associated score is as described feedback information; Document length normalization method device 803 is used for the length computation document length normalization method ratio according to each described selected feedback document, and described document length normalization method ratio is applied in the calculating to described feedback information; Associated score calculation element 805 is used for based on the speech of described selected feedback document and the described dependence between the speech in the described initial query formula, calculates associated score; Speech collator 807 is used for the speech of feedback document is sorted according to the descending of associated score; Feedback information is determined device 809, and preceding M the speech that is used to determine to have the highest associated score is as feedback information, and wherein M is a positive integer.

Following, will describe embodiments of the invention in detail.

(first embodiment)

With reference to Fig. 3 first embodiment is described.

This process starts from step S301, and at step S301, system obtains the initial query formula and carries out retrieval for the first time based on this initial query formula, so that obtain to be returned to user's result document tabulation.Retrieval for the first time can utilize any keyword retrieval method known to those skilled in the art, as long as the document in the result for retrieval can be sorted according to the mark of the degree of correlation between indication document and the query formulation.For example, the mark between document and the query formulation can be calculated as the represented mark of quantity that the speech by the initial query formula occurs in document, as follows.

(formula 2)

Wherein, doc _iRepresent i document, score (doc _i) expression doc _iMark, q _kK speech in the expression query formulation, and n (doc _i, q _k) expression doc _iMiddle q _kQuantity.

Those skilled in the art know many additive methods that can be used for retrieval for the first time and document ordering.

For example, preferably but not necessarily, system is that each query word distributes corresponding weights.Therefore, formula 2 is by following modification.

(formula 3)

W wherein _kExpression query word q _kWeight.Those skilled in the art can design the whole bag of tricks of weight allocation being given query word.For example, if query word frequently occurs in incoherent document, then this query word will be assigned with lower weight.For example, the frequent query word that occurs in various uncorrelated documents such as "Yes", " ", " " will be assigned with low-down weight.For example, can adopt people such as Min Zhang " DF or IDF? On the Use ofPrimary Feature Model for Web Information Retrieval ", vol.16, No.5, Journal of Software 2005; People's such as Shaohan Liu " Applying RelevanceFeedback to Information Retrieval Using Keyword and WeightAlgorithms ", Journal of the China Society for Scientific andTechnical Information, Vol.21, No.6, December, disclosed technology in 2002.

In step S303, the result document in this tabulation is arranged by descending according to above-mentioned mark.System obtains the top n document from lists of documents.N is the positive integer that can be selected arbitrarily or can suitably be set by system by the user.

In step S305, system circulates to this top n document, to obtain word segmentation result by lexical analyzer.Any lexical analyzer all can be used to obtain word segmentation result, such as " Dependence Language Model forInformation Retrieval " people such as Jianfeng Gao, Annual ACM Conference on Research andDevelopment in Information Retrieval, Proceedings of the 27thannual international ACM SIGIR conference on Research anddevelopment in information, the 170-177 page or leaf, 2004; " Discovery of Linguistic Relations Using Lexical Attraction " PhD dissertation of Deniz Yuret, Massachusetts Institute of Technology 1998; People's such as Peng Wang " Researches on Rule-based Chinese Parsing Techniques ", ComputerEngineering and Applications, Vol.29,2003; " the Summary ofChinese syntax parsing and lexical parsing technology " of Liu Qun, Students ' Workshop on Computational Linguistics, in 2002 disclosed those.

In step S307, the described top n document of systemic circulation is to obtain the grammer result by syntax analyzer.Syntax analyzer is a kind of like this analyzer, and it can import the speech tabulation of sentence, and exports the dependence (association) between these speech.

Syntax analyzer is a kind of important technology in system's (such as text search system, machine translation system, information extracting system, text-to-speech conversion system etc.) relevant with natural language processing.The task of syntax analyzer is the syntactic structure of automatic parsing sentence, then sentence translation is become structurized syntax graph.

In various syntax analyzers, a kind of special syntax analyzer that provides the grammer dependence between the speech in the sentence is that shallow grammar analysis device (shallow syntax parser) more and more is subjected to generally using, and this is because its precision and speed are all much better than complete syntax analyzer (fullsyntax parser).

Fig. 5 has provided the grammatical analysis result of two sentences.In Fig. 5, between speech, there is camber line, each camber line all is acyclic, plane and undirected camber line.In the technology of shallow grammar analysis device, each camber line illustrates association or the dependence between the speech at camber line two ends.Association in the technology of shallow grammar analysis device or dependence show based on given sentence the best relation in might concerning.

Each camber line of dependence between two speech shown in Fig. 5 all has the mark that is called as the fundamental relation mark.Preferably, in the present invention, described dependence and feedback speech and the fundamental relation mark between the speech in the initial query formula in the document can combine and be used.The fundamental relation mark is indicated the degree of correlation of these speech.Below show and utilize the two example of described dependence and fundamental relation mark.Yet be noted that dependence to be used alone and do not consider the fundamental relation mark.

Can use any syntax analyzer to obtain the grammer result.Can adopt for example in " the Dependence Language Model for InformationRetrieval " SIGIR-2004 such as people such as Jianfeng Gao, Sheffield, UK, July 25～29; Yuret Deniz, " Discovery of Linguistic Relations Using Lexical Attraction ", PhD dissertation, MIT, those disclosed shallow grammar analysis device in 1998.Also can adopt complete syntax analyzer.Yet, because the precision of shallow grammar analysis device and speed are all much better than complete syntax analyzer, the therefore preferred shallow grammar analysis device that uses.

In step S309, system obtains feedback information by calculate associated score based on word segmentation result that obtains and the grammer result that obtains in step S307 in step S305.Particularly, system utilizes word segmentation result and grammer result to calculate the associated score of each speech in the top n document.For example, can use following formula 4 to calculate the associated score of each speech.

relevance_score (w_{j}) = word_score (w_{j}) + relation_score (w_{j})

(formula 4)

Wherein, w _jJ speech in the expression top n document, w _jCan travel through all speech in all feedback documents; Relevance_score (w _j) expression w _jAssociated score; Word_score (w _j) expression w _j, only depend on about w _jThe mark of the information of itself; Relation_score (w _j) expression w _j, the mark of the interdependent degree of indicating itself and query word; Doc _iI document in the expression top n document; Word_score (w _j, doc _i) be w _jThe quantity that in current document, occurs; q _kK speech in the expression initial query formula; Relation_score (w _j, q _k) expression w _jAnd q _kThe fundamental relation mark, if w _jWith q _kDo not have any dependence, then relation_score (w _j, q _k) be zero; Relation_score (w _j, q _k, doc _i) be doc _iMiddle w _jAnd q _kThe fundamental relation mark, it represents doc _iMiddle w _jAnd q _kDependence.Note that the fundamental relation mark can manually be set as required, perhaps can from predetermined dictionary, take out.Scheme can obtain the fundamental relation mark by using syntax analyzer as an alternative.Can generate described dictionary by following step:

-collection corpus;

-all sentences in the corpus are divided into the speech node;

-adjacent speech node is counted the quantity that appears in this corpus to each on statistics; And

-record institute predicate node is right, and their appearance quantity is carried out normalization as they fundamental relation marks in fundamental relation mark dictionary.

Note that formula 4 only is to be used to use word segmentation result and the grammer example of the two calculating associated score as a result.Those skilled in the art can select other modes to utilize any combination of word segmentation result and grammer result to calculate associated score as required.For example, also can use revelacen_score (w _j)=word_score (w _j) relation_score (w _j) formula.

In the associated score of finishing each speech in each document of top n document (is relevance_score (w _j)) calculating after, these speech are sorted according to associated score, and it is selected as feedback information to have a M speech of the highest associated score.At this, M is can be as required by user's selection or by the automatic predetermined any positive integer of system.

In step S311, system adds described feedback information in the initial query formula to, to generate new query formulation.

In step S313, system uses the new query formulation that comprises feedback information to carry out the retrieval second time.

In step S315, the result that system will obtain from retrieval for the second time is as net result, and this result is returned to the user, and process finishes.

According to above-mentioned process,, but also, obtain feedback information based on the grammer object information not only based on simple word segmentation result.Each speech only distinguished in participle, and grammatical analysis also further speech in the identification feedback document and the dependence between the speech in the initial query formula.

In the emulation experiment that the inventor carries out,, estimated the above-mentioned search method of utilizing related feedback information, and obtained result as shown in table 1 according to present embodiment for the mandarin TREC corpus of 139353KB.

Table 1

	Use conventional related feedback method	Use related feedback method of the present invention
	Use conventional related feedback method	Use related feedback method of the present invention	Recall rate	??0.7599	??0.7724
Accuracy rate	??0.2238	??0.2299	Recall rate	??0.7599	??0.7724
Accuracy rate	??0.2238	??0.2299	The R-accuracy rate	??0.4537	??0.4753

At this, recall rate, accuracy rate, R-accuracy rate are three parameters commonly used that are used to estimate search method or system." recall rate " equals the quantity of the answer document in the result document tabulation and the ratio of the sum of all answer documents." accuracy rate " equals the ratio of the sum of the quantity of the answer document in the result document tabulation and result document.The ratio (R is a positive integer) of the sum of the quantity of the answer document before " R-accuracy rate " equals in R result document and all answer documents.At this, result document is meant document retrieved by searching system or that search out.Reply the document that document is meant that the user is actual required.The value of these parameters is big more, and corresponding performance is good more.Recall rate, accuracy rate, R-accuracy rate are the parameters that those skilled in the art use always, therefore will omit the detailed description to them.

As can be seen from Table 1, compare with conventional method, the related feedback method of the application of the invention has improved performance.

[example]

Now, for the ease of understanding, with an example that illustrates according to the said process of first embodiment to the principle of the invention.

Be noted that following example only is used to illustrate the purpose of the principle of the invention, any concrete numerical value, concrete equation or expression formula all are not intended to limit the scope of the invention.

In this example, query formulation Q is " music player ", and candidate documents comprises three documents altogether, that is:

D1: U.S. Apple releases the iPOD digital music player

D2: some companies of the U.S. release the novel music player

D3:iPOD shuffle is fast-selling in the whole world recently

In this example, N (quantity of feedback document) is set to 2, and M (will be got the quantity of the speech of making feedback information) is set to 3.

At first, description is utilized the situation of conventional search method shown in Figure 7.In step S701, above-mentioned query formulation " music player " is transfused to, and carries out the retrieval first time among D1～D3.In retrieval for the first time, can use any conventional search method, and can carry out processing such as participle query formulation " music player ", this belongs to the technological means that those skilled in the art use always.Therefore, obtain D1 and D2 document as a result of.In step S703, the individual document of preceding N (N=2) is that D1 and D2 are obtained as the feedback document of retrieving for the first time.In step S705, obtain word segmentation result by using lexical analyzer, as shown in Figure 4.Fig. 4 is the figure that the word segmentation result that is obtained by lexical analyzer is shown.In Fig. 4, sentence is divided into single speech.The participle process can be carried out by any lexical analysis technology well known in the art.

In step S707, according to formula 1, based on the associated score of each speech in the word segmentation result information calculations feedback document.At this, set i=1,2, and can obtain following associated score.

Relevance_score (U.S.)=2;

Relevance_score (apple)=1;

Relevance_score (company)=2;

Relevance_score (release)=2;

relevance_score(iPod)＝1；

Relevance_score (numeral)=1;

Relevance_score (some)=1;

Relevance_score (novel)=1.

According on after predicate associated score separately arranges speech by descending, being listed as follows of the candidate word that is used to feed back:

The U.S., company releases, apple, numeral, iPod, some, novel.

In step S709, the individual speech of M (M=3) of selecting to have the highest associated score is as feedback information, and it is added in the initial query formula, thereby new query formulation becomes:

Q: music, player, the U.S., company releases.

In step S711, carry out the retrieval second time, and, obtain final result for retrieval D1 and D2, and be presented to the user at step S713.

Then, the situation of using search method of the present invention shown in Figure 3 below will be described.

In step S301, above-mentioned query formulation " music player " is transfused to, and carries out the retrieval first time among D1～D3.In retrieval for the first time, can use any conventional search method, and can carry out processing such as participle query formulation " music player ", this belongs to the technological means that those skilled in the art use always.

According to a kind of segmenting method commonly used, query formulation " music player " is split into two query words, i.e. " music " and " player ".In various search methods, a kind of simple search method is arranged, promptly calculate quantity that each query word occurs in each document and, and will obtain a mark according to formula 3 for each document.

(formula 3)

Wherein, doc _iRepresent i document, score (doc _i) expression doc _iMark, q _kK speech in the expression query formulation, W _kExpression query word q _kWeight, and n (doc _i, q _k) expression doc _iMiddle q _kQuantity.

Those skilled in the art can design any method of weight allocation being given query word.For example, if query word frequently occurs in incoherent document, then this query word will be assigned with lower weight.For example, the frequent query word that occurs in various uncorrelated documents such as "Yes", " " or " " will be assigned with low-down weight.In this example, " music " and " player " is assigned with equal query formulation weight.Thereby, obtain D1 and D2 document as a result of.

In step S303, the individual document of preceding N (N=2) is that D1 and D2 are obtained as the feedback document of retrieving for the first time.

In step S305, obtain word segmentation result by utilizing lexical analyzer, as shown in Figure 4.Fig. 4 is the figure that the word segmentation result of the D1 that obtained by lexical analyzer and D2 is shown.In Fig. 4, sentence is divided into single speech.The participle process can be carried out by any lexical analysis technology well known in the art.

At this, with doc _iMiddle w _jQuantity as the word_score (w in the formula 4 _j, doc _i).

Therefore, based on word segmentation result, calculate word_score (w according to formula 4 _j), i.e. each w _jThe quantity that occurs in document is as follows.

Word_score (U.S.)=2;

Word_score (apple)=1;

Word_score (company)=2;

Word_score (release)=2;

word_score(iPod)＝1；

Word_score (numeral)=1;

Word_score (some)=1;

Word_score (novel)=1.

In step S307, system circulates to the top n document, to obtain the grammer result by the shallow grammar analysis device.The shallow grammar analysis device is a kind of like this analyzer, and it can import the speech tabulation of sentence, and exports the dependence (association) between these speech, and wherein each dependence (association) has the fundamental relation mark.

Fig. 5 is the figure that the grammer result of the D1 that obtains by the shallow grammar analysis device and D2 is shown.In Fig. 5, as has been described above, show dependence and fundamental relation mark between the speech.

Can use any shallow grammar analysis device to obtain dependence, and the fundamental relation mark can be determined or determine from the dictionary that is used for dependence by manual.Can adopt for example in " the Dependence Language Model forInformation Retrieval " SIGIR-2004 such as people such as Jianfeng Gao, Sheffield, UK, July 25～29; Yuret Deniz, " Discovery of Linguistic Relations Using LexicalAttraction ", PhD dissertation, MIT, disclosed shallow grammar analysis device in 1998.Also can adopt complete syntax analyzer.Can obtain identical expectation grammer result by in these shallow grammar analysis devices any one, though they have different performances.

Doc _iIn w _jAnd q _kBetween the fundamental relation mark be used as " relation_score (w in the formula 4 _j, q _k, doc _i) ".Thereby, according to formula 4 and Fig. 5, based on the grammer result to the w in the document _jWith each q _kCalculate w _jWith each q _kBetween relation_score (w _j), as follows.

Relation_score (U.S.)=0;

Relation_score (apple)=0;

Relation_score (company)=0;

Relation_score (release)=2+2=4;

relation_score(iPod)＝3；

Relation_score (numeral)=4;

Relation_score (some)=0;

Relation_score (novel)=2.

Note that the above fundamental relation mark that illustrates only is exemplary.Can use any dictionary with the various fundamental relation marks between the speech.Under a kind of particular case, all fundamental relation marks can equally be set, and in this case, the relation_score of speech represents the quantity of the relation of the speech in speech and the initial query formula.

In step S309, system obtains associated score based on word segmentation result that obtains and the grammer result that obtains in step S307 in step S305.Particularly, the associated score of each speech in the top n document is calculated by using word segmentation result and grammer result by system.The associated score of calculating each speech in the top n document according to formula 4 is relevance_score (w _j).At this, set i=1,2, and can obtain following associated score.

Relevance_score (U.S.)=word_score (U.S.)+relation_score (U.S.)=2

Relevance_score (apple)=word_score (apple)+relation_score (apple)=1

Relevance_score (company)=word_score (company)+relation_score (company)=2

Relevance_score (release)=word_score (release)+relation_score (release)=6

relevance_score(iPod)＝word_score(iPod)+relation_score(iPod)＝4

Relevance_score (numeral)=word_score (numeral)+relation_score (numeral)=5

Relevance_score (some)=word_score (some)+relation_score (some)=1

Relevance_score (novel)=word_score (novel)+relation_score (novel)=3

To these speech according to it corresponding associated score with descending sort after, the candidate word that is used to feed back is listed as follows:

Release, numeral, iPod, novel, the U.S., company, apple, some.

In step S311, the individual speech of M (M=3) with the highest associated score is selected as feedback information, and is added in the initial query formula, thereby new query formulation becomes:

Q: music, player is released numeral, iPod.

In step S313, carry out the retrieval second time.Can utilize the method identical to carry out the retrieval second time with the retrieval first time among the step S301.Preferably but not necessarily, system is comprised that the query word of new query word assigns weight.

In step S315, obtain final result for retrieval D1, D2, D3, and it is presented to the user.

As can be seen from the above, according to conventional methods, " U.S. ", " company ", " release " are obtained as feedback information, only are because the quantity of these speech in preceding 2 documents that obtain by retrieval for the first time is being the highest.On the other hand, the method according to this invention, " release ", " numeral ", " iPOD " are obtained as feedback information, because also consider such fact when calculating associated score: these speech and initial query speech " music " and " player " have higher fundamental relation mark.As a result, obtain D1 and D2 as result for retrieval by using method among Fig. 7, and by using method among Fig. 3 to obtain D1, D2 and D3 as result for retrieval.That is to say, during conventional method in using Fig. 7, omitted D3, and the D3 height correlation document that also to be the user may need.On the other hand, under situation about using according to method of the present invention shown in Figure 3, D3 is presented to the user as final result for retrieval.

Therefore as can be seen, searching system according to the present invention is compared with conventional method with method, can bring more superior performance aspect the document that finds more expectation.

For the ease of understanding, in conjunction with an example first embodiment of the present invention has been described.Yet concrete referred in this numerical value or formula only are exemplary, and are not intended to limit the scope of the invention.As mentioned above, any conventional search method can be used to for the first time and retrieval for the second time in, can use any lexical analyzer to obtain word segmentation result, and can use any syntax analyzer to obtain dependence between the speech.

(second embodiment)

Below second embodiment will be described.

In the step S313 of first embodiment, optimum system choosing ground assigns weight for each query word.In the conventional searching system of using related feedback information, each the new query word that obtains from feedback information equally is used in the retrieval for the second time.For example, in the above example that illustrates, system can assign weight for corresponding query word: W (music)=0.263, W (player)=0.263, W (release)=0.158, W (numeral)=0.158, W (iPOD)=0.158.

Yet the associated score of the speech in the feedback information is also unequal each other.This means that the contribution of the speech in the feedback information is not equal to each other, therefore preferably differently used.

In one example, the weight of query word can be adjusted according to following formula 5.

M=1 ..., M (formula 5) wherein, W ' (q _m) the adjusted weight of m the speech of expression in the feedback information, W (q _m) the unadjusted weight of m the speech of expression in the feedback information,

Be illustrated in all speech in the feedback information that calculates among the step S309 associated score and, M represents the total quantity of the speech in the feedback information.

Therefore, the adjusted weight of each new query word is as follows.

W ' (release)=0.158*6/ ((6+5+4)/3)=0.1896,

W ' (numeral)=0.158*5/ ((6+5+4)/3)=0.158,

W’(iPOD)＝0.158＊4/((6+5+4)/3)＝0.1246。

Can find out that from above according to present embodiment, the adjusted weight of each new query word demonstrates the difference to the contribution of feedback information.

(the 3rd embodiment)

Hereinafter with reference to Fig. 6 the 3rd embodiment is described.

Fig. 6 is the process flow diagram that is illustrated in the filter process that uses among the 3rd embodiment.

In the step S303 of first embodiment, by retrieval for the first time, the top n document is obtained as the feedback document.

In utilizing the conventional searching system of related feedback information, the top n document that is generated by retrieval for the first time is further processed and be directly used in feedback searching.The inventor finds that the performance of feedback searching also depends on the accuracy rate of retrieving for the first time, but enough not handy in feedback searching sometimes by the top n document of retrieval acquisition for the first time.

In the 3rd embodiment, after by the top n of retrieval acquisition for the first time document, described top n document is further filtered in step S303.

Fig. 6 is the process flow diagram that is illustrated in the preferred filter process of carrying out among the step S303.

In step S601, each document is analyzed, and finds phrase to distribute.How known phrase technology can be counted as a phrase if having provided two speech.For example high-level, if two speech are all adjacent in query formulation and in document, and do not have other speech therebetween, then these two speech can be counted as a phrase in the document.In medium rank, if two speech are adjacent in query formulation, and appear in the sentence in the document (may have some speech therebetween), then these two speech can be counted as a phrase.In low level, if two speech had both appeared in the sentence that also appears in the query formulation in the document (may have some speech therebetween), then these two speech can be counted as a phrase.

For example, when query formulation is " Chinese Economy Development ",

High-level,

Document 1: China's economic.(phrase number=1)

Document 2: the economy of China.Development in China.(phrase number=0)

Document 3: China's economic.Economic development.(phrase number=2)

In medium rank,

Document 1: China's economic.(phrase number=1)

Document 2: the economy of China.Development in China.(phrase number=1)

Document 3: China's economic.Economic development.(phrase number=2)

In low level,

Document 1: China's economic.(phrase number=1)

Document 2: the economy of China.Development in China.(phrase number=2)

Document 3: China's economic.Economic development.(phrase number=2)

For example, calculate the phrase number of each document according to aforesaid way.

In step S603, according to the phrase number, each document can be classified as sets of documentation, does not have the sets of documentation of phrase such as each document wherein, wherein each document only has the sets of documentation of a phrase, and wherein each document only has the sets of documentation of two phrases.

For example, high-level, document 1 is classified as sets of documentation { phrase number=1}; Document 2 is classified as sets of documentation { phrase number=0}; Document 3 is classified as sets of documentation { phrase number=2}.

At step S605, system filtration falls some documents, and only is retained in phrase and counts the document that the aspect satisfies condition.For example, { document of phrase number＞0} is retained to belong to sets of documentation.These documents are as final selected feedback document, and replacement top n document is used for step S305 and subsequent treatment thereof.

(the 4th embodiment)

Below the 4th embodiment will be described.

In the conventional searching system of utilizing related feedback information and method, document length also has influence to the calculating of feedback information.Therefore, when calculating feedback information, long document can have inequitable " advantage ", so associated score is preferably adjusted by length normalization method.

In the 4th embodiment, in step S309, based on the length of document, for each document calculations normalization ratio.For example, can calculate the normalization ratio according to following formula 6.

Normalization ratio=1/ (1+log (document length)) (formula 6)

Yet, also can adopt the additive method that is used to calculate the normalization ratio, for example the normalization ratio can be calculated as 1/ length simply.

Therefore, can use following formula 7 replacement formulas 4 to calculate the associated score of each speech.

relevance_score (w_{j}) = word_score (w_{j}) + relation_score (w_{j})

(formula 7)

Wherein, λ _iExpression doc _iThe normalization ratio.

In preamble, four embodiment have been described respectively.First embodiment is used to utilize the grammatical analysis result to carry out the relevant feedback system of information retrieval.Second embodiment is a technical scheme of wherein revising the process of the step S313 among first embodiment by the weight of adjusting each new query word.The 3rd embodiment is wherein by the top n document being filtered the technical scheme of the process of revising the step S303 among first embodiment.The 4th embodiment is by each document is carried out the technical scheme that length normalization method is revised the process of the step S309 among first embodiment.Yet, it will be apparent to one skilled in the art that first to the 4th embodiment can be arbitrarily combined.That is to say that any combination of the foregoing description is all within the scope of the invention involved.

Can implement method and system of the present invention by many modes.For example, can implement method and system of the present invention by software, hardware, firmware or its any combination.The order of above-mentioned method step only is illustrative, and method step of the present invention is not limited to above specifically described order, unless otherwise offer some clarification on.In addition, in certain embodiments, the present invention can also be implemented as the program that is recorded in the recording medium, and it comprises the machine readable instructions that is used to realize the method according to this invention.Thereby the present invention also covers the recording medium that storage is used to realize the program of the method according to this invention.

Though in above-mentioned example, Chinese language as example by illustration so that principle of the present invention to be described, the present invention can be applied to any language.That is to say that the method among the present invention is irrelevant with category of language, and is applicable to all searching systems.

Though by the example detail display specific embodiments more of the present invention, it will be appreciated by those skilled in the art that above-mentioned example only is intended that exemplary but not limits the scope of the invention.It should be appreciated by those skilled in the art that the foregoing description to be modified and do not depart from the scope and spirit of the present invention.Scope of the present invention is to limit by appended claim.

Claims

1. search method comprises:

Searching step was carried out retrieval for the first time with the acquisition result document by using the initial query formula to a plurality of documents, and select the feedback document from described result document the first time;

Select step, by the feedback document being filtered as criterion with the phrase quantity in the feedback document, select some documents as selected feedback document from the feedback document, described phrase is to be made of the speech in the initial query formula;

Feedback information obtains step, based on speech in the described selected feedback document and the dependence between the speech in the initial query formula, obtains feedback information from described selected feedback document;

Generate step,, generate new query formulation by described feedback information is added in the initial query formula; And

Searching step by using new query formulation, is carried out retrieval for the second time to described a plurality of documents for the second time.

2. search method as claimed in claim 1, wherein said feedback information obtain step and comprise:

The associated score calculation procedure based on speech in the described selected feedback document and the described dependence between the speech in the described initial query formula, is calculated associated score.

3. search method as claimed in claim 2, wherein:

Described associated score calculation procedure is based on speech in the described selected feedback document and the described dependence between the speech in the described initial query formula, and based on the quantity that each speech occurs in described selected feedback document, calculates associated score.

4. search method as claimed in claim 1, wherein:

Described feedback information obtains step based on speech in the described selected feedback document and the described dependence between the speech in the described initial query formula, and, obtain described feedback information based on speech in the described selected feedback document and the fundamental relation mark between the speech in the described initial query formula.

5. search method as claimed in claim 4, wherein said feedback information obtain step and comprise:

The associated score calculation procedure, based on speech in the described selected feedback document and the described dependence between the speech in the described initial query formula, and, calculate associated score based on speech in the described selected feedback document and the fundamental relation mark between the speech in the described initial query formula.

6. search method as claimed in claim 5, wherein:

Described associated score calculation procedure is based on speech in the described selected feedback document and the described dependence between the speech in the described initial query formula, based on speech in the described selected feedback document and the fundamental relation mark between the speech in the described initial query formula, and, calculate associated score in described selected feedback document based on the quantity that each speech occurs.

7. as each described search method in the claim 2,3,5,6, wherein said second time, searching step comprised:

The weight set-up procedure by utilizing described associated score, is adjusted the weight of each speech in the feedback information, and described weight is being used during the retrieval for the second time.

8. as each described search method in the claim 2,3,5,6, wherein said feedback information obtains step and comprises:

Selected ci poem is selected step, and the speech of selecting to have the predetermined quantity of high associated score is as described feedback information.

9. search method as claimed in claim 7, wherein said feedback information obtain step and comprise:

10. as each described search method in claim 1-6 and 9, wherein said feedback information obtains step and comprises:

Document length normalization method step according to the length computation document length normalization method ratio of each described selected feedback document, and is applied to described document length normalization method ratio in the calculating to described feedback information.

11. search method as claimed in claim 7, wherein said feedback information obtain step and comprise:

12. search method as claimed in claim 8, wherein said feedback information obtain step and comprise:

13. as each described search method among the claim 1-6, wherein:

Described dependence obtains by utilizing syntax analyzer.

14. search method as claimed in claim 13, wherein:

Described syntax analyzer is the shallow grammar analysis device.

15. a searching system comprises:

Indexing unit was used for retrieving for the first time with the acquisition result document by using the initial query formula that a plurality of documents are carried out, and selects the feedback document from described result document the first time;

Selecting arrangement is used for the feedback document being filtered as criterion by the phrase quantity with the feedback document, selects some documents as selected feedback document from the feedback document, and described phrase is to be made of the speech in the initial query formula;

Feedback information obtains device, is used for obtaining feedback information based on the speech of described selected feedback document and the dependence between the speech in the initial query formula from described selected feedback document;

Generating apparatus is used for generating new query formulation by adding described feedback information to the initial query formula; And

Indexing unit is used for by using new query formulation described a plurality of documents being carried out retrieval for the second time for the second time.

16. searching system as claimed in claim 15, wherein said feedback information obtain device and comprise:

The associated score calculation element is used for based on the speech of described selected feedback document and the described dependence between the speech in the described initial query formula, calculates associated score.

17. searching system as claimed in claim 16, wherein:

Described associated score calculation element is based on speech in the described selected feedback document and the described dependence between the speech in the described initial query formula, and based on the quantity that each speech occurs in described selected feedback document, calculates associated score.

18. searching system as claimed in claim 15, wherein:

Described feedback information obtains device based on speech in the described selected feedback document and the described dependence between the speech in the described initial query formula, and, obtain described feedback information based on speech in the described selected feedback document and the fundamental relation mark between the speech in the described initial query formula.

19. searching system as claimed in claim 18, wherein said feedback information obtain device and comprise:

The associated score calculation element, be used for based on the speech of described selected feedback document and the described dependence between the speech in the described initial query formula, and, calculate associated score based on speech in the described selected feedback document and the fundamental relation mark between the speech in the described initial query formula.

20. searching system as claimed in claim 19, wherein:

Described associated score calculation element is based on speech in the described selected feedback document and the described dependence between the speech in the described initial query formula, based on speech in the described selected feedback document and the fundamental relation mark between the speech in the described initial query formula, and, calculate associated score in described selected feedback document based on the quantity that each speech occurs.

21. as each described searching system in the claim 16,17,19,20, wherein said second time, indexing unit comprised:

Weight adjusting device is used for adjusting the weight of each speech in the feedback information by utilizing described associated score, and described weight is being used during the retrieval for the second time.

22. as each described searching system in the claim 16,17,19,20, wherein said feedback information obtains device and comprises:

The speech selecting arrangement, the speech that is used to select to have the predetermined quantity of high associated score is as described feedback information.

23. searching system as claimed in claim 21, wherein said feedback information obtain device and comprise:

24. as each described searching system in claim 15-20 and 23, wherein said feedback information obtains device and comprises:

Document length normalization method device is used for the length computation document length normalization method ratio according to each described selected feedback document, and described document length normalization method ratio is applied in the calculating to described feedback information.

25. searching system as claimed in claim 21, wherein said feedback information obtain device and comprise:

26. searching system as claimed in claim 22, wherein said feedback information obtain device and comprise:

27. as each described searching system among the claim 15-20, wherein:

Described dependence obtains by utilizing syntax analyzer.

28. searching system as claimed in claim 27, wherein:

Described syntax analyzer is the shallow grammar analysis device.