CN101271462A - Method and device for correcting retrieval formula and file retrieval - Google Patents

Method and device for correcting retrieval formula and file retrieval Download PDF

Info

Publication number
CN101271462A
CN101271462A CNA2007100891627A CN200710089162A CN101271462A CN 101271462 A CN101271462 A CN 101271462A CN A2007100891627 A CNA2007100891627 A CN A2007100891627A CN 200710089162 A CN200710089162 A CN 200710089162A CN 101271462 A CN101271462 A CN 101271462A
Authority
CN
China
Prior art keywords
mentioned
retrieval
retrieval type
weights
correction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2007100891627A
Other languages
Chinese (zh)
Inventor
王海峰
朱江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Corp
Original Assignee
Toshiba Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Corp filed Critical Toshiba Corp
Priority to CNA2007100891627A priority Critical patent/CN101271462A/en
Publication of CN101271462A publication Critical patent/CN101271462A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a modified retrieval method, a file retrieval method, a modified retrieval device and a file retrieval device. According to one aspect, the present invention provides the modified retrieval method, including: a file set is retrieved by the retrieval type, so as to obtain a plurality of relevant files; a plurality of multi-file abstracts are generated according to the relevant retrieved files; wherein, each multi-file abstract is corresponding to a file type among the relevant files; and the multi-file abstracts are utilized to modify the retrieval type.

Description

Revise the method and the device of retrieval type and file retrieval
Technical field
The present invention relates to the information processing technology, particularly the technology of information retrieval.
Background technology
Information retrieval technique is widely used.The known performance that has a lot of technology to improve information retrieval system from different angles.Wherein, (Pseudo Relevance Feedback PRF) is a kind of technology that is used for improving the performance of information retrieval system for spurious correlation feedback.This technology can not have under the situation of field feedback, and the document that automatically utilizes the previous round of system to retrieve is revised retrieval type as feedback information, thereby improves retrieval performance.
But a result who takes turns retrieval comprises a large amount of candidate documents usually, how to utilize these documents can produce directly influence to the effect of spurious correlation feedback.Existing spurious correlation feedback technique is confined to utilize independently each piece document as feedback, and does not consider the contact between the many pieces of documents, does not more have the content of many pieces of documents of rational synthesis.
Summary of the invention
In order to solve above-mentioned problems of the prior art, the invention provides the method for revising retrieval type, the method for file retrieval, the device of correction retrieval type, and the device of file retrieval.
According to an aspect of the present invention, provide a kind of method of revising retrieval type, having comprised: utilized this retrieval type that collection of document is retrieved and obtain a plurality of relevant documents; According to the above-mentioned a plurality of relevant document that retrieves, generate a plurality of many documents digests, wherein, each many documents digest is corresponding to a document class in above-mentioned a plurality of relevant documents; And utilize above-mentioned a plurality of many documents digest, revise above-mentioned retrieval type.
According to another aspect of the present invention, provide a kind of method of file retrieval, having comprised: according to retrieval request, the structure retrieval type; Utilize the method for above-mentioned correction retrieval type, revise above-mentioned retrieval type; And the retrieval type that utilizes above-mentioned correction, search file.
According to another aspect more of the present invention, a kind of device of revising retrieval type is provided, comprising: retrieval unit, utilize this retrieval type that collection of document is retrieved and obtain a plurality of relevant documents; Many documents digest generation unit according to a plurality of relevant document that above-mentioned retrieval unit retrieves, generates a plurality of many documents digests, and wherein, each many documents digest is corresponding to a document class in above-mentioned a plurality of relevant documents; And the retrieval type amending unit, utilize above-mentioned a plurality of many documents digest, revise above-mentioned retrieval type.
According to another aspect more of the present invention, a kind of device of file retrieval is provided, comprising: the retrieval type tectonic element, according to retrieval request, the structure retrieval type; The device of above-mentioned correction retrieval type is used to revise above-mentioned retrieval type; And document retrieving unit, utilize the retrieval type of above-mentioned correction, search file.
Description of drawings
Believe by below in conjunction with the explanation of accompanying drawing, can make people understand the above-mentioned characteristics of the present invention, advantage and purpose better the specific embodiment of the invention.
Fig. 1 is the process flow diagram of retrieval type modification method according to an embodiment of the invention;
Fig. 2 is the detail flowchart of the training process of termination condition according to an embodiment of the invention;
Fig. 3 is a detail flowchart of revising retrieval type according to an embodiment of the invention;
Fig. 4 is the process flow diagram of document retrieval method according to another embodiment of the invention;
Fig. 5 is the block scheme of retrieval type correcting device according to another embodiment of the invention; And
Fig. 6 is the block scheme of document retrieving apparatus according to another embodiment of the invention.
Embodiment
Below just in conjunction with the accompanying drawings each preferred embodiment of the present invention is described in detail.
The retrieval type modification method
Fig. 1 is the process flow diagram of retrieval type modification method according to an embodiment of the invention.As shown in Figure 1, at first,, utilize retrieval type that collection of document is retrieved, obtain a plurality of relevant documents in step 101.In the present embodiment, retrieval type can be generate according to user's retrieval request or generate by other system, the present invention is for the generating mode of retrieval type without limits.Generally include one or more terms in the retrieval type, wherein each term can also have initial weights.Collection of document is the document library that is retrieved, and it comprises a large amount of documents.
Alternatively, in this step, also can these documents that retrieve be sorted, obtain the result for retrieval of tape sort further according to a plurality of relevant document that obtains and the degree of correlation of retrieval type.The sort method here can be the known in the art or following any method, and the present invention is to this not restriction.
Then, in step 105, judge whether to satisfy termination condition.If satisfy termination condition, then finish retrieval, if do not satisfy termination condition, then method proceeds to step 110.In the present embodiment, termination condition is the number of times that carries out the retrieval type correction, and it obtains by training, and detail will be described with reference to figure 2 below.Yet, should be appreciated that termination condition is not limited to carry out the number of times of retrieval type correction, also can directly evaluate and test,, then finish if meet the demands to a plurality of relevant document that obtains in the step 101, if do not meet the demands, then continue retrieval type is revised.
Then, in step 110,, generate a plurality of many documents digests according to the above-mentioned a plurality of relevant document that in step 101, retrieves.Particularly, at first, according to the degree of correlation between a plurality of relevant documentations that retrieve, with a plurality of relevant clustering documents, thereby obtain a plurality of document class, wherein, each document class comprises one or more documents.Then, be respectively each document class and generate a summary, as many documents digest of the document class.
Generation method about many documents digest, known have a several different methods, for example at J.Goldstein, V.Mittal, the document of J.Carbonell and M.Kantrowitz " Multi-DocumentSummarization by Sentence Extraction " (In Proceedings of NAACL-ANLP2000 Workshop:Automatic Summarization, 2000) be described in (hereinafter being called document 1), at this by with reference to introducing its whole contents.The present invention has no particular limits for the mode that generates many documents digest, as long as can generate the digest of the content of these documents of reflection according to a plurality of documents.
Then, in step 115, utilize a plurality of many documents digests that generate in step 110, revise above-mentioned retrieval type, the detail of revising retrieval type will be described with reference to figure 3 below.
At last, utilize revised retrieval type, proceed step 101,, promptly be met the result for retrieval that retrieval requires up to satisfying termination condition to step 115.
Next, be described in detail in the training process of termination condition in the step 105 with reference to figure 2.Fig. 2 shows the detail flowchart of the training process of termination condition according to an embodiment of the invention.As shown in Figure 2, at first,, provide a plurality of retrieval types and the training collection of document that comprises the known relevant documentation corresponding with a plurality of retrieval types in step 201, in addition, the uncorrelated document that this training collection of document also comprises and a plurality of retrieval types are irrelevant.Such training collection of document is obtained by manual the preparation usually, and the training collection of document as so for example has TREC Appendices, 2003.Detail is referring to text retrieval meeting (Text REtrieval Conference is called for short TREC), and it is the annual world evaluation and test meeting of being organized by American National Standard Technical Board (NIST) and Defense Advanced Research Project Agency (DARPA).
Then, in step 202, result for retrieval evaluation and test cycle index being carried out initialization, in the present embodiment, for example, is 0 with the initial value design of cycle index.
Then,, utilize the retrieval type group that constitutes by known a plurality of retrieval types that the training collection of document is retrieved, obtain the result for retrieval of tape sort in step 203.The sort method here is identical with sort method in the above-mentioned steps 101, does not repeat them here.
Then, in step 205, for this retrieval type group, utilize the known a plurality of relevant document corresponding with this retrieval type group, the result for retrieval that evaluation and test obtains in step 203 produces an evaluation and test value, the validity that this evaluation and test value representation is retrieved according to this retrieval type group.Particularly,, judge whether the result for retrieval of this retrieval type group meets the requirements, obtain an evaluation and test value by to comparing according to the result for retrieval of this retrieval type group acquisition and a plurality of relevant documentations corresponding with this retrieval type group.For example, the value of this evaluation and test value is between 0 and 1, if the evaluation and test value equals 0, illustrates that result for retrieval is with different fully with the corresponding a plurality of relevant documentations of a plurality of retrieval types, if the evaluation and test value equals 1, illustrate that result for retrieval and a plurality of relevant documentations corresponding with a plurality of retrieval types are identical.
Reach optimum if in step 205, judge result for retrieval, finish training then in step 206 output cycle index, and in step 207.Do not reach optimum if judge result for retrieval in step 205, then training proceeds to step 208.
Then, in step 208 and 209, according to result for retrieval, this retrieval type group is revised, the detail of revising retrieval type will be described with reference to figure 3 below.
Then, utilize revised retrieval type group, proceed step 202, reach optimum up to result for retrieval to step 209.Simultaneously, at each circulation time, cycle index is added 1 in step 204.
At last, obtain optimum cycle number of times, as the cycle index that obtains the optimum search result for this retrieval type group.
In the present embodiment, though adopted the training collection of document of the relevant documentation of known a plurality of retrieval type and each retrieval type correspondence, train the optimum cycle number of times; But equally also can utilize the incompatible training of common document sets, evaluation and test at this moment just need manually be carried out, and the promptly artificial result for retrieval that obtains of estimating judges whether this result for retrieval is optimum.
Next, be described in detail in the process of utilizing a plurality of many documents digest correction retrieval types in above-mentioned steps 115 and the step 209 with reference to figure 3.Fig. 3 is a detail flowchart of revising retrieval type according to an embodiment of the invention.As shown in Figure 3, at first, in step 301, the a plurality of relevant document that retrieves according to former retrieval type and the degree of correlation of former retrieval type, a plurality of relevant documents are sorted, and concrete sort method is identical with the sort method of description in the above-mentioned steps 101, ignores its description at this.
Then,,, can get several documents, for example get preceding several documents after the ordering as feedback information as feedback information if the quantity of the relevant documentation that obtains of retrieval is too many in step 302.
Then, in step 305,, generate many documents digest according to feedback information.Particularly, at first,,, thereby obtain a plurality of document class, for example obtain M document class (shown in label 306) in the present embodiment a plurality of relevant clustering documents according to the degree of correlation between a plurality of relevant documentations that retrieve; Wherein, each document class comprises one or more documents.Then, be respectively each document class and generate a summary,, for example generate M many documents digest (shown in label 308) in the present embodiment accordingly as many documents digest of the document class.
Then, in step 304, calculate the weights of each sentence in each many documents digest respectively, concrete computing method for example can adopt the method for describing in the above-mentioned document 1, but the present invention has no particular limits this.
Then, in step 307, the weights of each of M the document class that calculating obtains at 306 places.Particularly, suppose to contain among a certain document class C K piece of writing document, and k (k=1 ..., K) individual document D kWeights be Weight (D k), then the weights Weight (C) of document class C is the weights sum of each document in the document class:
Weight ( C ) = Σ k = 1 K Weight ( D k ) - - - ( 1 )
Wherein, the weights Weight (D of each document k) a preceding result for retrieval returns.
Then, in step 310, the initial weight of the neologisms that occur in each of M many documents digest that calculating obtains at 308 places, concrete computing method for example can adopt at the document of S.E.Robertson and K.Sparck Jones " Simple; Proven Approaches to Text Retrieval " (Technical Report Number 356, Computer Laboratory, University ofCambridge, 1994) method of describing in (hereinafter being called document 2), at this by with reference to introducing its whole contents.But the present invention has no particular limits this.
Then, in step 311, calculate the word frequency of the neologisms that occur in each of M many documents digest obtaining at 308 places.
Incidentally, ordinal relation between the step 304,307,310,311 is had no requirement, that is to say, after carrying out many documents automatic abstract, can at first carry out any one in the step 304,307,310,311, the present invention does not have any restriction to this.
Then, in step 314, utilize the weights of the sentence that in step 304, calculates, the weights of the document class that in step 307, calculates, the initial weight of the neologisms that calculate in step 310, and the word frequency of the neologisms that calculate in step 311 are revised the weights of neologisms.
Particularly, the initial weight of supposing a certain neologisms W is Weight (W); The word frequency of W is freq (W); The sentence that comprises W is labeled as S j(j=1..., J, J are the sentence number that comprises W), S jWeights be Weight (S j); The document class that comprises W is labeled as C i(i=1 ..., I, I are the document class number that comprises W), C iWeights be Weight (C i).Then the weights Weight ' of revised W (W) is:
Weight ′ ( W ) = Weight ( W ) * freq ( W ) * Σ j = 1 J Weight ( S j ) * Σ i = 1 I Weight ( C i ) - - - ( 2 )
Should be noted that when the weights of neologisms are revised, must all not consider above-mentioned four factors, i.e. the weights of document class, the initial weight of neologisms, the word frequency of neologisms, and the weights of sentence, can only consider wherein any one or a plurality of.
Then, in step 315, the top n neologisms of weights maximum are extended in the former retrieval type.In addition, also the weights of each neologisms can be extended in the former retrieval type.
In addition, after step 302,, utilize feedback information, the weights of former term are revised in step 303.Concrete modification method for example can adopt the method for describing in the above-mentioned document 2.
Then, judge in step 309, if former term in M many documents digest that 308 places obtain, then the weights to former term are further revised in step 312, for example multiply by one greater than 1 coefficient.
At last,, obtain new retrieval type, comprising former term and revised weights and neologisms and revised weights thereof by above step.Should be appreciated that, when retrieving, can consider or not consider the weights of former term and neologisms, therefore, in above-mentioned steps, can be not the weights of neologisms not be extended in the new retrieval type, or the weights of former term are not revised, the present invention does not have any restriction to this.
The retrieval type modification method of present embodiment, by carrying out many documents digest, reasonably taken all factors into consideration a plurality of relevant documentations as feedback information, can obtain the term more relevant, can make the precision of retrieval obtain raising thereby revise retrieval type based on the method with retrieving requirement.
In addition, according to another embodiment of the invention, in retrieval type, do not comprise the weights of each term.Correspondingly, be to omit the step that weights with neologisms extend in the new retrieval type and the weights of existing term are adjusted with the difference of last embodiment.
Document retrieval method
Under same inventive concept, Fig. 4 is the process flow diagram of document retrieval method according to another embodiment of the invention.Below just in conjunction with this figure, present embodiment is described.For those parts identical, suitably omit its explanation with front embodiment.
As shown in Figure 4, at first, in step 401, according to retrieval request, the structure retrieval type.Retrieval type comprises one or more terms, and wherein each term can also have initial weights.
Then, in step 405, utilize the method for the described correction retrieval type of above-mentioned embodiment with reference to figure 1-3, revise the retrieval type of structure in step 401, the process of concrete correction retrieval type is same as described above, does not repeat them here.
At last, utilize the retrieval type of revising in step 405, the search file set obtains one or more documents that the user needs.
The document retrieval method of present embodiment, by carrying out many documents digest, reasonably taken all factors into consideration a plurality of relevant documentations as feedback information, thereby can revise retrieval type, thereby can make the precision of retrieval obtain to improve based on the method by obtaining the term more relevant with the retrieval requirement.
The retrieval type correcting device
Under same inventive concept, Fig. 5 is the block scheme of the device of correction retrieval type according to another embodiment of the invention.Below just in conjunction with this figure, present embodiment is described.For those parts identical, suitably omit its explanation with front embodiment.
As shown in Figure 5, the device 500 of the correction retrieval type of present embodiment comprises retrieval unit 501, utilizes retrieval type that collection of document is retrieved and obtains a plurality of relevant documents; Many documents digest generation unit 505 according to a plurality of relevant document that retrieval unit 501 retrieves, generates a plurality of many documents digests, and wherein, each many documents digest is corresponding to a document class in above-mentioned a plurality of relevant documents; And retrieval type amending unit 510, utilize above-mentioned a plurality of many documents digest, revise above-mentioned retrieval type, the detail of retrieval type amending unit 510 will be described below.
To describe the structure and the function of each ingredient of device 500 of the correction retrieval type of present embodiment below in detail.
In the present embodiment, retrieval type can be generate according to user's retrieval request or generate by other system, the present invention is for the generating mode of retrieval type without limits.Generally include one or more terms in the retrieval type, wherein each term can also have initial weights.Collection of document is the document library that is retrieved, and it comprises a large amount of documents.
In the present embodiment, retrieval unit 501 also can comprise sequencing unit, and it sorts to these documents that retrieve according to a plurality of relevant document that obtains and the degree of correlation of retrieval type, obtains the result for retrieval of tape sort.The sequencing unit here can be the known in the art or following any unit, and the present invention is to this not restriction.
In addition, the device 500 of the correction retrieval type of present embodiment can judge whether to satisfy termination condition.If satisfy termination condition, then finish retrieval, if do not satisfy termination condition, then continue to utilize 510 pairs of retrieval types of retrieval type amending unit to revise.In the present embodiment, termination condition is the number of times that carries out the retrieval type correction, and it obtains by training, and detail will be described below.Yet, should be appreciated that termination condition is not limited to carry out the number of times of retrieval type correction, also can directly evaluate and test,, then finish if meet the demands to a plurality of relevant document that retrieval unit 501 obtains, if do not meet the demands, then continue retrieval type is revised.
In the present embodiment, many documents digest generation unit 505 can comprise the clustering documents unit, and it is according to the degree of correlation between a plurality of relevant documentations that retrieve, with a plurality of relevant clustering documents, thereby obtain a plurality of document class, wherein, each document class comprises one or more documents; And the summary generation unit, be respectively each document class and generate a summary, as many documents digest of the document class.
About many documents digest generation unit 505, known have a various ways, for example is described in document 1, do not repeat them here.The present invention has no particular limits for many documents digest generation unit 505, as long as can generate the digest of the content of these documents of reflection according to a plurality of documents.
Next, the training process of the termination condition of the device 500 of the correction retrieval type of detailed description present embodiment.At first, provide a plurality of retrieval types and the training collection of document that comprises the known relevant documentation corresponding with a plurality of retrieval types, in addition, the uncorrelated document that this training collection of document also comprises and a plurality of retrieval types are irrelevant.Such training collection of document is obtained by manual the preparation usually, and the training collection of document as so for example has TREC Appendices, 2003.Detail is referring to text retrieval meeting (TextREtrieval Conference is called for short TREC), and it is the annual world evaluation and test meeting of being organized by American National Standard Technical Board (NIST) and Defense Advanced Research Project Agency (DARPA).
Then, result for retrieval evaluation and test cycle index being carried out initialization, in the present embodiment, for example, is 0 with the initial value design of cycle index.
Then, above-mentioned retrieval unit 501 utilizes the retrieval type group that is made of a plurality of retrieval types that the training collection of document is retrieved, and utilizes the result for retrieval of the sequencing unit acquisition tape sort of retrieval unit 501.The sequencing unit here is identical with above-mentioned sequencing unit, does not repeat them here.
Then, for this retrieval type group, utilize the known a plurality of relevant document corresponding with this retrieval type group, the result for retrieval that evaluation and test obtains produces an evaluation and test value, the validity that this evaluation and test value representation is retrieved according to this retrieval type group.Particularly,, judge whether the result for retrieval of this retrieval type group meets the requirements, obtain an evaluation and test value by to comparing according to the result for retrieval of this retrieval type group acquisition and a plurality of relevant documentations corresponding with this retrieval type group.For example, the value of this evaluation and test value is between 0 and 1, if the evaluation and test value equals 0, illustrates that result for retrieval is with different fully with the corresponding a plurality of relevant documentations of a plurality of retrieval types, if the evaluation and test value equals 1, illustrate that result for retrieval and a plurality of relevant documentations corresponding with a plurality of retrieval types are identical.
Reach optimum if judge result for retrieval, then export cycle index, and finish training.Do not reach optimum if judge result for retrieval, then training is proceeded.
Then, according to result for retrieval, utilize 510 pairs of these retrieval type groups of above-mentioned many documents digest generation unit 505 and retrieval type amending unit to revise, the detail of retrieval type amending unit 510 will be described below.
Then,, continue to utilize retrieval unit 501 to retrieve and utilize many documents digest generation unit 505 and retrieval type amending unit 510 to revise, reach optimum up to result for retrieval according to revised retrieval type group.Simultaneously, at each circulation time, cycle index is added 1.
At last, obtain optimum cycle number of times, as the cycle index that obtains the optimum search result for this retrieval type group.
In the present embodiment, though adopted the training collection of document of the relevant documentation of known a plurality of retrieval type and each retrieval type correspondence, train the optimum cycle number of times; But equally also can utilize the incompatible training of common document sets, evaluation and test at this moment just need manually be carried out, and the promptly artificial result for retrieval that obtains of estimating judges whether this result for retrieval is optimum.
Next, describe the 26S Proteasome Structure and Function of each ingredient of above-mentioned retrieval type amending unit 510 and the process of revising retrieval type in detail.
At first, utilize a plurality of relevant document that above-mentioned sequencing unit retrieves according to former retrieval type and the degree of correlation of former retrieval type, a plurality of relevant documents are sorted.
Then,, can get several documents, for example get preceding several documents after the ordering as feedback information as feedback information if the quantity of the relevant documentation that obtains of retrieval is too many.
Then,, utilize above-mentioned many documents digest generation unit 505, generate many documents digest according to feedback information.Particularly, at first, utilize above-mentioned clustering documents unit,, thereby obtain a plurality of document class, for example obtain M document class in the present embodiment a plurality of relevant clustering documents according to the degree of correlation between a plurality of relevant documentations that retrieve; Wherein, each document class comprises one or more documents.Then, utilize above-mentioned summary generation unit to be respectively each document class and generate a summary,, for example generate M many documents digest in the present embodiment accordingly as many documents digest of the document class.
In the present embodiment, above-mentioned many documents digest generation unit 505 also comprises the sentence weight calculation unit, it is used for calculating respectively the weights of each sentence of each many documents digest, concrete computing method for example can adopt the method for describing in the above-mentioned document 1, but the present invention has no particular limits this.
In addition, above-mentioned many documents digest generation unit 505 also comprises the document class weight calculation unit, and it is used to calculate each weights of M document class of acquisition.Particularly, suppose to contain among a certain document class C K piece of writing document, and k (k=1 ..., K) individual document D kWeights be Weight (D k), then the weights Weight (C) of document class C is the weights sum of each document in the document class:
Weight ( C ) = Σ k = 1 K Weight ( D k ) - - - ( 1 )
Wherein, the weights Weight (D of each document k) a preceding result for retrieval returns.
In the present embodiment, above-mentioned retrieval type amending unit 510 can comprise the speech weight calculation unit, it is used for calculating the initial weight of each neologisms that occur of M many documents digest of acquisition, concrete computing method for example can adopt the method for describing in above-mentioned document 2, do not repeat them here.But the present invention has no particular limits this.In addition, above-mentioned retrieval type amending unit 510 can also comprise the word frequency computing unit, is used for calculating the word frequency of each neologisms that occur of M many documents digest of acquisition.
In addition, last predicate weight calculation unit can also be utilized the weights of the sentence that calculates, the weights of document class, and the initial weight of neologisms, and the word frequency of neologisms are revised the weights of neologisms.
Particularly, the initial weight of supposing a certain neologisms W is Weight (W); The word frequency of W is freq (W); The sentence that comprises W is labeled as S j(j=1 ..., J, J are the sentence number that comprises W), S jWeights be Weight (S j); The document class that comprises W is labeled as C i(i=1 ..., I, I are the document class number that comprises W), C iWeights be Weight (C i).Then the weights Weight ' of revised W (W) is:
Weight ′ ( W ) = Weight ( W ) * freq ( W ) * Σ j = 1 J Weight ( S j ) * Σ i = 1 I Weight ( C i ) - - - ( 2 )
Should be noted that when the weights of neologisms are revised, must all not consider above-mentioned four factors, i.e. the weights of document class, the initial weight of neologisms, the word frequency of neologisms, and the weights of sentence, can only consider wherein any one or a plurality of.
In addition, above-mentioned retrieval type amending unit 510 can also comprise expansion unit, and it is used for the top n neologisms of weights maximum are extended to former retrieval type.In addition, this expansion unit also can extend to the weights of each neologisms in the former retrieval type.
In addition, above-mentioned retrieval type amending unit 510 can also comprise the weights amending unit, and it is used to utilize feedback information, and the weights of former term are revised, and detail for example is described in above-mentioned document 2, does not repeat them here.In addition, if former term is then further revised the weights of former term, for example multiply by one in the M that obtains many documents digest greater than 1 coefficient.
At last, utilize the device 500 of revising retrieval type, obtain new retrieval type, comprising former term and revised weights and neologisms and revised weights thereof.Should be appreciated that, when retrieving, can consider or not consider the weights of former term and neologisms, therefore, when utilizing 510 pairs of retrieval types of retrieval type amending unit to revise, can be not the weights of neologisms not be extended in the new retrieval type, or the weights of former term are not revised, the present invention does not have any restriction to this.
In addition, according to another embodiment of the invention, in retrieval type, do not comprise the weights of each term.Correspondingly, be that with the difference of last embodiment the weights that above-mentioned expansion unit can omit neologisms extend in the new retrieval type, and can omit the above-mentioned weights amending unit that the weights of existing term are adjusted.
The device 500 of the correction retrieval type of present embodiment, by carrying out many documents digest, reasonably taken all factors into consideration a plurality of relevant documentations as feedback information, thereby can obtain the term more relevant, and utilized this device to revise the precision acquisition raising that retrieval type can make retrieval with retrieving requirement.
The device 500 of the correction retrieval type of present embodiment and each ingredient thereof can constitute with special-purpose circuit or chip, also can carry out corresponding program by computing machine (processor) and realize.
Document retrieving apparatus
Under same inventive concept, Fig. 6 is the block scheme of document retrieving apparatus according to another embodiment of the invention.Below just in conjunction with this figure, present embodiment is described.For those parts identical, suitably omit its explanation with front embodiment.
As shown in Figure 6, the document retrieving apparatus 600 of present embodiment comprises retrieval type tectonic element 601, according to retrieval request, is used to construct retrieval type; The device 500 of above-mentioned correction retrieval type is used to revise above-mentioned retrieval type; And document retrieving unit 605, utilizing the retrieval type of above-mentioned correction, the search file set obtains one or more documents that the user needs.
The document retrieving apparatus 600 of present embodiment, by carrying out many documents digest, reasonably taken all factors into consideration a plurality of relevant documentations as feedback information, thereby can revise retrieval type, thereby utilized this device can make the precision of retrieval obtain to improve by obtaining the term more relevant with the retrieval requirement
The document retrieving apparatus 600 of present embodiment and each ingredient thereof can constitute with special-purpose circuit or chip, also can carry out corresponding program by computing machine (processor) and realize.
Though more than described the method for correction retrieval type of the present invention in detail by some exemplary embodiments, the method of file retrieval, revise the device of retrieval type, and the device of file retrieval, but above these embodiment are not exhaustive, and those skilled in the art can realize variations and modifications within the spirit and scope of the present invention.Therefore, the present invention is not limited to these embodiment, and scope of the present invention only is as the criterion by claims.

Claims (28)

1. method of revising retrieval type comprises:
Utilize this retrieval type that collection of document is retrieved and obtain a plurality of relevant documents;
According to the above-mentioned a plurality of relevant document that retrieves, generate a plurality of many documents digests, wherein, each many documents digest is corresponding to a document class in above-mentioned a plurality of relevant documents; And
Utilize above-mentioned a plurality of many documents digest, revise above-mentioned retrieval type.
2. the method for correction retrieval type according to claim 1, wherein, the above-mentioned step of utilizing this retrieval type that collection of document is retrieved comprises:
According to the above-mentioned a plurality of relevant documentations that retrieve and the degree of correlation of above-mentioned retrieval type, above-mentioned a plurality of relevant documentations are sorted.
3. the method for correction retrieval type according to claim 1 and 2, wherein, the step of a plurality of many documents digests of above-mentioned generation comprises:
According to the degree of correlation between the above-mentioned a plurality of relevant documentations that retrieve, with above-mentioned a plurality of relevant clustering documents, thereby obtain a plurality of document class, wherein, each document class comprises at least one document; And
For each above-mentioned document class generates a summary.
4. the method for correction retrieval type according to claim 3, wherein, the step of a plurality of many documents digests of above-mentioned generation also comprises:
Calculate the weights of each sentence in the above-mentioned summary.
5. the method for correction retrieval type according to claim 4, wherein, the step of a plurality of many documents digests of above-mentioned generation also comprises:
Calculate the weights of above-mentioned each document class.
6. the method for correction retrieval type according to claim 5, wherein, the above-mentioned step of utilizing the above-mentioned retrieval type of above-mentioned a plurality of many documents digest corrections comprises:
Calculate the weights of each neologisms that occurs in above-mentioned a plurality of many documents digest;
Neologisms with the appointment number of weights maximum extend in the above-mentioned retrieval type.
7. the method for correction retrieval type according to claim 6, wherein, the step of the weights of each neologisms that occurs in the above-mentioned a plurality of many documents digests of aforementioned calculation comprises:
The weights of the neologisms that occur in above-mentioned a plurality of many documents digest are calculated in the weights of word frequency, the weights that comprise the sentence of these neologisms that occur according to the initial weight of these neologisms, these neologisms, the document class that comprises these neologisms or their combination.
8. according to the method for any described correction retrieval type of claim 1~7, also comprise:
Utilize above-mentioned revised retrieval type, repeat the above-mentioned step that retrieves correction, up to a predefined number of times.
9. the method for correction retrieval type according to claim 8, wherein, above-mentioned predefined number of times is that the training collection of document that utilizes a plurality of retrieval types and comprise the known a plurality of relevant documentations corresponding with described a plurality of retrieval types at least calculates in the following manner:
Utilize above-mentioned a plurality of retrieval type, repeat the above-mentioned step that retrieves correction, and the result for retrieval that at every turn retrieves according to known a plurality of relevant documentation evaluations;
Counting obtains optimum search result's multiplicity.
10. the method for correction retrieval type according to claim 1, wherein, above-mentioned retrieval type comprises the weights of at least one term and this term.
11. the method for correction retrieval type according to claim 10, wherein, the above-mentioned step of utilizing the above-mentioned retrieval type of above-mentioned a plurality of many documents digest corrections comprises:
Calculate the weights of each neologisms that occurs in above-mentioned a plurality of many documents digest;
Neologisms and weights thereof with the appointment number of weights maximum extend in the above-mentioned retrieval type.
12. the method for correction retrieval type according to claim 10, wherein, the step of a plurality of many documents digests of above-mentioned generation comprises:
According to the above-mentioned a plurality of relevant documentations that retrieve, revise the weights of each term of above-mentioned at least one term.
13. the method for correction retrieval type according to claim 12, wherein, the step of the weights of above-mentioned at least one term of above-mentioned correction comprises:
According to each situation about in above-mentioned a plurality of many documents digests, occurring of above-mentioned at least one term, revise the weights of this term.
14. the method for a file retrieval comprises:
According to retrieval request, the structure retrieval type;
Utilize the method for any described correction retrieval type of aforesaid right requirement 1~13, revise above-mentioned retrieval type; And
Utilize the retrieval type of above-mentioned correction, search file.
15. a device of revising retrieval type comprises:
Retrieval unit utilizes this retrieval type that collection of document is retrieved and obtains a plurality of relevant documents;
Many documents digest generation unit according to a plurality of relevant document that above-mentioned retrieval unit retrieves, generates a plurality of many documents digests, and wherein, each many documents digest is corresponding to a document class in above-mentioned a plurality of relevant documents; And
The retrieval type amending unit utilizes above-mentioned a plurality of many documents digest, revises above-mentioned retrieval type.
16. the device of correction retrieval type according to claim 15, wherein, above-mentioned retrieval unit comprises:
Sequencing unit according to the above-mentioned a plurality of relevant documentations that retrieve and the degree of correlation of above-mentioned retrieval type, sorts to above-mentioned a plurality of relevant documentations.
17. according to the device of claim 15 or 16 described correction retrieval types, wherein, above-mentioned many documents digest generation unit comprises:
The clustering documents unit, the degree of correlation between a plurality of relevant documentations that retrieve according to above-mentioned retrieval unit with above-mentioned a plurality of relevant clustering documents, thereby obtains a plurality of document class, and wherein, each document class comprises at least one document; And
The summary generation unit is for each above-mentioned document class generates a summary.
18. the device of correction retrieval type according to claim 17, wherein, above-mentioned many documents digest generation unit also comprises:
The sentence weight calculation unit is used for calculating the weights of each sentence of above-mentioned summary.
19. the device of correction retrieval type according to claim 18, wherein, above-mentioned many documents digest generation unit also comprises:
The document class weight calculation unit is used to calculate the weights of above-mentioned each document class.
20. the device of correction retrieval type according to claim 19, wherein, above-mentioned retrieval type amending unit comprises:
The speech weight calculation unit is used for calculating the weights of each neologisms that above-mentioned a plurality of many documents digest occurs;
Expansion unit, the neologisms with the appointment number of weights maximum extend in the above-mentioned retrieval type.
21. the device of correction retrieval type according to claim 20, wherein, the weights of the neologisms that occur in above-mentioned a plurality of many documents digest are calculated in the word frequency that last predicate weight calculation unit occurs according to the initial weight of these neologisms, these neologisms, the weights that comprise the sentence of these neologisms, the weights of document class that comprise these neologisms or their combination.
22. according to the device of any described correction retrieval type of claim 15~21, wherein, the device of described correction retrieval type is recycled the above-mentioned retrieval type of above-mentioned a plurality of many documents digest correction, up to a predefined number of times.
23. the device of correction retrieval type according to claim 22, wherein, above-mentioned predefined number of times is that the training collection of document that utilizes a plurality of retrieval types and comprise the known a plurality of relevant documentations corresponding with described a plurality of retrieval types at least calculates in the following manner:
Utilize above-mentioned a plurality of retrieval type, repeat the above-mentioned step that retrieves correction, and the result for retrieval that at every turn retrieves according to known a plurality of relevant documentation evaluations;
Counting obtains optimum search result's multiplicity.
24. the device of correction retrieval type according to claim 15, wherein, above-mentioned retrieval type comprises the weights of at least one term and this term.
25. the device of correction retrieval type according to claim 24, wherein, above-mentioned retrieval type amending unit comprises:
The speech weight calculation unit is used for calculating the weights of each neologisms that above-mentioned a plurality of many documents digest occurs;
Expansion unit, neologisms and weights thereof with the appointment number of weights maximum extend in the above-mentioned retrieval type.
26. the device of correction retrieval type according to claim 24, wherein, above-mentioned many documents digest generation unit comprises:
The weights amending unit according to the above-mentioned a plurality of relevant documentations that retrieve, is revised the weights of each term of above-mentioned at least one term.
27. the device of correction retrieval type according to claim 26, wherein, above-mentioned weights amending unit is revised the weights of this term also according to each situation about occurring of above-mentioned at least one term in above-mentioned a plurality of many documents digests.
28. the device of a file retrieval comprises:
The retrieval type tectonic element, according to retrieval request, the structure retrieval type;
The device of any described correction retrieval type of aforesaid right requirement 15~27 is used to revise above-mentioned retrieval type; And
Document retrieving unit is utilized the retrieval type of above-mentioned correction, search file.
CNA2007100891627A 2007-03-20 2007-03-20 Method and device for correcting retrieval formula and file retrieval Pending CN101271462A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2007100891627A CN101271462A (en) 2007-03-20 2007-03-20 Method and device for correcting retrieval formula and file retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2007100891627A CN101271462A (en) 2007-03-20 2007-03-20 Method and device for correcting retrieval formula and file retrieval

Publications (1)

Publication Number Publication Date
CN101271462A true CN101271462A (en) 2008-09-24

Family

ID=40005436

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2007100891627A Pending CN101271462A (en) 2007-03-20 2007-03-20 Method and device for correcting retrieval formula and file retrieval

Country Status (1)

Country Link
CN (1) CN101271462A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678513A (en) * 2013-11-26 2014-03-26 安徽科大讯飞信息科技股份有限公司 Interactive search generation method and system
CN116089599A (en) * 2023-04-07 2023-05-09 北京澜舟科技有限公司 Information query method, system and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678513A (en) * 2013-11-26 2014-03-26 安徽科大讯飞信息科技股份有限公司 Interactive search generation method and system
CN103678513B (en) * 2013-11-26 2016-08-31 科大讯飞股份有限公司 A kind of interactively retrieval type generates method and system
CN116089599A (en) * 2023-04-07 2023-05-09 北京澜舟科技有限公司 Information query method, system and storage medium

Similar Documents

Publication Publication Date Title
Carterette et al. Minimal test collections for retrieval evaluation
CN102760138B (en) Classification method and device for user network behaviors and search method and device for user network behaviors
JP5597255B2 (en) Ranking search results based on word weights
CN103207899B (en) Text recommends method and system
CN101271461B (en) Cross-language retrieval request conversion and cross-language information retrieval method and system
CN102033955A (en) Method for expanding user search results and server
CN102456016B (en) Method and device for sequencing search results
CN101355457B (en) Test method and test equipment
CN102314443B (en) The modification method of search engine and system
CN103425687A (en) Retrieval method and system based on queries
CN104199965A (en) Semantic information retrieval method
CN104484380A (en) Personalized search method and personalized search device
CN102955849A (en) Method for recommending documents based on tags and document recommending device
CN105426514A (en) Personalized mobile APP recommendation method
CN100511214C (en) Method and system for abstracting batch single document for document set
TW201025035A (en) Analysis algorithm of time series word summary and story plot evolution
CN104252456A (en) Method, device and system for weight estimation
CN110046298A (en) Query word recommendation method and device, terminal device and computer readable medium
CN104361115A (en) Entry weight definition method and device based on co-clicking
CN104636407A (en) Parameter choice training and search request processing method and device
Feng et al. Practical duplicate bug reports detection in a large web-based development community
CN102081601A (en) Field word identification method and device
CN104834736A (en) Method and device for establishing index database and retrieval method, device and system
CN103544307A (en) Multi-search-engine automatic comparison and evaluation method independent of document library
CN104376115A (en) Fuzzy word determining method and device based on global search

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20080924