CN103678412A

CN103678412A - Document retrieval method and device

Info

Publication number: CN103678412A
Application number: CN201210360872.XA
Authority: CN
Inventors: 洪毅虹; 杨建武
Original assignee: Peking University; Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Current assignee: New Founder Holdings Development Co ltd; Peking University; Beijing Founder Electronics Co Ltd
Priority date: 2012-09-21
Filing date: 2012-09-21
Publication date: 2014-03-26
Anticipated expiration: 2032-09-21
Also published as: CN103678412B

Abstract

The invention provides a document retrieval method and device, and belongs to the field of information retrieval. The method includes the steps that a target document set is retrieved in a preset reverse index by using target query key words, a first target document set is obtained, correlation grading is carried out, a correlation grading result of first target documents is obtained, and a second target document set is obtained after reordering; the current target query key words are expanded through a spurious correlation feedback model to obtain novel target query key words, and thus a third target document set is obtained; phrasing treatment is carried out on target documents in the third target document set, and the label weight sum of all sentences is calculated; correlation grading is conducted on the content of each sentence according to the target query key words, a final score of each sentence is obtained, and therefore a target sentence is obtained; sentences with the lengths being within a preset length range are obtained in the target sentence to be used as retrieved result fragments. By means of the document retrieval method and device, the retrieval performance and precision rate of XML documents are improved.

Description

A kind of method of file retrieval and device

Technical field

The present invention relates to information retrieval field, relate in particular to a kind of method and device of file retrieval.

Background technology

The main carriers HTML(Hypertext Markup Language of traditional internet information, HTML (Hypertext Markup Language)) for user provides a kind of information demonstrating method easily, mainly pay close attention to the display effect of information on browser.Increasingly extensive along with Web application, the limitation of html data model highlights day by day, and HTML can not data of description, and html tag collection is fixed and is limited, and what user cannot be according to oneself need to add significant mark.Therefore, XML(Xtensible Markup Language, extendible markup language) therefore arise at the historic moment.

XML has self descriptiveness, platform-neutral, extensibility and the feature such as is simple and easy to use, can with readable form shfft registration according to and be not subject to the restriction of the form of expression; The existence of XML can make data exchange in incompatible system, has simplified the complicacy in data sharing and transmitting procedure; In XML document, existing content information also has structural information, its appearance make by Internet carry out mass data exchange, integrated, be integrated into possibility.Along with increasing Web application, as network service, ecommerce, digital library etc. adopt XML as the carrier of mass data storage, exchange and issue, how from the set of magnanimity XML document, to retrieve the concern that Useful Information has caused increasing researchist efficiently.

At present, carry out XML document retrieval and can pass through following two kinds of search modes:

The first, the retrieval based on XML document structure;

Under this search modes, user need to understand the structure of institute's Query XML document, can construct query expression.

The second retrieval model is the retrieval based on key word;

Under this search modes, by author, write in advance query expression, now user neither needs to learn complicated query language, also not needing has deep understanding to the data structure of XML document bottom, and user only needs the input key word relevant to its content of interest just can complete inquiry, and existing method comprises MLCA, SLCA, XRank, XSEarch, XSeek etc.

But in first method, on the one hand, in internet, most of XML document does not provide complete structural information to user; On the other hand, also exist a large amount of isomery XML document in internet, so in both cases, user is difficult to utilize existing language construct to go out query expression XML structure is inquired about.In the second approach, about the method major part of XML keyword query, all based on tree type memory model, launch, this just requires author when writing query expression, to know in advance the structure of XML document.

In sum, the retrieval model of existing XML document, need user to browse the full content of XML document, or know in advance the structure of institute's Query XML document, and in the process of retrieving, need to take a large amount of storage spaces, in the today that has the XML document of mass data amount, retrieval performance and the accuracy rate of existing XML document retrieval model are lower.

Summary of the invention

Embodiments of the invention provide a kind of method and device of file retrieval, have improved retrieval performance and the accuracy rate of XML document.

For achieving the above object, embodiments of the invention adopt following technical scheme:

A method for file retrieval, comprising:

Use the target query keyword that obtains through pre-service in the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document;

Described first object collection of document is carried out to correlativity marking, obtain the correlativity marking result of described first object document, and according to described correlativity marking result, described first object collection of document is reordered and obtains the second destination document set;

By spurious correlation feedback model, described current goal searching keyword is expanded, obtained new target query keyword;

When described new target query keyword meets when pre-conditioned, use described new target query keyword again to retrieve described first object collection of document, obtain the 3rd destination document set;

Each destination document in described the 3rd destination document set is carried out to subordinate sentence processing, and calculate and carry out the label weight summation that described subordinate sentence processing obtains each sentence;

According to described target query keyword, the content of text of described each sentence is carried out to correlativity marking, obtain the correlativity marking result of each sentence, and according to the correlativity marking result of described each sentence, obtain the final score of described each sentence;

According to the final score of described each sentence, obtain target sentences, and in described target sentences, obtain the sentence of length within the scope of preset length as result for retrieval fragment.

The present invention also provides a kind of device of file retrieval, comprising:

Retrieval unit, for use the target query keyword that obtains through pre-service the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document;

Acquiring unit, for described first object collection of document is carried out to correlativity marking, obtain the correlativity marking result of described first object document, and according to described correlativity marking result, described first object collection of document is reordered and obtains the second destination document set;

Described acquiring unit, also, for described current goal searching keyword being expanded by spurious correlation feedback model, obtains new target query keyword;

Described acquiring unit, also described when pre-conditioned for meeting when described new target query keyword, use described new target query keyword again to retrieve described first object collection of document, obtain the 3rd destination document set;

Described acquiring unit, also described when pre-conditioned for not meeting when described new target query keyword, use described new target query keyword again to retrieve described first object collection of document, obtain the second new destination document set;

Computing unit, for each destination document of described the 3rd destination document set is carried out to subordinate sentence processing, and calculates and carries out the label weight summation that described subordinate sentence processing obtains each sentence;

Described computing unit, also for the content of text of described each sentence being carried out to correlativity marking according to described target query keyword, obtain the correlativity marking result of each sentence, and according to the correlativity marking result of described each sentence, obtain the final score of described each sentence;

Described acquiring unit, also obtains target sentences for the final score according to described each sentence, and in described target sentences, obtains the sentence of length within the scope of preset length as result for retrieval fragment.

The method of a kind of file retrieval that the embodiment of the present invention provides and device, can so that user at the full content that does not need browsing document, and do not know in the situation of file structure and use key query word to retrieve, and be applicable to the retrieval of magnanimity document, retrieval performance and accuracy rate are high.

Accompanying drawing explanation

In order to be illustrated more clearly in the technical scheme of the embodiment of the present invention, to the accompanying drawing of required use in the embodiment of the present invention be briefly described below, apparently, below described accompanying drawing be only some embodiments of the present invention, for those of ordinary skills, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.

The method flow diagram of a kind of file retrieval that Fig. 1 provides for the embodiment of the present invention 1;

The method first stage process flow diagram of a kind of file retrieval that Fig. 2 provides for the embodiment of the present invention 2;

The method subordinate phase process flow diagram of a kind of file retrieval that Fig. 3 provides for the embodiment of the present invention 2;

Fig. 4 is the training first object collection of document that the embodiment of the present invention 2 provides, and obtains the method flow schematic diagram of label weight;

The SLCA subtree of each document of calculating that Fig. 5 provides for the embodiment of the present invention 2 is carried out the method flow schematic diagram of correlativity score;

The method phase III process flow diagram of a kind of file retrieval that Fig. 6 provides for the embodiment of the present invention 2;

Fig. 7 carries out subordinate sentence processing for each destination document in the 3rd destination document set that the embodiment of the present invention 2 provides, and calculates and carry out the method flow schematic diagram that the label weight summation of each sentence obtaining processed in subordinate sentence;

The method flow schematic diagram of the content of text of each sentence being given a mark according to target query keyword that Fig. 8 provides for the embodiment of the present invention 2;

The structural representation of the device of a kind of file retrieval that Fig. 9 provides for the embodiment of the present invention 3;

The second structural representation of the device of a kind of file retrieval that Figure 10 provides for the embodiment of the present invention 3;

The structural representation of the computing unit in the device of a kind of file retrieval that Figure 11 provides for the embodiment of the present invention 3;

The structural representation of the acquiring unit in the device of a kind of file retrieval that Figure 12 provides for the embodiment of the present invention 3;

The third structural representation of the device of a kind of file retrieval that Figure 13 provides for the embodiment of the present invention 3;

The second structural representation of the computing unit in the device of a kind of file retrieval that Figure 14 provides for the embodiment of the present invention 3.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, those of ordinary skills, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.

Embodiment 1

Referring to Fig. 1, the method flow diagram of a kind of file retrieval providing for the present embodiment, comprising:

A, use the target query keyword that obtains through pre-service in the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document.

B, first object collection of document is carried out to correlativity marking, obtain the correlativity marking result of first object document, and according to correlativity marking result, first object collection of document is reordered and obtains the second destination document set.

C, by spurious correlation feedback model, current goal searching keyword is expanded, obtained new target query keyword.

D, when described new target query keyword meets when pre-conditioned, use described new target query keyword again to retrieve described first object collection of document, obtain the 3rd destination document set.

E, each destination document in the 3rd destination document set is carried out to subordinate sentence processing, and calculate and carry out subordinate sentence and process the label weight summation that obtains each sentence.

F, according to target query keyword, the content of text of each sentence is carried out to correlativity marking, obtain the correlativity marking result of each sentence, and according to the correlativity marking result of each sentence, obtain the final score of each sentence.

G, according to the final score of each sentence, obtain target sentences, and in target sentences, obtain the sentence of length within the scope of preset length as result for retrieval fragment.

The method of a kind of file retrieval that the present embodiment provides, by the present embodiment, can so that user at the full content that does not need browsing document, and do not know in the situation of file structure and use key query word to retrieve, and be applicable to the retrieval of magnanimity document, retrieval performance and accuracy rate are high.

Embodiment 2

In the present embodiment, a kind of method of file retrieval is divided into three phases, and the first stage is the fuzzy search stage, to dwindle semi-structured document set; Subordinate phase is accurate retrieval phase, to obtain accurate collection of document associated with the query; Phase III is fragment generation phase.

Referring to Fig. 2, the method first stage process flow diagram of a kind of file retrieval providing for the present embodiment, comprising:

First stage: fuzzy search stage.

201: destination document set is carried out to pre-service.

In the present embodiment, destination document set is for being about to the XML semi-structured document set for inquiring about.

Destination document set is carried out to pre-service specifically can be realized by following sub-step:

1) reject the stop words in destination document set.

Wherein, stop words can be arranged in advance by user, can be that " in ", " the ", " oh " and punctuation mark etc. are without the word of concrete meaning, Chinese can for " ", " wearing ", " " and punctuation mark etc. are without the concrete word of meaning.

For example, following 2 pieces of articles are the partial content in collection of document, are used for illustrating the stop words of rejecting in destination document set;

The content of article 1 is: Tom lives in Guangzhou, I live in Guangzhou too.

The content of article 2 is: He once lived in Shanghai.

Above-mentioned article 1 and article 2 contents, be a character string, first, finds out respectively all words of article 1 and article 2 according to space, and each word is keyword, then stop words is rejected from article 1 and article 2; Article 1 and the article 2 of rejecting after stop words are as follows:

Reject the article 1:[Tom after stop words] [lives] [Guangzhou] [I] [live] [Guangzhou]

Reject the article 2:[He after stop words] [lives] [Shanghai].

It should be noted that, while there is Chinese sentence in document, need to utilize prior art centering sentence to carry out special word segmentation processing, then stop words is rejected from document.

2): the stem that extracts destination document set.

First, when the content in destination document set is English character, all words are unified to capital and small letter; For example, when user searches " He ", word " HE ", " he " also can be searched.

Secondly, when the content in destination document set is English character, all words are reduced; For example, when user searches " live ", word " lives ", " lived " also can be searched, and need word " lives ", " lived " to be reduced to " live ".

For example, the article 1 of take after above-mentioned rejecting stop words is said as example with the article 2 of rejecting after stop words, extract after stem,

All keywords of article 1 are: [tom] [live] [guangzhou] [i] [live] [guangzhou]

All keywords of article 2 are: [he] [live] [shanghai].

3): calculate the TF(term frequency of each word in destination document set in stem, word frequency) value and IDF(inverse document frequency, reverse file frequency) value.

While wherein, calculating the TF value of each word in destination document set, can adopt following formula to calculate:

{TF}_{i, j} = \frac{n_{i, j}}{Σ_{k} n_{k, j}}

N in above-mentioned formula _{i, j}that word is at destination document set d _jin occurrence number, denominator is d in destination document set _jin the occurrence number sum of all words.

While calculating the IDF value of each word in destination document set, can adopt following formula to calculate:

{IDF}_{i} = \log \frac{| D |}{| {j : t_{i} &Element; d_{j}} |}

Wherein, | D| is the total number of files in destination document set, | { j:t _i∈ d _j| for comprising t _itotal number of files (be n _{i, j}≠ 0 total number of files).

202: to carrying out pretreated destination document set, set up inverted index.

For example, take above-mentioned article 1 and article 2 is example, sets up after inverted index, and in article 1 and article 2, the corresponding relation of each keyword and article number, [frequency of occurrences], keyword position is:

guangzhou?1[2]?3，6

he?2[1]?1

i?1[1]?4

live?1[2],2[1]?2，5，2

shanghai?2[1]?3

tom?1[1]?1

Set up after inverted index, can learn number of times and particular location that keyword occurs in article.

203: searching keyword is carried out to pre-service and obtain target query keyword.

Wherein, searching keyword being carried out to pre-service specifically can realize by following sub-step:

1) reject the stop words in searching keyword.

It should be noted that, the concrete methods of realizing of this step is identical with the method for rejecting the stop words in destination document set in above-mentioned steps 2011, at this, no longer illustrates.

2) stem of extraction searching keyword obtains target query keyword.

It should be noted that, the concrete methods of realizing of this step is identical with the method for stem of extracting destination document set in above-mentioned steps 2012, at this, no longer illustrates.

204: according to retrieval model use the target query keyword that obtains through pre-service in inverted index to destination document set retrieve, obtain first object collection of document.

It should be noted that, setting up retrieval model is that probability of use statistical method and language model are set up; In the process of retrieval, use Di Li Cray Dirichlet smooth manner, the scope of having dwindled destination document set; Wherein, set up retrieval model and Dirichlet smooth manner all belongs to prior art, do not repeat them here.

Referring to Fig. 3, the method subordinate phase process flow diagram of a kind of file retrieval providing for the present embodiment, comprising:

Subordinate phase: accurate retrieval phase.

205: train first object collection of document, obtain the weight of each label in first object collection of document.

Participate in Fig. 4, training first object collection of document, obtains label weight and specifically can realize by following sub-step:

2051: obtain all tag name in first object collection of document.

2052: according to tag name, by the element in first object collection of document be divided into element set associated with the query and with inquiry incoherent element set.

2053: obtain each searching keyword t _iat each coherent element b _ktotal number A of all words in the number of times a of middle appearance and coherent element set.

It should be noted that, when searching keyword is English character, using each word as searching keyword; When searching keyword is Chinese statement, need to utilize prior art to carry out special word segmentation processing to Chinese statement, each word that processing obtains is as searching keyword.

2054: obtain each searching keyword t _iat each uncorrelated element b _ktotal number B of all words in the number of times b of middle appearance and uncorrelated element set.

2055: according to each searching keyword t _iat each coherent element b _ktotal number of all words in the number of times of middle appearance and coherent element set, calculates each searching keyword t _iat each coherent element b _kthe Probability p of middle appearance _ik.

Wherein,

p_{ik} = \frac{a}{A}

2056: according to each searching keyword t _iat each uncorrelated element b _ktotal number of all words in the number of times of middle appearance and uncorrelated element set, calculates each searching keyword t _iat each uncorrelated element b _kthe probability q of middle appearance _ik.

Wherein,

q_{ik} = \frac{b}{B}

2057: calculate each the label m in first object collection of document _jweight.

Wherein, label m _jthe computing formula of weight be:

f_{tag} (m_{j}) = \underset{t_{ik} {&Element; m}_{j}, t_{i} &Element; Q}{Σ} t_{ik} \times \log (\frac{p_{ik} (1 - q_{ik})}{q_{ik} (1 - p_{ik})})

Wherein, t _ikbe 01 value, can be 0 or 1, represent element b _kin whether include searching keyword t _i; Q is searching keyword.

206: searching keyword is carried out to pre-service, obtain target query keyword.

Wherein, in target query keyword, comprise several searching keywords q.

It should be noted that, in this step, searching keyword is carried out to pretreated method and in above-mentioned steps 203, searching keyword to be carried out to pretreated method identical, do not repeat them here.

207: extract the SLCA subtree of each destination document in first object collection of document as the structural information of each destination document.

208: the SLCA subtree to each destination document is carried out correlativity marking, obtain the correlativity score of each destination document.

Referring to Fig. 5, the SLCA subtree of calculating each document is carried out correlativity score and can be taked bottom-up method, specifically can realize by following sub-step:

2081: obtain the number of times tf that in target query keyword, each searching keyword q occurs respectively in each node n _{n, q}.

2082: calculate the TF value TF of each searching keyword q in first object collection of document _q.

Wherein, the method for calculating the TF value of each searching keyword q in first object collection of document in this step with in above-mentioned steps 2013, calculate stem in the method for the TF value of each word in destination document set identical, in this step, repeat.

2083: according to the tf of each searching keyword q _{n, q}and TF _qobtain each searching keyword q for the correlativity score tw (n, q) of present node.

Wherein,

tw (n, q) = \frac{{tf}_{n, q}}{{TF}_{q}}

2084: when present node n is leaf node, calculate each searching keyword q with respect to the summation of the correlativity score of present node n, as the correlativity score of document.

2085: when present node n is non-leaf node, calculate all child node c of present node n with respect to the correlativity score tw (c, q) of target query keyword.

2086: the correlativity score tw (n according to each searching keyword q with respect to present node n, q) and all child node c of present node n with respect to the correlativity score tw (c, q) of target query keyword, obtain each searching keyword q with respect to the correlativity score tw of present node n ₁(n, q)

Tw wherein ₁(n, q)=tw (n, q)+∑ _{c ∈ children (n)}d _ntw (c, q)

2087: the correlativity score tw according to each searching keyword q with respect to present node n ₁(n, q) calculates each searching keyword q with respect to the summation of the correlativity score of present node n, as the correlativity score of the document.

209: the correlativity score according to each destination document in described first object collection of document, obtains the second destination document set.

Concrete, can to the document in first object collection of document, resequence according to correlativity score order from high to low, also can to the document in destination document set, resequence according to correlativity score order from low to high.

Optionally, after the document in destination document set is resequenced, score can also be less than at the document of the first preset value and get rid of, keep score and be more than or equal to the destination document set of the first preset value, obtain the second destination document set.

210: according to the destination document in current the second destination document set, use spurious correlation feedback model to expand target query keyword, obtain new target query keyword, and judge whether new target query keyword meets pre-conditioned;

When target query keyword does not meet when pre-conditioned, execution step 211;

When target query keyword meets when pre-conditioned, execution step 212.

In the present embodiment, concrete, can use spurious correlation feedback model to expand target query keyword according to the higher default destination document of the second destination document set mid-score, obtain new target query keyword, and judge whether new target query keyword meets pre-conditioned.

It should be noted that, pre-conditioned can be the number of keyword, can be also the stem number of keyword, but be not limited to this.

211: use new target query keyword again to retrieve first object collection of document, obtain the second new destination document set, return to the operation of execution step 210.

Referring to Fig. 6, the method phase III process flow diagram of a kind of file retrieval providing for the present embodiment, comprising:

Phase III: fragment produces the stage.

212: use new target query keyword again to retrieve first object collection of document, obtain the 3rd destination document set.

It should be noted that, the method for label weight that obtains document in this step is identical with the method that obtains label weight in above-mentioned steps 205, at this, no longer illustrates.

213: each destination document in the 3rd destination document set is carried out to subordinate sentence processing, and calculate and carry out the label weight summation that each sentence obtaining processed in subordinate sentence.

Participate in Fig. 7, each destination document in the 3rd destination document set carried out to subordinate sentence processing, and calculate and carry out subordinate sentence and process the label weight summation of each sentence obtaining and specifically can realize by following sub-step:

2131: train the 3rd destination document set, obtain the weight of each label in the 3rd destination document set;

2132: remove label, each destination document in the 3rd destination document set is carried out to subordinate sentence processing.

It should be noted that, the operation of document being carried out to subordinate sentence processing belongs to prior art, at this, no longer illustrates.

2133: calculate the weight of the corresponding label of all words that each sentence comprises, to obtain the label weight summation tagW (s) of each sentence.

The weight summation of the corresponding label of all words that wherein, the label weight summation of each sentence comprises for each sentence.

214: searching keyword is carried out to pre-service and obtain target query keyword.

Wherein, target query keyword comprises several searching keywords q.

215: according to target query keyword, the content of text of each sentence is given a mark.

Participate in Fig. 8, the specific implementation of this step is specially:

2151: calculate target query keyword with respect to the correlativity score Score of each sentence _query(s).

Wherein, the correlativity of sentence s and target query keyword and three factor analysis: the kind queryC of the keyword occurring in each sentence (s); Number of times Occ (the q that each searching keyword q occurs in sentence _i, s); Weights W eight (the q of each searching keyword q _i).

Concrete, Score _query(s) can calculate by following formula.

{Score}_{query} (s) = queryC (s) * Σ_{i = 1}^{n} Occ (q_{i}, s) * Weight (q_{i})

2152: the score Score that calculates each important words in each sentence _sw(s).

In this step, important words is greater than the word of threshold number for the number of times that occurs in this destination document.

Wherein, Score _sw(s) can calculate by following formula:

2153: the title correlativity score Score that calculates each sentence _title(s).

Wherein, Score _title(s) can calculate by following formula:

2154: according to Score _query(s), Score _swand Score (s) _title(s) content of text of each sentence is carried out to correlativity marking Score _rel(s);

Wherein,

Score _rel(s)=αScore _query(s)+βScore _sw(s)+γScore _title(s)

Above-mentioned α, β, γ are three default mediation parameters.

216: the final score that calculates each sentence.

Wherein, the final score that calculates each sentence can calculate by following formula:

Score(s)=(1+σ*tagW(s))*Score _rel(s)

Wherein, the σ in above-mentioned formula is for being in harmonious proportion parameter.

217: the final score according to each sentence, obtain target sentences, the score of target sentences is more than or equal to the second preset value.

218: in target sentences, obtain the sentence of length within the scope of preset length as result for retrieval fragment.

Embodiment 3

Participate in Fig. 9, the installation drawing of a kind of file retrieval providing for the present embodiment, comprising:

Retrieval unit 301, for use the target query keyword that obtains through pre-service the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document;

Acquiring unit 302, for first object collection of document is carried out to correlativity marking, obtains the correlativity marking result of first object document, and according to correlativity marking result, first object collection of document is reordered and obtains the second destination document set;

Acquiring unit 302, also, for described current goal searching keyword being expanded by spurious correlation feedback model, obtains new target query keyword;

Acquiring unit 302, also, for meeting when described new target query keyword when pre-conditioned, is used described new target query keyword again to retrieve described first object collection of document, obtains the 3rd destination document set;

Computing unit 303, for each destination document of the 3rd destination document set is carried out to subordinate sentence processing, and calculates and carries out the label weight summation that subordinate sentence processing obtains each sentence;

Computing unit 303, also for the content of text of each sentence being carried out to correlativity marking according to target query keyword, obtains the correlativity marking result of each sentence, and according to the correlativity marking result of each sentence, obtains the final score of each sentence;

Acquiring unit 302, also obtains target sentences for the final score according to each sentence, and in target sentences, obtains the sentence of length within the scope of preset length as result for retrieval fragment.

Further, acquiring unit 302, also for obtaining word frequency TF value and the reverse file frequency IDF value of each word of destination document set in destination document set;

Referring to Figure 10, device also comprises:

Set up unit 304, for setting up inverted index according to the TF value of each word of destination document set and IDF value.

Processing unit 305, extracts operation for searching keyword being rejected to stop words and stem, obtains target query keyword.

The described unit 304 of setting up, also for setting up retrieval model.

Further, retrieval unit 301, specifically for according to retrieval model, use the target query keyword that obtains through pre-service in the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document.

Further, computing unit 303, also, for first object collection of document is trained, obtains the weight of each label in first object collection of document.

Further, referring to Figure 11, computing unit 303, specifically comprises:

Obtain subelement 3031, for obtaining all tag name of first object collection of document;

Classification subelement 3032, for according to tag name, by the element in first object collection of document be divided into element set associated with the query and with the incoherent element set of inquiry;

Obtain subelement 3031, also for obtaining each searching keyword t _iat each coherent element b _ktotal number A of all words in the number of times a of middle appearance and coherent element set;

Obtain subelement 3031, also for obtaining each searching keyword t _iat each uncorrelated element b _ktotal number B of all words in the number of times b of middle appearance and uncorrelated element set;

Computation subunit 3033, for according to each searching keyword t _iat each coherent element b _ktotal number of all words in the number of times of middle appearance and coherent element set, calculates each searching keyword t _iat each coherent element b _kthe Probability p of middle appearance _ik;

Wherein,

p_{ik} = \frac{a}{A}

Computation subunit 3033, also for according to each searching keyword t _iat each uncorrelated element b _ktotal number of all words in the number of times of middle appearance and uncorrelated element set, calculates each searching keyword t _iat each uncorrelated element b _kthe probability q of middle appearance _ik;

Wherein,

q_{ik} = \frac{b}{B}

Computation subunit 3033, also for calculating each label m of first object collection of document _jweight;

Wherein, label m _jthe computing formula of weight be:

f_{tag} (m_{j}) = \underset{t_{ik} &Element; m_{j}, t_{i} &Element; Q}{Σ} t_{ik} \times \log (\frac{p_{ik} (1 - q_{ik})}{q_{ik} (1 - p_{ik})})

T _ikbe 01 value, represent element b _kin whether include searching keyword t _i; Q is searching keyword.

Further, referring to Figure 12, acquiring unit 302, specifically comprises:

Extract subelement 3021, extract the SLCA subtree of each destination document in first object collection of document as the structural information of each destination document;

Computation subunit 3022, carries out correlativity marking for the SLCA subtree to each destination document, obtains the correlativity score of each destination document.

Further,

Computation subunit 3022, specifically for obtaining the number of times tf that in target query keyword, each searching keyword q occurs respectively in each node n _{n, q};

Computation subunit 3022, specifically for calculating the TF value TF of each searching keyword q in first object collection of document _q;

Computation subunit 3022, specifically for according to the tf of each searching keyword q _n,qand TF _qobtain each searching keyword q for the correlativity score tw (n, q) of present node;

Wherein,

tw (n, q) = \frac{{tf}_{n, q}}{{TF}_{q}}

Computation subunit 3022, specifically for when present node n is leaf node, calculates each searching keyword q with respect to the summation of the correlativity score of present node n, as the correlativity score of the document.

Further,

Computation subunit 3022, also specifically for when present node n is non-leaf node, calculates all child node c of present node n with respect to the correlativity score tw (c, q) of target query keyword;

Computation subunit 3022, also specifically for the correlativity score tw (n with respect to present node n according to each searching keyword q, q) and all child node c of present node n with respect to the correlativity score tw (c, q) of target query keyword, calculate each searching keyword q with respect to the correlativity score tw of present node n ₁(n, q);

Wherein, tw ₁(n, q)=tw (n, q)+∑ _{c ∈ children (n)}d _ntw (c, q)

Computation subunit 3022, also specifically for the correlativity score tw with respect to present node n according to each searching keyword q ₁(n, q) calculates each searching keyword q with respect to the summation of the correlativity score of present node n, as the correlativity score of the document.

Further, referring to Figure 13, described device also comprises:

Judging unit 306, for judging whether described new target query keyword meets pre-conditioned.

Described acquiring unit 302, also described when pre-conditioned for not meeting when described new target query keyword, use described new target query keyword again to retrieve described first object collection of document, obtain the second new destination document set;

Described acquiring unit 302, also for described current goal searching keyword being expanded by spurious correlation feedback model, obtains the target query keyword upgrading;

Described retrieval unit 301, also, for until the target query keyword of described renewal meets described pre-conditionedly, is used the target query keyword of described renewal again to retrieve described first object collection of document.

Further, referring to Figure 14, computing unit 303, specifically comprises:

The first computation subunit 3034, for training the 3rd destination document set, obtains the weight of each label in the 3rd destination document set;

Subelement 3035 processed in subordinate sentence, for removing label, each destination document in the 3rd destination document set carried out to subordinate sentence processing;

The first computation subunit 3034, also for calculating the weight of the corresponding label of all words that each sentence comprises, to obtain the label weight summation tagW (s) of each sentence.

Further,

Computing unit 303, specifically for calculating target query keyword with respect to the correlativity score Score of each sentence _query(s);

Wherein,

{Score}_{query} (s) = queryC (s) * Σ_{i = 1}^{n} Occ (q_{i}, s) * Weight (q_{i})

The kind that queryC (s) is the keyword that occurs in each sentence; Occ (q _i, number of times s) occurring in sentence for each searching keyword q; Weight (q _i) be the weight of each searching keyword q;

Computing unit 303, specifically for calculating the score Score of each important words in each sentence _sw(s); Important words is greater than the word of threshold number for the number of times that occurs in destination document;

Wherein,

Computing unit 303, specifically for calculating the title correlativity score Score of each sentence _title(s);

Wherein,

Computing unit 303, specifically for according to Score _query(s), Score _swand Score (s) _title(s) content of text of each sentence is carried out to correlativity marking Score _rel(s);

Wherein,

Score _rel(s)=αScore _query(s)+βScore _sw(s)+γScoret _title(s)

α, β, γ are default mediation parameter.

Computing unit 303, also specifically for according to formula S core (s)=(1+ σ * tagW (s)) * Score _rel(s) obtain the final score Score (s) of each sentence;

Wherein, σ is default mediation parameter.

The device of a kind of file retrieval that the embodiment of the present invention provides, can so that user at the full content that does not need browsing document, and do not know in the situation of file structure and use key query word to retrieve, and be applicable to the retrieval of magnanimity document, retrieval performance and accuracy rate are high.

Through the above description of the embodiments, those skilled in the art can be well understood to the mode that the present invention can add essential common hardware by software and realize, and can certainly pass through hardware, but in a lot of situation, the former is better embodiment.Understanding based on such, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of software product, this computer software product is stored in the storage medium can read, as the floppy disk of computing machine, hard disk or CD etc., comprise some instructions with so that computer equipment (can be personal computer, server, or the network equipment etc.) carry out the method described in each embodiment of the present invention.

The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited to this, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; can expect easily changing or replacing, within all should being encompassed in protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of described claim.

Claims

1. a method for file retrieval, is characterized in that, comprising:

2. method according to claim 1, is characterized in that, the target query keyword obtaining through pre-service in described use in the inverted index of setting up in advance to destination document set retrieve, before obtaining first object collection of document, also comprise:

Obtain word frequency TF value and the reverse file frequency IDF value of each word in described destination document set in destination document set;

According to TF value and the IDF value of each word in described destination document set, set up described inverted index;

Searching keyword is rejected to stop words and stem extraction operation, obtain described target query keyword;

Set up retrieval model.

3. method according to claim 2, is characterized in that, the target query keyword that described use obtains through pre-service in the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document, comprising:

According to described retrieval model, use the target query keyword that obtains through pre-service in the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document.

4. method according to claim 1, is characterized in that, described described first object collection of document is carried out to correlativity marking before, also comprise:

Described first object collection of document is trained, obtain the weight of each label in described first object collection of document.

5. method according to claim 4, is characterized in that, described described first object collection of document is trained, and obtains the weight of each label in described first object collection of document, comprising:

Obtain all tag name in described first object collection of document;

According to described tag name, by the element in described first object collection of document be divided into element set associated with the query and with inquiry incoherent element set;

Obtain each searching keyword t _iat each coherent element b _ktotal number A of all words in the number of times a of middle appearance and described coherent element set;

Obtain described each searching keyword t _iat each uncorrelated element b _ktotal number B of all words in the number of times b of middle appearance and described uncorrelated element set;

According to described each searching keyword t _iat described each coherent element b _ktotal number of all words in the number of times of middle appearance and described coherent element set, calculates described each searching keyword t _iat described each coherent element b _kthe Probability p of middle appearance _ik;

Wherein,

p_{ik} = \frac{a}{A}

According to described each searching keyword t _iat described each uncorrelated element b _ktotal number of all words in the number of times of middle appearance and described uncorrelated element set, calculates described each searching keyword t _iat described each uncorrelated element b _kthe probability q of middle appearance _ik;

Wherein,

q_{ik} = \frac{b}{B}

Calculate each the label m in described first object collection of document _jweight;

Wherein, label m _jthe computing formula of weight be:

f_{tag} (m_{j}) = \underset{t_{ik} &Element; m_{j}, t_{i} &Element; Q}{Σ} t_{ik} \times \log (\frac{p_{ik} (1 - q_{ik})}{q_{ik} (1 - p_{ik})})

Described t _ikbe 01 value, represent described element b _kin whether include described searching keyword t _i; Described Q is searching keyword.

6. method according to claim 1, is characterized in that, described described first object collection of document is carried out to correlativity marking, obtains the correlativity marking result of described first object document, comprising:

Extract the SLCA subtree of each destination document in described first object collection of document as the structural information of described each destination document;

SLCA subtree to described each destination document is carried out correlativity marking, obtains the correlativity score of described each destination document.

7. method according to claim 6, is characterized in that, the described SLCA subtree to described each destination document is carried out correlativity marking, obtains the correlativity score of described each destination document, comprising:

Obtain the number of times tf that in described target query keyword, each searching keyword q occurs respectively in each node n _{n, q};

Calculate the TF value TF of described each searching keyword q in described first object collection of document _q;

According to the tf of described each searching keyword q _n,qand TF _qobtain described each searching keyword q for the correlativity score tw (n, q) of present node;

Wherein,

tw (n, q) = \frac{{tf}_{n, q}}{{TF}_{q}}

When described present node n is leaf node, calculate described each searching keyword q with respect to the summation of the correlativity score of described present node n, as the correlativity score of the document.

8. method according to claim 7, is characterized in that, when described present node n is non-leaf node, also comprises:

Calculate all child node c of described present node n with respect to the correlativity score tw (c, q) of target query keyword;

Correlativity score tw (n according to described each searching keyword q with respect to described present node n, q) and all child node c of described present node n with respect to the correlativity score tw (c, q) of described target query keyword, calculate described each searching keyword q with respect to the correlativity score tw of described present node n ₁(n, q);

Wherein, tw ₁(n, q)=tw (n, q)+∑ _{c ∈ children (n)}d _ntw (c, q)

Correlativity score tw according to described each searching keyword q with respect to described present node n ₁(n, q) calculates described each searching keyword q with respect to the summation of the correlativity score of described present node n, as the correlativity score of the document.

9. method according to claim 1, is characterized in that, describedly meets when pre-conditioned when described new target query keyword, uses before described new target query keyword retrieves again to described first object collection of document, also comprises:

Judge whether described new target query keyword meets pre-conditioned;

When described new target query keyword does not meet describedly when pre-conditioned, use described new target query keyword again to retrieve described first object collection of document, obtain the second new destination document set;

By spurious correlation feedback model, described current goal searching keyword is expanded, obtained the target query keyword upgrading;

Until that the target query keyword of described renewal meets is described pre-conditioned, use the target query keyword of described renewal again to retrieve described first object collection of document.

10. method according to claim 1, is characterized in that, described each destination document in described the 3rd destination document set is carried out to subordinate sentence processing, and calculates and carry out described subordinate sentence and process the label weight summation that obtains each sentence, comprising:

Train described the 3rd destination document set, obtain the weight of each label in described the 3rd destination document set;

Remove label, each destination document in described the 3rd destination document set is carried out to subordinate sentence processing;

The weight of the corresponding label of all words that described in calculating, each sentence comprises, to obtain the label weight summation tagW (s) of each sentence.

11. methods according to claim 10, is characterized in that, describedly according to described target query keyword, the content of text of described each sentence are carried out to correlativity marking, comprising:

1) calculate target query keyword with respect to the correlativity score Score of each sentence _query(s);

Wherein,

{Score}_{query} (s) = queryC (s) * Σ_{i = 1}^{n} Occ (q_{i}, s) * Weight (q_{i})

The kind that queryC (s) is the keyword that occurs in described each sentence; Occ (q _i, number of times s) occurring in sentence for each searching keyword q; Weight (q _i) be the weight of each searching keyword q;

2) calculate the score Score of each important words in described each sentence _sw(s); Described important words is greater than the word of threshold number for the number of times that occurs in described destination document;

Wherein,

3) calculate the title correlativity score Score of described each sentence _title(s);

Wherein,

4) according to described Score _query(s), described Score _swand described Score (s) _title(s) content of text of described each sentence is carried out to correlativity marking Score _rel(s);

Wherein, Score _rel(s)=α Score _query(s)+β Score _sw(s)+γ Score _title(s)

Described α, β, γ are default mediation parameter.

Described in described basis, the correlativity of each sentence marking result obtains the final score of described each sentence, comprising:

According to formula S core (s)=(1+ σ * tagW (s)) * Score _rel(s) obtain the final score Score (s) of described each sentence;

Wherein, described σ is default mediation parameter.

The device of 12. 1 kinds of file retrievals, is characterized in that, comprising:

13. devices according to claim 12, is characterized in that,

Described acquiring unit, also for obtaining word frequency TF value and the reverse file frequency IDF value of each word of destination document set in described destination document set;

Described device also comprises:

Set up unit, for setting up described inverted index according to the TF value of described each word of destination document set and IDF value;

Processing unit, extracts operation for searching keyword being rejected to stop words and stem, obtains described target query keyword;

The described unit of setting up, also for setting up retrieval model.

14. devices according to claim 13, is characterized in that,

Described retrieval unit, specifically for according to described retrieval model, use the target query keyword that obtains through pre-service in the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document.

15. devices according to claim 12, is characterized in that,

Described computing unit, also, for described first object collection of document is trained, obtains the weight of each label in described first object collection of document.

16. devices according to claim 15, is characterized in that, described computing unit, specifically comprises:

Obtain subelement, for obtaining all tag name of described first object collection of document;

Classification subelement, for according to described tag name, by the element in described first object collection of document be divided into element set associated with the query and with the incoherent element set of inquiry;

The described subelement that obtains, also for obtaining each searching keyword t _iat each coherent element b _ktotal number A of all words in the number of times a of middle appearance and described coherent element set;

The described subelement that obtains, also for obtaining described each searching keyword t _iat each uncorrelated element b _ktotal number B of all words in the number of times b of middle appearance and described uncorrelated element set;

Computation subunit, for according to described each searching keyword t _iat described each coherent element b _ktotal number of all words in the number of times of middle appearance and described coherent element set, calculates described each searching keyword t _iat described each coherent element b _kthe Probability p of middle appearance _ik;

Wherein,

p_{ik} = \frac{a}{A}

Described computation subunit, also for according to described each searching keyword t _iat described each uncorrelated element b _ktotal number of all words in the number of times of middle appearance and described uncorrelated element set, calculates described each searching keyword t _iat described each uncorrelated element b _kthe probability q of middle appearance _ik;

Wherein,

q_{ik} = \frac{b}{B}

Described computation subunit, also for calculating each label m of described first object collection of document _jweight;

Wherein, label m _jthe computing formula of weight be:

f_{tag} (m_{j}) = \underset{t_{ik} &Element; m_{j}, t_{i} &Element; Q}{Σ} t_{ik} \times \log (\frac{p_{ik} (1 - q_{ik})}{q_{ik} (1 - p_{ik})})

17. devices according to claim 12, is characterized in that, described acquiring unit, specifically comprises:

Extract subelement, extract the SLCA subtree of each destination document in described first object collection of document as the structural information of described each destination document;

Computation subunit, carries out correlativity marking for the SLCA subtree to described each destination document, obtains the correlativity score of described each destination document.

18. devices according to claim 17, is characterized in that,

Described computation subunit, specifically for obtaining the number of times tf that in described target query keyword, each searching keyword q occurs respectively in each node n _n,q;

Described computation subunit, specifically for calculating the TF value TF of described each searching keyword q in described first object collection of document _q;

Described computation subunit, specifically for according to the tf of described each searching keyword q _n,qand TF _qobtain described each searching keyword q for the correlativity score tw (n, q) of present node;

Wherein,

tw (n, q) = \frac{{tf}_{n, q}}{{TF}_{q}}

Described computation subunit, specifically for when described present node n is leaf node, calculates described each searching keyword q with respect to the summation of the correlativity score of described present node n, as the correlativity score of the document.

19. devices according to claim 18, is characterized in that,

Described computation subunit, also specifically for when described present node n is non-leaf node, calculates all child node c of described present node n with respect to the correlativity score tw (c, q) of target query keyword;

Described computation subunit, also specifically for the correlativity score tw (n with respect to described present node n according to described each searching keyword q, q) and all child node c of described present node n with respect to the correlativity score tw (c, q) of described target query keyword, calculate described each searching keyword q with respect to the correlativity score tw of described present node n ₁(n, q);

Wherein, tw ₁(n, q)=tw (n, q)+∑ _{c ∈ children (n)}d _ntw (c, q)

Described computation subunit, also specifically for the correlativity score tw with respect to described present node n according to described each searching keyword q ₁(n, q) calculates described each searching keyword q with respect to the summation of the correlativity score of described present node n, as the correlativity score of the document.

20. devices according to claim 12, is characterized in that, described device also comprises:

Judging unit, for judging whether described new target query keyword meets pre-conditioned;

Described acquiring unit, also for described current goal searching keyword being expanded by spurious correlation feedback model, obtains the target query keyword upgrading;

Described retrieval unit, also, for until the target query keyword of described renewal meets described pre-conditionedly, is used the target query keyword of described renewal again to retrieve described first object collection of document.

21. devices according to claim 12, is characterized in that, described computing unit, specifically comprises:

The first computation subunit, for training described the 3rd destination document set, obtains the weight of each label in described the 3rd destination document set;

Subelement processed in subordinate sentence, for removing label, each destination document in described the 3rd destination document set carried out to subordinate sentence processing;

Described the first computation subunit, also for calculating the weight of the corresponding label of all words that described each sentence comprises, to obtain the label weight summation tagW (s) of each sentence.

22. devices according to claim 21, is characterized in that,

Described computing unit, specifically for calculating target query keyword with respect to the correlativity score Score of each sentence _query(s);

Wherein,

{Score}_{query} (s) = queryC (s) * Σ_{i = 1}^{n} Occ (q_{i}, s) * Weight (q_{i})

Described computing unit, specifically for calculating the score Score of each important words in described each sentence _sw(s); Described important words is greater than the word of threshold number for the number of times that occurs in described destination document;

Wherein,

Described computing unit, specifically for calculating the title correlativity score Score of described each sentence _title(s);

Wherein,

Described computing unit, specifically for according to described Score _query(s), described Score _swand described Score (s) _title(s) content of text of described each sentence is carried out to correlativity marking Score _rel(s);

Wherein, Score _rel(s)=α Score _query(s)+β Score _sw(s)+γ Score _title(s)

Described α, β, γ are default mediation parameter.

Described computing unit, also specifically for according to formula S core (s)=(1+ σ * tagW (s)) * Score _rel(s) obtain the final score Score (s) of described each sentence;

Wherein, described σ is default mediation parameter.