CN103678412B

CN103678412B - A kind of method and device of file retrieval

Info

Publication number: CN103678412B
Application number: CN201210360872.XA
Authority: CN
Inventors: 洪毅虹; 杨建武
Original assignee: Peking University; Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Current assignee: New Founder Holdings Development Co ltd; Peking University; Beijing Founder Electronics Co Ltd
Priority date: 2012-09-21
Filing date: 2012-09-21
Publication date: 2016-12-21
Anticipated expiration: 2032-09-21
Also published as: CN103678412A

Abstract

The present invention provides the method and device of a kind of file retrieval, belong to information retrieval field, including: use target query key word in the inverted index pre-build, destination document set to be retrieved, obtain first object collection of document, carry out dependency marking, obtaining the dependency marking result of first object document, rearrangement sequence of going forward side by side obtains the second destination document set；By pseudo-linear filter model, current goal searching keyword is extended, obtains new target query key word, and then obtain the 3rd destination document set；Destination document in 3rd destination document set is carried out subordinate sentence process, calculates the label weight summation of each sentence；According to target query key word, the content of each sentence is carried out dependency marking, obtain the final score of each sentence, thus obtain target sentences；Length sentence in the range of preset length is obtained as retrieval result fragment in target sentences.By the present invention, improve retrieval performance and the accuracy rate of XML document.

Description

A kind of method and device of file retrieval

Technical field

The present invention relates to information retrieval field, particularly relate to the method and device of a kind of file retrieval.

Background technology

The main carriers HTML(Hypertext Markup Language of traditional internet information, hypertext markup language Speech) provide the user a kind of convenient information demonstrating method, it is primarily upon information display effect on a web browser.Along with It is increasingly extensive that Web applies, and the limitation of html data model highlights day by day, i.e. HTML can not describe data, html tag collection Fixing and limited, user cannot according to oneself need add significant labelling.Therefore, XML(Xtensible Markup Language, extendible markup language) therefore arise at the historic moment.

XML has self descriptiveness, platform-neutral, extensibility and the feature such as easy to use, can be with readable form table Registration according to and do not limited by the form of expression；The existence of XML can make data swap in incompatible system, simplifies Complexity in data sharing and transmitting procedure；In XML document, existing content information also has structural information, and its appearance makes to lead to Cross Internet carry out mass data exchange, integrated, be integrated into possibility.Along with increasing Web applies, as network takes Business, ecommerce, digital library etc. use XML as mass data storage, the carrier that exchanges and issue, the most efficiently from Magnanimity XML document set retrieves useful information and causes the concern of increasing research worker.

At present, carry out XML document retrieval and can pass through the following two kinds search modes:

The first, retrieval based on XML document structure；

Under this search modes, user, it should be understood that the structure of inquired about XML document, can construct query express Formula.

The second retrieval model is retrieval based on keyword；

Under this search modes, author writing query expression in advance, now user both need not study again Miscellaneous query language, it is not required that the data structure of XML document bottom is had deep understanding, user needs only to input and it The keyword that content of interest is relevant just can complete inquiry, and existing method includes MLCA, SLCA, XRank, XSEarch, XSeek etc..

But, in first method, on the one hand, in the Internet, major part XML document does not provide the user with complete Structural information；On the other hand, there is also substantial amounts of isomery XML document in the Internet, so in both cases, user is very Difficulty utilizes existing language construct to go out query expression and inquires about XML structure.In the second approach, crucial about XML The method major part of word inquiry is all based on what tree-shaped storage model launched, and this just requires that author is when writing query expression It is known a priori by the structure of XML document.

In sum, the retrieval model of existing XML document, need user to browse the full content of XML document, or in advance Know the structure of inquired about XML document, and need to take substantial amounts of memory space during retrieving, have magnanimity Today of the XML document of data volume, retrieval performance and the accuracy rate of existing XML document retrieval model are relatively low.

Summary of the invention

Embodiments of the invention provide a kind of file retrieval method and device, improve XML document retrieval performance and Accuracy rate.

For reaching above-mentioned purpose, embodiments of the invention adopt the following technical scheme that

A kind of method of file retrieval, including:

Use the target query key word obtained through pretreatment to destination document collection in the inverted index pre-build Conjunction is retrieved, and obtains first object collection of document；

Described first object collection of document is carried out dependency marking, obtains the dependency marking of described first object document As a result, and according to described dependency marking result described first object collection of document reordered and obtain the second destination document Set；

By pseudo-linear filter model, described current goal searching keyword is extended, obtains new target query and close Keyword；

When described new target query key word meets pre-conditioned, use described new target query key word to institute State first object collection of document again to retrieve, obtain the 3rd destination document set；

Each destination document in described 3rd destination document set is carried out subordinate sentence process, and calculating carries out described subordinate sentence Process the label weight summation obtaining each sentence；

According to described target query key word, the content of text of described each sentence is carried out dependency marking, obtain each The dependency marking result of sentence, and obtain the final of described each sentence according to the dependency marking result of described each sentence Score；

Final score according to described each sentence obtains target sentences, and obtains length in described target sentences in advance If the sentence in length range is as retrieval result fragment.

Present invention also offers the device of a kind of file retrieval, including:

Retrieval unit, for using the target query key word obtained through pretreatment in the inverted index pre-build Destination document set is retrieved, obtains first object collection of document；

Acquiring unit, for described first object collection of document is carried out dependency marking, obtains described first object literary composition The dependency marking result of shelves, and according to described dependency marking result, described first object collection of document is reordered To the second destination document set；

Described acquiring unit, is additionally operable to be expanded described current goal searching keyword by pseudo-linear filter model Exhibition, obtains new target query key word；

Described acquiring unit, be additionally operable to when described new target query key word meet described pre-conditioned time, use institute State new target query key word described first object collection of document is retrieved again, obtain the 3rd destination document set；

Described acquiring unit, be additionally operable to when described new target query key word be unsatisfactory for described pre-conditioned time, use Described first object collection of document is retrieved by described new target query key word again, obtains the second new destination document Set；

Computing unit, for each destination document in described 3rd destination document set is carried out subordinate sentence process, and counts Calculation carries out described subordinate sentence and processes the label weight summation obtaining each sentence；

Described computing unit, is additionally operable to carry out the content of text of described each sentence according to described target query key word Dependency is given a mark, and obtains the dependency marking result of each sentence, and obtains according to the dependency marking result of described each sentence Final score to described each sentence；

Described acquiring unit, is additionally operable to the final score according to described each sentence and obtains target sentences, and at described mesh Mark sentence obtains length sentence in the range of preset length as retrieval result fragment.

The method and device of a kind of file retrieval that the embodiment of the present invention provides, so that user is being not required to browsing document Full content, and do not know and use in the case of file structure key query word to retrieve, and be applicable to magnanimity document Retrieval, retrieval performance and accuracy rate are high.

Accompanying drawing explanation

In order to be illustrated more clearly that the technical scheme of the embodiment of the present invention, will make required in the embodiment of the present invention below Accompanying drawing be briefly described, it should be apparent that, drawings described below is only some embodiments of the present invention, for From the point of view of those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to obtain other according to these accompanying drawings Accompanying drawing.

The method flow diagram of a kind of file retrieval that Fig. 1 provides for the embodiment of the present invention 1；

The method first stage flow chart of a kind of file retrieval that Fig. 2 provides for the embodiment of the present invention 2；

The method second stage flow chart of a kind of file retrieval that Fig. 3 provides for the embodiment of the present invention 2；

The training first object collection of document that Fig. 4 provides for the embodiment of the present invention 2, the method flow obtaining label weight shows It is intended to；

Fig. 5 carries out the method stream of Relevance scores for the SLCA subtree calculating each document that the embodiment of the present invention 2 provides Journey schematic diagram；

The method phase III flow chart of a kind of file retrieval that Fig. 6 provides for the embodiment of the present invention 2；

Each destination document in 3rd destination document set is carried out at subordinate sentence by Fig. 7 for what the embodiment of the present invention 2 provided Reason, and calculate the method flow schematic diagram carrying out the label weight summation that subordinate sentence processes each sentence obtained；

The content of text of each sentence is carried out beating according to target query key word by Fig. 8 for what the embodiment of the present invention 2 provided The method flow schematic diagram divided；

The structural representation of the device of a kind of file retrieval that Fig. 9 provides for the embodiment of the present invention 3；

The second structural representation of the device of a kind of file retrieval that Figure 10 provides for the embodiment of the present invention 3；

The structural representation of the computing unit in the device of a kind of file retrieval that Figure 11 provides for the embodiment of the present invention 3；

The structural representation of the acquiring unit in the device of a kind of file retrieval that Figure 12 provides for the embodiment of the present invention 3；

The third structural representation of the device of a kind of file retrieval that Figure 13 provides for the embodiment of the present invention 3；

The second structure of the computing unit in the device of a kind of file retrieval that Figure 14 provides for the embodiment of the present invention 3 is shown It is intended to.

Detailed description of the invention

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Describe, it is clear that described embodiment is only a part of embodiment of the present invention rather than whole embodiments wholely.Based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under not making creative work premise Embodiment, broadly falls into the scope of protection of the invention.

Embodiment 1

See Fig. 1, the method flow diagram of a kind of file retrieval provided for the present embodiment, including:

A, use the target query key word obtained through pretreatment to destination document in the inverted index pre-build Set is retrieved, and obtains first object collection of document.

B, first object collection of document is carried out dependency marking, obtains the dependency marking result of first object document, And according to dependency marking result first object collection of document reordered and to obtain the second destination document set.

C, by pseudo-linear filter model, current goal searching keyword is extended, obtains new target query crucial Word.

D, when described new target query key word meets pre-conditioned, use described new target query key word pair Described first object collection of document is retrieved again, obtains the 3rd destination document set.

E, each destination document in the 3rd destination document set is carried out subordinate sentence process, and calculate and carry out subordinate sentence and process Label weight summation to each sentence.

F, according to target query key word, the content of text of each sentence is carried out dependency marking, obtain each sentence Dependency marking result, and the final score of each sentence is obtained according to the dependency marking result of each sentence.

G, final score according to each sentence obtain target sentences, and obtain length in target sentences in preset length In the range of sentence as retrieval result fragment.

A kind of method of file retrieval that the present embodiment provides, by the present embodiment, so that user browses being not required to The full content of document, and do not know and use in the case of file structure key query word to retrieve, and it is applicable to magnanimity The retrieval of document, retrieval performance and accuracy rate are high.

Embodiment 2

In the present embodiment, a kind of method of file retrieval is divided into three phases, and the first stage is the fuzzy search stage, with contracting Smaller part structured document set；Second stage is the precise search stage, to obtain accurate collection of document associated with the query；The Three stages were fragment generation phase.

See Fig. 2, the method first stage flow chart of a kind of file retrieval provided for the present embodiment, including:

First stage: fuzzy search stage.

201: destination document set is carried out pretreatment.

In the present embodiment, destination document collection is combined into the XML semi-structured document set that will be used for inquiry.

Destination document set carries out pretreatment specifically to be realized by following sub-step:

1) stop words in destination document set is rejected.

Wherein, stop words can be configured in advance by user, can be the nothings such as " in ", " the ", " oh " and punctuation mark The word of concrete meaning, Chinese can be " ", " wearing ", " " and punctuation mark etc. are without the word of concrete meaning.

Such as, following 2 articles are the partial content in collection of document, are used for illustrating in rejecting destination document set Stop words；

The content of article 1 is: Tom lives in Guangzhou, I live in Guangzhou too.

The content of article 2 is: He once lived in Shanghai.

Above-mentioned article 1 and article 2 content, be a character string, first, find out article 1 and article 2 respectively according to space All words, each word is key word, then is rejected from article 1 and article 2 by stop words；Reject the article after stop words 1 and article 2 as follows:

Reject the article 1:[Tom after stop words] [lives] [Guangzhou] [I] [live] [Guangzhou]

Reject the article 2:[He after stop words] [lives] [Shanghai].

It should be noted that when document occurs Chinese sentence, need to utilize prior art centering sentence to carry out spy Different word segmentation processing, then stop words is rejected from document.

2): extract the stem of destination document set.

First, when the content in destination document set is English character, capital and small letter are unified in all words；Such as, when When user searches " He ", word " HE ", " he " can also be searched.

Secondly, when the content in destination document set is English character, all words are reduced；Such as, when with When " live " is searched at family, word " lives ", " lived " can also be searched, then need word " lives ", " lived " It is reduced to " live ".

Such as, say as a example by the article 1 after above-mentioned rejecting stop words and the article 2 after rejecting stop words, extract word After Gan,

All key words of article 1 are: [tom] [live] [guangzhou] [i] [live] [guangzhou]

All key words of article 2 are: [he] [live] [shanghai].

3): calculate each word TF(term frequency in destination document set in stem, word frequency) value and IDF (inverse document frequency, reverse document-frequency) value.

Wherein, equation below can be used to calculate when calculating each word TF value in destination document set:

{TF}_{i, j} = \frac{n_{i, j}}{Σ_{k} n_{k, j}}

N in above-mentioned formula_{I, j}It is that word is at destination document set d_jIn occurrence number, denominator is in destination document set Middle d_jIn the occurrence number sum of all words.

Equation below can be used to calculate when calculating each word IDF value in destination document set:

{IDF}_{i} = \log \frac{| D |}{| {j : t_{i} &Element; d_{j}} |}

Wherein, | D | is the total number of files in destination document set, | { j:t_i∈d_j| for comprising t_iTotal number of files (i.e. n_{I, j}The total number of files of ≠ 0).

202: set up inverted index to carrying out pretreated destination document set.

Such as, as a example by above-mentioned article 1 and article 2, after setting up inverted index, in article 1 and article 2 each key word with Article number, [frequency of occurrences], the corresponding relation of key word position be:

Guangzhou 1 [2] 3,6

he 2[1] 1

i 1[1] 4

Live 1 [2], 2 [1] 2,5,2

shanghai 2[1] 3

tom 1[1] 1

After setting up inverted index, it is appreciated that number of times and particular location that key word occurs in article.

203: searching keyword is carried out pretreatment and obtains target query key word.

Wherein, searching keyword carries out pretreatment specifically to be realized by following sub-step:

1) stop words in searching keyword is rejected.

It should be noted that the concrete methods of realizing of this step and above-mentioned steps 2011 are rejected in destination document set The method of stop words is identical, no longer illustrates at this.

2) stem extracting searching keyword obtains target query key word.

It should be noted that the concrete methods of realizing of this step and the word of extraction destination document set in above-mentioned steps 2012 Dry method is identical, no longer illustrates at this.

204: use the target query key word obtained through pretreatment to target in inverted index according to retrieval model Collection of document is retrieved, and obtains first object collection of document.

It should be noted that setting up retrieval model is to use probabilistic method and language model to set up；The mistake of retrieval Journey uses Di Li Cray Dirichlet smooth manner, reduces the scope of destination document set；Wherein, retrieval model is set up Belong to prior art with Dirichlet smooth manner, do not repeat them here.

See Fig. 3, the method second stage flow chart of a kind of file retrieval provided for the present embodiment, including:

Second stage: precise search stage.

205: training first object collection of document, obtain the weight of each label in first object collection of document.

Participate in Fig. 4, train first object collection of document, obtain label weight and specifically can be realized by following sub-step:

2051: obtain all tag name in first object collection of document.

2052: according to tag name, the element in first object collection of document is divided into element set associated with the query and With the incoherent element set of inquiry.

2053: obtain each searching keyword t_iAt each coherent element b_kIn the number of times a of middle appearance and coherent element set Total number A of all words.

It should be noted that when searching keyword is English character, using each word as searching keyword；Work as inquiry When key word is Chinese statement, needing to utilize prior art that Chinese statement is carried out special word segmentation processing, it is every that process obtains Individual word is as searching keyword.

2054: obtain each searching keyword t_iAt each uncorrelated element b_kThe number of times b of middle appearance and uncorrelated element set Total number B of all words in conjunction.

2055: according to each searching keyword t_iAt each coherent element b_kIn the number of times of middle appearance and coherent element set Total number of all words, calculates each searching keyword t_iAt each coherent element b_kThe Probability p of middle appearance_ik。

Wherein,

p_{ik} = \frac{a}{A}

2056: according to each searching keyword t_iAt each uncorrelated element b_kThe number of times of middle appearance and uncorrelated element set Total number of all words in conjunction, calculates each searching keyword t_iAt each uncorrelated element b_kThe probability q of middle appearance_ik。

Wherein,

q_{ik} = \frac{b}{B}

2057: each label m being calculated in first object collection of document_jWeight.

Wherein, label m_jThe computing formula of weight be:

f_{tag} (m_{j}) = \underset{t_{ik} {&Element; m}_{j}, t_{i} &Element; Q}{Σ} t_{ik} \times \log (\frac{p_{ik} (1 - q_{ik})}{q_{ik} (1 - p_{ik})})

Wherein, t_ikIt is 01 value, can be 0 or 1, represent element b_kIn whether include searching keyword t_i；Q is for looking into Ask key word.

206: searching keyword is carried out pretreatment, obtain target query key word.

Wherein, target query key word comprises several searching keywords q.

It should be noted that to looking into during searching keyword is carried out the method for pretreatment and above-mentioned steps 203 by this step The method that inquiry key word carries out pretreatment is identical, does not repeats them here.

207: in extraction first object collection of document, the SLCA subtree of each destination document is as the knot of each destination document Structure information.

208: the SLCA subtree of each destination document is carried out dependency marking, and the dependency obtaining each destination document obtains Point.

See Fig. 5, calculate the SLCA subtree of each document and carry out Relevance scores and can take bottom-up method, tool Body can be realized by following sub-step:

2081: obtain the number of times that in target query key word, each searching keyword q occurs respectively in each node n tf_{N, q}。

2082: calculate each searching keyword q TF value TF in first object collection of document_q。

Wherein, this step calculates the method for each searching keyword q TF value in first object collection of document with on State that to calculate the method for each word TF value in destination document set in stem in step 2013 identical, repeat in this step.

2083: according to the tf of each searching keyword q_{N, q}And TF_qObtain each searching keyword q for present node Relevance scores tw (n, q).

Wherein,

tw (n, q) = \frac{{tf}_{n, q}}{{TF}_{q}}

2084: when present node n is leaf node, it is calculated each searching keyword q relative to present node n's The summation of Relevance scores, as the Relevance scores of document.

2085: when present node n is non-leaf nodes, all child nodes c calculating present node n are looked into relative to target Ask key word Relevance scores tw (c, q).

2086: according to each searching keyword q Relevance scores tw relative to present node n (n, q) and present node n All child nodes c relative to target query key word Relevance scores tw (c, q) obtain each searching keyword q relative to The Relevance scores tw of present node n₁(n,q)

Wherein tw₁(n,q)=tw(n,q)+∑_{c∈children(n)}d_n·tw(c,q)

2087: according to each searching keyword q Relevance scores tw relative to present node n₁(n q) is calculated often The summation of the individual searching keyword q Relevance scores relative to present node n, as the Relevance scores of the document.

209: according to the Relevance scores of each destination document in described first object collection of document, obtain the second target literary composition Shelves set.

Concrete, according to Relevance scores order from high to low, the document in first object collection of document can be carried out Rearrangement, it is also possible to the document in destination document set is arranged again according to Relevance scores order from low to high Sequence.

Optionally, after the document in destination document set is resequenced, it is also possible to by score less than first The document of preset value is got rid of, and keeps score more than or equal to the destination document set of the first preset value, obtains the second target literary composition Shelves set.

210: use pseudo-linear filter model to target query according to the destination document in current second destination document set Key word is extended, and obtains new target query key word, and judges whether new target query key word meets default bar Part；

When target query key word is unsatisfactory for pre-conditioned, perform step 211；

When target query key word meets pre-conditioned, perform step 212.

In the present embodiment, concrete, can according to the second destination document set mid score higher preset destination document Use pseudo-linear filter model that target query key word is extended, obtain new target query key word, and judge new It is pre-conditioned whether target query key word meets.

Can be the number of key word it should be noted that pre-conditioned, it is also possible to be the stem number of key word, but not It is limited to this.

211: use new target query key word that first object collection of document is retrieved again, obtain new second Destination document set, returns the operation performing step 210.

See Fig. 6, the method phase III flow chart of a kind of file retrieval provided for the present embodiment, including:

Phase III: fragment produces the stage.

212: use new target query key word that first object collection of document is retrieved again, obtain the 3rd target Collection of document.

It should be noted that the method obtaining the label weight of document in this step obtains label in above-mentioned steps 205 The method of weight is identical, no longer illustrates at this.

213: each destination document in the 3rd destination document set is carried out subordinate sentence process, and calculating carries out subordinate sentence process The label weight summation of each sentence obtained.

Participate in Fig. 7, each destination document in the 3rd destination document set is carried out subordinate sentence process, and calculating carries out subordinate sentence The label weight summation processing each sentence obtained specifically can be realized by following sub-step:

2131: training the 3rd destination document set, obtain the weight of each label in the 3rd destination document set；

2132: remove label, each destination document in the 3rd destination document set is carried out subordinate sentence process.

It should be noted that the operation that document carries out subordinate sentence process belongs to prior art, no longer illustrate at this.

2133: calculate the weight of the label corresponding to all words that each sentence comprises, to obtain the label of each sentence Weight summation tagW (s).

Wherein, the weight that label weight summation is the label corresponding to all words that each sentence comprises of each sentence is total With.

214: searching keyword is carried out pretreatment and obtains target query key word.

Wherein, target query key word includes several searching keywords q.

215: according to target query key word, the content of text of each sentence is given a mark.

Participate in Fig. 8, this step implement particularly as follows:

2151: calculate the target query key word Relevance scores Score relative to each sentence_query(s)。

Wherein, sentence s is relevant to the dependency of target query key word and three factors: the key occurred in each sentence Kind queryC (s) of word；Number of times Occ (the q that each searching keyword q occurs in sentence_i,s)；Each searching keyword q Weight Weight (q_i)。

Concrete, Score_queryS () can be calculated by equation below.

{Score}_{query} (s) = queryC (s) * Σ_{i = 1}^{n} Occ (q_{i}, s) * Weight (q_{i})

2152: calculate score Score of each important words in each sentence_sw(s)。

In this step, important words is the word that the number of times occurred in this destination document is more than threshold number.

Wherein, Score_swS () can be calculated by equation below:

2153: calculate the title Relevance scores Score of each sentence_title(s)。

Wherein, Score_titleS () can be calculated by equation below:

2154: according to Score_query(s)、Score_sw(s) and Score_titleS the content of text of each sentence is carried out by () Dependency marking Score_rel(s)；

Wherein,

Score_rel(s)=αScore_query(s)+βScore_sw(s)+γScore_title(s)

Above-mentioned α, β, γ are three default mediation parameters.

216: calculate the final score of each sentence.

Wherein, the final score calculating each sentence can be calculated by equation below:

Score(s)=(1+σ*tagW(s))*Score_rel(s)

Wherein, the σ in above-mentioned formula is for being in harmonious proportion parameter.

217: according to the final score of each sentence, obtain target sentences, the score of target sentences is pre-more than or equal to second If value.

218: in target sentences, obtain length sentence in the range of preset length as retrieval result fragment.

Embodiment 3

Participate in Fig. 9, the installation drawing of a kind of file retrieval provided for the present embodiment, including:

Retrieval unit 301, for using the target query key word obtained through pretreatment to arrange rope pre-build In drawing, destination document set is retrieved, obtain first object collection of document；

Acquiring unit 302, for first object collection of document is carried out dependency marking, obtains the phase of first object document Closing property marking result, and according to dependency marking result first object collection of document reordered and obtain the second destination document Set；

Acquiring unit 302, is additionally operable to be extended described current goal searching keyword by pseudo-linear filter model, Obtain new target query key word；

Acquiring unit 302, is additionally operable to when described new target query key word meets pre-conditioned, uses described new Described first object collection of document is retrieved by target query key word again, obtains the 3rd destination document set；

Computing unit 303, for each destination document in the 3rd destination document set is carried out subordinate sentence process, and calculates Carry out subordinate sentence and process the label weight summation obtaining each sentence；

Computing unit 303, is additionally operable to, according to target query key word, the content of text of each sentence is carried out dependency and beats Point, obtain the dependency marking result of each sentence, and obtain each sentence according to the dependency marking result of each sentence Final score；

Acquiring unit 302, is additionally operable to the final score according to each sentence and obtains target sentences, and obtain in target sentences Take length sentence in the range of preset length as retrieval result fragment.

Further, acquiring unit 302, it is additionally operable to obtain in destination document set each word in destination document set Word frequency TF value and reverse document-frequency IDF value；

Seeing Figure 10, device also includes:

Set up unit 304, for setting up inverted index according to TF value and the IDF value of word each in destination document set.

Processing unit 305, extracts operation for searching keyword carries out rejecting stop words and stem, obtains target query Key word.

Described set up unit 304, be additionally operable to set up retrieval model.

Further, retrieval unit 301, specifically for using the target query obtained through pretreatment according to retrieval model Destination document set is retrieved in the inverted index pre-build by key word, obtains first object collection of document.

Further, computing unit 303, it is additionally operable to first object collection of document is trained, obtains first object literary composition The weight of each label in shelves set.

Further, see Figure 11, computing unit 303, specifically include:

Obtain subelement 3031, for obtaining all tag name in first object collection of document；

Classification subelement 3032, for according to tag name, is divided into the element in first object collection of document and inquiry phase Close element set and with inquire about incoherent element set；

Obtain subelement 3031, be additionally operable to obtain each searching keyword t_iAt each coherent element b_kThe number of times a of middle appearance Total number A with words all in coherent element set；

Obtain subelement 3031, be additionally operable to obtain each searching keyword t_iAt each uncorrelated element b_kMiddle appearance time Total number B of all words in number b and uncorrelated element set；

Computation subunit 3033, for according to each searching keyword t_iAt each coherent element b_kThe number of times of middle appearance and Total number of all words in coherent element set, calculates each searching keyword t_iAt each coherent element b_kMiddle appearance general Rate p_ik；

Wherein,

p_{ik} = \frac{a}{A}

Computation subunit 3033, is additionally operable to according to each searching keyword t_iAt each uncorrelated element b_kMiddle appearance time In number and uncorrelated element set, total number of all words, calculates each searching keyword t_iAt each uncorrelated element b_kIn The probability q occurred_ik；

Wherein,

q_{ik} = \frac{b}{B}

Computation subunit 3033, is additionally operable to each label m being calculated in first object collection of document_jWeight；

Wherein, label m_jThe computing formula of weight be:

f_{tag} (m_{j}) = \underset{t_{ik} &Element; m_{j}, t_{i} &Element; Q}{Σ} t_{ik} \times \log (\frac{p_{ik} (1 - q_{ik})}{q_{ik} (1 - p_{ik})})

t_ikIt is 01 value, represents element b_kIn whether include searching keyword t_i；Q is searching keyword.

Further, see Figure 12, acquiring unit 302, specifically include:

Extraction subelement 3021, in extraction first object collection of document, the SLCA subtree of each destination document is as each mesh The structural information of mark document；

Computation subunit 3022, for the SLCA subtree of each destination document is carried out dependency marking, obtains each mesh The Relevance scores of mark document.

Further,

Computation subunit 3022, specifically for each searching keyword q in acquisition target query key word respectively each The number of times tf occurred in node n_{N, q}；

Computation subunit 3022, specifically for calculating each searching keyword q TF value in first object collection of document TF_q；

Computation subunit 3022, specifically for the tf according to each searching keyword q_n,qAnd TF_qObtain each inquiry key Word q for present node Relevance scores tw (n, q)；

Wherein,

tw (n, q) = \frac{{tf}_{n, q}}{{TF}_{q}}

Computation subunit 3022, specifically for when present node n is leaf node, being calculated each searching keyword The summation of the q Relevance scores relative to present node n, as the Relevance scores of the document.

Further,

Computation subunit 3022, also particularly useful for when present node n is non-leaf nodes, calculates the institute of present node n Have child node c relative to target query key word Relevance scores tw (c, q)；

Computation subunit 3022, obtains also particularly useful for according to each searching keyword q dependency relative to present node n Point tw (n, q) and all child nodes c of present node n relative to the Relevance scores tw of target query key word, (c q) calculates Obtain each searching keyword q Relevance scores tw relative to present node n₁(n,q)；

Wherein, tw₁(n,q)=tw(n,q)+∑_{c∈children(n)}d_n·tw(c,q)

Computation subunit 3022, obtains also particularly useful for according to each searching keyword q dependency relative to present node n Divide tw₁(n q) is calculated the summation of each searching keyword q Relevance scores relative to present node n, as the document Relevance scores.

Further, seeing Figure 13, described device also includes:

Judging unit 306, for judging whether described new target query key word meets pre-conditioned.

Described acquiring unit 302, be additionally operable to when described new target query key word be unsatisfactory for described pre-conditioned time, make With described new target query key word, described first object collection of document is retrieved again, obtain the second new target literary composition Shelves set；

Described acquiring unit 302, is additionally operable to be carried out described current goal searching keyword by pseudo-linear filter model Extension, obtains the target query key word updated；

Described retrieval unit 301, is additionally operable to, until the target query key word of described renewal meets described pre-conditioned, make With the target query key word of described renewal, described first object collection of document is retrieved again.

Further, see Figure 14, computing unit 303, specifically include:

First computation subunit 3034, for training the 3rd destination document set, obtains in the 3rd destination document set every The weight of individual label；

Subordinate sentence processes subelement 3035, is used for removing label, enters each destination document in the 3rd destination document set Row subordinate sentence processes；

First computation subunit 3034, is additionally operable to calculate the weight of the label corresponding to all words that each sentence comprises, To obtain label weight summation tagW (s) of each sentence.

Further,

Computing unit 303, specifically for calculating the target query key word Relevance scores relative to each sentence Score_query(s)；

Wherein,

{Score}_{query} (s) = queryC (s) * Σ_{i = 1}^{n} Occ (q_{i}, s) * Weight (q_{i})

QueryC (s) is the kind of the key word occurred in each sentence；Occ(q_i, it is s) that each searching keyword q exists The number of times occurred in sentence；Weight(q_i) it is the weight of each searching keyword q；

Computing unit 303, specifically for calculating score Score of each important words in each sentence_sw(s)；Important list Word is the word that the number of times occurred in destination document is more than threshold number；

Wherein,

Computing unit 303, specifically for calculating the title Relevance scores Score of each sentence_title(s)；

Wherein,

Computing unit 303, specifically for according to Score_query(s)、Score_sw(s) and Score_titleS () is to each sentence Content of text carry out dependency marking Score_rel(s)；

Wherein,

Score_rel(s)=αScore_query(s)+βScore_sw(s)+γScoret_title(s)

α, β, γ are default mediation parameter.

Computing unit 303, also particularly useful for according to formula S core (s)=(1+ σ * tagW (s)) * Score_relS () obtains often Final score Score (s) of individual sentence；

Wherein, σ is default mediation parameter.

The device of a kind of file retrieval that the embodiment of the present invention provides, so that user is being not required to the whole of browsing document Content, and do not know and use in the case of file structure key query word to retrieve, and it is applicable to the retrieval of magnanimity document, Retrieval performance and accuracy rate are high.

Through the above description of the embodiments, those skilled in the art is it can be understood that can borrow to the present invention The mode helping software to add required common hardware realizes, naturally it is also possible to by hardware, but a lot of in the case of the former is more preferably Embodiment.Based on such understanding, the portion that prior art is contributed by technical scheme the most in other words Dividing and can embody with the form of software product, this computer software product is stored in the storage medium that can read, such as meter The floppy disk of calculation machine, hard disk or CD etc., including some instructions with so that computer equipment (can be personal computer, Server, or the network equipment etc.) perform the method described in each embodiment of the present invention.

The above, the only detailed description of the invention of the present invention, but protection scope of the present invention is not limited thereto, and any Those familiar with the art, in the technical scope that the invention discloses, can readily occur in change or replace, should contain Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with described scope of the claims.

Claims

1. the method for a file retrieval, it is characterised in that including:

The target query key word obtained through pretreatment is used in the inverted index pre-build, destination document set to be entered Line retrieval, obtains first object collection of document；

Described first object collection of document is carried out dependency marking, obtains the dependency marking knot of described first object document Really, and according to described dependency marking result described first object collection of document reordered and obtain the second destination document collection Close；

By pseudo-linear filter model, described current goal searching keyword is extended, obtains new target query crucial Word；

When described new target query key word meets pre-conditioned, use described new target query key word to described One destination document set is retrieved again, obtains the 3rd destination document set；

Each destination document in described 3rd destination document set is carried out subordinate sentence process, and calculating carries out described subordinate sentence process Obtain the label weight summation of each sentence；

According to described target query key word, the content of text of described each sentence is carried out dependency marking, obtain each sentence Dependency marking result, and obtain the final of described each sentence according to the dependency of described each sentence result of giving a mark Point；

Final score according to described each sentence obtains target sentences, and obtains length in described target sentences in default length Sentence in the range of degree is as retrieval result fragment；

Described described first object collection of document is carried out dependency marking before, also include:

Described first object collection of document is trained, obtains the power of each label in described first object collection of document Weight；

Described described first object collection of document is trained, obtains each label in described first object collection of document Weight, including:

Obtain all tag name in described first object collection of document；

According to described tag name, the element in described first object collection of document is divided into element set associated with the query and with Inquire about incoherent element set；

The probability occurred in each coherent element according to each searching keyword, and each searching keyword is each uncorrelated The probability occurred in element determines the weight of each label in described first object collection of document.

Method the most according to claim 1, it is characterised in that the target query obtained through pretreatment in described use closes Destination document set is retrieved in the inverted index pre-build by keyword, before obtaining first object collection of document, also Including:

Obtain the word frequency TF value in described destination document set of each word in destination document set and reverse document-frequency IDF Value；

Described inverted index is set up according to TF value and the IDF value of each word in described destination document set；

Searching keyword is carried out rejects stop words and stem extracts operation, obtain described target query key word；

Set up retrieval model.

Method the most according to claim 2, it is characterised in that the target query that described use obtains through pretreatment is crucial Destination document set is retrieved in the inverted index pre-build by word, obtains first object collection of document, including:

Use the target query key word obtained through pretreatment in the inverted index pre-build according to described retrieval model Destination document set is retrieved, obtains first object collection of document.

Method the most according to claim 1, it is characterised in that

The described probability occurred in each coherent element according to each searching keyword, and each searching keyword each not The probability occurred in coherent element determines that the weight of each label in described first object collection of document includes:

Obtain each searching keyword t_iAt each coherent element b_kThe number of times a of middle appearance and described coherent element set own Total number A of word；

Obtain described each searching keyword t_iAt each uncorrelated element b_kThe number of times b of middle appearance and described uncorrelated element set Total number B of all words in conjunction；

According to described each searching keyword t_iAt described each coherent element b_kThe number of times of middle appearance and described coherent element set In total number of all words, calculate described each searching keyword t_iAt described each coherent element b_kThe probability of middle appearance p_ik；

Wherein,

According to described each searching keyword t_iAt described each uncorrelated element b_kThe number of times of middle appearance and described uncorrelated element Total number of all words in set, calculates described each searching keyword t_iAt described each uncorrelated element b_kMiddle appearance Probability q_ik；

Wherein,

The each label m being calculated in described first object collection of document_jWeight；

Wherein, label m_jThe computing formula of weight be:

f_{t a g} (m_{j}) = \underset{t_{i k} &Element; m_{j}, t_{i} &Element; Q}{Σ} t_{i k} \times l o g (\frac{p_{i k} (1 - q_{i k})}{q_{i k} (1 - p_{i k})})

Described t_ikIt is 01 value, described t_ikIt is 0 or 1, represents described element b_kIn whether include described searching keyword t_i； Described Q is searching keyword.

Method the most according to claim 1, it is characterised in that described described first object collection of document is carried out dependency Marking, obtains the dependency marking result of described first object document, including:

Extract in described first object collection of document the SLCA subtree of each destination document as the knot of described each destination document Structure information；

The SLCA subtree of described each destination document is carried out dependency marking, and the dependency obtaining described each destination document obtains Point.

Method the most according to claim 5, it is characterised in that the described SLCA subtree to described each destination document is carried out Dependency is given a mark, and obtains the Relevance scores of described each destination document, including:

Obtain the number of times tf that in described target query key word, each searching keyword q occurs respectively in each node n_n,q；

Calculate described each searching keyword q TF value TF in described first object collection of document_q；

Tf according to described each searching keyword q_n,qAnd TF_qObtain the described each searching keyword q phase for present node Closing property score tw (n, q)；

Wherein,

When described present node n is leaf node, it is calculated described each searching keyword q relative to described present node The summation of the Relevance scores of n, as the Relevance scores of the document.

Method the most according to claim 6, it is characterised in that when described present node n is non-leaf nodes, also wrap Include:

Calculate all child nodes c of described present node n relative to target query key word Relevance scores tw (c, q)；

According to described each searching keyword q relative to described present node n Relevance scores tw (n, q) and described work as prosthomere Relative to the Relevance scores tw of described target query key word, (c q) is calculated described each look into all child nodes c of some n Ask the key word q Relevance scores tw relative to described present node n₁(n,q)；

Wherein, tw₁(n, q)=tw (n, q)+∑_{c∈children(n)}d_n·tw(c,q)

According to described each searching keyword q relative to the Relevance scores tw of described present node n₁(n, q) described in being calculated The summation of each searching keyword q Relevance scores relative to described present node n, as the Relevance scores of the document.

Method the most according to claim 1, it is characterised in that described when described new target query key word meet preset During condition, before using described new target query key word that described first object collection of document is retrieved again, also wrap Include:

Judge whether described new target query key word meets pre-conditioned；

When described new target query key word be unsatisfactory for described pre-conditioned time, use described new target query key word pair Described first object collection of document is retrieved again, obtains the second new destination document set；

Being extended described current goal searching keyword by pseudo-linear filter model, the target query obtaining updating is crucial Word；

Until the target query key word of described renewal meets described pre-conditioned, use the target query key word of described renewal Described first object collection of document is retrieved again.

Method the most according to claim 1, it is characterised in that described to each mesh in described 3rd destination document set Mark document carries out subordinate sentence process, and calculating carries out described subordinate sentence and processes the label weight summation obtaining each sentence, including:

Train described 3rd destination document set, obtain the weight of each label in described 3rd destination document set；

Remove label, each destination document in described 3rd destination document set is carried out subordinate sentence process；

Calculate the weight of the label corresponding to all words that described each sentence comprises, total to obtain the label weight of each sentence With tagW (s).

Method the most according to claim 9, it is characterised in that described according to described target query key word to described often The content of text of individual sentence carries out dependency marking, including:

1) the target query key word Relevance scores Score relative to each sentence is calculated_query(s)；

Wherein,

QueryC (s) is the kind of the key word occurred in described each sentence；Occ(q_i, it is s) that each searching keyword q is at sentence The number of times occurred in son；Weight(q_i) it is the weight of each searching keyword q；

2) score Score of each important words in described each sentence is calculated_sw(s)；Described important words is in described target The number of times occurred in document is more than the word of threshold number；

Wherein,

3) the title Relevance scores Score of described each sentence is calculated_title(s)；

Wherein,

4) according to described Score_query(s), described Score_sw(s) and described Score_title(s) text to described each sentence Content carries out dependency marking Score_rel(s)；

Wherein, Score_rel(s)=α Score_query(s)+βScore_sw(s)+γScore_title(s)

Described α, β, γ are default mediation parameter；

The described dependency marking result according to described each sentence obtains the final score of described each sentence, including:

According to formula S core (s)=(1+ σ * tagW (s)) * Score_relS () obtains the final score Score of described each sentence (s)；

Wherein, described σ is default mediation parameter.

The device of 11. 1 kinds of file retrievals, it is characterised in that including:

Retrieval unit, for using the target query key word obtained through pretreatment to mesh in the inverted index pre-build Mark collection of document is retrieved, and obtains first object collection of document；

Acquiring unit, for described first object collection of document is carried out dependency marking, obtains described first object document Dependency marking result, and according to described dependency marking result described first object collection of document reordered and obtain the Two destination document set；

Described acquiring unit, is additionally operable to be extended described current goal searching keyword by pseudo-linear filter model, To new target query key word；

Described acquiring unit, be additionally operable to when described new target query key word meet described pre-conditioned time, use described newly Target query key word described first object collection of document is retrieved again, obtain the 3rd destination document set；

Computing unit, for each destination document in described 3rd destination document set being carried out subordinate sentence process, and calculate into The described subordinate sentence of row processes the label weight summation obtaining each sentence；

Described computing unit, is additionally operable to be correlated with the content of text of described each sentence according to described target query key word Property marking, obtain the dependency marking result of each sentence, and obtain institute according to the dependency of described each sentence result of giving a mark State the final score of each sentence；

Described acquiring unit, is additionally operable to the final score according to described each sentence and obtains target sentences, and at described target sentence Son obtains length sentence in the range of preset length as retrieval result fragment；

Described computing unit, is additionally operable to be trained described first object collection of document, obtains described first object document sets The weight of each label in conjunction；

Described computing unit, specifically includes:

Obtain subelement, for obtaining all tag name in described first object collection of document；

Classification subelement, for according to described tag name, is divided into the element in described first object collection of document and inquiry phase Close element set and with inquire about incoherent element set；

Described acquisition subelement, is additionally operable to the probability occurred in each coherent element according to each searching keyword, and each The probability that searching keyword occurs in each uncorrelated element determines each label in described first object collection of document Weight.

12. devices according to claim 11, it is characterised in that

Described acquiring unit, is additionally operable to obtain each word word frequency TF in described destination document set in destination document set Value and reverse document-frequency IDF value；

Described device also includes:

Set up unit, for setting up described inverted index according to TF value and the IDF value of each word in described destination document set；

Processing unit, extracts operation for searching keyword carries out rejecting stop words and stem, obtains described target query and close Keyword；

Described set up unit, be additionally operable to set up retrieval model.

13. devices according to claim 12, it is characterised in that

Described retrieval unit, specifically for using the target query key word obtained through pretreatment to exist according to described retrieval model Destination document set is retrieved by the inverted index pre-build, obtains first object collection of document.

14. devices according to claim 11, it is characterised in that

Described acquisition subelement, is additionally operable to obtain each searching keyword t_iAt each coherent element b_kThe number of times a of middle appearance and institute State total number A of all words in coherent element set；

Described acquisition subelement, is additionally operable to obtain described each searching keyword t_iAt each uncorrelated element b_kMiddle appearance time Total number B of all words in number b and described uncorrelated element set；

Computation subunit, for according to described each searching keyword t_iAt described each coherent element b_kThe number of times of middle appearance and Total number of all words in described coherent element set, calculates described each searching keyword t_iAt described each coherent element b_kThe Probability p of middle appearance_ik；

Wherein,

Described computation subunit, is additionally operable to according to described each searching keyword t_iAt described each uncorrelated element b_kMiddle appearance Number of times and described uncorrelated element set in total number of all words, calculate described each searching keyword t_iDescribed often Individual uncorrelated element b_kThe probability q of middle appearance_ik；

Wherein,

Described computation subunit, is additionally operable to each label m being calculated in described first object collection of document_jWeight；

Wherein, label m_jThe computing formula of weight be:

f_{t a g} (m_{j}) = \underset{t_{i k} &Element; m_{j}, t_{i} &Element; Q}{Σ} t_{i k} \times l o g (\frac{p_{i k} (1 - q_{i k})}{q_{i k} (1 - p_{i k})})

15. devices according to claim 11, it is characterised in that described acquiring unit, specifically include:

Extraction subelement, extracts in described first object collection of document the SLCA subtree of each destination document as described each mesh The structural information of mark document；

Computation subunit, for the SLCA subtree of described each destination document is carried out dependency marking, obtains described each mesh The Relevance scores of mark document.

16. devices according to claim 15, it is characterised in that

Described computation subunit, specifically for each searching keyword q in the described target query key word of acquisition respectively each The number of times tf occurred in node n_n,q；

Described computation subunit, specifically for calculating described each searching keyword q in described first object collection of document TF value TF_q；

Described computation subunit, specifically for the tf according to described each searching keyword q_n,qAnd TF_qObtain described each inquiry Key word q for present node Relevance scores tw (n, q)；

Wherein,

Described computation subunit, specifically for when described present node n is leaf node, is calculated described each inquiry and closes The summation of the keyword q Relevance scores relative to described present node n, as the Relevance scores of the document.

17. devices according to claim 16, it is characterised in that

Described computation subunit, also particularly useful for when described present node n is non-leaf nodes, calculates described present node n All child nodes c relative to target query key word Relevance scores tw (c, q)；

Described computation subunit, also particularly useful for according to described each searching keyword q being correlated with relative to described present node n Property score tw (n, q) and all child nodes c of described present node n are relative to the Relevance scores of described target query key word (c q) is calculated the described each searching keyword q Relevance scores tw relative to described present node n to tw₁(n,q)；

Wherein, tw₁(n, q)=tw (n, q)+∑_{c∈children(n)}d_n·tw(c,q)

Described computation subunit, also particularly useful for according to described each searching keyword q being correlated with relative to described present node n Property score tw₁(n q) is calculated the total of the described each searching keyword q Relevance scores relative to described present node n With, as the Relevance scores of the document.

18. devices according to claim 11, it is characterised in that described device also includes:

Judging unit, for judging whether described new target query key word meets pre-conditioned；

Described acquiring unit, be additionally operable to when described new target query key word be unsatisfactory for described pre-conditioned time, use described Described first object collection of document is retrieved by new target query key word again, obtains the second new destination document collection Close；

Described acquiring unit, is additionally operable to be extended described current goal searching keyword by pseudo-linear filter model, To the target query key word updated；

Described retrieval unit, is additionally operable to, until the target query key word of described renewal meets described pre-conditioned, use described Described first object collection of document is retrieved by target query key word again that update.

19. devices according to claim 11, it is characterised in that described computing unit, specifically include:

First computation subunit, is used for training described 3rd destination document set, obtains in described 3rd destination document set every The weight of individual label；

Subordinate sentence processes subelement, is used for removing label, carries out each destination document in described 3rd destination document set point Sentence processes；

Described first computation subunit, is additionally operable to calculate the weight of the label corresponding to all words that described each sentence comprises, To obtain label weight summation tagW (s) of each sentence.

20. devices according to claim 19, it is characterised in that

Described computing unit, specifically for calculating the target query key word Relevance scores Score relative to each sentence_query (s)；

Wherein,

Described computing unit, specifically for calculating score Score of each important words in described each sentence_sw(s)；Described heavy Wanting word is the word that the number of times occurred in described destination document is more than threshold number；

Wherein,

Described computing unit, specifically for calculating the title Relevance scores Score of described each sentence_title(s)；

Wherein,

Described computing unit, specifically for according to described Score_query(s), described Score_sw(s) and described Score_title(s) The content of text of described each sentence is carried out dependency marking Score_rel(s)；

Wherein, Score_rel(s)=α Score_query(s)+βScore_sw(s)+γScore_title(s)

Described α, β, γ are default mediation parameter；

Described computing unit, also particularly useful for according to formula S core (s)=(1+ σ * tagW (s)) * Score_relS () obtains described Final score Score (s) of each sentence；

Wherein, described σ is default mediation parameter.