A kind of method and device of file retrieval
Technical field
The present invention relates to information retrieval field, particularly relate to the method and device of a kind of file retrieval.
Background technology
The main carriers HTML(Hypertext Markup Language of traditional internet information, hypertext markup language
Speech) provide the user a kind of convenient information demonstrating method, it is primarily upon information display effect on a web browser.Along with
It is increasingly extensive that Web applies, and the limitation of html data model highlights day by day, i.e. HTML can not describe data, html tag collection
Fixing and limited, user cannot according to oneself need add significant labelling.Therefore, XML(Xtensible Markup
Language, extendible markup language) therefore arise at the historic moment.
XML has self descriptiveness, platform-neutral, extensibility and the feature such as easy to use, can be with readable form table
Registration according to and do not limited by the form of expression;The existence of XML can make data swap in incompatible system, simplifies
Complexity in data sharing and transmitting procedure;In XML document, existing content information also has structural information, and its appearance makes to lead to
Cross Internet carry out mass data exchange, integrated, be integrated into possibility.Along with increasing Web applies, as network takes
Business, ecommerce, digital library etc. use XML as mass data storage, the carrier that exchanges and issue, the most efficiently from
Magnanimity XML document set retrieves useful information and causes the concern of increasing research worker.
At present, carry out XML document retrieval and can pass through the following two kinds search modes:
The first, retrieval based on XML document structure;
Under this search modes, user, it should be understood that the structure of inquired about XML document, can construct query express
Formula.
The second retrieval model is retrieval based on keyword;
Under this search modes, author writing query expression in advance, now user both need not study again
Miscellaneous query language, it is not required that the data structure of XML document bottom is had deep understanding, user needs only to input and it
The keyword that content of interest is relevant just can complete inquiry, and existing method includes MLCA, SLCA, XRank, XSEarch,
XSeek etc..
But, in first method, on the one hand, in the Internet, major part XML document does not provide the user with complete
Structural information;On the other hand, there is also substantial amounts of isomery XML document in the Internet, so in both cases, user is very
Difficulty utilizes existing language construct to go out query expression and inquires about XML structure.In the second approach, crucial about XML
The method major part of word inquiry is all based on what tree-shaped storage model launched, and this just requires that author is when writing query expression
It is known a priori by the structure of XML document.
In sum, the retrieval model of existing XML document, need user to browse the full content of XML document, or in advance
Know the structure of inquired about XML document, and need to take substantial amounts of memory space during retrieving, have magnanimity
Today of the XML document of data volume, retrieval performance and the accuracy rate of existing XML document retrieval model are relatively low.
Summary of the invention
Embodiments of the invention provide a kind of file retrieval method and device, improve XML document retrieval performance and
Accuracy rate.
For reaching above-mentioned purpose, embodiments of the invention adopt the following technical scheme that
A kind of method of file retrieval, including:
Use the target query key word obtained through pretreatment to destination document collection in the inverted index pre-build
Conjunction is retrieved, and obtains first object collection of document;
Described first object collection of document is carried out dependency marking, obtains the dependency marking of described first object document
As a result, and according to described dependency marking result described first object collection of document reordered and obtain the second destination document
Set;
By pseudo-linear filter model, described current goal searching keyword is extended, obtains new target query and close
Keyword;
When described new target query key word meets pre-conditioned, use described new target query key word to institute
State first object collection of document again to retrieve, obtain the 3rd destination document set;
Each destination document in described 3rd destination document set is carried out subordinate sentence process, and calculating carries out described subordinate sentence
Process the label weight summation obtaining each sentence;
According to described target query key word, the content of text of described each sentence is carried out dependency marking, obtain each
The dependency marking result of sentence, and obtain the final of described each sentence according to the dependency marking result of described each sentence
Score;
Final score according to described each sentence obtains target sentences, and obtains length in described target sentences in advance
If the sentence in length range is as retrieval result fragment.
Present invention also offers the device of a kind of file retrieval, including:
Retrieval unit, for using the target query key word obtained through pretreatment in the inverted index pre-build
Destination document set is retrieved, obtains first object collection of document;
Acquiring unit, for described first object collection of document is carried out dependency marking, obtains described first object literary composition
The dependency marking result of shelves, and according to described dependency marking result, described first object collection of document is reordered
To the second destination document set;
Described acquiring unit, is additionally operable to be expanded described current goal searching keyword by pseudo-linear filter model
Exhibition, obtains new target query key word;
Described acquiring unit, be additionally operable to when described new target query key word meet described pre-conditioned time, use institute
State new target query key word described first object collection of document is retrieved again, obtain the 3rd destination document set;
Described acquiring unit, be additionally operable to when described new target query key word be unsatisfactory for described pre-conditioned time, use
Described first object collection of document is retrieved by described new target query key word again, obtains the second new destination document
Set;
Computing unit, for each destination document in described 3rd destination document set is carried out subordinate sentence process, and counts
Calculation carries out described subordinate sentence and processes the label weight summation obtaining each sentence;
Described computing unit, is additionally operable to carry out the content of text of described each sentence according to described target query key word
Dependency is given a mark, and obtains the dependency marking result of each sentence, and obtains according to the dependency marking result of described each sentence
Final score to described each sentence;
Described acquiring unit, is additionally operable to the final score according to described each sentence and obtains target sentences, and at described mesh
Mark sentence obtains length sentence in the range of preset length as retrieval result fragment.
The method and device of a kind of file retrieval that the embodiment of the present invention provides, so that user is being not required to browsing document
Full content, and do not know and use in the case of file structure key query word to retrieve, and be applicable to magnanimity document
Retrieval, retrieval performance and accuracy rate are high.
Accompanying drawing explanation
In order to be illustrated more clearly that the technical scheme of the embodiment of the present invention, will make required in the embodiment of the present invention below
Accompanying drawing be briefly described, it should be apparent that, drawings described below is only some embodiments of the present invention, for
From the point of view of those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to obtain other according to these accompanying drawings
Accompanying drawing.
The method flow diagram of a kind of file retrieval that Fig. 1 provides for the embodiment of the present invention 1;
The method first stage flow chart of a kind of file retrieval that Fig. 2 provides for the embodiment of the present invention 2;
The method second stage flow chart of a kind of file retrieval that Fig. 3 provides for the embodiment of the present invention 2;
The training first object collection of document that Fig. 4 provides for the embodiment of the present invention 2, the method flow obtaining label weight shows
It is intended to;
Fig. 5 carries out the method stream of Relevance scores for the SLCA subtree calculating each document that the embodiment of the present invention 2 provides
Journey schematic diagram;
The method phase III flow chart of a kind of file retrieval that Fig. 6 provides for the embodiment of the present invention 2;
Each destination document in 3rd destination document set is carried out at subordinate sentence by Fig. 7 for what the embodiment of the present invention 2 provided
Reason, and calculate the method flow schematic diagram carrying out the label weight summation that subordinate sentence processes each sentence obtained;
The content of text of each sentence is carried out beating according to target query key word by Fig. 8 for what the embodiment of the present invention 2 provided
The method flow schematic diagram divided;
The structural representation of the device of a kind of file retrieval that Fig. 9 provides for the embodiment of the present invention 3;
The second structural representation of the device of a kind of file retrieval that Figure 10 provides for the embodiment of the present invention 3;
The structural representation of the computing unit in the device of a kind of file retrieval that Figure 11 provides for the embodiment of the present invention 3;
The structural representation of the acquiring unit in the device of a kind of file retrieval that Figure 12 provides for the embodiment of the present invention 3;
The third structural representation of the device of a kind of file retrieval that Figure 13 provides for the embodiment of the present invention 3;
The second structure of the computing unit in the device of a kind of file retrieval that Figure 14 provides for the embodiment of the present invention 3 is shown
It is intended to.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Describe, it is clear that described embodiment is only a part of embodiment of the present invention rather than whole embodiments wholely.Based on
Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under not making creative work premise
Embodiment, broadly falls into the scope of protection of the invention.
Embodiment 1
See Fig. 1, the method flow diagram of a kind of file retrieval provided for the present embodiment, including:
A, use the target query key word obtained through pretreatment to destination document in the inverted index pre-build
Set is retrieved, and obtains first object collection of document.
B, first object collection of document is carried out dependency marking, obtains the dependency marking result of first object document,
And according to dependency marking result first object collection of document reordered and to obtain the second destination document set.
C, by pseudo-linear filter model, current goal searching keyword is extended, obtains new target query crucial
Word.
D, when described new target query key word meets pre-conditioned, use described new target query key word pair
Described first object collection of document is retrieved again, obtains the 3rd destination document set.
E, each destination document in the 3rd destination document set is carried out subordinate sentence process, and calculate and carry out subordinate sentence and process
Label weight summation to each sentence.
F, according to target query key word, the content of text of each sentence is carried out dependency marking, obtain each sentence
Dependency marking result, and the final score of each sentence is obtained according to the dependency marking result of each sentence.
G, final score according to each sentence obtain target sentences, and obtain length in target sentences in preset length
In the range of sentence as retrieval result fragment.
A kind of method of file retrieval that the present embodiment provides, by the present embodiment, so that user browses being not required to
The full content of document, and do not know and use in the case of file structure key query word to retrieve, and it is applicable to magnanimity
The retrieval of document, retrieval performance and accuracy rate are high.
Embodiment 2
In the present embodiment, a kind of method of file retrieval is divided into three phases, and the first stage is the fuzzy search stage, with contracting
Smaller part structured document set;Second stage is the precise search stage, to obtain accurate collection of document associated with the query;The
Three stages were fragment generation phase.
See Fig. 2, the method first stage flow chart of a kind of file retrieval provided for the present embodiment, including:
First stage: fuzzy search stage.
201: destination document set is carried out pretreatment.
In the present embodiment, destination document collection is combined into the XML semi-structured document set that will be used for inquiry.
Destination document set carries out pretreatment specifically to be realized by following sub-step:
1) stop words in destination document set is rejected.
Wherein, stop words can be configured in advance by user, can be the nothings such as " in ", " the ", " oh " and punctuation mark
The word of concrete meaning, Chinese can be " ", " wearing ", " " and punctuation mark etc. are without the word of concrete meaning.
Such as, following 2 articles are the partial content in collection of document, are used for illustrating in rejecting destination document set
Stop words;
The content of article 1 is: Tom lives in Guangzhou, I live in Guangzhou too.
The content of article 2 is: He once lived in Shanghai.
Above-mentioned article 1 and article 2 content, be a character string, first, find out article 1 and article 2 respectively according to space
All words, each word is key word, then is rejected from article 1 and article 2 by stop words;Reject the article after stop words
1 and article 2 as follows:
Reject the article 1:[Tom after stop words] [lives] [Guangzhou] [I] [live] [Guangzhou]
Reject the article 2:[He after stop words] [lives] [Shanghai].
It should be noted that when document occurs Chinese sentence, need to utilize prior art centering sentence to carry out spy
Different word segmentation processing, then stop words is rejected from document.
2): extract the stem of destination document set.
First, when the content in destination document set is English character, capital and small letter are unified in all words;Such as, when
When user searches " He ", word " HE ", " he " can also be searched.
Secondly, when the content in destination document set is English character, all words are reduced;Such as, when with
When " live " is searched at family, word " lives ", " lived " can also be searched, then need word " lives ", " lived "
It is reduced to " live ".
Such as, say as a example by the article 1 after above-mentioned rejecting stop words and the article 2 after rejecting stop words, extract word
After Gan,
All key words of article 1 are: [tom] [live] [guangzhou] [i] [live] [guangzhou]
All key words of article 2 are: [he] [live] [shanghai].
3): calculate each word TF(term frequency in destination document set in stem, word frequency) value and IDF
(inverse document frequency, reverse document-frequency) value.
Wherein, equation below can be used to calculate when calculating each word TF value in destination document set:
N in above-mentioned formulaI, jIt is that word is at destination document set djIn occurrence number, denominator is in destination document set
Middle djIn the occurrence number sum of all words.
Equation below can be used to calculate when calculating each word IDF value in destination document set:
Wherein, | D | is the total number of files in destination document set, | { j:ti∈dj| for comprising tiTotal number of files (i.e.
nI, jThe total number of files of ≠ 0).
202: set up inverted index to carrying out pretreated destination document set.
Such as, as a example by above-mentioned article 1 and article 2, after setting up inverted index, in article 1 and article 2 each key word with
Article number, [frequency of occurrences], the corresponding relation of key word position be:
Guangzhou 1 [2] 3,6
he 2[1] 1
i 1[1] 4
Live 1 [2], 2 [1] 2,5,2
shanghai 2[1] 3
tom 1[1] 1
After setting up inverted index, it is appreciated that number of times and particular location that key word occurs in article.
203: searching keyword is carried out pretreatment and obtains target query key word.
Wherein, searching keyword carries out pretreatment specifically to be realized by following sub-step:
1) stop words in searching keyword is rejected.
It should be noted that the concrete methods of realizing of this step and above-mentioned steps 2011 are rejected in destination document set
The method of stop words is identical, no longer illustrates at this.
2) stem extracting searching keyword obtains target query key word.
It should be noted that the concrete methods of realizing of this step and the word of extraction destination document set in above-mentioned steps 2012
Dry method is identical, no longer illustrates at this.
204: use the target query key word obtained through pretreatment to target in inverted index according to retrieval model
Collection of document is retrieved, and obtains first object collection of document.
It should be noted that setting up retrieval model is to use probabilistic method and language model to set up;The mistake of retrieval
Journey uses Di Li Cray Dirichlet smooth manner, reduces the scope of destination document set;Wherein, retrieval model is set up
Belong to prior art with Dirichlet smooth manner, do not repeat them here.
See Fig. 3, the method second stage flow chart of a kind of file retrieval provided for the present embodiment, including:
Second stage: precise search stage.
205: training first object collection of document, obtain the weight of each label in first object collection of document.
Participate in Fig. 4, train first object collection of document, obtain label weight and specifically can be realized by following sub-step:
2051: obtain all tag name in first object collection of document.
2052: according to tag name, the element in first object collection of document is divided into element set associated with the query and
With the incoherent element set of inquiry.
2053: obtain each searching keyword tiAt each coherent element bkIn the number of times a of middle appearance and coherent element set
Total number A of all words.
It should be noted that when searching keyword is English character, using each word as searching keyword;Work as inquiry
When key word is Chinese statement, needing to utilize prior art that Chinese statement is carried out special word segmentation processing, it is every that process obtains
Individual word is as searching keyword.
2054: obtain each searching keyword tiAt each uncorrelated element bkThe number of times b of middle appearance and uncorrelated element set
Total number B of all words in conjunction.
It should be noted that when searching keyword is English character, using each word as searching keyword;Work as inquiry
When key word is Chinese statement, needing to utilize prior art that Chinese statement is carried out special word segmentation processing, it is every that process obtains
Individual word is as searching keyword.
2055: according to each searching keyword tiAt each coherent element bkIn the number of times of middle appearance and coherent element set
Total number of all words, calculates each searching keyword tiAt each coherent element bkThe Probability p of middle appearanceik。
Wherein,
2056: according to each searching keyword tiAt each uncorrelated element bkThe number of times of middle appearance and uncorrelated element set
Total number of all words in conjunction, calculates each searching keyword tiAt each uncorrelated element bkThe probability q of middle appearanceik。
Wherein,
2057: each label m being calculated in first object collection of documentjWeight.
Wherein, label mjThe computing formula of weight be:
Wherein, tikIt is 01 value, can be 0 or 1, represent element bkIn whether include searching keyword ti;Q is for looking into
Ask key word.
206: searching keyword is carried out pretreatment, obtain target query key word.
Wherein, target query key word comprises several searching keywords q.
It should be noted that to looking into during searching keyword is carried out the method for pretreatment and above-mentioned steps 203 by this step
The method that inquiry key word carries out pretreatment is identical, does not repeats them here.
207: in extraction first object collection of document, the SLCA subtree of each destination document is as the knot of each destination document
Structure information.
208: the SLCA subtree of each destination document is carried out dependency marking, and the dependency obtaining each destination document obtains
Point.
See Fig. 5, calculate the SLCA subtree of each document and carry out Relevance scores and can take bottom-up method, tool
Body can be realized by following sub-step:
2081: obtain the number of times that in target query key word, each searching keyword q occurs respectively in each node n
tfN, q。
2082: calculate each searching keyword q TF value TF in first object collection of documentq。
Wherein, this step calculates the method for each searching keyword q TF value in first object collection of document with on
State that to calculate the method for each word TF value in destination document set in stem in step 2013 identical, repeat in this step.
2083: according to the tf of each searching keyword qN, qAnd TFqObtain each searching keyword q for present node
Relevance scores tw (n, q).
Wherein,
2084: when present node n is leaf node, it is calculated each searching keyword q relative to present node n's
The summation of Relevance scores, as the Relevance scores of document.
2085: when present node n is non-leaf nodes, all child nodes c calculating present node n are looked into relative to target
Ask key word Relevance scores tw (c, q).
2086: according to each searching keyword q Relevance scores tw relative to present node n (n, q) and present node n
All child nodes c relative to target query key word Relevance scores tw (c, q) obtain each searching keyword q relative to
The Relevance scores tw of present node n1(n,q)
Wherein tw1(n,q)=tw(n,q)+∑c∈children(n)dn·tw(c,q)
2087: according to each searching keyword q Relevance scores tw relative to present node n1(n q) is calculated often
The summation of the individual searching keyword q Relevance scores relative to present node n, as the Relevance scores of the document.
209: according to the Relevance scores of each destination document in described first object collection of document, obtain the second target literary composition
Shelves set.
Concrete, according to Relevance scores order from high to low, the document in first object collection of document can be carried out
Rearrangement, it is also possible to the document in destination document set is arranged again according to Relevance scores order from low to high
Sequence.
Optionally, after the document in destination document set is resequenced, it is also possible to by score less than first
The document of preset value is got rid of, and keeps score more than or equal to the destination document set of the first preset value, obtains the second target literary composition
Shelves set.
210: use pseudo-linear filter model to target query according to the destination document in current second destination document set
Key word is extended, and obtains new target query key word, and judges whether new target query key word meets default bar
Part;
When target query key word is unsatisfactory for pre-conditioned, perform step 211;
When target query key word meets pre-conditioned, perform step 212.
In the present embodiment, concrete, can according to the second destination document set mid score higher preset destination document
Use pseudo-linear filter model that target query key word is extended, obtain new target query key word, and judge new
It is pre-conditioned whether target query key word meets.
Can be the number of key word it should be noted that pre-conditioned, it is also possible to be the stem number of key word, but not
It is limited to this.
211: use new target query key word that first object collection of document is retrieved again, obtain new second
Destination document set, returns the operation performing step 210.
See Fig. 6, the method phase III flow chart of a kind of file retrieval provided for the present embodiment, including:
Phase III: fragment produces the stage.
212: use new target query key word that first object collection of document is retrieved again, obtain the 3rd target
Collection of document.
It should be noted that the method obtaining the label weight of document in this step obtains label in above-mentioned steps 205
The method of weight is identical, no longer illustrates at this.
213: each destination document in the 3rd destination document set is carried out subordinate sentence process, and calculating carries out subordinate sentence process
The label weight summation of each sentence obtained.
Participate in Fig. 7, each destination document in the 3rd destination document set is carried out subordinate sentence process, and calculating carries out subordinate sentence
The label weight summation processing each sentence obtained specifically can be realized by following sub-step:
2131: training the 3rd destination document set, obtain the weight of each label in the 3rd destination document set;
2132: remove label, each destination document in the 3rd destination document set is carried out subordinate sentence process.
It should be noted that the operation that document carries out subordinate sentence process belongs to prior art, no longer illustrate at this.
2133: calculate the weight of the label corresponding to all words that each sentence comprises, to obtain the label of each sentence
Weight summation tagW (s).
Wherein, the weight that label weight summation is the label corresponding to all words that each sentence comprises of each sentence is total
With.
214: searching keyword is carried out pretreatment and obtains target query key word.
Wherein, target query key word includes several searching keywords q.
It should be noted that to looking into during searching keyword is carried out the method for pretreatment and above-mentioned steps 203 by this step
The method that inquiry key word carries out pretreatment is identical, does not repeats them here.
215: according to target query key word, the content of text of each sentence is given a mark.
Participate in Fig. 8, this step implement particularly as follows:
2151: calculate the target query key word Relevance scores Score relative to each sentencequery(s)。
Wherein, sentence s is relevant to the dependency of target query key word and three factors: the key occurred in each sentence
Kind queryC (s) of word;Number of times Occ (the q that each searching keyword q occurs in sentencei,s);Each searching keyword q
Weight Weight (qi)。
Concrete, ScorequeryS () can be calculated by equation below.
2152: calculate score Score of each important words in each sentencesw(s)。
In this step, important words is the word that the number of times occurred in this destination document is more than threshold number.
Wherein, ScoreswS () can be calculated by equation below:
2153: calculate the title Relevance scores Score of each sentencetitle(s)。
Wherein, ScoretitleS () can be calculated by equation below:
2154: according to Scorequery(s)、Scoresw(s) and ScoretitleS the content of text of each sentence is carried out by ()
Dependency marking Scorerel(s);
Wherein,
Scorerel(s)=αScorequery(s)+βScoresw(s)+γScoretitle(s)
Above-mentioned α, β, γ are three default mediation parameters.
216: calculate the final score of each sentence.
Wherein, the final score calculating each sentence can be calculated by equation below:
Score(s)=(1+σ*tagW(s))*Scorerel(s)
Wherein, the σ in above-mentioned formula is for being in harmonious proportion parameter.
217: according to the final score of each sentence, obtain target sentences, the score of target sentences is pre-more than or equal to second
If value.
218: in target sentences, obtain length sentence in the range of preset length as retrieval result fragment.
A kind of method of file retrieval that the present embodiment provides, by the present embodiment, so that user browses being not required to
The full content of document, and do not know and use in the case of file structure key query word to retrieve, and it is applicable to magnanimity
The retrieval of document, retrieval performance and accuracy rate are high.
Embodiment 3
Participate in Fig. 9, the installation drawing of a kind of file retrieval provided for the present embodiment, including:
Retrieval unit 301, for using the target query key word obtained through pretreatment to arrange rope pre-build
In drawing, destination document set is retrieved, obtain first object collection of document;
Acquiring unit 302, for first object collection of document is carried out dependency marking, obtains the phase of first object document
Closing property marking result, and according to dependency marking result first object collection of document reordered and obtain the second destination document
Set;
Acquiring unit 302, is additionally operable to be extended described current goal searching keyword by pseudo-linear filter model,
Obtain new target query key word;
Acquiring unit 302, is additionally operable to when described new target query key word meets pre-conditioned, uses described new
Described first object collection of document is retrieved by target query key word again, obtains the 3rd destination document set;
Computing unit 303, for each destination document in the 3rd destination document set is carried out subordinate sentence process, and calculates
Carry out subordinate sentence and process the label weight summation obtaining each sentence;
Computing unit 303, is additionally operable to, according to target query key word, the content of text of each sentence is carried out dependency and beats
Point, obtain the dependency marking result of each sentence, and obtain each sentence according to the dependency marking result of each sentence
Final score;
Acquiring unit 302, is additionally operable to the final score according to each sentence and obtains target sentences, and obtain in target sentences
Take length sentence in the range of preset length as retrieval result fragment.
Further, acquiring unit 302, it is additionally operable to obtain in destination document set each word in destination document set
Word frequency TF value and reverse document-frequency IDF value;
Seeing Figure 10, device also includes:
Set up unit 304, for setting up inverted index according to TF value and the IDF value of word each in destination document set.
Processing unit 305, extracts operation for searching keyword carries out rejecting stop words and stem, obtains target query
Key word.
Described set up unit 304, be additionally operable to set up retrieval model.
Further, retrieval unit 301, specifically for using the target query obtained through pretreatment according to retrieval model
Destination document set is retrieved in the inverted index pre-build by key word, obtains first object collection of document.
Further, computing unit 303, it is additionally operable to first object collection of document is trained, obtains first object literary composition
The weight of each label in shelves set.
Further, see Figure 11, computing unit 303, specifically include:
Obtain subelement 3031, for obtaining all tag name in first object collection of document;
Classification subelement 3032, for according to tag name, is divided into the element in first object collection of document and inquiry phase
Close element set and with inquire about incoherent element set;
Obtain subelement 3031, be additionally operable to obtain each searching keyword tiAt each coherent element bkThe number of times a of middle appearance
Total number A with words all in coherent element set;
Obtain subelement 3031, be additionally operable to obtain each searching keyword tiAt each uncorrelated element bkMiddle appearance time
Total number B of all words in number b and uncorrelated element set;
Computation subunit 3033, for according to each searching keyword tiAt each coherent element bkThe number of times of middle appearance and
Total number of all words in coherent element set, calculates each searching keyword tiAt each coherent element bkMiddle appearance general
Rate pik;
Wherein,
Computation subunit 3033, is additionally operable to according to each searching keyword tiAt each uncorrelated element bkMiddle appearance time
In number and uncorrelated element set, total number of all words, calculates each searching keyword tiAt each uncorrelated element bkIn
The probability q occurredik;
Wherein,
Computation subunit 3033, is additionally operable to each label m being calculated in first object collection of documentjWeight;
Wherein, label mjThe computing formula of weight be:
tikIt is 01 value, represents element bkIn whether include searching keyword ti;Q is searching keyword.
Further, see Figure 12, acquiring unit 302, specifically include:
Extraction subelement 3021, in extraction first object collection of document, the SLCA subtree of each destination document is as each mesh
The structural information of mark document;
Computation subunit 3022, for the SLCA subtree of each destination document is carried out dependency marking, obtains each mesh
The Relevance scores of mark document.
Further,
Computation subunit 3022, specifically for each searching keyword q in acquisition target query key word respectively each
The number of times tf occurred in node nN, q;
Computation subunit 3022, specifically for calculating each searching keyword q TF value in first object collection of document
TFq;
Computation subunit 3022, specifically for the tf according to each searching keyword qn,qAnd TFqObtain each inquiry key
Word q for present node Relevance scores tw (n, q);
Wherein,
Computation subunit 3022, specifically for when present node n is leaf node, being calculated each searching keyword
The summation of the q Relevance scores relative to present node n, as the Relevance scores of the document.
Further,
Computation subunit 3022, also particularly useful for when present node n is non-leaf nodes, calculates the institute of present node n
Have child node c relative to target query key word Relevance scores tw (c, q);
Computation subunit 3022, obtains also particularly useful for according to each searching keyword q dependency relative to present node n
Point tw (n, q) and all child nodes c of present node n relative to the Relevance scores tw of target query key word, (c q) calculates
Obtain each searching keyword q Relevance scores tw relative to present node n1(n,q);
Wherein, tw1(n,q)=tw(n,q)+∑c∈children(n)dn·tw(c,q)
Computation subunit 3022, obtains also particularly useful for according to each searching keyword q dependency relative to present node n
Divide tw1(n q) is calculated the summation of each searching keyword q Relevance scores relative to present node n, as the document
Relevance scores.
Further, seeing Figure 13, described device also includes:
Judging unit 306, for judging whether described new target query key word meets pre-conditioned.
Described acquiring unit 302, be additionally operable to when described new target query key word be unsatisfactory for described pre-conditioned time, make
With described new target query key word, described first object collection of document is retrieved again, obtain the second new target literary composition
Shelves set;
Described acquiring unit 302, is additionally operable to be carried out described current goal searching keyword by pseudo-linear filter model
Extension, obtains the target query key word updated;
Described retrieval unit 301, is additionally operable to, until the target query key word of described renewal meets described pre-conditioned, make
With the target query key word of described renewal, described first object collection of document is retrieved again.
Further, see Figure 14, computing unit 303, specifically include:
First computation subunit 3034, for training the 3rd destination document set, obtains in the 3rd destination document set every
The weight of individual label;
Subordinate sentence processes subelement 3035, is used for removing label, enters each destination document in the 3rd destination document set
Row subordinate sentence processes;
First computation subunit 3034, is additionally operable to calculate the weight of the label corresponding to all words that each sentence comprises,
To obtain label weight summation tagW (s) of each sentence.
Further,
Computing unit 303, specifically for calculating the target query key word Relevance scores relative to each sentence
Scorequery(s);
Wherein,
QueryC (s) is the kind of the key word occurred in each sentence;Occ(qi, it is s) that each searching keyword q exists
The number of times occurred in sentence;Weight(qi) it is the weight of each searching keyword q;
Computing unit 303, specifically for calculating score Score of each important words in each sentencesw(s);Important list
Word is the word that the number of times occurred in destination document is more than threshold number;
Wherein,
Computing unit 303, specifically for calculating the title Relevance scores Score of each sentencetitle(s);
Wherein,
Computing unit 303, specifically for according to Scorequery(s)、Scoresw(s) and ScoretitleS () is to each sentence
Content of text carry out dependency marking Scorerel(s);
Wherein,
Scorerel(s)=αScorequery(s)+βScoresw(s)+γScorettitle(s)
α, β, γ are default mediation parameter.
Computing unit 303, also particularly useful for according to formula S core (s)=(1+ σ * tagW (s)) * ScorerelS () obtains often
Final score Score (s) of individual sentence;
Wherein, σ is default mediation parameter.
The device of a kind of file retrieval that the embodiment of the present invention provides, so that user is being not required to the whole of browsing document
Content, and do not know and use in the case of file structure key query word to retrieve, and it is applicable to the retrieval of magnanimity document,
Retrieval performance and accuracy rate are high.
Through the above description of the embodiments, those skilled in the art is it can be understood that can borrow to the present invention
The mode helping software to add required common hardware realizes, naturally it is also possible to by hardware, but a lot of in the case of the former is more preferably
Embodiment.Based on such understanding, the portion that prior art is contributed by technical scheme the most in other words
Dividing and can embody with the form of software product, this computer software product is stored in the storage medium that can read, such as meter
The floppy disk of calculation machine, hard disk or CD etc., including some instructions with so that computer equipment (can be personal computer,
Server, or the network equipment etc.) perform the method described in each embodiment of the present invention.
The above, the only detailed description of the invention of the present invention, but protection scope of the present invention is not limited thereto, and any
Those familiar with the art, in the technical scope that the invention discloses, can readily occur in change or replace, should contain
Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with described scope of the claims.