CN103678412B - A kind of method and device of file retrieval - Google Patents

A kind of method and device of file retrieval Download PDF

Info

Publication number
CN103678412B
CN103678412B CN201210360872.XA CN201210360872A CN103678412B CN 103678412 B CN103678412 B CN 103678412B CN 201210360872 A CN201210360872 A CN 201210360872A CN 103678412 B CN103678412 B CN 103678412B
Authority
CN
China
Prior art keywords
document
sentence
key word
searching keyword
target query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210360872.XA
Other languages
Chinese (zh)
Other versions
CN103678412A (en
Inventor
洪毅虹
杨建武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Peking University
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University
Priority to CN201210360872.XA priority Critical patent/CN103678412B/en
Publication of CN103678412A publication Critical patent/CN103678412A/en
Application granted granted Critical
Publication of CN103678412B publication Critical patent/CN103678412B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • G06F16/835Query processing
    • G06F16/8373Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • G06F16/835Query processing
    • G06F16/8365Query optimisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides the method and device of a kind of file retrieval, belong to information retrieval field, including: use target query key word in the inverted index pre-build, destination document set to be retrieved, obtain first object collection of document, carry out dependency marking, obtaining the dependency marking result of first object document, rearrangement sequence of going forward side by side obtains the second destination document set;By pseudo-linear filter model, current goal searching keyword is extended, obtains new target query key word, and then obtain the 3rd destination document set;Destination document in 3rd destination document set is carried out subordinate sentence process, calculates the label weight summation of each sentence;According to target query key word, the content of each sentence is carried out dependency marking, obtain the final score of each sentence, thus obtain target sentences;Length sentence in the range of preset length is obtained as retrieval result fragment in target sentences.By the present invention, improve retrieval performance and the accuracy rate of XML document.

Description

A kind of method and device of file retrieval
Technical field
The present invention relates to information retrieval field, particularly relate to the method and device of a kind of file retrieval.
Background technology
The main carriers HTML(Hypertext Markup Language of traditional internet information, hypertext markup language Speech) provide the user a kind of convenient information demonstrating method, it is primarily upon information display effect on a web browser.Along with It is increasingly extensive that Web applies, and the limitation of html data model highlights day by day, i.e. HTML can not describe data, html tag collection Fixing and limited, user cannot according to oneself need add significant labelling.Therefore, XML(Xtensible Markup Language, extendible markup language) therefore arise at the historic moment.
XML has self descriptiveness, platform-neutral, extensibility and the feature such as easy to use, can be with readable form table Registration according to and do not limited by the form of expression;The existence of XML can make data swap in incompatible system, simplifies Complexity in data sharing and transmitting procedure;In XML document, existing content information also has structural information, and its appearance makes to lead to Cross Internet carry out mass data exchange, integrated, be integrated into possibility.Along with increasing Web applies, as network takes Business, ecommerce, digital library etc. use XML as mass data storage, the carrier that exchanges and issue, the most efficiently from Magnanimity XML document set retrieves useful information and causes the concern of increasing research worker.
At present, carry out XML document retrieval and can pass through the following two kinds search modes:
The first, retrieval based on XML document structure;
Under this search modes, user, it should be understood that the structure of inquired about XML document, can construct query express Formula.
The second retrieval model is retrieval based on keyword;
Under this search modes, author writing query expression in advance, now user both need not study again Miscellaneous query language, it is not required that the data structure of XML document bottom is had deep understanding, user needs only to input and it The keyword that content of interest is relevant just can complete inquiry, and existing method includes MLCA, SLCA, XRank, XSEarch, XSeek etc..
But, in first method, on the one hand, in the Internet, major part XML document does not provide the user with complete Structural information;On the other hand, there is also substantial amounts of isomery XML document in the Internet, so in both cases, user is very Difficulty utilizes existing language construct to go out query expression and inquires about XML structure.In the second approach, crucial about XML The method major part of word inquiry is all based on what tree-shaped storage model launched, and this just requires that author is when writing query expression It is known a priori by the structure of XML document.
In sum, the retrieval model of existing XML document, need user to browse the full content of XML document, or in advance Know the structure of inquired about XML document, and need to take substantial amounts of memory space during retrieving, have magnanimity Today of the XML document of data volume, retrieval performance and the accuracy rate of existing XML document retrieval model are relatively low.
Summary of the invention
Embodiments of the invention provide a kind of file retrieval method and device, improve XML document retrieval performance and Accuracy rate.
For reaching above-mentioned purpose, embodiments of the invention adopt the following technical scheme that
A kind of method of file retrieval, including:
Use the target query key word obtained through pretreatment to destination document collection in the inverted index pre-build Conjunction is retrieved, and obtains first object collection of document;
Described first object collection of document is carried out dependency marking, obtains the dependency marking of described first object document As a result, and according to described dependency marking result described first object collection of document reordered and obtain the second destination document Set;
By pseudo-linear filter model, described current goal searching keyword is extended, obtains new target query and close Keyword;
When described new target query key word meets pre-conditioned, use described new target query key word to institute State first object collection of document again to retrieve, obtain the 3rd destination document set;
Each destination document in described 3rd destination document set is carried out subordinate sentence process, and calculating carries out described subordinate sentence Process the label weight summation obtaining each sentence;
According to described target query key word, the content of text of described each sentence is carried out dependency marking, obtain each The dependency marking result of sentence, and obtain the final of described each sentence according to the dependency marking result of described each sentence Score;
Final score according to described each sentence obtains target sentences, and obtains length in described target sentences in advance If the sentence in length range is as retrieval result fragment.
Present invention also offers the device of a kind of file retrieval, including:
Retrieval unit, for using the target query key word obtained through pretreatment in the inverted index pre-build Destination document set is retrieved, obtains first object collection of document;
Acquiring unit, for described first object collection of document is carried out dependency marking, obtains described first object literary composition The dependency marking result of shelves, and according to described dependency marking result, described first object collection of document is reordered To the second destination document set;
Described acquiring unit, is additionally operable to be expanded described current goal searching keyword by pseudo-linear filter model Exhibition, obtains new target query key word;
Described acquiring unit, be additionally operable to when described new target query key word meet described pre-conditioned time, use institute State new target query key word described first object collection of document is retrieved again, obtain the 3rd destination document set;
Described acquiring unit, be additionally operable to when described new target query key word be unsatisfactory for described pre-conditioned time, use Described first object collection of document is retrieved by described new target query key word again, obtains the second new destination document Set;
Computing unit, for each destination document in described 3rd destination document set is carried out subordinate sentence process, and counts Calculation carries out described subordinate sentence and processes the label weight summation obtaining each sentence;
Described computing unit, is additionally operable to carry out the content of text of described each sentence according to described target query key word Dependency is given a mark, and obtains the dependency marking result of each sentence, and obtains according to the dependency marking result of described each sentence Final score to described each sentence;
Described acquiring unit, is additionally operable to the final score according to described each sentence and obtains target sentences, and at described mesh Mark sentence obtains length sentence in the range of preset length as retrieval result fragment.
The method and device of a kind of file retrieval that the embodiment of the present invention provides, so that user is being not required to browsing document Full content, and do not know and use in the case of file structure key query word to retrieve, and be applicable to magnanimity document Retrieval, retrieval performance and accuracy rate are high.
Accompanying drawing explanation
In order to be illustrated more clearly that the technical scheme of the embodiment of the present invention, will make required in the embodiment of the present invention below Accompanying drawing be briefly described, it should be apparent that, drawings described below is only some embodiments of the present invention, for From the point of view of those of ordinary skill in the art, on the premise of not paying creative work, it is also possible to obtain other according to these accompanying drawings Accompanying drawing.
The method flow diagram of a kind of file retrieval that Fig. 1 provides for the embodiment of the present invention 1;
The method first stage flow chart of a kind of file retrieval that Fig. 2 provides for the embodiment of the present invention 2;
The method second stage flow chart of a kind of file retrieval that Fig. 3 provides for the embodiment of the present invention 2;
The training first object collection of document that Fig. 4 provides for the embodiment of the present invention 2, the method flow obtaining label weight shows It is intended to;
Fig. 5 carries out the method stream of Relevance scores for the SLCA subtree calculating each document that the embodiment of the present invention 2 provides Journey schematic diagram;
The method phase III flow chart of a kind of file retrieval that Fig. 6 provides for the embodiment of the present invention 2;
Each destination document in 3rd destination document set is carried out at subordinate sentence by Fig. 7 for what the embodiment of the present invention 2 provided Reason, and calculate the method flow schematic diagram carrying out the label weight summation that subordinate sentence processes each sentence obtained;
The content of text of each sentence is carried out beating according to target query key word by Fig. 8 for what the embodiment of the present invention 2 provided The method flow schematic diagram divided;
The structural representation of the device of a kind of file retrieval that Fig. 9 provides for the embodiment of the present invention 3;
The second structural representation of the device of a kind of file retrieval that Figure 10 provides for the embodiment of the present invention 3;
The structural representation of the computing unit in the device of a kind of file retrieval that Figure 11 provides for the embodiment of the present invention 3;
The structural representation of the acquiring unit in the device of a kind of file retrieval that Figure 12 provides for the embodiment of the present invention 3;
The third structural representation of the device of a kind of file retrieval that Figure 13 provides for the embodiment of the present invention 3;
The second structure of the computing unit in the device of a kind of file retrieval that Figure 14 provides for the embodiment of the present invention 3 is shown It is intended to.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Describe, it is clear that described embodiment is only a part of embodiment of the present invention rather than whole embodiments wholely.Based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under not making creative work premise Embodiment, broadly falls into the scope of protection of the invention.
Embodiment 1
See Fig. 1, the method flow diagram of a kind of file retrieval provided for the present embodiment, including:
A, use the target query key word obtained through pretreatment to destination document in the inverted index pre-build Set is retrieved, and obtains first object collection of document.
B, first object collection of document is carried out dependency marking, obtains the dependency marking result of first object document, And according to dependency marking result first object collection of document reordered and to obtain the second destination document set.
C, by pseudo-linear filter model, current goal searching keyword is extended, obtains new target query crucial Word.
D, when described new target query key word meets pre-conditioned, use described new target query key word pair Described first object collection of document is retrieved again, obtains the 3rd destination document set.
E, each destination document in the 3rd destination document set is carried out subordinate sentence process, and calculate and carry out subordinate sentence and process Label weight summation to each sentence.
F, according to target query key word, the content of text of each sentence is carried out dependency marking, obtain each sentence Dependency marking result, and the final score of each sentence is obtained according to the dependency marking result of each sentence.
G, final score according to each sentence obtain target sentences, and obtain length in target sentences in preset length In the range of sentence as retrieval result fragment.
A kind of method of file retrieval that the present embodiment provides, by the present embodiment, so that user browses being not required to The full content of document, and do not know and use in the case of file structure key query word to retrieve, and it is applicable to magnanimity The retrieval of document, retrieval performance and accuracy rate are high.
Embodiment 2
In the present embodiment, a kind of method of file retrieval is divided into three phases, and the first stage is the fuzzy search stage, with contracting Smaller part structured document set;Second stage is the precise search stage, to obtain accurate collection of document associated with the query;The Three stages were fragment generation phase.
See Fig. 2, the method first stage flow chart of a kind of file retrieval provided for the present embodiment, including:
First stage: fuzzy search stage.
201: destination document set is carried out pretreatment.
In the present embodiment, destination document collection is combined into the XML semi-structured document set that will be used for inquiry.
Destination document set carries out pretreatment specifically to be realized by following sub-step:
1) stop words in destination document set is rejected.
Wherein, stop words can be configured in advance by user, can be the nothings such as " in ", " the ", " oh " and punctuation mark The word of concrete meaning, Chinese can be " ", " wearing ", " " and punctuation mark etc. are without the word of concrete meaning.
Such as, following 2 articles are the partial content in collection of document, are used for illustrating in rejecting destination document set Stop words;
The content of article 1 is: Tom lives in Guangzhou, I live in Guangzhou too.
The content of article 2 is: He once lived in Shanghai.
Above-mentioned article 1 and article 2 content, be a character string, first, find out article 1 and article 2 respectively according to space All words, each word is key word, then is rejected from article 1 and article 2 by stop words;Reject the article after stop words 1 and article 2 as follows:
Reject the article 1:[Tom after stop words] [lives] [Guangzhou] [I] [live] [Guangzhou]
Reject the article 2:[He after stop words] [lives] [Shanghai].
It should be noted that when document occurs Chinese sentence, need to utilize prior art centering sentence to carry out spy Different word segmentation processing, then stop words is rejected from document.
2): extract the stem of destination document set.
First, when the content in destination document set is English character, capital and small letter are unified in all words;Such as, when When user searches " He ", word " HE ", " he " can also be searched.
Secondly, when the content in destination document set is English character, all words are reduced;Such as, when with When " live " is searched at family, word " lives ", " lived " can also be searched, then need word " lives ", " lived " It is reduced to " live ".
Such as, say as a example by the article 1 after above-mentioned rejecting stop words and the article 2 after rejecting stop words, extract word After Gan,
All key words of article 1 are: [tom] [live] [guangzhou] [i] [live] [guangzhou]
All key words of article 2 are: [he] [live] [shanghai].
3): calculate each word TF(term frequency in destination document set in stem, word frequency) value and IDF (inverse document frequency, reverse document-frequency) value.
Wherein, equation below can be used to calculate when calculating each word TF value in destination document set:
TF i , j = n i , j Σ k n k , j
N in above-mentioned formulaI, jIt is that word is at destination document set djIn occurrence number, denominator is in destination document set Middle djIn the occurrence number sum of all words.
Equation below can be used to calculate when calculating each word IDF value in destination document set:
IDF i = log | D | | { j : t i ∈ d j } |
Wherein, | D | is the total number of files in destination document set, | { j:ti∈dj| for comprising tiTotal number of files (i.e. nI, jThe total number of files of ≠ 0).
202: set up inverted index to carrying out pretreated destination document set.
Such as, as a example by above-mentioned article 1 and article 2, after setting up inverted index, in article 1 and article 2 each key word with Article number, [frequency of occurrences], the corresponding relation of key word position be:
Guangzhou 1 [2] 3,6
he 2[1] 1
i 1[1] 4
Live 1 [2], 2 [1] 2,5,2
shanghai 2[1] 3
tom 1[1] 1
After setting up inverted index, it is appreciated that number of times and particular location that key word occurs in article.
203: searching keyword is carried out pretreatment and obtains target query key word.
Wherein, searching keyword carries out pretreatment specifically to be realized by following sub-step:
1) stop words in searching keyword is rejected.
It should be noted that the concrete methods of realizing of this step and above-mentioned steps 2011 are rejected in destination document set The method of stop words is identical, no longer illustrates at this.
2) stem extracting searching keyword obtains target query key word.
It should be noted that the concrete methods of realizing of this step and the word of extraction destination document set in above-mentioned steps 2012 Dry method is identical, no longer illustrates at this.
204: use the target query key word obtained through pretreatment to target in inverted index according to retrieval model Collection of document is retrieved, and obtains first object collection of document.
It should be noted that setting up retrieval model is to use probabilistic method and language model to set up;The mistake of retrieval Journey uses Di Li Cray Dirichlet smooth manner, reduces the scope of destination document set;Wherein, retrieval model is set up Belong to prior art with Dirichlet smooth manner, do not repeat them here.
See Fig. 3, the method second stage flow chart of a kind of file retrieval provided for the present embodiment, including:
Second stage: precise search stage.
205: training first object collection of document, obtain the weight of each label in first object collection of document.
Participate in Fig. 4, train first object collection of document, obtain label weight and specifically can be realized by following sub-step:
2051: obtain all tag name in first object collection of document.
2052: according to tag name, the element in first object collection of document is divided into element set associated with the query and With the incoherent element set of inquiry.
2053: obtain each searching keyword tiAt each coherent element bkIn the number of times a of middle appearance and coherent element set Total number A of all words.
It should be noted that when searching keyword is English character, using each word as searching keyword;Work as inquiry When key word is Chinese statement, needing to utilize prior art that Chinese statement is carried out special word segmentation processing, it is every that process obtains Individual word is as searching keyword.
2054: obtain each searching keyword tiAt each uncorrelated element bkThe number of times b of middle appearance and uncorrelated element set Total number B of all words in conjunction.
It should be noted that when searching keyword is English character, using each word as searching keyword;Work as inquiry When key word is Chinese statement, needing to utilize prior art that Chinese statement is carried out special word segmentation processing, it is every that process obtains Individual word is as searching keyword.
2055: according to each searching keyword tiAt each coherent element bkIn the number of times of middle appearance and coherent element set Total number of all words, calculates each searching keyword tiAt each coherent element bkThe Probability p of middle appearanceik
Wherein, p ik = a A
2056: according to each searching keyword tiAt each uncorrelated element bkThe number of times of middle appearance and uncorrelated element set Total number of all words in conjunction, calculates each searching keyword tiAt each uncorrelated element bkThe probability q of middle appearanceik
Wherein, q ik = b B
2057: each label m being calculated in first object collection of documentjWeight.
Wherein, label mjThe computing formula of weight be:
f tag ( m j ) = Σ t ik ∈ m j , t i ∈ Q t ik × log ( p ik ( 1 - q ik ) q ik ( 1 - p ik ) )
Wherein, tikIt is 01 value, can be 0 or 1, represent element bkIn whether include searching keyword ti;Q is for looking into Ask key word.
206: searching keyword is carried out pretreatment, obtain target query key word.
Wherein, target query key word comprises several searching keywords q.
It should be noted that to looking into during searching keyword is carried out the method for pretreatment and above-mentioned steps 203 by this step The method that inquiry key word carries out pretreatment is identical, does not repeats them here.
207: in extraction first object collection of document, the SLCA subtree of each destination document is as the knot of each destination document Structure information.
208: the SLCA subtree of each destination document is carried out dependency marking, and the dependency obtaining each destination document obtains Point.
See Fig. 5, calculate the SLCA subtree of each document and carry out Relevance scores and can take bottom-up method, tool Body can be realized by following sub-step:
2081: obtain the number of times that in target query key word, each searching keyword q occurs respectively in each node n tfN, q
2082: calculate each searching keyword q TF value TF in first object collection of documentq
Wherein, this step calculates the method for each searching keyword q TF value in first object collection of document with on State that to calculate the method for each word TF value in destination document set in stem in step 2013 identical, repeat in this step.
2083: according to the tf of each searching keyword qN, qAnd TFqObtain each searching keyword q for present node Relevance scores tw (n, q).
Wherein, tw ( n , q ) = tf n , q TF q
2084: when present node n is leaf node, it is calculated each searching keyword q relative to present node n's The summation of Relevance scores, as the Relevance scores of document.
2085: when present node n is non-leaf nodes, all child nodes c calculating present node n are looked into relative to target Ask key word Relevance scores tw (c, q).
2086: according to each searching keyword q Relevance scores tw relative to present node n (n, q) and present node n All child nodes c relative to target query key word Relevance scores tw (c, q) obtain each searching keyword q relative to The Relevance scores tw of present node n1(n,q)
Wherein tw1(n,q)=tw(n,q)+∑c∈children(n)dn·tw(c,q)
2087: according to each searching keyword q Relevance scores tw relative to present node n1(n q) is calculated often The summation of the individual searching keyword q Relevance scores relative to present node n, as the Relevance scores of the document.
209: according to the Relevance scores of each destination document in described first object collection of document, obtain the second target literary composition Shelves set.
Concrete, according to Relevance scores order from high to low, the document in first object collection of document can be carried out Rearrangement, it is also possible to the document in destination document set is arranged again according to Relevance scores order from low to high Sequence.
Optionally, after the document in destination document set is resequenced, it is also possible to by score less than first The document of preset value is got rid of, and keeps score more than or equal to the destination document set of the first preset value, obtains the second target literary composition Shelves set.
210: use pseudo-linear filter model to target query according to the destination document in current second destination document set Key word is extended, and obtains new target query key word, and judges whether new target query key word meets default bar Part;
When target query key word is unsatisfactory for pre-conditioned, perform step 211;
When target query key word meets pre-conditioned, perform step 212.
In the present embodiment, concrete, can according to the second destination document set mid score higher preset destination document Use pseudo-linear filter model that target query key word is extended, obtain new target query key word, and judge new It is pre-conditioned whether target query key word meets.
Can be the number of key word it should be noted that pre-conditioned, it is also possible to be the stem number of key word, but not It is limited to this.
211: use new target query key word that first object collection of document is retrieved again, obtain new second Destination document set, returns the operation performing step 210.
See Fig. 6, the method phase III flow chart of a kind of file retrieval provided for the present embodiment, including:
Phase III: fragment produces the stage.
212: use new target query key word that first object collection of document is retrieved again, obtain the 3rd target Collection of document.
It should be noted that the method obtaining the label weight of document in this step obtains label in above-mentioned steps 205 The method of weight is identical, no longer illustrates at this.
213: each destination document in the 3rd destination document set is carried out subordinate sentence process, and calculating carries out subordinate sentence process The label weight summation of each sentence obtained.
Participate in Fig. 7, each destination document in the 3rd destination document set is carried out subordinate sentence process, and calculating carries out subordinate sentence The label weight summation processing each sentence obtained specifically can be realized by following sub-step:
2131: training the 3rd destination document set, obtain the weight of each label in the 3rd destination document set;
2132: remove label, each destination document in the 3rd destination document set is carried out subordinate sentence process.
It should be noted that the operation that document carries out subordinate sentence process belongs to prior art, no longer illustrate at this.
2133: calculate the weight of the label corresponding to all words that each sentence comprises, to obtain the label of each sentence Weight summation tagW (s).
Wherein, the weight that label weight summation is the label corresponding to all words that each sentence comprises of each sentence is total With.
214: searching keyword is carried out pretreatment and obtains target query key word.
Wherein, target query key word includes several searching keywords q.
It should be noted that to looking into during searching keyword is carried out the method for pretreatment and above-mentioned steps 203 by this step The method that inquiry key word carries out pretreatment is identical, does not repeats them here.
215: according to target query key word, the content of text of each sentence is given a mark.
Participate in Fig. 8, this step implement particularly as follows:
2151: calculate the target query key word Relevance scores Score relative to each sentencequery(s)。
Wherein, sentence s is relevant to the dependency of target query key word and three factors: the key occurred in each sentence Kind queryC (s) of word;Number of times Occ (the q that each searching keyword q occurs in sentencei,s);Each searching keyword q Weight Weight (qi)。
Concrete, ScorequeryS () can be calculated by equation below.
Score query ( s ) = queryC ( s ) * Σ i = 1 n Occ ( q i , s ) * Weight ( q i )
2152: calculate score Score of each important words in each sentencesw(s)。
In this step, important words is the word that the number of times occurred in this destination document is more than threshold number.
Wherein, ScoreswS () can be calculated by equation below:
2153: calculate the title Relevance scores Score of each sentencetitle(s)。
Wherein, ScoretitleS () can be calculated by equation below:
2154: according to Scorequery(s)、Scoresw(s) and ScoretitleS the content of text of each sentence is carried out by () Dependency marking Scorerel(s);
Wherein,
Scorerel(s)=αScorequery(s)+βScoresw(s)+γScoretitle(s)
Above-mentioned α, β, γ are three default mediation parameters.
216: calculate the final score of each sentence.
Wherein, the final score calculating each sentence can be calculated by equation below:
Score(s)=(1+σ*tagW(s))*Scorerel(s)
Wherein, the σ in above-mentioned formula is for being in harmonious proportion parameter.
217: according to the final score of each sentence, obtain target sentences, the score of target sentences is pre-more than or equal to second If value.
218: in target sentences, obtain length sentence in the range of preset length as retrieval result fragment.
A kind of method of file retrieval that the present embodiment provides, by the present embodiment, so that user browses being not required to The full content of document, and do not know and use in the case of file structure key query word to retrieve, and it is applicable to magnanimity The retrieval of document, retrieval performance and accuracy rate are high.
Embodiment 3
Participate in Fig. 9, the installation drawing of a kind of file retrieval provided for the present embodiment, including:
Retrieval unit 301, for using the target query key word obtained through pretreatment to arrange rope pre-build In drawing, destination document set is retrieved, obtain first object collection of document;
Acquiring unit 302, for first object collection of document is carried out dependency marking, obtains the phase of first object document Closing property marking result, and according to dependency marking result first object collection of document reordered and obtain the second destination document Set;
Acquiring unit 302, is additionally operable to be extended described current goal searching keyword by pseudo-linear filter model, Obtain new target query key word;
Acquiring unit 302, is additionally operable to when described new target query key word meets pre-conditioned, uses described new Described first object collection of document is retrieved by target query key word again, obtains the 3rd destination document set;
Computing unit 303, for each destination document in the 3rd destination document set is carried out subordinate sentence process, and calculates Carry out subordinate sentence and process the label weight summation obtaining each sentence;
Computing unit 303, is additionally operable to, according to target query key word, the content of text of each sentence is carried out dependency and beats Point, obtain the dependency marking result of each sentence, and obtain each sentence according to the dependency marking result of each sentence Final score;
Acquiring unit 302, is additionally operable to the final score according to each sentence and obtains target sentences, and obtain in target sentences Take length sentence in the range of preset length as retrieval result fragment.
Further, acquiring unit 302, it is additionally operable to obtain in destination document set each word in destination document set Word frequency TF value and reverse document-frequency IDF value;
Seeing Figure 10, device also includes:
Set up unit 304, for setting up inverted index according to TF value and the IDF value of word each in destination document set.
Processing unit 305, extracts operation for searching keyword carries out rejecting stop words and stem, obtains target query Key word.
Described set up unit 304, be additionally operable to set up retrieval model.
Further, retrieval unit 301, specifically for using the target query obtained through pretreatment according to retrieval model Destination document set is retrieved in the inverted index pre-build by key word, obtains first object collection of document.
Further, computing unit 303, it is additionally operable to first object collection of document is trained, obtains first object literary composition The weight of each label in shelves set.
Further, see Figure 11, computing unit 303, specifically include:
Obtain subelement 3031, for obtaining all tag name in first object collection of document;
Classification subelement 3032, for according to tag name, is divided into the element in first object collection of document and inquiry phase Close element set and with inquire about incoherent element set;
Obtain subelement 3031, be additionally operable to obtain each searching keyword tiAt each coherent element bkThe number of times a of middle appearance Total number A with words all in coherent element set;
Obtain subelement 3031, be additionally operable to obtain each searching keyword tiAt each uncorrelated element bkMiddle appearance time Total number B of all words in number b and uncorrelated element set;
Computation subunit 3033, for according to each searching keyword tiAt each coherent element bkThe number of times of middle appearance and Total number of all words in coherent element set, calculates each searching keyword tiAt each coherent element bkMiddle appearance general Rate pik
Wherein, p ik = a A
Computation subunit 3033, is additionally operable to according to each searching keyword tiAt each uncorrelated element bkMiddle appearance time In number and uncorrelated element set, total number of all words, calculates each searching keyword tiAt each uncorrelated element bkIn The probability q occurredik
Wherein, q ik = b B
Computation subunit 3033, is additionally operable to each label m being calculated in first object collection of documentjWeight;
Wherein, label mjThe computing formula of weight be:
f tag ( m j ) = Σ t ik ∈ m j , t i ∈ Q t ik × log ( p ik ( 1 - q ik ) q ik ( 1 - p ik ) )
tikIt is 01 value, represents element bkIn whether include searching keyword ti;Q is searching keyword.
Further, see Figure 12, acquiring unit 302, specifically include:
Extraction subelement 3021, in extraction first object collection of document, the SLCA subtree of each destination document is as each mesh The structural information of mark document;
Computation subunit 3022, for the SLCA subtree of each destination document is carried out dependency marking, obtains each mesh The Relevance scores of mark document.
Further,
Computation subunit 3022, specifically for each searching keyword q in acquisition target query key word respectively each The number of times tf occurred in node nN, q
Computation subunit 3022, specifically for calculating each searching keyword q TF value in first object collection of document TFq
Computation subunit 3022, specifically for the tf according to each searching keyword qn,qAnd TFqObtain each inquiry key Word q for present node Relevance scores tw (n, q);
Wherein, tw ( n , q ) = tf n , q TF q
Computation subunit 3022, specifically for when present node n is leaf node, being calculated each searching keyword The summation of the q Relevance scores relative to present node n, as the Relevance scores of the document.
Further,
Computation subunit 3022, also particularly useful for when present node n is non-leaf nodes, calculates the institute of present node n Have child node c relative to target query key word Relevance scores tw (c, q);
Computation subunit 3022, obtains also particularly useful for according to each searching keyword q dependency relative to present node n Point tw (n, q) and all child nodes c of present node n relative to the Relevance scores tw of target query key word, (c q) calculates Obtain each searching keyword q Relevance scores tw relative to present node n1(n,q);
Wherein, tw1(n,q)=tw(n,q)+∑c∈children(n)dn·tw(c,q)
Computation subunit 3022, obtains also particularly useful for according to each searching keyword q dependency relative to present node n Divide tw1(n q) is calculated the summation of each searching keyword q Relevance scores relative to present node n, as the document Relevance scores.
Further, seeing Figure 13, described device also includes:
Judging unit 306, for judging whether described new target query key word meets pre-conditioned.
Described acquiring unit 302, be additionally operable to when described new target query key word be unsatisfactory for described pre-conditioned time, make With described new target query key word, described first object collection of document is retrieved again, obtain the second new target literary composition Shelves set;
Described acquiring unit 302, is additionally operable to be carried out described current goal searching keyword by pseudo-linear filter model Extension, obtains the target query key word updated;
Described retrieval unit 301, is additionally operable to, until the target query key word of described renewal meets described pre-conditioned, make With the target query key word of described renewal, described first object collection of document is retrieved again.
Further, see Figure 14, computing unit 303, specifically include:
First computation subunit 3034, for training the 3rd destination document set, obtains in the 3rd destination document set every The weight of individual label;
Subordinate sentence processes subelement 3035, is used for removing label, enters each destination document in the 3rd destination document set Row subordinate sentence processes;
First computation subunit 3034, is additionally operable to calculate the weight of the label corresponding to all words that each sentence comprises, To obtain label weight summation tagW (s) of each sentence.
Further,
Computing unit 303, specifically for calculating the target query key word Relevance scores relative to each sentence Scorequery(s);
Wherein,
Score query ( s ) = queryC ( s ) * Σ i = 1 n Occ ( q i , s ) * Weight ( q i )
QueryC (s) is the kind of the key word occurred in each sentence;Occ(qi, it is s) that each searching keyword q exists The number of times occurred in sentence;Weight(qi) it is the weight of each searching keyword q;
Computing unit 303, specifically for calculating score Score of each important words in each sentencesw(s);Important list Word is the word that the number of times occurred in destination document is more than threshold number;
Wherein,
Computing unit 303, specifically for calculating the title Relevance scores Score of each sentencetitle(s);
Wherein,
Computing unit 303, specifically for according to Scorequery(s)、Scoresw(s) and ScoretitleS () is to each sentence Content of text carry out dependency marking Scorerel(s);
Wherein,
Scorerel(s)=αScorequery(s)+βScoresw(s)+γScorettitle(s)
α, β, γ are default mediation parameter.
Computing unit 303, also particularly useful for according to formula S core (s)=(1+ σ * tagW (s)) * ScorerelS () obtains often Final score Score (s) of individual sentence;
Wherein, σ is default mediation parameter.
The device of a kind of file retrieval that the embodiment of the present invention provides, so that user is being not required to the whole of browsing document Content, and do not know and use in the case of file structure key query word to retrieve, and it is applicable to the retrieval of magnanimity document, Retrieval performance and accuracy rate are high.
Through the above description of the embodiments, those skilled in the art is it can be understood that can borrow to the present invention The mode helping software to add required common hardware realizes, naturally it is also possible to by hardware, but a lot of in the case of the former is more preferably Embodiment.Based on such understanding, the portion that prior art is contributed by technical scheme the most in other words Dividing and can embody with the form of software product, this computer software product is stored in the storage medium that can read, such as meter The floppy disk of calculation machine, hard disk or CD etc., including some instructions with so that computer equipment (can be personal computer, Server, or the network equipment etc.) perform the method described in each embodiment of the present invention.
The above, the only detailed description of the invention of the present invention, but protection scope of the present invention is not limited thereto, and any Those familiar with the art, in the technical scope that the invention discloses, can readily occur in change or replace, should contain Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with described scope of the claims.

Claims (20)

1. the method for a file retrieval, it is characterised in that including:
The target query key word obtained through pretreatment is used in the inverted index pre-build, destination document set to be entered Line retrieval, obtains first object collection of document;
Described first object collection of document is carried out dependency marking, obtains the dependency marking knot of described first object document Really, and according to described dependency marking result described first object collection of document reordered and obtain the second destination document collection Close;
By pseudo-linear filter model, described current goal searching keyword is extended, obtains new target query crucial Word;
When described new target query key word meets pre-conditioned, use described new target query key word to described One destination document set is retrieved again, obtains the 3rd destination document set;
Each destination document in described 3rd destination document set is carried out subordinate sentence process, and calculating carries out described subordinate sentence process Obtain the label weight summation of each sentence;
According to described target query key word, the content of text of described each sentence is carried out dependency marking, obtain each sentence Dependency marking result, and obtain the final of described each sentence according to the dependency of described each sentence result of giving a mark Point;
Final score according to described each sentence obtains target sentences, and obtains length in described target sentences in default length Sentence in the range of degree is as retrieval result fragment;
Described described first object collection of document is carried out dependency marking before, also include:
Described first object collection of document is trained, obtains the power of each label in described first object collection of document Weight;
Described described first object collection of document is trained, obtains each label in described first object collection of document Weight, including:
Obtain all tag name in described first object collection of document;
According to described tag name, the element in described first object collection of document is divided into element set associated with the query and with Inquire about incoherent element set;
The probability occurred in each coherent element according to each searching keyword, and each searching keyword is each uncorrelated The probability occurred in element determines the weight of each label in described first object collection of document.
Method the most according to claim 1, it is characterised in that the target query obtained through pretreatment in described use closes Destination document set is retrieved in the inverted index pre-build by keyword, before obtaining first object collection of document, also Including:
Obtain the word frequency TF value in described destination document set of each word in destination document set and reverse document-frequency IDF Value;
Described inverted index is set up according to TF value and the IDF value of each word in described destination document set;
Searching keyword is carried out rejects stop words and stem extracts operation, obtain described target query key word;
Set up retrieval model.
Method the most according to claim 2, it is characterised in that the target query that described use obtains through pretreatment is crucial Destination document set is retrieved in the inverted index pre-build by word, obtains first object collection of document, including:
Use the target query key word obtained through pretreatment in the inverted index pre-build according to described retrieval model Destination document set is retrieved, obtains first object collection of document.
Method the most according to claim 1, it is characterised in that
The described probability occurred in each coherent element according to each searching keyword, and each searching keyword each not The probability occurred in coherent element determines that the weight of each label in described first object collection of document includes:
Obtain each searching keyword tiAt each coherent element bkThe number of times a of middle appearance and described coherent element set own Total number A of word;
Obtain described each searching keyword tiAt each uncorrelated element bkThe number of times b of middle appearance and described uncorrelated element set Total number B of all words in conjunction;
According to described each searching keyword tiAt described each coherent element bkThe number of times of middle appearance and described coherent element set In total number of all words, calculate described each searching keyword tiAt described each coherent element bkThe probability of middle appearance pik
Wherein,
According to described each searching keyword tiAt described each uncorrelated element bkThe number of times of middle appearance and described uncorrelated element Total number of all words in set, calculates described each searching keyword tiAt described each uncorrelated element bkMiddle appearance Probability qik
Wherein,
The each label m being calculated in described first object collection of documentjWeight;
Wherein, label mjThe computing formula of weight be:
f t a g ( m j ) = Σ t i k ∈ m j , t i ∈ Q t i k × l o g ( p i k ( 1 - q i k ) q i k ( 1 - p i k ) )
Described tikIt is 01 value, described tikIt is 0 or 1, represents described element bkIn whether include described searching keyword ti; Described Q is searching keyword.
Method the most according to claim 1, it is characterised in that described described first object collection of document is carried out dependency Marking, obtains the dependency marking result of described first object document, including:
Extract in described first object collection of document the SLCA subtree of each destination document as the knot of described each destination document Structure information;
The SLCA subtree of described each destination document is carried out dependency marking, and the dependency obtaining described each destination document obtains Point.
Method the most according to claim 5, it is characterised in that the described SLCA subtree to described each destination document is carried out Dependency is given a mark, and obtains the Relevance scores of described each destination document, including:
Obtain the number of times tf that in described target query key word, each searching keyword q occurs respectively in each node nn,q
Calculate described each searching keyword q TF value TF in described first object collection of documentq
Tf according to described each searching keyword qn,qAnd TFqObtain the described each searching keyword q phase for present node Closing property score tw (n, q);
Wherein,
When described present node n is leaf node, it is calculated described each searching keyword q relative to described present node The summation of the Relevance scores of n, as the Relevance scores of the document.
Method the most according to claim 6, it is characterised in that when described present node n is non-leaf nodes, also wrap Include:
Calculate all child nodes c of described present node n relative to target query key word Relevance scores tw (c, q);
According to described each searching keyword q relative to described present node n Relevance scores tw (n, q) and described work as prosthomere Relative to the Relevance scores tw of described target query key word, (c q) is calculated described each look into all child nodes c of some n Ask the key word q Relevance scores tw relative to described present node n1(n,q);
Wherein, tw1(n, q)=tw (n, q)+∑c∈children(n)dn·tw(c,q)
According to described each searching keyword q relative to the Relevance scores tw of described present node n1(n, q) described in being calculated The summation of each searching keyword q Relevance scores relative to described present node n, as the Relevance scores of the document.
Method the most according to claim 1, it is characterised in that described when described new target query key word meet preset During condition, before using described new target query key word that described first object collection of document is retrieved again, also wrap Include:
Judge whether described new target query key word meets pre-conditioned;
When described new target query key word be unsatisfactory for described pre-conditioned time, use described new target query key word pair Described first object collection of document is retrieved again, obtains the second new destination document set;
Being extended described current goal searching keyword by pseudo-linear filter model, the target query obtaining updating is crucial Word;
Until the target query key word of described renewal meets described pre-conditioned, use the target query key word of described renewal Described first object collection of document is retrieved again.
Method the most according to claim 1, it is characterised in that described to each mesh in described 3rd destination document set Mark document carries out subordinate sentence process, and calculating carries out described subordinate sentence and processes the label weight summation obtaining each sentence, including:
Train described 3rd destination document set, obtain the weight of each label in described 3rd destination document set;
Remove label, each destination document in described 3rd destination document set is carried out subordinate sentence process;
Calculate the weight of the label corresponding to all words that described each sentence comprises, total to obtain the label weight of each sentence With tagW (s).
Method the most according to claim 9, it is characterised in that described according to described target query key word to described often The content of text of individual sentence carries out dependency marking, including:
1) the target query key word Relevance scores Score relative to each sentence is calculatedquery(s);
Wherein,
QueryC (s) is the kind of the key word occurred in described each sentence;Occ(qi, it is s) that each searching keyword q is at sentence The number of times occurred in son;Weight(qi) it is the weight of each searching keyword q;
2) score Score of each important words in described each sentence is calculatedsw(s);Described important words is in described target The number of times occurred in document is more than the word of threshold number;
Wherein,
3) the title Relevance scores Score of described each sentence is calculatedtitle(s);
Wherein,
4) according to described Scorequery(s), described Scoresw(s) and described Scoretitle(s) text to described each sentence Content carries out dependency marking Scorerel(s);
Wherein, Scorerel(s)=α Scorequery(s)+βScoresw(s)+γScoretitle(s)
Described α, β, γ are default mediation parameter;
The described dependency marking result according to described each sentence obtains the final score of described each sentence, including:
According to formula S core (s)=(1+ σ * tagW (s)) * ScorerelS () obtains the final score Score of described each sentence (s);
Wherein, described σ is default mediation parameter.
The device of 11. 1 kinds of file retrievals, it is characterised in that including:
Retrieval unit, for using the target query key word obtained through pretreatment to mesh in the inverted index pre-build Mark collection of document is retrieved, and obtains first object collection of document;
Acquiring unit, for described first object collection of document is carried out dependency marking, obtains described first object document Dependency marking result, and according to described dependency marking result described first object collection of document reordered and obtain the Two destination document set;
Described acquiring unit, is additionally operable to be extended described current goal searching keyword by pseudo-linear filter model, To new target query key word;
Described acquiring unit, be additionally operable to when described new target query key word meet described pre-conditioned time, use described newly Target query key word described first object collection of document is retrieved again, obtain the 3rd destination document set;
Computing unit, for each destination document in described 3rd destination document set being carried out subordinate sentence process, and calculate into The described subordinate sentence of row processes the label weight summation obtaining each sentence;
Described computing unit, is additionally operable to be correlated with the content of text of described each sentence according to described target query key word Property marking, obtain the dependency marking result of each sentence, and obtain institute according to the dependency of described each sentence result of giving a mark State the final score of each sentence;
Described acquiring unit, is additionally operable to the final score according to described each sentence and obtains target sentences, and at described target sentence Son obtains length sentence in the range of preset length as retrieval result fragment;
Described computing unit, is additionally operable to be trained described first object collection of document, obtains described first object document sets The weight of each label in conjunction;
Described computing unit, specifically includes:
Obtain subelement, for obtaining all tag name in described first object collection of document;
Classification subelement, for according to described tag name, is divided into the element in described first object collection of document and inquiry phase Close element set and with inquire about incoherent element set;
Described acquisition subelement, is additionally operable to the probability occurred in each coherent element according to each searching keyword, and each The probability that searching keyword occurs in each uncorrelated element determines each label in described first object collection of document Weight.
12. devices according to claim 11, it is characterised in that
Described acquiring unit, is additionally operable to obtain each word word frequency TF in described destination document set in destination document set Value and reverse document-frequency IDF value;
Described device also includes:
Set up unit, for setting up described inverted index according to TF value and the IDF value of each word in described destination document set;
Processing unit, extracts operation for searching keyword carries out rejecting stop words and stem, obtains described target query and close Keyword;
Described set up unit, be additionally operable to set up retrieval model.
13. devices according to claim 12, it is characterised in that
Described retrieval unit, specifically for using the target query key word obtained through pretreatment to exist according to described retrieval model Destination document set is retrieved by the inverted index pre-build, obtains first object collection of document.
14. devices according to claim 11, it is characterised in that
Described acquisition subelement, is additionally operable to obtain each searching keyword tiAt each coherent element bkThe number of times a of middle appearance and institute State total number A of all words in coherent element set;
Described acquisition subelement, is additionally operable to obtain described each searching keyword tiAt each uncorrelated element bkMiddle appearance time Total number B of all words in number b and described uncorrelated element set;
Computation subunit, for according to described each searching keyword tiAt described each coherent element bkThe number of times of middle appearance and Total number of all words in described coherent element set, calculates described each searching keyword tiAt described each coherent element bkThe Probability p of middle appearanceik
Wherein,
Described computation subunit, is additionally operable to according to described each searching keyword tiAt described each uncorrelated element bkMiddle appearance Number of times and described uncorrelated element set in total number of all words, calculate described each searching keyword tiDescribed often Individual uncorrelated element bkThe probability q of middle appearanceik
Wherein,
Described computation subunit, is additionally operable to each label m being calculated in described first object collection of documentjWeight;
Wherein, label mjThe computing formula of weight be:
f t a g ( m j ) = Σ t i k ∈ m j , t i ∈ Q t i k × l o g ( p i k ( 1 - q i k ) q i k ( 1 - p i k ) )
Described tikIt is 01 value, described tikIt is 0 or 1, represents described element bkIn whether include described searching keyword ti; Described Q is searching keyword.
15. devices according to claim 11, it is characterised in that described acquiring unit, specifically include:
Extraction subelement, extracts in described first object collection of document the SLCA subtree of each destination document as described each mesh The structural information of mark document;
Computation subunit, for the SLCA subtree of described each destination document is carried out dependency marking, obtains described each mesh The Relevance scores of mark document.
16. devices according to claim 15, it is characterised in that
Described computation subunit, specifically for each searching keyword q in the described target query key word of acquisition respectively each The number of times tf occurred in node nn,q
Described computation subunit, specifically for calculating described each searching keyword q in described first object collection of document TF value TFq
Described computation subunit, specifically for the tf according to described each searching keyword qn,qAnd TFqObtain described each inquiry Key word q for present node Relevance scores tw (n, q);
Wherein,
Described computation subunit, specifically for when described present node n is leaf node, is calculated described each inquiry and closes The summation of the keyword q Relevance scores relative to described present node n, as the Relevance scores of the document.
17. devices according to claim 16, it is characterised in that
Described computation subunit, also particularly useful for when described present node n is non-leaf nodes, calculates described present node n All child nodes c relative to target query key word Relevance scores tw (c, q);
Described computation subunit, also particularly useful for according to described each searching keyword q being correlated with relative to described present node n Property score tw (n, q) and all child nodes c of described present node n are relative to the Relevance scores of described target query key word (c q) is calculated the described each searching keyword q Relevance scores tw relative to described present node n to tw1(n,q);
Wherein, tw1(n, q)=tw (n, q)+∑c∈children(n)dn·tw(c,q)
Described computation subunit, also particularly useful for according to described each searching keyword q being correlated with relative to described present node n Property score tw1(n q) is calculated the total of the described each searching keyword q Relevance scores relative to described present node n With, as the Relevance scores of the document.
18. devices according to claim 11, it is characterised in that described device also includes:
Judging unit, for judging whether described new target query key word meets pre-conditioned;
Described acquiring unit, be additionally operable to when described new target query key word be unsatisfactory for described pre-conditioned time, use described Described first object collection of document is retrieved by new target query key word again, obtains the second new destination document collection Close;
Described acquiring unit, is additionally operable to be extended described current goal searching keyword by pseudo-linear filter model, To the target query key word updated;
Described retrieval unit, is additionally operable to, until the target query key word of described renewal meets described pre-conditioned, use described Described first object collection of document is retrieved by target query key word again that update.
19. devices according to claim 11, it is characterised in that described computing unit, specifically include:
First computation subunit, is used for training described 3rd destination document set, obtains in described 3rd destination document set every The weight of individual label;
Subordinate sentence processes subelement, is used for removing label, carries out each destination document in described 3rd destination document set point Sentence processes;
Described first computation subunit, is additionally operable to calculate the weight of the label corresponding to all words that described each sentence comprises, To obtain label weight summation tagW (s) of each sentence.
20. devices according to claim 19, it is characterised in that
Described computing unit, specifically for calculating the target query key word Relevance scores Score relative to each sentencequery (s);
Wherein,
QueryC (s) is the kind of the key word occurred in described each sentence;Occ(qi, it is s) that each searching keyword q is at sentence The number of times occurred in son;Weight(qi) it is the weight of each searching keyword q;
Described computing unit, specifically for calculating score Score of each important words in described each sentencesw(s);Described heavy Wanting word is the word that the number of times occurred in described destination document is more than threshold number;
Wherein,
Described computing unit, specifically for calculating the title Relevance scores Score of described each sentencetitle(s);
Wherein,
Described computing unit, specifically for according to described Scorequery(s), described Scoresw(s) and described Scoretitle(s) The content of text of described each sentence is carried out dependency marking Scorerel(s);
Wherein, Scorerel(s)=α Scorequery(s)+βScoresw(s)+γScoretitle(s)
Described α, β, γ are default mediation parameter;
Described computing unit, also particularly useful for according to formula S core (s)=(1+ σ * tagW (s)) * ScorerelS () obtains described Final score Score (s) of each sentence;
Wherein, described σ is default mediation parameter.
CN201210360872.XA 2012-09-21 2012-09-21 A kind of method and device of file retrieval Expired - Fee Related CN103678412B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210360872.XA CN103678412B (en) 2012-09-21 2012-09-21 A kind of method and device of file retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210360872.XA CN103678412B (en) 2012-09-21 2012-09-21 A kind of method and device of file retrieval

Publications (2)

Publication Number Publication Date
CN103678412A CN103678412A (en) 2014-03-26
CN103678412B true CN103678412B (en) 2016-12-21

Family

ID=50315993

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210360872.XA Expired - Fee Related CN103678412B (en) 2012-09-21 2012-09-21 A kind of method and device of file retrieval

Country Status (1)

Country Link
CN (1) CN103678412B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268227B (en) * 2014-09-26 2017-10-10 天津大学 High-quality correlated samples chooses method automatically in picture search based on reverse k neighbours
CN104765862A (en) * 2015-04-22 2015-07-08 百度在线网络技术(北京)有限公司 Document retrieval method and device
CN106372087B (en) * 2015-07-23 2019-12-13 北京大学 information map generation method facing information retrieval and dynamic updating method thereof
CN106294662A (en) * 2016-08-05 2017-01-04 华东师范大学 Inquiry based on context-aware theme represents and mixed index method for establishing model
CN106294784B (en) * 2016-08-12 2019-12-17 合一智能科技(深圳)有限公司 resource searching method and device
CN107247745B (en) * 2017-05-23 2018-07-03 华中师范大学 A kind of information retrieval method and system based on pseudo-linear filter model
CN108062355B (en) * 2017-11-23 2020-07-31 华南农业大学 Query term expansion method based on pseudo feedback and TF-IDF
CN108345679B (en) * 2018-02-26 2021-03-23 科大讯飞股份有限公司 Audio and video retrieval method, device and equipment and readable storage medium
CN108520033B (en) * 2018-03-28 2020-01-24 华中师范大学 Enhanced pseudo-correlation feedback model information retrieval method based on hyperspace simulation language
CN109992647B (en) * 2019-04-04 2021-11-12 鼎富智能科技有限公司 Content searching method and device
CN111949679A (en) * 2019-05-17 2020-11-17 上海戈吉网络科技有限公司 Document retrieval system and method
CN112732864B (en) * 2020-12-25 2021-11-09 中国科学院软件研究所 Document retrieval method based on dense pseudo query vector representation
CN114817639B (en) * 2022-05-18 2024-05-10 山东大学 Webpage diagram convolution document ordering method and system based on contrast learning

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1916904A (en) * 2006-09-01 2007-02-21 北大方正集团有限公司 Method of abstracting single file based on expansion of file

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120150856A1 (en) * 2010-12-11 2012-06-14 Pratik Singh System and method of ranking web sites or web pages or documents based on search words position coordinates

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1916904A (en) * 2006-09-01 2007-02-21 北大方正集团有限公司 Method of abstracting single file based on expansion of file

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"PKU at INEX 2011XML Snippet Trace";Songlin Wang 等,;《Lecture Notes in Computer Science(LNCS)》;20111231;第2节及第3.1-3.3节 *
"基于倒排索引的文本相似搜索";杨建武 等,;《计算机工程》;20050331;第31卷(第5期);第2.1-2.6节、第3.1-3.4节及第4节 *

Also Published As

Publication number Publication date
CN103678412A (en) 2014-03-26

Similar Documents

Publication Publication Date Title
CN103678412B (en) A kind of method and device of file retrieval
CN105843795A (en) Topic model based document keyword extraction method and system
CN104008171A (en) Legal database establishing method and legal retrieving service method
CN106649666A (en) Left-right recursion-based new word discovery method
CN105404677A (en) Tree structure based retrieval method
Bhardwaj et al. A novel approach for content extraction from web pages
CN107391690B (en) Method for processing document information
CN103177122B (en) Personal desktop document searching method based on synonyms
WO2016099422A2 (en) Content sensitive document ranking method by analyzing the citation contexts
Zeng et al. Linking entities in short texts based on a Chinese semantic knowledge base
CN105426490A (en) Tree structure based indexing method
Zhao et al. Expanding approach to information retrieval using semantic similarity analysis based on WordNet and Wikipedia
Rathore et al. Ontology based web page topic identification
Ren et al. Role-explicit query extraction and utilization for quantifying user intents
Phan et al. Automated data extraction from the web with conditional models
Ayyasamy et al. Mining Wikipedia knowledge to improve document indexing and classification
Abuteir et al. Automatic sarcasm detection in Arabic text: A supervised classification approach
Moghadam et al. Comparative study of various Persian stemmers in the field of information retrieval
CN112100500A (en) Example learning-driven content-associated website discovery method
Qureshi et al. Exploiting Wikipedia to Identify Domain-Specific Key Terms/Phrases from a Short-Text Collection.
Lim et al. Generalized and lightweight algorithms for automated web forum content extraction
Li et al. Adding Lexical Chain to Keyphrase Extraction
Ghorai An Information Retrieval System for FIRE 2016 Microblog Track.
Holzmann et al. Named entity evolution recognition on the Blogosphere
Baliyan et al. Related Blogs’ Summarization With Natural Language Processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220623

Address after: 100871 No. 5, the Summer Palace Road, Beijing, Haidian District

Patentee after: Peking University

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

Address before: 100871 No. 5, the Summer Palace Road, Beijing, Haidian District

Patentee before: Peking University

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20161221

CF01 Termination of patent right due to non-payment of annual fee