CN103678412A - Document retrieval method and device - Google Patents

Document retrieval method and device Download PDF

Info

Publication number
CN103678412A
CN103678412A CN201210360872.XA CN201210360872A CN103678412A CN 103678412 A CN103678412 A CN 103678412A CN 201210360872 A CN201210360872 A CN 201210360872A CN 103678412 A CN103678412 A CN 103678412A
Authority
CN
China
Prior art keywords
score
document
keyword
sentence
correlativity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210360872.XA
Other languages
Chinese (zh)
Other versions
CN103678412B (en
Inventor
洪毅虹
杨建武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Peking University
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University
Priority to CN201210360872.XA priority Critical patent/CN103678412B/en
Publication of CN103678412A publication Critical patent/CN103678412A/en
Application granted granted Critical
Publication of CN103678412B publication Critical patent/CN103678412B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • G06F16/835Query processing
    • G06F16/8373Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • G06F16/835Query processing
    • G06F16/8365Query optimisation

Abstract

The invention provides a document retrieval method and device, and belongs to the field of information retrieval. The method includes the steps that a target document set is retrieved in a preset reverse index by using target query key words, a first target document set is obtained, correlation grading is carried out, a correlation grading result of first target documents is obtained, and a second target document set is obtained after reordering; the current target query key words are expanded through a spurious correlation feedback model to obtain novel target query key words, and thus a third target document set is obtained; phrasing treatment is carried out on target documents in the third target document set, and the label weight sum of all sentences is calculated; correlation grading is conducted on the content of each sentence according to the target query key words, a final score of each sentence is obtained, and therefore a target sentence is obtained; sentences with the lengths being within a preset length range are obtained in the target sentence to be used as retrieved result fragments. By means of the document retrieval method and device, the retrieval performance and precision rate of XML documents are improved.

Description

A kind of method of file retrieval and device
Technical field
The present invention relates to information retrieval field, relate in particular to a kind of method and device of file retrieval.
Background technology
The main carriers HTML(Hypertext Markup Language of traditional internet information, HTML (Hypertext Markup Language)) for user provides a kind of information demonstrating method easily, mainly pay close attention to the display effect of information on browser.Increasingly extensive along with Web application, the limitation of html data model highlights day by day, and HTML can not data of description, and html tag collection is fixed and is limited, and what user cannot be according to oneself need to add significant mark.Therefore, XML(Xtensible Markup Language, extendible markup language) therefore arise at the historic moment.
XML has self descriptiveness, platform-neutral, extensibility and the feature such as is simple and easy to use, can with readable form shfft registration according to and be not subject to the restriction of the form of expression; The existence of XML can make data exchange in incompatible system, has simplified the complicacy in data sharing and transmitting procedure; In XML document, existing content information also has structural information, its appearance make by Internet carry out mass data exchange, integrated, be integrated into possibility.Along with increasing Web application, as network service, ecommerce, digital library etc. adopt XML as the carrier of mass data storage, exchange and issue, how from the set of magnanimity XML document, to retrieve the concern that Useful Information has caused increasing researchist efficiently.
At present, carry out XML document retrieval and can pass through following two kinds of search modes:
The first, the retrieval based on XML document structure;
Under this search modes, user need to understand the structure of institute's Query XML document, can construct query expression.
The second retrieval model is the retrieval based on key word;
Under this search modes, by author, write in advance query expression, now user neither needs to learn complicated query language, also not needing has deep understanding to the data structure of XML document bottom, and user only needs the input key word relevant to its content of interest just can complete inquiry, and existing method comprises MLCA, SLCA, XRank, XSEarch, XSeek etc.
But in first method, on the one hand, in internet, most of XML document does not provide complete structural information to user; On the other hand, also exist a large amount of isomery XML document in internet, so in both cases, user is difficult to utilize existing language construct to go out query expression XML structure is inquired about.In the second approach, about the method major part of XML keyword query, all based on tree type memory model, launch, this just requires author when writing query expression, to know in advance the structure of XML document.
In sum, the retrieval model of existing XML document, need user to browse the full content of XML document, or know in advance the structure of institute's Query XML document, and in the process of retrieving, need to take a large amount of storage spaces, in the today that has the XML document of mass data amount, retrieval performance and the accuracy rate of existing XML document retrieval model are lower.
Summary of the invention
Embodiments of the invention provide a kind of method and device of file retrieval, have improved retrieval performance and the accuracy rate of XML document.
For achieving the above object, embodiments of the invention adopt following technical scheme:
A method for file retrieval, comprising:
Use the target query keyword that obtains through pre-service in the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document;
Described first object collection of document is carried out to correlativity marking, obtain the correlativity marking result of described first object document, and according to described correlativity marking result, described first object collection of document is reordered and obtains the second destination document set;
By spurious correlation feedback model, described current goal searching keyword is expanded, obtained new target query keyword;
When described new target query keyword meets when pre-conditioned, use described new target query keyword again to retrieve described first object collection of document, obtain the 3rd destination document set;
Each destination document in described the 3rd destination document set is carried out to subordinate sentence processing, and calculate and carry out the label weight summation that described subordinate sentence processing obtains each sentence;
According to described target query keyword, the content of text of described each sentence is carried out to correlativity marking, obtain the correlativity marking result of each sentence, and according to the correlativity marking result of described each sentence, obtain the final score of described each sentence;
According to the final score of described each sentence, obtain target sentences, and in described target sentences, obtain the sentence of length within the scope of preset length as result for retrieval fragment.
The present invention also provides a kind of device of file retrieval, comprising:
Retrieval unit, for use the target query keyword that obtains through pre-service the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document;
Acquiring unit, for described first object collection of document is carried out to correlativity marking, obtain the correlativity marking result of described first object document, and according to described correlativity marking result, described first object collection of document is reordered and obtains the second destination document set;
Described acquiring unit, also, for described current goal searching keyword being expanded by spurious correlation feedback model, obtains new target query keyword;
Described acquiring unit, also described when pre-conditioned for meeting when described new target query keyword, use described new target query keyword again to retrieve described first object collection of document, obtain the 3rd destination document set;
Described acquiring unit, also described when pre-conditioned for not meeting when described new target query keyword, use described new target query keyword again to retrieve described first object collection of document, obtain the second new destination document set;
Computing unit, for each destination document of described the 3rd destination document set is carried out to subordinate sentence processing, and calculates and carries out the label weight summation that described subordinate sentence processing obtains each sentence;
Described computing unit, also for the content of text of described each sentence being carried out to correlativity marking according to described target query keyword, obtain the correlativity marking result of each sentence, and according to the correlativity marking result of described each sentence, obtain the final score of described each sentence;
Described acquiring unit, also obtains target sentences for the final score according to described each sentence, and in described target sentences, obtains the sentence of length within the scope of preset length as result for retrieval fragment.
The method of a kind of file retrieval that the embodiment of the present invention provides and device, can so that user at the full content that does not need browsing document, and do not know in the situation of file structure and use key query word to retrieve, and be applicable to the retrieval of magnanimity document, retrieval performance and accuracy rate are high.
Accompanying drawing explanation
In order to be illustrated more clearly in the technical scheme of the embodiment of the present invention, to the accompanying drawing of required use in the embodiment of the present invention be briefly described below, apparently, below described accompanying drawing be only some embodiments of the present invention, for those of ordinary skills, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.
The method flow diagram of a kind of file retrieval that Fig. 1 provides for the embodiment of the present invention 1;
The method first stage process flow diagram of a kind of file retrieval that Fig. 2 provides for the embodiment of the present invention 2;
The method subordinate phase process flow diagram of a kind of file retrieval that Fig. 3 provides for the embodiment of the present invention 2;
Fig. 4 is the training first object collection of document that the embodiment of the present invention 2 provides, and obtains the method flow schematic diagram of label weight;
The SLCA subtree of each document of calculating that Fig. 5 provides for the embodiment of the present invention 2 is carried out the method flow schematic diagram of correlativity score;
The method phase III process flow diagram of a kind of file retrieval that Fig. 6 provides for the embodiment of the present invention 2;
Fig. 7 carries out subordinate sentence processing for each destination document in the 3rd destination document set that the embodiment of the present invention 2 provides, and calculates and carry out the method flow schematic diagram that the label weight summation of each sentence obtaining processed in subordinate sentence;
The method flow schematic diagram of the content of text of each sentence being given a mark according to target query keyword that Fig. 8 provides for the embodiment of the present invention 2;
The structural representation of the device of a kind of file retrieval that Fig. 9 provides for the embodiment of the present invention 3;
The second structural representation of the device of a kind of file retrieval that Figure 10 provides for the embodiment of the present invention 3;
The structural representation of the computing unit in the device of a kind of file retrieval that Figure 11 provides for the embodiment of the present invention 3;
The structural representation of the acquiring unit in the device of a kind of file retrieval that Figure 12 provides for the embodiment of the present invention 3;
The third structural representation of the device of a kind of file retrieval that Figure 13 provides for the embodiment of the present invention 3;
The second structural representation of the computing unit in the device of a kind of file retrieval that Figure 14 provides for the embodiment of the present invention 3.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, those of ordinary skills, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.
Embodiment 1
Referring to Fig. 1, the method flow diagram of a kind of file retrieval providing for the present embodiment, comprising:
A, use the target query keyword that obtains through pre-service in the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document.
B, first object collection of document is carried out to correlativity marking, obtain the correlativity marking result of first object document, and according to correlativity marking result, first object collection of document is reordered and obtains the second destination document set.
C, by spurious correlation feedback model, current goal searching keyword is expanded, obtained new target query keyword.
D, when described new target query keyword meets when pre-conditioned, use described new target query keyword again to retrieve described first object collection of document, obtain the 3rd destination document set.
E, each destination document in the 3rd destination document set is carried out to subordinate sentence processing, and calculate and carry out subordinate sentence and process the label weight summation that obtains each sentence.
F, according to target query keyword, the content of text of each sentence is carried out to correlativity marking, obtain the correlativity marking result of each sentence, and according to the correlativity marking result of each sentence, obtain the final score of each sentence.
G, according to the final score of each sentence, obtain target sentences, and in target sentences, obtain the sentence of length within the scope of preset length as result for retrieval fragment.
The method of a kind of file retrieval that the present embodiment provides, by the present embodiment, can so that user at the full content that does not need browsing document, and do not know in the situation of file structure and use key query word to retrieve, and be applicable to the retrieval of magnanimity document, retrieval performance and accuracy rate are high.
Embodiment 2
In the present embodiment, a kind of method of file retrieval is divided into three phases, and the first stage is the fuzzy search stage, to dwindle semi-structured document set; Subordinate phase is accurate retrieval phase, to obtain accurate collection of document associated with the query; Phase III is fragment generation phase.
Referring to Fig. 2, the method first stage process flow diagram of a kind of file retrieval providing for the present embodiment, comprising:
First stage: fuzzy search stage.
201: destination document set is carried out to pre-service.
In the present embodiment, destination document set is for being about to the XML semi-structured document set for inquiring about.
Destination document set is carried out to pre-service specifically can be realized by following sub-step:
1) reject the stop words in destination document set.
Wherein, stop words can be arranged in advance by user, can be that " in ", " the ", " oh " and punctuation mark etc. are without the word of concrete meaning, Chinese can for " ", " wearing ", " " and punctuation mark etc. are without the concrete word of meaning.
For example, following 2 pieces of articles are the partial content in collection of document, are used for illustrating the stop words of rejecting in destination document set;
The content of article 1 is: Tom lives in Guangzhou, I live in Guangzhou too.
The content of article 2 is: He once lived in Shanghai.
Above-mentioned article 1 and article 2 contents, be a character string, first, finds out respectively all words of article 1 and article 2 according to space, and each word is keyword, then stop words is rejected from article 1 and article 2; Article 1 and the article 2 of rejecting after stop words are as follows:
Reject the article 1:[Tom after stop words] [lives] [Guangzhou] [I] [live] [Guangzhou]
Reject the article 2:[He after stop words] [lives] [Shanghai].
It should be noted that, while there is Chinese sentence in document, need to utilize prior art centering sentence to carry out special word segmentation processing, then stop words is rejected from document.
2): the stem that extracts destination document set.
First, when the content in destination document set is English character, all words are unified to capital and small letter; For example, when user searches " He ", word " HE ", " he " also can be searched.
Secondly, when the content in destination document set is English character, all words are reduced; For example, when user searches " live ", word " lives ", " lived " also can be searched, and need word " lives ", " lived " to be reduced to " live ".
For example, the article 1 of take after above-mentioned rejecting stop words is said as example with the article 2 of rejecting after stop words, extract after stem,
All keywords of article 1 are: [tom] [live] [guangzhou] [i] [live] [guangzhou]
All keywords of article 2 are: [he] [live] [shanghai].
3): calculate the TF(term frequency of each word in destination document set in stem, word frequency) value and IDF(inverse document frequency, reverse file frequency) value.
While wherein, calculating the TF value of each word in destination document set, can adopt following formula to calculate:
TF i , j = n i , j Σ k n k , j
N in above-mentioned formula i, jthat word is at destination document set d jin occurrence number, denominator is d in destination document set jin the occurrence number sum of all words.
While calculating the IDF value of each word in destination document set, can adopt following formula to calculate:
IDF i = log | D | | { j : t i ∈ d j } |
Wherein, | D| is the total number of files in destination document set, | { j:t i∈ d j| for comprising t itotal number of files (be n i, j≠ 0 total number of files).
202: to carrying out pretreated destination document set, set up inverted index.
For example, take above-mentioned article 1 and article 2 is example, sets up after inverted index, and in article 1 and article 2, the corresponding relation of each keyword and article number, [frequency of occurrences], keyword position is:
guangzhou?1[2]?3,6
he?2[1]?1
i?1[1]?4
live?1[2],2[1]?2,5,2
shanghai?2[1]?3
tom?1[1]?1
Set up after inverted index, can learn number of times and particular location that keyword occurs in article.
203: searching keyword is carried out to pre-service and obtain target query keyword.
Wherein, searching keyword being carried out to pre-service specifically can realize by following sub-step:
1) reject the stop words in searching keyword.
It should be noted that, the concrete methods of realizing of this step is identical with the method for rejecting the stop words in destination document set in above-mentioned steps 2011, at this, no longer illustrates.
2) stem of extraction searching keyword obtains target query keyword.
It should be noted that, the concrete methods of realizing of this step is identical with the method for stem of extracting destination document set in above-mentioned steps 2012, at this, no longer illustrates.
204: according to retrieval model use the target query keyword that obtains through pre-service in inverted index to destination document set retrieve, obtain first object collection of document.
It should be noted that, setting up retrieval model is that probability of use statistical method and language model are set up; In the process of retrieval, use Di Li Cray Dirichlet smooth manner, the scope of having dwindled destination document set; Wherein, set up retrieval model and Dirichlet smooth manner all belongs to prior art, do not repeat them here.
Referring to Fig. 3, the method subordinate phase process flow diagram of a kind of file retrieval providing for the present embodiment, comprising:
Subordinate phase: accurate retrieval phase.
205: train first object collection of document, obtain the weight of each label in first object collection of document.
Participate in Fig. 4, training first object collection of document, obtains label weight and specifically can realize by following sub-step:
2051: obtain all tag name in first object collection of document.
2052: according to tag name, by the element in first object collection of document be divided into element set associated with the query and with inquiry incoherent element set.
2053: obtain each searching keyword t iat each coherent element b ktotal number A of all words in the number of times a of middle appearance and coherent element set.
It should be noted that, when searching keyword is English character, using each word as searching keyword; When searching keyword is Chinese statement, need to utilize prior art to carry out special word segmentation processing to Chinese statement, each word that processing obtains is as searching keyword.
2054: obtain each searching keyword t iat each uncorrelated element b ktotal number B of all words in the number of times b of middle appearance and uncorrelated element set.
It should be noted that, when searching keyword is English character, using each word as searching keyword; When searching keyword is Chinese statement, need to utilize prior art to carry out special word segmentation processing to Chinese statement, each word that processing obtains is as searching keyword.
2055: according to each searching keyword t iat each coherent element b ktotal number of all words in the number of times of middle appearance and coherent element set, calculates each searching keyword t iat each coherent element b kthe Probability p of middle appearance ik.
Wherein, p ik = a A
2056: according to each searching keyword t iat each uncorrelated element b ktotal number of all words in the number of times of middle appearance and uncorrelated element set, calculates each searching keyword t iat each uncorrelated element b kthe probability q of middle appearance ik.
Wherein, q ik = b B
2057: calculate each the label m in first object collection of document jweight.
Wherein, label m jthe computing formula of weight be:
f tag ( m j ) = Σ t ik ∈ m j , t i ∈ Q t ik × log ( p ik ( 1 - q ik ) q ik ( 1 - p ik ) )
Wherein, t ikbe 01 value, can be 0 or 1, represent element b kin whether include searching keyword t i; Q is searching keyword.
206: searching keyword is carried out to pre-service, obtain target query keyword.
Wherein, in target query keyword, comprise several searching keywords q.
It should be noted that, in this step, searching keyword is carried out to pretreated method and in above-mentioned steps 203, searching keyword to be carried out to pretreated method identical, do not repeat them here.
207: extract the SLCA subtree of each destination document in first object collection of document as the structural information of each destination document.
208: the SLCA subtree to each destination document is carried out correlativity marking, obtain the correlativity score of each destination document.
Referring to Fig. 5, the SLCA subtree of calculating each document is carried out correlativity score and can be taked bottom-up method, specifically can realize by following sub-step:
2081: obtain the number of times tf that in target query keyword, each searching keyword q occurs respectively in each node n n, q.
2082: calculate the TF value TF of each searching keyword q in first object collection of document q.
Wherein, the method for calculating the TF value of each searching keyword q in first object collection of document in this step with in above-mentioned steps 2013, calculate stem in the method for the TF value of each word in destination document set identical, in this step, repeat.
2083: according to the tf of each searching keyword q n, qand TF qobtain each searching keyword q for the correlativity score tw (n, q) of present node.
Wherein, tw ( n , q ) = tf n , q TF q
2084: when present node n is leaf node, calculate each searching keyword q with respect to the summation of the correlativity score of present node n, as the correlativity score of document.
2085: when present node n is non-leaf node, calculate all child node c of present node n with respect to the correlativity score tw (c, q) of target query keyword.
2086: the correlativity score tw (n according to each searching keyword q with respect to present node n, q) and all child node c of present node n with respect to the correlativity score tw (c, q) of target query keyword, obtain each searching keyword q with respect to the correlativity score tw of present node n 1(n, q)
Tw wherein 1(n, q)=tw (n, q)+∑ c ∈ children (n)d ntw (c, q)
2087: the correlativity score tw according to each searching keyword q with respect to present node n 1(n, q) calculates each searching keyword q with respect to the summation of the correlativity score of present node n, as the correlativity score of the document.
209: the correlativity score according to each destination document in described first object collection of document, obtains the second destination document set.
Concrete, can to the document in first object collection of document, resequence according to correlativity score order from high to low, also can to the document in destination document set, resequence according to correlativity score order from low to high.
Optionally, after the document in destination document set is resequenced, score can also be less than at the document of the first preset value and get rid of, keep score and be more than or equal to the destination document set of the first preset value, obtain the second destination document set.
210: according to the destination document in current the second destination document set, use spurious correlation feedback model to expand target query keyword, obtain new target query keyword, and judge whether new target query keyword meets pre-conditioned;
When target query keyword does not meet when pre-conditioned, execution step 211;
When target query keyword meets when pre-conditioned, execution step 212.
In the present embodiment, concrete, can use spurious correlation feedback model to expand target query keyword according to the higher default destination document of the second destination document set mid-score, obtain new target query keyword, and judge whether new target query keyword meets pre-conditioned.
It should be noted that, pre-conditioned can be the number of keyword, can be also the stem number of keyword, but be not limited to this.
211: use new target query keyword again to retrieve first object collection of document, obtain the second new destination document set, return to the operation of execution step 210.
Referring to Fig. 6, the method phase III process flow diagram of a kind of file retrieval providing for the present embodiment, comprising:
Phase III: fragment produces the stage.
212: use new target query keyword again to retrieve first object collection of document, obtain the 3rd destination document set.
It should be noted that, the method for label weight that obtains document in this step is identical with the method that obtains label weight in above-mentioned steps 205, at this, no longer illustrates.
213: each destination document in the 3rd destination document set is carried out to subordinate sentence processing, and calculate and carry out the label weight summation that each sentence obtaining processed in subordinate sentence.
Participate in Fig. 7, each destination document in the 3rd destination document set carried out to subordinate sentence processing, and calculate and carry out subordinate sentence and process the label weight summation of each sentence obtaining and specifically can realize by following sub-step:
2131: train the 3rd destination document set, obtain the weight of each label in the 3rd destination document set;
2132: remove label, each destination document in the 3rd destination document set is carried out to subordinate sentence processing.
It should be noted that, the operation of document being carried out to subordinate sentence processing belongs to prior art, at this, no longer illustrates.
2133: calculate the weight of the corresponding label of all words that each sentence comprises, to obtain the label weight summation tagW (s) of each sentence.
The weight summation of the corresponding label of all words that wherein, the label weight summation of each sentence comprises for each sentence.
214: searching keyword is carried out to pre-service and obtain target query keyword.
Wherein, target query keyword comprises several searching keywords q.
It should be noted that, in this step, searching keyword is carried out to pretreated method and in above-mentioned steps 203, searching keyword to be carried out to pretreated method identical, do not repeat them here.
215: according to target query keyword, the content of text of each sentence is given a mark.
Participate in Fig. 8, the specific implementation of this step is specially:
2151: calculate target query keyword with respect to the correlativity score Score of each sentence query(s).
Wherein, the correlativity of sentence s and target query keyword and three factor analysis: the kind queryC of the keyword occurring in each sentence (s); Number of times Occ (the q that each searching keyword q occurs in sentence i, s); Weights W eight (the q of each searching keyword q i).
Concrete, Score query(s) can calculate by following formula.
Score query ( s ) = queryC ( s ) * Σ i = 1 n Occ ( q i , s ) * Weight ( q i )
2152: the score Score that calculates each important words in each sentence sw(s).
In this step, important words is greater than the word of threshold number for the number of times that occurs in this destination document.
Wherein, Score sw(s) can calculate by following formula:
Figure BDA00002177132500132
2153: the title correlativity score Score that calculates each sentence title(s).
Wherein, Score title(s) can calculate by following formula:
Figure BDA00002177132500133
2154: according to Score query(s), Score swand Score (s) title(s) content of text of each sentence is carried out to correlativity marking Score rel(s);
Wherein,
Score rel(s)=αScore query(s)+βScore sw(s)+γScore title(s)
Above-mentioned α, β, γ are three default mediation parameters.
216: the final score that calculates each sentence.
Wherein, the final score that calculates each sentence can calculate by following formula:
Score(s)=(1+σ*tagW(s))*Score rel(s)
Wherein, the σ in above-mentioned formula is for being in harmonious proportion parameter.
217: the final score according to each sentence, obtain target sentences, the score of target sentences is more than or equal to the second preset value.
218: in target sentences, obtain the sentence of length within the scope of preset length as result for retrieval fragment.
The method of a kind of file retrieval that the present embodiment provides, by the present embodiment, can so that user at the full content that does not need browsing document, and do not know in the situation of file structure and use key query word to retrieve, and be applicable to the retrieval of magnanimity document, retrieval performance and accuracy rate are high.
Embodiment 3
Participate in Fig. 9, the installation drawing of a kind of file retrieval providing for the present embodiment, comprising:
Retrieval unit 301, for use the target query keyword that obtains through pre-service the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document;
Acquiring unit 302, for first object collection of document is carried out to correlativity marking, obtains the correlativity marking result of first object document, and according to correlativity marking result, first object collection of document is reordered and obtains the second destination document set;
Acquiring unit 302, also, for described current goal searching keyword being expanded by spurious correlation feedback model, obtains new target query keyword;
Acquiring unit 302, also, for meeting when described new target query keyword when pre-conditioned, is used described new target query keyword again to retrieve described first object collection of document, obtains the 3rd destination document set;
Computing unit 303, for each destination document of the 3rd destination document set is carried out to subordinate sentence processing, and calculates and carries out the label weight summation that subordinate sentence processing obtains each sentence;
Computing unit 303, also for the content of text of each sentence being carried out to correlativity marking according to target query keyword, obtains the correlativity marking result of each sentence, and according to the correlativity marking result of each sentence, obtains the final score of each sentence;
Acquiring unit 302, also obtains target sentences for the final score according to each sentence, and in target sentences, obtains the sentence of length within the scope of preset length as result for retrieval fragment.
Further, acquiring unit 302, also for obtaining word frequency TF value and the reverse file frequency IDF value of each word of destination document set in destination document set;
Referring to Figure 10, device also comprises:
Set up unit 304, for setting up inverted index according to the TF value of each word of destination document set and IDF value.
Processing unit 305, extracts operation for searching keyword being rejected to stop words and stem, obtains target query keyword.
The described unit 304 of setting up, also for setting up retrieval model.
Further, retrieval unit 301, specifically for according to retrieval model, use the target query keyword that obtains through pre-service in the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document.
Further, computing unit 303, also, for first object collection of document is trained, obtains the weight of each label in first object collection of document.
Further, referring to Figure 11, computing unit 303, specifically comprises:
Obtain subelement 3031, for obtaining all tag name of first object collection of document;
Classification subelement 3032, for according to tag name, by the element in first object collection of document be divided into element set associated with the query and with the incoherent element set of inquiry;
Obtain subelement 3031, also for obtaining each searching keyword t iat each coherent element b ktotal number A of all words in the number of times a of middle appearance and coherent element set;
Obtain subelement 3031, also for obtaining each searching keyword t iat each uncorrelated element b ktotal number B of all words in the number of times b of middle appearance and uncorrelated element set;
Computation subunit 3033, for according to each searching keyword t iat each coherent element b ktotal number of all words in the number of times of middle appearance and coherent element set, calculates each searching keyword t iat each coherent element b kthe Probability p of middle appearance ik;
Wherein, p ik = a A
Computation subunit 3033, also for according to each searching keyword t iat each uncorrelated element b ktotal number of all words in the number of times of middle appearance and uncorrelated element set, calculates each searching keyword t iat each uncorrelated element b kthe probability q of middle appearance ik;
Wherein, q ik = b B
Computation subunit 3033, also for calculating each label m of first object collection of document jweight;
Wherein, label m jthe computing formula of weight be:
f tag ( m j ) = Σ t ik ∈ m j , t i ∈ Q t ik × log ( p ik ( 1 - q ik ) q ik ( 1 - p ik ) )
T ikbe 01 value, represent element b kin whether include searching keyword t i; Q is searching keyword.
Further, referring to Figure 12, acquiring unit 302, specifically comprises:
Extract subelement 3021, extract the SLCA subtree of each destination document in first object collection of document as the structural information of each destination document;
Computation subunit 3022, carries out correlativity marking for the SLCA subtree to each destination document, obtains the correlativity score of each destination document.
Further,
Computation subunit 3022, specifically for obtaining the number of times tf that in target query keyword, each searching keyword q occurs respectively in each node n n, q;
Computation subunit 3022, specifically for calculating the TF value TF of each searching keyword q in first object collection of document q;
Computation subunit 3022, specifically for according to the tf of each searching keyword q n,qand TF qobtain each searching keyword q for the correlativity score tw (n, q) of present node;
Wherein, tw ( n , q ) = tf n , q TF q
Computation subunit 3022, specifically for when present node n is leaf node, calculates each searching keyword q with respect to the summation of the correlativity score of present node n, as the correlativity score of the document.
Further,
Computation subunit 3022, also specifically for when present node n is non-leaf node, calculates all child node c of present node n with respect to the correlativity score tw (c, q) of target query keyword;
Computation subunit 3022, also specifically for the correlativity score tw (n with respect to present node n according to each searching keyword q, q) and all child node c of present node n with respect to the correlativity score tw (c, q) of target query keyword, calculate each searching keyword q with respect to the correlativity score tw of present node n 1(n, q);
Wherein, tw 1(n, q)=tw (n, q)+∑ c ∈ children (n)d ntw (c, q)
Computation subunit 3022, also specifically for the correlativity score tw with respect to present node n according to each searching keyword q 1(n, q) calculates each searching keyword q with respect to the summation of the correlativity score of present node n, as the correlativity score of the document.
Further, referring to Figure 13, described device also comprises:
Judging unit 306, for judging whether described new target query keyword meets pre-conditioned.
Described acquiring unit 302, also described when pre-conditioned for not meeting when described new target query keyword, use described new target query keyword again to retrieve described first object collection of document, obtain the second new destination document set;
Described acquiring unit 302, also for described current goal searching keyword being expanded by spurious correlation feedback model, obtains the target query keyword upgrading;
Described retrieval unit 301, also, for until the target query keyword of described renewal meets described pre-conditionedly, is used the target query keyword of described renewal again to retrieve described first object collection of document.
Further, referring to Figure 14, computing unit 303, specifically comprises:
The first computation subunit 3034, for training the 3rd destination document set, obtains the weight of each label in the 3rd destination document set;
Subelement 3035 processed in subordinate sentence, for removing label, each destination document in the 3rd destination document set carried out to subordinate sentence processing;
The first computation subunit 3034, also for calculating the weight of the corresponding label of all words that each sentence comprises, to obtain the label weight summation tagW (s) of each sentence.
Further,
Computing unit 303, specifically for calculating target query keyword with respect to the correlativity score Score of each sentence query(s);
Wherein,
Score query ( s ) = queryC ( s ) * Σ i = 1 n Occ ( q i , s ) * Weight ( q i )
The kind that queryC (s) is the keyword that occurs in each sentence; Occ (q i, number of times s) occurring in sentence for each searching keyword q; Weight (q i) be the weight of each searching keyword q;
Computing unit 303, specifically for calculating the score Score of each important words in each sentence sw(s); Important words is greater than the word of threshold number for the number of times that occurs in destination document;
Wherein,
Figure BDA00002177132500182
Computing unit 303, specifically for calculating the title correlativity score Score of each sentence title(s);
Wherein,
Figure BDA00002177132500183
Computing unit 303, specifically for according to Score query(s), Score swand Score (s) title(s) content of text of each sentence is carried out to correlativity marking Score rel(s);
Wherein,
Score rel(s)=αScore query(s)+βScore sw(s)+γScoret title(s)
α, β, γ are default mediation parameter.
Computing unit 303, also specifically for according to formula S core (s)=(1+ σ * tagW (s)) * Score rel(s) obtain the final score Score (s) of each sentence;
Wherein, σ is default mediation parameter.
The device of a kind of file retrieval that the embodiment of the present invention provides, can so that user at the full content that does not need browsing document, and do not know in the situation of file structure and use key query word to retrieve, and be applicable to the retrieval of magnanimity document, retrieval performance and accuracy rate are high.
Through the above description of the embodiments, those skilled in the art can be well understood to the mode that the present invention can add essential common hardware by software and realize, and can certainly pass through hardware, but in a lot of situation, the former is better embodiment.Understanding based on such, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of software product, this computer software product is stored in the storage medium can read, as the floppy disk of computing machine, hard disk or CD etc., comprise some instructions with so that computer equipment (can be personal computer, server, or the network equipment etc.) carry out the method described in each embodiment of the present invention.
The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited to this, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; can expect easily changing or replacing, within all should being encompassed in protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of described claim.

Claims (22)

1. a method for file retrieval, is characterized in that, comprising:
Use the target query keyword that obtains through pre-service in the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document;
Described first object collection of document is carried out to correlativity marking, obtain the correlativity marking result of described first object document, and according to described correlativity marking result, described first object collection of document is reordered and obtains the second destination document set;
By spurious correlation feedback model, described current goal searching keyword is expanded, obtained new target query keyword;
When described new target query keyword meets when pre-conditioned, use described new target query keyword again to retrieve described first object collection of document, obtain the 3rd destination document set;
Each destination document in described the 3rd destination document set is carried out to subordinate sentence processing, and calculate and carry out the label weight summation that described subordinate sentence processing obtains each sentence;
According to described target query keyword, the content of text of described each sentence is carried out to correlativity marking, obtain the correlativity marking result of each sentence, and according to the correlativity marking result of described each sentence, obtain the final score of described each sentence;
According to the final score of described each sentence, obtain target sentences, and in described target sentences, obtain the sentence of length within the scope of preset length as result for retrieval fragment.
2. method according to claim 1, is characterized in that, the target query keyword obtaining through pre-service in described use in the inverted index of setting up in advance to destination document set retrieve, before obtaining first object collection of document, also comprise:
Obtain word frequency TF value and the reverse file frequency IDF value of each word in described destination document set in destination document set;
According to TF value and the IDF value of each word in described destination document set, set up described inverted index;
Searching keyword is rejected to stop words and stem extraction operation, obtain described target query keyword;
Set up retrieval model.
3. method according to claim 2, is characterized in that, the target query keyword that described use obtains through pre-service in the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document, comprising:
According to described retrieval model, use the target query keyword that obtains through pre-service in the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document.
4. method according to claim 1, is characterized in that, described described first object collection of document is carried out to correlativity marking before, also comprise:
Described first object collection of document is trained, obtain the weight of each label in described first object collection of document.
5. method according to claim 4, is characterized in that, described described first object collection of document is trained, and obtains the weight of each label in described first object collection of document, comprising:
Obtain all tag name in described first object collection of document;
According to described tag name, by the element in described first object collection of document be divided into element set associated with the query and with inquiry incoherent element set;
Obtain each searching keyword t iat each coherent element b ktotal number A of all words in the number of times a of middle appearance and described coherent element set;
Obtain described each searching keyword t iat each uncorrelated element b ktotal number B of all words in the number of times b of middle appearance and described uncorrelated element set;
According to described each searching keyword t iat described each coherent element b ktotal number of all words in the number of times of middle appearance and described coherent element set, calculates described each searching keyword t iat described each coherent element b kthe Probability p of middle appearance ik;
Wherein, p ik = a A
According to described each searching keyword t iat described each uncorrelated element b ktotal number of all words in the number of times of middle appearance and described uncorrelated element set, calculates described each searching keyword t iat described each uncorrelated element b kthe probability q of middle appearance ik;
Wherein, q ik = b B
Calculate each the label m in described first object collection of document jweight;
Wherein, label m jthe computing formula of weight be:
f tag ( m j ) = Σ t ik ∈ m j , t i ∈ Q t ik × log ( p ik ( 1 - q ik ) q ik ( 1 - p ik ) )
Described t ikbe 01 value, represent described element b kin whether include described searching keyword t i; Described Q is searching keyword.
6. method according to claim 1, is characterized in that, described described first object collection of document is carried out to correlativity marking, obtains the correlativity marking result of described first object document, comprising:
Extract the SLCA subtree of each destination document in described first object collection of document as the structural information of described each destination document;
SLCA subtree to described each destination document is carried out correlativity marking, obtains the correlativity score of described each destination document.
7. method according to claim 6, is characterized in that, the described SLCA subtree to described each destination document is carried out correlativity marking, obtains the correlativity score of described each destination document, comprising:
Obtain the number of times tf that in described target query keyword, each searching keyword q occurs respectively in each node n n, q;
Calculate the TF value TF of described each searching keyword q in described first object collection of document q;
According to the tf of described each searching keyword q n,qand TF qobtain described each searching keyword q for the correlativity score tw (n, q) of present node;
Wherein, tw ( n , q ) = tf n , q TF q
When described present node n is leaf node, calculate described each searching keyword q with respect to the summation of the correlativity score of described present node n, as the correlativity score of the document.
8. method according to claim 7, is characterized in that, when described present node n is non-leaf node, also comprises:
Calculate all child node c of described present node n with respect to the correlativity score tw (c, q) of target query keyword;
Correlativity score tw (n according to described each searching keyword q with respect to described present node n, q) and all child node c of described present node n with respect to the correlativity score tw (c, q) of described target query keyword, calculate described each searching keyword q with respect to the correlativity score tw of described present node n 1(n, q);
Wherein, tw 1(n, q)=tw (n, q)+∑ c ∈ children (n)d ntw (c, q)
Correlativity score tw according to described each searching keyword q with respect to described present node n 1(n, q) calculates described each searching keyword q with respect to the summation of the correlativity score of described present node n, as the correlativity score of the document.
9. method according to claim 1, is characterized in that, describedly meets when pre-conditioned when described new target query keyword, uses before described new target query keyword retrieves again to described first object collection of document, also comprises:
Judge whether described new target query keyword meets pre-conditioned;
When described new target query keyword does not meet describedly when pre-conditioned, use described new target query keyword again to retrieve described first object collection of document, obtain the second new destination document set;
By spurious correlation feedback model, described current goal searching keyword is expanded, obtained the target query keyword upgrading;
Until that the target query keyword of described renewal meets is described pre-conditioned, use the target query keyword of described renewal again to retrieve described first object collection of document.
10. method according to claim 1, is characterized in that, described each destination document in described the 3rd destination document set is carried out to subordinate sentence processing, and calculates and carry out described subordinate sentence and process the label weight summation that obtains each sentence, comprising:
Train described the 3rd destination document set, obtain the weight of each label in described the 3rd destination document set;
Remove label, each destination document in described the 3rd destination document set is carried out to subordinate sentence processing;
The weight of the corresponding label of all words that described in calculating, each sentence comprises, to obtain the label weight summation tagW (s) of each sentence.
11. methods according to claim 10, is characterized in that, describedly according to described target query keyword, the content of text of described each sentence are carried out to correlativity marking, comprising:
1) calculate target query keyword with respect to the correlativity score Score of each sentence query(s);
Wherein, Score query ( s ) = queryC ( s ) * Σ i = 1 n Occ ( q i , s ) * Weight ( q i )
The kind that queryC (s) is the keyword that occurs in described each sentence; Occ (q i, number of times s) occurring in sentence for each searching keyword q; Weight (q i) be the weight of each searching keyword q;
2) calculate the score Score of each important words in described each sentence sw(s); Described important words is greater than the word of threshold number for the number of times that occurs in described destination document;
Wherein,
Figure FDA00002177132400061
3) calculate the title correlativity score Score of described each sentence title(s);
Wherein,
Figure FDA00002177132400062
4) according to described Score query(s), described Score swand described Score (s) title(s) content of text of described each sentence is carried out to correlativity marking Score rel(s);
Wherein, Score rel(s)=α Score query(s)+β Score sw(s)+γ Score title(s)
Described α, β, γ are default mediation parameter.
Described in described basis, the correlativity of each sentence marking result obtains the final score of described each sentence, comprising:
According to formula S core (s)=(1+ σ * tagW (s)) * Score rel(s) obtain the final score Score (s) of described each sentence;
Wherein, described σ is default mediation parameter.
The device of 12. 1 kinds of file retrievals, is characterized in that, comprising:
Retrieval unit, for use the target query keyword that obtains through pre-service the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document;
Acquiring unit, for described first object collection of document is carried out to correlativity marking, obtain the correlativity marking result of described first object document, and according to described correlativity marking result, described first object collection of document is reordered and obtains the second destination document set;
Described acquiring unit, also, for described current goal searching keyword being expanded by spurious correlation feedback model, obtains new target query keyword;
Described acquiring unit, also described when pre-conditioned for meeting when described new target query keyword, use described new target query keyword again to retrieve described first object collection of document, obtain the 3rd destination document set;
Computing unit, for each destination document of described the 3rd destination document set is carried out to subordinate sentence processing, and calculates and carries out the label weight summation that described subordinate sentence processing obtains each sentence;
Described computing unit, also for the content of text of described each sentence being carried out to correlativity marking according to described target query keyword, obtain the correlativity marking result of each sentence, and according to the correlativity marking result of described each sentence, obtain the final score of described each sentence;
Described acquiring unit, also obtains target sentences for the final score according to described each sentence, and in described target sentences, obtains the sentence of length within the scope of preset length as result for retrieval fragment.
13. devices according to claim 12, is characterized in that,
Described acquiring unit, also for obtaining word frequency TF value and the reverse file frequency IDF value of each word of destination document set in described destination document set;
Described device also comprises:
Set up unit, for setting up described inverted index according to the TF value of described each word of destination document set and IDF value;
Processing unit, extracts operation for searching keyword being rejected to stop words and stem, obtains described target query keyword;
The described unit of setting up, also for setting up retrieval model.
14. devices according to claim 13, is characterized in that,
Described retrieval unit, specifically for according to described retrieval model, use the target query keyword that obtains through pre-service in the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document.
15. devices according to claim 12, is characterized in that,
Described computing unit, also, for described first object collection of document is trained, obtains the weight of each label in described first object collection of document.
16. devices according to claim 15, is characterized in that, described computing unit, specifically comprises:
Obtain subelement, for obtaining all tag name of described first object collection of document;
Classification subelement, for according to described tag name, by the element in described first object collection of document be divided into element set associated with the query and with the incoherent element set of inquiry;
The described subelement that obtains, also for obtaining each searching keyword t iat each coherent element b ktotal number A of all words in the number of times a of middle appearance and described coherent element set;
The described subelement that obtains, also for obtaining described each searching keyword t iat each uncorrelated element b ktotal number B of all words in the number of times b of middle appearance and described uncorrelated element set;
Computation subunit, for according to described each searching keyword t iat described each coherent element b ktotal number of all words in the number of times of middle appearance and described coherent element set, calculates described each searching keyword t iat described each coherent element b kthe Probability p of middle appearance ik;
Wherein, p ik = a A
Described computation subunit, also for according to described each searching keyword t iat described each uncorrelated element b ktotal number of all words in the number of times of middle appearance and described uncorrelated element set, calculates described each searching keyword t iat described each uncorrelated element b kthe probability q of middle appearance ik;
Wherein, q ik = b B
Described computation subunit, also for calculating each label m of described first object collection of document jweight;
Wherein, label m jthe computing formula of weight be:
f tag ( m j ) = Σ t ik ∈ m j , t i ∈ Q t ik × log ( p ik ( 1 - q ik ) q ik ( 1 - p ik ) )
Described t ikbe 01 value, represent described element b kin whether include described searching keyword t i; Described Q is searching keyword.
17. devices according to claim 12, is characterized in that, described acquiring unit, specifically comprises:
Extract subelement, extract the SLCA subtree of each destination document in described first object collection of document as the structural information of described each destination document;
Computation subunit, carries out correlativity marking for the SLCA subtree to described each destination document, obtains the correlativity score of described each destination document.
18. devices according to claim 17, is characterized in that,
Described computation subunit, specifically for obtaining the number of times tf that in described target query keyword, each searching keyword q occurs respectively in each node n n,q;
Described computation subunit, specifically for calculating the TF value TF of described each searching keyword q in described first object collection of document q;
Described computation subunit, specifically for according to the tf of described each searching keyword q n,qand TF qobtain described each searching keyword q for the correlativity score tw (n, q) of present node;
Wherein, tw ( n , q ) = tf n , q TF q
Described computation subunit, specifically for when described present node n is leaf node, calculates described each searching keyword q with respect to the summation of the correlativity score of described present node n, as the correlativity score of the document.
19. devices according to claim 18, is characterized in that,
Described computation subunit, also specifically for when described present node n is non-leaf node, calculates all child node c of described present node n with respect to the correlativity score tw (c, q) of target query keyword;
Described computation subunit, also specifically for the correlativity score tw (n with respect to described present node n according to described each searching keyword q, q) and all child node c of described present node n with respect to the correlativity score tw (c, q) of described target query keyword, calculate described each searching keyword q with respect to the correlativity score tw of described present node n 1(n, q);
Wherein, tw 1(n, q)=tw (n, q)+∑ c ∈ children (n)d ntw (c, q)
Described computation subunit, also specifically for the correlativity score tw with respect to described present node n according to described each searching keyword q 1(n, q) calculates described each searching keyword q with respect to the summation of the correlativity score of described present node n, as the correlativity score of the document.
20. devices according to claim 12, is characterized in that, described device also comprises:
Judging unit, for judging whether described new target query keyword meets pre-conditioned;
Described acquiring unit, also described when pre-conditioned for not meeting when described new target query keyword, use described new target query keyword again to retrieve described first object collection of document, obtain the second new destination document set;
Described acquiring unit, also for described current goal searching keyword being expanded by spurious correlation feedback model, obtains the target query keyword upgrading;
Described retrieval unit, also, for until the target query keyword of described renewal meets described pre-conditionedly, is used the target query keyword of described renewal again to retrieve described first object collection of document.
21. devices according to claim 12, is characterized in that, described computing unit, specifically comprises:
The first computation subunit, for training described the 3rd destination document set, obtains the weight of each label in described the 3rd destination document set;
Subelement processed in subordinate sentence, for removing label, each destination document in described the 3rd destination document set carried out to subordinate sentence processing;
Described the first computation subunit, also for calculating the weight of the corresponding label of all words that described each sentence comprises, to obtain the label weight summation tagW (s) of each sentence.
22. devices according to claim 21, is characterized in that,
Described computing unit, specifically for calculating target query keyword with respect to the correlativity score Score of each sentence query(s);
Wherein, Score query ( s ) = queryC ( s ) * Σ i = 1 n Occ ( q i , s ) * Weight ( q i )
The kind that queryC (s) is the keyword that occurs in described each sentence; Occ (q i, number of times s) occurring in sentence for each searching keyword q; Weight (q i) be the weight of each searching keyword q;
Described computing unit, specifically for calculating the score Score of each important words in described each sentence sw(s); Described important words is greater than the word of threshold number for the number of times that occurs in described destination document;
Wherein,
Figure FDA00002177132400112
Described computing unit, specifically for calculating the title correlativity score Score of described each sentence title(s);
Wherein,
Figure FDA00002177132400113
Described computing unit, specifically for according to described Score query(s), described Score swand described Score (s) title(s) content of text of described each sentence is carried out to correlativity marking Score rel(s);
Wherein, Score rel(s)=α Score query(s)+β Score sw(s)+γ Score title(s)
Described α, β, γ are default mediation parameter.
Described computing unit, also specifically for according to formula S core (s)=(1+ σ * tagW (s)) * Score rel(s) obtain the final score Score (s) of described each sentence;
Wherein, described σ is default mediation parameter.
CN201210360872.XA 2012-09-21 2012-09-21 A kind of method and device of file retrieval Expired - Fee Related CN103678412B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210360872.XA CN103678412B (en) 2012-09-21 2012-09-21 A kind of method and device of file retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210360872.XA CN103678412B (en) 2012-09-21 2012-09-21 A kind of method and device of file retrieval

Publications (2)

Publication Number Publication Date
CN103678412A true CN103678412A (en) 2014-03-26
CN103678412B CN103678412B (en) 2016-12-21

Family

ID=50315993

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210360872.XA Expired - Fee Related CN103678412B (en) 2012-09-21 2012-09-21 A kind of method and device of file retrieval

Country Status (1)

Country Link
CN (1) CN103678412B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268227A (en) * 2014-09-26 2015-01-07 天津大学 Automatic high-quality related sample selection method based on reverse k adjacent image search
CN104765862A (en) * 2015-04-22 2015-07-08 百度在线网络技术(北京)有限公司 Document retrieval method and device
CN106294784A (en) * 2016-08-12 2017-01-04 合智能科技(深圳)有限公司 Resource search method and device
CN106294662A (en) * 2016-08-05 2017-01-04 华东师范大学 Inquiry based on context-aware theme represents and mixed index method for establishing model
CN106372087A (en) * 2015-07-23 2017-02-01 北京大学 Information retrieval-oriented information map generation method and dynamic updating method
CN107247745A (en) * 2017-05-23 2017-10-13 华中师范大学 A kind of information retrieval method and system based on pseudo-linear filter model
CN108062355A (en) * 2017-11-23 2018-05-22 华南农业大学 Query word extended method based on pseudo- feedback with TF-IDF
CN108345679A (en) * 2018-02-26 2018-07-31 科大讯飞股份有限公司 A kind of audio and video search method, device, equipment and readable storage medium storing program for executing
CN108520033A (en) * 2018-03-28 2018-09-11 华中师范大学 Enhancing pseudo-linear filter model information search method based on superspace simulation language
CN109992647A (en) * 2019-04-04 2019-07-09 北京神州泰岳软件股份有限公司 A kind of content search method and device
CN111949679A (en) * 2019-05-17 2020-11-17 上海戈吉网络科技有限公司 Document retrieval system and method
CN112732864A (en) * 2020-12-25 2021-04-30 中国科学院软件研究所 Document retrieval method based on dense pseudo query vector representation
CN113806491A (en) * 2021-09-28 2021-12-17 上海航空工业(集团)有限公司 Information processing method, device, equipment and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1916904A (en) * 2006-09-01 2007-02-21 北大方正集团有限公司 Method of abstracting single file based on expansion of file
US20120150856A1 (en) * 2010-12-11 2012-06-14 Pratik Singh System and method of ranking web sites or web pages or documents based on search words position coordinates

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1916904A (en) * 2006-09-01 2007-02-21 北大方正集团有限公司 Method of abstracting single file based on expansion of file
US20120150856A1 (en) * 2010-12-11 2012-06-14 Pratik Singh System and method of ranking web sites or web pages or documents based on search words position coordinates

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SONGLIN WANG 等,: ""PKU at INEX 2011XML Snippet Trace"", 《LECTURE NOTES IN COMPUTER SCIENCE(LNCS)》 *
杨建武 等,: ""基于倒排索引的文本相似搜索"", 《计算机工程》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268227A (en) * 2014-09-26 2015-01-07 天津大学 Automatic high-quality related sample selection method based on reverse k adjacent image search
CN104268227B (en) * 2014-09-26 2017-10-10 天津大学 High-quality correlated samples chooses method automatically in picture search based on reverse k neighbours
CN104765862A (en) * 2015-04-22 2015-07-08 百度在线网络技术(北京)有限公司 Document retrieval method and device
CN106372087B (en) * 2015-07-23 2019-12-13 北京大学 information map generation method facing information retrieval and dynamic updating method thereof
CN106372087A (en) * 2015-07-23 2017-02-01 北京大学 Information retrieval-oriented information map generation method and dynamic updating method
CN106294662A (en) * 2016-08-05 2017-01-04 华东师范大学 Inquiry based on context-aware theme represents and mixed index method for establishing model
CN106294784B (en) * 2016-08-12 2019-12-17 合一智能科技(深圳)有限公司 resource searching method and device
CN106294784A (en) * 2016-08-12 2017-01-04 合智能科技(深圳)有限公司 Resource search method and device
CN107247745A (en) * 2017-05-23 2017-10-13 华中师范大学 A kind of information retrieval method and system based on pseudo-linear filter model
CN108062355A (en) * 2017-11-23 2018-05-22 华南农业大学 Query word extended method based on pseudo- feedback with TF-IDF
CN108062355B (en) * 2017-11-23 2020-07-31 华南农业大学 Query term expansion method based on pseudo feedback and TF-IDF
CN108345679A (en) * 2018-02-26 2018-07-31 科大讯飞股份有限公司 A kind of audio and video search method, device, equipment and readable storage medium storing program for executing
CN108520033A (en) * 2018-03-28 2018-09-11 华中师范大学 Enhancing pseudo-linear filter model information search method based on superspace simulation language
CN109992647A (en) * 2019-04-04 2019-07-09 北京神州泰岳软件股份有限公司 A kind of content search method and device
CN111949679A (en) * 2019-05-17 2020-11-17 上海戈吉网络科技有限公司 Document retrieval system and method
CN112732864A (en) * 2020-12-25 2021-04-30 中国科学院软件研究所 Document retrieval method based on dense pseudo query vector representation
CN113806491A (en) * 2021-09-28 2021-12-17 上海航空工业(集团)有限公司 Information processing method, device, equipment and medium

Also Published As

Publication number Publication date
CN103678412B (en) 2016-12-21

Similar Documents

Publication Publication Date Title
CN103678412B (en) A kind of method and device of file retrieval
JP5143057B2 (en) Important keyword extraction apparatus, method and program
CN101231661A (en) Method and system for digging object grade knowledge
CN111104801B (en) Text word segmentation method, system, equipment and medium based on website domain name
CN104346382B (en) Use the text analysis system and method for language inquiry
CN106372232B (en) Information mining method and device based on artificial intelligence
CN105404677A (en) Tree structure based retrieval method
Bhardwaj et al. A novel approach for content extraction from web pages
CN112015907A (en) Method and device for quickly constructing discipline knowledge graph and storage medium
Dang et al. WordNet-based suffix tree clustering algorithm
Zhang et al. A tag recommendation system for folksonomy
CN105426490A (en) Tree structure based indexing method
Matsuoka et al. Examination of effective features for CRF-based bibliography extraction from reference strings
Wang et al. User intention-based document summarization on heterogeneous sentence networks
Ren et al. Role-explicit query extraction and utilization for quantifying user intents
Gupta et al. Document summarisation based on sentence ranking using vector space model
CN112100500A (en) Example learning-driven content-associated website discovery method
Baliyan et al. Related Blogs’ Summarization With Natural Language Processing
Yuan et al. Self-adaptive extracting academic entities from World Wide Web
Thanadechteemapat et al. Thai word segmentation for visualization of thai web sites
Sharma et al. Analysis and Summarization of Related Blog Entries Using Semantic Web
Wang et al. A general web page extraction method aiming at online social networks
Ojo et al. Knowledge discovery in academic electronic resources using text mining
Ševa et al. Open Directory Project based universal taxonomy for Personalization of Online (Re) sources
Sathianesan et al. Personalized semantic based blog retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220623

Address after: 100871 No. 5, the Summer Palace Road, Beijing, Haidian District

Patentee after: Peking University

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

Address before: 100871 No. 5, the Summer Palace Road, Beijing, Haidian District

Patentee before: Peking University

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20161221

CF01 Termination of patent right due to non-payment of annual fee