Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, those of ordinary skills, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.
Embodiment 1
Referring to Fig. 1, the method flow diagram of a kind of file retrieval providing for the present embodiment, comprising:
A, use the target query keyword that obtains through pre-service in the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document.
B, first object collection of document is carried out to correlativity marking, obtain the correlativity marking result of first object document, and according to correlativity marking result, first object collection of document is reordered and obtains the second destination document set.
C, by spurious correlation feedback model, current goal searching keyword is expanded, obtained new target query keyword.
D, when described new target query keyword meets when pre-conditioned, use described new target query keyword again to retrieve described first object collection of document, obtain the 3rd destination document set.
E, each destination document in the 3rd destination document set is carried out to subordinate sentence processing, and calculate and carry out subordinate sentence and process the label weight summation that obtains each sentence.
F, according to target query keyword, the content of text of each sentence is carried out to correlativity marking, obtain the correlativity marking result of each sentence, and according to the correlativity marking result of each sentence, obtain the final score of each sentence.
G, according to the final score of each sentence, obtain target sentences, and in target sentences, obtain the sentence of length within the scope of preset length as result for retrieval fragment.
The method of a kind of file retrieval that the present embodiment provides, by the present embodiment, can so that user at the full content that does not need browsing document, and do not know in the situation of file structure and use key query word to retrieve, and be applicable to the retrieval of magnanimity document, retrieval performance and accuracy rate are high.
Embodiment 2
In the present embodiment, a kind of method of file retrieval is divided into three phases, and the first stage is the fuzzy search stage, to dwindle semi-structured document set; Subordinate phase is accurate retrieval phase, to obtain accurate collection of document associated with the query; Phase III is fragment generation phase.
Referring to Fig. 2, the method first stage process flow diagram of a kind of file retrieval providing for the present embodiment, comprising:
First stage: fuzzy search stage.
201: destination document set is carried out to pre-service.
In the present embodiment, destination document set is for being about to the XML semi-structured document set for inquiring about.
Destination document set is carried out to pre-service specifically can be realized by following sub-step:
1) reject the stop words in destination document set.
Wherein, stop words can be arranged in advance by user, can be that " in ", " the ", " oh " and punctuation mark etc. are without the word of concrete meaning, Chinese can for " ", " wearing ", " " and punctuation mark etc. are without the concrete word of meaning.
For example, following 2 pieces of articles are the partial content in collection of document, are used for illustrating the stop words of rejecting in destination document set;
The content of article 1 is: Tom lives in Guangzhou, I live in Guangzhou too.
The content of article 2 is: He once lived in Shanghai.
Above-mentioned article 1 and article 2 contents, be a character string, first, finds out respectively all words of article 1 and article 2 according to space, and each word is keyword, then stop words is rejected from article 1 and article 2; Article 1 and the article 2 of rejecting after stop words are as follows:
Reject the article 1:[Tom after stop words] [lives] [Guangzhou] [I] [live] [Guangzhou]
Reject the article 2:[He after stop words] [lives] [Shanghai].
It should be noted that, while there is Chinese sentence in document, need to utilize prior art centering sentence to carry out special word segmentation processing, then stop words is rejected from document.
2): the stem that extracts destination document set.
First, when the content in destination document set is English character, all words are unified to capital and small letter; For example, when user searches " He ", word " HE ", " he " also can be searched.
Secondly, when the content in destination document set is English character, all words are reduced; For example, when user searches " live ", word " lives ", " lived " also can be searched, and need word " lives ", " lived " to be reduced to " live ".
For example, the article 1 of take after above-mentioned rejecting stop words is said as example with the article 2 of rejecting after stop words, extract after stem,
All keywords of article 1 are: [tom] [live] [guangzhou] [i] [live] [guangzhou]
All keywords of article 2 are: [he] [live] [shanghai].
3): calculate the TF(term frequency of each word in destination document set in stem, word frequency) value and IDF(inverse document frequency, reverse file frequency) value.
While wherein, calculating the TF value of each word in destination document set, can adopt following formula to calculate:
N in above-mentioned formula
i, jthat word is at destination document set d
jin occurrence number, denominator is d in destination document set
jin the occurrence number sum of all words.
While calculating the IDF value of each word in destination document set, can adopt following formula to calculate:
Wherein, | D| is the total number of files in destination document set, | { j:t
i∈ d
j| for comprising t
itotal number of files (be n
i, j≠ 0 total number of files).
202: to carrying out pretreated destination document set, set up inverted index.
For example, take above-mentioned article 1 and article 2 is example, sets up after inverted index, and in article 1 and article 2, the corresponding relation of each keyword and article number, [frequency of occurrences], keyword position is:
guangzhou?1[2]?3,6
he?2[1]?1
i?1[1]?4
live?1[2],2[1]?2,5,2
shanghai?2[1]?3
tom?1[1]?1
Set up after inverted index, can learn number of times and particular location that keyword occurs in article.
203: searching keyword is carried out to pre-service and obtain target query keyword.
Wherein, searching keyword being carried out to pre-service specifically can realize by following sub-step:
1) reject the stop words in searching keyword.
It should be noted that, the concrete methods of realizing of this step is identical with the method for rejecting the stop words in destination document set in above-mentioned steps 2011, at this, no longer illustrates.
2) stem of extraction searching keyword obtains target query keyword.
It should be noted that, the concrete methods of realizing of this step is identical with the method for stem of extracting destination document set in above-mentioned steps 2012, at this, no longer illustrates.
204: according to retrieval model use the target query keyword that obtains through pre-service in inverted index to destination document set retrieve, obtain first object collection of document.
It should be noted that, setting up retrieval model is that probability of use statistical method and language model are set up; In the process of retrieval, use Di Li Cray Dirichlet smooth manner, the scope of having dwindled destination document set; Wherein, set up retrieval model and Dirichlet smooth manner all belongs to prior art, do not repeat them here.
Referring to Fig. 3, the method subordinate phase process flow diagram of a kind of file retrieval providing for the present embodiment, comprising:
Subordinate phase: accurate retrieval phase.
205: train first object collection of document, obtain the weight of each label in first object collection of document.
Participate in Fig. 4, training first object collection of document, obtains label weight and specifically can realize by following sub-step:
2051: obtain all tag name in first object collection of document.
2052: according to tag name, by the element in first object collection of document be divided into element set associated with the query and with inquiry incoherent element set.
2053: obtain each searching keyword t
iat each coherent element b
ktotal number A of all words in the number of times a of middle appearance and coherent element set.
It should be noted that, when searching keyword is English character, using each word as searching keyword; When searching keyword is Chinese statement, need to utilize prior art to carry out special word segmentation processing to Chinese statement, each word that processing obtains is as searching keyword.
2054: obtain each searching keyword t
iat each uncorrelated element b
ktotal number B of all words in the number of times b of middle appearance and uncorrelated element set.
It should be noted that, when searching keyword is English character, using each word as searching keyword; When searching keyword is Chinese statement, need to utilize prior art to carry out special word segmentation processing to Chinese statement, each word that processing obtains is as searching keyword.
2055: according to each searching keyword t
iat each coherent element b
ktotal number of all words in the number of times of middle appearance and coherent element set, calculates each searching keyword t
iat each coherent element b
kthe Probability p of middle appearance
ik.
Wherein,
2056: according to each searching keyword t
iat each uncorrelated element b
ktotal number of all words in the number of times of middle appearance and uncorrelated element set, calculates each searching keyword t
iat each uncorrelated element b
kthe probability q of middle appearance
ik.
Wherein,
2057: calculate each the label m in first object collection of document
jweight.
Wherein, label m
jthe computing formula of weight be:
Wherein, t
ikbe 01 value, can be 0 or 1, represent element b
kin whether include searching keyword t
i; Q is searching keyword.
206: searching keyword is carried out to pre-service, obtain target query keyword.
Wherein, in target query keyword, comprise several searching keywords q.
It should be noted that, in this step, searching keyword is carried out to pretreated method and in above-mentioned steps 203, searching keyword to be carried out to pretreated method identical, do not repeat them here.
207: extract the SLCA subtree of each destination document in first object collection of document as the structural information of each destination document.
208: the SLCA subtree to each destination document is carried out correlativity marking, obtain the correlativity score of each destination document.
Referring to Fig. 5, the SLCA subtree of calculating each document is carried out correlativity score and can be taked bottom-up method, specifically can realize by following sub-step:
2081: obtain the number of times tf that in target query keyword, each searching keyword q occurs respectively in each node n
n, q.
2082: calculate the TF value TF of each searching keyword q in first object collection of document
q.
Wherein, the method for calculating the TF value of each searching keyword q in first object collection of document in this step with in above-mentioned steps 2013, calculate stem in the method for the TF value of each word in destination document set identical, in this step, repeat.
2083: according to the tf of each searching keyword q
n, qand TF
qobtain each searching keyword q for the correlativity score tw (n, q) of present node.
Wherein,
2084: when present node n is leaf node, calculate each searching keyword q with respect to the summation of the correlativity score of present node n, as the correlativity score of document.
2085: when present node n is non-leaf node, calculate all child node c of present node n with respect to the correlativity score tw (c, q) of target query keyword.
2086: the correlativity score tw (n according to each searching keyword q with respect to present node n, q) and all child node c of present node n with respect to the correlativity score tw (c, q) of target query keyword, obtain each searching keyword q with respect to the correlativity score tw of present node n
1(n, q)
Tw wherein
1(n, q)=tw (n, q)+∑
c ∈ children (n)d
ntw (c, q)
2087: the correlativity score tw according to each searching keyword q with respect to present node n
1(n, q) calculates each searching keyword q with respect to the summation of the correlativity score of present node n, as the correlativity score of the document.
209: the correlativity score according to each destination document in described first object collection of document, obtains the second destination document set.
Concrete, can to the document in first object collection of document, resequence according to correlativity score order from high to low, also can to the document in destination document set, resequence according to correlativity score order from low to high.
Optionally, after the document in destination document set is resequenced, score can also be less than at the document of the first preset value and get rid of, keep score and be more than or equal to the destination document set of the first preset value, obtain the second destination document set.
210: according to the destination document in current the second destination document set, use spurious correlation feedback model to expand target query keyword, obtain new target query keyword, and judge whether new target query keyword meets pre-conditioned;
When target query keyword does not meet when pre-conditioned, execution step 211;
When target query keyword meets when pre-conditioned, execution step 212.
In the present embodiment, concrete, can use spurious correlation feedback model to expand target query keyword according to the higher default destination document of the second destination document set mid-score, obtain new target query keyword, and judge whether new target query keyword meets pre-conditioned.
It should be noted that, pre-conditioned can be the number of keyword, can be also the stem number of keyword, but be not limited to this.
211: use new target query keyword again to retrieve first object collection of document, obtain the second new destination document set, return to the operation of execution step 210.
Referring to Fig. 6, the method phase III process flow diagram of a kind of file retrieval providing for the present embodiment, comprising:
Phase III: fragment produces the stage.
212: use new target query keyword again to retrieve first object collection of document, obtain the 3rd destination document set.
It should be noted that, the method for label weight that obtains document in this step is identical with the method that obtains label weight in above-mentioned steps 205, at this, no longer illustrates.
213: each destination document in the 3rd destination document set is carried out to subordinate sentence processing, and calculate and carry out the label weight summation that each sentence obtaining processed in subordinate sentence.
Participate in Fig. 7, each destination document in the 3rd destination document set carried out to subordinate sentence processing, and calculate and carry out subordinate sentence and process the label weight summation of each sentence obtaining and specifically can realize by following sub-step:
2131: train the 3rd destination document set, obtain the weight of each label in the 3rd destination document set;
2132: remove label, each destination document in the 3rd destination document set is carried out to subordinate sentence processing.
It should be noted that, the operation of document being carried out to subordinate sentence processing belongs to prior art, at this, no longer illustrates.
2133: calculate the weight of the corresponding label of all words that each sentence comprises, to obtain the label weight summation tagW (s) of each sentence.
The weight summation of the corresponding label of all words that wherein, the label weight summation of each sentence comprises for each sentence.
214: searching keyword is carried out to pre-service and obtain target query keyword.
Wherein, target query keyword comprises several searching keywords q.
It should be noted that, in this step, searching keyword is carried out to pretreated method and in above-mentioned steps 203, searching keyword to be carried out to pretreated method identical, do not repeat them here.
215: according to target query keyword, the content of text of each sentence is given a mark.
Participate in Fig. 8, the specific implementation of this step is specially:
2151: calculate target query keyword with respect to the correlativity score Score of each sentence
query(s).
Wherein, the correlativity of sentence s and target query keyword and three factor analysis: the kind queryC of the keyword occurring in each sentence (s); Number of times Occ (the q that each searching keyword q occurs in sentence
i, s); Weights W eight (the q of each searching keyword q
i).
Concrete, Score
query(s) can calculate by following formula.
2152: the score Score that calculates each important words in each sentence
sw(s).
In this step, important words is greater than the word of threshold number for the number of times that occurs in this destination document.
Wherein, Score
sw(s) can calculate by following formula:
2153: the title correlativity score Score that calculates each sentence
title(s).
Wherein, Score
title(s) can calculate by following formula:
2154: according to Score
query(s), Score
swand Score (s)
title(s) content of text of each sentence is carried out to correlativity marking Score
rel(s);
Wherein,
Score
rel(s)=αScore
query(s)+βScore
sw(s)+γScore
title(s)
Above-mentioned α, β, γ are three default mediation parameters.
216: the final score that calculates each sentence.
Wherein, the final score that calculates each sentence can calculate by following formula:
Score(s)=(1+σ*tagW(s))*Score
rel(s)
Wherein, the σ in above-mentioned formula is for being in harmonious proportion parameter.
217: the final score according to each sentence, obtain target sentences, the score of target sentences is more than or equal to the second preset value.
218: in target sentences, obtain the sentence of length within the scope of preset length as result for retrieval fragment.
The method of a kind of file retrieval that the present embodiment provides, by the present embodiment, can so that user at the full content that does not need browsing document, and do not know in the situation of file structure and use key query word to retrieve, and be applicable to the retrieval of magnanimity document, retrieval performance and accuracy rate are high.
Embodiment 3
Participate in Fig. 9, the installation drawing of a kind of file retrieval providing for the present embodiment, comprising:
Retrieval unit 301, for use the target query keyword that obtains through pre-service the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document;
Acquiring unit 302, for first object collection of document is carried out to correlativity marking, obtains the correlativity marking result of first object document, and according to correlativity marking result, first object collection of document is reordered and obtains the second destination document set;
Acquiring unit 302, also, for described current goal searching keyword being expanded by spurious correlation feedback model, obtains new target query keyword;
Acquiring unit 302, also, for meeting when described new target query keyword when pre-conditioned, is used described new target query keyword again to retrieve described first object collection of document, obtains the 3rd destination document set;
Computing unit 303, for each destination document of the 3rd destination document set is carried out to subordinate sentence processing, and calculates and carries out the label weight summation that subordinate sentence processing obtains each sentence;
Computing unit 303, also for the content of text of each sentence being carried out to correlativity marking according to target query keyword, obtains the correlativity marking result of each sentence, and according to the correlativity marking result of each sentence, obtains the final score of each sentence;
Acquiring unit 302, also obtains target sentences for the final score according to each sentence, and in target sentences, obtains the sentence of length within the scope of preset length as result for retrieval fragment.
Further, acquiring unit 302, also for obtaining word frequency TF value and the reverse file frequency IDF value of each word of destination document set in destination document set;
Referring to Figure 10, device also comprises:
Set up unit 304, for setting up inverted index according to the TF value of each word of destination document set and IDF value.
Processing unit 305, extracts operation for searching keyword being rejected to stop words and stem, obtains target query keyword.
The described unit 304 of setting up, also for setting up retrieval model.
Further, retrieval unit 301, specifically for according to retrieval model, use the target query keyword that obtains through pre-service in the inverted index of setting up in advance to destination document set retrieve, obtain first object collection of document.
Further, computing unit 303, also, for first object collection of document is trained, obtains the weight of each label in first object collection of document.
Further, referring to Figure 11, computing unit 303, specifically comprises:
Obtain subelement 3031, for obtaining all tag name of first object collection of document;
Classification subelement 3032, for according to tag name, by the element in first object collection of document be divided into element set associated with the query and with the incoherent element set of inquiry;
Obtain subelement 3031, also for obtaining each searching keyword t
iat each coherent element b
ktotal number A of all words in the number of times a of middle appearance and coherent element set;
Obtain subelement 3031, also for obtaining each searching keyword t
iat each uncorrelated element b
ktotal number B of all words in the number of times b of middle appearance and uncorrelated element set;
Computation subunit 3033, for according to each searching keyword t
iat each coherent element b
ktotal number of all words in the number of times of middle appearance and coherent element set, calculates each searching keyword t
iat each coherent element b
kthe Probability p of middle appearance
ik;
Wherein,
Computation subunit 3033, also for according to each searching keyword t
iat each uncorrelated element b
ktotal number of all words in the number of times of middle appearance and uncorrelated element set, calculates each searching keyword t
iat each uncorrelated element b
kthe probability q of middle appearance
ik;
Wherein,
Computation subunit 3033, also for calculating each label m of first object collection of document
jweight;
Wherein, label m
jthe computing formula of weight be:
T
ikbe 01 value, represent element b
kin whether include searching keyword t
i; Q is searching keyword.
Further, referring to Figure 12, acquiring unit 302, specifically comprises:
Extract subelement 3021, extract the SLCA subtree of each destination document in first object collection of document as the structural information of each destination document;
Computation subunit 3022, carries out correlativity marking for the SLCA subtree to each destination document, obtains the correlativity score of each destination document.
Further,
Computation subunit 3022, specifically for obtaining the number of times tf that in target query keyword, each searching keyword q occurs respectively in each node n
n, q;
Computation subunit 3022, specifically for calculating the TF value TF of each searching keyword q in first object collection of document
q;
Computation subunit 3022, specifically for according to the tf of each searching keyword q
n,qand TF
qobtain each searching keyword q for the correlativity score tw (n, q) of present node;
Wherein,
Computation subunit 3022, specifically for when present node n is leaf node, calculates each searching keyword q with respect to the summation of the correlativity score of present node n, as the correlativity score of the document.
Further,
Computation subunit 3022, also specifically for when present node n is non-leaf node, calculates all child node c of present node n with respect to the correlativity score tw (c, q) of target query keyword;
Computation subunit 3022, also specifically for the correlativity score tw (n with respect to present node n according to each searching keyword q, q) and all child node c of present node n with respect to the correlativity score tw (c, q) of target query keyword, calculate each searching keyword q with respect to the correlativity score tw of present node n
1(n, q);
Wherein, tw
1(n, q)=tw (n, q)+∑
c ∈ children (n)d
ntw (c, q)
Computation subunit 3022, also specifically for the correlativity score tw with respect to present node n according to each searching keyword q
1(n, q) calculates each searching keyword q with respect to the summation of the correlativity score of present node n, as the correlativity score of the document.
Further, referring to Figure 13, described device also comprises:
Judging unit 306, for judging whether described new target query keyword meets pre-conditioned.
Described acquiring unit 302, also described when pre-conditioned for not meeting when described new target query keyword, use described new target query keyword again to retrieve described first object collection of document, obtain the second new destination document set;
Described acquiring unit 302, also for described current goal searching keyword being expanded by spurious correlation feedback model, obtains the target query keyword upgrading;
Described retrieval unit 301, also, for until the target query keyword of described renewal meets described pre-conditionedly, is used the target query keyword of described renewal again to retrieve described first object collection of document.
Further, referring to Figure 14, computing unit 303, specifically comprises:
The first computation subunit 3034, for training the 3rd destination document set, obtains the weight of each label in the 3rd destination document set;
Subelement 3035 processed in subordinate sentence, for removing label, each destination document in the 3rd destination document set carried out to subordinate sentence processing;
The first computation subunit 3034, also for calculating the weight of the corresponding label of all words that each sentence comprises, to obtain the label weight summation tagW (s) of each sentence.
Further,
Computing unit 303, specifically for calculating target query keyword with respect to the correlativity score Score of each sentence
query(s);
Wherein,
The kind that queryC (s) is the keyword that occurs in each sentence; Occ (q
i, number of times s) occurring in sentence for each searching keyword q; Weight (q
i) be the weight of each searching keyword q;
Computing unit 303, specifically for calculating the score Score of each important words in each sentence
sw(s); Important words is greater than the word of threshold number for the number of times that occurs in destination document;
Wherein,
Computing unit 303, specifically for calculating the title correlativity score Score of each sentence
title(s);
Wherein,
Computing unit 303, specifically for according to Score
query(s), Score
swand Score (s)
title(s) content of text of each sentence is carried out to correlativity marking Score
rel(s);
Wherein,
Score
rel(s)=αScore
query(s)+βScore
sw(s)+γScoret
title(s)
α, β, γ are default mediation parameter.
Computing unit 303, also specifically for according to formula S core (s)=(1+ σ * tagW (s)) * Score
rel(s) obtain the final score Score (s) of each sentence;
Wherein, σ is default mediation parameter.
The device of a kind of file retrieval that the embodiment of the present invention provides, can so that user at the full content that does not need browsing document, and do not know in the situation of file structure and use key query word to retrieve, and be applicable to the retrieval of magnanimity document, retrieval performance and accuracy rate are high.
Through the above description of the embodiments, those skilled in the art can be well understood to the mode that the present invention can add essential common hardware by software and realize, and can certainly pass through hardware, but in a lot of situation, the former is better embodiment.Understanding based on such, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of software product, this computer software product is stored in the storage medium can read, as the floppy disk of computing machine, hard disk or CD etc., comprise some instructions with so that computer equipment (can be personal computer, server, or the network equipment etc.) carry out the method described in each embodiment of the present invention.
The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited to this, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; can expect easily changing or replacing, within all should being encompassed in protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of described claim.