Embodiment
In the text retrieval system, the search condition that the user submits to is made up of the term on data item (Field) and this data item.According to the searching attribute information of data item itself, i.e. the participle characteristic of data item itself, full-text search can comprise two kinds of retrieval modes:
A kind of is to create the retrieval of carrying out on the indexed data item behind the participle.
This retrieval mode requires to hit document and comprises term at the data item occurrence of creating on the indexed data item, hits between document and the search condition and can represent with the degree of correlation, and its degree of correlation is the floating number between [0,1].
The 2nd, directly create the retrieval of carrying out on the indexed data item at participle not.
It is in full accord at data item occurrence and the term created on the indexed data item that this retrieval mode requires to hit in the document document, or hit document in the document at the data item occurrence on the establishment indexed data item in the specified scope of search condition.Its degree of correlation can only not have intermediate value for 0 or 1.Therefore, this type retrieval is divided into two disjoint set with the shelves in the index database: satisfy the collection of document of search condition and do not satisfy the collection of document of search condition.Filtrator then is equivalent to this type retrieval, after filtrator filters, is met or does not satisfy the result of search condition.Therefore be participle establishment indexed data item not for searching attribute information, the search condition that comprises this data item can transform the filtrator of generation based on this data item, i.e. the said packet-based filtrator of hereinafter.
The search method that the embodiment of the invention provides adopts retrieval and filters the mode that combines, and realizes the retrieval to index database.
At first, according to preset data item (Field) document in the index database is divided into groups.And set up each data item occurrence in the preset data item, comprise the corresponding relation of document identification of the document of this data item occurrence, be specially: comprise the corresponding relation between the document identification of each document of this data item occurrence in the corresponding relation of data item occurrence and group character and group character and the index database.
Obtain data item preset in the index database,, and store the corresponding relation of each data item occurrence and group character for each data item occurrence distributes a group character (GroupID).The also corresponding relation of given data item value and group character in advance.For example: data item can comprise: plurality of data items such as newspaper name, publication date; Each data can comprise a plurality of different data item occurrences in mutually, for example comprises data item occurrences such as People's Daily, Jurisprudence Daily in the data item " newspaper name ".
According to data item occurrence all documents in the index database are divided in each grouping.All documents that are about to comprise the same data item value are included in the grouping.Thereby, realize document is incorporated in the different packets through setting up the corresponding relation of group character and document identification.Every piece of document can belong at least one grouping according to the data item occurrence that it comprised.For example, the document that will comprise data item occurrence " People's Daily " is divided in the grouping, and corresponding packet is designated GroupID 1; The document that will comprise data item occurrence " Jurisprudence Daily " is divided in the grouping, and corresponding packet is designated GroupID 2 etc.For example shown in Figure 1, be the corresponding relation synoptic diagram of document identification and group character.Wherein, the corresponding some document identification (DocID) of each GroupID.
Through above-mentioned data item occurrence and the corresponding relation of group character and the corresponding relation of setting up successively of group character and document identification, can establish the corresponding relation of the document identification of each document in each data item occurrence and the index database.The data item occurrence of above-mentioned foundation and the corresponding relation of group character and the corresponding relation of group character and document identification are saved in the packetized file.That is to say that this packetized file can provide the ability of searching group character according to data item occurrence, ability of searching group character according to document identification fast etc. is provided simultaneously.For example shown in Figure 2, be the attaching relation synoptic diagram between data item occurrence and the document.Wherein, comprise some documents (Doc) in the grouping of each data item occurrence (Field value).
After the search condition that gets access to user's submission, the flow process of in index database, retrieving is as shown in Figure 3, and it realizes that principle is as shown in Figure 4, and execution in step is following:
Step S1: from the search condition that the user submits to, isolate first search condition that is used to retrieve and second search condition that is used to filter.
Specifically, determine first search condition that is used to retrieve and second search condition that is used to filter according to the searching attribute information of data item in the search condition of user's submission.Wherein, Searching attribute information according to predefined each data item in the index database; The searching attribute information of the data item of determining that comprises is the part or all of search condition that the mode that adopts participle not directly to create index is retrieved, as second search condition; And the remaining search condition except that second search condition is first search condition in the deterministic retrieval condition.
Because after the user submitted search key to, the search condition that changes into was made up of data item and the term on this data item.And in index database; Preestablished the searching attribute information of each data item, the mode that promptly adopts the mode of creating index behind the participle to retrieve or adopt participle not directly to create index is retrieved, therefore; Can be according to the searching attribute information of predefined each data item in the index database; Find the searching attribute information of data item in the search condition, search condition is distinguished, realize determining above-mentioned first search condition and second search condition.
It is a plurality of that the user submits to search key to have, and the search condition that therefore transforms out also has a plurality of, and above-mentioned promptly is the differentiation that the situation that a plurality of search conditions are arranged is carried out, and gets ready thereby retrieve the full-text search that combines with filtration for follow-up realization.More than one of the equal possibility of first search condition of determining after the differentiation and second search condition.After distinguishing first search condition and second search condition, can realize the pairing retrieving of second search condition is converted into filter process, the packet-based filtrator that generates below promptly using substitutes the part search condition, to simplify retrieving.
Step S2: first search condition through determining is retrieved index database, obtains the preliminary search result.
According to above-mentioned first search condition of determining, index database is retrieved, obtain the preliminary search result.
When first search condition was unique, directly retrieval obtained result for retrieval; When first search condition is not unique, when a plurality of first search condition is promptly arranged, use each first search condition to retrieve respectively after, and each first search condition retrieved resulting result merge, obtain the preliminary search result.After promptly using each search condition to retrieve to obtain separately the document that is hit respectively, the document that a plurality of first search conditions of confirming wherein to comprise are all hit is the document that comprises among the preliminary search result after the merging.
Step S3: search the pairing data item occurrence of term on the data item that comprises in second search condition, generate packet-based filtrator.Wherein, the packet-based filtrator document that only allows or only do not allow to set grouping passes through.Specifically comprise:
At first, search the data item occurrence in the pairing index database of term on the data item that comprises in second search condition, generate the filtering information corresponding with second search condition.Wherein, filtering information comprises: the form and the span of filter value on the filter type of the data item of filtration, correspondence and the data item of filtering.
The form of filter value on the data item of above-mentioned filtration can comprise: one or more in the forms such as data item occurrence, the pairing group character of data item occurrence.
Then, according to filtering information directly find corresponding buffer memory in advance based on packet filter; Or, confirm corresponding effective group character according to the filter type of the data item correspondence of filtering in the filtering information and the form and the span of filter value, generate the filtrator that only allows or only do not allow the pairing document of effective group character to pass through.Promptly, realize that the document that only allows or only do not allow to set grouping passes through through only allowing or only not allowing the pairing document of effective group character to pass through.
Step S4: each the packet-based filtrator through above-mentioned generation filters the document that hits among the preliminary search result successively, obtains final result for retrieval.
At first, according to the document identification of hitting document among the preliminary search result, confirm to hit the pairing group character of document.
According to the document identification of each document among the above-mentioned preliminary search result who obtains, the group character in the inquiry stored packet file and the corresponding relation of document identification are confirmed the pairing group character of each document.Wherein, when the grouping under the document was not unique, the pairing group character of document also can have a plurality of.
Then, through each filtrator the group character of determining is filtered successively, only allow or only do not allow effective group character, obtain and to be final result for retrieval through the document of each filtrator according to filtrator.
Document among the above-mentioned preliminary search result who obtains is imported successively in the packet-based filtrator of generation and filtered; When in the pairing group character of document at least one by the permission that defines in the filtrator of process pass through effective group character the time; Confirm that the document can pass through this filtrator, otherwise can not pass through; Or when the pairing group character of document all by define in the filtrator of process the effective group character that does not allow to pass through the time, confirm that the document can not pass through this filtrator, otherwise can pass through.Be that packet-based filtrator is to filter according to the group character of document to search, and return and exist or non-existent lookup result.
When a plurality of filtrator, then the filter result of a last filtrator is filtered in the next filtrator of input.After all filtrators filter, just can obtain final result for retrieval.Be that the preliminary search result once filters through a plurality of filtrators, any one filters failure then can return the filter result of filtering failure, and a plurality of filtrators all filter the successful final result for retrieval that then is.
Among the above-mentioned steps S3, search the pairing data item occurrence of term on the data item that comprises in second search condition, generate the process of packet-based filtrator, specifically can comprise following two kinds of situation:
Situation one: the form of the filter value that comprises in the retrieving information of generation is a data item occurrence.
Under this situation, search the data item occurrence in the pairing index database of term on the data item that comprises in second search condition, generate the process of the filtering information corresponding, specifically comprise with second search condition:
Since the term on the data item that comprises in second search condition promptly corresponding the data item occurrence on this data item in the index database, promptly both have consistance.So can be according to the term on the data item that comprises in second search condition, the data item occurrence in the search index storehouse obtains the data item occurrence consistent with term.
Preferable, for a plurality of data item occurrences that belong to same data item or the span of data item occurrence, can merge in advance, reducing the quantity of the filtrator that generates, thus further conserve system resources.
Find after the data item occurrence consistent with term, the form that promptly can directly generate filter value is a data item occurrence, and corresponding span is the filtering information of the scope of the data item occurrence set or data item occurrence.
The general filtering information that generates carries out on the foreground, and after the filtering information that generates sends to the backstage, can search the filtrator whether buffer memory is arranged according to filtering information, when finding, can directly call; Perhaps confirm corresponding effective group character, generate packet-based filtrator according to filtering information.
Under this situation, confirm that according to filtering information corresponding effective group character generates the process of filtrator, specifically comprises:
According to the scope of the data item occurrence or the data item occurrence of the setting that comprises in the filtering information, search the corresponding relation of data item occurrence and group character, obtain effective group character.Promptly search data item occurrence and the corresponding relation of group character in the stored packet file in advance, the pairing group character of data item occurrence that comprises in the scope of the data item occurrence that obtains setting or the data item occurrence of setting.
According to the filter type of effective group character of determining, generate based on packet filter with effective pairing data item of group character.
Wherein, filter type comprises filtration that comprises data item occurrence and the modes such as filtration that do not comprise data item occurrence.Corresponding to the filtrator that two kinds of filter types generate, also be divided into the filtrator that filtrator that the pairing group character of the data item occurrence that only allows setting (being effective group character) passes through and the pairing group character of data item occurrence that does not only allow setting are passed through.And the filtrator that adopts above-mentioned wherein a kind of filter type reversed to obtain adopting the filtrator of above-mentioned another kind of mode.
Situation two: the form of the filter value that comprises in the retrieving information of generation is group character.
Under this situation, search the data item occurrence in the pairing index database of term on the data item that comprises in second search condition, generate the process of the filtering information corresponding, specifically comprise with second search condition:
According to the term on the data item that comprises in second search condition, the data item occurrence in the search index storehouse obtains the data item occurrence consistent with term.
Preferable, for a plurality of data item occurrences that belong to same data item or the span of data item occurrence, can merge earlier in advance.
According to the data item occurrence consistent that finds with term; Search the corresponding relation of data item occurrence and group character and determine the corresponding packet sign; The form that generates filter value is group character, and corresponding span is the filtering information of the scope of the group character set or group character.Promptly search data item occurrence and the corresponding relation of group character in the stored packet file in advance, the pairing group character of data item occurrence that comprises in the scope of the data item occurrence that obtains setting or the data item occurrence of setting.
Same, the filtrator whether buffer memory is arranged can be searched according to filtering information in the backstage, when finding, can directly call; Perhaps confirm corresponding effective group character, generate packet-based filtrator according to filtering information.
Under this situation, confirm that according to filtering information corresponding effective group character generates the process of filtrator, specifically comprises:
Directly obtain the group character that comprises in the filtering information or the scope of group character, obtain effective group character.
According to the filter type of effective group character and the pairing data item of this group character, generate packet-based filtrator then.
Preferable, the filtrator that can be with situation one generates during with situation two is stored or buffer memory, then can directly call when reusing, and avoids repeating creating.For the filtrator that adds buffer memory can (Least Recently Used, cache policy LRU) carry out the renewal replacement of buffer memory through using algorithm recently at most.
The generation of general filtering information can be handled at the front end of searching system, generates filtrator according to filtering information and then generally can handle on the backstage of searching system.That is to say in the situation one that front end is only made the simple information of being about to of handling and mail to the backstage, and in the situation two, the processing of front end is many, has reduced the processing pressure on backstage comparatively speaking.
The filtering information that generates can only comprise the filtering information of any one form among above-mentioned situation one or two, also can comprise the combination of the filtering information of above-mentioned several kinds of forms.For example:
During user search, specify newspaper name (papername), publication date (date), article title, author, publication region etc. to relate to a plurality of search keys of a plurality of data item, generated a plurality of search conditions.Wherein the searching attribute information of publication date and the pairing data item of newspaper name is for directly creating the retrieval of carrying out on the indexed data item at participle not; Then these two search conditions can be used as second search condition, and the retrieval that this search condition is corresponding is converted into filtrator.
Then according to the term that comprises in newspaper name and these two data item of publication date; Find data item occurrence corresponding in the index database; An example of the filtering information of the xml form that generates is following: wherein; The form and the span (data item occurrence and span, group character and span etc.) that comprise data item (Field), filter value, and information such as filter type.
<Filters>
<Filter?field=″papername″format=″index″operation=″exclude
″>2-5,9-20</Filter>
<Filter?field=″date″format=″value″operation=″include″>20080808</Filter>
<Filter?field=″date″format=″value″operation=″include″>20090808</Filter>
</Filters>
Above-mentioned data item comprises newspaper name (papername) and publication date (date); The form of filter value comprises that data item occurrence (value), its span are 20090808, and group character (index), its span are 2-5,9-20; Filter type comprises and comprises (include) and do not comprise (exclude) filter value etc.The filtrator that then generates according to this filtering information is the filtrator based on newspaper name and publication date two groupings, and two filtrators that can certainly generate respectively based on newspaper name and publication date filter successively.
Preferable, when the user imports search key,, above-mentioned two kinds of processing procedures that situation is given can be arranged then if adopt the directly mode of input.If what the user adopted when importing search key is selection mode; Be that system directly provides several retrieval keyword option to supply the user to select; The user only need choose the search key that will import can realize input; Then can directly its corresponding packet sign be bound with this term for offering the data item occurrence that belongs to preset in each search key that the user selects this moment.In case then the user has selected this term (data item occurrence), can directly get access to its corresponding packet sign, and needn't search the corresponding relation of data item occurrence and group character again.
According to above-mentioned search method of the present invention, can make up a kind of indexing unit, as shown in Figure 4, comprising: grouping module 10, separation module 20, retrieval module 30, generation module 40 and filtering module 50.
Grouping module 10 is used for according to the data item occurrence of preset data item the document in the index database being divided into groups.
Grouping module 10 specifically is used for: set up the corresponding relation of data item occurrence and group character, and the corresponding relation of document identification of setting up group character and the document that comprises data item occurrence, realize grouping to document in the index database.
Separation module 20 is used to get access to the search condition that the user submits to, and according to the searching attribute information of data item in the search condition, determines first search condition that is used to retrieve and second search condition that is used to filter.
Separation module 20; Specifically be used for: according to the searching attribute information of predefined each data item of index database, the said searching attribute information of the data item of confirming to comprise is that the part or all of search condition that the mode that adopts participle not directly to create index is retrieved is second search condition; Confirm that remaining search condition is first search condition.
Retrieval module 30 is used for through first search condition that separation module 20 is determined index database being retrieved, and obtains the preliminary search result.
Retrieval module 30 specifically is used for: when first search condition is not unique, use each first search condition that index database is retrieved respectively, and each first search condition is retrieved resulting result's merging, obtain the preliminary search result.
Generation module 40 is used for searching the pairing data item occurrence of term on the data item that second search condition that separation module 20 determines comprises, and generates packet-based filtrator; Wherein, the packet-based filtrator document that only allows or only do not allow to set grouping passes through.
Preferable, above-mentioned generation module 40 specifically comprises: information process unit 401 and definite generation unit 402.
Information process unit 401 is used for searching the data item occurrence in the pairing index database of term on the data item that second search condition that separation module 20 determines comprises, and generates the filtering information corresponding with second search condition; Wherein, comprise in the filtering information: the form and the span of filter value on the filter type of the data item of filtration, correspondence and the data item of filtering.
Preferable, information process unit 401 further can comprise: search subelement 4011 and handle subelement 4012.
Search subelement 4011, the term on the data item that is used for comprising according to second search condition that separation module 20 is determined, the search index storehouse obtains the data item occurrence consistent with term.
Handle subelement 4012, the form that is used for directly generating filter value is a data item occurrence, and corresponding span is the filtering information of the scope of the data item occurrence set or data item occurrence; Or according to the data item occurrence consistent with term; Search the corresponding relation of data item occurrence and group character and determine the corresponding packet sign; The form that generates filter value is group character, and corresponding span is the filtering information of the scope of the group character set or group character.
Confirm generation unit 402; The filter type that the data item that is used for filtering according to the filtering information that information process unit 401 generates is corresponding and the form and the span span of filter value; Confirm corresponding effective group character, generate the packet-based filtrator that only allows or only do not allow the pairing document of effective group character to pass through; Or according to the filtering information that information process unit 401 generates directly find corresponding buffer memory in advance based on packet filter.
Preferable, confirming generation unit 402, further can comprise: confirm subelement 4021 and generate subelement 4022.
Confirm subelement 4021, the data item occurrence of the setting that is used for comprising according to the filtering information that information process unit 401 generates or the scope of data item occurrence are searched the corresponding relation of data item occurrence and group character, obtain effective group character; Or directly obtain the group character that comprises in the filtering information or the scope of group character, obtain effective group character.
Generate subelement 4022, be used for the filter type corresponding with this data item, generate packet-based filtrator according to effective group character.
Filtering module 50, the packet-based filtrator that is used for successively generating through each generation module 40 filters the preliminary search result's that retrieval module 30 obtains the document that hits, and obtains final result for retrieval.
Preferable, above-mentioned filtering module 50 specifically comprises: sign is confirmed unit 501 and filter element 502.
Sign is confirmed unit 501, and the document identification that the preliminary search result who is used for obtaining according to retrieval module 30 hits document confirms to hit the pairing group character of document.
Filter element 502; The packet-based filtrator that is used for successively generating through each generation module 40 confirms that to identifying the group character that unit 501 is determined filters; Only allow or only do not allow effective group character according to packet-based filtrator; Obtain and to be final result for retrieval through the document of each said filtrator.
For example: shown in Figure 5 is the principle contrast synoptic diagram of search method of the application's search method and prior art.
As can be seen from Figure 5; To the Boolean retrieval that four search conditions are arranged; Way originally is to use four search conditions ( search condition 1,2,3,4) to retrieve respectively; (for example: the document that each search condition below provides among the figure), the document that then each search condition is hit merges, thereby obtains net result to obtain the document that hits of each search condition.Then be that wherein search condition 3 and 4 is transformed for filtrator among the application, use search condition 1 and 2 to retrieve after, with obtaining the preliminary search result after the result for retrieval merging, filter successively through filtrator 1 and 2 then, obtain net result.Be used for reducing the complexity that the result merges through making of filtrator, improve retrieval rate.
Above-mentioned search method and device that the embodiment of the invention provides are through dividing into groups to the document in the index database according to the data item occurrence in the preset data item; So that when retrieval, can generate packet-based filtrator.The document that packet-based filtrator only allows or only do not allow to set grouping passes through, with respect to it creates that efficient data volume high, that handle is little, strainability is high based on the filtrator of document.
When needs carry out full-text search, can determine first search condition and second search condition according to the searching attribute information of data item in the search condition that gets access to user's submission, realize with the pairing retrieving of second search condition being filter process.Through packet-based filtrator preliminary search is filtered, conveniently realize filtering, obtain final result for retrieval through group character.The search condition that is used to retrieve through minimizing reduces the complexity that retrieving and result for retrieval merge, and improves processing speed, thereby has practiced thrift system resource, has improved processing speed.Thereby overcome in the text retrieval system performance deficiency, improved the overall performance of searching system based on the filtrator of document.
Can also generate the filtrator commonly used that complicated filtrator and buffer memory generate in advance in addition; Reduce the generation and the constructive process of filtrator; And the generative process of retrieving and filtrator can concurrent processing, also can merge to handle for the term of same data item to generate a filtrator, thereby further optimize system performance; Reach and reduce search condition quantity, better improve the purpose of retrieval performance.
The above; Be merely the preferable embodiment of the present invention; But protection scope of the present invention is not limited thereto; Any technician who is familiar with the present technique field variation that can expect easily, replaces or is applied to other similar devices in the technical scope that the present invention discloses, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claims.