CN102054007A - Searching method and searching device - Google Patents

Searching method and searching device Download PDF

Info

Publication number
CN102054007A
CN102054007A CN2009102371861A CN200910237186A CN102054007A CN 102054007 A CN102054007 A CN 102054007A CN 2009102371861 A CN2009102371861 A CN 2009102371861A CN 200910237186 A CN200910237186 A CN 200910237186A CN 102054007 A CN102054007 A CN 102054007A
Authority
CN
China
Prior art keywords
data item
search condition
group character
document
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2009102371861A
Other languages
Chinese (zh)
Other versions
CN102054007B (en
Inventor
童征宇
李晓蕊
刘志云
赵东岩
徐剑波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Peking University
Founder Apabi Technology Ltd
Original Assignee
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Apabi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Peking University Founder Group Co Ltd, Beijing Founder Apabi Technology Co Ltd filed Critical Peking University
Priority to CN2009102371861A priority Critical patent/CN102054007B/en
Publication of CN102054007A publication Critical patent/CN102054007A/en
Application granted granted Critical
Publication of CN102054007B publication Critical patent/CN102054007B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a searching method and a searching device. The method comprises the following steps of: grouping documents in an indexing library according to data item values in preset data items, and executing the following steps when search conditions submitted by a user are acquired: determining a first searching condition for searching and a second searching condition for filtering according to the searching attribute information of data items in the searching conditions, searching the indexing library according to the first searching condition to obtain a primary searching result, and searching the data item values corresponding to searching words on the data items contained in the second searching conditions, and generating filters based on grouping, wherein the filters based on grouping only allow or only not allow grouped documents to pass; and filtering the targeted documents in the primary searching result by means of each filter in order to obtain the final searching result. By applying the method, a part of searching conditions are converted into filtering conditions so as to reduce the processes of searching and combining, the system resource is saved, and the processing speed is increased.

Description

A kind of search method and indexing unit
Technical field
The present invention relates to field of information processing, refer to a kind ofly especially, adopt the packet filtering mode to accelerate the search method and the indexing unit of retrieval rate based on global search technology.
Background technology
In the prior art, text retrieval system supports that the user specifies a plurality of search conditions to retrieve simultaneously, each search condition is retrieved respectively as a branch and is obtained one group of result for retrieval, the result for retrieval of respectively organizing that each branch is obtained merges at last, just can be met the final result for retrieval of whole search conditions (being a plurality of search conditions).Therefore, the system resources consumption in the retrieving comprises the retrieval of each retrieval branch and the result for retrieval of each branch is merged the system resources consumption that obtains two processes of net result.
At present, the retrieval for a plurality of different values of same data item generally can be split into a plurality of search conditions; Simultaneously, Fu Za search condition (as the retrieval of specified retrieval word scope, the retrieval of specified retrieval word prefix etc.) can be extended to one group of common search condition when retrieval; This makes the quantity of the search condition in the primary retrieval may reach up to a hundred even thousands of, along with increasing of the search condition quantity that splits out, the consumption of system resource also constantly increases along with the increase of search condition, thereby has aggravated the performance issue that full-text search exists more.
Existing system performance problems when solving a plurality of conditional information retrieval, can consider to improve branch retrieval performance, improve retrieving of complicated retrieval type or the like.But the raising of the system performance that this improvement brought is very limited.
Application number is 200610083172.5 patented claims, a kind of data integral service system and method are disclosed, comprise: the querying condition of user's input is converted into numerical range, and with the data area that the data source of preserving in advance provides, search and provide query requests to obtain Query Result to the data source of correspondence.This method directly is cured as filter function with the part retrieval request, and result for retrieval calculates one by one by filter function and verifies.The calculated amount of filtering in this method is big, and the data of processing are many.This method can't be supported user's dynamic process retrieval request in retrieving neatly in addition.
And the search processing method based on filtercondition that exists in the prior art all is based on whole documents, filtercondition is combined with whole documents in the index database, this filtrator based on whole documents exists establishment speed slow, and data volume is big, takies shortcomings such as a large amount of memory sources.And often need in actual applications to filter according to certain feature of document, when the value of feature more for a long time, tend to cause the long problem that causes performance and transmission of search condition, and be difficult to satisfy this filtration requirement based on the filtrator of document.Still there is serious performance issue when therefore, in the how concurrent text retrieval system of multi-user, using based on the filtrator of whole documents.
Summary of the invention
The embodiment of the invention provides a kind of search method and indexing unit, and the system resource overhead that exists when solving many conditional information retrievals in the prior art is big, processing speed waits system performance problems slowly.
A kind of search method comprises: according to the data item occurrence in the default data item document in the index database is divided into groups, when getting access to the search condition of user's submission, carry out the following step:
According to the searching attribute information of data item in the described search condition, determine first search condition that is used to retrieve and second search condition that is used to filter;
By described first search condition described index database is retrieved, obtained the preliminary search result; And search the pairing data item occurrence of term on the data item that comprises in described second search condition, generate packet-based filtrator; The document that described packet-based filtrator only allows or only do not allow to set grouping passes through;
By each described filtrator the document that hits among the described preliminary search result is filtered successively, obtain final result for retrieval.
A kind of indexing unit comprises:
Grouping module is used for according to the data item occurrence of default data item the document in the index database being divided into groups;
Separation module is used to get access to the search condition that the user submits to, and according to the searching attribute information of data item in the described search condition, determines first search condition that is used to retrieve and second search condition that is used to filter;
Retrieval module is used for by first search condition that described separation module is determined described index database being retrieved, and obtains the preliminary search result;
Generation module is used for searching the pairing data item occurrence of term on the data item that second search condition that described separation module determines comprises, and generates packet-based filtrator; The document that described packet-based filtrator only allows or only do not allow to set grouping passes through;
Filtering module is used for successively the packet-based filtrator that generates by each described generation module described preliminary search result's the document that hits is filtered, and obtains final result for retrieval.
Search method that the embodiment of the invention provides and indexing unit divide into groups to the document in the index database according to the data item occurrence in the default data item; When needs are retrieved, can determine first search condition that is used to retrieve and second search condition that is used to filter according to the searching attribute information of data item in the search condition that gets access to user's submission; Be filter process with the pairing retrieving of second search condition then, promptly index database retrieved, obtain the preliminary search result by first search condition; And search the pairing data item occurrence of term on the data item that comprises in second search condition, generate and only allow or only do not allow to set the packet-based filtrator that the document of grouping passes through; By each described filtrator the document that hits among the preliminary search result is filtered successively, obtain final result for retrieval.Said method with minimizing retrieving and the latter incorporated complexity of retrieval, thereby has been saved system resource by the part search condition is converted into filtercondition, has improved processing speed.
Description of drawings
Fig. 1 is the corresponding relation synoptic diagram of group character and document identification in the embodiment of the invention;
Fig. 2 is the attaching relation synoptic diagram between data item occurrence and the document in the embodiment of the invention;
Fig. 3 is the process flow diagram of search method in the embodiment of the invention;
Fig. 4 is the principle schematic of search method in the embodiment of the invention;
Fig. 5 is the structural representation of indexing unit in the embodiment of the invention.
Embodiment
In the text retrieval system, the search condition that the user submits to is made up of the term on data item (Field) and this data item.According to the searching attribute information of data item itself, i.e. the participle characteristic of data item itself, full-text search can comprise two kinds of retrieval modes:
A kind of is to create the retrieval of carrying out on the indexed data item behind the participle.
This retrieval mode requires to hit document and comprises term at the data item occurrence of creating on the indexed data item, hits between document and the search condition and can represent with the degree of correlation, and its degree of correlation is the floating number between [0,1].
The 2nd, directly create the retrieval of carrying out on the indexed data item at participle not.
It is in full accord at data item occurrence and the term created on the indexed data item that this retrieval mode requires to hit in the document document, or hit document in the document at the data item occurrence on the establishment indexed data item in the specified scope of search condition.Its degree of correlation can only not have intermediate value for 0 or 1.Therefore, this class retrieval is divided into two disjoint set with the shelves in the index database: the collection of document that satisfies the collection of document of search condition and do not satisfy search condition.Filtrator then is equivalent to this class retrieval, is met or does not satisfy the result of search condition after filtering through filter.Therefore be participle establishment indexed data item not for searching attribute information, the search condition that comprises this data item can transform the filtrator of generation based on this data item, promptly hereinafter said packet-based filtrator.
The search method that the embodiment of the invention provides adopts retrieval and filters the mode that combines, and realizes the retrieval to index database.
At first, according to default data item (Field) document in the index database is divided into groups.And set up each data item occurrence in the default data item, comprise the corresponding relation of document identification of the document of this data item occurrence, be specially: comprise the corresponding relation between the document identification of each document of this data item occurrence in the corresponding relation of data item occurrence and group character and group character and the index database.
Obtain data item default in the index database,, and store the corresponding relation of each data item occurrence and group character for each data item occurrence distributes a group character (GroupID).The also corresponding relation of given data item value and group character in advance.For example: data item can comprise: plurality of data items such as newspaper name, publication date; Each data can comprise a plurality of different data item occurrences in mutually, for example comprises data item occurrences such as People's Daily, Jurisprudence Daily in the data item " newspaper name ".
According to data item occurrence all documents in the index database are divided in each grouping.All documents that are about to comprise the same data item value are included in the grouping.Thereby, realize document is incorporated in the different groupings by setting up the corresponding relation of group character and document identification.Every piece of document can belong at least one grouping according to the data item occurrence that it comprised.For example, the document that will comprise data item occurrence " People's Daily " is divided in the grouping, and corresponding group character is GroupID 1; The document that will comprise data item occurrence " Jurisprudence Daily " is divided in the grouping, and corresponding group character is GroupID 2 etc.For example shown in Figure 1, be the corresponding relation synoptic diagram of document identification and group character.Wherein, the corresponding some document identification (DocID) of each GroupID.
By above-mentioned data item occurrence and the corresponding relation of group character and the corresponding relation of setting up successively of group character and document identification, can establish the corresponding relation of the document identification of each document in each data item occurrence and the index database.The data item occurrence of above-mentioned foundation and the corresponding relation of group character and the corresponding relation of group character and document identification are saved in the packetized file.That is to say that this packetized file can provide the ability of searching group character according to data item occurrence, ability of searching group character according to document identification fast etc. is provided simultaneously.For example shown in Figure 2, be the attaching relation synoptic diagram between data item occurrence and the document.Wherein, comprise some documents (Doc) in the grouping of each data item occurrence (Field value).
After getting access to the search condition that the user submits to, the flow process of in index database, retrieving as shown in Figure 3, it realizes principle as shown in Figure 4, execution in step is as follows:
Step S1: from the search condition that the user submits to, isolate first search condition that is used to retrieve and second search condition that is used to filter.
Specifically, determine first search condition that is used to retrieve and second search condition that is used to filter according to the searching attribute information of data item in the search condition of user's submission.Wherein, searching attribute information according to predefined each data item in the index database, the searching attribute information of the data item of determining that comprises is the part or all of search condition that the mode that adopts participle not directly to create index is retrieved, as second search condition; And the remaining search condition except that second search condition is first search condition in the deterministic retrieval condition.
Because after the user submitted search key to, the search condition that changes into was made up of the term on data item and this data item.And in index database, preestablished the searching attribute information of each data item, the mode that promptly adopts the mode of creating index behind the participle to retrieve or adopt participle not directly to create index is retrieved, therefore, can be according to the searching attribute information of predefined each data item in the index database, find the searching attribute information of data item in the search condition, search condition is distinguished, realize determining above-mentioned first search condition and second search condition.
It is a plurality of that the user submits to search key to have, and the search condition that therefore transforms out also has a plurality of, and above-mentioned promptly is the differentiation that the situation that a plurality of search conditions are arranged is carried out, thus for follow-up realization retrieval with filter the full-text search that combines and get ready.More than one of the equal possibility of first search condition of determining after the differentiation and second search condition.After distinguishing first search condition and second search condition, can realize the pairing retrieving of second search condition is converted into filter process, promptly use the packet-based filtrator that generates below to substitute the part search condition, to simplify retrieving.
Step S2: by first search condition of determining index database is retrieved, obtained the preliminary search result.
According to above-mentioned first search condition of determining, index database is retrieved, obtain the preliminary search result.
When first search condition was unique, directly retrieval obtained result for retrieval; When first search condition is not unique, when a plurality of first search condition is promptly arranged, use each first search condition to retrieve respectively after, and each first search condition retrieved resulting result merge, obtain the preliminary search result.After promptly using each search condition to retrieve to obtain separately the document that is hit respectively, the document that a plurality of first search conditions of determining wherein to comprise are all hit is the document that comprises among the preliminary search result after the merging.
Step S3: search the pairing data item occurrence of term on the data item that comprises in second search condition, generate packet-based filtrator.Wherein, the packet-based filtrator document that only allows or only do not allow to set grouping passes through.Specifically comprise:
At first, search the data item occurrence in the pairing index database of term on the data item that comprises in second search condition, generate the filtering information corresponding with second search condition.Wherein, filtering information comprises: the form and the span of filter value on the filter type of the data item of filtration, correspondence and the data item of filtering.
The form of filter value on the data item of above-mentioned filtration can comprise: one or more in the forms such as data item occurrence, the pairing group character of data item occurrence.
Then, according to filtering information directly find corresponding buffer memory in advance based on packet filter; Or, determine corresponding effective group character according to the filter type of the data item correspondence of filtering in the filtering information and the form and the span of filter value, generate the filtrator that only allows or only do not allow the pairing document of effective group character to pass through.Promptly, realize that the document that only allows or only do not allow to set grouping passes through by only allowing or only not allowing the pairing document of effective group character to pass through.
Step S4: each the packet-based filtrator by above-mentioned generation filters the document that hits among the preliminary search result successively, obtains final result for retrieval.
At first, according to the document identification of hitting document among the preliminary search result, determine to hit the pairing group character of document.
According to the document identification of each document among the above-mentioned preliminary search result who obtains, the group character in the inquiry stored packet file and the corresponding relation of document identification are determined the pairing group character of each document.Wherein, when the grouping under the document was not unique, the pairing group character of document also can have a plurality of.
Then, by each filtrator the group character of determining is filtered successively, only allow or only do not allow effective group character, obtain and to be final result for retrieval by the document of each filtrator according to filtrator.
Document among the above-mentioned preliminary search result who obtains is imported successively in the packet-based filtrator of generation and filtered, when in the pairing group character of document at least one by the permission that defines in the filtrator of process pass through effective group character the time, determine that the document can pass through this filtrator, otherwise can not pass through; Or when the pairing group character of document all by define in the filtrator of process the effective group character that does not allow to pass through the time, determine that the document can not pass through this filtrator, otherwise can pass through.Be that packet-based filtrator is to filter according to the group character of document to search, and return and exist or non-existent lookup result.
When a plurality of filtrator, then the filter result of a last filtrator is filtered in the next filtrator of input.After all filtrators filter, just can obtain final result for retrieval.Be that the preliminary search result once filters through a plurality of filtrators, any one filters failure then can return the filter result of filtering failure, and a plurality of filtrators all filter the successful final result for retrieval that then is.
Among the above-mentioned steps S3, search the pairing data item occurrence of term on the data item that comprises in second search condition, generate the process of packet-based filtrator, specifically can comprise following two kinds of situations:
Situation one: the form of the filter value that comprises in the retrieving information of generation is a data item occurrence.
Under this situation, search the data item occurrence in the pairing index database of term on the data item that comprises in second search condition, generate the process of the filtering information corresponding, specifically comprise with second search condition:
Since the term on the data item that comprises in second search condition promptly corresponding the data item occurrence on this data item in the index database, promptly both have consistance.So can be according to the term on the data item that comprises in second search condition, the data item occurrence in the search index storehouse obtains the data item occurrence consistent with term.
Preferable, for a plurality of data item occurrences that belong to same data item or the span of data item occurrence, can merge in advance, reducing the quantity of the filtrator that generates, thus further conserve system resources.
Find after the data item occurrence consistent with term, the form that promptly can directly generate filter value is a data item occurrence, and corresponding span is the filtering information of the scope of the data item occurrence set or data item occurrence.
The general filtering information that generates carries out on the foreground, and after the filtering information that generates sends to the backstage, can search the filtrator whether buffer memory is arranged according to filtering information, when finding, can directly call; Perhaps determine corresponding effective group character, generate packet-based filtrator according to filtering information.
Under this situation, determine that according to filtering information corresponding effective group character generates the process of filtrator, specifically comprises:
According to the scope of the data item occurrence or the data item occurrence of the setting that comprises in the filtering information, search the corresponding relation of data item occurrence and group character, obtain effective group character.Promptly search the data item occurrence in the stored packet file in advance and the corresponding relation of group character, the pairing group character of data item occurrence that comprises in the scope of the data item occurrence that obtains setting or the data item occurrence of setting.
Filter type according to effective group character of determining and effective pairing data item of group character generates based on packet filter.
Wherein, filter type comprises filtration that comprises data item occurrence and the modes such as filtration that do not comprise data item occurrence.Corresponding to the filtrator that two kinds of filter types generate, also be divided into the filtrator that filtrator that the pairing group character of the data item occurrence that only allows setting (being effective group character) passes through and the pairing group character of data item occurrence that does not only allow setting are passed through.And the filtrator that adopts above-mentioned wherein a kind of filter type reversed to obtain adopting the filtrator of above-mentioned another kind of mode.
Situation two: the form of the filter value that comprises in the retrieving information of generation is group character.
Under this situation, search the data item occurrence in the pairing index database of term on the data item that comprises in second search condition, generate the process of the filtering information corresponding, specifically comprise with second search condition:
According to the term on the data item that comprises in second search condition, the data item occurrence in the search index storehouse obtains the data item occurrence consistent with term.
Preferable, for a plurality of data item occurrences that belong to same data item or the span of data item occurrence, can merge earlier in advance.
According to the data item occurrence consistent that finds with term, search data item occurrence and determine corresponding group character with the corresponding relation of group character, the form that generates filter value is group character, and corresponding span is the filtering information of the scope of the group character set or group character.Promptly search the data item occurrence in the stored packet file in advance and the corresponding relation of group character, the pairing group character of data item occurrence that comprises in the scope of the data item occurrence that obtains setting or the data item occurrence of setting.
Same, the filtrator whether buffer memory is arranged can be searched according to filtering information in the backstage, when finding, can directly call; Perhaps determine corresponding effective group character, generate packet-based filtrator according to filtering information.
Under this situation, determine that according to filtering information corresponding effective group character generates the process of filtrator, specifically comprises:
Directly obtain the group character that comprises in the filtering information or the scope of group character, obtain effective group character.
According to the filter type of effective group character and the pairing data item of this group character, generate packet-based filtrator then.
Preferable, the filtrator that generates in the time of can be with situation one and situation two is stored or buffer memory, then can directly call when reusing, and avoids repeating creating.For the filtrator that adds buffer memory can (Least Recently Used, cache policy LRU) carry out the renewal of buffer memory and replace by using algorithm recently at most.
The generation of general filtering information can be handled at the front end of searching system, generates filtrator according to filtering information and then generally can handle on the backstage of searching system.That is to say in the situation one that front end is only made the simple information of being about to of handling and mail to the backstage, and in the situation two, the processing of front end is many, has reduced the processing pressure on backstage comparatively speaking.
The filtering information that generates can only comprise the filtering information of any one form among above-mentioned situation one or two, also can comprise the combination of the filtering information of above-mentioned several forms.For example:
During user search, specify newspaper name (papername), publication date (date), article title, author, publication region etc. to relate to a plurality of search keys of a plurality of data item, generated a plurality of search conditions.Wherein the searching attribute information of publication date and the pairing data item of newspaper name is for directly creating the retrieval of carrying out on the indexed data item at participle not, then these two search conditions can be used as second search condition, and the retrieval of this search condition correspondence is converted into filtrator.
Then according to the term that comprises in newspaper name and these two data item of publication date, find data item occurrence corresponding in the index database, an example of the filtering information of the xml form that generates is as follows: wherein, the form and the span (data item occurrence and span, group character and span etc.) that comprise data item (Field), filter value, and information such as filter type.
<Filters>
<Filter?field=″papername″format=″index″operation=″exclude
″>2-5,9-20</Filter>
<Filter?field=″date″format=″value″operation=″include″>20080808</Filter>
<Filter?field=″date″format=″value″operation=″include″>20090808</Filter>
</Filters>
Above-mentioned data item comprises newspaper name (papername) and publication date (date); The form of filter value comprises that data item occurrence (value), its span are 20090808, and group character (index), its span are 2-5,9-20; Filter type comprises and comprises (include) and do not comprise (exclude) filter value etc.Then the filtrator that generates according to this filtering information is the filtrator based on newspaper name and publication date two groupings, and two filtrators that can certainly generate respectively based on newspaper name and publication date filter successively.
Preferable, when the user imports search key,, above-mentioned two kinds of processing procedures that situation is given can be arranged then if adopt the directly mode of input.If what the user adopted when importing search key is selection mode, be that system directly provides several retrieval keyword option to select for the user, the user only need choose the search key that will import can realize input, then this moment, group character that can be directly that it is corresponding and this term were bound for offering the data item occurrence that belongs to default in each search key that the user selects.In case then the user has selected this term (data item occurrence), can directly get access to its corresponding group character, and needn't search the corresponding relation of data item occurrence and group character again.
According to above-mentioned search method of the present invention, can make up a kind of indexing unit, as shown in Figure 4, comprising: grouping module 10, separation module 20, retrieval module 30, generation module 40 and filtering module 50.
Grouping module 10 is used for according to the data item occurrence of default data item the document in the index database being divided into groups.
Grouping module 10 specifically is used for: set up the corresponding relation of data item occurrence and group character, and the corresponding relation of document identification of setting up group character and the document that comprises data item occurrence, realize grouping to document in the index database.
Separation module 20 is used to get access to the search condition that the user submits to, and according to the searching attribute information of data item in the search condition, determines first search condition that is used to retrieve and second search condition that is used to filter.
Separation module 20, specifically be used for: according to the searching attribute information of predefined each data item of index database, the described searching attribute information of determining the data item that comprises is that the part or all of search condition that the mode that adopts participle not directly to create index is retrieved is second search condition; Determine that remaining search condition is first search condition.
Retrieval module 30 is used for by first search condition that separation module 20 is determined index database being retrieved, and obtains the preliminary search result.
Retrieval module 30 specifically is used for: when first search condition is not unique, use each first search condition that index database is retrieved respectively, and each first search condition is retrieved resulting result's merging, obtain the preliminary search result.
Generation module 40 is used for searching the pairing data item occurrence of term on the data item that second search condition that separation module 20 determines comprises, and generates packet-based filtrator; Wherein, the packet-based filtrator document that only allows or only do not allow to set grouping passes through.
Preferable, above-mentioned generation module 40 specifically comprises: information process unit 401 and definite generation unit 402.
Information process unit 401 is used for searching the data item occurrence in the pairing index database of term on the data item that second search condition that separation module 20 determines comprises, and generates the filtering information corresponding with second search condition; Wherein, comprise in the filtering information: the form and the span of filter value on the filter type of the data item of filtration, correspondence and the data item of filtering.
Preferable, information process unit 401 further can comprise: search subelement 4011 and handle subelement 4012.
Search subelement 4011, the term on the data item that is used for comprising according to second search condition that separation module 20 is determined, the search index storehouse obtains the data item occurrence consistent with term.
Handle subelement 4012, the form that is used for directly generating filter value is a data item occurrence, and corresponding span is the filtering information of the scope of the data item occurrence set or data item occurrence; Or according to the data item occurrence consistent with term, search data item occurrence and determine corresponding group character with the corresponding relation of group character, the form that generates filter value is group character, and corresponding span is the filtering information of the scope of the group character set or group character.
Determine generation unit 402, be used for the filter type of the data item correspondence of filtering and the form and the span span of filter value according to the filtering information that information process unit 401 generates, determine corresponding effective group character, generate the packet-based filtrator that only allows or only do not allow the pairing document of effective group character to pass through; Or according to the filtering information that information process unit 401 generates directly find corresponding buffer memory in advance based on packet filter.
Preferable, determining generation unit 402, further can comprise: determine subelement 4021 and generate subelement 4022.
Determine subelement 4021, be used for the data item occurrence of the setting that comprises according to the filtering information that information process unit 401 generates or the scope of data item occurrence, search the corresponding relation of data item occurrence and group character, obtain effective group character; Or directly obtain the group character that comprises in the filtering information or the scope of group character, obtain effective group character.
Generate subelement 4022, be used for filter type, generate packet-based filtrator according to effective group character and this data item correspondence.
Filtering module 50, the document that hits that is used for successively the preliminary search result that the packet-based filtrator that generates by each generation module 40 obtains retrieval module 30 filters, and obtains final result for retrieval.
Preferable, above-mentioned filtering module 50 specifically comprises: sign determining unit 501 and filter element 502.
Sign determining unit 501, the document identification that the preliminary search result who is used for obtaining according to retrieval module 30 hits document determines to hit the pairing group character of document.
Filter element 502, the group character that the packet-based filtrator that is used for generating by each generation module 40 is successively determined sign determining unit 501 is filtered, only allow or only do not allow effective group character according to packet-based filtrator, obtain and to be final result for retrieval by the document of each described filtrator.
For example: the principle contrast synoptic diagram that Figure 5 shows that the search method of the application's search method and prior art.
As can be seen from Figure 5, at the Boolean retrieval that four search conditions are arranged, way originally is to use four search conditions ( search condition 1,2,3,4) to retrieve respectively, what obtain each search condition (for example: the document that each search condition below provides among the figure) hits document, the document that then each search condition is hit merges, thereby obtains net result.Then be that wherein search condition 3 and 4 is transformed for filtrator among the application, use search condition 1 and 2 to retrieve after, obtain the preliminary search result after result for retrieval merged, filter successively by filtrator 1 and 2 then, obtain net result.Be used for reducing the complexity that the result merges by making of filtrator, improve retrieval rate.
Above-mentioned search method and device that the embodiment of the invention provides are by dividing into groups to the document in the index database according to the data item occurrence in the default data item; So that when retrieval, can generate packet-based filtrator.The document that packet-based filtrator only allows or only do not allow to set grouping passes through, with respect to it creates that the data volume of efficient height, processing is little, strainability is high based on the filtrator of document.
When needs carry out full-text search, can determine first search condition and second search condition according to the searching attribute information of data item in the search condition that gets access to user's submission, realize with the pairing retrieving of second search condition being filter process.By packet-based filtrator preliminary search is filtered, conveniently realize filtering, obtain final result for retrieval by group character.The search condition that is used to retrieve by minimizing reduces the complexity that retrieving and result for retrieval merge, and improves processing speed, thereby has saved system resource, has improved processing speed.Thereby overcome in the text retrieval system performance deficiency, improved the overall performance of searching system based on the filtrator of document.
Can also generate the filtrator commonly used that complicated filtrator and buffer memory generate in advance in addition, reduce the generation and the constructive process of filtrator, and the generative process of retrieving and filtrator can concurrent processing, term for same data item also can merge filtrator of processing generation, thereby further optimized system performance, reach and reduce search condition quantity, better improve the purpose of retrieval performance.
The above; only be the preferable embodiment of the present invention; but protection scope of the present invention is not limited thereto; anyly be familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily, replace or be applied to other similar devices, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claims.

Claims (16)

1. a search method is characterized in that, comprising: according to the data item occurrence in the default data item document in the index database is divided into groups, when getting access to the search condition of user's submission, carry out the following step:
According to the searching attribute information of data item in the described search condition, determine first search condition that is used to retrieve and second search condition that is used to filter;
By described first search condition described index database is retrieved, obtained the preliminary search result; And search the pairing data item occurrence of term on the data item that comprises in described second search condition, generate packet-based filtrator; The document that described packet-based filtrator only allows or only do not allow to set grouping passes through;
By each described filtrator the document that hits among the described preliminary search result is filtered successively, obtain final result for retrieval.
2. the method for claim 1 is characterized in that, the document in the index database is divided into groups, and specifically comprises:
Set up the corresponding relation of described data item occurrence and group character, and the corresponding relation of document identification of setting up described group character and the document that comprises described data item occurrence.
3. method as claimed in claim 2 is characterized in that, the described pairing data item occurrence of searching on the data item that comprises in described second search condition of term generates packet-based filtrator; Specifically comprise:
Search the data item occurrence in the pairing described index database of term on the data item that comprises in described second search condition, generate and the corresponding filtering information of described second search condition; Comprise in the described filtering information: the form and the span of filter value on the filter type of the data item of filtration, correspondence and the data item of filtering;
According to the filter type of the data item correspondence of filtering in the described filtering information and the form and the span of described filter value, determine corresponding effective group character, generate the described filtrator that only allows or only do not allow the pairing document of described effective group character to pass through; Or according to described filtering information directly find corresponding buffer memory in advance based on packet filter.
4. method as claimed in claim 3, it is characterized in that, data item occurrence in the described pairing described index database of searching on the data item that comprises in described second search condition of term generates and the corresponding filtering information of described second search condition, specifically comprises:
According to the term on the data item that comprises in described second search condition, inquire about described index database, obtain the data item occurrence consistent with described term;
The form that directly generates described filter value is a data item occurrence, and corresponding span is the filtering information of the scope of the data item occurrence set or data item occurrence; Or according to the data item occurrence consistent with described term, search described data item occurrence and determine corresponding group character with the corresponding relation of group character, the form that generates described filter value is group character, and corresponding span is the filtering information of the scope of the group character set or group character.
5. method as claimed in claim 4 is characterized in that, according to described filtering information, determines that corresponding effective group character generates the process of filtrator, specifically comprises:
Scope according to the data item occurrence or the data item occurrence of the setting that comprises in the described filtering information, search the corresponding relation of described data item occurrence and group character, obtain described effective group character, and, generate described filtrator according to effective group character and described filter type; Or directly obtain the group character that comprises in the described filtering information or the scope of group character, obtain described effective group character, and, generate described filtrator according to effective group character and described filter type.
6. the method for claim 1 is characterized in that, described searching attribute information according to data item in the described search condition is determined first search condition that is used to retrieve and second search condition that is used to filter, and specifically comprises:
According to the searching attribute information of predefined each data item in the described index database, the described searching attribute information of determining the data item that comprises is that the part or all of search condition that the mode that adopts participle not directly to create index is retrieved is described second search condition; Determine that remaining search condition is described first search condition.
7. the method for claim 1, it is characterized in that, when described first search condition is not unique, use each first search condition that described index database is retrieved respectively, and each first search condition is retrieved resulting result merge, obtain described preliminary search result.
8. as the arbitrary described method of claim 3-7, it is characterized in that, describedly by each described filtrator the document that hits among the described preliminary search result filtered successively, obtain final result for retrieval, specifically comprise:
According to the document identification of hitting document among the described preliminary search result, determine to hit the pairing group character of document;
By each described filtrator the described group character of determining is filtered successively, only allow or only do not allow described effective group character, obtain and to be described final result for retrieval by the document of each described filtrator according to described filtrator.
9. an indexing unit is characterized in that, comprising:
Grouping module is used for according to the data item occurrence of default data item the document in the index database being divided into groups;
Separation module is used to get access to the search condition that the user submits to, and according to the searching attribute information of data item in the described search condition, determines first search condition that is used to retrieve and second search condition that is used to filter;
Retrieval module is used for by first search condition that described separation module is determined described index database being retrieved, and obtains the preliminary search result;
Generation module is used for searching the pairing data item occurrence of term on the data item that second search condition that described separation module determines comprises, and generates packet-based filtrator; The document that described packet-based filtrator only allows or only do not allow to set grouping passes through;
Filtering module is used for successively the packet-based filtrator that generates by each described generation module described preliminary search result's the document that hits is filtered, and obtains final result for retrieval.
10. device as claimed in claim 9, it is characterized in that, described grouping module, specifically be used for: the corresponding relation of setting up described data item occurrence and group character, and the corresponding relation of document identification of setting up described group character and the document that comprises described data item occurrence, realize grouping to document in the index database.
11. device as claimed in claim 10 is characterized in that, described generation module specifically comprises:
Information process unit is used for searching the data item occurrence in the pairing described index database of term on the data item that second search condition that described separation module determines comprises, and generates and the corresponding filtering information of described second search condition; Comprise in the described filtering information: the form and the span of filter value on the filter type of the data item of filtration, correspondence and the data item of filtering;
Determine generation unit, be used for the filter type of the data item correspondence of filtering and the form and the span span of described filter value according to the filtering information that described information process unit generates, determine corresponding effective group character, generate the described filtrator that only allows or only do not allow the pairing document of described effective group character to pass through; Or according to described filtering information directly find corresponding buffer memory in advance based on packet filter.
12. device as claimed in claim 11 is characterized in that, described information process unit specifically comprises:
Search subelement, described index database inquired about in the term on the data item that is used for comprising according to second search condition that described separation module is determined, and obtains the data item occurrence consistent with described term;
Handle subelement, the form that is used for directly generating described filter value is a data item occurrence, and corresponding span is the filtering information of the scope of the data item occurrence of setting or data item occurrence; Or according to the data item occurrence consistent with described term, search described data item occurrence and determine corresponding group character with the corresponding relation of group character, the form that generates described filter value is group character, and corresponding span is the filtering information of the scope of the group character set or group character.
13. device as claimed in claim 12 is characterized in that, described definite generation unit specifically comprises:
Determine subelement, be used for the data item occurrence of the setting that comprises according to the filtering information that described information process unit generates or the scope of data item occurrence, search the corresponding relation of described data item occurrence and group character, obtain described effective group character; Or directly obtain the group character that comprises in the described filtering information or the scope of group character, obtain described effective group character;
Generate subelement, be used for generating described filtrator according to effective group character and described filter type.
14. device as claimed in claim 9, it is characterized in that, described separation module, specifically be used for: according to the searching attribute information of predefined each data item of described index database, the described searching attribute information of determining the data item that comprises is that the part or all of search condition that the mode that adopts participle not directly to create index is retrieved is described second search condition; Determine that remaining search condition is described first search condition.
15. device as claimed in claim 9, it is characterized in that, described retrieval module, specifically be used for: when described first search condition is not unique, use each first search condition that described index database is retrieved respectively, and each first search condition is retrieved resulting result merge, obtain described preliminary search result.
16., it is characterized in that described filtering module specifically comprises as the arbitrary described device of claim 9-15:
The sign determining unit is used for the document identification of hitting document according to described preliminary search result, determines to hit the pairing group character of document;
Filter element, be used for by each described filtrator the described group character of determining being filtered successively, only allow or only do not allow described effective group character according to described filtrator, obtain and to be described final result for retrieval by the document of each described filtrator.
CN2009102371861A 2009-11-10 2009-11-10 Searching method and searching device Expired - Fee Related CN102054007B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009102371861A CN102054007B (en) 2009-11-10 2009-11-10 Searching method and searching device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009102371861A CN102054007B (en) 2009-11-10 2009-11-10 Searching method and searching device

Publications (2)

Publication Number Publication Date
CN102054007A true CN102054007A (en) 2011-05-11
CN102054007B CN102054007B (en) 2012-10-31

Family

ID=43958341

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009102371861A Expired - Fee Related CN102054007B (en) 2009-11-10 2009-11-10 Searching method and searching device

Country Status (1)

Country Link
CN (1) CN102054007B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102810096A (en) * 2011-06-02 2012-12-05 阿里巴巴集团控股有限公司 Retrieval method and device based on separate character indexing system
CN103123638A (en) * 2011-11-21 2013-05-29 北京神州泰岳软件股份有限公司 Data searching method and data searching device
CN103136305A (en) * 2011-12-05 2013-06-05 北大方正集团有限公司 Processing method and device used for test resource
CN103365910A (en) * 2012-04-06 2013-10-23 腾讯科技(深圳)有限公司 Method and system for information retrieval
WO2014036684A1 (en) * 2012-09-04 2014-03-13 华为技术有限公司 Method and device for storing and retrieving data
CN103853742A (en) * 2012-11-29 2014-06-11 北大方正集团有限公司 Retrieval device, terminal and retrieval method
CN105701155A (en) * 2015-12-30 2016-06-22 百度在线网络技术(北京)有限公司 Information push method and the device
CN105808737A (en) * 2016-03-10 2016-07-27 腾讯科技(深圳)有限公司 Information retrieval method and server
CN106202449A (en) * 2016-07-14 2016-12-07 上海超橙科技有限公司 Information retrieval and methods of exhibiting and system
CN106504140A (en) * 2016-11-17 2017-03-15 中知厚德知识产权投资管理(天津)有限公司 The intellectual property data system of various dimensions technology correlation evaluation
CN106779580A (en) * 2016-11-17 2017-05-31 中知厚德知识产权投资管理(天津)有限公司 Multi-level intellectual property data system
CN107391535A (en) * 2017-04-20 2017-11-24 阿里巴巴集团控股有限公司 The method and device of document is searched in document application
CN107480253A (en) * 2017-08-14 2017-12-15 浪潮软件集团有限公司 Retrieval method and device
CN108090064A (en) * 2016-11-21 2018-05-29 腾讯科技(深圳)有限公司 A kind of data query method, apparatus, data storage server and system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6993603B2 (en) * 2002-12-09 2006-01-31 Microsoft Corporation Managed file system filter model and architecture
CN100578498C (en) * 2006-06-07 2010-01-06 华为技术有限公司 Data integral service system and method
CN101281524A (en) * 2007-09-24 2008-10-08 北大方正集团有限公司 Method and apparatus for acquiring material

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102810096B (en) * 2011-06-02 2016-03-16 阿里巴巴集团控股有限公司 A kind of search method based on individual character directory system and device
CN102810096A (en) * 2011-06-02 2012-12-05 阿里巴巴集团控股有限公司 Retrieval method and device based on separate character indexing system
US9311389B2 (en) 2011-06-02 2016-04-12 Alibaba Group Holding Limited Finding indexed documents
CN103123638A (en) * 2011-11-21 2013-05-29 北京神州泰岳软件股份有限公司 Data searching method and data searching device
CN103136305A (en) * 2011-12-05 2013-06-05 北大方正集团有限公司 Processing method and device used for test resource
CN103365910A (en) * 2012-04-06 2013-10-23 腾讯科技(深圳)有限公司 Method and system for information retrieval
CN103365910B (en) * 2012-04-06 2017-02-15 腾讯科技(深圳)有限公司 Method and system for information retrieval
WO2014036684A1 (en) * 2012-09-04 2014-03-13 华为技术有限公司 Method and device for storing and retrieving data
CN103853742A (en) * 2012-11-29 2014-06-11 北大方正集团有限公司 Retrieval device, terminal and retrieval method
CN105701155A (en) * 2015-12-30 2016-06-22 百度在线网络技术(北京)有限公司 Information push method and the device
CN105701155B (en) * 2015-12-30 2019-05-31 百度在线网络技术(北京)有限公司 Information-pushing method and device
CN105808737A (en) * 2016-03-10 2016-07-27 腾讯科技(深圳)有限公司 Information retrieval method and server
CN105808737B (en) * 2016-03-10 2021-04-06 腾讯科技(深圳)有限公司 Information retrieval method and server
CN106202449A (en) * 2016-07-14 2016-12-07 上海超橙科技有限公司 Information retrieval and methods of exhibiting and system
CN106202449B (en) * 2016-07-14 2019-09-13 上海超橙科技有限公司 Information retrieval and methods of exhibiting and system
CN106779580A (en) * 2016-11-17 2017-05-31 中知厚德知识产权投资管理(天津)有限公司 Multi-level intellectual property data system
CN106504140A (en) * 2016-11-17 2017-03-15 中知厚德知识产权投资管理(天津)有限公司 The intellectual property data system of various dimensions technology correlation evaluation
CN108090064A (en) * 2016-11-21 2018-05-29 腾讯科技(深圳)有限公司 A kind of data query method, apparatus, data storage server and system
CN107391535A (en) * 2017-04-20 2017-11-24 阿里巴巴集团控股有限公司 The method and device of document is searched in document application
CN107480253A (en) * 2017-08-14 2017-12-15 浪潮软件集团有限公司 Retrieval method and device

Also Published As

Publication number Publication date
CN102054007B (en) 2012-10-31

Similar Documents

Publication Publication Date Title
CN102054007B (en) Searching method and searching device
CN104536959B (en) A kind of optimization method of Hadoop accessing small high-volume files
CN104679778B (en) A kind of generation method and device of search result
US10621370B2 (en) Methods and apparatus to provide group-based row-level security for big data platforms
CN103902544B (en) A kind of data processing method and system
US8296279B1 (en) Identifying results through substring searching
CN107368527B (en) Multi-attribute index method based on data stream
US20120131022A1 (en) Methods and systems for merging data sets
CA2484009A1 (en) Managing expressions in a database system
KR20040104465A (en) Efficiently storing indented threads in threaded application
Tran et al. Structure index for RDF data
CN103440245A (en) Line and column hybrid storage method of database system
CN105843960B (en) Indexing method and system based on semantic tree
US20090144279A1 (en) Method for improving search efficiency in enterprise search system
US20140372412A1 (en) Dynamic filtering search results using augmented indexes
US20040054683A1 (en) System and method for join operations of a star schema database
Terrovitis et al. Efficient answering of set containment queries for skewed item distributions
CN103605750B (en) A kind of quick distributed data paging method
CN110532371B (en) Full-text retrieval method and device based on configuration management database and electronic equipment
CN102314464A (en) Lyrics searching method and lyrics searching engine
CN101620633B (en) Method and system for generating indexes in an xml database management system
Kulkarni et al. Skyline computation for frequent queries in update intensive environment
US20030172048A1 (en) Text search system for complex queries
CN104252537A (en) Index fragmentation method based on mail characteristics
US8805820B1 (en) Systems and methods for facilitating searches involving multiple indexes

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220622

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: FOUNDER APABI TECHNOLOGY Ltd.

Patentee after: Peking University

Address before: 100871, Beijing, Haidian District Cheng Fu Road 298, founder building, 9 floor

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: FOUNDER APABI TECHNOLOGY Ltd.

Patentee before: Peking University

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20121031