CN103488687A

CN103488687A - Searching system and searching method of big data

Info

Publication number: CN103488687A
Application number: CN201310392278.3A
Authority: CN
Inventors: 郭辉
Original assignee: Yonyou Software Co Ltd
Current assignee: Yonyou Software Co Ltd
Priority date: 2013-09-02
Filing date: 2013-09-02
Publication date: 2014-01-01

Abstract

The invention provides a searching system of big data. The searching system comprises a grouping unit, a fragmentation creating unit and a searching unit, wherein the grouping unit is used for dividing an index file of the big data into one or a plurality of source groups; the index file of each source group comprises resource data of the same type; the fragmentation creating unit is used for carrying out fragmentation operation on each source group to obtain index files of a plurality of fragmentations; corresponding index fragmentations are created by using the index file of each fragmentation; the searching unit is used for executing and sending search operation in fragmentation searching files corresponding to the appointed one or multiple index fragmentations to obtain and return a corresponding search result according to a received searching instruction. The invention also provides a searching method of the big data. By using the technical scheme of the invention, a distributed index file searching method is realized, the searching speed is favorably promoted, and the searching efficiency bottleneck problem of the big data of an enterprise is solved.

Description

Search system and searching method for large data

Technical field

The present invention relates to the data searching technology field, in particular to a kind of search system for large data and a kind of searching method for large data.

Background technology

The large data of enterprise (big data), or title flood tide data, refer to data quantity related in the process such as enterprise's producing and selling huge to seeing through current main flow Software tool, reaching acquisition, management, processing within reasonable time, also arrangement becomes the information that positive purpose is played in the help enterprise management decision-making.Be accompanied by the extensive application of technology in enterprise information management such as Internet of Things, cloud computing, mobile Internet, car networking, expedited the emergence of a large amount of internal information resources.According to statistics, business data every year is with 200% speed increment, wherein 80% data leave in enterprise in computer system with unstructured data forms such as file, mail, picture, sound, the not competent retrieval to these data of database management system and work for the treatment of, but these a large amount of relatively scattered data again can be compared to a huge underground gold mine for enterprise, and large data search can become a kind of means of enterprise in gold mine the inside Denver Nuggets, large data search technical solution has become the urgent problem to be solved that enterprise faces.

The enterprise search technology is a kind of important technical of business processes inside non-structured data.Yet, at large data age, data volume constantly expands, index file increases too fast, causes search performance constantly to descend, and has become the new bottleneck in the enterprise search application on availability and efficiency.

In the prior art, the large data search of existing solution enterprise at present mainly contains two kinds of methods:

One, solve the storage problem of large data by the Apche project Hadoop that increases income;

Two, by controlling the mode of index information scale, when increment adds index, some inactive index are deleted, controlled the index file scale.

But, in actual application process, all there are some defects in above-mentioned two schemes.Such as in scheme one, there is efficiency in Hadoop to the real-time search of the large data of enterprise, and the strong point of Hadoop is once to store, repeatedly read, and business data frequently modification can have a strong impact on efficiency; And scheme two is obviously a kind of forced method, take and sacrifice data volume and improve the search efficiency problem as cost.

Therefore, how the search efficiency of the large data of enterprise, become technical matters urgently to be resolved hurrily at present.

Summary of the invention

The present invention just is being based on the problems referred to above, has proposed a kind of search technique of large data newly, can realize a kind of distributed index file searching method, contributes to promote search speed, solves the search efficiency bottleneck problem of the large data of enterprise.

In view of this, the present invention proposes a kind of search system for large data, comprising: grouped element, be divided into one or more sources group for the index file by described large data, the index file in each source group includes the resource data of same type; The burst creating unit, for each described source group is carried out to Fragmentation, obtain a plurality of burst index files, utilizes the index burst that each described burst index file creation is corresponding; Search unit, for according to the search instruction that receives, the execution concurrence search operation in corresponding burst search file at one or more index bursts of appointment, to obtain and to return corresponding Search Results.

In this technical scheme, by index file is carried out to burst, make when carrying out search, realize concurrent search operation on a plurality of index bursts, thereby required time while effectively having shortened the search that completes all index files has promoted search efficiency simultaneously.Generate different source groups by the type according to resource data, make when corresponding index burst is retrieved, be easier to the demand of user according to self, directly specify the index burst of partial response is retrieved, and all retrieved without the index burst to all, contribute to promote recall precision, reduce power consumption and calculation resources that search operaqtion consumes.Wherein, the index that index file comprises concrete resource data and generates based on these resource datas; Simultaneously, " one or more index bursts of appointment " can be the demand appointment of user according to self, can be also the part or all of index burst of acquiescence.

In technique scheme, preferably, described burst creating unit is used for: by the resource data in the group of same described source, according to the difference of residing server, be divided into and described server a plurality of burst index files one to one, and create corresponding index burst.

In this technical scheme, for originally just being stored in the resource data of a plurality of servers respectively, the resource data of storing on each server can be created as to corresponding index burst; For originally just being stored in the resource data in same server, it can be created as to a corresponding index burst, or be created as a plurality of index bursts after grouping.By based on server, resource data being created as to corresponding index burst, reduced as much as possible the move operation for resource data, contribute to reduce the calculation resources taken, avoid data to shift the loss of data equivalent risk that may cause.

In above-mentioned arbitrary technical scheme, preferably, described burst creating unit also for: for the resource data in same server, according to the level of intimate of relation, be divided into a plurality of burst index files, and create corresponding index burst.

In this technical scheme, level of intimate refers between resource data whether meet some default conditions simultaneously, when meeting wherein one or meeting many simultaneously, can think in close relations between resource data, can be used as the resource data of same type, for leaving same index burst in.Particularly, between data, exist the level of intimate of relation comprise as some data always (number of times is more than or equal to default frequency threshold value) called simultaneously or edited, or some data all relate to identical user, company etc.

In above-mentioned arbitrary technical scheme, preferably, described search unit also for: obtain respectively the burst Search Results that each described index burst obtains; In all burst Search Results, select data that the matching degree of predetermined number is the highest as final Search Results, and return to described final Search Results.

In this technical scheme, keyword based on user's input, appointed each index burst is all carried out corresponding search operation, then after the burst Search Results all index bursts obtained carries out comprehensively, therefrom select the highest data of matching degree of predetermined number, thereby realized the merging of burst Search Results that a plurality of index bursts are obtained.

In above-mentioned arbitrary technical scheme, preferably, also comprise: the relational storage unit, for the corresponding relation between the resource data of preserving each described burst index file and wherein comprising; Wherein, described search unit also for: during according to the edit instruction to the allocated resource data that receives, determine the index burst that comprises described allocated resource data according to described corresponding relation, and the described allocated resource data of search executive editor's operation in definite index burst.

In this technical scheme, by the corresponding relation between the resource data of setting up the burst index file and wherein comprising, make while wishing to upgrade resource data such as the user, need to carry out editing operation to original resource data, can be according to above-mentioned corresponding relation, directly find out the affiliated index burst of this resource data, thereby only need in this index burst, search for corresponding resource data and edit, get final product, without other index bursts are carried out to search operation, contribute to reduce computational load, improve treatment effeciency.

According to another aspect of the invention, also proposed a kind of searching method for large data, having comprised: step 202, the index file of described large data is divided into to one or more sources group, the index file in each source group includes the resource data of same type; Step 204, carry out Fragmentation to each described source group, obtains a plurality of burst index files, utilizes the index burst that each described burst index file creation is corresponding; Step 206, according to the search instruction received, the execution concurrence search operation in corresponding burst search file at one or more index bursts of appointment, to obtain and to return corresponding Search Results.

In technique scheme, preferably, described step 204 comprises: by the resource data in the group of same described source, according to the difference of residing server, be divided into and described server a plurality of burst index files one to one, and create corresponding index burst.

In above-mentioned arbitrary technical scheme, preferably, described step 204 also comprises: for the resource data in same server, according to the level of intimate of relation, be divided into a plurality of burst index files, and create corresponding index burst.

In above-mentioned arbitrary technical scheme, preferably, described step 206 also comprises: obtain respectively the burst Search Results that each described index burst obtains; In all burst Search Results, select data that the matching degree of predetermined number is the highest as final Search Results, and return to described final Search Results.

In above-mentioned arbitrary technical scheme, preferably, also comprise: preserve each described index burst and the resource data that wherein comprises between corresponding relation; During according to the edit instruction to the allocated resource data that receives, determine the index burst that comprises described allocated resource data according to described corresponding relation; The described allocated resource data of search executive editor's operation in definite index burst.

By above technical scheme, can realize a kind of distributed index file searching method, contribute to promote search speed, solve the search efficiency bottleneck problem of the large data of enterprise.

The accompanying drawing explanation

Fig. 1 shows according to an embodiment of the invention the schematic block diagram for the search system of large data;

Fig. 2 shows according to an embodiment of the invention the schematic flow sheet for the searching method of large data;

Fig. 3 shows the principle framework schematic diagram of searching for according to an embodiment of the invention large data;

Fig. 4 shows the schematic flow sheet of execution index burst according to an embodiment of the invention.

Embodiment

In order more clearly to understand above-mentioned purpose of the present invention, feature and advantage, below in conjunction with the drawings and specific embodiments, the present invention is further described in detail.It should be noted that, in the situation that do not conflict, the application's embodiment and the feature in embodiment can combine mutually.

A lot of details have been set forth in the following description so that fully understand the present invention; but; the present invention can also adopt other to be different from other modes described here and implement, and therefore, protection scope of the present invention is not subject to the restriction of following public specific embodiment.

Fig. 1 shows according to an embodiment of the invention the schematic block diagram for the search system of large data.

As shown in Figure 1, according to an embodiment of the invention for the search system 100 of large data, comprise: grouped element 102, be divided into one or more sources group for the index file by described large data, the index file in each source group includes the resource data of same type; Burst creating unit 104, for each described source group is carried out to Fragmentation, obtain a plurality of burst index files, utilizes the index burst that each described burst index file creation is corresponding; Search unit 106, for according to the search instruction that receives, the execution concurrence search operation in corresponding burst search file at one or more index bursts of appointment, to obtain and to return corresponding Search Results.

In technique scheme, preferably, described burst creating unit 104 for: by the resource data in the group of same described source, according to the difference of residing server, be divided into and described server a plurality of burst index files one to one, and create corresponding index burst.

In above-mentioned arbitrary technical scheme, preferably, described burst creating unit 104 also for: for the resource data in same server, according to the level of intimate of relation, be divided into a plurality of burst index files, and create corresponding index burst.

In above-mentioned arbitrary technical scheme, preferably, described search unit 106 also for: obtain respectively the burst Search Results that each described index burst obtains; In all burst Search Results, select data that the matching degree of predetermined number is the highest as final Search Results, and return to described final Search Results.

In above-mentioned arbitrary technical scheme, preferably, also comprise: relational storage unit 108, for the corresponding relation between the resource data of preserving each described burst index file and wherein comprising; Wherein, described search unit 106 also for: during according to the edit instruction to the allocated resource data that receives, determine the index burst that comprises described allocated resource data according to described corresponding relation, and the described allocated resource data of search executive editor's operation in definite index burst.

Search system 100 with respect to shown in Fig. 1, be elaborated to the process based on the large data search of the present invention below in conjunction with Fig. 2-Fig. 4.

As shown in Figure 2, for the searching method of large data, comprising according to an embodiment of the invention: step 202, the index file of described large data is divided into to one or more sources group, the index file in each source group includes the resource data of same type; Step 204, carry out Fragmentation to each described source group, obtains a plurality of burst index files, utilizes the index burst that each described burst index file creation is corresponding; Step 206, according to the search instruction received, the execution concurrence search operation in corresponding burst search file at one or more index bursts of appointment, to obtain and to return corresponding Search Results.

Search source as shown in Figure 3, be above-mentioned all index files, wherein include enterprise database, web data, file system, voice data, video data etc., type that can be based on different, above-mentioned all files are classified, such as the basic file type of standard, document types, Office Doctype, email type etc.

Such as dissimilar being called " Source Type " by above-mentioned, the resource data based on the identical sources type, can be placed in same set, i.e. " search source ".Certainly, the Source Type of the resource data in different search sources can be identical, also can be not identical.The search source of all same types can form one " source group ", can certainly form respectively a plurality of sources group, and the Source Type of the resource data that a plurality of sources group comprises can be identical.

Take " source group " be minimum physical isolation unit, the resource data of all large data is carried out to Fragmentation.Wherein, a source group can form a plurality of burst index files, thereby forms a plurality of index bursts.Each index burst can be called to a search core.Can configure and move a plurality of search core on every station server, and a plurality of search cores corresponding to source group also can configure and operate on a plurality of servers.Such as being divided into 3 search when a source group during core, can configure and move 3 whole search cores on a station server, also can on 3 station servers, respectively move a search core, or on a station server, core is searched in 1 of operation, on another station server, cores etc. are searched in 2 of operations, can be configured and adjust according to actual conditions.

When creating the index burst, need to the resource data in the search source, be crawled by index reptile plug-in unit, and the burst rule based on default, determine specifically how to carry out burst, and create corresponding index burst.

In the embodiment shown in fig. 3, can carry out Fragmentation by the index distribution module.Particularly, the application has proposed the mode of multiple execution Fragmentation, such as:

In the first situation, can adopt the index level burst.For a resource data that the source group is corresponding, set up a plurality of search core, and be configured on a plurality of different servers.Wherein, each server can configure one or more search core.

If the resource data in the group of same source is located in a plurality of servers originally, can adopt aforesaid way, directly create corresponding a plurality of search cores.

In the second situation, can adopt the index vertical fragmentation.For a resource data that the source group is corresponding, all be stored in same server, and set up a plurality of search core.

If the resource data in the group of same source is located on same server originally, can adopt aforesaid way, directly create corresponding a plurality of search cores.

In the third situation, can adopt the level of intelligence vertical fragmentation.The data of enterprise normally have the operation rule, by data in close relations, by data relationship, calculate, and are distributed to same index burst.Carry out logical partitioning according to group or company in a lot of situations of data at present general large enterprise's database, usually there will be the data access together usually of certain several group or company, the data of other group company are accessed together, for such a case, we have proposed the intelligent stripping strategy of business-level, by all data to thering is the identical services association, be set in same index burst, these data can be arranged in same server, also can be arranged in a plurality of servers, all can be according to actual conditions, it is configured in same server, to set up corresponding one or more search cores, also it can be configured in different servers, to set up corresponding one or more search cores in each server.

By default which kind of minute sheet mode that adopts, concrete stripping strategy, can realize the auto plate separation to index file.

Based on above-mentioned processing, can obtain the perdurable data shown in Fig. 3, as index file, search Source Type, the grouping of search source, search source information, index stripping strategy etc., for the user, carry out search operation.

As shown in Figure 4, the flow process of execution index burst comprises according to an embodiment of the invention:

Step 402, source group reptile plug-in unit crawls the index data (being above-mentioned resource data) in search source, is designated as: list, int i=0.

Step 404, obtain i bar index data.

Step 406, according to default stripping strategy, determining whether needs to create new index burst, if need, enters step 408, otherwise enters step 410.

Step 408, according to the stripping strategy obtained, than horizontal fragmentation strategy described above or vertical fragmentation strategy etc., be transmitted to the burst server by index data, for managing the server of Fragmentation.

Step 410, determine and need the index file upgraded.

Step 412, according to the contrast information table, upgrade the index burst.

Step 414, judge whether i<list.size(), if, there is still untreated index data, enter step 416, otherwise finish.

Step 416, i++, after making i add 1, return to step 404.

In above-mentioned steps 412, relate to " contrast information table ".In the contrast information table, corresponding relation between the resource data that is actually each index burst and wherein comprises, when the user need to be operated certain resource data, such as needs are deleted it or upgrade, the contrast information table that system can be based on above-mentioned, determine the residing index burst of this resource data, and only need in this index burst, carry out the search get final product, do not need the index burst irrelevant to other to carry out search operation, contribute to reduce running load, promote recall precision.

Index field	Explanation of field
		sourcegroup	Search source group sign

source	Search source sign
		ID	Data ID
Shard	Burst

Table 1

Table 1 shows the form of the contrast information table in a kind of situation, has wherein comprised the information such as search source group sign, search source sign, data ID (unique identification of resource data), burst (belonging to which index burst).

In Fig. 3, in " index file distribution frame ", also include a plurality of functional modules.Wherein:

Cluster burst information acquisition module: the control center that is all cluster burst information.Wherein, the cluster burst information is equivalent to the data field of burst information, and this module externally provides the burst collocation strategy in source group and search source, and this provides the change service to above-mentioned stripping strategy.

Index burst and administration module: the index data cutting algorithm of putting according to the source assembly, the index quantity of each index burst of global statistics, quantity according to statistics, provide and carry out index distribution command foundation, and pin is responsible for loading and management contrast information table (as shown in table 1), for reptile and search inquiry, uses.This functional module specifically can be taked cache policy, and the amount of active index key is loaded in internal memory, sets up the storage of hash table, accelerates retrieval process speed.Above this index management mode is also more convenient to dynamic expansion shard.

The dynamic-configuration module: mainly complete the dynamic appending of search cluster server, and the index stripping strategy, and the management of the dynamic-configuration of search strategy.

Particularly, along with the increase of business data amount, in order to improve performance, need to add search server, or need to increase the search burst to some sources group, improve search efficiency, so dynamic expansion is also a critical function of cluster management.Particularly, for the server of configuration index burst in cluster, take mirror-image copies, server code is identical with former server code, dynamically is switched on new high-performance server; For the server newly increased, by the dynamic-configuration module, encoded, server code does not allow repetition in whole cluster, if coding is enabled, do not allow the change of encoding, adjustment need to be at the source group of the enterprising line index burst of this server and the index stripping strategy in search source.In addition, index burst and administration module also can pass through load balancing, preferentially on newly-increased server, create the index burst.

Search merges engine: when the user proposes searching request to some servers, search merges engine and carries out concurrent search according to the index burst of user's appointment, after having searched for, Search Results is merged, and returns to search subscriber.If assigned indexes burst not, whole index bursts of default search source group, Search Results returns to the user after merging.

More than be described with reference to the accompanying drawings technical scheme of the present invention, technical scheme of the present invention can realize:

1. by index file under the enterprise clusters environment is carried out to burst, improved greatly the processing power to large data, made the data-handling capacity of enterprise bring up to the TB magnitude from the GB magnitude.

2. by the multi-core parallel concurrent way of search, improved the search speed of large data, millions data and do not adopt this technology to be contrasted before, can improve a plurality of orders of magnitude.

3. by the support level extended mode, make enterprise be easy to increase search server.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. the search system for large data, is characterized in that, comprising:

Grouped element, be divided into one or more sources group for the index file by described large data, and the index file in each source group includes the resource data of same type;

The burst creating unit, for each described source group is carried out to Fragmentation, obtain a plurality of burst index files, utilizes the index burst that each described burst index file creation is corresponding;

Search unit, for according to the search instruction that receives, the execution concurrence search operation in corresponding burst search file at one or more index bursts of appointment, to obtain and to return corresponding Search Results.

2. the search system for large data according to claim 1, is characterized in that, described burst creating unit is used for:

By the resource data in the group of same described source, according to the difference of residing server, be divided into and described server a plurality of burst index files one to one, and create corresponding index burst.

3. the search system for large data according to claim 2, is characterized in that, described burst creating unit also for:

For the resource data in same server, be divided into a plurality of burst index files according to the level of intimate of relation, and create corresponding index burst.

4. the search system for large data according to claim 1, is characterized in that, described search unit also for:

Obtain respectively the burst Search Results that each described index burst obtains;

In all burst Search Results, select data that the matching degree of predetermined number is the highest as final Search Results, and return to described final Search Results.

5. according to the described search system for large data of any one in claim 1 to 4, it is characterized in that, also comprise:

The relational storage unit, for the corresponding relation between the resource data of preserving each described burst index file and wherein comprising;

Wherein, described search unit also for: during according to the edit instruction to the allocated resource data that receives, determine the index burst that comprises described allocated resource data according to described corresponding relation, and the described allocated resource data of search executive editor's operation in definite index burst.

6. the searching method for large data, is characterized in that, comprising:

Step 202, be divided into one or more sources group by the index file of described large data, and the index file in each source group includes the resource data of same type;

Step 204, carry out Fragmentation to each described source group, obtains a plurality of burst index files, utilizes the index burst that each described burst index file creation is corresponding;

Step 206, according to the search instruction received, the execution concurrence search operation in corresponding burst search file at one or more index bursts of appointment, to obtain and to return corresponding Search Results.

7. the searching method for large data according to claim 6, is characterized in that, described step 204 comprises:

8. the searching method for large data according to claim 7, is characterized in that, described step 204 also comprises:

9. the searching method for large data according to claim 6, is characterized in that, described step 206 also comprises:

10. according to the described searching method for large data of any one in claim 6 to 9, it is characterized in that, also comprise:

Corresponding relation between the resource data of preserving each described index burst and wherein comprising;

During according to the edit instruction to the allocated resource data that receives, determine the index burst that comprises described allocated resource data according to described corresponding relation;

The described allocated resource data of search executive editor's operation in definite index burst.