CN104714983A

CN104714983A - Generating method and device for distributed indexes

Info

Publication number: CN104714983A
Application number: CN201310695615.6A
Authority: CN
Inventors: 韩丙卫
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2013-12-17
Filing date: 2013-12-17
Publication date: 2015-06-17
Anticipated expiration: 2033-12-17
Also published as: WO2014180411A1; CN104714983B

Abstract

The invention discloses a generating method and device for distributed indexes. According to the method, the number of map jobs in Hadoop is determined according to the data volume of original data; data processed through the map jobs are distributed to multiple reduce jobs, and an index database corresponding to each reduce job is generated, wherein the number of the reduce jobs and the corresponding relation between each reduce job and one or more map jobs are pre-configured; the index databases corresponding to the reduce jobs are combined. According to the technical scheme, mass data are efficiently and quickly indexed.

Description

The generation method of distributed index and device

Technical field

The present invention relates to the communications field, in particular to a kind of generation method and device of distributed index.

Background technology

Along with the arriving in cloud epoch, large data (Big data) have also attracted increasing concern.Large data are commonly used to a large amount of destructuring and the semi-structured data that describe company's creation, and these data can expend too much time and money when downloading to relevant database for analyzing.Large data analysis is normal to be linked together with cloud computing, because the real-time framework of large data set analysis needs as MapReduce shares out the work to tens of, hundreds of or even thousands of computers.And large data are often referred to for so a kind of phenomenon at internet industry: Internet firm generates in daily operation, the user network behavioral data of accumulation.The scale of these data is so huge, to such an extent as to cannot adopt G or T to weigh.

Do large data have much on earth? by means of only the time of one day, the full content that internet produces can carve full 1.68 hundred million DVD; The mail amount sent can reach more than 2,940 hundred million envelopes; The community post sent can reach 2,000,000; The mobile phone sold is 37.8 ten thousand

Cut-off was to 2012, and data volume is from TB(1TB=1024GB) rank rises to PB(1PB=1024TB), EB(1EB=1024PB) and even ZB(1ZB=1024EB) rank.The result of study of International Data Corporation (IDC) (IDC) shows, the data volume of whole world generation in 2008 is 0.49ZB, the data volume of whole world generation in 2009 is 0.8ZB, the data volume of whole world generation in 2010 increases as 1.2ZB, and the data volume that the whole world in 2011 produces is especially up to 1.82ZB, everyone produces the data of more than 200GB to be equivalent to the whole world.To 2012, the data volume of all printing materials of human being's production was 200PB, and all data volumes that the whole mankind said in history are approximately 5EB.The research of IBM shows, in the total data that whole human civilization obtains, has 90% to produce in two years in the past.And having arrived the year two thousand twenty, the data scale that the whole world produces will reach 44 times of today.

At present, at large data age, from large data, how fast and effeciently to search out the data that user is concerned about has become increasingly important problem.The efficient index of establishment is fast the prerequisite that user carries out searching for, and the technical scheme of the establishment index usually adopted in correlation technique is single-threaded, performance bottleneck is there is when in the face of mass data, due to higher to system requirements, and the limited system expanding ability, it cannot meet the demand that user fast and effeciently carries out data retrieval in mass data.

Summary of the invention

The invention provides a kind of generation method and device of distributed index, at least to solve the problem that cannot create efficient index fast in correlation technique to mass data.

According to an aspect of the present invention, a kind of generation method of distributed index is provided.

Generation method according to distributed index of the present invention comprises: the quantity determining mapping (map) operation in Hadoop according to the data volume of raw data; Data after each map operation process are dispensed to multiple stipulations (reduce) operation, and generate the index database corresponding with each reduce operation, wherein, the quantity of reduce operation and the corresponding relation between each reduce operation and one or more map operation are pre-configured completing; The index database corresponding with each reduce operation is merged.

Preferably, generate the index database corresponding with each reduce operation to comprise: the type obtaining the file system of current support; The generating mode of the index database corresponding with each reduce operation is determined according to the type of file system; The index database corresponding with each reduce operation is generated according to generating mode.

Preferably, generate the index database corresponding with each reduce operation according to generating mode to comprise: when the type of file system is Hadoop distributed file system (HDFS), in local disk, generate the index database corresponding with each reduce operation, then the index database generated in local disk is all uploaded to HDFS; Or, when the type of file system be all the other except HDFS support distributed file system (DFS) shared time, directly support to generate the index database corresponding with each reduce operation in the DFS shared at all the other.

Preferably, merging is carried out to the index database corresponding with each reduce operation and comprises: when the type of file system is HDFS, the index database corresponding with each reduce operation in HDFS is downloaded to local disk; Merge at the local disk pair index database corresponding with each reduce operation; The index database obtained after merging is uploaded to HDFS, and the index database corresponding with each reduce operation in local disk is deleted.

Preferably, merging is carried out to the index database corresponding with each reduce operation and comprises: when the type of file system be all the other support the DFS shared time, all the other are supported that the index database corresponding with each reduce operation generated in the DFS shared merges; All the other are supported that the index database corresponding with each reduce operation generated in the DFS shared is deleted.

According to a further aspect in the invention, a kind of generating apparatus of distributed index is provided.

Generating apparatus according to distributed index of the present invention comprises: determination module, for determining the quantity of the mapping map operation in Hadoop according to the data volume of raw data; Generation module, for the data after each map operation process are dispensed to multiple stipulations reduce operation, and generate the index database corresponding with each reduce operation, wherein, the quantity of reduce operation and the corresponding relation between each reduce operation and one or more map operation are pre-configured completing; Merge module, for merging the index database corresponding with each reduce operation.

Preferably, generation module comprises: acquiring unit, for obtaining the type of the file system of current support; Determining unit, for determining the generating mode of the index database corresponding with each reduce operation according to the type of file system; Generation unit, for generating the index database corresponding with each reduce operation according to generating mode.

Preferably, generation unit, for when the type of file system is Hadoop distributed file system HDFS, generates the index database corresponding with each reduce operation, then the index database generated in local disk is all uploaded to HDFS in local disk; Or, generation unit, for when the type of file system be all the other except HDFS support the distributed file system DFS shared time, directly support to generate the index database corresponding with each reduce operation in the DFS shared at all the other.

Preferably, merge module to comprise: download unit, for when the type of file system is HDFS, is downloaded to local disk by the index database corresponding with each reduce operation in HDFS; First merge cells, for merging at the local disk pair index database corresponding with each reduce operation; First processing unit, for the index database obtained after merging is uploaded to HDFS, and deletes the index database corresponding with each reduce operation in local disk.

Preferably, merge module and comprise: the second merge cells, for when the type of file system be all the other support the DFS shared time, all the other are supported that the index database corresponding with each reduce operation generated in shared DFS merges; By all the other, second processing unit, for supporting that the index database corresponding with each reduce operation generated in the DFS shared is deleted.

By the embodiment of the present invention, adopt the quantity determining the map operation in Hadoop according to the data volume of raw data; Data after each map operation process are dispensed to multiple reduce operation, and generating the index database corresponding with each reduce operation, the quantity of this reduce operation and the corresponding relation between each reduce operation and one or more map operation are pre-configured completing; The index database corresponding with each reduce operation is merged, namely by adopting the map operation in Hadoop and reduce operation to process raw data, generate the index database corresponding with each reduce operation, then the index database corresponding with each reduce operation is merged, to solve in correlation technique the problem that cannot create efficient index fast to mass data thus, and then achieve efficiently, rapidly index is carried out to mass data.

Accompanying drawing explanation

Accompanying drawing described herein is used to provide a further understanding of the present invention, and form a application's part, schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:

Fig. 1 is the process flow diagram of the generation method of distributed index according to the embodiment of the present invention;

Fig. 2 is the process flow diagram of the generation method of distributed index according to the preferred embodiment of the invention;

Fig. 3 is the structured flowchart of the generating apparatus of distributed index according to the embodiment of the present invention;

Fig. 4 is the structured flowchart of the generating apparatus of distributed index according to the preferred embodiment of the invention.

Embodiment

Hereinafter also describe the present invention in detail with reference to accompanying drawing in conjunction with the embodiments.It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.

Fig. 1 is the process flow diagram of the generation method of distributed index according to the embodiment of the present invention.As shown in Figure 1, the method can comprise following treatment step:

Step S102: the quantity determining the map operation in Hadoop according to the data volume of raw data;

Step S104: the data after each map operation process are dispensed to multiple reduce operation, and generate the index database corresponding with each reduce operation, wherein, the quantity of reduce operation and the corresponding relation between each reduce operation and one or more map operation are pre-configured completing;

Step S106: the index database corresponding with each reduce operation is merged.

In correlation technique, cannot create efficiently mass data, index fast.Adopt method as shown in Figure 1, by adopting the map operation in Hadoop and reduce operation, raw data is processed, generate the index database corresponding with each reduce operation, then the index database corresponding with each reduce operation is merged, to solve in correlation technique the problem that cannot create efficient index fast to mass data thus, and then achieve efficiently, rapidly index is carried out to mass data.

Preferably, in step S104, generate the index database corresponding with each reduce operation and can comprise following operation:

Step S1: the type obtaining the file system of current support;

Step S2: the generating mode determining the index database corresponding with each reduce operation according to the type of file system;

Step S3: generate the index database corresponding with each reduce operation according to generating mode.

In a preferred embodiment, first, need the size of the data volume determining raw data to be obtained, and to be divided into M(M be positive integer) part, wherein, every number is according to a corresponding map operation respectively.Certainly, the data volume handled by each map operation can dynamic-configuration.Thus, map data processing plug-in unit is set.In addition, the middle key-value pair set produced after each map operation process regularly can write local disk, and it is positive integer that local disk can be divided into again N(N) individual, N is that User Defined is arranged, and each subregion is a corresponding reduce operation respectively.By the maximum number of configuration reduce operation, to improve the establishment efficiency of distributed index, and reduce data processing plug-in unit is set according to the quantity of user configured reduce operation.In the preferred embodiment, create index can support Hadoop distributed file system (HDFS) and other can support the distributed file system (DFS) shared.Therefore, the generating mode of the index database corresponding with each reduce operation can be determined according to the type difference creating the file system supported in Index process, then generate the index database corresponding with each reduce operation according to generating mode.

Preferably, in step s3, generate the index database corresponding with each reduce operation according to generating mode can one of comprise the following steps:

Step S31: when the type of file system is Hadoop distributed file system (HDFS), generates the index database corresponding with each reduce operation, then the index database generated in local disk is all uploaded to HDFS in local disk;

Step S32: when the type of file system be all the other except HDFS support distributed file system (DFS) shared time, directly support to generate the index database corresponding with each reduce operation in the DFS shared at all the other.

In a preferred embodiment, if the type of the file system of current support is HDFS, so each reduce operation all generates interim index database in local file system (i.e. local disk); Then, in the scale removal process that reduce operation is last, the interim index database generated can be uploaded in HDFS file system in local file system.If the type of the file system of current support is all the other support shared DFS, then directly can generate interim index database in DFS file system.

Preferably, in step s 106, merging is carried out to the index database corresponding with each reduce operation and can comprise following operation:

Step S4: when the type of file system is HDFS, is downloaded to local disk by the index database corresponding with each reduce operation in HDFS;

Step S5: merge at the local disk pair index database corresponding with each reduce operation;

Step S6: the index database obtained after merging is uploaded to HDFS, and the index database corresponding with each reduce operation in local disk is deleted.

In a preferred embodiment, if the type of the file system of current support is HDFS, so, first from HDFS file system, whole interim index database is downloaded to local file system by the index host node (master) of Hadoop; Secondly, index host node merges the whole interim index database in local file system, generates complete index database; Again, complete index database is uploaded in HDFS file system by index host node; Then, interim for each in local file system index database is deleted by index host node; Finally, complete index database is downloaded in local file system from HDFS file system from node (slave) by the index of Hadoop, so that retrieval uses.

Preferably, in step s 106, carry out merging to the index database corresponding with each reduce operation can comprise the following steps:

Step S7: when the type of file system be all the other support the DFS shared time, all the other are supported that the index database corresponding with each reduce operation generated in the DFS shared merges;

Step S8: all the other are supported that the index database corresponding with each reduce operation generated in the DFS shared is deleted.

In a preferred embodiment, if the type of the file system of current support is all the other support shared DFS, so first by the index host node of Hadoop, the interim index database in DFS file system is merged into complete index database, so that retrieval uses; On index host node, interim for each in DFS file system index database is deleted again.

Below in conjunction with the preferred implementation shown in Fig. 2, above-mentioned preferred implementation process is further described.

Fig. 2 is the process flow diagram of the generation method of distributed index according to the preferred embodiment of the invention.As shown in Figure 2, the processing stage that this flow process can comprising following:

First stage: data acquisition phase, i.e. the map sessions of Hadoop, data acquisition phase is the preposition preparatory stage arranging index, and it can provide Data support for creating index.What the map sessions of Hadoop adopted is distributed implementation, and it can process data concurrently, and wherein, the number needs of map operation dynamically will be determined by the data volume gathered.Utilize the collection text of the map operation of Hadoop or database file to process data, generate the content of each field (i.e. key-value pair (key, value) set) created required for index, drastically increase data processing performance thus.And when gathering owing to supporting plug-in unit process, therefore different processing modes can be customized according to data volume.

Subordinate phase: create index stage, i.e. the reduce sessions of Hadoop, creates distributed index storehouse.The greatest measure reduceNum of reduce job parallelism process is determined by the number arranging reduce operation.The data generated in data acquisition phase distribute concrete data to each reduce operation as index by HashCode () %reduceNum, and each reduce operation generates self interim index database file respectively.

It should be noted that, create index can support Hadoop distributed file system (HDFS) and other can support the distributed file system (DFS) shared.

Phase III: index merging phase, according to each the interim index database creating each reduce operation generation that the index stage obtains, call index merging by index host node and each interim index database is merged into a complete index database.When execution index merges, each interim index database can be read one by one, interim index database is incorporated into independent master index storehouse, finally each interim index database be deleted, and provide retrieval service by master index storehouse.

Fig. 3 is the structured flowchart of the generating apparatus of distributed index according to the embodiment of the present invention.As shown in Figure 3, this device can comprise: determination module 10, for determining the quantity of the mapping map operation in Hadoop according to the data volume of raw data; Generation module 20, for the data after each map operation process are dispensed to multiple stipulations reduce operation, and generate the index database corresponding with each reduce operation, wherein, the quantity of reduce operation and the corresponding relation between each reduce operation and one or more map operation are pre-configured completing; Merge module 30, for merging the index database corresponding with each reduce operation.

Adopt device as shown in Figure 3, solve the problem that cannot create efficient index fast in correlation technique to mass data, and then achieve efficiently, rapidly index is carried out to mass data.

Preferably, as shown in Figure 4, generation module 20 can comprise: acquiring unit 200, for obtaining the type of the file system of current support; Determining unit 202, for determining the generating mode of the index database corresponding with each reduce operation according to the type of file system; Generation unit 204, for generating the index database corresponding with each reduce operation according to generating mode.

Preferably, as shown in Figure 4, generation unit 204, for when the type of file system is Hadoop distributed file system HDFS, in local disk, generate the index database corresponding with each reduce operation, then the index database generated in local disk is all uploaded to HDFS; Or, generation unit 204, for when the type of file system be all the other except HDFS support the distributed file system DFS shared time, directly support to generate the index database corresponding with each reduce operation in the DFS shared at all the other.

Preferably, as shown in Figure 4, merge module 30 can comprise: download unit 300, for when the type of file system is HDFS, is downloaded to local disk by the index database corresponding with each reduce operation in HDFS; First merge cells 302, for merging at the local disk pair index database corresponding with each reduce operation; First processing unit 304, for the index database obtained after merging is uploaded to HDFS, and deletes the index database corresponding with each reduce operation in local disk.

Preferably, as shown in Figure 4, merge module 30 can comprise: the second merge cells 306, for when the type of file system be all the other support the DFS shared time, all the other are supported that the index database corresponding with each reduce operation generated in shared DFS merges; By all the other, second processing unit 308, for supporting that the index database corresponding with each reduce operation generated in the DFS shared is deleted.

From above description, can find out, above embodiments enable following technique effect (it should be noted that these effects are effects that some preferred embodiment can reach): adopt the technical scheme that the embodiment of the present invention provides, can process raw data by adopting the map-reduce programming model in Hadoop, generate the index database corresponding with each reduce operation, then the index database corresponding with each reduce operation is merged, form a complete index database, so that retrieval uses, solve the problem that cannot create efficient index fast in correlation technique to mass data thus, and then achieve to mass data efficiently, carry out index rapidly.

Obviously, those skilled in the art should be understood that, above-mentioned of the present invention each module or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on network that multiple calculation element forms, alternatively, they can realize with the executable program code of calculation element, thus, they can be stored and be performed by calculation element in the storage device, and in some cases, step shown or described by can performing with the order be different from herein, or they are made into each integrated circuit modules respectively, or the multiple module in them or step are made into single integrated circuit module to realize.Like this, the present invention is not restricted to any specific hardware and software combination.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a generation method for distributed index, is characterized in that, comprising:

The quantity of the mapping map operation in Hadoop is determined according to the data volume of raw data;

Data after each map operation process are dispensed to multiple stipulations reduce operation, and generate the index database corresponding with each reduce operation, wherein, the quantity of described reduce operation and the corresponding relation between described each reduce operation and one or more map operation are pre-configured completing;

The index database corresponding with described each reduce operation is merged.

2. method according to claim 1, is characterized in that, generates the index database corresponding with described each reduce operation and comprises:

Obtain the type of the file system of current support;

The generating mode of the index database corresponding with described each reduce operation is determined according to the type of described file system;

The index database corresponding with described each reduce operation is generated according to described generating mode.

3. method according to claim 2, is characterized in that, generates the index database corresponding with described each reduce operation comprise according to described generating mode:

When the type of described file system is Hadoop distributed file system HDFS, in local disk, generates the index database corresponding with described each reduce operation, then the index database generated in described local disk is all uploaded to described HDFS; Or,

When the type of described file system be all the other except described HDFS support the distributed file system DFS shared time, directly described all the other support to generate the index database corresponding with described each reduce operation in the DFS shared.

4. method according to claim 3, is characterized in that, carries out merging comprise the index database corresponding with described each reduce operation:

When the type of described file system is described HDFS, the index database corresponding with described each reduce operation in described HDFS is downloaded to described local disk;

Merge at the described local disk pair index database corresponding with described each reduce operation;

The index database obtained after merging is uploaded to described HDFS, and the index database corresponding with described each reduce operation in described local disk is deleted.

5. method according to claim 3, is characterized in that, carries out merging comprise the index database corresponding with described each reduce operation:

As the DFS that all the other supports described in the type of described file system is are shared, the index database corresponding with described each reduce operation generated in the DFS share all the other supports described merges;

The index database corresponding with described each reduce operation generated in the DFS all the other supports described shared is deleted.

6. a generating apparatus for distributed index, is characterized in that, comprising:

Determination module, for determining the quantity of the mapping map operation in Hadoop according to the data volume of raw data;

Generation module, for the data after each map operation process are dispensed to multiple stipulations reduce operation, and generate the index database corresponding with each reduce operation, wherein, the quantity of described reduce operation and the corresponding relation between described each reduce operation and one or more map operation are pre-configured completing;

Merge module, for merging the index database corresponding with described each reduce operation.

7. device according to claim 6, is characterized in that, described generation module comprises:

Acquiring unit, for obtaining the type of the file system of current support;

Determining unit, for determining the generating mode of the index database corresponding with described each reduce operation according to the type of described file system;

Generation unit, for generating the index database corresponding with described each reduce operation according to described generating mode.

8. device according to claim 7, it is characterized in that, described generation unit, for when the type of described file system is Hadoop distributed file system HDFS, in local disk, generate the index database corresponding with described each reduce operation, then the index database generated in described local disk is all uploaded to described HDFS; Or, described generation unit, for when the type of described file system be all the other except described HDFS support the distributed file system DFS shared time, directly described all the other support to generate the index database corresponding with described each reduce operation in the DFS shared.

9. device according to claim 8, is characterized in that, described merging module comprises:

Download unit, for when the type of described file system is described HDFS, is downloaded to described local disk by the index database corresponding with described each reduce operation in described HDFS;

First merge cells, for merging at the described local disk pair index database corresponding with described each reduce operation;

First processing unit, for the index database obtained after merging is uploaded to described HDFS, and deletes the index database corresponding with described each reduce operation in described local disk.

10. device according to claim 8, is characterized in that, described merging module comprises:

Second merge cells, for when the type of described file system be described all the other support the DFS shared time, to described all the other support that the index database corresponding with described each reduce operation generated in the DFS shared merges;

Second processing unit, for deleting the index database corresponding with described each reduce operation generated in DFS shared for all the other supports described.