WO2013078583A1

WO2013078583A1 - Method and apparatus for optimizing data access, method and apparatus for optimizing data storage

Info

Publication number: WO2013078583A1
Application number: PCT/CN2011/083021
Authority: WO
Inventors: 智伟; 赵智峰; 周帅锋
Original assignee: 华为技术有限公司
Priority date: 2011-11-28
Filing date: 2011-11-28
Publication date: 2013-06-06
Also published as: CN102725753B; CN102725753A

Abstract

Disclosed are a method and an apparatus for optimizing data access, and a method and an apparatus for optimizing data storage. The method for optimizing data access comprises: a main controller receiving a request of a user for accessing a data table in HBASE, the request carrying data input range information, and the data input range information comprising a plurality of data input ranges; determining input block information according to region information of the data table and the data input range information; determining the number of Map tasks according to the input block information; distributing, according to the number of the Map tasks, data in the data table read from a processor; and returning the data read from the processor to the user.

Description

Method and device for optimizing data access, method and device for optimizing data storage

The present invention relates to the field of data processing technologies, and in particular, to a method and apparatus for optimizing data access, and a method and apparatus for optimizing data storage.

Background technique

As a large-scale data parallel processing method, MapReduce has been widely used in large-scale data analysis. HBASE (Hadoop Database) is a high-reliability, high-performance, column-oriented, scalable distributed storage system. HBASE can be used as a MapReduce data source and data destination, enabling MapReduce to process data stored in HBASE or output. The data is saved in HBASE.

When HBASE is used as the data source of MapReduce, the table name and the range of the data accessed by the MapReduce are defined by the table name and the scope query object. The range query object defines the data query range by setting the start key value and the terminating key value.

When the user program calls the MapReduce function, it will cause the following operations: Information, by comparing the range query object to define the data range and all the partition information to get the block information and the number M.

(2) The main controller creates a Map task based on the block information and the number, and each Map task processes the data in one block.

(3) Other processes are the same as the basic MapReduce process.

When HBASE is used as the data destination of MapReduce, the table name is used to define the table to be saved by MapReduce data. The main operations are as follows:

(1) The MapReduce function library in the user program first divides the input file into M blocks. According to the table to be saved, access the HBASE metadata table to obtain the range information corresponding to the table, and specify the number of Reduces R by the range information.

(2) The main controller gets the input split information and creates a Map task for each partition. Create a Reduce task based on the number of Reduces configured. There are a total of M Map tasks and R Reduce tasks that need to be dispatched. The master controller dispatches tasks to the slave processor. There are a total of M Map tasks and R Reduce tasks that need to be dispatched. (3) Other processes are the same as the basic MapReduce process.

As can be seen from the above process, the prior art has at least the following problems:

When HBASE is used as the data source of MapReduce, the data range of MapReduce access is defined according to the scope of data query in a range query object. In order to make the recorded data that meets the requirements not to be missed, the data range specified by the scope query object can only be expanded. This results in too many partitions being covered and the partition contains a large amount of invalid data. In addition to reading the required records in the partition, the MapReduce program must also read a large number of invalid records for comparison and discard, resulting in a large number of invalid operations, which seriously reduces the speed of data processing execution.

When HBASE is used as the data destination of MapReduce, if the partition range of data storage is limited, multiple useless Reduce processes will be generated, which wastes scheduling time and system resources and reduces the speed of data processing execution. Summary of the invention

An embodiment of the present invention provides a method and apparatus for optimizing data access to reduce the reading of invalid data and improve data processing efficiency.

An embodiment of the present invention provides a method and an apparatus for optimizing data storage, so as to reduce execution of invalid MapReduce tasks and improve data processing efficiency.

In order to solve the above technical problem, the technical solution adopted by the embodiment of the present invention is:

A method of optimizing data access, including:

The main controller receives a request for the user to access the data table in the HBASE, where the request carries data input range information, and the data input range information includes multiple data input ranges;

Determining, according to the partition information of the data table and the data input range information, the input block information, determining the number of Map tasks according to the input block information;

Reading data in the data table from the processor according to the number of Map tasks; and returning the data read from the processor to the user.

A method of optimizing data storage, including:

The primary controller receives a request from the user to store data in a data table in HBASE, the request carrying one or more data output ranges; Determining output block information according to the partition information of the data table and the data output range; determining a number of Reduce tasks according to the output block information;

Data is written from the processor to the data table in accordance with the number of Reduce tasks.

An apparatus for optimizing data access, comprising:

a receiving unit, configured to receive a request for a user to access a data table in the HBASE, where the request carries data input range information, where the data input range information includes multiple data input ranges;

Inputting a blocking unit, configured to determine input block information according to the partition information of the data table and the data input range information;

a task determining unit, configured to determine a number of Map tasks according to the input block information; and an allocating unit, configured to read data in the data table from the processor according to the number of the Map tasks;

And a sending unit, configured to return the data read by the slave processor to the user.

An apparatus for optimizing data storage, comprising:

a receiving unit, configured to receive a request for a user to store data in a data table in HBASE, where the request carries one or more data output ranges;

An output blocking unit, configured to determine output block information according to the partition information of the data table and the data output range;

a task determining unit, configured to determine, according to the output block information, a number of Reduce tasks; and an allocating unit, configured to allocate data from the processor to the data table according to the number of the reduced task allocations.

The method and device for optimizing data access according to the embodiment of the present invention, when HBASE is used as the data source of MapReduce, reduce the reading of invalid data by specifying a plurality of data input ranges, and improve data processing efficiency; accordingly, the embodiment of the present invention optimizes data. The storage method and device, when HBASE is used as the data destination of MapReduce, reduce the execution of invalid MapReduce tasks by specifying multiple data output ranges, and improve data processing efficiency. DRAWINGS

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings to be used in the embodiments or the description of the prior art will be briefly described below, and obviously, the following The drawings in the description are only some of the embodiments of the present invention, and those skilled in the art can obtain other drawings based on these drawings without any creative work.

FIG. 1 is a schematic diagram of an operation flow of an existing MapReduce;

2 is a schematic diagram of a data model of a table in an existing HBASE database;

3 is a schematic diagram of an operation flow of MapReduce in the prior art when HBASE is used as a data source;

4 is a schematic diagram of the operation flow of MapReduce in the prior art when HBASE is used as a data destination;

5 is a flowchart of a method for optimizing data access according to an embodiment of the present invention;

6 is a schematic diagram of a data table in HBASE according to an embodiment of the present invention;

7 is a schematic diagram of querying the data table shown in FIG. 6 according to the prior art;

8 is a schematic diagram of querying the data table shown in FIG. 6 according to a method according to an embodiment of the present invention; FIG. 9 is a flowchart of a method for optimizing data storage according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of storing the data table shown in FIG. 4 according to the prior art; FIG. 11 is a schematic diagram of storing the data table shown in FIG. 4 according to the method of the embodiment of the present invention; FIG. 12 is an embodiment of the present invention. Schematic diagram of a device for data access;

13 is a schematic diagram showing the structure of an input blocking unit in an apparatus for optimizing data access according to an embodiment of the present invention;

14 is a schematic structural diagram of an apparatus for optimizing data storage according to an embodiment of the present invention;

Figure 15 is a block diagram showing the structure of an output block unit in an apparatus for optimizing data storage in an embodiment of the present invention. detailed description

The embodiments of the present invention are further described in detail below in conjunction with the drawings and embodiments.

The following is a description of the existing MapReduce operation flow.

As shown in Figure 1, MapReduce consists of three separate entities: user program, main controller, and slave processor. The main controller is used to coordinate the running of the job, and assigns the task to the slave processor; the slave task processes the Map task and the Reduce task after the job is run. When the user program calls the MapReduce function, it will cause the following operations:

1) The MapReduce library in the user program first divides the input file into M blocks.

2) The main controller gets the input split information and creates a Map task for each partition. Create a Reduce task based on the configured amount of Reduce. There are a total of M Map tasks and R Reduce tasks that need to be dispatched. The master controller dispatches tasks to the slave processor.

3) A slave processor assigned a Map task reads and processes the associated input chunks.

4) The intermediate result of the Map task buffering to memory will be periodically written to the local hard disk, and the data is divided into R areas by the partition function. The intermediate result is that the location information of the local hard disk will be sent back to the primary controller, and then the primary controller is responsible for transmitting these location information to the slave processor of the Reduce task.

5) When the primary controller notifies the slave processor of the intermediate data location of the Reduce task, the buffered intermediate data is read from the local hard disk of the slave processor of the Map task.

6) The slave task of the Reduce task processes the intermediate data and passes the intermediate result value set to the user-defined Reduce function. The output of the Reduce block in the Reduce function is output to a final output file.

7) When all Map tasks and Reduce tasks have been completed, the main controller activates the user program, and MapReduce returns the call point of the user program.

HBASE is a distributed column storage database, in which the data model of the table is shown in Figure 2: Figure 2 shows the data in a HBASE table, including the following information:

RowKey: The identifier of each row of data, similar to the primary key of a table in a relational database. Column Family, the HBASE table is a collection of different column clusters. Columns of tables in a similar relational database need to be predefined. Unlike columns in a table in a relational database, there can be multiple columns under a column cluster.

歹 |J ( Column ), is a label under the column cluster, which can be arbitrarily added when writing data. The column value (Value ) is similar to the value of a column in a relational database.

Region, the data in the table is organized according to a certain size. When the data in the column cluster in the table reaches the threshold, the partition will be split, a new partition will be created, and the data under the original column cluster will be moved to the new partition in the row key order.

HBASE can be used as a data source for MapReduce, that is, MapReduce can process data stored in HBASE. As shown in Figure 3, when the user program calls the MapReduce function, it will cause the following operations: 1) MapReduce obtains all Region information of the formulated table from HBASE according to the table name, and obtains the data range and all Region scope information by comparing the scope query object. Block information and number.

2) The main controller creates a Map task according to the above-mentioned block information and the number, and each Map task processes the data in one block.

3) The master controller dispatches a Map task to the slave processor.

4) Other processes are the same as the basic MapReduce process.

HBASE can also be used as a data export for MapReduce. That is to say, MapReduce can store output data in HBASE, and define the table to be stored by MapReduce data by table name.

As shown in Figure 4, when the user program calls the MapReduce function, it will cause the following operations:

1) The MapReduce function in the user program first divides the input file into multiple partitions. According to the table to be saved, the HBASE data table is accessed to obtain the Region information corresponding to the table, and the Reduce number is specified by the Region information.

2) The main controller gets the input file block information and creates a Map task for each block. Create a Reduce task based on the configured Reduce amount.

3) The main controller dispatches the Map task and the Reduce task to the slave processor.

4) Other processes are the same as the basic MapReduce process.

The method and device for optimizing data access according to the embodiment of the present invention, in view of the above problems in the prior art, when HBASE is used as a data source of MapReduce, the number of data input ranges is specified to reduce the reading of invalid data, thereby improving data processing efficiency.

Correspondingly, the method and device for optimizing data storage according to the embodiment of the present invention, in view of the above problems in the prior art, when HBASE is used as a data destination of MapReduce, the execution of invalid MapReduce tasks is reduced by specifying multiple data output ranges, thereby improving Data processing efficiency.

The method and apparatus for optimizing data access and the method and apparatus for optimizing data storage according to embodiments of the present invention are described in detail below.

As shown in FIG. 5, it is a flowchart of a method for optimizing data access according to an embodiment of the present invention, which includes the following steps: Step 501: The main controller receives a request for a user to access a data table in the HBASE, where the request carries data input range information, where the data input range information includes multiple data input ranges.

The plurality of data input ranges described above may be divided by a separator, and accordingly, the main controller may divide the input string by a separator to obtain a plurality of data input ranges.

The data input range may take various forms, for example, any one of the following forms: a) a list form, and the list includes multiple range query objects, and each range query object is a data input range, for example:

SCAN1 (20010101, 20010131),

SCAN2 (20010201, 20010228),

SCAN3 (20010301, 20010331).

b) a list form, and the list contains multiple start and end data range pairs, each starting and ending data range pair representing a data range, for example:

(20010101, 20010131),

(20010201, 20010228),

(20010301, 20010331).

c) The file format, that is, in the form of a file, a plurality of data input ranges are saved in the file, and the main controller obtains a data input range by reading the file, for example:

(20010101, 20010131), (20010201, 20010228), (20010301, 20010331). Step 502: Determine input block information according to the partition information of the data table and the data input range information.

In the process, the start key value and the end key value of all the partitions in the data table may be first acquired, and then each data input range in the data input range information is respectively compared with the start key value of each partition. The termination key values are compared to obtain the coverage of each data input range in each partition, and the input block information can be determined according to the coverage range.

In addition, after obtaining the coverage of each data input range in each partition, the coverage areas belonging to the same partition and the continuous coverage may be merged first, and then the input points are determined according to the combined coverage. Block information.

In order to further improve the processing efficiency, in the above process, before the data input range in the data input range information is compared with the start key value and the end key value of each partition respectively, The data input range in the data input range information is first sorted, and then the sorted data input range is sequentially compared with the start key value and the end key value of each partition.

Of course, instead of performing the above sorting, after obtaining the coverage of each data input range in each partition, all the obtained coverage ranges are sorted, and then the same partition is merged and the continuous coverage is merged. .

The input block information may include: inputting the number of blocks and the start key value and the end key value of each input block.

Step 503: Determine the number of Map tasks according to the input block information.

Specifically, the number of Map tasks may be determined according to the number of input blocks, and each Map task corresponds to one input block.

Step 504: Read data in the data table from the processor according to the number of the Map tasks.

Specifically, the master controller may allocate a slave processor for each Map task, and transmit the start key value and the terminating key value of the input block corresponding to the Map task to the slave processor. Correspondingly, the slave processor reads the data in the data table according to the start key value and the end key value of the input block.

Step 505: Return the data read from the processor to the user.

It should be noted that, in the above request for the user to access the data table in the HBASE, the table name information of the data table is also included. In the actual application, after receiving the request, the main controller may first check the name of the table in the request and the data input range carried in the request to determine whether the information is correct. That is to say, according to the table name information in the request, it is checked whether there is a corresponding data table in HBASE, and whether the data input range is within the data range stored in the corresponding data table. If the information is correct, perform step 502 above.

Specifically, the HBASE system table may be queried. If the table name is not included in the system table, the input table is incorrect; if the table name is present, the table name is correct. Then, the Region information of the data table corresponding to the table name is obtained, and by comparing the data input range and the Region information, if the data input range is smaller than the start key value of the table or greater than the termination key value of the table, the data input range may be determined to be incorrect. Otherwise, you can determine that the data input range is correct.

In addition, the method for optimizing data access in the embodiment of the present invention may also be The method is compatible. For example, after receiving the request of the user to access the data table in the HBASE, the main controller determines whether the data carries a plurality of data input ranges, and if yes, accesses the HBASE according to the process of steps 502 to 504. The data table in ; otherwise, access the data table in HBASE in the manner of the prior art.

The following is an example to further illustrate the difference between the processing of MapReduce and the prior art in the method for optimizing data access in the embodiment of the present invention.

For example, a data table in HBASE shown in Figure 6 is a movie access information table indicating the number of times different movies are accessed by different types of terminals each day.

The Region information in the data table shown in Figure 6 is shown in Table 1 below.

Table 1 :

If MapReduce processing is performed on the data table shown in Figure 6 according to the prior art, and the total number of accesses of different terminals of each movie in the first week of August is aggregated, the processing is as shown in Fig. 7, as follows:

Set the starting key value of the range query object: Movie 1#20110801 , End key value: Movie 3#20110807.

Comparing the range query object start key value and the end key value with the Region information, the effective input data range is:

RegionO (film 1#20110801, film 1#20110808)

Regionl (movie 2#20110102, film 2#20110808)

Region2 (movie 3#20110101, film 1#20110807)

These valid input data ranges belong to three different Regions, and virtually every valid data range may include invalid data.

The valid input data range is divided into three different block information, which are:

The first block information: Movie 1#20110801 Movie 1#20110808, where the movie! #20110808 is invalid data. The second block information: film 2#20110102 film 2#20110808, where the film 2#20110101—movie 2#20110731, the film 2#20110808 is invalid data.

The third block information: film 3#20110101 film 1#20110807, where the film 3#20110101-film 3#20110731 is invalid data.

Therefore, the host controller initiates three Map tasks based on these three block information and dispatches them to the slave processor for processing. The three Map tasks process the data in the different blocks, and the invalid data is also included in the above blocks. When the processor processes the data, it reads each record in each block from the HBASE table to determine whether it is the first week of August data, and if so, adds it. If not, it discards it and does not process it. As shown in Figure 7, the data in the thick line frame is valid data, and the others are invalid data.

It can be seen that according to the Map task, a large amount of invalid data is read to judge, which causes a large number of invalid operations, which reduces the data processing execution speed and makes the efficiency low.

If the method for optimizing data access is performed according to the embodiment of the present invention, MapReduce processing is performed on the data table shown in FIG. 6, and the total number of accesses of different terminals of each movie in the first week of August is aggregated, and the processing procedure is as shown in FIG. as follows:

1. The main controller receives a request for data access by the user, and the request carries multiple data input range information. For example, in this example, there are three data input ranges, as follows:

SCAN1 (starting key value: movie 1#20110801, termination key value: movie 1#20110807); SCAN2 (starting key value: movie 2#20110801, termination key value: movie 2#20110807); SCAN3 (starting key value) : Movie 3#20110801, End Key: Movie 3#20110807).

2. Call MapReduce to divide each data input range into blocks, where: For the first data input range and Region information, the valid input data range is:

RegionO (film 1#20110801, film 1#20110807);

Comparing the second data input range with the Region information, the valid input data range is:

Regionl (movie 2#20110801, film 2#20110807);

For the third data input range and the Region information, the valid input data range is: Region2 (movie 3#20110801, film 3#20110807).

The above valid input data range belongs to three different Regions, so according to the three valid input data ranges obtained above, three different input block information are obtained, which are:

SPLIT0 (film 1#20110801 film 1 #20110807 )

SPLIT1 (movie 2#20110801 film 2#20110807)

SPLIT2 (movie 3#20110801 film 3#20110807).

3. Based on the above input block information and number, the host controller starts three Map tasks and dispatches them to the slave processor for processing. The three Map tasks process the data in different input block information. As can be seen from Figure 6, the data in the above three input block information is valid data, and does not contain invalid data.

It can be seen that, in the method for optimizing data access in the embodiment of the present invention, by subdividing and setting a plurality of data input ranges, that is, querying objects, the effective input data range is limited by multiple comparison operations of the cartridge, so that the data processing process is reduced. A large amount of invalid data is read, which greatly improves the processing efficiency.

As shown in FIG. 9, it is a flowchart of a method for optimizing data storage according to an embodiment of the present invention, which includes the following steps:

Step 901: The main controller receives a request for the user to store data in the HBASE data table, where the request carries one or more data output ranges.

The plurality of data output ranges described above may be divided by a separator, and accordingly, the main controller divides the output string by a separator to obtain a plurality of data output ranges.

The data output range may take various forms, for example, any of the following forms: a) In the form of a list, the list includes a start and end data range pair, for example:

(20010101, 20010331).

c) in the form of a list, and the list contains multiple start and end data range pairs. Each start and end data range pair represents a data output range, for example:

(20010101, 20010131),

(20010201, 20010228),

(20010301, 20010331).

d) The file form, that is, in the form of a file, one or more data output ranges are saved in the file, and the main controller obtains the data output range by reading the file, for example: (20010101, 20010131), (20010201, 20010228), (20010301, 20010331). Step 902: Determine output block information according to the partition information of the data table and the data output range.

In the process, the start key value and the end key value of all the partitions in the data table may be first obtained, and then the data output range is compared with the start key value and the end key value of each partition respectively to obtain a The partition information covered by the data output range is determined, and the output block information is determined according to the partition information.

The output block information includes: outputting the number of blocks.

Step 903: Determine, according to the output block information, a number of Reduce tasks.

Specifically, the number of Reduce tasks may be determined according to the number of output blocks, and each Reduce task corresponds to one output block.

Step 904: Write data from the processor to the data table according to the number of the Reduce tasks.

Specifically, the main controller may allocate a slave processor for each Reduce task, and transfer the storage data corresponding to the Reduce task to the slave processor. Correspondingly, the slave processor writes the storage data corresponding to the Reduce task into the partition corresponding to the output block.

The following examples further illustrate the difference between the processing of MapReduce and the prior art in the method for optimizing data storage in the embodiment of the present invention.

Assume that the following table 2 is the latest access information of the movie 3, and the information needs to be saved in the HBASE data table. The data output table is the movie access information table shown in FIG.

Table 2:

If MapReduce processing is performed on the information shown in Table 2 according to the prior art, the processing is as follows: The output data range is obtained according to the Region information in the data table shown in FIG. 6:

RegionO (movie 1#20110104, film 1#20110808);

Regionl (movie 2#20110101, film 2#20110808);

Region2 (movie 3#20110101, film 3#20110808).

These output data ranges belong to three different Regions, and the main controller is set to three different

The Reduce task is dispatched to the slave processor for processing. Each of the Reduce tasks processes an output data range, and the data satisfying the condition "start key value <input data termination key value" is stored in the corresponding Region.

The actual data range in Table 2 that needs to be saved to HBASE is the movie 3#20110809—movie 3#20110812, which belongs to the Reduce data range corresponding to Region2, so the Reduce corresponding to RegionO and Regionl will not process any data. That is, since the actual input data satisfies the data range of Region2, the data in Table 2 will be stored in Region2, while the other two Reduce tasks have no output data, which is an invalid Reduce task, as shown in Figure 10.

It can be seen that in the prior art, multiple invalid Reduce processes are generated according to the Reduce task, which wastes scheduling time and resources, and reduces data processing execution speed.

If the method of optimizing data storage according to the embodiment of the present invention performs MapReduce processing on the information shown in Table 2, the processing procedure is as shown in FIG. 11, and the details are as follows:

1. The main controller receives a request for data storage by the user, and the request carries data output range information. For example, in this example, the data output range information carried is: <movie 3#20110809, movie 3#20110812>.

2. MapReduce compares the data output range with the Region information of the data table shown in Figure 6, and obtains a valid output data range, namely Region2 (movie 3#20110101, movie 3#20110808). The valid output data range belongs to only one Region.

Therefore, according to the range of valid output data obtained above, an output block information is obtained:

Region2 (film 3#20110101, film 3#20110808)

3. The main controller sets one Reduce task according to the above output block information. Dispatched to the slave processor for processing.

It can be seen that the data storage method of the embodiment of the present invention divides and sets the output data by subdividing The comparison output operation limits the effective output data range, which reduces the startup and destruction of a large number of invalid Reduce tasks in the data processing process, and greatly improves the processing efficiency.

Correspondingly, the embodiment of the present invention further provides an apparatus for optimizing data access, as shown in FIG. 12, which is a schematic structural diagram of the apparatus.

In this embodiment, the device for optimizing data access includes:

The receiving unit 121 is configured to receive a request for the user to access the data table in the HBASE, where the request carries data input range information, the data input range information includes multiple data input ranges, and the input blocking unit 122 is configured to Partition information of the data table and the data input range information, determining input block information;

The task determining unit 123 is configured to determine the number of Map tasks according to the input block information, and the allocating unit 124 is configured to read data in the data table from the processor according to the number of the Map tasks;

The sending unit 125 is configured to return the data read by the slave processor to the user.

The input blocking unit 122 can be implemented in various manners. For example, as shown in FIG. 13, the input blocking unit 122 can include: a partition information acquiring subunit 1221, a comparing subunit 1222, a merging subunit 1223, and blocking information. Subunit 1224 is determined. among them:

The partition information obtaining sub-unit 1221 is configured to obtain a start key value and a stop key value of all partitions in the data table;

The comparing subunit 1222 is configured to compare the data input range in the data input range information with the start key value and the end key value of each partition, to obtain coverage of the data input range in each partition. ;

The merging sub-unit 1223 is configured to merge all the coverages obtained by the comparing sub-units into the same partition and merge the coverage areas;

The block information determining sub-unit 1224 is configured to determine the input block information according to the merged coverage.

It should be noted that the foregoing merging sub-unit 1223 is optional, that is, the blocking information confirming that the staging unit 1224 can directly determine the input blocking information according to the coverage obtained by the comparing sub-unit 1222.

In order to further improve the processing efficiency, the above input blocking unit 122 may further include: The sequence subunit 1225 is configured to input the range information to the data before the comparison subunit 1222 compares the data input range in the data input range information with the start key value and the termination key value of each partition respectively. Sort the data input range in .

In the embodiment of the present invention, the input block information may include: inputting the number of blocks and the start key value and the terminating key value of each input block.

Correspondingly, the task determining unit 123 may determine the number of Map tasks according to the number of input blocks, and each Map task corresponds to one input block.

Correspondingly, the foregoing allocating unit 124 may allocate a slave processor for each Map task, and transmit the start key value and the terminating key value of the input block corresponding to the Map task to the slave processor, so that The slave processor reads data in the data table according to a start key value and a stop key value of the input block.

The device for optimizing data access in the embodiment of the present invention can be used as a main controller in MapReduce, and the device can be used to limit the effective input data range by multiple comparison operations of the single tube, so that a large amount of invalid data is read during data processing. , greatly improving the processing efficiency. For details, refer to the description in the method for optimizing data access in the foregoing embodiments of the present invention, and details are not described herein again.

Correspondingly, the embodiment of the present invention further provides an apparatus for optimizing data storage, as shown in FIG. 14, which is a schematic structural diagram of the apparatus.

In this embodiment, the apparatus for optimizing data storage includes:

The receiving unit 131 is configured to receive a request for the user to store data in the HBASE data table, where the request carries one or more data output ranges;

The output blocking unit 132 is configured to determine output blocking information according to the partition information of the data table and the data output range;

The task determining unit 133 is configured to determine a number of Reduce tasks according to the output block information, and an allocating unit 134, configured to allocate data from the processor to the data table according to the number of the Reduce tasks.

The output blocking unit 132 can be implemented in various manners. For example, as shown in FIG. 15, the output blocking unit 132 can include: a partition information acquiring subunit 1321, a comparing subunit 1322, and a blocking information determining subunit 1323. among them:

The partition information obtaining sub-unit 1321 is configured to acquire the start of all partitions in the data table. Key value and end key value;

The comparing subunit 1322 is configured to compare the data output range with a start key value and a stop key value of each partition to obtain partition information covered by the data output range;

The blocking information determining subunit 1323 is configured to determine an output blocking message according to the partition information.

Of course, the foregoing output blocking unit 132 may have other implementation manners, which are not limited in this embodiment of the present invention.

In the embodiment of the present invention, the outputting the block information may include: outputting the number of blocks.

Correspondingly, the task determining unit 133 may determine the number of Reduce tasks according to the number of output blocks, and each Reduce task corresponds to one output block;

Correspondingly, the allocating unit 134 may allocate a slave processor for each Reduce task, and the storage data corresponding to the Reduce task is written into the partition corresponding to the output block.

The device for optimizing data storage in the embodiment of the present invention can be used as a main controller in MapReduce. With the device, the effective output data range can be limited by the comparison operation of the cartridge, so that the startup and destruction of a large number of invaliduce tasks are reduced during the data processing. , greatly improving the processing efficiency. For a specific process, refer to the description in the method for optimizing data access in the foregoing embodiments of the present invention, and details are not described herein again.

It will be apparent to those skilled in the art from the above description of the embodiments that all or part of the steps of the above embodiments may be implemented by means of software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product, which may be stored in a storage medium such as a ROM/RAM or a disk. , an optical disk, etc., includes instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various embodiments of the present invention or portions of the embodiments.

It is to be noted that the various embodiments in the present specification are described in a progressive manner, and the same similar parts between the various embodiments may be referred to each other, and each embodiment focuses on different embodiments from other embodiments. At the office. In particular, for the device embodiment, since it is basically similar to the method embodiment, it is described in a relatively simple manner, and the relevant parts can be referred to the description of the method embodiment. The device embodiments described above are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie may be located in one place. , or it can be distributed to multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without any creative effort.

The above description is only the preferred embodiment of the present invention and is not intended to limit the scope of the present invention. Any modifications, equivalents, improvements, etc. made within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

Rights request

A method for optimizing data access, characterized in that it comprises:

2. The method according to claim 1, wherein the plurality of data input ranges are in any one of the following forms:

List form, and the list contains multiple range query objects, each range query object is a data input range;

List form, and the list contains multiple start and end data range pairs, each starting and ending data range pair represents a data input range;

File form.

The method according to claim 1 or 2, wherein the determining the input block information according to the partition information of the data table and the data input range information comprises:

Obtaining a starting key value and a terminating key value of all partitions in the data table;

Comparing the data input range in the data input range information with the start key value and the end key value of each partition, respectively, to obtain a coverage range of the data input range in each partition;

The input block information is determined according to the coverage.

The method according to claim 3, wherein the method further comprises: before determining the input block information according to the coverage range, all the obtained coverage areas belong to the same partition and continuous coverage Consolidate

The determining the input block information according to the coverage includes:

The input block information is determined based on the combined coverage.

The method according to claim 3 or 4, wherein the method further comprises: inputting a data input range in the data input range information and a start key value of each partition respectively The data input range in the data input range information is sorted before being compared with the terminating key value.

The method according to claim 3 or 4 or 5, wherein the input block information comprises: inputting a number of blocks and a start key value and a stop key value of each input block;

Determining the number of Map tasks according to the input block information includes: determining a number of Map tasks according to the number of input blocks, and each Map task corresponds to one input block;

The reading the data in the data table from the processor according to the number of the Map task allocation includes:

Allocating a slave processor for each Map task, and transmitting a start key value and a terminating key value of the input block corresponding to the Map task to the slave processor, so that the slave processor according to the input The start and exit key values of the block read the data in the data table.

7. A method of optimizing data storage, comprising:

The primary controller receives a request from the user to store data in a data table in HBASE, the request carrying one or more data output ranges;

Determining output block information according to the partition information of the data table and the data output range; determining a number of Reduce tasks according to the output block information;

8. The method according to claim 7, wherein the data output range is any one of the following forms:

a list form, and the list contains one or more start and end data range pairs, each starting and ending data range pair representing a data output range;

File form.

The method according to claim 7 or 8, wherein the determining the output block information according to the partition information of the data table and the data output range comprises:

Comparing the data output range with the start key value and the termination key value of each partition, respectively, to obtain partition information covered by the data output range;

The output block information is determined based on the partition information.

The method of claim 9, wherein the outputting the block information comprises: Output the number of blocks;

Determining, according to the output block information, the number of Reduce tasks includes: determining a number of Reduce tasks according to the number of output blocks, and each Reduce task corresponds to one output block;

And the writing of data from the processor to the data table according to the number of the reduced task tasks includes:

Allocating a slave processor to each reduce task, and transmitting the storage data corresponding to the reduce task to the slave processor, so that the slave processor writes the storage data corresponding to the reduce task to the output The partition corresponds to the partition.

11. An apparatus for optimizing data access, comprising:

The device according to claim 11, wherein the input blocking unit comprises: a partition information acquiring subunit, configured to acquire a starting key value and a ending key value of all partitions in the data table;

a comparison subunit, configured to compare a data input range in the data input range information with a start key value and a stop key value of each partition, to obtain a coverage range of the data input range in each partition;

The blocking information determining subunit is configured to determine the input blocking information according to the coverage.

The apparatus according to claim 12, wherein the input blocking unit further comprises: a merging subunit, wherein all coverages obtained by the comparing subunit belong to the same partition and are continuously covered Scope merger;

The block information determining subunit is specifically configured to determine input block information according to the combined coverage of the merged subunit.

The device according to claim 12 or 13, wherein the input blocking unit further comprises:

a sorting subunit, configured to: in the data input range information, before the comparing subunit compares the data input range in the data input range information with the start key value and the end key value of each partition respectively The data input range is sorted.

The device according to claim 12 or 13 or 14, wherein the input block information comprises: an input block number and a start key value and a stop key value of each input block;

The task determining unit is specifically configured to determine a number of Map tasks according to the number of input blocks, and each Map task corresponds to one input block;

The allocating unit is specifically configured to allocate a slave processor for each Map task, and the

The start key value and the end key value of the input block corresponding to the Map task are transmitted to the slave processor, so that the slave processor reads the start key value and the end key value according to the input block The data in the data table.

16. An apparatus for optimizing data storage, comprising:

The apparatus according to claim 16, wherein the output blocking unit comprises: a partition information acquiring subunit, configured to acquire a starting key value and a final key value of all partitions in the data table;

a comparison subunit, configured to compare the data output range with a start key value and a stop key value of each partition to obtain partition information covered by the data output range;

The block information determining subunit is configured to determine output block information according to the partition information.

The device according to claim 17, wherein the output block information comprises: outputting a number of blocks; The task determining unit is specifically configured to determine a number of Reduce tasks according to the number of output blocks, and each Reduce task corresponds to one output block;

The allocating unit is configured to allocate a slave processor to each reduce task, and transmit the storage data corresponding to the reduce task to the slave processor, so that the slave processor corresponds the reduce task The stored data is written into the partition corresponding to the output block.