CN106503058B

CN106503058B - A kind of data load method, terminal and computing cluster

Info

Publication number: CN106503058B
Application number: CN201610856707.1A
Authority: CN
Inventors: 房浩; 毕杰山; 莫凯; 郭益君; 钟超强
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2016-09-27
Filing date: 2016-09-27
Publication date: 2019-01-18
Anticipated expiration: 2036-09-27
Also published as: WO2018058998A1; CN106503058A

Abstract

The embodiment of the present invention provides a kind of data load method, terminal and computing cluster, is related to field of communication technology, can reduce the read-write time delay of KeyValue database, improves the query performance of KeyValue database.The specific scheme is that computing cluster receives the data load requests for carrying the partition information of tables of data to be loaded；Determine that the first data subregion, all subregions of partition information instruction bind a first data subregion respectively according to partition information；The source data for obtaining each subregion of partition information instruction respectively, executes mapping tasks to the source data of each subregion respectively；According to the binding relationship of the subregion of partition information instruction and the first data subregion, the resulting intermediate data of each mapping tasks will be executed, the first data subregion is accordingly written；Reduction task is executed respectively to the intermediate data in each first data subregion, execution obtains the file destination of each reduction task, and file destination carries out data query use for the load tables of data of KeyValue database.The embodiment of the present invention is for loading data.

Description

A kind of data load method, terminal and computing cluster

Technical field

The present embodiments relate to field of communication technology more particularly to a kind of data load methods, terminal and computing cluster.

Background technique

Distributed key assignments KeyValue database can effectively reduce the number of read-write disk, have better readwrite performance, Better data query service can be provided for user.KeyValue database is frequently with mapping reduction MapReduce service group Part loads data in batches.During batch loads data, by executing MapReduce task, generation and KeyValue The consistent file destination of the file memory format of the definition of database, and store into distributed file system, then from point Cloth file system is loaded into KeyValue database.

Wherein, while arrangement has the structural schematic diagram of the cluster of MapReduce serviced component and KeyValue database can Referring to Fig. 1.In cluster shown in Fig. 1, MapReduce task execution process needs to read a large amount of data, and is related to Calculated to sequence, subregion etc. are a large amount of so that entire cluster central processing unit (Central Processing Unit, Abbreviation CPU), network inputs/output (Input/Output, abbreviation I/O) mouth, the utilization rate of the resources such as magnetic disc i/o mouth it is very high. KeyValue database is higher to read-write delay requirement, generally in Millisecond；But it is being using MapReduce serviced component When KeyValue database batch load data, more resource can be occupied by executing MapReduce task to load the process of data, Cause the resource that can be used for the process of the query service of KeyValue database is opposite to reduce, to influence KeyValue database Read-write time delay, the data query reduced performance of KeyValue database leads to not the business demand for meeting user.

Summary of the invention

The embodiment of the present invention provides a kind of data load method, terminal and computing cluster, can reduce KeyValue data The read-write time delay in library improves the query performance of KeyValue database.

In order to achieve the above objectives, the embodiment of the present invention adopts the following technical scheme that

In a first aspect, the embodiment of the present invention provides a kind of data load method, it is applied to computing cluster.It is directed to inquire Cluster, computing cluster are loaded for data, and inquiry cluster is used for the data query of KeyValue database, computing cluster and inquiry Cluster is different clusters.This method comprises: firstly, computing cluster receive data load requests, data load requests carry to Load the partition information of tables of data.Secondly, computing cluster determines the first data subregion according to partition information.Wherein, partition information All subregions indicated bind a first data subregion respectively.Then, computing cluster obtains respectively from distributed file system The source data for each subregion for taking partition information to indicate, executes mapping tasks to the source data of each subregion respectively.Later, it calculates The binding relationship of subregion and the first data subregion that cluster is indicated according to partition information, will execute each mapping tasks it is resulting in Between data the first data subregion is accordingly written.Then, computing cluster distinguishes the intermediate data in each first data subregion Reduction task is executed, execution obtains the file destination of each reduction task, load of the file destination for KeyValue database Tables of data carries out data query use.

In this way, can execute MapReduce task by the resource in computing cluster, it is fixed with KeyValue database to generate The identical file destination of file memory format of justice, so that the load tables of data of the KeyValue database in inquiry cluster carries out Data query uses.Wherein, since execute MapReduce task is computing cluster, looking into for query service is provided with for user It askes cluster and is independent from each other two clusters, thus even if can be occupied a large amount of during executing MapReduce task The resources such as CPU, I/O mouthfuls, but these resources are the resource in computing cluster, and the execution of MapReduce task will not occupy inquiry The related resource of cluster, so that the load of inquiry cluster is lower, it is thus possible to reduce KeyValue in inquiry cluster The read-write time delay of database improves the query performance of KeyValue database.

In a kind of possible implementation of first aspect, this method further include: computing cluster sends file destination To inquiry cluster.

Second aspect, the embodiment of the present invention provide a kind of data load method, are applied to terminal.It is directed to computing cluster With inquiry cluster, computing cluster is loaded for data, and inquiry cluster is used for the data query of KeyValue database, computing cluster It is different clusters from inquiry cluster.This includes: terminal to computing cluster transmission data load requests, and data load requests carry The partition information of tables of data to be loaded.Wherein, data load requests indicate that computing cluster determines the first data according to partition information Subregion.All subregions of partition information instruction bind a first data subregion respectively.First data subregion for store for Source data in the subregion of first data partition bindings executes the resulting intermediate data of mapping tasks, so as to the first data subregion In intermediate data execute reduction task and obtain file destination.

In a kind of possible implementation of second aspect, inquiry cluster and computing cluster have respective distributed text Part system, it is mutually isolated to inquire the distributed file system that cluster and computing cluster respectively have, in this case, terminal or Inquiry cluster need to request computing cluster that the corresponding file destination of each first data subregion is sent to inquiry cluster, so as to The tables of data to be loaded of KeyValue database carries out using file destination when data query.

In a kind of possible implementation of second aspect, inquiry cluster and computing cluster share distributed field system System, inquiry cluster obtain file destination from distributed file system.

After computing cluster generates file destination, file destination can be stored in the distribution text shared with inquiry cluster In part system, inquiry cluster can obtain file destination and be loaded directly from distributed file system, so as in KeyValue The tables of data to be loaded of database carries out using the file destination when data query.

In a kind of possible implementation of second aspect, terminal to computing cluster send data load requests it Before, this method further include: terminal requests the partition information of tables of data to be loaded to inquiry cluster.

To which terminal can determine the first data subregion according to the partition information obtained from inquiry cluster.

In a kind of possible implementation of second aspect, the connection config set and meter of inquiry cluster are preserved in terminal Calculate the connection config set of cluster.Before terminal requests the partition information of tables of data to be loaded to inquiry cluster, this method is also wrapped Include: terminal establishes connection request to query set pocket transmission first according to the connection config set of inquiry cluster.Collect in terminal to calculating Before pocket transmission data load requests, this method further include: terminal is according to the connection config set of computing cluster to institute's computing cluster It sends second and establishes connection request.

To which after terminal and inquiry cluster/computing cluster establish connection, terminal can be with inquiry cluster/computing cluster Carry out interacting message.

In a kind of possible implementation of second aspect, connection config set includes IP address, port and secure access At least one of configuration information.

The third aspect, the embodiment of the present invention provide a kind of data load method, are applied to inquiry cluster.It is directed to calculate Cluster, inquiry cluster are used for the data query of KeyValue database, and computing cluster is loaded for data, computing cluster and inquiry Cluster is different clusters.This method comprises: inquiry cluster receives each of the computing cluster transmission corresponding mesh of the first data subregion Mark file.Then, the corresponding file destination of each first data subregion is loaded onto KeyValue database by inquiry cluster, with Just file destination is used when the tables of data to be loaded of KeyValue database carries out data query.

In this way, terminal or inquiry cluster can request computing cluster by target after computing cluster generates file destination File is sent to inquiry cluster；Inquiry cluster can add file destination after the file destination for receiving computing cluster transmission It is loaded onto KeyValue database, to use the mesh when the tables of data to be loaded of KeyValue database carries out data query Mark file.

In conjunction with any of the above-described aspect, in one possible implementation, the Key value of each subregion of partition information instruction Range is different；For subregion and the first data subregion with binding relationship, in the source data of subregion and the first data subregion Between data Key value range having the same.

In this way, computing cluster can obtain each subregion according to Key value range when executing mapping reduction task respectively Source data, and the intermediate data of identical Key range is distributed in the corresponding first data subregion of each subregion.

In conjunction with any of the above-described aspect, in one possible implementation, inquiring has the second data subregion, institute in cluster There is the corresponding Key value range of the second data subregion Key value range corresponding with the first data subregion identical, the second data subregion is used In the file destination for storing corresponding Key value range.

In this way, inquiry cluster receive computing cluster transmission file destination after, or from computing cluster share In distributed file system obtain file destination after, can will file destination corresponding with each first data file, store respectively In a corresponding second data subregion, and the corresponding Key value range of the second data subregion is corresponding with the first data subregion Key value range is identical.

In conjunction with any of the above-described aspect, in one possible implementation, subregion instruction information is used to indicate inquiry cluster KeyValue database in, the corresponding pass of the corresponding Key value range of tables of data to be loaded and M target the second data subregion System.Also, target the first data subregion and its corresponding target the second data subregion correspond to identical Key value range.

Fourth aspect, the embodiment of the present invention provide a kind of computing cluster, comprising: receiving module loads for receiving data Request, data load requests carry the partition information of tables of data to be loaded.Determining module, for determining according to partition information One data subregion.Wherein, all subregions of partition information instruction bind a first data subregion respectively.Execution module is used for The source data for obtaining each subregion of partition information instruction respectively from distributed file system, to the source data point of each subregion It Zhi Hang not mapping tasks.Writing module, the binding relationship of subregion and the first data subregion for being indicated according to partition information will It executes the resulting intermediate data of each mapping tasks and the first data subregion is accordingly written.Execution module is also used to, to each Intermediate data in one data subregion executes reduction task respectively, and execution obtains the file destination of each reduction task, target text Part carries out data query use for inquiring the load tables of data of the KeyValue database of cluster.

In a kind of possible implementation of fourth aspect, the Key value range of each subregion of partition information instruction is not Together.For subregion and the first data subregion with binding relationship, the intermediate data of the source data of subregion and the first data subregion Key value range having the same.

In a kind of possible implementation of fourth aspect, computing cluster further include: sending module, for target is literary Part is sent to inquiry cluster.

In a kind of possible implementation of fourth aspect, inquiring has the second data subregion in cluster, and all second The corresponding Key value range of data subregion Key value range corresponding with the first data subregion is identical, and the second data subregion is for storing The file destination of corresponding Key value range.

5th aspect, the embodiment of the present invention provide a kind of terminal, comprising: sending module, for sending number to computing cluster According to load request.Data load requests carry the partition information of tables of data to be loaded.Data load requests indicate computing cluster The first data subregion is determined according to partition information.All subregions of partition information instruction bind a first data subregion respectively. First data subregion be used to store the source data execution mapping tasks in subregion for the first data partition bindings it is resulting in Between data, obtain file destination to execute reduction task to the intermediate data in the first data subregion.Request module is used for Request computing cluster that the corresponding file destination of each first data subregion is sent to inquiry cluster, so as in inquiry cluster The tables of data to be loaded of KeyValue database carries out using file destination when data query.

In a kind of possible implementation of the 5th aspect, inquiring has the second data subregion in cluster, and all second The corresponding Key value range of data subregion Key value range corresponding with the first data subregion is identical, and the second data subregion is for storing The file destination of corresponding Key value range.

In a kind of possible implementation of the 5th aspect, request module is also used to: to be added to inquiry cluster request Before the partition information for carrying tables of data, the partition information of tables of data to be loaded is requested to inquiry cluster.

6th aspect, the embodiment of the present invention provide a kind of inquiry cluster, comprising: receiving module, for receiving computing cluster Each of the transmission corresponding file destination of the first data subregion.Loading module is used for the corresponding mesh of each first data subregion Mark file is loaded onto KeyValue database, so as to when the tables of data to be loaded of KeyValue database carries out data query Use file destination.

In a kind of possible implementation of the 6th aspect, inquiring has the second data subregion in cluster, and all second The corresponding Key value range of data subregion Key value range corresponding with the first data subregion is identical, and the second data subregion is for storing The file destination of corresponding Key value range.

Another aspect, the embodiment of the invention provides a kind of computing cluster, including multiple calculate nodes, multiple calculate nodes In a calculate node execute the data load side that any possible implementation of first aspect or first aspect provides Data interaction is carried out between at least two calculate nodes in method or multiple calculate nodes to execute first aspect or first The data load method that any possible implementation of aspect provides.

In another aspect, the embodiment of the invention provides a kind of computer storage medium, for being stored as above-mentioned computing cluster Computer software instructions used, it includes for executing any possible reality for realizing above-mentioned first aspect or first aspect Program designed by the data load method that existing mode provides.

Another aspect, the embodiment of the invention provides a kind of terminals, including at least one processor, memory and communication connect Mouthful；At least one described processor, the memory and the communication interface pass through bus and connect；The memory, is used for Store computer executed instructions；At least one described processor, for executing the computer executed instructions of the memory storage, So that the computing terminal carries out data interaction by the communication interface and computing cluster and/or inquiry cluster, on executing The data load method of embodiment offer is provided.

In another aspect, the embodiment of the invention provides a kind of computer storage medium, for being stored as used in above-mentioned terminal Computer software instructions, include for executing any possible implementation for realizing above-mentioned second aspect or second aspect Program designed by the data load method of offer.

Another aspect, the embodiment of the invention provides a kind of inquiry cluster, including multiple queries node, multiple queries nodes In a query node execute the data load side that any possible implementation of first aspect or first aspect provides Method perhaps carries out data interaction to execute first aspect or between at least two query nodes in multiple queries node The data load method that any possible implementation of one side provides.

In another aspect, the embodiment of the invention provides a kind of computer storage medium, for being stored as above-mentioned inquiry cluster Computer software instructions used, comprising for executing any possible realization for realizing the above-mentioned third aspect or the third aspect Program designed by the data load method that mode provides.

In another aspect, the embodiment of the invention provides a kind of communication system, terminal, calculating collection including the description of above-mentioned aspect Group and inquiry cluster.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be in embodiment or description of the prior art Required attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some realities of the invention Example is applied, it for those of ordinary skill in the art, without creative efforts, can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 is a kind of structural schematic diagram of cluster in the prior art；

Fig. 2 is a kind of system architecture schematic diagram provided in an embodiment of the present invention；

Fig. 3 is another system architecture schematic diagram provided in an embodiment of the present invention；

Fig. 4 is a kind of data load method flow chart provided in an embodiment of the present invention；

Fig. 5 is another data load method flow chart provided in an embodiment of the present invention；

Fig. 6 is another data load method flow chart provided in an embodiment of the present invention；

Fig. 7 is a kind of structural schematic diagram of computing cluster provided in an embodiment of the present invention；

Fig. 8 is a kind of structural schematic diagram of terminal provided in an embodiment of the present invention；

Fig. 9 is a kind of structural schematic diagram for inquiring cluster provided in an embodiment of the present invention；

Figure 10 is a kind of structural schematic diagram for calculating equipment provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description.Obviously, described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

Using MapReduce serviced component is that system architecture involved in KeyValue database batch load data can be with As shown in Fig. 2, the system can specifically include cluster and multiple terminals.Wherein, which is that batch can be submitted to add to cluster The equipment for carrying the requests such as data task and query task, such as can be desktop computer, laptop, iPad, intelligent hand Machine etc..The cluster may include multiple node devices, which can be the calculating equipment with computing capability；The cluster Arrangement has MapReduce serviced component and KeyValue database and distributed file system simultaneously.MapReduce service Component computation capability with high performance can be KeyValue database batch load data.KeyValue database The read-write requests of terminal can be responded, provide query service for terminal user.Distributed file system can be KeyValue number It stores and supports according to the bottom that library provides high reliability.Illustratively, which specifically can be Hadoop cluster, distribution text Part system specifically can be HDFS (Hadoop Distributed File System), which specifically can be with It is HBase.

Wherein, when being KeyValue database batch load data using MapReduce serviced component, terminal can be to Cluster submits a data load requests, and a management node in MapReduce serviced component receives data load requests Afterwards, MapReduce task is executed.Specifically, MapReduce task includes Map task and Reduce task, wherein execute Map The stage of task may include the Shuffle stage, and the stage for executing Reduce task may include the Sort stage.Cluster executes Map task reads source data, and is parsed to obtain intermediate data<Key, Value>right to source data；Map task will be executed again Parsing obtain<Key, Value>right, write in data subregion Partition in the Shuffle stage according to key, to execute Data are obtained from the Partition when Reduce task.It is alternatively possible to right first when executing Reduce task In Partition<Key, Value>to progress Sort processing.The corresponding data subregion of each Reduce task Partition, each Partition correspond to a data subregion Region in KeyValue database.Each Reduce appoints Business generates the file destination of corresponding Partition.Wherein, the file destination that Reduce phased mission generates is for KeyValue data The query service in library uses, therefore the file destination that Reduce task generates meets the file storage that KeyValue database defines Format.Then, the file destination that Reduce task generates is loaded onto KeyValue data from distributed file system by cluster In library, so that inquiry uses.

In system architecture shown in Fig. 2, due to MapReduce serviced component execute MapReduce task process with The process for the query service that KeyValue database executes is located in same cluster, and MapReduce serviced component is executing batch It during loading data task, needs to read a large amount of data, and is related to a large amount of calculating such as sequence and subregion, so that The load of entire cluster is very big, and resource utilization is very high, to greatly affected the reading of KeyValue database in cluster Time delay is write, the query performance of KeyValue database is reduced.For this problem, the embodiment of the invention provides a kind of data to add Support method, terminal and computing system are different by the way that MapReduce serviced component and KeyValue database to be separately positioned on In cluster, to reduce the load of KeyValue database place cluster, the read-write time delay of KeyValue database is reduced, is improved The query performance of KeyValue database；MapReduce serviced component can obtain enough resources to execute simultaneously MapReduce task improves the execution efficiency of MapReduce task.

As shown in figure 3, system architecture involved in data load method provided in an embodiment of the present invention may include inquiry Cluster and the different clusters of computing cluster two and terminal may each comprise multiple node devices in each cluster, which sets It is standby to can be the calculating equipment with computing capability.Inquiry cluster arrangement has KeyValue database and distributed file system, Query service can be provided for user.Illustratively, the KeyValue database specifically can be Google Bigtable, Apache HBase or Apache Cassandra etc..Computing cluster arrangement has MapReduce serviced component and distributed document System can preserve source data file and execute MapReduce task, be KeyValue database batch load data.Its In, the distributed file system in distributed file system and computing cluster in cluster is inquired, can be two independently Distributed file system, be also possible to the shared same distributed file system of two clusters, be not especially limited here.

Based on system architecture shown in Fig. 3, the embodiment of the present invention provides a kind of data load method, referring to fig. 4, this method May include:

101, terminal sends data load requests to computing cluster, and data load requests carry point of tables of data to be loaded Area's information, data load requests indicate that computing cluster determines the first data subregion, the institute of partition information instruction according to partition information There is subregion to bind a first data subregion respectively, the first data subregion is used to store the subregion for the first data partition bindings In source data execute the resulting intermediate data of mapping tasks, so as to in the first data subregion intermediate data execute reduction appoint Business is to obtain file destination.

In system architecture shown in Fig. 3, computing cluster arrangement has MapReduce serviced component, and terminal can be to calculating Collect pocket transmission data load requests, to request to execute MapReduce task using every resource in computing cluster, thus Data load is carried out after the completion of MapReduce task execution.

Wherein, the partition information of tables of data to be loaded is carried in data load requests, subregion instruction information is for referring to Show at least one subregion.The data load requests can indicate computing cluster according to partition information therein, and determining and subregion is believed The first data subregion that the subregion of breath instruction is bound one by one.First data subregion can be used for storing, and tie up with the first data subregion Source data in fixed subregion executes resulting intermediate data after mapping tasks, so that computing cluster can be to the first data Intermediate data in subregion executes reduction task, and then obtains file destination.Specifically, the first data subregion can be Fig. 3 institute Show the data subregion Partition in computing cluster.

102, computing cluster determines the first data according to partition information after the data load requests for receiving terminal transmission Subregion.

Computing cluster, can basis after receiving the data load requests for carrying subregion instruction information of terminal transmission The first data subregion that the determining subregion with partition information instruction of partition information is bound one by one.

Illustratively, when the subregion of subregion instruction information instruction is 3, the first determining data subregion of computing cluster It is 3, the first data subregion A (Partition A), the first data subregion B being specifically as follows in system shown in Figure 3 framework (Partition B) and the first data subregion C (Partition C).

103, computing cluster obtains the source data of each subregion of partition information instruction respectively from distributed file system, Mapping tasks are executed respectively to the source data of each subregion.

Computing cluster executes MapReduce task after receiving data load requests.Specifically, computing cluster can be with Obtain the source data of each subregion of partition information instruction respectively first from distributed file system, and to the source of each subregion Data execute mapping Map task respectively, obtain intermediate data<Key, Value>right.

104, the binding relationship for the subregion and the first data subregion that computing cluster is indicated according to partition information, it is each by executing The first data subregion is accordingly written in the resulting intermediate data of mapping tasks.

Wherein, the stage that computing cluster executes mapping Map task may include the Shuffle stage.Computing cluster is executing Map task parses to obtain<Key, Value>to rear, can be in the Shuffle stage, according to key by the source data to each subregion Mapping Map task resulting intermediate data<Key, Value>right are executed, corresponding first data point of each subregion are accordingly write In area Partition.

105, computing cluster executes reduction task to the intermediate data in each first data subregion respectively, and execution obtains every The file destination of a reduction task, file destination carry out data query use for the load tables of data in KeyValue database.

After it will execute the resulting intermediate data of each mapping Map task and the first data subregion is accordingly written, calculate Cluster can execute reduction Reduce task to intermediate data<Key in each first data subregion, Value>right respectively, from And obtain the corresponding file destination of reduction Reduce task of each first data subregion.

Wherein, the format of the file destination follows the file memory format that KeyValue database defines, so as to supply The load tables of data of KeyValue database carries out data query use.Illustratively, when KeyValue database is HBase When, file destination can be HFile format.

As it can be seen that data load method provided in an embodiment of the present invention, executes MapReduce by the resource in computing cluster Task generates file destination identical with the file memory format that KeyValue database defines, in inquiry cluster The load tables of data of KeyValue database carries out data query use.Wherein, since execute MapReduce task is to calculate Cluster is independent from each other two clusters with the inquiry cluster for providing query service for user, thus even if executing During MapReduce task, the resources such as CPU, I/O mouthful a large amount of can be occupied, but these resources are the money in computing cluster Source, the related resource without occupying inquiry cluster reduce in inquiry cluster so that the load of inquiry cluster is lower The read-write time delay of KeyValue database improves the query performance of KeyValue database.

That is, data load method provided in an embodiment of the present invention, by by MapReduce serviced component and KeyValue database is deployed in respectively in different clusters, can be to avoid the MapReduce task process for occupying vast resources Influence to query service process, so that the load of cluster where reducing KeyValue database, improves KeyValue database Query performance.

In addition, data load requests can also carry source data file store path and output in above-mentioned steps 101 Path.Computing cluster can store from source data file in step 103 and obtain source data under road, and generate in step 105 After file destination corresponding with each first data subregion, these file destinations are stored under outgoing route.

It should be noted that file destination after generating file destination, specifically can be stored in this earth magnetism by computing cluster In disk, file destination can also be stored in distributed file system.Also, the distributed file system of computing cluster can be with It is the mutually independent distributed file system of distributed file system with inquiry cluster, is also possible to the distribution with inquiry cluster The shared same distributed file system of formula file system.

On the one hand, computing cluster by file destination be stored in local disk or with inquiry cluster distributed field system In the case where in mutually independent distributed file system of uniting, referring to Fig. 5, after step 105, this method can also include:

106, the corresponding file destination of each first data subregion is sent to inquiry cluster by terminal request computing cluster, with Just file destination is used when the tables of data to be loaded of KeyValue database carries out data query.

107, the corresponding file destination of each first data subregion is sent to inquiry cluster by computing cluster.

108, inquiry cluster receives each of the computing cluster transmission corresponding file destination of the first data subregion.

109, the corresponding file destination of each first data subregion is loaded onto KeyValue database by inquiry cluster, with Just file destination is used when the tables of data to be loaded of KeyValue database carries out data query.

In the case of this kind, terminal can request computing cluster will be in the local disk of computing cluster or distributed file system Each of the preservation corresponding file destination of the first data subregion, is sent to inquiry cluster；Inquiry cluster is receiving computing cluster After each of the transmission corresponding file destination of the first data subregion, the corresponding file destination of each first data subregion can be protected There are local disks, or are stored in the distributed file system (distributed field system with computing cluster that inquiry cluster uses Unite independent distributed file system) in, so as to which the file destination in local disk or distributed file system is loaded Into KeyValue database, for KeyValue database tables of data to be loaded carry out data query when use.

On the other hand, file destination is stored in and shared same of the distributed file system of inquiry cluster in computing cluster In the case where in one distributed file system, computing cluster does not need for file destination to be sent to inquiry cluster.In step 105 Later, this method can also include:

110, inquiry cluster obtains file destination from distributed file system.

After step 105, inquiry cluster can be obtained directly from the distributed file system shared with computing cluster File destination carries out data load, so as to when the tables of data to be loaded of KeyValue database carries out data query using obtaining The file destination taken.Wherein, inquiry cluster specifically can be according to partition information, from the distributed field system shared with computing cluster The corresponding file destination of each first data subregion is obtained in system.

Further, referring to Fig. 6, before above-mentioned steps 101, this method can also include:

111, terminal requests the partition information of tables of data to be loaded to inquiry cluster.

Wherein, the partition information for the tables of data to be loaded that terminal is requested to inquiry cluster is used to indicate tables of data pair to be loaded At least one subregion answered.The representation of partition information can there are many, the embodiment of the present invention does not limit its concrete form It is fixed.

In KeyValue database, the data in tables of data to be loaded correspond to a Key value range, tables of data to be loaded At least one subregion can be divided into according to Key value range, the Key value range of each subregion of partition information instruction is different.Its In, Key is a keyword, specifically can be a field, attribute or feature in tables of data to be loaded.

Illustratively, the tables of data to be loaded in KeyValue database is " user message table ", " user message table " tool Body includes " identity ", " name ", " phone " and " address " 4 fields, and the range of " identity " of user is 00000000-29999999.The specific format for being somebody's turn to do " user data table " may refer to such as the following table 1:

Table 1

Identity	Name	Phone	Address
				00000000	…	…	…
00000001	…	…	…
				00000002	…	…	…
…	…	…	…
				29999999	…	…	…

In the tables of data to be loaded shown in the table 1, if key is " identity " this field, tables of data pair to be loaded The Key value range answered is 00000000-29999999.Tables of data to be loaded can divide subregion according to separation Key value.Example Such as, when separation Key value is 10000000 and 20000000, tables of data to be loaded can be divided into 3 subregions: with Key value The corresponding subregion 1 of range 00000000-09999999, subregion 2 corresponding with Key value range 10000000-19999999, and Subregion 3 corresponding with Key value range 20000000-29999999.

In this example, subregion instruction information can be the corresponding Key range 00000000- of tables of data to be loaded 29999999 and separation Key value 10000000 and 20000000.The subregion indicates that information indicates that tables of data to be loaded is corresponding 3 subregions, and the corresponding Key value range of subregion 1 is 00000000-09999999, the corresponding Key value range of subregion 2 is 10000000-19999999, the corresponding Key value range of subregion 3 are 20000000-29999999.

When the Key value range difference of each subregion of partition information instruction, for the subregion and the with binding relationship One data subregion, the intermediate data Key value range having the same of the source data of subregion and the first data subregion.

Illustratively, when the corresponding Key value range of subregion 1 is 00000000-09999999, the corresponding Key value model of subregion 2 It encloses for 10000000-19999999, when the corresponding Key value range of subregion 3 is 20000000-29999999, if what subregion 1 was bound First data subregion is Partition A, and the first data subregion that subregion 2 is bound is Partition B, subregion 3 bind the One data subregion is Partition C, then: the source data of Partition A and the intermediate data of Partition A are corresponding Key value range 00000000-09999999；The source data of Partition B and the intermediate data of Partition B are corresponding Key value range 10000000-19999999；The source data of Partition C and the intermediate data of Partition C are corresponding Key value range 20000000-29999999.

To which in step 103, computing cluster can obtain source corresponding with subregion 1 number from distributed file system According to the Key value range of source data is 00000000-09999999；Also, at step 104, computing cluster can reflect execution The intermediate data of Key value range that Map task obtains between 00000000-09999999 is penetrated, Partition is accordingly written In A.

Similarly, computing cluster can obtain source data corresponding with subregion 2 from distributed file system, source data Key value range is 10000000-19999999；Also, at step 104, computing cluster can will execute mapping Map task and obtain Intermediate data of the Key value range arrived between 10000000-19999999 is accordingly written in Partition B.

Also, computing cluster can obtain source data corresponding with subregion 3 from distributed file system, source data Key value range is 20000000-29999999；Also, at step 104, computing cluster can will execute mapping Map task and obtain Intermediate data of the Key value range arrived between 20000000-29999999 is accordingly written in Partition C.

Further, the second data subregion can also be had by inquiring in cluster, the corresponding Key value of all second data subregions Range Key value range corresponding with the first data subregion is identical, and the second data subregion is used to store the target of corresponding Key value range File.

On this basis, it is corresponding to receive the first data subregion of each of computing cluster transmission in step 108 for inquiry cluster File destination after, can also include: inquiry cluster will the corresponding file destination of each first data subregion, respectively save extremely with Each first data subregion corresponds in the second data subregion of identical Key value range, so that each second data subregion is corresponding File destination be loaded onto KeyValue database, so as to the tables of data to be loaded of KeyValue database carry out data look into The file destination is used when inquiry.

Cluster is inquired in step 110 directly from the distributed file system shared with computing cluster, obtains each the After the corresponding file destination of one data subregion, the corresponding file destination of each first data subregion can also be saved respectively Extremely and in the second data subregion of the corresponding identical Key value range of each first data subregion, thus by each second data subregion Corresponding file destination is loaded onto KeyValue database, so that the tables of data to be loaded in KeyValue database is counted It is investigated that using the file destination when asking.

Illustratively, the second data subregion inquired in cluster can be Region as shown in Figure 3.Wherein, Region1 Can identical Key value range 00000000-09999999 corresponding with Partition A, Region1 can be used for storing correspondence The file destination of Key value range 00000000-09999999.Region2 can identical Key value corresponding with Partition B Range 10000000-19999999, Region2 can be used for storing the target of corresponding Key value range 10000000-19999999 File.Region3 can identical Key value range 20000000-29999999, Region3 corresponding with Partition C can be with For storing the file destination of corresponding Key value range 20000000-29999999.

In step 108 or 110, inquiry cluster can be by the corresponding target text of Key value range 00000000-09999999 Part is stored into the second data subregion Region1；By the corresponding file destination of Key value range 10000000-19999999, deposit Storage is into the second data subregion Region2；The corresponding file destination of Key value range 20000000-29999999 is stored to In two data subregion Region3, to use the mesh when the tables of data to be loaded of KeyValue database carries out data query Mark file.

In addition, when terminal requests the partition information of tables of data to be loaded to inquiry cluster, it can also be by tables of data to be loaded Identification information be sent to inquiry cluster so that inquiry cluster can according to the identification information of tables of data to be loaded, determine to Load the partition information of tables of data and tables of data to be loaded.Wherein, the identification information of tables of data to be loaded is used to indicate to be added Tables of data is carried, such as can be table name, the number etc. of tables of data to be loaded, is not specifically limited here.

In addition, the connection config set of the connection config set and computing cluster of inquiry cluster can also be preserved in terminal.Its In, connection config set establishes the configuration information needed when connection for saving terminal and inquiry cluster/computing cluster, and the present invention is real Example is applied to be not specifically limited the particular content of connection config set.Illustratively, which may include network protocol At least one of IP address, port and secure access configuration information.Wherein, the IP address connected in config set can be inquiry The IP address of management node in cluster/computing cluster also may include the IP for inquiring all nodes in cluster/computing cluster Location；Port in connection config set can be to provide the port of related service.

Further, referring to Fig. 6, before above-mentioned steps 111, this method can also include:

112, terminal establishes connection request to query set pocket transmission first according to the connection config set of inquiry cluster.

Before above-mentioned steps 101, this method can also include:

113, terminal sends second to computing cluster according to the connection config set of computing cluster and establishes connection request.

Further, before above-mentioned steps 112, this method can also include:

114, terminal log-on data loading tasks.

Wherein, data loading tasks are directed to the task of the KeyValue database load data in inquiry cluster.Terminal The concrete mode of log-on data loading tasks can there are many, such as terminal can receive user input triggering command, thus Start batch data loading tasks；Start batch data loading tasks automatically after terminal booting；Or terminal periodic start Batch data loading tasks etc., are not specifically limited here.

The embodiment of the present invention provides a kind of computing cluster 700, and referring to Fig. 7, which may include receiving module 701, determining module 702, execution module 703, writing module 704 and sending module 705.Specifically, the computing cluster 700 can be with Including multiple calculate nodes, which can be the calculating equipment with computing capability；At least one in computing cluster 700 A calculate node is used to dispose the function of each module of computing cluster 700.Wherein, receiving module 701 can be used for, and receives data and adds Request is carried, data load requests carry the partition information of tables of data to be loaded；Determining module 702 can be used for, according to subregion Information determines the first data subregion, wherein all subregions of partition information instruction bind a first data subregion respectively；It executes Module 703 can be used for, and the source data of each subregion of partition information instruction be obtained respectively from distributed file system, to every The source data of a subregion executes mapping tasks respectively；Writing module 704 can be used for, according to the subregion and the of partition information instruction The binding relationship of one data subregion will execute the resulting intermediate data of each mapping tasks and the first data subregion is accordingly written； Execution module 703 can be also used for, and execute reduction task respectively to the intermediate data in each first data subregion, execution obtains The file destination of each reduction task, file destination carry out data for inquiring the load tables of data of the KeyValue database of cluster Inquiry uses.

In addition, sending module 705 can be used for executing the step 107 in Fig. 5.Computing cluster 700 in Fig. 7 can be used for Any process in above method process is executed, this will not be detailed here for the embodiment of the present invention.

The embodiment of the invention also provides a kind of computer storage mediums, for being stored as computing cluster shown in above-mentioned Fig. 7 Computer software instructions used, comprising for executing program designed by above method embodiment.By the institute for executing storage Program is stated, data load may be implemented.

The embodiment of the present invention provides a kind of terminal 800, and referring to Fig. 8, which may include sending module 801 and ask Modulus block 802.Wherein, sending module 801 can be used for, and send data load requests to computing cluster, data load requests are taken Partition information with tables of data to be loaded, data load requests indicate that computing cluster determines the first data point according to partition information All subregions in area, partition information instruction bind a first data subregion respectively, and the first data subregion is for storing for the Source data in the subregion of one data partition bindings executes the resulting intermediate data of mapping tasks, so as to in the first data subregion Intermediate data execute reduction task and obtain file destination.Request module 802 can be used for, and request computing cluster is by each the The corresponding file destination of one data subregion is sent to inquiry cluster, so as in the to be loaded of the KeyValue database of inquiry cluster Tables of data carries out using file destination when data query.

In addition, request module 802 can be used for executing the step 111 in Fig. 6.Terminal 800 in Fig. 8 can be used for executing Any process in above method process, this will not be detailed here for the embodiment of the present invention.

The embodiment of the invention also provides a kind of computer storage mediums, for being stored as used in terminal shown in above-mentioned Fig. 8 Computer software instructions, it includes for executing program designed by above method embodiment.By executing described in storage Data load may be implemented in program.

The embodiment of the present invention provides a kind of inquiry cluster 900, and referring to Fig. 9, which may include receiving module 901 and loading module 902.Specifically, the inquiry cluster 900 may include multiple queries node, which can be tool There is the calculating equipment of computing capability；At least one query node inquired in cluster 900 is for disposing the inquiry each module of cluster 900 Function.Wherein, receiving module 901 can be used for, and receive each of the computing cluster transmission corresponding target of the first data subregion File.Loading module 902 can be used for, and the corresponding file destination of each first data subregion is loaded onto KeyValue database In, to use file destination when the tables of data to be loaded of KeyValue database carries out data query.In addition, in Fig. 9 Inquiry cluster 900 can be used for executing any process in above method process, and this will not be detailed here for the embodiment of the present invention.

The embodiment of the invention also provides a kind of computer storage mediums, inquire cluster shown in above-mentioned Fig. 9 for being stored as Computer software instructions used, it includes for executing program designed by above method embodiment.By executing storage Data load may be implemented in described program.

Referring to Figure 10, the embodiment of the present invention also provides a kind of calculating equipment 1000, and calculating equipment 1000 includes at least one Processor 1001, memory 1002 and communication interface 1003；At least one described processor 1001, the memory 1002 and institute Communication interface 1003 is stated to connect by bus 1004；The memory 1002, for storing computer executed instructions；It is described extremely A few processor 1001, the computer executed instructions stored for executing the memory 1002, so that the calculating equipment 1000 (such as inquire query node in cluster, end by the communication interface 1003 and other equipment with computing capability Calculate node in end or computing cluster) data interaction is carried out, to execute data load method provided by the above embodiment.

A kind of alternative embodiment, the calculate node that computing cluster provided in an embodiment of the present invention includes are to calculate equipment 1000, the calculating equipment 1000 of the computing cluster is set by the communication interface 1003 with other calculating in computing cluster Standby 1000, the query node of terminal and inquiry cluster carries out data interaction, to execute data load side provided by the above embodiment Method.

A kind of alternative embodiment, the query node that inquiry cluster provided in an embodiment of the present invention includes are calculating equipment 1000, the calculating equipment 1000 of the inquiry cluster is set by other calculating in the communication interface 1003, with inquiry cluster Standby 1000, terminal and the calculate node of computing cluster carry out data interaction to execute data load side provided by the above embodiment Method.

A kind of alternative embodiment, terminal provided in an embodiment of the present invention are to calculate equipment 1000, the calculating equipment 1000 By the communication interface 1003, data interaction is carried out with the calculate node of computing cluster and the query node of inquiry cluster, is come Execute data load method provided by the above embodiment.

Optionally, at least one processor 1001 may include different types of processor 1001, or including mutually similar The processor 1001 of type；Processor 1001 can be below any: central processor CPU, arm processor, scene can compile Journey gate array (Field Programmable Gate Array, abbreviation FPGA), application specific processor etc. have calculation processing ability Device.A kind of optional embodiment, at least one described processor 1001 can also be integrated into many-core processor.

Optionally, memory 1002 can be below any or any combination: random access memory (Random Access Memory, abbreviation RAM), read-only memory (Read Only Memory, abbreviation ROM), nonvolatile memory (Non-volatile Memory, abbreviation NVM), solid state hard disk (Solid State Drives, abbreviation SSD), mechanical hard disk, The storage mediums such as disk, disk array.

Optionally, communication interface 1003 is used to calculate equipment 1000 and other setting with computing capability or storage capacity It is standby to carry out data interaction.Communication interface 1003 can be below any or any combination: network interface (such as Ethernet Interface), the device with network access facility such as wireless network card.

The bus 1004 may include address bus, data/address bus, control bus etc., for convenient for indicating, Figure 10 is with one Thick line indicates the bus.Bus 1004 can be below any or any combination: industry standard architecture (Industry Standard Architecture, abbreviation ISA) bus, peripheral component interconnection (Peripheral Component Interconnect, abbreviation PCI) bus, expanding the industrial standard structure (Extended Industry Standard Architecture, abbreviation EISA) wired data transfers such as bus device.

Another embodiment of the present invention provides a kind of communication systems, may include terminal, computing cluster and inquiry cluster, this is logical The structural schematic diagram of letter system may refer to Fig. 3.Terminal, computing cluster and inquiry cluster in the communication system can execute State the data load method in embodiment of the method.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features； And these are modified or replaceed, the range for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution.

Claims

1. a kind of data load method, which is characterized in that computing cluster is loaded for data, and inquiry cluster is used for KeyValue number According to the data query in library, the computing cluster is different clusters from the inquiry cluster, which comprises

The computing cluster receives data load requests, and the data load requests carry the subregion letter of tables of data to be loaded Breath；

The computing cluster determines the first data subregion according to the partition information, wherein the partition information instruction is owned Subregion binds the first data subregion respectively；

The computing cluster obtains the source data of each subregion of the partition information instruction respectively from distributed file system, Mapping tasks are executed respectively to the source data of each subregion；

The binding relationship of subregion and the first data subregion that the computing cluster is indicated according to the partition information, will execute The first data subregion is accordingly written in each resulting intermediate data of mapping tasks；

The computing cluster executes reduction task to the intermediate data in each first data subregion respectively, and execution obtains every The file destination of a reduction task, the file destination carry out data query for the load tables of data of KeyValue database It uses.

2. the method according to claim 1, wherein the Key value model of each subregion of partition information instruction Enclose difference；For with the binding relationship the subregion and the first data subregion, the source data of the subregion and institute State the intermediate data Key value range having the same of the first data subregion.

3. method according to claim 1 or 2, which is characterized in that the method also includes:

The file destination is sent to the inquiry cluster by the computing cluster.

4. according to the method described in claim 3, owning it is characterized in that, having the second data subregion in the inquiry cluster The corresponding Key value range of second data subregion Key value range corresponding with the first data subregion is identical, and described second Data subregion is used to store the file destination of corresponding Key value range.

5. a kind of data load method, which is characterized in that computing cluster is loaded for data, and inquiry cluster is used for KeyValue number According to the data query in library, the computing cluster is different clusters from the inquiry cluster, which comprises

Data load requests are sent to the computing cluster, the data load requests carry the subregion letter of tables of data to be loaded Breath, the data load requests indicate that the computing cluster determines the first data subregion, the subregion according to the partition information All subregions of information instruction bind the first data subregion respectively, and the first data subregion is for storing for institute The source data stated in the subregion of the first data partition bindings executes the resulting intermediate data of mapping tasks, so as to first number Reduction task is executed according to the intermediate data in subregion to obtain file destination；

Request the computing cluster that the corresponding file destination of each first data subregion is sent to the inquiry cluster, with Just the file destination is used when the tables of data to be loaded of KeyValue database carries out data query.

6. according to the method described in claim 5, owning it is characterized in that, having the second data subregion in the inquiry cluster The corresponding Key value range of second data subregion Key value range corresponding with the first data subregion is identical, and described second Data subregion is used to store the file destination of corresponding Key value range.

7. method according to claim 5 or 6, which is characterized in that sending data load requests to the computing cluster Before, the method also includes:

The partition information of tables of data to be loaded is requested to the inquiry cluster.

8. a kind of computing cluster characterized by comprising

Receiving module, load request, the data load requests carry the subregion letter of tables of data to be loaded for receiving data Breath；

Determining module, for determining the first data subregion according to the partition information, wherein the partition information instruction is owned Subregion binds the first data subregion respectively；

Execution module, the source number of each subregion for obtaining the partition information instruction respectively from distributed file system According to executing mapping tasks respectively to the source data of each subregion；

Writing module, the binding relationship of subregion and the first data subregion for being indicated according to the partition information, will hold The first data subregion is accordingly written in each resulting intermediate data of mapping tasks of row；

The execution module is also used to, and is executed reduction task respectively to the intermediate data in each first data subregion, is held Row obtains the file destination of each reduction task, the load of the file destination for the KeyValue database of inquiry cluster Tables of data carries out data query use.

9. computing cluster according to claim 8, which is characterized in that the Key of each subregion of the partition information instruction It is different to be worth range；For the subregion and the first data subregion with the binding relationship, the source data of the subregion With the intermediate data Key value range having the same of the first data subregion.

10. computing cluster according to claim 8 or claim 9, which is characterized in that further include:

Sending module, for the file destination to be sent to the inquiry cluster.

11. computing cluster according to claim 10, which is characterized in that have the second data point in the inquiry cluster Area, the corresponding Key value range of all second data subregions Key value range corresponding with the first data subregion is identical, The second data subregion is used to store the file destination of corresponding Key value range.

12. a kind of terminal characterized by comprising

Sending module, for sending data load requests to computing cluster, the data load requests carry data to be loaded The partition information of table, the data load requests indicate that the computing cluster determines the first data point according to the partition information All subregions in area, the partition information instruction bind the first data subregion respectively, and the first data subregion is used The resulting intermediate data of mapping tasks is executed in storing the source data in the subregion for the first data partition bindings, so as to Reduction task is executed to the intermediate data in the first data subregion to obtain file destination；

Request module is looked into for requesting the computing cluster to be sent to the corresponding file destination of each first data subregion Cluster is ask, using described when to carry out data query in the tables of data to be loaded for inquiring the KeyValue database of cluster File destination.

13. terminal according to claim 12, which is characterized in that have the second data subregion, institute in the inquiry cluster There is the corresponding Key value range of the second data subregion Key value range corresponding with the first data subregion identical, described Two data subregions are used to store the file destination of corresponding Key value range.

14. terminal according to claim 12 or 13, which is characterized in that the request module is also used to:

Before requesting the partition information of tables of data to be loaded to the inquiry cluster, number to be loaded is requested to the inquiry cluster According to the partition information of table.