WO2018058998A1 - 一种数据加载方法、终端和计算集群 - Google Patents

一种数据加载方法、终端和计算集群 Download PDF

Info

Publication number
WO2018058998A1
WO2018058998A1 PCT/CN2017/087152 CN2017087152W WO2018058998A1 WO 2018058998 A1 WO2018058998 A1 WO 2018058998A1 CN 2017087152 W CN2017087152 W CN 2017087152W WO 2018058998 A1 WO2018058998 A1 WO 2018058998A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
partition
cluster
query
computing cluster
Prior art date
Application number
PCT/CN2017/087152
Other languages
English (en)
French (fr)
Inventor
房浩
毕杰山
莫凯
郭益君
钟超强
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2018058998A1 publication Critical patent/WO2018058998A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Definitions

  • the embodiments of the present invention relate to the field of communications technologies, and in particular, to a data loading method, a terminal, and a computing cluster.
  • the distributed key-value KeyValue database can effectively reduce the number of read and write disks, has better read and write performance, and can provide users with better data query services.
  • KeyValue databases often use map-reduced MapReduce service components to load data in bulk. In the process of loading data in batches, by executing the MapReduce task, an object file that is consistent with the file storage format defined by the KeyValue database is generated and stored in the distributed file system, and then loaded into the KeyValue database from the distributed file system.
  • FIG. 1 For a schematic diagram of the structure of a cluster that has both a MapReduce service component and a KeyValue database, see FIG.
  • the MapReduce task execution process needs to read a large amount of data, and involves a large number of calculations such as sorting and partitioning, so that the central processing unit (CPU) and network input of the entire cluster are implemented.
  • the usage of resources such as Input/Output (I/O) port and disk I/O port is very high.
  • KeyValue database requires high read/write delay, generally in the millisecond level.
  • the MapReduce service component when used to load data in bulk for the KeyValue database, the process of executing the MapReduce task to load data will occupy more resources, resulting in the KeyValue database.
  • the resources of the query service process are relatively reduced, which affects the read and write delay of the KeyValue database.
  • the data query performance of the KeyValue database is degraded, which may fail to meet the user's business requirements.
  • the embodiment of the invention provides a data loading method, a terminal and a computing cluster, which can reduce the read/write delay of the KeyValue database and improve the query performance of the KeyValue database.
  • an embodiment of the present invention provides a data loading method, which is applied to a computing cluster.
  • the query cluster is used, the computing cluster is used for data loading, the query cluster is used for data query of the KeyValue database, and the computing cluster and the query cluster are different clusters.
  • the method includes: first, the computing cluster receives a data loading request, and the data loading request carries the partition information of the data table to be loaded. Second, the computing cluster determines the first data partition based on the partition information. The partitions indicated by the partition information are respectively bound to a first data partition. Then, the computing cluster separately obtains the source data of each partition indicated by the partition information from the distributed file system, and performs a mapping task on the source data of each partition separately.
  • the computing cluster binds the partition data indicated by the partition information to the first data partition, and writes the intermediate data obtained by each mapping task to the first data partition. Then, the computing cluster respectively performs a reduction task on the intermediate data in each of the first data partitions, and executes an object file for each reduction task, and the target file is used for data query by the load data table of the KeyValue database.
  • the MapReduce task can be executed by computing the resources in the cluster, and the object file format identical to the file storage format defined by the KeyValue database is generated, for loading the data table of the KeyValue database in the query cluster.
  • Row data queries are used.
  • the query cluster that provides the query service for the user is a cluster that is independent of each other. Therefore, even if the MapReduce task is executed, a large number of CPUs, I/O ports, and the like are occupied. However, these resources are used to calculate the resources in the cluster.
  • the execution of the MapReduce task does not occupy the related resources of the query cluster, which can make the load of the query cluster lower, thereby reducing the read and write delay of the KeyValue database in the query cluster and improving the KeyValue. Database query performance.
  • the method further includes: the computing cluster sends the target file to the query cluster.
  • an embodiment of the present invention provides a data loading method, which is applied to a terminal.
  • the computing cluster and the query cluster are involved, the computing cluster is used for data loading, the query cluster is used for data query of the KeyValue database, and the computing cluster and the query cluster are different clusters.
  • the method includes: the terminal sends a data loading request to the computing cluster, where the data loading request carries the partition information of the data table to be loaded.
  • the data loading request instructs the computing cluster to determine the first data partition according to the partition information. All partitions indicated by the partition information are respectively bound to a first data partition.
  • the first data partition is configured to store intermediate data obtained by performing a mapping task on the source data in the partition bound by the first data partition, so as to perform a reduction task on the intermediate data in the first data partition to acquire the target file.
  • the query cluster and the computing cluster have respective distributed file systems, and the distributed cluster file system of the query cluster and the computing cluster are isolated from each other.
  • the terminal or the query cluster The computing cluster is required to send the target file corresponding to each first data partition to the query cluster, so as to use the target file when performing data query on the data table to be loaded in the KeyValue database.
  • the query cluster shares the distributed file system with the computing cluster, and the query cluster obtains the target file from the distributed file system.
  • the target file can be saved in the distributed file system shared with the query cluster.
  • the query cluster can directly obtain the target file from the distributed file system and load it, so that the data table to be loaded in the KeyValue database Use this object file for data queries.
  • the method before the terminal sends the data loading request to the computing cluster, the method further includes: the terminal requesting, by the query cluster, the partition information of the data table to be loaded.
  • the terminal can determine the first data partition based on the partition information acquired from the query cluster.
  • connection configuration set of the query cluster and the connection configuration set of the computing cluster are saved on the terminal.
  • the method further includes: the terminal sending the first connection establishment request to the query cluster according to the connection configuration set of the query cluster.
  • the method further includes: the terminal sending the second connection establishment request to the computing cluster according to the connection configuration set of the computing cluster.
  • the terminal can perform message interaction with the query cluster/computing cluster.
  • connection configuration set includes at least one of an IP address, a port, and security access configuration information.
  • an embodiment of the present invention provides a data loading method, which is applied to a query cluster. Which involves calculating the cluster, querying the cluster for data query of the KeyValue database, computing cluster for data loading, computing cluster and Query the cluster as a different cluster.
  • the method includes: the query cluster receives an object file corresponding to each first data partition sent by the computing cluster. Then, the query cluster loads the target file corresponding to each first data partition into the KeyValue database, so as to use the target file when performing data query on the data table to be loaded in the KeyValue database.
  • the terminal or the query cluster may request the computing cluster to send the target file to the query cluster; after receiving the target file sent by the computing cluster, the query cluster may load the target file into the KeyValue database, so that the target file may be loaded into the KeyValue database.
  • the target file is used when data query is performed on the data table to be loaded in the KeyValue database.
  • the partitioning information indicates that each partition has a different Key value range; for the partition having the binding relationship and the first data partition, the partitioned source data and the first data partition The intermediate data has the same range of Key values.
  • the computing cluster can obtain the source data of each partition according to the Key value range, and distribute the intermediate data of the same Key range to the first data partition corresponding to each partition.
  • the query cluster has a second data partition, and the range of Key values corresponding to all the second data partitions is the same as the range of Key values corresponding to the first data partition, and the second data partition is Used to store the target file corresponding to the range of Key values.
  • the target file corresponding to each first data file may be stored in a corresponding one.
  • the range of Key values corresponding to the second data partition is the same as the range of Key values corresponding to the first data partition.
  • the partition indication information is used to indicate a correspondence between a Key value range corresponding to the data table to be loaded and M target second data partitions in the KeyValue database of the query cluster. And, the target first data partition and its corresponding target second data partition correspond to the same Key value range.
  • an embodiment of the present invention provides a computing cluster, including: a receiving module, configured to receive a data loading request, where the data loading request carries the partition information of the data table to be loaded. And a determining module, configured to determine the first data partition according to the partition information. The partitions indicated by the partition information are respectively bound to a first data partition.
  • the execution module is configured to separately obtain source data of each partition indicated by the partition information from the distributed file system, and perform a mapping task on the source data of each partition separately.
  • a writing module configured to write the intermediate data obtained by performing each mapping task to the first data partition correspondingly according to the binding relationship between the partition indicated by the partition information and the first data partition.
  • the execution module is further configured to perform a reduction task on each of the intermediate data in each of the first data partitions, and execute an object file for each reduction task, and the target file is used to query the load data table of the KeyValue database of the cluster for data query. .
  • the partitioning information indicates that each of the partitions has a different Key value range.
  • the source data of the partition has the same range of Key values as the intermediate data of the first data partition.
  • the computing cluster further includes: a sending module, configured to send the target file to the query cluster.
  • the query cluster has a second data partition, and the range of Key values corresponding to all the second data partitions is the same as the range of Key values corresponding to the first data partition, and the second data partition is used.
  • an embodiment of the present invention provides a terminal, including: a sending module, configured to send a data loading request to a computing cluster.
  • the data loading request carries the partition information of the data table to be loaded.
  • the data load request instructs the computing cluster to determine the first data partition based on the partition information. All partitions indicated by the partition information are respectively bound to a first data partition.
  • the first data partition is configured to store intermediate data obtained by performing a mapping task on the source data in the partition bound by the first data partition, so as to perform a reduction task on the intermediate data in the first data partition to acquire the target file.
  • the requesting module is configured to request the computing cluster to send the target file corresponding to each first data partition to the query cluster, so as to use the target file when querying the data table to be loaded of the KeyValue database of the cluster for data query.
  • the query cluster has a second data partition, and the range of Key values corresponding to all the second data partitions is the same as the range of Key values corresponding to the first data partition, and the second data partition is used.
  • the requesting module is further configured to: before requesting the partition information of the data table to be loaded to the query cluster, request the partition information of the data table to be loaded to the query cluster.
  • the embodiment of the present invention provides a query cluster, including: a receiving module, configured to receive an object file corresponding to each first data partition sent by the computing cluster.
  • the loading module is configured to load the target file corresponding to each first data partition into the KeyValue database, so as to use the target file when performing data query on the data table to be loaded in the KeyValue database.
  • the query cluster has a second data partition, and the range of Key values corresponding to all the second data partitions is the same as the range of Key values corresponding to the first data partition, and the second data partition is used.
  • an embodiment of the present invention provides a computing cluster, including a plurality of computing nodes, and one of the plurality of computing nodes performs the data loading method provided by the first aspect or any possible implementation manner of the first aspect. Or performing data interaction between at least two of the plurality of compute nodes to perform the data loading method provided by the first aspect or any of the possible implementations of the first aspect.
  • an embodiment of the present invention provides a computer storage medium for storing computer software instructions used by the computing cluster, including any possible implementation for implementing the foregoing first aspect or the first aspect.
  • an embodiment of the present invention provides a terminal, including at least one processor, a memory, and a communication interface; the at least one processor, the memory, and the communication interface are all connected by a bus; Executing instructions on the storage computer; the at least one processor, configured to execute the computer-executed instructions stored in the memory, such that the computing terminal performs data interaction with the computing cluster and/or the query cluster through the communication interface to perform the above The data loading method provided by the embodiment.
  • the embodiment of the present invention provides a computer storage medium for storing computer software instructions used by the terminal, including any possible implementation manner for implementing the foregoing second aspect or the second aspect.
  • the program designed by the data loading method is not limited to any possible implementation manner for implementing the foregoing second aspect or the second aspect.
  • an embodiment of the present invention provides a query cluster, including multiple query nodes, and one of the plurality of query nodes performs the data loading method provided by the first aspect or any possible implementation manner of the first aspect. Or performing data interaction between at least two of the plurality of query nodes to perform the data loading method provided by the first aspect or any of the possible implementations of the first aspect.
  • an embodiment of the present invention provides a computer storage medium for storing computer software instructions used by the query cluster, including any possible implementation manner for performing the foregoing third aspect or the third aspect.
  • an embodiment of the present invention provides a communication system, including the terminal, the computing cluster, and the query cluster described in the foregoing aspects.
  • FIG. 1 is a schematic structural diagram of a cluster in the prior art
  • FIG. 2 is a schematic structural diagram of a system according to an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of another system according to an embodiment of the present disclosure.
  • FIG. 4 is a flowchart of a data loading method according to an embodiment of the present invention.
  • FIG. 5 is a flowchart of another data loading method according to an embodiment of the present invention.
  • FIG. 6 is a flowchart of another data loading method according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a computing cluster according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic structural diagram of a terminal according to an embodiment of the present disclosure.
  • FIG. 9 is a schematic structural diagram of a query cluster according to an embodiment of the present disclosure.
  • FIG. 10 is a schematic structural diagram of a computing device according to an embodiment of the present invention.
  • the system architecture involved in bulk loading data for the KeyValue database by using the MapReduce service component can be as shown in FIG. 2, and the system can specifically include a cluster and multiple terminals.
  • the terminal is a device that can submit a request for bulk loading data tasks and query tasks to the cluster, such as a desktop computer, a notebook computer, an iPad, a smart phone, and the like.
  • the cluster may include a plurality of node devices, which may be computing devices with computing capabilities; the cluster is simultaneously deployed with a MapReduce service component and a KeyValue database, and a distributed file system.
  • the MapReduce service component has high-performance parallel computing capabilities that can load data in bulk for the KeyValue database.
  • the KeyValue database can provide query services for end users in response to read and write requests from the terminal.
  • the distributed file system provides high-reliability underlying storage support for the KeyValue database.
  • the cluster may be a Hadoop cluster
  • the distributed file system may be an HDFS (Hadoop Distributed File System)
  • the KeyValue database may be HBase.
  • the terminal may submit a data loading request to the cluster.
  • a management node in the MapReduce service component executes the MapReduce task.
  • the MapReduce task includes a Map task and a Reduce task, wherein the stage for executing the Map task may include a Shuffle phase, and the phase for performing the Reduce task may include a Sort phase.
  • the cluster executes the Map task to read the source data, and parses the source data to obtain the intermediate data ⁇ Key, Value> pair; then performs the ⁇ Key, Value> pair obtained by parsing the Map task, and writes the data partition Partition according to the key in the Shuffle phase.
  • the ⁇ Key, Value> pair in the Partition may be first processed when performing the Reduce task.
  • Each Reduce task corresponds to a data partition Partition, and each Partition corresponds to a data partition Region in the KeyValue database.
  • Each Reduce task generates an object file corresponding to the Partition.
  • the target file generated by the Reduce phase task is used by the query service of the KeyValue database, so the target file generated by the Reduce task satisfies the file storage format defined by the KeyValue database.
  • the cluster loads the object file generated by the Reduce task from the distributed file system into the KeyValue database for query use.
  • the MapReduce service component needs to read a large amount in the process of performing the bulk load data task.
  • the data and involves a large number of calculations such as sorting and partitioning, which makes the load of the entire cluster large and the resource usage rate is very high, which greatly affects the read and write delay of the KeyValue database in the cluster and reduces the query performance of the KeyValue database.
  • the embodiment of the present invention provides a data loading method, a terminal, and a computing system.
  • the MapReduce service component and the KeyValue database are respectively set in different clusters to reduce the load of the cluster where the KeyValue database is located, and reduce the KeyValue database.
  • the read/write delay improves the query performance of the KeyValue database.
  • the MapReduce service component can obtain sufficient resources to perform MapReduce tasks and improve the execution efficiency of the MapReduce task.
  • the system architecture involved in the data loading method provided by the embodiment of the present invention may include two different clusters of a query cluster and a computing cluster, and a terminal, each of which may include multiple node devices, and the node device It can be a computing device with computing power.
  • the query cluster deployment has a KeyValue database and a distributed file system, which can provide query services for users.
  • the KeyValue database may specifically be Google Bigtable, Apache HBase or Apache Cassandra.
  • the compute cluster deploys MapReduce service components and distributed file systems, which can save active data files and perform MapReduce tasks to load data in bulk for the KeyValue database.
  • the distributed file system in the query cluster and the distributed file system in the computing cluster may be two separate distributed file systems, or the same distributed file system shared by two clusters. limited.
  • an embodiment of the present invention provides a data loading method.
  • the method may include:
  • the terminal sends a data loading request to the computing cluster, where the data loading request carries the partition information of the data table to be loaded, and the data loading request instructs the computing cluster to determine the first data partition according to the partition information, and all the partitions indicated by the partition information are respectively bound to a first a data partition, the first data partition is configured to store intermediate data obtained by performing a mapping task on the source data in the partition bound by the first data partition, so as to perform the return of the intermediate data in the first data partition About tasks to get the target file.
  • the computing cluster deploys a MapReduce service component, and the terminal may send a data loading request to the computing cluster to request to perform a MapReduce task by using various resources in the computing cluster, so that after the MapReduce task is executed. Perform data loading.
  • the data loading request carries the partition information of the data table to be loaded, and the partition indication information is used to indicate at least one partition.
  • the data loading request may instruct the computing cluster to determine a first data partition bound to the partition indicated by the partition information according to the partition information therein.
  • the first data partition may be used to store intermediate data obtained after the mapping task is performed on the source data in the partition bound to the first data partition, so that the computing cluster may perform a reduction task on the intermediate data in the first data partition. Then get the target file.
  • the first data partition may be a data partition Partition in the computing cluster shown in FIG. 3.
  • the computing cluster After receiving the data loading request sent by the terminal, the computing cluster determines the first data partition according to the partition information.
  • the computing cluster may determine, according to the partition information, the first data partition bound to the partition indicated by the partition information.
  • the first data partition determined by the computing cluster is also three, which may be the first data partition A (Partition A) in the system architecture shown in FIG.
  • the computing cluster separately obtains source data of each partition indicated by the partition information from the distributed file system, and performs a mapping task on the source data of each partition separately.
  • the computing cluster After the computing cluster receives the data loading request, it performs the MapReduce task. Specifically, the computing cluster may first obtain the source data of each partition indicated by the partition information from the distributed file system, and perform a mapping Map task on the source data of each partition to obtain an intermediate data ⁇ Key, Value> pair.
  • the computing cluster writes, according to the binding relationship between the partition indicated by the partition information and the first data partition, the intermediate data obtained by performing each mapping task to the first data partition.
  • the stage in which the computing cluster performs the mapping of the Map task may include a Shuffle phase. After the calculation cluster performs the Map task to obtain the ⁇ Key, Value> pair, it can perform the mapping of the intermediate data ⁇ Key, Value> obtained by mapping the Map task to the source data of each partition according to the key in the Shuffle phase. Each partition corresponds to the first data partition in the Partition.
  • the computing cluster respectively performs a reduction task on the intermediate data in each of the first data partitions, and executes an object file for each reduction task, and the target file is used for data query in the load data table of the KeyValue database.
  • the computing cluster may perform a reductionuce task on the intermediate data ⁇ Key, Value> pairs in each of the first data partitions, respectively. Thereby, the target file corresponding to the reductionuce task of each first data partition is obtained.
  • the format of the target file follows the file storage format defined by the KeyValue database, so that the data query of the KeyValue database can be used for data query.
  • the target file can be in HFile format.
  • the data loading method performs a MapReduce task by calculating resources in the cluster, and generates an object file with the same file storage format as that defined by the KeyValue database, for being used in the query cluster.
  • the load data table of the KeyValue database is used for data query.
  • the query cluster that provides the query service for the user is a cluster that is independent of each other. Therefore, even if the MapReduce task is executed, a large number of CPUs, I/O ports, and the like are occupied.
  • these resources are used to calculate the resources in the cluster, and do not occupy the relevant resources of the query cluster, thereby making the load of the query cluster lower, reducing the read and write delay of the KeyValue database in the query cluster, and improving the query performance of the KeyValue database.
  • the data loading method provided by the embodiment of the present invention can reduce the impact of the MapReduce task process on the query service process by using the MapReduce service component and the KeyValue database in different clusters, thereby reducing the KeyValue.
  • the load of the cluster where the database is located improves the query performance of the KeyValue database.
  • the data loading request may also carry an active data file storage path and an output path.
  • the computing cluster may obtain the source data from the source data file storage path in step 103, and after generating the target files corresponding to each of the first data partitions in step 105, store the target files in the output path.
  • the target file may be saved on the local disk, or the target file may be saved in the distributed file system.
  • the distributed file system of the computing cluster may be a distributed file system independent of the distributed file system of the query cluster, or may be the same distributed file system shared with the distributed file system of the query cluster.
  • the method may further include:
  • the terminal requests the computing cluster to send the target file corresponding to each first data partition to the query cluster, so as to use the target file when performing data query on the data table to be loaded in the KeyValue database.
  • the computing cluster sends the target file corresponding to each first data partition to the query cluster.
  • the query cluster receives the target file corresponding to each first data partition sent by the computing cluster.
  • the query cluster loads the target file corresponding to each first data partition into the KeyValue database, so as to use the target file when performing data query on the data table to be loaded in the KeyValue database.
  • the terminal may request the computing cluster to send the target file corresponding to each first data partition saved in the local disk or the distributed file system of the computing cluster to the query cluster; the query cluster receives each sent by the computing cluster.
  • the target file corresponding to the first data partition may be saved on a local disk, or may be stored in a distributed file system used by the query cluster (independent of the distributed file system of the computing cluster) Distributed file system), so that the target file in the local disk or distributed file system can be loaded into the KeyValue database for use in data query of the data table to be loaded of the KeyValue database.
  • the method may further include:
  • the query cluster obtains the target file from the distributed file system.
  • the query cluster can directly acquire the target file from the distributed file system shared by the computing cluster to perform data loading, so as to use the acquired target file when performing data query on the data table to be loaded in the KeyValue database.
  • the query cluster may specifically distribute the distributed text from the computing cluster according to the partition information. Obtain the target file corresponding to each first data partition in the system.
  • the method may further include:
  • the terminal requests, by the query cluster, partition information of the data table to be loaded.
  • the partition information of the data table to be loaded requested by the terminal to the query cluster is used to indicate at least one partition corresponding to the data table to be loaded.
  • the representation of the partition information may be various, and the specific form of the embodiment of the present invention is not limited.
  • the data in the data table to be loaded corresponds to a range of Key values
  • the data table to be loaded may be divided into at least one partition according to the range of Key values, and the range of Key values of each partition indicated by the partition information is different.
  • the Key is a keyword, which may be a field, an attribute, or a feature in the data table to be loaded.
  • the data table to be loaded in the KeyValue database is a “user information table”, and the “user information table” specifically includes four fields of “identity identification”, “name”, “telephone” and “address”, and the “identity of the user”.
  • the scope of the "ID” is 000000000-29999999.
  • the specific format of the "user data table” can be seen in Table 1 below:
  • the key value range of the data table to be loaded is 0000000-29999999.
  • the data table to be loaded can be partitioned according to the demarcation point Key value. For example, when the key value of the demarcation point is 10000000 and 20000000, the data table to be loaded can be divided into three partitions: partition 1 corresponding to the key value range 00000000-09999999, partition 2 corresponding to the key value range 10000000-19999999, and The key value ranges from 20000000-29999999 to partition 3.
  • the partition indication information may be a Key range of 0000000-29999999 corresponding to the data table to be loaded, and a demarcation point Key value of 10000000 and 20000000.
  • the partition indication information indicates that the data table to be loaded corresponds to three partitions, and the key value range corresponding to the partition 1 is 00000000-09999999, the key value range corresponding to the partition 2 is 10000000-19999999, and the key value range corresponding to the partition 3 is 20000000-29999999. .
  • the source data of the partition has the same Key value range as the intermediate data of the first data partition.
  • the value of Key corresponding to partition 1 is 00000000-09999999
  • the value of Key corresponding to partition 2 is 10000000-19999999
  • the range of Key corresponding to partition 3 is 20000000-29999999
  • the partition 1 is bound first.
  • the data partition is Partition A
  • the first data partition bound to partition 2 is Partition B
  • the first data partition bound to partition 3 is Partition C.
  • the source data of Partition A and the intermediate data of Partition A correspond to the Key value range of 00000000. -09999999
  • the source data of Partition B and the intermediate data of Partition B correspond to the Key value range of 10000000-19999999
  • the source data of Partition C and the intermediate data of Partition C correspond to the Key value range of 20000000-29999999.
  • the computing cluster can obtain the source data corresponding to the partition 1 from the distributed file system, and the key value range of the source data is 00000000-09999999; and, in step 104, the computing cluster can execute the mapping map.
  • the intermediate data obtained by the task with the Key value in the range of 00000000-09999999 is correspondingly written into Partition A.
  • the computing cluster can obtain the source data corresponding to the partition 2 from the distributed file system, and the key value of the source data ranges from 10000000 to 19999999; and, in step 104, the computing cluster can perform the key obtained by mapping the Map task. Intermediate data with values ranging from 10000000 to 19999999 is written to Partition B correspondingly.
  • the computing cluster can obtain the source data corresponding to the partition 3 from the distributed file system, and the key value range of the source data is 20000000-29999999; and, in step 104, the computing cluster can perform the key value obtained by mapping the Map task.
  • the intermediate data between 20000000-29999999 is correspondingly written into Partition C.
  • the query cluster may further have a second data partition, where the range of Key values corresponding to all the second data partitions is the same as the range of Key values corresponding to the first data partition, and the second data partition is used to store the target file corresponding to the range of Key values. .
  • the query cluster may further include: querying the cluster to save the target file corresponding to each first data partition to Each first data partition corresponds to a second data partition of the same Key value range, so that the target file corresponding to each second data partition is loaded into the KeyValue database for use in data query of the data table to be loaded in the KeyValue database.
  • the target file may further include: querying the cluster to save the target file corresponding to each first data partition to Each first data partition corresponds to a second data partition of the same Key value range, so that the target file corresponding to each second data partition is loaded into the KeyValue database for use in data query of the data table to be loaded in the KeyValue database.
  • the target file corresponding to each first data partition may be separately saved to and Each first data partition corresponds to a second data partition of the same Key value range, so that the target file corresponding to each second data partition is loaded into the KeyValue database for use in data query of the data table to be loaded in the KeyValue database.
  • the target file may be separately saved to and Each first data partition corresponds to a second data partition of the same Key value range, so that the target file corresponding to each second data partition is loaded into the KeyValue database for use in data query of the data table to be loaded in the KeyValue database.
  • the second data partition in the query cluster may be a Region as shown in FIG.
  • the Region1 can be the same as the Partition A with the same Key value range of 00000000-09999999.
  • the Region1 can be used to store the target file with the corresponding Key value range of 00000000-09999999.
  • Region2 can be the same as the Partition B with the same Key value range of 10000000-19999999.
  • Region2 can be used to store the target file with the corresponding Key value range of 10000000-19999999.
  • Region3 can be the same as the Partition C with the same Key value range of 20000000-29999999.
  • Region3 can be used to store the target file with the corresponding Key value range of 20000000-29999999.
  • the query cluster may store the target file corresponding to the key value range 00000000-09999999 in the second data partition Region1, and store the target file corresponding to the key value range 10000000-19999999 to the second data partition Region2.
  • the target file corresponding to the key value range 20000000-29999999 is stored in the second data partition Region3, so that the target file is used when the data query of the data table to be loaded in the KeyValue database is performed.
  • the data table to be loaded may also be The identification information is sent to the query cluster, so that the query cluster can determine the data table to be loaded and the partition information of the data table to be loaded according to the identification information of the data table to be loaded.
  • the identifier information of the data table to be loaded is used to indicate the data table to be loaded, for example, the table name, the number, and the like of the data table to be loaded, and is not specifically limited herein.
  • connection configuration set of the query cluster and the connection configuration set of the computing cluster can also be saved on the terminal.
  • the connection configuration set is used to save configuration information required for the terminal to establish a connection with the query cluster/computing cluster.
  • the specific content of the connection configuration set is not specifically limited in the embodiment of the present invention.
  • the connection configuration set may include at least one of a network protocol IP address, a port, and secure access configuration information.
  • the IP address of the connection configuration set may be the IP address of the management node in the query cluster/computing cluster, or may include the IP address of all the nodes in the query cluster/computing cluster; the port in the connection configuration set may be the port providing the related service.
  • the method may further include:
  • the terminal sends a first connection establishment request to the query cluster according to the connection configuration set of the query cluster.
  • the method may further include:
  • the terminal sends a second connection establishment request to the computing cluster according to the connection configuration set of the computing cluster.
  • the method may further include:
  • the terminal starts a data loading task.
  • the data loading task is a task that points to the KeyValue database in the query cluster to load data.
  • the terminal can receive the trigger instruction input by the user, thereby starting the batch data loading task; the batch data loading task is automatically started after the terminal is powered on; or the terminal periodically starts the batch data loading task. Etc., there is no specific limit here.
  • the embodiment of the present invention provides a computing cluster 700.
  • the computing cluster 700 can include a receiving module 701, a determining module 702, an executing module 703, a writing module 704, and a sending module 705.
  • the computing cluster 700 can include a plurality of computing nodes, which can be computing devices with computing capabilities; and at least one computing node in the computing cluster 700 is configured to deploy functions of the modules of the computing cluster 700.
  • the receiving module 701 can be configured to receive a data loading request, where the data loading request carries the partition information of the data table to be loaded.
  • the determining module 702 can be configured to determine, according to the partition information, the first data partition, where all partitions indicated by the partition information Each of the first data partitions is bound to the first data partition; the execution module 703 can be configured to obtain source data of each partition indicated by the partition information from the distributed file system, and perform mapping tasks on the source data of each partition respectively; The 704 may be configured to: correspondingly, the intermediate data obtained by performing each mapping task is correspondingly written into the first data partition according to the binding relationship between the partition indicated by the partition information and the first data partition; the executing module 703 may also be used for each The intermediate data in the first data partition respectively performs a reduction task, and the target file of each reduction task is executed, and the target file is used for querying the data of the KeyValue database of the query cluster for data query.
  • the transmitting module 705 can be used to perform step 107 in FIG.
  • the computing cluster 700 in FIG. 7 can be used to perform any of the foregoing method flows, and the embodiments of the present invention are not described in detail herein.
  • the embodiment of the present invention further provides a computer storage medium for storing the computer software instructions used in the computing cluster shown in FIG. 7 above, and includes a program designed to execute the foregoing method embodiments. By performing storage The program can implement data loading.
  • the embodiment of the present invention provides a terminal 800.
  • the terminal 800 may include a sending module 801 and a requesting module 802.
  • the sending module 801 is configured to send a data loading request to the computing cluster, where the data loading request carries the partition information of the data table to be loaded, and the data loading request instructs the computing cluster to determine the first data partition according to the partition information, and all the partitions indicated by the partition information.
  • Binding a first data partition the first data partition is configured to store intermediate data obtained by performing a mapping task on the source data in the partition bound to the first data partition, so as to perform reduction on the intermediate data in the first data partition Task to get the target file.
  • the requesting module 802 can be configured to request the computing cluster to send the target file corresponding to each first data partition to the query cluster, so as to use the target file when querying the data table to be loaded of the KeyValue database of the cluster for data query.
  • request module 802 can be used to perform step 111 in FIG.
  • the terminal 800 in FIG. 8 can be used to perform any of the foregoing method flows, and the embodiments of the present invention are not described in detail herein.
  • the embodiment of the present invention further provides a computer storage medium for storing the computer software instructions used by the terminal shown in FIG. 8 above, which includes a program designed to execute the foregoing method embodiment. Data loading can be achieved by executing the stored program.
  • the query cluster 900 may include a receiving module 901 and a loading module 902.
  • the query cluster 900 may include multiple query nodes, which may be computing devices with computing capabilities; at least one query node in the query cluster 900 is used to deploy the functions of querying each module of the cluster 900.
  • the receiving module 901 is configured to receive, by the computing cluster, an object file corresponding to each first data partition sent by the computing cluster.
  • the loading module 902 can be configured to load the target file corresponding to each first data partition into the KeyValue database, so as to use the target file when performing data query on the data table to be loaded of the KeyValue database.
  • the query cluster 900 in FIG. 9 can be used to perform any of the foregoing method flows, and the embodiments of the present invention are not described in detail herein.
  • the embodiment of the present invention further provides a computer storage medium for storing the computer software instructions used for the query cluster shown in FIG. 9 above, which includes a program designed to execute the foregoing method embodiment. Data loading can be achieved by executing the stored program.
  • an embodiment of the present invention further provides a computing device 1000.
  • the computing device 1000 includes at least one processor 1001, a memory 1002, and a communication interface 1003.
  • the at least one processor 1001, the memory 1002, and the communication interface 1003 is connected by a bus 1004;
  • the memory 1002 is configured to store a computer execution instruction;
  • the at least one processor 1001 is configured to execute a computer execution instruction stored by the memory 1002, so that the computing device 1000 passes the communication
  • the interface 1003 performs data interaction with other computing capable devices (for example, a query node in a query cluster, a terminal, or a computing node in a computing cluster) to execute the data loading method provided in the foregoing embodiment.
  • other computing capable devices for example, a query node in a query cluster, a terminal, or a computing node in a computing cluster
  • the computing node included in the computing cluster provided by the embodiment of the present invention is a computing device 1000, and the computing device 1000 of the computing cluster passes through the communication interface 1003, and other computing devices 1000 and terminals in the computing cluster. Perform data interaction with the query node of the query cluster to execute the data loader provided by the foregoing embodiment. law.
  • the query node included in the query cluster provided by the embodiment of the present invention is the computing device 1000, and the computing device 1000 of the query cluster passes the communication interface 1003, and queries other computing devices 1000 and terminals in the cluster.
  • the data loading method provided by the above embodiment is performed by performing data interaction with the computing node of the computing cluster.
  • the terminal provided by the embodiment of the present invention is a computing device 1000, and the computing device 1000 performs data interaction with a computing node of a computing cluster and a query node of a query cluster through the communication interface 1003 to perform the foregoing.
  • the data loading method provided by the embodiment is a computing device 1000, and the computing device 1000 performs data interaction with a computing node of a computing cluster and a query node of a query cluster through the communication interface 1003 to perform the foregoing.
  • At least one processor 1001 may include different types of processors 1001, or include the same type of processor 1001; the processor 1001 may be any one of the following: a central processing unit CPU, an ARM processor, and an on-site A device with computational processing capability, such as a Field Programmable Gate Array (FPGA) or a dedicated processor. In an optional implementation manner, the at least one processor 1001 may also be integrated into a many-core processor.
  • processors 1001 may include different types of processors 1001, or include the same type of processor 1001; the processor 1001 may be any one of the following: a central processing unit CPU, an ARM processor, and an on-site A device with computational processing capability, such as a Field Programmable Gate Array (FPGA) or a dedicated processor.
  • FPGA Field Programmable Gate Array
  • the at least one processor 1001 may also be integrated into a many-core processor.
  • the memory 1002 may be any one or any combination of the following: a random access memory (RAM), a read only memory (ROM), a nonvolatile memory ( Non-volatile Memory (NVM), Solid State Drives (SSD), mechanical hard disks, disks, disk arrays and other storage media.
  • RAM random access memory
  • ROM read only memory
  • NVM Non-volatile Memory
  • SSD Solid State Drives
  • the communication interface 1003 is used by the computing device 1000 to perform data interaction with other devices having computing capabilities or storage capabilities.
  • the communication interface 1003 may be any one or any combination of the following: a network interface (such as an Ethernet interface), a wireless network card, and the like having a network access function.
  • the bus 1004 can include an address bus, a data bus, a control bus, etc., for ease of representation, Figure 10 shows the bus with a thick line.
  • the bus 1004 may be any one or any combination of the following: an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, and an extended industry standard structure ( Extended Industry Standard Architecture (EISA) bus and other devices for wired data transmission.
  • ISA Industry Standard Architecture
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • FIG. 3 Another embodiment of the present invention provides a communication system, which may include a terminal, a computing cluster, and a query cluster.
  • a communication system which may include a terminal, a computing cluster, and a query cluster.
  • the terminal, the computing cluster and the query cluster in the communication system can execute the data loading method in the foregoing method embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种数据加载方法、终端和计算集群,涉及通信技术领域,能够降低KeyValue数据库的读写时延,提高KeyValue数据库的查询性能。具体方案为:计算集群接收携带有待加载数据表的分区信息的数据加载请求;根据分区信息确定第一数据分区,分区信息指示的所有分区分别绑定一个第一数据分区;分别获取分区信息指示的每个分区的源数据,对每个分区的源数据分别执行映射任务;根据分区信息指示的分区与第一数据分区的绑定关系,将执行每个映射任务所得的中间数据对应地写入第一数据分区;对每个第一数据分区中的中间数据分别执行归约任务,执行得到每个归约任务的目标文件,目标文件供KeyValue数据库的加载数据表进行数据查询使用。

Description

一种数据加载方法、终端和计算集群 技术领域
本发明实施例涉及通信技术领域,尤其涉及一种数据加载方法、终端和计算集群。
背景技术
分布式键值KeyValue数据库能有效减少读写磁盘的次数,具有更好的读写性能,能够为用户提供更好的数据查询服务。KeyValue数据库常采用映射归约MapReduce服务组件来批量加载数据。在批量加载数据的过程中,通过执行MapReduce任务,生成与KeyValue数据库的定义的文件存储格式相一致的目标文件,并存储到分布式文件系统中,而后从分布式文件系统加载到KeyValue数据库中。
其中,同时布署有MapReduce服务组件和KeyValue数据库的集群的结构示意图可以参见图1。在图1所示的集群中,MapReduce任务执行过程需要读取大量的数据,并且涉及到排序、分区等大量的计算,从而使得整个集群的中央处理器(Central Processing Unit,简称CPU)、网络输入/输出(Input/Output,简称I/O)口、磁盘I/O口等资源的使用率非常高。KeyValue数据库对读写时延要求较高,一般在毫秒级;但在使用MapReduce服务组件为KeyValue数据库批量加载数据时,执行MapReduce任务来加载数据的进程会占用较多资源,导致可用于KeyValue数据库的查询服务的进程的资源相对减少,从而影响KeyValue数据库的读写时延,KeyValue数据库的数据查询性能降低,导致无法满足用户的业务需求。
发明内容
本发明实施例提供一种数据加载方法、终端和计算集群,能够降低KeyValue数据库的读写时延,提高KeyValue数据库的查询性能。
为达到上述目的,本发明的实施例采用如下技术方案:
第一方面,本发明实施例提供一种数据加载方法,应用于计算集群。其中涉及查询集群,计算集群用于数据加载,查询集群用于KeyValue数据库的数据查询,计算集群与查询集群为不同集群。该方法包括:首先,计算集群接收数据加载请求,数据加载请求携带有待加载数据表的分区信息。其次,计算集群根据分区信息确定第一数据分区。其中,分区信息指示的所有分区分别绑定一个第一数据分区。然后,计算集群从分布式文件系统中分别获取分区信息指示的每个分区的源数据,对每个分区的源数据分别执行映射任务。之后,计算集群根据分区信息指示的分区与第一数据分区的绑定关系,将执行每个映射任务所得的中间数据对应地写入第一数据分区。而后,计算集群对每个第一数据分区中的中间数据分别执行归约任务,执行得到每个归约任务的目标文件,该目标文件供KeyValue数据库的加载数据表进行数据查询使用。
这样,可以通过计算集群中的资源执行MapReduce任务,生成与KeyValue数据库定义的文件存储格式相同的目标文件,以供在查询集群中KeyValue数据库的加载数据表进 行数据查询使用。其中,由于执行MapReduce任务的是计算集群,与为用户提供查询服务的查询集群是相互独立的两个集群,因而即使在执行MapReduce任务的过程中,会占用大量的CPU、I/O口等资源,但这些资源为计算集群中的资源,MapReduce任务的执行不会占用查询集群的相关资源,从而可以使得查询集群的负载较低,因而能够降低查询集群中KeyValue数据库的读写时延,提高KeyValue数据库的查询性能。
在第一方面的一种可能的实现方式中,该方法还包括:计算集群将目标文件发送至查询集群。
第二方面,本发明实施例提供一种数据加载方法,应用于终端。其中涉及计算集群和查询集群,计算集群用于数据加载,查询集群用于KeyValue数据库的数据查询,计算集群与查询集群为不同集群。该包括:终端向计算集群发送数据加载请求,数据加载请求携带有待加载数据表的分区信息。其中,数据加载请求指示计算集群根据分区信息确定第一数据分区。分区信息指示的所有分区分别绑定一个第一数据分区。第一数据分区用于存储对于第一数据分区绑定的分区中的源数据执行映射任务所得的中间数据,以便对第一数据分区中的中间数据执行归约任务来获取目标文件。
在第二方面的一种可能的实现方式中,查询集群与计算集群具有各自的分布式文件系统,查询集群与计算集群各自具有的分布式文件系统相互隔离,这种情况下,终端或者查询集群需要请求计算集群将每个第一数据分区对应的目标文件发送至查询集群,以便在KeyValue数据库的待加载数据表进行数据查询时使用目标文件。
在第二方面的一种可能的实现方式中,查询集群与计算集群共享分布式文件系统,查询集群从分布式文件系统中获取目标文件。
在计算集群生成目标文件后,可以将目标文件保存在与查询集群共享的分布式文件系统中,查询集群可以直接从分布式文件系统中获取目标文件并加载,以便在KeyValue数据库的待加载数据表进行数据查询时使用该目标文件。
在第二方面的一种可能的实现方式中,在终端向计算集群发送数据加载请求之前,该方法还包括:终端向查询集群请求待加载数据表的分区信息。
从而,终端可以根据从查询集群获取的分区信息确定第一数据分区。
在第二方面的一种可能的实现方式中,终端上保存有查询集群的连接配置集和计算集群的连接配置集。在终端向查询集群请求待加载数据表的分区信息之前,该方法还包括:终端根据查询集群的连接配置集向查询集群发送第一建立连接请求。在终端向计算集群发送数据加载请求之前,该方法还包括:终端根据计算集群的连接配置集向所计算集群发送第二建立连接请求。
从而,在终端与查询集群/计算集群建立连接后,终端可以与查询集群/计算集群进行消息交互。
在第二方面的一种可能的实现方式中,连接配置集包括IP地址、端口和安全访问配置信息中的至少一个。
第三方面,本发明实施例提供一种数据加载方法,应用于查询集群。其中涉及计算集群,查询集群用于KeyValue数据库的数据查询,计算集群用于数据加载,计算集群与 查询集群为不同集群。该方法包括:查询集群接收计算集群发送的每个第一数据分区对应的目标文件。而后,查询集群将每个第一数据分区对应的目标文件加载至KeyValue数据库中,以便在KeyValue数据库的待加载数据表进行数据查询时使用目标文件。
这样,在计算集群生成目标文件后,终端或者查询集群可以请求计算集群将目标文件发送至查询集群;查询集群在接收到计算集群发送的目标文件后,可以将目标文件加载至KeyValue数据库中,以便在KeyValue数据库的待加载数据表进行数据查询时使用该目标文件。
结合上述任一方面,在一种可能的实现方式中,分区信息指示的每个分区的Key值范围不同;对于具有绑定关系的分区与第一数据分区,分区的源数据与第一数据分区的中间数据具有相同的Key值范围。
这样,计算集群在执行映射归约任务时,可以根据Key值范围分别获取每个分区的源数据,并将相同Key范围的中间数据分发至每个分区对应的第一数据分区中。
结合上述任一方面,在一种可能的实现方式中,查询集群中具有第二数据分区,所有第二数据分区对应的Key值范围与第一数据分区对应的Key值范围相同,第二数据分区用于存储对应Key值范围的目标文件。
这样,在查询集群接收到计算集群发送的目标文件后,或者从与计算集群共享的分布式文件系统中获取目标文件后,可以将与每个第一数据文件对应目标文件,分别存储在一个对应的第二数据分区中,且第二数据分区对应的Key值范围与第一数据分区对应的Key值范围相同。
结合上述任一方面,在一种可能的实现方式中,分区指示信息用于指示查询集群的KeyValue数据库中,待加载数据表对应的Key值范围与M个目标第二数据分区的对应关系。并且,目标第一数据分区及其对应的目标第二数据分区对应相同的Key值范围。
第四方面,本发明实施例提供一种计算集群,包括:接收模块,用于接收数据加载请求,数据加载请求携带有待加载数据表的分区信息。确定模块,用于根据分区信息确定第一数据分区。其中,分区信息指示的所有分区分别绑定一个第一数据分区。执行模块,用于从分布式文件系统中分别获取分区信息指示的每个分区的源数据,对每个分区的源数据分别执行映射任务。写入模块,用于根据分区信息指示的分区与第一数据分区的绑定关系,将执行每个映射任务所得的中间数据对应地写入第一数据分区。执行模块还用于,对每个第一数据分区中的中间数据分别执行归约任务,执行得到每个归约任务的目标文件,目标文件供查询集群的KeyValue数据库的加载数据表进行数据查询使用。
在第四方面的一种可能的实现方式中,分区信息指示的每个分区的Key值范围不同。对于具有绑定关系的分区与第一数据分区,分区的源数据与第一数据分区的中间数据具有相同的Key值范围。
在第四方面的一种可能的实现方式中,计算集群还包括:发送模块,用于将目标文件发送至查询集群。
在第四方面的一种可能的实现方式中,查询集群中具有第二数据分区,所有第二数据分区对应的Key值范围与第一数据分区对应的Key值范围相同,第二数据分区用于存储对应Key值范围的目标文件。
第五方面,本发明实施例提供一种终端,包括:发送模块,用于向计算集群发送数据加载请求。数据加载请求携带有待加载数据表的分区信息。数据加载请求指示计算集群根据分区信息确定第一数据分区。分区信息指示的所有分区分别绑定一个第一数据分区。第一数据分区用于存储对于第一数据分区绑定的分区中的源数据执行映射任务所得的中间数据,以便对第一数据分区中的中间数据执行归约任务来获取目标文件。请求模块,用于请求计算集群将每个第一数据分区对应的目标文件发送至查询集群,以便在查询集群的KeyValue数据库的待加载数据表进行数据查询时使用目标文件。
在第五方面的一种可能的实现方式中,查询集群中具有第二数据分区,所有第二数据分区对应的Key值范围与第一数据分区对应的Key值范围相同,第二数据分区用于存储对应Key值范围的目标文件。
在第五方面的一种可能的实现方式中,请求模块还用于:在向查询集群请求待加载数据表的分区信息之前,向查询集群请求待加载数据表的分区信息。
第六方面,本发明实施例提供一种查询集群,包括:接收模块,用于接收计算集群发送的每个第一数据分区对应的目标文件。加载模块,用于将每个第一数据分区对应的目标文件加载至KeyValue数据库中,以便在KeyValue数据库的待加载数据表进行数据查询时使用目标文件。
在第六方面的一种可能的实现方式中,查询集群中具有第二数据分区,所有第二数据分区对应的Key值范围与第一数据分区对应的Key值范围相同,第二数据分区用于存储对应Key值范围的目标文件。
又一方面,本发明实施例提供了一种计算集群,包括多个计算节点,多个计算节点中的一个计算节点执行第一方面或者第一方面的任一可能的实现方式提供的数据加载方法,或者多个计算节点中的至少两个计算节点之间进行数据交互来执行第一方面或者第一方面的任一可能的实现方式提供的数据加载方法。
再一方面,本发明实施例提供了一种计算机存储介质,用于储存为上述计算集群所用的计算机软件指令,其包含用于执行实现上述第一方面或者第一方面的任一可能的实现方式提供的数据加载方法所设计的程序。
又一方面,本发明实施例提供了一种终端,包括至少一个处理器、存储器和通信接口;所述至少一个处理器、所述存储器和所述通信接口均通过总线连接;所述存储器,用于存储计算机执行指令;所述至少一个处理器,用于执行所述存储器存储的计算机执行指令,使得所述计算终端通过所述通信接口与计算集群和/或查询集群进行数据交互,来执行上述实施例提供的数据加载方法。
再一方面,本发明实施例提供了一种计算机存储介质,用于储存为上述终端所用的计算机软件指令,包含用于执行实现上述第二方面或者第二方面的任一可能的实现方式提供的数据加载方法所设计的程序。
又一方面,本发明实施例提供了一种查询集群,包括多个查询节点,多个查询节点中的一个查询节点执行第一方面或者第一方面的任一可能的实现方式提供的数据加载方法,或者多个查询节点中的至少两个查询节点之间进行数据交互,来执行第一方面或者第一方面的任一可能的实现方式提供的数据加载方法。
再一方面,本发明实施例提供了一种计算机存储介质,用于储存为上述查询集群所用的计算机软件指令,包含用于执行实现上述第三方面或者第三方面的任一可能的实现方式提供的数据加载方法所设计的程序。
再一方面,本发明实施例提供了一种通信系统,包括上述方面描述的终端、计算集群和查询集群。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为现有技术中的一种集群的结构示意图;
图2为本发明实施例提供的一种系统架构示意图;
图3为本发明实施例提供的另一种系统架构示意图;
图4为本发明实施例提供的一种数据加载方法流程图;
图5为本发明实施例提供的另一种数据加载方法流程图;
图6为本发明实施例提供的另一种数据加载方法流程图;
图7为本发明实施例提供的一种计算集群的结构示意图;
图8为本发明实施例提供的一种终端的结构示意图;
图9为本发明实施例提供的一种查询集群的结构示意图;
图10为本发明实施例提供的一种计算设备的结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述。显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
采用MapReduce服务组件为KeyValue数据库批量加载数据所涉及的系统架构可以如图2所示,该系统具体可以包括集群和多个终端。其中,该终端是可以向集群提交批量加载数据任务和查询任务等请求的设备,例如可以是台式计算机、笔记本电脑、iPad、智能手机等。该集群可以包括多个节点设备,该节点设备可以是具有计算能力的计算设备;该集群同时布署有MapReduce服务组件和KeyValue数据库,以及分布式文件系统。MapReduce服务组件具有高性能的并行计算能力,可以为KeyValue数据库批量加载数据。KeyValue数据库可以响应终端的读写请求,为终端用户提供查询服务。分布式文件系统可以为KeyValue数据库提供高可靠性的底层存储支持。示例性的,该集群具体可以是Hadoop集群,该分布式文件系统具体可以是HDFS(Hadoop Distributed File System),该KeyValue数据库具体可以是HBase。
其中,在利用MapReduce服务组件为KeyValue数据库批量加载数据时,终端可以向集群提交一个数据加载请求,MapReduce服务组件中的一个管理节点接收到数据加载请求后,执行MapReduce任务。具体的,MapReduce任务包括Map任务和Reduce任务,其中,执行Map任务的阶段可以包括Shuffle阶段,执行Reduce任务的阶段可以包括Sort阶段。集群执行Map任务读取源数据,并对源数据进行解析得到中间数据<Key,Value>对;再将执行Map任务解析得到的<Key,Value>对,在Shuffle阶段按照key写到数据分区Partition中,以便执行Reduce任务时从该Partition获取数据。可选地,可以在执行Reduce任务时首先对Partition中的<Key,Value>对进行Sort处理。每个Reduce任务对应一个数据分区Partition,每个Partition对应KeyValue数据库中的一个数据分区Region。每个Reduce任务生成对应Partition的目标文件。其中,Reduce阶段任务生成的目标文件供KeyValue数据库的查询服务使用,因此Reduce任务生成的目标文件满足KeyValue数据库定义的文件存储格式。而后,集群从分布式文件系统中将Reduce任务生成的目标文件加载至KeyValue数据库中,以供查询使用。
在图2所示的系统架构中,由于MapReduce服务组件执行MapReduce任务的进程与KeyValue数据库执行的查询服务的进程位于同一集群中,MapReduce服务组件在执行批量加载数据任务的过程中,需要读取大量的数据,并且涉及到排序和分区等大量的计算,使得整个集群的负载很大,资源使用率非常高,从而极大地影响了集群中KeyValue数据库的读写时延,降低了KeyValue数据库的查询性能。针对该问题,本发明实施例提供了一种数据加载方法、终端和计算系统,通过将MapReduce服务组件和KeyValue数据库分别设置在不同的集群中,来降低KeyValue数据库所在集群的负载,降低KeyValue数据库的读写时延,提高KeyValue数据库的查询性能;同时MapReduce服务组件能够获得足够的资源来执行MapReduce任务,提高了MapReduce任务的执行效率。
如图3所示,本发明实施例提供的数据加载方法所涉及的系统架构可以包括查询集群和计算集群两个不同集群,以及终端,每个集群中均可以包括多个节点设备,该节点设备可以是具有计算能力的计算设备。查询集群布署有KeyValue数据库和分布式文件系统,可以为用户提供查询服务。示例性的,该KeyValue数据库具体可以是Google Bigtable、Apache HBase或Apache Cassandra等。计算集群布署有MapReduce服务组件和分布式文件系统,可以保存有源数据文件并执行MapReduce任务,为KeyValue数据库批量加载数据。其中,查询集群中的分布式文件系统和计算集群中的分布式文件系统,可以是两个分别独立的分布式文件系统,也可以是两个集群共享的同一个分布式文件系统,这里不作具体限定。
基于图3所示的系统架构,本发明实施例提供一种数据加载方法,参见图4,该方法可以包括:
101、终端向计算集群发送数据加载请求,数据加载请求携带有待加载数据表的分区信息,数据加载请求指示计算集群根据分区信息确定第一数据分区,分区信息指示的所有分区分别绑定一个第一数据分区,第一数据分区用于存储对于第一数据分区绑定的分区中的源数据执行映射任务所得的中间数据,以便对第一数据分区中的中间数据执行归 约任务来获取目标文件。
在图3所示的系统架构中,计算集群布署有MapReduce服务组件,终端可以向计算集群发送数据加载请求,以请求利用计算集群中的各项资源执行MapReduce任务,从而在MapReduce任务执行完成后进行数据加载。
其中,数据加载请求中携带有待加载数据表的分区信息,该分区指示信息用于指示至少一个分区。该数据加载请求可以指示计算集群根据其中的分区信息,确定与分区信息指示的分区一一绑定的第一数据分区。第一数据分区可以用于存储,与第一数据分区绑定的分区中的源数据执行映射任务后所得的中间数据,从而使得计算集群可以对第一数据分区中的中间数据执行归约任务,进而获取目标文件。具体的,第一数据分区可以为图3所示计算集群中的数据分区Partition。
102、计算集群在接收到终端发送的数据加载请求后,根据分区信息确定第一数据分区。
计算集群在接收到终端发送的携带有分区指示信息的数据加载请求后,可以根据分区信息确定与分区信息指示的分区一一绑定的第一数据分区。
示例性的,当分区指示信息指示的分区为3个时,计算集群确定的第一数据分区也为3个,具体可以为图3所示系统架构中的第一数据分区A(Partition A)、第一数据分区B(Partition B)和第一数据分区C(Partition C)。
103、计算集群从分布式文件系统中分别获取分区信息指示的每个分区的源数据,对每个分区的源数据分别执行映射任务。
计算集群在接收到数据加载请求后,执行MapReduce任务。具体的,计算集群可以首先从分布式文件系统中分别获取分区信息指示的每个分区的源数据,并对每个分区的源数据分别执行映射Map任务,得到中间数据<Key,Value>对。
104、计算集群根据分区信息指示的分区与第一数据分区的绑定关系,将执行每个映射任务所得的中间数据对应地写入第一数据分区。
其中,计算集群执行映射Map任务的阶段可以包括Shuffle阶段。计算集群在执行Map任务解析得到<Key,Value>对后,可以在Shuffle阶段,按照key将对每个分区的源数据执行映射Map任务所得的中间数据<Key,Value>对,对应地写到每个分区对应的第一数据分区Partition中。
105、计算集群对每个第一数据分区中的中间数据分别执行归约任务,执行得到每个归约任务的目标文件,目标文件供在KeyValue数据库的加载数据表进行数据查询使用。
在将执行每个映射Map任务所得的中间数据对应地写入第一数据分区之后,计算集群可以对每个第一数据分区中的中间数据<Key,Value>对,分别执行归约Reduce任务,从而得到每个第一数据分区的归约Reduce任务对应的目标文件。
其中,该目标文件的格式遵循KeyValue数据库定义的文件存储格式,从而可以供KeyValue数据库的加载数据表进行数据查询使用。示例性的,当KeyValue数据库为HBase时,目标文件可以为HFile格式。
可见,本发明实施例提供的数据加载方法,通过计算集群中的资源执行MapReduce任务,生成与KeyValue数据库定义的文件存储格式相同的目标文件,以供在查询集群中 KeyValue数据库的加载数据表进行数据查询使用。其中,由于执行MapReduce任务的是计算集群,与为用户提供查询服务的查询集群是相互独立的两个集群,因而即使在执行MapReduce任务的过程中,会占用大量的CPU、I/O口等资源,但这些资源为计算集群中的资源,而不会占用查询集群的相关资源,从而可以使得查询集群的负载较低,降低查询集群中KeyValue数据库的读写时延,提高KeyValue数据库的查询性能。
也就是说,本发明实施例提供的数据加载方法,通过将MapReduce服务组件和KeyValue数据库分别布署在不同的集群中,可以避免占用大量资源的MapReduce任务进程对查询服务进程的影响,从而降低KeyValue数据库所在集群的负载,提高KeyValue数据库的查询性能。
此外,在上述步骤101中,数据加载请求还可以携带有源数据文件存储路径和输出路径。计算集群可以在步骤103中,从源数据文件存储路下获取源数据,并在步骤105中生成与每个第一数据分区对应的目标文件后,将这些目标文件存储在输出路径下。
需要说明的是,计算集群在生成目标文件后,具体可以将目标文件保存在本地磁盘中,也可以将目标文件保存在分布式文件系统中。并且,计算集群的分布式文件系统可以是与查询集群的分布式文件系统相互独立的分布式文件系统,也可以是与查询集群的分布式文件系统共享的同一个分布式文件系统。
一方面,在计算集群将目标文件存储在本地磁盘或者与查询集群的分布式文件系统相独立的分布式文件系统中的情况下,参见图5,在步骤105之后,该方法还可以包括:
106、终端请求计算集群将每个第一数据分区对应的目标文件发送至查询集群,以便在KeyValue数据库的待加载数据表进行数据查询时使用目标文件。
107、计算集群将每个第一数据分区对应的目标文件发送至查询集群。
108、查询集群接收计算集群发送的每个第一数据分区对应的目标文件。
109、查询集群将每个第一数据分区对应的目标文件加载至KeyValue数据库中,以便在KeyValue数据库的待加载数据表进行数据查询时使用目标文件。
该种情况下,终端可以请求计算集群将计算集群的本地磁盘或分布式文件系统中保存的每个第一数据分区对应的目标文件,发送给查询集群;查询集群在接收到计算集群发送的每个第一数据分区对应的目标文件后,可以将每个第一数据分区对应的目标文件保存在本地磁盘,或者保存在查询集群使用的分布式文件系统(与计算集群的分布式文件系统独立的分布式文件系统)中,以便可以将本地磁盘或分布式文件系统中的目标文件加载至KeyValue数据库中,供KeyValue数据库的待加载数据表进行数据查询时使用。
另一方面,在计算集群将目标文件存储在与查询集群的分布式文件系统共享的同一个分布式文件系统中的情况下,计算集群不需要将目标文件发送至查询集群。在步骤105之后,该方法还可以包括:
110、查询集群从分布式文件系统中获取目标文件。
在步骤105之后,查询集群可以直接从与计算集群共享的分布式文件系统中获取目标文件来进行数据加载,以便在KeyValue数据库的待加载数据表进行数据查询时使用获取的目标文件。其中,查询集群具体可以根据分区信息,从与计算集群共享的分布式文 件系统中获取每个第一数据分区对应的目标文件。
进一步地,参见图6,在上述步骤101之前,该方法还可以包括:
111、终端向查询集群请求待加载数据表的分区信息。
其中,终端向查询集群请求的待加载数据表的分区信息用于指示待加载数据表对应的至少一个分区。分区信息的表示形式可以有多种,本发明实施例对其具体形式不作限定。
在KeyValue数据库中,待加载数据表中的数据对应一个Key值范围,待加载数据表可以根据Key值范围划分为至少一个分区,分区信息指示的每个分区的Key值范围不同。其中,Key是一个关键字,具体可以是待加载数据表中的一个字段、属性或特征。
示例性的,KeyValue数据库中的待加载数据表为“用户信息表”,“用户信息表”具体包括“身份标识”、“姓名”、“电话”和“地址”4个字段,用户的“身份标识”的范围为00000000-29999999。该“用户数据表”的具体格式可以参见如下表1:
表1
身份标识 姓名 电话 地址
00000000
00000001
00000002
29999999
在表1所示的待加载数据表中,若key为“身份标识”这一字段,则待加载数据表对应的Key值范围为00000000-29999999。待加载数据表可以根据分界点Key值划分分区。例如,当分界点Key值为10000000和20000000时,待加载数据表可以划分为3个分区:与Key值范围00000000-09999999对应的分区1,与Key值范围10000000-19999999对应的分区2,以及与Key值范围20000000-29999999对应的分区3。
在该示例中,分区指示信息可以为待加载数据表对应的Key范围00000000-29999999,以及分界点Key值10000000和20000000。该分区指示信息指示待加载数据表对应3个分区,且分区1对应的Key值范围为00000000-09999999,分区2对应的Key值范围为10000000-19999999,分区3对应的Key值范围为20000000-29999999。
当分区信息指示的每个分区的Key值范围不同时,对于具有绑定关系的分区与第一数据分区,分区的源数据与第一数据分区的中间数据具有相同的Key值范围。
示例性的,当分区1对应的Key值范围为00000000-09999999,分区2对应的Key值范围为10000000-19999999,分区3对应的Key值范围为20000000-29999999时,若分区1绑定的第一数据分区为Partition A,分区2绑定的第一数据分区为Partition B,分区3绑定的第一数据分区为Partition C,则:Partition A的源数据以及Partition A的中间数据对应Key值范围00000000-09999999;Partition B的源数据以及Partition B的中间数据对应Key值范围10000000-19999999;Partition C的源数据以及Partition C的中间数据对应Key值范围20000000-29999999。
从而,在步骤103中,计算集群可以从分布式文件系统中获取与分区1对应的源数据,源数据的Key值范围为00000000-09999999;并且,在步骤104中,计算集群可以将执行映射Map任务得到的Key值范围在00000000-09999999之间的中间数据,对应地写入Partition A中。
类似地,计算集群可以从分布式文件系统中获取与分区2对应的源数据,源数据的Key值范围为10000000-19999999;并且,在步骤104中,计算集群可以将执行映射Map任务得到的Key值范围在10000000-19999999之间的中间数据,对应地写入Partition B中。
并且,计算集群可以从分布式文件系统中获取与分区3对应的源数据,源数据的Key值范围为20000000-29999999;并且,在步骤104中,计算集群可以将执行映射Map任务得到的Key值范围在20000000-29999999之间的中间数据,对应地写入Partition C中。
进一步地,查询集群中还可以具有第二数据分区,所有第二数据分区对应的Key值范围与第一数据分区对应的Key值范围相同,第二数据分区用于存储对应Key值范围的目标文件。
在此基础上,查询集群在步骤108中接收计算集群发送的每个第一数据分区对应的目标文件后,还可以包括:查询集群将每个第一数据分区对应的目标文件,分别保存至与每个第一数据分区对应相同Key值范围的第二数据分区中,从而将每个第二数据分区对应的目标文件加载至KeyValue数据库中,以便在KeyValue数据库的待加载数据表进行数据查询时使用该目标文件。
查询集群在步骤110中直接从与计算集群共享的分布式文件系统中,获取每个第一数据分区对应的目标文件之后,还可以将每个第一数据分区对应的目标文件,分别保存至与每个第一数据分区对应相同Key值范围的第二数据分区中,从而将每个第二数据分区对应的目标文件加载至KeyValue数据库中,以便在KeyValue数据库的待加载数据表进行数据查询时使用该目标文件。
示例性的,查询集群中的第二数据分区可以为如图3所示的Region。其中,Region1可以与Partition A对应相同的Key值范围00000000-09999999,Region1可以用于存储对应Key值范围00000000-09999999的目标文件。Region2可以与Partition B对应相同的Key值范围10000000-19999999,Region2可以用于存储对应Key值范围10000000-19999999的目标文件。Region3可以与Partition C对应相同的Key值范围20000000-29999999,Region3可以用于存储对应Key值范围20000000-29999999的目标文件。
在步骤108或110中,查询集群可以将Key值范围00000000-09999999对应的目标文件,存储至第二数据分区Region1中;将Key值范围10000000-19999999对应的目标文件,存储至第二数据分区Region2中;将Key值范围20000000-29999999对应的目标文件,存储至第二数据分区Region3中,以便在KeyValue数据库的待加载数据表进行数据查询时使用该目标文件。
另外,终端向查询集群请求待加载数据表的分区信息时,还可以将待加载数据表 的标识信息发送给查询集群,以使得查询集群可以根据待加载数据表的标识信息,确定待加载数据表以及待加载数据表的分区信息。其中,待加载数据表的标识信息用于指示待加载数据表,例如可以为待加载数据表的表名、编号等,这里不做具体限定。
此外,终端上还可以保存有查询集群的连接配置集和计算集群的连接配置集。其中,连接配置集用于保存终端与查询集群/计算集群建立连接时需要的配置信息,本发明实施例对连接配置集的具体内容不做具体限定。示例性的,该连接配置集可以包括网络协议IP地址、端口和安全访问配置信息中的至少一个。其中,连接配置集中的IP地址可以是查询集群/计算集群中管理节点的IP地址,也可以包括查询集群/计算集群中所有节点的IP地址;连接配置集中的端口可以是提供相关服务的端口。
进一步地,参见图6,在上述步骤111之前,该方法还可以包括:
112、终端根据查询集群的连接配置集向查询集群发送第一建立连接请求。
在上述步骤101之前,该方法还可以包括:
113、终端根据计算集群的连接配置集向计算集群发送第二建立连接请求。
进一步地,在上述步骤112之前,该方法还可以包括:
114、终端启动数据加载任务。
其中,数据加载任务是指向查询集群中的KeyValue数据库加载数据的任务。终端启动数据加载任务的具体方式可以有多种,例如终端可以接收用户输入的触发指令,从而启动批量数据加载任务;终端开机后自动启动批量数据加载任务;或者终端周期性地启动批量数据加载任务等,这里不做具体限定。
本发明实施例提供一种计算集群700,参见图7,该计算集群700可以包括接收模块701、确定模块702、执行模块703、写入模块704和发送模块705。具体的,该计算集群700可以包括多个计算节点,该计算节点可以是具有计算能力的计算设备;计算集群700中的至少一个计算节点用于部署计算集群700各模块的功能。其中,接收模块701可以用于,接收数据加载请求,数据加载请求携带有待加载数据表的分区信息;确定模块702可以用于,根据分区信息确定第一数据分区,其中,分区信息指示的所有分区分别绑定一个第一数据分区;执行模块703可以用于,从分布式文件系统中分别获取分区信息指示的每个分区的源数据,对每个分区的源数据分别执行映射任务;写入模块704可以用于,根据分区信息指示的分区与第一数据分区的绑定关系,将执行每个映射任务所得的中间数据对应地写入第一数据分区;执行模块703还可以用于,对每个第一数据分区中的中间数据分别执行归约任务,执行得到每个归约任务的目标文件,目标文件供查询集群的KeyValue数据库的加载数据表进行数据查询使用。
此外,发送模块705可以用于执行图5中的步骤107。图7中的计算集群700可以用于执行上述方法流程中的任一流程,本发明实施例在此不再详述。
本发明实施例还提供了一种计算机存储介质,用于储存为上述图7所示计算集群所用的计算机软件指令,包含用于执行上述方法实施例所设计的程序。通过执行存储 的所述程序,可以实现数据加载。
本发明实施例提供一种终端800,参见图8,该终端800可以包括发送模块801和请求模块802。其中,发送模块801可以用于,向计算集群发送数据加载请求,数据加载请求携带有待加载数据表的分区信息,数据加载请求指示计算集群根据分区信息确定第一数据分区,分区信息指示的所有分区分别绑定一个第一数据分区,第一数据分区用于存储对于第一数据分区绑定的分区中的源数据执行映射任务所得的中间数据,以便对第一数据分区中的中间数据执行归约任务来获取目标文件。请求模块802可以用于,请求计算集群将每个第一数据分区对应的目标文件发送至查询集群,以便在查询集群的KeyValue数据库的待加载数据表进行数据查询时使用目标文件。
此外,请求模块802可以用于执行图6中的步骤111。图8中的终端800可以用于执行上述方法流程中的任一流程,本发明实施例在此不再详述。
本发明实施例还提供了一种计算机存储介质,用于储存为上述图8所示终端所用的计算机软件指令,其包含用于执行上述方法实施例所设计的程序。通过执行存储的所述程序,可以实现数据加载。
本发明实施例提供一种查询集群900,参见图9,该查询集群900可以包括接收模块901和加载模块902。具体的,该查询集群900可以包括多个查询节点,该查询节点可以是具有计算能力的计算设备;查询集群900中的至少一个查询节点用于部署查询集群900各模块的功能。其中,接收模块901可以用于,接收计算集群发送的每个第一数据分区对应的目标文件。加载模块902可以用于,将每个第一数据分区对应的目标文件加载至KeyValue数据库中,以便在KeyValue数据库的待加载数据表进行数据查询时使用目标文件。此外,图9中的查询集群900可以用于执行上述方法流程中的任一流程,本发明实施例在此不再详述。
本发明实施例还提供了一种计算机存储介质,用于储存为上述图9所示查询集群所用的计算机软件指令,其包含用于执行上述方法实施例所设计的程序。通过执行存储的所述程序,可以实现数据加载。
参见图10,本发明实施例还提供一种计算设备1000,计算设备1000包括至少一个处理器1001、存储器1002和通信接口1003;所述至少一个处理器1001、所述存储器1002和所述通信接口1003均通过总线1004连接;所述存储器1002,用于存储计算机执行指令;所述至少一个处理器1001,用于执行所述存储器1002存储的计算机执行指令,使得所述计算设备1000通过所述通信接口1003与其它具有计算能力的设备(例如查询集群中的查询节点、终端或者计算集群中的计算节点)进行数据交互,来执行上述实施例提供的数据加载方法。
一种可选实施例,本发明实施例提供的计算集群包括的计算节点为计算设备1000,所述计算集群的计算设备1000通过所述通信接口1003,与计算集群中的其它计算设备1000、终端和查询集群的查询节点进行数据交互,来执行上述实施例提供的数据加载方 法。
一种可选实施例,本发明实施例提供的查询集群包括的查询节点为计算设备1000,所述查询集群的计算设备1000通过所述通信接口1003,与查询集群中的其它计算设备1000、终端和计算集群的计算节点进行数据交互来执行上述实施例提供的数据加载方法。
一种可选实施例,本发明实施例提供的终端为计算设备1000,所述计算设备1000通过所述通信接口1003,与计算集群的计算节点和查询集群的查询节点进行数据交互,来执行上述实施例提供的数据加载方法。
可选地,至少一个处理器1001,可以包括不同类型的处理器1001,或者包括相同类型的处理器1001;处理器1001可以是以下的任一种:中央处理器CPU、ARM处理器、现场可编程门阵列(Field Programmable Gate Array,简称FPGA)、专用处理器等具有计算处理能力的器件。一种可选实施方式,所述至少一个处理器1001还可以集成为众核处理器。
可选地,存储器1002可以是以下的任一种或任一种组合:随机存取存储器(Random Access Memory,简称RAM)、只读存储器(Read Only Memory,简称ROM)、非易失性存储器(Non-volatile Memory,简称NVM)、固态硬盘(Solid State Drives,简称SSD)、机械硬盘、磁盘、磁盘阵列等存储介质。
可选地,通信接口1003用于计算设备1000与其他具有计算能力或者存储能力的设备进行数据交互。通信接口1003可以是以下的任一种或任一种组合:网络接口(例如以太网接口)、无线网卡等具有网络接入功能的器件。
该总线1004可以包括地址总线、数据总线、控制总线等,为便于表示,图10用一条粗线表示该总线。总线1004可以是以下的任一种或任一种组合:工业标准体系结构(Industry Standard Architecture,简称ISA)总线、外设组件互连标准(Peripheral Component Interconnect,简称PCI)总线、扩展工业标准结构(Extended Industry Standard Architecture,简称EISA)总线等有线数据传输的器件。
本发明另一实施例提供一种通信系统,可以包括终端、计算集群和查询集群,该通信系统的结构示意图可以参见图3。该通信系统中的终端、计算集群和查询集群可以执行上述方法实施例中的数据加载方法。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。

Claims (14)

  1. 一种数据加载方法,其特征在于,计算集群用于数据加载,查询集群用于KeyValue数据库的数据查询,所述计算集群与所述查询集群为不同集群,所述方法包括:
    所述计算集群接收数据加载请求,所述数据加载请求携带有待加载数据表的分区信息;
    所述计算集群根据所述分区信息确定第一数据分区,其中,所述分区信息指示的所有分区分别绑定一个所述第一数据分区;
    所述计算集群从分布式文件系统中分别获取所述分区信息指示的每个分区的源数据,对每个分区的源数据分别执行映射任务;
    所述计算集群根据所述分区信息指示的分区与所述第一数据分区的绑定关系,将执行每个映射任务所得的中间数据对应地写入所述第一数据分区;
    所述计算集群对每个所述第一数据分区中的中间数据分别执行归约任务,执行得到每个归约任务的目标文件,所述目标文件供KeyValue数据库的所述加载数据表进行数据查询使用。
  2. 根据权利要求1所述的方法,其特征在于,所述分区信息指示的每个分区的Key值范围不同;对于具有所述绑定关系的所述分区与所述第一数据分区,所述分区的源数据与所述第一数据分区的中间数据具有相同的Key值范围。
  3. 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:
    所述计算集群将所述目标文件发送至所述查询集群。
  4. 根据权利要求3所述的方法,其特征在于,所述查询集群中具有第二数据分区,所有所述第二数据分区对应的Key值范围与所述第一数据分区对应的Key值范围相同,所述第二数据分区用于存储对应Key值范围的所述目标文件。
  5. 一种数据加载方法,其特征在于,计算集群用于数据加载,查询集群用于KeyValue数据库的数据查询,所述计算集群与所述查询集群为不同集群,所述方法包括:
    向所述计算集群发送数据加载请求,所述数据加载请求携带有待加载数据表的分区信息,所述数据加载请求指示所述计算集群根据所述分区信息确定第一数据分区,所述分区信息指示的所有分区分别绑定一个所述第一数据分区,所述第一数据分区用于存储对于所述第一数据分区绑定的分区中的源数据执行映射任务所得的中间数据,以便对所述第一数据分区中的中间数据执行归约任务来获取目标文件;
    请求所述计算集群将每个所述第一数据分区对应的目标文件发送至所述查询集群,以便在KeyValue数据库的所述待加载数据表进行数据查询时使用所述目标文件。
  6. 根据权利要求5所述的方法,其特征在于,所述查询集群中具有第二数据分区,所有所述第二数据分区对应的Key值范围与所述第一数据分区对应的Key值范围相同,所述第二数据分区用于存储对应Key值范围的所述目标文件。
  7. 根据权利要求5或6所述的方法,其特征在于,在向所述计算集群发送数据加载请求之前,所述方法还包括:
    向所述查询集群请求待加载数据表的分区信息。
  8. 一种计算集群,其特征在于,包括:
    接收模块,用于接收数据加载请求,所述数据加载请求携带有待加载数据表的分区信息;
    确定模块,用于根据所述分区信息确定第一数据分区,其中,所述分区信息指示的所有分区分别绑定一个所述第一数据分区;
    执行模块,用于从分布式文件系统中分别获取所述分区信息指示的每个分区的源数据,对每个分区的源数据分别执行映射任务;
    写入模块,用于根据所述分区信息指示的分区与所述第一数据分区的绑定关系,将执行每个映射任务所得的中间数据对应地写入所述第一数据分区;
    所述执行模块还用于,对每个所述第一数据分区中的中间数据分别执行归约任务,执行得到每个归约任务的目标文件,所述目标文件供查询集群的KeyValue数据库的所述加载数据表进行数据查询使用。
  9. 根据权利要求8所述的计算集群,其特征在于,所述分区信息指示的每个分区的Key值范围不同;对于具有所述绑定关系的所述分区与所述第一数据分区,所述分区的源数据与所述第一数据分区的中间数据具有相同的Key值范围。
  10. 根据权利要求8或9所述的计算集群,其特征在于,还包括:
    发送模块,用于将所述目标文件发送至所述查询集群。
  11. 根据权利要求10所述的计算集群,其特征在于,所述查询集群中具有第二数据分区,所有所述第二数据分区对应的Key值范围与所述第一数据分区对应的Key值范围相同,所述第二数据分区用于存储对应Key值范围的所述目标文件。
  12. 一种终端,其特征在于,包括:
    发送模块,用于向所述计算集群发送数据加载请求,所述数据加载请求携带有待加载数据表的分区信息,所述数据加载请求指示所述计算集群根据所述分区信息确定第一数据分区,所述分区信息指示的所有分区分别绑定一个所述第一数据分区,所述第一数据分区用于存储对于所述第一数据分区绑定的分区中的源数据执行映射任务所得的中间数据,以便对所述第一数据分区中的中间数据执行归约任务来获取目标文件;
    请求模块,用于请求所述计算集群将每个所述第一数据分区对应的目标文件发送至所述查询集群,以便在查询集群的KeyValue数据库的所述待加载数据表进行数据查询时使用所述目标文件。
  13. 根据权利要求12所述的终端,其特征在于,所述查询集群中具有第二数据分区,所有所述第二数据分区对应的Key值范围与所述第一数据分区对应的Key值范围相同,所述第二数据分区用于存储对应Key值范围的所述目标文件。
  14. 根据权利要求12或13所述的终端,其特征在于,所述请求模块还用于:
    在向所述查询集群请求待加载数据表的分区信息之前,向所述查询集群请求待加载数据表的分区信息。
PCT/CN2017/087152 2016-09-27 2017-06-05 一种数据加载方法、终端和计算集群 WO2018058998A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610856707.1A CN106503058B (zh) 2016-09-27 2016-09-27 一种数据加载方法、终端和计算集群
CN201610856707.1 2016-09-27

Publications (1)

Publication Number Publication Date
WO2018058998A1 true WO2018058998A1 (zh) 2018-04-05

Family

ID=58290036

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/087152 WO2018058998A1 (zh) 2016-09-27 2017-06-05 一种数据加载方法、终端和计算集群

Country Status (2)

Country Link
CN (1) CN106503058B (zh)
WO (1) WO2018058998A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111090645A (zh) * 2019-10-12 2020-05-01 平安科技(深圳)有限公司 基于云存储的数据传输方法、装置及计算机设备

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503058B (zh) * 2016-09-27 2019-01-18 华为技术有限公司 一种数据加载方法、终端和计算集群
CN110019125B (zh) * 2017-11-27 2021-12-14 北京京东尚科信息技术有限公司 数据库管理的方法和装置
CN110083658B (zh) * 2019-03-11 2021-05-25 北京达佳互联信息技术有限公司 数据同步方法、装置、电子设备及存储介质
CN112988034B (zh) * 2019-12-02 2024-04-12 华为云计算技术有限公司 一种分布式系统数据写入方法及装置
CN111651509B (zh) * 2020-04-30 2024-04-02 中国平安财产保险股份有限公司 基于Hbase数据库的数据导入方法、装置、电子设备及介质
CN112799820A (zh) * 2021-02-05 2021-05-14 拉卡拉支付股份有限公司 数据处理方法、装置、电子设备、存储介质及程序产品
CN114860349B (zh) * 2022-07-06 2022-11-08 深圳华锐分布式技术股份有限公司 数据加载方法、装置、设备及介质
CN117271562B (zh) * 2023-11-21 2024-01-19 成都凌亚科技有限公司 一种数据采集处理方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102594852A (zh) * 2011-01-04 2012-07-18 中国移动通信集团公司 数据访问方法、节点及系统
CN102833295A (zh) * 2011-06-17 2012-12-19 南京中兴新软件有限责任公司 分布式缓存系统中的数据操作方法和装置
CN105138679A (zh) * 2015-09-14 2015-12-09 桂林电子科技大学 一种基于分布式缓存的数据处理系统及处理方法
EP2977899A2 (en) * 2014-06-27 2016-01-27 General Electric Company Integrating execution of computing analytics within a mapreduce processing environment
CN106503058A (zh) * 2016-09-27 2017-03-15 华为技术有限公司 一种数据加载方法、终端和计算集群

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102594852A (zh) * 2011-01-04 2012-07-18 中国移动通信集团公司 数据访问方法、节点及系统
CN102833295A (zh) * 2011-06-17 2012-12-19 南京中兴新软件有限责任公司 分布式缓存系统中的数据操作方法和装置
EP2977899A2 (en) * 2014-06-27 2016-01-27 General Electric Company Integrating execution of computing analytics within a mapreduce processing environment
CN105138679A (zh) * 2015-09-14 2015-12-09 桂林电子科技大学 一种基于分布式缓存的数据处理系统及处理方法
CN106503058A (zh) * 2016-09-27 2017-03-15 华为技术有限公司 一种数据加载方法、终端和计算集群

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI, XING: "Research and Design of Parallel K-prototypes Clustering Algorithm Based on Hadoop.", CMFD, INFORMATION TECHNOLOGIES DIVISION, 15 March 2015 (2015-03-15), pages 7 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111090645A (zh) * 2019-10-12 2020-05-01 平安科技(深圳)有限公司 基于云存储的数据传输方法、装置及计算机设备
CN111090645B (zh) * 2019-10-12 2024-03-01 平安科技(深圳)有限公司 基于云存储的数据传输方法、装置及计算机设备

Also Published As

Publication number Publication date
CN106503058A (zh) 2017-03-15
CN106503058B (zh) 2019-01-18

Similar Documents

Publication Publication Date Title
WO2018058998A1 (zh) 一种数据加载方法、终端和计算集群
JP6882511B2 (ja) ブロックチェーンコンセンサスのための方法、装置およびシステム
US11275622B2 (en) Utilizing accelerators to accelerate data analytic workloads in disaggregated systems
US10169413B2 (en) Scalable acceleration of database query operations
US20200104378A1 (en) Mapreduce implementation in an on-demand network code execution system and stream data processing system
US10789085B2 (en) Selectively providing virtual machine through actual measurement of efficiency of power usage
US10394731B2 (en) System on a chip comprising reconfigurable resources for multiple compute sub-systems
US9304815B1 (en) Dynamic replica failure detection and healing
US9378053B2 (en) Generating map task output with version information during map task execution and executing reduce tasks using the output including version information
JP2018088293A (ja) 単一テナント及び複数テナント環境を提供するデータベースシステム
KR102361156B1 (ko) 분산 파일 시스템에서 데이터에 대한 병렬 액세스
US20130318240A1 (en) Reconfigurable cloud computing
JP7200078B2 (ja) I/oステアリングエンジンを備えるシステムオンチップ
WO2022111313A1 (zh) 一种请求处理方法及微服务系统
US8930518B2 (en) Processing of write requests in application server clusters
CN114281263B (zh) 容器集群管理系统的存储资源处理方法、系统和设备
US11200192B2 (en) Multi-mode system on a chip
TW202008763A (zh) 資料處理方法和裝置、用戶端
US11706289B1 (en) System and method for distributed management of hardware using intermediate representations of systems to satisfy user intent
WO2022247316A1 (zh) 存储对象处理系统、请求处理方法、网关和存储介质
US11544260B2 (en) Transaction processing method and system, and server
CN108920111B (zh) 数据共享方法及分布式数据共享系统
US9432476B1 (en) Proxy data storage system monitoring aggregator for a geographically-distributed environment
US9537941B2 (en) Method and system for verifying quality of server
US10824640B1 (en) Framework for scheduling concurrent replication cycles

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17854454

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17854454

Country of ref document: EP

Kind code of ref document: A1