WO2021068351A1 - Cloud-storage-based data transmission method and apparatus, and computer device - Google Patents

Cloud-storage-based data transmission method and apparatus, and computer device Download PDF

Info

Publication number
WO2021068351A1
WO2021068351A1 PCT/CN2019/118401 CN2019118401W WO2021068351A1 WO 2021068351 A1 WO2021068351 A1 WO 2021068351A1 CN 2019118401 W CN2019118401 W CN 2019118401W WO 2021068351 A1 WO2021068351 A1 WO 2021068351A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
partition
database
partitions
sorted
Prior art date
Application number
PCT/CN2019/118401
Other languages
French (fr)
Chinese (zh)
Inventor
邓煜
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021068351A1 publication Critical patent/WO2021068351A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/278Data partitioning, e.g. horizontal or vertical partitioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed in the present application are a cloud-storage-based data transmission method and apparatus, a computer device, and a storage medium. The method comprises: receiving full data uploaded by means of an Hive database, and storing the full data; obtaining the number of pre-partitioned regions in an HBase database; according to the number of the pre-partitioned regions and the row key of each data of the full data, partitioning the full data to obtain corresponding partition data; sequentially sorting each partition data in an ascending order according to column and row keys so as to obtain corresponding sorted partition data; and transmitting each sorted partition data to a partition server corresponding to the Hbase database for storage.

Description

基于云存储的数据传输方法、装置及计算机设备Data transmission method, device and computer equipment based on cloud storage
本申请要求于2019年10月12日提交中国专利局、申请号为201910969811.5、申请名称为“基于云存储的数据传输方法、装置及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the Chinese Patent Office on October 12, 2019, the application number is 201910969811.5, and the application name is "cloud storage-based data transmission methods, devices and computer equipment", the entire content of which is incorporated by reference Incorporated in this application.
技术领域Technical field
本申请涉及云存储技术领域,尤其涉及一种基于云存储的数据传输方法、装置、计算机设备及存储介质。This application relates to the field of cloud storage technology, and in particular to a data transmission method, device, computer equipment, and storage medium based on cloud storage.
背景技术Background technique
目前,将Hive数据库(Hive是一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表)中的数据写入HBase(HBase是一个分布式的、面向列的开源数据库)中时,一般采用离线批量写入或者流式写入的方式,但是上述两种方式在将数据写入HBase时均是采用put的方式(put是HBase中数据插入方式中的一种),通过put指令插入数据时是一边排序一边插入,造成对HBase集群的数据处理效率的影响,而且导致数据写入效率低下。At present, when the data in the Hive database (Hive is a data warehouse tool that can map structured data files to a database table) is written into HBase (HBase is a distributed, column-oriented open source database), Generally, offline batch writing or streaming writing is used, but the above two methods both use the put method when writing data into HBase (put is one of the data insertion methods in HBase), which is inserted through the put instruction The data is inserted while sorting, which affects the data processing efficiency of the HBase cluster and leads to low data writing efficiency.
发明内容Summary of the invention
本申请实施例提供了一种基于云存储的数据传输方法、装置、计算机设备及存储介质,旨在解决现有技术中将数据写入HBase时均是采用put的方式,通过put指令插入数据时是一边排序一边插入,造成对HBase集群的数据处理效率的影响,而且导致数据写入效率低下的问题。The embodiments of the present application provide a data transmission method, device, computer equipment, and storage medium based on cloud storage, aiming to solve the problem that when data is written into HBase in the prior art, the put method is used, and when the data is inserted through the put instruction It is inserting while sorting, which affects the data processing efficiency of the HBase cluster, and causes the problem of low data writing efficiency.
第一方面,本申请实施例提供了一种基于云存储的数据传输方法,其包括:In the first aspect, an embodiment of the present application provides a data transmission method based on cloud storage, which includes:
接收由Hive数据库上传的全量数据,并进行存储;其中,所述Hive数据库为数据仓库式数据库;Receive and store the full amount of data uploaded by the Hive database; wherein, the Hive database is a data warehouse database;
获取HBase数据库中的预分区个数;其中,所述HBase数据库为分布式开源数据库,且所述HBase数据库中每一预分区均对应一个分区服务器;Acquiring the number of pre-partitions in the HBase database; wherein the HBase database is a distributed open source database, and each pre-partition in the HBase database corresponds to a partition server;
根据所述预分区个数及全量数据中各数据的行键,对所述全量数据进行分区,得到对应的分区数据;其中,分区数据的总分区数与所述预分区个数相等, 且每一分区数据唯一对应一个分区服务器;According to the number of pre-partitions and the row key of each data in the full data, the full data is partitioned to obtain corresponding partition data; wherein the total number of partitions of the partition data is equal to the number of pre-partitions, and each One partition data uniquely corresponds to one partition server;
将每个分区数据依次根据列和行键进行升序排序,得到对应的排序后分区数据;以及Sort each partition data in ascending order according to the column and row key to obtain the corresponding sorted partition data; and
将各排序后分区数据发送至所述Hbase数据库对应的分区服务器中以进行存储。Send each sorted partition data to the partition server corresponding to the Hbase database for storage.
第二方面,本申请实施例提供了一种基于云存储的数据传输装置,其包括:In the second aspect, an embodiment of the present application provides a cloud storage-based data transmission device, which includes:
接收单元,用于接收由Hive数据库上传的全量数据,并进行存储;其中,所述Hive数据库为数据仓库式数据库;The receiving unit is used to receive and store the full amount of data uploaded by the Hive database; wherein, the Hive database is a data warehouse database;
分区个数获取单元,用于获取HBase数据库中的预分区个数;其中,所述HBase数据库为分布式开源数据库,且所述HBase数据库中每一预分区均对应一个分区服务器;The partition number obtaining unit is used to obtain the number of pre-partitions in the HBase database; wherein, the HBase database is a distributed open source database, and each pre-partition in the HBase database corresponds to a partition server;
分区单元,用于根据所述预分区个数及全量数据中各数据的行键,对所述全量数据进行分区,得到对应的分区数据;其中,分区数据的总分区数与所述预分区个数相等,且每一分区数据唯一对应一个分区服务器;The partition unit is used to partition the full amount of data according to the number of pre-partitions and the row key of each data in the full amount of data to obtain corresponding partitioned data; wherein, the total number of partitions of the partitioned data and the number of pre-partitioned data The numbers are equal, and each partition data uniquely corresponds to a partition server;
排序单元,用于将每个分区数据依次根据列和行键进行升序排序,得到对应的排序后分区数据;以及The sorting unit is used to sort each partition data in ascending order according to the column and row key in turn to obtain the corresponding sorted partition data; and
传输单元,用于将各排序后分区数据发送至所述Hbase数据库对应的分区服务器中以进行存储。The transmission unit is used to send each sorted partition data to the partition server corresponding to the Hbase database for storage.
第三方面,本申请实施例又提供了一种计算机设备,其包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述第一方面所述的基于云存储的数据传输方法。In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor executes the computer The program implements the cloud storage-based data transmission method described in the first aspect above.
第四方面,本申请实施例还提供了一种计算机可读存储介质,其中所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行上述第一方面所述的基于云存储的数据传输方法。In a fourth aspect, the embodiments of the present application also provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the above-mentioned first The cloud storage-based data transmission method described in one aspect.
附图说明Description of the drawings
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative work.
图1为本申请实施例提供的基于云存储的数据传输方法的应用场景示意图;FIG. 1 is a schematic diagram of an application scenario of a data transmission method based on cloud storage provided by an embodiment of the application;
图2为本申请实施例提供的基于云存储的数据传输方法的流程示意图;2 is a schematic flowchart of a data transmission method based on cloud storage provided by an embodiment of the application;
图3为本申请实施例提供的基于云存储的数据传输方法的子流程示意图;FIG. 3 is a schematic diagram of a sub-flow of a data transmission method based on cloud storage provided by an embodiment of the application;
图4为本申请实施例提供的基于云存储的数据传输方法的另一子流程示意图;4 is a schematic diagram of another sub-flow of the cloud storage-based data transmission method provided by an embodiment of the application;
图5为本申请实施例提供的基于云存储的数据传输方法的另一子流程示意图;FIG. 5 is a schematic diagram of another sub-process of the cloud storage-based data transmission method provided by an embodiment of the application;
图6为本申请实施例提供的基于云存储的数据传输方法的另一子流程示意图;6 is a schematic diagram of another sub-flow of the cloud storage-based data transmission method provided by an embodiment of the application;
图7为本申请实施例提供的基于云存储的数据传输装置的示意性框图;FIG. 7 is a schematic block diagram of a data transmission device based on cloud storage provided by an embodiment of the application;
图8为本申请实施例提供的基于云存储的数据传输装置的子单元示意性框图;FIG. 8 is a schematic block diagram of subunits of a cloud storage-based data transmission device provided by an embodiment of the application;
图9为本申请实施例提供的基于云存储的数据传输装置的另一子单元示意性框图;FIG. 9 is a schematic block diagram of another subunit of the cloud storage-based data transmission device provided by an embodiment of the application;
图10为本申请实施例提供的基于云存储的数据传输装置的另一子单元示意性框图;FIG. 10 is a schematic block diagram of another subunit of the cloud storage-based data transmission device provided by an embodiment of the application; FIG.
图11为本申请实施例提供的基于云存储的数据传输装置的另一子单元示意性框图;FIG. 11 is a schematic block diagram of another subunit of the cloud storage-based data transmission device provided by an embodiment of the application; FIG.
图12为本申请实施例提供的计算机设备的示意性框图。FIG. 12 is a schematic block diagram of a computer device provided by an embodiment of this application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in this specification and appended claims, the terms "including" and "including" indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or The existence or addition of multiple other features, wholes, steps, operations, elements, components, and/or collections thereof.
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使 用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should also be understood that the terms used in the specification of this application are only for the purpose of describing specific embodiments and are not intended to limit the application. As used in the specification of this application and the appended claims, unless the context clearly indicates otherwise, the singular forms "a", "an" and "the" are intended to include plural forms.
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should be further understood that the term "and/or" used in the specification and appended claims of this application refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .
请参阅图1和图2,图1为本申请实施例提供的基于云存储的数据传输方法的应用场景示意图;图2为本申请实施例提供的基于云存储的数据传输方法的流程示意图,该基于云存储的数据传输方法应用于服务器中,该方法通过安装于服务器中的应用软件进行执行。Please refer to FIGS. 1 and 2. FIG. 1 is a schematic diagram of an application scenario of a data transmission method based on cloud storage provided by an embodiment of the application; FIG. 2 is a schematic flowchart of a data transmission method based on cloud storage provided by an embodiment of the application. The data transmission method based on cloud storage is applied to the server, and the method is executed by application software installed in the server.
如图2所示,该方法包括步骤S110~S150。As shown in Figure 2, the method includes steps S110 to S150.
S110、接收由Hive数据库上传的全量数据,并进行存储;其中,所述Hive数据库为数据仓库式数据库。S110. Receive and store the full amount of data uploaded by the Hive database; wherein, the Hive database is a data warehouse database.
在本实施例中,是在云计算平台的角度描述技术方案。本申请中云计算平台具体采用的是Spark,Spark是专为大规模数据处理而设计的快速通用的计算引擎,Spark启用了内存分布数据集,除了能够提供交互式查询外,它还可以优化迭代工作负载。In this embodiment, the technical solution is described from the perspective of a cloud computing platform. The cloud computing platform in this application specifically uses Spark. Spark is a fast and universal computing engine designed for large-scale data processing. Spark enables memory distributed data sets. In addition to providing interactive queries, it can also optimize iterations. Work load.
当云计算平台接收了由Hive数据库上传的全量数据后,是生成逻辑上的dataframe(dataframe是dataset的行的集合,dataset是Spark 1.6+中添加的一个新接口)进行物理存储(物理存储是内存和磁盘结合存储的)。When the cloud computing platform receives the full amount of data uploaded by the Hive database, it generates a logical dataframe (dataframe is a collection of rows of dataset, dataset is a new interface added in Spark 1.6+) for physical storage (physical storage is memory Combined with disk storage).
S120、获取HBase数据库中的预分区个数;其中,所述HBase数据库为分布式开源数据库,且所述HBase数据库中每一预分区均对应一个分区服务器。S120. Obtain the number of pre-partitions in the HBase database; wherein the HBase database is a distributed open source database, and each pre-partition in the HBase database corresponds to a partition server.
在本实施例中,当在云计算平台完成了全量数据的存储后,为了获知后续将全量数据划分为多少个分区进行存储,此时需先从所述HBase数据库中获取预分区个数。In this embodiment, after the storage of the full amount of data is completed on the cloud computing platform, in order to know how many partitions the full amount of data is subsequently divided into for storage, the number of pre-partitions needs to be obtained from the HBase database at this time.
其中,所述HBase数据库为分布式开源数据库,且所述HBase数据库中每一预分区均对应一个分区服务器,它是基于Hadoop的高可靠性、高性能、面向列、可伸缩的分布式存储系统,利用HBase技术可在廉价电脑服务器上搭建起大规模结构化存储集群。Wherein, the HBase database is a distributed open source database, and each pre-partition in the HBase database corresponds to a partition server. It is a highly reliable, high-performance, column-oriented, and scalable distributed storage system based on Hadoop , Using HBase technology can build a large-scale structured storage cluster on a cheap computer server.
在一实施例中,如图3所示,步骤S120包括:In an embodiment, as shown in FIG. 3, step S120 includes:
S121、发送RPC请求至所述HBase数据库;其中,所述RPC请求为远程 过程调用协议请求;S121. Send an RPC request to the HBase database; wherein the RPC request is a remote procedure call protocol request;
S122、接收所述HBase数据库根据所述RPC请求发送的元信息,根据元信息获取预分区个数。S122. Receive meta-information sent by the HBase database according to the RPC request, and obtain the number of pre-partitions according to the meta-information.
在本实施例中,当在云计算平台完成了全量数据的存储后,云计算平台会发起RPC请求(RPC请求即远程过程调用协议请求,它是一种通过网络从远程计算机程序上请求服务),访问所述Hbase数据库的zk元信息(即ZooKeeper元信息,ZooKeeper是一个分布式的、开放源码的分布式应用程序协调服务),在zk元信息里已经存储了HBase预先建好的表的分区信息,也就获知了HBase数据库中的预分区个数。通过获知所述HBase数据库中的预分区个数,能准确的将全量数据划分为相同分区个数。In this embodiment, when the full amount of data is stored on the cloud computing platform, the cloud computing platform will initiate an RPC request (RPC request is a remote procedure call protocol request, which is a request for service from a remote computer program through the network) , Access the zk meta-information of the Hbase database (ie ZooKeeper meta-information, ZooKeeper is a distributed, open source distributed application coordination service), and the partition of the HBase pre-built table has been stored in the zk meta-information Information, the number of pre-partitions in the HBase database is also known. By knowing the number of pre-partitions in the HBase database, the entire amount of data can be accurately divided into the same number of partitions.
S130、根据所述预分区个数及全量数据中各数据的行键,对所述全量数据进行分区,得到对应的分区数据;其中,分区数据的总分区数与所述预分区个数相等,且每一分区数据唯一对应一个分区服务器。S130. Partition the full amount of data according to the number of pre-partitions and the row key of each data in the full amount of data to obtain corresponding partitioned data; wherein the total number of partitions of the partitioned data is equal to the number of pre-partitions, And each partition data uniquely corresponds to a partition server.
在本实施例中,将云计算平台中的dataframe中所存储的全量数据,根据HexStringSplit的预分区方式将所述全量数据中每一条数据打散到对应的分区中。其中,HexStringSplit是一种适用于行键是十六进制的字符串作为前缀的预划分。In this embodiment, the full amount of data stored in the dataframe in the cloud computing platform is divided into corresponding partitions according to the pre-partitioning method of HexStringSplit. Among them, HexStringSplit is a kind of pre-splitting that is suitable for prefixing the string with hexadecimal as the row key.
在一实施例中,如图4所示,步骤S130包括:In an embodiment, as shown in FIG. 4, step S130 includes:
S131、获取所述全量数据中各数据对应的行键;S131. Obtain a row key corresponding to each data in the full amount of data;
S132、将各数据的行键通过MD5加密算法或SHA-256加密算法生成对应的哈希值;S132. Use the MD5 encryption algorithm or the SHA-256 encryption algorithm to generate a corresponding hash value for the row key of each data;
S133、将各行键对应的哈希值对所述预分区个数求模,得到与各行键对应的余数;S133. The hash value corresponding to each row key is modulo the number of pre-partitions to obtain a remainder corresponding to each row key;
S134、将各行键对应的数据存储至该行键对应的余数所对应的分区中,以得到对应的分区数据。S134. Store the data corresponding to each row key in the partition corresponding to the remainder corresponding to the row key to obtain corresponding partition data.
在本实施例中,在Spark中各数据均对应有一个行键(即rowkey),此时先获取各数据的行键,便于对应进行处理后将数据划分至对应的区域。In this embodiment, each data in Spark corresponds to a row key (ie, rowkey). At this time, the row key of each data is obtained first, so that the data can be divided into corresponding areas after corresponding processing.
之后对各数据的行键通过MD5加密算法或SHA加密算法进行计算时,能对应生成的哈希值。其中,MD5算法是一种被广泛使用的密码散列函数,可以产生出一个128位(16字节)的散列值(hash value),用于确保信息传输完整 一致。SHA-256算法是一种安全散列算法,能计算出一个数字消息所对应到的长度固定的字符串(又称消息摘要)的算法。通过上述MD5或SHA-256的方式将行键生成哈希值以打散到对应的分区中,使得具有相同行键余数的数据被划分在同一分区中。通过这一方式,实现了对全量数据的快速且有效的划分。Later, when the row key of each data is calculated by MD5 encryption algorithm or SHA encryption algorithm, it can correspond to the generated hash value. Among them, the MD5 algorithm is a widely used cryptographic hash function, which can generate a 128-bit (16-byte) hash value to ensure complete and consistent information transmission. The SHA-256 algorithm is a secure hash algorithm that can calculate a fixed-length character string (also called a message digest) corresponding to a digital message. The hash value of the row key is generated by the above MD5 or SHA-256 method to disperse it into the corresponding partition, so that the data with the same row key remainder is divided into the same partition. In this way, a fast and effective division of the full amount of data is realized.
由于所述HBase数据库中每一预分区均对应一个分区服务器,而且每一分区数据唯一对应一个分区服务器,故分区数据与分区服务器的对应关系可以是预先就设置了对应关系,例如分区1对应分区服务器1,……,分区N应分区服务器N。在获知了各分区数据与分区服务器的对应关系后,后续进行数据存储时,则可实现定向存储,提高存储效率。Since each pre-partition in the HBase database corresponds to a partition server, and each partition data uniquely corresponds to a partition server, the correspondence relationship between partition data and partition server may be a preset correspondence relationship, for example, partition 1 corresponds to a partition Server 1,..., Partition N should be Partition Server N. After knowing the corresponding relationship between each partition data and the partition server, when subsequent data storage is performed, directional storage can be realized and storage efficiency can be improved.
S140、将每个分区数据依次根据列和行键进行升序排序,得到对应的排序后分区数据。S140: Sort each partition data in ascending order according to the column and row key in turn to obtain the corresponding sorted partition data.
在本实施例中,当在云计算平台中将全量数据根据预分区个数对应进行分区后,之后还需要对每一分区数据再进行排序,当完成排序后再发送至所述Hbase数据库即可快速存储。此时对各分区数据进行排序时,可以选取列值和行键值的大小来进行排序。In this embodiment, after the full amount of data is partitioned according to the number of pre-partitions in the cloud computing platform, then each partition data needs to be sorted again, and then sent to the Hbase database after sorting is completed Fast storage. At this time, when sorting the data of each partition, you can select the size of the column value and the row key value to sort.
在一实施例中,如图5所示,步骤S140包括:In an embodiment, as shown in FIG. 5, step S140 includes:
S141、在每个分区数据中各自获取具有相同行键的数据,将具有相同行键的数据中根据列的升序进行排序,得到与每一分区数据对应的第一排序后分区数据;S141. Obtain data with the same row key in each partition data, and sort the data with the same row key according to the ascending order of the columns to obtain the first sorted partition data corresponding to each partition data;
S142、将每一第一排序后分区数据根据行键的升序进行排序,得到与每一第一排序后分区数据对应的排序后分区数据。S142: Sort each first sorted partition data according to the ascending order of the row key to obtain sorted partition data corresponding to each first sorted partition data.
在本实施例中,在每一分区数据中先是将具有相同行键值的数据归为一类,具有相同行键值的数据内部则是按照列值进行升序排序,从而得到第一排序后分区数据。完成初次排序后所得到的第一排序后分区数据中,可以再据根据行键的升序进行排序,得到与每一第一排序后分区数据对应的排序后分区数据。可见,通过列和行键对个分区数据进行排序后,能将数据更有规律性的存储。In this embodiment, in each partition data, the data with the same row key value is first classified into one category, and the data with the same row key value is internally sorted according to the column value in ascending order, so as to obtain the first sorted partition data. The first sorted partition data obtained after the initial sorting is completed can be sorted according to the ascending order of the row key to obtain sorted partition data corresponding to each first sorted partition data. It can be seen that after sorting the partition data by column and row keys, the data can be stored more regularly.
S150、将各排序后分区数据发送至所述Hbase数据库对应的分区服务器中以进行存储。S150. Send each sorted partition data to a partition server corresponding to the Hbase database for storage.
在本实施例中,完成对各分区数据的排序而得到对应的各排序分区数据后,直接发送至所述Hbase数据库进行存储即可,无需再采用如put指令插入数据时 是一边排序一边插入,造成对HBase集群的数据处理效率的影响,直接将完成分区和排序的数据存储于所述Hbase数据库,只需直接存储即可,提高了存储效率。In this embodiment, after sorting the partition data to obtain the corresponding sort partition data, it can be directly sent to the Hbase database for storage. There is no need to use the put instruction to insert data while sorting while inserting. As a result, the data processing efficiency of the HBase cluster is affected, and the partitioned and sorted data is directly stored in the Hbase database, and it only needs to be stored directly, which improves the storage efficiency.
在一实施例中,如图6所示,步骤S150包括:In one embodiment, as shown in FIG. 6, step S150 includes:
S151、将各排序后分区数据输入至本地的HDFS层,以将各序后分区数据转化为对应的数据文件;其中,所述HDFS层为分布式文件系统层;S151. Input each sorted partition data to a local HDFS layer to convert each sorted partition data into a corresponding data file; wherein, the HDFS layer is a distributed file system layer;
S152、将所述数据文件发送至所述Hbase数据库对应的分区服务器中以进行存储。S152. Send the data file to the partition server corresponding to the Hbase database for storage.
在本实施例中,云计算平台(即Spark)的最底层是用于存储数据的HDFS层,将各排序后分区数据输入至HDFS层,即可由HDFS层将各排序后分区数据转化为数据文件。数据文件具体为HFile文件,HFile文件包括7个块(即block),按照block类型可分为:In this embodiment, the bottom layer of the cloud computing platform (ie Spark) is the HDFS layer for storing data, and each sorted partition data is input to the HDFS layer, and then the sorted partition data can be converted into a data file by the HDFS layer . The data file is specifically an HFile file. The HFile file includes 7 blocks (ie blocks), which can be divided into:
a)datablock(datablock即数据块),存放的key-value数据(即键值对数据),一般一个datablock大小默认为64KB;a) Datablock (datablock is the data block), stored key-value data (key-value pair data), generally the size of a datablock is 64KB by default;
b)data index block,其中存放的是datablock的index(index即索引),索引可以是多级索引,中间索引,叶子索引一般会分布在HFile文件当中;b) data index block, which stores the index of the datablock (index is the index), the index can be a multi-level index, the intermediate index, and the leaf index are generally distributed in the HFile file;
c)bloom filterblock,保存了bloom过滤器(即布隆过滤器)的值;c) bloom filterblock, which saves the value of the bloom filter (ie bloom filter);
d)meta data block,meta data block(即元数据块)有多个,且连续分布;d) There are multiple metadata blocks, metadata blocks (that is, metadata blocks), and they are continuously distributed;
e)meta data index,表示meta data(即元数据)的索引;e) metadata index, which means the index of metadata (ie metadata);
f)file-info block(即文件信息块),其中记录了关于文件的一些信息,比如:HFile中最大的key、平均Key长度、HFile创建时间戳、data block使用的编码方式等;f) file-info block (ie file information block), which records some information about the file, such as: the largest key in the HFile, the average key length, the HFile creation timestamp, the encoding method used by the data block, etc.;
g)trailer block(即报尾),每个HFile文件都会有的trailer block,对于不同版本的HFile(有V1,V2,V3三个版本,V2和V3相差不大)来说trailer长度可能不一样,但是同一个版本的所有HFile trailer的长度都是一样长的,并且trailer的最后4B一定是版本信息。g) Trailer block (that is, trailer), each HFile file will have a trailer block. For different versions of HFile (there are three versions of V1, V2, V3, V2 and V3 are not much different), the trailer length may be different , But all HFile trailers of the same version have the same length, and the last 4B of the trailer must be version information.
可见,各排序后分区数据是存储在本地的HDFS层中,而且是转化为HFile文件的方式进行存储。It can be seen that the sorted partition data is stored in the local HDFS layer, and it is stored in a way of converting into HFile files.
当在HSFS层将各排序后分区数据转化为HFile文件,即可将各排序后分区数据对应的HFile文件发送至Hbase数据库对应的分区服务器中。之后再由Hbase 数据库的分区服务器采用Bulkload方案(即主体加载方案)将HFile写入HBase数据库。其中,Bulkload的优点在于导入过程不占用分区资源;能快速导入海量的数据;节省内存。When the sorted partition data is converted into HFile files in the HSFS layer, the HFile files corresponding to the sorted partition data can be sent to the partition server corresponding to the Hbase database. After that, the partition server of the Hbase database uses the Bulkload scheme (that is, the main loading scheme) to write the HFile into the HBase database. Among them, the advantage of Bulkload is that the import process does not occupy partition resources; it can quickly import a large amount of data; and it saves memory.
在一实施例中,步骤S150之后还包括:In an embodiment, after step S150, the method further includes:
若检测到已接收所述Hbase数据库发送的数据传输报错信息,根据数据传输报错信息对应的日志文件在各排序后分区数据定位获取数据传输中断点;If it is detected that the data transmission error message sent by the Hbase database has been received, the data transmission interruption point is obtained by partitioning data after each sort according to the log file corresponding to the data transmission error message;
将位于各排序后分区数据的数据传输中断点之后的数据,发送至所述Hbase数据库对应的分区服务器中以进行存储。The data located after the data transmission interruption point of each sorted partition data is sent to the partition server corresponding to the Hbase database for storage.
在本实施例中,在将各排序后分区数据发送至所述Hbase数据库进行存储的过程中,若存在传输中断的情况,此时可以接收由所述Hbase数据库发送的数据传输报错信息,根据数据传输报错信息对应的日志文件在各排序后分区数据定位获取数据传输中断点。在获取了数据传输中断点后,即可从该数据传输中断点之后的数据开始继续传输数据,确保了异常发生后也能恢复正常传输。In this embodiment, in the process of sending the sorted partition data to the Hbase database for storage, if there is a transmission interruption, the data transmission error message sent by the Hbase database can be received at this time, according to the data After each sorting, the log file corresponding to the transmission error message is partitioned to locate the data transmission interruption point. After the data transmission interruption point is obtained, data transmission can be continued from the data after the data transmission interruption point, ensuring that normal transmission can be resumed after an abnormality occurs.
该方法实现了全量数据写入Hbase数据库之前,将排序过程在云端完成,提高了数据写入Hbase数据库的效率。This method realizes that before the full amount of data is written into the Hbase database, the sorting process is completed in the cloud, which improves the efficiency of data writing into the Hbase database.
本申请实施例还提供一种基于云存储的数据传输装置,该基于云存储的数据传输装置用于执行前述基于云存储的数据传输方法的任一实施例。具体地,请参阅图7,图7是本申请实施例提供的基于云存储的数据传输装置的示意性框图。该基于云存储的数据传输装置100可以配置于服务器中。The embodiment of the present application also provides a data transmission device based on cloud storage, which is used to execute any embodiment of the aforementioned data transmission method based on cloud storage. Specifically, please refer to FIG. 7, which is a schematic block diagram of a cloud storage-based data transmission apparatus provided in an embodiment of the present application. The cloud storage-based data transmission device 100 can be configured in a server.
如图7所示,基于云存储的数据传输装置100包括接收单元110、分区个数获取单元120、分区单元130、排序单元140、传输单元150。As shown in FIG. 7, the data transmission device 100 based on cloud storage includes a receiving unit 110, a partition number obtaining unit 120, a partition unit 130, a sorting unit 140, and a transmission unit 150.
接收单元110,用于接收由Hive数据库上传的全量数据,并进行存储;其中,所述Hive数据库为数据仓库式数据库。The receiving unit 110 is configured to receive and store the full amount of data uploaded by the Hive database; wherein, the Hive database is a data warehouse database.
在本实施例中,是在云计算平台的角度描述技术方案。本申请中云计算平台具体采用的是Spark,Spark是专为大规模数据处理而设计的快速通用的计算引擎,Spark启用了内存分布数据集,除了能够提供交互式查询外,它还可以优化迭代工作负载。In this embodiment, the technical solution is described from the perspective of a cloud computing platform. The cloud computing platform in this application specifically uses Spark. Spark is a fast and universal computing engine designed for large-scale data processing. Spark enables memory distributed data sets. In addition to providing interactive queries, it can also optimize iterations. Work load.
当云计算平台接收了由Hive数据库上传的全量数据后,是生成逻辑上的dataframe(dataframe是dataset的行的集合,dataset是Spark 1.6+中添加的一个新接口)进行物理存储(物理存储是内存和磁盘结合存储的)。When the cloud computing platform receives the full amount of data uploaded by the Hive database, it generates a logical dataframe (dataframe is a collection of rows of dataset, dataset is a new interface added in Spark 1.6+) for physical storage (physical storage is memory Combined with disk storage).
分区个数获取单元120,用于获取HBase数据库中的预分区个数;其中,所述HBase数据库为分布式开源数据库,且所述HBase数据库中每一预分区均对应一个分区服务器。The partition number obtaining unit 120 is configured to obtain the number of pre-partitions in the HBase database; wherein, the HBase database is a distributed open source database, and each pre-partition in the HBase database corresponds to a partition server.
在本实施例中,当在云计算平台完成了全量数据的存储后,为了获知后续将全量数据划分为多少个分区进行存储,此时需先从所述HBase数据库中获取预分区个数。In this embodiment, after the storage of the full amount of data is completed on the cloud computing platform, in order to know how many partitions the full amount of data is subsequently divided into for storage, the number of pre-partitions needs to be obtained from the HBase database at this time.
其中,所述HBase数据库为分布式开源数据库,且所述HBase数据库中每一预分区均对应一个分区服务器,它是基于Hadoop的高可靠性、高性能、面向列、可伸缩的分布式存储系统,利用HBase技术可在廉价电脑服务器上搭建起大规模结构化存储集群。Wherein, the HBase database is a distributed open source database, and each pre-partition in the HBase database corresponds to a partition server. It is a highly reliable, high-performance, column-oriented, and scalable distributed storage system based on Hadoop , Using HBase technology can build a large-scale structured storage cluster on a cheap computer server.
在一实施例中,如图8所示,分区个数获取单元120包括:In an embodiment, as shown in FIG. 8, the partition number obtaining unit 120 includes:
请求发送单元121,用于发送RPC请求至所述HBase数据库;其中,所述RPC请求为远程过程调用协议请求;The request sending unit 121 is configured to send an RPC request to the HBase database; wherein, the RPC request is a remote procedure call protocol request;
元信息解析单元122,用于接收所述HBase数据库根据所述RPC请求发送的元信息,根据元信息获取预分区个数。The meta-information analysis unit 122 is configured to receive meta-information sent by the HBase database according to the RPC request, and obtain the number of pre-partitions according to the meta-information.
在本实施例中,当在云计算平台完成了全量数据的存储后,云计算平台会发起RPC请求(RPC请求即远程过程调用协议请求,它是一种通过网络从远程计算机程序上请求服务),访问所述Hbase数据库的zk元信息(即ZooKeeper元信息,ZooKeeper是一个分布式的、开放源码的分布式应用程序协调服务),在zk元信息里已经存储了HBase预先建好的表的分区信息,也就获知了HBase数据库中的预分区个数。通过获知所述HBase数据库中的预分区个数,能准确的将全量数据划分为相同分区个数。In this embodiment, when the full amount of data is stored on the cloud computing platform, the cloud computing platform will initiate an RPC request (RPC request is a remote procedure call protocol request, which is a request for service from a remote computer program through the network) , Access the zk meta-information of the Hbase database (ie ZooKeeper meta-information, ZooKeeper is a distributed, open source distributed application coordination service), and the partition of the HBase pre-built table has been stored in the zk meta-information Information, the number of pre-partitions in the HBase database is also known. By knowing the number of pre-partitions in the HBase database, the entire amount of data can be accurately divided into the same number of partitions.
分区单元130,用于根据所述预分区个数及全量数据中各数据的行键,对所述全量数据进行分区,得到对应的分区数据;其中,分区数据的总分区数与所述预分区个数相等,且每一分区数据唯一对应一个分区服务器。The partition unit 130 is configured to partition the full amount of data according to the number of pre-partitions and the row key of each data in the full amount of data to obtain corresponding partitioned data; wherein, the total number of partitions of the partitioned data is the same as that of the pre-partitioned data. The number is equal, and each partition data uniquely corresponds to a partition server.
在本实施例中,将云计算平台中的dataframe中所存储的全量数据,根据HexStringSplit的预分区方式将所述全量数据中每一条数据打散到对应的分区中。其中,HexStringSplit是一种适用于行键是十六进制的字符串作为前缀的预划分。In this embodiment, the full amount of data stored in the dataframe in the cloud computing platform is divided into corresponding partitions according to the pre-partitioning method of HexStringSplit. Among them, HexStringSplit is a kind of pre-splitting that is suitable for prefixing the string with hexadecimal as the row key.
在一实施例中,如图9所示,分区单元130包括:In an embodiment, as shown in FIG. 9, the partition unit 130 includes:
行键获取单元131,用于获取所述全量数据中各数据对应的行键;The row key acquiring unit 131 is configured to acquire the row key corresponding to each data in the full amount of data;
哈希单元132,用于将各数据的行键通过MD5加密算法或SHA-256加密算法生成对应的哈希值;The hash unit 132 is configured to generate a corresponding hash value through the MD5 encryption algorithm or the SHA-256 encryption algorithm for the row key of each data;
求模运算单元133,用于将各行键对应的哈希值对所述预分区个数求模,得到与各行键对应的余数;A modulus calculation unit 133, configured to modulate the hash value corresponding to each row key to the number of pre-partitions to obtain a remainder corresponding to each row key;
数据分区单元134,用于将各行键对应的数据存储至该行键对应的余数所对应的分区中,以得到对应的分区数据。The data partition unit 134 is configured to store the data corresponding to each row key in the partition corresponding to the remainder corresponding to the row key to obtain the corresponding partition data.
在本实施例中,在Spark中各数据均对应有一个行键(即rowkey),此时先获取各数据的行键,便于对应进行处理后将数据划分至对应的区域。In this embodiment, each data in Spark corresponds to a row key (ie, rowkey). At this time, the row key of each data is obtained first, so that the data can be divided into corresponding areas after corresponding processing.
之后对各数据的行键通过MD5加密算法或SHA加密算法进行计算时,能对应生成的哈希值。其中,MD5算法是一种被广泛使用的密码散列函数,可以产生出一个128位(16字节)的散列值(hash value),用于确保信息传输完整一致。SHA-256算法是一种安全散列算法,能计算出一个数字消息所对应到的长度固定的字符串(又称消息摘要)的算法。通过上述MD5或SHA-256的方式将行键生成哈希值以打散到对应的分区中,使得具有相同行键余数的数据被划分在同一分区中。通过这一方式,实现了对全量数据的快速且有效的划分。Later, when the row key of each data is calculated by MD5 encryption algorithm or SHA encryption algorithm, it can correspond to the generated hash value. Among them, the MD5 algorithm is a widely used cryptographic hash function, which can generate a 128-bit (16-byte) hash value to ensure complete and consistent information transmission. The SHA-256 algorithm is a secure hash algorithm that can calculate a fixed-length character string (also called a message digest) corresponding to a digital message. The hash value of the row key is generated by the above MD5 or SHA-256 method to disperse it into the corresponding partition, so that the data with the same row key remainder is divided into the same partition. In this way, a fast and effective division of the full amount of data is realized.
由于所述HBase数据库中每一预分区均对应一个分区服务器,而且每一分区数据唯一对应一个分区服务器,故分区数据与分区服务器的对应关系可以是预先就设置了对应关系,例如分区1对应分区服务器1,……,分区N应分区服务器N。在获知了各分区数据与分区服务器的对应关系后,后续进行数据存储时,则可实现定向存储,提高存储效率。Since each pre-partition in the HBase database corresponds to a partition server, and each partition data uniquely corresponds to a partition server, the correspondence relationship between partition data and partition server may be a preset correspondence relationship, for example, partition 1 corresponds to a partition Server 1,..., Partition N should be Partition Server N. After knowing the corresponding relationship between each partition data and the partition server, when subsequent data storage is performed, directional storage can be realized and storage efficiency can be improved.
排序单元140,用于将每个分区数据依次根据列和行键进行升序排序,得到对应的排序后分区数据。The sorting unit 140 is configured to sort each partition data in ascending order according to the column and row keys in order to obtain the corresponding sorted partition data.
在本实施例中,当在云计算平台中将全量数据根据预分区个数对应进行分区后,之后还需要对每一分区数据再进行排序,当完成排序后再发送至所述Hbase数据库即可快速存储。此时对各分区数据进行排序时,可以选取列值和行键值的大小来进行排序。In this embodiment, after the full amount of data is partitioned according to the number of pre-partitions in the cloud computing platform, then each partition data needs to be sorted again, and then sent to the Hbase database after sorting is completed Fast storage. At this time, when sorting the data of each partition, you can select the size of the column value and the row key value to sort.
在一实施例中,如图10所示,排序单元140包括:In an embodiment, as shown in FIG. 10, the sorting unit 140 includes:
第一排序单元141,用于在每个分区数据中各自获取具有相同行键的数据,将具有相同行键的数据中根据列的升序进行排序,得到与每一分区数据对应的 第一排序后分区数据;The first sorting unit 141 is configured to obtain data with the same row key in each partition data, and sort the data with the same row key according to the ascending column order to obtain the first sorted data corresponding to each partition data. Partition data
第二排序单元142,用于将每一第一排序后分区数据根据行键的升序进行排序,得到与每一第一排序后分区数据对应的排序后分区数据。The second sorting unit 142 is configured to sort each first sorted partition data according to the ascending order of the row key to obtain sorted partition data corresponding to each first sorted partition data.
在本实施例中,在每一分区数据中先是将具有相同行键值的数据归为一类,具有相同行键值的数据内部则是按照列值进行升序排序,从而得到第一排序后分区数据。完成初次排序后所得到的第一排序后分区数据中,可以再据根据行键的升序进行排序,得到与每一第一排序后分区数据对应的排序后分区数据。可见,通过列和行键对个分区数据进行排序后,能将数据更有规律性的存储。In this embodiment, in each partition data, the data with the same row key value is first classified into one category, and the data with the same row key value is internally sorted according to the column value in ascending order, so as to obtain the first sorted partition data. The first sorted partition data obtained after the initial sorting is completed can be sorted according to the ascending order of the row key to obtain sorted partition data corresponding to each first sorted partition data. It can be seen that after sorting the partition data by column and row keys, the data can be stored more regularly.
传输单元150,用于将各排序后分区数据发送至所述Hbase数据库对应的分区服务器中以进行存储。The transmission unit 150 is configured to send each sorted partition data to the partition server corresponding to the Hbase database for storage.
在本实施例中,完成对各分区数据的排序而得到对应的各排序分区数据后,直接发送至所述Hbase数据库进行存储即可,无需再采用如put指令插入数据时是一边排序一边插入,造成对HBase集群的数据处理效率的影响,直接将完成分区和排序的数据存储于所述Hbase数据库,只需直接存储即可,提高了存储效率。In this embodiment, after sorting the partition data to obtain the corresponding sort partition data, it can be directly sent to the Hbase database for storage, and there is no need to use the put instruction to insert data while sorting while inserting. As a result, the data processing efficiency of the HBase cluster is affected, and the partitioned and sorted data is directly stored in the Hbase database, and it only needs to be stored directly, which improves the storage efficiency.
在一实施例中,如图11所示,传输单元150包括:In an embodiment, as shown in FIG. 11, the transmission unit 150 includes:
底层存储单元151,用于将各排序后分区数据输入至本地的HDFS层,以将各序后分区数据转化为对应的数据文件;其中,所述HDFS层为分布式文件系统层;The bottom storage unit 151 is used to input each sorted partition data to the local HDFS layer to convert each sorted partition data into a corresponding data file; wherein, the HDFS layer is a distributed file system layer;
数据发送单元152,用于将所述数据文件发送至所述Hbase数据库对应的分区服务器中以进行存储。The data sending unit 152 is configured to send the data file to the partition server corresponding to the Hbase database for storage.
在本实施例中,云计算平台(即Spark)的最底层是用于存储数据的HDFS层,将各排序后分区数据输入至HDFS层,即可由HDFS层将各排序后分区数据转化为数据文件。可见,各排序后分区数据是存储在本地的HDFS层中,而且是转化为HFile文件的方式进行存储。In this embodiment, the bottom layer of the cloud computing platform (ie Spark) is the HDFS layer for storing data, and each sorted partition data is input to the HDFS layer, and then the sorted partition data can be converted into a data file by the HDFS layer . It can be seen that the partition data after sorting is stored in the local HDFS layer, and it is converted into HFile files for storage.
当在HSFS层将各排序后分区数据转化为HFile文件,即可将各排序后分区数据对应的HFile文件发送至Hbase数据库对应的分区服务器中。之后再由Hbase数据库的分区服务器采用Bulkload方案(即主体加载方案)将HFile写入HBase数据库。其中,Bulkload的优点在于导入过程不占用分区资源;能快速导入海量的数据;节省内存。When the sorted partition data is converted into HFile files in the HSFS layer, the HFile files corresponding to the sorted partition data can be sent to the partition server corresponding to the Hbase database. After that, the partition server of the Hbase database uses the Bulkload scheme (that is, the main loading scheme) to write the HFile into the HBase database. Among them, the advantage of Bulkload is that the import process does not occupy partition resources; it can quickly import a large amount of data; and it saves memory.
在一实施例中,基于云存储的数据传输装置100还包括:In an embodiment, the cloud storage-based data transmission device 100 further includes:
中断点获取单元,用于若检测到已接收所述Hbase数据库发送的数据传输报错信息,根据数据传输报错信息对应的日志文件在各排序后分区数据定位获取数据传输中断点;The interruption point acquisition unit is configured to, if it is detected that the data transmission error message sent by the Hbase database has been received, locate and obtain the data transmission interruption point according to the log file corresponding to the data transmission error message after each sort;
数据传输恢复单元,用于将位于各排序后分区数据的数据传输中断点之后的数据,发送至所述Hbase数据库对应的分区服务器中以进行存储。The data transmission recovery unit is used to send the data located after the data transmission interruption point of each sorted partition data to the partition server corresponding to the Hbase database for storage.
在本实施例中,在将各排序后分区数据发送至所述Hbase数据库进行存储的过程中,若存在传输中断的情况,此时可以接收由所述Hbase数据库发送的数据传输报错信息,根据数据传输报错信息对应的日志文件在各排序后分区数据定位获取数据传输中断点。在获取了数据传输中断点后,即可从该数据传输中断点之后的数据开始继续传输数据,确保了异常发生后也能恢复正常传输。In this embodiment, in the process of sending the sorted partition data to the Hbase database for storage, if there is a transmission interruption, the data transmission error message sent by the Hbase database can be received at this time, according to the data After each sorting, the log file corresponding to the transmission error message is partitioned to locate the data transmission interruption point. After the data transmission interruption point is acquired, data transmission can be continued from the data after the data transmission interruption point, ensuring that normal transmission can be resumed after an abnormality occurs.
该装置实现了全量数据写入Hbase数据库之前,将排序过程在云端完成,提高了数据写入Hbase数据库的效率。The device realizes that before the full amount of data is written into the Hbase database, the sorting process is completed in the cloud, which improves the efficiency of data writing into the Hbase database.
上述基于云存储的数据传输装置可以实现为计算机程序的形式,该计算机程序可以在如图12所示的计算机设备上运行。The above-mentioned cloud storage-based data transmission device can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 12.
请参阅图12,图12是本申请实施例提供的计算机设备的示意性框图。该计算机设备500是服务器,服务器可以是独立的服务器,也可以是多个服务器组成的服务器集群。Please refer to FIG. 12, which is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 is a server, and the server may be an independent server or a server cluster composed of multiple servers.
参阅图12,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括非易失性存储介质503和内存储器504。Referring to FIG. 12, the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时,可使得处理器502执行基于云存储的数据传输方法。The non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032. When the computer program 5032 is executed, the processor 502 can execute a data transmission method based on cloud storage.
该处理器502用于提供计算和控制能力,支撑整个计算机设备500的运行。The processor 502 is used to provide computing and control capabilities, and support the operation of the entire computer device 500.
该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行基于云存储的数据传输方法。The internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503. When the computer program 5032 is executed by the processor 502, the processor 502 can execute the data transmission method based on cloud storage.
该网络接口505用于进行网络通信,如提供数据信息的传输等。本领域技术人员可以理解,图12中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的 计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。The network interface 505 is used for network communication, such as providing data information transmission. Those skilled in the art can understand that the structure shown in FIG. 12 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied. The specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现本申请实施例公开的基于云存储的数据传输方法。Wherein, the processor 502 is configured to run a computer program 5032 stored in a memory to implement the cloud storage-based data transmission method disclosed in the embodiment of the present application.
本领域技术人员可以理解,图12中示出的计算机设备的实施例并不构成对计算机设备具体构成的限定,在其他实施例中,计算机设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图12所示实施例一致,在此不再赘述。Those skilled in the art can understand that the embodiment of the computer device shown in FIG. 12 does not constitute a limitation on the specific configuration of the computer device. In other embodiments, the computer device may include more or less components than those shown in the figure. Or some parts are combined, or different parts are arranged. For example, in some embodiments, the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 12, and will not be repeated here.
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central Processing Unit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that in this embodiment of the application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
在本申请的另一实施例中提供计算机可读存储介质。该计算机可读存储介质可以为非易失性的计算机可读存储介质。该计算机可读存储介质存储有计算机程序,其中计算机程序被处理器执行时实现本申请实施例公开的基于云存储的数据传输方法。In another embodiment of the present application, a computer-readable storage medium is provided. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium stores a computer program, where the computer program is executed by a processor to implement the cloud storage-based data transmission method disclosed in the embodiments of the present application.
所述存储介质为实体的、非瞬时性的存储介质,例如可以是U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的实体存储介质。The storage medium is a physical, non-transitory storage medium, such as a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a magnetic disk, or an optical disk, etc., which can store program codes. medium.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the above-described equipment, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (20)

  1. 一种基于云存储的数据传输方法,包括:A data transmission method based on cloud storage includes:
    接收由Hive数据库上传的全量数据,并进行存储;其中,所述Hive数据库为数据仓库式数据库;Receive and store the full amount of data uploaded by the Hive database; wherein, the Hive database is a data warehouse database;
    获取HBase数据库中的预分区个数;其中,所述HBase数据库为分布式开源数据库,且所述HBase数据库中每一预分区均对应一个分区服务器;Acquiring the number of pre-partitions in the HBase database; wherein the HBase database is a distributed open source database, and each pre-partition in the HBase database corresponds to a partition server;
    根据所述预分区个数及全量数据中各数据的行键,对所述全量数据进行分区,得到对应的分区数据;其中,分区数据的总分区数与所述预分区个数相等,且每一分区数据唯一对应一个分区服务器;According to the number of pre-partitions and the row key of each data in the full data, the full data is partitioned to obtain corresponding partition data; wherein the total number of partitions of the partition data is equal to the number of pre-partitions, and each One partition data uniquely corresponds to one partition server;
    将每个分区数据依次根据列和行键进行升序排序,得到对应的排序后分区数据;以及Sort each partition data in ascending order according to the column and row key to obtain the corresponding sorted partition data; and
    将各排序后分区数据发送至所述Hbase数据库对应的分区服务器中以进行存储。Send each sorted partition data to the partition server corresponding to the Hbase database for storage.
  2. 根据权利要求1所述的基于云存储的数据传输方法,其中,所述获取HBase数据库中的预分区个数,包括:The data transmission method based on cloud storage according to claim 1, wherein said obtaining the number of pre-partitions in the HBase database comprises:
    发送RPC请求至所述HBase数据库;其中,所述RPC请求为远程过程调用协议请求;Sending an RPC request to the HBase database; wherein the RPC request is a remote procedure call protocol request;
    接收所述HBase数据库根据所述RPC请求发送的元信息,根据元信息获取预分区个数。The meta-information sent by the HBase database according to the RPC request is received, and the number of pre-partitions is obtained according to the meta-information.
  3. 根据权利要求1所述的基于云存储的数据传输方法,其中,所述根据所述预分区个数及全量数据中各数据的行键,对所述全量数据进行分区,得到对应的分区数据,包括:The cloud storage-based data transmission method according to claim 1, wherein said partitioning said full amount of data according to the number of pre-partitions and the row key of each data in the full amount of data to obtain corresponding partitioned data, include:
    获取所述全量数据中各数据对应的行键;Obtaining the row key corresponding to each data in the total data;
    将各数据的行键通过MD5加密算法或SHA-256加密算法生成对应的哈希值;Use MD5 encryption algorithm or SHA-256 encryption algorithm to generate the corresponding hash value of the row key of each data;
    将各行键对应的哈希值对所述预分区个数求模,得到与各行键对应的余数;Modulo the hash value corresponding to each row key to the number of pre-partitions to obtain a remainder corresponding to each row key;
    将各行键对应的数据存储至该行键对应的余数所对应的分区中,以得到对应的分区数据。The data corresponding to each row key is stored in the partition corresponding to the remainder corresponding to the row key to obtain the corresponding partition data.
  4. 根据权利要求1所述的基于云存储的数据传输方法,其中,所述将每个 分区数据依次根据列和行键进行升序排序,得到对应的排序后分区数据,包括:The cloud storage-based data transmission method according to claim 1, wherein said sorting each partition data in ascending order according to column and row keys in order to obtain the corresponding sorted partition data comprises:
    在每个分区数据中各自获取具有相同行键的数据,将具有相同行键的数据中根据列的升序进行排序,得到与每一分区数据对应的第一排序后分区数据;Obtain data with the same row key in each partition data, sort the data with the same row key according to the ascending column order, and obtain the first sorted partition data corresponding to each partition data;
    将每一第一排序后分区数据根据行键的升序进行排序,得到与每一第一排序后分区数据对应的排序后分区数据。Each first sorted partition data is sorted according to the ascending order of the row key to obtain sorted partition data corresponding to each first sorted partition data.
  5. 根据权利要求1所述的基于云存储的数据传输方法,其中,所述将各排序后分区数据发送至所述Hbase数据库对应的分区服务器中以进行存储,包括:The cloud storage-based data transmission method according to claim 1, wherein the sending each sorted partition data to a partition server corresponding to the Hbase database for storage includes:
    将各排序后分区数据输入至本地的HDFS层,以将各序后分区数据转化为对应的数据文件;其中,所述HDFS层为分布式文件系统层;Input each sorted partition data to a local HDFS layer to convert each sorted partition data into a corresponding data file; wherein, the HDFS layer is a distributed file system layer;
    将所述数据文件发送至所述Hbase数据库对应的分区服务器中以进行存储。The data file is sent to the partition server corresponding to the Hbase database for storage.
  6. 根据权利要求1所述的基于云存储的数据传输方法,其中,所述将各排序后分区数据发送至所述Hbase数据库对应的分区服务器中以进行存储之后,还包括:The cloud storage-based data transmission method according to claim 1, wherein, after said sending each sorted partition data to a partition server corresponding to the Hbase database for storage, the method further comprises:
    若检测到已接收所述Hbase数据库发送的数据传输报错信息,根据数据传输报错信息对应的日志文件在各排序后分区数据定位获取数据传输中断点;If it is detected that the data transmission error message sent by the Hbase database has been received, the data transmission interruption point is obtained by partitioning data after each sort according to the log file corresponding to the data transmission error message;
    将位于各排序后分区数据的数据传输中断点之后的数据,发送至所述Hbase数据库对应的分区服务器中以进行存储。The data located after the data transmission interruption point of each sorted partition data is sent to the partition server corresponding to the Hbase database for storage.
  7. 根据权利要求1所述的基于云存储的数据传输方法,其中,所述接收由Hive数据库上传的全量数据,并进行存储,包括:The data transmission method based on cloud storage according to claim 1, wherein the receiving and storing the full amount of data uploaded by the Hive database comprises:
    生成dataframe对所述全量数据进行物理存储;其中,所述dataframe为矩阵数据表。Generate a dataframe to physically store the full amount of data; wherein, the dataframe is a matrix data table.
  8. 根据权利要求7所述的基于云存储的数据传输方法,其中,所述根据所述预分区个数及全量数据中各数据的行键,对所述全量数据进行分区,得到对应的分区数据,包括:The data transmission method based on cloud storage according to claim 7, wherein said partitioning said full amount of data according to the number of pre-partitions and the row key of each data in the full amount of data to obtain corresponding partitioned data, include:
    将所述dataframe中存储的全量数据根据HexStringSplit的预分区方式将所述全量数据中每一条数据打散到对应的分区中;其中,HexStringSplit的预分区方式是用于数据的行键是十六进制的字符串作为前缀的预划分方式。The full amount of data stored in the dataframe is divided into corresponding partitions according to the pre-partitioning method of HexStringSplit; wherein, the pre-partitioning method of HexStringSplit is that the row key for the data is hexadecimal The pre-divided character string used as the prefix.
  9. 一种基于云存储的数据传输装置,包括:A data transmission device based on cloud storage, including:
    接收单元,用于接收由Hive数据库上传的全量数据,并进行存储;其中,所述Hive数据库为数据仓库式数据库;The receiving unit is used to receive and store the full amount of data uploaded by the Hive database; wherein, the Hive database is a data warehouse database;
    分区个数获取单元,用于获取HBase数据库中的预分区个数;其中,所述HBase数据库为分布式开源数据库,且所述HBase数据库中每一预分区均对应一个分区服务器;The partition number obtaining unit is used to obtain the number of pre-partitions in the HBase database; wherein, the HBase database is a distributed open source database, and each pre-partition in the HBase database corresponds to a partition server;
    分区单元,用于根据所述预分区个数及全量数据中各数据的行键,对所述全量数据进行分区,得到对应的分区数据;其中,分区数据的总分区数与所述预分区个数相等,且每一分区数据唯一对应一个分区服务器;The partition unit is used to partition the full amount of data according to the number of pre-partitions and the row key of each data in the full amount of data to obtain corresponding partitioned data; wherein, the total number of partitions of the partitioned data and the number of pre-partitioned data The numbers are equal, and each partition data uniquely corresponds to a partition server;
    排序单元,用于将每个分区数据依次根据列和行键进行升序排序,得到对应的排序后分区数据;以及The sorting unit is used to sort each partition data in ascending order according to the column and row key in turn to obtain the corresponding sorted partition data; and
    传输单元,用于将各排序后分区数据发送至所述Hbase数据库对应的分区服务器中以进行存储。The transmission unit is used to send each sorted partition data to the partition server corresponding to the Hbase database for storage.
  10. 根据权利要求9所述的基于云存储的数据传输装置,其中,所述分区个数获取单元,包括:The cloud storage-based data transmission device according to claim 9, wherein the unit for obtaining the number of partitions comprises:
    请求发送单元,用于发送RPC请求至所述HBase数据库;其中,所述RPC请求为远程过程调用协议请求;The request sending unit is configured to send an RPC request to the HBase database; wherein the RPC request is a remote procedure call protocol request;
    元信息解析单元,用于接收所述HBase数据库根据所述RPC请求发送的元信息,根据元信息获取预分区个数。The meta-information analysis unit is configured to receive meta-information sent by the HBase database according to the RPC request, and obtain the number of pre-partitions according to the meta-information.
  11. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现以下步骤:A computer device includes a memory, a processor, and a computer program that is stored on the memory and can run on the processor, and the processor implements the following steps when the processor executes the computer program:
    接收由Hive数据库上传的全量数据,并进行存储;其中,所述Hive数据库为数据仓库式数据库;Receive and store the full amount of data uploaded by the Hive database; wherein, the Hive database is a data warehouse database;
    获取HBase数据库中的预分区个数;其中,所述HBase数据库为分布式开源数据库,且所述HBase数据库中每一预分区均对应一个分区服务器;Acquiring the number of pre-partitions in the HBase database; wherein the HBase database is a distributed open source database, and each pre-partition in the HBase database corresponds to a partition server;
    根据所述预分区个数及全量数据中各数据的行键,对所述全量数据进行分区,得到对应的分区数据;其中,分区数据的总分区数与所述预分区个数相等,且每一分区数据唯一对应一个分区服务器;According to the number of pre-partitions and the row key of each data in the full data, the full data is partitioned to obtain corresponding partition data; wherein the total number of partitions of the partition data is equal to the number of pre-partitions, and each One partition data uniquely corresponds to one partition server;
    将每个分区数据依次根据列和行键进行升序排序,得到对应的排序后分区数据;以及Sort each partition data in ascending order according to the column and row key to obtain the corresponding sorted partition data; and
    将各排序后分区数据发送至所述Hbase数据库对应的分区服务器中以进行存储。Send each sorted partition data to the partition server corresponding to the Hbase database for storage.
  12. 根据权利要求11所述的计算机设备,其中,所述获取HBase数据库中的预分区个数,包括:The computer device according to claim 11, wherein said obtaining the number of pre-partitions in the HBase database comprises:
    发送RPC请求至所述HBase数据库;其中,所述RPC请求为远程过程调用协议请求;Sending an RPC request to the HBase database; wherein the RPC request is a remote procedure call protocol request;
    接收所述HBase数据库根据所述RPC请求发送的元信息,根据元信息获取预分区个数。The meta-information sent by the HBase database according to the RPC request is received, and the number of pre-partitions is obtained according to the meta-information.
  13. 根据权利要求11所述的计算机设备,其中,所述根据所述预分区个数及全量数据中各数据的行键,对所述全量数据进行分区,得到对应的分区数据,包括:The computer device according to claim 11, wherein the partitioning the full amount of data according to the number of pre-partitions and the row key of each data in the full amount of data to obtain the corresponding partitioned data comprises:
    获取所述全量数据中各数据对应的行键;Obtaining the row key corresponding to each data in the total data;
    将各数据的行键通过MD5加密算法或SHA-256加密算法生成对应的哈希值;Use MD5 encryption algorithm or SHA-256 encryption algorithm to generate the corresponding hash value of the row key of each data;
    将各行键对应的哈希值对所述预分区个数求模,得到与各行键对应的余数;Modulo the hash value corresponding to each row key to the number of pre-partitions to obtain a remainder corresponding to each row key;
    将各行键对应的数据存储至该行键对应的余数所对应的分区中,以得到对应的分区数据。The data corresponding to each row key is stored in the partition corresponding to the remainder corresponding to the row key to obtain the corresponding partition data.
  14. 根据权利要求11所述的计算机设备,其中,所述将每个分区数据依次根据列和行键进行升序排序,得到对应的排序后分区数据,包括:11. The computer device according to claim 11, wherein said sorting each partition data in ascending order according to column and row keys in order to obtain the corresponding sorted partition data comprises:
    在每个分区数据中各自获取具有相同行键的数据,将具有相同行键的数据中根据列的升序进行排序,得到与每一分区数据对应的第一排序后分区数据;Obtain data with the same row key in each partition data, sort the data with the same row key according to the ascending column order, and obtain the first sorted partition data corresponding to each partition data;
    将每一第一排序后分区数据根据行键的升序进行排序,得到与每一第一排序后分区数据对应的排序后分区数据。Each first sorted partition data is sorted according to the ascending order of the row key to obtain sorted partition data corresponding to each first sorted partition data.
  15. 根据权利要求11所述的计算机设备,其中,所述将各排序后分区数据发送至所述Hbase数据库对应的分区服务器中以进行存储,包括:The computer device according to claim 11, wherein the sending each sorted partition data to the partition server corresponding to the Hbase database for storage comprises:
    将各排序后分区数据输入至本地的HDFS层,以将各序后分区数据转化为对应的数据文件;其中,所述HDFS层为分布式文件系统层;Input each sorted partition data to a local HDFS layer to convert each sorted partition data into a corresponding data file; wherein, the HDFS layer is a distributed file system layer;
    将所述数据文件发送至所述Hbase数据库对应的分区服务器中以进行存储。The data file is sent to the partition server corresponding to the Hbase database for storage.
  16. 根据权利要求11所述的计算机设备,其中,所述将各排序后分区数据发送至所述Hbase数据库对应的分区服务器中以进行存储之后,还包括:11. The computer device according to claim 11, wherein after said sending each sorted partition data to the partition server corresponding to the Hbase database for storage, further comprising:
    若检测到已接收所述Hbase数据库发送的数据传输报错信息,根据数据传输报错信息对应的日志文件在各排序后分区数据定位获取数据传输中断点;If it is detected that the data transmission error message sent by the Hbase database has been received, the data transmission interruption point is obtained by partitioning data after each sort according to the log file corresponding to the data transmission error message;
    将位于各排序后分区数据的数据传输中断点之后的数据,发送至所述Hbase数据库对应的分区服务器中以进行存储。The data located after the data transmission interruption point of each sorted partition data is sent to the partition server corresponding to the Hbase database for storage.
  17. 根据权利要求11所述的计算机设备,其中,所述接收由Hive数据库上传的全量数据,并进行存储,包括:The computer device according to claim 11, wherein the receiving and storing the full amount of data uploaded by the Hive database comprises:
    生成dataframe对所述全量数据进行物理存储;其中,所述dataframe为矩阵数据表。Generate a dataframe to physically store the full amount of data; wherein, the dataframe is a matrix data table.
  18. 根据权利要求17所述的计算机设备,其中,所述根据所述预分区个数及全量数据中各数据的行键,对所述全量数据进行分区,得到对应的分区数据,包括:18. The computer device according to claim 17, wherein the partitioning the full amount of data according to the number of pre-partitions and the row key of each data in the full amount of data to obtain the corresponding partitioned data comprises:
    将所述dataframe中存储的全量数据根据HexStringSplit的预分区方式将所述全量数据中每一条数据打散到对应的分区中;其中,HexStringSplit的预分区方式是用于数据的行键是十六进制的字符串作为前缀的预划分方式。The full amount of data stored in the dataframe is divided into corresponding partitions according to the pre-partitioning method of HexStringSplit; wherein, the pre-partitioning method of HexStringSplit is that the row key for the data is hexadecimal The pre-divided character string used as the prefix.
  19. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下操作:接收由Hive数据库上传的全量数据,并进行存储;其中,所述Hive数据库为数据仓库式数据库;A computer-readable storage medium that stores a computer program that, when executed by a processor, causes the processor to perform the following operations: receive the full amount of data uploaded by the Hive database, and perform Storage; wherein, the Hive database is a data warehouse database;
    获取HBase数据库中的预分区个数;其中,所述HBase数据库为分布式开源数据库,且所述HBase数据库中每一预分区均对应一个分区服务器;Acquiring the number of pre-partitions in the HBase database; wherein the HBase database is a distributed open source database, and each pre-partition in the HBase database corresponds to a partition server;
    根据所述预分区个数及全量数据中各数据的行键,对所述全量数据进行分区,得到对应的分区数据;其中,分区数据的总分区数与所述预分区个数相等,且每一分区数据唯一对应一个分区服务器;According to the number of pre-partitions and the row key of each data in the full data, the full data is partitioned to obtain corresponding partition data; wherein the total number of partitions of the partition data is equal to the number of pre-partitions, and each One partition data uniquely corresponds to one partition server;
    将每个分区数据依次根据列和行键进行升序排序,得到对应的排序后分区数据;以及Sort each partition data in ascending order according to the column and row key to obtain the corresponding sorted partition data; and
    将各排序后分区数据发送至所述Hbase数据库对应的分区服务器中以进行存储。Send each sorted partition data to the partition server corresponding to the Hbase database for storage.
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述获取HBase数据库中的预分区个数,包括:The computer-readable storage medium according to claim 19, wherein said obtaining the number of pre-partitions in the HBase database comprises:
    发送RPC请求至所述HBase数据库;其中,所述RPC请求为远程过程调用协议请求;Sending an RPC request to the HBase database; wherein the RPC request is a remote procedure call protocol request;
    接收所述HBase数据库根据所述RPC请求发送的元信息,根据元信息获取 预分区个数。The meta information sent by the HBase database according to the RPC request is received, and the number of pre-partitions is obtained according to the meta information.
PCT/CN2019/118401 2019-10-12 2019-11-14 Cloud-storage-based data transmission method and apparatus, and computer device WO2021068351A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910969811.5 2019-10-12
CN201910969811.5A CN111090645B (en) 2019-10-12 2019-10-12 Cloud storage-based data transmission method and device and computer equipment

Publications (1)

Publication Number Publication Date
WO2021068351A1 true WO2021068351A1 (en) 2021-04-15

Family

ID=70392992

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118401 WO2021068351A1 (en) 2019-10-12 2019-11-14 Cloud-storage-based data transmission method and apparatus, and computer device

Country Status (2)

Country Link
CN (1) CN111090645B (en)
WO (1) WO2021068351A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113177090A (en) * 2021-04-30 2021-07-27 中国邮政储蓄银行股份有限公司 Data processing method and device
CN113535856A (en) * 2021-07-29 2021-10-22 上海哔哩哔哩科技有限公司 Data synchronization method and system
CN113568966A (en) * 2021-07-29 2021-10-29 上海哔哩哔哩科技有限公司 Data processing method and system used between ODS layer and DW layer
CN114925123A (en) * 2022-04-24 2022-08-19 杭州悦数科技有限公司 Data transmission method between distributed graph database and graph computing system
CN115801787A (en) * 2023-01-29 2023-03-14 智道网联科技(北京)有限公司 Method and device for transmitting road end data, electronic equipment and storage medium

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312414B (en) * 2020-07-30 2023-12-26 阿里巴巴集团控股有限公司 Data processing method, device, equipment and storage medium
CN112395345A (en) * 2020-12-04 2021-02-23 江苏苏宁云计算有限公司 HBase full data import method and device, computer equipment and storage medium
CN112905854A (en) * 2021-03-05 2021-06-04 北京中经惠众科技有限公司 Data processing method and device, computing equipment and storage medium
CN113096284B (en) * 2021-03-19 2022-08-30 福建新大陆通信科技股份有限公司 CTID access control authorization information verification method
CN116049197B (en) * 2023-03-07 2023-06-30 中船奥蓝托无锡软件技术有限公司 HBase-based data equilibrium storage method
CN116719822B (en) * 2023-08-10 2023-12-22 深圳市连用科技有限公司 Method and system for storing massive structured data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150310082A1 (en) * 2014-04-24 2015-10-29 Luke Qing Han Hadoop olap engine
CN106970929A (en) * 2016-09-08 2017-07-21 阿里巴巴集团控股有限公司 Data lead-in method and device
WO2019161679A1 (en) * 2018-02-26 2019-08-29 众安信息技术服务有限公司 Data processing method and device for use in online analytical processing
US20190303494A1 (en) * 2018-03-30 2019-10-03 American Express Travel Related Services Company, Inc. Node linkage in entity graphs

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10055458B2 (en) * 2015-07-30 2018-08-21 Futurewei Technologies, Inc. Data placement control for distributed computing environment
US9923960B2 (en) * 2015-08-18 2018-03-20 Salesforce.Com, Inc. Partition balancing in an on-demand services environment
US10216740B2 (en) * 2016-03-31 2019-02-26 Acronis International Gmbh System and method for fast parallel data processing in distributed storage systems
CN106503058B (en) * 2016-09-27 2019-01-18 华为技术有限公司 A kind of data load method, terminal and computing cluster

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150310082A1 (en) * 2014-04-24 2015-10-29 Luke Qing Han Hadoop olap engine
CN106970929A (en) * 2016-09-08 2017-07-21 阿里巴巴集团控股有限公司 Data lead-in method and device
WO2019161679A1 (en) * 2018-02-26 2019-08-29 众安信息技术服务有限公司 Data processing method and device for use in online analytical processing
US20190303494A1 (en) * 2018-03-30 2019-10-03 American Express Travel Related Services Company, Inc. Node linkage in entity graphs

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
GAO JINBIAO ,: "Hive and Hbase Based on Research on Hadoop Distributed File System", INDUSTRIAL CONTROL COMPUTER, no. 12, 1 January 2015 (2015-01-01), pages 44 - 47, XP055800529 *
LIANG CHEN, JING QIU, ZHU YOU-CHAN, GU XUE-PING: "Cloud computing architecture oriented household Internet of things", APPLICATION RESEARCH OF COMPUTERS, vol. 30, no. 12, 1 December 2013 (2013-12-01), pages 3686 - 3689, XP055800531 *
LIU BOWEI, HUANG RUIZHANG: "HBase-Based Storage System for Financial Time Series Data", CHINA SCIENCE PAPER, vol. 11, no. 20, 1 October 2016 (2016-10-01), pages 2387 - 2392, XP055800522 *
LIU XING-PING, LUO XIANG-YUN, YANG HAI: "Querying Research on Efficient Traffic Data Cloud-Indexing Technology Based on HBase", CONTROL ENGINEERING OF CHINA, vol. 23, no. 4, 1 April 2016 (2016-04-01), pages 560 - 564, XP055800517, ISSN: 1671-7848, DOI: 10.14107/j.cnki.kzgc.150049 *
TAN JIE-QING, MAO XI-JUN: "The Structures of Hadoop Cloud Computing Infrastructure and the Integrated Application of HBase and Hive 1 Hadoop hbase hive", GUIZHOU SCIENCE, vol. 31, no. 5, 1 January 2013 (2013-01-01), pages 32 - 35, XP055800519 *
TANG CHANG-CHENG, FENG YANG, DONG DAI, MING-MING SUN, XUE-HAI ZHOU: "Research of Data Durable and Available Base on HBase", COMPUTER SYSTEMS & APPLICATIONS, vol. 22, no. 10, 1 January 2013 (2013-01-01), pages 175 - 180, XP055800534 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113177090A (en) * 2021-04-30 2021-07-27 中国邮政储蓄银行股份有限公司 Data processing method and device
CN113535856A (en) * 2021-07-29 2021-10-22 上海哔哩哔哩科技有限公司 Data synchronization method and system
CN113568966A (en) * 2021-07-29 2021-10-29 上海哔哩哔哩科技有限公司 Data processing method and system used between ODS layer and DW layer
CN114925123A (en) * 2022-04-24 2022-08-19 杭州悦数科技有限公司 Data transmission method between distributed graph database and graph computing system
CN115801787A (en) * 2023-01-29 2023-03-14 智道网联科技(北京)有限公司 Method and device for transmitting road end data, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN111090645B (en) 2024-03-01
CN111090645A (en) 2020-05-01

Similar Documents

Publication Publication Date Title
WO2021068351A1 (en) Cloud-storage-based data transmission method and apparatus, and computer device
US9792306B1 (en) Data transfer between dissimilar deduplication systems
US20140304525A1 (en) Key/value storage device and method
US10733061B2 (en) Hybrid data storage system with private storage cloud and public storage cloud
US9426219B1 (en) Efficient multi-part upload for a data warehouse
US10938961B1 (en) Systems and methods for data deduplication by generating similarity metrics using sketch computation
US20140195575A1 (en) Data file handling in a network environment and independent file server
US10762051B1 (en) Reducing hash collisions in large scale data deduplication
US10289310B2 (en) Hybrid data storage system with private storage cloud and public storage cloud
US11422891B2 (en) Global storage solution with logical cylinders and capsules
EP3716580B1 (en) Cloud file transfers using cloud file descriptors
US20200065306A1 (en) Bloom filter partitioning
AU2014353667A1 (en) A method of generating a reference index data structure and method for finding a position of a data pattern in a reference data structure
US20220043778A1 (en) System and method for data compaction and security with extended functionality
US11831343B2 (en) System and method for data compression with encryption
US20220156233A1 (en) Systems and methods for sketch computation
WO2016029441A1 (en) File scanning method and apparatus
US20230283292A1 (en) System and method for data compaction and security with extended functionality
US20210191640A1 (en) Systems and methods for data segment processing
US10152269B2 (en) Method and system for preserving branch cache file data segment identifiers upon volume replication
US20240106457A1 (en) System and method for data compression and encryption using asymmetric codebooks
WO2014165451A2 (en) Key/value storage device and method
CN114254045A (en) Data storage method, device and equipment based on block chain and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19948487

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19948487

Country of ref document: EP

Kind code of ref document: A1