CN111427887A

CN111427887A - Method, device and system for rapidly scanning HBase partition table

Info

Publication number: CN111427887A
Application number: CN202010188346.4A
Authority: CN
Inventors: 刘智鑫; 蔡苗; 陈震宇; 刘国华
Original assignee: Postal Savings Bank of China Ltd
Current assignee: Postal Savings Bank of China Ltd
Priority date: 2020-03-17
Filing date: 2020-03-17
Publication date: 2020-07-17

Abstract

The embodiment of the application discloses a method, a device and a system for rapidly scanning an HBase partition table, wherein the method comprises the following steps: pre-partitioning the HBase data table to obtain a plurality of physical partitions; partitioning the RDD of the Spark according to the number of the physical partitions to obtain the logical partitions with the same number as the physical partitions, and establishing a mapping relation between the logical partitions and the physical partitions so as to map each logical partition to the corresponding physical partition; and when Spark runs, allocating a SCAN scanning object to each physical partition to realize parallel scanning of the HBase data table. The method and the device have the advantages that the HBase data tables are pre-partitioned, and the SCAN scanning object is established for the pre-partition of each HBase data table, so that the data of each partition can be read in parallel, and the partition tables of the HBase can be rapidly scanned.

Description

Method, device and system for rapidly scanning HBase partition table

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a method, a device and a system for rapidly scanning an HBase partition table.

Background

In practical application, besides an application scene of inquiring single data, a user may need to scan a data table of the whole HBase to realize table scanning operation, but the index does not play a role in accelerating assistance on the table scanning operation.

In general, the table scan operation needs to implement the filtering query requirement of the full table from the beginning to the end, which may involve specific aggregation operations such as Count statistics, Sum summation, etc. of indexes, and may also involve stripe-by-stripe reading operation of the full table data. HBase mainly supports two operations of GET and SCAN to obtain data from a data table at present, a GET object is used for obtaining single record data, and a SCAN object is used for scanning data in a specified range.

At present, when reading data in an HBase data table, a Spark client mainly obtains the data in the table in a scanning manner, and usually only generates one SCAN object, but the SCAN object needs to sequentially SCAN Region partitions of all HBase data tables step by step, which results in a slow scanning speed. Therefore, the current method cannot well use the distributed processing capability of Spark, and does not fully use the concept of partitioning Region of data table in HBase.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a method, a device and a system for rapidly scanning an HBase partition table.

The embodiment of the invention provides the following specific technical scheme:

in a first aspect, the present invention provides a method for rapidly scanning an HBase partition table, where the method includes:

pre-partitioning the HBase data table to obtain a plurality of physical partitions;

partitioning the RDD of the Spark according to the number of the physical partitions to obtain the logical partitions with the same number as the physical partitions, and establishing a mapping relation between the logical partitions and the physical partitions so as to map each logical partition to a corresponding physical partition;

and when Spark is operated, allocating a SCAN scanning object to each physical partition to realize parallel scanning of the HBase data table.

Preferably, the method further comprises:

creating a task for each of said logical partitions;

and running the task to process the scanning result of the corresponding SCAN scanning object so as to realize the parallel processing of the HBase data table.

Preferably, the pre-partitioning the HBase data table to obtain a plurality of physical partitions specifically includes:

calculating the data volume to be processed;

and equally dividing the HBase data table according to the data volume to obtain a plurality of continuous physical partitions.

Preferably, equally dividing the HBase data table according to the data size to obtain a plurality of continuous physical partitions specifically includes:

and equally dividing the HBase data table according to the range identified by the row key of the HBase data table to obtain a plurality of continuous physical partitions. Preferably, the pre-partitioning the HBase data table to obtain a plurality of physical partitions specifically includes:

and dividing the HBase data table according to the historical data change trend to obtain a plurality of continuous physical partitions.

Preferably, the allocating one SCAN object to each physical partition specifically includes:

acquiring a starting primary key and an ending primary key of each physical partition;

generating a SCAN object having the same start primary key and end primary key according to the start primary key and end primary key of each physical partition.

In a second aspect, the present invention provides an apparatus for rapidly scanning an HBase partition table, where the apparatus includes:

the first partitioning module is used for pre-partitioning the HBase data table to obtain a plurality of physical partitions;

the second partitioning module is used for partitioning the RDD of the Spark according to the number of the physical partitions to obtain the logical partitions with the same number as the physical partitions;

the mapping module is used for establishing a mapping relation with the physical partitions so that each logic partition is mapped to the corresponding physical partition;

and the first allocation module is used for allocating a SCAN scanning object to each physical partition when Spark runs so as to realize parallel scanning of the HBase data table.

Preferably, the apparatus further comprises:

a second allocation module for creating a task for each of said logical partitions;

and the operation module is used for operating the task to process the scanning result of the corresponding SCAN scanning object so as to realize the parallel processing of the HBase data table.

Preferably, the first partitioning module specifically includes:

the calculation module is used for calculating the data volume to be processed;

and the dividing module is used for equally dividing the HBase data table according to the data volume to obtain a plurality of continuous physical partitions.

Preferably, the dividing module is specifically configured to equally divide the HBase data table according to a range identified by a row key of the HBase data table to obtain a plurality of continuous physical partitions.

Preferably, the first partitioning module is specifically configured to:

Preferably, the first distribution module specifically includes:

the acquisition module is used for acquiring a starting primary key and an ending primary key of each physical partition;

and the generation module generates the SCAN scanning object with the same starting primary key and ending primary key according to the starting primary key and ending primary key of each physical partition.

In a third aspect, the present invention provides a computer system comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:

The embodiment of the invention has the following beneficial effects:

the method is used for pre-partitioning the HBase data table, and an SCAN scanning object is established for the pre-partitioning of each HBase data table, so that the data of each partition can be read in parallel, and the partition table of the HBase can be scanned quickly.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram illustrating a correspondence relationship between Partition in Spark and Region in HBase in the prior art according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for fast scanning an HBase partition table according to an embodiment of the present application;

fig. 3 is a schematic diagram of a correspondence relationship between a Partition in Spark and a Region in HBase according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an apparatus for rapidly scanning an HBase partition table according to a second embodiment of the present application;

fig. 5 is a schematic structural diagram of a computer system according to a third embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

As shown in fig. 1, when reading data in an HBase data table, a Spark client mainly obtains the data in the table in a scanning manner, and usually only generates one SCAN object, but the SCAN object needs to sequentially SCAN all Region partitions of the HBase data table step by step, which results in a slow scanning speed. Therefore, the current method cannot well use the distributed processing capability of Spark, and does not fully use the concept of partitioning Region of data table in HBase. Based on this, the application provides a method for rapidly scanning an HBase partition table, as shown in fig. 2, a Spark client can read the HBase partition table in parallel, thereby achieving rapid scanning of an HBase data table.

The method for rapidly scanning the HBase partition table specifically comprises the following steps:

and S1, pre-partitioning the HBase data table to obtain a plurality of physical partitions.

The method specifically comprises the following steps:

s11, calculating the data volume to be processed;

and S12, equally dividing the HBase data table according to the data size to obtain a plurality of continuous physical partitions.

Specifically, the step S12 may be:

and equally dividing the HBase data table according to the range identified by the row key of the HBase data table to obtain a plurality of continuous physical partitions.

The physical partition refers to a Region of the HBase data table.

For example, if the range identified by the row key Rowkey is (0 ~ 19), the HBase data table can be divided into (0 ~ 9), (10 ~ 19) two physical partitions. The number and the range of the physical partitions can be defined according to actual requirements.

Further, the above step S1 can be implemented as follows:

and equally dividing the HBase data table according to the historical data change trend to obtain a plurality of continuous physical partitions.

For example, for a certain bank, the traffic from 9 am to 11 am is small, and the traffic from 1 pm to 3 pm is large, so when partitioning is performed, the range identified by the Rowkey from 1 pm to 3 pm can be set to be twice as large as that from 9 am to 11 am in combination with the historical data change trend, thereby increasing the processing speed.

S2, partitioning the RDDs of the Spark according to the number of the physical partitions to obtain the logical partitions with the same number as the physical partitions, and establishing a mapping relation between the logical partitions and the physical partitions so that each logical partition is mapped to the corresponding physical partition.

The rdd (resource Distributed dataset) is called an elastic Distributed dataset, and is the most basic data abstraction in Spark, and represents a collection of immutable, partitionable, and parallel-computable elements in the collection. RDD has the characteristics of a data flow model: automatic fault tolerance, location-aware scheduling, and scalability. RDD allows a user to explicitly cache a working set in a memory when executing a plurality of queries, and subsequent queries can reuse the working set, which greatly improves the query speed of data.

The RDD data structure in Spark supports the concept of Partition on logic, so that the RDD data structure in Spark is partitioned based on the pre-partitioned HBase data table to obtain the logic partitions (Partition) with the same number as the physical partitions (regions) of the HBase data table.

And S3, when Spark is operated, allocating a SCAN scanning object to each physical partition to realize parallel scanning of the HBase data table.

In this embodiment, the correspondence between Partition in Spark and Region in HBase is shown in fig. 3.

The allocating a SCAN object to each physical partition specifically includes:

1. acquiring a starting primary key and an ending primary key of each physical partition;

2. generating a SCAN object having the same start primary key and end primary key according to the start primary key and end primary key of each physical partition.

In this manner, each physical partition (Region) of the HBase data table can be scanned in parallel. Under the condition of abundant machine resources, the speed of parallel scanning is naturally faster as the number of partitions is larger, the size of each partition is reasonably set, and the time consumption of scanning can be controlled within a constant time range.

S4, creating a task for each logic partition.

And S5, running a task to process the scanning result of the corresponding SCAN scanning object so as to realize the parallel processing of the HBase data table.

The scheme can be applied to training of a calculation model, and the training time of the calculation model is prolonged.

The method specifically comprises the following steps:

1. storing the newly added features of a computational model into a data table of Hbase, wherein a column in the data table represents a feature of the computational model;

2. scanning the data table of the HBase according to the method for rapidly scanning the HBase partition table to obtain all technical characteristics of the calculation model;

3. and retraining the calculation model according to all the acquired technical characteristics.

Thus, the training capability of the calculation model can be improved.

Example two

Corresponding to the embodiment, as shown in fig. 4, the present application further provides an apparatus for rapidly scanning an HBase partition table, including:

a first partitioning module 21, configured to pre-partition the HBase data table to obtain a plurality of physical partitions;

the second partitioning module 22 is configured to partition the RDDs of the Spark according to the number of the physical partitions, so as to obtain logical partitions whose number is the same as that of the physical partitions;

a mapping module 23, configured to establish a mapping relationship with the physical partitions so that each logical partition is mapped to a corresponding physical partition;

and the first allocation module 24 is configured to allocate one SCAN object to each physical partition when Spark is running, so as to implement parallel scanning on the HBase data table.

Preferably, the above apparatus further comprises:

a second allocating module 25, configured to create a task for each logical partition;

and the running module 26 is configured to run a task to process a scanning result of the corresponding SCAN object so as to implement parallel processing on the HBase data table.

Preferably, the first partitioning module 21 specifically includes:

a calculating module 211, configured to calculate a data amount to be processed;

and the dividing module 212 is configured to equally divide the HBase data table according to the data amount to obtain a plurality of continuous physical partitions.

Preferably, the dividing module 212 is specifically configured to equally divide the HBase data table according to a range identified by a row key of the HBase data table, so as to obtain a plurality of continuous physical partitions.

Preferably, the first partitioning module 21 is specifically configured to:

Preferably, the first distribution module 24 specifically includes:

an obtaining module 241, configured to obtain a start primary key and an end primary key of each physical partition;

the generating module 242 generates SCAN objects having the same start primary key and end primary key according to the start primary key and end primary key of each physical partition.

EXAMPLE III

The present application further provides a computer system comprising:

one or more processors; and

partitioning the RDD of the Spark according to the number of the physical partitions to obtain the logical partitions with the same number as the physical partitions, and establishing a mapping relation between the logical partitions and the physical partitions so as to map each logical partition to the corresponding physical partition;

and when Spark runs, allocating a SCAN scanning object to each physical partition to realize parallel scanning of the HBase data table.

FIG. 5 illustrates an architecture of a computer system that may include, in particular, a processor 32, a video display adapter 34, a disk drive 36, an input/output interface 38, a network interface 310, and a memory 312. The processor 32, video display adapter 34, disk drive 36, input/output interface 38, network interface 310, and memory 312 may be communicatively coupled via a communication bus 314.

The processor 32 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solution provided in the present Application.

The Memory 312 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random access Memory), a static storage device, a dynamic storage device, or the like. The memory 312 may store an operating system 316 for controlling the operation of the computer system 30, a Basic Input Output System (BIOS)318 for controlling low-level operations of the computer system. In addition, a web browser 320, a data storage management system 322, and the like may also be stored. In summary, when the technical solution provided by the present application is implemented by software or firmware, the relevant program code is stored in the memory 312 and invoked by the processor 32 for execution.

The input/output interface 38 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The network interface 310 is used for connecting a communication module (not shown in the figure) to realize communication interaction between the device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Communication bus 314 includes a path to transfer information between the various components of the device, such as processor 32, video display adapter 34, disk drive 36, input/output interface 38, network interface 310, and memory 312.

In addition, the computer system can also obtain the information of specific receiving conditions from the virtual resource object receiving condition information database for condition judgment and the like.

It should be noted that although the above-described device only shows the processor 32, the video display adapter 34, the disk drive 36, the input/output interface 38, the network interface 310, the memory 312, the communication bus 314, etc., in a specific implementation, the device may also include other components necessary for normal operation.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a cloud server, or a network device) to execute the method according to the embodiments or some parts of the embodiments of the present application.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention. In addition, the computer system, the apparatus for rapidly scanning the HBase partition table, and the method for rapidly scanning the HBase partition table provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are described in detail in the method embodiments and are not described herein again.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for rapidly scanning an HBase partition table is characterized by comprising the following steps:

2. The method of claim 1, further comprising:

creating a task for each of said logical partitions;

3. The method according to claim 1, wherein pre-partitioning the HBase data table to obtain a plurality of physical partitions specifically comprises:

calculating the data volume to be processed;

4. The method according to claim 3, wherein averaging the HBase data table according to the data size to obtain a plurality of consecutive physical partitions specifically comprises:

5. The method according to claim 1, wherein pre-partitioning the HBase data table to obtain a plurality of physical partitions specifically comprises:

6. The method according to any one of claims 1 to 5, wherein the allocating one SCAN SCAN object to each physical partition specifically comprises:

7. An apparatus for rapidly scanning HBase partition table, comprising:

8. The apparatus of claim 7, further comprising:

9. The apparatus of claim 7, wherein the first partition module specifically comprises:

the calculation module is used for calculating the data volume to be processed;

10. A computer system, comprising:

one or more processors; and