CN111159235A

CN111159235A - Data pre-partition method and device, electronic equipment and readable storage medium

Info

Publication number: CN111159235A
Application number: CN201911321136.1A
Authority: CN
Inventors: 李威; 覃鹏; 刘增文; 叶长全; 吴仰波
Original assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Current assignee: China Construction Bank Corp
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2020-05-15

Abstract

The application provides a data pre-partitioning method, a data pre-partitioning device, electronic equipment and a readable storage medium, which are applied to the technical field of big data, wherein the method comprises the following steps: acquiring target row key data from the target Hive data set through a Spark-sql engine, realizing large-scale collection of the target row key data, obtaining partition result information according to the collected target row key data, and performing data pre-partitioning on an Hbase database according to the partition result information, so that uniform storage of the large-scale data can be realized, and data inclination is avoided; in addition, the Spark-sql engine is a large data processing framework based on a memory, compared with a magnetic disk-based MapReduce, the data processing speed is higher, and the rapid calculation of large-scale data can be realized; moreover, compared with the Hash Hash rowkey mode, the method and the device do not need to perform escape (the Hash Hash mode needs escape) when reading data, and can effectively improve the speed of data query.

Description

Data pre-partition method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of big data technologies, and in particular, to a data pre-partitioning method and apparatus, an electronic device, and a readable storage medium.

Background

The financial industry, especially the traditional huge enterprises represented by the banking industry, has high complexity of service types, multiple customer index dimensions, large total amount of customer groups and large transaction flow, and the data of 360-view customers or continuous flow query can reach hundreds of dimensions, and the volume can reach hundreds of millions of customers and trillion-level data.

The mainstream distributed and nematic Hbase common expression for query transaction solves the problem, but due to uneven data distribution, the transaction performance is low due to a hot spot problem in high concurrent transaction. The existing common processing method is to create a Hash rowkey and then uniformly distribute data, but the Hash rowkey can better optimize data distribution, but the data needs to be transferred during query, and the Hash itself occupies space, so that the efficiency is not optimal and resources are wasted. Therefore, the rational storage of large volumes of data becomes a problem.

Disclosure of Invention

The application provides a data pre-partitioning method, a data pre-partitioning device, electronic equipment and a readable storage medium, which are used for realizing uniform storage of large-scale data and avoiding the phenomenon of data inclination, and the technical scheme adopted by the application is as follows:

in a first aspect, there is provided a data pre-partitioning method, the method comprising,

acquiring target row key data from a target Hive data set through a Spark-sql engine based on received pre-partition information, wherein the pre-partition information comprises row key table names, row key column names and preset partition numbers;

performing partition iterative computation through a Spark-sql engine based on the target row key data and the preset partition number to obtain partition result information, wherein the partition result information comprises a start row key and an end row key of each partition;

and performing data pre-partitioning on the Hbase database based on the partitioning result information.

Optionally, obtaining target row key data from the target Hive data set by a Spark-sql engine based on the received pre-partition information, including:

determining a target row key table from the target Hive data set based on the row key table name;

and acquiring target row key data from the target row key table based on the row key column names.

Optionally, based on the target row key data and the predetermined partition number, performing partition iterative computation by using a Spark-sql engine to obtain partition result information, including:

performing RDD processing on the target row key data to obtain RDD processed target row key data;

sequencing the target row key data based on the RDD processing to obtain the sequenced target row key data;

and determining partition result information based on the sorted target row key data and the preset partition number.

Optionally, the manner of acquiring data in the target Hive data set includes at least one of:

extracting target row key data from the Hive tables to obtain a target Hive data set;

and extracting target row key data from at least one service file, storing the obtained target row key data to the HDFS system, and mapping the target row key data stored to the HDFS system into a Hive table for access.

Optionally, the method further comprises:

receiving a storage request of target data to be stored, wherein the target data to be stored comprises a second target row key;

and determining a target partition corresponding to the target data to be stored based on the second target row key, and storing the target data to be stored based on the determined target partition.

In a second aspect, there is provided a data pre-partitioning apparatus, the apparatus comprising,

the acquisition module is used for acquiring target row key data from the target Hive data set through a Spark-sql engine based on the received pre-partition information, and the pre-partition information comprises row key table names, row key column names and preset partition numbers;

the computing module is used for carrying out partition iterative computation through a Spark-sql engine based on the target row key data and the preset partition number to obtain partition result information, and the partition result information comprises a start row key and an end row key of each partition;

and the partitioning module is used for carrying out data pre-partitioning on the Hbase database based on the partitioning result information.

Optionally, the obtaining module includes:

a determining unit, configured to determine a target row key table from the target Hive data set based on the row key table name;

and the acquisition unit is used for acquiring target row key data from the target row key table based on the row key column names.

Optionally, the calculation module comprises:

the RDD processing unit is used for carrying out RDD processing on the target row key data to obtain the target row key data after the RDD processing;

the sorting unit is used for carrying out sorting processing on the basis of the target row key data subjected to RDD processing to obtain the target row key data subjected to sorting processing;

and the determining unit is used for determining the partition result information based on the sorted target row key data and the preset partition number.

Optionally, the apparatus further comprises:

the receiving module is used for receiving a storage request of target data to be stored, and the target data to be stored comprises a second target row key;

and the storage module is used for determining a target partition corresponding to the target data to be stored based on the second target row key and storing the target data to be stored based on the determined target partition.

In a third aspect, an electronic device is provided, which includes:

one or more processors;

a memory;

one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: the data pre-partitioning method shown in the first aspect is performed.

In a fourth aspect, there is provided a computer-readable storage medium for storing computer instructions which, when run on a computer, cause the computer to perform the data pre-partitioning method of the first aspect.

Compared with the prior art that data distribution is optimized through Hash Hash rowkey, the data pre-partitioning method and device, the electronic equipment and the readable storage medium have the advantages that target row key data are obtained from target Hive data set through the Spark-sql engine based on received pre-partitioning information, the pre-partitioning information comprises row key table names, row key column names and preset partition numbers, partitioning iterative calculation is conducted through the Spark-sql engine based on the target row key data and the preset partition numbers to obtain partitioning result information, the partitioning result information comprises starting row keys and ending row keys of all partitions, and then data pre-partitioning is conducted on an Hbase database based on the partitioning result information. Acquiring target row key data from a target Hive data set through a Spark-sql engine, realizing large-scale collection of the target row key data, obtaining partition result information according to the collected target row key data, and performing data pre-partitioning on an Hbase database according to the partition result information, so that uniform storage of the large-scale data can be realized, and data inclination is avoided; in addition, the Spark-sql engine is a large data processing framework based on a memory, compared with a magnetic disk-based MapReduce, the data processing speed is higher, and the rapid calculation of large-scale data can be realized; moreover, compared with the Hash Hash rowkey mode, the method and the device do not need to perform escape (the Hash Hash mode needs escape) when reading data, and can effectively improve the speed of data query.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart of a data pre-partitioning method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a data pre-partitioning apparatus according to an embodiment of the present application;

FIG. 3 is a schematic diagram of another data pre-partition apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Spark: apache Spark is a fast, general-purpose computing engine designed specifically for large-scale data processing. Spark is a similar open source clustered computing environment as Hadoop, but there are some differences between the two that make Spark superior in terms of some workloads, in other words Spark enables memory distributed datasets that, in addition to being able to provide interactive queries, can also optimize iterative workloads.

Hbase: HBase is a distributed, column-oriented open source database, and the technology is derived from the Google paper "Bigtable: a distributed storage system of structured data.

Pre-partitioning: HBase provides a pre-partition function, namely, a user can partition a table according to a certain rule when creating the table.

The embodiment of the application provides a data pre-partition method, which is applied to a cloud server, and as shown in fig. 1, the method can include the following steps:

step S101, acquiring target row key data from a target Hive data set through a Spark-sql engine based on received pre-partition information, wherein the pre-partition information comprises row key table names, row key column names and preset partition numbers;

specifically, 4 parameters may be introduced, and Spark is called to perform partitioning iterative computation. Wherein the incoming parameters may be: parameter 1 external table name e _ tbl _ rowkey (i.e., row key table name), parameter 2rowkey column name (row key column name), parameter 3 desired uniform partition number (i.e., predetermined partition number), and parameter 4 partition result output path. And the partitioning result output path is used for storing the partitioning result information obtained by iterative computation to the partitioning result output path.

Step S102, performing partition iterative computation through a Spark-sql engine based on target row key data and a preset partition number to obtain partition result information, wherein the partition result information comprises a start row key and an end row key of each partition;

specifically, partition iterative calculation is performed through a Spark-sql engine based on target row key data and a preset partition number to obtain partition result information, wherein the partition result information comprises a start row key and an end row key of each partition. The target row key data can be an identification card number, a bank card number and a transaction account number, and fixed fields such as a customer identification card number and a transaction account number are used as unique marks of customers or assets in industries represented by banking industries and the like, so that the target row key data is ordered and stable in the whole. In the prior art (Hash rowkey), it is assumed that data itself is completely unordered, so pre-partitioning needs to be performed by the Hash rowkey, and in actual production, client information such as a client identity number and a transaction account number is ordered and stable.

Illustratively, the target row key data may be 1-100 data in an out-of-order arrangement, the predetermined number of partitions may be 10, and the obtained partition result information may be 1-10, 11-20, 21-30, 31-40, 41-50, 51-60, 61-70, 71-80, 81-90, 91-100.

And step S103, carrying out data pre-partitioning on the Hbase database based on the partitioning result information.

The HBase is a distributed database which is high in reliability, high in performance, column-oriented and telescopic, is different from a general relational database and is suitable for unstructured data storage, a large-scale unstructured storage cluster can be built on a low-cost PC Server by utilizing the HBase, and storage cost under a large-data background can be effectively reduced. The pre-partitioning is a strategy of counting all rowkeys of the whole body and then uniformly dividing the rowkeys into N (preset partition number) equal parts, and is a method for automatically generating a table building statement based on Hbase characteristics.

Compared with the prior art that data distribution is optimized through a Hash Hash rowkey, the data pre-partitioning method includes the steps that target row key data are obtained from a target Hive data set through a Spark-sql engine based on received pre-partitioning information, the pre-partitioning information includes row key table names, row key column names and preset partition numbers, then partitioning iterative calculation is conducted through the Spark-sql engine based on the target row key data and the preset partition numbers to obtain partitioning result information, the partitioning result information includes starting row keys and ending row keys of all partitions, and then data pre-partitioning is conducted on an Hbase database based on the partitioning result information. Acquiring target row key data from a target Hive data set through a Spark-sql engine, realizing large-scale collection of the target row key data, obtaining partition result information according to the collected target row key data, and performing data pre-partitioning on an Hbase database according to the partition result information, so that uniform storage of the large-scale data can be realized, and data inclination is avoided; in addition, the Spark-sql engine is a large data processing framework based on a memory, compared with a magnetic disk-based MapReduce, the data processing speed is higher, and the rapid calculation of large-scale data can be realized; moreover, compared with the Hash Hash rowkey mode, the method and the device do not need to perform escape (the Hash Hash mode needs escape) when reading data, and can effectively improve the speed of data query.

The embodiment of the present application provides a possible implementation manner, and step S101 includes:

step S1011 (not shown in the figure), determining a target row key table from the target Hive data set based on the row key table name;

in step S1012 (not shown in the figure), target row key data is acquired from the target row key table based on the row key column name.

Specifically, a plurality of data tables can be collected in the target Hive data set, a target row key table is determined from the target Hive data set based on the row key table name, and then target row key data is obtained from the target row key table based on the row key column name.

For the embodiment of the application, the problem of obtaining the target row key data is solved.

The embodiment of the present application provides a possible implementation manner, and further, step S102 includes:

step S1021 (not shown in the figure), performing RDD processing on the target row key data to obtain target row key data after RDD processing;

step S1022 (not shown in the figure), perform sorting processing based on the RDD processed target row key data, to obtain sorted target row key data;

in step S1023 (not shown), partition result information is determined based on the sorted target row key data and the predetermined number of partitions.

Wherein, RDD (flexible Distributed data sets), RDD partition: the RDD is divided into many partitions (partitions) distributed to the nodes of the cluster, the number of partitions relating to the granularity of parallel computations on this RDD. Partitioning is a concept, and the old and new partitions before and after transformation may be physically the same block of memory or storage, and this optimization prevents infinite expansion of memory requirements due to function invariance. In Rdd, the user may use the partitions method to obtain the partition number of Rdd partition, and of course, the user may also set the partition number. If no default is specified to be used, the default is the number of cpu cores to which the program is assigned, and if created from the hdfs file, the default is the number of data blocks of the file. The RDD calculation in Spark is performed by taking a partition as a unit, and the calculation functions are compounded to an iterator, so that the calculation result of each time does not need to be stored. Partition computations are typically performed using operations such as mapPartitions, the input function of which is applied to each partition, i.e., the contents of each partition are treated as a whole.

For the embodiment of the application, the problem of determining the partition result information is solved.

The embodiment of the application provides a possible implementation manner, wherein the acquisition manner of the data in the target Hive data set comprises at least one of the following:

step S104 (not shown in the figure), extracting target row key data from the Hive tables to obtain a target Hive data set;

step S105 (not shown in the figure), extracting target row key data from at least one service file, storing the obtained target row key data in the HDFS system, and mapping the target row key data stored in the HDFS system to Hive table access.

Specifically, the single-column field may be extracted and inserted into the corresponding table of Hive based on the existing data table of Hive.

Such as: insert overhead table e _ tbl _ rowkey selection row column from row business table

Specifically, the target row key data can also be extracted from one or more service files, then uploaded to the HDFS system, and then mapped to Hive table access.

Exemplarily, manually providing a rowkey aggregate file, and uploading the rowkey aggregate file to the HDFS/tmp/rowkey;

Hadoop fs-rm-f/tmp/rowkey/*

hadoop fs-put local file/tmp/rowkey-

For the embodiment of the application, the problem of acquiring data in the target Hive data is solved.

The embodiment of the present application provides a possible implementation manner, and the method further includes:

step S106 (not shown in the figure), receiving a storage request of target data to be stored, where the target data to be stored includes a second target row key;

step S106 (not shown in the figure), determining a target partition corresponding to the target data to be stored based on the second target row key, and storing the target data to be stored based on the determined target partition.

For the application embodiment, how to realize the storage of the target data to be stored is solved.

Fig. 2 is a data pre-partitioning apparatus according to an embodiment of the present application, where the apparatus 20 includes: an acquisition module 201, a calculation module 202, and a partitioning module 203, wherein,

an obtaining module 201, configured to obtain target row key data from a target Hive data set through a Spark-sql engine based on received pre-partition information, where the pre-partition information includes a row key table name, a row key column name, and a predetermined partition number;

a calculating module 202, configured to perform partition iterative calculation through a Spark-sql engine based on the target row key data and the predetermined partition number to obtain partition result information, where the partition result information includes a start row key and an end row key of each partition;

and the partitioning module 203 is used for performing data pre-partitioning on the Hbase database based on the partitioning result information.

Compared with the prior art that data distribution is optimized through Hash Hash rowkey, the data pre-partitioning device obtains target row key data from target Hive data set through a Spark-sql engine based on received pre-partitioning information, the pre-partitioning information comprises row key table names, row key column names and preset partition numbers, partitioning iterative calculation is conducted through the Spark-sql engine based on the target row key data and the preset partition numbers to obtain partitioning result information, the partitioning result information comprises starting row keys and ending row keys of all partitions, and then data pre-partitioning is conducted on an Hbase database based on the partitioning result information. Acquiring target row key data from a target Hive data set through a Spark-sql engine, realizing large-scale collection of the target row key data, obtaining partition result information according to the collected target row key data, and performing data pre-partitioning on an Hbase database according to the partition result information, so that uniform storage of the large-scale data can be realized, and data inclination is avoided; in addition, the Spark-sql engine is a large data processing framework based on a memory, compared with a magnetic disk-based MapReduce, the data processing speed is higher, and the rapid calculation of large-scale data can be realized; moreover, compared with the Hash Hash rowkey mode, the method and the device do not need to perform escape (the Hash Hash mode needs escape) when reading data, and can effectively improve the speed of data query.

The data pre-partitioning apparatus of this embodiment may execute the data pre-partitioning method provided in the above embodiments of this application, and the implementation principles thereof are similar, and are not described herein again.

As shown in fig. 3, an embodiment of the present application provides another data pre-partitioning apparatus, where the apparatus 30 includes: an acquisition module 301, a calculation module 302, and a partitioning module 303, wherein,

an obtaining module 301, configured to obtain target row key data from a target Hive data set through a Spark-sql engine based on received pre-partition information, where the pre-partition information includes a row key table name, a row key column name, and a predetermined partition number;

the acquiring module 301 in fig. 3 has the same or similar function as the acquiring module 201 in fig. 2.

A calculating module 302, configured to perform partition iterative calculation through a Spark-sql engine based on the target row key data and the predetermined partition number to obtain partition result information, where the partition result information includes a start row key and an end row key of each partition;

wherein the computing module 302 in fig. 3 has the same or similar function as the computing module 202 in fig. 2.

And a partitioning module 303, configured to perform data pre-partitioning on the Hbase database based on the partitioning result information.

Wherein the partition module 303 of fig. 3 has the same or similar function as the partition module 203 of fig. 2.

The embodiment of the present application provides a possible implementation manner, and specifically, the obtaining module 301 includes:

a determining unit 3011, configured to determine a target row key table from the target Hive data set based on the row key table name;

an obtaining unit 3012, configured to obtain target row key data from the target row key table based on the row key column name.

For the embodiment of the present application, the problem of obtaining the target row key data is solved for the embodiment of the present application.

The embodiment of the present application provides a possible implementation manner, and specifically, the calculating module 302 includes:

the RDD processing unit 3021 performs RDD processing on the target row key data to obtain RDD processed target row key data;

a sorting unit 3022, configured to perform sorting processing based on the RDD processed target row key data to obtain sorted target row key data;

a determination unit 3023 configured to determine partitioning result information based on the sorting-processed target row key data and a predetermined number of partitions.

The embodiment of the application provides a possible implementation manner, and the acquisition manner of the data in the target Hive data set comprises at least one of the following:

The embodiment of the present application provides a possible implementation manner, and the apparatus 30 further includes:

a receiving module 304, configured to receive a storage request of target data to be stored, where the target data to be stored includes a second target row key;

and the storage module 305 is configured to determine a target partition corresponding to the target data to be stored based on the second target row key, and store the target data to be stored based on the determined target partition.

According to the embodiment of the application, how to realize the storage of the target data to be stored is solved.

The embodiment of the present application provides a data pre-partitioning apparatus, which is suitable for the method shown in the above embodiment, and is not described herein again.

An embodiment of the present application provides an electronic device, as shown in fig. 4, an electronic device 40 shown in fig. 4 includes: a processor 401 and a memory 403. Wherein the processor 401 is coupled to the memory 403, such as via a bus 402. Further, the electronic device 40 may also include a transceiver 404. It should be noted that the transceiver 404 is not limited to one in practical applications, and the structure of the electronic device 40 is not limited to the embodiment of the present application. The processor 401 is applied in this embodiment of the present application to implement the functions of the obtaining module, the calculating module, and the partitioning module shown in fig. 2 or fig. 3, and the functions of the receiving module and the storing module shown in fig. 3. The transceiver 404 includes a receiver and a transmitter.

The processor 401 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 401 may also be a combination of computing functions, e.g., comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 402 may include a path that transfers information between the above components. The bus 402 may be a PCI bus or an EISA bus, etc. The bus 402 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.

The memory 403 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 403 is used for storing application program codes for executing the scheme of the application, and the execution is controlled by the processor 401. Processor 401 is configured to execute application program code stored in memory 403 to implement the functions of the data pre-partition method apparatus provided by the embodiments shown in fig. 2 or fig. 3.

Compared with the prior art that data distribution is optimized through Hash Hash rowkey, target row key data are obtained from target Hive data set through a Spark-sql engine based on received pre-partition information, the pre-partition information comprises row key table names, row key column names and preset partition numbers, partition iterative calculation is conducted through the Spark-sql engine based on the target row key data and the preset partition numbers to obtain partition result information, the partition result information comprises starting row keys and ending row keys of all partitions, and then data pre-partitioning is conducted on an Hbase database based on the partition result information. Acquiring target row key data from a target Hive data set through a Spark-sql engine, realizing large-scale collection of the target row key data, obtaining partition result information according to the collected target row key data, and performing data pre-partitioning on an Hbase database according to the partition result information, so that uniform storage of the large-scale data can be realized, and data inclination is avoided; in addition, the Spark-sql engine is a large data processing framework based on a memory, compared with a magnetic disk-based MapReduce, the data processing speed is higher, and the rapid calculation of large-scale data can be realized; moreover, compared with the Hash Hash rowkey mode, the method and the device do not need to perform escape (the Hash Hash mode needs escape) when reading data, and can effectively improve the speed of data query.

The embodiment of the application provides an electronic device suitable for the method embodiment. And will not be described in detail herein.

The present application provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method shown in the above embodiments is implemented.

Compared with the prior art that data distribution is optimized through Hash Hash rowkey, the embodiment of the application obtains target row key data from target Hive data set through a Spark-sql engine based on received pre-partition information, the pre-partition information comprises row key table names, row key column names and preset partition numbers, then partition iterative calculation is carried out through the Spark-sql engine based on the target row key data and the preset partition numbers to obtain partition result information, the partition result information comprises start row keys and end row keys of all partitions, and then data pre-partitioning is carried out on an Hbase database based on the partition result information. Acquiring target row key data from a target Hive data set through a Spark-sql engine, realizing large-scale collection of the target row key data, obtaining partition result information according to the collected target row key data, and performing data pre-partitioning on an Hbase database according to the partition result information, so that uniform storage of the large-scale data can be realized, and data inclination is avoided; in addition, the Spark-sql engine is a large data processing framework based on a memory, compared with a magnetic disk-based MapReduce, the data processing speed is higher, and the rapid calculation of large-scale data can be realized; moreover, compared with the Hash Hash rowkey mode, the method and the device do not need to perform escape (the Hash Hash mode needs escape) when reading data, and can effectively improve the speed of data query.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A method for pre-partitioning data, comprising:

performing partition iterative computation through the Spark-sql engine based on the target row key data and the preset partition number to obtain partition result information, wherein the partition result information comprises a start row key and an end row key of each partition;

2. The method of claim 1, wherein obtaining target row key data from a target Hive dataset by a Spark-sql engine based on the received pre-partition information comprises:

determining a target row key table from a target Hive data set based on the row key table name;

and acquiring target row key data from the target row key table based on the row key column name.

3. The method according to claim 1 or 2, wherein performing partition iterative computation by the Spark-sql engine based on the target row key data and a predetermined partition number to obtain partition result information comprises:

4. The method of claim 1, wherein the data in the target Hive dataset is obtained in a manner that comprises at least one of:

extracting target row key data from a plurality of Hive tables to obtain a target Hive data set;

5. The method according to any one of claims 1-4, characterized in that the method further comprises:

6. A data pre-partitioning apparatus, comprising:

the acquisition module is used for acquiring target row key data from the target Hive data set through a Spark-sql engine based on received pre-partition information, wherein the pre-partition information comprises row key table names, row key column names and preset partition numbers;

the calculation module is used for performing partition iterative calculation through the Spark-sql engine based on the target row key data and the preset partition number to obtain partition result information, wherein the partition result information comprises a start row key and an end row key of each partition;

7. The apparatus of claim 6, wherein the obtaining module comprises:

8. The apparatus of claim 6, wherein the computing module comprises:

the sorting unit is used for carrying out sorting processing on the basis of the RDD processed target row key data to obtain sorted target row key data;

9. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: performing the data pre-partitioning method of any one of claims 1 to 5.

10. A computer-readable storage medium for storing computer instructions which, when executed on a computer, cause the computer to perform the data pre-partitioning method of any one of claims 1 to 5.