CN111159235A - Data pre-partition method and device, electronic equipment and readable storage medium - Google Patents

Data pre-partition method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN111159235A
CN111159235A CN201911321136.1A CN201911321136A CN111159235A CN 111159235 A CN111159235 A CN 111159235A CN 201911321136 A CN201911321136 A CN 201911321136A CN 111159235 A CN111159235 A CN 111159235A
Authority
CN
China
Prior art keywords
data
row key
target
partition
target row
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911321136.1A
Other languages
Chinese (zh)
Inventor
李威
覃鹏
刘增文
叶长全
吴仰波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp, CCB Finetech Co Ltd filed Critical China Construction Bank Corp
Priority to CN201911321136.1A priority Critical patent/CN111159235A/en
Publication of CN111159235A publication Critical patent/CN111159235A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data pre-partitioning method, a data pre-partitioning device, electronic equipment and a readable storage medium, which are applied to the technical field of big data, wherein the method comprises the following steps: acquiring target row key data from the target Hive data set through a Spark-sql engine, realizing large-scale collection of the target row key data, obtaining partition result information according to the collected target row key data, and performing data pre-partitioning on an Hbase database according to the partition result information, so that uniform storage of the large-scale data can be realized, and data inclination is avoided; in addition, the Spark-sql engine is a large data processing framework based on a memory, compared with a magnetic disk-based MapReduce, the data processing speed is higher, and the rapid calculation of large-scale data can be realized; moreover, compared with the Hash Hash rowkey mode, the method and the device do not need to perform escape (the Hash Hash mode needs escape) when reading data, and can effectively improve the speed of data query.

Description

Data pre-partition method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of big data technologies, and in particular, to a data pre-partitioning method and apparatus, an electronic device, and a readable storage medium.
Background
The financial industry, especially the traditional huge enterprises represented by the banking industry, has high complexity of service types, multiple customer index dimensions, large total amount of customer groups and large transaction flow, and the data of 360-view customers or continuous flow query can reach hundreds of dimensions, and the volume can reach hundreds of millions of customers and trillion-level data.
The mainstream distributed and nematic Hbase common expression for query transaction solves the problem, but due to uneven data distribution, the transaction performance is low due to a hot spot problem in high concurrent transaction. The existing common processing method is to create a Hash rowkey and then uniformly distribute data, but the Hash rowkey can better optimize data distribution, but the data needs to be transferred during query, and the Hash itself occupies space, so that the efficiency is not optimal and resources are wasted. Therefore, the rational storage of large volumes of data becomes a problem.
Disclosure of Invention
The application provides a data pre-partitioning method, a data pre-partitioning device, electronic equipment and a readable storage medium, which are used for realizing uniform storage of large-scale data and avoiding the phenomenon of data inclination, and the technical scheme adopted by the application is as follows:
in a first aspect, there is provided a data pre-partitioning method, the method comprising,
acquiring target row key data from a target Hive data set through a Spark-sql engine based on received pre-partition information, wherein the pre-partition information comprises row key table names, row key column names and preset partition numbers;
performing partition iterative computation through a Spark-sql engine based on the target row key data and the preset partition number to obtain partition result information, wherein the partition result information comprises a start row key and an end row key of each partition;
and performing data pre-partitioning on the Hbase database based on the partitioning result information.
Optionally, obtaining target row key data from the target Hive data set by a Spark-sql engine based on the received pre-partition information, including:
determining a target row key table from the target Hive data set based on the row key table name;
and acquiring target row key data from the target row key table based on the row key column names.
Optionally, based on the target row key data and the predetermined partition number, performing partition iterative computation by using a Spark-sql engine to obtain partition result information, including:
performing RDD processing on the target row key data to obtain RDD processed target row key data;
sequencing the target row key data based on the RDD processing to obtain the sequenced target row key data;
and determining partition result information based on the sorted target row key data and the preset partition number.
Optionally, the manner of acquiring data in the target Hive data set includes at least one of:
extracting target row key data from the Hive tables to obtain a target Hive data set;
and extracting target row key data from at least one service file, storing the obtained target row key data to the HDFS system, and mapping the target row key data stored to the HDFS system into a Hive table for access.
Optionally, the method further comprises:
receiving a storage request of target data to be stored, wherein the target data to be stored comprises a second target row key;
and determining a target partition corresponding to the target data to be stored based on the second target row key, and storing the target data to be stored based on the determined target partition.
In a second aspect, there is provided a data pre-partitioning apparatus, the apparatus comprising,
the acquisition module is used for acquiring target row key data from the target Hive data set through a Spark-sql engine based on the received pre-partition information, and the pre-partition information comprises row key table names, row key column names and preset partition numbers;
the computing module is used for carrying out partition iterative computation through a Spark-sql engine based on the target row key data and the preset partition number to obtain partition result information, and the partition result information comprises a start row key and an end row key of each partition;
and the partitioning module is used for carrying out data pre-partitioning on the Hbase database based on the partitioning result information.
Optionally, the obtaining module includes:
a determining unit, configured to determine a target row key table from the target Hive data set based on the row key table name;
and the acquisition unit is used for acquiring target row key data from the target row key table based on the row key column names.
Optionally, the calculation module comprises:
the RDD processing unit is used for carrying out RDD processing on the target row key data to obtain the target row key data after the RDD processing;
the sorting unit is used for carrying out sorting processing on the basis of the target row key data subjected to RDD processing to obtain the target row key data subjected to sorting processing;
and the determining unit is used for determining the partition result information based on the sorted target row key data and the preset partition number.
Optionally, the manner of acquiring data in the target Hive data set includes at least one of:
extracting target row key data from the Hive tables to obtain a target Hive data set;
and extracting target row key data from at least one service file, storing the obtained target row key data to the HDFS system, and mapping the target row key data stored to the HDFS system into a Hive table for access.
Optionally, the apparatus further comprises:
the receiving module is used for receiving a storage request of target data to be stored, and the target data to be stored comprises a second target row key;
and the storage module is used for determining a target partition corresponding to the target data to be stored based on the second target row key and storing the target data to be stored based on the determined target partition.
In a third aspect, an electronic device is provided, which includes:
one or more processors;
a memory;
one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: the data pre-partitioning method shown in the first aspect is performed.
In a fourth aspect, there is provided a computer-readable storage medium for storing computer instructions which, when run on a computer, cause the computer to perform the data pre-partitioning method of the first aspect.
Compared with the prior art that data distribution is optimized through Hash Hash rowkey, the data pre-partitioning method and device, the electronic equipment and the readable storage medium have the advantages that target row key data are obtained from target Hive data set through the Spark-sql engine based on received pre-partitioning information, the pre-partitioning information comprises row key table names, row key column names and preset partition numbers, partitioning iterative calculation is conducted through the Spark-sql engine based on the target row key data and the preset partition numbers to obtain partitioning result information, the partitioning result information comprises starting row keys and ending row keys of all partitions, and then data pre-partitioning is conducted on an Hbase database based on the partitioning result information. Acquiring target row key data from a target Hive data set through a Spark-sql engine, realizing large-scale collection of the target row key data, obtaining partition result information according to the collected target row key data, and performing data pre-partitioning on an Hbase database according to the partition result information, so that uniform storage of the large-scale data can be realized, and data inclination is avoided; in addition, the Spark-sql engine is a large data processing framework based on a memory, compared with a magnetic disk-based MapReduce, the data processing speed is higher, and the rapid calculation of large-scale data can be realized; moreover, compared with the Hash Hash rowkey mode, the method and the device do not need to perform escape (the Hash Hash mode needs escape) when reading data, and can effectively improve the speed of data query.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a data pre-partitioning method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a data pre-partitioning apparatus according to an embodiment of the present application;
FIG. 3 is a schematic diagram of another data pre-partition apparatus according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Spark: apache Spark is a fast, general-purpose computing engine designed specifically for large-scale data processing. Spark is a similar open source clustered computing environment as Hadoop, but there are some differences between the two that make Spark superior in terms of some workloads, in other words Spark enables memory distributed datasets that, in addition to being able to provide interactive queries, can also optimize iterative workloads.
Hbase: HBase is a distributed, column-oriented open source database, and the technology is derived from the Google paper "Bigtable: a distributed storage system of structured data.
Pre-partitioning: HBase provides a pre-partition function, namely, a user can partition a table according to a certain rule when creating the table.
The embodiment of the application provides a data pre-partition method, which is applied to a cloud server, and as shown in fig. 1, the method can include the following steps:
step S101, acquiring target row key data from a target Hive data set through a Spark-sql engine based on received pre-partition information, wherein the pre-partition information comprises row key table names, row key column names and preset partition numbers;
specifically, 4 parameters may be introduced, and Spark is called to perform partitioning iterative computation. Wherein the incoming parameters may be: parameter 1 external table name e _ tbl _ rowkey (i.e., row key table name), parameter 2rowkey column name (row key column name), parameter 3 desired uniform partition number (i.e., predetermined partition number), and parameter 4 partition result output path. And the partitioning result output path is used for storing the partitioning result information obtained by iterative computation to the partitioning result output path.
Step S102, performing partition iterative computation through a Spark-sql engine based on target row key data and a preset partition number to obtain partition result information, wherein the partition result information comprises a start row key and an end row key of each partition;
specifically, partition iterative calculation is performed through a Spark-sql engine based on target row key data and a preset partition number to obtain partition result information, wherein the partition result information comprises a start row key and an end row key of each partition. The target row key data can be an identification card number, a bank card number and a transaction account number, and fixed fields such as a customer identification card number and a transaction account number are used as unique marks of customers or assets in industries represented by banking industries and the like, so that the target row key data is ordered and stable in the whole. In the prior art (Hash rowkey), it is assumed that data itself is completely unordered, so pre-partitioning needs to be performed by the Hash rowkey, and in actual production, client information such as a client identity number and a transaction account number is ordered and stable.
Illustratively, the target row key data may be 1-100 data in an out-of-order arrangement, the predetermined number of partitions may be 10, and the obtained partition result information may be 1-10, 11-20, 21-30, 31-40, 41-50, 51-60, 61-70, 71-80, 81-90, 91-100.
And step S103, carrying out data pre-partitioning on the Hbase database based on the partitioning result information.
The HBase is a distributed database which is high in reliability, high in performance, column-oriented and telescopic, is different from a general relational database and is suitable for unstructured data storage, a large-scale unstructured storage cluster can be built on a low-cost PC Server by utilizing the HBase, and storage cost under a large-data background can be effectively reduced. The pre-partitioning is a strategy of counting all rowkeys of the whole body and then uniformly dividing the rowkeys into N (preset partition number) equal parts, and is a method for automatically generating a table building statement based on Hbase characteristics.
Compared with the prior art that data distribution is optimized through a Hash Hash rowkey, the data pre-partitioning method includes the steps that target row key data are obtained from a target Hive data set through a Spark-sql engine based on received pre-partitioning information, the pre-partitioning information includes row key table names, row key column names and preset partition numbers, then partitioning iterative calculation is conducted through the Spark-sql engine based on the target row key data and the preset partition numbers to obtain partitioning result information, the partitioning result information includes starting row keys and ending row keys of all partitions, and then data pre-partitioning is conducted on an Hbase database based on the partitioning result information. Acquiring target row key data from a target Hive data set through a Spark-sql engine, realizing large-scale collection of the target row key data, obtaining partition result information according to the collected target row key data, and performing data pre-partitioning on an Hbase database according to the partition result information, so that uniform storage of the large-scale data can be realized, and data inclination is avoided; in addition, the Spark-sql engine is a large data processing framework based on a memory, compared with a magnetic disk-based MapReduce, the data processing speed is higher, and the rapid calculation of large-scale data can be realized; moreover, compared with the Hash Hash rowkey mode, the method and the device do not need to perform escape (the Hash Hash mode needs escape) when reading data, and can effectively improve the speed of data query.
The embodiment of the present application provides a possible implementation manner, and step S101 includes:
step S1011 (not shown in the figure), determining a target row key table from the target Hive data set based on the row key table name;
in step S1012 (not shown in the figure), target row key data is acquired from the target row key table based on the row key column name.
Specifically, a plurality of data tables can be collected in the target Hive data set, a target row key table is determined from the target Hive data set based on the row key table name, and then target row key data is obtained from the target row key table based on the row key column name.
For the embodiment of the application, the problem of obtaining the target row key data is solved.
The embodiment of the present application provides a possible implementation manner, and further, step S102 includes:
step S1021 (not shown in the figure), performing RDD processing on the target row key data to obtain target row key data after RDD processing;
step S1022 (not shown in the figure), perform sorting processing based on the RDD processed target row key data, to obtain sorted target row key data;
in step S1023 (not shown), partition result information is determined based on the sorted target row key data and the predetermined number of partitions.
Wherein, RDD (flexible Distributed data sets), RDD partition: the RDD is divided into many partitions (partitions) distributed to the nodes of the cluster, the number of partitions relating to the granularity of parallel computations on this RDD. Partitioning is a concept, and the old and new partitions before and after transformation may be physically the same block of memory or storage, and this optimization prevents infinite expansion of memory requirements due to function invariance. In Rdd, the user may use the partitions method to obtain the partition number of Rdd partition, and of course, the user may also set the partition number. If no default is specified to be used, the default is the number of cpu cores to which the program is assigned, and if created from the hdfs file, the default is the number of data blocks of the file. The RDD calculation in Spark is performed by taking a partition as a unit, and the calculation functions are compounded to an iterator, so that the calculation result of each time does not need to be stored. Partition computations are typically performed using operations such as mapPartitions, the input function of which is applied to each partition, i.e., the contents of each partition are treated as a whole.
For the embodiment of the application, the problem of determining the partition result information is solved.
The embodiment of the application provides a possible implementation manner, wherein the acquisition manner of the data in the target Hive data set comprises at least one of the following:
step S104 (not shown in the figure), extracting target row key data from the Hive tables to obtain a target Hive data set;
step S105 (not shown in the figure), extracting target row key data from at least one service file, storing the obtained target row key data in the HDFS system, and mapping the target row key data stored in the HDFS system to Hive table access.
Specifically, the single-column field may be extracted and inserted into the corresponding table of Hive based on the existing data table of Hive.
Such as: insert overhead table e _ tbl _ rowkey selection row column from row business table
Specifically, the target row key data can also be extracted from one or more service files, then uploaded to the HDFS system, and then mapped to Hive table access.
Exemplarily, manually providing a rowkey aggregate file, and uploading the rowkey aggregate file to the HDFS/tmp/rowkey;
Hadoop fs-rm-f/tmp/rowkey/*
hadoop fs-put local file/tmp/rowkey-
For the embodiment of the application, the problem of acquiring data in the target Hive data is solved.
The embodiment of the present application provides a possible implementation manner, and the method further includes:
step S106 (not shown in the figure), receiving a storage request of target data to be stored, where the target data to be stored includes a second target row key;
step S106 (not shown in the figure), determining a target partition corresponding to the target data to be stored based on the second target row key, and storing the target data to be stored based on the determined target partition.
For the application embodiment, how to realize the storage of the target data to be stored is solved.
Fig. 2 is a data pre-partitioning apparatus according to an embodiment of the present application, where the apparatus 20 includes: an acquisition module 201, a calculation module 202, and a partitioning module 203, wherein,
an obtaining module 201, configured to obtain target row key data from a target Hive data set through a Spark-sql engine based on received pre-partition information, where the pre-partition information includes a row key table name, a row key column name, and a predetermined partition number;
a calculating module 202, configured to perform partition iterative calculation through a Spark-sql engine based on the target row key data and the predetermined partition number to obtain partition result information, where the partition result information includes a start row key and an end row key of each partition;
and the partitioning module 203 is used for performing data pre-partitioning on the Hbase database based on the partitioning result information.
Compared with the prior art that data distribution is optimized through Hash Hash rowkey, the data pre-partitioning device obtains target row key data from target Hive data set through a Spark-sql engine based on received pre-partitioning information, the pre-partitioning information comprises row key table names, row key column names and preset partition numbers, partitioning iterative calculation is conducted through the Spark-sql engine based on the target row key data and the preset partition numbers to obtain partitioning result information, the partitioning result information comprises starting row keys and ending row keys of all partitions, and then data pre-partitioning is conducted on an Hbase database based on the partitioning result information. Acquiring target row key data from a target Hive data set through a Spark-sql engine, realizing large-scale collection of the target row key data, obtaining partition result information according to the collected target row key data, and performing data pre-partitioning on an Hbase database according to the partition result information, so that uniform storage of the large-scale data can be realized, and data inclination is avoided; in addition, the Spark-sql engine is a large data processing framework based on a memory, compared with a magnetic disk-based MapReduce, the data processing speed is higher, and the rapid calculation of large-scale data can be realized; moreover, compared with the Hash Hash rowkey mode, the method and the device do not need to perform escape (the Hash Hash mode needs escape) when reading data, and can effectively improve the speed of data query.
The data pre-partitioning apparatus of this embodiment may execute the data pre-partitioning method provided in the above embodiments of this application, and the implementation principles thereof are similar, and are not described herein again.
As shown in fig. 3, an embodiment of the present application provides another data pre-partitioning apparatus, where the apparatus 30 includes: an acquisition module 301, a calculation module 302, and a partitioning module 303, wherein,
an obtaining module 301, configured to obtain target row key data from a target Hive data set through a Spark-sql engine based on received pre-partition information, where the pre-partition information includes a row key table name, a row key column name, and a predetermined partition number;
the acquiring module 301 in fig. 3 has the same or similar function as the acquiring module 201 in fig. 2.
A calculating module 302, configured to perform partition iterative calculation through a Spark-sql engine based on the target row key data and the predetermined partition number to obtain partition result information, where the partition result information includes a start row key and an end row key of each partition;
wherein the computing module 302 in fig. 3 has the same or similar function as the computing module 202 in fig. 2.
And a partitioning module 303, configured to perform data pre-partitioning on the Hbase database based on the partitioning result information.
Wherein the partition module 303 of fig. 3 has the same or similar function as the partition module 203 of fig. 2.
The embodiment of the present application provides a possible implementation manner, and specifically, the obtaining module 301 includes:
a determining unit 3011, configured to determine a target row key table from the target Hive data set based on the row key table name;
an obtaining unit 3012, configured to obtain target row key data from the target row key table based on the row key column name.
For the embodiment of the present application, the problem of obtaining the target row key data is solved for the embodiment of the present application.
The embodiment of the present application provides a possible implementation manner, and specifically, the calculating module 302 includes:
the RDD processing unit 3021 performs RDD processing on the target row key data to obtain RDD processed target row key data;
a sorting unit 3022, configured to perform sorting processing based on the RDD processed target row key data to obtain sorted target row key data;
a determination unit 3023 configured to determine partitioning result information based on the sorting-processed target row key data and a predetermined number of partitions.
For the embodiment of the application, the problem of determining the partition result information is solved.
The embodiment of the application provides a possible implementation manner, and the acquisition manner of the data in the target Hive data set comprises at least one of the following:
extracting target row key data from the Hive tables to obtain a target Hive data set;
and extracting target row key data from at least one service file, storing the obtained target row key data to the HDFS system, and mapping the target row key data stored to the HDFS system into a Hive table for access.
For the embodiment of the application, the problem of acquiring data in the target Hive data is solved.
The embodiment of the present application provides a possible implementation manner, and the apparatus 30 further includes:
a receiving module 304, configured to receive a storage request of target data to be stored, where the target data to be stored includes a second target row key;
and the storage module 305 is configured to determine a target partition corresponding to the target data to be stored based on the second target row key, and store the target data to be stored based on the determined target partition.
According to the embodiment of the application, how to realize the storage of the target data to be stored is solved.
Compared with the prior art that data distribution is optimized through Hash Hash rowkey, the data pre-partitioning device obtains target row key data from target Hive data set through a Spark-sql engine based on received pre-partitioning information, the pre-partitioning information comprises row key table names, row key column names and preset partition numbers, partitioning iterative calculation is conducted through the Spark-sql engine based on the target row key data and the preset partition numbers to obtain partitioning result information, the partitioning result information comprises starting row keys and ending row keys of all partitions, and then data pre-partitioning is conducted on an Hbase database based on the partitioning result information. Acquiring target row key data from a target Hive data set through a Spark-sql engine, realizing large-scale collection of the target row key data, obtaining partition result information according to the collected target row key data, and performing data pre-partitioning on an Hbase database according to the partition result information, so that uniform storage of the large-scale data can be realized, and data inclination is avoided; in addition, the Spark-sql engine is a large data processing framework based on a memory, compared with a magnetic disk-based MapReduce, the data processing speed is higher, and the rapid calculation of large-scale data can be realized; moreover, compared with the Hash Hash rowkey mode, the method and the device do not need to perform escape (the Hash Hash mode needs escape) when reading data, and can effectively improve the speed of data query.
The embodiment of the present application provides a data pre-partitioning apparatus, which is suitable for the method shown in the above embodiment, and is not described herein again.
An embodiment of the present application provides an electronic device, as shown in fig. 4, an electronic device 40 shown in fig. 4 includes: a processor 401 and a memory 403. Wherein the processor 401 is coupled to the memory 403, such as via a bus 402. Further, the electronic device 40 may also include a transceiver 404. It should be noted that the transceiver 404 is not limited to one in practical applications, and the structure of the electronic device 40 is not limited to the embodiment of the present application. The processor 401 is applied in this embodiment of the present application to implement the functions of the obtaining module, the calculating module, and the partitioning module shown in fig. 2 or fig. 3, and the functions of the receiving module and the storing module shown in fig. 3. The transceiver 404 includes a receiver and a transmitter.
The processor 401 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 401 may also be a combination of computing functions, e.g., comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 402 may include a path that transfers information between the above components. The bus 402 may be a PCI bus or an EISA bus, etc. The bus 402 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.
The memory 403 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
The memory 403 is used for storing application program codes for executing the scheme of the application, and the execution is controlled by the processor 401. Processor 401 is configured to execute application program code stored in memory 403 to implement the functions of the data pre-partition method apparatus provided by the embodiments shown in fig. 2 or fig. 3.
Compared with the prior art that data distribution is optimized through Hash Hash rowkey, target row key data are obtained from target Hive data set through a Spark-sql engine based on received pre-partition information, the pre-partition information comprises row key table names, row key column names and preset partition numbers, partition iterative calculation is conducted through the Spark-sql engine based on the target row key data and the preset partition numbers to obtain partition result information, the partition result information comprises starting row keys and ending row keys of all partitions, and then data pre-partitioning is conducted on an Hbase database based on the partition result information. Acquiring target row key data from a target Hive data set through a Spark-sql engine, realizing large-scale collection of the target row key data, obtaining partition result information according to the collected target row key data, and performing data pre-partitioning on an Hbase database according to the partition result information, so that uniform storage of the large-scale data can be realized, and data inclination is avoided; in addition, the Spark-sql engine is a large data processing framework based on a memory, compared with a magnetic disk-based MapReduce, the data processing speed is higher, and the rapid calculation of large-scale data can be realized; moreover, compared with the Hash Hash rowkey mode, the method and the device do not need to perform escape (the Hash Hash mode needs escape) when reading data, and can effectively improve the speed of data query.
The embodiment of the application provides an electronic device suitable for the method embodiment. And will not be described in detail herein.
The present application provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method shown in the above embodiments is implemented.
Compared with the prior art that data distribution is optimized through Hash Hash rowkey, the embodiment of the application obtains target row key data from target Hive data set through a Spark-sql engine based on received pre-partition information, the pre-partition information comprises row key table names, row key column names and preset partition numbers, then partition iterative calculation is carried out through the Spark-sql engine based on the target row key data and the preset partition numbers to obtain partition result information, the partition result information comprises start row keys and end row keys of all partitions, and then data pre-partitioning is carried out on an Hbase database based on the partition result information. Acquiring target row key data from a target Hive data set through a Spark-sql engine, realizing large-scale collection of the target row key data, obtaining partition result information according to the collected target row key data, and performing data pre-partitioning on an Hbase database according to the partition result information, so that uniform storage of the large-scale data can be realized, and data inclination is avoided; in addition, the Spark-sql engine is a large data processing framework based on a memory, compared with a magnetic disk-based MapReduce, the data processing speed is higher, and the rapid calculation of large-scale data can be realized; moreover, compared with the Hash Hash rowkey mode, the method and the device do not need to perform escape (the Hash Hash mode needs escape) when reading data, and can effectively improve the speed of data query.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims (10)

1. A method for pre-partitioning data, comprising:
acquiring target row key data from a target Hive data set through a Spark-sql engine based on received pre-partition information, wherein the pre-partition information comprises row key table names, row key column names and preset partition numbers;
performing partition iterative computation through the Spark-sql engine based on the target row key data and the preset partition number to obtain partition result information, wherein the partition result information comprises a start row key and an end row key of each partition;
and performing data pre-partitioning on the Hbase database based on the partitioning result information.
2. The method of claim 1, wherein obtaining target row key data from a target Hive dataset by a Spark-sql engine based on the received pre-partition information comprises:
determining a target row key table from a target Hive data set based on the row key table name;
and acquiring target row key data from the target row key table based on the row key column name.
3. The method according to claim 1 or 2, wherein performing partition iterative computation by the Spark-sql engine based on the target row key data and a predetermined partition number to obtain partition result information comprises:
performing RDD processing on the target row key data to obtain RDD processed target row key data;
sequencing the target row key data based on the RDD processing to obtain the sequenced target row key data;
and determining partition result information based on the sorted target row key data and the preset partition number.
4. The method of claim 1, wherein the data in the target Hive dataset is obtained in a manner that comprises at least one of:
extracting target row key data from a plurality of Hive tables to obtain a target Hive data set;
and extracting target row key data from at least one service file, storing the obtained target row key data to the HDFS system, and mapping the target row key data stored to the HDFS system into a Hive table for access.
5. The method according to any one of claims 1-4, characterized in that the method further comprises:
receiving a storage request of target data to be stored, wherein the target data to be stored comprises a second target row key;
and determining a target partition corresponding to the target data to be stored based on the second target row key, and storing the target data to be stored based on the determined target partition.
6. A data pre-partitioning apparatus, comprising:
the acquisition module is used for acquiring target row key data from the target Hive data set through a Spark-sql engine based on received pre-partition information, wherein the pre-partition information comprises row key table names, row key column names and preset partition numbers;
the calculation module is used for performing partition iterative calculation through the Spark-sql engine based on the target row key data and the preset partition number to obtain partition result information, wherein the partition result information comprises a start row key and an end row key of each partition;
and the partitioning module is used for carrying out data pre-partitioning on the Hbase database based on the partitioning result information.
7. The apparatus of claim 6, wherein the obtaining module comprises:
a determining unit, configured to determine a target row key table from the target Hive data set based on the row key table name;
and the acquisition unit is used for acquiring target row key data from the target row key table based on the row key column names.
8. The apparatus of claim 6, wherein the computing module comprises:
the RDD processing unit is used for carrying out RDD processing on the target row key data to obtain the target row key data after the RDD processing;
the sorting unit is used for carrying out sorting processing on the basis of the RDD processed target row key data to obtain sorted target row key data;
and the determining unit is used for determining the partition result information based on the sorted target row key data and the preset partition number.
9. An electronic device, comprising:
one or more processors;
a memory;
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: performing the data pre-partitioning method of any one of claims 1 to 5.
10. A computer-readable storage medium for storing computer instructions which, when executed on a computer, cause the computer to perform the data pre-partitioning method of any one of claims 1 to 5.
CN201911321136.1A 2019-12-20 2019-12-20 Data pre-partition method and device, electronic equipment and readable storage medium Pending CN111159235A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911321136.1A CN111159235A (en) 2019-12-20 2019-12-20 Data pre-partition method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911321136.1A CN111159235A (en) 2019-12-20 2019-12-20 Data pre-partition method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN111159235A true CN111159235A (en) 2020-05-15

Family

ID=70557559

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911321136.1A Pending CN111159235A (en) 2019-12-20 2019-12-20 Data pre-partition method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN111159235A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737347A (en) * 2020-06-15 2020-10-02 中国工商银行股份有限公司 Method and device for sequentially segmenting data on Spark platform
CN112233727A (en) * 2020-10-29 2021-01-15 北京诺禾致源科技股份有限公司 Data partition storage method and device
CN112905628A (en) * 2021-03-26 2021-06-04 第四范式(北京)技术有限公司 Data processing method and device
CN112905854A (en) * 2021-03-05 2021-06-04 北京中经惠众科技有限公司 Data processing method and device, computing equipment and storage medium
CN113177090A (en) * 2021-04-30 2021-07-27 中国邮政储蓄银行股份有限公司 Data processing method and device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130166589A1 (en) * 2011-12-23 2013-06-27 Daniel Baeumges Split processing paths for a database calculation engine
CN105550293A (en) * 2015-12-11 2016-05-04 深圳市华讯方舟软件技术有限公司 Background refreshing method based on Spark-SQL big data processing platform
CN107885779A (en) * 2017-10-12 2018-04-06 北京人大金仓信息技术股份有限公司 A kind of method of Spark concurrent accesses MPP databases
CN109902101A (en) * 2019-02-18 2019-06-18 国家计算机网络与信息安全管理中心 Transparent partition method and device based on SparkSQL
CN110046176A (en) * 2019-04-28 2019-07-23 南京大学 A kind of querying method of the large-scale distributed DataFrame based on Spark
CN110109923A (en) * 2019-04-04 2019-08-09 北京市天元网络技术股份有限公司 Storage method, analysis method and the device of time series data
CN110175175A (en) * 2019-05-29 2019-08-27 大连大学 Secondary index and range query algorithm between a kind of distributed space based on SPARK
CN110502471A (en) * 2019-07-31 2019-11-26 联想(北京)有限公司 A kind of data processing method and electronic equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130166589A1 (en) * 2011-12-23 2013-06-27 Daniel Baeumges Split processing paths for a database calculation engine
CN105550293A (en) * 2015-12-11 2016-05-04 深圳市华讯方舟软件技术有限公司 Background refreshing method based on Spark-SQL big data processing platform
CN107885779A (en) * 2017-10-12 2018-04-06 北京人大金仓信息技术股份有限公司 A kind of method of Spark concurrent accesses MPP databases
CN109902101A (en) * 2019-02-18 2019-06-18 国家计算机网络与信息安全管理中心 Transparent partition method and device based on SparkSQL
CN110109923A (en) * 2019-04-04 2019-08-09 北京市天元网络技术股份有限公司 Storage method, analysis method and the device of time series data
CN110046176A (en) * 2019-04-28 2019-07-23 南京大学 A kind of querying method of the large-scale distributed DataFrame based on Spark
CN110175175A (en) * 2019-05-29 2019-08-27 大连大学 Secondary index and range query algorithm between a kind of distributed space based on SPARK
CN110502471A (en) * 2019-07-31 2019-11-26 联想(北京)有限公司 A kind of data processing method and electronic equipment

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737347A (en) * 2020-06-15 2020-10-02 中国工商银行股份有限公司 Method and device for sequentially segmenting data on Spark platform
CN111737347B (en) * 2020-06-15 2024-02-13 中国工商银行股份有限公司 Method and device for sequentially segmenting data on Spark platform
CN112233727A (en) * 2020-10-29 2021-01-15 北京诺禾致源科技股份有限公司 Data partition storage method and device
CN112233727B (en) * 2020-10-29 2024-01-26 北京诺禾致源科技股份有限公司 Data partition storage method and device
CN112905854A (en) * 2021-03-05 2021-06-04 北京中经惠众科技有限公司 Data processing method and device, computing equipment and storage medium
CN112905628A (en) * 2021-03-26 2021-06-04 第四范式(北京)技术有限公司 Data processing method and device
CN112905628B (en) * 2021-03-26 2024-01-02 第四范式(北京)技术有限公司 Data processing method and device
CN113177090A (en) * 2021-04-30 2021-07-27 中国邮政储蓄银行股份有限公司 Data processing method and device

Similar Documents

Publication Publication Date Title
CN111159235A (en) Data pre-partition method and device, electronic equipment and readable storage medium
Park et al. Parallel computation of skyline and reverse skyline queries using mapreduce
Kolb et al. Load balancing for mapreduce-based entity resolution
KR101700340B1 (en) System and method for analyzing cluster result of mass data
US8868576B1 (en) Storing files in a parallel computing system based on user-specified parser function
CN106407207B (en) Real-time newly-added data updating method and device
CN112287182A (en) Graph data storage and processing method and device and computer storage medium
US9986018B2 (en) Method and system for a scheduled map executor
Ferraro Petrillo et al. Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics
CN104111936A (en) Method and system for querying data
CN103440246A (en) Intermediate result data sequencing method and system for MapReduce
US11221890B2 (en) Systems and methods for dynamic partitioning in distributed environments
Tao et al. Clustering massive small data for IOT
US10326824B2 (en) Method and system for iterative pipeline
US10599614B1 (en) Intersection-based dynamic blocking
US20170371892A1 (en) Systems and methods for dynamic partitioning in distributed environments
US10048991B2 (en) System and method for parallel processing data blocks containing sequential label ranges of series data
Slagter et al. SmartJoin: a network-aware multiway join for MapReduce
CN109726219A (en) The method and terminal device of data query
US20190196783A1 (en) Data shuffling with hierarchical tuple spaces
Packiaraj et al. Hypar-fca: a distributed framework based on hybrid partitioning for fca
CN108319604A (en) The associated optimization method of size table in a kind of hive
US11061736B2 (en) Multiple parallel reducer types in a single map-reduce job
Rasel et al. Summarized bit batch-based triangle listing in massive graphs
US10891274B2 (en) Data shuffling with hierarchical tuple spaces

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220926

Address after: 25 Financial Street, Xicheng District, Beijing 100033

Applicant after: CHINA CONSTRUCTION BANK Corp.

Address before: 25 Financial Street, Xicheng District, Beijing 100033

Applicant before: CHINA CONSTRUCTION BANK Corp.

Applicant before: Jianxin Financial Science and Technology Co.,Ltd.