CN113434608A

CN113434608A - Data processing method and device for Hive data warehouse

Info

Publication number: CN113434608A
Application number: CN202110762070.0A
Authority: CN
Inventors: 朱阿龙; 田林; 张亚泽; 何聪聪; 豆敏娟; 刘琦; 张靖羚; 石慧彪; 刘宇琦
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2021-07-06
Filing date: 2021-07-06
Publication date: 2021-09-24

Abstract

The invention provides a data processing method and a data processing device for a Hive data warehouse, which are applied to the field of big data, and the method comprises the following steps: establishing a temporary intermediate table which has the same structure as the Hive formal table and is different in name from the Hive formal table according to the information of the Hive formal table; importing data to be stored into a temporary intermediate table, and reading the data from the temporary intermediate table into a Hadoop distributed file system; and sequentially writing the files in the minimum unit of each elastic distributed data set RDD in the Hadoop distributed file system into a Hive formal table. By importing the data into the Hadoop distributed file system from the constructed intermediate table and splitting the file into small files capable of being imported into the Hive formal table by taking the minimum unit of the elastic distributed data set RDD as a unit, the method is simple, and compared with the prior art of reading data content line by line, the method can store the data more quickly and is not easy to lose and repeat the data.

Description

Data processing method and device for Hive data warehouse

Technical Field

The invention relates to the technical field of big data, in particular to a data processing method and device of a Hive data warehouse.

Background

In a large data environment, the performance test of the application program often requires millions of data volumes, tens of millions of data volumes and even hundreds of millions of data volumes to test. And the development of the application program in the big data environment needs to consider the basic performance problems of the application, such as the parallelism, the execution efficiency and the like. However, sometimes, the upstream system or the historical stock data has large data files, and the files of G or even dozens of G exist; under a big data frame, the parallelism of program operation is reduced due to the big files, the execution efficiency of the program is low, and the characteristics of the big data frame cannot be truly embodied; in addition, too large a single file may result in too high a consumption of resources and, if the resources are not particularly sufficient, may result in a failure of program execution.

Therefore, the application performance in Spark and Hadoop big data ecology can require that the size of the data file stored in the Hive table and the Hadoop is preferably 128M, so that the big file cannot be directly stored, and the file content needs to be sequentially read line by line into small and storable files by using a program, but the operation is complex, and the time of the development and test process is very wasted; and the file content is read line by line, so that the problems of data loss and data repetition are easy to occur.

Disclosure of Invention

The embodiment of the invention provides a data processing method of a Hive data warehouse, which is used for simply and quickly storing large files and reducing the occurrence of data loss and data repetition during storage and comprises the following steps:

determining the information of a Hive formal table when data to be stored are stored in a Hive data warehouse;

establishing a temporary intermediate table which has the same structure as the Hive formal table and is different in name from the Hive formal table according to the information of the Hive formal table;

importing data to be stored into a temporary intermediate table, and reading the data from the temporary intermediate table into a Hadoop distributed file system;

and sequentially writing the files in the minimum unit of each elastic distributed data set RDD in the Hadoop distributed file system into a Hive formal table.

The embodiment of the invention also provides a data processing device of a Hive data warehouse, which is used for simply and quickly storing large files and reducing the occurrence of data loss and data repetition during storage, and the device comprises:

the formal table information determining module is used for determining the information of the Hive formal table when the data to be stored are stored in the Hive data warehouse;

the intermediate table building module is used for building a temporary intermediate table which has the same structure as the Hive formal table and is different in name from the Hive formal table according to the information of the Hive formal table;

the data temporary storage module is used for importing data to be stored into the temporary intermediate table and reading the data from the temporary intermediate table into the Hadoop distributed file system;

and the file splitting module is used for sequentially writing the files in the minimum unit of each elastic distributed data set RDD in the Hadoop distributed file system into the Hive formal table.

The embodiment of the invention also provides computer equipment which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein when the processor executes the computer program, the data processing method of the Hive data warehouse is realized.

An embodiment of the present invention also provides a computer-readable storage medium, which stores a computer program for executing the data processing method of the Hive data warehouse.

In the embodiment of the invention, the information of the Hive formal table when the data to be stored is stored in the Hive data warehouse is determined; establishing a temporary intermediate table which has the same structure as the Hive formal table and is different in name from the Hive formal table according to the information of the Hive formal table; importing data to be stored into a temporary intermediate table, and reading the data from the temporary intermediate table into a Hadoop distributed file system; and sequentially writing the files in the minimum unit of each elastic distributed data set RDD in the Hadoop distributed file system into a Hive formal table. By constructing the intermediate table, data is imported into the Hadoop distributed file system from the intermediate table, and the file is split into small files capable of being imported into the Hive formal table by taking the minimum unit of the elastic distributed data set RDD as a unit.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a data processing method of a Hive data warehouse in an embodiment of the present invention.

Fig. 2 is a schematic diagram of a data processing method of a Hive data warehouse according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a data processing apparatus of a Hive data warehouse in an embodiment of the present invention.

FIG. 4 is a schematic diagram of a data processing apparatus of a Hive data warehouse in an embodiment of the invention.

Fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For a better understanding of the embodiments of the present invention, the terms of art to which the embodiments of the present invention relate will first be explained:

spark: spark is an open source cluster computing system based on memory computing, is one of the most hot projects in the Apache community, and compared with Hadoop, the computing speed of Spark can be improved by nearly 100 times. Spark is composed of a group of powerful, high-level libraries including Spark sql, Spark Streaming, MLlib, GraphX. Spark provides a large number of operators and a rich data operation interface to facilitate data processing.

Hive: hive is a data warehouse tool based on Hadoop, is used for data extraction, transformation and loading, and is a mechanism capable of storing, inquiring and analyzing large-scale data stored in Hadoop. The storage formats are mainly TextFile, RCFile, ORCFile and partial.

Spark sql: spark sql is a part of Spark big data framework, supports reading and writing data by using standard sql query and Hive ql, can be used for structured data processing, can execute Spark data query similar to sql, and is helpful for developers to create and run Spark programs more quickly.

Partition: for the partition in spark, the partition in spark is the smallest unit of the elastic distributed data set RDD, and the RDD is composed of partitions distributed on each node. The partition refers to the smallest unit of generated data in a computing space in the computing process of spark, the partition of the same copy data (RDD) has different sizes and variable quantities, and is determined according to an operator in the application and the quantity of data blocks read initially, which is one of the reasons why the spark is called as an elastic distributed data set.

Hadoop: a distributed system infrastructure developed by the Apache foundation.

Hdfs: referred to as a Hadoop Distributed File System (HDFS), designed to fit distributed file systems running on general purpose hardware (comfort hardware). Blocks in Hdfs are the smallest units of distributed storage, and like boxes holding files, a file may occupy multiple boxes, but the contents of a box may only come from the same file.

The embodiment of the invention provides a data processing method of a Hive data warehouse, which is used for simply and quickly storing large files and reducing the occurrence of data loss and data repetition during storage, and as shown in fig. 1, the method comprises the following steps:

step 101: determining the information of a Hive formal table when data to be stored are stored in a Hive data warehouse;

step 102: establishing a temporary intermediate table which has the same structure as the Hive formal table and is different in name from the Hive formal table according to the information of the Hive formal table;

step 103: importing data to be stored into a temporary intermediate table, and reading the data from the temporary intermediate table into a Hadoop distributed file system;

step 104: and sequentially writing the files in the minimum unit of each elastic distributed data set RDD in the Hadoop distributed file system into a Hive formal table.

As can be known from the flow shown in fig. 1, in the embodiment of the present invention, information of the Hive formal table when the data to be stored is stored in the Hive data warehouse is determined; establishing a temporary intermediate table which has the same structure as the Hive formal table and is different in name from the Hive formal table according to the information of the Hive formal table; importing data to be stored into a temporary intermediate table, and reading the data from the temporary intermediate table into a Hadoop distributed file system; and sequentially writing the files in the minimum unit of each elastic distributed data set RDD in the Hadoop distributed file system into a Hive formal table. By constructing the intermediate table, data is imported into the Hadoop distributed file system from the intermediate table, and the file is split into small files capable of being imported into the Hive formal table by taking the minimum unit of the elastic distributed data set RDD as a unit.

In specific implementation, the information of the Hive formal table when the data to be stored is stored in the Hive data warehouse is determined, specifically, the information of the Hive formal table includes: table name, field type, data size, partition information, storage structure, and file storage format.

After the information of the Hive formal table is determined, a temporary intermediate table which has the same structure as the Hive formal table and is different in name is established according to the information of the Hive formal table, namely, the temporary intermediate table which is different in table name and is consistent in field name, field type, data volume, partition information, storage structure and file storage format is established.

And after the temporary intermediate table is established, importing the data to be stored into the temporary intermediate table, and reading the data from the temporary intermediate table into the Hadoop distributed file system. In specific implementation, the Spark sql is used for configuring the maximum resource available for the distributed cluster, and data is read from the temporary intermediate table into the Hadoop distributed file system.

And after reading the data from the temporary intermediate table into the Hadoop distributed file system, sequentially writing the files in the minimum unit (partition) of each elastic distributed data set RDD in the Hadoop distributed file system into the Hive formal table.

In a specific embodiment, data can be queried from the temporary intermediate table through hsql and inserted into the formal Hive table. In a specific embodiment, data is imported into an intermediate temporary table through a load data command, the data is read into Hdfs by using the maximum resource of a distributed cluster, and then files in each partition of the Hdfs are written into a file directory of a formal table of a Hive library again.

Because the large file is divided into a plurality of small files close to 128M by means of partition, one small file corresponds to one partition and corresponds to one task, the parallelism and the execution efficiency of Spark sql and hsql tasks can be improved.

Due to the foregoing processing method, only for a large file whose file size exceeds 128M, as shown in fig. 2, the data processing method of the Hive data warehouse provided in the specific embodiment further includes, on the basis of fig. 1:

step 201: determining the file size of the data to be stored, and judging whether the file size of the data to be stored exceeds the preset file specification.

Wherein the preset file specification is generally set to 128M.

Accordingly, step 101 is modified to: and when the file size of the data to be stored exceeds the preset file specification, determining the information of the Hive formal table when the data to be stored is stored in the Hive data warehouse.

Based on the same inventive concept, an embodiment of the present invention further provides a data processing apparatus for a Hive data warehouse, and since the principle of the problem solved by the data processing apparatus for the Hive data warehouse is similar to that of the data processing method for the Hive data warehouse, the implementation of the data processing apparatus for the Hive data warehouse may refer to the implementation of the data processing method for the Hive data warehouse, and the repeated parts are not described again, and the specific structure is as shown in fig. 3:

a formal table information determining module 301, configured to determine information of a Hive formal table when data to be stored is stored in a Hive data warehouse;

a middle table building module 302, configured to build a temporary middle table with the same structure as the Hive formal table and a different name from the Hive formal table according to the information of the Hive formal table;

the data temporary storage module 303 is configured to import data to be stored into a temporary intermediate table, and read the data from the temporary intermediate table into the Hadoop distributed file system;

the file splitting module 304 is configured to sequentially write the files in the minimum unit of each elastic distributed data set RDD in the Hadoop distributed file system into the Hive formal table.

In a specific embodiment, the information of the Hive formal table includes:

table name, field type, data size, partition information, storage structure, and file storage format.

In specific implementation, the data temporary storage module 303 is specifically configured to:

and configuring the maximum resource available for the distributed cluster by utilizing Spark sql, and reading data from the temporary intermediate table into the Hadoop distributed file system.

In an embodiment, a data processing apparatus of a Hive data warehouse is further provided, as shown in fig. 4, on the basis of fig. 3, the data processing apparatus further includes:

a file size determining module 401 configured to:

determining the file size of the data to be stored, and judging whether the file size of the data to be stored exceeds a preset file specification or not;

accordingly, the formal table information determining module 301 is specifically configured to:

and when the file size of the data to be stored exceeds the preset file specification, determining the information of the Hive formal table when the data to be stored is stored in the Hive data warehouse.

An embodiment of the present invention further provides a computer device, and fig. 5 is a schematic diagram of a computer device in an embodiment of the present invention, where the computer device is capable of implementing all steps in the data processing method of the Hive data warehouse in the foregoing embodiment, and the computer device specifically includes the following contents:

a processor (processor)501, a memory (memory)502, a communication Interface (Communications Interface)503, and a communication bus 504;

the processor 501, the memory 502 and the communication interface 503 complete mutual communication through the communication bus 504; the communication interface 503 is used for implementing information transmission between related devices;

the processor 501 is used for calling the computer program in the memory 502, and when the processor executes the computer program, the processor implements the data processing method of the Hive data warehouse in the above embodiment.

An embodiment of the present invention further provides a computer-readable storage medium, where a computer program for executing the data processing method of the Hive data warehouse is stored in the computer-readable storage medium.

In summary, the data processing method and apparatus for the Hive data warehouse provided by the embodiment of the invention have the following advantages:

determining the information of a Hive formal table when the data to be stored is stored in a Hive data warehouse; establishing a temporary intermediate table which has the same structure as the Hive formal table and is different in name from the Hive formal table according to the information of the Hive formal table; importing data to be stored into a temporary intermediate table, and reading the data from the temporary intermediate table into a Hadoop distributed file system; and sequentially writing the files in the minimum unit of each elastic distributed data set RDD in the Hadoop distributed file system into a Hive formal table. By constructing the intermediate table, data is imported into the Hadoop distributed file system from the intermediate table, and the file is split into small files capable of being imported into the Hive formal table by taking the minimum unit of the elastic distributed data set RDD as a unit.

Although the present invention provides method steps as described in the examples or flowcharts, more or fewer steps may be included based on routine or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an actual apparatus or client product executes, it may execute sequentially or in parallel (e.g., in the context of parallel processors or multi-threaded processing) according to the embodiments or methods shown in the figures.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, apparatus (system) or computer program product. Accordingly, embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment. In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "upper", "lower", and the like, indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the referred devices or elements must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention. Unless expressly stated or limited otherwise, the terms "mounted," "connected," and "connected" are intended to be inclusive and mean, for example, that they may be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations. It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention is not limited to any single aspect, nor is it limited to any single embodiment, nor is it limited to any combination and/or permutation of these aspects and/or embodiments. Moreover, each aspect and/or embodiment of the present invention may be utilized alone or in combination with one or more other aspects and/or embodiments thereof.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims

1. A data processing method of a Hive data warehouse is characterized by comprising the following steps:

2. The data processing method of the Hive data warehouse of claim 1, wherein the information of the Hive official table comprises:

3. The data processing method of the Hive data warehouse of claim 1, wherein reading data from the temporary intermediate table into the Hadoop distributed file system comprises:

4. The data processing method of the Hive data warehouse of claim 1, further comprising:

the information of the Hive formal table when the data to be stored is stored in the Hive data warehouse is determined, and the information comprises the following information:

5. A data processing apparatus of a Hive data warehouse, comprising:

6. The data processing apparatus of a Hive data warehouse according to claim 5, wherein the information of the Hive official table comprises:

7. The data processing apparatus of the Hive data warehouse of claim 5, wherein the data staging module is specifically configured to:

8. The data processing apparatus of the Hive data warehouse of claim 5, further comprising:

a file size determination module to:

the formal table information determining module is specifically configured to:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 1 to 4.