CN111008235A

CN111008235A - Spark-based small file merging method and system

Info

Publication number: CN111008235A
Application number: CN201911216907.0A
Authority: CN
Inventors: 查文宇; 张艳清; 王纯斌; 赵神州; 费滔
Original assignee: Chengdu Sefon Software Co Ltd
Current assignee: Chengdu Sefon Software Co Ltd
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2020-04-14

Abstract

The invention discloses a small file merging method and a small file merging system based on Spark, which can reduce the scattering quantity of small files by regularly merging small file tasks and merging a plurality of files in a plurality of partitions into 1 file according to task rules, and can reduce the reading load of a disk, the network transmission consumption, the data merging and other processes when inquiring data in a Hive library so as to improve the data inquiry efficiency. The problems that in the existing scheme, data in a source database is extracted into a Hive database, Spark simultaneously reads the data in the source database by a plurality of tasks, and the data is written into different partitions, so that the reading of a disk is multiplied, and the data query performance is reduced are solved.

Description

Spark-based small file merging method and system

Technical Field

The invention relates to the field of business intelligent analysis platforms, in particular to a Spark-based small file merging method and system.

Background

Business Intelligence (Business Intelligence, BI for short), also known as Business Intelligence or Business Intelligence, refers to the realization of Business value by data analysis using modern data warehouse technology, on-line analysis and processing technology, data mining and data presentation technology.

Business intelligence is generally understood as a tool that translates data existing in an enterprise into knowledge, helping the enterprise make informed business decisions. Data referred to herein includes orders, inventory, transaction accounts, customers and suppliers from the business and competitors of the enterprise business system, and various data from other external environments of the enterprise. Business operation decisions which can be assisted by business intelligence can be decisions of an operation layer, a tactical layer and a strategic layer. To convert data into knowledge, techniques such as data warehousing, online analytical processing (OLAP) tools, and data mining are required. Therefore, from a technical level, the business intelligence is not a new technology, and is only a comprehensive application of technologies such as data warehouse, OLAP and data mining.

Business intelligence can be considered as a process of gathering, managing and analyzing business information, and aims to enable decision makers at all levels of an enterprise to acquire knowledge or insight (insight) and to enable the decision makers to make decisions which are more beneficial to the enterprise. Business intelligence generally consists of data warehousing, online analytical processing, data mining, data backup and recovery, and the like. The business intelligence is realized by software, hardware, consultation service and application, and the basic architecture comprises three parts of data warehouse, online analysis processing and data mining.

Therefore, it is appropriate to consider business intelligence as a solution. The key of business intelligence is to extract useful data from many data from different enterprise operating systems and clean the data to ensure the correctness of the data, then merge the data into an enterprise-level data warehouse through Extraction (Extraction), Transformation (Transformation) and loading (Load), i.e. ETL process, so as to obtain a global view of the enterprise data, analyze and process the data on the basis by using a proper query and analysis tool, a data mining tool (big data magic mirror), an OLAP tool and the like (at this time, information becomes knowledge for assisting decision making), and finally present the knowledge to a manager to support the decision making process of the manager.

In the existing scheme, data in a source database is extracted into a Hive database, Spark performs multiple tasks simultaneously to read the source database data and writes the data into different partitions, multiple files of each partition are generated when the data falls into a hadoop file system, the files are exponentially increased when a user performs data increment extraction again, and when the system queries the data of the file system after the number of the files is increased, the disk reading is multiplied, and the data query performance is reduced.

Disclosure of Invention

The invention aims to: the small file merging method and system based on Spark solve the problems that in the existing scheme, data in a source database is extracted into a Hive base, Spark simultaneously reads the data of the source database by a plurality of tasks, and writes the data into different partitions, so that disk reading multiplication and data query performance are reduced.

The technical scheme adopted by the invention is as follows:

a small file merging method based on Spark is based on a source database, a business intelligent analysis platform with a Spark engine and a Hive database loaded with a hadoop file system, and further comprises the following steps:

s1, operating the source database and configuring a data extraction function by the user through the commercial intelligent analysis platform;

s2, reading data in a source database by the commercial intelligent analysis platform according to N extraction partitions configured by a user, and writing the extracted data into M partitions in the Hive database, wherein the number of files in each partition is N, and M, N is a positive integer;

and S3, merging the files in the M partitions by the hadoop file system according to the time period and the task rule pre-entered by the user.

In the existing scheme, data in a source database is extracted into a hive library, wherein Spark performs multiple tasks simultaneously and reads N source database data, and writes the data into M partitions, N files of each partition are generated when the data falls into a hadoop file system, and when a user performs data increment extraction again, M × N files are added again, and when the system queries the data of the file system after the number of the files is multiplied, the disk reading multiplication and the data query performance are reduced.

The invention is mainly realized by the following technical scheme:

firstly, a user operates a data source and configures a data extraction function in a data set through a data set processing node through a platform. Then the system extracts the data source data of the partitions according to N configured by a user and writes the data source data into M partitions in the Hive library, wherein the number of data files in each partition is N; and finally, the system combines the small file tasks at regular time, and combines the M files in the N partitions into 1 file according to the task rule, so that the scattering quantity of the small files is reduced, and the data query efficiency is improved in the processes of disk reading load, network transmission consumption, data combination and the like when data in the Hive library is queried.

Further, the business intelligence analysis platform includes a data set matching with the source database, and in step S1, the user operates the source database and configures the data extraction function through the data set processing node of the data set in the business intelligence analysis platform.

Further, the method for the business intelligence analysis platform to read the data in the source database according to the N extraction partitions configured by the user in step S2 includes: the Spark engine executes N tasks simultaneously to read the source database data and writes the data into M partitions.

Further, the method for merging the files in the M partitions by the hadoop file system according to the time period and the task rule pre-entered by the user in the step S3 includes the following steps:

s301, configuring the hadoop file system by a user, setting a period for the hadoop file system to perform file merging, and configuring a task rule for the file merging;

s302, starting timing after the hadoop file system is started, and merging the files in the M partitions by the hadoop file system according to the task rule configured in the step S301 after the timing reaches the time preset in the period in the step S301;

s303, after the hadoop file system completes the file combination, the timer is reset, and then the step S302 is carried out.

Further, the task rule in step S302 includes: sorting according to file names, merging, sorting according to file creation time, merging according to file modification time, sorting according to file sizes, and merging.

Further, the file merged in step S303 includes: the file header comprises names of all files before merging, and the file content comprises data of all files before merging.

A small file merging system based on Spark comprises a source database, a business intelligent analysis platform with a Spark engine and a Hive database based on a hadoop file system;

the Hive database comprises:

a memory for storing executable instructions and files;

and the processor is used for executing the executable instructions stored in the memory to realize the above small file merging method based on Spark.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

1. the invention discloses a small file merging method and a small file merging system based on Spark, which solve the problems that in the prior art, data in a source database is extracted into a Hive database, Spark simultaneously reads the data of the source database by a plurality of tasks, and writes the data into different partitions, so that the reading of a disk is multiplied, and the data query performance is reduced;

2. according to the small file merging method and system based on Spark, the scattering distribution of disk files can be reduced, the load of reading I/O (input/output) of a disk during data query is reduced, the network transmission consumption is reduced, the memory consumption generated by merging of multiple file data during query is reduced, the query efficiency is improved, and the user perception is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts, wherein:

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a diagram illustrating the number of documents before merging;

FIG. 3 is a diagram illustrating the number of merged small files according to the present invention;

FIG. 4 is a screenshot of the number of files before merging of small files in accordance with the present invention;

FIG. 5 is a screenshot of the number of files after merging of small files according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to fig. 1 to 5, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

Spark: apache Spark is a fast and general computing engine specially designed for large-scale data processing, Spark enables a memory distribution data set, and can optimize iterative workload besides providing interactive query;

hive: hive is a data warehouse tool based on Hadoop, and can map structured data files into a database table and provide SQL-like query functions;

a data source: the general name of data sources such as files and databases;

the data processing node: the system is a subdivision node of a business intelligent analysis platform for a data processing function, wherein the subdivision node comprises functions of table association, field coming, data filtering, field calculation, grouping statistics, data types and the like, and a series of processing functions of cleaning, filtering, splitting and the like of data are mainly performed on a data source;

data set: the data source is referred to collectively after the data processing node is configured, a data set may include one or more data sources and one or more data processing nodes, and the generated data set may be regarded as a virtual data source;

and (3) analysis: the module measures the dimensionality generated in the process after the data set is processed, performs configuration query and data association graph components, and can bind the dimensionality index generated by one data set to a plurality of different display components in the module;

reporting: the report module is used for assembling the analysis assembly, and providing for a user to check after layout configuration;

presto: the SQL query engine is an open-source distributed SQL query engine and is used for running interactive analysis queries aiming at data sources with various sizes;

a container component for submitting the Spark task, which is developed by a business intelligent analysis platform and dynamically submits the Spark task;

example 1

Example 2

In this embodiment, based on embodiment 1, the business intelligence analysis platform includes a data set matching with the source database, and in step S1, the user operates the source database and configures the data extraction function through the data set processing node of the data set in the business intelligence analysis platform.

Example 3

In this embodiment, based on embodiment 1, the method for merging files in M partitions by the hadoop file system according to the time period and the task rule pre-entered by the user in step S3 includes the following steps:

Example 4

the Hive database comprises:

a memory for storing executable instructions and files;

Example 5

The embodiment is a part of function codes of the scheme:

ResultSet resultSet＝stat.executeQuery("show partitI/Ons"+table)；

if(resultSet＝＝null)continue；

while(resultSet.next()){

stat.execute("alter table"+table+"partitI/On("+resultSet.getString(1)+")concatenate")；

}

resultSet.close()；

// update the merged partition field status of ec _ dataset _ info.

datasetManageMapper.updateMergeFileStatus(datasetId,0)。

Example 6

As shown in fig. 4, in this embodiment, a file list before merging is adopted, each partition is 8 files before merging, each file is about 410KB, the partition size is 256MB, when a system queries data of a file system, it is equivalent to reading 8 small files, a current hard disk memory needs to seek before reading the file, 8 small files need to seek for 8 times, the seek time is about 8ms, and the data transmission time is about 5ms for one file, which totally needs 104 ms.

Example 7

As shown in fig. 5, in this embodiment, a file list before merging is adopted, each partition is 1 large file before merging, a file is about 3.2MB, the partition size is 256MB, when a system queries data of a file system, it is equivalent to only reading 1 large file, a current hard disk memory needs to seek before reading the file, 1 small file only needs to seek for 1 time, the seek time is about 8ms, and the data transmission time is about 40ms for one large file, which totally needs 48 ms. The problems that in the existing scheme, data in a source database is extracted into a Hive database, Spark simultaneously reads the data in the source database by a plurality of tasks, and the data is written into different partitions, so that the reading of a disk is multiplied, and the data query performance is reduced are solved.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A small file merging method based on Spark is based on a source database, a business intelligent analysis platform with a Spark engine and a Hive database loaded with a hadoop file system, and is characterized in that: further comprising the steps of:

2. The method for merging small files based on Spark according to claim 1, wherein: the business intelligence analysis platform includes a data set matching the source database, and in step S1, the user operates the source database and configures a data extraction function through the data set processing node of the data set in the business intelligence analysis platform.

3. The method for merging small files based on Spark according to claim 1, wherein: the method for the business intelligence analysis platform to read the data in the source database according to the N extraction partitions configured by the user in step S2 includes: the Spark engine executes N tasks simultaneously to read the source database data and writes the data into M partitions.

4. The method for merging small files based on Spark according to claim 1, wherein: the method for merging the files in the M partitions by the hadoop file system in the step S3 according to the time period and the task rule pre-entered by the user comprises the following steps:

5. The method of claim 4, wherein the small file merging method based on Spark is as follows: the task rule in step S302 includes: sorting according to file names, merging, sorting according to file creation time, merging according to file modification time, sorting according to file sizes, and merging.

6. The method of claim 4, wherein the small file merging method based on Spark is as follows: the file merged in step S303 includes: the file header comprises names of all files before merging, and the file content comprises data of all files before merging.

7. A small file merging system based on Spark is characterized in that: the system comprises a source database, a business intelligent analysis platform with a Spark engine and a Hive database based on a hadoop file system;

the Hive database comprises:

a memory for storing executable instructions and files;

a processor, configured to execute the executable instructions stored in the memory, and implement the Spark-based doclet merging method as claimed in claim 1.