CN114116224A

CN114116224A - File merging method, processor and storage medium

Info

Publication number: CN114116224A
Application number: CN202111442216.XA
Authority: CN
Inventors: 赵振洪; 陈钟浩; 管瑞峰; 刘运春
Original assignee: Shanghai Zhijing Information Technology Co ltd
Current assignee: Shanghai Zhijing Information Technology Co ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-03-01

Abstract

The embodiment of the invention provides a file merging method, a file merging device, a processor and a storage medium. The method comprises the following steps: after the Hive file is written into the Spark engine, determining the file size and the file number of the file to be merged; determining the merging task according to the size and the number of the files; and submitting the merging task to a Spark engine, starting a merging task thread through the Spark engine to merge the files to be merged to obtain the merged file. When the small files are combined by the method, all the existing offline tasks in the current cluster can be optimized at one time without modifying the offline computing tasks, the computing performance of downstream tasks is improved, the resource consumed by the cluster is reduced, and the cluster runs more stably.

Description

File merging method, processor and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a file merging method, a processor and a storage medium.

Background

The existing large data platform off-line data warehouse is mostly constructed based on Hive, data is stored on an HDFS, an off-line data analysis Task writes SQL (structured query language) or programs into the HDFS through a Spark engine after processing the data, Spark defaults to generate a corresponding number of Task write-in files according to the configuration of shuffle.

In the prior art, the shuttle adopts the technical scheme of the internet of things + sass, a single device reports acquired data to a cloud server in real time, and the cloud server stores the data in Hive for offline task calculation based on a streaming computing platform. With the increase of services, links of offline tasks are continuously lengthened, an upstream task is often not optimized, too many small files are output, and the overhead of a downstream task is increased, which is fatal when a large number of offline tasks are available. The off-line tasks are characterized in that a large number of tasks are mutually dependent, developers usually only pay attention to Task implementation logic of themselves, if a certain Task generates a large number of small files during output, the computing performance of downstream tasks can be greatly influenced, and meaningless waste can be conducted on cluster resources. The setting is too large, the number of files output finally is too large, when a downstream Task uses data, the same number of Task tasks need to be started to read the data, unnecessary resource waste is caused, and a balance value is difficult to find to perfectly solve the problem.

Disclosure of Invention

The embodiment of the invention aims to provide a file merging method, a processor and a storage medium.

In order to achieve the above object, a first aspect of the present invention provides a file merging method, including:

after the Hive file is written into the Spark engine, determining the file size and the file number of the file to be merged;

determining the current merging task according to the file size and the file number;

and submitting the merging task to the Spark engine, and starting a merging task thread through the Spark engine to merge the files to be merged to obtain the merged file.

Optionally, after the Spark engine writes the Hive file, determining the file size and the number of files of the file to be merged includes: after the Spark engine writes in the Hive file, determining the state value of the function configuration item of the merged file; and under the condition that the state value of the function configuration item of the merged file indicates that merging is started, determining the file size and the number of files of the file to be merged according to the written Hive file.

Optionally, the method further comprises: and under the condition that the state value of the function configuration item of the merged file indicates that merging is started, scanning an Hdfs directory to acquire a file of the task so as to determine the size and the number of the files to be merged.

Optionally, determining the current merging task according to the file size and the file number includes: determining parameters of file merging quantity configuration items; determining a lower limit value of the file merging quantity according to the parameters of the file merging quantity configuration items; under the condition that the number of the files to be merged is smaller than the lower limit value, the merging task of the files to be merged is not started; and starting a merging task of the files to be merged according to the size of the files and the number of the files under the condition that the number of the files to be merged is greater than or equal to the lower limit value.

Optionally, when the number of the files to be merged is greater than or equal to the lower limit, starting a merging task for the files to be merged according to the size of the files and the number of the files includes: determining parameters of file merging size configuration items; determining a target value of the file size after the file is merged according to the parameters of the file merging size configuration items; and under the condition that the number of the files to be merged is greater than or equal to the lower limit value, merging the files in the files to be merged according to the size of each file to be merged, so that the file size obtained after merging at least 2 files is the target value of the file size.

Optionally, the method further comprises: and when the file size of the file existing in the files to be merged is larger than or equal to the target value of the file size, not merging the files larger than or equal to the target value of the file size.

Optionally, the method further comprises: determining the process number of the merging tasks according to the target value of the file size and the lower limit value of the file merging number under the condition that the file number of the files to be merged is greater than or equal to the lower limit value; creating temporary file directories with the same number as the processes; determining the files corresponding to each temporary file directory according to the target file size value and the lower limit value of the file merging number; and respectively merging the files in each temporary file directory to obtain a merged file corresponding to each temporary file directory.

Optionally, the lower limit of the number of merged files is 10, and the target value of the file size is 128M.

A second aspect of the present invention provides a processor configured to execute the file merging method described above.

A third aspect of the present invention provides a file merging apparatus, including the processor described above.

A fourth aspect of the invention provides a machine-readable storage medium having stored thereon instructions which, when executed by a processor, cause the processor to be configured to perform the file merging method described above.

According to the file merging method, after Hive files are written in by the Spark engine, the files written in the task are obtained, the file size and the file number of the files to be merged are determined, the merging task is determined according to the file size and the file number, then the merging task is submitted to the Spark engine, a merging task thread is started through the Spark engine to merge the files to be merged, and the merged files are obtained. When the small files are combined by the method, all the existing offline tasks in the current cluster can be optimized at one time without modifying the offline computing tasks, the computing performance of downstream tasks is improved, the resource consumed by the cluster is reduced, and the cluster runs more stably.

Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:

FIG. 1 is a flow chart diagram schematically illustrating a file merging method according to an embodiment of the present invention;

FIG. 2 schematically shows a flowchart of a file merging method according to another embodiment of the present invention;

fig. 3 schematically shows an internal structure diagram of a computer apparatus according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating embodiments of the invention, are given by way of illustration and explanation only, not limitation.

Fig. 1 schematically shows a flowchart of a file merging method according to an embodiment of the present invention. As shown in fig. 1, in an embodiment of the present invention, a file merging method is provided, which includes the following steps:

step 101, after the Spark engine writes in the Hive file, determining the file size and the number of files of the file to be merged.

And step 102, determining the current merging task according to the file size and the file number.

And 103, submitting the merging task to a Spark engine, starting a merging task thread through the Spark engine to merge the files to be merged to obtain the merged file.

In a traditional big data offline task development process, a developer generally only pays attention to self business logic, and then the business logic is realized through SQL or program tasks, and the Spark engine per se defaults to control the number of downstream tasks by configuring the number of the default sharbuffers to ensure the number of written files. However, this configuration is a fixed value, and it is difficult to be compatible with all service scenarios, and some jobs may not output many small files themselves, and do not need to merge output files. The Shuffle partition is too large in configuration, which may result in a large number of small files, and too small in configuration, which may result in too large output files, which may also affect the performance of the downstream task. It is difficult to find a configuration that is fully compatible with all tasks and requires manual configuration by the user, which is inconvenient to use and maintain.

In this embodiment, for the characteristics of the Spark engine and the scenes of common offline synchronization tasks, a file merging operation is added after the Spark engine writes the Hive file. First, after the Spark engine writes the Hive file, the file size and the number of files of the file to be merged may be determined. And then determining the merging task according to the size and the number of the files. And then submitting a merging task to a Spark engine, and starting a merging task thread through the Spark engine to merge files to be merged to obtain a merged file.

Further, in an embodiment, after the Spark engine writes the Hive file, determining the file size and the number of files of the file to be merged includes: after the Spark engine writes in the Hive file, determining the state value of the function configuration item of the merged file; and under the condition that the state value of the function configuration item of the merged file indicates that merging is started, determining the file size and the number of files of the file to be merged according to the written Hive file.

After the Spark engine writes in the Hive file, a configuration item is newly added: spark. sql. mergefiles. enabled. The configuration item is used for judging whether the merged file function is started, and the optional value of the configuration is true/false, and the default is true, namely the merged file function is started and closed. Therefore, in the case that the status value of the merge file function configuration item indicates that the merge is initiated, the file size and the number of files of the file to be merged can be determined according to the written Hive file.

Further, under the condition that the state value of the function configuration item of the merged file indicates that merging is started, the Hdfs directory can be scanned to obtain the file of the task, so as to determine the file size and the number of the files to be merged. The method comprises the steps of obtaining the size and the number of files written in the task, wherein the method is mainly used for recording the number and the size of the files output by the current task and providing data support for how to merge the files subsequently.

In one embodiment, determining the current merging task according to the file size and the file number includes: determining parameters of file merging quantity configuration items; determining a lower limit value of the file merging quantity according to the parameters of the file merging quantity configuration items; under the condition that the number of the files to be merged is smaller than the lower limit value, the merging task of the files to be merged is not started; and starting the merging task of the files to be merged according to the size and the number of the files under the condition that the number of the files to be merged is greater than or equal to the lower limit value.

In this embodiment, in an actual production environment, some tasks output small files, but the number of output files is not large, and because the merged files themselves have a certain performance loss, when the number of files is too small, we can determine whether to merge files according to the configuration when the number of files is smaller than the current configuration. Therefore, in the present embodiment, configuration items are newly added: spark. sql. mergefiles. minfilecount 10. This configuration entry indicates how many files the output file will not merge. Therefore, when the number of the written files is less than the number of the spark. Wherein, the Hdfs directory refers to a directory of the distributed file system.

Specifically, the lower limit of the merging number may be set to 10 by default, that is, it means that file merging is not performed if the task itself outputs less than 10 files. That is, the merging task of the files to be merged is not started when the number of the files to be merged is less than the lower limit value. And only under the condition that the number of the files to be merged is greater than or equal to the lower limit value, starting the merging task of the files to be merged according to the size and the number of the files.

Further, when the number of the files to be merged is greater than or equal to the lower limit value, starting the merging task of the files to be merged according to the size and the number of the files comprises: determining parameters of file merging size configuration items; determining a target value of the file size after the file is merged according to the parameters of the file merging size configuration items; and under the condition that the number of the files to be merged is greater than or equal to the lower limit value, merging the files in the files to be merged according to the size of each file to be merged, so that the size of the file obtained after merging at least 2 files is a target value of the size of the file.

And starting the merging task of the files to be merged according to the size and the number of the files under the condition that the number of the files to be merged is greater than or equal to the lower limit value. Specifically, a parameter of the file merge size configuration item may be determined, and then a target file size value after the file merge may be determined according to the parameter of the file merge size configuration item. For example, the file merge size configuration item spark. sql. mergefiles. maxfilesize ═ 134217728 indicates that when files in the file to be merged are merged, the file size obtained after merging at least 2 files does not exceed the file size target value set by the configuration item, such as 134217728 described above, that is, 128M.

In one embodiment, the method further comprises: and when the file size of the file existing in the files to be merged is larger than or equal to the target value of the file size, not merging the files larger than or equal to the target value of the file size.

And under the condition that the file size of a single file of the files to be merged is greater than or equal to the target value of the file size, not merging the files which are greater than or equal to the target value of the file size. For example, assume that there are three files, whose file sizes are 128M, 72M, and 14M, respectively. At this time, the 128M files are not merged, but two files of 72M and 14M can be merged, and the size of the merged file does not exceed 128M. Further, assume that there are four files, the file sizes of which are 128M, 72M, 100M, and 14M, respectively. At this time, similarly, the 128M files are not merged. But a 100M file will not merge with a 72M file because its merged file size will exceed 128M. Instead, the 100M file can be merged with the 14M file without exceeding 128M in size.

In one embodiment, the method further comprises: determining the process number of the merging tasks according to the target value of the file size and the lower limit value of the file merging number under the condition that the file number of the files to be merged is greater than or equal to the lower limit value; creating temporary file directories with the same number as the processes; determining the file corresponding to each temporary file directory according to the target value of the file size and the lower limit value of the file merging number; and respectively merging the files in each temporary file directory to obtain a merged file corresponding to each temporary file directory.

As shown in fig. 2, after checking the size and number of files, the small files to be merged can be moved to the merge directory according to a certain policy. This directory is the temporary file directory. And further, submitting the merging tasks to a Spark engine, wherein the number of the tasks of the merging tasks is consistent with the number of the directories needing to be merged, and one Task can merge all the files under the specified directory into one file. For example, there are currently 100 small files that need to be merged. Then it can be calculated according to the single file size allowed by the above configuration, and assuming that 100 files will be merged into 5 128M files after merging, then 5 tasks can be generated for file merging at this time. Meanwhile, temporary file directories consistent with the number of tasks, that is, 5 temporary file directories, are also created. Then, the Task may run a merge Task to merge the temporary files under each directory to obtain a merged file corresponding to each temporary file directory. And after all Task execution is completed, the merging Task is finished.

The file merging method comprises the steps of scanning an Hdfs directory to obtain a file written by a task after a Hive file is written into a Spark engine, putting files to be merged into different temporary directories according to configuration, submitting a file merging task through the Spark engine, and merging the files under the temporary directories into one file. When the small files are combined by the method, all the existing offline tasks in the current cluster can be optimized at one time without modifying the offline computing tasks, the computing performance of downstream tasks is improved, the resource consumed by the cluster is reduced, and the cluster runs more stably.

The embodiment of the invention provides a processor, which is used for running a program, wherein the file merging method is executed when the program runs.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

An embodiment of the present invention provides a storage medium, on which a program is stored, which, when executed by a processor, implements the above-described file merging method.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 3. The computer device includes a processor a01, a network interface a02, a memory (not shown), and a database (not shown) connected by a system bus. Wherein processor a01 of the computer device is used to provide computing and control capabilities. The memory of the computer device comprises an internal memory a03 and a non-volatile storage medium a 04. The non-volatile storage medium a04 stores an operating system B01, a computer program B02, and a database (not shown in the figure). The internal memory a03 provides an environment for the operation of the operating system B01 and the computer program B02 in the nonvolatile storage medium a 04. The network interface a02 of the computer device is used for communication with an external terminal through a network connection. The computer program B02 is executed by the processor a01 to implement a file merge method.

Those skilled in the art will appreciate that the architecture shown in fig. 3 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps: after the Hive file is written into the Spark engine, determining the file size and the file number of the file to be merged; determining the merging task according to the size and the number of the files; and submitting the merging task to a Spark engine, starting a merging task thread through the Spark engine to merge the files to be merged to obtain the merged file.

In one embodiment, after the Spark engine writes the Hive file, determining the file size and the number of files of the file to be merged includes: after the Spark engine writes in the Hive file, determining the state value of the function configuration item of the merged file; and under the condition that the state value of the function configuration item of the merged file indicates that merging is started, determining the file size and the number of files of the file to be merged according to the written Hive file.

In one embodiment, the method further comprises: and under the condition that the state value of the function configuration item of the merged file indicates that merging is started, scanning the Hdfs directory to acquire the file of the task so as to determine the file size and the number of the files to be merged.

In one embodiment, when the number of files to be merged is greater than or equal to the lower limit, starting the merging task of the files to be merged according to the size of the files and the number of the files includes: determining parameters of file merging size configuration items; determining a target value of the file size after the file is merged according to the parameters of the file merging size configuration items; and under the condition that the number of the files to be merged is greater than or equal to the lower limit value, merging the files in the files to be merged according to the size of each file to be merged, so that the size of the file obtained after merging at least 2 files is a target value of the size of the file.

In one embodiment, the method further comprises: and when the file size of the file in the file to be merged is larger than or equal to the target value of the file size, not merging the files larger than or equal to the target value of the file size.

In one embodiment, the lower limit of the number of file merges is 10 and the target file size is 128M.

The present application further provides a computer program product adapted to perform a program of initializing a step of a method for merging files when executed on a data processing device.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method for merging files, the method comprising:

2. The method of claim 1, wherein after the Spark engine writes the Hive file, determining the file size and the number of files of the files to be merged comprises:

after the Spark engine writes in the Hive file, determining the state value of the function configuration item of the merged file;

and under the condition that the state value of the function configuration item of the merged file indicates that merging is started, determining the file size and the number of files of the file to be merged according to the written Hive file.

3. The method of claim 2, further comprising:

and under the condition that the state value of the function configuration item of the merged file indicates that merging is started, scanning an Hdfs directory to acquire a file of the task so as to determine the size and the number of the files to be merged.

4. The method according to claim 1, wherein the determining the current merging task according to the file size and the file number comprises:

determining parameters of file merging quantity configuration items;

determining a lower limit value of the file merging quantity according to the parameters of the file merging quantity configuration items;

under the condition that the number of the files to be merged is smaller than the lower limit value, the merging task of the files to be merged is not started;

and starting a merging task of the files to be merged according to the size of the files and the number of the files under the condition that the number of the files to be merged is greater than or equal to the lower limit value.

5. The method according to claim 4, wherein, in the case that the number of files of the file to be merged is greater than or equal to the lower limit value, starting the merging task for the file to be merged according to the file size and the number of files comprises:

determining parameters of file merging size configuration items;

determining a target value of the file size after the file is merged according to the parameters of the file merging size configuration items;

and under the condition that the number of the files to be merged is greater than or equal to the lower limit value, merging the files in the files to be merged according to the size of each file to be merged, so that the file size obtained after merging at least 2 files is the target value of the file size.

6. The method of claim 5, further comprising:

and when the file size of the file existing in the files to be merged is larger than or equal to the target value of the file size, not merging the files larger than or equal to the target value of the file size.

7. The method of claim 5, further comprising:

determining the process number of the merging tasks according to the target value of the file size and the lower limit value of the file merging number under the condition that the file number of the files to be merged is greater than or equal to the lower limit value;

creating temporary file directories with the same number as the processes;

determining the files corresponding to each temporary file directory according to the target file size value and the lower limit value of the file merging number;

and respectively merging the files in each temporary file directory to obtain a merged file corresponding to each temporary file directory.

8. The method according to any one of claims 4 to 7, wherein the lower limit value of the file merging number is 10, and the target file size value is 128M.

9. A processor configured to perform the file merging method according to any one of claims 1 to 8.

10. A machine-readable storage medium having instructions stored thereon, which when executed by a processor causes the processor to be configured to perform the file merging method according to any one of claims 1 to 8.