CN112948330A

CN112948330A - Data merging method, device, electronic equipment, storage medium and program product

Info

Publication number: CN112948330A
Application number: CN202110221020.1A
Authority: CN
Inventors: 不公告发明人
Original assignee: Lakala Payment Co ltd
Current assignee: Lakala Payment Co ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-06-11

Abstract

The embodiment of the disclosure discloses a data merging method, a data merging device, an electronic device, a storage medium and a program product, wherein the method comprises the following steps: responding to a data writing success event of the distributed file system, and reading file information under a distributed file directory related to the current data writing operation; determining a target file with the file size smaller than a first preset threshold according to the file information; and when the target files are multiple, merging the multiple target files. According to the technical scheme, through the mode of the embodiment of the disclosure, too many small files can be prevented from being generated in the file writing process of the Spark, and further the file management efficiency, the data query performance and the like of the distributed file system can be improved.

Description

Data merging method, device, electronic equipment, storage medium and program product

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a data merging method, a data merging device, electronic equipment, a storage medium and a program product.

Background

In the big data era, with the rapid rise and popularization of the internet technology, the data volume collected by people in different fields is large, and reaches unprecedented level. Meanwhile, the data generation, storage and processing modes are revolutionarily changed, the work and life of people can be basically represented by digitalization, and the data is very frequently used and inquired.

Spark is a fast and general computing engine specially designed for large-scale data processing, and forms an ecosystem with high-speed development and wide application. Spark can perform a variety of operations including SQL queries, text processing, machine learning, and the like. Spark also provides a number of libraries including Spark Core, Spark SQL, Spark Streaming, MLlib, GraphX. However, when writing Hive through Spark SQL or Spark Streaming or directly writing HDFS, too many small files may generate huge pressure on memory management of NameNode, etc., which may affect the stable operation of the whole cluster. Therefore, how to solve the problem that Spark generates too many small files when writing Hive or directly writing HDFS has become one of the main problems to be solved by those skilled in the art.

Disclosure of Invention

The embodiment of the disclosure provides a data merging method, a data merging device, electronic equipment, a storage medium and a program product.

In a first aspect, an embodiment of the present disclosure provides a data merging method, including:

responding to a data writing success event of the distributed file system, and reading file information under a distributed file directory related to the current data writing operation;

determining a target file with the file size smaller than a first preset threshold according to the file information;

and when the target files are multiple, merging the multiple target files.

Further, the data write operation includes an operation of a big data processing analysis engine writing a data processing result into the distributed file system.

Further, the method further comprises:

responding to a request of outputting data to the distributed file system by a current task, and sending a data writing request to the distributed file system so as to write the data to be output into the distributed file directory of the distributed file system;

and receiving the data writing success event returned by the distributed file system.

Further, the first preset threshold is predetermined based on the size of the disk block segmented when the data is stored in the distributed file system.

Further, when there are a plurality of target files, merging the plurality of target files, including:

grouping according to the file sizes of the target files, so that the sum of the file sizes of the target files in each group is larger than or equal to the first preset threshold and smaller than or equal to the second preset threshold;

and merging the target files in each group.

Further, merging the target files in each group, including:

and calling a file merging interface in the distributed file system, and merging the target files in each group.

Further, grouping according to the file sizes of the target files, so that the sum of the file sizes of the plurality of target files included in each group is greater than or equal to the first preset threshold and is less than or equal to a second preset threshold, includes:

sorting the target files according to the file sizes;

dividing one target file arranged on a larger side and one or more target files arranged on a smaller side in the sorting result into a group, wherein the sum of the file sizes of the target files in the group is greater than or equal to the first preset threshold and is less than or equal to the second preset threshold;

and grouping the target files in the sorting result, keeping the target files which are not grouped, and repeating the previous step until all the target files are grouped.

In a second aspect, an embodiment of the present disclosure provides a data merging method, including:

responding to the file information reading request, and returning the file information under the distributed file directory currently performing data writing operation;

receiving a merging request for merging the target files in the distributed file directory;

and carrying out merging operation on the target file according to the merging request.

Further, before the file information in the distributed file directory where the data write operation is currently performed is returned in response to the file information read request, the method further includes:

receiving a data writing request of a big data processing analysis engine;

writing the data specified in the data writing request into the distributed file directory according to the data writing request;

and returning a data writing success event to the big data processing analysis engine.

Further, performing a merge operation on the target file according to the merge request includes:

acquiring the grouping information of the target file in the merging request;

and merging a plurality of target files in the same group according to the grouping information.

In a third aspect, an embodiment of the present disclosure provides a data merging method, including:

responding to a data writing success event of the distributed file system, and sending a file information reading request to the distributed file system by the big data processing analysis engine;

the distributed file system responds to a file information reading request and returns file information under a distributed file directory related to data writing operation corresponding to the data writing success event;

the big data processing and analyzing engine determines a target file with the file size smaller than a first preset threshold according to the file information;

when a plurality of target files are available, the big data processing analysis engine sends a merging request for merging the target files to the distributed file system;

and the distributed file system receives a merging request for merging the target files in the distributed file directory and merges the target files according to the merging request.

Further, the method further comprises:

and the distributed file system receives the data writing request of the big data processing and analyzing engine, writes the data to be output specified in the data writing request into the distributed file directory according to the data writing request, and returns a data writing success event to the big data processing and analyzing engine.

Further, when there are a plurality of target files, the big data processing analysis engine sends a merge request for performing a merge operation on the plurality of target files to the distributed file system, including:

the big data processing and analyzing engines are grouped according to the file sizes of the target files, so that the sum of the file sizes of the target files in each group is larger than or equal to the first preset threshold and smaller than or equal to the second preset threshold;

the big data processing analysis engine sends the merging request to the distributed file system to merge the target files in each group.

Further, the sending, by the big data processing analysis engine, the merge request to the distributed file system to merge the target files in each group includes:

and the big data processing analysis engine merges the target files in each group by calling a file merging interface in the distributed file system.

Further, grouping the big data processing and analyzing engines according to the file sizes of the target files, so that the sum of the file sizes of the target files included in each group is greater than or equal to the first preset threshold and less than or equal to a second preset threshold, includes:

the big data processing and analyzing engine sorts the target files according to the file sizes;

the big data processing and analyzing engine divides one target file arranged on a larger side and one or more target files arranged on a smaller side in the sequencing result into a group, and the sum of the file sizes of the target files in the group is greater than or equal to the first preset threshold and is less than or equal to the second preset threshold;

and the big data processing analysis engine groups the target files in the sequencing result, retains the target files which are not grouped, and repeats the previous step until all the target files are grouped.

Further, the receiving, by the distributed file system, a merge request for performing a merge operation on the target file in the distributed file directory, and performing a merge operation on the target file according to the merge request includes:

and the distributed file system acquires the grouping information of the target files in the merging request and merges a plurality of target files in the same group according to the grouping information.

In a fourth aspect, an embodiment of the present disclosure provides a data merging apparatus, including:

the first response module is configured to respond to a data writing success event of the distributed file system and read file information under a distributed file directory related to the current data writing operation;

the determining module is configured to determine a target file with a file size smaller than a first preset threshold according to the file information;

the first merging module is configured to merge the plurality of target files when the plurality of target files are multiple.

In a fifth aspect, an embodiment of the present disclosure provides a data merging apparatus, including:

the second response module is configured to respond to the file information reading request and return the file information under the distributed file directory currently carrying out data writing operation;

the receiving module is configured to receive a merging request for merging the target files in the distributed file directory;

and the second merging module is configured to perform merging operation on the target file according to the merging request.

The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above-described functions.

In one possible design, the apparatus includes a memory configured to store one or more computer instructions that enable the apparatus to perform the corresponding method, and a processor configured to execute the computer instructions stored in the memory. The apparatus may also include a communication interface for the apparatus to communicate with other devices or a communication network.

In a sixth aspect, an embodiment of the present disclosure provides a data merging system, including: a big data processing analysis engine and a distributed file system;

the big data processing and analyzing engine responds to a data writing success event of a distributed file system and sends a file information reading request to the distributed file system;

In a seventh aspect, an embodiment of the present disclosure provides an electronic device, including a memory and a processor, where the memory is used to store one or more computer instructions that support any of the above apparatuses to perform the corresponding methods described above, and the processor is configured to execute the computer instructions stored in the memory. Any of the above may also include a communication interface for communicating with other devices or a communication network.

In an eighth aspect, the present disclosure provides a computer-readable storage medium for storing computer instructions for use by any one of the above apparatuses, which includes computer instructions for performing any one of the above methods.

In a ninth aspect, the disclosed embodiments provide a computer program product comprising computer instructions for implementing the steps of the method of any one of the above aspects when executed by a processor.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

according to the method provided by the embodiment of the disclosure, in the process of performing big data operation by a big data processing analysis engine Spark, after data is written into a distributed file system such as a HIVE or HDFS, after a write-in success event is received, file information under a written file directory can be read, and when a plurality of small files with file sizes smaller than a first preset threshold exist under the file directory, the technical problem that file management of the distributed file system is over stressed due to too many small files generated by Spark when the files are written can be solved in a mode of merging the plurality of small files. By the method, too many small files can be prevented from being generated in the file writing process of the Spark, and the file management efficiency, the data query performance and the like of the distributed file system can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of embodiments of the disclosure.

Drawings

Other features, objects, and advantages of embodiments of the disclosure will become more apparent from the following detailed description of non-limiting embodiments when taken in conjunction with the accompanying drawings. In the drawings:

FIG. 1 shows a flow diagram of a data merging method according to an embodiment of the present disclosure;

FIG. 2 shows a flow diagram of a data merging method according to another embodiment of the present disclosure;

FIG. 3 illustrates a flow diagram of a data merging method according to yet another embodiment of the present disclosure;

FIG. 4 illustrates an overall flow diagram of a data merging method according to an embodiment of the present disclosure;

FIG. 5 shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure;

FIG. 6 shows a block diagram of a data processing apparatus according to another embodiment of the present disclosure;

FIG. 7 shows a block diagram of a data processing system, according to an embodiment of the present disclosure;

FIG. 8 is a schematic block diagram of a computer system suitable for use in implementing a data merging method according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, exemplary embodiments of the disclosed embodiments will be described in detail with reference to the accompanying drawings so that they can be easily implemented by those skilled in the art. Also, for the sake of clarity, parts not relevant to the description of the exemplary embodiments are omitted in the drawings.

In the disclosed embodiments, it is to be understood that terms such as "including" or "having," etc., are intended to indicate the presence of the disclosed features, numbers, steps, behaviors, components, parts, or combinations thereof, and are not intended to preclude the possibility that one or more other features, numbers, steps, behaviors, components, parts, or combinations thereof may be present or added.

It should be further noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. The embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

According to the technical scheme provided by the embodiment of the disclosure, in the process of big data operation of a big data processing analysis engine (for example, Spark), after data is written into a distributed file system such as HIVE or HDFS, after a write success event is received, file information under a written file directory can be read, and when a plurality of small files with the file sizes smaller than a first preset threshold exist under the file directory, the technical problem that excessive pressure is caused to file management of the distributed file system due to too many small files generated by Spark when the files are written can be solved in a mode of combining the plurality of small files. By the method, too many small files can be prevented from being generated in the file writing process of the Spark, and the file management efficiency, the data query performance and the like of the distributed file system can be improved.

Fig. 1 shows a flowchart of a data merging method according to an embodiment of the present disclosure, as shown in fig. 1, the data merging method includes the following steps S101-S103:

in step S101, in response to a data write success event to the distributed file system, reading file information under a distributed file directory related to a current data write operation;

in step S102, determining a target file with a file size smaller than a first preset threshold according to the file information;

in step S103, when there are a plurality of target files, the plurality of target files are merged.

As mentioned above, Spark is a fast and general-purpose computing engine designed specifically for large-scale data processing, and now forms an ecosystem with a wide range of applications in rapid development. Spark can perform a variety of operations including SQL queries, text processing, machine learning, and the like. Spark also provides a number of libraries including Spark Core, Spark SQL, Spark Streaming, MLlib, GraphX. However, when writing Hive through Spark SQL or Spark Streaming or directly writing HDFS, too many small files may generate huge pressure on memory management of NameNode, etc., which may affect the stable operation of the whole cluster. Therefore, how to solve the problem that Spark generates too many small files when writing Hive or directly writing HDFS has become one of the main problems to be solved by those skilled in the art.

In view of the above problems, in this embodiment, a data merging method is provided, where after data is written into a distributed file system such as a HIVE or an HDFS during a big data operation performed by a big data processing and analyzing engine Spark, after a write success event is received, file information in a written file directory may be read, and when a plurality of small files with file sizes smaller than a first preset threshold exist in the file directory, a technical problem that file management of the distributed file system is stressed due to too many small files generated by Spark when writing files may be solved by merging the plurality of small files. By the method, too many small files can be prevented from being generated in the file writing process of the Spark, and the file management efficiency, the data query performance and the like of the distributed file system can be improved.

In an embodiment of the present disclosure, the data merging method may be adapted to be executed on a client of a big data processing analysis engine.

In an embodiment of the present disclosure, after the task started by the big data processing analysis engine, for example, Spark, is completed, the data may be written into the disk file. The disk file may be a disk file in a distributed file system. In the big data processing analysis engine, one job is distributed to a plurality of different tasks which are executed in parallel, and each task generates a plurality of files in the execution process, so that a plurality of tasks which are executed in parallel generate a plurality of disk files for one job, and in the case of more parallel tasks, a plurality of smaller disk files may be generated. In addition, when the disk files are read in the next stage, each disk file needs to be subjected to disk addressing once, and when a large number of disk files exist, the addressing times are increased, so that the data reading efficiency of the distributed file system is influenced.

Therefore, in the embodiment of the present disclosure, after data is written into the distributed file and a data writing success event returned by the distributed file system is received, the file information written under the distributed file directory in which the data is written by the current data writing operation is read.

In an embodiment of the present disclosure, the file information in the distributed file directory may include, but is not limited to, information of a disk file generated by a current data write operation, such as a file name, a file storage location, a file size, and the like.

In an embodiment of the present disclosure, the first preset threshold may be set to be smaller than a size of a disk block in the distributed file system. In the HDFS system, a block concept (block) is introduced in order to facilitate management and backup of files. Here, the block is the minimum unit of storage in the HDFS system, and the HDFS defines a block size of 64MB by default. When a disk file is uploaded to the HDFS, if the size of the file is larger than the set block size, the file is segmented and stored into a plurality of blocks, the plurality of blocks can be stored in different DataNodes, and the HDFS system can ensure that one block is stored in one DataNode in the whole process. It is noted that if a file size does not reach 64MB, the file does not occupy the entire block space. The NameNode in the HDFS records which DataNode each block of the file is stored in the file partition, and these information are also commonly referred to as meta information (MetaInfo). When a job is submitted to a big data processing analysis engine such as Spark to be executed, there may be many data transmitted in the process, and the data generated by the Spark task may generate a plurality of files after being subjected to operations such as partitioning, and the size of the file may be smaller than that of a disk block, which may cause that a file stored on a disk block cannot occupy the whole space of the disk block, cause space waste, and cause too much meta-information to occupy the memory of NameNode. Therefore, in the embodiment of the present disclosure, the first preset threshold may be set based on a size of a disk block of the distributed file system, for example, set to the size of the disk block, determine a file smaller than the disk block as a target file, that is, determine a file written in the distributed file directory and having a size smaller than the disk block as a small file, and then perform a merge operation.

In an embodiment of the present disclosure, if there are a plurality of target files in the distributed file directory, that is, two or more target files, the problem caused by too many small files can be solved by merging the plurality of target files.

In an embodiment of the present disclosure, the data write operation includes an operation of a big data processing analysis engine writing a data processing result to the distributed file system. In this optional implementation, the embodiment of the present disclosure is directed to a data write operation of a distributed file system when a big data processing analysis engine executes a big data job.

In an embodiment of the present disclosure, the method further comprises the following steps:

In this alternative implementation, when the Spark SQL or Spark Streaming is executed, the data writing operation to the distributed file system is involved. In Spark SQL or Spark Streaming, data processing and analysis are performed by starting a plurality of parallel execution tasks, which may be Map tasks or Reduce tasks. Map tasks are executed on Map nodes, and Reduce tasks are executed on Reduce nodes.

After the task is completed, a disk-dropping operation, that is, an operation of writing a disk file of the distributed file system, is performed on the data. When the disk-dropping operation is executed, the execution result of the current task, that is, the data to be written, may be sent to the distributed file system to request to write the data into the distributed file directory of the distributed file system. It should be noted that the result of executing one task may include a plurality of files. In the process of writing the file generated by the current task into the distributed file system, responding to a request that the current task needs to output data to the distributed file system, and sending a data writing request to the distributed file system so as to write the data to be output into a distributed file directory of the distributed file system. And after receiving the data writing success event returned by the distributed file system, executing the operation of merging the small files.

In an embodiment of the present disclosure, the first preset threshold is predetermined based on a size of a disk block that is split when data is stored in the distributed file system.

In an embodiment of the present disclosure, the step S103, namely merging the plurality of target files when the plurality of target files are provided, further includes the following steps:

and merging the target files in each group.

In this alternative implementation, as described above, in the big data processing analysis engine, one job is allocated to a plurality of different tasks to be executed in parallel, so that for one job, a plurality of tasks to be executed in parallel generate a plurality of disk files, and in the case of many parallel tasks, a large number of smaller disk files may be generated. In addition, when the disk files are read in the next stage, each disk file needs to be subjected to disk addressing once, and when a large number of disk files exist, the addressing times are increased, so that the data reading efficiency of the distributed file system is influenced.

In addition, when the distributed file system stores a file, if the size of the file is larger than the size of the storage space of the disk blocks, the file is divided into the size of the storage space of the disk blocks and then distributed and stored on different disk blocks, and when the size of the stored file is smaller than the size of the storage space of the disk blocks, the space waste of the disk blocks is caused. It is understood that, when the disk file is too large, the distributed file system needs to segment and store the disk file on different disk blocks, and the different disk blocks may be distributed on different storage nodes. Therefore, the reading performance of the distributed file system is affected by the fact that the disk file is too large.

Therefore, when merging target files in a distributed file directory, the embodiments of the present disclosure may group the target files, where each group may include a plurality of target files, but the sum of sizes of the target files divided into one group may be greater than or equal to a first preset threshold and smaller than a second preset threshold. As described above, the size of the first preset threshold may be set according to the size of the disk block, and the second preset threshold may be set according to experience or performance of the distributed file system, so that the merged disk file is not too large.

After the target files are divided into different groups, the target files in each group can be merged into one disk file and then stored in the distributed file system.

In an embodiment of the present disclosure, the merging the target files in each group further includes:

In this alternative implementation, the distributed file system may provide a file merging interface, and the big data processing analysis engine may merge the target files divided into a group by calling the file merging interface supported by the distributed file system itself.

In an embodiment of the present disclosure, the step of grouping according to the file sizes of the target files, so that a sum of file sizes of a plurality of target files included in each group is greater than or equal to the first preset threshold and less than or equal to a second preset threshold further includes the steps of:

sorting the target files according to the file sizes;

In this optional implementation manner, in the process of merging a plurality of target files in the distributed file directory, merging may be performed based on the following principles:

firstly, a plurality of target files are sorted according to the file size, the largest target file arranged at one end and the smallest target file arranged at the other end are added into a grouping queue, if the target files in the grouping queue are combined, the size of the combined file is larger than or equal to a first preset threshold value, the target files in the grouping queue are divided into a group, and the rest target files are continuously grouped in the mode as above; if the size of the merged file is smaller than a first preset threshold value, the smallest target file which is arranged at the other end and is not grouped is continuously added into the grouping queue, whether the size of the merged file is larger than or equal to the first preset threshold value or not is judged if the target files in the grouping queue are merged, and the like is carried out until the number of the remaining target files which are not grouped is 1 or 0.

By the method, the target files in the distributed file directory can be merged into the large file which is larger than or equal to the first preset threshold value, and a series of problems caused by excessive small files can be solved.

Fig. 2 shows a flowchart of a data merging method according to another embodiment of the present disclosure, as shown in fig. 2, the data merging method includes the following steps S201 to S203:

in step S201, in response to the file information reading request, returning file information in the distributed file directory where the data writing operation is currently performed;

in step S202, a merge request for performing a merge operation on a target file in the distributed file directory is received;

in step S203, a merge operation is performed on the target file according to the merge request.

In view of the above problems, in this embodiment, a data merging method is provided, where in a big data operation performed by a big data processing and analyzing engine Spark, after data is written into a distributed file system such as a HIVE or an HDFS, after a write success event returned by the distributed file system is received, file information in a file directory written in the file directory is read, and when a plurality of small files with file sizes smaller than a first preset threshold exist in the file directory, a technical problem that file management of the distributed file system is over stressed due to too many small files generated by Spark when writing a file is solved by merging the plurality of small files. By the method, too many small files can be prevented from being generated in the file writing process of the Spark, and the file management efficiency, the data query performance and the like of the distributed file system can be improved.

In an embodiment of the present disclosure, the data merging method may be adapted to be performed on a distributed file system.

In an embodiment of the present disclosure, after the big data processing and analyzing engine sends the data to the distributed file system for storage, and after the data is successfully written in the distributed file system, a data writing success event is returned to the big data processing and analyzing engine, and the data writing success event carries storage information of the data being written, for example, information such as a distributed file directory of the data being written. The big data processing engine may also send a file information read request to the distributed file system to obtain file information in the distributed file directory, where the file information in the distributed file directory may include file information of data written by the data write operation, such as a file name, a file size, a file storage location, and the like. The distributed file system may return the file information to the big data handling analytics engine based on the file information read request.

In an embodiment of the present disclosure, after receiving file information in a distributed file directory, when determining that a plurality of small files, that is, target files, are written in the distributed file directory by a current data write operation through judgment, a big data processing analysis engine sends a request for merging the small files to a distributed file system, where the request may include file information of the small files, so that the distributed file system performs a merge operation on the small files.

In an embodiment of the present disclosure, before the step S201 of returning, in response to the file information read request, the file information in the distributed file directory where the data write operation is currently performed, the method further includes the following steps:

receiving a data writing request of a big data processing analysis engine;

In this optional implementation manner, the big data processing analysis engine starts a plurality of parallel execution tasks to perform data processing during the big data job execution, and the processed result is written into the disk files, which are stored in the distributed file system. Therefore, the distributed file system receives a data writing request from the big data processing analysis engine, establishes a distributed file directory for the data writing request according to the data writing request, and writes a processing result specified in the data writing request into the distributed file directory. After the data are successfully written into the distributed file directory by the distributed file system, a data writing success event is returned to the big data processing and analyzing engine, and the big data processing and analyzing engine determines that the data are successfully written into the distributed file system based on the data writing success event.

In an embodiment of the present disclosure, step S203, which is to perform a merge operation on the target file according to the merge request, further includes the following steps:

acquiring the grouping information of the target file in the merging request;

In addition, when the distributed file system stores a file, if the size of the file is larger than the storage space of the disk blocks, the file is divided into the size of the disk blocks and then distributed and stored on different disk blocks, and when the stored file is smaller than the storage space of the disk blocks, the space of the disk blocks is wasted. It is understood that, when the disk file is too large, the distributed file system needs to segment and store the disk file on different disk blocks, and the different disk blocks may be distributed on different storage nodes. Therefore, the reading performance of the distributed file system is affected by the fact that the disk file is too large.

Therefore, when merging target files under a distributed file directory, a big data processing analysis engine may group the target files, where each group may include a plurality of target files, but the sum of sizes of the target files divided into one group may be greater than or equal to a first preset threshold and smaller than a second preset threshold. As described above, the size of the first preset threshold may be set according to the size of the disk block, and the second preset threshold may be set according to experience or performance of the distributed file system, so that the merged disk file is not too large.

After dividing the target files into different groups, the big data processing analysis engine can combine the target files in each group into one disk file by calling a file combination interface provided by the distributed file system, and then store the disk file in the distributed file system. After receiving the file merging request, the distributed file system may merge the target files divided into the same group into one file according to the grouping information provided by the big data processing analysis engine. The grouping information may include at least the identifications of the divided different groups and the file identification of the target file included in each group.

Technical terms and technical features related to the technical terms and technical features shown in fig. 2 and related embodiments are the same as or similar to those of the technical terms and technical features shown in fig. 1 and related embodiments, and for the explanation and description of the technical terms and technical features related to the technical terms and technical features shown in fig. 2 and related embodiments, reference may be made to the above explanation of the explanation of fig. 1 and related embodiments, and no further description is provided here.

Fig. 3 shows a flowchart of a data merging method according to another embodiment of the present disclosure, as shown in fig. 3, the data merging method includes the following steps S301-S305:

in step S301, in response to a data write success event to the distributed file system, the big data processing analysis engine sends a file information read request to the distributed file system;

in step S302, the distributed file system returns, in response to the file information read request, file information in a distributed file directory related to the data write operation corresponding to the data write success event;

in step S303, the big data processing and analyzing engine determines, according to the file information, a target file with a file size smaller than a first preset threshold;

in step S304, when there are a plurality of target files, the big data processing analysis engine sends a merge request for performing a merge operation on the plurality of target files to the distributed file system;

in step S305, the distributed file system receives a merge request for performing a merge operation on a target file in the distributed file directory, and performs a merge operation on the target file according to the merge request.

In an embodiment of the present disclosure, the data merging method may be applied to a process in which a big data processing analysis engine writes data to a distributed file system.

In an embodiment of the present disclosure, after the task started by the big data processing analysis engine, for example, Spark, is completed, the data may be written into the disk file. The disk file may be a disk file in a distributed file system. In the large data processing analysis engine, one job is allocated to a plurality of different tasks to be executed in parallel, so that a plurality of tasks to be executed in parallel generate a plurality of disk files for one job, and in the case of many parallel tasks, many smaller disk files may be generated. In addition, when the disk files are read in the next stage, each disk file needs to be subjected to disk addressing once, and when a large number of disk files exist, the addressing times are increased, so that the data reading efficiency of the distributed file system is influenced.

In an embodiment of the present disclosure, the first preset threshold may be set to be smaller than a size of a disk block in the distributed file system. In the HDFS system, a block concept (block) is introduced in order to facilitate management and backup of files. Here, the block is the minimum unit of storage in the HDFS system, and the HDFS defines a block size of 64MB by default. When a disk file is uploaded to the HDFS, if the size of the file is larger than the set block size, the file is segmented and stored into a plurality of blocks, the plurality of blocks can be stored in different DataNodes, and the HDFS system can ensure that one block is stored in one DataNode in the whole process. It is noted that if a file size does not reach 64MB, the file does not occupy the entire block space. The NameNode in the HDFS records which DataNode each block of the file is stored in the file partition, and these information are also commonly referred to as meta information (MetaInfo). When a job is submitted to a big data processing analysis engine, for example, Spark is executed, there may be many data transmitted in the process, data generated by Spark task may generate a plurality of files after being subjected to operations such as partitioning, and the size of the file may be smaller than that of a disk block, which may cause that a file stored on a disk block cannot occupy the whole space of the disk block, cause space waste, and cause too much meta-information to occupy memory of NameNode. Therefore, in the embodiment of the present disclosure, the first preset threshold may be set based on a size of a disk block of the distributed file system, for example, set to the size of the disk block, determine a file smaller than the disk block as a target file, that is, determine a file written in the distributed file directory and having a size smaller than the disk block as a small file, and then perform a merge operation.

The distributed file system receives a data writing request from the big data processing analysis engine, establishes a distributed file directory for the data writing request according to the data writing request, and writes data specified in the data writing request into the distributed file directory. After the data are successfully written into the distributed file directory by the distributed file system, a data writing success event is returned to the big data processing and analyzing engine, and the big data processing and analyzing engine determines that the data are successfully written into the distributed file system based on the data writing success event.

In an embodiment of the present disclosure, in step S304, that is, when there are a plurality of target files, the step of sending, by the big data processing analysis engine, a merge request for performing a merge operation on the plurality of target files to the distributed file system further includes the following steps:

In an embodiment of the present disclosure, the step of sending, by the big data processing analysis engine, the merge request to the distributed file system to merge the target files in each group further includes the following steps:

In an embodiment of the present disclosure, the step of grouping the big data processing and analyzing engines according to the file sizes of the target files, so that the sum of the file sizes of the plurality of target files included in each group is greater than or equal to the first preset threshold and less than or equal to the second preset threshold further includes the following steps:

sorting the target files according to the file sizes;

firstly, a plurality of target files are sorted according to the file size, the largest target file arranged at one end and the smallest target file arranged at the other end are added into a grouping queue, if the target files in the grouping queue are combined, the size of the combined file is larger than or equal to a first preset threshold value, the target files in the grouping queue are divided into a group, and the rest target files are continuously grouped as above; if the size of the merged file is smaller than a first preset threshold value, the smallest target file which is arranged at the other end and is not grouped is continuously added into the grouping queue, whether the size of the merged file is larger than or equal to the first preset threshold value or not is judged if the target files in the grouping queue are merged, and the like, and the number of the remaining target files which are not grouped is 1 or 0.

In an embodiment of the present disclosure, the step of receiving, by the distributed file system, a merge request for performing a merge operation on a target file in the distributed file directory, and performing a merge operation on the target file according to the merge request further includes the following steps:

In this embodiment, after the big data processing analysis engine divides the target files into different groups, the target files in each group may be merged into one disk file by calling a file merging interface provided by the distributed file system, and then stored in the distributed file system. After receiving the file merging request, the distributed file system may merge the target files divided into the same group into one file according to the grouping information provided by the big data processing analysis engine. The grouping information may include at least the identifications of the divided different groups and the file identification of the target file included in each group.

Technical terms and technical features related to the technical terms and technical features shown in fig. 1 and the related embodiments are the same as or similar to those of the technical terms and technical features shown in fig. 1-2 and the related embodiments, and for the explanation and the explanation of the technical terms and technical features related to the technical terms and technical features shown in fig. 2 and the related embodiments, the explanation of the technical terms and technical features shown in fig. 1-2 and the related embodiments can be referred to, and the technical terms and the technical features are not repeated herein.

Fig. 4 illustrates an overall flowchart of a data merging method according to an embodiment of the present disclosure. As shown in fig. 4, when a disk-dropping operation is performed after a map task or a reduce task is executed, a big data processing analysis engine Spark writes data obtained after the task is executed into a distributed file system HDFS, after the data writing is successful, the HDFS returns a data writing success event to Spark, after receiving the data writing success event, Spark sends a file information reading request to the HDFS, and the HDFS returns file information under a file target where Spark writes data before to Spark, Spark determines whether a target file with a file size smaller than a first preset threshold exists in the file directory by analyzing the file information, and sends a merging request for merging a plurality of target files to the HDFS when the target files exist and the number of the target files is greater than or equal to 2. The HDFS merges the plurality of target files into one or more files after receiving the merge request.

The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods.

Fig. 5 shows a block diagram of a data merging apparatus according to an embodiment of the present disclosure, which may be implemented as part or all of an electronic device by software, hardware, or a combination of both. As shown in fig. 5, the data merging apparatus includes:

a first response module 501, configured to, in response to a data write success event to the distributed file system, read file information under a distributed file directory involved in a current data write operation;

a determining module 502 configured to determine, according to the file information, a target file with a file size smaller than a first preset threshold;

a first merging module 503, configured to merge a plurality of the target files when the target files are multiple.

In an embodiment of the present disclosure, the data merging device may be adapted to execute on a client of a big data processing analysis engine.

Fig. 6 shows a block diagram of a data merging apparatus according to another embodiment of the present disclosure, which may be implemented as part or all of an electronic device by software, hardware, or a combination of both. As shown in fig. 6, the data merging apparatus includes:

the second response module 601 is configured to respond to the file information reading request, and return file information in the distributed file directory where data writing operation is currently performed;

a receiving module 602, configured to receive a merge request for performing a merge operation on a target file in the distributed file directory;

a second merge module 603 configured to perform a merge operation on the target file according to the merge request.

In an embodiment of the present disclosure, the data merging apparatus may be adapted to execute on a distributed file system.

Fig. 7 shows a block diagram of a data merging system according to an embodiment of the present disclosure, which may be implemented as part or all of an electronic device by software, hardware, or a combination of both. As shown in fig. 7, the data merging system includes: a big data processing analysis engine 701 and a distributed file system 702;

the big data processing analysis engine 701 sends a file information reading request to the distributed file system 702 in response to a data writing success event to the distributed file system 702;

the distributed file system 702 responds to the file information reading request, and returns the file information under the distributed file directory related to the data writing operation corresponding to the data writing success event;

the big data processing and analyzing engine 701 determines a target file with a file size smaller than a first preset threshold according to the file information;

when a plurality of target files are available, the big data processing analysis engine 701 sends a merge request for performing a merge operation on the plurality of target files to the distributed file system 702;

the distributed file system 702 receives a merge request for performing a merge operation on the target files in the distributed file directory, and performs a merge operation on the target files according to the merge request.

In one embodiment of the present disclosure, the data merging system may be adapted for use in a process in which a big data processing analytics engine writes data to a distributed file system.

The technical features related to the above device embodiments and the corresponding explanations and descriptions thereof are the same as, corresponding to or similar to the technical features related to the above method embodiments and the corresponding explanations and descriptions thereof, and for the technical features related to the above device embodiments and the corresponding explanations and descriptions thereof, reference may be made to the technical features related to the above method embodiments and the corresponding explanations and descriptions thereof, and details of the disclosure are not repeated herein.

The embodiment of the present disclosure also discloses an electronic device, which includes a memory and a processor; wherein the content of the first and second substances,

the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to perform any of the method steps described above.

As shown in fig. 8, the computer system 800 includes a processing unit 801 which can execute various processes in the above-described embodiments according to a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data necessary for the operation of the computer system 800 are also stored. The processing unit 801, the ROM802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary. The processing unit 801 may be implemented as a CPU, a GPU, a TPU, an FPGA, an NPU, or other processing units.

In particular, the above described methods may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a medium readable thereby, the computer program comprising program code for performing the data transmission method. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 809 and/or installed from the removable medium 811.

A computer program product is also disclosed in embodiments of the present disclosure, the computer program product comprising computer programs/instructions which, when executed by a processor, implement any of the above method steps.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowcharts or block diagrams may represent a module, a program segment, or a portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules described in the embodiments of the present disclosure may be implemented by software or hardware. The units or modules described may also be provided in a processor, and the names of the units or modules do not in some cases constitute a limitation of the units or modules themselves.

As another aspect, the disclosed embodiment also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the apparatus in the foregoing embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the embodiments of the present disclosure.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept. For example, the above features and (but not limited to) the features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A method of data merging, comprising:

and when the target files are multiple, merging the multiple target files.

2. The method of claim 1, wherein the data write operation comprises an operation of a big data processing analytics engine writing data processing results to the distributed file system.

3. A method of data merging, comprising:

4. A method of data merging, comprising:

5. A data merging apparatus, comprising:

6. A data merging apparatus, comprising:

7. A data merging system, comprising: a big data processing analysis engine and a distributed file system;

8. An electronic device comprising a memory and a processor; wherein the content of the first and second substances,

the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the steps of the method of any one of claims 1-4.

9. A computer readable storage medium having computer instructions stored thereon, wherein the computer instructions, when executed by a processor, implement the steps of the method of any one of claims 1-4.

10. A computer program product comprising computer programs/instructions which, when executed by a processor, carry out the steps of the method of any one of claims 1 to 4.