CN111159130A

CN111159130A - Small file merging method and electronic equipment

Info

Publication number: CN111159130A
Application number: CN201811317734.7A
Authority: CN
Inventors: 秦华婵; 廖光贤; 范云博; 陶捷; 沈国栋; 王宝晗
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2018-11-07
Filing date: 2018-11-07
Publication date: 2020-05-15

Abstract

The invention discloses a small file merging method and electronic equipment, which are used for improving the access efficiency of system files. The small file merging method comprises the following steps: searching files in a distributed file system (HDFS) to obtain a plurality of small files; determining at least two small files to be merged from the plurality of small files; and merging the at least two small files to be merged according to a merging strategy by using a file merging tool based on Spark, wherein the merging strategy is used for indicating the size of the merged small files.

Description

Small file merging method and electronic equipment

Technical Field

The invention relates to the technical field of big data, in particular to a small file merging method and electronic equipment.

Background

A distributed file system (HDFS) is an important component of a cluster, and is composed of a management node and a plurality of data nodes. The management node stores metadata of the file system in a memory, and although the memory space occupied by each small file is small, each small file occupies one memory block, and the storage space of each memory block is about 150 bytes. If ten million files are stored, the management node correspondingly stores and manages information such as a file system directory and the like, which requires about 3G of space, that is, the number of stored files and the cluster size are severely limited by the memory size of the management node.

And the HDFS is based on streaming access, i.e. an access mode of writing and reading for many times, so that access to a small file in the HDFS needs to jump from one small file to another small file continuously, and as the number of small files stored in the HDFS increases, the read-write performance is reduced, the access time is prolonged, i.e. the access efficiency of a system file is low.

Disclosure of Invention

The embodiment of the invention provides a small file merging method and electronic equipment, which are used for improving the access efficiency of system files.

In a first aspect, a method for merging small files is provided, where the method includes:

searching files in a distributed file system (HDFS) to obtain a plurality of small files;

determining at least two small files to be merged from the plurality of small files;

and merging the at least two small files to be merged according to a merging strategy by using a file merging tool based on Spark, wherein the merging strategy is used for indicating the size of the merged small files.

In the embodiment of the invention, if a plurality of small files exist in the HDFS, the small files can be merged, and the small files are merged into a plurality of files with the size specified by the merging strategy through the Spark file merging tool according to the merging strategy, so that the number of the files in the HDFS is small, the reading and writing performance is improved, and the access efficiency of the system files is improved. And the directory for managing the HDFS file can be reduced, and the memory of the HDFS management node is saved.

Optionally, retrieving the file in the distributed file system HDFS to obtain a plurality of small files, including:

searching in the HDFS according to an input file directory and a small file threshold value to obtain a plurality of small files, wherein the size of each small file is smaller than or equal to the small file threshold value;

or,

and searching in the HDFS according to the input Hive table name to obtain the small files, wherein the Hive table is used for indicating a metadata storage directory of a file system, and the Hive table is used for indicating files of the same type in the HDFS.

In the embodiment of the invention, two retrieval modes are provided, wherein one of the two retrieval modes is that a user specifies a retrieval directory, namely a file directory to be retrieved is input by the user for retrieval, so that the actual requirements of the user are further met. The other method is that the user inputs the name of the Hive table stored in the system for searching, and the Hive table can search the files of the same type, namely the small files searched by the Hive table can be combined and can be directly combined, so that the combining efficiency is improved.

Optionally, after retrieving the file in the distributed file system HDFS, the method further includes:

judging whether all the small files obtained after retrieval are files of the same type;

and outputting a plurality of small files of the same type.

In the embodiment of the invention, after the small files are searched, whether the plurality of small files obtained after the search are of the same type or not needs to be judged, namely whether the small files can be merged or not, if the small files cannot be merged, the small files which can be merged are not output, and only the small files which can be merged are output, so that the number of the output small files is small, and the complexity of determining the merging of the small files by a user is reduced.

Optionally, after retrieving the file in the distributed file system HDFS, the method further includes: outputting attribute information of all small files or part of small files obtained after retrieval, wherein the attribute information comprises at least one of the size, the type, the storage format and the memory utilization rate of the small files;

determining at least two small files to be merged from the plurality of small files, including:

receiving a selection operation of a user on the plurality of small files based on the attribute information;

and determining at least two small files to be combined from the plurality of small files according to the selection operation.

In the embodiment of the present invention, after retrieving a plurality of small files, the attribute information of each small file, for example, the size, the type, and the like of the small file, may be output, so that the user may select the small files to be merged according to the attribute information, thereby performing a targeted optimization operation to optimize the HDFS.

Optionally, the merging the at least two small files to be merged by the file merging tool based on Spark according to a merging policy includes:

grouping at least two small files according to the merging strategy to obtain at least two groups, wherein the difference value between the memories of any two groups in the at least two groups is smaller than or equal to a first preset threshold value;

merging the at least two small files by taking a group as a unit by a file merging tool based on Spark, wherein each group corresponds to a new merged file;

and calling an output interface, and outputting at least two new files formed by correspondingly combining the at least two groups.

In the embodiment of the invention, a Spark file merging tool is used for grouping a plurality of texts according to the set size of the merged file, and the merged file corresponding to each group is output. The size of the combined file can be set in advance, and actual requirements of users can be met.

In a second aspect, an electronic device for merging small files is provided, the electronic device comprising:

the system comprises a retrieval unit, a file processing unit and a file processing unit, wherein the retrieval unit is used for retrieving files in a distributed file system (HDFS) to obtain a plurality of small files;

a determining unit, configured to determine at least two small files to be merged from the plurality of small files;

and a merging unit, configured to merge the at least two small files to be merged according to a merging policy by using a file merging tool based on Spark, where the merging policy is used to indicate a size of the merged small file.

Optionally, the retrieving unit is specifically configured to:

or,

Optionally, the apparatus further includes a determining unit, configured to:

and outputting a plurality of small files of the same type.

Optionally, the retrieving unit is further configured to: outputting attribute information of all small files or part of small files obtained after retrieval, wherein the attribute information comprises at least one of the size, the type, the storage format and the memory utilization rate of the small files;

the determining unit is specifically configured to:

Optionally, the merging unit is specifically configured to:

Third, an electronic device is provided, which includes:

at least one processor, and

a memory coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor implementing the method of any one of the first aspect by executing the instructions stored by the memory.

In a fourth aspect, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the first aspects.

In the embodiment of the invention, if a plurality of small files exist in the HDFS, the small files can be merged, and the small files are merged into a plurality of files with the sizes specified by the merging strategy through the Spark file merging tool, so that the number of the files in the HDFS is small, the reading and writing performance is improved, and the access efficiency of the file pocket of the system is improved. And the directory for managing the HDFS file can be reduced, and the memory of the HDFS management node is saved.

Drawings

Fig. 1 is a schematic flow chart of a small file merging method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly and completely understood, the technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention.

At present, the HDFS is severely limited by the memory size of a management node due to the number of files stored and the cluster size. Moreover, the HDFS is based on streaming access, and if the number of small files stored in the HDFS is large, the read-write performance is reduced, which results in a reduction in the access efficiency of system files.

In view of this, in the embodiment of the present invention, the Spark file merging tool merges the plurality of small files existing in the HDFS into the plurality of files of the size specified by the merge policy according to the merge policy, so that the number of files in the HDFS is small, thereby improving the read-write performance and improving the system file pocket access efficiency. And the directory for managing the HDFS file can be reduced, and the memory of the HDFS management node is saved.

The technical scheme provided by the embodiment of the invention is described in the following with the accompanying drawings of the specification.

Referring to fig. 1, an embodiment of the present invention provides a method for merging small files, where the method for merging small files may be executed by an electronic device, such as a server or a PC, and a flow of the method for merging small files is described as follows.

S101, searching the files in the HDFS to obtain a plurality of small files.

The HDFS stores a large number of files, some files occupy a large space, some files occupy a small space, and the files occupying the small space are generally called small files. If the number of small files is large, the management node of the HDFS occupies a large space, and the number of stored files and the cluster size are limited. Therefore, in the embodiment of the invention, the small files of the HDFS are combined to reduce the number of the stored files. Because the data volume of the file in the HDFS is large, the embodiment of the invention firstly searches the file in the HDFS to retrieve the small file in the HDFS.

Specifically, the retrieval of the small files in the HDFS includes, but is not limited to, the following two ways, so as to meet the requirements of different scenes of the user.

The first mode is as follows:

the embodiment of the invention can receive the file directory and the small file threshold value input by the user, and search according to the received file directory and the small file threshold value to obtain the directory list of a plurality of small files, thereby searching the plurality of small files. The file directory input by the user is a search directory, and the small file threshold may be understood as a limit set by the user to determine whether a certain file is a small file, for example, if the size of a space occupied by a file is smaller than or equal to the small file threshold, the file is a small file. The doclet threshold may be a default value for the implementation setting, or may be a value entered later, such as 30M, 20M, or other values. In a possible implementation manner, the user may input the file directory based on a command line tool provided by the electronic device in the embodiment of the present invention, may also input the file directory based on a certain interface on an operation interface provided by the electronic device, or in another possible manner. The electronic equipment in the embodiment of the invention provides an interface for a user to input the file directory, thereby reducing the operation difficulty.

In this way, considering that the types of the retrieved small files may be different, and the small files of different types cannot be merged, the embodiment of the present invention determines whether all the small files obtained after retrieval are files of the same type, outputs a plurality of small files of the same type, and displays the merged small files for the user, thereby reducing the number of the output small files, and facilitating the user to determine the small files to be merged. Specifically, the embodiment of the present invention may submit the directory list of all the small files obtained by the retrieval to the thread pool, where each thread in the thread pool acquires a directory in the directory list, and determines the file type of the corresponding small file, for example, the file type may be seq, orc, parquet, text, and the like, by traversing file header information of the file in the directory list. After determining the small files of the same type, the embodiment of the invention can output a plurality of small files of the same type in a directory list mode of the small files or directly output the small files.

The second mode is as follows:

the embodiment of the invention can receive the Hive table name input by the user, and searches in the HDFS according to the Hive table name, thereby searching a plurality of small files. Because the types of the files in one Hive table are consistent, the files are directly searched through the Hive table name, and the obtained small files can be directly merged, so that the merging efficiency of the small files is improved. The electronic equipment provided by the embodiment of the invention inputs the name of the Hive list through a certain interface on the operation interface, or other possible modes, thereby realizing the file combination according to the needs after the confirmation of the user.

In a possible implementation manner, after the plurality of small files are retrieved, the embodiment of the present invention may further output attribute information of all or part of the small files obtained after the retrieval, where the attribute information includes at least one of the size, type, storage format, and memory utilization rate of the small files, so as to help a user to better understand the storage condition of each small file, to perform targeted merging, and to optimize the HDFS as much as possible.

S102, determining at least two small files to be combined from the plurality of small files.

The embodiment of the invention receives the selection operation of a user aiming at a plurality of small files based on the attribute information, and determines at least two small files to be combined from the plurality of small files. The user can select a file directory or a Hive table to be merged according to merging requirements based on the attribute information, and the small files corresponding to the file directory can be determined as the small files to be merged or the small files corresponding to the Hive table can be determined as the small files to be merged.

S103, merging the at least two small files to be merged according to a merging strategy by the file merging tool based on Spark. And the merging strategy is used for indicating the size of the merged small file.

The file merging tool based on Spark, such as RDD. Specifically, in the embodiment of the present invention, the size and the type of the merged file, or the path of the merged file, and the like may be set by using a file merging tool of Spark. In a possible embodiment, the definition of the merging interface may be: compactSmallFiles, wherein the parameters of the merging interface are set as follows:

{

"path"/abc "// directory after file merge

Files [ "1.txt", "2.txt", "3.txt" ],// indicating files to be merged

"type": txt ",// indicates the type after the file is merged

"targetSize":67108864// set the size of the merged file

}

According to the method, the size of the merged file can be specified, and the actual requirements of the user are met.

According to the merging strategy, namely the size of the merged files, grouping at least two small files into at least two groups, wherein the difference value between the memories of any two groups in the at least two groups is smaller than or equal to a first preset threshold value, namely the sizes of the groups are close to the same. And then merging at least two small files by taking a group as a unit by using a file merging tool based on Spark, calling an output interface to output at least two new files formed by correspondingly merging at least two groups, thereby realizing the merging of a plurality of small files. In the merging process, the merged file directory can be set as a temporary directory, and after the small files are merged, the temporary directory and the original small files are automatically deleted, so that the total size of the files in the HDFS is reduced. If a new file needs to be stored or merged after merging, the new file can be added into the HDFS or the merged file.

Although the number of files is reduced, the total size of the files cannot be reduced, the files after combination and the original files use different data blocks, and when the number of the combined files is large, the actual reading efficiency is low. In the embodiment of the invention, the Spark-based file merging tool merges at least two small files according to the size of the file after the designated merging, so that the total size of the file in the HDFS is reduced after the merging, the memory space is saved, the number of the merged files is small, and the reading efficiency is high.

Although the number of files is reduced by the current sequence file merging method, the merged files are inconvenient to view because indexes of the files are not established, and the reading efficiency is still low. In the embodiment of the invention, the group is taken as a unit during combination, the directory where the group is located is taken as an index, the checking is convenient, and the reading efficiency is high.

The current combinanefileinputformat merging method needs to spend extra memory to store the metadata of the small file, including the information of the initial offset, the length, the position of the data block and the like of the file. After the files are merged, the original files are deleted, no additional memory is needed, and the memory space is saved.

To sum up, the embodiment of the present invention merges a plurality of small files existing in the HDFS into a plurality of files of a size specified by the merge policy through the Spark file merge tool according to the merge policy, so that the number of files in the HDFS is small, thereby improving the read-write performance and improving the system file pocket access efficiency. And the directory for managing the HDFS file can be reduced, and the memory of the HDFS management node is saved.

The device provided by the embodiment of the invention is described in the following with the attached drawings of the specification.

Referring to fig. 2, based on the same inventive concept, an embodiment of the present invention provides an electronic device for merging small files, which includes a retrieving unit 201, a determining unit 202, and a merging unit 203. The retrieving unit 201 is configured to retrieve a file in the distributed file system HDFS, and obtain a plurality of small files. The determining unit 202 is configured to determine at least two small files to be merged from the plurality of small files. The merging unit 203 is configured to merge at least two small files to be merged according to a merging policy by using a file merging tool based on Spark, where the merging policy is used to indicate a size of the merged small files.

Optionally, the retrieving unit 201 is specifically configured to:

searching in an HDFS according to an input file directory and a small file threshold value to obtain a plurality of small files, wherein the size of each small file is smaller than or equal to the small file threshold value;

or,

and searching in the HDFS according to the input Hive table name to obtain a plurality of small files, wherein the Hive table is used for indicating a metadata storage directory of a file system, and the Hive table is used for indicating files of the same type in the HDFS.

Optionally, the system further includes a judgment unit, configured to:

and outputting a plurality of small files of the same type.

Optionally, the retrieving unit 201 is further configured to: outputting attribute information of all small files or part of small files obtained after retrieval, wherein the attribute information comprises at least one of the size, the type, the storage format and the memory utilization rate of the small files;

the determining unit 202 is specifically configured to:

receiving selection operation of a user aiming at the plurality of small files based on the attribute information;

at least two small files to be combined are determined from the plurality of small files according to the selection operation.

Optionally, the merging unit 203 is specifically configured to:

grouping at least two small files according to a merging strategy to obtain at least two groups, wherein the difference value between the memories of any two groups in the at least two groups is smaller than or equal to a first preset threshold value;

merging at least two small files by taking a group as a unit by a file merging tool based on Spark, wherein each group corresponds to a new merged file;

and calling an output interface, and outputting at least two new files formed by correspondingly combining at least two groups.

The device may be configured to execute the method provided in the embodiment shown in fig. 1, and therefore, for functions and the like that can be realized by each functional module of the device, reference may be made to the description of the method portion, which is not repeated here.

Referring to fig. 3, based on the same inventive concept, an embodiment of the present invention provides an electronic device, which may include: at least one processor 301, where the processor 301 is configured to execute the computer program stored in the memory to implement the steps of the small file merging method shown in fig. 1 according to the embodiment of the present invention: searching files in a distributed file system (HDFS) to obtain a plurality of small files; determining at least two small files to be merged from the plurality of small files; and merging the at least two small files to be merged according to a merging strategy by using a file merging tool based on Spark, wherein the merging strategy is used for indicating the size of the merged small files.

Alternatively, the processor 301 may be a central processing unit, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits for controlling program execution.

Optionally, the electronic device further includes a Memory 302 connected to the at least one processor, where the Memory 302 may include a Read Only Memory (ROM), a Random Access Memory (RAM), and a disk Memory. The memory 302 is used for storing data required by the processor 301 during operation, that is, storing instructions executable by the at least one processor 301, and the at least one processor 301 executes the method shown in fig. 1 by executing the instructions stored in the memory 302. The number of the memories 302 is one or more. The memory 302 is also shown in fig. 3, but it should be understood that the memory 302 is not an optional functional module, and is therefore shown in fig. 3 by a dotted line.

Optionally, the processor 301 is specifically configured to:

or,

Optionally, the processor 301 is specifically configured to:

and outputting a plurality of small files of the same type.

Optionally, the processor 301 is further configured to:

outputting attribute information of all small files or part of small files obtained after retrieval, wherein the attribute information comprises at least one of the size, the type, the storage format and the memory utilization rate of the small files;

Optionally, the processor 301 is specifically configured to:

The entity devices corresponding to the retrieving unit 201, the determining unit 202 and the combining unit 203 may be the processor 301. The electronic device may be configured to perform the method provided by the embodiment shown in fig. 1. Therefore, regarding the functions that can be realized by each functional module in the device, reference may be made to the corresponding description in the embodiment shown in fig. 1, which is not repeated herein.

Embodiments of the present invention also provide a computer storage medium, where the computer storage medium stores computer instructions, and when the computer instructions are executed on a computer, the computer is caused to execute the method as described in fig. 1.

It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working processes of the system, the apparatus and the unit described above, reference may be made to the corresponding processes in the foregoing method embodiments, and details are not described here again.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a Universal Serial Bus flash disk (usb flash disk), a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, and an optical disk.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for merging small files is characterized by comprising the following steps:

2. The method of claim 1, wherein retrieving the file in the distributed file system HDFS to obtain a plurality of small files comprises:

or,

3. The method of claim 2, wherein after retrieving the file in the distributed file system HDFS, further comprising:

and outputting a plurality of small files of the same type.

4. The method of claim 1, after retrieving the file in the distributed file system HDFS, further comprising: outputting attribute information of all small files or part of small files obtained after retrieval, wherein the attribute information comprises at least one of the size, the type, the storage format and the memory utilization rate of the small files;

5. The method according to any one of claims 1 to 4, wherein merging the at least two small files to be merged according to a merging policy by a spare-based file merging tool comprises:

6. An electronic device for merging small files, comprising:

7. The electronic device of claim 6, wherein the retrieval unit is specifically configured to:

or,

8. The electronic device of claim 2, further comprising a determination unit to:

and outputting a plurality of small files of the same type.

9. An electronic device, comprising:

at least one processor, and

a memory coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor implementing the method of any one of claims 1-5 by executing the instructions stored by the memory.

10. A computer storage medium on which a computer program is stored, which computer program, when being executed by a processor, carries out the method according to any one of claims 1-5.