CN113111036A

CN113111036A - Small file processing method, device, medium and electronic equipment based on HDFS

Info

Publication number: CN113111036A
Application number: CN202110417936.4A
Authority: CN
Inventors: 魏鹏飞; 万月亮; 火一莽
Original assignee: Beijing Ruian Technology Co Ltd
Current assignee: Beijing Ruian Technology Co Ltd
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2021-07-13
Also published as: WO2022222303A1

Abstract

The embodiment of the application discloses a small file processing method, a small file processing device, a small file processing medium and electronic equipment based on an HDFS. The method comprises the following steps: screening files to be processed to obtain target files; if the target file meets the file volume constraint condition, storing the target file to a target cluster according to a preset writing rule; and merging the target files in the target cluster according to file types to obtain merged files, and transmitting the merged files to the HDFS. According to the technical scheme, the small files can be stored in the cluster, and all the small files in the cluster are combined into the large file to be stored in the HDFS, so that the time for processing the small files is saved, and the processing efficiency is improved.

Description

Small file processing method, device, medium and electronic equipment based on HDFS

Technical Field

The embodiment of the application relates to the technical field of big data, in particular to a small file processing method, device, medium and electronic equipment based on an HDFS.

Background

With the development of internet technology, the amount of network data grows exponentially. In an actual production environment, the large data scale reaches billions or PB level, and at present, HDFS (Hadoop Distributed File System) is widely used to process various files.

The data inputted to the HDFS is composed of many small files, which means files smaller than the minimum storage and processing unit in the HDFS.

The speed of processing small files is much slower than the speed of processing large files of the same size. Each small file occupies one resource unit, and the task starting consumes a large amount of time and even most of the time for starting and releasing the task. There is no good solution for large numbers of small file processes.

Disclosure of Invention

The embodiment of the application provides a small file processing method, a small file processing device, a medium and electronic equipment based on an HDFS (Hadoop distributed File System), which can store small files in a cluster, combine the small files in the cluster into a large file and store the large file in the HDFS, save the time for processing the small files and improve the processing efficiency.

In a first aspect, an embodiment of the present application provides a small file processing method based on an HDFS, where the method includes:

screening files to be processed to obtain target files;

if the target file meets the file volume constraint condition, storing the target file to a target cluster according to a preset writing rule;

and merging the target files in the target cluster according to file types to obtain merged files, and transmitting the merged files to the HDFS.

In a second aspect, an embodiment of the present application provides an HDFS-based small file processing apparatus, including:

the target file acquisition module is used for screening the files to be processed to obtain target files;

the target file storage and storage module is used for storing the target file to a target cluster according to a preset writing rule if the target file meets a file volume constraint condition;

and the merged file transmission module is used for merging all the target files in the target cluster according to file types to obtain merged files and transmitting the merged files to the HDFS.

In a third aspect, an embodiment of the present application provides a computer-readable medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements the HDFS-based small file processing method according to the embodiment of the present application.

In a fourth aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the HDFS-based small file processing method according to the embodiment of the present application.

According to the technical scheme provided by the embodiment of the application, the files to be processed are screened to obtain target files; if the target file meets the file volume constraint condition, storing the target file to a target cluster according to a preset writing rule; and merging all target files in the target cluster according to the file types to obtain merged files, and transmitting the merged files to the HDFS. According to the technical scheme, the small files can be stored in the cluster, and all the small files in the cluster are combined into the large file to be stored in the HDFS, so that the time for processing the small files is saved, and the processing efficiency is improved.

Drawings

FIG. 1 is a flowchart of a method for processing a small file based on an HDFS according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a small file processing process based on HDFS provided in the second embodiment of the present application;

FIG. 3 is a schematic structural diagram of a small file processing apparatus based on HDFS according to a third embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be further noted that, for the convenience of description, only some of the structures related to the present application are shown in the drawings, not all of the structures.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Example one

Fig. 1 is a flowchart of a small file processing method based on HDFS according to an embodiment of the present application, where the present embodiment is applicable to a case where a large number of small files are processed, and the method can be executed by a small file processing apparatus based on HDFS according to an embodiment of the present application, where the apparatus can be implemented by software and/or hardware, and can be integrated in an intelligent terminal or other device for file processing.

As shown in fig. 1, the method for processing a small file based on HDFS includes:

s110, screening files to be processed to obtain target files;

in this embodiment, the files to be processed may be screened by using a preset function. Preferably, the preset function may be a globStatus function. And acquiring the path of the target file meeting the condition through the globStatus function. The globStatus function is a path for making a pattern by wildcard matching.

S120, if the target file meets a file volume constraint condition, storing the target file to a target cluster according to a preset writing rule;

the file volume constraint condition can be set according to business requirements. For example, a target file smaller than 128M may be determined to satisfy the file volume constraint and a target file larger than 128M may be determined to not satisfy the file volume constraint.

The clustering can provide the same service by adding the number of the servers, so that the servers reach a stable and efficient state.

In the scheme, the target files meeting the file volume constraint condition can be stored in the target cluster according to the specific rule. For example, the name may be saved to the target cluster according to a specific name rule. And when the target files are stored in the target cluster, the target files are respectively stored according to the service groups. And directly saving the target file which does not meet the file volume constraint condition to the HDFS. The Hadoop Distributed File System (HDFS) is a distributed file system which is designed to be suitable for running on general hardware and has the characteristic of high fault tolerance.

In this technical solution, optionally, the target cluster includes a Redis cluster;

correspondingly, the storing the target file to a target cluster according to a preset writing rule includes:

and grouping the target files according to a preset writing rule and storing the target files to a Redis cluster.

Among them, the Remote Dictionary service (Remote Dictionary Server) is an open source log-type and Key-Value database written in ANSI C language, supporting network, based on memory and persistent, and provides API in multiple languages. Redis clustering is to enhance the read-write capability of Redis. In a redis cluster, each redis is referred to as a node. There are two types of nodes: a master node and a slave node.

The target file is stored in the Redis cluster, so that CPU resources can be fully utilized, and the processing performance of the small file is improved.

And S130, merging the target files in the target cluster according to file types to obtain a merged file, and transmitting the merged file to the HDFS.

In the scheme, the file types can be searched according to the protocol, and all the target files are combined into a large file according to the preset combination rule. Preferably, the target file can be merged into several large files by a copyBytes function and uploaded to the HDFS.

In this technical solution, optionally, merging the target files in the target cluster according to file types to obtain a merged file, and transmitting the merged file to the HDFS includes:

if the storage capacity of the target cluster meets a preset storage capacity condition, merging all target files in the target cluster according to file types to obtain a merged file, and transmitting the merged file to the HDFS;

and if the storage time of each target file in the target cluster meets a preset time condition, merging each target file in the target cluster according to the file type to obtain a merged file, and transmitting the merged file to the HDFS.

It can be understood that the target files are stored in the target cluster, and when the data amount stored in the target cluster reaches a certain amount or the storage time of the target files meets a certain time requirement, the target files in the target cluster are merged according to the file types to obtain merged files, and the merged files are transmitted to the HDFS.

By combining all the small files in the target cluster into a large file to be stored in the HDFS, the time for processing the small files is saved, and the processing efficiency is improved.

and if the cache abnormality is detected, merging the target files in the target cluster according to the file types according to a preset timing task to obtain a merged file, and transmitting the merged file to the HDFS.

Specifically, if the client is detected to be abnormally rushing, the write counter in the target cluster is caused to be abnormal. Or when the client does not request for a long time and the cache cannot be refreshed to the HDFS, starting a background timing task according to the configuration to refresh the target cluster cache for the overlong time to the HDFS.

The target cluster information is obtained, the target cluster cache is prevented from being exploded due to abnormal information, and the small file processing efficiency is improved.

In this technical solution, optionally, the method further includes:

responding to the reading operation of the client, and detecting whether target data exists in the target cluster;

if yes, sending the target data to a client;

and if the target data does not exist, acquiring the target data in the HDFS, and sending the target data to the client.

It can be understood that when reading target data, the client reads from the target cluster at first; if the data is not in the target cluster, the target data is read from the HDFS. And if the data reading time is longer than the time of uploading each target file in the target cluster to the HDFS, directly reading the target data from the HDFS.

And storing the small files into the cluster, combining the small files in the cluster into a large file, and storing the large file into the HDFS. By reading data from the cluster or HDFS, the reading efficiency is improved.

According to the technical scheme provided by the embodiment of the application, the files to be processed are screened to obtain target files; if the target file meets the file volume constraint condition, storing the target file to a target cluster according to a preset writing rule; and merging all target files in the target cluster according to the file types to obtain merged files, and transmitting the merged files to the HDFS. By executing the technical scheme, the small files can be stored in the cluster, and all the small files in the cluster are combined into a large file to be stored in the HDFS, so that the time for processing the small files is saved, and the processing efficiency is improved.

Example two

Fig. 2 is a schematic diagram of a small file processing process based on the HDFS provided in the second embodiment of the present application, and the second embodiment is further optimized based on the first embodiment. The concrete optimization is as follows: saving the target file to a target cluster according to a preset writing rule, wherein the method comprises the following steps: responding to a client calling request, and sending a target parameter to a client so that the client can construct a file to be stored meeting a cluster constraint condition according to the target parameter; wherein the target parameters comprise an absolute path parameter and an attachment name parameter; and responding to the write operation of the client, and storing the file to be stored to a target cluster. The details which are not described in detail in this embodiment are shown in the first embodiment. As shown in fig. 2, the method comprises the steps of:

s210, screening files to be processed to obtain target files;

s220, responding to a client calling request, and sending a target parameter to a client so that the client can construct a file to be stored meeting a cluster constraint condition according to the target parameter; wherein the target parameters comprise an absolute path parameter and an attachment name parameter;

in the scheme, MapFile name service is realized by adopting a Spring Boot framework, and a service interface adopts a Rest interface. The client calls the MapFile name service to acquire the stored MapFile absolute path parameters of the attachment and the attachment name parameters of the attachment in MapFile by sending a call request. And the client processes the target file according to the absolute path parameter and the accessory name parameter to construct a file to be stored meeting the cluster constraint condition. The file to be stored may be a MapFile.

S230, responding to the writing operation of the client, and storing the file to be stored to a target cluster;

according to the scheme, the client processes the target file according to the absolute path parameter and the attachment name parameter, and writes the file to be stored into the target cluster after the file to be stored meeting the cluster constraint condition is constructed.

In this technical solution, optionally, before responding to a client call request and sending a target parameter to a client, the method further includes:

determining information to be edited according to the namespace information and the data set information sent by the client; the information to be edited comprises a file length and a counter;

correspondingly, after the file to be stored is saved to the target cluster in response to the client write operation, the method further includes:

and if the file to be stored is detected to be stored in the target cluster, operating the counter to monitor the write-in operation of the client.

Specifically, a client calls a MapFile name service to acquire a stored MapFile complete path of an attachment and an attachment name of the attachment in the MapFile, and transmitted parameters comprise a namespace, a data set, an attachment length and an attachment name; the MapFile name service calculates the storage period of data according to a namespace and a data set transmitted by a write client, and looks up MapFile info information in an internal Hash table. If not, creating a new MapFLIeInfo; if yes, increasing the file length of the current MapFile information, adding one to a counter, and returning to an absolute path parameter of the MapFile written to the client and an attachment name parameter of the attachment in the MapFile storage; the client completes the writing of the MapFile file into the target cluster according to the absolute path parameter and the attachment name parameter; the client calls the MapFile name service cache to complete writing, and the MapFile service subtracts one from the reference count of the cache of the cluster, so that the writing operation of the client is monitored. At this time, the MapFile name service checks that the current MapFile write counter is 0, and the target file storage exceeds the configuration time or size, informs the client to refresh MapFile in the cluster to the HDFS file system, and destroys the current MapFile structure.

And the target file is stored in the Redis cluster, so that the CPU resource can be fully utilized, and the performance is improved. Meanwhile, the cache operation written into the cluster is monitored, and the small file processing efficiency is improved.

S240, merging the target files in the target cluster according to file types to obtain merged files, and transmitting the merged files to the HDFS.

According to the technical scheme provided by the embodiment of the application, the files to be processed are screened to obtain target files; and responding to the client call request, sending the target parameters to the client, and responding to the client write-in operation, and storing the file to be stored to the target cluster. And merging all target files in the target cluster according to the file types to obtain merged files, and transmitting the merged files to the HDFS. By executing the technical scheme, the small files can be stored in the cluster, and all the small files in the cluster are combined into a large file to be stored in the HDFS, so that the time for processing the small files is saved, and the processing efficiency is improved.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a small file processing device based on an HDFS according to a third embodiment of the present application, and as shown in fig. 3, the small file processing device based on the HDFS includes:

the target file obtaining module 310 is configured to screen a file to be processed to obtain a target file;

a target file saving and saving module 320, configured to save the target file to a target cluster according to a preset writing rule if the target file meets a file volume constraint condition;

and the merged file transmission module 330 is configured to merge the target files in the target cluster according to file types to obtain a merged file, and transmit the merged file to the HDFS.

In this technical solution, optionally, the target file saving module 320 includes:

the calling request responding unit is used for responding to a calling request of the client and sending the target parameters to the client so that the client can construct a file to be stored meeting the cluster constraint condition according to the target parameters; wherein the target parameters comprise an absolute path parameter and an attachment name parameter;

and the write-in operation response unit is used for responding to the write-in operation of the client and storing the file to be stored to the target cluster.

In this technical solution, optionally, the apparatus further includes:

the information to be edited determining module is used for determining the information to be edited according to the namespace information and the data set information sent by the client; the information to be edited comprises a file length and a counter;

correspondingly, the device further comprises:

and the counter operation module is used for operating the counter if the file to be stored is detected to be stored in the target cluster, so as to monitor the write-in operation of the client.

In this technical solution, optionally, the merged file transmission module 330 includes:

the storage capacity judging unit is used for merging all target files in the target cluster according to file types to obtain merged files and transmitting the merged files to the HDFS if the storage capacity of the target cluster meets a preset storage capacity condition;

and the storage time judging unit is used for merging the target files in the target cluster according to file types to obtain a merged file and transmitting the merged file to the HDFS if the storage time of each target file in the target cluster meets a preset time condition.

and the cache exception processing unit is used for merging all target files in the target cluster according to file types according to a preset timing task to obtain a merged file and transmitting the merged file to the HDFS if the cache exception is detected.

correspondingly, the target file saving module 320 is specifically configured to:

In this technical solution, optionally, the apparatus further includes:

the client reading operation response module is used for responding to the client reading operation and detecting whether target data exists in the target cluster;

the target data storage module is used for sending the target data to the client if the target data exists;

and the target data non-existence module is used for acquiring the target data in the HDFS and sending the target data to the client if the target data does not exist.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method.

Example four

Embodiments of the present application also provide a medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a method for HDFS-based small file processing, the method including:

screening files to be processed to obtain target files;

Media-any of various types of memory devices or storage devices. The term "media" is intended to include: mounting media such as CD-ROM, floppy disk, or tape devices; computer system memory or random access memory such as DRAM, DDR RAM, SRAM, EDO RAM, Lanbas (Rambus) RAM, etc.; non-volatile memory such as flash memory, magnetic media (e.g., hard disk or optical storage); registers or other similar types of memory elements, etc. The medium may also include other types of memory or combinations thereof. In addition, the medium may be located in the computer system in which the program is executed, or may be located in a different second computer system, which is connected to the computer system through a network (such as the internet). The second computer system may provide the program instructions to the computer for execution. The term "media" may include two or more media that may reside in different locations, such as in different computer systems that are connected by a network. The media may store program instructions (e.g., embodied as computer programs) that are executable by one or more processors.

Of course, the medium provided in the embodiments of the present application includes computer-executable instructions, and the computer-executable instructions are not limited to the above-described HDFS-based small file processing operation, and may also perform related operations in the HDFS-based small file processing method provided in any embodiment of the present application.

EXAMPLE five

The embodiment of the application provides electronic equipment, and the small file processing device based on the HDFS provided by the embodiment of the application can be integrated in the electronic equipment. Fig. 4 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present application. As shown in fig. 4, the present embodiment provides an electronic device 400, which includes: one or more processors 420; the storage device 410 is configured to store one or more programs, and when the one or more programs are executed by the one or more processors 420, the one or more processors 420 implement the HDFS-based small file processing method provided in an embodiment of the present application, where the method includes:

screening files to be processed to obtain target files;

Of course, those skilled in the art can understand that the processor 420 also implements the technical solution of the HDFS-based small file processing method provided in any embodiment of the present application.

The electronic device 400 shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 4, the electronic device 400 includes a processor 420, a storage device 410, an input device 430, and an output device 440; the number of the processors 420 in the electronic device may be one or more, and one processor 420 is taken as an example in fig. 4; the processor 420, the storage device 410, the input device 430, and the output device 440 in the electronic apparatus may be connected by a bus or other means, and are exemplified by a bus 450 in fig. 4.

The storage device 410 is a computer-readable medium, and can be used to store software programs, computer-executable programs, and module units, such as program instructions corresponding to the HDFS-based small file processing method in the embodiment of the present application.

The storage device 410 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the storage 410 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, storage 410 may further include memory located remotely from processor 420, which may be connected via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input means 430 may be used to receive input numbers, character information, or voice information, and to generate key signal inputs related to user settings and function control of the electronic device. The output device 440 may include a display screen, speakers, or other electronic equipment.

The electronic equipment provided by the embodiment of the application can achieve the purposes of saving the small files in the cluster, combining the small files in the cluster into the large file and storing the large file in the HDFS, saving the time for processing the small files and improving the processing efficiency.

The HDFS-based small file processing apparatus, medium, and electronic device provided in the above embodiments may execute the HDFS-based small file processing method provided in any embodiment of the present application, and have functional modules and beneficial effects corresponding to the execution of the method. Technical details that are not described in detail in the above embodiments may be referred to a small file processing method based on the HDFS provided in any embodiment of the present application.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present application and the technical principles employed. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, although the present application has been described in more detail with reference to the above embodiments, the present application is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present application, and the scope of the present application is determined by the scope of the appended claims.

Claims

1. A small file processing method based on HDFS is characterized by comprising the following steps:

screening files to be processed to obtain target files;

2. The method of claim 1, wherein the target file is saved to a target cluster according to a preset writing rule, and the method comprises:

responding to a client calling request, and sending a target parameter to a client so that the client can construct a file to be stored meeting a cluster constraint condition according to the target parameter; wherein the target parameters comprise an absolute path parameter and an attachment name parameter;

and responding to the write operation of the client, and storing the file to be stored to a target cluster.

3. The method of claim 2, wherein prior to sending the target parameter to the client in response to the client invocation request, the method further comprises:

4. The method according to claim 1, wherein merging the target files in the target cluster by file type to obtain a merged file, and transmitting the merged file to the HDFS, comprises:

5. The method according to claim 1, wherein merging the target files in the target cluster by file type to obtain a merged file, and transmitting the merged file to the HDFS, comprises:

6. The method of claim 1, wherein the target cluster comprises a Redis cluster;

7. The method of claim 1, further comprising:

if yes, sending the target data to a client;

8. An HDFS-based small file processing apparatus, comprising:

9. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the HDFS-based doclet processing method according to any one of claims 1 to 7.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the HDFS-based doclet processing method according to any one of claims 1 to 7 when executing the computer program.