CN112650711A

CN112650711A - Massive small file storage method based on Redis and HDFS

Info

Publication number: CN112650711A
Application number: CN202011528833.7A
Authority: CN
Inventors: 成军; 祖佳征; 杨勤
Original assignee: Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Current assignee: Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2021-04-13

Abstract

The invention particularly relates to a mass small file storage method based on Redis and HDFS. According to the mass small file storage method based on Redis and HDFS, when small files are uploaded, the small files are stored in a cache database Redis, a timer is configured, whether the small files reach a set merging threshold value or not is checked at regular time, when the small files reach the merging threshold value, the cache files are merged through an interface provided by the HDFS, and the merged cache files are uploaded to the HDFS. According to the mass small File storage method based on Redis and HDFS, small files are combined into Sequence files to be stored on the HDFS, the storage efficiency of the small files in the HDFS is improved, meanwhile, the reading performance of the small files is improved in a caching mode, and the method is suitable for popularization and application.

Description

Massive small file storage method based on Redis and HDFS

Technical Field

The invention relates to the technical field of big data caching, in particular to a mass small file storage method based on Redis and HDFS.

Background

The Hadoop Distributed File System (HDFS) is a distributed file system that runs on general purpose hardware. As a solution for mass data storage with high fault tolerance and high throughput, HDFS has been widely used in various large-scale online services and large-scale storage systems, has become a mass storage de facto standard for online service companies such as various large websites and the like, and provides reliable and efficient services for website customers.

With the rapid development of information systems, massive amounts of information need to be stored reliably while being quickly accessible to a large number of users. The traditional storage scheme is more and more difficult to adapt to the rapid development of information system services in recent years from the framework, and becomes a bottleneck and obstacle of service development.

HDFS distributes data access and storage among a large number of servers through an efficient distribution algorithm, distributes access to each server in a cluster while reliably providing multiple backup storage, and is a subversive development of conventional storage architectures.

Redis (remote Dictionary Server), a remote Dictionary service, is an open source log-type and Key-Value database written in ANSI C language, supporting network, based on memory and persistent, and provides API of multiple languages.

With the development of science and technology, small files such as pictures and documents are used more and more widely, and the requirement of a user on the reading and writing speed of the small files is higher and higher. Aiming at the problem that the storage efficiency of the current small files is low, the invention provides a mass small file storage method based on Redis and HDFS.

Disclosure of Invention

In order to make up for the defects of the prior art, the invention provides a simple and efficient mass small file storage method based on Redis and HDFS.

The invention is realized by the following technical scheme:

a mass small file storage method based on Redis and HDFS is characterized by comprising the following steps: when the small files are uploaded, the small files are stored in a cache database Redis, a timer is configured, whether the small files reach a set merging threshold value or not is checked at regular time, when the small files reach the merging threshold value, the cache files are merged through an interface provided by the HDFS, and the merged cache files are uploaded to the HDFS.

When uploading small files, firstly storing the received files into a cache file (SF) of a cache database Redis so as to facilitate high-speed reading by a user, updating the length SFL of the files stored in the cache file (SF) in real time, and regularly judging whether the length SFL of the stored files reaches the file merging size; and if so, sending a message that the files can be merged to a merging processing module, packaging and uploading the cache files to the HDFS by the merging processing module, and storing the information of the merged files in metadata information.

The merged cache File is stored in a Sequence File (Hadoop Sequence File) of the HDFS, and the metadata information of the cache File is stored in a cache database Redis.

After the cache files are packed and uploaded, the small files stored in the cache files are deleted, so that the memory is released, the high fault tolerance and high efficiency of file access are guaranteed, and the metadata records are updated.

The small file refers to a file not exceeding 64M.

The content of the cache file comprises cache file data, the total length of files stored in the cache file data, file information stored in metadata, a folder structure and a folder ID.

The metadata information comprises a File name, a File length, a File name of the Sequence File and a File storage position of the Sequence File.

In the storage structure of the HDFS, the merged Sequence File is stored in the basic directory and named by a timestamp;

meanwhile, the Sequence File adopts a sequential storage structure, and the File content can be quickly positioned through the File position recorded by the File name in the metadata;

in addition, in order to reduce disk occupation and increase transmission speed, the Sequence File adopts Block compression.

When a user reads a small File or a timer starts timing check, whether the small File exists in a cache File of a cache database Redis is judged firstly, if yes, the small File is returned directly, otherwise, the name, the length and the File position of the Sequence File where the small File is located are inquired in the cache database Redis metadata according to the File name, the File content is quickly positioned according to the File position, and the result is returned to the user.

In the file reading process, traversing from the beginning of Block to a file to be acquired; when high concurrent access is performed, according to the characteristic of Sequence File sequential reading, a plurality of files to be acquired can be hit when the access is started from the Block each time, and the direct reading can cause multiple times of reading of the same Block. Therefore, a caching mechanism is adopted to save the Block to improve the reading efficiency.

The invention has the beneficial effects that: according to the mass small File storage method based on Redis and HDFS, small files are combined into Sequence files to be stored on the HDFS, the storage efficiency of the small files in the HDFS is improved, meanwhile, the reading performance of the small files is improved in a caching mode, and the method is suitable for popularization and application.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic diagram of a mass small file storage method based on Redis and HDFS.

FIG. 2 is a schematic diagram of a storage process of a large number of small files based on Redis and HDFS.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the embodiment of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

According to the mass small file storage method based on Redis and HDFS, when small files are uploaded, the small files are stored in a cache database Redis, a timer is configured, whether the small files reach a set merging threshold value or not is checked at regular time, when the small files reach the merging threshold value, the cache files are merged through an interface provided by the HDFS, and the merged cache files are uploaded to the HDFS.

The small file refers to a file not exceeding 64M.

Redis is a file caching server for storing cached file content and metadata of files, as shown in tables 1 and 2.

TABLE 1 caching File content

TABLE 2 metadata information

Name (R)	Type (B)	Description of the invention
			name	String	Filename
length	Long	File length
			filename	String	Filename of Sequence File
filepos	Long	File storage position of Sequence File

In the File reading process, due to the adoption of Block compression of the Sequence File, synchronization points are at two ends of the Block, and the File to be acquired needs to be traversed from the beginning of the Block. The cost of reading the file in the traversal process is low. In addition, during high concurrent access, according to the characteristic of Sequence File sequential reading, a plurality of files to be acquired may be hit each time when accessing from the Block, and direct reading may cause multiple times of reading of the same Block. Therefore, a caching mechanism is adopted to save the Block to improve the reading efficiency.

The above-described embodiment is only one specific embodiment of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims

1. A mass small file storage method based on Redis and HDFS is characterized by comprising the following steps: when the small files are uploaded, the small files are stored in a cache database Redis, a timer is configured, whether the small files reach a set merging threshold value or not is checked at regular time, when the small files reach the merging threshold value, the cache files are merged through an interface provided by the HDFS, and the merged cache files are uploaded to the HDFS.

2. The mass small file storage method based on Redis and HDFS of claim 1, wherein: when uploading the small file, firstly storing the received file into a cache file of a cache database Redis so as to facilitate high-speed reading by a user, updating the length SFL of the file stored in the cache file in real time, and regularly judging whether the length SFL of the stored file reaches the file merging size; and if so, sending a message that the files can be merged to a merging processing module, packaging and uploading the cache files to the HDFS by the merging processing module, and storing the information of the merged files in metadata information.

3. The mass small file storage method based on Redis and HDFS of claim 2, wherein: the merged cache File is stored in a Sequence File of the HDFS, and metadata information of the cache File is stored in a cache database Redis;

4. The Redis and HDFS based mass small file storage method according to claim 1, 2 or 3, characterized in that: the small file refers to a file not exceeding 64M.

5. The mass small file storage method based on Redis and HDFS of claim 3, wherein: the cache file content comprises cache file data, the total length of files stored in the cache file data, file information stored by metadata, a folder structure and a folder ID;

6. The mass small file storage method based on Redis and HDFS of claim 3, wherein: in the storage structure of the HDFS, the merged Sequence File is stored in the basic directory and named by a timestamp;

7. The mass small file storage method based on Redis and HDFS of claim 6, wherein: when a user reads a small File or a timer starts timing check, whether the small File exists in a cache File of a cache database Redis is judged firstly, if yes, the small File is returned directly, otherwise, the name, the length and the File position of the Sequence File where the small File is located are inquired in the cache database Redis metadata according to the File name, the File content is quickly positioned according to the File position, and the result is returned to the user.

8. The mass small file storage method based on Redis and HDFS of claim 6, wherein: in the file reading process, traversing from the beginning of Block to a file to be acquired; and when high concurrent access is performed, a cache mechanism is adopted to store Block so as to improve the reading efficiency.