CN112650711A - Massive small file storage method based on Redis and HDFS - Google Patents

Massive small file storage method based on Redis and HDFS Download PDF

Info

Publication number
CN112650711A
CN112650711A CN202011528833.7A CN202011528833A CN112650711A CN 112650711 A CN112650711 A CN 112650711A CN 202011528833 A CN202011528833 A CN 202011528833A CN 112650711 A CN112650711 A CN 112650711A
Authority
CN
China
Prior art keywords
file
hdfs
cache
files
small
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202011528833.7A
Other languages
Chinese (zh)
Inventor
成军
祖佳征
杨勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Original Assignee
Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chaozhou Zhuoshu Big Data Industry Development Co Ltd filed Critical Chaozhou Zhuoshu Big Data Industry Development Co Ltd
Priority to CN202011528833.7A priority Critical patent/CN112650711A/en
Publication of CN112650711A publication Critical patent/CN112650711A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention particularly relates to a mass small file storage method based on Redis and HDFS. According to the mass small file storage method based on Redis and HDFS, when small files are uploaded, the small files are stored in a cache database Redis, a timer is configured, whether the small files reach a set merging threshold value or not is checked at regular time, when the small files reach the merging threshold value, the cache files are merged through an interface provided by the HDFS, and the merged cache files are uploaded to the HDFS. According to the mass small File storage method based on Redis and HDFS, small files are combined into Sequence files to be stored on the HDFS, the storage efficiency of the small files in the HDFS is improved, meanwhile, the reading performance of the small files is improved in a caching mode, and the method is suitable for popularization and application.

Description

Massive small file storage method based on Redis and HDFS
Technical Field
The invention relates to the technical field of big data caching, in particular to a mass small file storage method based on Redis and HDFS.
Background
The Hadoop Distributed File System (HDFS) is a distributed file system that runs on general purpose hardware. As a solution for mass data storage with high fault tolerance and high throughput, HDFS has been widely used in various large-scale online services and large-scale storage systems, has become a mass storage de facto standard for online service companies such as various large websites and the like, and provides reliable and efficient services for website customers.
With the rapid development of information systems, massive amounts of information need to be stored reliably while being quickly accessible to a large number of users. The traditional storage scheme is more and more difficult to adapt to the rapid development of information system services in recent years from the framework, and becomes a bottleneck and obstacle of service development.
HDFS distributes data access and storage among a large number of servers through an efficient distribution algorithm, distributes access to each server in a cluster while reliably providing multiple backup storage, and is a subversive development of conventional storage architectures.
Redis (remote Dictionary Server), a remote Dictionary service, is an open source log-type and Key-Value database written in ANSI C language, supporting network, based on memory and persistent, and provides API of multiple languages.
With the development of science and technology, small files such as pictures and documents are used more and more widely, and the requirement of a user on the reading and writing speed of the small files is higher and higher. Aiming at the problem that the storage efficiency of the current small files is low, the invention provides a mass small file storage method based on Redis and HDFS.
Disclosure of Invention
In order to make up for the defects of the prior art, the invention provides a simple and efficient mass small file storage method based on Redis and HDFS.
The invention is realized by the following technical scheme:
a mass small file storage method based on Redis and HDFS is characterized by comprising the following steps: when the small files are uploaded, the small files are stored in a cache database Redis, a timer is configured, whether the small files reach a set merging threshold value or not is checked at regular time, when the small files reach the merging threshold value, the cache files are merged through an interface provided by the HDFS, and the merged cache files are uploaded to the HDFS.
When uploading small files, firstly storing the received files into a cache file (SF) of a cache database Redis so as to facilitate high-speed reading by a user, updating the length SFL of the files stored in the cache file (SF) in real time, and regularly judging whether the length SFL of the stored files reaches the file merging size; and if so, sending a message that the files can be merged to a merging processing module, packaging and uploading the cache files to the HDFS by the merging processing module, and storing the information of the merged files in metadata information.
The merged cache File is stored in a Sequence File (Hadoop Sequence File) of the HDFS, and the metadata information of the cache File is stored in a cache database Redis.
After the cache files are packed and uploaded, the small files stored in the cache files are deleted, so that the memory is released, the high fault tolerance and high efficiency of file access are guaranteed, and the metadata records are updated.
The small file refers to a file not exceeding 64M.
The content of the cache file comprises cache file data, the total length of files stored in the cache file data, file information stored in metadata, a folder structure and a folder ID.
The metadata information comprises a File name, a File length, a File name of the Sequence File and a File storage position of the Sequence File.
In the storage structure of the HDFS, the merged Sequence File is stored in the basic directory and named by a timestamp;
meanwhile, the Sequence File adopts a sequential storage structure, and the File content can be quickly positioned through the File position recorded by the File name in the metadata;
in addition, in order to reduce disk occupation and increase transmission speed, the Sequence File adopts Block compression.
When a user reads a small File or a timer starts timing check, whether the small File exists in a cache File of a cache database Redis is judged firstly, if yes, the small File is returned directly, otherwise, the name, the length and the File position of the Sequence File where the small File is located are inquired in the cache database Redis metadata according to the File name, the File content is quickly positioned according to the File position, and the result is returned to the user.
In the file reading process, traversing from the beginning of Block to a file to be acquired; when high concurrent access is performed, according to the characteristic of Sequence File sequential reading, a plurality of files to be acquired can be hit when the access is started from the Block each time, and the direct reading can cause multiple times of reading of the same Block. Therefore, a caching mechanism is adopted to save the Block to improve the reading efficiency.
The invention has the beneficial effects that: according to the mass small File storage method based on Redis and HDFS, small files are combined into Sequence files to be stored on the HDFS, the storage efficiency of the small files in the HDFS is improved, meanwhile, the reading performance of the small files is improved in a caching mode, and the method is suitable for popularization and application.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic diagram of a mass small file storage method based on Redis and HDFS.
FIG. 2 is a schematic diagram of a storage process of a large number of small files based on Redis and HDFS.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the embodiment of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
According to the mass small file storage method based on Redis and HDFS, when small files are uploaded, the small files are stored in a cache database Redis, a timer is configured, whether the small files reach a set merging threshold value or not is checked at regular time, when the small files reach the merging threshold value, the cache files are merged through an interface provided by the HDFS, and the merged cache files are uploaded to the HDFS.
The small file refers to a file not exceeding 64M.
When uploading small files, firstly storing the received files into a cache file (SF) of a cache database Redis so as to facilitate high-speed reading by a user, updating the length SFL of the files stored in the cache file (SF) in real time, and regularly judging whether the length SFL of the stored files reaches the file merging size; and if so, sending a message that the files can be merged to a merging processing module, packaging and uploading the cache files to the HDFS by the merging processing module, and storing the information of the merged files in metadata information.
The merged cache File is stored in a Sequence File (Hadoop Sequence File) of the HDFS, and the metadata information of the cache File is stored in a cache database Redis.
After the cache files are packed and uploaded, the small files stored in the cache files are deleted, so that the memory is released, the high fault tolerance and high efficiency of file access are guaranteed, and the metadata records are updated.
Redis is a file caching server for storing cached file content and metadata of files, as shown in tables 1 and 2.
The content of the cache file comprises cache file data, the total length of files stored in the cache file data, file information stored in metadata, a folder structure and a folder ID.
TABLE 1 caching File content
Figure BDA0002851484090000041
The metadata information comprises a File name, a File length, a File name of the Sequence File and a File storage position of the Sequence File.
TABLE 2 metadata information
Name (R) Type (B) Description of the invention
name String Filename
length Long File length
filename String Filename of Sequence File
filepos Long File storage position of Sequence File
In the storage structure of the HDFS, the merged Sequence File is stored in the basic directory and named by a timestamp;
meanwhile, the Sequence File adopts a sequential storage structure, and the File content can be quickly positioned through the File position recorded by the File name in the metadata;
in addition, in order to reduce disk occupation and increase transmission speed, the Sequence File adopts Block compression.
When a user reads a small File or a timer starts timing check, whether the small File exists in a cache File of a cache database Redis is judged firstly, if yes, the small File is returned directly, otherwise, the name, the length and the File position of the Sequence File where the small File is located are inquired in the cache database Redis metadata according to the File name, the File content is quickly positioned according to the File position, and the result is returned to the user.
In the File reading process, due to the adoption of Block compression of the Sequence File, synchronization points are at two ends of the Block, and the File to be acquired needs to be traversed from the beginning of the Block. The cost of reading the file in the traversal process is low. In addition, during high concurrent access, according to the characteristic of Sequence File sequential reading, a plurality of files to be acquired may be hit each time when accessing from the Block, and direct reading may cause multiple times of reading of the same Block. Therefore, a caching mechanism is adopted to save the Block to improve the reading efficiency.
The above-described embodiment is only one specific embodiment of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims (8)

1. A mass small file storage method based on Redis and HDFS is characterized by comprising the following steps: when the small files are uploaded, the small files are stored in a cache database Redis, a timer is configured, whether the small files reach a set merging threshold value or not is checked at regular time, when the small files reach the merging threshold value, the cache files are merged through an interface provided by the HDFS, and the merged cache files are uploaded to the HDFS.
2. The mass small file storage method based on Redis and HDFS of claim 1, wherein: when uploading the small file, firstly storing the received file into a cache file of a cache database Redis so as to facilitate high-speed reading by a user, updating the length SFL of the file stored in the cache file in real time, and regularly judging whether the length SFL of the stored file reaches the file merging size; and if so, sending a message that the files can be merged to a merging processing module, packaging and uploading the cache files to the HDFS by the merging processing module, and storing the information of the merged files in metadata information.
3. The mass small file storage method based on Redis and HDFS of claim 2, wherein: the merged cache File is stored in a Sequence File of the HDFS, and metadata information of the cache File is stored in a cache database Redis;
after the cache files are packed and uploaded, the small files stored in the cache files are deleted, so that the memory is released, the high fault tolerance and high efficiency of file access are guaranteed, and the metadata records are updated.
4. The Redis and HDFS based mass small file storage method according to claim 1, 2 or 3, characterized in that: the small file refers to a file not exceeding 64M.
5. The mass small file storage method based on Redis and HDFS of claim 3, wherein: the cache file content comprises cache file data, the total length of files stored in the cache file data, file information stored by metadata, a folder structure and a folder ID;
the metadata information comprises a File name, a File length, a File name of the Sequence File and a File storage position of the Sequence File.
6. The mass small file storage method based on Redis and HDFS of claim 3, wherein: in the storage structure of the HDFS, the merged Sequence File is stored in the basic directory and named by a timestamp;
meanwhile, the Sequence File adopts a sequential storage structure, and the File content can be quickly positioned through the File position recorded by the File name in the metadata;
in addition, in order to reduce disk occupation and increase transmission speed, the Sequence File adopts Block compression.
7. The mass small file storage method based on Redis and HDFS of claim 6, wherein: when a user reads a small File or a timer starts timing check, whether the small File exists in a cache File of a cache database Redis is judged firstly, if yes, the small File is returned directly, otherwise, the name, the length and the File position of the Sequence File where the small File is located are inquired in the cache database Redis metadata according to the File name, the File content is quickly positioned according to the File position, and the result is returned to the user.
8. The mass small file storage method based on Redis and HDFS of claim 6, wherein: in the file reading process, traversing from the beginning of Block to a file to be acquired; and when high concurrent access is performed, a cache mechanism is adopted to store Block so as to improve the reading efficiency.
CN202011528833.7A 2020-12-22 2020-12-22 Massive small file storage method based on Redis and HDFS Withdrawn CN112650711A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011528833.7A CN112650711A (en) 2020-12-22 2020-12-22 Massive small file storage method based on Redis and HDFS

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011528833.7A CN112650711A (en) 2020-12-22 2020-12-22 Massive small file storage method based on Redis and HDFS

Publications (1)

Publication Number Publication Date
CN112650711A true CN112650711A (en) 2021-04-13

Family

ID=75358988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011528833.7A Withdrawn CN112650711A (en) 2020-12-22 2020-12-22 Massive small file storage method based on Redis and HDFS

Country Status (1)

Country Link
CN (1) CN112650711A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113609094A (en) * 2021-07-09 2021-11-05 济南浪潮数据技术有限公司 Method, system, equipment and medium for controlling data downloading and brushing
CN114374687A (en) * 2022-01-11 2022-04-19 同方有云(北京)科技有限公司 File transmission method and device between thermomagnetic storage and blue light storage

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113609094A (en) * 2021-07-09 2021-11-05 济南浪潮数据技术有限公司 Method, system, equipment and medium for controlling data downloading and brushing
CN114374687A (en) * 2022-01-11 2022-04-19 同方有云(北京)科技有限公司 File transmission method and device between thermomagnetic storage and blue light storage
CN114374687B (en) * 2022-01-11 2024-04-16 同方有云(北京)科技有限公司 File transmission method and device between thermomagnetic storage and blue light storage

Similar Documents

Publication Publication Date Title
US10262005B2 (en) Method, server and system for managing content in content delivery network
US10853340B2 (en) Static sorted index replication
US11010300B2 (en) Optimized record lookups
CN101866358B (en) Multidimensional interval querying method and system thereof
CN102122285B (en) Data cache system and data inquiry method
CN103020315A (en) Method for storing mass of small files on basis of master-slave distributed file system
CN105183839A (en) Hadoop-based storage optimizing method for small file hierachical indexing
CN113377868B (en) Offline storage system based on distributed KV database
CN103902660B (en) System and method for prefetching file layout through readdir++ in cluster file system
CN103179185A (en) Method and system for creating files in cache of distributed file system client
CN103595797B (en) Caching method for distributed storage system
CN112650711A (en) Massive small file storage method based on Redis and HDFS
CN102541985A (en) Organization method of client directory cache in distributed file system
CN111930316B (en) Cache read-write system and method for content distribution network
CN111752804B (en) Database cache system based on database log scanning
CN110543495A (en) cursor traversal storage method and device
WO2020125630A1 (en) File reading
CN111159176A (en) Method and system for storing and reading mass stream data
CN109767274B (en) Method and system for carrying out associated storage on massive invoice data
CN117573032A (en) Optimization method based on RocksDB database write amplification
CN105138545B (en) The asynchronous method and system pre-read of directory entry in a kind of distributed file system
CN106599326B (en) Recorded data duplication eliminating processing method and system under cloud architecture
US11341163B1 (en) Multi-level replication filtering for a distributed database
CN112395440A (en) Caching method, efficient image semantic retrieval method and system
CN110287172B (en) Method for formatting HBase data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20210413

WW01 Invention patent application withdrawn after publication