CN112650711A - Massive small file storage method based on Redis and HDFS - Google Patents
Massive small file storage method based on Redis and HDFS Download PDFInfo
- Publication number
- CN112650711A CN112650711A CN202011528833.7A CN202011528833A CN112650711A CN 112650711 A CN112650711 A CN 112650711A CN 202011528833 A CN202011528833 A CN 202011528833A CN 112650711 A CN112650711 A CN 112650711A
- Authority
- CN
- China
- Prior art keywords
- file
- hdfs
- cache
- files
- small
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/14—Details of searching files based on file metadata
- G06F16/148—File search processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
- G06F16/162—Delete operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/172—Caching, prefetching or hoarding of files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/174—Redundancy elimination performed by the file system
- G06F16/1744—Redundancy elimination performed by the file system using compression, e.g. sparse files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention particularly relates to a mass small file storage method based on Redis and HDFS. According to the mass small file storage method based on Redis and HDFS, when small files are uploaded, the small files are stored in a cache database Redis, a timer is configured, whether the small files reach a set merging threshold value or not is checked at regular time, when the small files reach the merging threshold value, the cache files are merged through an interface provided by the HDFS, and the merged cache files are uploaded to the HDFS. According to the mass small File storage method based on Redis and HDFS, small files are combined into Sequence files to be stored on the HDFS, the storage efficiency of the small files in the HDFS is improved, meanwhile, the reading performance of the small files is improved in a caching mode, and the method is suitable for popularization and application.
Description
Technical Field
The invention relates to the technical field of big data caching, in particular to a mass small file storage method based on Redis and HDFS.
Background
The Hadoop Distributed File System (HDFS) is a distributed file system that runs on general purpose hardware. As a solution for mass data storage with high fault tolerance and high throughput, HDFS has been widely used in various large-scale online services and large-scale storage systems, has become a mass storage de facto standard for online service companies such as various large websites and the like, and provides reliable and efficient services for website customers.
With the rapid development of information systems, massive amounts of information need to be stored reliably while being quickly accessible to a large number of users. The traditional storage scheme is more and more difficult to adapt to the rapid development of information system services in recent years from the framework, and becomes a bottleneck and obstacle of service development.
HDFS distributes data access and storage among a large number of servers through an efficient distribution algorithm, distributes access to each server in a cluster while reliably providing multiple backup storage, and is a subversive development of conventional storage architectures.
Redis (remote Dictionary Server), a remote Dictionary service, is an open source log-type and Key-Value database written in ANSI C language, supporting network, based on memory and persistent, and provides API of multiple languages.
With the development of science and technology, small files such as pictures and documents are used more and more widely, and the requirement of a user on the reading and writing speed of the small files is higher and higher. Aiming at the problem that the storage efficiency of the current small files is low, the invention provides a mass small file storage method based on Redis and HDFS.
Disclosure of Invention
In order to make up for the defects of the prior art, the invention provides a simple and efficient mass small file storage method based on Redis and HDFS.
The invention is realized by the following technical scheme:
a mass small file storage method based on Redis and HDFS is characterized by comprising the following steps: when the small files are uploaded, the small files are stored in a cache database Redis, a timer is configured, whether the small files reach a set merging threshold value or not is checked at regular time, when the small files reach the merging threshold value, the cache files are merged through an interface provided by the HDFS, and the merged cache files are uploaded to the HDFS.
When uploading small files, firstly storing the received files into a cache file (SF) of a cache database Redis so as to facilitate high-speed reading by a user, updating the length SFL of the files stored in the cache file (SF) in real time, and regularly judging whether the length SFL of the stored files reaches the file merging size; and if so, sending a message that the files can be merged to a merging processing module, packaging and uploading the cache files to the HDFS by the merging processing module, and storing the information of the merged files in metadata information.
The merged cache File is stored in a Sequence File (Hadoop Sequence File) of the HDFS, and the metadata information of the cache File is stored in a cache database Redis.
After the cache files are packed and uploaded, the small files stored in the cache files are deleted, so that the memory is released, the high fault tolerance and high efficiency of file access are guaranteed, and the metadata records are updated.
The small file refers to a file not exceeding 64M.
The content of the cache file comprises cache file data, the total length of files stored in the cache file data, file information stored in metadata, a folder structure and a folder ID.
The metadata information comprises a File name, a File length, a File name of the Sequence File and a File storage position of the Sequence File.
In the storage structure of the HDFS, the merged Sequence File is stored in the basic directory and named by a timestamp;
meanwhile, the Sequence File adopts a sequential storage structure, and the File content can be quickly positioned through the File position recorded by the File name in the metadata;
in addition, in order to reduce disk occupation and increase transmission speed, the Sequence File adopts Block compression.
When a user reads a small File or a timer starts timing check, whether the small File exists in a cache File of a cache database Redis is judged firstly, if yes, the small File is returned directly, otherwise, the name, the length and the File position of the Sequence File where the small File is located are inquired in the cache database Redis metadata according to the File name, the File content is quickly positioned according to the File position, and the result is returned to the user.
In the file reading process, traversing from the beginning of Block to a file to be acquired; when high concurrent access is performed, according to the characteristic of Sequence File sequential reading, a plurality of files to be acquired can be hit when the access is started from the Block each time, and the direct reading can cause multiple times of reading of the same Block. Therefore, a caching mechanism is adopted to save the Block to improve the reading efficiency.
The invention has the beneficial effects that: according to the mass small File storage method based on Redis and HDFS, small files are combined into Sequence files to be stored on the HDFS, the storage efficiency of the small files in the HDFS is improved, meanwhile, the reading performance of the small files is improved in a caching mode, and the method is suitable for popularization and application.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a schematic diagram of a mass small file storage method based on Redis and HDFS.
FIG. 2 is a schematic diagram of a storage process of a large number of small files based on Redis and HDFS.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the embodiment of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
According to the mass small file storage method based on Redis and HDFS, when small files are uploaded, the small files are stored in a cache database Redis, a timer is configured, whether the small files reach a set merging threshold value or not is checked at regular time, when the small files reach the merging threshold value, the cache files are merged through an interface provided by the HDFS, and the merged cache files are uploaded to the HDFS.
The small file refers to a file not exceeding 64M.
When uploading small files, firstly storing the received files into a cache file (SF) of a cache database Redis so as to facilitate high-speed reading by a user, updating the length SFL of the files stored in the cache file (SF) in real time, and regularly judging whether the length SFL of the stored files reaches the file merging size; and if so, sending a message that the files can be merged to a merging processing module, packaging and uploading the cache files to the HDFS by the merging processing module, and storing the information of the merged files in metadata information.
The merged cache File is stored in a Sequence File (Hadoop Sequence File) of the HDFS, and the metadata information of the cache File is stored in a cache database Redis.
After the cache files are packed and uploaded, the small files stored in the cache files are deleted, so that the memory is released, the high fault tolerance and high efficiency of file access are guaranteed, and the metadata records are updated.
Redis is a file caching server for storing cached file content and metadata of files, as shown in tables 1 and 2.
The content of the cache file comprises cache file data, the total length of files stored in the cache file data, file information stored in metadata, a folder structure and a folder ID.
TABLE 1 caching File content
The metadata information comprises a File name, a File length, a File name of the Sequence File and a File storage position of the Sequence File.
TABLE 2 metadata information
Name (R) | Type (B) | Description of the invention |
name | String | Filename |
length | Long | File length |
filename | String | Filename of Sequence File |
filepos | Long | File storage position of Sequence File |
In the storage structure of the HDFS, the merged Sequence File is stored in the basic directory and named by a timestamp;
meanwhile, the Sequence File adopts a sequential storage structure, and the File content can be quickly positioned through the File position recorded by the File name in the metadata;
in addition, in order to reduce disk occupation and increase transmission speed, the Sequence File adopts Block compression.
When a user reads a small File or a timer starts timing check, whether the small File exists in a cache File of a cache database Redis is judged firstly, if yes, the small File is returned directly, otherwise, the name, the length and the File position of the Sequence File where the small File is located are inquired in the cache database Redis metadata according to the File name, the File content is quickly positioned according to the File position, and the result is returned to the user.
In the File reading process, due to the adoption of Block compression of the Sequence File, synchronization points are at two ends of the Block, and the File to be acquired needs to be traversed from the beginning of the Block. The cost of reading the file in the traversal process is low. In addition, during high concurrent access, according to the characteristic of Sequence File sequential reading, a plurality of files to be acquired may be hit each time when accessing from the Block, and direct reading may cause multiple times of reading of the same Block. Therefore, a caching mechanism is adopted to save the Block to improve the reading efficiency.
The above-described embodiment is only one specific embodiment of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.
Claims (8)
1. A mass small file storage method based on Redis and HDFS is characterized by comprising the following steps: when the small files are uploaded, the small files are stored in a cache database Redis, a timer is configured, whether the small files reach a set merging threshold value or not is checked at regular time, when the small files reach the merging threshold value, the cache files are merged through an interface provided by the HDFS, and the merged cache files are uploaded to the HDFS.
2. The mass small file storage method based on Redis and HDFS of claim 1, wherein: when uploading the small file, firstly storing the received file into a cache file of a cache database Redis so as to facilitate high-speed reading by a user, updating the length SFL of the file stored in the cache file in real time, and regularly judging whether the length SFL of the stored file reaches the file merging size; and if so, sending a message that the files can be merged to a merging processing module, packaging and uploading the cache files to the HDFS by the merging processing module, and storing the information of the merged files in metadata information.
3. The mass small file storage method based on Redis and HDFS of claim 2, wherein: the merged cache File is stored in a Sequence File of the HDFS, and metadata information of the cache File is stored in a cache database Redis;
after the cache files are packed and uploaded, the small files stored in the cache files are deleted, so that the memory is released, the high fault tolerance and high efficiency of file access are guaranteed, and the metadata records are updated.
4. The Redis and HDFS based mass small file storage method according to claim 1, 2 or 3, characterized in that: the small file refers to a file not exceeding 64M.
5. The mass small file storage method based on Redis and HDFS of claim 3, wherein: the cache file content comprises cache file data, the total length of files stored in the cache file data, file information stored by metadata, a folder structure and a folder ID;
the metadata information comprises a File name, a File length, a File name of the Sequence File and a File storage position of the Sequence File.
6. The mass small file storage method based on Redis and HDFS of claim 3, wherein: in the storage structure of the HDFS, the merged Sequence File is stored in the basic directory and named by a timestamp;
meanwhile, the Sequence File adopts a sequential storage structure, and the File content can be quickly positioned through the File position recorded by the File name in the metadata;
in addition, in order to reduce disk occupation and increase transmission speed, the Sequence File adopts Block compression.
7. The mass small file storage method based on Redis and HDFS of claim 6, wherein: when a user reads a small File or a timer starts timing check, whether the small File exists in a cache File of a cache database Redis is judged firstly, if yes, the small File is returned directly, otherwise, the name, the length and the File position of the Sequence File where the small File is located are inquired in the cache database Redis metadata according to the File name, the File content is quickly positioned according to the File position, and the result is returned to the user.
8. The mass small file storage method based on Redis and HDFS of claim 6, wherein: in the file reading process, traversing from the beginning of Block to a file to be acquired; and when high concurrent access is performed, a cache mechanism is adopted to store Block so as to improve the reading efficiency.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011528833.7A CN112650711A (en) | 2020-12-22 | 2020-12-22 | Massive small file storage method based on Redis and HDFS |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011528833.7A CN112650711A (en) | 2020-12-22 | 2020-12-22 | Massive small file storage method based on Redis and HDFS |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112650711A true CN112650711A (en) | 2021-04-13 |
Family
ID=75358988
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011528833.7A Withdrawn CN112650711A (en) | 2020-12-22 | 2020-12-22 | Massive small file storage method based on Redis and HDFS |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112650711A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113609094A (en) * | 2021-07-09 | 2021-11-05 | 济南浪潮数据技术有限公司 | Method, system, equipment and medium for controlling data downloading and brushing |
CN114374687A (en) * | 2022-01-11 | 2022-04-19 | 同方有云(北京)科技有限公司 | File transmission method and device between thermomagnetic storage and blue light storage |
-
2020
- 2020-12-22 CN CN202011528833.7A patent/CN112650711A/en not_active Withdrawn
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113609094A (en) * | 2021-07-09 | 2021-11-05 | 济南浪潮数据技术有限公司 | Method, system, equipment and medium for controlling data downloading and brushing |
CN114374687A (en) * | 2022-01-11 | 2022-04-19 | 同方有云(北京)科技有限公司 | File transmission method and device between thermomagnetic storage and blue light storage |
CN114374687B (en) * | 2022-01-11 | 2024-04-16 | 同方有云(北京)科技有限公司 | File transmission method and device between thermomagnetic storage and blue light storage |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10262005B2 (en) | Method, server and system for managing content in content delivery network | |
US10853340B2 (en) | Static sorted index replication | |
US11010300B2 (en) | Optimized record lookups | |
CN101866358B (en) | Multidimensional interval querying method and system thereof | |
CN102122285B (en) | Data cache system and data inquiry method | |
CN103020315A (en) | Method for storing mass of small files on basis of master-slave distributed file system | |
CN105183839A (en) | Hadoop-based storage optimizing method for small file hierachical indexing | |
CN113377868B (en) | Offline storage system based on distributed KV database | |
CN103902660B (en) | System and method for prefetching file layout through readdir++ in cluster file system | |
CN103179185A (en) | Method and system for creating files in cache of distributed file system client | |
CN103595797B (en) | Caching method for distributed storage system | |
CN112650711A (en) | Massive small file storage method based on Redis and HDFS | |
CN102541985A (en) | Organization method of client directory cache in distributed file system | |
CN111930316B (en) | Cache read-write system and method for content distribution network | |
CN111752804B (en) | Database cache system based on database log scanning | |
CN110543495A (en) | cursor traversal storage method and device | |
WO2020125630A1 (en) | File reading | |
CN111159176A (en) | Method and system for storing and reading mass stream data | |
CN109767274B (en) | Method and system for carrying out associated storage on massive invoice data | |
CN117573032A (en) | Optimization method based on RocksDB database write amplification | |
CN105138545B (en) | The asynchronous method and system pre-read of directory entry in a kind of distributed file system | |
CN106599326B (en) | Recorded data duplication eliminating processing method and system under cloud architecture | |
US11341163B1 (en) | Multi-level replication filtering for a distributed database | |
CN112395440A (en) | Caching method, efficient image semantic retrieval method and system | |
CN110287172B (en) | Method for formatting HBase data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20210413 |
|
WW01 | Invention patent application withdrawn after publication |