CN114546962A

CN114546962A - Hadoop-based distributed storage system for marine bureau ship inspection big data

Info

Publication number: CN114546962A
Application number: CN202210147679.1A
Authority: CN
Inventors: 邓酩; 谢刚; 侯立宪; 刘超; 柳庆龙
Original assignee: Guilin University of Technology
Current assignee: Guilin University of Technology
Priority date: 2022-02-17
Filing date: 2022-02-17
Publication date: 2022-05-27

Abstract

The invention discloses a Hadoop-based big data distributed storage method for marine bureau ship inspection. The method comprises the steps that firstly, data characteristics of a marine bureau ship inspection platform are analyzed, and a department node is added between a client and a NameNode, so that the problem of low performance when the HDFS processes massive small files is optimized; secondly, when the small files are merged in the node, a small file preprocessing module and a pre-fetching module are set; meanwhile, the metadata information of the large file after the hot spot data are merged is cached, so that the access flow of the small file is optimized, and the pressure of the NameNode node is relieved; when the small file preprocessing module and the pre-fetching module work, the large file metadata information after hot spot data combination is cached, and the access flow of the small file is optimized. The invention solves the problem that the traditional HDFS usually shows low performance when storing massive small files, and is more convenient for file storage aiming at the characteristic of small massive data files of the maritime bureau.

Description

Hadoop-based distributed storage system for marine bureau ship inspection big data

Technical Field

The invention relates to the field of distributed storage systems, in particular to a Hadoop-based marine bureau ship inspection big data distributed system storage method.

Background

With the rapid development of internet technology, the scale of network information is exponentially increased. With the continuous improvement of enterprise informatization construction, a traditional data storage mode cannot meet the requirement of a large amount of data generated in a daily process, so that more and more enterprises begin to use a distributed file system. The HDFS is the bottom layer in the Hadoop cluster, has the capacity of storing mass data in a distributed mode, and is widely used for distributed storage of a large number of files due to the characteristics of free source, expandability and high fault tolerance.

However, the way that the HDFS randomly selects the storage node for storing data is prone to the following two problems: data distribution is unbalanced easily, and data node hardware difference is not considered in the default placement strategy. Since most of files stored by the maritime office are massive small files, the HDFS distributed system is considered to store data information on the premise of cluster isomorphism, corresponding optimization is not performed on hardware differences of data nodes in a cluster during design, when a client side initiates a file request, a NameNode node is accessed to obtain metadata information of a corresponding file, and the corresponding DataNode node is found through the metadata to search stored data. The purpose of Hadoop design is to store larger files, and under the condition that the sizes of the stored files are the same, the memory of the NameNode consumed by storing the large files is less, so when a large number of small file storage and access requests exist, the NameNode has the problem of low performance.

Disclosure of Invention

A Hadoop-based big data distributed storage method for marine bureau ship inspection aims at solving the problems in the background technology.

In order to realize the aim of the above complaint, the invention provides the following technical scheme: a big data distributed storage method for marine bureau ship inspection based on Hadoop comprises the following steps:

s1: analyzing the storage performance of the mass small files and determining the sizes of the small files;

s2: optimally designing small file storage;

s3: designing a file preprocessing module;

s4: searching and optimally designing small files;

s5: designing a cache prefetching module;

the step S1 specifically includes the following steps:

s1.1, collecting a file size data set to be analyzed by a maritime affairs bureau, and carrying out quantitative analysis;

s1.2, determining the file size of the demarcation point by adopting a linear fitting method;

the optimized target in step S2 includes node memory consumption, and time frequency of node access, and is specifically defined as follows:

one NameNode node in the Hadoop distributed files is in an activated state, and the NameNode node does not store specific files and is only responsible for storing metadata of the files stored in the system. Assuming that memory consumption of the HDFS itself is a when the HDFS does not store a file, memory consumption of a BlockMap in the NameNode is b, and a default block size of the HDFS is 64MB, when files with sizes of M1, M2, M3 …, and Mn are stored in sequence, the memory consumption of the NameNode is defined as follows:

wherein

Indicating that file storage is the number of the segmented blocks, (368+ b) indicating that file storage is the number of the segmented blocks, and indicating the memory of the NameNode occupied by the generated list information of each block, therefore, in order to reduce the memory consumption of the NameNode, the number of file blocks needs to be reduced.

Reading N files with the sizes of M1, M2 and M3 … Mn in sequence, wherein the time consumed by reading the N files is as follows: further, the load state is a disk usage rate, and the disk usage rate model is:

the NameNode receives the time consumption of the Client request and records the time consumption as Tcn, and the time consumed by the Client for receiving the metadata sent back by the NameNode is recorded as Tnc. The time consumed by the Client to send a request to the DataNode after receiving the metadata is marked as Tcd. The time consumed by the DataNode to query the file after receiving the request is recorded as Td. The DataNode records the time consumed by returning the query file to the Client as Tdc.

When the number HDFS of blocks C is used for storing a file, each file is stored independently as long as the size of the file does not exceed the size of the default partitioned data block, so that C is equal to N, and therefore, the total time consumed is calculated as:

therefore, as long as the number of interaction with the NameNode is reduced, the time for acquiring metadata information in the NameNode is reduced, and the time for reading files in the DataNode is reduced, the reading efficiency of the HDFS can be improved.

The S3 concrete steps are as follows:

s3.1 add a department node. The department nodes comprise a file preprocessing module, a file prefetching module and a caching module. The department nodes comprise a file preprocessing module, a file prefetching module and a caching module.

S3.2, judging the size of the file uploaded by the client, and storing the related file in a continuous disk space as much as possible, so that the hit rate of file prefetching is improved, and the file reading delay is reduced.

S3.3, department detection, namely judging the department ID number in the small file name at the beginning of file combination, and adding files with the same department ID number into the same file combination waiting queue.

S3.4 at this time, the small files are merged by adopting a MapFile technology. Since the file system will block files that exceed the default chunk size, we set the merge queue threshold to 64MB in order to avoid our merged file being stored as a split.

When accessing a small file, firstly, whether metadata cache information of a large file corresponding to the small file to be accessed exists in a newly established department node is searched according to the name of the small file, if not, the offset of the small file in the large file and the position information of a data block of the large file on a DataNode are obtained through memory loading according to the name of the small file, and the storage position of the small file is found according to the information.

The step of S5 is as follows:

s5.1, the cache region on the DataNode is divided into a cache region blocks with the size of M.

S5.2, the size of the cache region is determined by the values of a and M, the values of a and M can be adjusted according to the node resources, and each region with the size of M caches one small file. A proper cache prefetching module is designed according to the data characteristics, so that the time consumption of reading the hot spot file can be greatly reduced, and the optimization of a small file storage system is further realized.

S5.3, a cache replacement strategy, namely a least frequently used strategy, is adopted, so that the cache efficiency is improved.

Drawings

FIG. 1 is a diagram of a data storage architecture according to the present invention.

FIG. 2 is a flow chart of a data storage architecture according to the present invention.

Detailed Description

Example (b):

the invention provides the following technical scheme: a hybrid recommendation system and method based on multi-objective optimization comprises the following steps:

s2: optimally designing small file storage;

s3: designing a file preprocessing module;

s4: searching and optimally designing small files;

s5: designing a cache prefetching module;

the Hadoop-based distributed storage method for big data of marine office ship inspection according to the claims, wherein the step S1 specifically comprises the following steps:

s1.2, carrying out quantitative analysis on 23 files with different sizes, wherein the sizes of the selected files are all between 0.5MB and 64MB, and determining the size of the file at the demarcation point by adopting a linear fitting method through analyzing the influence of the files with different sizes on memory consumption and access performance in the processing process, wherein the size of the file is 4.35 MB;

the distributed big data storage method for marine bureau ship inspection based on Hadoop as claimed in claim, wherein: the optimized target in step S2 includes node memory consumption, and time frequency of node access, and is specifically defined as follows:

wherein

The distributed big data storage method for marine bureau ship inspection based on Hadoop as claimed in claim, wherein: the S3 concrete steps are as follows:

s3.1 add a department node as shown in the data storage architecture diagram of fig. 1. The department nodes comprise a file preprocessing module, a file prefetching module and a caching module. The department nodes comprise a file preprocessing module, a file prefetching module and a caching module. The file preprocessing module merges the small files with the same department number so as to increase the relevance of the small files, establishes a mapping with the name of the small file as a main key and stores the mapping in the HBase, and relieves the memory pressure of name nodes.

S3.3 As shown in the file processing flow chart of FIG. 2, at the beginning of file merging, the department ID number in the small file name is judged first, and the files with the same department ID number are added into the same file merging waiting queue. And for different department ID numbers, a waiting queue is newly built for adding.

S3.4 at this time, the small files are merged by adopting the MapFile technology. Since the file system will block files that exceed the default chunk size, we set the merge queue threshold to 64MB in order to avoid our merged file being stored as a split. If the file size is larger than the merge queue remaining space or queue latency is larger than wait _ time, the file is added to the wait queue and converted to a new merge queue. Otherwise, adding the small files into the queue, and merging the queue when the remaining space of the queue reaches a set threshold value.

The distributed big data storage method for marine bureau ship inspection based on Hadoop as claimed in claim, wherein: the step of S4 is as follows:

The distributed big data storage method for marine bureau ship inspection based on Hadoop as claimed in claim, wherein: the step of S5 is as follows:

S5.2, the size of the cache region is determined by the values of a and M, the values of a and M can be adjusted according to the node resources, and each region with the size of M caches one small file. And a proper cache pre-fetching module is designed according to the data characteristics, so that the time consumption for reading the hot files can be greatly reduced, and the optimization of a small file storage system is further realized.

S5.3, a cache replacement strategy, namely a least frequently used strategy, is adopted, in the file system, metadata information of data is cached, occupied memory space is relatively small, caches are distributed on different nodes, unnecessary disk interaction and communication are reduced through caching of the metadata of the small files, time consumption is reduced, and caching efficiency is improved.

Claims

1. A big data distributed storage method for ship inspection of a marine office based on Hadoop is characterized by comprising the following steps: analyzing the data characteristics of the marine bureau ship inspection platform and determining the file size of the demarcation point; by adding a department node between the client and the NameNode, the problem of low performance when the HDFS processes massive small files is optimized; a department detection algorithm merges the id of the same department, and meanwhile, the method based on MapFile realizes the merging of small files to large files through the serialization of key value pairs and establishes the mapping from the small files to the large files; the small file retrieval is optimized, the metadata information of the large file after data combination is cached, the access flow of the small file is optimized, and the access efficiency is improved.

2. A big data distributed storage method for marine bureau ship inspection based on Hadoop is characterized by comprising the following steps:

s2: optimally designing small file storage;

s3: designing a file preprocessing module;

s4: searching and optimally designing small files;

s5: and designing a cache prefetching module.

3. The Hadoop-based big data distributed storage method for the ship inspection of the maritime office according to claim 2, wherein the step S1 specifically comprises the following steps:

s1.2, determining the file size of the demarcation point by adopting a linear fitting method.

4. The Hadoop-based big data distributed storage method for the ship inspection of the maritime affairs bureau according to claim 2, wherein: the optimized target in step S2 includes node memory consumption, and time frequency of node access, and is specifically defined as follows:

one NameNode node in the Hadoop distributed files is in an activated state, and the NameNode node does not store specific files and is only responsible for storing metadata of the files stored in the system; assuming that memory consumption of the HDFS itself is a when the HDFS does not store a file, memory consumption of a BlockMap in the NameNode is b, and a default block size of the HDFS is 64MB, when files with sizes of M1, M2, M3 …, and Mn are stored in sequence, the memory consumption of the NameNode is defined as follows:

wherein

Indicating the number of the segmented blocks of the file storage, (368+ b) indicating the number of the segmented blocks of the file storage, and indicating the memory of the NameNode occupied by the generated list information of each block, therefore, in order to reduce the memory consumption of the NameNode node, the number of the file blocks needs to be reduced;

the NameNode receives the time consumption of the Client request, and the time consumption is marked as Tcn, and the time consumed when the Client receives the metadata sent back by the NameNode is marked as Tnc; the time consumed by the Client for sending a request to the DataNode after receiving the metadata is recorded as Tcd; the data node receives the request and inquires the time consumed by the file, and the time is recorded as Td; the DataNode records the time consumed by returning the inquired file to the Client as Tdc;

therefore, as long as the number of interaction times with the NameNode is reduced, the time for acquiring metadata information in the NameNode is reduced, and the time for reading files in the DataNode is reduced, the reading efficiency of the HDFS can be improved.

5. The Hadoop-based big data distributed storage method for the ship inspection of the maritime affairs bureau according to claim 2, wherein: the S3 concrete steps are as follows:

s3.1, adding a department node; the department nodes comprise a file preprocessing module, a file prefetching module and a caching module;

s3.2, judging the size of the file uploaded by the client and storing the related file in a continuous disk space as much as possible, so that the hit rate of file prefetching is improved and the file reading delay is reduced;

s3.3, department detection, namely judging the department ID number in the small file name at the beginning of file combination, and adding files with the same department ID number into the same file combination waiting queue;

s3.4, at the moment, the MapFile technology is adopted to merge the small files; since the file system will block files that exceed the default chunk size, we set the merge queue threshold to 64MB in order to avoid our merged file being stored as a split.

6. The Hadoop-based big data distributed storage method for the ship inspection of the maritime affairs bureau according to claim 2, wherein: the step of S4 is as follows:

7. The Hadoop-based big data distributed storage method for the ship inspection of the maritime affairs bureau according to claim 2, wherein: the step of S5 is as follows:

s5.1, dividing a cache region on the DataNode into a cache region blocks with the size of M;

s5.2, determining the size of a cache region according to the values of a and M, wherein the values of a and M can be adjusted according to node resources, and each region with the size of M caches a small file; a proper cache prefetching module is designed according to the data characteristics, so that the time consumption for reading the hot files can be greatly reduced, and the optimization of a small file storage system is further realized;