CN114546962A - Hadoop-based distributed storage system for marine bureau ship inspection big data - Google Patents

Hadoop-based distributed storage system for marine bureau ship inspection big data Download PDF

Info

Publication number
CN114546962A
CN114546962A CN202210147679.1A CN202210147679A CN114546962A CN 114546962 A CN114546962 A CN 114546962A CN 202210147679 A CN202210147679 A CN 202210147679A CN 114546962 A CN114546962 A CN 114546962A
Authority
CN
China
Prior art keywords
file
files
small
namenode
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210147679.1A
Other languages
Chinese (zh)
Inventor
邓酩
谢刚
侯立宪
刘超
柳庆龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Technology
Original Assignee
Guilin University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Technology filed Critical Guilin University of Technology
Priority to CN202210147679.1A priority Critical patent/CN114546962A/en
Publication of CN114546962A publication Critical patent/CN114546962A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/164File meta data generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Hadoop-based big data distributed storage method for marine bureau ship inspection. The method comprises the steps that firstly, data characteristics of a marine bureau ship inspection platform are analyzed, and a department node is added between a client and a NameNode, so that the problem of low performance when the HDFS processes massive small files is optimized; secondly, when the small files are merged in the node, a small file preprocessing module and a pre-fetching module are set; meanwhile, the metadata information of the large file after the hot spot data are merged is cached, so that the access flow of the small file is optimized, and the pressure of the NameNode node is relieved; when the small file preprocessing module and the pre-fetching module work, the large file metadata information after hot spot data combination is cached, and the access flow of the small file is optimized. The invention solves the problem that the traditional HDFS usually shows low performance when storing massive small files, and is more convenient for file storage aiming at the characteristic of small massive data files of the maritime bureau.

Description

Hadoop-based distributed storage system for marine bureau ship inspection big data
Technical Field
The invention relates to the field of distributed storage systems, in particular to a Hadoop-based marine bureau ship inspection big data distributed system storage method.
Background
With the rapid development of internet technology, the scale of network information is exponentially increased. With the continuous improvement of enterprise informatization construction, a traditional data storage mode cannot meet the requirement of a large amount of data generated in a daily process, so that more and more enterprises begin to use a distributed file system. The HDFS is the bottom layer in the Hadoop cluster, has the capacity of storing mass data in a distributed mode, and is widely used for distributed storage of a large number of files due to the characteristics of free source, expandability and high fault tolerance.
However, the way that the HDFS randomly selects the storage node for storing data is prone to the following two problems: data distribution is unbalanced easily, and data node hardware difference is not considered in the default placement strategy. Since most of files stored by the maritime office are massive small files, the HDFS distributed system is considered to store data information on the premise of cluster isomorphism, corresponding optimization is not performed on hardware differences of data nodes in a cluster during design, when a client side initiates a file request, a NameNode node is accessed to obtain metadata information of a corresponding file, and the corresponding DataNode node is found through the metadata to search stored data. The purpose of Hadoop design is to store larger files, and under the condition that the sizes of the stored files are the same, the memory of the NameNode consumed by storing the large files is less, so when a large number of small file storage and access requests exist, the NameNode has the problem of low performance.
Disclosure of Invention
A Hadoop-based big data distributed storage method for marine bureau ship inspection aims at solving the problems in the background technology.
In order to realize the aim of the above complaint, the invention provides the following technical scheme: a big data distributed storage method for marine bureau ship inspection based on Hadoop comprises the following steps:
s1: analyzing the storage performance of the mass small files and determining the sizes of the small files;
s2: optimally designing small file storage;
s3: designing a file preprocessing module;
s4: searching and optimally designing small files;
s5: designing a cache prefetching module;
the step S1 specifically includes the following steps:
s1.1, collecting a file size data set to be analyzed by a maritime affairs bureau, and carrying out quantitative analysis;
s1.2, determining the file size of the demarcation point by adopting a linear fitting method;
the optimized target in step S2 includes node memory consumption, and time frequency of node access, and is specifically defined as follows:
one NameNode node in the Hadoop distributed files is in an activated state, and the NameNode node does not store specific files and is only responsible for storing metadata of the files stored in the system. Assuming that memory consumption of the HDFS itself is a when the HDFS does not store a file, memory consumption of a BlockMap in the NameNode is b, and a default block size of the HDFS is 64MB, when files with sizes of M1, M2, M3 …, and Mn are stored in sequence, the memory consumption of the NameNode is defined as follows:
Figure BDA0003509583330000021
wherein
Figure BDA0003509583330000022
Indicating that file storage is the number of the segmented blocks, (368+ b) indicating that file storage is the number of the segmented blocks, and indicating the memory of the NameNode occupied by the generated list information of each block, therefore, in order to reduce the memory consumption of the NameNode, the number of file blocks needs to be reduced.
Reading N files with the sizes of M1, M2 and M3 … Mn in sequence, wherein the time consumed by reading the N files is as follows: further, the load state is a disk usage rate, and the disk usage rate model is:
Figure BDA0003509583330000031
the NameNode receives the time consumption of the Client request and records the time consumption as Tcn, and the time consumed by the Client for receiving the metadata sent back by the NameNode is recorded as Tnc. The time consumed by the Client to send a request to the DataNode after receiving the metadata is marked as Tcd. The time consumed by the DataNode to query the file after receiving the request is recorded as Td. The DataNode records the time consumed by returning the query file to the Client as Tdc.
When the number HDFS of blocks C is used for storing a file, each file is stored independently as long as the size of the file does not exceed the size of the default partitioned data block, so that C is equal to N, and therefore, the total time consumed is calculated as:
Figure BDA0003509583330000032
therefore, as long as the number of interaction with the NameNode is reduced, the time for acquiring metadata information in the NameNode is reduced, and the time for reading files in the DataNode is reduced, the reading efficiency of the HDFS can be improved.
The S3 concrete steps are as follows:
s3.1 add a department node. The department nodes comprise a file preprocessing module, a file prefetching module and a caching module. The department nodes comprise a file preprocessing module, a file prefetching module and a caching module.
S3.2, judging the size of the file uploaded by the client, and storing the related file in a continuous disk space as much as possible, so that the hit rate of file prefetching is improved, and the file reading delay is reduced.
S3.3, department detection, namely judging the department ID number in the small file name at the beginning of file combination, and adding files with the same department ID number into the same file combination waiting queue.
S3.4 at this time, the small files are merged by adopting a MapFile technology. Since the file system will block files that exceed the default chunk size, we set the merge queue threshold to 64MB in order to avoid our merged file being stored as a split.
When accessing a small file, firstly, whether metadata cache information of a large file corresponding to the small file to be accessed exists in a newly established department node is searched according to the name of the small file, if not, the offset of the small file in the large file and the position information of a data block of the large file on a DataNode are obtained through memory loading according to the name of the small file, and the storage position of the small file is found according to the information.
The step of S5 is as follows:
s5.1, the cache region on the DataNode is divided into a cache region blocks with the size of M.
S5.2, the size of the cache region is determined by the values of a and M, the values of a and M can be adjusted according to the node resources, and each region with the size of M caches one small file. A proper cache prefetching module is designed according to the data characteristics, so that the time consumption of reading the hot spot file can be greatly reduced, and the optimization of a small file storage system is further realized.
S5.3, a cache replacement strategy, namely a least frequently used strategy, is adopted, so that the cache efficiency is improved.
Drawings
FIG. 1 is a diagram of a data storage architecture according to the present invention.
FIG. 2 is a flow chart of a data storage architecture according to the present invention.
Detailed Description
Example (b):
the invention provides the following technical scheme: a hybrid recommendation system and method based on multi-objective optimization comprises the following steps:
s1: analyzing the storage performance of the mass small files and determining the sizes of the small files;
s2: optimally designing small file storage;
s3: designing a file preprocessing module;
s4: searching and optimally designing small files;
s5: designing a cache prefetching module;
the Hadoop-based distributed storage method for big data of marine office ship inspection according to the claims, wherein the step S1 specifically comprises the following steps:
s1.1, collecting a file size data set to be analyzed by a maritime affairs bureau, and carrying out quantitative analysis;
s1.2, carrying out quantitative analysis on 23 files with different sizes, wherein the sizes of the selected files are all between 0.5MB and 64MB, and determining the size of the file at the demarcation point by adopting a linear fitting method through analyzing the influence of the files with different sizes on memory consumption and access performance in the processing process, wherein the size of the file is 4.35 MB;
the distributed big data storage method for marine bureau ship inspection based on Hadoop as claimed in claim, wherein: the optimized target in step S2 includes node memory consumption, and time frequency of node access, and is specifically defined as follows:
one NameNode node in the Hadoop distributed files is in an activated state, and the NameNode node does not store specific files and is only responsible for storing metadata of the files stored in the system. Assuming that memory consumption of the HDFS itself is a when the HDFS does not store a file, memory consumption of a BlockMap in the NameNode is b, and a default block size of the HDFS is 64MB, when files with sizes of M1, M2, M3 …, and Mn are stored in sequence, the memory consumption of the NameNode is defined as follows:
Figure BDA0003509583330000051
wherein
Figure BDA0003509583330000052
Indicating that file storage is the number of the segmented blocks, (368+ b) indicating that file storage is the number of the segmented blocks, and indicating the memory of the NameNode occupied by the generated list information of each block, therefore, in order to reduce the memory consumption of the NameNode, the number of file blocks needs to be reduced.
Figure BDA0003509583330000053
The NameNode receives the time consumption of the Client request and records the time consumption as Tcn, and the time consumed by the Client for receiving the metadata sent back by the NameNode is recorded as Tnc. The time consumed by the Client to send a request to the DataNode after receiving the metadata is marked as Tcd. The time consumed by the DataNode to query the file after receiving the request is recorded as Td. The DataNode records the time consumed by returning the query file to the Client as Tdc.
When the number HDFS of blocks C is used for storing a file, each file is stored independently as long as the size of the file does not exceed the size of the default partitioned data block, so that C is equal to N, and therefore, the total time consumed is calculated as:
Figure BDA0003509583330000061
therefore, as long as the number of interaction with the NameNode is reduced, the time for acquiring metadata information in the NameNode is reduced, and the time for reading files in the DataNode is reduced, the reading efficiency of the HDFS can be improved.
The distributed big data storage method for marine bureau ship inspection based on Hadoop as claimed in claim, wherein: the S3 concrete steps are as follows:
s3.1 add a department node as shown in the data storage architecture diagram of fig. 1. The department nodes comprise a file preprocessing module, a file prefetching module and a caching module. The department nodes comprise a file preprocessing module, a file prefetching module and a caching module. The file preprocessing module merges the small files with the same department number so as to increase the relevance of the small files, establishes a mapping with the name of the small file as a main key and stores the mapping in the HBase, and relieves the memory pressure of name nodes.
S3.2, judging the size of the file uploaded by the client, and storing the related file in a continuous disk space as much as possible, so that the hit rate of file prefetching is improved, and the file reading delay is reduced.
S3.3 As shown in the file processing flow chart of FIG. 2, at the beginning of file merging, the department ID number in the small file name is judged first, and the files with the same department ID number are added into the same file merging waiting queue. And for different department ID numbers, a waiting queue is newly built for adding.
S3.4 at this time, the small files are merged by adopting the MapFile technology. Since the file system will block files that exceed the default chunk size, we set the merge queue threshold to 64MB in order to avoid our merged file being stored as a split. If the file size is larger than the merge queue remaining space or queue latency is larger than wait _ time, the file is added to the wait queue and converted to a new merge queue. Otherwise, adding the small files into the queue, and merging the queue when the remaining space of the queue reaches a set threshold value.
The distributed big data storage method for marine bureau ship inspection based on Hadoop as claimed in claim, wherein: the step of S4 is as follows:
when accessing a small file, firstly, whether metadata cache information of a large file corresponding to the small file to be accessed exists in a newly established department node is searched according to the name of the small file, if not, the offset of the small file in the large file and the position information of a data block of the large file on a DataNode are obtained through memory loading according to the name of the small file, and the storage position of the small file is found according to the information.
The distributed big data storage method for marine bureau ship inspection based on Hadoop as claimed in claim, wherein: the step of S5 is as follows:
s5.1, the cache region on the DataNode is divided into a cache region blocks with the size of M.
S5.2, the size of the cache region is determined by the values of a and M, the values of a and M can be adjusted according to the node resources, and each region with the size of M caches one small file. And a proper cache pre-fetching module is designed according to the data characteristics, so that the time consumption for reading the hot files can be greatly reduced, and the optimization of a small file storage system is further realized.
S5.3, a cache replacement strategy, namely a least frequently used strategy, is adopted, in the file system, metadata information of data is cached, occupied memory space is relatively small, caches are distributed on different nodes, unnecessary disk interaction and communication are reduced through caching of the metadata of the small files, time consumption is reduced, and caching efficiency is improved.

Claims (7)

1. A big data distributed storage method for ship inspection of a marine office based on Hadoop is characterized by comprising the following steps: analyzing the data characteristics of the marine bureau ship inspection platform and determining the file size of the demarcation point; by adding a department node between the client and the NameNode, the problem of low performance when the HDFS processes massive small files is optimized; a department detection algorithm merges the id of the same department, and meanwhile, the method based on MapFile realizes the merging of small files to large files through the serialization of key value pairs and establishes the mapping from the small files to the large files; the small file retrieval is optimized, the metadata information of the large file after data combination is cached, the access flow of the small file is optimized, and the access efficiency is improved.
2. A big data distributed storage method for marine bureau ship inspection based on Hadoop is characterized by comprising the following steps:
s1: analyzing the storage performance of the mass small files and determining the sizes of the small files;
s2: optimally designing small file storage;
s3: designing a file preprocessing module;
s4: searching and optimally designing small files;
s5: and designing a cache prefetching module.
3. The Hadoop-based big data distributed storage method for the ship inspection of the maritime office according to claim 2, wherein the step S1 specifically comprises the following steps:
s1.1, collecting a file size data set to be analyzed by a maritime affairs bureau, and carrying out quantitative analysis;
s1.2, determining the file size of the demarcation point by adopting a linear fitting method.
4. The Hadoop-based big data distributed storage method for the ship inspection of the maritime affairs bureau according to claim 2, wherein: the optimized target in step S2 includes node memory consumption, and time frequency of node access, and is specifically defined as follows:
one NameNode node in the Hadoop distributed files is in an activated state, and the NameNode node does not store specific files and is only responsible for storing metadata of the files stored in the system; assuming that memory consumption of the HDFS itself is a when the HDFS does not store a file, memory consumption of a BlockMap in the NameNode is b, and a default block size of the HDFS is 64MB, when files with sizes of M1, M2, M3 …, and Mn are stored in sequence, the memory consumption of the NameNode is defined as follows:
Figure FDA0003509583320000021
wherein
Figure FDA0003509583320000022
Indicating the number of the segmented blocks of the file storage, (368+ b) indicating the number of the segmented blocks of the file storage, and indicating the memory of the NameNode occupied by the generated list information of each block, therefore, in order to reduce the memory consumption of the NameNode node, the number of the file blocks needs to be reduced;
reading N files with the sizes of M1, M2 and M3 … Mn in sequence, wherein the time consumed by reading the N files is as follows: further, the load state is a disk usage rate, and the disk usage rate model is:
Figure FDA0003509583320000023
the NameNode receives the time consumption of the Client request, and the time consumption is marked as Tcn, and the time consumed when the Client receives the metadata sent back by the NameNode is marked as Tnc; the time consumed by the Client for sending a request to the DataNode after receiving the metadata is recorded as Tcd; the data node receives the request and inquires the time consumed by the file, and the time is recorded as Td; the DataNode records the time consumed by returning the inquired file to the Client as Tdc;
when the number HDFS of blocks C is used for storing a file, each file is stored independently as long as the size of the file does not exceed the size of the default partitioned data block, so that C is equal to N, and therefore, the total time consumed is calculated as:
Figure FDA0003509583320000024
therefore, as long as the number of interaction times with the NameNode is reduced, the time for acquiring metadata information in the NameNode is reduced, and the time for reading files in the DataNode is reduced, the reading efficiency of the HDFS can be improved.
5. The Hadoop-based big data distributed storage method for the ship inspection of the maritime affairs bureau according to claim 2, wherein: the S3 concrete steps are as follows:
s3.1, adding a department node; the department nodes comprise a file preprocessing module, a file prefetching module and a caching module;
s3.2, judging the size of the file uploaded by the client and storing the related file in a continuous disk space as much as possible, so that the hit rate of file prefetching is improved and the file reading delay is reduced;
s3.3, department detection, namely judging the department ID number in the small file name at the beginning of file combination, and adding files with the same department ID number into the same file combination waiting queue;
s3.4, at the moment, the MapFile technology is adopted to merge the small files; since the file system will block files that exceed the default chunk size, we set the merge queue threshold to 64MB in order to avoid our merged file being stored as a split.
6. The Hadoop-based big data distributed storage method for the ship inspection of the maritime affairs bureau according to claim 2, wherein: the step of S4 is as follows:
when accessing a small file, firstly, whether metadata cache information of a large file corresponding to the small file to be accessed exists in a newly established department node is searched according to the name of the small file, if not, the offset of the small file in the large file and the position information of a data block of the large file on a DataNode are obtained through memory loading according to the name of the small file, and the storage position of the small file is found according to the information.
7. The Hadoop-based big data distributed storage method for the ship inspection of the maritime affairs bureau according to claim 2, wherein: the step of S5 is as follows:
s5.1, dividing a cache region on the DataNode into a cache region blocks with the size of M;
s5.2, determining the size of a cache region according to the values of a and M, wherein the values of a and M can be adjusted according to node resources, and each region with the size of M caches a small file; a proper cache prefetching module is designed according to the data characteristics, so that the time consumption for reading the hot files can be greatly reduced, and the optimization of a small file storage system is further realized;
s5.3, a cache replacement strategy, namely a least frequently used strategy, is adopted, so that the cache efficiency is improved.
CN202210147679.1A 2022-02-17 2022-02-17 Hadoop-based distributed storage system for marine bureau ship inspection big data Pending CN114546962A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210147679.1A CN114546962A (en) 2022-02-17 2022-02-17 Hadoop-based distributed storage system for marine bureau ship inspection big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210147679.1A CN114546962A (en) 2022-02-17 2022-02-17 Hadoop-based distributed storage system for marine bureau ship inspection big data

Publications (1)

Publication Number Publication Date
CN114546962A true CN114546962A (en) 2022-05-27

Family

ID=81675793

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210147679.1A Pending CN114546962A (en) 2022-02-17 2022-02-17 Hadoop-based distributed storage system for marine bureau ship inspection big data

Country Status (1)

Country Link
CN (1) CN114546962A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117519608A (en) * 2023-12-27 2024-02-06 泰安北航科技园信息科技有限公司 Big data server with Hadoop as core

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117519608A (en) * 2023-12-27 2024-02-06 泰安北航科技园信息科技有限公司 Big data server with Hadoop as core
CN117519608B (en) * 2023-12-27 2024-03-22 泰安北航科技园信息科技有限公司 Big data server with Hadoop as core

Similar Documents

Publication Publication Date Title
CN108710639B (en) Ceph-based access optimization method for mass small files
Baeza-Yates et al. The impact of caching on search engines
KR102564170B1 (en) Method and device for storing data object, and computer readable storage medium having a computer program using the same
CN102467572B (en) Data block inquiring method for supporting data de-duplication program
US8463846B2 (en) File bundling for cache servers of content delivery networks
Cambazoglu et al. Scalability challenges in web search engines
CN113377868B (en) Offline storage system based on distributed KV database
CN102521406A (en) Distributed query method and system for complex task of querying massive structured data
CN102521405A (en) Massive structured data storage and query methods and systems supporting high-speed loading
US9262511B2 (en) System and method for indexing streams containing unstructured text data
CN109766318B (en) File reading method and device
Dong et al. Correlation based file prefetching approach for hadoop
CN104111898A (en) Hybrid storage system based on multidimensional data similarity and data management method
CN106155934A (en) Based on the caching method repeating data under a kind of cloud environment
CN111159176A (en) Method and system for storing and reading mass stream data
CN114546962A (en) Hadoop-based distributed storage system for marine bureau ship inspection big data
CN112559459B (en) Cloud computing-based self-adaptive storage layering system and method
CN111857582B (en) Key value storage system
CN101459599B (en) Method and system for implementing concurrent execution of cache data access and loading
CN117076466A (en) Rapid data indexing method for large archive database
US20200019539A1 (en) Efficient and light-weight indexing for massive blob/objects
CN109213760B (en) High-load service storage and retrieval method for non-relational data storage
CN115562592A (en) Memory and disk hybrid caching method based on cloud object storage
CN109582233A (en) A kind of caching method and device of data
CN113722274A (en) Efficient R-tree index remote sensing data storage model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination