CN114546962A - Hadoop-based distributed storage system for marine bureau ship inspection big data - Google Patents
Hadoop-based distributed storage system for marine bureau ship inspection big data Download PDFInfo
- Publication number
- CN114546962A CN114546962A CN202210147679.1A CN202210147679A CN114546962A CN 114546962 A CN114546962 A CN 114546962A CN 202210147679 A CN202210147679 A CN 202210147679A CN 114546962 A CN114546962 A CN 114546962A
- Authority
- CN
- China
- Prior art keywords
- file
- files
- small
- namenode
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/172—Caching, prefetching or hoarding of files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
- G06F16/164—File meta data generation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a Hadoop-based big data distributed storage method for marine bureau ship inspection. The method comprises the steps that firstly, data characteristics of a marine bureau ship inspection platform are analyzed, and a department node is added between a client and a NameNode, so that the problem of low performance when the HDFS processes massive small files is optimized; secondly, when the small files are merged in the node, a small file preprocessing module and a pre-fetching module are set; meanwhile, the metadata information of the large file after the hot spot data are merged is cached, so that the access flow of the small file is optimized, and the pressure of the NameNode node is relieved; when the small file preprocessing module and the pre-fetching module work, the large file metadata information after hot spot data combination is cached, and the access flow of the small file is optimized. The invention solves the problem that the traditional HDFS usually shows low performance when storing massive small files, and is more convenient for file storage aiming at the characteristic of small massive data files of the maritime bureau.
Description
Technical Field
The invention relates to the field of distributed storage systems, in particular to a Hadoop-based marine bureau ship inspection big data distributed system storage method.
Background
With the rapid development of internet technology, the scale of network information is exponentially increased. With the continuous improvement of enterprise informatization construction, a traditional data storage mode cannot meet the requirement of a large amount of data generated in a daily process, so that more and more enterprises begin to use a distributed file system. The HDFS is the bottom layer in the Hadoop cluster, has the capacity of storing mass data in a distributed mode, and is widely used for distributed storage of a large number of files due to the characteristics of free source, expandability and high fault tolerance.
However, the way that the HDFS randomly selects the storage node for storing data is prone to the following two problems: data distribution is unbalanced easily, and data node hardware difference is not considered in the default placement strategy. Since most of files stored by the maritime office are massive small files, the HDFS distributed system is considered to store data information on the premise of cluster isomorphism, corresponding optimization is not performed on hardware differences of data nodes in a cluster during design, when a client side initiates a file request, a NameNode node is accessed to obtain metadata information of a corresponding file, and the corresponding DataNode node is found through the metadata to search stored data. The purpose of Hadoop design is to store larger files, and under the condition that the sizes of the stored files are the same, the memory of the NameNode consumed by storing the large files is less, so when a large number of small file storage and access requests exist, the NameNode has the problem of low performance.
Disclosure of Invention
A Hadoop-based big data distributed storage method for marine bureau ship inspection aims at solving the problems in the background technology.
In order to realize the aim of the above complaint, the invention provides the following technical scheme: a big data distributed storage method for marine bureau ship inspection based on Hadoop comprises the following steps:
s1: analyzing the storage performance of the mass small files and determining the sizes of the small files;
s2: optimally designing small file storage;
s3: designing a file preprocessing module;
s4: searching and optimally designing small files;
s5: designing a cache prefetching module;
the step S1 specifically includes the following steps:
s1.1, collecting a file size data set to be analyzed by a maritime affairs bureau, and carrying out quantitative analysis;
s1.2, determining the file size of the demarcation point by adopting a linear fitting method;
the optimized target in step S2 includes node memory consumption, and time frequency of node access, and is specifically defined as follows:
one NameNode node in the Hadoop distributed files is in an activated state, and the NameNode node does not store specific files and is only responsible for storing metadata of the files stored in the system. Assuming that memory consumption of the HDFS itself is a when the HDFS does not store a file, memory consumption of a BlockMap in the NameNode is b, and a default block size of the HDFS is 64MB, when files with sizes of M1, M2, M3 …, and Mn are stored in sequence, the memory consumption of the NameNode is defined as follows:
whereinIndicating that file storage is the number of the segmented blocks, (368+ b) indicating that file storage is the number of the segmented blocks, and indicating the memory of the NameNode occupied by the generated list information of each block, therefore, in order to reduce the memory consumption of the NameNode, the number of file blocks needs to be reduced.
Reading N files with the sizes of M1, M2 and M3 … Mn in sequence, wherein the time consumed by reading the N files is as follows: further, the load state is a disk usage rate, and the disk usage rate model is:
the NameNode receives the time consumption of the Client request and records the time consumption as Tcn, and the time consumed by the Client for receiving the metadata sent back by the NameNode is recorded as Tnc. The time consumed by the Client to send a request to the DataNode after receiving the metadata is marked as Tcd. The time consumed by the DataNode to query the file after receiving the request is recorded as Td. The DataNode records the time consumed by returning the query file to the Client as Tdc.
When the number HDFS of blocks C is used for storing a file, each file is stored independently as long as the size of the file does not exceed the size of the default partitioned data block, so that C is equal to N, and therefore, the total time consumed is calculated as:
therefore, as long as the number of interaction with the NameNode is reduced, the time for acquiring metadata information in the NameNode is reduced, and the time for reading files in the DataNode is reduced, the reading efficiency of the HDFS can be improved.
The S3 concrete steps are as follows:
s3.1 add a department node. The department nodes comprise a file preprocessing module, a file prefetching module and a caching module. The department nodes comprise a file preprocessing module, a file prefetching module and a caching module.
S3.2, judging the size of the file uploaded by the client, and storing the related file in a continuous disk space as much as possible, so that the hit rate of file prefetching is improved, and the file reading delay is reduced.
S3.3, department detection, namely judging the department ID number in the small file name at the beginning of file combination, and adding files with the same department ID number into the same file combination waiting queue.
S3.4 at this time, the small files are merged by adopting a MapFile technology. Since the file system will block files that exceed the default chunk size, we set the merge queue threshold to 64MB in order to avoid our merged file being stored as a split.
When accessing a small file, firstly, whether metadata cache information of a large file corresponding to the small file to be accessed exists in a newly established department node is searched according to the name of the small file, if not, the offset of the small file in the large file and the position information of a data block of the large file on a DataNode are obtained through memory loading according to the name of the small file, and the storage position of the small file is found according to the information.
The step of S5 is as follows:
s5.1, the cache region on the DataNode is divided into a cache region blocks with the size of M.
S5.2, the size of the cache region is determined by the values of a and M, the values of a and M can be adjusted according to the node resources, and each region with the size of M caches one small file. A proper cache prefetching module is designed according to the data characteristics, so that the time consumption of reading the hot spot file can be greatly reduced, and the optimization of a small file storage system is further realized.
S5.3, a cache replacement strategy, namely a least frequently used strategy, is adopted, so that the cache efficiency is improved.
Drawings
FIG. 1 is a diagram of a data storage architecture according to the present invention.
FIG. 2 is a flow chart of a data storage architecture according to the present invention.
Detailed Description
Example (b):
the invention provides the following technical scheme: a hybrid recommendation system and method based on multi-objective optimization comprises the following steps:
s1: analyzing the storage performance of the mass small files and determining the sizes of the small files;
s2: optimally designing small file storage;
s3: designing a file preprocessing module;
s4: searching and optimally designing small files;
s5: designing a cache prefetching module;
the Hadoop-based distributed storage method for big data of marine office ship inspection according to the claims, wherein the step S1 specifically comprises the following steps:
s1.1, collecting a file size data set to be analyzed by a maritime affairs bureau, and carrying out quantitative analysis;
s1.2, carrying out quantitative analysis on 23 files with different sizes, wherein the sizes of the selected files are all between 0.5MB and 64MB, and determining the size of the file at the demarcation point by adopting a linear fitting method through analyzing the influence of the files with different sizes on memory consumption and access performance in the processing process, wherein the size of the file is 4.35 MB;
the distributed big data storage method for marine bureau ship inspection based on Hadoop as claimed in claim, wherein: the optimized target in step S2 includes node memory consumption, and time frequency of node access, and is specifically defined as follows:
one NameNode node in the Hadoop distributed files is in an activated state, and the NameNode node does not store specific files and is only responsible for storing metadata of the files stored in the system. Assuming that memory consumption of the HDFS itself is a when the HDFS does not store a file, memory consumption of a BlockMap in the NameNode is b, and a default block size of the HDFS is 64MB, when files with sizes of M1, M2, M3 …, and Mn are stored in sequence, the memory consumption of the NameNode is defined as follows:
whereinIndicating that file storage is the number of the segmented blocks, (368+ b) indicating that file storage is the number of the segmented blocks, and indicating the memory of the NameNode occupied by the generated list information of each block, therefore, in order to reduce the memory consumption of the NameNode, the number of file blocks needs to be reduced.
The NameNode receives the time consumption of the Client request and records the time consumption as Tcn, and the time consumed by the Client for receiving the metadata sent back by the NameNode is recorded as Tnc. The time consumed by the Client to send a request to the DataNode after receiving the metadata is marked as Tcd. The time consumed by the DataNode to query the file after receiving the request is recorded as Td. The DataNode records the time consumed by returning the query file to the Client as Tdc.
When the number HDFS of blocks C is used for storing a file, each file is stored independently as long as the size of the file does not exceed the size of the default partitioned data block, so that C is equal to N, and therefore, the total time consumed is calculated as:
therefore, as long as the number of interaction with the NameNode is reduced, the time for acquiring metadata information in the NameNode is reduced, and the time for reading files in the DataNode is reduced, the reading efficiency of the HDFS can be improved.
The distributed big data storage method for marine bureau ship inspection based on Hadoop as claimed in claim, wherein: the S3 concrete steps are as follows:
s3.1 add a department node as shown in the data storage architecture diagram of fig. 1. The department nodes comprise a file preprocessing module, a file prefetching module and a caching module. The department nodes comprise a file preprocessing module, a file prefetching module and a caching module. The file preprocessing module merges the small files with the same department number so as to increase the relevance of the small files, establishes a mapping with the name of the small file as a main key and stores the mapping in the HBase, and relieves the memory pressure of name nodes.
S3.2, judging the size of the file uploaded by the client, and storing the related file in a continuous disk space as much as possible, so that the hit rate of file prefetching is improved, and the file reading delay is reduced.
S3.3 As shown in the file processing flow chart of FIG. 2, at the beginning of file merging, the department ID number in the small file name is judged first, and the files with the same department ID number are added into the same file merging waiting queue. And for different department ID numbers, a waiting queue is newly built for adding.
S3.4 at this time, the small files are merged by adopting the MapFile technology. Since the file system will block files that exceed the default chunk size, we set the merge queue threshold to 64MB in order to avoid our merged file being stored as a split. If the file size is larger than the merge queue remaining space or queue latency is larger than wait _ time, the file is added to the wait queue and converted to a new merge queue. Otherwise, adding the small files into the queue, and merging the queue when the remaining space of the queue reaches a set threshold value.
The distributed big data storage method for marine bureau ship inspection based on Hadoop as claimed in claim, wherein: the step of S4 is as follows:
when accessing a small file, firstly, whether metadata cache information of a large file corresponding to the small file to be accessed exists in a newly established department node is searched according to the name of the small file, if not, the offset of the small file in the large file and the position information of a data block of the large file on a DataNode are obtained through memory loading according to the name of the small file, and the storage position of the small file is found according to the information.
The distributed big data storage method for marine bureau ship inspection based on Hadoop as claimed in claim, wherein: the step of S5 is as follows:
s5.1, the cache region on the DataNode is divided into a cache region blocks with the size of M.
S5.2, the size of the cache region is determined by the values of a and M, the values of a and M can be adjusted according to the node resources, and each region with the size of M caches one small file. And a proper cache pre-fetching module is designed according to the data characteristics, so that the time consumption for reading the hot files can be greatly reduced, and the optimization of a small file storage system is further realized.
S5.3, a cache replacement strategy, namely a least frequently used strategy, is adopted, in the file system, metadata information of data is cached, occupied memory space is relatively small, caches are distributed on different nodes, unnecessary disk interaction and communication are reduced through caching of the metadata of the small files, time consumption is reduced, and caching efficiency is improved.
Claims (7)
1. A big data distributed storage method for ship inspection of a marine office based on Hadoop is characterized by comprising the following steps: analyzing the data characteristics of the marine bureau ship inspection platform and determining the file size of the demarcation point; by adding a department node between the client and the NameNode, the problem of low performance when the HDFS processes massive small files is optimized; a department detection algorithm merges the id of the same department, and meanwhile, the method based on MapFile realizes the merging of small files to large files through the serialization of key value pairs and establishes the mapping from the small files to the large files; the small file retrieval is optimized, the metadata information of the large file after data combination is cached, the access flow of the small file is optimized, and the access efficiency is improved.
2. A big data distributed storage method for marine bureau ship inspection based on Hadoop is characterized by comprising the following steps:
s1: analyzing the storage performance of the mass small files and determining the sizes of the small files;
s2: optimally designing small file storage;
s3: designing a file preprocessing module;
s4: searching and optimally designing small files;
s5: and designing a cache prefetching module.
3. The Hadoop-based big data distributed storage method for the ship inspection of the maritime office according to claim 2, wherein the step S1 specifically comprises the following steps:
s1.1, collecting a file size data set to be analyzed by a maritime affairs bureau, and carrying out quantitative analysis;
s1.2, determining the file size of the demarcation point by adopting a linear fitting method.
4. The Hadoop-based big data distributed storage method for the ship inspection of the maritime affairs bureau according to claim 2, wherein: the optimized target in step S2 includes node memory consumption, and time frequency of node access, and is specifically defined as follows:
one NameNode node in the Hadoop distributed files is in an activated state, and the NameNode node does not store specific files and is only responsible for storing metadata of the files stored in the system; assuming that memory consumption of the HDFS itself is a when the HDFS does not store a file, memory consumption of a BlockMap in the NameNode is b, and a default block size of the HDFS is 64MB, when files with sizes of M1, M2, M3 …, and Mn are stored in sequence, the memory consumption of the NameNode is defined as follows:
whereinIndicating the number of the segmented blocks of the file storage, (368+ b) indicating the number of the segmented blocks of the file storage, and indicating the memory of the NameNode occupied by the generated list information of each block, therefore, in order to reduce the memory consumption of the NameNode node, the number of the file blocks needs to be reduced;
reading N files with the sizes of M1, M2 and M3 … Mn in sequence, wherein the time consumed by reading the N files is as follows: further, the load state is a disk usage rate, and the disk usage rate model is:
the NameNode receives the time consumption of the Client request, and the time consumption is marked as Tcn, and the time consumed when the Client receives the metadata sent back by the NameNode is marked as Tnc; the time consumed by the Client for sending a request to the DataNode after receiving the metadata is recorded as Tcd; the data node receives the request and inquires the time consumed by the file, and the time is recorded as Td; the DataNode records the time consumed by returning the inquired file to the Client as Tdc;
when the number HDFS of blocks C is used for storing a file, each file is stored independently as long as the size of the file does not exceed the size of the default partitioned data block, so that C is equal to N, and therefore, the total time consumed is calculated as:
therefore, as long as the number of interaction times with the NameNode is reduced, the time for acquiring metadata information in the NameNode is reduced, and the time for reading files in the DataNode is reduced, the reading efficiency of the HDFS can be improved.
5. The Hadoop-based big data distributed storage method for the ship inspection of the maritime affairs bureau according to claim 2, wherein: the S3 concrete steps are as follows:
s3.1, adding a department node; the department nodes comprise a file preprocessing module, a file prefetching module and a caching module;
s3.2, judging the size of the file uploaded by the client and storing the related file in a continuous disk space as much as possible, so that the hit rate of file prefetching is improved and the file reading delay is reduced;
s3.3, department detection, namely judging the department ID number in the small file name at the beginning of file combination, and adding files with the same department ID number into the same file combination waiting queue;
s3.4, at the moment, the MapFile technology is adopted to merge the small files; since the file system will block files that exceed the default chunk size, we set the merge queue threshold to 64MB in order to avoid our merged file being stored as a split.
6. The Hadoop-based big data distributed storage method for the ship inspection of the maritime affairs bureau according to claim 2, wherein: the step of S4 is as follows:
when accessing a small file, firstly, whether metadata cache information of a large file corresponding to the small file to be accessed exists in a newly established department node is searched according to the name of the small file, if not, the offset of the small file in the large file and the position information of a data block of the large file on a DataNode are obtained through memory loading according to the name of the small file, and the storage position of the small file is found according to the information.
7. The Hadoop-based big data distributed storage method for the ship inspection of the maritime affairs bureau according to claim 2, wherein: the step of S5 is as follows:
s5.1, dividing a cache region on the DataNode into a cache region blocks with the size of M;
s5.2, determining the size of a cache region according to the values of a and M, wherein the values of a and M can be adjusted according to node resources, and each region with the size of M caches a small file; a proper cache prefetching module is designed according to the data characteristics, so that the time consumption for reading the hot files can be greatly reduced, and the optimization of a small file storage system is further realized;
s5.3, a cache replacement strategy, namely a least frequently used strategy, is adopted, so that the cache efficiency is improved.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210147679.1A CN114546962A (en) | 2022-02-17 | 2022-02-17 | Hadoop-based distributed storage system for marine bureau ship inspection big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210147679.1A CN114546962A (en) | 2022-02-17 | 2022-02-17 | Hadoop-based distributed storage system for marine bureau ship inspection big data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114546962A true CN114546962A (en) | 2022-05-27 |
Family
ID=81675793
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210147679.1A Pending CN114546962A (en) | 2022-02-17 | 2022-02-17 | Hadoop-based distributed storage system for marine bureau ship inspection big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114546962A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117519608A (en) * | 2023-12-27 | 2024-02-06 | 泰安北航科技园信息科技有限公司 | Big data server with Hadoop as core |
-
2022
- 2022-02-17 CN CN202210147679.1A patent/CN114546962A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117519608A (en) * | 2023-12-27 | 2024-02-06 | 泰安北航科技园信息科技有限公司 | Big data server with Hadoop as core |
CN117519608B (en) * | 2023-12-27 | 2024-03-22 | 泰安北航科技园信息科技有限公司 | Big data server with Hadoop as core |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108710639B (en) | Ceph-based access optimization method for mass small files | |
Baeza-Yates et al. | The impact of caching on search engines | |
KR102564170B1 (en) | Method and device for storing data object, and computer readable storage medium having a computer program using the same | |
CN102467572B (en) | Data block inquiring method for supporting data de-duplication program | |
US8463846B2 (en) | File bundling for cache servers of content delivery networks | |
Cambazoglu et al. | Scalability challenges in web search engines | |
CN113377868B (en) | Offline storage system based on distributed KV database | |
CN102521406A (en) | Distributed query method and system for complex task of querying massive structured data | |
CN102521405A (en) | Massive structured data storage and query methods and systems supporting high-speed loading | |
US9262511B2 (en) | System and method for indexing streams containing unstructured text data | |
CN109766318B (en) | File reading method and device | |
Dong et al. | Correlation based file prefetching approach for hadoop | |
CN104111898A (en) | Hybrid storage system based on multidimensional data similarity and data management method | |
CN106155934A (en) | Based on the caching method repeating data under a kind of cloud environment | |
CN111159176A (en) | Method and system for storing and reading mass stream data | |
CN114546962A (en) | Hadoop-based distributed storage system for marine bureau ship inspection big data | |
CN112559459B (en) | Cloud computing-based self-adaptive storage layering system and method | |
CN111857582B (en) | Key value storage system | |
CN101459599B (en) | Method and system for implementing concurrent execution of cache data access and loading | |
CN117076466A (en) | Rapid data indexing method for large archive database | |
US20200019539A1 (en) | Efficient and light-weight indexing for massive blob/objects | |
CN109213760B (en) | High-load service storage and retrieval method for non-relational data storage | |
CN115562592A (en) | Memory and disk hybrid caching method based on cloud object storage | |
CN109582233A (en) | A kind of caching method and device of data | |
CN113722274A (en) | Efficient R-tree index remote sensing data storage model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |