CN106909651A - A kind of method for being write based on HDFS small documents and being read - Google Patents
A kind of method for being write based on HDFS small documents and being read Download PDFInfo
- Publication number
- CN106909651A CN106909651A CN201710100365.5A CN201710100365A CN106909651A CN 106909651 A CN106909651 A CN 106909651A CN 201710100365 A CN201710100365 A CN 201710100365A CN 106909651 A CN106909651 A CN 106909651A
- Authority
- CN
- China
- Prior art keywords
- file
- small documents
- size
- hdfs
- write
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/172—Caching, prefetching or hoarding of files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of method for being write based on HDFS small documents and being read,It is characterized in that,For each user sets up a user file,Each user file is different,Operated with file appending and be merged into small documents in this user file in the form of streaming,And also the offset information of index record small documents is set up for small documents using relational database,Small documents merge the effective limitation for solving file format,User's small documents can also be at any time appended in user file and avoid being uploaded again after file cache to server first is reached into a certain size,The file security sex chromosome mosaicism of generation,The recall precision of file can be effectively improved finally by index information is set up for small documents,Ask to be forwarded to the pressure raising file transmission efficiency for being processed on back end and mitigating master server when the file less than block is accessed simultaneously.
Description
Technical field
The invention belongs to file storage technology field, and in particular to a kind of side for being write based on HDFS small documents and being read
Method.
Background technology
With the high speed development of internet, data in internet expansion also drastically, in order to provide the user with preferably
Service, Internet enterprises will preserve and excavate these data.The concept of cloud computing is thereby produced, cloud computing is current research
Heat subject, it solves the problem of the computing of big data and storage well, wherein cloud storage as cloud computing derivative
Also become the focus of domestic and international research, in the research of numerous cloud storages, the distributed file system HDFS conducts of Hadoop
The realization of increasing income of GoogleFile System has turned into industry research cloud computing and cloud storage, realizes that cloud application provides cloud clothes
The master pattern of business reference.HDFS can be used for large-scale distributed storage, can build one easily extension, fault-tolerance it is high,
High performance cloud storage platform, and it also provided the user one group of reliable stabilization interface can make developer according to
The demand of oneself reality is developed and extended, and Hadoop has obtained the favor of many major companies at present, and it is in mass data
Excellent being widely used is showed in storage and treatment.
The uniformity of data can be effectively kept using HDFS file system, is adapted to the occasion that write-once is repeatedly read,
The framework can be built on arbitrary computer to run, and ensure scalability, the Backup and Restore mechanism of HDFS with to point
Monitoring mechanism with task has all ensured the reliability of distributed storage, and HDFS uses stream-oriented file reading and is applicable very much
In reading magnanimity DBMS.But HDFS file system is not perfect, it has some to limit in mass small documents access
System.So there now have been some solutions for HDFS small documents storage problems:
(1)The storage efficiency of file will be improved in small documents storage to Hbase with decomposition by Piece file mergence, this scheme
Shortcoming is exactly as increasing for file can cause the substantial amounts of merging of Hbase to take a large amount of serious shadows of system resource with operation splitting
The performance of acoustic system, and Hbase only supports simple character types, it is bad to the support of the types such as other pictures also to need to use
Family is individually processed.
(2)Also it is exactly right using the filing instrument Hadoop Archives (abbreviation HAR Files) of Hadoop offers
Small documents carry out filing packing, although this mode can effectively reduce memory consumption of the large amount of small documents to NameNode
But needed when user needs to access it is super to look for two secondary indexs just to find file detection not high, while also needing to keeper's maintenance
Operation order carries out archive operation and is not suitable for building the cloud storage platform based on internet.
(3)Also a kind of scheme of Piece file mergence is that the merging of file is carried out using Sequence File,
Sequence File are the files for storing binary system key-value forms, are generally deposited using Sequence File
Filename is deposited into key file content storage in value during storage small documents, the maximum shortcoming of this mode be exactly by
In key assignments therein being that unsorted file random read take is less efficient can just be read out, it is necessary to travel through whole file, and
This mode does not support that file appending is operated, thus merge before small documents to be cached in server, such file
Security cannot be protected.
The content of the invention
It is an object of the present invention to be directed to the defect that above-mentioned prior art is present, there is provided design is a kind of to be based on the small texts of HDFS
The method that part writes and reads, to solve above-mentioned technical problem.
To achieve these goals, the technical scheme is that:
A kind of method for being write based on HDFS small documents and being read, it is characterised in that including following two parts:(1)File is write
Enter
Sw1:Send and upload file request;
Sw2:The size of file to be uploaded is obtained, and the size of file to be uploaded is compared with the threshold value for setting, if treating
The size of transmitting file is less than the threshold value for setting, then it is assumed that is small documents, judges that user file whether there is, if not existing, performs
Step Sw3;If in the presence of, step Sw4 is jumped to,
Sw3:The title and ID that new user file and user file are set up in HDFS clusters are corresponded, then chasing after
Plus be merged into small documents in user file by the mode write, step Sw5 is performed;
Sw4:Judge whether small documents size is more than the size of user file remaining space, if so, jump procedure Sw3, if it is not, then
File is appended in user file in the way of additional writing;
Sw5:The metadata information of small documents is deposited into relevant database, and sets up search index.
(2)The reading of file
Sr1:Send and read file request;
Sr2:The size of file is obtained, and the size of file is compared with the threshold value for setting, if the size of file is less than setting
During the threshold value put, Sr4 is jumped to, step Sr3 is performed when threshold value of the file more than setting;
Sr3:If file is asked to be forwarded to back end more than the threshold value for setting and less than a size for blocks of files
On, the relevant information of blocks of files is obtained by function, directly come to transmit data to client from back end;
Sr4:The deviant and size of file are read, by function locating to the position in user file where small documents, according to
Metadata information reads small documents in the way of streaming;
Further, in step Sw2, the size of file to be uploaded is more than the threshold value for setting, then it is assumed that be big file, directly by leading to
Uploaded with file module.
In step Sw5, the metadata information that there are the small documents of relevance is deposited under same catalogue.
Step Sr3:File more than blocks of files size when, directly read file.
The metadata information includes file name, size, position, attribute, creation time, modification time, deviant, length
Degree length.
The relevant information of the blocks of files includes, the side-play amount of title, port numbers, this document of back end in block.
User file is a file for file block size.
The file block size is defaulted as 64M.
The threshold value is defaulted as 1M.
The beneficial effects of the present invention are:
The small documents that invention herein proposes relational database merge algorithm, are that each user sets up a user file, often
Individual user file is all different, is operated with file appending and is merged into small documents in this user file in the form of streaming, and
The offset information of index record small documents is also set up for small documents using relational database, the reading efficiency of small documents is improve,
Simultaneously when the file less than block is accessed, the pressure for ask to be forwarded to treatment mitigation master server on back end is carried
File transmission efficiency high;Small documents are merged into a file storage NameNode internal memory can't be with increasing for file
Change, so as to reduce the consumption of NameNode internal memories, improve the performance of whole system;Small documents merge effective solution
The limitation of file format, moreover it is possible to user's small documents are appended in user file at any time and avoid file cache to server first
The file security sex chromosome mosaicism for uploading generation after a certain size again is reached, can be had finally by index information is set up for small documents
The recall precision of the raising file of effect.
Additionally, design principle reliability of the present invention, simple structure, with application prospect widely.
As can be seen here, the present invention compared with prior art, improves with prominent substantive distinguishing features and significantly, and it is implemented
Beneficial effect be also obvious.
Brief description of the drawings
Fig. 1 is based on HDFS small documents wiring method flow charts for a kind of.
Fig. 2 is based on HDFS small documents read method flow charts for a kind of.
Specific embodiment
The present invention will be described in detail below in conjunction with the accompanying drawings and by specific embodiment, and following examples are to the present invention
Explanation, and the invention is not limited in implementation below.
A kind of method for being write based on HDFS small documents and being read that the present embodiment is provided, as shown in figure 1, when user's write-in
During file, sent to NameNode upload file request first, size judgement is carried out afterwards, if big file is then directly by leading to
In uploading to HDFS clusters with file module, if small documents, small documents processing module is given, if file size is less than setting
The threshold value 1M for putting, just in the way of additional writing be appended in HDFS clusters in the user file of the user file, write by we
After the completion of during the size of file and deviant recorded relevant database by us, and set up index, there is relevance
The metadata information of small documents is deposited under same catalogue, if small documents size is more than the size of user file remaining space,
HDFS clusters set up new user file and the title of user file is corresponded with ID, then will in the way of additional writing
Small documents are merged into newly-built user file.Acquiescence is not support additional write operation in HDFS file system, it is therefore desirable to
Configured, be set to for the value of dfs.support.append by the HDFS-site.xml files under modification Namenode servers
true。
As shown in Fig. 2 when user initiates to read file request to NameNode, first having to sentence file size
Disconnected, when file is less than the threshold value 1M for setting, small documents processing module just reads the skew of file from the database of relationship type
Value and size are long finally according to length then by the position in user file where seek () function locating to small documents
Degree reads small documents in the way of streaming from HDFS clusters.Just further sentenced as threshold value 1M of the file more than setting
It is disconnected, if file more than blocks of files size (being defaulted as 64M) when, directly read file from HDFS clusters.If literary
Part then obtains blocks of files less than a size for block, small documents processing module by getFileBlockLocation () function
Relevant information, including the information such as title and the side-play amount of port numbers and this document in block of back end, directly
Come to transmit data to client from back end, improve file transmission efficiency, mitigate the pressure of NameNode.
Disclosed above is only the preferred embodiment of the present invention, but the present invention is not limited to this, any this area
What technical staff can think does not have creative change, and some improvement made without departing from the principles of the present invention and
Retouching, should all be within the scope of the present invention.
Claims (9)
1. it is a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that including following two parts:
(1)The write-in of file
Sw1:Send and upload file request;
Sw2:The size of file to be uploaded is obtained, and the sizes values of file to be uploaded are compared with the threshold value for setting, if treating
The size of upper transmitting file is less than the threshold value for setting, then it is assumed that is small documents, judges that user file whether there is, if not existing, holds
Row step Sw3;If in the presence of jumping to step Sw4;
Sw3:The title that new user file and user file are set up in HDFS clusters is corresponded with ID, then will be small
During file is appended to user file in the way of additional writing, step Sw5 is jumped to;
Sw4:Judge whether small documents size is more than the size of user file remaining space, if so, step Sw3 is jumped to, if it is not,
Small documents are appended in user file in the way of additional writing then;
Sw5:The metadata information of small documents is deposited into relevant database, and sets up search index;
(2)The reading of file
Sr1:Send and read file request;
Sr2:The size of file is obtained, and the size of file is compared with the threshold value for setting, if the size of file is less than setting
During the threshold value put, Sr4 is jumped to, step Sr3 is performed when threshold value of the file more than setting;
Sr3:If file is asked to be forwarded to back end more than the threshold value for setting and less than a size for blocks of files
On, the relevant information of blocks of files is obtained by function, directly come to transmit data to client from back end;
Sr4:The deviant and size of file are read, by function locating to the position in user file where small documents, according to
Metadata information reads small documents in the way of streaming.
2. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that step
In Sw2, the size of file to be uploaded is more than the threshold value for setting, then it is assumed that is big file, is directly uploaded by general file module.
3. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that step
In Sw5, the metadata information that there are the small documents of relevance is deposited under same catalogue.
4. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that step
Sr3:File more than blocks of files size when, directly read file.
5. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that the unit
Data message includes file name, size, position, attribute, creation time, modification time, deviant, length length.
6. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that the text
The relevant information of part block includes, the side-play amount of title, port numbers, this document of back end in block.
7. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that Yong Huwen
Part is a file for file block size.
8. it is according to claim 7 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that the text
Part block size is defaulted as 64M.
9. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that the threshold
Value is defaulted as 1M.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710100365.5A CN106909651A (en) | 2017-02-23 | 2017-02-23 | A kind of method for being write based on HDFS small documents and being read |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710100365.5A CN106909651A (en) | 2017-02-23 | 2017-02-23 | A kind of method for being write based on HDFS small documents and being read |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106909651A true CN106909651A (en) | 2017-06-30 |
Family
ID=59209163
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710100365.5A Pending CN106909651A (en) | 2017-02-23 | 2017-02-23 | A kind of method for being write based on HDFS small documents and being read |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106909651A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107506447A (en) * | 2017-08-25 | 2017-12-22 | 郑州云海信息技术有限公司 | A kind of small documents reading/writing method and system based on local file system |
CN107729432A (en) * | 2017-09-29 | 2018-02-23 | 浪潮软件股份有限公司 | A kind of storage of distributed small documents, read method, device and access system |
CN108234594A (en) * | 2017-11-28 | 2018-06-29 | 北京市商汤科技开发有限公司 | File memory method and device, electronic equipment, program and medium |
CN108595567A (en) * | 2018-04-13 | 2018-09-28 | 郑州云海信息技术有限公司 | A kind of merging method of small documents, device, equipment and readable storage medium storing program for executing |
CN108717457A (en) * | 2018-05-23 | 2018-10-30 | 苏州易康萌思网络科技有限公司 | A kind of e-commerce platform big data processing method and system |
CN108932287A (en) * | 2018-05-22 | 2018-12-04 | 广东技术师范学院 | A kind of mass small documents wiring method based on Hadoop |
CN110069451A (en) * | 2019-03-28 | 2019-07-30 | 浪潮卓数大数据产业发展有限公司 | A kind of method and device of HDFS storage small documents |
CN110413588A (en) * | 2019-07-30 | 2019-11-05 | 中国工商银行股份有限公司 | Distributed objects storage method, device, computer equipment and storage medium |
CN110515920A (en) * | 2019-08-30 | 2019-11-29 | 北京浪潮数据技术有限公司 | A kind of mass small documents access method and system based on Hadoop |
CN113905037A (en) * | 2021-06-18 | 2022-01-07 | 武汉理工数字传播工程有限公司 | File transmission management method, device, equipment and storage medium |
CN114968939A (en) * | 2022-05-31 | 2022-08-30 | 济南浪潮数据技术有限公司 | File merging method and device and computer readable storage medium |
CN116756093A (en) * | 2023-08-17 | 2023-09-15 | 天津神舟通用数据技术有限公司 | Large object storage and query method, device, equipment and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102332027A (en) * | 2011-10-15 | 2012-01-25 | 西安交通大学 | Mass non-independent small file associated storage method based on Hadoop |
CN103577123A (en) * | 2013-11-12 | 2014-02-12 | 河海大学 | Small file optimization storage method based on HDFS |
CN103856567A (en) * | 2014-03-26 | 2014-06-11 | 西安电子科技大学 | Small file storage method based on Hadoop distributed file system |
US20150379024A1 (en) * | 2014-06-27 | 2015-12-31 | International Business Machines Corporation | File storage processing in hdfs |
CN105631010A (en) * | 2015-12-29 | 2016-06-01 | 成都康赛信息技术有限公司 | Optimization method based on HDFS small file storage |
-
2017
- 2017-02-23 CN CN201710100365.5A patent/CN106909651A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102332027A (en) * | 2011-10-15 | 2012-01-25 | 西安交通大学 | Mass non-independent small file associated storage method based on Hadoop |
CN103577123A (en) * | 2013-11-12 | 2014-02-12 | 河海大学 | Small file optimization storage method based on HDFS |
CN103856567A (en) * | 2014-03-26 | 2014-06-11 | 西安电子科技大学 | Small file storage method based on Hadoop distributed file system |
US20150379024A1 (en) * | 2014-06-27 | 2015-12-31 | International Business Machines Corporation | File storage processing in hdfs |
CN105631010A (en) * | 2015-12-29 | 2016-06-01 | 成都康赛信息技术有限公司 | Optimization method based on HDFS small file storage |
Non-Patent Citations (2)
Title |
---|
张海 等: "基于HDFS的小文件存储与读取优化策略", 《计算机系统应用》 * |
张海: "基于HDFS分布式存储技术研究与优化", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107506447A (en) * | 2017-08-25 | 2017-12-22 | 郑州云海信息技术有限公司 | A kind of small documents reading/writing method and system based on local file system |
CN107729432A (en) * | 2017-09-29 | 2018-02-23 | 浪潮软件股份有限公司 | A kind of storage of distributed small documents, read method, device and access system |
CN108234594A (en) * | 2017-11-28 | 2018-06-29 | 北京市商汤科技开发有限公司 | File memory method and device, electronic equipment, program and medium |
CN108595567A (en) * | 2018-04-13 | 2018-09-28 | 郑州云海信息技术有限公司 | A kind of merging method of small documents, device, equipment and readable storage medium storing program for executing |
CN108932287B (en) * | 2018-05-22 | 2019-11-29 | 广东技术师范大学 | A kind of mass small documents wiring method based on Hadoop |
CN108932287A (en) * | 2018-05-22 | 2018-12-04 | 广东技术师范学院 | A kind of mass small documents wiring method based on Hadoop |
CN108717457A (en) * | 2018-05-23 | 2018-10-30 | 苏州易康萌思网络科技有限公司 | A kind of e-commerce platform big data processing method and system |
CN110069451A (en) * | 2019-03-28 | 2019-07-30 | 浪潮卓数大数据产业发展有限公司 | A kind of method and device of HDFS storage small documents |
CN110413588A (en) * | 2019-07-30 | 2019-11-05 | 中国工商银行股份有限公司 | Distributed objects storage method, device, computer equipment and storage medium |
CN110515920A (en) * | 2019-08-30 | 2019-11-29 | 北京浪潮数据技术有限公司 | A kind of mass small documents access method and system based on Hadoop |
CN113905037A (en) * | 2021-06-18 | 2022-01-07 | 武汉理工数字传播工程有限公司 | File transmission management method, device, equipment and storage medium |
CN114968939A (en) * | 2022-05-31 | 2022-08-30 | 济南浪潮数据技术有限公司 | File merging method and device and computer readable storage medium |
CN116756093A (en) * | 2023-08-17 | 2023-09-15 | 天津神舟通用数据技术有限公司 | Large object storage and query method, device, equipment and medium |
CN116756093B (en) * | 2023-08-17 | 2023-11-03 | 天津神舟通用数据技术有限公司 | Large object storage and query method, device, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106909651A (en) | A kind of method for being write based on HDFS small documents and being read | |
US12099467B2 (en) | Snapshot metadata arrangement for efficient cloud integrated data management | |
US11010300B2 (en) | Optimized record lookups | |
US9916176B2 (en) | Method and apparatus of accessing data of virtual machine | |
US10346354B2 (en) | Reducing stable data eviction with synthetic baseline snapshot and eviction state refresh | |
CN104731921B (en) | Storage and processing method of the Hadoop distributed file systems for log type small documents | |
US20170262186A1 (en) | Reconstructing In-Memory Indices in a Distributed Data Storage System | |
AU2013403132B2 (en) | Data storage method, data storage apparatus, and storage device | |
US11093387B1 (en) | Garbage collection based on transmission object models | |
US9904480B1 (en) | Multiplexing streams without changing the number of streams of a deduplicating storage system | |
US9405643B2 (en) | Multi-level lookup architecture to facilitate failure recovery | |
JP2010079886A (en) | Scalable secondary storage system and method | |
US9619322B2 (en) | Erasure-coding extents in an append-only storage system | |
KR20090063733A (en) | Method recovering data server at the applying multiple reproduce dispersion file system and metadata storage and save method thereof | |
CN104965835B (en) | A kind of file read/write method and device of distributed file system | |
US20200034451A1 (en) | Data deduplication for elastic cloud storage devices | |
CN105516313A (en) | Distributed storage system used for big data | |
CN110851407A (en) | Data distributed storage system and method | |
Cheng et al. | Optimizing small file storage process of the HDFS which based on the indexing mechanism | |
US10789002B1 (en) | Hybrid data deduplication for elastic cloud storage devices | |
Zhang et al. | Hierarchical data deduplication technology based on bloom filter array | |
Long et al. | A fast deduplication scheme for stored data in distributed storage systems | |
US20240070135A1 (en) | Hash engine for conducting point queries | |
Liu et al. | Storage design of marine ship inspection data based on cloud platform | |
US20160112293A1 (en) | Using an rpc framework to facilitate out-of-band data transfers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170630 |
|
RJ01 | Rejection of invention patent application after publication |