CN106909651A - A kind of method for being write based on HDFS small documents and being read - Google Patents

A kind of method for being write based on HDFS small documents and being read Download PDF

Info

Publication number
CN106909651A
CN106909651A CN201710100365.5A CN201710100365A CN106909651A CN 106909651 A CN106909651 A CN 106909651A CN 201710100365 A CN201710100365 A CN 201710100365A CN 106909651 A CN106909651 A CN 106909651A
Authority
CN
China
Prior art keywords
file
small documents
size
hdfs
write
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710100365.5A
Other languages
Chinese (zh)
Inventor
辛永欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201710100365.5A priority Critical patent/CN106909651A/en
Publication of CN106909651A publication Critical patent/CN106909651A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of method for being write based on HDFS small documents and being read,It is characterized in that,For each user sets up a user file,Each user file is different,Operated with file appending and be merged into small documents in this user file in the form of streaming,And also the offset information of index record small documents is set up for small documents using relational database,Small documents merge the effective limitation for solving file format,User's small documents can also be at any time appended in user file and avoid being uploaded again after file cache to server first is reached into a certain size,The file security sex chromosome mosaicism of generation,The recall precision of file can be effectively improved finally by index information is set up for small documents,Ask to be forwarded to the pressure raising file transmission efficiency for being processed on back end and mitigating master server when the file less than block is accessed simultaneously.

Description

A kind of method for being write based on HDFS small documents and being read
Technical field
The invention belongs to file storage technology field, and in particular to a kind of side for being write based on HDFS small documents and being read Method.
Background technology
With the high speed development of internet, data in internet expansion also drastically, in order to provide the user with preferably Service, Internet enterprises will preserve and excavate these data.The concept of cloud computing is thereby produced, cloud computing is current research Heat subject, it solves the problem of the computing of big data and storage well, wherein cloud storage as cloud computing derivative Also become the focus of domestic and international research, in the research of numerous cloud storages, the distributed file system HDFS conducts of Hadoop The realization of increasing income of GoogleFile System has turned into industry research cloud computing and cloud storage, realizes that cloud application provides cloud clothes The master pattern of business reference.HDFS can be used for large-scale distributed storage, can build one easily extension, fault-tolerance it is high, High performance cloud storage platform, and it also provided the user one group of reliable stabilization interface can make developer according to The demand of oneself reality is developed and extended, and Hadoop has obtained the favor of many major companies at present, and it is in mass data Excellent being widely used is showed in storage and treatment.
The uniformity of data can be effectively kept using HDFS file system, is adapted to the occasion that write-once is repeatedly read, The framework can be built on arbitrary computer to run, and ensure scalability, the Backup and Restore mechanism of HDFS with to point Monitoring mechanism with task has all ensured the reliability of distributed storage, and HDFS uses stream-oriented file reading and is applicable very much In reading magnanimity DBMS.But HDFS file system is not perfect, it has some to limit in mass small documents access System.So there now have been some solutions for HDFS small documents storage problems:
(1)The storage efficiency of file will be improved in small documents storage to Hbase with decomposition by Piece file mergence, this scheme Shortcoming is exactly as increasing for file can cause the substantial amounts of merging of Hbase to take a large amount of serious shadows of system resource with operation splitting The performance of acoustic system, and Hbase only supports simple character types, it is bad to the support of the types such as other pictures also to need to use Family is individually processed.
(2)Also it is exactly right using the filing instrument Hadoop Archives (abbreviation HAR Files) of Hadoop offers Small documents carry out filing packing, although this mode can effectively reduce memory consumption of the large amount of small documents to NameNode But needed when user needs to access it is super to look for two secondary indexs just to find file detection not high, while also needing to keeper's maintenance Operation order carries out archive operation and is not suitable for building the cloud storage platform based on internet.
(3)Also a kind of scheme of Piece file mergence is that the merging of file is carried out using Sequence File, Sequence File are the files for storing binary system key-value forms, are generally deposited using Sequence File Filename is deposited into key file content storage in value during storage small documents, the maximum shortcoming of this mode be exactly by In key assignments therein being that unsorted file random read take is less efficient can just be read out, it is necessary to travel through whole file, and This mode does not support that file appending is operated, thus merge before small documents to be cached in server, such file Security cannot be protected.
The content of the invention
It is an object of the present invention to be directed to the defect that above-mentioned prior art is present, there is provided design is a kind of to be based on the small texts of HDFS The method that part writes and reads, to solve above-mentioned technical problem.
To achieve these goals, the technical scheme is that:
A kind of method for being write based on HDFS small documents and being read, it is characterised in that including following two parts:(1)File is write Enter
Sw1:Send and upload file request;
Sw2:The size of file to be uploaded is obtained, and the size of file to be uploaded is compared with the threshold value for setting, if treating The size of transmitting file is less than the threshold value for setting, then it is assumed that is small documents, judges that user file whether there is, if not existing, performs Step Sw3;If in the presence of, step Sw4 is jumped to,
Sw3:The title and ID that new user file and user file are set up in HDFS clusters are corresponded, then chasing after Plus be merged into small documents in user file by the mode write, step Sw5 is performed;
Sw4:Judge whether small documents size is more than the size of user file remaining space, if so, jump procedure Sw3, if it is not, then File is appended in user file in the way of additional writing;
Sw5:The metadata information of small documents is deposited into relevant database, and sets up search index.
(2)The reading of file
Sr1:Send and read file request;
Sr2:The size of file is obtained, and the size of file is compared with the threshold value for setting, if the size of file is less than setting During the threshold value put, Sr4 is jumped to, step Sr3 is performed when threshold value of the file more than setting;
Sr3:If file is asked to be forwarded to back end more than the threshold value for setting and less than a size for blocks of files On, the relevant information of blocks of files is obtained by function, directly come to transmit data to client from back end;
Sr4:The deviant and size of file are read, by function locating to the position in user file where small documents, according to Metadata information reads small documents in the way of streaming;
Further, in step Sw2, the size of file to be uploaded is more than the threshold value for setting, then it is assumed that be big file, directly by leading to Uploaded with file module.
In step Sw5, the metadata information that there are the small documents of relevance is deposited under same catalogue.
Step Sr3:File more than blocks of files size when, directly read file.
The metadata information includes file name, size, position, attribute, creation time, modification time, deviant, length Degree length.
The relevant information of the blocks of files includes, the side-play amount of title, port numbers, this document of back end in block.
User file is a file for file block size.
The file block size is defaulted as 64M.
The threshold value is defaulted as 1M.
The beneficial effects of the present invention are:
The small documents that invention herein proposes relational database merge algorithm, are that each user sets up a user file, often Individual user file is all different, is operated with file appending and is merged into small documents in this user file in the form of streaming, and The offset information of index record small documents is also set up for small documents using relational database, the reading efficiency of small documents is improve, Simultaneously when the file less than block is accessed, the pressure for ask to be forwarded to treatment mitigation master server on back end is carried File transmission efficiency high;Small documents are merged into a file storage NameNode internal memory can't be with increasing for file Change, so as to reduce the consumption of NameNode internal memories, improve the performance of whole system;Small documents merge effective solution The limitation of file format, moreover it is possible to user's small documents are appended in user file at any time and avoid file cache to server first The file security sex chromosome mosaicism for uploading generation after a certain size again is reached, can be had finally by index information is set up for small documents The recall precision of the raising file of effect.
Additionally, design principle reliability of the present invention, simple structure, with application prospect widely.
As can be seen here, the present invention compared with prior art, improves with prominent substantive distinguishing features and significantly, and it is implemented Beneficial effect be also obvious.
Brief description of the drawings
Fig. 1 is based on HDFS small documents wiring method flow charts for a kind of.
Fig. 2 is based on HDFS small documents read method flow charts for a kind of.
Specific embodiment
The present invention will be described in detail below in conjunction with the accompanying drawings and by specific embodiment, and following examples are to the present invention Explanation, and the invention is not limited in implementation below.
A kind of method for being write based on HDFS small documents and being read that the present embodiment is provided, as shown in figure 1, when user's write-in During file, sent to NameNode upload file request first, size judgement is carried out afterwards, if big file is then directly by leading to In uploading to HDFS clusters with file module, if small documents, small documents processing module is given, if file size is less than setting The threshold value 1M for putting, just in the way of additional writing be appended in HDFS clusters in the user file of the user file, write by we After the completion of during the size of file and deviant recorded relevant database by us, and set up index, there is relevance The metadata information of small documents is deposited under same catalogue, if small documents size is more than the size of user file remaining space, HDFS clusters set up new user file and the title of user file is corresponded with ID, then will in the way of additional writing Small documents are merged into newly-built user file.Acquiescence is not support additional write operation in HDFS file system, it is therefore desirable to Configured, be set to for the value of dfs.support.append by the HDFS-site.xml files under modification Namenode servers true。
As shown in Fig. 2 when user initiates to read file request to NameNode, first having to sentence file size Disconnected, when file is less than the threshold value 1M for setting, small documents processing module just reads the skew of file from the database of relationship type Value and size are long finally according to length then by the position in user file where seek () function locating to small documents Degree reads small documents in the way of streaming from HDFS clusters.Just further sentenced as threshold value 1M of the file more than setting It is disconnected, if file more than blocks of files size (being defaulted as 64M) when, directly read file from HDFS clusters.If literary Part then obtains blocks of files less than a size for block, small documents processing module by getFileBlockLocation () function Relevant information, including the information such as title and the side-play amount of port numbers and this document in block of back end, directly Come to transmit data to client from back end, improve file transmission efficiency, mitigate the pressure of NameNode.
Disclosed above is only the preferred embodiment of the present invention, but the present invention is not limited to this, any this area What technical staff can think does not have creative change, and some improvement made without departing from the principles of the present invention and Retouching, should all be within the scope of the present invention.

Claims (9)

1. it is a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that including following two parts:
(1)The write-in of file
Sw1:Send and upload file request;
Sw2:The size of file to be uploaded is obtained, and the sizes values of file to be uploaded are compared with the threshold value for setting, if treating The size of upper transmitting file is less than the threshold value for setting, then it is assumed that is small documents, judges that user file whether there is, if not existing, holds Row step Sw3;If in the presence of jumping to step Sw4;
Sw3:The title that new user file and user file are set up in HDFS clusters is corresponded with ID, then will be small During file is appended to user file in the way of additional writing, step Sw5 is jumped to;
Sw4:Judge whether small documents size is more than the size of user file remaining space, if so, step Sw3 is jumped to, if it is not, Small documents are appended in user file in the way of additional writing then;
Sw5:The metadata information of small documents is deposited into relevant database, and sets up search index;
(2)The reading of file
Sr1:Send and read file request;
Sr2:The size of file is obtained, and the size of file is compared with the threshold value for setting, if the size of file is less than setting During the threshold value put, Sr4 is jumped to, step Sr3 is performed when threshold value of the file more than setting;
Sr3:If file is asked to be forwarded to back end more than the threshold value for setting and less than a size for blocks of files On, the relevant information of blocks of files is obtained by function, directly come to transmit data to client from back end;
Sr4:The deviant and size of file are read, by function locating to the position in user file where small documents, according to Metadata information reads small documents in the way of streaming.
2. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that step In Sw2, the size of file to be uploaded is more than the threshold value for setting, then it is assumed that is big file, is directly uploaded by general file module.
3. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that step In Sw5, the metadata information that there are the small documents of relevance is deposited under same catalogue.
4. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that step Sr3:File more than blocks of files size when, directly read file.
5. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that the unit Data message includes file name, size, position, attribute, creation time, modification time, deviant, length length.
6. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that the text The relevant information of part block includes, the side-play amount of title, port numbers, this document of back end in block.
7. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that Yong Huwen Part is a file for file block size.
8. it is according to claim 7 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that the text Part block size is defaulted as 64M.
9. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that the threshold Value is defaulted as 1M.
CN201710100365.5A 2017-02-23 2017-02-23 A kind of method for being write based on HDFS small documents and being read Pending CN106909651A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710100365.5A CN106909651A (en) 2017-02-23 2017-02-23 A kind of method for being write based on HDFS small documents and being read

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710100365.5A CN106909651A (en) 2017-02-23 2017-02-23 A kind of method for being write based on HDFS small documents and being read

Publications (1)

Publication Number Publication Date
CN106909651A true CN106909651A (en) 2017-06-30

Family

ID=59209163

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710100365.5A Pending CN106909651A (en) 2017-02-23 2017-02-23 A kind of method for being write based on HDFS small documents and being read

Country Status (1)

Country Link
CN (1) CN106909651A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506447A (en) * 2017-08-25 2017-12-22 郑州云海信息技术有限公司 A kind of small documents reading/writing method and system based on local file system
CN107729432A (en) * 2017-09-29 2018-02-23 浪潮软件股份有限公司 A kind of storage of distributed small documents, read method, device and access system
CN108234594A (en) * 2017-11-28 2018-06-29 北京市商汤科技开发有限公司 File memory method and device, electronic equipment, program and medium
CN108595567A (en) * 2018-04-13 2018-09-28 郑州云海信息技术有限公司 A kind of merging method of small documents, device, equipment and readable storage medium storing program for executing
CN108717457A (en) * 2018-05-23 2018-10-30 苏州易康萌思网络科技有限公司 A kind of e-commerce platform big data processing method and system
CN108932287A (en) * 2018-05-22 2018-12-04 广东技术师范学院 A kind of mass small documents wiring method based on Hadoop
CN110069451A (en) * 2019-03-28 2019-07-30 浪潮卓数大数据产业发展有限公司 A kind of method and device of HDFS storage small documents
CN110413588A (en) * 2019-07-30 2019-11-05 中国工商银行股份有限公司 Distributed objects storage method, device, computer equipment and storage medium
CN110515920A (en) * 2019-08-30 2019-11-29 北京浪潮数据技术有限公司 A kind of mass small documents access method and system based on Hadoop
CN113905037A (en) * 2021-06-18 2022-01-07 武汉理工数字传播工程有限公司 File transmission management method, device, equipment and storage medium
CN114968939A (en) * 2022-05-31 2022-08-30 济南浪潮数据技术有限公司 File merging method and device and computer readable storage medium
CN116756093A (en) * 2023-08-17 2023-09-15 天津神舟通用数据技术有限公司 Large object storage and query method, device, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332027A (en) * 2011-10-15 2012-01-25 西安交通大学 Mass non-independent small file associated storage method based on Hadoop
CN103577123A (en) * 2013-11-12 2014-02-12 河海大学 Small file optimization storage method based on HDFS
CN103856567A (en) * 2014-03-26 2014-06-11 西安电子科技大学 Small file storage method based on Hadoop distributed file system
US20150379024A1 (en) * 2014-06-27 2015-12-31 International Business Machines Corporation File storage processing in hdfs
CN105631010A (en) * 2015-12-29 2016-06-01 成都康赛信息技术有限公司 Optimization method based on HDFS small file storage

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332027A (en) * 2011-10-15 2012-01-25 西安交通大学 Mass non-independent small file associated storage method based on Hadoop
CN103577123A (en) * 2013-11-12 2014-02-12 河海大学 Small file optimization storage method based on HDFS
CN103856567A (en) * 2014-03-26 2014-06-11 西安电子科技大学 Small file storage method based on Hadoop distributed file system
US20150379024A1 (en) * 2014-06-27 2015-12-31 International Business Machines Corporation File storage processing in hdfs
CN105631010A (en) * 2015-12-29 2016-06-01 成都康赛信息技术有限公司 Optimization method based on HDFS small file storage

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张海 等: "基于HDFS的小文件存储与读取优化策略", 《计算机系统应用》 *
张海: "基于HDFS分布式存储技术研究与优化", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506447A (en) * 2017-08-25 2017-12-22 郑州云海信息技术有限公司 A kind of small documents reading/writing method and system based on local file system
CN107729432A (en) * 2017-09-29 2018-02-23 浪潮软件股份有限公司 A kind of storage of distributed small documents, read method, device and access system
CN108234594A (en) * 2017-11-28 2018-06-29 北京市商汤科技开发有限公司 File memory method and device, electronic equipment, program and medium
CN108595567A (en) * 2018-04-13 2018-09-28 郑州云海信息技术有限公司 A kind of merging method of small documents, device, equipment and readable storage medium storing program for executing
CN108932287B (en) * 2018-05-22 2019-11-29 广东技术师范大学 A kind of mass small documents wiring method based on Hadoop
CN108932287A (en) * 2018-05-22 2018-12-04 广东技术师范学院 A kind of mass small documents wiring method based on Hadoop
CN108717457A (en) * 2018-05-23 2018-10-30 苏州易康萌思网络科技有限公司 A kind of e-commerce platform big data processing method and system
CN110069451A (en) * 2019-03-28 2019-07-30 浪潮卓数大数据产业发展有限公司 A kind of method and device of HDFS storage small documents
CN110413588A (en) * 2019-07-30 2019-11-05 中国工商银行股份有限公司 Distributed objects storage method, device, computer equipment and storage medium
CN110515920A (en) * 2019-08-30 2019-11-29 北京浪潮数据技术有限公司 A kind of mass small documents access method and system based on Hadoop
CN113905037A (en) * 2021-06-18 2022-01-07 武汉理工数字传播工程有限公司 File transmission management method, device, equipment and storage medium
CN114968939A (en) * 2022-05-31 2022-08-30 济南浪潮数据技术有限公司 File merging method and device and computer readable storage medium
CN116756093A (en) * 2023-08-17 2023-09-15 天津神舟通用数据技术有限公司 Large object storage and query method, device, equipment and medium
CN116756093B (en) * 2023-08-17 2023-11-03 天津神舟通用数据技术有限公司 Large object storage and query method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN106909651A (en) A kind of method for being write based on HDFS small documents and being read
US12099467B2 (en) Snapshot metadata arrangement for efficient cloud integrated data management
US11010300B2 (en) Optimized record lookups
US9916176B2 (en) Method and apparatus of accessing data of virtual machine
US10346354B2 (en) Reducing stable data eviction with synthetic baseline snapshot and eviction state refresh
CN104731921B (en) Storage and processing method of the Hadoop distributed file systems for log type small documents
US20170262186A1 (en) Reconstructing In-Memory Indices in a Distributed Data Storage System
AU2013403132B2 (en) Data storage method, data storage apparatus, and storage device
US11093387B1 (en) Garbage collection based on transmission object models
US9904480B1 (en) Multiplexing streams without changing the number of streams of a deduplicating storage system
US9405643B2 (en) Multi-level lookup architecture to facilitate failure recovery
JP2010079886A (en) Scalable secondary storage system and method
US9619322B2 (en) Erasure-coding extents in an append-only storage system
KR20090063733A (en) Method recovering data server at the applying multiple reproduce dispersion file system and metadata storage and save method thereof
CN104965835B (en) A kind of file read/write method and device of distributed file system
US20200034451A1 (en) Data deduplication for elastic cloud storage devices
CN105516313A (en) Distributed storage system used for big data
CN110851407A (en) Data distributed storage system and method
Cheng et al. Optimizing small file storage process of the HDFS which based on the indexing mechanism
US10789002B1 (en) Hybrid data deduplication for elastic cloud storage devices
Zhang et al. Hierarchical data deduplication technology based on bloom filter array
Long et al. A fast deduplication scheme for stored data in distributed storage systems
US20240070135A1 (en) Hash engine for conducting point queries
Liu et al. Storage design of marine ship inspection data based on cloud platform
US20160112293A1 (en) Using an rpc framework to facilitate out-of-band data transfers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170630

RJ01 Rejection of invention patent application after publication