CN106909651A

CN106909651A - A kind of method for being write based on HDFS small documents and being read

Info

Publication number: CN106909651A
Application number: CN201710100365.5A
Authority: CN
Inventors: 辛永欣
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2017-02-23
Filing date: 2017-02-23
Publication date: 2017-06-30

Abstract

The present invention relates to a kind of method for being write based on HDFS small documents and being read,It is characterized in that,For each user sets up a user file,Each user file is different,Operated with file appending and be merged into small documents in this user file in the form of streaming,And also the offset information of index record small documents is set up for small documents using relational database,Small documents merge the effective limitation for solving file format,User's small documents can also be at any time appended in user file and avoid being uploaded again after file cache to server first is reached into a certain size,The file security sex chromosome mosaicism of generation,The recall precision of file can be effectively improved finally by index information is set up for small documents,Ask to be forwarded to the pressure raising file transmission efficiency for being processed on back end and mitigating master server when the file less than block is accessed simultaneously.

Description

A kind of method for being write based on HDFS small documents and being read

Technical field

The invention belongs to file storage technology field, and in particular to a kind of side for being write based on HDFS small documents and being read Method.

Background technology

With the high speed development of internet, data in internet expansion also drastically, in order to provide the user with preferably Service, Internet enterprises will preserve and excavate these data.The concept of cloud computing is thereby produced, cloud computing is current research Heat subject, it solves the problem of the computing of big data and storage well, wherein cloud storage as cloud computing derivative Also become the focus of domestic and international research, in the research of numerous cloud storages, the distributed file system HDFS conducts of Hadoop The realization of increasing income of GoogleFile System has turned into industry research cloud computing and cloud storage, realizes that cloud application provides cloud clothes The master pattern of business reference.HDFS can be used for large-scale distributed storage, can build one easily extension, fault-tolerance it is high, High performance cloud storage platform, and it also provided the user one group of reliable stabilization interface can make developer according to The demand of oneself reality is developed and extended, and Hadoop has obtained the favor of many major companies at present, and it is in mass data Excellent being widely used is showed in storage and treatment.

The uniformity of data can be effectively kept using HDFS file system, is adapted to the occasion that write-once is repeatedly read, The framework can be built on arbitrary computer to run, and ensure scalability, the Backup and Restore mechanism of HDFS with to point Monitoring mechanism with task has all ensured the reliability of distributed storage, and HDFS uses stream-oriented file reading and is applicable very much In reading magnanimity DBMS.But HDFS file system is not perfect, it has some to limit in mass small documents access System.So there now have been some solutions for HDFS small documents storage problems：

（1）The storage efficiency of file will be improved in small documents storage to Hbase with decomposition by Piece file mergence, this scheme Shortcoming is exactly as increasing for file can cause the substantial amounts of merging of Hbase to take a large amount of serious shadows of system resource with operation splitting The performance of acoustic system, and Hbase only supports simple character types, it is bad to the support of the types such as other pictures also to need to use Family is individually processed.

（2）Also it is exactly right using the filing instrument Hadoop Archives (abbreviation HAR Files) of Hadoop offers Small documents carry out filing packing, although this mode can effectively reduce memory consumption of the large amount of small documents to NameNode But needed when user needs to access it is super to look for two secondary indexs just to find file detection not high, while also needing to keeper's maintenance Operation order carries out archive operation and is not suitable for building the cloud storage platform based on internet.

（3）Also a kind of scheme of Piece file mergence is that the merging of file is carried out using Sequence File, Sequence File are the files for storing binary system key-value forms, are generally deposited using Sequence File Filename is deposited into key file content storage in value during storage small documents, the maximum shortcoming of this mode be exactly by In key assignments therein being that unsorted file random read take is less efficient can just be read out, it is necessary to travel through whole file, and This mode does not support that file appending is operated, thus merge before small documents to be cached in server, such file Security cannot be protected.

The content of the invention

It is an object of the present invention to be directed to the defect that above-mentioned prior art is present, there is provided design is a kind of to be based on the small texts of HDFS The method that part writes and reads, to solve above-mentioned technical problem.

To achieve these goals, the technical scheme is that：

A kind of method for being write based on HDFS small documents and being read, it is characterised in that including following two parts：（1）File is write Enter

Sw1：Send and upload file request；

Sw2:The size of file to be uploaded is obtained, and the size of file to be uploaded is compared with the threshold value for setting, if treating The size of transmitting file is less than the threshold value for setting, then it is assumed that is small documents, judges that user file whether there is, if not existing, performs Step Sw3；If in the presence of, step Sw4 is jumped to,

Sw3：The title and ID that new user file and user file are set up in HDFS clusters are corresponded, then chasing after Plus be merged into small documents in user file by the mode write, step Sw5 is performed；

Sw4：Judge whether small documents size is more than the size of user file remaining space, if so, jump procedure Sw3, if it is not, then File is appended in user file in the way of additional writing；

Sw5：The metadata information of small documents is deposited into relevant database, and sets up search index.

（2）The reading of file

Sr1：Send and read file request；

Sr2：The size of file is obtained, and the size of file is compared with the threshold value for setting, if the size of file is less than setting During the threshold value put, Sr4 is jumped to, step Sr3 is performed when threshold value of the file more than setting;

Sr3：If file is asked to be forwarded to back end more than the threshold value for setting and less than a size for blocks of files On, the relevant information of blocks of files is obtained by function, directly come to transmit data to client from back end；

Sr4：The deviant and size of file are read, by function locating to the position in user file where small documents, according to Metadata information reads small documents in the way of streaming；

Further, in step Sw2, the size of file to be uploaded is more than the threshold value for setting, then it is assumed that be big file, directly by leading to Uploaded with file module.

In step Sw5, the metadata information that there are the small documents of relevance is deposited under same catalogue.

Step Sr3：File more than blocks of files size when, directly read file.

The metadata information includes file name, size, position, attribute, creation time, modification time, deviant, length Degree length.

The relevant information of the blocks of files includes, the side-play amount of title, port numbers, this document of back end in block.

User file is a file for file block size.

The file block size is defaulted as 64M.

The threshold value is defaulted as 1M.

The beneficial effects of the present invention are：

The small documents that invention herein proposes relational database merge algorithm, are that each user sets up a user file, often Individual user file is all different, is operated with file appending and is merged into small documents in this user file in the form of streaming, and The offset information of index record small documents is also set up for small documents using relational database, the reading efficiency of small documents is improve, Simultaneously when the file less than block is accessed, the pressure for ask to be forwarded to treatment mitigation master server on back end is carried File transmission efficiency high；Small documents are merged into a file storage NameNode internal memory can't be with increasing for file Change, so as to reduce the consumption of NameNode internal memories, improve the performance of whole system；Small documents merge effective solution The limitation of file format, moreover it is possible to user's small documents are appended in user file at any time and avoid file cache to server first The file security sex chromosome mosaicism for uploading generation after a certain size again is reached, can be had finally by index information is set up for small documents The recall precision of the raising file of effect.

Additionally, design principle reliability of the present invention, simple structure, with application prospect widely.

As can be seen here, the present invention compared with prior art, improves with prominent substantive distinguishing features and significantly, and it is implemented Beneficial effect be also obvious.

Brief description of the drawings

Fig. 1 is based on HDFS small documents wiring method flow charts for a kind of.

Fig. 2 is based on HDFS small documents read method flow charts for a kind of.

Specific embodiment

The present invention will be described in detail below in conjunction with the accompanying drawings and by specific embodiment, and following examples are to the present invention Explanation, and the invention is not limited in implementation below.

A kind of method for being write based on HDFS small documents and being read that the present embodiment is provided, as shown in figure 1, when user's write-in During file, sent to NameNode upload file request first, size judgement is carried out afterwards, if big file is then directly by leading to In uploading to HDFS clusters with file module, if small documents, small documents processing module is given, if file size is less than setting The threshold value 1M for putting, just in the way of additional writing be appended in HDFS clusters in the user file of the user file, write by we After the completion of during the size of file and deviant recorded relevant database by us, and set up index, there is relevance The metadata information of small documents is deposited under same catalogue, if small documents size is more than the size of user file remaining space, HDFS clusters set up new user file and the title of user file is corresponded with ID, then will in the way of additional writing Small documents are merged into newly-built user file.Acquiescence is not support additional write operation in HDFS file system, it is therefore desirable to Configured, be set to for the value of dfs.support.append by the HDFS-site.xml files under modification Namenode servers true。

As shown in Fig. 2 when user initiates to read file request to NameNode, first having to sentence file size Disconnected, when file is less than the threshold value 1M for setting, small documents processing module just reads the skew of file from the database of relationship type Value and size are long finally according to length then by the position in user file where seek () function locating to small documents Degree reads small documents in the way of streaming from HDFS clusters.Just further sentenced as threshold value 1M of the file more than setting It is disconnected, if file more than blocks of files size (being defaulted as 64M) when, directly read file from HDFS clusters.If literary Part then obtains blocks of files less than a size for block, small documents processing module by getFileBlockLocation () function Relevant information, including the information such as title and the side-play amount of port numbers and this document in block of back end, directly Come to transmit data to client from back end, improve file transmission efficiency, mitigate the pressure of NameNode.

Disclosed above is only the preferred embodiment of the present invention, but the present invention is not limited to this, any this area What technical staff can think does not have creative change, and some improvement made without departing from the principles of the present invention and Retouching, should all be within the scope of the present invention.

Claims

1. it is a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that including following two parts：

（1）The write-in of file

Sw1：Send and upload file request；

Sw2:The size of file to be uploaded is obtained, and the sizes values of file to be uploaded are compared with the threshold value for setting, if treating The size of upper transmitting file is less than the threshold value for setting, then it is assumed that is small documents, judges that user file whether there is, if not existing, holds Row step Sw3；If in the presence of jumping to step Sw4；

Sw3：The title that new user file and user file are set up in HDFS clusters is corresponded with ID, then will be small During file is appended to user file in the way of additional writing, step Sw5 is jumped to；

Sw4：Judge whether small documents size is more than the size of user file remaining space, if so, step Sw3 is jumped to, if it is not, Small documents are appended in user file in the way of additional writing then；

Sw5：The metadata information of small documents is deposited into relevant database, and sets up search index；

（2）The reading of file

Sr1：Send and read file request；

Sr2：The size of file is obtained, and the size of file is compared with the threshold value for setting, if the size of file is less than setting During the threshold value put, Sr4 is jumped to, step Sr3 is performed when threshold value of the file more than setting；

Sr4：The deviant and size of file are read, by function locating to the position in user file where small documents, according to Metadata information reads small documents in the way of streaming.

2. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that step In Sw2, the size of file to be uploaded is more than the threshold value for setting, then it is assumed that is big file, is directly uploaded by general file module.

3. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that step In Sw5, the metadata information that there are the small documents of relevance is deposited under same catalogue.

4. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that step Sr3：File more than blocks of files size when, directly read file.

5. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that the unit Data message includes file name, size, position, attribute, creation time, modification time, deviant, length length.

6. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that the text The relevant information of part block includes, the side-play amount of title, port numbers, this document of back end in block.

7. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that Yong Huwen Part is a file for file block size.

8. it is according to claim 7 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that the text Part block size is defaulted as 64M.

9. it is according to claim 1 a kind of based on the write-in of HDFS small documents and the method for reading, it is characterised in that the threshold Value is defaulted as 1M.