CN104978336A

CN104978336A - Unstructured data storage system based on Hadoop distributed computing platform

Info

Publication number: CN104978336A
Application number: CN201410137127.8A
Authority: CN
Inventors: 罗学礼; 杨晴; 杨莉; 杜韶辉; 吴清华; 马瑞; 臧戎
Original assignee: Yunnan Electric Power Experimental Research Institute Group Co Ltd of Electric Power Research Institute; Kunming Enersun Technology Co Ltd
Current assignee: Yunnan Electric Power Experimental Research Institute Group Co Ltd of Electric Power Research Institute; Kunming Enersun Technology Co Ltd
Priority date: 2014-04-08
Filing date: 2014-04-08
Publication date: 2015-10-14

Abstract

The invention relates to the field of information technology processing, in particular to an unstructured data storage system based on a Hadoop distributed computing platform. The unstructured data storage system comprises the following steps: S1) a client side calls an HDFS(Hadoop Distributed File System)-class DistributedFileSystem object calling create () function to create a new file in the namespace of a file system, wherein the new file is not provided with a corresponding data block; S2) a namenode executes an examination to guarantee that a currently-created file is not in the presence and a client side has a permission for creating the file, if the examination is qualified, a new file record is created, and if the examination is not qualified, file creation fails, and exceptions are thrown; and S3) when the client side writes unstructured data into the created new file, the unstructured data is divided into a plurality of data packages and is written into an internal queue, the DataStreamer of the HDFS processes a data queue, and a proper new block is distributed by the namenode according to the queue list request of datanode to store a data backup. The price of expensive storage equipment required by data storage is greatly lowered, and the HDFS owns a good data disaster-tolerant mechanism in a data storage process.

Description

Based on the unstructured data storage system of Hadoop Distributed Computing Platform

Technical field

The present invention relates to Information Technology Agreement field, be specifically related to a kind of unstructured data storage system based on Hadoop Distributed Computing Platform.

Background technology

In unstructured data stores, we are main it is considered that the storage of large data, although existing commercial podium also can meet the storage of unstructured data, problem mainly goes out in the system expandability and construction cost.The server price huge unstructured data being stored to I/O bottleneck problem and the costliness produced has to make us seek another way out

Summary of the invention

Object of the present invention is in order to solve the problem, provide a kind of unstructured data storage system based on Hadoop Distributed Computing Platform, it can select common PC device as back end, this greatly reduces and stores expensive storage equipment price required for data, and in data storage procedure, HDFS has good data disaster tolerance mechanism.

For achieving the above object, the invention provides a kind of unstructured data storage system based on Hadoop Distributed Computing Platform, comprise the following steps:

S1: client creates a new file by calling HDFS class DistributedFileSystem object reference create () function in the NameSpace of file system, and this new file does not also have corresponding data block;

S2:namenode performs inspection and guarantees that the file of current establishment does not also exist and client has the authority creating this file, checks by then creating new file record, does not pass through if check, document creation failure also throw exception;

S3: client is when giving the new file write unstructured data created, unstructured data is divided into packet one by one, and write internal queues, according to the queue lists of datanode, the DataStreamer process data queue of HDFS, requires that namenode distributes the new block be applicable to and carrys out storing data backup.

Further, the establishment file in described step S3 is all stored as a series of piece, and in identical file, except last block, other size of all pieces is all the same.

Further, the block of described file all ensures fault-tolerant by copying, and the size of the block of described file and replicator all can configure, and MapReduce program can the number of times that copies of specified file, replicator can be specified when document creation, also can specify after document creation.

Further, namenode makes all decisions according to block replication status, and its can the receiving from the heartbeat of Data Node in cluster and block report of cycle.

Further, namenode puts one-duplicate copy on the node of running client, second duplicate is placed on the node in different from first and the random frame selected in addition, 3rd duplicate is placed on the frame identical with second duplicate, and Stochastic choice another one node, other duplicates are put on the node of Stochastic choice in the cluster.

Further, unstructured data first time drawing-in system time calculation check and, and when data are transmitted by an insecure passage again calculation check and, so just can find whether data are damaged, if the School Affairs that the new School Affairs of calculating gained is original does not mate, then think that this unstructured data damages.

Further, client also can verify School Affairs when datanode reads data, the School Affairs stored in they and datanode is compared, each datanode has persisted one for verifying School Affairs daily record, so it knows each data block terminal check time, after client unsuccessful verification data, this datanode can be told, this datanode Update log thus.

Further, during client read block, if the mistake of detecting, just report the data block damaged and this datanode attempting read operation thereof to namenode; The data block that this has damaged by namenode is labeled as to be damaged, after the copy damaged being backuped to other blocks, carries out reading data from other duplicates simultaneously.

Further, the NameSpace of HDFS is stored on namenode, and name node uses the transaction journal being called " editor's daily record " to carry out each change of persistence log file system metadata

The present invention has following beneficial effect: the distributed file system HDFS of Hadoop occurs just solving the I/O bottleneck in commercial podium and the expensive problem of server.The Advantages found of Hadoop is in the following aspects:

1) Hadoop depends on low-end server or even common computer, and relative to the sky high cost of commercial podium, its cost is much lower, almost can say that anyone can use it, even the little Wei enterprise that IT application cost budget is less;

2) HDFS and Map/Reduce is closely integrated is the storage foundation stone of Hadoop Distributed Calculation.The design object that it has oneself clear and definite is exactly support that large data file is greatly to T level, and these files are read as master with order, and the high-throughput deposited with file/read is target.After use HDFS distributed file system stores unstructured document, the storage file speed of our system will be improved;

3) data recovery capabilities of HDFS also ensure that the safe reliability of system, and reliability is embodied in its hypothesis and calculates element and storage meeting failure, and therefore it safeguards multiple operational data copy, guarantees to redistribute process for the node of failure.

4) support the hot plug of memory node simultaneously and can store unstructured document in ordinary PC, this not only increases the expansion dirigibility of system, also greatly reduces the input of enterprise at hardware aspect.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is flow chart of steps of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

See Fig. 1, the invention provides a kind of unstructured data storage system based on Hadoop Distributed Computing Platform, comprise the following steps:

5) Hadoop depends on low-end server or even common computer, and relative to the sky high cost of commercial podium, its cost is much lower, almost can say that anyone can use it, even the little Wei enterprise that IT application cost budget is less;

6) HDFS and Map/Reduce is closely integrated is the storage foundation stone of Hadoop Distributed Calculation.The design object that it has oneself clear and definite is exactly support that large data file is greatly to T level, and these files are read as master with order, and the high-throughput deposited with file/read is target.After use HDFS distributed file system stores unstructured document, the storage file speed of our system will be improved;

7) data recovery capabilities of HDFS also ensure that the safe reliability of system, and reliability is embodied in its hypothesis and calculates element and storage meeting failure, and therefore it safeguards multiple operational data copy, guarantees to redistribute process for the node of failure.

8) support the hot plug of memory node simultaneously and can store unstructured document in ordinary PC, this not only increases the expansion dirigibility of system, also greatly reduces the input of enterprise at hardware aspect.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1., based on the unstructured data storage system of Hadoop Distributed Computing Platform, it is characterized in that: comprise the following steps:

2. the unstructured data storage system based on Hadoop Distributed Computing Platform according to claim 1, it is characterized in that: the establishment file in described step S3 is all stored as a series of piece, in identical file, except last block, other size of all pieces is all the same.

3. the unstructured data storage system based on Hadoop Distributed Computing Platform according to claim 2, it is characterized in that: the block of described file all by copy ensure fault-tolerant, the size of the block of described file and replicator all can configure, MapReduce program can the number of times that copies of specified file, replicator can be specified when document creation, also can specify after document creation.

4. the unstructured data storage system based on Hadoop Distributed Computing Platform according to claim 3, it is characterized in that: namenode makes all decisions according to block replication status, its can the receiving from the heartbeat of Data Node in cluster and block report of cycle.

5. the unstructured data storage system based on Hadoop Distributed Computing Platform according to claim 1, it is characterized in that: namenode puts one-duplicate copy on the node of running client, second duplicate is placed on the node in different from first and the random frame selected in addition, 3rd duplicate is placed on the frame identical with second duplicate, and Stochastic choice another one node, other duplicates are put on the node of Stochastic choice in the cluster.

6. the unstructured data storage system based on Hadoop Distributed Computing Platform according to claim 1, it is characterized in that: unstructured data first time drawing-in system time calculation check and, and when data are transmitted by an insecure passage again calculation check and, so just can find whether data are damaged, if the School Affairs that the new School Affairs of calculating gained is original does not mate, then think that this unstructured data damages.

7. the unstructured data storage system based on Hadoop Distributed Computing Platform according to claim 6, it is characterized in that: client also can verify School Affairs when datanode reads data, the School Affairs stored in they and datanode is compared, each datanode has persisted one for verifying School Affairs daily record, so it knows each data block terminal check time, after client unsuccessful verification data, this datanode can be told, this datanode Update log thus.

8. the unstructured data storage system based on Hadoop Distributed Computing Platform according to claim 7, it is characterized in that: during client read block, if the mistake of detecting, just report the data block damaged and this datanode attempting read operation thereof to namenode; The data block that this has damaged by namenode is labeled as to be damaged, after the copy damaged being backuped to other blocks, carries out reading data from other duplicates simultaneously.

9. the unstructured data storage system based on Hadoop Distributed Computing Platform according to claim 1, it is characterized in that: the NameSpace of HDFS is stored on namenode, name node uses the transaction journal being called " editor's daily record " to carry out each change of persistence log file system metadata.