CN104978336A - Unstructured data storage system based on Hadoop distributed computing platform - Google Patents

Unstructured data storage system based on Hadoop distributed computing platform Download PDF

Info

Publication number
CN104978336A
CN104978336A CN201410137127.8A CN201410137127A CN104978336A CN 104978336 A CN104978336 A CN 104978336A CN 201410137127 A CN201410137127 A CN 201410137127A CN 104978336 A CN104978336 A CN 104978336A
Authority
CN
China
Prior art keywords
file
data
unstructured data
data storage
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410137127.8A
Other languages
Chinese (zh)
Inventor
罗学礼
杨晴
杨莉
杜韶辉
吴清华
马瑞
臧戎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan Electric Power Experimental Research Institute Group Co Ltd of Electric Power Research Institute
Kunming Enersun Technology Co Ltd
Original Assignee
Yunnan Electric Power Experimental Research Institute Group Co Ltd of Electric Power Research Institute
Kunming Enersun Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan Electric Power Experimental Research Institute Group Co Ltd of Electric Power Research Institute, Kunming Enersun Technology Co Ltd filed Critical Yunnan Electric Power Experimental Research Institute Group Co Ltd of Electric Power Research Institute
Priority to CN201410137127.8A priority Critical patent/CN104978336A/en
Publication of CN104978336A publication Critical patent/CN104978336A/en
Pending legal-status Critical Current

Links

Abstract

The invention relates to the field of information technology processing, in particular to an unstructured data storage system based on a Hadoop distributed computing platform. The unstructured data storage system comprises the following steps: S1) a client side calls an HDFS(Hadoop Distributed File System)-class DistributedFileSystem object calling create () function to create a new file in the namespace of a file system, wherein the new file is not provided with a corresponding data block; S2) a namenode executes an examination to guarantee that a currently-created file is not in the presence and a client side has a permission for creating the file, if the examination is qualified, a new file record is created, and if the examination is not qualified, file creation fails, and exceptions are thrown; and S3) when the client side writes unstructured data into the created new file, the unstructured data is divided into a plurality of data packages and is written into an internal queue, the DataStreamer of the HDFS processes a data queue, and a proper new block is distributed by the namenode according to the queue list request of datanode to store a data backup. The price of expensive storage equipment required by data storage is greatly lowered, and the HDFS owns a good data disaster-tolerant mechanism in a data storage process.

Description

Based on the unstructured data storage system of Hadoop Distributed Computing Platform
Technical field
The present invention relates to Information Technology Agreement field, be specifically related to a kind of unstructured data storage system based on Hadoop Distributed Computing Platform.
Background technology
In unstructured data stores, we are main it is considered that the storage of large data, although existing commercial podium also can meet the storage of unstructured data, problem mainly goes out in the system expandability and construction cost.The server price huge unstructured data being stored to I/O bottleneck problem and the costliness produced has to make us seek another way out
Summary of the invention
Object of the present invention is in order to solve the problem, provide a kind of unstructured data storage system based on Hadoop Distributed Computing Platform, it can select common PC device as back end, this greatly reduces and stores expensive storage equipment price required for data, and in data storage procedure, HDFS has good data disaster tolerance mechanism.
For achieving the above object, the invention provides a kind of unstructured data storage system based on Hadoop Distributed Computing Platform, comprise the following steps:
S1: client creates a new file by calling HDFS class DistributedFileSystem object reference create () function in the NameSpace of file system, and this new file does not also have corresponding data block;
S2:namenode performs inspection and guarantees that the file of current establishment does not also exist and client has the authority creating this file, checks by then creating new file record, does not pass through if check, document creation failure also throw exception;
S3: client is when giving the new file write unstructured data created, unstructured data is divided into packet one by one, and write internal queues, according to the queue lists of datanode, the DataStreamer process data queue of HDFS, requires that namenode distributes the new block be applicable to and carrys out storing data backup.
Further, the establishment file in described step S3 is all stored as a series of piece, and in identical file, except last block, other size of all pieces is all the same.
Further, the block of described file all ensures fault-tolerant by copying, and the size of the block of described file and replicator all can configure, and MapReduce program can the number of times that copies of specified file, replicator can be specified when document creation, also can specify after document creation.
Further, namenode makes all decisions according to block replication status, and its can the receiving from the heartbeat of Data Node in cluster and block report of cycle.
Further, namenode puts one-duplicate copy on the node of running client, second duplicate is placed on the node in different from first and the random frame selected in addition, 3rd duplicate is placed on the frame identical with second duplicate, and Stochastic choice another one node, other duplicates are put on the node of Stochastic choice in the cluster.
Further, unstructured data first time drawing-in system time calculation check and, and when data are transmitted by an insecure passage again calculation check and, so just can find whether data are damaged, if the School Affairs that the new School Affairs of calculating gained is original does not mate, then think that this unstructured data damages.
Further, client also can verify School Affairs when datanode reads data, the School Affairs stored in they and datanode is compared, each datanode has persisted one for verifying School Affairs daily record, so it knows each data block terminal check time, after client unsuccessful verification data, this datanode can be told, this datanode Update log thus.
Further, during client read block, if the mistake of detecting, just report the data block damaged and this datanode attempting read operation thereof to namenode; The data block that this has damaged by namenode is labeled as to be damaged, after the copy damaged being backuped to other blocks, carries out reading data from other duplicates simultaneously.
Further, the NameSpace of HDFS is stored on namenode, and name node uses the transaction journal being called " editor's daily record " to carry out each change of persistence log file system metadata
The present invention has following beneficial effect: the distributed file system HDFS of Hadoop occurs just solving the I/O bottleneck in commercial podium and the expensive problem of server.The Advantages found of Hadoop is in the following aspects:
1) Hadoop depends on low-end server or even common computer, and relative to the sky high cost of commercial podium, its cost is much lower, almost can say that anyone can use it, even the little Wei enterprise that IT application cost budget is less;
2) HDFS and Map/Reduce is closely integrated is the storage foundation stone of Hadoop Distributed Calculation.The design object that it has oneself clear and definite is exactly support that large data file is greatly to T level, and these files are read as master with order, and the high-throughput deposited with file/read is target.After use HDFS distributed file system stores unstructured document, the storage file speed of our system will be improved;
3) data recovery capabilities of HDFS also ensure that the safe reliability of system, and reliability is embodied in its hypothesis and calculates element and storage meeting failure, and therefore it safeguards multiple operational data copy, guarantees to redistribute process for the node of failure.
4) support the hot plug of memory node simultaneously and can store unstructured document in ordinary PC, this not only increases the expansion dirigibility of system, also greatly reduces the input of enterprise at hardware aspect.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is flow chart of steps of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
See Fig. 1, the invention provides a kind of unstructured data storage system based on Hadoop Distributed Computing Platform, comprise the following steps:
S1: client creates a new file by calling HDFS class DistributedFileSystem object reference create () function in the NameSpace of file system, and this new file does not also have corresponding data block;
S2:namenode performs inspection and guarantees that the file of current establishment does not also exist and client has the authority creating this file, checks by then creating new file record, does not pass through if check, document creation failure also throw exception;
S3: client is when giving the new file write unstructured data created, unstructured data is divided into packet one by one, and write internal queues, according to the queue lists of datanode, the DataStreamer process data queue of HDFS, requires that namenode distributes the new block be applicable to and carrys out storing data backup.
Further, the establishment file in described step S3 is all stored as a series of piece, and in identical file, except last block, other size of all pieces is all the same.
Further, the block of described file all ensures fault-tolerant by copying, and the size of the block of described file and replicator all can configure, and MapReduce program can the number of times that copies of specified file, replicator can be specified when document creation, also can specify after document creation.
Further, namenode makes all decisions according to block replication status, and its can the receiving from the heartbeat of Data Node in cluster and block report of cycle.
Further, namenode puts one-duplicate copy on the node of running client, second duplicate is placed on the node in different from first and the random frame selected in addition, 3rd duplicate is placed on the frame identical with second duplicate, and Stochastic choice another one node, other duplicates are put on the node of Stochastic choice in the cluster.
Further, unstructured data first time drawing-in system time calculation check and, and when data are transmitted by an insecure passage again calculation check and, so just can find whether data are damaged, if the School Affairs that the new School Affairs of calculating gained is original does not mate, then think that this unstructured data damages.
Further, client also can verify School Affairs when datanode reads data, the School Affairs stored in they and datanode is compared, each datanode has persisted one for verifying School Affairs daily record, so it knows each data block terminal check time, after client unsuccessful verification data, this datanode can be told, this datanode Update log thus.
Further, during client read block, if the mistake of detecting, just report the data block damaged and this datanode attempting read operation thereof to namenode; The data block that this has damaged by namenode is labeled as to be damaged, after the copy damaged being backuped to other blocks, carries out reading data from other duplicates simultaneously.
Further, the NameSpace of HDFS is stored on namenode, and name node uses the transaction journal being called " editor's daily record " to carry out each change of persistence log file system metadata
The present invention has following beneficial effect: the distributed file system HDFS of Hadoop occurs just solving the I/O bottleneck in commercial podium and the expensive problem of server.The Advantages found of Hadoop is in the following aspects:
5) Hadoop depends on low-end server or even common computer, and relative to the sky high cost of commercial podium, its cost is much lower, almost can say that anyone can use it, even the little Wei enterprise that IT application cost budget is less;
6) HDFS and Map/Reduce is closely integrated is the storage foundation stone of Hadoop Distributed Calculation.The design object that it has oneself clear and definite is exactly support that large data file is greatly to T level, and these files are read as master with order, and the high-throughput deposited with file/read is target.After use HDFS distributed file system stores unstructured document, the storage file speed of our system will be improved;
7) data recovery capabilities of HDFS also ensure that the safe reliability of system, and reliability is embodied in its hypothesis and calculates element and storage meeting failure, and therefore it safeguards multiple operational data copy, guarantees to redistribute process for the node of failure.
8) support the hot plug of memory node simultaneously and can store unstructured document in ordinary PC, this not only increases the expansion dirigibility of system, also greatly reduces the input of enterprise at hardware aspect.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (9)

1., based on the unstructured data storage system of Hadoop Distributed Computing Platform, it is characterized in that: comprise the following steps:
S1: client creates a new file by calling HDFS class DistributedFileSystem object reference create () function in the NameSpace of file system, and this new file does not also have corresponding data block;
S2:namenode performs inspection and guarantees that the file of current establishment does not also exist and client has the authority creating this file, checks by then creating new file record, does not pass through if check, document creation failure also throw exception;
S3: client is when giving the new file write unstructured data created, unstructured data is divided into packet one by one, and write internal queues, according to the queue lists of datanode, the DataStreamer process data queue of HDFS, requires that namenode distributes the new block be applicable to and carrys out storing data backup.
2. the unstructured data storage system based on Hadoop Distributed Computing Platform according to claim 1, it is characterized in that: the establishment file in described step S3 is all stored as a series of piece, in identical file, except last block, other size of all pieces is all the same.
3. the unstructured data storage system based on Hadoop Distributed Computing Platform according to claim 2, it is characterized in that: the block of described file all by copy ensure fault-tolerant, the size of the block of described file and replicator all can configure, MapReduce program can the number of times that copies of specified file, replicator can be specified when document creation, also can specify after document creation.
4. the unstructured data storage system based on Hadoop Distributed Computing Platform according to claim 3, it is characterized in that: namenode makes all decisions according to block replication status, its can the receiving from the heartbeat of Data Node in cluster and block report of cycle.
5. the unstructured data storage system based on Hadoop Distributed Computing Platform according to claim 1, it is characterized in that: namenode puts one-duplicate copy on the node of running client, second duplicate is placed on the node in different from first and the random frame selected in addition, 3rd duplicate is placed on the frame identical with second duplicate, and Stochastic choice another one node, other duplicates are put on the node of Stochastic choice in the cluster.
6. the unstructured data storage system based on Hadoop Distributed Computing Platform according to claim 1, it is characterized in that: unstructured data first time drawing-in system time calculation check and, and when data are transmitted by an insecure passage again calculation check and, so just can find whether data are damaged, if the School Affairs that the new School Affairs of calculating gained is original does not mate, then think that this unstructured data damages.
7. the unstructured data storage system based on Hadoop Distributed Computing Platform according to claim 6, it is characterized in that: client also can verify School Affairs when datanode reads data, the School Affairs stored in they and datanode is compared, each datanode has persisted one for verifying School Affairs daily record, so it knows each data block terminal check time, after client unsuccessful verification data, this datanode can be told, this datanode Update log thus.
8. the unstructured data storage system based on Hadoop Distributed Computing Platform according to claim 7, it is characterized in that: during client read block, if the mistake of detecting, just report the data block damaged and this datanode attempting read operation thereof to namenode; The data block that this has damaged by namenode is labeled as to be damaged, after the copy damaged being backuped to other blocks, carries out reading data from other duplicates simultaneously.
9. the unstructured data storage system based on Hadoop Distributed Computing Platform according to claim 1, it is characterized in that: the NameSpace of HDFS is stored on namenode, name node uses the transaction journal being called " editor's daily record " to carry out each change of persistence log file system metadata.
CN201410137127.8A 2014-04-08 2014-04-08 Unstructured data storage system based on Hadoop distributed computing platform Pending CN104978336A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410137127.8A CN104978336A (en) 2014-04-08 2014-04-08 Unstructured data storage system based on Hadoop distributed computing platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410137127.8A CN104978336A (en) 2014-04-08 2014-04-08 Unstructured data storage system based on Hadoop distributed computing platform

Publications (1)

Publication Number Publication Date
CN104978336A true CN104978336A (en) 2015-10-14

Family

ID=54274851

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410137127.8A Pending CN104978336A (en) 2014-04-08 2014-04-08 Unstructured data storage system based on Hadoop distributed computing platform

Country Status (1)

Country Link
CN (1) CN104978336A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682227A (en) * 2017-01-06 2017-05-17 郑州云海信息技术有限公司 Log data storage system based on distributed file system and reading-writing method
CN107169099A (en) * 2017-05-16 2017-09-15 成都四象联创科技有限公司 Data processing method based on HADOOP
CN107402841A (en) * 2016-03-30 2017-11-28 阿里巴巴集团控股有限公司 Large-scale distributed file system data recovery method and equipment
CN107784047A (en) * 2016-11-14 2018-03-09 平安科技(深圳)有限公司 The implementation method and device of storing process
CN107807792A (en) * 2017-10-27 2018-03-16 郑州云海信息技术有限公司 A kind of data processing method and relevant apparatus based on copy storage system
CN108573007A (en) * 2017-06-08 2018-09-25 北京金山云网络技术有限公司 Method, apparatus, electronic equipment and the storage medium of data consistency detection
CN111176885A (en) * 2019-12-31 2020-05-19 浪潮电子信息产业股份有限公司 Data verification method and related device for distributed storage system
CN111274210A (en) * 2020-02-10 2020-06-12 北京松果电子有限公司 Metadata processing method and device and electronic equipment
CN115454958A (en) * 2022-09-15 2022-12-09 北京百度网讯科技有限公司 Data processing method, device, equipment, system and medium based on artificial intelligence

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116643A (en) * 2013-02-25 2013-05-22 江苏物联网研究发展中心 Hadoop-based intelligent medical data management method
US20130254246A1 (en) * 2012-03-21 2013-09-26 Todd Lipcon Data processing performance enhancement in a distributed file system
US20130325813A1 (en) * 2012-05-30 2013-12-05 Spectra Logic Corporation System and method for archive in a distributed file system
CN103530387A (en) * 2013-10-22 2014-01-22 浪潮电子信息产业股份有限公司 Improved method aimed at small files of HDFS
CN103617290A (en) * 2013-12-13 2014-03-05 江苏名通信息科技有限公司 Chinese machine-reading system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130254246A1 (en) * 2012-03-21 2013-09-26 Todd Lipcon Data processing performance enhancement in a distributed file system
US20130325813A1 (en) * 2012-05-30 2013-12-05 Spectra Logic Corporation System and method for archive in a distributed file system
CN103116643A (en) * 2013-02-25 2013-05-22 江苏物联网研究发展中心 Hadoop-based intelligent medical data management method
CN103530387A (en) * 2013-10-22 2014-01-22 浪潮电子信息产业股份有限公司 Improved method aimed at small files of HDFS
CN103617290A (en) * 2013-12-13 2014-03-05 江苏名通信息科技有限公司 Chinese machine-reading system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
周轶男等: "Hadoop文件系统性能分析", 《电子技术》 *
朱颂: "分布式文件系统HDFS的分析", 《福建电脑》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107402841A (en) * 2016-03-30 2017-11-28 阿里巴巴集团控股有限公司 Large-scale distributed file system data recovery method and equipment
CN107402841B (en) * 2016-03-30 2021-01-29 阿里巴巴集团控股有限公司 Data restoration method and device for large-scale distributed file system
CN107784047A (en) * 2016-11-14 2018-03-09 平安科技(深圳)有限公司 The implementation method and device of storing process
CN107784047B (en) * 2016-11-14 2020-09-11 平安科技(深圳)有限公司 Method and device for realizing storage process
CN106682227A (en) * 2017-01-06 2017-05-17 郑州云海信息技术有限公司 Log data storage system based on distributed file system and reading-writing method
CN107169099A (en) * 2017-05-16 2017-09-15 成都四象联创科技有限公司 Data processing method based on HADOOP
CN108573007A (en) * 2017-06-08 2018-09-25 北京金山云网络技术有限公司 Method, apparatus, electronic equipment and the storage medium of data consistency detection
CN107807792A (en) * 2017-10-27 2018-03-16 郑州云海信息技术有限公司 A kind of data processing method and relevant apparatus based on copy storage system
CN111176885A (en) * 2019-12-31 2020-05-19 浪潮电子信息产业股份有限公司 Data verification method and related device for distributed storage system
CN111274210A (en) * 2020-02-10 2020-06-12 北京松果电子有限公司 Metadata processing method and device and electronic equipment
CN111274210B (en) * 2020-02-10 2023-05-30 北京小米松果电子有限公司 Metadata processing method and device and electronic equipment
CN115454958A (en) * 2022-09-15 2022-12-09 北京百度网讯科技有限公司 Data processing method, device, equipment, system and medium based on artificial intelligence
CN115454958B (en) * 2022-09-15 2024-03-05 北京百度网讯科技有限公司 Data processing method, device, equipment, system and medium based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN104978336A (en) Unstructured data storage system based on Hadoop distributed computing platform
US10437671B2 (en) Synchronizing replicated stored data
CN107391758B (en) Database switching method, device and equipment
US8521974B2 (en) Migration of data in a distributed environment
US20140229440A1 (en) Method and apparatus for replicating virtual machine images using deduplication metadata
US20190196919A1 (en) Maintaining files in a retained file system
CN103294752B (en) Online verification method and system of a standby database in log shipping physical replication environment
CN110574061A (en) Block chain local ledger
CN102323930B (en) Mirroring data changes in a database system
CN110188103A (en) Data account checking method, device, equipment and storage medium
US10866827B2 (en) Virtual machine linking
US10657004B1 (en) Single-tenant recovery with a multi-tenant archive
US11481284B2 (en) Systems and methods for generating self-notarized backups
US11914867B2 (en) Coordinated snapshots among storage systems implementing a promotion/demotion model
US8762662B1 (en) Method and apparatus for application migration validation
CN105608150A (en) Business data processing method and system
US10127270B1 (en) Transaction processing using a key-value store
US11436193B2 (en) System and method for managing data using an enumerator
CN109542860B (en) Service data management method based on HDFS and terminal equipment
US20210081430A1 (en) System and method for managing a role-based blockchain network
US9858011B2 (en) Repopulating failed replicas through modified consensus recovery
US9471409B2 (en) Processing of PDSE extended sharing violations among sysplexes with a shared DASD
US20210056120A1 (en) In-stream data load in a replication environment
Barot et al. Hadoop Backup and Recovery Solutions
US9747166B2 (en) Self healing cluster of a content management system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20151014

RJ01 Rejection of invention patent application after publication