CN103440244A - Large-data storage and optimization method - Google Patents

Large-data storage and optimization method Download PDF

Info

Publication number
CN103440244A
CN103440244A CN201310293482XA CN201310293482A CN103440244A CN 103440244 A CN103440244 A CN 103440244A CN 201310293482X A CN201310293482X A CN 201310293482XA CN 201310293482 A CN201310293482 A CN 201310293482A CN 103440244 A CN103440244 A CN 103440244A
Authority
CN
China
Prior art keywords
data
datanode
namenode
optimization
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310293482XA
Other languages
Chinese (zh)
Inventor
安宏伟
季统凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201310293482XA priority Critical patent/CN103440244A/en
Publication of CN103440244A publication Critical patent/CN103440244A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data processing, in particular to a large-data storage and optimization method facing to sea-cloud coordination. The method comprises the following steps of data preprocessing, calculation optimization and mass data optimization, wherein the step of data preprocessing comprises data collection, multi-source data organization and gathering, data redundant processing and data compression storage; the calculation optimization comprises HDFS (hadoop distributed file system) file transmission and optimization and Map/Reduce parallel calculation and optimization; and the step of mass data optimization comprises data backup for disaster recovery, data encryption, CC index and CCT backup. The large-data storage and optimization method disclosed by the invention can be applied to large-data storage of a cloud platform.

Description

A kind of large data store optimization method
Technical field
The present invention relates to technical field of data processing, a kind of large data store optimization method of especially working in coordination with towards sea-cloud.
Background technology
Along with the fast development of infotech, traditional persistent storage scheme more and more has been difficult to adapt to the development of information service from framework; The Hadoop distributed system is passed through distributed algorithm, by the access of data and storage and distribution among a large amount of servers, on each server access can also be distributed in to cluster in many back-up storage reliably in, it is a subversive development of conventional store framework.But Opportunity coexists with challenge, the distributed structure/architecture of increasing income but seems especially heavy when solving Distributed Application, particularly to large data storage and frequent file writes, performance response deficiency during read operation.
Summary of the invention
The technical matters that the present invention solves is a kind of large data store optimization method that provides collaborative towards sea-cloud, effectively realizes the optimization of air exercise data storage.
The technical scheme that the present invention solves the problems of the technologies described above is:
Described method comprises data pre-service, calculation optimization and mass data optimization, the data pre-service comprise data acquisition, multi-source data tissue and converge, data redundancy is processed, compression storing data; Calculation optimization comprises the optimization of HDFS file transfer and Map/Reduce parallel computation optimization; With mass data optimization comprise that the data calamity is standby, data encryption, CCIndex index and CCT backup; The data that client is submitted to are after data acquisition gathers, by the multi-source data tissue with converge, data redundancy processes and to carry out standardization processing, and adopt RCFile to compress storage, data level splits, introduce piecemeal, the first piecemeal of burst mechanism burst again, adopt in piece by going storage, store by row in burst; Then, in calculation optimization, adopt CCIndex that the data random ergodic is converted into by the line index traversal, adopt CCT to carry out the recording level row and copied the data increment backup; When mass data is optimized, the parallel computation assembly completes HDFS file system and the optimization of Map/Reduce computation model configuration class, collaborative seamless integrated with G-cloud cloud platform, infrastructure and the infrastructure service of using flexibly G-cloud cloud platform to provide.
The system of storage optimization is set to:
The first step, linux system file mount parameter optimization, increase the noatime parameter;
Second step, NameNode node parameter configuration optimization, the dfs.block.size massive data files is processed and is set to 64M*N(N=1,2,3,4), the dfs.namenode.handler.count default value is arranged to 64;
The 3rd step, the DataNode node optimization, the service thread quantity that the far call of the DataNode node of dfs.datanode.handler.count is opened is set to 8;
The 4th step, job.tracker monitor node configuration optimization, the quantity that the processing task trackers opened on mapred.job.tracker.handler.count-job tracker passes the service thread of the RPC come is set to 64; The map task quantity of each job of mapred.map.tasks-, be arranged to cluster in the very approaching numerical value of the host number that exists; The reduce task quantity of each job of mapred.reduce.tasks-, be arranged to cluster in the very approaching numerical value of the host number that exists;
The 5th step, task.tracker monitor node configuration optimization,
Mapred.tasktracker.map.tasks.maximum, the maximum quantity of the map task that can simultaneously move on task tracker, be set to server CPU core number or number and subtract 1;
Can control the quantity of the task of operation simultaneously on task tracker of mapred.tasktracker.reduce.tasks.maximum simultaneously and be set to 2; TaskTracker.http.threads is the Thread Count on HTTPserver, operates in each TaskTracker upper, for the treatment of maptask output, can be set to 40~50;
The 6th step, the map configuration optimization, io.sort.mb can be set to 200M, the io.sort.factor attribute, the int type, Map end and Reduce end use this setup of attribute Map end and Reduce hold all use to file Sort the time max-flow that once merges be set to 100; The io.file.buffer.size attribute, the size of the buffer zone provided in the I/O operation of this setup of attribute MapReduce operation is provided in the iMapReduce operation, take byte as unit, be adjusted into 64KB or128KB, the tasktracker.http.threads attribute, the int type, the Map end is used each tasktracker in this setup of attribute cluster to be increased between 40-50 for the quantity of map output being passed to the worker thread of reducer;
The 7th step, the reduce configuration optimization, the also line number that mapred.reduce.parallel.copies increases the reproduction process of reduce end is adjusted into 20; The mapred.child.java.opts attribute, be adjusted into 2MB;
The mapred.job.shuffle.input.buffer.percent attribute, suitably scaling up is not overflow Map output and is write disk; The mapred.job.shuffle.merge.percent attribute, suitably increase its ratio and reduce the excessive number of times of writing of disk; The mapred.inmem.merge.threshold attribute, seldom the time, can be 0 when the memory requirements of Reduce function by this setup of attribute, controlled separately to overflow by the mapred.job.shuffle.merge.percent attribute and write process; The mapred.job.reduce.input.buffer.percent attribute, be set to 1.0.
The HDFS distributed document is stored workflow,
The first step, client, by authentication, is set up TCP/IP and is connected, and by a configurable port, is connected to NameNode and initiates the RPC remote request;
Second step, NameNode checks whether file to be created exists, and whether the founder has authority to be operated; Successful be record of document creation, otherwise to the client throw exception;
The 3rd step, the client writing in files, file is cut into to a plurality of packets, and in inside with these packets of format management of data queue " data queue ", apply for new blocks to NameNode simultaneously, obtain for storing the suitable DataNode list of replicas, the size of list is determined according to the setting to replication in NameNode;
The 4th step, form with pipeline writes packet in all replicas, packet is write to first DataNode in the mode flowed, this DataNode is after this packet storage, again it is passed to the next DataNode in this pipeline, to the last a DataNode;
The 5th step, if in transmitting procedure, there is certain DataNode fault to occur, so current pipeline can be closed, the DataNode broken down can remove from current pipeline, and remaining block can continue in remaining DataNode to continue the form transmission with pipeline, and NameNode can distribute a new DataNode simultaneously, the quantity that keeps replicas to set, write operation completes;
The 6th step, NameNode is linked to mailing address in corresponding DataNode piece according to the data block address of storage, the some or all of block list of backspace file;
The 7th step, NameNode selects nearest DataNode node, reads the block list, starts the file file reading.
Data are processed detailed process:
The first step, from a plurality of visual angles such as information source, imformosome, user's requests, analyze the availability aspect of multi-source magnanimity information;
Second step, multi-source data, at tissue with after converging, may produce a plurality of identical copies; When newly-increased file is converged storage, system monitoring, to event, calculates the digest value of the file that makes new advances, to the system request new files; Whether system contrast digest value has been present in system, and if there is no, return message allows client to converge storage data, newly-built this file; If digest value exists, the newly-built this document of system and corresponding authority, attribute information, but file data is directly quoted the data with existing content, without converging again the system of depositing in;
The 3rd step, adopt RCFile to complete compression to data, and the relation data level is split, and in burst, by the row order, stored, and will become the storage organization of the unit of classifying as in distributed data processing system with the storage organization of the unit of being recorded as;
The 4th step, the storage of unstructured document data is responsible for by data cluster, introduces deblocking and piecemeal copy mechanism and is stored, and increase data directory and tree node optimization;
The 5th step, adopt transmission channel to encrypt and data storage encryption mode, and symmetric cryptography is combined with asymmetric encryption;
The 6th step, adopt disk array to be backed up in realtime to production data; CCIndex is introduced in mass data processing optimization, and the data random ergodic is converted into efficiently and travels through by line index, and introducing CCT carries out the recording level row and copied the data increment backup;
The 7th step, synchronously access G-cloud cloud platform, uses computational resource, virtual resources, management resource etc. to carry out massive data processing, filters heavy and mining analysis, simultaneously by introducing the operations such as mass data search index and tree node optimization.
The HDFS distributed document reads detailed process:
The first step, client is connected to NameNode by a configurable port, and this connection is set up by ICP/IP protocol;
Second step, client is mutual by ClientProtocol and NameNode;
The 3rd step, DataNode use DatanodeProtocol and NameNode are mutual, and foundation is connected with NameNode;
The 4th step, DataNode keeps the communication connection with NameNode by periodically to NameNode, sending heartbeat and data block;
The 5th step, the information of data block comprises the attribute of data block, which file is data block belong to, data block address ID, modification time etc.;
The 6th step, the NameNode response is asked from the RPC of client and DataNode, and receives heartbeat signal and bulk state report from all DataNode;
The 7th step, return to bulk state and report to client, and status report has comprised the data block list that certain DataNode is all;
The 8th step, client, according to the address information of returning in the piece report, is chosen DataNode node reading out data;
The 9th step, close DataNode and connect, and once reads end.
The present invention has realized the optimization of HDFS file transfer, Map/Reduce parallel computation optimization, mass data query optimization, to reach following performance index: realized stable, efficiently large data store optimization method, the mass data query processing is optimized, be with good expansibility, can support to be no less than the memory capacity of 100PB level, support to expand to the storage of EB level; Have good reliability, security, to critical data, can realize many copies redundancy protecting mechanism, number of copies is not less than 3; Have strange land data disaster recovery and backup systems, based on the G-cloud platform, realize that resource elasticity takes, system has good response speed, supports mass data analysis and the service of excavating.
The accompanying drawing explanation
Below in conjunction with accompanying drawing, the present invention is further described;
Fig. 1 is system architecture schematic diagram of the present invention;
Fig. 2 is unstructured data storage system schematic diagram;
Fig. 3 is sea-cloud collaborative platform HDFS distributed file system schematic diagram;
Fig. 4 is network topology schematic diagram of the present invention.
Embodiment
The present invention proposes a kind of large data store optimization method based on G-cloud cloud platform, the JobClient client submits the data to data acquisition system (DAS), mass data adopts Data Preprocessing Technology to submit to data to carry out standardization processing the JobClient client, data compression technique adopts efficient storage organization RCFile, data level is split, introducing piecemeal, burst mechanism are first piecemeal burst again, adopt in piece by going and store, and in burst, by row, store; CCIndex is introduced in mass data processing optimization, and the data random ergodic is converted into efficiently and travels through by line index, and introducing CCT carries out the recording level row and copied the data increment backup; The parallel computation assembly completes HDFS file system and the optimization of Map/Reduce computation model configuration class, the mass data storage scheme of Error Tolerance and high-throughput is provided, significantly improve file processing and calculated performance, collaborative seamless integrated with G-cloud cloud platform, infrastructure and the infrastructure service of using flexibly G-cloud cloud platform to provide, support large-scale calculations resource, storage resources, Internet resources are virtual and data analysis management.
As shown in Figure 1, the detailed process of enforcement storage optimization method of the present invention is:
The first step, linux system file mount parameter optimization, increase the noatime parameter, Linux provides this parameter of noatime to forbid recording the last access time stamp, when file system mounted, can significantly improve the efficiency of disk I/O, only need again the carry file system after modification arranges, just do not need to restart and can come into force;
Second step, NameNode node parameter configuration optimization, the dfs.block.size massive data files is processed and is set to 64M*N (N=1,2,3,4), and the dfs.namenode.handler.count default value is 10, during the massive data files cluster, is arranged to 64;
The 3rd step, the DataNode node optimization, dfs.datanode.handler.count, the service thread quantity that the far call of DataNode node is opened, be defaulted as 3, and the present invention is set to 8;
The 4th step, job.tracker monitor node configuration optimization, the processing task trackers opened on mapred.job.tracker.handler.count-job tracker passes the quantity of the service thread of the RPC come, general 4% of the task tracker number of nodes that is set to, the present invention is set to 64.The map task quantity of each job of mapred.map.tasks-, often be arranged to cluster in the very approaching numerical value of the host number that exists.The reduce task quantity of each job of mapred.reduce.tasks-, often be arranged to cluster in the very approaching numerical value of the host number that exists;
The 5th step, task.tracker monitor node configuration optimization,
Mapred.tasktracker.map.tasks.maximum, the maximum quantity of the map task that can simultaneously move on task tracker, being set to server CPU core number or number, to subtract 1 o'clock operational efficiency the highest.Can control the quantity of the task of operation simultaneously on task tracker of mapred.tasktracker.reduce.tasks.maximum, the present invention is set to 2 simultaneously.TaskTracker.http.threads is the Thread Count on HTTPserver, operates in each TaskTracker upper, and for the treatment of maptask output, the large data sets group can be set to 40~50;
The 6th step, the map configuration optimization, io.sort.mb acquiescence 10 can be set to 200M for large cluster, the io.sort.factor attribute, the int type, Map end and Reduce end use this setup of attribute Map end and Reduce hold all use to file Sort the time max-flow that once merges, its default value is 10, is increased to 100.The io.file.buffer.size attribute, the size of the buffer zone provided in the I/O operation of this setup of attribute MapReduce operation is provided in the iMapReduce operation, take byte as unit, acquiescence is 4KB, be adjusted into 64KB or128KB, the tasktracker.http.threads attribute, the int type, in this setup of attribute cluster of Map end use, each tasktracker is for passing to map output the quantity of the worker thread of reducer, acquiescence is 40, it can be increased between 40-50, can increase the doubling Thread Count, improve the cluster performance;
The 7th step, the reduce configuration optimization, mapred.reduce.parallel.copies increases the also line number of reduce end reproduction process, default value 5, the present invention is adjusted into 20.The mapred.child.java.opts attribute, be adjusted into 2MB, improves the performance of MapReduce operation.The mapred.job.shuffle.input.buffer.percent attribute, acquiescence is 0.70, suitably scaling up is not overflow Map output and is write disk;
The mapred.job.shuffle.merge.percent attribute, suitably increase its ratio and can reduce the excessive number of times of writing of disk.The mapred.inmem.merge.threshold attribute, be defaulted as 1000.Seldom the time, can be 0 when the memory requirements of Reduce function by this setup of attribute, there is no threshold restriction, be controlled separately to overflow by the mapred.job.shuffle.merge.percent attribute and write process.The mapred.job.reduce.input.buffer.percent attribute, be set to 1.0;
The 8th step, CCIndex is introduced in mass data processing optimization, and the data random ergodic is converted into efficiently and travels through by line index, and introducing CCT carries out the recording level row and has copied the data increment backup;
As shown in Figure 2
Unstructured data storage detailed process is:
The first step, often contain unclean and nonstandard form in the magnanimity multi-source data, will cause potential risk to use, the statistical study of application system.Must data be converted into to the standardized data of system platform by data pre-service standard;
Second step, from a plurality of visual angles such as information source, imformosome, user's requests, analyze the availability aspect of multi-source magnanimity information, sets up and meet the availability assessment inference pattern that information develops and applies;
The 3rd step, multi-source data, at tissue with after converging, may produce a plurality of identical copies with a part file.When newly-increased file is converged storage, system monitoring, to event, calculates the digest value of the file that makes new advances, to the system request new files.Whether system contrast digest value has been present in system, and if there is no, return message allows client to converge storage data, newly-built this file.If digest value exists, the newly-built this document of system and corresponding authority, attribute information, but file data is directly quoted the data with existing content, without converging again the system of depositing in;
The 4th step, the present invention adopts a kind of efficient data store organisation---RCFile (Record Columnar File), data are completed to compression, the RCFile data store organisation is based on the Hadoop system, the RCFile storage organization combines the advantage of row storage and row storage, follow the design concept of " first horizontal division, then vertical division ".
The 5th step, the storage of unstructured document data is responsible for by data cluster, introduces deblocking and piecemeal copy mechanism and is stored, and for accelerating the retrieval rate of data, increases data directory and tree node optimization;
The 6th step, for increasing data security, adopt transmission channel to encrypt and data storage encryption mode, and symmetric encipherment algorithm is combined with rivest, shamir, adelman;
The 7th step, adopt disk array to be backed up in realtime to production data.
As shown in Figure 3
The detailed process of the collaborative distributed file storage of sea-cloud is:
The first step, client is connected to NameNode by a configurable port, and this connection is set up by ICP/IP protocol;
Second step, client is mutual by ClientProtocol and NameNode;
The 3rd step, DataNode use DatanodeProtocol and NameNode are mutual, and foundation is connected with NameNode;
The 4th step, DataNode keeps the communication connection with NameNode by periodically to NameNode, sending heartbeat and data block;
The 5th step, the information of data block comprises the attribute of data block, which file is data block belong to, data block address ID, modification time etc.;
The 6th step, the NameNode response is asked from the RPC of client and DataNode, and receives heartbeat signal and bulk state report from all DataNode;
The 7th step, return to bulk state and report to client, and status report has comprised the data block list that certain DataNode is all;
The 8th step, client, according to the address information of returning in the piece report, is chosen DataNode node reading out data;
The 9th step, close DataNode and connect, and reads end
As shown in Figure 4, the present invention is comprised of mass data storage management, distributed data platform and G-cloud cloud operating system three parts; Client is by authentication, setting up TCP/IP connects, be connected to NameNode and initiate the RPC request by a configurable port, carry out the data storage alternately with field of distributed file processing, bottom access cloud platform, used cloud infrastructure and infrastructure service to carry out data mining and analysis flexibly; Complete the collaborative calculation services of sea-cloud.

Claims (7)

1. a large data store optimization method, it is characterized in that: described method comprises data pre-service, calculation optimization and mass data optimization, the data pre-service comprise data acquisition, multi-source data tissue and converge, data redundancy is processed, compression storing data; Calculation optimization comprises the optimization of HDFS file transfer and Map/Reduce parallel computation optimization; With mass data optimization comprise that the data calamity is standby, data encryption, CCIndex index and CCT backup; The data that client is submitted to are after data acquisition gathers, by the multi-source data tissue with converge, data redundancy processes and to carry out standardization processing, and adopt RCFile to compress storage, data level splits, introduce piecemeal, the first piecemeal of burst mechanism burst again, adopt in piece by going storage, store by row in burst; Then, in calculation optimization, adopt CCIndex that the data random ergodic is converted into by the line index traversal, adopt CCT to carry out the recording level row and copied the data increment backup; When mass data is optimized, the parallel computation assembly completes HDFS file system and the optimization of Map/Reduce computation model configuration class, collaborative seamless integrated with G-cloud cloud platform, infrastructure and the infrastructure service of using flexibly G-cloud cloud platform to provide.
2. large data according to claim 1 are stored and optimization method, and it is characterized in that: the system of storage optimization is set to:
The first step, linux system file mount parameter optimization, increase the noatime parameter;
Second step, NameNode node parameter configuration optimization, the dfs.block.size massive data files is processed and is set to 64M*N(N=1,2,3,4), the dfs.namenode.handler.count default value is arranged to 64;
The 3rd step, the DataNode node optimization, the service thread quantity that the far call of the DataNode node of dfs.datanode.handler.count is opened is set to 8;
The 4th step, job.tracker monitor node configuration optimization, the quantity that the processing task trackers opened on mapred.job.tracker.handler.count-job tracker passes the service thread of the RPC come is set to 64; The map task quantity of each job of mapred.map.tasks-, be arranged to cluster in the very approaching numerical value of the host number that exists; The reduce task quantity of each job of mapred.reduce.tasks-, be arranged to cluster in the very approaching numerical value of the host number that exists;
The 5th step, task.tracker monitor node configuration optimization,
Mapred.tasktracker.map.tasks.maximum, the maximum quantity of the map task that can simultaneously move on task tracker, be set to server CPU core number or number and subtract 1;
Can control the quantity of the task of operation simultaneously on task tracker of mapred.tasktracker.reduce.tasks.maximum simultaneously and be set to 2; TaskTracker.http.threads is the Thread Count on HTTPserver, operates in each TaskTracker upper, for the treatment of maptask output, can be set to 40~50;
The 6th step, the map configuration optimization, io.sort.mb can be set to 200M, the io.sort.factor attribute, the int type, Map end and Reduce end use this setup of attribute Map end and Reduce hold all use to file Sort the time max-flow that once merges be set to 100; The io.file.buffer.size attribute, the size of the buffer zone provided in the I/O operation of this setup of attribute MapReduce operation is provided in the iMapReduce operation, take byte as unit, be adjusted into 64KB or128KB, the tasktracker.http.threads attribute, the int type, the Map end is used each tasktracker in this setup of attribute cluster to be increased between 40-50 for the quantity of map output being passed to the worker thread of reducer;
The 7th step, the reduce configuration optimization, the also line number that mapred.reduce.parallel.copies increases the reproduction process of reduce end is adjusted into 20; The mapred.child.java.opts attribute, be adjusted into 2MB;
The mapred.job.shuffle.input.buffer.percent attribute, suitably scaling up is not overflow Map output and is write disk; The mapred.job.shuffle.merge.percent attribute, suitably increase its ratio and reduce the excessive number of times of writing of disk; The mapred.inmem.merge.threshold attribute, seldom the time, can be 0 when the memory requirements of Reduce function by this setup of attribute, controlled separately to overflow by the mapred.job.shuffle.merge.percent attribute and write process; The mapred.job.reduce.input.buffer.percent attribute, be set to 1.0.
3. large data store optimization method according to claim 1 is characterized in that:
The HDFS distributed document is stored workflow,
The first step, client, by authentication, is set up TCP/IP and is connected, and by a configurable port, is connected to NameNode and initiates the RPC remote request;
Second step, NameNode checks whether file to be created exists, and whether the founder has authority to be operated; Successful be record of document creation, otherwise to the client throw exception;
The 3rd step, the client writing in files, file is cut into to a plurality of packets, and in inside with these packets of format management of data queue " data queue ", apply for new blocks to NameNode simultaneously, obtain for storing the suitable DataNode list of replicas, the size of list is determined according to the setting to replication in NameNode;
The 4th step, form with pipeline writes packet in all replicas, packet is write to first DataNode in the mode flowed, this DataNode is after this packet storage, again it is passed to the next DataNode in this pipeline, to the last a DataNode;
The 5th step, if in transmitting procedure, there is certain DataNode fault to occur, so current pipeline can be closed, the DataNode broken down can remove from current pipeline, and remaining block can continue in remaining DataNode to continue the form transmission with pipeline, and NameNode can distribute a new DataNode simultaneously, the quantity that keeps replicas to set, write operation completes;
The 6th step, NameNode is linked to mailing address in corresponding DataNode piece according to the data block address of storage, the some or all of block list of backspace file;
The 7th step, NameNode selects nearest DataNode node, reads the block list, starts the file file reading.
4. large data store optimization method according to claim 2 is characterized in that:
The HDFS distributed document is stored workflow,
The first step, client, by authentication, is set up TCP/IP and is connected, and by a configurable port, is connected to NameNode and initiates the RPC remote request;
Second step, NameNode checks whether file to be created exists, and whether the founder has authority to be operated; Successful be record of document creation, otherwise to the client throw exception;
The 3rd step, the client writing in files, file is cut into to a plurality of packets, and in inside with these packets of format management of data queue " data queue ", apply for new blocks to NameNode simultaneously, obtain for storing the suitable DataNode list of replicas, the size of list is determined according to the setting to replication in Namenode;
The 4th step, form with pipeline writes packet in all replicas, packet is write to first DataNode in the mode flowed, this DataNode is after this packet storage, again it is passed to the next DataNode in this pipeline, to the last a DataNode;
The 5th step, if in transmitting procedure, there is certain DataNode fault to occur, so current pipeline can be closed, the DataNode broken down can remove from current pipeline, and remaining block can continue in remaining DataNode to continue the form transmission with pipeline, and NameNode can distribute a new DataNode simultaneously, the quantity that keeps replicas to set, write operation completes;
The 6th step, NameNode is linked to mailing address in corresponding DataNode piece according to the data block address of storage, the some or all of block list of backspace file;
The 7th step, NameNode selects nearest DataNode node, reads the block list, starts the file file reading.
5. according to the described large data store optimization method of claim 1 to 4 any one, it is characterized in that:
Data are processed detailed process:
The first step, from a plurality of visual angles such as information source, imformosome, user's requests, analyze the availability aspect of multi-source magnanimity information;
Second step, multi-source data, at tissue with after converging, may produce a plurality of identical copies; When newly-increased file is converged storage, system monitoring, to event, calculates the digest value of the file that makes new advances, to the system request new files; Whether system contrast digest value has been present in system, and if there is no, return message allows client to converge storage data, newly-built this file; If digest value exists, the newly-built this document of system and corresponding authority, attribute information, but file data is directly quoted the data with existing content, without converging again the system of depositing in;
The 3rd step, adopt RCFile to complete compression to data, and the relation data level is split, and in burst, by the row order, stored, and will become the storage organization of the unit of classifying as in distributed data processing system with the storage organization of the unit of being recorded as;
The 4th step, the storage of unstructured document data is responsible for by data cluster, introduces deblocking and piecemeal copy mechanism and is stored, and increase data directory and tree node optimization;
The 5th step, adopt transmission channel to encrypt and data storage encryption mode, and symmetric cryptography is combined with asymmetric encryption;
The 6th step, adopt disk array to be backed up in realtime to production data; CCIndex is introduced in mass data processing optimization, and the data random ergodic is converted into efficiently and travels through by line index, and introducing CCT carries out the recording level row and copied the data increment backup;
The 7th step, synchronously access G-cloud cloud platform, uses computational resource, virtual resources, management resource etc. to carry out massive data processing, filters heavy and mining analysis, simultaneously by introducing the operations such as mass data search index and tree node optimization.
6. according to the described large data store optimization method of claim 1 to 4 any one, it is characterized in that:
The HDFS distributed document reads detailed process:
The first step, client is connected to NameNode by a configurable port, and this connection is set up by ICP/IP protocol;
Second step, client is mutual by ClientProtocol and NameNode;
The 3rd step, DataNode use DatanodeProtocol and NameNode are mutual, and foundation is connected with NameNode;
The 4th step, DataNode keeps the communication connection with NameNode by periodically to NameNode, sending heartbeat and data block;
The 5th step, the information of data block comprises the attribute of data block, which file is data block belong to, data block address ID, modification time etc.;
The 6th step, the NameNode response is asked from the RPC of client and DataNode, and receives heartbeat signal and bulk state report from all DataNode;
The 7th step, return to bulk state and report to client, and status report has comprised the data block list that certain DataNode is all;
The 8th step, client, according to the address information of returning in the piece report, is chosen DataNode node reading out data;
The 9th step, close DataNode and connect, and once reads end.
7. large data store optimization method according to claim 5 is characterized in that:
The HDFS distributed document reads detailed process:
The first step, client is connected to NameNode by a configurable port, and this connection is set up by ICP/IP protocol;
Second step, client is mutual by ClientProtocol and NameNode;
The 3rd step, DataNode use DatanodeProtocol and NameNode are mutual, and foundation is connected with NameNode;
The 4th step, DataNode keeps the communication connection with NameNode by periodically to NameNode, sending heartbeat and data block;
The 5th step, the information of data block comprises the attribute of data block, which file is data block belong to, data block address ID, modification time etc.;
The 6th step, the NameNode response is asked from the RPC of client and DataNode, and receives heartbeat signal and bulk state report from all DataNode;
The 7th step, return to bulk state and report to client, and status report has comprised the data block list that certain DataNode is all;
The 8th step, client, according to the address information of returning in the piece report, is chosen DataNode node reading out data;
The 9th step, close DataNode and connect, and once reads end.
CN201310293482XA 2013-07-12 2013-07-12 Large-data storage and optimization method Pending CN103440244A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310293482XA CN103440244A (en) 2013-07-12 2013-07-12 Large-data storage and optimization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310293482XA CN103440244A (en) 2013-07-12 2013-07-12 Large-data storage and optimization method

Publications (1)

Publication Number Publication Date
CN103440244A true CN103440244A (en) 2013-12-11

Family

ID=49693935

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310293482XA Pending CN103440244A (en) 2013-07-12 2013-07-12 Large-data storage and optimization method

Country Status (1)

Country Link
CN (1) CN103440244A (en)

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104077355A (en) * 2014-05-29 2014-10-01 中国银行股份有限公司 Methods, devices and system for storing and inquiring unstructured data
WO2014180411A1 (en) * 2013-12-17 2014-11-13 中兴通讯股份有限公司 Distributed index generation method and device
CN104317738A (en) * 2014-10-24 2015-01-28 中国科学技术大学 Incremental computation method on basis of MapReduce
CN104598562A (en) * 2015-01-08 2015-05-06 浪潮软件股份有限公司 XML file processing method and device based on MapReduce parallel computing model
CN104933042A (en) * 2013-09-29 2015-09-23 国家电网公司 Large-data-volume based database table acquisition optimizing technique
CN105138615A (en) * 2015-08-10 2015-12-09 北京思特奇信息技术股份有限公司 Method and system for building big data distributed log
CN105426493A (en) * 2015-11-24 2016-03-23 北京中电普华信息技术有限公司 Data processing system and method applied to distributed storage system
CN105608160A (en) * 2015-12-21 2016-05-25 浪潮软件股份有限公司 Distributed big data analysis method
CN105677710A (en) * 2015-12-28 2016-06-15 曙光信息产业(北京)有限公司 Processing method and system of big data
CN105787597A (en) * 2016-01-20 2016-07-20 北京优弈数据科技有限公司 Data optimizing processing system
CN105893435A (en) * 2015-12-11 2016-08-24 乐视网信息技术(北京)股份有限公司 Data loading and storing equipment, method and system
CN106021268A (en) * 2015-03-26 2016-10-12 国际商业机器公司 File system block-level tiering and co-allocation
CN106250784A (en) * 2016-07-20 2016-12-21 乐视控股(北京)有限公司 Full disk encryption method and device
CN106339473A (en) * 2016-08-29 2017-01-18 北京百度网讯科技有限公司 Method and device for copying file
CN106570425A (en) * 2015-10-10 2017-04-19 北京奇虎科技有限公司 Hard disk data encryption method and system
CN106845276A (en) * 2017-02-13 2017-06-13 湖南财政经济学院 A kind of big data based on network security implements system
CN107329982A (en) * 2017-06-01 2017-11-07 华南理工大学 A kind of big data parallel calculating method stored based on distributed column and system
CN107342914A (en) * 2017-06-07 2017-11-10 同济大学 A kind of high availability for cloud platform verifies system
CN107480283A (en) * 2017-08-23 2017-12-15 九次方大数据信息集团有限公司 Realize the method, apparatus and storage system of big data quick storage
CN107506394A (en) * 2017-07-31 2017-12-22 武汉工程大学 Optimization method for eliminating big data standard relation connection redundancy
CN108681487A (en) * 2018-05-21 2018-10-19 千寻位置网络有限公司 The distributed system and tuning method of sensing algorithm arameter optimization
WO2019006640A1 (en) * 2017-07-04 2019-01-10 深圳齐心集团股份有限公司 Big data management system
CN109753306A (en) * 2018-12-28 2019-05-14 北京东方国信科技股份有限公司 A kind of big data processing method of because precompiled function caching engine
CN109981674A (en) * 2019-04-04 2019-07-05 北京信而泰科技股份有限公司 A kind of remote procedure calling (PRC) method, device, equipment and medium
CN110268397A (en) * 2016-12-30 2019-09-20 日彩电子科技(深圳)有限公司 Effectively optimizing data layout method applied to data warehouse
CN111539029A (en) * 2020-04-25 2020-08-14 章稳建 Industrial internet-based big data storage rate optimization method and cloud computing center
CN111930731A (en) * 2020-07-28 2020-11-13 苏州亿歌网络科技有限公司 Data dump method, device, equipment and storage medium
CN112084158A (en) * 2020-09-25 2020-12-15 北京百家科技集团有限公司 Data set file compression method and device
CN112134914A (en) * 2020-02-10 2020-12-25 北京天德科技有限公司 Distributed secure storage strategy based on cryptography
CN112597348A (en) * 2020-12-15 2021-04-02 电子科技大学中山学院 Method and device for optimizing big data storage
TWI760403B (en) * 2017-03-23 2022-04-11 韓商愛思開海力士有限公司 Data storage device and operating method thereof

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222092A (en) * 2011-06-03 2011-10-19 复旦大学 Massive high-dimension data clustering method for MapReduce platform
US20120182891A1 (en) * 2011-01-19 2012-07-19 Youngseok Lee Packet analysis system and method using hadoop based parallel computation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120182891A1 (en) * 2011-01-19 2012-07-19 Youngseok Lee Packet analysis system and method using hadoop based parallel computation
CN102222092A (en) * 2011-06-03 2011-10-19 复旦大学 Massive high-dimension data clustering method for MapReduce platform

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
刘文娟: "基于Hadoop的文件同步存储系统的设计与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
刘飞: "基于云计算的分布式存储系统的研究和应用", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
张树: "用于加工中心的计算机智能监测控制方法研究", 《中国优秀硕士学位论文全文数据库工程科技一辑》 *
林国庆: "网络信息安全体系中关键技术的研究", 《中国博士学位论文全文数据库信息科技辑》 *
查礼: "基于Hadoop的大数据计算技术", 《科研信息化技术与应用》 *
罗军舟: "云计算:体系架构与关键技术", 《通信学报》 *
辛大欣: "Hadoop集群性能优化技术研究", 《电脑知识与技术》 *
高蓟超: "Hadoop平台存储策略的研究与优化", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
龚高晟: "通用分布式文件系统的研究与改进", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933042A (en) * 2013-09-29 2015-09-23 国家电网公司 Large-data-volume based database table acquisition optimizing technique
CN104933042B (en) * 2013-09-29 2018-04-13 国家电网公司 Database table optimization of collection technology based on big data quantity
WO2014180411A1 (en) * 2013-12-17 2014-11-13 中兴通讯股份有限公司 Distributed index generation method and device
CN104077355A (en) * 2014-05-29 2014-10-01 中国银行股份有限公司 Methods, devices and system for storing and inquiring unstructured data
CN104317738B (en) * 2014-10-24 2017-07-25 中国科学技术大学 A kind of incremental calculation method based on MapReduce
CN104317738A (en) * 2014-10-24 2015-01-28 中国科学技术大学 Incremental computation method on basis of MapReduce
CN104598562A (en) * 2015-01-08 2015-05-06 浪潮软件股份有限公司 XML file processing method and device based on MapReduce parallel computing model
US10558399B2 (en) 2015-03-26 2020-02-11 International Business Machines Corporation File system block-level tiering and co-allocation
CN106021268B (en) * 2015-03-26 2020-04-10 国际商业机器公司 File system block level layering and co-allocation
US11593037B2 (en) 2015-03-26 2023-02-28 International Business Machines Corporation File system block-level tiering and co-allocation
CN106021268A (en) * 2015-03-26 2016-10-12 国际商业机器公司 File system block-level tiering and co-allocation
CN105138615B (en) * 2015-08-10 2019-02-26 北京思特奇信息技术股份有限公司 A kind of method and system constructing big data distributed information log
CN105138615A (en) * 2015-08-10 2015-12-09 北京思特奇信息技术股份有限公司 Method and system for building big data distributed log
CN106570425A (en) * 2015-10-10 2017-04-19 北京奇虎科技有限公司 Hard disk data encryption method and system
CN105426493A (en) * 2015-11-24 2016-03-23 北京中电普华信息技术有限公司 Data processing system and method applied to distributed storage system
CN105893435A (en) * 2015-12-11 2016-08-24 乐视网信息技术(北京)股份有限公司 Data loading and storing equipment, method and system
CN105608160A (en) * 2015-12-21 2016-05-25 浪潮软件股份有限公司 Distributed big data analysis method
CN105677710A (en) * 2015-12-28 2016-06-15 曙光信息产业(北京)有限公司 Processing method and system of big data
CN105787597A (en) * 2016-01-20 2016-07-20 北京优弈数据科技有限公司 Data optimizing processing system
CN105787597B (en) * 2016-01-20 2019-12-06 大连优弈数据科技有限公司 Data optimization processing system
CN106250784A (en) * 2016-07-20 2016-12-21 乐视控股(北京)有限公司 Full disk encryption method and device
CN106339473A (en) * 2016-08-29 2017-01-18 北京百度网讯科技有限公司 Method and device for copying file
CN110268397A (en) * 2016-12-30 2019-09-20 日彩电子科技(深圳)有限公司 Effectively optimizing data layout method applied to data warehouse
CN106845276A (en) * 2017-02-13 2017-06-13 湖南财政经济学院 A kind of big data based on network security implements system
TWI760403B (en) * 2017-03-23 2022-04-11 韓商愛思開海力士有限公司 Data storage device and operating method thereof
CN107329982A (en) * 2017-06-01 2017-11-07 华南理工大学 A kind of big data parallel calculating method stored based on distributed column and system
CN107342914A (en) * 2017-06-07 2017-11-10 同济大学 A kind of high availability for cloud platform verifies system
WO2019006640A1 (en) * 2017-07-04 2019-01-10 深圳齐心集团股份有限公司 Big data management system
CN107506394A (en) * 2017-07-31 2017-12-22 武汉工程大学 Optimization method for eliminating big data standard relation connection redundancy
CN107480283A (en) * 2017-08-23 2017-12-15 九次方大数据信息集团有限公司 Realize the method, apparatus and storage system of big data quick storage
CN108681487A (en) * 2018-05-21 2018-10-19 千寻位置网络有限公司 The distributed system and tuning method of sensing algorithm arameter optimization
CN108681487B (en) * 2018-05-21 2021-08-24 千寻位置网络有限公司 Distributed system and method for adjusting and optimizing sensor algorithm parameters
CN109753306A (en) * 2018-12-28 2019-05-14 北京东方国信科技股份有限公司 A kind of big data processing method of because precompiled function caching engine
CN109981674B (en) * 2019-04-04 2021-08-17 北京信而泰科技股份有限公司 Remote procedure calling method, device, equipment and medium
CN109981674A (en) * 2019-04-04 2019-07-05 北京信而泰科技股份有限公司 A kind of remote procedure calling (PRC) method, device, equipment and medium
CN112134914A (en) * 2020-02-10 2020-12-25 北京天德科技有限公司 Distributed secure storage strategy based on cryptography
CN112134914B (en) * 2020-02-10 2021-08-06 北京天德科技有限公司 Distributed secure storage strategy based on cryptography
CN111539029A (en) * 2020-04-25 2020-08-14 章稳建 Industrial internet-based big data storage rate optimization method and cloud computing center
CN111930731A (en) * 2020-07-28 2020-11-13 苏州亿歌网络科技有限公司 Data dump method, device, equipment and storage medium
CN112084158A (en) * 2020-09-25 2020-12-15 北京百家科技集团有限公司 Data set file compression method and device
CN112597348A (en) * 2020-12-15 2021-04-02 电子科技大学中山学院 Method and device for optimizing big data storage

Similar Documents

Publication Publication Date Title
CN103440244A (en) Large-data storage and optimization method
US11423015B2 (en) Log-structured storage systems
US11093455B2 (en) Log-structured storage systems
TWI737395B (en) Log-structured storage systems and method
TWI733514B (en) A storage system, a network node of a blockchain network, and a blockchain-based log-structured storage system
CN102411637B (en) Metadata management method of distributed file system
US11422728B2 (en) Log-structured storage systems
WO2019228572A2 (en) Log-structured storage systems
US11294881B2 (en) Log-structured storage systems
EP3695303B1 (en) Log-structured storage systems
US10903981B1 (en) Log-structured storage systems
CN102413172B (en) Parallel data sharing method based on cluster technology and apparatus thereof
US10942852B1 (en) Log-structured storage systems
CN102833580A (en) High-definition video application system and method based on infiniband
CN102480489A (en) Logging method and device used in distributed environment
CN202872848U (en) Cloud storage terminal equipment based on cloud information and cloud computing services

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20131211