CN103440244A - Large-data storage and optimization method - Google Patents
Large-data storage and optimization method Download PDFInfo
- Publication number
- CN103440244A CN103440244A CN201310293482XA CN201310293482A CN103440244A CN 103440244 A CN103440244 A CN 103440244A CN 201310293482X A CN201310293482X A CN 201310293482XA CN 201310293482 A CN201310293482 A CN 201310293482A CN 103440244 A CN103440244 A CN 103440244A
- Authority
- CN
- China
- Prior art keywords
- data
- datanode
- namenode
- optimization
- file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000005457 optimization Methods 0.000 title claims abstract description 75
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000013500 data storage Methods 0.000 title claims abstract description 14
- 238000003860 storage Methods 0.000 claims abstract description 37
- 238000012545 processing Methods 0.000 claims abstract description 20
- 238000004364 calculation method Methods 0.000 claims abstract description 11
- 230000005540 biological transmission Effects 0.000 claims abstract description 7
- 230000008520 organization Effects 0.000 claims abstract description 7
- 230000008569 process Effects 0.000 claims description 16
- 238000007726 management method Methods 0.000 claims description 7
- 230000007246 mechanism Effects 0.000 claims description 7
- 230000004044 response Effects 0.000 claims description 6
- 238000007906 compression Methods 0.000 claims description 5
- 230000006835 compression Effects 0.000 claims description 5
- 238000012986 modification Methods 0.000 claims description 5
- 230000004048 modification Effects 0.000 claims description 5
- 238000004891 communication Methods 0.000 claims description 4
- 239000003638 chemical reducing agent Substances 0.000 claims description 3
- 238000000151 deposition Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 claims description 3
- 238000004519 manufacturing process Methods 0.000 claims description 3
- 238000012544 monitoring process Methods 0.000 claims description 3
- 230000010076 replication Effects 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 3
- 230000000007 visual effect Effects 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 claims description 2
- 238000005065 mining Methods 0.000 claims description 2
- 238000007781 pre-processing Methods 0.000 abstract description 3
- 238000013144 data compression Methods 0.000 abstract description 2
- 238000011084 recovery Methods 0.000 abstract description 2
- 238000013480 data collection Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 4
- 238000007405 data analysis Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- GOLXNESZZPUPJE-UHFFFAOYSA-N spiromesifen Chemical compound CC1=CC(C)=CC(C)=C1C(C(O1)=O)=C(OC(=O)CC(C)(C)C)C11CCCC1 GOLXNESZZPUPJE-UHFFFAOYSA-N 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of data processing, in particular to a large-data storage and optimization method facing to sea-cloud coordination. The method comprises the following steps of data preprocessing, calculation optimization and mass data optimization, wherein the step of data preprocessing comprises data collection, multi-source data organization and gathering, data redundant processing and data compression storage; the calculation optimization comprises HDFS (hadoop distributed file system) file transmission and optimization and Map/Reduce parallel calculation and optimization; and the step of mass data optimization comprises data backup for disaster recovery, data encryption, CC index and CCT backup. The large-data storage and optimization method disclosed by the invention can be applied to large-data storage of a cloud platform.
Description
Technical field
The present invention relates to technical field of data processing, a kind of large data store optimization method of especially working in coordination with towards sea-cloud.
Background technology
Along with the fast development of infotech, traditional persistent storage scheme more and more has been difficult to adapt to the development of information service from framework; The Hadoop distributed system is passed through distributed algorithm, by the access of data and storage and distribution among a large amount of servers, on each server access can also be distributed in to cluster in many back-up storage reliably in, it is a subversive development of conventional store framework.But Opportunity coexists with challenge, the distributed structure/architecture of increasing income but seems especially heavy when solving Distributed Application, particularly to large data storage and frequent file writes, performance response deficiency during read operation.
Summary of the invention
The technical matters that the present invention solves is a kind of large data store optimization method that provides collaborative towards sea-cloud, effectively realizes the optimization of air exercise data storage.
The technical scheme that the present invention solves the problems of the technologies described above is:
Described method comprises data pre-service, calculation optimization and mass data optimization, the data pre-service comprise data acquisition, multi-source data tissue and converge, data redundancy is processed, compression storing data; Calculation optimization comprises the optimization of HDFS file transfer and Map/Reduce parallel computation optimization; With mass data optimization comprise that the data calamity is standby, data encryption, CCIndex index and CCT backup; The data that client is submitted to are after data acquisition gathers, by the multi-source data tissue with converge, data redundancy processes and to carry out standardization processing, and adopt RCFile to compress storage, data level splits, introduce piecemeal, the first piecemeal of burst mechanism burst again, adopt in piece by going storage, store by row in burst; Then, in calculation optimization, adopt CCIndex that the data random ergodic is converted into by the line index traversal, adopt CCT to carry out the recording level row and copied the data increment backup; When mass data is optimized, the parallel computation assembly completes HDFS file system and the optimization of Map/Reduce computation model configuration class, collaborative seamless integrated with G-cloud cloud platform, infrastructure and the infrastructure service of using flexibly G-cloud cloud platform to provide.
The system of storage optimization is set to:
The first step, linux system file mount parameter optimization, increase the noatime parameter;
Second step, NameNode node parameter configuration optimization, the dfs.block.size massive data files is processed and is set to 64M*N(N=1,2,3,4), the dfs.namenode.handler.count default value is arranged to 64;
The 3rd step, the DataNode node optimization, the service thread quantity that the far call of the DataNode node of dfs.datanode.handler.count is opened is set to 8;
The 4th step, job.tracker monitor node configuration optimization, the quantity that the processing task trackers opened on mapred.job.tracker.handler.count-job tracker passes the service thread of the RPC come is set to 64; The map task quantity of each job of mapred.map.tasks-, be arranged to cluster in the very approaching numerical value of the host number that exists; The reduce task quantity of each job of mapred.reduce.tasks-, be arranged to cluster in the very approaching numerical value of the host number that exists;
The 5th step, task.tracker monitor node configuration optimization,
Mapred.tasktracker.map.tasks.maximum, the maximum quantity of the map task that can simultaneously move on task tracker, be set to server CPU core number or number and subtract 1;
Can control the quantity of the task of operation simultaneously on task tracker of mapred.tasktracker.reduce.tasks.maximum simultaneously and be set to 2; TaskTracker.http.threads is the Thread Count on HTTPserver, operates in each TaskTracker upper, for the treatment of maptask output, can be set to 40~50;
The 6th step, the map configuration optimization, io.sort.mb can be set to 200M, the io.sort.factor attribute, the int type, Map end and Reduce end use this setup of attribute Map end and Reduce hold all use to file Sort the time max-flow that once merges be set to 100; The io.file.buffer.size attribute, the size of the buffer zone provided in the I/O operation of this setup of attribute MapReduce operation is provided in the iMapReduce operation, take byte as unit, be adjusted into 64KB or128KB, the tasktracker.http.threads attribute, the int type, the Map end is used each tasktracker in this setup of attribute cluster to be increased between 40-50 for the quantity of map output being passed to the worker thread of reducer;
The 7th step, the reduce configuration optimization, the also line number that mapred.reduce.parallel.copies increases the reproduction process of reduce end is adjusted into 20; The mapred.child.java.opts attribute, be adjusted into 2MB;
The mapred.job.shuffle.input.buffer.percent attribute, suitably scaling up is not overflow Map output and is write disk; The mapred.job.shuffle.merge.percent attribute, suitably increase its ratio and reduce the excessive number of times of writing of disk; The mapred.inmem.merge.threshold attribute, seldom the time, can be 0 when the memory requirements of Reduce function by this setup of attribute, controlled separately to overflow by the mapred.job.shuffle.merge.percent attribute and write process; The mapred.job.reduce.input.buffer.percent attribute, be set to 1.0.
The HDFS distributed document is stored workflow,
The first step, client, by authentication, is set up TCP/IP and is connected, and by a configurable port, is connected to NameNode and initiates the RPC remote request;
Second step, NameNode checks whether file to be created exists, and whether the founder has authority to be operated; Successful be record of document creation, otherwise to the client throw exception;
The 3rd step, the client writing in files, file is cut into to a plurality of packets, and in inside with these packets of format management of data queue " data queue ", apply for new blocks to NameNode simultaneously, obtain for storing the suitable DataNode list of replicas, the size of list is determined according to the setting to replication in NameNode;
The 4th step, form with pipeline writes packet in all replicas, packet is write to first DataNode in the mode flowed, this DataNode is after this packet storage, again it is passed to the next DataNode in this pipeline, to the last a DataNode;
The 5th step, if in transmitting procedure, there is certain DataNode fault to occur, so current pipeline can be closed, the DataNode broken down can remove from current pipeline, and remaining block can continue in remaining DataNode to continue the form transmission with pipeline, and NameNode can distribute a new DataNode simultaneously, the quantity that keeps replicas to set, write operation completes;
The 6th step, NameNode is linked to mailing address in corresponding DataNode piece according to the data block address of storage, the some or all of block list of backspace file;
The 7th step, NameNode selects nearest DataNode node, reads the block list, starts the file file reading.
Data are processed detailed process:
The first step, from a plurality of visual angles such as information source, imformosome, user's requests, analyze the availability aspect of multi-source magnanimity information;
Second step, multi-source data, at tissue with after converging, may produce a plurality of identical copies; When newly-increased file is converged storage, system monitoring, to event, calculates the digest value of the file that makes new advances, to the system request new files; Whether system contrast digest value has been present in system, and if there is no, return message allows client to converge storage data, newly-built this file; If digest value exists, the newly-built this document of system and corresponding authority, attribute information, but file data is directly quoted the data with existing content, without converging again the system of depositing in;
The 3rd step, adopt RCFile to complete compression to data, and the relation data level is split, and in burst, by the row order, stored, and will become the storage organization of the unit of classifying as in distributed data processing system with the storage organization of the unit of being recorded as;
The 4th step, the storage of unstructured document data is responsible for by data cluster, introduces deblocking and piecemeal copy mechanism and is stored, and increase data directory and tree node optimization;
The 5th step, adopt transmission channel to encrypt and data storage encryption mode, and symmetric cryptography is combined with asymmetric encryption;
The 6th step, adopt disk array to be backed up in realtime to production data; CCIndex is introduced in mass data processing optimization, and the data random ergodic is converted into efficiently and travels through by line index, and introducing CCT carries out the recording level row and copied the data increment backup;
The 7th step, synchronously access G-cloud cloud platform, uses computational resource, virtual resources, management resource etc. to carry out massive data processing, filters heavy and mining analysis, simultaneously by introducing the operations such as mass data search index and tree node optimization.
The HDFS distributed document reads detailed process:
The first step, client is connected to NameNode by a configurable port, and this connection is set up by ICP/IP protocol;
Second step, client is mutual by ClientProtocol and NameNode;
The 3rd step, DataNode use DatanodeProtocol and NameNode are mutual, and foundation is connected with NameNode;
The 4th step, DataNode keeps the communication connection with NameNode by periodically to NameNode, sending heartbeat and data block;
The 5th step, the information of data block comprises the attribute of data block, which file is data block belong to, data block address ID, modification time etc.;
The 6th step, the NameNode response is asked from the RPC of client and DataNode, and receives heartbeat signal and bulk state report from all DataNode;
The 7th step, return to bulk state and report to client, and status report has comprised the data block list that certain DataNode is all;
The 8th step, client, according to the address information of returning in the piece report, is chosen DataNode node reading out data;
The 9th step, close DataNode and connect, and once reads end.
The present invention has realized the optimization of HDFS file transfer, Map/Reduce parallel computation optimization, mass data query optimization, to reach following performance index: realized stable, efficiently large data store optimization method, the mass data query processing is optimized, be with good expansibility, can support to be no less than the memory capacity of 100PB level, support to expand to the storage of EB level; Have good reliability, security, to critical data, can realize many copies redundancy protecting mechanism, number of copies is not less than 3; Have strange land data disaster recovery and backup systems, based on the G-cloud platform, realize that resource elasticity takes, system has good response speed, supports mass data analysis and the service of excavating.
The accompanying drawing explanation
Below in conjunction with accompanying drawing, the present invention is further described;
Fig. 1 is system architecture schematic diagram of the present invention;
Fig. 2 is unstructured data storage system schematic diagram;
Fig. 3 is sea-cloud collaborative platform HDFS distributed file system schematic diagram;
Fig. 4 is network topology schematic diagram of the present invention.
Embodiment
The present invention proposes a kind of large data store optimization method based on G-cloud cloud platform, the JobClient client submits the data to data acquisition system (DAS), mass data adopts Data Preprocessing Technology to submit to data to carry out standardization processing the JobClient client, data compression technique adopts efficient storage organization RCFile, data level is split, introducing piecemeal, burst mechanism are first piecemeal burst again, adopt in piece by going and store, and in burst, by row, store; CCIndex is introduced in mass data processing optimization, and the data random ergodic is converted into efficiently and travels through by line index, and introducing CCT carries out the recording level row and copied the data increment backup; The parallel computation assembly completes HDFS file system and the optimization of Map/Reduce computation model configuration class, the mass data storage scheme of Error Tolerance and high-throughput is provided, significantly improve file processing and calculated performance, collaborative seamless integrated with G-cloud cloud platform, infrastructure and the infrastructure service of using flexibly G-cloud cloud platform to provide, support large-scale calculations resource, storage resources, Internet resources are virtual and data analysis management.
As shown in Figure 1, the detailed process of enforcement storage optimization method of the present invention is:
The first step, linux system file mount parameter optimization, increase the noatime parameter, Linux provides this parameter of noatime to forbid recording the last access time stamp, when file system mounted, can significantly improve the efficiency of disk I/O, only need again the carry file system after modification arranges, just do not need to restart and can come into force;
Second step, NameNode node parameter configuration optimization, the dfs.block.size massive data files is processed and is set to 64M*N (N=1,2,3,4), and the dfs.namenode.handler.count default value is 10, during the massive data files cluster, is arranged to 64;
The 3rd step, the DataNode node optimization, dfs.datanode.handler.count, the service thread quantity that the far call of DataNode node is opened, be defaulted as 3, and the present invention is set to 8;
The 4th step, job.tracker monitor node configuration optimization, the processing task trackers opened on mapred.job.tracker.handler.count-job tracker passes the quantity of the service thread of the RPC come, general 4% of the task tracker number of nodes that is set to, the present invention is set to 64.The map task quantity of each job of mapred.map.tasks-, often be arranged to cluster in the very approaching numerical value of the host number that exists.The reduce task quantity of each job of mapred.reduce.tasks-, often be arranged to cluster in the very approaching numerical value of the host number that exists;
The 5th step, task.tracker monitor node configuration optimization,
Mapred.tasktracker.map.tasks.maximum, the maximum quantity of the map task that can simultaneously move on task tracker, being set to server CPU core number or number, to subtract 1 o'clock operational efficiency the highest.Can control the quantity of the task of operation simultaneously on task tracker of mapred.tasktracker.reduce.tasks.maximum, the present invention is set to 2 simultaneously.TaskTracker.http.threads is the Thread Count on HTTPserver, operates in each TaskTracker upper, and for the treatment of maptask output, the large data sets group can be set to 40~50;
The 6th step, the map configuration optimization, io.sort.mb acquiescence 10 can be set to 200M for large cluster, the io.sort.factor attribute, the int type, Map end and Reduce end use this setup of attribute Map end and Reduce hold all use to file Sort the time max-flow that once merges, its default value is 10, is increased to 100.The io.file.buffer.size attribute, the size of the buffer zone provided in the I/O operation of this setup of attribute MapReduce operation is provided in the iMapReduce operation, take byte as unit, acquiescence is 4KB, be adjusted into 64KB or128KB, the tasktracker.http.threads attribute, the int type, in this setup of attribute cluster of Map end use, each tasktracker is for passing to map output the quantity of the worker thread of reducer, acquiescence is 40, it can be increased between 40-50, can increase the doubling Thread Count, improve the cluster performance;
The 7th step, the reduce configuration optimization, mapred.reduce.parallel.copies increases the also line number of reduce end reproduction process, default value 5, the present invention is adjusted into 20.The mapred.child.java.opts attribute, be adjusted into 2MB, improves the performance of MapReduce operation.The mapred.job.shuffle.input.buffer.percent attribute, acquiescence is 0.70, suitably scaling up is not overflow Map output and is write disk;
The mapred.job.shuffle.merge.percent attribute, suitably increase its ratio and can reduce the excessive number of times of writing of disk.The mapred.inmem.merge.threshold attribute, be defaulted as 1000.Seldom the time, can be 0 when the memory requirements of Reduce function by this setup of attribute, there is no threshold restriction, be controlled separately to overflow by the mapred.job.shuffle.merge.percent attribute and write process.The mapred.job.reduce.input.buffer.percent attribute, be set to 1.0;
The 8th step, CCIndex is introduced in mass data processing optimization, and the data random ergodic is converted into efficiently and travels through by line index, and introducing CCT carries out the recording level row and has copied the data increment backup;
As shown in Figure 2
Unstructured data storage detailed process is:
The first step, often contain unclean and nonstandard form in the magnanimity multi-source data, will cause potential risk to use, the statistical study of application system.Must data be converted into to the standardized data of system platform by data pre-service standard;
Second step, from a plurality of visual angles such as information source, imformosome, user's requests, analyze the availability aspect of multi-source magnanimity information, sets up and meet the availability assessment inference pattern that information develops and applies;
The 3rd step, multi-source data, at tissue with after converging, may produce a plurality of identical copies with a part file.When newly-increased file is converged storage, system monitoring, to event, calculates the digest value of the file that makes new advances, to the system request new files.Whether system contrast digest value has been present in system, and if there is no, return message allows client to converge storage data, newly-built this file.If digest value exists, the newly-built this document of system and corresponding authority, attribute information, but file data is directly quoted the data with existing content, without converging again the system of depositing in;
The 4th step, the present invention adopts a kind of efficient data store organisation---RCFile (Record Columnar File), data are completed to compression, the RCFile data store organisation is based on the Hadoop system, the RCFile storage organization combines the advantage of row storage and row storage, follow the design concept of " first horizontal division, then vertical division ".
The 5th step, the storage of unstructured document data is responsible for by data cluster, introduces deblocking and piecemeal copy mechanism and is stored, and for accelerating the retrieval rate of data, increases data directory and tree node optimization;
The 6th step, for increasing data security, adopt transmission channel to encrypt and data storage encryption mode, and symmetric encipherment algorithm is combined with rivest, shamir, adelman;
The 7th step, adopt disk array to be backed up in realtime to production data.
As shown in Figure 3
The detailed process of the collaborative distributed file storage of sea-cloud is:
The first step, client is connected to NameNode by a configurable port, and this connection is set up by ICP/IP protocol;
Second step, client is mutual by ClientProtocol and NameNode;
The 3rd step, DataNode use DatanodeProtocol and NameNode are mutual, and foundation is connected with NameNode;
The 4th step, DataNode keeps the communication connection with NameNode by periodically to NameNode, sending heartbeat and data block;
The 5th step, the information of data block comprises the attribute of data block, which file is data block belong to, data block address ID, modification time etc.;
The 6th step, the NameNode response is asked from the RPC of client and DataNode, and receives heartbeat signal and bulk state report from all DataNode;
The 7th step, return to bulk state and report to client, and status report has comprised the data block list that certain DataNode is all;
The 8th step, client, according to the address information of returning in the piece report, is chosen DataNode node reading out data;
The 9th step, close DataNode and connect, and reads end
As shown in Figure 4, the present invention is comprised of mass data storage management, distributed data platform and G-cloud cloud operating system three parts; Client is by authentication, setting up TCP/IP connects, be connected to NameNode and initiate the RPC request by a configurable port, carry out the data storage alternately with field of distributed file processing, bottom access cloud platform, used cloud infrastructure and infrastructure service to carry out data mining and analysis flexibly; Complete the collaborative calculation services of sea-cloud.
Claims (7)
1. a large data store optimization method, it is characterized in that: described method comprises data pre-service, calculation optimization and mass data optimization, the data pre-service comprise data acquisition, multi-source data tissue and converge, data redundancy is processed, compression storing data; Calculation optimization comprises the optimization of HDFS file transfer and Map/Reduce parallel computation optimization; With mass data optimization comprise that the data calamity is standby, data encryption, CCIndex index and CCT backup; The data that client is submitted to are after data acquisition gathers, by the multi-source data tissue with converge, data redundancy processes and to carry out standardization processing, and adopt RCFile to compress storage, data level splits, introduce piecemeal, the first piecemeal of burst mechanism burst again, adopt in piece by going storage, store by row in burst; Then, in calculation optimization, adopt CCIndex that the data random ergodic is converted into by the line index traversal, adopt CCT to carry out the recording level row and copied the data increment backup; When mass data is optimized, the parallel computation assembly completes HDFS file system and the optimization of Map/Reduce computation model configuration class, collaborative seamless integrated with G-cloud cloud platform, infrastructure and the infrastructure service of using flexibly G-cloud cloud platform to provide.
2. large data according to claim 1 are stored and optimization method, and it is characterized in that: the system of storage optimization is set to:
The first step, linux system file mount parameter optimization, increase the noatime parameter;
Second step, NameNode node parameter configuration optimization, the dfs.block.size massive data files is processed and is set to 64M*N(N=1,2,3,4), the dfs.namenode.handler.count default value is arranged to 64;
The 3rd step, the DataNode node optimization, the service thread quantity that the far call of the DataNode node of dfs.datanode.handler.count is opened is set to 8;
The 4th step, job.tracker monitor node configuration optimization, the quantity that the processing task trackers opened on mapred.job.tracker.handler.count-job tracker passes the service thread of the RPC come is set to 64; The map task quantity of each job of mapred.map.tasks-, be arranged to cluster in the very approaching numerical value of the host number that exists; The reduce task quantity of each job of mapred.reduce.tasks-, be arranged to cluster in the very approaching numerical value of the host number that exists;
The 5th step, task.tracker monitor node configuration optimization,
Mapred.tasktracker.map.tasks.maximum, the maximum quantity of the map task that can simultaneously move on task tracker, be set to server CPU core number or number and subtract 1;
Can control the quantity of the task of operation simultaneously on task tracker of mapred.tasktracker.reduce.tasks.maximum simultaneously and be set to 2; TaskTracker.http.threads is the Thread Count on HTTPserver, operates in each TaskTracker upper, for the treatment of maptask output, can be set to 40~50;
The 6th step, the map configuration optimization, io.sort.mb can be set to 200M, the io.sort.factor attribute, the int type, Map end and Reduce end use this setup of attribute Map end and Reduce hold all use to file Sort the time max-flow that once merges be set to 100; The io.file.buffer.size attribute, the size of the buffer zone provided in the I/O operation of this setup of attribute MapReduce operation is provided in the iMapReduce operation, take byte as unit, be adjusted into 64KB or128KB, the tasktracker.http.threads attribute, the int type, the Map end is used each tasktracker in this setup of attribute cluster to be increased between 40-50 for the quantity of map output being passed to the worker thread of reducer;
The 7th step, the reduce configuration optimization, the also line number that mapred.reduce.parallel.copies increases the reproduction process of reduce end is adjusted into 20; The mapred.child.java.opts attribute, be adjusted into 2MB;
The mapred.job.shuffle.input.buffer.percent attribute, suitably scaling up is not overflow Map output and is write disk; The mapred.job.shuffle.merge.percent attribute, suitably increase its ratio and reduce the excessive number of times of writing of disk; The mapred.inmem.merge.threshold attribute, seldom the time, can be 0 when the memory requirements of Reduce function by this setup of attribute, controlled separately to overflow by the mapred.job.shuffle.merge.percent attribute and write process; The mapred.job.reduce.input.buffer.percent attribute, be set to 1.0.
3. large data store optimization method according to claim 1 is characterized in that:
The HDFS distributed document is stored workflow,
The first step, client, by authentication, is set up TCP/IP and is connected, and by a configurable port, is connected to NameNode and initiates the RPC remote request;
Second step, NameNode checks whether file to be created exists, and whether the founder has authority to be operated; Successful be record of document creation, otherwise to the client throw exception;
The 3rd step, the client writing in files, file is cut into to a plurality of packets, and in inside with these packets of format management of data queue " data queue ", apply for new blocks to NameNode simultaneously, obtain for storing the suitable DataNode list of replicas, the size of list is determined according to the setting to replication in NameNode;
The 4th step, form with pipeline writes packet in all replicas, packet is write to first DataNode in the mode flowed, this DataNode is after this packet storage, again it is passed to the next DataNode in this pipeline, to the last a DataNode;
The 5th step, if in transmitting procedure, there is certain DataNode fault to occur, so current pipeline can be closed, the DataNode broken down can remove from current pipeline, and remaining block can continue in remaining DataNode to continue the form transmission with pipeline, and NameNode can distribute a new DataNode simultaneously, the quantity that keeps replicas to set, write operation completes;
The 6th step, NameNode is linked to mailing address in corresponding DataNode piece according to the data block address of storage, the some or all of block list of backspace file;
The 7th step, NameNode selects nearest DataNode node, reads the block list, starts the file file reading.
4. large data store optimization method according to claim 2 is characterized in that:
The HDFS distributed document is stored workflow,
The first step, client, by authentication, is set up TCP/IP and is connected, and by a configurable port, is connected to NameNode and initiates the RPC remote request;
Second step, NameNode checks whether file to be created exists, and whether the founder has authority to be operated; Successful be record of document creation, otherwise to the client throw exception;
The 3rd step, the client writing in files, file is cut into to a plurality of packets, and in inside with these packets of format management of data queue " data queue ", apply for new blocks to NameNode simultaneously, obtain for storing the suitable DataNode list of replicas, the size of list is determined according to the setting to replication in Namenode;
The 4th step, form with pipeline writes packet in all replicas, packet is write to first DataNode in the mode flowed, this DataNode is after this packet storage, again it is passed to the next DataNode in this pipeline, to the last a DataNode;
The 5th step, if in transmitting procedure, there is certain DataNode fault to occur, so current pipeline can be closed, the DataNode broken down can remove from current pipeline, and remaining block can continue in remaining DataNode to continue the form transmission with pipeline, and NameNode can distribute a new DataNode simultaneously, the quantity that keeps replicas to set, write operation completes;
The 6th step, NameNode is linked to mailing address in corresponding DataNode piece according to the data block address of storage, the some or all of block list of backspace file;
The 7th step, NameNode selects nearest DataNode node, reads the block list, starts the file file reading.
5. according to the described large data store optimization method of claim 1 to 4 any one, it is characterized in that:
Data are processed detailed process:
The first step, from a plurality of visual angles such as information source, imformosome, user's requests, analyze the availability aspect of multi-source magnanimity information;
Second step, multi-source data, at tissue with after converging, may produce a plurality of identical copies; When newly-increased file is converged storage, system monitoring, to event, calculates the digest value of the file that makes new advances, to the system request new files; Whether system contrast digest value has been present in system, and if there is no, return message allows client to converge storage data, newly-built this file; If digest value exists, the newly-built this document of system and corresponding authority, attribute information, but file data is directly quoted the data with existing content, without converging again the system of depositing in;
The 3rd step, adopt RCFile to complete compression to data, and the relation data level is split, and in burst, by the row order, stored, and will become the storage organization of the unit of classifying as in distributed data processing system with the storage organization of the unit of being recorded as;
The 4th step, the storage of unstructured document data is responsible for by data cluster, introduces deblocking and piecemeal copy mechanism and is stored, and increase data directory and tree node optimization;
The 5th step, adopt transmission channel to encrypt and data storage encryption mode, and symmetric cryptography is combined with asymmetric encryption;
The 6th step, adopt disk array to be backed up in realtime to production data; CCIndex is introduced in mass data processing optimization, and the data random ergodic is converted into efficiently and travels through by line index, and introducing CCT carries out the recording level row and copied the data increment backup;
The 7th step, synchronously access G-cloud cloud platform, uses computational resource, virtual resources, management resource etc. to carry out massive data processing, filters heavy and mining analysis, simultaneously by introducing the operations such as mass data search index and tree node optimization.
6. according to the described large data store optimization method of claim 1 to 4 any one, it is characterized in that:
The HDFS distributed document reads detailed process:
The first step, client is connected to NameNode by a configurable port, and this connection is set up by ICP/IP protocol;
Second step, client is mutual by ClientProtocol and NameNode;
The 3rd step, DataNode use DatanodeProtocol and NameNode are mutual, and foundation is connected with NameNode;
The 4th step, DataNode keeps the communication connection with NameNode by periodically to NameNode, sending heartbeat and data block;
The 5th step, the information of data block comprises the attribute of data block, which file is data block belong to, data block address ID, modification time etc.;
The 6th step, the NameNode response is asked from the RPC of client and DataNode, and receives heartbeat signal and bulk state report from all DataNode;
The 7th step, return to bulk state and report to client, and status report has comprised the data block list that certain DataNode is all;
The 8th step, client, according to the address information of returning in the piece report, is chosen DataNode node reading out data;
The 9th step, close DataNode and connect, and once reads end.
7. large data store optimization method according to claim 5 is characterized in that:
The HDFS distributed document reads detailed process:
The first step, client is connected to NameNode by a configurable port, and this connection is set up by ICP/IP protocol;
Second step, client is mutual by ClientProtocol and NameNode;
The 3rd step, DataNode use DatanodeProtocol and NameNode are mutual, and foundation is connected with NameNode;
The 4th step, DataNode keeps the communication connection with NameNode by periodically to NameNode, sending heartbeat and data block;
The 5th step, the information of data block comprises the attribute of data block, which file is data block belong to, data block address ID, modification time etc.;
The 6th step, the NameNode response is asked from the RPC of client and DataNode, and receives heartbeat signal and bulk state report from all DataNode;
The 7th step, return to bulk state and report to client, and status report has comprised the data block list that certain DataNode is all;
The 8th step, client, according to the address information of returning in the piece report, is chosen DataNode node reading out data;
The 9th step, close DataNode and connect, and once reads end.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310293482XA CN103440244A (en) | 2013-07-12 | 2013-07-12 | Large-data storage and optimization method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310293482XA CN103440244A (en) | 2013-07-12 | 2013-07-12 | Large-data storage and optimization method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103440244A true CN103440244A (en) | 2013-12-11 |
Family
ID=49693935
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310293482XA Pending CN103440244A (en) | 2013-07-12 | 2013-07-12 | Large-data storage and optimization method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103440244A (en) |
Cited By (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104077355A (en) * | 2014-05-29 | 2014-10-01 | 中国银行股份有限公司 | Methods, devices and system for storing and inquiring unstructured data |
WO2014180411A1 (en) * | 2013-12-17 | 2014-11-13 | 中兴通讯股份有限公司 | Distributed index generation method and device |
CN104317738A (en) * | 2014-10-24 | 2015-01-28 | 中国科学技术大学 | Incremental computation method on basis of MapReduce |
CN104598562A (en) * | 2015-01-08 | 2015-05-06 | 浪潮软件股份有限公司 | XML file processing method and device based on MapReduce parallel computing model |
CN104933042A (en) * | 2013-09-29 | 2015-09-23 | 国家电网公司 | Large-data-volume based database table acquisition optimizing technique |
CN105138615A (en) * | 2015-08-10 | 2015-12-09 | 北京思特奇信息技术股份有限公司 | Method and system for building big data distributed log |
CN105426493A (en) * | 2015-11-24 | 2016-03-23 | 北京中电普华信息技术有限公司 | Data processing system and method applied to distributed storage system |
CN105608160A (en) * | 2015-12-21 | 2016-05-25 | 浪潮软件股份有限公司 | Distributed big data analysis method |
CN105677710A (en) * | 2015-12-28 | 2016-06-15 | 曙光信息产业(北京)有限公司 | Processing method and system of big data |
CN105787597A (en) * | 2016-01-20 | 2016-07-20 | 北京优弈数据科技有限公司 | Data optimizing processing system |
CN105893435A (en) * | 2015-12-11 | 2016-08-24 | 乐视网信息技术(北京)股份有限公司 | Data loading and storing equipment, method and system |
CN106021268A (en) * | 2015-03-26 | 2016-10-12 | 国际商业机器公司 | File system block-level tiering and co-allocation |
CN106250784A (en) * | 2016-07-20 | 2016-12-21 | 乐视控股(北京)有限公司 | Full disk encryption method and device |
CN106339473A (en) * | 2016-08-29 | 2017-01-18 | 北京百度网讯科技有限公司 | Method and device for copying file |
CN106570425A (en) * | 2015-10-10 | 2017-04-19 | 北京奇虎科技有限公司 | Hard disk data encryption method and system |
CN106845276A (en) * | 2017-02-13 | 2017-06-13 | 湖南财政经济学院 | A kind of big data based on network security implements system |
CN107329982A (en) * | 2017-06-01 | 2017-11-07 | 华南理工大学 | A kind of big data parallel calculating method stored based on distributed column and system |
CN107342914A (en) * | 2017-06-07 | 2017-11-10 | 同济大学 | A kind of high availability for cloud platform verifies system |
CN107480283A (en) * | 2017-08-23 | 2017-12-15 | 九次方大数据信息集团有限公司 | Realize the method, apparatus and storage system of big data quick storage |
CN107506394A (en) * | 2017-07-31 | 2017-12-22 | 武汉工程大学 | Optimization method for eliminating big data standard relation connection redundancy |
CN108681487A (en) * | 2018-05-21 | 2018-10-19 | 千寻位置网络有限公司 | The distributed system and tuning method of sensing algorithm arameter optimization |
WO2019006640A1 (en) * | 2017-07-04 | 2019-01-10 | 深圳齐心集团股份有限公司 | Big data management system |
CN109753306A (en) * | 2018-12-28 | 2019-05-14 | 北京东方国信科技股份有限公司 | A kind of big data processing method of because precompiled function caching engine |
CN109981674A (en) * | 2019-04-04 | 2019-07-05 | 北京信而泰科技股份有限公司 | A kind of remote procedure calling (PRC) method, device, equipment and medium |
CN110268397A (en) * | 2016-12-30 | 2019-09-20 | 日彩电子科技(深圳)有限公司 | Effectively optimizing data layout method applied to data warehouse |
CN111539029A (en) * | 2020-04-25 | 2020-08-14 | 章稳建 | Industrial internet-based big data storage rate optimization method and cloud computing center |
CN111930731A (en) * | 2020-07-28 | 2020-11-13 | 苏州亿歌网络科技有限公司 | Data dump method, device, equipment and storage medium |
CN112084158A (en) * | 2020-09-25 | 2020-12-15 | 北京百家科技集团有限公司 | Data set file compression method and device |
CN112134914A (en) * | 2020-02-10 | 2020-12-25 | 北京天德科技有限公司 | Distributed secure storage strategy based on cryptography |
CN112597348A (en) * | 2020-12-15 | 2021-04-02 | 电子科技大学中山学院 | Method and device for optimizing big data storage |
TWI760403B (en) * | 2017-03-23 | 2022-04-11 | 韓商愛思開海力士有限公司 | Data storage device and operating method thereof |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102222092A (en) * | 2011-06-03 | 2011-10-19 | 复旦大学 | Massive high-dimension data clustering method for MapReduce platform |
US20120182891A1 (en) * | 2011-01-19 | 2012-07-19 | Youngseok Lee | Packet analysis system and method using hadoop based parallel computation |
-
2013
- 2013-07-12 CN CN201310293482XA patent/CN103440244A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120182891A1 (en) * | 2011-01-19 | 2012-07-19 | Youngseok Lee | Packet analysis system and method using hadoop based parallel computation |
CN102222092A (en) * | 2011-06-03 | 2011-10-19 | 复旦大学 | Massive high-dimension data clustering method for MapReduce platform |
Non-Patent Citations (9)
Title |
---|
刘文娟: "基于Hadoop的文件同步存储系统的设计与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
刘飞: "基于云计算的分布式存储系统的研究和应用", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
张树: "用于加工中心的计算机智能监测控制方法研究", 《中国优秀硕士学位论文全文数据库工程科技一辑》 * |
林国庆: "网络信息安全体系中关键技术的研究", 《中国博士学位论文全文数据库信息科技辑》 * |
查礼: "基于Hadoop的大数据计算技术", 《科研信息化技术与应用》 * |
罗军舟: "云计算:体系架构与关键技术", 《通信学报》 * |
辛大欣: "Hadoop集群性能优化技术研究", 《电脑知识与技术》 * |
高蓟超: "Hadoop平台存储策略的研究与优化", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
龚高晟: "通用分布式文件系统的研究与改进", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104933042A (en) * | 2013-09-29 | 2015-09-23 | 国家电网公司 | Large-data-volume based database table acquisition optimizing technique |
CN104933042B (en) * | 2013-09-29 | 2018-04-13 | 国家电网公司 | Database table optimization of collection technology based on big data quantity |
WO2014180411A1 (en) * | 2013-12-17 | 2014-11-13 | 中兴通讯股份有限公司 | Distributed index generation method and device |
CN104077355A (en) * | 2014-05-29 | 2014-10-01 | 中国银行股份有限公司 | Methods, devices and system for storing and inquiring unstructured data |
CN104317738B (en) * | 2014-10-24 | 2017-07-25 | 中国科学技术大学 | A kind of incremental calculation method based on MapReduce |
CN104317738A (en) * | 2014-10-24 | 2015-01-28 | 中国科学技术大学 | Incremental computation method on basis of MapReduce |
CN104598562A (en) * | 2015-01-08 | 2015-05-06 | 浪潮软件股份有限公司 | XML file processing method and device based on MapReduce parallel computing model |
US10558399B2 (en) | 2015-03-26 | 2020-02-11 | International Business Machines Corporation | File system block-level tiering and co-allocation |
CN106021268B (en) * | 2015-03-26 | 2020-04-10 | 国际商业机器公司 | File system block level layering and co-allocation |
US11593037B2 (en) | 2015-03-26 | 2023-02-28 | International Business Machines Corporation | File system block-level tiering and co-allocation |
CN106021268A (en) * | 2015-03-26 | 2016-10-12 | 国际商业机器公司 | File system block-level tiering and co-allocation |
CN105138615B (en) * | 2015-08-10 | 2019-02-26 | 北京思特奇信息技术股份有限公司 | A kind of method and system constructing big data distributed information log |
CN105138615A (en) * | 2015-08-10 | 2015-12-09 | 北京思特奇信息技术股份有限公司 | Method and system for building big data distributed log |
CN106570425A (en) * | 2015-10-10 | 2017-04-19 | 北京奇虎科技有限公司 | Hard disk data encryption method and system |
CN105426493A (en) * | 2015-11-24 | 2016-03-23 | 北京中电普华信息技术有限公司 | Data processing system and method applied to distributed storage system |
CN105893435A (en) * | 2015-12-11 | 2016-08-24 | 乐视网信息技术(北京)股份有限公司 | Data loading and storing equipment, method and system |
CN105608160A (en) * | 2015-12-21 | 2016-05-25 | 浪潮软件股份有限公司 | Distributed big data analysis method |
CN105677710A (en) * | 2015-12-28 | 2016-06-15 | 曙光信息产业(北京)有限公司 | Processing method and system of big data |
CN105787597A (en) * | 2016-01-20 | 2016-07-20 | 北京优弈数据科技有限公司 | Data optimizing processing system |
CN105787597B (en) * | 2016-01-20 | 2019-12-06 | 大连优弈数据科技有限公司 | Data optimization processing system |
CN106250784A (en) * | 2016-07-20 | 2016-12-21 | 乐视控股(北京)有限公司 | Full disk encryption method and device |
CN106339473A (en) * | 2016-08-29 | 2017-01-18 | 北京百度网讯科技有限公司 | Method and device for copying file |
CN110268397A (en) * | 2016-12-30 | 2019-09-20 | 日彩电子科技(深圳)有限公司 | Effectively optimizing data layout method applied to data warehouse |
CN106845276A (en) * | 2017-02-13 | 2017-06-13 | 湖南财政经济学院 | A kind of big data based on network security implements system |
TWI760403B (en) * | 2017-03-23 | 2022-04-11 | 韓商愛思開海力士有限公司 | Data storage device and operating method thereof |
CN107329982A (en) * | 2017-06-01 | 2017-11-07 | 华南理工大学 | A kind of big data parallel calculating method stored based on distributed column and system |
CN107342914A (en) * | 2017-06-07 | 2017-11-10 | 同济大学 | A kind of high availability for cloud platform verifies system |
WO2019006640A1 (en) * | 2017-07-04 | 2019-01-10 | 深圳齐心集团股份有限公司 | Big data management system |
CN107506394A (en) * | 2017-07-31 | 2017-12-22 | 武汉工程大学 | Optimization method for eliminating big data standard relation connection redundancy |
CN107480283A (en) * | 2017-08-23 | 2017-12-15 | 九次方大数据信息集团有限公司 | Realize the method, apparatus and storage system of big data quick storage |
CN108681487A (en) * | 2018-05-21 | 2018-10-19 | 千寻位置网络有限公司 | The distributed system and tuning method of sensing algorithm arameter optimization |
CN108681487B (en) * | 2018-05-21 | 2021-08-24 | 千寻位置网络有限公司 | Distributed system and method for adjusting and optimizing sensor algorithm parameters |
CN109753306A (en) * | 2018-12-28 | 2019-05-14 | 北京东方国信科技股份有限公司 | A kind of big data processing method of because precompiled function caching engine |
CN109981674B (en) * | 2019-04-04 | 2021-08-17 | 北京信而泰科技股份有限公司 | Remote procedure calling method, device, equipment and medium |
CN109981674A (en) * | 2019-04-04 | 2019-07-05 | 北京信而泰科技股份有限公司 | A kind of remote procedure calling (PRC) method, device, equipment and medium |
CN112134914A (en) * | 2020-02-10 | 2020-12-25 | 北京天德科技有限公司 | Distributed secure storage strategy based on cryptography |
CN112134914B (en) * | 2020-02-10 | 2021-08-06 | 北京天德科技有限公司 | Distributed secure storage strategy based on cryptography |
CN111539029A (en) * | 2020-04-25 | 2020-08-14 | 章稳建 | Industrial internet-based big data storage rate optimization method and cloud computing center |
CN111930731A (en) * | 2020-07-28 | 2020-11-13 | 苏州亿歌网络科技有限公司 | Data dump method, device, equipment and storage medium |
CN112084158A (en) * | 2020-09-25 | 2020-12-15 | 北京百家科技集团有限公司 | Data set file compression method and device |
CN112597348A (en) * | 2020-12-15 | 2021-04-02 | 电子科技大学中山学院 | Method and device for optimizing big data storage |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103440244A (en) | Large-data storage and optimization method | |
US11423015B2 (en) | Log-structured storage systems | |
US11093455B2 (en) | Log-structured storage systems | |
TWI737395B (en) | Log-structured storage systems and method | |
TWI733514B (en) | A storage system, a network node of a blockchain network, and a blockchain-based log-structured storage system | |
CN102411637B (en) | Metadata management method of distributed file system | |
US11422728B2 (en) | Log-structured storage systems | |
WO2019228572A2 (en) | Log-structured storage systems | |
US11294881B2 (en) | Log-structured storage systems | |
EP3695303B1 (en) | Log-structured storage systems | |
US10903981B1 (en) | Log-structured storage systems | |
CN102413172B (en) | Parallel data sharing method based on cluster technology and apparatus thereof | |
US10942852B1 (en) | Log-structured storage systems | |
CN102833580A (en) | High-definition video application system and method based on infiniband | |
CN102480489A (en) | Logging method and device used in distributed environment | |
CN202872848U (en) | Cloud storage terminal equipment based on cloud information and cloud computing services |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20131211 |