CN103198097A - Massive geoscientific data parallel processing method based on distributed file system - Google Patents

Massive geoscientific data parallel processing method based on distributed file system Download PDF

Info

Publication number
CN103198097A
CN103198097A CN2013100768952A CN201310076895A CN103198097A CN 103198097 A CN103198097 A CN 103198097A CN 2013100768952 A CN2013100768952 A CN 2013100768952A CN 201310076895 A CN201310076895 A CN 201310076895A CN 103198097 A CN103198097 A CN 103198097A
Authority
CN
China
Prior art keywords
data
computing node
distributed file
scheduling system
job scheduling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013100768952A
Other languages
Chinese (zh)
Other versions
CN103198097B (en
Inventor
黎建辉
沈庚
周园春
王学志
韦远科
张洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Computer Network Information Center of CAS
Original Assignee
Computer Network Information Center of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Computer Network Information Center of CAS filed Critical Computer Network Information Center of CAS
Priority to CN201310076895.2A priority Critical patent/CN103198097B/en
Publication of CN103198097A publication Critical patent/CN103198097A/en
Application granted granted Critical
Publication of CN103198097B publication Critical patent/CN103198097B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a massive geoscientific data parallel processing method based on a distributed file system. The massive geoscientific data parallel processing method comprises the following steps: 1), taking the distributed file system as a storage system of geoscientific data, and deploying the distributed file system on a computing cluster, wherein the distributed file system has a unified name space; 2), storing received calculating tasks in a waiting queue by a task scheduling system of the computing cluster; 3), selecting one calculating task from the waiting queue by the scheduling system, and entering a running queue; 4), according to information of the calculating task, searching a computing node of a data file required by running the calculating task from metadata of the distributed file system by the scheduling system; and 5), selecting the computing node possessing the maximum data required by running the calculating task by the task scheduling system, remotely acquiring a data file, which is required by the calculating task, but not possessed by the computing node, executing the calculating task at the computing node, and returning an execution result. By the massive geoscientific data parallel processing method, computation localization is achieved to the maximum extent.

Description

A kind of magnanimity earth science data method for parallel processing based on distributed file system
Technical field
The invention belongs to ecology and the geography information field that learns a skill, relate to storage and the parallel processing of magnanimity remote sensing earth science data, relate in particular to a kind of magnanimity earth science data method for parallel processing based on distributed file system, be mainly used in the disposal route at mass data of association areas such as remote sensing ecological monitoring, species distribution prediction and the inverting of remote sensing earth science data.
Background technology
File system is the important ingredient of computer system, along with development of internet technology, the trend that develops on the oriented high-speed local area network of file system on the independent platform forms the support technology-distributed file system (Distributed File System) in the distributed computer environment gradually.The gordian technique of distributed file system mainly comprise Virtual File System, cache technology and required mechanics of communication (answer the high big vast Kui of morning sunlight. " computer engineering and science " the 3rd phase of nineteen ninety-five.) distributed file system refers to that the physical store resource of file system management not necessarily directly is connected on the local node, but link to each other with node by computer network.The design of distributed file system is based on Client.
Job scheduling system is called task scheduling system again, is used for giving a plurality of computing units with large batch of distribution of computation tasks, and the processing calculation task that these computing units can be walked abreast, modal is the process scheduling device of operating system.In a distributed computing system, the major function of job scheduler is to collect and the Management Calculation task, and task is reasonably distributed to each node on the network, task in batches can be walked abreast carry out efficiently.Simultaneously, it also will possess some and carry out relevant subsidiary function with operation, such as the process that tracking task is carried out, reclaims result of operation execution etc.Job scheduling system is used for high-performance calculation and computing grid more, and the processing time of calculation task in enormous quantities has not only been shortened in the application of dispatching technique, and makes that the calculated performance of calculating cluster is brought into play efficiently.
Remote sensing image is the chief component of geographic information data, and the film (or photograph) of the various atural object electromagnetic wave of every record size all is called remote sensing image, mainly refers to airphoto and satellite photograph here.For a large amount of remotely-sensed datas, want therefrom to obtain useful ground and learn information, need to use complicated computer system that it is handled.Common Software tool is as GDAL (Geospatial Data Abstraction Library), and GDAL (http://www.gdal.org/) is a grid space data-switching storehouse of increasing income under the X/MIT permission agreement.The various file layouts that it utilizes abstract data model to express to support also comprise a series of command-line tools and carry out data-switching and processing.In past 20 years, the image data of magnanimity has been obtained in earth observation.In next ten years, earth observation systems (EOS) and other Earthwatch platforms will produce the magnanimity image data with the speed that surpasses 115TB every day.In the face of these image datas that pile up like a mountain, how therefrom retrieval, the interested data of explicit user become current research focus efficiently.(Ruixin?Yang.Value?range?queries?on?earth?science?data?via?histogram?clustering[M].Lecture?Notes?In?Computer?Science,1999.)
Ground is learned calculating that image data big data quantity and the high characteristics of computation complexity make that the processing of earth science data, particularly mass data are relevant and the higher online data calculation services of response time requirement is become a great challenge.Therefore need a kind of method can solve the storage of magnanimity earth science data and the problem of fast processing, and high-quality data computation service can be provided.Though being applied to the data of every field, some traditional big data processing techniques handle, but learn information calculations for the ground based on the remote sensing image file, be limited by the single use-pattern of special form and the handling implement of data file, traditional treatment technology is difficult to learn contentedly at short notice the demand of information calculations.
Summary of the invention
From above analysis as can be seen, the data volume of earth science data is very big, and efficient available technical scheme is needed in the storage of remotely-sensed data and processing badly.The mode of separate unit server process data is limited by the restriction of machine internal memory and storage space, can't satisfy the demand of handling mass data.More existing general big data Processing Cluster technology such as MapReduce, MPI etc., because the singularity of calculating is learned on ground, can't be applied to the calculating of earth science data again easily and fast.Technical matters at the prior art existence, the object of the present invention is to provide a kind of magnanimity earth science data method for parallel processing based on distributed file system, the present invention utilizes distributed file system and job scheduling technology, with the earth science data that moves on the separate unit server handle application extension be one can the application that efficient parallel is carried out on cluster technology.
Technical scheme of the present invention is:
A kind of magnanimity earth science data method for parallel processing based on distributed file system the steps include:
1) adopts distributed file system as the storage system of earth science data, described distributed file system is deployed in calculates on the cluster; Wherein, described distributed file system has a unified name space;
2) job scheduling system of calculating cluster is saved in the computational tasks that receives in the one wait formation;
3) job scheduling system selects a computational tasks to enter operation queue from described waiting list;
4) job scheduling system is searched the computing node that this computational tasks is moved the data file place that needs according to the computational tasks information that enters described operation queue in the metadata of distributed file system;
5) job scheduling system is from 4) select a maximum computing node of data of holding these computational tasks operation needs the gained computing node; This computing node is long-range to obtain that this computational tasks needs but data file that this computing node is not held, carries out this computational tasks at this computing node then, and execution result is returned to job scheduling system;
6) job scheduling system is deleted this computational tasks from described operation queue.
Further, there is at least one copy in each earth science data file in described calculating cluster; All data that belong to a duplicate of the document all only are kept on the computing node, and its stored position information is kept in the metadata of distributed file system.
Further, each computing node has a plurality of disks, and duplicate of the document is divided into a plurality of blocks of files, and the blocks of files that belongs to the identical file copy evenly distributes on many disks at random.
Further, described job scheduling system is backed up in realtime in the disk file to the computational tasks request in the described waiting list.
Further, described job scheduling system is backed up in realtime in the relational database to the computational tasks request in the described waiting list.
Further, described job scheduling system selects an operation to enter described operation queue from described waiting list according to the first-in first-out strategy.
Further, described job scheduling system is deployed in the scheduler of described calculating cluster.
Use high-performance distributed file system as the storage system of earth science data among the present invention, simultaneously the calculation task of being submitted to by a job scheduling system leading subscriber.Distributed file system is deployed on the multiple servers, every station server is equipped with the polylith hard disk, a file (being original or wave file) is dispersed into a plurality of blocks of files, blocks of files is distributed on a plurality of hard disks, like this, when an IO request to certain file occurred, file system can belong to the blocks of files of this file simultaneously in the search of polylith disk, and the IO of file will be initiated simultaneously by the polylith disk.Distributed file system among the present invention has unified name space, calculate all nodes in the cluster (node is Distributed Architecture or calculates a server in the cluster) but this file system of carry all, each computing node can use this unified name space, so can have access to all the earth science data files in the file system.File system is supported duplicate of the document, and namely there is one or more copy in each file in cluster.On each computing node, belong to distribution use at random the even distribution of blocks of files on many disks of identical file copy, but the blocks of files that belongs to the identical file copy can the cross-node storage, and namely all data that belong to a duplicate of the document all only are kept on the computing node.The physical location of duplicate of the document is fixed, and its location dependent information is kept in the metadata of distributed file system.
Compared with prior art, good effect of the present invention is:
The present invention is in order to guarantee the security of operation relevant information, formation in the dispatching system has backing up in realtime on local storage system, this backup is kept in disk file or the relational database, be consistent with the formation in the internal memory, run into be similar to the outage this system can't prevent special circumstances the time, the complete information of operation can not lost, and after job scheduling system restarted, the formation meeting in the internal memory returned to the state of preserving for the last time before the outage.
A computational tasks may need a more than input file, function support by distributed file system, dispatching system can be obtained position and the file size of all input files, can select the computing node of holding input file data volume maximum by simple computation, for farthest accomplishing to calculate localization, dispatching system can be job-shop to this node.
Description of drawings
Accompanying drawing is learned computing job scheduling system works flow process figure with being.
Embodiment
In a lot of distributed computing frameworks, be exactly to be difficult to avoid the transmission of mass data on network to the problem of the processing maximum of mass data.Such as the Hadoop of the MapReduce system that increases income commonly used, a lot of algorithms can produce bigger flow at network in the Reduce stage, because have huge difference between CPU processing speed and the network transfer speeds, the bottleneck of the efficient maximum of whole computation process often is this.Because it is big that earth science data has data volume, the characteristics that computational complexity is high, use conventional distributed computing to be difficult to satisfy the requirement of user to computing time, in order to address this problem, the present invention has designed a job scheduling system, (allow the physical location of user's specified file copy in conjunction with possessing specific function, make all data blocks of a certain duplicate of the document can be kept on the node, need not just can guarantee that by network transmission the computational tasks that need read this copy gets access to desired data) distributed file system, position according to the earth science data file that calculates needs, the computational tasks of complexity is arranged on the server at file place and carries out, reduce the transmission of mass data on network as much as possible, computational tasks has farthest accomplished to calculate localization to the read operation that is limited in local disk basically of reading of input file.Most crucial part is job scheduling system among the present invention, job scheduling system mainly is made of two formations, preserved operation that all users submit to but that also be not assigned with in the waiting list, operation queue has been preserved all and has been assigned to certain station server, the operation that is moving.Job scheduling system is accepted the computational tasks that a large amount of front end systems is submitted at short notice, go out the distribution that this calculates the earth science data file of needs according to the earth science data associated information calculation that comprises in the operation of submitting to, find to comprise the maximum computing node of desired data, operation is assigned to this node.
As shown in the figure, the workflow of job scheduling system of the present invention is as follows:
1. receive new operation and enter waiting list.
2. upgrade the backup of waiting list.
3. an operation enters operation queue according to the first-in first-out policy selection.
4. according to the fileinfo (path that file is preserved in distributed file system) about the input data that comprises in the operation that enters operation queue, which computing node is the data file that searching this job run in the metadata of distributed file system needs be distributed on.
5. select one and hold the computing node that needs data maximum.
6. carry out operation at this computing node, all required data of operation read from distributed file system.Because most of data are kept at this locality, computational tasks only need be obtained a spot of teledata, and network traffics are less.
7. computing node keeps communicating by letter with scheduler.
8. the operation of computing node notice scheduler is finished.
9. scheduler is fetched operation result operation is deleted from operation queue.

Claims (7)

1. the magnanimity earth science data method for parallel processing based on distributed file system the steps include:
1) adopts distributed file system as the storage system of earth science data, described distributed file system is deployed in calculates on the cluster; Wherein, described distributed file system has a unified name space;
2) job scheduling system of calculating cluster is saved in the computational tasks that receives in the one wait formation;
3) job scheduling system selects a computational tasks to enter operation queue from described waiting list;
4) job scheduling system is searched the computing node that this computational tasks is moved the data file place that needs according to the computational tasks information that enters described operation queue in the metadata of distributed file system;
5) job scheduling system is from 4) select a maximum computing node of data of holding these computational tasks operation needs the gained computing node; This computing node is long-range to obtain that this computational tasks needs but data file that this computing node is not held, carries out this computational tasks at this computing node then, and execution result is returned to job scheduling system;
6) job scheduling system is deleted this computational tasks from described operation queue.
2. the method for claim 1 is characterized in that there is at least one copy in each earth science data file in described calculating cluster; All data that belong to a duplicate of the document all only are kept on the computing node, and its stored position information is kept in the metadata of distributed file system.
3. method as claimed in claim 2 is characterized in that each computing node has a plurality of disks, and duplicate of the document is divided into a plurality of blocks of files, and the blocks of files that belongs to the identical file copy evenly distributes on many disks at random.
4. as claim 1 or 2 or 3 described methods, it is characterized in that described job scheduling system backs up in realtime in the disk file to the computational tasks request in the described waiting list.
5. as claim 1 or 2 or 3 described methods, it is characterized in that described job scheduling system backs up in realtime in the relational database to the computational tasks request in the described waiting list.
6. the method for claim 1 is characterized in that described job scheduling system selects an operation to enter described operation queue according to the first-in first-out strategy from described waiting list.
7. the method for claim 1 is characterized in that described job scheduling system is deployed in the scheduler of described calculating cluster.
CN201310076895.2A 2013-03-11 2013-03-11 A kind of magnanimity earth science data method for parallel processing based on distributed file system Active CN103198097B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310076895.2A CN103198097B (en) 2013-03-11 2013-03-11 A kind of magnanimity earth science data method for parallel processing based on distributed file system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310076895.2A CN103198097B (en) 2013-03-11 2013-03-11 A kind of magnanimity earth science data method for parallel processing based on distributed file system

Publications (2)

Publication Number Publication Date
CN103198097A true CN103198097A (en) 2013-07-10
CN103198097B CN103198097B (en) 2016-02-10

Family

ID=48720655

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310076895.2A Active CN103198097B (en) 2013-03-11 2013-03-11 A kind of magnanimity earth science data method for parallel processing based on distributed file system

Country Status (1)

Country Link
CN (1) CN103198097B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530182A (en) * 2013-10-22 2014-01-22 海南大学 Working scheduling method and device
CN103631657A (en) * 2013-11-19 2014-03-12 浪潮电子信息产业股份有限公司 Task scheduling algorithm based on MapReduce
CN105205183A (en) * 2015-10-29 2015-12-30 哈尔滨工业大学 Automatic establishing method of DDS (data distribution service) distributive system based on XML
CN105426235A (en) * 2015-11-06 2016-03-23 东莞理工学院 Depicting method for terrestrial atmospheric aerosol retrieval distributed workflow dependency
CN106227397A (en) * 2016-08-05 2016-12-14 北京市计算中心 Computing cluster job management system based on application virtualization technology and method
CN106250473A (en) * 2016-07-29 2016-12-21 江苏物联网研究发展中心 remote sensing image cloud storage method
CN106371931A (en) * 2016-09-30 2017-02-01 电子科技大学 Web framework-based high-performance geocomputation service system
CN107729435A (en) * 2017-09-29 2018-02-23 郑州云海信息技术有限公司 Method, apparatus, equipment and the storage medium that distributed file system task is assigned
CN108763299A (en) * 2018-04-19 2018-11-06 贵州师范大学 A kind of large-scale data processing calculating acceleration system
CN111897792A (en) * 2020-08-11 2020-11-06 北京无线电测量研究所 Distributed file access method, system, medium and device
CN113176910A (en) * 2021-04-29 2021-07-27 南方电网科学研究院有限责任公司 Distributed file system algorithm parallel execution method
CN114661637A (en) * 2022-02-28 2022-06-24 中国科学院上海天文台 Data processing system and method for radio astronomical data intensive scientific operation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050055328A1 (en) * 2003-09-10 2005-03-10 Hitachi, Ltd. Method and apparatus for data integration
CN102033889A (en) * 2009-09-29 2011-04-27 熊凡凡 Distributed database parallel processing system
CN102880832A (en) * 2012-08-28 2013-01-16 曙光信息产业(北京)有限公司 Method for implementing mass data management system under colony

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050055328A1 (en) * 2003-09-10 2005-03-10 Hitachi, Ltd. Method and apparatus for data integration
CN102033889A (en) * 2009-09-29 2011-04-27 熊凡凡 Distributed database parallel processing system
CN102880832A (en) * 2012-08-28 2013-01-16 曙光信息产业(北京)有限公司 Method for implementing mass data management system under colony

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530182A (en) * 2013-10-22 2014-01-22 海南大学 Working scheduling method and device
CN103631657A (en) * 2013-11-19 2014-03-12 浪潮电子信息产业股份有限公司 Task scheduling algorithm based on MapReduce
CN103631657B (en) * 2013-11-19 2017-08-25 浪潮电子信息产业股份有限公司 A kind of method for scheduling task based on MapReduce
CN105205183A (en) * 2015-10-29 2015-12-30 哈尔滨工业大学 Automatic establishing method of DDS (data distribution service) distributive system based on XML
CN105205183B (en) * 2015-10-29 2018-06-22 哈尔滨工业大学 A kind of DDS distributed system method for auto constructing based on XML
CN105426235B (en) * 2015-11-06 2018-09-25 东莞理工学院 A kind of land atmospheric aerosol inverting distributed work flow dependence depicting method
CN105426235A (en) * 2015-11-06 2016-03-23 东莞理工学院 Depicting method for terrestrial atmospheric aerosol retrieval distributed workflow dependency
CN106250473A (en) * 2016-07-29 2016-12-21 江苏物联网研究发展中心 remote sensing image cloud storage method
CN106250473B (en) * 2016-07-29 2019-11-12 江苏物联网研究发展中心 Remote sensing image cloud storage method
CN106227397A (en) * 2016-08-05 2016-12-14 北京市计算中心 Computing cluster job management system based on application virtualization technology and method
CN106371931B (en) * 2016-09-30 2019-11-05 电子科技大学 A kind of high-performance geoscience computing service system based on Web frame
CN106371931A (en) * 2016-09-30 2017-02-01 电子科技大学 Web framework-based high-performance geocomputation service system
CN107729435A (en) * 2017-09-29 2018-02-23 郑州云海信息技术有限公司 Method, apparatus, equipment and the storage medium that distributed file system task is assigned
CN108763299A (en) * 2018-04-19 2018-11-06 贵州师范大学 A kind of large-scale data processing calculating acceleration system
CN111897792A (en) * 2020-08-11 2020-11-06 北京无线电测量研究所 Distributed file access method, system, medium and device
CN113176910A (en) * 2021-04-29 2021-07-27 南方电网科学研究院有限责任公司 Distributed file system algorithm parallel execution method
CN114661637A (en) * 2022-02-28 2022-06-24 中国科学院上海天文台 Data processing system and method for radio astronomical data intensive scientific operation
CN114661637B (en) * 2022-02-28 2023-03-24 中国科学院上海天文台 Data processing system and method for radio astronomical data intensive scientific operation

Also Published As

Publication number Publication date
CN103198097B (en) 2016-02-10

Similar Documents

Publication Publication Date Title
CN103198097B (en) A kind of magnanimity earth science data method for parallel processing based on distributed file system
Padhy Big data processing with Hadoop-MapReduce in cloud systems
WO2009103221A1 (en) Effective relating theme model data processing method and system thereof
Hongchao et al. Distributed data organization and parallel data retrieval methods for huge laser scanner point clouds
US9774676B2 (en) Storing and moving data in a distributed storage system
CN108885641A (en) High Performance Data Query processing and data analysis
CN106570145B (en) Distributed database result caching method based on hierarchical mapping
CN103491155A (en) Cloud computing method and system for achieving mobile computing and obtaining mobile data
US11818012B2 (en) Online restore to different topologies with custom data distribution
CN111258978A (en) Data storage method
Su et al. Sdquery dsi: integrating data management support with a wide area data transfer protocol
EP3646187B1 (en) Cost-based garbage collection scheduling in a distributed storage environment
Zhang et al. Oceanrt: Real-time analytics over large temporal data
Merceedi et al. A comprehensive survey for hadoop distributed file system
Davoudian et al. A workload-adaptive streaming partitioner for distributed graph stores
US11223528B2 (en) Management of cloud-based shared content using predictive cost modeling
Azari et al. A data replication algorithm for groups of files in data grids
CN112597369A (en) Webpage spider theme type search system based on improved cloud platform
Pan et al. A remote sensing image cloud processing system based on Hadoop
Jin et al. Optimization of task assignment strategy for map-reduce
Jolfaei et al. Improvement of job scheduling and tow level data replication strategies in data grid
Dang et al. Improvement of data grid's performance by combining job scheduling with dynamic replication strategy
Alikhan et al. Dingo optimization based network bandwidth selection to reduce processing time during data upload and access from cloud by user
Deshpande et al. A comparative analysis of data replication strategies and consistency maintenance in distributed file systems
CN113535695B (en) Archive updating method based on process scheduling

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant