CN103198097A

CN103198097A - Massive geoscientific data parallel processing method based on distributed file system

Info

Publication number: CN103198097A
Application number: CN2013100768952A
Authority: CN
Inventors: 黎建辉; 沈庚; 周园春; 王学志; 韦远科; 张洋
Original assignee: Computer Network Information Center of CAS
Current assignee: Computer Network Information Center of CAS
Priority date: 2013-03-11
Filing date: 2013-03-11
Publication date: 2013-07-10
Anticipated expiration: 2033-03-11
Also published as: CN103198097B

Abstract

The invention discloses a massive geoscientific data parallel processing method based on a distributed file system. The massive geoscientific data parallel processing method comprises the following steps: 1), taking the distributed file system as a storage system of geoscientific data, and deploying the distributed file system on a computing cluster, wherein the distributed file system has a unified name space; 2), storing received calculating tasks in a waiting queue by a task scheduling system of the computing cluster; 3), selecting one calculating task from the waiting queue by the scheduling system, and entering a running queue; 4), according to information of the calculating task, searching a computing node of a data file required by running the calculating task from metadata of the distributed file system by the scheduling system; and 5), selecting the computing node possessing the maximum data required by running the calculating task by the task scheduling system, remotely acquiring a data file, which is required by the calculating task, but not possessed by the computing node, executing the calculating task at the computing node, and returning an execution result. By the massive geoscientific data parallel processing method, computation localization is achieved to the maximum extent.

Description

A kind of magnanimity earth science data method for parallel processing based on distributed file system

Technical field

The invention belongs to ecology and the geography information field that learns a skill, relate to storage and the parallel processing of magnanimity remote sensing earth science data, relate in particular to a kind of magnanimity earth science data method for parallel processing based on distributed file system, be mainly used in the disposal route at mass data of association areas such as remote sensing ecological monitoring, species distribution prediction and the inverting of remote sensing earth science data.

Background technology

File system is the important ingredient of computer system, along with development of internet technology, the trend that develops on the oriented high-speed local area network of file system on the independent platform forms the support technology-distributed file system (Distributed File System) in the distributed computer environment gradually.The gordian technique of distributed file system mainly comprise Virtual File System, cache technology and required mechanics of communication (answer the high big vast Kui of morning sunlight. " computer engineering and science " the 3rd phase of nineteen ninety-five.) distributed file system refers to that the physical store resource of file system management not necessarily directly is connected on the local node, but link to each other with node by computer network.The design of distributed file system is based on Client.

Job scheduling system is called task scheduling system again, is used for giving a plurality of computing units with large batch of distribution of computation tasks, and the processing calculation task that these computing units can be walked abreast, modal is the process scheduling device of operating system.In a distributed computing system, the major function of job scheduler is to collect and the Management Calculation task, and task is reasonably distributed to each node on the network, task in batches can be walked abreast carry out efficiently.Simultaneously, it also will possess some and carry out relevant subsidiary function with operation, such as the process that tracking task is carried out, reclaims result of operation execution etc.Job scheduling system is used for high-performance calculation and computing grid more, and the processing time of calculation task in enormous quantities has not only been shortened in the application of dispatching technique, and makes that the calculated performance of calculating cluster is brought into play efficiently.

Remote sensing image is the chief component of geographic information data, and the film (or photograph) of the various atural object electromagnetic wave of every record size all is called remote sensing image, mainly refers to airphoto and satellite photograph here.For a large amount of remotely-sensed datas, want therefrom to obtain useful ground and learn information, need to use complicated computer system that it is handled.Common Software tool is as GDAL (Geospatial Data Abstraction Library), and GDAL (http://www.gdal.org/) is a grid space data-switching storehouse of increasing income under the X/MIT permission agreement.The various file layouts that it utilizes abstract data model to express to support also comprise a series of command-line tools and carry out data-switching and processing.In past 20 years, the image data of magnanimity has been obtained in earth observation.In next ten years, earth observation systems (EOS) and other Earthwatch platforms will produce the magnanimity image data with the speed that surpasses 115TB every day.In the face of these image datas that pile up like a mountain, how therefrom retrieval, the interested data of explicit user become current research focus efficiently.(Ruixin?Yang.Value?range?queries?on?earth?science?data?via?histogram?clustering[M].Lecture?Notes?In?Computer?Science，1999.)

Ground is learned calculating that image data big data quantity and the high characteristics of computation complexity make that the processing of earth science data, particularly mass data are relevant and the higher online data calculation services of response time requirement is become a great challenge.Therefore need a kind of method can solve the storage of magnanimity earth science data and the problem of fast processing, and high-quality data computation service can be provided.Though being applied to the data of every field, some traditional big data processing techniques handle, but learn information calculations for the ground based on the remote sensing image file, be limited by the single use-pattern of special form and the handling implement of data file, traditional treatment technology is difficult to learn contentedly at short notice the demand of information calculations.

Summary of the invention

From above analysis as can be seen, the data volume of earth science data is very big, and efficient available technical scheme is needed in the storage of remotely-sensed data and processing badly.The mode of separate unit server process data is limited by the restriction of machine internal memory and storage space, can't satisfy the demand of handling mass data.More existing general big data Processing Cluster technology such as MapReduce, MPI etc., because the singularity of calculating is learned on ground, can't be applied to the calculating of earth science data again easily and fast.Technical matters at the prior art existence, the object of the present invention is to provide a kind of magnanimity earth science data method for parallel processing based on distributed file system, the present invention utilizes distributed file system and job scheduling technology, with the earth science data that moves on the separate unit server handle application extension be one can the application that efficient parallel is carried out on cluster technology.

Technical scheme of the present invention is:

A kind of magnanimity earth science data method for parallel processing based on distributed file system the steps include:

1) adopts distributed file system as the storage system of earth science data, described distributed file system is deployed in calculates on the cluster; Wherein, described distributed file system has a unified name space;

2) job scheduling system of calculating cluster is saved in the computational tasks that receives in the one wait formation;

3) job scheduling system selects a computational tasks to enter operation queue from described waiting list;

4) job scheduling system is searched the computing node that this computational tasks is moved the data file place that needs according to the computational tasks information that enters described operation queue in the metadata of distributed file system;

5) job scheduling system is from 4) select a maximum computing node of data of holding these computational tasks operation needs the gained computing node; This computing node is long-range to obtain that this computational tasks needs but data file that this computing node is not held, carries out this computational tasks at this computing node then, and execution result is returned to job scheduling system;

6) job scheduling system is deleted this computational tasks from described operation queue.

Further, there is at least one copy in each earth science data file in described calculating cluster; All data that belong to a duplicate of the document all only are kept on the computing node, and its stored position information is kept in the metadata of distributed file system.

Further, each computing node has a plurality of disks, and duplicate of the document is divided into a plurality of blocks of files, and the blocks of files that belongs to the identical file copy evenly distributes on many disks at random.

Further, described job scheduling system is backed up in realtime in the disk file to the computational tasks request in the described waiting list.

Further, described job scheduling system is backed up in realtime in the relational database to the computational tasks request in the described waiting list.

Further, described job scheduling system selects an operation to enter described operation queue from described waiting list according to the first-in first-out strategy.

Further, described job scheduling system is deployed in the scheduler of described calculating cluster.

Use high-performance distributed file system as the storage system of earth science data among the present invention, simultaneously the calculation task of being submitted to by a job scheduling system leading subscriber.Distributed file system is deployed on the multiple servers, every station server is equipped with the polylith hard disk, a file (being original or wave file) is dispersed into a plurality of blocks of files, blocks of files is distributed on a plurality of hard disks, like this, when an IO request to certain file occurred, file system can belong to the blocks of files of this file simultaneously in the search of polylith disk, and the IO of file will be initiated simultaneously by the polylith disk.Distributed file system among the present invention has unified name space, calculate all nodes in the cluster (node is Distributed Architecture or calculates a server in the cluster) but this file system of carry all, each computing node can use this unified name space, so can have access to all the earth science data files in the file system.File system is supported duplicate of the document, and namely there is one or more copy in each file in cluster.On each computing node, belong to distribution use at random the even distribution of blocks of files on many disks of identical file copy, but the blocks of files that belongs to the identical file copy can the cross-node storage, and namely all data that belong to a duplicate of the document all only are kept on the computing node.The physical location of duplicate of the document is fixed, and its location dependent information is kept in the metadata of distributed file system.

Compared with prior art, good effect of the present invention is:

The present invention is in order to guarantee the security of operation relevant information, formation in the dispatching system has backing up in realtime on local storage system, this backup is kept in disk file or the relational database, be consistent with the formation in the internal memory, run into be similar to the outage this system can't prevent special circumstances the time, the complete information of operation can not lost, and after job scheduling system restarted, the formation meeting in the internal memory returned to the state of preserving for the last time before the outage.

A computational tasks may need a more than input file, function support by distributed file system, dispatching system can be obtained position and the file size of all input files, can select the computing node of holding input file data volume maximum by simple computation, for farthest accomplishing to calculate localization, dispatching system can be job-shop to this node.

Description of drawings

Accompanying drawing is learned computing job scheduling system works flow process figure with being.

Embodiment

In a lot of distributed computing frameworks, be exactly to be difficult to avoid the transmission of mass data on network to the problem of the processing maximum of mass data.Such as the Hadoop of the MapReduce system that increases income commonly used, a lot of algorithms can produce bigger flow at network in the Reduce stage, because have huge difference between CPU processing speed and the network transfer speeds, the bottleneck of the efficient maximum of whole computation process often is this.Because it is big that earth science data has data volume, the characteristics that computational complexity is high, use conventional distributed computing to be difficult to satisfy the requirement of user to computing time, in order to address this problem, the present invention has designed a job scheduling system, (allow the physical location of user's specified file copy in conjunction with possessing specific function, make all data blocks of a certain duplicate of the document can be kept on the node, need not just can guarantee that by network transmission the computational tasks that need read this copy gets access to desired data) distributed file system, position according to the earth science data file that calculates needs, the computational tasks of complexity is arranged on the server at file place and carries out, reduce the transmission of mass data on network as much as possible, computational tasks has farthest accomplished to calculate localization to the read operation that is limited in local disk basically of reading of input file.Most crucial part is job scheduling system among the present invention, job scheduling system mainly is made of two formations, preserved operation that all users submit to but that also be not assigned with in the waiting list, operation queue has been preserved all and has been assigned to certain station server, the operation that is moving.Job scheduling system is accepted the computational tasks that a large amount of front end systems is submitted at short notice, go out the distribution that this calculates the earth science data file of needs according to the earth science data associated information calculation that comprises in the operation of submitting to, find to comprise the maximum computing node of desired data, operation is assigned to this node.

As shown in the figure, the workflow of job scheduling system of the present invention is as follows:

1. receive new operation and enter waiting list.

2. upgrade the backup of waiting list.

3. an operation enters operation queue according to the first-in first-out policy selection.

4. according to the fileinfo (path that file is preserved in distributed file system) about the input data that comprises in the operation that enters operation queue, which computing node is the data file that searching this job run in the metadata of distributed file system needs be distributed on.

5. select one and hold the computing node that needs data maximum.

6. carry out operation at this computing node, all required data of operation read from distributed file system.Because most of data are kept at this locality, computational tasks only need be obtained a spot of teledata, and network traffics are less.

7. computing node keeps communicating by letter with scheduler.

8. the operation of computing node notice scheduler is finished.

9. scheduler is fetched operation result operation is deleted from operation queue.

Claims

1. the magnanimity earth science data method for parallel processing based on distributed file system the steps include:

2. the method for claim 1 is characterized in that there is at least one copy in each earth science data file in described calculating cluster; All data that belong to a duplicate of the document all only are kept on the computing node, and its stored position information is kept in the metadata of distributed file system.

3. method as claimed in claim 2 is characterized in that each computing node has a plurality of disks, and duplicate of the document is divided into a plurality of blocks of files, and the blocks of files that belongs to the identical file copy evenly distributes on many disks at random.

4. as claim 1 or 2 or 3 described methods, it is characterized in that described job scheduling system backs up in realtime in the disk file to the computational tasks request in the described waiting list.

5. as claim 1 or 2 or 3 described methods, it is characterized in that described job scheduling system backs up in realtime in the relational database to the computational tasks request in the described waiting list.

6. the method for claim 1 is characterized in that described job scheduling system selects an operation to enter described operation queue according to the first-in first-out strategy from described waiting list.

7. the method for claim 1 is characterized in that described job scheduling system is deployed in the scheduler of described calculating cluster.