CN105468457B - A kind of parallel system local migration fault-tolerance approach based on difference identification - Google Patents

A kind of parallel system local migration fault-tolerance approach based on difference identification Download PDF

Info

Publication number
CN105468457B
CN105468457B CN201510830319.1A CN201510830319A CN105468457B CN 105468457 B CN105468457 B CN 105468457B CN 201510830319 A CN201510830319 A CN 201510830319A CN 105468457 B CN105468457 B CN 105468457B
Authority
CN
China
Prior art keywords
job
migration
parallel
key message
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510830319.1A
Other languages
Chinese (zh)
Other versions
CN105468457A (en
Inventor
宋长明
刘沙
李伟东
张宏宇
王礼生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Jiangnan Computing Technology Institute
Original Assignee
Wuxi Jiangnan Computing Technology Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Jiangnan Computing Technology Institute filed Critical Wuxi Jiangnan Computing Technology Institute
Priority to CN201510830319.1A priority Critical patent/CN105468457B/en
Publication of CN105468457A publication Critical patent/CN105468457A/en
Application granted granted Critical
Publication of CN105468457B publication Critical patent/CN105468457B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/203Failover techniques using migration

Abstract

A kind of parallel system local migration fault-tolerance approach based on difference identification includes: that system starting concurrent job migrates fault-tolerant and applies new resource for job migration;Job management carries out migrating preceding preparation;Parallel file system progress flying quality drives and state retains;Parallel language library progress message is driven and tasks synchronization;Parallel language library extracts the key message for needing to migrate, and notifies the system core, and job management job task is notified to be ready for migration and prepare;Job management calling system core interface carries out job task migration, and system core state and job task progress information are only transmitted to destination node by the system core, and restore the job task process comprising key message;In destination node, parallel file system is again turned on corresponding file according to the descriptor recorded before migration, restores document environment, the key message that parallel language restores according to the system core, resume operation running environment;Job management reconstructs operation, and what is resumed operation continues to run.

Description

A kind of parallel system local migration fault-tolerance approach based on difference identification
Technical field
The present invention relates to processor technical fields, and in particular to a kind of parallel system local migration appearance based on difference identification Wrong method.
Background technique
In parallel system, the fault-tolerant processing of extensive operation is always to influence system availability, usability and resource benefit With rate critical problem.
Specifically, in parallel system, since node size is huge, so the node failure to take place frequently leads to the company of operation Reforwarding guild is regular to be interrupted;And the frequent fault-tolerant continuous service for not only influencing operation, reduce the benefit of system resource With rate, the also usage experience of strong influence user.
The job migration that Active Fault Tolerant is realized based on fault pre-alarming is to solve the problems, such as this effective means.In this side In case, generally when finding that certain nodes break down early warning or health degree reduce, the job task on the node is moved to Other health resources influence the continuous service of operation to avoid node failure.
But in this scheme of the prior art, complete machine environmental transport and transfer or right is carried out primarily directed to early warning node Operation consumer process on early warning node carries out bulk migration, and the environment for either way needing to migrate is larger, fault-tolerant time length, Expense is big.
Summary of the invention
The technical problem to be solved by the present invention is to be directed to that drawbacks described above exists in the prior art, mainly for routine work The larger problem of migration overhead realizes a kind of parallel system local migration fault-tolerance approach based on difference identification, can be effective Expense when reducing the migration of node job task, effectively reduce migration fault-tolerant time, reduce fault-tolerant risk, improve the utilization of resources Rate.
According to the present invention, a kind of parallel system local migration fault-tolerance approach based on difference identification is provided, comprising:
First step: system is fault-tolerant according to the starting concurrent job migration of the working condition of node and applies new resources for making Industry migration;
Second step: job management carries out migrating preceding preparation;
Third step: parallel file system progress flying quality drives and state retains, and parallel language library disappears Breath drives and tasks synchronization;
Four steps: parallel language library extracts the key message for needing to migrate in the memory that user uses, and key is believed Breath notifies the system core, and job management job task is notified to be ready for migration and prepare;
5th step: job management calling system core interface carries out job task migration, and wherein the system core only will System core state and job task progress information are transmitted to destination node, and restore the job task comprising key message into Journey;
6th step: in destination node, parallel file system is again turned on corresponding according to the descriptor recorded before migration File restores document environment;
7th step: it in the key message that destination node, parallel language restore according to the system core, resumes operation and runs ring Border;
8th step: job management reconstructs operation according to new job run environment, and what is resumed operation continues to run.
Preferably, in the first step, system judges that starting is parallel when nodes break down early warning or health degree are lowered Job migration is fault-tolerant and applies new resource for job migration.
Preferably, in the second step, job management is moved parallel file system and parallel language notice with aspect Shifting source side.
Preferably, in four steps, parallel language library is extracted according to user's subject style in the memory that user uses The key message for needing to migrate.
Detailed description of the invention
In conjunction with attached drawing, and by reference to following detailed description, it will more easily have more complete understanding to the present invention And its adjoint advantage and feature is more easily to understand, in which:
Fig. 1 schematically shows the parallel system local migration according to the preferred embodiment of the invention based on difference identification The flow chart of fault-tolerance approach.
It should be noted that attached drawing is not intended to limit the present invention for illustrating the present invention.Note that indicating that the attached drawing of structure can It can be not necessarily drawn to scale.Also, in attached drawing, same or similar element indicates same or similar label.
Specific embodiment
In order to keep the contents of the present invention more clear and understandable, combined with specific embodiments below with attached drawing in of the invention Appearance is described in detail.
In the parallel system local migration fault-tolerance approach according to the present invention based on difference identification, integrating parallel operation fortune Row state, the effective information for needing to migrate when effectively distinguishing Active Fault Tolerant and other invalid informations, only move effective information It moves and restores, realize the fault-tolerant of low overhead.
Particularly preferred embodiment of the invention is described below in conjunction with attached drawing.
Fig. 1 schematically shows the parallel system local migration according to the preferred embodiment of the invention based on difference identification The flow chart of fault-tolerance approach.
As shown in Figure 1, the parallel system local migration fault-tolerant side according to the preferred embodiment of the invention based on difference identification Method includes the following step successively executed:
First step S1: according to the working condition of node, (system is judging nodes break down early warning or health degree to system When attenuating) starting concurrent job migration is fault-tolerant and applies new resources for job migration;
Second step S2: job management carries out migrating preceding preparation, for example, source side and being composed a piece of writing with aspect informing removal Part system and parallel language;Wherein, migration source side is the node of the break down early warning or health degree attenuating of first step, also It is the node for needing to move out job task.
Third step S3: parallel file system progress flying quality drives and state retains, to reach data stabilization state simultaneously Reduce transferring content;And parallel language library progress message is driven and tasks synchronization, it is ensured that reaches message stable state;
Four steps S4: parallel language library (such as according to user's subject style) is distinguished and is needed in the memory that user uses The key message of migration (in the memory that user uses needs to migrate with the other information for being not required to migration that is, extracting Key message), key message is notified into the system core, and job management job task is notified to be ready for migration and prepare;
5th step S5: job management calling system core interface carries out job task migration, and wherein the system core is only System core state and job task progress information are transmitted to destination node, and restore the job task comprising key message into Journey;
Wherein, for containing the job task process of key message, for example, consumer process has used in 100M It deposits, during tradition is realized, when process migration, core needs to migrate and restore the content for the 100M memory that all users use.This hair In bright, the key message (such as 10M) for needing to migrate in parallel language library notice core memory before migrating, then when process migration, Core migration and restore consumer process when, it is only necessary to migrate and restore this 10M key message can, the useless letter of remaining 90M Breath can not have to migration and restore.
Specifically, for example, when job management receives when job task is ready for the notice of migration preparation (namely Say, after operation management judgment job task, which is ready for migration, to be prepared), job management calling system core interface carries out operation Task immigration, the key message the (the 4th that wherein system core retains according to the needs that parallel language library in four steps S4 notifies Which key message of parallel language library notice core needs to migrate and restore in step S4), it only need to be by necessary system core shape State and job task progress information are transmitted to destination node and the task process that resumes operation;
6th step S6: at destination node (that is, the new resources applied), parallel file system is according to record before migration Descriptor is again turned on corresponding file, restores document environment;
7th step S7: in the key message that destination node, parallel language restore according to the system core, resume operation operation Environment.
8th step S8: job management is according to new job running environment (that is, making on operation original entire run environmental basis New job running environment after replacing source node with destination node) the whole road operation of reconstruct, what is resumed operation continues to run.
Specifically, due to the participation (packet for all nodes for needing operation to use when job management reconstructs job run environment Include due to replacement source node and the destination node being newly added and the node not changed), so new job run environment includes to move Complete environment after shifting, refers not only to destination node.That is, new job running environment include replacement source node destination node and Original work industry running environment again under the node that does not change.
The present invention realizes a kind of parallel system local migration fault-tolerance approach based on difference identification as a result, can be effective Expense when reducing the migration of node job task, effectively reduce migration fault-tolerant time, reduce fault-tolerant risk, improve the utilization of resources Rate.
In order to enable technical staff better understood when the present invention, below for term used in this specification, do Following explanations out:
Concurrent job: referring generally to be write by parallel languages such as MPI, run on task in parallel computer computing resource into Cheng Jihe is started by job management system and is controlled, and completes same problem solution by cooperateing between process.
Process migration: process migration is exactly the given processor that a process is moved to specified node from current location On, continue all resources for accessing it and continues to run.Its groundwork is that extraction process status, then in destination node The process is regenerated according to process status.In parallel computation, process migration is that holding load balance and one kind of high fault tolerance are non- Normal effective means.
It is understood that although the present invention has been disclosed in the preferred embodiments as above, above-described embodiment not to Limit the present invention.For any person skilled in the art, without departing from the scope of the technical proposal of the invention, Many possible changes and modifications all are made to technical solution of the present invention using the technology contents of the disclosure above, or are revised as With the equivalent embodiment of variation.Therefore, anything that does not depart from the technical scheme of the invention are right according to the technical essence of the invention Any simple modifications, equivalents, and modifications made for any of the above embodiments still fall within the range of technical solution of the present invention protection It is interior.

Claims (1)

1. a kind of parallel system local migration fault-tolerance approach based on difference identification, characterized by comprising:
First step: system is according to the working condition of node, and when nodes break down early warning or health degree are lowered, starting is parallel Job migration is fault-tolerant and applies new resources for job migration;
Second step: job management carries out migrating preceding preparation, wherein job management with aspect by parallel file system with simultaneously Row language informing removal source side, and wherein migration source side is that break down in first step early warning or health degree lowers to need The node that job task is moved out;
Third step: parallel file system progress flying quality drives and state retains, and parallel language library carries out message drive Tasks synchronization without losing time;
Four steps: parallel language library extract the key message for needing to migrate in the memory that user uses be not required to migration its His information, notifies the system core for key message, and job management job task is notified to be ready for migration and prepare;Wherein, Parallel language library extracts the key message for needing to migrate in the memory that user uses according to user's subject style;
5th step: job management calling system core interface carries out job task migration, and wherein the system core is only by system Kernel state and job task progress information are transmitted to destination node, and restore the job task process comprising key message;
6th step: in destination node, parallel file system is again turned on corresponding text according to the descriptor recorded before migration Part restores document environment;
7th step: in the key message that destination node, parallel language restore according to the system core, resume operation running environment;
8th step: job management reconstructs operation according to new job run environment, and what is resumed operation continues to run, wherein new Job run environment include replacement source node destination node and original work industry running environment again under the node that does not change.
CN201510830319.1A 2015-11-24 2015-11-24 A kind of parallel system local migration fault-tolerance approach based on difference identification Active CN105468457B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510830319.1A CN105468457B (en) 2015-11-24 2015-11-24 A kind of parallel system local migration fault-tolerance approach based on difference identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510830319.1A CN105468457B (en) 2015-11-24 2015-11-24 A kind of parallel system local migration fault-tolerance approach based on difference identification

Publications (2)

Publication Number Publication Date
CN105468457A CN105468457A (en) 2016-04-06
CN105468457B true CN105468457B (en) 2019-04-09

Family

ID=55606192

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510830319.1A Active CN105468457B (en) 2015-11-24 2015-11-24 A kind of parallel system local migration fault-tolerance approach based on difference identification

Country Status (1)

Country Link
CN (1) CN105468457B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804206B (en) * 2017-04-26 2021-04-09 武汉斗鱼网络科技有限公司 Processing method and system for synchronous task
CN113076264A (en) * 2020-01-03 2021-07-06 阿里巴巴集团控股有限公司 Memory management method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102103524A (en) * 2010-12-23 2011-06-22 北京航空航天大学 Memory redundancy oriented virtual machine migration device and method
CN102136993A (en) * 2010-07-29 2011-07-27 华为技术有限公司 Data transfer method, device and system
CN103885829A (en) * 2014-04-16 2014-06-25 中国科学院软件研究所 Virtual machine cross-data-center dynamic migration optimization method based on statistics

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102136993A (en) * 2010-07-29 2011-07-27 华为技术有限公司 Data transfer method, device and system
CN102103524A (en) * 2010-12-23 2011-06-22 北京航空航天大学 Memory redundancy oriented virtual machine migration device and method
CN103885829A (en) * 2014-04-16 2014-06-25 中国科学院软件研究所 Virtual machine cross-data-center dynamic migration optimization method based on statistics

Also Published As

Publication number Publication date
CN105468457A (en) 2016-04-06

Similar Documents

Publication Publication Date Title
US8656388B2 (en) Method and apparatus for efficient memory replication for high availability (HA) protection of a virtual machine (VM)
US8413145B2 (en) Method and apparatus for efficient memory replication for high availability (HA) protection of a virtual machine (VM)
CN106528327B (en) A kind of data processing method and backup server
CN104679579B (en) Virtual machine migration method and device in a kind of group system
US9852138B2 (en) Content fabric for a distributed file system
CN105468457B (en) A kind of parallel system local migration fault-tolerance approach based on difference identification
CN102597958A (en) Symmetric live migration of virtual machines
CN103154893A (en) Multicore processor system, method of monitoring control, and monitoring control program
CN111078628B (en) Multi-disk concurrent data migration method, system, device and readable storage medium
CN102521265A (en) Dynamic consistency control method in massive data management
CN111666266A (en) Data migration method and related equipment
Santy The journey out and in: psychiatry and space exploration.
CN107391226A (en) The method and apparatus of backup virtual machine under a kind of open storehouse Openstack platforms
CN115756828A (en) Multithreading data file processing method, equipment and medium
Yacavone et al. Flight experience and the likelihood of US Navy aircraft mishaps.
Morimoto Buddhist mummies in Japan
Conte et al. Closure of the interventricular foramen and morphogenesis of the membranous septum and ventricular septal defects in the human heart.
Robb Cases from the aerospace medicine residents' teaching file. Case H57. Complete spontaneous pneumothorax in-flight in an F-16 pilot during a high-G maneuver.
US9218256B1 (en) Systems and methods for shipping I/O operations to prevent replication failure
Manatt Onboard oxygen generation systems.
Korbel Ocular manifestations of systemic diseases in birds. Part 1
Draeger et al. Investigations on the tolerance and reliability of contact lenses in air and space travel (author's transl)
Suzutani et al. Medico-legal studies on the deaths from coal-mine accidents. 1. Cadaveric phenomena (author's transl)
CN114840331A (en) VCPU scheduling method, device and control equipment
Wood Prevention of+ Gz-induced loss of consciousness.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant