CN113742081A - Distributed task migration method and distributed system based on container technology - Google Patents

Distributed task migration method and distributed system based on container technology Download PDF

Info

Publication number
CN113742081A
CN113742081A CN202111066003.1A CN202111066003A CN113742081A CN 113742081 A CN113742081 A CN 113742081A CN 202111066003 A CN202111066003 A CN 202111066003A CN 113742081 A CN113742081 A CN 113742081A
Authority
CN
China
Prior art keywords
task
node
container
key data
program
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111066003.1A
Other languages
Chinese (zh)
Inventor
王中华
王一凡
杨子怡
唐丽园
何旺宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Aeronautics Computing Technique Research Institute of AVIC
Original Assignee
Xian Aeronautics Computing Technique Research Institute of AVIC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Aeronautics Computing Technique Research Institute of AVIC filed Critical Xian Aeronautics Computing Technique Research Institute of AVIC
Priority to CN202111066003.1A priority Critical patent/CN113742081A/en
Publication of CN113742081A publication Critical patent/CN113742081A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • G06F9/5088Techniques for rebalancing the load in a distributed system involving task migration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/4557Distribution of virtual machine instances; Migration and load balancing

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a distributed task migration method and a distributed system based on a container technology, wherein the task migration method comprises the following steps: creating a container in the node through the configuration file, and uploading the configuration file corresponding to the container to a program field server when the container is created and operates successfully; storing the serialized key data on the node in a program field server by a container directory mounting technology in the task running process on the node; the task control server monitors the task running state of each node in the cluster; if the node failure or the container failure is monitored, the same container environment is started on the selected target node according to the corresponding configuration file and the key data, and the key data are deserialized, so that the task operation environment and the operation state are recovered.

Description

Distributed task migration method and distributed system based on container technology
Technical Field
The invention relates to the technical field of computers, in particular to a distributed task migration method and a distributed system based on a container technology.
Background
With the continuous development of virtualization technology, container technology has been widely used in civilian world. The container is a lightweight virtualization technology, and can be used for simplifying configuration, reducing the coupling degree between an application environment and system hardware, and providing a consistent environment from code development to online deployment so as to improve development efficiency. In order to meet various challenges such as resource shortage and long application and development period, the field of aviation aircrafts is also approaching to container technology. The airborne application is deployed in the container, so that the development of new functions of the warplane and the iteration of the software are accelerated, and the delivery capacity of the software is rapidly improved.
Due to the fact that an airborne environment is severe, conditions such as warplane node failure, inter-aircraft communication failure and task breakdown caused by various factors exist, and therefore the multi-aircraft cooperative task cannot be completed. In order to improve the availability of the warplane cluster and ensure the effective execution of the combat mission, a set of mission migration mechanism needs to be implemented to enable the mission in the failed node container or the mission which cannot normally run to be re-run.
Disclosure of Invention
The invention provides a distributed task migration method and a distributed system based on a container technology, which realize field persistence of a program by serializing key data in the running of a task program in a container and storing the key data in a file system. And when the tasks are dynamically migrated, deserializing the serialized program key data in a new node container to ensure the effective execution of the tasks in the cluster.
In a first aspect, the present application provides a distributed task migration method based on a container technology, where the method is applied to a distributed system, and tasks executed in all task nodes in the distributed system are deployed in a container; the task migration method comprises the following steps:
creating a container in the node through the configuration file, and uploading the configuration file corresponding to the container to a program field server when the container is created and operates successfully;
storing the serialized key data on the node in a program field server by a container directory mounting technology in the task running process on the node;
the task control server monitors the task running state of each node in the cluster;
if the node failure or the container failure is monitored, the same container environment is started on the selected target node according to the corresponding configuration file and the key data, and the key data are deserialized, so that the task operation environment and the operation state are recovered.
Preferably, the serializing and storing the critical data of the tasks on the nodes in the program field server comprises:
the preset time for serializing the key data and storing the key data in a program field server;
the task control server sends an instruction to the task in the node, serializes the key data and stores the key data in a program site server;
when the key data on the node are changed, the key data on the node are re-serialized and stored in a program field server;
serializing the key data on the nodes according to a preset period, and storing the serialized key data in a program field server;
serializing the key data on the nodes according to a preset period and when the key data change in the preset period, and storing the serialized key data in a program field server;
preferably, the selecting a target node for task migration specifically includes:
the task control server selects a node with the minimum load as a target node for migration by traversing the load of the cluster nodes;
or the task control server migrates by selecting the node with the best cluster performance.
Preferably, the configuration file includes a yam l file (i.e., ym l file), an xm l file, a propert i es file, and a json file.
Preferably, the node failure is that a node failure is monitored and a task on the node is not completed;
the container failure is the monitoring that the running time of a task on the node exceeds the set running time.
Preferably, the critical data on the node includes various variables, data structures, and phases of execution of the program.
In a second aspect, the present application provides a container technology-based distributed system, the system comprising a task control server, a procedure site server, N task nodes, each task node comprising a processor and a computer-readable storage medium, wherein:
the task control server is used for cluster monitoring, distributing tasks to the N task nodes, selecting a target node for task migration, triggering and executing a task migration process;
the program site server is used for storing container configuration files and key data of executing tasks in current and historical N task nodes;
the computer readable storage medium of each task node stores a computer program for implementing distributed task migration.
Preferably, the computer program is loaded by a processor to perform the steps of:
step a1, the task control server does not detect the task abnormity, the node does not need to carry out task migration, and the node executes the steps b 1-b 3;
step a2, after the task control server detects that the task is abnormal and the node needs to perform task migration, the task control server determines a target node of the task migration, and the target node executes the steps c 1-c 4;
b1, the node task program serializes the key data, and converts the object with the key data into a byte sequence;
step b2, the node task program makes the key data persistent, and stores the key data in the host machine, namely the file system of the node, through the container directory mounting technology;
b3, the node task program uploads the container configuration file and the persistent key data to a program field server;
step c1, the target node of the task migration downloads the container configuration file to be recovered and the persistent key data from the program site server;
step c2, restoring the container environment at the task migration target node through the container configuration file, and loading the task program;
step c3, deserializing the persisted key data;
and c4, restoring the program scene at the target node of the task migration.
Compared with the prior art, the invention has the following advantages:
the method adopts the key data during the operation of the serialized and persistent program to save the program site, and takes the key data as the input of the program to recover the program site during the task migration. Compared with the migration by using a VM snapshot or a container mirror image, the task migration mode is lighter and faster, and has obvious space advantages when being stored in a disk.
The invention utilizes two servers except for the distributed nodes as a task controller and a program field memory, so that the nodes in the cluster respectively take their own roles, the distributed nodes only need to concentrate on the execution of a task program without considering task migration and program field storage, the task control server only needs to monitor, analyze and distribute tasks, migrate tasks and the like to the cluster, and the program field server only needs to have large-capacity storage capacity to store all program fields of the cluster.
Drawings
Fig. 1 is a flowchart of an embodiment and a distributed cluster architecture thereof.
FIG. 2 is a task migration flow diagram of different failure granularities, according to an embodiment.
Detailed Description
The invention is further described in detail below with reference to the figures and examples.
Example one
The technical scheme of the invention is as follows:
according to the container-based distributed task migration method, the core mechanism in the cluster is the restart of the same node or a cross-node offline container and the recovery of tasks. In order to achieve the purpose, the cluster firstly needs to have task error sensing capability, constantly monitors and judges whether the container normally operates; secondly, the task program executed in the container needs to have the capability of saving the site; finally, the cluster needs to have the capability of cross-node on-site recovery of the program.
To this end, the container-based distributed task migration method comprises the following steps:
a container in the node is created through a configuration file, and when the container is created and runs successfully, the configuration file is uploaded to a program field server;
in the program running process, serializing the key data, and storing the serialized key data in a program field server through a container directory mounting technology;
in the program running process, when the key data is changed, the key data is re-serialized and stored in a program site server;
and the task control server monitors the running state of the tasks distributed in the cluster, and if a certain node fails but the tasks distributed on the node are not completed or the running time of a certain task exceeds the set running time, the same container environment is started on another node according to the configuration file of the container and the serialized key data stored in the task program in the container, and the key data is deserialized, so that the running environment and the running state of the tasks are recovered.
Based on the above scheme, the invention further optimizes as follows:
alternatively, in the running process of the program, the time for saving the site can be set by the program itself, for example, after calculating the important data result or saving the data at regular time.
Alternatively, the task control server can inform the running program to save the site at any time.
Optionally, in the process of running the program, if the task regularly saves the program site, the program site is only required to be uploaded to the program site server when the key data changes.
It should be noted that the critical data includes various variables, data structures, and operation phases of the program.
Optionally, the task control server may select a target node for task migration through a certain policy, and may select a node with the smallest load for migration by traversing the cluster node load; and the node with the best cluster performance can be selected for migration.
Optionally, the program site server has a cache function, and stores historical program sites of the nodes, and the program can be rolled back to the previous program state according to different storage time points of the program sites.
Example two
The embodiment mainly comprises the field preservation and the cross-node field recovery of the container-based distributed tasks and the task migration strategies with different fault granularities. The distributed cluster architecture and migration flow of the present embodiment are shown in fig. 1.
The invention is oriented to a distributed environment similar to an airborne cloud, and in order to meet the requirement of high-availability HA, a cluster needs to be provided with a task migration mechanism. The cluster is composed of distributed nodes capable of executing tasks, and the tasks in the nodes run in containers. The task control server is responsible for task management, distribution, monitoring and task migration of each node in the cluster, and the program site server is responsible for storing a container configuration file for executing tasks in each node and an operation site of a program.
When a new task sequence arrives, the task control server analyzes the new task sequence, distributes subtasks to available nodes in the cluster and runs in a container through a written configuration file. And the task control server monitors the running state of the distributed tasks at any moment, and once the task on a certain node is found to be in fault, the system enters a task migration process.
The task migration process of the distributed system is further described below in conjunction with fig. 1.
Step 1, a task program converts an object with key data into a byte sequence to realize key data serialization;
step 2, the task program stores the key data converted into the byte sequence in a host machine through a container directory mounting technology to realize key data persistence;
in practical application, storing the key data in a file system of a node through a container directory mounting technology;
step 3, the task program uploads the container configuration file of each running container of each node and the key data of each container to a program field server;
step 4, when the task control server detects a node fault or a container fault, selecting a target node of task migration, and downloading a container configuration file to be recovered and corresponding key data from the program site server by the target node;
step 5, restoring a container environment at the target node through a container configuration file, and loading a task program;
step 6, deserializing the persisted key data;
and 7, restoring the program site at the target node of the task migration.
When the tasks in the nodes are normally executed, the programs in the containers are operated and the steps 1 to 3 are repeated to save the program sites, and the time when the steps 1 to 3 are carried out and the frequency of the steps are determined by the programs. And when the task control server detects that the execution of the node task is abnormal and the task migration is required, performing the steps 4 to 7 to recover the program site under the support of the program site server.
EXAMPLE III
The present embodiment performs task migration according to different failure granularities, as shown in fig. 2.
When the task control server detects that a certain task is abnormally operated, task migration is required. The type of failure will be judged first, and failures in the distributed cluster can be roughly divided into three types: node failures, container failures, and program failures. The node failure may be caused by external environmental factors or node hardware failure, and the probability of node failure is higher under the condition of severe environment such as airborne cloud; container failure may be caused by container non-response, abnormal container exit, etc.; program failures may be caused by memory violations, function stack overflows, etc. during program operation.
When the type of failure is determined, a corresponding migration policy may be implemented. Specifically, for a node fault, because a plurality of tasks can be run on the node, all running interrupted tasks on the node with the fault need to be migrated at this time, and the target nodes selected by the tasks for task migration are sequentially recovered on site; for a container fault, the node running the task is normal at the moment, and only one task needs to be migrated as the container only executes the task; for program faults, the nodes and the containers are normal at this time, and at this time, developers are required to debug and modify the programs and restart the containers.
In summary, the present application relates to a distributed task migration method based on a container technology, and aims to quickly and effectively recover a failed task on a node in a distributed cluster and ensure the availability of the distributed cluster. The method comprises the following steps: distributing tasks to the clusters and monitoring the task execution state through the task control server, judging the fault type and granularity of the tasks, and selecting available nodes in the clusters as target nodes for task migration; saving the program site and container configuration files of each node in the cluster through a program site server; the method comprises the steps of storing a program site through serialized and persisted program key data and storing a configuration file of a corresponding container, recovering a container environment executed by a task through the container configuration file, and then using the deserialized key data as program input of a migration task, thereby achieving program site storage and cross-node site recovery based on the container.

Claims (8)

1. A distributed task migration method based on a container technology is characterized in that the method is applied to a distributed system, and tasks executed in all task nodes in the distributed system are deployed in a container; the task migration method comprises the following steps:
creating a container in the node through the configuration file, and uploading the configuration file corresponding to the container to a program field server when the container is created and operates successfully;
storing the serialized key data on the node in a program field server by a container directory mounting technology in the task running process on the node;
the task control server monitors the task running state of each node in the cluster;
if the node failure or the container failure is monitored, the same container environment is started on the selected target node according to the corresponding configuration file and the key data, and the key data are deserialized, so that the task operation environment and the operation state are recovered.
2. The distributed task migration method based on container technology according to claim 1, wherein the serializing and storing the critical data of the tasks on the nodes at the time of the program field server comprises:
the preset time for serializing the key data and storing the key data in a program field server;
the task control server sends an instruction to the task in the node, serializes the key data and stores the key data in a program site server;
serializing the key data on the nodes according to a preset period, and storing the serialized key data in a program field server;
and serializing the key data on the nodes according to a preset period and when the key data change in the preset period, and storing the key data in a program field server.
3. The distributed task migration method based on the container technology according to claim 1, wherein selecting a target node for task migration specifically includes:
the task control server selects a node with the minimum load as a target node for migration by traversing the load of the cluster nodes;
or the task control server migrates by selecting the node with the best cluster performance.
4. The container technology based distributed task migration method of claim 1, wherein the configuration files comprise a yaml file, an xml file, properties file, and a json file.
5. The container technology based distributed task migration method according to claim 1, wherein the node failure is a monitored node failure and a task on the node is not completed;
the container failure is the monitoring that the running time of a task on the node exceeds the set running time.
6. The container technology based distributed task migration method of claim 1, wherein the critical data on the node comprises variables, data structures and execution phases of the program.
7. A container technology based distributed system comprising a task control server, a procedural site server, N task nodes, each task node comprising a processor and a computer readable storage medium, wherein:
the task control server is used for cluster monitoring, distributing tasks to the N task nodes, selecting a target node for task migration, triggering and executing a task migration process;
the program site server is used for storing container configuration files and key data of executing tasks in current and historical N task nodes;
the computer readable storage medium of each task node stores a computer program for implementing distributed task migration.
8. The distributed system of claim 7, wherein the computer program is loaded by the processor to perform the steps of:
step a1, the task control server does not detect the task abnormity, the node does not need to carry out task migration, and the node executes the steps b 1-b 3;
step a2, after the task control server detects that the task is abnormal and the node needs to perform task migration, the task control server determines a target node of the task migration, and the target node executes the steps c 1-c 4;
b1, the node task program serializes the key data, and converts the object with the key data into a byte sequence;
step b2, the node task program makes the key data persistent, and stores the key data in the host machine, namely the file system of the node, through the container directory mounting technology;
b3, the node task program uploads the container configuration file and the persistent key data to a program field server;
step c1, the target node of the task migration downloads the container configuration file to be recovered and the persistent key data from the program site server;
step c2, restoring the container environment at the task migration target node through the container configuration file, and loading the task program;
step c3, deserializing the persisted key data;
and c4, restoring the program scene at the target node of the task migration.
CN202111066003.1A 2021-09-10 2021-09-10 Distributed task migration method and distributed system based on container technology Pending CN113742081A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111066003.1A CN113742081A (en) 2021-09-10 2021-09-10 Distributed task migration method and distributed system based on container technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111066003.1A CN113742081A (en) 2021-09-10 2021-09-10 Distributed task migration method and distributed system based on container technology

Publications (1)

Publication Number Publication Date
CN113742081A true CN113742081A (en) 2021-12-03

Family

ID=78738191

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111066003.1A Pending CN113742081A (en) 2021-09-10 2021-09-10 Distributed task migration method and distributed system based on container technology

Country Status (1)

Country Link
CN (1) CN113742081A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114598585A (en) * 2022-03-07 2022-06-07 浪潮云信息技术股份公司 Method and system for monitoring hardware through snmptrapd
CN114697191A (en) * 2022-03-29 2022-07-01 浪潮云信息技术股份公司 Resource migration method, device, equipment and storage medium
CN115766405A (en) * 2023-01-09 2023-03-07 苏州浪潮智能科技有限公司 Fault processing method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106933508A (en) * 2017-02-14 2017-07-07 深信服科技股份有限公司 The moving method and device of application container
CN107590033A (en) * 2017-09-07 2018-01-16 网宿科技股份有限公司 A kind of methods, devices and systems of establishment DOCKER containers
CN111190688A (en) * 2019-12-19 2020-05-22 西安交通大学 Cloud data center-oriented Docker migration method and system
CN111506386A (en) * 2020-02-27 2020-08-07 平安科技(深圳)有限公司 Virtual machine online migration method, device, equipment and computer readable storage medium
CN112532763A (en) * 2020-11-26 2021-03-19 新华三大数据技术有限公司 Container operation data synchronization method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106933508A (en) * 2017-02-14 2017-07-07 深信服科技股份有限公司 The moving method and device of application container
CN107590033A (en) * 2017-09-07 2018-01-16 网宿科技股份有限公司 A kind of methods, devices and systems of establishment DOCKER containers
CN111190688A (en) * 2019-12-19 2020-05-22 西安交通大学 Cloud data center-oriented Docker migration method and system
CN111506386A (en) * 2020-02-27 2020-08-07 平安科技(深圳)有限公司 Virtual machine online migration method, device, equipment and computer readable storage medium
CN112532763A (en) * 2020-11-26 2021-03-19 新华三大数据技术有限公司 Container operation data synchronization method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114598585A (en) * 2022-03-07 2022-06-07 浪潮云信息技术股份公司 Method and system for monitoring hardware through snmptrapd
CN114697191A (en) * 2022-03-29 2022-07-01 浪潮云信息技术股份公司 Resource migration method, device, equipment and storage medium
CN115766405A (en) * 2023-01-09 2023-03-07 苏州浪潮智能科技有限公司 Fault processing method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN113742081A (en) Distributed task migration method and distributed system based on container technology
US11556438B2 (en) Proactive cluster compute node migration at next checkpoint of cluster upon predicted node failure
US8375363B2 (en) Mechanism to change firmware in a high availability single processor system
CN111290834B (en) Method, device and equipment for realizing high service availability based on cloud management platform
US20200110675A1 (en) Data backup and disaster recovery between environments
US7194652B2 (en) High availability synchronization architecture
US7188237B2 (en) Reboot manager usable to change firmware in a high availability single processor system
EP2802990B1 (en) Fault tolerance for complex distributed computing operations
US20040083402A1 (en) Use of unique XID range among multiple control processors
US7065673B2 (en) Staged startup after failover or reboot
US8479038B1 (en) Method and apparatus for achieving high availability for applications and optimizing power consumption within a datacenter
CN113569987A (en) Model training method and device
EP3671461A1 (en) Systems and methods of monitoring software application processes
Bilal et al. Fault tolerance in the cloud
EP3617887B1 (en) Method and system for providing service redundancy between a master server and a slave server
CN111935244B (en) Service request processing system and super-integration all-in-one machine
CN114138732A (en) Data processing method and device
WO2024041363A1 (en) Serverless-architecture-based distributed fault-tolerant system, method and apparatus, and device and medium
US20210173698A1 (en) Hosting virtual machines on a secondary storage system
CN112269693B (en) Node self-coordination method, device and computer readable storage medium
Ooi et al. Dynamic service placement and redundancy to ensure service availability during resource failures
CN111580792B (en) High-reliability satellite-borne software architecture design method based on operating system
CN113626147A (en) Ocean platform computer control method and system based on virtualization technology
El-Desoky et al. Improving fault tolerance in desktop grids based on incremental checkpointing
US20230092343A1 (en) Lockstep processor recovery for vehicle applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination