CN113742081A

CN113742081A - Distributed task migration method and distributed system based on container technology

Info

Publication number: CN113742081A
Application number: CN202111066003.1A
Authority: CN
Inventors: 王中华; 王一凡; 杨子怡; 唐丽园; 何旺宇
Original assignee: Xian Aeronautics Computing Technique Research Institute of AVIC
Current assignee: Xian Aeronautics Computing Technique Research Institute of AVIC
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2021-12-03

Abstract

The application provides a distributed task migration method and a distributed system based on a container technology, wherein the task migration method comprises the following steps: creating a container in the node through the configuration file, and uploading the configuration file corresponding to the container to a program field server when the container is created and operates successfully; storing the serialized key data on the node in a program field server by a container directory mounting technology in the task running process on the node; the task control server monitors the task running state of each node in the cluster; if the node failure or the container failure is monitored, the same container environment is started on the selected target node according to the corresponding configuration file and the key data, and the key data are deserialized, so that the task operation environment and the operation state are recovered.

Description

Distributed task migration method and distributed system based on container technology

Technical Field

The invention relates to the technical field of computers, in particular to a distributed task migration method and a distributed system based on a container technology.

Background

With the continuous development of virtualization technology, container technology has been widely used in civilian world. The container is a lightweight virtualization technology, and can be used for simplifying configuration, reducing the coupling degree between an application environment and system hardware, and providing a consistent environment from code development to online deployment so as to improve development efficiency. In order to meet various challenges such as resource shortage and long application and development period, the field of aviation aircrafts is also approaching to container technology. The airborne application is deployed in the container, so that the development of new functions of the warplane and the iteration of the software are accelerated, and the delivery capacity of the software is rapidly improved.

Due to the fact that an airborne environment is severe, conditions such as warplane node failure, inter-aircraft communication failure and task breakdown caused by various factors exist, and therefore the multi-aircraft cooperative task cannot be completed. In order to improve the availability of the warplane cluster and ensure the effective execution of the combat mission, a set of mission migration mechanism needs to be implemented to enable the mission in the failed node container or the mission which cannot normally run to be re-run.

Disclosure of Invention

The invention provides a distributed task migration method and a distributed system based on a container technology, which realize field persistence of a program by serializing key data in the running of a task program in a container and storing the key data in a file system. And when the tasks are dynamically migrated, deserializing the serialized program key data in a new node container to ensure the effective execution of the tasks in the cluster.

In a first aspect, the present application provides a distributed task migration method based on a container technology, where the method is applied to a distributed system, and tasks executed in all task nodes in the distributed system are deployed in a container; the task migration method comprises the following steps:

creating a container in the node through the configuration file, and uploading the configuration file corresponding to the container to a program field server when the container is created and operates successfully;

storing the serialized key data on the node in a program field server by a container directory mounting technology in the task running process on the node;

the task control server monitors the task running state of each node in the cluster;

if the node failure or the container failure is monitored, the same container environment is started on the selected target node according to the corresponding configuration file and the key data, and the key data are deserialized, so that the task operation environment and the operation state are recovered.

Preferably, the serializing and storing the critical data of the tasks on the nodes in the program field server comprises:

the preset time for serializing the key data and storing the key data in a program field server;

the task control server sends an instruction to the task in the node, serializes the key data and stores the key data in a program site server;

when the key data on the node are changed, the key data on the node are re-serialized and stored in a program field server;

serializing the key data on the nodes according to a preset period, and storing the serialized key data in a program field server;

serializing the key data on the nodes according to a preset period and when the key data change in the preset period, and storing the serialized key data in a program field server;

preferably, the selecting a target node for task migration specifically includes:

the task control server selects a node with the minimum load as a target node for migration by traversing the load of the cluster nodes;

or the task control server migrates by selecting the node with the best cluster performance.

Preferably, the configuration file includes a yam l file (i.e., ym l file), an xm l file, a propert i es file, and a json file.

Preferably, the node failure is that a node failure is monitored and a task on the node is not completed;

the container failure is the monitoring that the running time of a task on the node exceeds the set running time.

Preferably, the critical data on the node includes various variables, data structures, and phases of execution of the program.

In a second aspect, the present application provides a container technology-based distributed system, the system comprising a task control server, a procedure site server, N task nodes, each task node comprising a processor and a computer-readable storage medium, wherein:

the task control server is used for cluster monitoring, distributing tasks to the N task nodes, selecting a target node for task migration, triggering and executing a task migration process;

the program site server is used for storing container configuration files and key data of executing tasks in current and historical N task nodes;

the computer readable storage medium of each task node stores a computer program for implementing distributed task migration.

Preferably, the computer program is loaded by a processor to perform the steps of:

step a1, the task control server does not detect the task abnormity, the node does not need to carry out task migration, and the node executes the steps b 1-b 3;

step a2, after the task control server detects that the task is abnormal and the node needs to perform task migration, the task control server determines a target node of the task migration, and the target node executes the steps c 1-c 4;

b1, the node task program serializes the key data, and converts the object with the key data into a byte sequence;

step b2, the node task program makes the key data persistent, and stores the key data in the host machine, namely the file system of the node, through the container directory mounting technology;

b3, the node task program uploads the container configuration file and the persistent key data to a program field server;

step c1, the target node of the task migration downloads the container configuration file to be recovered and the persistent key data from the program site server;

step c2, restoring the container environment at the task migration target node through the container configuration file, and loading the task program;

step c3, deserializing the persisted key data;

and c4, restoring the program scene at the target node of the task migration.

Compared with the prior art, the invention has the following advantages:

the method adopts the key data during the operation of the serialized and persistent program to save the program site, and takes the key data as the input of the program to recover the program site during the task migration. Compared with the migration by using a VM snapshot or a container mirror image, the task migration mode is lighter and faster, and has obvious space advantages when being stored in a disk.

The invention utilizes two servers except for the distributed nodes as a task controller and a program field memory, so that the nodes in the cluster respectively take their own roles, the distributed nodes only need to concentrate on the execution of a task program without considering task migration and program field storage, the task control server only needs to monitor, analyze and distribute tasks, migrate tasks and the like to the cluster, and the program field server only needs to have large-capacity storage capacity to store all program fields of the cluster.

Drawings

Fig. 1 is a flowchart of an embodiment and a distributed cluster architecture thereof.

FIG. 2 is a task migration flow diagram of different failure granularities, according to an embodiment.

Detailed Description

The invention is further described in detail below with reference to the figures and examples.

Example one

The technical scheme of the invention is as follows:

according to the container-based distributed task migration method, the core mechanism in the cluster is the restart of the same node or a cross-node offline container and the recovery of tasks. In order to achieve the purpose, the cluster firstly needs to have task error sensing capability, constantly monitors and judges whether the container normally operates; secondly, the task program executed in the container needs to have the capability of saving the site; finally, the cluster needs to have the capability of cross-node on-site recovery of the program.

To this end, the container-based distributed task migration method comprises the following steps:

a container in the node is created through a configuration file, and when the container is created and runs successfully, the configuration file is uploaded to a program field server;

in the program running process, serializing the key data, and storing the serialized key data in a program field server through a container directory mounting technology;

in the program running process, when the key data is changed, the key data is re-serialized and stored in a program site server;

and the task control server monitors the running state of the tasks distributed in the cluster, and if a certain node fails but the tasks distributed on the node are not completed or the running time of a certain task exceeds the set running time, the same container environment is started on another node according to the configuration file of the container and the serialized key data stored in the task program in the container, and the key data is deserialized, so that the running environment and the running state of the tasks are recovered.

Based on the above scheme, the invention further optimizes as follows:

alternatively, in the running process of the program, the time for saving the site can be set by the program itself, for example, after calculating the important data result or saving the data at regular time.

Alternatively, the task control server can inform the running program to save the site at any time.

Optionally, in the process of running the program, if the task regularly saves the program site, the program site is only required to be uploaded to the program site server when the key data changes.

It should be noted that the critical data includes various variables, data structures, and operation phases of the program.

Optionally, the task control server may select a target node for task migration through a certain policy, and may select a node with the smallest load for migration by traversing the cluster node load; and the node with the best cluster performance can be selected for migration.

Optionally, the program site server has a cache function, and stores historical program sites of the nodes, and the program can be rolled back to the previous program state according to different storage time points of the program sites.

Example two

The embodiment mainly comprises the field preservation and the cross-node field recovery of the container-based distributed tasks and the task migration strategies with different fault granularities. The distributed cluster architecture and migration flow of the present embodiment are shown in fig. 1.

The invention is oriented to a distributed environment similar to an airborne cloud, and in order to meet the requirement of high-availability HA, a cluster needs to be provided with a task migration mechanism. The cluster is composed of distributed nodes capable of executing tasks, and the tasks in the nodes run in containers. The task control server is responsible for task management, distribution, monitoring and task migration of each node in the cluster, and the program site server is responsible for storing a container configuration file for executing tasks in each node and an operation site of a program.

When a new task sequence arrives, the task control server analyzes the new task sequence, distributes subtasks to available nodes in the cluster and runs in a container through a written configuration file. And the task control server monitors the running state of the distributed tasks at any moment, and once the task on a certain node is found to be in fault, the system enters a task migration process.

The task migration process of the distributed system is further described below in conjunction with fig. 1.

Step 1, a task program converts an object with key data into a byte sequence to realize key data serialization;

step 2, the task program stores the key data converted into the byte sequence in a host machine through a container directory mounting technology to realize key data persistence;

in practical application, storing the key data in a file system of a node through a container directory mounting technology;

step 3, the task program uploads the container configuration file of each running container of each node and the key data of each container to a program field server;

step 4, when the task control server detects a node fault or a container fault, selecting a target node of task migration, and downloading a container configuration file to be recovered and corresponding key data from the program site server by the target node;

step 5, restoring a container environment at the target node through a container configuration file, and loading a task program;

step 6, deserializing the persisted key data;

and 7, restoring the program site at the target node of the task migration.

When the tasks in the nodes are normally executed, the programs in the containers are operated and the steps 1 to 3 are repeated to save the program sites, and the time when the steps 1 to 3 are carried out and the frequency of the steps are determined by the programs. And when the task control server detects that the execution of the node task is abnormal and the task migration is required, performing the steps 4 to 7 to recover the program site under the support of the program site server.

EXAMPLE III

The present embodiment performs task migration according to different failure granularities, as shown in fig. 2.

When the task control server detects that a certain task is abnormally operated, task migration is required. The type of failure will be judged first, and failures in the distributed cluster can be roughly divided into three types: node failures, container failures, and program failures. The node failure may be caused by external environmental factors or node hardware failure, and the probability of node failure is higher under the condition of severe environment such as airborne cloud; container failure may be caused by container non-response, abnormal container exit, etc.; program failures may be caused by memory violations, function stack overflows, etc. during program operation.

When the type of failure is determined, a corresponding migration policy may be implemented. Specifically, for a node fault, because a plurality of tasks can be run on the node, all running interrupted tasks on the node with the fault need to be migrated at this time, and the target nodes selected by the tasks for task migration are sequentially recovered on site; for a container fault, the node running the task is normal at the moment, and only one task needs to be migrated as the container only executes the task; for program faults, the nodes and the containers are normal at this time, and at this time, developers are required to debug and modify the programs and restart the containers.

In summary, the present application relates to a distributed task migration method based on a container technology, and aims to quickly and effectively recover a failed task on a node in a distributed cluster and ensure the availability of the distributed cluster. The method comprises the following steps: distributing tasks to the clusters and monitoring the task execution state through the task control server, judging the fault type and granularity of the tasks, and selecting available nodes in the clusters as target nodes for task migration; saving the program site and container configuration files of each node in the cluster through a program site server; the method comprises the steps of storing a program site through serialized and persisted program key data and storing a configuration file of a corresponding container, recovering a container environment executed by a task through the container configuration file, and then using the deserialized key data as program input of a migration task, thereby achieving program site storage and cross-node site recovery based on the container.

Claims

1. A distributed task migration method based on a container technology is characterized in that the method is applied to a distributed system, and tasks executed in all task nodes in the distributed system are deployed in a container; the task migration method comprises the following steps:

2. The distributed task migration method based on container technology according to claim 1, wherein the serializing and storing the critical data of the tasks on the nodes at the time of the program field server comprises:

and serializing the key data on the nodes according to a preset period and when the key data change in the preset period, and storing the key data in a program field server.

3. The distributed task migration method based on the container technology according to claim 1, wherein selecting a target node for task migration specifically includes:

4. The container technology based distributed task migration method of claim 1, wherein the configuration files comprise a yaml file, an xml file, properties file, and a json file.

5. The container technology based distributed task migration method according to claim 1, wherein the node failure is a monitored node failure and a task on the node is not completed;

6. The container technology based distributed task migration method of claim 1, wherein the critical data on the node comprises variables, data structures and execution phases of the program.

7. A container technology based distributed system comprising a task control server, a procedural site server, N task nodes, each task node comprising a processor and a computer readable storage medium, wherein:

8. The distributed system of claim 7, wherein the computer program is loaded by the processor to perform the steps of:

step c3, deserializing the persisted key data;

and c4, restoring the program scene at the target node of the task migration.