CN115396495B

CN115396495B - A fault response method for factory microservice system in SDN-FOG environment

Info

Publication number: CN115396495B
Application number: CN202211005627.7A
Authority: CN
Inventors: 范书豪; 杨博; 刘宇翔; 袁亚洲; 郑忠斌; 陈彩莲; 关新平
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2022-08-22
Filing date: 2022-08-22
Publication date: 2023-12-12
Anticipated expiration: 2042-08-22
Also published as: CN115396495A

Abstract

The invention discloses a fault handling method of a factory micro-service system in an SDN-FOG environment, and relates to the technical field of industrial Internet of things. The invention develops a fault handling scheme of an information physical intelligent factory micro-service system in an SDN-FOG environment. The method comprises the steps of establishing a system model which accords with an actual factory operation scene, establishing a fault processing framework of a micro-service system based on an SDN-FOG environment, establishing an integer programming problem which can simultaneously consider network resource constraint, workflow response time, load, energy consumption and fault coping, designing logic of an application layer fault coping program based on the integer programming problem, and obtaining an optimal response strategy by using a Gurobi solver through two self-adaptive related optimization problems. In the scene of strict requirements on fault processing response time, a suboptimal heuristic algorithm is designed to simplify the flow, the expandability of the provided optimization problem is solved, and the obtained solution is a self-adaptive method and is suitable for the network of the actual factory scene.

Description

A fault response method for factory microservice system in SDN-FOG environment

技术领域Technical field

本发明涉及工业物联网技术领域，尤其涉及一种SDN-FOG环境下工厂微服务体系的故障应对方法。The present invention relates to the technical field of industrial Internet of Things, and in particular to a fault response method for a factory microservice system in an SDN-FOG environment.

背景技术Background technique

近年来，工业4.0被认为是新一代工业革命的重要推动力，工业物联网(IIoT)已经引起了广泛关注，它可以满足各种实际工厂应用场景，如远程适配和配置、智能操作和维护、设备协同控制等。IIoT将物理实体与基于互联网技术标准的计算能力连接起来，通过工业数据建模、管理和分析，以信息和数据技术推动生产。IIoT一般有严格的服务质量(QoS)要求，比如非常短的响应时间要求、负载平衡问题、能耗问题和可靠性问题。考虑到云计算服务器通常远离现场设备，存在的延迟可能会使足够短响应时间的过程复杂化，为了支持具有严格响应时间要求的应用程序，雾计算范式正在兴起。雾计算将计算资源部署到更接近最终用户的地方，边缘的雾计算可以在本地快速计算和分析数据，并将相关的按需处理数据流从事件地理位置传传输到核心平台，提高整体网络效率。为了提高对边缘雾计算设备的管理效率，可以引入软件定义网络(SDN)技术，这是一种允许通过SDN控制器对网络进行编程的范式，允许进行可编程的路由优化，当有新设备添加到基础设施中时，这种延迟优化可扩展且灵活。此外，SDN架构还允许网络控制器监控网络和计算设备，从基础设施收集性能信息。在工业物联网的雾基础设施中使用多个协调的SDN控制器，可以提高整体基础设施的QoS。当在工厂中部署了基于SDN的雾基础设施后，可以通过利用IIoT将其转换为信息物理智能工厂，工厂环境的数字化有助于实现新的自动化范式，实现更加灵活和安全的运转。In recent years, Industry 4.0 has been considered an important driving force for the new generation of industrial revolution, and the Industrial Internet of Things (IIoT) has attracted widespread attention. It can meet various actual factory application scenarios, such as remote adaptation and configuration, intelligent operation and maintenance. , equipment collaborative control, etc. IIoT connects physical entities with computing capabilities based on Internet technology standards, and uses information and data technology to promote production through industrial data modeling, management, and analysis. IIoT generally has strict quality of service (QoS) requirements, such as very short response time requirements, load balancing issues, energy consumption issues, and reliability issues. To support applications with stringent response time requirements, the fog computing paradigm is emerging, given that cloud computing servers are often located far away from on-site equipment and the presence of latency can complicate the process of achieving sufficiently short response times. Fog computing deploys computing resources closer to end users. Fog computing at the edge can quickly calculate and analyze data locally, and transmit relevant on-demand processing data streams from event geographical locations to the core platform, improving overall network efficiency. . In order to improve the management efficiency of edge fog computing devices, software-defined network (SDN) technology can be introduced, which is a paradigm that allows the network to be programmed through an SDN controller, allowing programmable routing optimization when new devices are added. When integrated into the infrastructure, this latency optimization is scalable and flexible. In addition, SDN architecture allows network controllers to monitor networks and computing devices and collect performance information from the infrastructure. Using multiple coordinated SDN controllers in industrial IoT fog infrastructure can improve overall infrastructure QoS. When SDN-based fog infrastructure is deployed in a factory, it can be converted into a cyber-physical smart factory by leveraging IIoT. The digitization of the factory environment helps realize a new automation paradigm and achieve more flexible and safer operations.

在SDN-FOG环境下，工厂的运营可以由微服务体系结构来实现，这是一种很有前途的架构，它将单个软件分解为一组松散耦合的容器化微服务，支持分布式部署，并将它们关联到多个微服务链以服务于请求，这样可以显著缩短更改系统和将更改应用到生产环境中的时间，降低开发和维护成本，并提高了灵活性。目前，在SDN-FOG环境下与微服务开发、部署、扩展和维护相关的优化问题主要有三大类：一是分散计算分布问题，主要研究在雾计算环境中，微服务使用哪个节点以及节点应该承载应用程序的哪个部分。该类问题一般以平均响应时间为优化目标，假定网络是一个提供特定延迟的静态实体，SDN设备负责从基础设施收集信息，然后通过路由优化等技术来最小化网络延迟，但该类问题一般都是在设计架构时考虑优化，很少考虑故障情况下的微服务重新部署。二是最优控制器放置问题，SDN交换机和控制器之间的时延会影响SDN架构中任意两台设备之间的时延，该类问题一般假定流量不会根据网络实现的延迟而改变，研究控制器的布置与网络拓扑和网络中流量引导方式的关系，然后优化控制延迟和数据传输延迟，但该类问题一般也是在设计架构时考虑优化，很少考虑故障情况下的实时控制流。三是路径优化问题，一般将系统的响应时间视为要优化的服务质量指标，响应时间是执行时间和传输延迟的组合。该类问题一般通过建立多目标优化问题，联合考虑服务执行时间和传输延迟，为工作流安排最合适的路径，但该类问题的场景都是在正常运行时考虑路由优化，缺少对故障情况下的重路由的考虑。In an SDN-FOG environment, factory operations can be implemented by a microservices architecture, a promising architecture that decomposes a single software into a set of loosely coupled containerized microservices, supporting distributed deployment, And associating them to multiple microservice chains to serve requests can significantly shorten the time to change the system and apply the changes to the production environment, reduce development and maintenance costs, and increase flexibility. At present, there are three main categories of optimization problems related to microservice development, deployment, expansion and maintenance in the SDN-FOG environment: First, the decentralized computing distribution problem, which mainly studies which nodes should be used by microservices in fog computing environments and which nodes should Which part of the application is hosted. This type of problem generally takes the average response time as the optimization goal. It is assumed that the network is a static entity that provides a specific delay. The SDN device is responsible for collecting information from the infrastructure and then minimizing the network delay through techniques such as route optimization. However, this type of problem generally Optimization is considered when designing the architecture, and microservice redeployment in case of failure is rarely considered. The second is the problem of optimal controller placement. The delay between the SDN switch and the controller will affect the delay between any two devices in the SDN architecture. This type of problem generally assumes that the traffic will not change according to the delay of the network implementation. Study the relationship between the layout of the controller and the network topology and traffic guidance methods in the network, and then optimize the control delay and data transmission delay. However, this type of problem is generally considered when designing the architecture, and real-time control flow under fault conditions is rarely considered. The third is the path optimization problem. Generally, the response time of the system is regarded as the service quality indicator to be optimized. The response time is the combination of execution time and transmission delay. This type of problem generally establishes a multi-objective optimization problem, jointly considering service execution time and transmission delay, and arranging the most appropriate path for the workflow. However, the scenarios for this type of problem all consider routing optimization during normal operation, and lack of understanding of fault conditions. rerouting considerations.

微服务在工业场景中的应用已经引起了研究界的兴趣，然而如上所述，大多数现有的将微服务应用于智能工厂的研究都集中在一般的体系结构原则和优化部署方案上，只有少数工作考虑了微服务架构下的故障应对方案，且它们的范围一般很窄，只提供有限的适应形式(如自我修复和运行时的位置适应等)。这些研究将微服务架构和自适应系统结合了起来，设计的自适应系统可以在运行时监控其行为并更改其配置，以在不确定的操作条件下(如工作负载变化和故障风险等)保持和增强微服务体系的质量属性。目前来看，研究微服务在实际工厂中的应用时，除了考虑执行时间和延迟等关键因素，将动态网络拓扑变化、工作流变化、故障预防和故障处理纳入考虑也是很有必要的。The application of microservices in industrial scenarios has aroused the interest of the research community. However, as mentioned above, most existing research on applying microservices to smart factories focuses on general architectural principles and optimized deployment solutions. A few works have considered fault response solutions under microservice architecture, and their scope is generally very narrow, providing only limited forms of adaptation (such as self-healing and runtime location adaptation, etc.). These studies combine microservices architecture with adaptive systems designed to monitor their behavior at runtime and change their configuration to survive uncertain operating conditions such as workload changes and failure risk. and enhance the quality attributes of microservice systems. At present, when studying the application of microservices in actual factories, in addition to considering key factors such as execution time and delay, it is also necessary to take dynamic network topology changes, workflow changes, fault prevention and fault handling into consideration.

设计能够实现自动故障处理的自适应系统涉及到在观察网络环境时对环境和系统本身做出设计决策，然后选择合适的适应机制。在基于SDN-FOG环境的微服务体系结构中，由于运行时组件数量众多，且具有独立性和高度动态性，用于做出自适应决策的设计空间更加复杂。首要的挑战就是：如何开发监控和适应机制，以应对微服务体系结构质量属性的多样性。可以考虑的主要优化点包括：1)响应时间。2)节点可靠性。3)网络开销。4)负载平衡(若只在一个或几个节点上运行微服务，将导致服务执行性能下降。因此，在雾计算场景中一般需要考虑节点负载增加导致的性能下降)。Designing adaptive systems that enable automatic fault handling involves making design decisions about the environment and the system itself while observing the network environment, and then selecting appropriate adaptation mechanisms. In a microservice architecture based on an SDN-FOG environment, the design space for making adaptive decisions is more complex due to the large number of runtime components that are independent and highly dynamic. The first challenge is: how to develop monitoring and adaptation mechanisms to cope with the diversity of quality attributes of microservice architectures. The main optimization points that can be considered include: 1) Response time. 2) Node reliability. 3) Network overhead. 4) Load balancing (if microservices are only run on one or a few nodes, service execution performance will decrease. Therefore, in fog computing scenarios, performance degradation caused by increased node load generally needs to be considered).

微服务体系的特点，即独立和频繁地部署、对高度自动化的需求以及复杂的运行体系结构，可以促进自适应系统的进一步研究和开发。相应地，自适应系统提供了一个面向控制的视角，研究故障应对方案是设计自适应系统的重要一环，目前为微服务体系结构设计自适应相关方案的工作主要有以下两种：一是通过利用大量的理论和实践结果，设计动态规划算法，为每个微服务选择最佳的适应策略。在系统运行时，不断采集网络中的信息，确定被控对象的当前实际工作状态，根据应用程序的设定，在达到设定阈值的时间点，触发自适应控制规律，从而实时地调整系统结构或参数，使系统始终自动地工作在最优或次最优的运行状态。二是通过强化学习方法，从过去的适应结果中学习新的适应策略，提高整体微服务体系的质量属性，实现在不确定性条件下进行策略推理和实现多目标优化，以满足多种可能相互冲突的服务质量需求。从解决模型的角度来看，强化学习由于在解决战略决策问题方面的显著优势而被应用于微服务体系的自适应方案设计。The characteristics of microservice systems, namely independent and frequent deployment, the need for a high degree of automation, and complex operational architecture, can promote further research and development of adaptive systems. Correspondingly, adaptive systems provide a control-oriented perspective. Researching fault response solutions is an important part of designing adaptive systems. Currently, there are two main types of work to design adaptive solutions for microservice architectures: First, through Utilize a large number of theoretical and practical results to design a dynamic programming algorithm to select the best adaptation strategy for each microservice. When the system is running, the information in the network is continuously collected to determine the current actual working status of the controlled object. According to the settings of the application, when the set threshold is reached, the adaptive control law is triggered to adjust the system structure in real time. Or parameters, so that the system always automatically works in the optimal or sub-optimal operating state. The second is to use reinforcement learning methods to learn new adaptation strategies from past adaptation results, improve the quality attributes of the overall microservice system, and achieve strategic reasoning and multi-objective optimization under uncertainty to meet multiple possible interactions. Conflicting service quality requirements. From the perspective of solving models, reinforcement learning is applied to the adaptive solution design of microservice systems due to its significant advantages in solving strategic decision-making problems.

通过上述背景的介绍与分析，可以看到，针对SDN-FOG环境下的信息物理智能工厂，微服务体系的故障应对方案设计主要面临以下三个困难：1)所提出的方案不仅需要考虑故障应对方案，基本的服务质量指标也要得到满足，如在研究故障恢复的路由优化环节时，仍应优化响应时间、能耗和负载等质量指标。2)工业场景下微服务体系的现有技术很少考虑故障应对方案的设计，且与微服务体系的故障应对相关的问题不止一种，要尽可能多的考虑相关问题，如故障预防和重路由代价等。3)现实工业场景中的工作流数量、雾节点数量和所需微服务数量，会对所提出算法产生影响，因此设计出的解决方案需要有在实际场景中的弹性，且提出的算法要能够对随时间动态变化的工作流进行快速响应。Through the introduction and analysis of the above background, it can be seen that for the cyber-physical smart factory in the SDN-FOG environment, the fault response solution design of the microservice system mainly faces the following three difficulties: 1) The proposed solution not only needs to consider fault response scheme, basic service quality indicators must also be met. For example, when studying the route optimization link of fault recovery, quality indicators such as response time, energy consumption, and load should still be optimized. 2) The existing technology of microservice systems in industrial scenarios rarely considers the design of fault response solutions, and there are more than one problems related to fault response in microservice systems. It is necessary to consider as many related issues as possible, such as fault prevention and re-installation. Routing costs, etc. 3) The number of workflows, the number of fog nodes and the number of required microservices in real industrial scenarios will have an impact on the proposed algorithm. Therefore, the designed solution needs to be flexible in actual scenarios, and the proposed algorithm must be able to Respond quickly to workflows that change dynamically over time.

可见，对雾计算、SDN、微服务的研究相对独立，将三者结合在一起考虑，针对SDN-FOG环境下的微服务开发、部署、管理和扩展的研究工作有限，该环境下与故障应对相关的研究更少。针对工业场景下微服务体系的现有技术对微服务体系的优化部署方案进行了充分的研究，但很少考虑微服务体系的自适应设计，目前只有少数工作解决了为微服务体系开发自适应方案的具体挑战，故障应对作为自适应设计的关键一环，相关研究工作数量也较少。针对微服务体系的故障应对方案的现有技术考虑的范围一般较窄，更多聚焦于故障处理和编排新的工作流，对故障预防、重路由代价的考虑不够。微服务体系的质量属性具有多样性，在故障应对方案设计时，缺少对响应时间、节点可靠性、网络开销和负载的综合考虑。It can be seen that the research on fog computing, SDN, and microservices are relatively independent. When considering the three together, there is limited research work on the development, deployment, management, and expansion of microservices in the SDN-FOG environment. In this environment, fault response is There are even fewer relevant studies. The existing technology of microservice systems in industrial scenarios has been fully studied on the optimal deployment scheme of microservice systems, but the adaptive design of microservice systems is rarely considered. At present, only a few works have solved the problem of developing adaptive solutions for microservice systems. Regarding the specific challenges of the solution, fault response is a key part of adaptive design, and the number of related research works is also small. Existing technologies for fault response solutions for microservice systems generally have a narrow scope of consideration, focusing more on fault handling and orchestrating new workflows, and insufficient consideration of fault prevention and rerouting costs. The quality attributes of the microservice system are diverse. When designing fault response solutions, there is a lack of comprehensive consideration of response time, node reliability, network overhead and load.

因此，本领域的技术人员致力于开发一种SDN-FOG环境下工厂微服务体系的故障应对方法。搭建基于SDN-FOG环境的微服务体系的故障处理框架，构建能够同时考虑网络资源约束、工作流响应时间、负载、能量消耗和故障应对的整数规划问题，使用Gurobi求解器得到最优响应策略。对故障处理响应时间要求严格的场景，设计次优的启发式算法来简化流程，解决优化问题的可扩展性，得到自适应的解决方案，适用于实际工厂场景的网络。Therefore, those skilled in the art are committed to developing a fault response method for the factory microservice system in the SDN-FOG environment. Build a fault handling framework for the microservice system based on the SDN-FOG environment, construct an integer programming problem that can simultaneously consider network resource constraints, workflow response time, load, energy consumption and fault response, and use the Gurobi solver to obtain the optimal response strategy. For scenarios with strict response time requirements for fault handling, a suboptimal heuristic algorithm is designed to simplify the process, solve the scalability of the optimization problem, and obtain an adaptive solution that is suitable for networks in actual factory scenarios.

发明内容Contents of the invention

有鉴于现有技术的上述缺陷，本发明所要解决的技术问题是开发一种SDN-FOG环境下信息物理智能工厂微服务体系的故障应对方案。建立了符合实际工厂运营场景的系统模型，搭建了基于SDN-FOG环境的微服务体系的故障处理框架，并构建了一个能够同时考虑网络资源约束、工作流响应时间、负载、能量消耗和故障应对的整数规划问题，以此为基础设计了应用层的故障应对程序的逻辑，提出的两个自适应相关的优化问题可以使用Gurobi求解器得到最优响应策略。对故障处理响应时间要求严格的场景，又设计了一个次优的启发式算法来简化流程，解决所提优化问题的可扩展性，得到的解决方案是一个自适应的方法，适用于实际工厂场景的网络。In view of the above-mentioned defects of the existing technology, the technical problem to be solved by the present invention is to develop a fault response solution for the cyber-physical smart factory microservice system in the SDN-FOG environment. Established a system model that conforms to the actual factory operation scenario, built a fault handling framework for the microservice system based on the SDN-FOG environment, and built a system that can simultaneously consider network resource constraints, workflow response time, load, energy consumption and fault response The integer programming problem is used as the basis to design the logic of the application layer's fault response program. The two adaptive related optimization problems proposed can use the Gurobi solver to obtain the optimal response strategy. For scenarios that require strict response time for fault processing, a suboptimal heuristic algorithm is designed to simplify the process and solve the scalability of the proposed optimization problem. The obtained solution is an adaptive method that is suitable for actual factory scenarios. network of.

为实现上述目的，本发明提供了一种SDN-FOG环境下工厂微服务体系的故障应对方法，包括以下步骤：In order to achieve the above purpose, the present invention provides a fault response method for a factory microservice system in an SDN-FOG environment, which includes the following steps:

步骤1、建立基于SDN-FOG环境的微服务体系故障处理架构；Step 1. Establish a microservice system fault handling architecture based on SDN-FOG environment;

步骤2、系统建模，描述基础设施、传输网络，雾计算设备之间的关系；Step 2. System modeling, describing the relationship between infrastructure, transmission network, and fog computing equipment;

步骤3、构建整数规划问题，故障应对程序；Step 3. Construct integer programming problems and fault response procedures;

步骤4、使用求解器求解最优的应对策略；Step 4. Use the solver to find the optimal response strategy;

步骤5、对故障响应时间要求特别严格的故障处理的场景，使用启发式算法。Step 5. For fault handling scenarios that require particularly strict fault response time, use a heuristic algorithm.

进一步地，所述步骤1基于SDN-FOG环境的微服务体系故障处理架构中包括应用层、控制层、基础层。Further, the microservice system fault handling architecture based on the SDN-FOG environment in step 1 includes an application layer, a control layer, and a base layer.

进一步地，所述应用层包括故障处理程序、定时器和功能性程序。Further, the application layer includes fault handling programs, timers and functional programs.

进一步地，所述控制层包括故障应对方案所需的各类组件。Further, the control layer includes various components required for fault response solutions.

进一步地，所述基础层，部署支持SDN控制的交换机，交换机在时隙中具有不同的故障概率，中央总控制器连接到交换机以汇总网络拓扑和流量信息，并使用南向接口动态编程配置交换机；每个交换机连接一个雾节点，合并视为一个节点；每个雾节点都对应了一组工业物联网设备和雾服务器。Further, the base layer deploys switches that support SDN control. The switches have different failure probabilities in time slots. The central controller is connected to the switches to summarize network topology and traffic information, and dynamically configures the switches using southbound interface programming. ; Each switch is connected to a fog node and combined into one node; each fog node corresponds to a set of industrial IoT devices and fog servers.

进一步地，所述步骤1包括以下步骤：Further, the step 1 includes the following steps:

步骤1.1、连接到工厂现场设备的传感器不断地采集信息，传感器数据经预处理之后发送给连接的支持SDN控制的交换机；网络监控组件从交换机收集包括现场设备、雾节点和交换机在内的节点信息和链路信息，并将信息发送到应用层来持续监视网络流量；Step 1.1. Sensors connected to factory field equipment continuously collect information. The sensor data is pre-processed and sent to the connected switch that supports SDN control; the network monitoring component collects node information including field equipment, fog nodes and switches from the switch. and link information, and sends the information to the application layer to continuously monitor network traffic;

步骤1.2、故障检测组件根据网络监控组件收集的信息，判断传输网络是否发生故障；如果检测出交换机或雾节点或链路处发生故障，网络监控组件将当前新的网络拓扑和每个雾节点支持的微服务集发送给故障应对程序；Step 1.2. The fault detection component determines whether a fault occurs in the transmission network based on the information collected by the network monitoring component; if a fault is detected at the switch or fog node or link, the network monitoring component will determine the current new network topology and the support of each fog node. The set of microservices is sent to the fault response program;

步骤1.3、故障应对程序实时计算，给通过故障节点或链路的工作流重新分配资源；节点配置组件通过调度转发表来将资源分配给工作流，将应用层中故障应对程序做出的故障应对决策应用于交换机及雾节点；Step 1.3. The fault response program calculates in real time and reallocates resources to the workflow passing through the faulty node or link; the node configuration component allocates resources to the workflow through the scheduling forwarding table and responds to the fault made by the fault response program in the application layer. Decisions are applied to switches and fog nodes;

步骤1.4、配置完成后刷新定时器，如果因为部分节点或链路故障，导致部分工作流中的微服务无法完成，则报告故障类型，转人工处理；Step 1.4. Refresh the timer after the configuration is completed. If some microservices in the workflow cannot be completed due to some node or link failures, report the failure type and transfer to manual processing;

步骤1.5、经过定时器的一个完整周期，激活故障预防组件，主动对网络进行周期性重构，优化工作流的所选路径，提高总体服务质量并降低故障概率。Step 1.5: After a complete cycle of the timer, activate the fault prevention component, actively perform periodic reconstruction of the network, optimize the selected path of the workflow, improve the overall service quality and reduce the probability of failure.

进一步地，所述步骤2包括以下步骤：Further, the step 2 includes the following steps:

步骤2.1、传输网络建模；Step 2.1. Transmission network modeling;

步骤2.2、雾节点与支持的微服务集建模；Step 2.2, modeling of fog nodes and supported microservice sets;

步骤2.3、工作流建模；Step 2.3, workflow modeling;

步骤2.4、服务质量指标建模。Step 2.4. Modeling of service quality indicators.

进一步地，所述步骤3包括以下步骤：Further, the step 3 includes the following steps:

步骤3.1、定义决策变量；Step 3.1. Define decision variables;

步骤3.2、构建资源约束；Step 3.2. Build resource constraints;

步骤3.3、构建与决策变量相关的优化目标；Step 3.3. Construct optimization objectives related to decision variables;

步骤3.4、构建决策变量相关约束；Step 3.4: Construct constraints related to decision variables;

步骤3.5、构建目标函数。Step 3.5. Construct the objective function.

进一步地，所述步骤3.5构建目标函数包括网络的故障恢复问题和网络的周期性重构问题。Further, the objective function constructed in step 3.5 includes the problem of network failure recovery and the problem of periodic reconstruction of the network.

进一步地，所述步骤4求解器包括商用求解器Gurobi。Further, the step 4 solver includes the commercial solver Gurobi.

在本发明的较佳实施方式中，提出了一种适用于信息物理智能工厂的微服务体系故障处理架构，该架构利用了SDN和雾计算技术，并给出了故障恢复和故障预防的问题描述。对应的优化问题是整数规划的形式，可以用Gurobi商用求解器得到最优响应策略。In the preferred embodiment of the present invention, a microservice system fault handling architecture suitable for cyber-physical smart factories is proposed. This architecture utilizes SDN and fog computing technology, and provides problem descriptions of fault recovery and fault prevention. . The corresponding optimization problem is in the form of integer programming, and the optimal response strategy can be obtained using the Gurobi commercial solver.

提出的方法考虑了链路和节点的最大利用率和故障概率，优化了网络设备的负载和网络故障的概率，由于网络中的流量需求、节点或链路发生故障的概率会随着时间的变化而变化，考虑到故障预防，该方法会在预定义的时间周期内，以较小的重路由代价对资源进行动态重新分配，使网络处于最佳状态。The proposed method considers the maximum utilization and failure probability of links and nodes, optimizing the load of network equipment and the probability of network failure. Due to the traffic demand in the network, the probability of node or link failure will change over time. For changes, considering fault prevention, this method will dynamically reallocate resources with a small rerouting cost within a predefined time period to keep the network in the best state.

提出的方法能够在一定程度上优化网络负载和能耗，保证所需的服务质量水平，并能在节点或链路故障的情况下，实时地重新配置网络；并且考虑到了微服务体系质量属性的多样性，可以根据实际情况衡量其权重。The proposed method can optimize network load and energy consumption to a certain extent, ensure the required service quality level, and can reconfigure the network in real time in the event of node or link failure; and takes into account the quality attributes of the microservice system. Diversity, its weight can be measured according to the actual situation.

提出的次优启发式方法能够对随时间动态变化的工作流进行快速响应，该解决方案是一个自适应的方法，适用于实际工厂场景的网络。The proposed suboptimal heuristic method is able to respond quickly to workflows that change dynamically over time, and the solution is an adaptive approach that is suitable for networks in real factory scenarios.

在实际操作部署时，可以根据业务需求和质量需求对微服务进行逻辑分组，这对应了每个雾节点支持的微服务集，然后可以用独立的、可感知应用程序的服务器以分层或分布式的结构进行集体管理。In actual operational deployment, microservices can be logically grouped according to business needs and quality requirements, which corresponds to the set of microservices supported by each fog node, and can then be layered or distributed with independent, application-aware servers. type structure for collective management.

本发明与现有技术相比较，具有如下显而易见的实质性特点和显著优点：Compared with the prior art, the present invention has the following obvious substantive features and significant advantages:

1.将SDN、雾计算和微服务技术进行融合，在保证常见服务质量指标的前提下，设计了基于SDN-FOG环境的微服务体系故障处理架构，对实际工厂运营场景进行了建模，给出了故障处理和故障预防的问题描述，用于设计故障应对程序。1. Integrate SDN, fog computing and microservice technologies, and on the premise of ensuring common service quality indicators, design a microservice system fault handling architecture based on SDN-FOG environment, model the actual factory operation scenario, and provide Problem descriptions for troubleshooting and prevention are provided to design troubleshooting procedures.

2.综合考虑了微服务体系质量属性的多样性，在故障应对方案设计时，把响应时间、节点可靠性、能耗开销和负载都纳入了综合考虑，并可以根据实际情况衡量其权重。2. Comprehensive consideration is given to the diversity of quality attributes of the microservice system. When designing the fault response plan, response time, node reliability, energy consumption and load are all taken into comprehensive consideration, and their weight can be measured according to the actual situation.

3.提出的方案考虑了链路和节点的最大利用率，进行了防过载的故障预防设计，还考虑了链路和节点的故障概率，进行了限制故障概率的故障预防设计，在重路由时考虑了网络流量更改的代价。3. The proposed scheme considers the maximum utilization of links and nodes, carries out a fault prevention design to prevent overload, also considers the failure probability of links and nodes, and carries out a fault prevention design to limit the failure probability. When rerouting, The cost of network traffic changes is taken into account.

4.在实际操作时，可以根据业务需求和质量需求对微服务进行逻辑分组，这对应了每个雾节点支持的微服务集，然后可以用独立的、可感知应用程序的服务器以分层或分布式的结构进行集体管理。4. In actual operation, microservices can be logically grouped according to business needs and quality requirements, which corresponds to the set of microservices supported by each fog node, and then independent, application-aware servers can be used to layer or Distributed structure for collective management.

以下将结合附图对本发明的构思、具体结构及产生的技术效果作进一步说明，以充分地了解本发明的目的、特征和效果。The concept, specific structure and technical effects of the present invention will be further described below in conjunction with the accompanying drawings to fully understand the purpose, features and effects of the present invention.

附图说明Description of the drawings

图1是本发明的一个较佳实施例的基于SDN-FOG环境的微服务体系故障处理架构；Figure 1 is a microservice system fault handling architecture based on SDN-FOG environment according to a preferred embodiment of the present invention;

图2是本发明的一个较佳实施例的基础层的详细结构；Figure 2 is a detailed structure of the base layer of a preferred embodiment of the present invention;

图3是本发明的一个较佳实施例的基于SDN-FOG环境的微服务体系故障应对流程；Figure 3 is a microservice system fault response process based on SDN-FOG environment according to a preferred embodiment of the present invention;

图4是本发明的一个较佳实施例的启发式算法流程图。Figure 4 is a heuristic algorithm flow chart of a preferred embodiment of the present invention.

具体实施方式Detailed ways

以下参考说明书附图介绍本发明的多个优选实施例，使其技术内容更加清楚和便于理解。本发明可以通过许多不同形式的实施例来得以体现，本发明的保护范围并非仅限于文中提到的实施例。The following describes multiple preferred embodiments of the present invention with reference to the accompanying drawings to make the technical content clearer and easier to understand. The present invention can be embodied in many different forms of embodiments, and the protection scope of the present invention is not limited to the embodiments mentioned herein.

在附图中，结构相同的部件以相同数字标号表示，各处结构或功能相似的组件以相似数字标号表示。附图所示的每一组件的尺寸和厚度是任意示出的，本发明并没有限定每个组件的尺寸和厚度。为了使图示更清晰，附图中有些地方适当夸大了部件的厚度。In the drawings, components with the same structure are denoted by the same numerals, and components with similar structures or functions are denoted by similar numerals. The size and thickness of each component shown in the drawings are arbitrarily shown, and the present invention does not limit the size and thickness of each component. In order to make the illustrations clearer, the thickness of components is exaggerated in some places in the drawings.

本发明提出的一种SDN-FOG环境下的智能工厂微服务体系故障应对方法，包括以下步骤：The invention proposes a smart factory microservice system failure response method in an SDN-FOG environment, which includes the following steps:

步骤一：针对信息物理智能工厂，提出了一种如图1所示的基于SDN-FOG环境的微服务体系故障处理架构，主要包括如下步骤：Step 1: For the cyber-physical smart factory, a microservice system fault handling architecture based on SDN-FOG environment is proposed as shown in Figure 1, which mainly includes the following steps:

S1，工厂中部署了雾基础设施和工业物联网相关设备，连接到工厂现场设备的传感器不断地采集信息，传感器数据经预处理之后发送给其连接的支持SDN控制的交换机；网络监控组件从交换机收集包括现场设备、雾节点和交换机在内的节点信息和相关链路信息，并将信息发送到应用层来持续监视网络流量。S1, fog infrastructure and industrial Internet of Things related equipment are deployed in the factory. Sensors connected to the factory field equipment continuously collect information. The sensor data is pre-processed and sent to its connected switches that support SDN control; the network monitoring component starts from the switch. Collect node information and related link information including field devices, fog nodes and switches, and send the information to the application layer to continuously monitor network traffic.

S2，故障检测组件根据网络监控组件收集的信息，判断传输网络是否发生故障。如果检测出交换机或雾节点或链路处发生故障，网络监控组件将当前新的网络拓扑和每个雾节点支持的微服务集发送给故障应对程序。S2, the fault detection component determines whether a fault occurs in the transmission network based on the information collected by the network monitoring component. If a fault is detected at a switch or fog node or link, the network monitoring component sends the current new network topology and the set of microservices supported by each fog node to the fault response program.

S3，故障应对程序实时计算，给通过故障节点或链路的工作流重新分配资源。节点配置组件根据调度转发表来将资源分配给工作流，将应用层中故障应对程序做出的故障应对决策应用于交换机及雾节点。S3, the fault response program calculates in real time and reallocates resources to the workflow passing through the failed node or link. The node configuration component allocates resources to workflows based on the scheduling forwarding table, and applies fault response decisions made by the fault response program in the application layer to switches and fog nodes.

S4，配置完成后刷新定时器，如果因为部分节点或链路故障，导致部分工作流中的微服务无法完成，则报告故障类型，转人工处理。S4, refresh the timer after the configuration is completed. If some microservices in the workflow cannot be completed due to some node or link failures, the failure type will be reported and manual processing will be performed.

S5，经过定时器的一个完整周期，激活故障预防组件，主动对网络进行周期性重构，在较小重路由代价的前提下，优化工作流的所选路径，提高总体服务质量并降低故障概率。S5, after a complete cycle of the timer, activates the fault prevention component, actively performs periodic reconstruction of the network, optimizes the selected path of the workflow, improves the overall service quality, and reduces the probability of failure at a small rerouting cost. .

具体地，如图1所示，基于SDN-FOG环境的微服务体系故障处理架构中有三大层:1)应用层，包含了故障处理程序、定时器和其他功能性程序；2)控制层，包含了故障应对方案所需的各类组件等；3)基础结构层，基础层部署了多个支持SDN控制的交换机，这些交换机节点在时隙中具有不同的故障概率，工厂的中央总控制器连接到交换机以汇总网络拓扑和流量信息，并使用南向接口动态编程配置交换机；如图2所示，此架构中设定每个交换机连接了一个雾节点，将其合并视为一个节点；每个雾节点都对应了一组工业物联网设备和雾服务器。在这个工厂场景中，部署了一些功能性应用程序，通过相关传感器收集工厂设备的状态，并在雾服务器中进行处理，从而完成相应的命令，持续监控和管理智能工厂。这些应用程序是使用微服务体系结构设计的，因此包含了不同的独立服务，这些微服务执行特定类型的处理，其功能可以单独请求，也可以通过工作流组合。Specifically, as shown in Figure 1, there are three major layers in the fault handling architecture of the microservice system based on the SDN-FOG environment: 1) the application layer, which includes fault handling programs, timers and other functional programs; 2) the control layer, Contains various components required for fault response solutions; 3) Infrastructure layer. The basic layer deploys multiple switches that support SDN control. These switch nodes have different failure probabilities in time slots. The central controller of the factory Connect to the switch to summarize network topology and traffic information, and use the southbound interface to dynamically program and configure the switch; as shown in Figure 2, in this architecture, each switch is connected to a fog node, which is combined into one node; each switch is connected to a fog node. Each fog node corresponds to a set of industrial IoT devices and fog servers. In this factory scenario, some functional applications are deployed to collect the status of factory equipment through relevant sensors and process it in the fog server to complete corresponding commands and continuously monitor and manage the smart factory. These applications are designed using a microservices architecture and therefore consist of different independent services that perform specific types of processing and whose functionality can be requested individually or combined through workflows.

如图1右半部分所示，基础层又可以分为三部分：物理部分包含工厂的物理设备；网络部分包含支持SDN的交换机和总控制器；计算部分包含工业物联网设备和雾服务器。As shown in the right half of Figure 1, the base layer can be divided into three parts: the physical part includes the physical equipment of the factory; the network part includes switches and master controllers that support SDN; and the computing part includes industrial IoT devices and fog servers.

步骤二：对基础层进行详细的系统建模以描述基础设施、传输网络，雾计算设备之间的关系,主要包括如下步骤：Step 2: Conduct detailed system modeling of the base layer to describe the relationship between infrastructure, transmission network, and fog computing devices. It mainly includes the following steps:

S1，传输网络建模：网络基础结构表示为G＝{N,L},N是节点的集合,L是连接不同交换机的链路的集合。节点i∈N一个元组i＝<p_i,r_i>，p_i是雾节点的计算处理能力(units/s)，一个unit代表一个微周期；r_i是节点的RAM(Mb)。链路l_ij∈L为一个元组l_ij＝<d_ij,c_ij>，其中，d_ij为链路的时延(ms)，c_ij为链路的最大容量(Mb/s)，节点i为链路的源，节点j为链路的目的地。考虑到避免过载，用μ₁表示最大链路利用率，用μ₂表示最大节点利用率；考虑到故障预防，用P_i(t)表示节点i的故障概率，用P_ij(t)表示链路l_ij的故障概率。那么，网络拓扑可以用矩阵C_N*N表示，链路的传播延迟可以用矩阵D_N*N表示，例如：S1, Transmission network modeling: The network infrastructure is expressed as G={N,L}, N is a set of nodes, and L is a set of links connecting different switches. Node i∈N has a tuple i=<p _i , _ri >, where p _i is the computing processing capability of the fog node (units/s), and one unit represents a microcycle; r _i is the node's RAM (Mb). Link l _ij ∈L is a tuple l _ij =<d _ij ,c _ij >, where d _ij is the delay of the link (ms), c _ij is the maximum capacity of the link (Mb/s), and the node i is the source of the link, and node j is the destination of the link. Considering the avoidance of overload, μ ₁ is used to represent the maximum link utilization, and μ ₂ is used to represent the maximum node utilization; considering fault prevention, P _i (t) is used to represent the failure probability of node i, and P _ij (t) is used to represent the link. The failure probability of road l _ij . Then, the network topology can be represented by the matrix C _N*N , and the propagation delay of the link can be represented by the matrix D _N*N , for example:

S2，雾节点与支持的微服务集建模：考虑X种不同的微服务，微服务x∈X可以由一个元组来表示x＝<pc_x,r_x>，其中pc_x是单位流量下微服务x所需的处理能力(用所需微周期的数量来衡量)，r_x是微服务x需要的RAM量(Mb)。用矩阵NF_N*X表示每个节点支持的微服务集，NF_(i,x)＝1表示节点i支持微服务x。在实际操作部署时，可以根据业务需求和质量需求对微服务进行逻辑分组，这对应了每个雾节点支持的微服务集。S2, modeling of fog nodes and supported microservice sets: considering X different microservices, microservices x∈X can be represented by a tuple x=<pc _x ,r _x >, where pc _x is the unit traffic The processing power required by microservice x (measured in the number of microcycles required), r _x is the amount of RAM (Mb) required by microservice x. The matrix NF _N*X is used to represent the set of microservices supported by each node, and NF _(i,x) =1 indicates that node i supports microservice x. In actual operational deployment, microservices can be logically grouped according to business needs and quality requirements, which corresponds to the set of microservices supported by each fog node.

S3，工作流建模：该架构下我们只考虑无环路由，即节点和链路不会在工作流的路由中使用两次。F是工作流的集合，每个工作流都包含特定功能的执行，根据执行功能所需的微服务数量，在1个到|Y|个微服务之间链接,|Y|<|X|。工作流f∈F是一个有序元组f＝{m₁,m₂,…,m_|f-1|,m_|f|}，每个元素m_a与X中的一个微服务x具有完全相同的格式和值，即m_a∈X，|f|≤|Y|。用C^f(t)表示时隙t中f的流量。要执行工作流，数据必须从启动工作流的节点流向执行m₁的节点，再从那里流向执行m₂的节点，依此类推，最后一项功能性微服务m_|f-1|完成后，m_|f|表示需将数据返回到工厂设备或向控制层交付。每个工作流请求的微服务由矩阵R_F*X表示。/>表示时隙t中工作流f∈F请求了微服务x∈X。工作流f的启动节点用s_f表示，工作流f执行完后，将传输到节点d_f聚合信息，返回作用于工厂设备或上传至控制层。S3, workflow modeling: Under this architecture, we only consider loop-free routing, that is, nodes and links will not be used twice in workflow routing. F is a collection of workflows, each workflow contains the execution of a specific function, linked between 1 to |Y| microservices, |Y|<|X|, depending on the number of microservices required to execute the function. Workflow f∈F is an ordered tuple f={m ₁ ,m ₂ ,…,m _|f-1| ,m _|f| }. Each element m _a has complete relationship with a microservice x in X. The same format and values, i.e. m _a ∈X, |f|≤|Y|. Let C ^f (t) represent the traffic of f in time slot t. To execute a workflow, data must flow from the node that starts the workflow to the node that executes m ₁ , from there to the node that executes m ₂ , and so on, after the last functional microservice m _|f-1| is completed, m _|f| indicates that the data needs to be returned to the factory equipment or delivered to the control layer. The microservices requested by each workflow are represented by the matrix R _F*X . /> Indicates that workflow f∈F in time slot t requests microservice x∈X. The startup node of workflow f is represented by s _f . After the execution of workflow f, it will be transmitted to node d _f to aggregate information, and then return to the factory equipment or upload to the control layer.

S4，服务质量指标建模：1)每个雾节点有两种不同的模式：工作和空闲。如果雾节点上没有激活的微服务，雾节点将处于空闲模式；用e_i表示工作模式下节点i的能量消耗，在节点空闲模式下，能耗是工作时能耗的一小部分e₀+e_idle(主要是交换机的工作能耗e₀加上雾节点的静态能耗e_idle)，雾节点的当前状态由S_1*N(t)∈{0,1}指定，设E(t)为网络在时隙t中的能量消耗。2)用表示工作流f可以容忍的最大处理延迟，用/>表示工作流f的最大容许传播延迟，设T(t)为网络在时隙t中所有工作流的响应时间。3)工作流f在网络中路由的链路故障概率和节点故障概率应小于预定义的阈值M_l和M_n。4)时隙t中网络重路由的代价，用新网络配置下，和原网络配置相比，需要改变的路由数量来衡量。S4, Service quality indicator modeling: 1) Each fog node has two different modes: working and idle. If there is no activated microservice on the fog node, the fog node will be in idle mode; let e _i represent the energy consumption of node i in working mode. In node idle mode, the energy consumption is a small part of the energy consumption in working time e ₀ + e _idle (mainly the working energy consumption e ₀ of the switch plus the static energy consumption e _idle of the fog node). The current state of the fog node is specified by S _1*N (t)∈{0,1}, assuming E(t) is the energy consumption of the network in time slot t. 2) use Indicates the maximum processing delay that workflow f can tolerate, using/> represents the maximum allowable propagation delay of workflow f, and let T(t) be the response time of all workflows in the network in time slot t. 3) The link failure probability and node failure probability of workflow f routed in the network should be less than the predefined thresholds M _l and M _n . 4) The cost of network rerouting in time slot t is measured by the number of routes that need to be changed under the new network configuration compared with the original network configuration.

步骤三：构建整数规划问题以描述故障应对程序，主要包括如下步骤：Step 3: Construct an integer programming problem to describe the fault response procedure, which mainly includes the following steps:

S1，定义0-1决策变量：1)矩阵表示将节点和微服务分配给时隙t中的工作流。表示时隙t中在节点i上执行工作流f中的微服务m_a。2)矩阵/>是工作流的网络链路资源分配，/>表示工作流f中的微服务m_a在时隙t中通过链路l_ij路由，/>表示工作流f在时隙t中通过了链路l_ij。3)雾节点的当前状态由S_1*N(t)∈{0,1}指定。S1, define 0-1 decision variables: 1) matrix Represents the assignment of nodes and microservices to workflows in time slot t. Indicates that microservice m _a in workflow f is executed on node i in time slot t. 2)Matrix/> Is the network link resource allocation of the workflow,/> Indicates that microservice m _a in workflow f is routed through link l _ij in time slot t,/> Indicates that workflow f passes link l _ij in time slot t. 3) The current state of the fog node is specified by S _1*N (t)∈{0,1}.

S2，构建资源约束：约束(1)是节点之间的链路容量约束, S2, build resource constraints: Constraint (1) is the link capacity constraint between nodes,

约束(2)用于控制每个工作流f在链路中的传输延迟， Constraint (2) is used to control the transmission delay of each workflow f in the link,

约束(3)是响应工作流f的链路的故障概率约束， Constraint (3) is the failure probability constraint of the link responding to workflow f,

约束(4)是提供微服务的节点的RAM容量约束， Constraint (4) is the RAM capacity constraint of the node that provides the microservice,

约束(5)用于控制每个工作流在节点中的处理延迟， Constraint (5) is used to control the processing delay of each workflow in the node,

约束(6)是响应工作流f的节点的故障概率约束， Constraint (6) is the failure probability constraint of the node responding to workflow f,

S3，构建与决策变量相关的优化目标，等式(7)用于计算网络在时隙t中的能量消耗E(t)S3, construct the optimization objective related to the decision variable, equation (7) is used to calculate the energy consumption E(t) of the network in time slot t

等式(8)用于计算网络在时隙t中工作流的总体响应时间T(t)Equation (8) is used to calculate the overall response time T(t) of the network workflow in time slot t

等式(9)用于计算网络在时隙t中重路由的代价Cost(t)Equation (9) is used to calculate the cost of network rerouting in time slot t Cost(t)

S4，构建决策变量相关约束：约束(10)表示：除了启动工作流的节点和工作流完成后交付的节点，其他中间节点均有流量的输入和输出, S4, constructing constraints related to decision variables: Constraint (10) means: except for the node that starts the workflow and the node that is delivered after the workflow is completed, other intermediate nodes have traffic input and output.

约束(11)设定了该架构下我们只考虑无环路由， Constraint (11) stipulates that we only consider loop-free routes in this architecture.

约束(12)表示工作流到达节点时，执行的是所请求的微服务, Constraint (12) indicates that when the workflow reaches the node, the requested microservice is executed.

约束(13)保证工作流执行微服务的节点上要支持请求的微服务, Constraint (13) ensures that the node where the workflow executes the microservice must support the requested microservice.

约束(14)保证了工作流不会多次请求同一微服务， Constraint (14) ensures that the workflow will not request the same microservice multiple times.

约束(15)保证了只在工作流经过的节点上执行微服务， Constraint (15) ensures that the microservice is only executed on the nodes through which the workflow passes.

约束(16)表示当节点向工作流提供微服务时，节点处于工作状态，其中Si(t∈{0,1}Constraint (16) means that when the node provides microservices to the workflow, the node is in the working state, where Si(t∈{0,1}

S5，构建目标函数：本架构下主要考虑两个自适应相关的问题，主要是故障应对方法。S5, construct the objective function: Under this architecture, two adaptive-related issues are mainly considered, mainly fault response methods.

1)网络的故障恢复问题：目标函数1优先优化网络流量重新路由的代价，其次优化网络中工作流的总体响应时间，目标是快速做出响应，其中α₁>α₂。1) Network failure recovery problem: Objective function 1 gives priority to optimizing the cost of network traffic rerouting, and secondly optimizes the overall response time of the workflow in the network. The goal is to respond quickly, where α ₁ > α ₂ .

minα₁Cost(t)+α₂T(t)minα ₁ Cost(t)+α ₂ T(t)

2)网络的周期性重构问题：每经过定时器的一个完整周期，对总体网络进行周期性优化，目标函数2主要优化服务执行时间，节点能耗，使网络达到最佳状态，同时限制重路由的代价不要太大。其中β₁>β₂>β₃。2) Periodic reconstruction problem of the network: Every time a complete cycle of the timer passes, the overall network is periodically optimized. Objective function 2 mainly optimizes the service execution time and node energy consumption, so that the network reaches the best state while limiting the redundancy. The cost of routing should not be too high. Where β ₁ >β ₂ >β ₃ .

minβ₁T(t)+β₂E(t)+β₃Cost(t)minβ ₁ T(t)+β ₂ E(t)+β ₃ Cost(t)

步骤四：使用求解器进行求解得出最优的应对策略，经过步骤三可以得出Step 4: Use the solver to solve to get the optimal response strategy. After step 3, you can get

优化问题1的表达为：The expression of optimization problem 1 is:

minα₁Cost(t)+α₂T(t)minα ₁ Cost(t)+α ₂ T(t)

s.t.(1)-(16)s.t.(1)-(16)

优化问题2的表达为：The expression of optimization problem 2 is:

minβ₁T(t)+β₂E(t)+β₃Cost(t)minβ ₁ T(t)+β ₂ E(t)+β ₃ Cost(t)

s.t.(1)-(16)s.t.(1)-(16)

两个问题都是整数规划问题，可以直接输入到商用求解器Gurobi里求得数值解，相应的完整故障应对流程如图3所示。此外，解决优化问题2时,可以直接使用商用求解器Gurobi；解决优化问题1时，如果对故障响应时间要求特别严格，则可以使用如下所述的次优的启发式算法。Both problems are integer programming problems and can be directly input into the commercial solver Gurobi to obtain numerical solutions. The corresponding complete fault response process is shown in Figure 3. In addition, when solving optimization problem 2, you can directly use the commercial solver Gurobi; when solving optimization problem 1, if the fault response time requirements are particularly strict, you can use the suboptimal heuristic algorithm described below.

步骤五：由于该架构下优化问题的计算复杂度都比较高，故提出一个如图4所示的次优的启发式方法，该方法适用于对故障响应时间要求更严格的故障处理的场景，能够对随时间变化的网络流量进行快速响应，该解决方法是一个自适应的方法，适用于实际工厂场景的网络，具体步骤如下：Step 5: Since the computational complexity of optimization problems under this architecture is relatively high, a suboptimal heuristic method as shown in Figure 4 is proposed. This method is suitable for fault handling scenarios that require stricter fault response time. Able to quickly respond to network traffic that changes over time, this solution is an adaptive method and is suitable for networks in actual factory scenarios. The specific steps are as follows:

S1，故障检测组件激活故障应对程序，解决优化问题1时，算法的输入是通过故障节点或链路的工作流集合F_effected的属性和当前的网络拓扑G＝{N,L}(如果某个节点或链路发生故障，网络拓扑和节点支持的微服务集可能会发生变化)，然后依次为每个受影响的工作流f_i∈F_effected分配资源。S1, the fault detection component activates the fault response program. When solving optimization problem 1, the input of the algorithm is the attributes of the workflow set F _effected through the fault node or link and the current network topology G = {N, L} (if a certain When a node or link fails, the network topology and the set of microservices supported by the node may change), and then resources are allocated to each affected workflow f _i ∈ F _effected in turn.

S2，根据链路容量约束(1)和传输延迟约束(2)，从当前执行m_a∈f_i的节点CN出发，删除可用容量或传输延迟不能满足工作流需求的所有链路，然后得到一个列表K，列表K中是从当前节点CN到其他支持当前所需微服务m_a+1的节点的路径。S2, according to the link capacity constraint (1) and the transmission delay constraint (2), starting from the node CN currently executing m _a ∈ f _i , delete all links whose available capacity or transmission delay cannot meet the workflow requirements, and then get a List K. List K is the path from the current node CN to other nodes that support the currently required microservice ma ₊₁ .

S3，根据节点容量约束(4)和处理延迟约束(5)，从列表K中删除可用容量或处理延迟不能满足工作流需求的节点，然后从K中选择执行时间最短且满足故障概率约束的新节点NN。S3, according to the node capacity constraint (4) and the processing delay constraint (5), delete the nodes whose available capacity or processing delay cannot meet the workflow requirements from the list K, and then select the new node with the shortest execution time and satisfying the failure probability constraint from K Node NN.

S4，将前往新节点NN的路径添加到当前路径中，将当前节点从网络拓扑中删除，防止工作流出现循环。循环S1-S4，直到完成工作流f_i中的所有微服务，转S5。S4, add the path to the new node NN to the current path, and delete the current node from the network topology to prevent workflow loops. Loop S1-S4 until all microservices in workflow _fi are completed, and then go to S5.

S5，输出工作流f_i的路径，该工作流将以最优的执行时间完成，继续为下一个受影响的工作流f_i+1分配资源，循环S1-S5，直到处理完所有受影响的工作流，完成故障处理。S5, output the path of workflow _fi , which will be completed in the optimal execution time. Continue to allocate resources to the next affected workflow fi ₊₁ , and loop S1-S5 until all affected workflows are processed. Workflow to complete troubleshooting.

以上详细描述了本发明的较佳具体实施例。应当理解，本领域的普通技术无需创造性劳动就可以根据本发明的构思作出诸多修改和变化。因此，凡本技术领域中技术人员依本发明的构思在现有技术的基础上通过逻辑分析、推理或者有限的实验可以得到的技术方案，皆应在由权利要求书所确定的保护范围内。The preferred embodiments of the present invention are described in detail above. It should be understood that those skilled in the art can make many modifications and changes according to the concept of the present invention without creative efforts. Therefore, any technical solutions that can be obtained by those skilled in the art through logical analysis, reasoning or limited experiments based on the concept of the present invention and on the basis of the prior art should be within the scope of protection determined by the claims.

Claims

1. A fault response method for the factory microservice system in the SDN-FOG environment, which is characterized by:

Includes the following steps:

Step 1. Establish a microservice system fault handling architecture based on SDN-FOG environment;

Step 2. System modeling, describing the relationship between infrastructure, transmission network, and fog computing equipment;

Step 3. Construct an integer programming problem. The two optimization goals are "fault handling" for fault recovery and "fault prevention" for periodic reconstruction;

Step 4. Select the Gurobi solver to find the optimal solution to the fault prevention problem;

Step 5. Use heuristic algorithms for "fault handling" problems;

The microservice system fault handling architecture of step 1 based on SDN-FOG environment includes application layer, control layer and basic layer;

The application layer includes fault handling programs, timers and functional programs;

The control layer includes various components required for fault response solutions, including network monitoring components, fault detection components, node configuration components, and fault prevention components;

In the base layer, switches that support SDN control are deployed. The switches have different failure probabilities in time slots. The central master controller is connected to the switches to summarize network topology and traffic information, and uses southbound interfaces to dynamically program and configure the switches; each The switch connects to a fog node and is combined into one node; each fog node corresponds to a set of industrial IoT devices and fog servers;

The step 1 includes the following steps:

Step 1.1. Sensors connected to factory field equipment continuously collect information. The sensor data is pre-processed and sent to the connected switch that supports SDN control; the network monitoring component collects node information including field equipment, fog nodes and switches from the switch. and link information, and sends the information to the application layer to continuously monitor network traffic;

Step 1.2. The fault detection component determines whether a fault occurs in the transmission network based on the information collected by the network monitoring component; if a fault is detected at the switch or fog node or link, the network monitoring component will determine the current new network topology and the support of each fog node. The set of microservices is sent to the fault response program;

Step 1.3. The fault response program calculates in real time and reallocates resources to the workflow passing through the faulty node or link; the node configuration component allocates resources to the workflow through the scheduling forwarding table and responds to the fault made by the fault response program in the application layer. Decisions are applied to switches and fog nodes;

Step 1.4. Refresh the timer after the configuration is completed. If some microservices in the workflow cannot be completed due to some node or link failures, report the failure type and transfer to manual processing;

Step 1.5. After a complete cycle of the timer, activate the fault prevention component, actively perform periodic reconstruction of the network, optimize the selected path of the workflow, improve the overall service quality and reduce the probability of failure;

The step 2 includes the following steps:

Step 2.1. Transmission network modeling: Establish network topology matrix and link propagation delay matrix;

Step 2.2. Modeling of fog nodes and supported microservice sets: Use a matrix to represent the microservice set supported by each node, and logically group the microservices according to business requirements and quality requirements, corresponding to the microservice set supported by each fog node;

Step 2.3. Workflow modeling: F is a collection of workflows. Each workflow contains the execution of a specific function. Workflow f∈F is an ordered tuple f={m ₁ ,m ₂ ,…,m _{| f-1|} ,m _|f| }, to execute a workflow, data must flow from the node that starts the workflow to the node that executes microservice m ₁ , and from there to the node that executes m ₂ , and so on, with the last item After the functional microservice m _|f-1| is completed, m _|f| indicates that the data needs to be returned to the factory equipment or delivered to the control layer;

Step 2.4. Modeling of service quality indicators: network energy consumption, workflow response time, predefined thresholds for link failure probability, predefined thresholds for node failure probability, and rerouting costs;

The step 3 includes the following steps:

Step 3.1. Define decision variables: including workflow matrix, microservices that execute workflow, and network link resource allocation matrix;

Step 3.2. Build resource constraints: including link capacity constraints between nodes, transmission delay constraints of each workflow in the link, failure probability constraints of links responding to workflows, and RAM capacity constraints of nodes that provide microservices. , the processing delay constraint of each workflow in the node, and the failure probability constraint of the node responding to the workflow;

Step 3.3. Construct optimization objectives related to decision variables: including energy consumption, overall response time, and rerouting cost;

Step 3.4. Construct constraints related to decision variables: including when the workflow reaches the node, the requested microservice is executed, ensuring that the workflow does not request the same microservice multiple times, and ensuring that the microservice is only executed on the node through which the workflow passes. When a node provides microservices to the workflow, the node is in a working state;

Step 3.5. Construct the objective function: including network failure recovery and periodic network reconstruction;

The steps of the heuristic algorithm are as follows:

S1, the fault detection component activates the fault response program to solve the fault recovery problem. The input of the algorithm is the attributes of the workflow set through the fault node or link and the current network topology, and then allocates resources to each affected workflow in turn. ;

S2, based on the link capacity constraint and transmission delay constraint, starting from the currently executing node, delete all links whose available capacity or transmission delay cannot meet the workflow requirements, and then obtain a list from the current node to other nodes that support the current The path to the node of the required microservice;

S3, according to the node capacity constraint and processing delay constraint, delete the nodes whose available capacity or processing delay cannot meet the workflow requirements from the list, and then select a new node with the shortest execution time and satisfying the failure probability constraint;

S4, add the path to the new node to the current path, and delete the current node from the network topology to prevent the workflow from looping; loop S1-S4 until all microservices in the workflow are completed, then go to S5;

S5, output the path of the workflow, the workflow will be completed in the optimal execution time, continue to allocate resources to the next affected workflow, and loop S1-S5 until all affected workflows are processed and fault handling is completed.