CN114900420A

CN114900420A - Distributed software service guarantee method based on group intelligence

Info

Publication number: CN114900420A
Application number: CN202210311620.1A
Authority: CN
Inventors: 刘潇健; 于学军; 张旸旸; 顾问; 边洪梅
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2022-03-28
Filing date: 2022-03-28
Publication date: 2022-08-12

Abstract

The invention provides a distributed software service guarantee method based on group intelligence. The invention clearly divides 3 main components of software service, service adaptation and reinforcement learning decision, and realizes the complete decoupling of the decision of service guarantee and the business logic of distributed software service. In the distributed software service system, on the basis of keeping the original functional module and the original guarantee module of each software service, adaptive modules such as state monitoring and dynamic configuration are added, and interfaces are respectively provided to interact with a reinforcement learning decision module, so that the situation awareness and guarantee behavior dynamic control of the software services are realized; the reinforcement learning decision-making component integrates the reinforcement learning core idea of improving performance through experience into service guarantee activities under a distributed software service architecture through continuous interactive learning of environmental factors such as the operation and maintenance effects of an intelligent agent and a system, user expectation satisfaction conditions and the like, and realizes a group intelligent decision-making mechanism which is independent and cooperative with each other.

Description

Distributed software service guarantee method based on group intelligence

Technical Field

The invention belongs to the field of distributed software services.

Background

The distributed software service divides an originally independent software system into a plurality of small services which run independently, and interactively cooperates through a lightweight communication mechanism to realize the business value of a user. Compared with the traditional single system, the distributed software service architecture has the advantages of deployment independence, easy expandability, decentralization and the like. The independent and cooperative interaction relationship among the distributed software services in the distributed software service architecture is not only the advantages thereof, but also has obvious service credibility risks: on one hand, independence means operation, maintenance and decision modes which are respectively administrative, global benefits are easily ignored due to excessive pursuit of local benefits, and even mutually contradictory service guarantee efforts can be made among distributed software services; on the other hand, the complex business interaction relationship among distributed software services usually amplifies the local fault into the cascade fault to generate the avalanche effect, and further cannot clear the problem and the symptom, so that the next operation cannot be performed. The core for solving the service guarantee problem is to establish an effective group decision mechanism of the distributed software service system, and endow each distributed software service with the ability of sensing the global situation and the decision ability facing the whole system.

Among many artificial intelligence methods, a reinforcement learning method is an important method for solving the sequential decision problem, and an optimal strategy can be learned through interaction with the environment without prior knowledge. In addition, the distributed software service system has the decision characteristics of independent local decision and group distributed cooperative decision, has natural compatibility with the multi-agent reinforcement learning method, and provides a new idea for establishing group intelligent decision mechanism research under a distributed software service framework. The invention provides a distributed software service guarantee mechanism based on swarm intelligence, and provides a mechanism and a method for solving the guarantee problems of service continuation, service maintenance, stable interference transition and the like of a distributed software service system under the interference condition.

The existing distributed software service guarantee research mostly focuses on local situations of respective software services, aims to guarantee normal operation of the services, and cannot comprehensively consider the guarantee problem of service expectation from the perspective of users. How to establish a service credibility guarantee system of global decision on the basis of not breaking the original distributed independence framework of the distributed software service is one of the problems that the distributed software service framework needs to be mainly solved.

Disclosure of Invention

The distributed software service architecture is a novel distributed and autonomous software architecture, and each distributed software service can be configured and deployed with independent guarantee measures. From the perspective of service construction, a distributed and autonomous architecture has advantages, but when interference occurs and credibility needs to be guaranteed, the respective responsible safeguard measures and mutually independent decision systems easily ignore global benefits due to excessive pursuit of local benefits, cannot make global optimal decisions based on global situations, and even make contradictory efforts among distributed software services. How to realize the global optimal decision-making target under the distributed software service architecture on the premise of keeping the distributed independent decision-making of the distributed software service is the problem which is solved primarily by the project. The invention aims to model the guarantee decision system of each distributed software service as a reinforcement learning intelligent agent on the basis of maintaining distributed and autonomous architectures and allowing heterogeneous guarantee measures so as to maintain independent decision capability. And relative to each agent, the safeguard measures, situation evaluation and user rewards of other distributed software services are taken as an overall situation environment, and the ability of each agent for sensing the overall situation and other agents is given through interactive experience learning of the agents and the overall situation environment, so that a group intelligent decision mechanism which is independent and cooperative is established. By the method, the core idea of reinforcement learning of performance improvement through experience is integrated into service credibility guarantee activity of the distributed software service system, the problems of service continuation, service maintenance, stable interference transition and the like under the interference condition are solved, a new idea of group intelligence is provided for distributed software service guarantee, and a foundation is laid for realizing intelligent guarantee.

The distributed software service architecture is a novel distributed and autonomous software architecture, and each distributed software service can be configured and deployed with independent guarantee measures. When the credibility needs to be guaranteed due to interference, the respective administrative safeguard measures and mutually independent decision systems easily ignore the global benefits due to excessive pursuit of local benefits, cannot make global optimal decisions based on the global situation, and even make contradictory efforts among the distributed software services. Therefore, the invention is based on a multi-agent reinforcement learning method, on the basis of maintaining a distributed and autonomous framework and allowing heterogeneous safeguard measures, a safeguard decision system of a distributed software service is modeled into a reinforcement learning agent, and an intelligent decision method which is independent and collaborated with each other is established through interactive experience learning of the agent and the whole situation environment.

The general framework of the method is shown in fig. 1.

The framework clearly divides 3 main components of software service, service adaptation and reinforcement learning decision, and realizes complete decoupling of the decision of service guarantee and the business logic of distributed software service. In the distributed software service system, on the basis of keeping the original functional module and the original guarantee module, each software service is additionally provided with adaptive modules such as state monitoring and dynamic configuration, and the like, and an interface is respectively provided to interact with a reinforcement learning decision module, so that the situation awareness and guarantee behavior dynamic control of the software service are realized (for convenience of display, the mutual calling relation of services among a plurality of software services is simplified in fig. 1); the reinforcement learning decision-making component integrates the reinforcement learning core idea of improving performance through experience into service guarantee activities under a distributed software service architecture through continuous interactive learning of environmental factors such as the operation and maintenance effects of an intelligent agent and a system, user expectation satisfaction conditions and the like, and realizes a group intelligent decision-making mechanism which is independent and cooperative with each other.

Service adaptation component

The core function of the service adaptation component is to provide an interface for the reinforcement learning decision-making component to sense the running service state of the software service system and timely control the configuration and implementation of various types of safeguard measures, and the main function module comprises a state monitoring module and a dynamic configuration module. Wherein:

(1) and a state monitoring module. The content of the state monitoring depends on the actual situation, such as the general request amount, the accuracy, the response time, etc., and may also be specific service parameters, abnormal codes, etc. For example, the popular micro service framework Spring Cloud framework provides interfaces such as/metrics end point,/health end point,/trace end point, etc. for conventional micro service status monitoring, and the customized status monitoring module and interface are also applicable to the method of the present patent.

(2) And dynamically configuring the module. The method is dynamic guarantee oriented to operation, and requires a reinforcement learning decision-making component to select an optimal guarantee strategy and configuration according to a real-time service state, and dynamically configure and execute a guarantee measure under the condition of not restarting the service. The method adopts a method of establishing a configuration center outside functional services to carry out centralized management on configuration files of each software service, and a reinforcement learning decision component controls the content of the configuration files of each software service in real time and configures updated activities according to decision results, thereby realizing complete decoupling of decision of software service guarantee and software service logic.

The interaction logic of the software service adaptation component and the reinforcement learning decision component is shown in fig. 2. And the reinforcement learning decision-making component acquires a service state set of each software service through the state monitoring interface and makes a guarantee strategy decision based on the reinforcement learning model. And according to the decision result, the reinforcement learning decision part dynamically updates the configuration file of the corresponding software service, and then sends a request for updating the configuration to each software service. The software service receiving the configuration updating request requests the configuration center for the latest configuration file and updates the configuration of the service and the safeguard measures.

Third, strengthen the decision part of learning

The reinforcement learning decision-making component models each software service with decision-making capability into an independent agent (agent), centralized training and non-centralized execution are carried out by adopting a multi-agent reinforcement learning method, namely, a global state is used for training during training, the action strategies of other agents are considered in the reinforcement learning of each agent, and each agent only makes a decision for ensuring behaviors according to the state perception of the agent during execution. And an experience playback pool is set, and the problems of correlation among training samples and unfixed probability distribution of the training samples are solved through an experience playback mechanism. Each set of state transition records holds a state-behavior pair and a corresponding reward and next state, namely:

(s ₁ ,s ₂ ,…,s _n ；a ₁ ,a ₂ ,…,a _n ；R；s ₁ ′,s ₂ ′,…,s _n ′)

wherein s is _i The current state for each software service, i.e., the set of states for each software service shown in FIG. 2; a is _i A selected safeguard action for each software service; r is an incentive value, such as the expected satisfaction degree of various users after executing each safeguard action; s _i ' is the next state of each software service. The training framework and process for the 2 software services is shown in FIG. 3.

2 identical decision strategies were designed for the "behavioral decision" module of each software service: target strategy mu _i ' and evaluation strategy mu _i The method is used for making a guarantee behavior decision based on the self state of the software service, wherein:

target strategy μ _i ' Next State s with own software service _i Is an input, an output s _i ' corresponding safeguard action a _i ′：

a _i ′＝μ _i ′(s _i ′)

Target strategy mu _i ' training is not performed actively, but periodically with an evaluation strategy μ that is continuously learned _i Is updated, thereby increasing the stability of the learning process.

Evaluation strategy μ _i Current state s of own software service _i Is an input, an output s _i Corresponding safeguard action a _i ：

a _i ＝μ _i (s _i )

Evaluation strategy mu _i Training and learning are continuously performed according to the behavior value (namely Q value) feedback of the 'value decision' module:

designing a comprehensive criticic module (namely a value decision module) for all software services, and outputting a Q value corresponding to each software service according to a comprehensive reward function. The 'value decision' module designs 2 neural networks with the same structure: the value decision target network Net _ target _ criticc and the value decision evaluation network Net _ evaluation _ criticc are used for outputting Q values of software service guarantee behaviors based on the global state of the software service system, wherein:

net _ target _ critical in the software service System Next State(s) ₁ ′,s ₂ ′,…,s′ _n ) And correspond to

(a ₁ ′,a′ ₂ ,…,a′ _n ) Outputting a Q value corresponding to the next state of each software service for input:

Q _i ′(s _i ′,a _i ′,θ _target )

wherein, theta _target Is a parameter of Net _ target _ critical. The Net _ target _ critic does not actively train and learn, but periodically updates the Net _ target _ critic with the parameters of Net _ evaluation _ critic which are continuously learned, so as to increase the stability of the learning process.

Net _ evaluation _ crititc to serve the current state(s) of the system with the software ₁ ,s ₂ ,…,s _n ) And corresponds to (a) ₁ ,a ₂ ,…,a _n ) Outputting Q values corresponding to the current states of the software services for input:

Q _i (s _i ,a _i ,θ _eval )

wherein, theta _eval Is the parameter Net _ evaluation _ critic. Net evaluation critic periodically randomly selects several state transition records (assumed to be N) from the empirical playback pool for training and learning. The process of training and learning is a process that continually optimizes the difference between the estimated Q value and the actual Q value. The loss function is defined as:

wherein R is _i For real-time reward value, gamma is learning rate, gamma belongs to [0,1 ]]The larger the γ, the more the learning process pays more attention to the long-term rewards.

Evaluation policy mu for each software service _i Updating parameters according to gradient descent:

wherein the content of the first and second substances,

representing the gradient of the corresponding parameter. Each parameter is updated based on graduating it and updating the parameter in the direction where the gradient decreases most rapidly.

The verification proves that the method can effectively select the guarantee behaviors, can distinguish the degradation objects according to the service risk sources, and realizes intelligent flexible guarantee. In summary, the innovation points and effects of the method are mainly reflected in that:

(1) the reinforcement learning core idea of improving performance through experience is integrated into the distributed software service guarantee activity. The invention maps the independent local decision premise and the global cooperative decision demand of the distributed software service with the multi-agent reinforcement learning method, innovatively provides a group intelligent guarantee mechanism, and innovatively integrates the reinforcement learning core idea of improving performance through experience into the distributed software service guarantee activity.

(2) Group decision is realized on the basis of not breaking the original distributed independent decision and guarantee system. The distributed and autonomous software architecture has obvious advantages in the aspect of business logic construction, but in service guarantee, the respective administrative safeguard measures and mutually independent decision systems easily ignore the global benefits due to excessive pursuit of local benefits. The group decision system provided by the invention realizes global situation perception and independent intelligent decision of each software service by using a centralized learning and decentralized decision mode, and realizes a group intelligent decision mechanism which is independent and cooperative on the premise of well keeping distributed and autonomous characteristics and allowing heterogeneous guarantee measures.

Drawings

FIG. 1 population intelligent distributed software service assurance framework

FIG. 2 logic diagram of adaptation component and decision component interaction

FIG. 3 a block diagram of a reinforcement learning decision component

FIG. 4 shows the requested average reward of each service guarantee mechanism under different service risks

Detailed Description

Taking the Spring Cloud micro-service framework as an example for implementing the invention, the micro-service system comprises:

(1) and the 2 request processing micro-services are used for receiving the request for inquiring the user information, calling the background service processing Provider _ user micro-service and returning the result to the requesting user, wherein the Core _ client micro-service is the Core service which needs key guarantee, the Non _ Core _ client micro-service is the Non-Core service, and the performance of the Core _ client micro-service can be sacrificed to guarantee the normal operation of the Core _ client micro-service if necessary.

(2) And the micro-service receives 2 user information query requests requesting to process the micro-service, and returns a query result.

The method comprises the following specific implementation steps:

(1) because the Spring Cloud micro-service framework provides a ready-made configuration center module, a configuration center can be directly constructed through the Spring-closed-configuration-server, each micro-service is configured with a configuration client, and a configuration file path is set to a corresponding position of the configuration center.

(2) Adding spring-boot-startup-activator dependence for each micro-service, and activating/refreshing an endpoint for refreshing configuration; an activation/metrics end point,/health end point,/trace end point for software service situational monitoring.

(3) Developing an Agent for each micro-service based on python, wherein the Agent can acquire the real-time situation of the software service through a GET request/metrics endpoint,/health endpoint,/trace endpoint; sending heartbeat monitoring requests to corresponding micro-services randomly within every 3s, and monitoring and recording response time and response contents; and dynamically updating the software service configuration by modifying the corresponding configuration file of the configuration center and making a POST request/refresh endpoint. The frequency of heartbeat monitoring requests depends on the decision efficiency requirement.

(4) Design of a reinforcement learning decision component:

environmental status: the request response time of the request processing micro-service, the request return content and the real-time situation of the software service are taken as the environment state of the reinforcement learning, and the satisfaction degree of the user service requirements is divided into 3 cases: 1) normal service, that is, returning a correct request result within a specified time (according to analysis on micro-service performance conditions and consideration on the effect of a contrast experiment, a response time threshold is set to be 3 s); 2) the method comprises the following steps of degrading service, namely degrading micro service to fuse concurrent requests of the micro service, guaranteeing service continuation under the condition that partial service requirements are met, and realizing the method by fusing all requests and returning a default value to a user; 3) service failures, i.e. request response time timeout (3s) or request return error.

The reward function: and taking the corresponding situation of the heartbeat monitoring request randomly sent by each micro-service Agent as the basis of the reinforcement learning reward. The reward function is as follows:

wherein:

core _ requests and Non _ Core _ requests are the total number of microservice state heartbeat monitoring requests randomly sent in corresponding periods, sigma R _CC Sum Σ R _NC The sum of the heart beat monitoring request rewards for the Core _ client micro service and the Non _ Core _ client micro service, respectively. The reward value in the formula can be adjusted according to different guarantee decision requirements, and the service setting which generally needs key guarantee is betterHigh values, lower values are set for faults that need to be avoided with emphasis.

Reinforcement learning method design: the states s and s' are designed as the request response time of 2 request processing microservices, the request return content and the real-time situation of software service (including the request amount, the I/O load, the CPU and the memory use condition); the safeguard behavior is designed to be 2 requests to process whether the microservice performs degradation to fuse its concurrent requests; the reward value r is designed to be the average reward value of all heartbeat monitoring requests within 15s after the guarantee action is executed; the capacity of the experience playback pool is 200, and 32 groups of state transition records are randomly selected from the experience playback pool as training samples to learn every 5 steps; because the example scene is simple, a target network Net _ target and an evaluation network Net _ evaluation of 2 layers are respectively constructed based on TensorFlow, and parameters are updated to the target network Net _ target for 1 time every 200 times of learning of the evaluation network Net _ evaluation; the optimization of the neural network adopts the current popular RMSprop optimizer; the learning rate γ is set to 0.9 (the larger the learning rate is, the more the learning process pays more attention to the forward reward, and the learning process can be adjusted according to the actual situation).

In order to verify the actual effect of the method, comparative experiments in different service risk scenes are designed based on the cases in 2.2.

In the aspect of object comparison, Hystrix developed by the Netfilix API team is an open-source current-limiting and fusing function library which is widely applied, and the currently popular micro-service architecture Spring Cloud also uses Hystrix as a matched infrastructure thereof, so that the wide application of Hystrix is promoted. Therefore, the Hytrix mechanism is used as a main comparison object, and the service guarantee model trained by the method is compared and analyzed with the service effect of the Normal mechanism which implements the Hytrix service fusing mechanism and does not adopt any safeguard measures.

In terms of service risk scenarios, 5 service risk scenarios (the number of concurrent users in each scenario is based on the result of performance test on each micro-service) are designed according to whether high concurrency is combined or not and whether each request processing micro-service is independent and high concurrency is achieved or not, as shown in table 1. The naming of the service risk scenario consists of 3 fields: a joint concurrency field (high concurrency HJC or low concurrency LJC), a Core _ client micro-service independent concurrency field (high concurrency HCC or low concurrency LCC), and a Non _ Core _ client micro-service independent concurrency field (high concurrency HNC or low concurrency LNC).

TABLE 1 service risk scenarios and corresponding number of concurrent users

In various scenarios, the average reward value of heartbeat monitoring requests corresponding to 3 different methods (reward functions are uniformly designed according to "reinforcement learning decision component" in 2.2) is shown in fig. 4.

In scene 3L _JC -L _CC -L _NC In the middle, neither the Hystrix fusing mechanism nor the Normal guarantee mode limits the requests to be sent to the Provider _ user micro service, and all the requests can be normally served, so the average reward value is 5. According to the method, due to the fact that the accuracy rate of the reinforcement learning model is high, 100% of optimum decision making cannot be guaranteed, and therefore the average reward value is slightly lower than 5 (4.9). In other federated high concurrency scenarios: (1) adopting a Normal guarantee mode without any measures to cause overtime of all request response time, designing an incentive function according to the above, wherein the average incentive value is-4; (2) by adopting a common Hystrix fusing mechanism, whether independent or independent high concurrency is adopted, the concurrent request fusing of the Provider _ user micro service is executed by the 2 request processing micro services, and the average reward value is 1 according to the above design of the reward function. Since Hystrix tries to re-request the Provider _ user micro-service every 15s, normal service can be obtained at this time, but once fusing is cancelled and the request pressure is restored, fusing is activated again, so that the average reward value of a common Hystrix fusing mechanism fluctuates within the range of 1+ 0.2; (3) the decision model trained by the method can intelligently and selectively execute degradation and service fusing of Core _ client micro services or Non _ Core _ client micro services according to whether the services are independent and concurrent or not and whether the services are independent and concurrent or not, so that the improvement of the average reward value is realized. In scene 5H _JC -H _CC -H _NC In the middle, the average reward value is slightly lower than the ordinary Hystrix fusing mechanism, which is the costThe mechanism does not retry the request in high concurrency, and reduces the normal service reward of the reconnection request and a proportion of normal service requests in the independent high concurrency case compared with the ordinary Hystrix fusing mechanism, but if the Provider _ user micro-service has an external high concurrency request, the service reward of the reconnection request is negative, the proportion of normal service requests is also reduced, and the average reward value of the ordinary Hystrix fusing mechanism is reduced.

The architecture of the method is mainly divided into 3 core components: (1) a state monitoring module is added on the basis of the original business logic of each distributed software service and is used for supporting the state perception capability required in the reinforcement learning method; (2) establishing a dynamic configuration module for each distributed software service, wherein the dynamic configuration module is used for supporting behavior execution capacity required in the reinforcement learning method; (3) the decision-making capability of each distributed software service is modeled into an independent intelligent agent, centralized learning and decentralized decision-making are realized by designing a distributed multi-behavior decision-making module and a single comprehensive value decision-making module, and the group intelligent decision-making of cooperative cooperation is realized on the premise of keeping distributed autonomy.

Claims

1. The distributed software service guarantee method based on group intelligence is characterized by comprising

The framework of the method is divided into a distributed software service system, service adaptation and reinforcement learning decision; wherein: (1) the distributed software service system is a distributed software system for providing business services, and a service guarantee mechanism is deployed in part or all of distributed modules to guarantee the reliability of the services; (2) the service adaptation comprises state monitoring and dynamic configuration, the functions of state monitoring and dynamic configuration are expanded for a distributed service system of the distributed service system, interfaces are respectively provided for interacting with a reinforcement learning decision-making component, and situation awareness and guarantee behavior dynamic control of distributed software service are realized; (3) the reinforcement learning decision acquires a service state set of each software service through a state monitoring interface adapted to the service, and performs self-adaptive optimization decision based on a reinforcement learning model; according to the decision result, the reinforcement learning decision dynamically updates the configuration of the service and the safeguard measures through a dynamic configuration interface adapted by the service; the method comprises the steps of reinforcement learning decision-making, wherein each software service with decision-making capability is modeled into an independent agent (agent), a multi-agent reinforcement learning method is adopted for centralized training and non-centralized execution, namely, a global state is used for training during training, the reinforcement learning of each agent takes the action strategies of other agents into consideration, and each agent only makes a decision for ensuring behaviors according to the state perception of the agent during execution; an experience playback pool is set, and the problems that correlation exists among training samples and the probability distribution of the training samples is unfixed are solved through an experience playback mechanism; each set of state transition records holds a state-behavior pair and a corresponding reward and next state, namely:

(s ₁ ，s ₂ ，...，s _n ；a ₁ ，a ₂ ，...，a _n ；R；s′ ₁ ，s′ ₂ ，...，s′ _n )

wherein s is _i For the current state of each software service, a _i A selected safeguard action for each software service; r is an incentive value, such as the expected satisfaction degree of various users after executing each safeguard action; s' _i A next state for each software service;

2 identical decision strategies were designed for the "behavioral decision" module of each software service: target policy μ' _i And evaluating the strategy mu _i The method is used for making a guarantee behavior decision based on the self state of the software service, wherein:

target policy μ' _i Next State s 'served by own software' _i Is input and output s' _i Corresponding safeguard action a' _i ：

a′ _i ＝μ′ _i (s′ _i )

Target policy μ' _i Without active training, periodically with a constantly learned evaluation strategy mu _i The parameters are updated, so that the stability of the learning process is improved;

evaluation strategy μ _i Current state s of own software service _i For inputting and outputtingGo out of s _i Corresponding safeguard action a _i ：

a _i ＝μ _i (s _i )

Evaluation strategy mu _i Training and learning are continuously carried out according to the behavior value of the 'value decision' module, namely Q value feedback:

designing a comprehensive criticic module, namely a 'value decision' module for all software services, and outputting Q values corresponding to the software services according to a comprehensive reward function; the 'value decision' module designs 2 neural networks with the same structure: the value decision target network Net _ target _ criticc and the value decision evaluation network Net _ evaluation _ criticc are used for outputting Q values of software service guarantee behaviors based on the global state of the software service system, wherein:

net _ target _ critical services the software service System Next State (s' ₁ ，s′ ₂ ，...，s′ _n ) And corresponds to (a' ₁ ，a′ ₂ ，...，a′ _n ) Outputting a Q value corresponding to the next state of each software service for input:

Q′ _i (s′ _i ，a′ _i ，θ _target )

wherein, theta _target A parameter Net _ target _ critical; the Net _ target _ critic does not actively train and learn, but periodically updates the parameter with the continuously learned Net _ evaluation _ critic, so as to increase the stability of the learning process;

net _ evaluation _ critic to service the current state of the system with the software(s) ₁ ，s ₂ ，...，s _n ) And (a) corresponds to ₁ ，a ₂ ，...，a _n ) Outputting Q values corresponding to the current states of the software services for input:

Q _i (s _i ，a _i ，θ _eval )

wherein, theta _eval A parameter Net _ evaluation _ critic; the Net _ evaluation _ critic periodically selects N state transition records from the experience playback pool at random for training and learning; the training and learning process is a process of continuously optimizing the difference between the estimated Q value and the actual Q value;the loss function is defined as:

wherein R is _i For real-time reward value, gamma is learning rate, gamma belongs to [0,1 ]]；

Evaluation policy mu of each software service _i Updating parameters according to gradient descent:

wherein the content of the first and second substances,

representing the graduating of the respective parameters; each parameter is updated based on graduating it and updating the parameter in the direction where the gradient decreases most rapidly.

2. The method of claim 1, characterized in that the method comprises:

(1)2 request processing micro-services for receiving a request for inquiring user information, calling a background service processing Provider _ user micro-service and returning a result to a requesting user, wherein a Core _ client micro-service is a Core service and needs key guarantee, and a Non _ Core _ client micro-service is a Non-Core service;

(2)1 Provider _ user micro-service responsible for background service processing, which receives 2 user information query requests requesting for processing the micro-service and returns query results;

the specific implementation steps are as follows:

(1) because the Spring Cloud micro-service framework provides a ready-made configuration center module, a configuration center is constructed through Spring-closed-configuration-server, each micro-service is configured with a configuration client, and a configuration file path is set to a corresponding position of the configuration center;

(2) adding spring-boot-startup-activator dependence for each micro-service, and activating/refreshing an endpoint for refreshing configuration; an activation/metrics end point, a/health end point and a/trace end point, which are used for monitoring the software service situation;

(3) developing an Agent for each micro-service based on python, wherein the Agent acquires the real-time situation of the software service through a GET request/metrics endpoint,/health endpoint,/trace endpoint; sending heartbeat monitoring requests to corresponding micro-services randomly within every 3s, and monitoring and recording response time and response contents; dynamically updating the software service configuration by modifying the corresponding configuration file of the configuration center and making a POST request/refresh endpoint; the frequency of the heartbeat monitoring request is determined according to the decision efficiency requirement;

(4) design of a reinforcement learning decision component:

environmental status: the request response time of the request processing micro-service, the request return content and the real-time situation of the software service are taken as the environment state of the reinforcement learning, and the satisfaction degree of the user service requirements is divided into 3 cases: 1) normal service, namely returning a correct request result within a specified time, and setting a response time threshold to be 3 s; 2) the method comprises the following steps of degrading service, namely degrading micro service to fuse concurrent requests of the micro service, guaranteeing service continuation under the condition that partial service requirements are met, and realizing the degradation service by fusing all requests and returning a default value to a user; 3) service failure, i.e. request response time exceeds 3s or request return error;

the reward function: taking the corresponding situation of the heartbeat monitoring request randomly sent by each micro-service Agent as the basis of the reinforcement learning reward; the reward function is as follows:

wherein:

core _ requests and Non _ Core _ requests are the total number of microservice state heartbeat monitoring requests, sigma R, randomly sent in corresponding periods _CC Sum Σ R _NC Respectively the total sum of the heartbeat monitoring request rewards aiming at the Core _ client micro service and the Non _ Core _ client micro service; the reward value in the formula can be adjusted according to different guarantee decision requirements, a higher numerical value is generally set for a service needing important guarantee, and a lower numerical value is set for a fault needing important avoidance;

reinforcement learning method design: the states s and s' are designed as the request response time of 2 request processing micro-services, the request return content and the real-time situation of software service, including the request amount, the I/O load, the CPU and the memory use condition; the safeguard behavior is designed to be 2 requests to process whether the microservice performs degradation to fuse its concurrent requests; the reward value r is designed to be the average reward value of all heartbeat monitoring requests within 15s after the guarantee action is executed; the capacity of the experience playback pool is 200, and 32 groups of state transition records are randomly selected from the experience playback pool as training samples to learn every 5 steps; respectively constructing a target network Net _ target and an evaluation network Net _ evaluation of 2 layers based on TensorFlow, wherein the parameters of the evaluation network Net _ evaluation are updated for 1 time to the target network Net _ target every 200 times of learning; the optimization of the neural network adopts an RMSprop optimizer; the learning rate γ is set to 0.9.