CN108476084B

CN108476084B - Method and device for adjusting state space boundary in Q learning

Info

Publication number: CN108476084B
Application number: CN201680056875.0A
Authority: CN
Inventors: 夏伊·霍罗威茨; 亚伊·阿里安; 郑淼
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2016-12-02
Filing date: 2016-12-02
Publication date: 2020-05-08
Anticipated expiration: 2036-12-02
Also published as: WO2018098797A1; CN108476084A

Abstract

A method for adjusting state boundaries in Q learning can improve algorithm performance of a Q learning algorithm. The method comprises the following steps: according to a first state of the system in a first time period, determining a segment where the first state is located, and determining a first action with the maximum Q value in a plurality of actions corresponding to the segment, wherein the Q value of each action is used for representing an expected profit value (210) which can be obtained by the system after each action is executed; performing the first action and, at a second time interval after performing the first action, calculating an actual revenue value obtained by the system after performing the first action (220); a determination is made as to whether a second of the plurality of actions has a Q value greater than the actual profit margin, and if the second of the plurality of actions has a Q value greater than the actual profit margin, the spatial boundary of the segment is adjusted (230).

Description

Method and device for adjusting state space boundary in Q learning

Technical Field

The embodiment of the application relates to the field of information technology, in particular to a method and a device for adjusting state space boundaries in Q learning.

Background

Reinforcement learning (called reinformance learning or evaluation learning) is an important machine learning method. The method has many applications in the fields of intelligent control of robots, analysis and prediction and the like. Reinforcement learning is the learning of the intelligent system from environment to behavior mapping to maximize the value of the reward value function, where the value of the reward value function provided by the environment in reinforcement learning evaluates how good or bad an action is, rather than telling the reinforcement learning system how to produce the correct action. Since the information provided by the external environment is very small, reinforcement learning must be performed by its experience. In this way, reinforcement learning gains knowledge in the context of action-assessment, improving the action scheme to suit the context. The Q-learning (Q-learning) method is one of the classic algorithms in reinforcement learning, and is a learning algorithm independent of the model.

The data center cluster performs adaptive scheduling on resources used by Application (Application) based on the Q learning algorithm, and the resource utilization rate of the data center can be improved. In the existing Q-learning based algorithm, the data center generally schedules resources used by an application according to the load change condition of the application (or the state of the application). The state of an application is mostly characterized by a parameter of average resource utilization of all machines used in a machine cluster. Also, the parameter average resource utilization is a continuous, rather than discrete, value. In the prior art, in order to accurately describe a candidate action that can be taken when an application is applied to each state, an originally continuous state space is discretely divided.

However, the discrete division of the continuous state space may cause information loss and result in inaccurate description of the state. Thus making the result of resource scheduling less than ideal. In addition, the fine-grained state space partitioning also makes the state space too large, resulting in too slow a convergence rate of the Q-table.

Disclosure of Invention

The application provides a method and a device for adjusting state space boundaries in Q learning, which can accelerate the convergence rate of a Q learning algorithm and improve the performance of the Q learning algorithm.

In a first aspect, the present application provides a method for adjusting a state space boundary in Q learning, which is applied to a service operation system, and the method includes: determining a segment where the first state is located according to the first state of the system in the first time period, and determining a first action with the maximum Q value in a plurality of actions corresponding to the segment, wherein the segment is one segment in a continuous value range of the state value of the system state, and the Q value of each action is used for representing an expected profit value which can be obtained by the system after each action is executed; executing the first action, and calculating an actual profit value obtained by the system after the first action is executed in a second time interval after the first action is executed; and judging whether a second action with the Q value larger than the actual profit value exists in the actions, and if the second action with the Q value larger than the actual profit value exists in the actions, adjusting the boundary of the segment.

It is to be understood that the second period of time is subsequent to the first period of time. More specifically, the first period is the period of time before the first action is performed (or, taken). The second period is a period after the first action is performed.

All states of the system are arranged according to the magnitude sequence of the state values (from large to small or from small to large), and a continuous section is taken out from the states, namely, a section.

In the embodiment of the application, the boundary of the section where the state of the system is located is adjusted, so that the number of the states of the system is reduced, the convergence rate of the Q learning algorithm is increased, and the performance of the algorithm can be improved.

In one possible implementation, if there is a second action in the plurality of actions that has a Q value greater than the actual profit value, adjusting the boundary of the segment includes: the boundary of the segment is adjusted to the state value of the first state.

In one possible implementation, the attributes of each state are characterized using at least one of the following parameters of the system: memory utilization, CPU utilization, network utilization, and the number of machines used.

In the embodiment of the invention, a plurality of parameters are adopted to represent the attribute of the state (also called as the state space), so that the representation of the state space in Q learning is multidimensional, the description of the state space can be more accurate and detailed, and the performance of the algorithm can be further optimized.

In one possible implementation, before performing the first action, the method further includes: determining whether the state value of the first state belongs to a preset region of the segment, wherein the difference value between each state value of the state in the preset region and the boundary value of the segment is less than or equal to a preset threshold value; when it is determined that the state value of the first state belongs to the preset region, the first action is performed with a probability of (1-epsilon).

Specifically, in the embodiment of the present invention, when the state value of the first state of the system in the first period is the boundary value of the segment in which the first state is located and the segment in which the second state is located, or is located near the boundary value, the first action is selected to be executed with the probability of (1-epsilon), and any action, except the first action, of the plurality of actions corresponding to the segment in which the first state is located is executed with the probability of epsilon. Here, the second state is different from the first state, and the second state is adjacent to the segment in which the first state is located.

It can be understood that, in the existing Q learning algorithm, an epsilon greedy strategy is adopted each time the optimal action of the application in one state is selected, so as to balance the exploration capacity (exploration) and the exploitation capacity (exploration) of the algorithm, so as to enhance the exploration capacity of the algorithm. Attempts are made to see if better results are obtained for those actions that have not been performed. However, excessive exploration attempts may affect the performance of the algorithm.

In the embodiment of the application, an epsilon greedy strategy is adopted for the states near the boundary values of the two segments, so that the invalid attempt times can be reduced, and the algorithm performance is improved.

In one possible implementation, adjusting the boundary of the segment includes: and adjusting the boundary of the segment by adopting any one of the following algorithms: divide and conquer method, clustering method and classification method.

It should be noted that, when the boundary of the segment is adjusted, an algorithm in the prior art, such as a divide-and-conquer method, a clustering method, a classification method, and the like, may be used. The specific calculation process of each algorithm can refer to the prior art, and the embodiment of the invention is not described in detail.

Optionally, in this embodiment of the present application, when the attribute of the state space is characterized by using one parameter (that is, the state space is one-dimensional), the number of states applied may be reduced to be the same as the number of actions by using the method for adjusting the state space boundary provided in this embodiment of the present application.

In a second aspect, the present application provides an apparatus for adjusting state space boundaries in Q learning, for performing the method of the first aspect or any possible implementation manner of the first aspect. In particular, the apparatus comprises means for performing the method of the first aspect or any possible implementation manner of the first aspect.

In a third aspect, the present application provides an apparatus for adjusting state space boundaries in Q learning. Specifically, the apparatus includes: a memory and a processor. Wherein the memory is configured to store instructions and the processor is configured to execute the instructions stored by the memory, and when the instructions are executed, the processor performs the method of the first aspect or any possible implementation manner of the first aspect.

In a fourth aspect, the present application provides a computer-readable medium for storing a computer program comprising instructions for performing the method of the first aspect or any possible implementation manner of the first aspect.

In the embodiment of the application, the number of the system states is reduced by adjusting the boundary of the segment where the system states are located (namely, the boundary between the states), so that the convergence rate of the Q learning algorithm is increased, and the performance of the algorithm can be improved.

Drawings

Fig. 1 is a flow chart of a method 100 for resource scheduling using a Q-learning algorithm in the prior art.

Fig. 2 is a flowchart of a method 200 for adjusting a state space boundary according to an embodiment of the present application.

Fig. 3 is an example of adjusting a segment boundary according to an embodiment of the present application.

Fig. 4 is another example of adjusting segment boundaries according to an embodiment of the present application.

Fig. 5 is a schematic diagram of an apparatus 500 for adjusting a boundary of a state space according to an embodiment of the present application.

Fig. 6 is a schematic diagram of an apparatus 600 for adjusting a boundary of a state space according to an embodiment of the present application.

Detailed Description

The technical solution in the embodiments of the present invention will be described below with reference to the accompanying drawings.

It should be understood that the technical solution of the embodiment of the present application may be applied to various fields, for example, the field of resource adaptive scheduling of a data center. The data center may include a computer cluster, and the data center may adjust the number of machines (e.g., virtual machines, containers, etc.) allocated to the application in real time according to information such as load change of the application. For example, the number of machines is increased or decreased, or the number of machines is kept unchanged, etc., so as to improve the overall resource utilization rate of the data center on the premise of effectively meeting the application requirements.

First, a brief description will be given of the basic concept involved in the embodiments of the present application.

The state of the application: describing the current running condition of the application, which may be denoted as S (M, U), where M denotes the number of machines used by the current application, and U denotes the average resource occupancy rate of all machines in the cluster of machines used by the current application. The Machine herein may include a Physical Machine (PM), a Virtual Machine (VM), a container (Docker), and/or the like.

The actions are as follows: the Q-learning algorithm may take various types of actions (e.g., number of actions, magnitude of actions, etc.) in a data center cluster. The specific setting can be according to the load condition of application. For example, when scheduling resources in a data center cluster based on Q learning, actions may be used to adjust the number of resources or machines allocated to an application. For example, the number of machines is reduced, kept constant or increased. The specific adjustment amount of the action to the resource allocated to the application may be set according to actual needs, and is not limited in the embodiment of the present invention.

The reward function: after determining that the Q-learning algorithm performed action a in application state S, the system gives a system reward value for the state-action combination (S, a) that can be used to evaluate how well action a was performed in application state S. For example, if the reward function value is positive, it indicates that the Service Level Objective (SLO) applied after performing action a can be satisfied in time. If the value of the reward function is negative, it indicates that the SLO applied after taking action a cannot be satisfied. The calculation formula of the reward function may be as follows:

by way of example, the reward value function may be represented by the following equation:

where U may represent the average resource occupancy of all machines currently in use by the application, and p is a configuration parameter, which is set to 2 by default. respTime represents the 99% response time percentage of the data center. SLO may represent a service level target of 99% response time percentage to ensure that 99% of applications are responded to in time.

Q value: a function of learned fingers through state-action pairs is used to measure the cumulative return of an action for a state. The specific calculation formula can be represented by the following formula:

wherein c and γ are adjustable parameters. r denotes a bonus function. Q(s)_t，a_t) Indicating that the application is at time t, action a_tFor state s_tA Q value of (1).

Indicating that the application is in state t +1, in state s_t+1The Q value corresponding to action a having the maximum Q value.

Q table: for recording the Q values of various possible state-action combinations made up of all possible states and all optional actions of the application. The algorithm decides which action to take at each state and selects it according to the following principle: the action with the largest Q value in all actions of the state is selected.

Table 1 below is an example of a Q table in Q learning. Column 1 of the Q table represents the state of the application. Columns 2 through M +1 of the Q table represent M selectable actions, respectively. Q_ijAnd represents the Q value corresponding to the state-action combination composed of the application state of the ith row and the action of the jth column.

TABLE 1

Fig. 1 is a flow chart of a method 100 for resource scheduling using a Q-learning algorithm in the prior art. As shown in fig. 1, the method 100 mainly includes the following steps 101-106.

101. The state S at which the application is at time t is determined.

102. From the Q table, the action A taken by the application at time t in state S is determined.

103. The application performs action a.

It should be appreciated that the application performs action a, namely, scheduling resources of the application (e.g., increasing the number of resources, keeping the number of resources unchanged, or decreasing the number of resources, etc.).

104. And obtaining the average resource utilization rate of the application at the time T + T.

After the application has performed action a, the system recalculates the resource utilization of the application.

105. A value of a reward function for the state-action combination (S, a) is calculated.

Specifically, the system calculates the reward function value of the action A according to the resource utilization rate of the application, the response time, the SLO of the application and other factors so as to judge whether the action A is taken when the application is in the state S.

106. Updating the Q value corresponding to the state-action combination (S, A) in the Q table using the reward function value of the state-action combination (S, A) and the Q value of the state-action combination (S, A) before the action A is taken.

As described in the above flow, in resource scheduling based on the Q learning algorithm, resources used by an application are adjusted in real time according to the average resource utilization rate of all machines used by the application in a machine cluster. While the parameter average resource utilization is a continuous, rather than discrete, value. In the prior art, the state space of an application is generally subjected to discrete division depending on manual experience, and a series of discrete states (such as a Q table shown in table 1) of the application are obtained. In order to improve the performance of the algorithm, a conventional scheme proposes to combine states having similar Q values in the Q table to reduce the number of state spaces.

It is understood that in the Q learning algorithm, on the one hand, the Q value does not completely reflect the correspondence of the state and the action. The relative values of Q values corresponding to different operations in the same state are meaningful, while the absolute values of Q values corresponding to operations in different states are not meaningful. Therefore, combining Q values may cause inaccuracy, and the performance of the algorithm may not be guaranteed due to the Q value combination. On the other hand, in the prior art, the original continuous state space is usually discretized by an empirical value, and the performance of the algorithm is greatly influenced by the granularity of the partitioning. For example, the granularity of division is too large, and the accuracy of the algorithm is difficult to guarantee. And the granularity of division is too small, the convergence speed of the algorithm is too low, and the efficiency is reduced.

Therefore, the embodiment of the application provides a method for adjusting the state space boundary in Q learning, which can improve the convergence rate of a Q learning algorithm and can improve the performance of the algorithm.

The following describes in detail a method for adjusting a state space boundary in Q learning according to an embodiment of the present application with reference to fig. 2 to 4.

Without loss of generality, a processor is taken as an example to be an execution subject of the method for adjusting the state space boundary in Q learning provided by the embodiment of the application.

Fig. 2 is a schematic flow chart of a method 200 for adjusting a state space boundary according to an embodiment of the present application, where the method 200 is applied to a service operating system. As shown in FIG. 2, the method 200 generally includes

steps

210 and 230.

210. A processor (for example, a processor of the service operation system) determines a segment where the first state is located according to the first state of the system in the first time period, and determines a first action with the maximum Q value in a plurality of actions corresponding to the segment, wherein the Q value of each action is used for representing an expected profit value which can be obtained by the system after each action is executed.

In the embodiment of the present application, the segmentation refers to a value range of a segment of state values obtained by dividing the state values of the system state according to a certain division granularity. That is, all states of the system are arranged in order of magnitude of the state values (from large to small or from small to large), and a continuous segment is taken out from the state values, namely, a segment.

For example, the average resource utilization rate is taken as a parameter for representing the state of the system, and the average resource utilization rate of the system is divided into 10 grades with the granularity of 10%, and the grades are 0-10%, 10% -20%, 20% -30%, and the grades are 80% -90% and 90% -100%. Wherein each of the profiles is a segment.

220. The processor performs the first action and calculates an actual revenue value obtained by the system after performing the first action for a second period of time after performing the first action.

Wherein the second period of time is after the first period of time.

Taking Q learning as an example, step 220 specifically includes the following two processes: (1) during a first time period, the processor performs a first action and calculates a reward value for the system to obtain; (2) updating the Q value of the first action according to the system reward value.

It should be noted that, here, the process (2), i.e. the process of updating the Q value of the first action according to the system reward value, may refer to the prior art, and will not be described in detail here.

230. And the system processor judges whether a second action with the Q value larger than the actual profit value exists in the actions, and if the second action with the Q value larger than the actual profit value exists in the actions, the boundary of the segment is adjusted.

In the embodiment of the present application, the processor first determines the state of the system during the first period (hereinafter referred to as state S)₁) Determining the state S₁The corresponding segment (hereinafter referred to as segment #1) is identified, and the operation having the largest Q value among the plurality of operations corresponding to segment #1 (hereinafter referred to as operation a) is determined₁). Thereafter, the processor performs action A₁And performing action A₁The second period of time thereafter,computing execution action A₁Actual revenue value obtained by the post-system. Finally, the processor determines whether or not there is an operation in which the Q value is larger than the actual profit value among the operations corresponding to segment #1 (hereinafter referred to as operation a)₂) If there is an action A₂The boundary of segment #1 is adjusted.

As mentioned above, a segment is a continuous segment of the system state. Therefore, the boundaries of the segments, also referred to as boundaries between the adjustment states, are adjusted.

Specifically, in the embodiments of the present application, there are various ways to adjust the boundary of the segment, which are described in detail below.

1. And adjusting the boundary of the segment according to the actual income value obtained by the system after the action is executed.

First, assume that the state of the system in the first period is state S₁And in a first period, state S₁The segment is segment # A, and among the plurality of actions corresponding to segment # A, action A is₁The Q value of (B) is maximum.

Specifically, during the first period, if the system is in state S₁The processor executes the optimal action A of the plurality of actions corresponding to segment # A₁So that action A is performed at the processor₁In a second subsequent time interval, the optimal action of the plurality of actions corresponding to segment # A is changed (e.g., the optimal action is changed by A₁Change to action A₂) Then the boundary of segment # a needs to be adjusted.

It should be understood that adjusting the boundary of segment # a refers to adjusting the boundary value of segment # a with the neighboring segments.

Fig. 3 is an example of adjusting a segment boundary according to an embodiment of the present application. As shown in fig. 3, assume that the original boundary between segment # a and segment # B is that the resource utilization of the system is 0.7.

Before the method is executed, the state of the system is that the resource utilization rate of the system is 0.62. Among the plurality of actions corresponding to the segment having the resource utilization rate of 0.62 in the Q table, the action having the largest Q value (i.e., the optimal action corresponding to the segment) is action 0. After the processor performs action 0, the processor calculates the reward value of (0.62, action 0) that the system obtained.

The Q value of action 0 is updated according to the bonus value of (0.62, action 0).

After updating the Q value, if the operation having the largest Q value among the plurality of operations corresponding to the segment # a is changed to another operation different from the operation 0 (assuming that the operation is changed to the operation +1) instead of the operation 0, the boundary of the segment # a is adjusted. Here, adjusting the boundary of the segment # a means adjusting the boundary value between the segment # a and the segment # B.

Specifically, according to the embodiment of the present application, the boundary value between segment # a and segment # B should be adjusted from 0.7, which is the original value, to 0.62.

2. The division and treatment method.

The basic idea of the divide and conquer algorithm is to decompose a problem of size N into K sub-problems of smaller size, which are independent of each other and have the same properties as the original problem. After the solutions of the sub-problems are solved, the solutions of the sub-problems are combined layer by layer, and then the solution of the original problem can be obtained.

The divide and conquer method applied in the embodiment of the present application can be used to adjust the boundary of the segment.

Continuing with the example shown in fig. 3. After the processor executes action 0, if the optimal action corresponding to segment # a is found, the action 0 is changed to action + 1. The boundary of segment # a should be adjusted to:

3. and (4) clustering.

The clustering (English full name can be: Cluster) method is a statistical analysis method for researching classification problems. Clustering analysis is based on similarity, with more similarity between elements in the same category than between elements in different categories.

In the embodiment of the present application, the process of applying the clustering method to adjust the segment boundary mainly includes the following steps 301-304.

301. And clustering the state data of the system in T in a past period of time.

Assume that the number of actions that the processor can take is preset in the algorithm to be 5, action-2, action-1, action 0, action +1, and action +2, respectively.

According to the preset action quantity and category, the state data of the system can be divided into the above 5 categories through clustering operation.

It should be noted that, in the embodiment of the present application, a specific algorithm used for the clustering operation is not limited. For example, a classical clustering algorithm, K-MEANS, a modified K-MEDOIDS algorithm, a Clara algorithm, etc., may be used.

302. Determining the optimal action (recording action A) of the state data according to the maximum Q value corresponding to the state data of the system at the current moment₁) And add the state data to the optimal action (i.e., action A)₁) The category (referred to as category # P) to which the user belongs.

303. The cluster center position of the class # P is updated.

304. The boundary values of the class # P are recalculated.

Specifically, assume that the re-determined cluster center is u_iRadius r_i. Adjacent cluster center is u_jRadius r_jThen the boundary values for the new class # P should be:

4. and (4) classification.

The state S of the system is used as input, the optimal action A which is to be taken by the processor when the system is in the state S is used as output, and a classification method such as a Support Vector Machine (SVM) and a decision tree is adopted to determine the boundary value of two adjacent segments.

Specifically, a boundary value between two adjacent segments is determined by using state data applied in a past period of time T in a support vector machine mode. When new data is added, the method of the support vector machine is operated again to determine new boundary values.

Alternatively, as an embodiment, a logistic regression method in the classification method may be used to determine the boundary value of two adjacent segments.

Specifically, when the logistic regression method is applied to the embodiment of the present application, the main idea is to determine the state space boundary value between two adjacent segments by using the state data of the system (or historical data of the system state) in the past period T and adopting the logistic regression method. When new data is added, the logistic regression method is operated again to determine new boundary values between the segments.

It should be noted that the specific implementation processes of the divide and conquer method, the clustering method and the classification method (e.g., the logistic regression method) described above can refer to the prior art, and will not be described in detail herein.

In the above embodiments, the state attribute of the system is only an example of the average resource utilization of the system.

Optionally, as an embodiment, the attribute of each state is characterized using at least one of the following parameters of the system: memory utilization, CPU utilization, network utilization, and the number of machines used by the system.

Preferably, when the attribute of the state is characterized by using one parameter, the number of states of the system can be reduced to be equal to the number of executable actions by the method for adjusting the state space boundary provided by the embodiment of the application.

It will be appreciated that when a plurality of parameters are used to characterize the properties of a system state, the boundary between two adjacent segments determined using any of the algorithms described above will be a multi-dimensional boundary. The attribute of the state space is represented by a plurality of parameters, so that the representation of the state in Q learning is multidimensional, the description of the state can be more accurate and detailed, and the performance of the algorithm can be further optimized.

Fig. 4 shows another example of adjusting segment boundaries provided by the embodiment of the present application. As shown in fig. 4, the boundary values of the neighboring segments are determined by using a logistic regression method among classification methods.

The "stars" and "dots" shown in FIG. 4 are the state data of the system over a period of time T in the past. These state data are characterized by two attributes, e.g., average resource utilization and number of machines. In fig. 4, the horizontal axis (x-axis) represents the average resource utilization, and the vertical axis (y-axis) represents the number of machines. Here, the segment corresponding to the data represented by the average resource utilization rate is referred to as segment # a, and the segment corresponding to the data represented by the number of machines is referred to as segment # B.

Specifically, the processor firstly normalizes the average resource utilization rate to make the value range of the average resource utilization rate the same as the value range of the machine number. Then, a logistic regression method is used to determine the boundary value between two adjacent segments.

As shown in fig. 4, y ═ x is the dividing line of the two types of data, that is, the number of machines is 100 × the average resource utilization, which is the actual segment boundary. When the number of machines is less than (100 × average resource utilization), it belongs to segment # a, and conversely, it belongs to segment # B. And when new state data is added, operating a logistic regression method to re-determine the boundary of the segment.

Optionally, as an embodiment, before performing the first action, the method further includes:

determining whether the state value of the first state belongs to a preset region of the segment, wherein the difference value between the state value of each state in the preset region and the boundary value of the segment is less than or equal to a preset threshold value;

when it is determined that the state value of the first state belongs to the preset region, the first action is performed with a probability of (1-epsilon).

It will be appreciated that the predetermined area is actually a portion of the region of the segment near the segment boundary values. That is, the state included in the preset region is located near the segment boundary value (boundary value including the segment).

The preset area can be set and adjusted according to the resource scheduling condition. For example, when the state space is large, the preset region may be selected to be set larger to add the convergence speed of the algorithm. And when the state space is adjusted to be smaller, the preset area can be set to be smaller so as to refine the division of the state space boundary and enable the division of the state to be more accurate.

In the embodiment of the present application, specific values of the preset threshold are not limited. In fact, the preset threshold value and the preset area are correspondingly changed. When the preset area is large, the preset threshold is correspondingly large. When the preset area is small, the preset threshold is correspondingly small. The preset threshold value is the absolute value of the difference between the head state value and the tail state value of the preset area.

For example, referring to fig. 3, it is assumed that the preset area is set to have an average resource utilization rate of 1%. If the average resource utilization rate of the system in a certain period is 69.8%, the average resource rate is 69.8% and falls into the preset area of the segment (30% -70%). According to an embodiment of the application, at this time, the processor may choose to perform the optimal action corresponding to the segmentation (30% -70%) with a probability of (1- ε), i.e., perform action 0. Any other of a number of actions (not shown in fig. 3) corresponding to the segmentation (30% -70%) may also be performed with a probability of epsilon. And if the average resource utilization of the system is 65% and does not belong to the preset region of the segment (30% -70%), the processor determines to perform action 0.

As can be seen from fig. 3, there are two boundary values for each segment of the system state. Whether the state of the system at a certain moment falls into the upper boundary or the lower boundary of a segment, the optimal action corresponding to the segment is executed with a probability of (1-epsilon).

Therefore, an average resource rate of 69.8% can be considered as a preset region falling into the upper boundary of the segment (30% -70%). Further, assuming that the average resource utilization of the system is 30.5%, at this time, the average resource utilization falls within the preset area of the lower boundary of the segment (30% -70%), and at this time, the processor should also execute action 0 with a probability of (1-epsilon).

In the prior art, the reinforcement learning algorithm employs an epsilon greedy strategy each time an optimal action in one state (denoted as state # a) is selected. The epsilon greedy policy is an action in which the system processor selects an action having the largest Q value in a segment in which state # a is present with a probability of (1-epsilon) when selecting and executing actions, and selects any action other than the optimal action among a plurality of actions corresponding to the segment with a probability of epsilon.

It will be appreciated that the original goal of the epsilon greedy strategy is to balance the exploratory (exploration) and extraction (exploration) capabilities of the algorithm to enhance the exploratory capabilities of the algorithm. Attempts are made to see if better results are obtained for those actions that have not been performed. However, excessive exploration attempts may affect the performance of the algorithm.

In the present embodiment, we consider those states that are near the two segment boundaries more desirable to employ an epsilon greedy strategy. Because the state near the segment boundary is located exactly between the two adjacent actions that may be taken, there is a greater likelihood that action selection fluctuations will occur. Therefore, in the embodiment of the present application, when the state value of the first state of the system in the first period is the boundary value of the first segment and the second segment, or the state value of the first state is in the vicinity of the boundary value of the first segment and the second segment, the processor selects the first action with the largest Q value among the plurality of actions corresponding to the first segment with the probability of (1-epsilon), and executes any one action except the first action among the plurality of actions corresponding to the first segment with the probability of epsilon.

By adopting an epsilon greedy strategy for states near segment boundaries, the number of invalid attempts can be reduced, thereby improving algorithm performance.

The method for adjusting the state boundary in Q learning according to the embodiment of the present application will be described below with reference to table 2 and table 3.

For ease of understanding and explanation, the following assumptions are made first: (1) there are 5 classes of actions that Q learning can take, respectively: reduce 2 machines, reduce 1 machine, keep the machine count unchanged, increase 1 machine and increase 2 machines. (2) And dividing the average resource utilization rate of the system into 10 grades by taking 10% as the division granularity. (3) The total number of the system machines is 100. The number of machines currently used by the system is 1.

The process of adjusting the boundaries of adjacent segments will be described here only by taking the clustering method described above as an example.

Table 2 shows the results obtained by clustering the Q table by the clustering method before the segment boundary adjustment. According to the method of the embodiment of the application, the specific flow is as follows:

401. assuming that the current state of the system is that the average resource utilization rate is 0.36, the Q table is queried to obtain that the segment where the average resource utilization rate is 0.36 should be the segment (denoted as segment # a) with the average resource utilization rate being in the range of (30% -70%), and the action with the largest Q value in the multiple actions corresponding to the segment # a is taken as action 0, that is, the number of machines used by the system is kept unchanged. The prize value is calculated (0.36, action 0) according to the prize function and is positive. After the Q value of action 0 is updated according to the reward value, the optimal action corresponding to the segment where the average resource utilization rate of the system is 0.36 is changed into action-1, which indicates that the number of machines used by the system is wasted under the current average resource utilization rate (i.e., 0.36), and action-1 is better. Since the optimal operation among the plurality of operations corresponding to the segment # a changes, the boundary of the segment # a needs to be updated.

In step 401, when the system is in a certain state, after the system processor executes the optimal action corresponding to the segment where the state is located, the specific process of the processor calculating the reward value of the state-action combination and updating the Q value of the optimal action may refer to the prior art, and will not be described in detail herein.

TABLE 2

402. Add 0.36 to Category [0.1, 0.3]In (3), re-calculating the center u of the class_iAnd radius r_i。

403. The new boundary values are re-determined.

Assume a new cluster center of u_iRadius r_i. Adjacent cluster center is u_jRadius r_jThen the new boundary values should be updated as:

the updated Q table for the boundary of segment # a is shown in table 3.

TABLE 3

It can be seen that the boundary between the segment (10% -30%) and the segment (30% -70%) is updated from 30% to 33%.

The method for adjusting the state boundary in the embodiment of the application can also be applied to a scene of performing motion prediction by using a Q learning algorithm. For example, dynamic channel allocation for mobile networks, robot motion prediction, etc. In the scenario of robot motion prediction, we can define the motion of the robot as moving 1 step to the left, 2 steps to the left, stationary in place, 1 step to the right, 2 steps to the right, etc., and the state space can be defined as the location distance (e.g., can be a longitude distance, a latitude distance, etc.) of the robot from the destination at present. Therefore, the robot can be quickly guided to obtain more accurate action prediction by reducing the number of the state spaces.

In the embodiment of the application, the number of states of the system is reduced by adjusting the boundaries among a plurality of segments obtained by dividing the state of the system, so that the convergence rate of the Q learning algorithm is increased, and the performance of the algorithm is improved.

The method for adjusting the state boundary according to the embodiment of the present application is described in detail above with reference to fig. 1 to 4, and the apparatus and device for adjusting the state boundary according to the embodiment of the present application are described below with reference to fig. 5 and 6.

Fig. 5 is a schematic block diagram of an apparatus 500 for adjusting a state boundary according to an embodiment of the present application. The apparatus 500 is configured in a service operation system. As shown in fig. 5, the apparatus 500 includes:

the processing unit 510 is configured to determine, according to a first state of the system in a first time period, a segment corresponding to the first state, and determine a first action with a largest Q value in a plurality of actions corresponding to the segment, where the Q value of each action is used to represent an expected revenue value that can be obtained by the system after each action is performed;

the processing unit 510 is further configured to execute the first action, and calculate an actual profit value obtained by the system after the first action is executed in a second time interval after the first action is executed;

the processing unit 510 is further configured to determine whether there is a second action with a Q value greater than the actual profit value in the plurality of actions, and adjust the boundary of the segment if there is a second action with a Q value greater than the actual profit value in the plurality of actions.

The units and other operations or functions in the apparatus 500 for adjusting state boundaries according to the embodiment of the present application are respectively for implementing corresponding flows in the method 200 for adjusting state boundaries. For brevity, no further description is provided herein.

It is to be understood that the processing unit herein may be a processor. The apparatus 500 should also include a memory unit. The storage unit may be a memory. The memory is for storing computer instructions. The processor is configured to execute computer instructions stored in the memory. When executed, the computer instructions cause the processor to perform the steps of the method 200 for adjusting state boundaries provided by embodiments of the present application.

Fig. 6 is a schematic block diagram of an apparatus 600 for adjusting a state boundary according to an embodiment of the present application. As shown in fig. 6, the apparatus 600 includes: memory 610, processor 620, and communication interface 630. Wherein the memory 610, the processor 620, and the communication interface 630 are connected to each other through a communication bus 640.

The memory 610 is used to store applications, code or instructions that implement aspects of the present invention. The processor 620 is configured to execute an application, code or instructions stored in the memory 610 to perform the method 200 of adjusting state boundaries in Q learning and corresponding processes and/or operations in various embodiments. For brevity, no further description is provided herein.

It should be understood that the apparatus 500 for adjusting state boundaries provided in fig. 5 can be implemented by the apparatus 600 for adjusting state boundaries shown in fig. 6. For example, the processing unit in fig. 5 may be implemented by the processor 620 in fig. 6, and the storage unit may be implemented by the memory 610.

The processor 620 shown in fig. 6 may be a Central Processing Unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs according to the present invention.

The Memory 610 shown in fig. 6 may be a Read-Only Memory (ROM) or other types of static storage devices that can store static information and instructions, a Random Access Memory (RAM) or other types of dynamic storage devices that can store information and instructions, an Electrically Erasable Programmable Read-Only Memory (EEPROM), a compact disc Read-Only Memory (CD-ROM) or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory may be self-contained and coupled to the processor through a communication bus (e.g., communication bus 640 in fig. 6). The memory may also be integral to the processor.

The communication bus 640 may include a power bus, a control bus, a status signal bus, and the like, in addition to a data bus. For clarity of illustration, the various buses are labeled as communication buses in the figures.

The communication Interface 630 may be a wired Interface, such as a Fiber Distributed Data Interface (FDDI), a Gigabit Ethernet (GE) Interface, or a wireless Interface.

It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one network unit, or may be distributed on a plurality of network units. Some or all of the elements may be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or each unit may be integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present application, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present disclosure, and all the changes or substitutions should be covered by the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for adjusting state space boundary in Q learning is applied to a service operation system, and is characterized by comprising the following steps:

determining a segment where the first state is located according to the first state of the system in a first time period, and determining a first action with the maximum Q value in a plurality of actions corresponding to the segment, wherein the segment is one segment in a continuous value range of the state value of the system state, and the Q value of each action is used for representing an expected profit value which can be obtained by the system after each action is executed;

executing the first action, and calculating an actual profit value obtained by the system after the first action is executed in a second time interval after the first action is executed;

judging whether a second action with the Q value larger than the actual profit value exists in the actions, and if the second action with the Q value larger than the actual profit value exists in the actions, adjusting the boundary of the segment;

wherein if there is a second action in the plurality of actions whose Q value is greater than the actual profit value, adjusting the boundary of the segment, including:

and adjusting the boundary of the segment to be the state value of the first state.

2. The method according to claim 1, characterized in that the properties of each state are characterized using at least one of the following parameters of the system:

memory utilization, CPU utilization, network utilization, and the number of machines used.

3. The method of claim 1 or 2, wherein prior to said performing said first action, the method further comprises:

determining whether the state value of the first state belongs to a preset region of the segment, wherein the difference value between the state value of each state in the preset region and the boundary value of the segment is smaller than or equal to a preset threshold value;

and when the state value of the first state is determined to belong to the preset area, executing the first action by adopting the probability of (1-epsilon).

4. The method of claim 1 or 2, wherein the adjusting the boundaries of the segments comprises:

adjusting the boundary of the segment by adopting any one of the following algorithms:

divide and conquer method, clustering method and classification method.

5. An apparatus for adjusting state space boundary in Q learning, configured in a business operation system, comprising:

the processing unit is used for determining a segment corresponding to a first state according to the first state of the system in the first time period, and determining a first action with the maximum Q value in a plurality of actions corresponding to the segment, wherein the segment is one of a continuous value range of the state value of the system state, and the Q value of each action is used for representing an expected profit value which can be obtained by the system after each action is executed;

the processing unit is further configured to execute the first action, and calculate an actual profit value obtained by the system after the first action is executed in a second time period after the first action is executed;

the processing unit is further configured to determine whether a second action with a Q value greater than the actual profit value exists in the plurality of actions, and adjust the segmented space boundary if the second action with a Q value greater than the actual profit value exists in the plurality of actions;

wherein the processing unit is specifically configured to adjust the segmented spatial boundary to a state value of the first state.

6. The apparatus of claim 5, wherein the attributes of each state are characterized using at least one of the following parameters of the system:

7. The apparatus according to claim 5 or 6, wherein the processing unit is specifically configured to:

8. The apparatus according to claim 5 or 6, wherein the processing unit is specifically configured to adjust the boundary of the segment by using any one of the following algorithms:

divide and conquer method, clustering method and classification method.