WO2018098797A1 - Q学习中调整状态空间边界的方法和装置 - Google Patents

Q学习中调整状态空间边界的方法和装置 Download PDF

Info

Publication number
WO2018098797A1
WO2018098797A1 PCT/CN2016/108312 CN2016108312W WO2018098797A1 WO 2018098797 A1 WO2018098797 A1 WO 2018098797A1 CN 2016108312 W CN2016108312 W CN 2016108312W WO 2018098797 A1 WO2018098797 A1 WO 2018098797A1
Authority
WO
WIPO (PCT)
Prior art keywords
value
state
action
segment
boundary
Prior art date
Application number
PCT/CN2016/108312
Other languages
English (en)
French (fr)
Inventor
霍罗威茨夏伊
阿里安亚伊
郑淼
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201680056875.0A priority Critical patent/CN108476084B/zh
Priority to PCT/CN2016/108312 priority patent/WO2018098797A1/zh
Publication of WO2018098797A1 publication Critical patent/WO2018098797A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L1/00Arrangements for detecting or preventing errors in the information received

Definitions

  • Embodiments of the present application relate to the field of information technology, and more particularly, to a method and apparatus for adjusting a state space boundary in Q learning.
  • Reinforcement learning (English full name can be reinforcement learning), also known as reinforcement learning or evaluation learning, is an important machine learning method. There are many applications in the fields of intelligent control of robots and analysis and prediction.
  • the so-called reinforcement learning is the learning of the intelligent system from the environment to the behavior mapping, so that the value of the reward value function is maximized.
  • the value of the reward value function provided by the environment in the reinforcement learning is to evaluate the quality of the action, instead of telling the reinforcement learning system. How to generate the correct action. Because the external environment provides very little information, reinforcement learning must be based on its own experience. In this way, reinforcement learning gains knowledge in an action-evaluation environment and improves action plans to adapt to the environment.
  • the Q-learning method is one of the classical algorithms in reinforcement learning, and it is a learning algorithm that is independent of the model.
  • the data center cluster adaptively schedules resources used by the application based on the Q learning algorithm to improve resource utilization of the data center.
  • the data center usually schedules the resources used by the application according to the load change of the application (or the state of the application).
  • the state of the application is mostly characterized by the parameter of the average resource utilization of all machines used in the machine cluster. Also, the average resource utilization parameter is continuous, not discrete. In the prior art, in order to accurately describe a candidate action that can be taken when an application is in each state, the original continuous state space is discretely divided.
  • the present application provides a method and apparatus for adjusting a state space boundary in Q learning, which can improve the performance of the Q learning algorithm while accelerating the convergence speed of the Q learning algorithm.
  • the present application provides a method for adjusting a state space boundary in Q learning, which is applied to a service running system, the method includes: determining, according to a first state of the first time period, a segment in which the first state is located, and determining a first action in which the Q value is the largest among the multiple actions corresponding to the segment, where the The segment is a segment of the continuous value range of the state value of the system state, and the Q value of each action is used to indicate the expected benefit value obtainable by the system after each action is performed; the first action is performed, and after the first action is performed The second period of time, calculating the actual income value obtained by the system after performing the first action; determining whether there is a second action in which the Q value is greater than the actual income value in the multiple actions, if the Q value is greater than the actual income value in the multiple actions The second action adjusts the boundary of the segment.
  • the second time period is after the first time period. More specifically, the first time period is the time period before the first action is performed (or taken). The second time period is a time period after the first action is performed.
  • All the states of the system are arranged in order of size (from large to small or from small to large), and a continuous segment is taken as a segment.
  • the boundary of the segment where the state of the system is located by adjusting the boundary of the segment where the state of the system is located, the number of states of the system is reduced, the convergence speed of the Q learning algorithm is accelerated, and the performance of the algorithm can be improved.
  • adjusting the boundary of the segment includes: adjusting a boundary of the segment to a state of the first state value.
  • the attributes of each state are characterized using at least one of the following parameters of the system: memory utilization, CPU utilization of the central processing unit, network utilization, and number of machines used.
  • a plurality of parameters are used to represent attributes of a state (also referred to as a state space), so that the representation of the state space in the Q learning is multi-dimensionalized, which can make the description of the state space more accurate and detailed, and can make The performance of the algorithm is further optimized.
  • the method before the performing the first action, the method further includes: determining whether a state value of the first state belongs to a preset region of the segment, and each state value in the preset region and the state The difference between the boundary values of the segments is less than or equal to a preset threshold; when it is determined that the state value of the first state belongs to the preset region, the first action is performed with a probability of (1- ⁇ ).
  • the state value of the first state in which the system is in the first time period is the boundary value of the segment where the first state is located and the segment where the second state is located, or is located near the boundary value
  • the first action is performed with a probability of (1 - ⁇ )
  • any other than the first action among the plurality of actions corresponding to the segment in which the first state is located is executed with the probability of ⁇ .
  • the second state is different In the first state, and the second state is adjacent to the segment in which the first state is located.
  • the ⁇ greedy strategy is adopted every time the optimal action in a state is selected, and the purpose is to balance the exploration and exploitation of the algorithm to strengthen the algorithm.
  • the ability to explore algorithms Try out those actions that have not been performed to see if you can get better results.
  • excessive exploration attempts can affect the performance of the algorithm.
  • adopting the ⁇ greedy strategy for those states near the two segment boundary values can reduce the number of invalid attempts and improve the performance of the algorithm.
  • adjusting the boundary of the segment includes: adjusting a boundary of the segment by using any one of the following algorithms: divide and conquer, clustering, and classification.
  • the method for adjusting the state space boundary provided by the embodiment of the present application may be applied.
  • the number of states is reduced to the same number as the number of actions.
  • the present application provides an apparatus for adjusting a state space boundary in Q learning for performing the method of the first aspect or any possible implementation of the first aspect.
  • the apparatus comprises means for performing the method of the first aspect or any of the possible implementations of the first aspect.
  • the present application provides an apparatus for adjusting a state space boundary in Q learning.
  • the device includes: a memory and a processor.
  • the memory is for storing instructions
  • the processor is for executing instructions stored in the memory, and when the instructions are executed, the processor performs the method of the first aspect or any possible implementation of the first aspect.
  • the present application provides a computer readable medium for storing a computer program, the computer program comprising instructions for performing the method of the first aspect or any of the possible implementations of the first aspect.
  • the boundary of the segment where the system state is located that is, the boundary between the states
  • the number of system states is reduced, the convergence speed of the Q learning algorithm is accelerated, and the performance of the algorithm can be improved.
  • FIG. 1 is a flow chart of a method 100 for resource scheduling using a Q learning algorithm in the prior art.
  • FIG. 2 is a flowchart of a method 200 for adjusting a state space boundary according to an embodiment of the present application.
  • FIG. 3 is an example of adjusting a segment boundary provided by an embodiment of the present application.
  • FIG. 4 is another example of adjusting a segment boundary provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of an apparatus 500 for adjusting a state space boundary according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of an apparatus 600 for adjusting a state space boundary according to an embodiment of the present disclosure.
  • the data center may include a computer cluster, and the data center may adjust the number of machines (for example, virtual machines, containers, and the like) allocated to the application in real time according to information such as load changes of the application. For example, increase or decrease the number of machines, or keep the number of machines unchanged, etc., to improve the overall resource utilization of the data center while effectively meeting the application requirements.
  • machines for example, virtual machines, containers, and the like
  • Status of the application describes the current running status of the application, which can be expressed as S(M, U), where M represents the number of machines used by the current application, and U represents the average resource occupancy of all machines in the machine cluster currently used by the application.
  • the machines herein may include a physical machine (PM), a virtual machine (VM), and/or a container (Docker).
  • Action The various types of actions that the Q learning algorithm can take in a data center cluster (eg, number of actions, range of actions, etc.). Specifically, it can be set according to the load condition of the application. For example, when resource scheduling in a data center cluster is based on Q learning, actions can be used to adjust the amount of resources or machines allocated to an application. For example, reduce the number of machines, keep the number of machines or increase the number of machines.
  • the specific adjustment quantity of the resource allocated to the application may be set according to an actual requirement, which is not limited in the embodiment of the present invention.
  • Reward function used to determine that the Q learning algorithm performs action A when applying state S, the system gives a system reward value for the state-action combination (S, A), which can be used to evaluate the execution of action A when applying state S. Bad situation. For example, if the reward function value is positive, it indicates that the Service Level Objective (SLO) applied after the action A is executed can be satisfied in time. If the reward function value is negative, it means that the SLO applied after taking action A cannot be satisfied.
  • SLO Service Level Objective
  • the calculation formula of the reward function can be as follows:
  • the bonus value function can be represented by:
  • U can represent the average resource occupancy rate of all the machines currently used by the application, and p is a configuration parameter, which is set to 2 by default.
  • respTime represents the percentage of 99% response time in the data center.
  • the SLO can represent a service level target of 99% response time percentage, which is used to ensure that 99% of applications can respond in a timely manner.
  • Q value A function of a learned pointer through a state-action that is used to measure the cumulative return of an action to a state.
  • the specific calculation formula can be expressed by the following formula:
  • c and ⁇ are adjustable parameters.
  • r represents the reward function.
  • Q(s t , a t ) denotes a Q value of the action a t for the state s t at the time t.
  • T + indicates a state when the application 1, t Q value + 1 operation with a maximum Q value corresponding to the state s.
  • Q table The Q value used to record all possible states of the application and all possible states of action - action combinations. Each time the algorithm decides which action to take in a state, it chooses according to the following principle: select the action with the largest Q value among all the actions of the state.
  • Table 1 below is an example of a Q table in Q learning.
  • the first column of the Q table represents the state of the application.
  • Columns 2 through M+1 of the Q table represent M optional actions, respectively.
  • Q ij represents the Q value corresponding to the state-action combination composed of the application state of the i-th row and the action of the j-th column.
  • FIG. 1 is a flow chart of a method 100 for resource scheduling using a Q learning algorithm in the prior art. As shown in FIG. 1, the method 100 mainly includes the following steps 101-106.
  • the application performs action A.
  • action A that is, scheduling the resources of the application (for example, increasing the number of resources, keeping the number of resources unchanged or reducing the number of resources, etc.).
  • the system recalculates the resource utilization of the application.
  • the system calculates the reward function value of action A according to factors such as resource utilization, response time, and application SLO of the application, to determine whether the application takes action A when it is in state S.
  • the resources used by the application are adjusted in real time according to the average resource utilization of all the machines used in the machine cluster.
  • the average resource utilization parameter is continuous, not discrete.
  • the state space of the application is discretely divided according to the artificial experience, and a series of discrete states of the application (such as the Q table shown in Table 1) are obtained.
  • an existing scheme proposes to merge states having similar Q values in the Q table to reduce the number of state spaces.
  • the Q value does not fully reflect the correspondence between the state and the action.
  • the relative value of the Q value corresponding to different actions in the same state is meaningful, and the absolute value of the Q value corresponding to the action of different states has no practical significance. Therefore, combining the Q values will result in inaccurate information, and the combination of Q values will make the performance of the algorithm unguaranteed.
  • the original continuous state space is usually discretized depending on the empirical value, and the granularity of the partition greatly affects the performance of the algorithm. For example, if the partition granularity is too large, the accuracy of the algorithm is difficult to guarantee. However, if the partition granularity is too small, the convergence speed of the algorithm is too slow and the efficiency is lowered.
  • the embodiment of the present application provides a method for adjusting a state space boundary in Q learning, which can improve the convergence speed of the Q learning algorithm and improve the performance of the algorithm.
  • the processor is used as an example of the execution subject of the method for adjusting the state space boundary in the Q learning provided by the embodiment of the present application.
  • FIG. 2 is a schematic flowchart of a method 200 for adjusting a state space boundary according to an embodiment of the present application, where the method 200 is applied to a service running system. As shown in FIG. 2, method 200 primarily includes steps 210-230.
  • the processor determines the segment where the first state is located according to the first state of the system in the first time period, and determines that the Q value is the largest among the multiple actions corresponding to the segment.
  • the first action wherein the Q value of each action is used to represent the expected benefit value available to the system after each action is performed.
  • the segmentation refers to a range of values of a state value obtained by dividing a state value of a system state according to a certain division granularity. That is, all the states of the system are arranged in order of size (from large to small or from small to large), and a continuous segment is taken as a segment.
  • the average resource utilization rate of the system is divided into 10 bins at a granularity of 10%, which are 0-10%, 10%-20%, 20%-30 in order. %, ..., 80% - 90% and 90% - 100%. Among them, each bin is a segment.
  • the processor performs a first action, and calculates a real income value obtained by the system after performing the first action in a second time period after the first action is performed.
  • the second time period is located after the first time period.
  • step 220 the following two processes are included in step 220: (1) in the first time period, the processor executes the first action, and calculates the reward value thus obtained by the system; (2) according to the system reward value Update the Q value of the first action.
  • process (2) herein that is, the process of updating the Q value of the first action according to the system reward value, may refer to the prior art and will not be described in detail herein.
  • the system processor determines whether there is a second action in which the Q value is greater than the actual revenue value among the multiple actions. If there is a second action in which the Q value is greater than the actual benefit value, the boundary of the segment is adjusted.
  • the processor first determines the segment corresponding to the state S 1 (hereinafter referred to as segment #1) according to the state (hereinafter referred to as state S 1 ) of the system in the first time period, and determines the score.
  • the operation in which the Q value is the largest among the plurality of operations corresponding to the segment #1 (hereinafter referred to as the action A 1 ).
  • the processor performs action A 1 and calculates the actual revenue value obtained by the system after performing action A 1 in the second time period after the action A 1 is performed.
  • the processor determines whether there is an action in which the Q value is larger than the actual benefit value among the plurality of actions corresponding to the segment #1 (hereinafter referred to as action A 2 ), and if the action A 2 exists, the boundary of the segment #1 is adjusted.
  • a segment is a continuous value of the state of the system. Therefore, adjusting the boundary of the segment can also be said to be the boundary between the adjustment states.
  • the state of the system in the first time period is the state S 1
  • the segment in which the state S 1 is located is the segment #A
  • the Q value of the action A 1 in the plurality of actions corresponding to the segment #A maximum is the Q value of the action A 1 in the plurality of actions corresponding to the segment #A maximum.
  • the optimal action among the plurality of actions corresponding to the segment #A is changed (for example, the optimal action is changed from A 1 to the action A 2 ), the boundary of the segment #A needs to be adjusted.
  • adjusting the boundary of the segment #A means adjusting the boundary value of the segment #A and the adjacent segment.
  • FIG. 3 is an example of adjusting a segment boundary provided by an embodiment of the present application. As shown in FIG. 3, it is assumed that the original boundary between segment #A and segment #B is the resource utilization ratio of the system is 0.7.
  • the state of the system is the resource utilization of the system is 0.62. Further, among the plurality of operations corresponding to the segment in which the resource utilization ratio is 0.62 in the Q table, the action with the largest Q value (that is, the optimal segmentation corresponding to the action) is the action 0. After the processor performs action 0, the processor calculates the reward value of (0.62, action 0) obtained by the system.
  • the Q value of action 0 is updated according to the bonus value of (0.62, action 0).
  • the boundary of the segment #A is adjusted.
  • the adjustment of the boundary of the segment #A means that the boundary value between the segment #A and the segment #B is adjusted.
  • the boundary value between the segment #A and the segment #B should be adjusted from the original 0.7 to 0.62.
  • the basic idea of the divide-and-conquer algorithm is to decompose a problem of size N into K smaller sub-problems that are independent of each other and of the same nature as the original problem. After solving the solutions of each sub-problem, the solutions of the sub-problems are merged layer by layer, and the solution to the original problem can be obtained.
  • Applying the divide and conquer method to the embodiment of the present application can be used to adjust the boundary of the segment.
  • Clustering (English full name can be: Cluster) method is a statistical analysis method to study classification problems. Cluster analysis is based on similarity, with more similarities between elements in the same category than between elements in different categories.
  • the process of applying the clustering method to adjust the segment boundary mainly includes the following steps 301-304.
  • the clustering operation can divide the state data of the system into the above five categories.
  • the specific algorithm used for the clustering operation in the embodiment of the present application is not limited.
  • a classical clustering algorithm K-MEANS, a modified K-MEDOIDS algorithm, a Clara algorithm, etc. can be used.
  • the boundary value of the new category #P should be:
  • the optimal action A that the processor should take when the system is in the state S is taken as an output, and two adjacent points are determined by using a support vector machine (SVM), a decision tree, and the like.
  • SVM support vector machine
  • the boundary value of the segment is determined by using a support vector machine (SVM), a decision tree, and the like.
  • the party using the support vector machine determines the boundary value between two adjacent segments.
  • the method of re-running the support vector machine determines the new boundary value.
  • the boundary value of two adjacent segments may be determined by a logistic regression method in the classification method.
  • the main idea is to use the state data of the system in the past period of time T (or historical data of the system state), and use the method of logistic regression to determine the two phases.
  • T past period of time
  • the state space boundary value between adjacent segments When new data is added, rerun the logistic regression method to determine the new boundary values between the segments.
  • the state attribute of the system is only exemplified by the average resource utilization of the system.
  • the attributes of each state are characterized using at least one of the following parameters of the system: memory utilization, CPU utilization of the central processing unit, network utilization, and number of machines used by the system. .
  • the method for adjusting the state space boundary provided by the embodiment of the present application can reduce the number of states of the system to be equal to the number of executable actions.
  • the boundary between two adjacent segments determined using any of the algorithms described above will be a multi-dimensional boundary.
  • the use of multiple parameters to characterize the state space makes the representation of the state in Q learning multi-dimensional, which can make the description of the state more accurate and detailed, and the performance of the algorithm can be further optimized.
  • FIG. 4 shows still another example of adjusting the segmentation boundary provided by the embodiment of the present application.
  • the boundary value of adjacent segments is determined by logistic regression in the classification method.
  • the "star” and "point” shown in Fig. 4 are state data of the system for a period of time T in the past. These status data are characterized by two attributes, such as average resource utilization and number of machines.
  • the horizontal axis (x-axis) represents the average resource utilization
  • the vertical axis (y-axis) represents the number of machines.
  • segment #A the segment corresponding to the data characterized by the average resource utilization rate
  • segment #B the segment corresponding to the data represented by the machine number
  • the processor first normalizes the average resource utilization rate so that the value range of the value is the same as the value range of the machine number. Then, using logistic regression to determine two neighbors The boundary value between segments.
  • the number of machines is less than (100 ⁇ average resource utilization)
  • the logistic regression method is run to re-determine the boundaries of the segments.
  • the method before performing the first action, the method further includes:
  • the first action is performed with a probability of (1- ⁇ ).
  • the preset area is actually a portion of the segment that is close to the segment boundary value. That is, the state included in the preset area is located near the segment boundary value (including the boundary value of the segment).
  • the preset area can be set and adjusted according to the resource scheduling situation. For example, when the state space is large, you can choose to set the preset area to be larger to increase the convergence speed of the algorithm. When the state space has been adjusted to be small, at this time, the preset area may be set smaller to refine the boundary of the state space to make the division of the state more accurate.
  • the specific value of the preset threshold is not limited.
  • the preset threshold is correspondingly changed to the preset area.
  • the preset threshold is correspondingly large.
  • the preset threshold is also correspondingly small.
  • the preset threshold should be the absolute value of the difference between the first and last state values of the preset area.
  • the preset area is set to an average resource utilization rate of 1%. If the average resource utilization rate of the system during a certain period is 69.8%, at this time, the average resource rate of 69.8% falls within the preset area of the segment (30%-70%). According to the embodiment of the present application, at this time, the processor may select to perform the optimal action corresponding to the segment (30%-70%) with the probability of (1- ⁇ ), that is, perform action 0. Any of a plurality of actions (not shown in FIG. 3) corresponding to the segmentation (30% - 70%) may also be performed with a probability of ⁇ . If the average resource utilization of the system is 65%, which is not a preset area of the segment (30%-70%), the processor determines to perform action 0.
  • each segment of the system state has two boundary values. Regardless of whether the state of the system falls into the upper or lower boundary of a segment at a certain moment, the optimal action corresponding to the segment is performed with a probability of (1- ⁇ ).
  • the average resource rate of 69.8% can be considered to fall within the upper boundary of the segment (30%-70%).
  • Preset area Take the assumption that the average resource utilization rate of the system is 30.5%. At this time, the average resource utilization falls within the preset area of the lower boundary of the segment (30%-70%). At this time, the processor should also The probability of 1- ⁇ ) performs action 0.
  • the reinforcement learning algorithm uses the ⁇ greedy strategy each time an optimal action under one state (recorded as state #A) is selected.
  • the ⁇ greedy strategy refers to the action of the system processor selecting the maximum value of Q in the segment where state #A is located with the probability of (1- ⁇ ) when performing motion selection and execution, and selecting the corresponding segment of the segment with the probability of ⁇ . Any of a number of actions other than the optimal action.
  • the processor selects to take the first action with the largest Q value among the plurality of actions corresponding to the first segment with the probability of (1- ⁇ ), and performs the first segment corresponding to the probability of ⁇ with the probability of ⁇ Any of a number of actions other than the first action.
  • Table 2 shows the results obtained by clustering the Q table before the segment boundary adjustment. According to the method of the embodiment of the present application, the specific process is as follows:
  • the current state of the system is an average resource utilization rate of 0.36
  • the query Q table is obtained.
  • the segment where the average resource utilization rate is 0.36 should be the segment with the average resource utilization rate (30%-70%).
  • the action with the largest Q value among the multiple actions corresponding to segment #A is action 0, that is, the number of machines used by the system is kept unchanged.
  • the bonus value is calculated according to the bonus function (0.36, action 0), and the reward value is positive.
  • the optimal action corresponding to the segment where the average resource utilization rate of the system is 0.36 becomes action-1, indicating that the system is under the current average resource utilization rate (ie, 0.36). There is a waste of the number of machines used in the system, and action-1 is better. Since the optimal action among the plurality of actions corresponding to the segment #A has changed, it is necessary to update the boundary of the segment #A.
  • step 401 after the system performs a certain action corresponding to the segment in which the state is located in a certain state, the processor calculates a bonus value of the state-action combination and a specific process of updating the Q value of the optimal action.
  • the new cluster center is u i and the radius is r i . If the adjacent cluster center is u j and the radius is r j , the new boundary value should be updated to:
  • segmentation (10%-30%) and segmentation (30%-70%) is updated from 30% to 33%.
  • the method for adjusting the state boundary in the embodiment of the present application can also be applied to a scenario in which motion prediction is performed by using a Q learning algorithm.
  • a Q learning algorithm For example, dynamic channel allocation of mobile networks, motion prediction of robots, and the like.
  • the motion of the robot we can define the motion of the robot as 1 step to the left, 2 steps to the left, 2 to the right, 1 to the right, 2 to the right, and the state space can be defined.
  • the distance from the robot to the destination for example, longitude distance, latitude distance, etc.). In this way, by reducing the amount of state space, the robot can be quickly guided to obtain a more accurate motion prediction.
  • the boundary between the multiple segments obtained by dividing the system state is adjusted, so that the number of states of the system is reduced, the convergence speed of the Q learning algorithm is accelerated, and the performance of the algorithm is improved.
  • the method for adjusting the state boundary of the embodiment of the present application is described in detail with reference to FIG. 1 to FIG. 4 .
  • the apparatus and device for adjusting the state boundary in the embodiment of the present application are described below with reference to FIG. 5 and FIG. 6 .
  • FIG. 5 is a schematic block diagram of an apparatus 500 for adjusting a state boundary according to an embodiment of the present application.
  • the device 500 is configured in a business operating system.
  • the apparatus 500 includes:
  • the processing unit 510 is configured to determine, according to the first state that the system is in the first time period, the segment corresponding to the first state, and determine the first action that has the largest Q value among the multiple actions corresponding to the segment, where each The Q value of the action is used to indicate the expected benefit value available to the system after each action is performed;
  • the processing unit 510 is further configured to perform a first action, and calculate a real income value obtained by the system after performing the first action in a second time period after the first action is performed;
  • the processing unit 510 is further configured to determine whether the Q value is greater than the actual revenue value among the multiple actions. The second action, if there is a second action in which the Q value is greater than the actual benefit value, adjust the boundary of the segment.
  • the units in the apparatus 500 for adjusting the state boundary of the embodiment of the present application and the other operations or functions described above are respectively used to implement the corresponding flow in the method 200 of adjusting the state boundary. For the sake of brevity, it will not be repeated here.
  • the processing unit herein may be a processor.
  • Device 500 should also include a storage unit.
  • the storage unit can be a memory.
  • the memory is used to store computer instructions.
  • the processor is configured to execute computer instructions stored in the memory. When the computer instructions are executed, the processor executes the corresponding steps of the method 200 of adjusting the state boundary provided by the embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an apparatus 600 for adjusting a state boundary according to an embodiment of the present disclosure.
  • device 600 includes a memory 610, a processor 620, and a communication interface 630.
  • the memory 610, the processor 620, and the communication interface 630 are connected to each other through a communication bus 640.
  • Memory 610 is used to store applications, code or instructions that perform the inventive arrangements.
  • the processor 620 is configured to execute an application, code or instruction stored in the memory 610 to perform the method 200 of adjusting state boundaries in Q learning and corresponding flows and/or operations in various embodiments. For the sake of brevity, it will not be repeated here.
  • the apparatus 500 for adjusting state boundaries provided in FIG. 5 can be implemented by the apparatus 600 for adjusting state boundaries as shown in FIG.
  • the processing unit of FIG. 5 can be implemented by processor 620 of FIG. 6, and the memory unit can be implemented by memory 610.
  • the processor 620 shown in FIG. 6 may be a central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more program execution programs for controlling the present invention. Integrated circuit.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • the memory 610 shown in FIG. 6 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (RAM) or Other types of dynamic storage devices that store information and instructions may also be Electrically Erasable Programmable Read-Only Memory (EEPROM) or Compact Disc Read-Only Memory (CD-ROM). Or other disc storage, optical disc storage (including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), disk storage media or other magnetic storage devices, or can be used to carry or store expectations in the form of instructions or data structures Program code and any other medium that can be accessed by a computer, but is not limited thereto. Memory can be stored separately
  • the processor is coupled to a processor via a communication bus (e.g., communication bus 640 in FIG. 6).
  • the memory can also be integrated with the processor.
  • the communication bus 640 may include a power bus, a control bus, a status signal bus, and the like in addition to the data bus.
  • various buses are labeled as communication buses in the figures.
  • the communication interface 630 can be a wired interface, such as a Fiber Distributed Data Interface (FDDI), a Gigabit Ethernet (GE) interface, or the like.
  • FDDI Fiber Distributed Data Interface
  • GE Gigabit Ethernet
  • the size of the sequence numbers of the foregoing processes does not mean the order of execution sequence, and the order of execution of each process should be determined by its function and internal logic, and should not be applied to the embodiment of the present application.
  • the implementation process constitutes any limitation.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one network unit, or may be distributed to multiple network units. . Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or each unit may be integrated into one unit.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product.
  • the technical solution of the present application which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including
  • the instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种Q学习中调整状态边界的方法,能够提升Q学习算法的算法性能。该方法包括:根据系统在第一时段的第一状态,确定第一状态所在的分段,并确定该分段对应的多个动作中Q值最大的第一动作,其中,每个动作的Q值用于表示执行每个动作后系统可获得的预期收益值(210);执行第一动作,并在执行第一动作后的第二时段,计算执行第一动作后系统获得的实际收益值(220);判断该多个动作中是否存在Q值大于实际收益值的第二动作,若该多个动作中存在Q值大于实际收益值的第二动作,则调整该分段的空间边界(230)。

Description

Q学习中调整状态空间边界的方法和装置 技术领域
本申请实施例涉及信息技术领域,并且更具体地,涉及Q学习中调整状态空间边界的方法和装置。
背景技术
强化学习(英文全称可以为reinforcement learning)又称再励学习或评价学习,是一种重要的机器学习方法。在智能控制机器人及分析预测等领域有许多应用。所谓强化学习就是智能系统从环境到行为映射的学习,以使奖励值函数的值最大,强化学习中由环境提供的奖励值函数的值是对动作的好坏进行评价,而不是告诉强化学习系统如何去产生正确的动作。由于外部环境提供的信息很少,强化学习必须靠自身的经历进行学习。通过这种方式,强化学习在行动-评价的环境中获得知识,改进行动方案以适应环境。而Q学习(Q-learning)方法是强化学习中的经典算法之一,是一种与模型无关的学习算法。
数据中心集群基于上述Q学习算法对应用(Application)使用的资源进行自适应调度,可以提升数据中心的资源利用率。在现有的基于Q学习的算法中,数据中心通常是根据应用的负载变化情况(或者说,应用的状态)对应用所使用的资源进行调度。而应用的状态大多是通过应用在机器集群中所使用的所有机器的平均资源利用率这一参数进行表征。并且,平均资源利用率这一参数是连续的,而非离散值。现有技术中,为了准确地描述一个应用在各个状态时可采取的候选动作,将原本连续的状态空间进行了离散划分。
但是,将连续的状态空间离散划分,可能造成信息的损失,并导致状态的描述不够准确。从而使得资源调度的结果不甚理想。另外,细粒度的状态空间划分也会使得状态空间过大,导致Q表的收敛速度过慢。
发明内容
本申请提供一种Q学习中调整状态空间边界的方法和装置,能够在加快Q学习算法收敛速度的同时,提升Q学习算法的性能。
第一方面,本申请提供一种Q学习中调整状态空间边界的方法,应用于 业务运行系统,该方法包括:根据系统在第一时段的第一状态,确定第一状态所在的分段,并确定该分段对应的多个动作中Q值最大的第一动作,其中,该分段是系统状态的状态值连续取值范围中的一段,每个动作的Q值用于表示执行每个动作后系统可获得的预期收益值;执行第一动作,并在执行第一动作后的第二时段,计算执行第一动作后,系统获得的实际收益值;判断多个动作中是否存在Q值大于实际收益值的第二动作,若该多个动作中存在Q值大于实际收益值的第二动作,则调整该分段的边界。
应理解,第二时段位于第一时段之后。更具体地,第一时段为执行(或者说,采取)第一动作之前所处的时段。第二时段为执行第一动作之后的时段。
将系统的所有状态按照状态值的大小顺序排列(从大到小或从小到大),从中取出连续的一段即为一个分段。
在本申请实施例中,通过对系统的状态所在分段的边界进行调整,使得系统的状态数量减少,加快了Q学习算法的收敛速度,能够提升算法的性能。
在一种可能的实现方式中,若该多个动作中存在Q值大于实际收益值的第二动作,则调整该分段的边界,包括:将该分段的边界调整为第一状态的状态值。
在一种可能的实现方式中,每个状态的属性使用系统的下列参数中的至少一项进行表征:内存利用率、中央处理器CPU的利用率、网络利用率和所使用的机器数量。
在本发明实施例中,采用多个参数表征状态(也可称为状态空间)的属性,使得Q学习中状态空间的表征多维度化,能够使状态空间的描述更加准确和细化,可以使算法的性能得到进一步优化。
在一种可能的实现方式中,执行第一动作之前,该方法还包括:确定第一状态的状态值是否属于该分段的预设区域,该预设区域内的状态每个状态值与该分段的边界值之间的差值小于或等于预设阈值;当确定第一状态的状态值属于该预设区域时,采用(1-ε)的概率执行第一动作。
具体地,在本发明实施例中,当系统在第一时段所处于的第一状态的状态值为第一状态所在分段和第二状态所在分段的边界值或位于该边界值附近时,选择以(1-ε)的概率执行第一动作,以ε的概率执行第一状态所在分段对应的多个动作中除第一动作之外的其他任一动作。这里,第二状态不同 于第一状态,且第二状态与第一状态所在分段相邻。
可以理解的是,现有的Q学习算法中,每次选择应用在一个状态下的最优动作时采用ε贪婪策略,目的在于平衡算法的探索能力(exploration)以及开采能力(exploitation),以加强算法的探索能力。对那些没有执行过的动作进行尝试,看是否能获得更好的效果。然而过多的进行探索尝试会影响算法的性能。
在本申请实施例中,对处于两个分段边界值附近的那些状态采用ε贪婪策略,可以减少无效的尝试次数,提升算法性能。
在一种可能的实现方式中,调整该分段的边界,包括:采用以下任意一种算法调整该分段的边界:分治法、聚类法和分类法。
需要说明的是,调整分段的边界时,可以采用现有技术中的算法,例如,分治法、聚类法和分类法等。每种算法的具体计算过程可以参考现有技术,本发明实施例对此不作详述。
可选地,在本申请实施例中,当状态空间的属性使用一个参数(即,状态空间为一维)进行表征时,通过本申请实施例提供的调整状态空间边界的方法,可以将应用的状态数量减少到与动作数量相同。
第二方面,本申请提供一种Q学习中调节状态空间边界的装置,用于执行第一方面或第一方面的任意可能的实现方式中的方法。具体地,该装置包括用于执行第一方面或第一方面的任意可能的实现方式中的方法的单元。
第三方面,本申请提供了一种Q学习中调节状态空间边界的设备。具体地,该设备包括:存储器和处理器。其中,存储器用于存储指令,处理器用于执行存储器存储的指令,当指令被执行时,处理器执行第一方面或第一方面的任意可能的实现方式中的方法。
第四方面,本申请提供一种计算机可读介质,用于存储计算机程序,该计算机程序包括用于执行第一方面或第一方面的任意可能的实现方式中的方法的指令。
在本申请实施例中,通过对系统状态所在分段的边界(也即,状态之间的边界)进行调整,使得系统状态的数量减少,加快了Q学习算法的收敛速度,能够提升算法的性能。
附图说明
图1为现有技术中利用Q学习算法进行资源调度的方法100的流程图。
图2为本申请实施例提供的调整状态空间边界的方法200的流程图。
图3为本申请实施例提供的调整分段边界的一个示例。
图4为本申请实施例提供的调整分段边界的另一个示例。
图5为本申请实施例提供的调整状态空间边界的装置500的示意图。
图6为本申请实施例提供的调整状态空间边界的设备600的示意图。
具体实施方式
下面将结合附图,对本发明实施例中的技术方案进行说明。
应理解,本申请实施例的技术方案可以应用于各种领域,例如,数据中心的资源自适应调度领域。其中,数据中心可以包括计算机集群,数据中心可以根据应用的负载变化情况等信息,实时调整分配给该应用的机器(例如,虚拟机、容器等)数量。例如增加或减少机器数量,或保持机器数量不变,等,以在有效满足应用需求的前提下提升数据中心的总体资源利用率。
首先,对本申请实施例中涉及到的基本概念作简单介绍。
应用的状态:描述应用的当前运行情况,可以表示为S(M,U),其中,M表示当前应用所使用的机器数量,U表示应用当前所使用的机器集群中所有机器的平均资源占用率。这里的机器可以包括物理机(Physical Machine,PM)、虚拟机(Virtual Machine,VM)和/或容器(Docker)等。
动作:Q学习算法在数据中心集群中可以采取的各种动作种类(例如,动作个数、动作幅度等)。具体可以根据应用的负载情况设定。例如,当基于Q学习进行数据中心集群中的资源调度时,动作可以用于对分配给应用的资源数量或机器数量进行调整。例如,减少机器数量、保持机器数量不变或增加机器数量。其中,动作对分配给应用的资源的具体调整数量可以根据实际需要设定,本发明的实施例中不作限定。
奖励函数:用来确定Q学习算法在应用状态S时执行动作A后,系统给出对于状态-动作组合(S,A)的系统奖励值,可用来评价在应用状态S时执行动作A的好坏情况。例如,如果奖励函数值为正,说明执行动作A后应用的服务水平目标(Service Level Objective,SLO)能够得到及时满足。如果奖励函数值为负,说明采取动作A后应用的SLO不能得到满足。奖励函数的计算公式可以如下:
作为举例,奖励值函数可以由下式表示:
Figure PCTCN2016108312-appb-000001
其中,U可以表示应用当前使用的所有机器的平均资源占用率,p为配置参数,其默认设置为2。respTime表示数据中心的99%响应时间百分比。SLO可以表示99%响应时间百分比的服务水平目标,用于保障99%的应用都能得到及时响应。
Q值:通过状态-动作对学习到的指的一个函数,用于衡量某个动作对于某个状态的累积回报。具体计算公式可以由下式表示:
Figure PCTCN2016108312-appb-000002
其中,c和γ为可调节参数。r表示奖励函数。Q(st,at)表示应用在时刻t,动作at对于状态st的一个Q值。
Figure PCTCN2016108312-appb-000003
表示应用在状态t+1时,在状态st+1具有最大Q值的动作a所对应的Q值。
Q表:用于记录应用的所有可能状态和所有可选动作所组成的各种可能状态-动作组合的Q值。算法每次在一个状态决定采取哪个动作会根据如下原则进行选择:选取该状态的所有动作中Q值最大的那个动作。
下面的表1为Q学习中Q表的示例。Q表的第1列代表应用的状态。Q表的第2列到第M+1列分别代表M个可选动作。Qij表示由第i行的应用状态和第j列的动作组成的状态-动作组合对应的Q值。
表1
Figure PCTCN2016108312-appb-000004
图1为现有技术中利用Q学习算法进行资源调度的方法100的流程图。如图1所示,方法100主要包括如下步骤101-106。
101、确定应用在时刻t所处于的状态S。
102、根据Q表,确定应用在时刻t,在状态S下采取的动作A。
103、应用执行动作A。
应理解,应用执行动作A,即就是对应用的资源进行调度(例如,增加资源数量、保持资源数量不变或减少资源数量等)。
104、获取应用在时刻t+T的平均资源利用率。
应用执行完动作A之后,系统重新计算该应用的资源利用率。
105、计算状态-动作组合(S,A)的奖励函数值。
具体地,系统根据应用的资源利用率、响应时间、应用的SLO等因素计算采取动作A的奖励函数值,以判断应用在处于状态S时采取动作A的好坏情况。
106、使用状态-动作组合(S,A)的奖励函数值和采取动作A之前状态-动作组合(S,A)的Q值,更新Q表中状态-动作组合(S,A)所对应的Q值。
如上述流程所述,在基于Q学习算法的资源调度中,根据应用在机器集群中使用的所有机器的平均资源利用率对应用使用的资源实时地进行调整。而平均资源利用率这个参数是连续的,而非离散值。现有技术中一般是将应用的状态空间依赖人工经验进行离散划分,得到该应用的一系列离散状态(如表1所示的Q表)。并且,为了提升算法的性能,现有的一种方案中提出将Q表中具有相近Q值的状态进行合并,以减少状态空间的数量。
可以理解的是,在Q学习算法中,一方面,Q值并不能完全反映状态和动作的对应关系。同一状态对应不同动作的Q值的相对值才是有意义的,而不同状态对应动作的Q值的绝对值并无实际意义。因此,将Q值进行合并,会造成信息不准确,并且Q值合并会使得算法的性能无法保证。另一方面,现有技术中通常是依赖经验值将原本连续的状态空间离散化,而划分的粒度会极大影响算法的性能。例如,划分粒度过大,算法的准确度难以保证。而划分粒度过小,算法的收敛速度过慢,效率降低。
为此,本申请实施例提供一种Q学习中调整状态空间边界的方法,能够提高Q学习算法的收敛速度,同时能够提升算法的性能。
下文结合图2至图4,对本申请实施例提供的Q学习中调整状态空间边界的方法进行详细说明。
不失一般性,以处理器作为本申请实施例提供的Q学习中调整状态空间边界的方法的执行主体为例,进行说明。
图2为本申请实施例的调整状态空间边界的方法200的示意性流程图,该方法200应用于业务运行系统。如图2所示,方法200主要包括步骤210-230。
210、处理器(例如,该业务运行系统的处理器)根据系统在第一时段的第一状态,确定第一状态所在的分段,并确定该分段对应的多个动作中Q值最大的第一动作,其中,每个动作的Q值用于表示执行每个动作后系统可获得的预期收益值。
在本申请实施例中,分段是指将系统状态的状态值按照一定的划分粒度进行划分后得到的一段状态值的取值范围。即,将系统的所有状态按照状态值的大小顺序排列(从大到小或从小到大),从中取出连续的一段即为一个分段。
例如,以平均资源利用率作为表征系统状态的参数,将系统的平均资源利用率以10%的粒度划分为10个分档,依次为0-10%、10%-20%、20%-30%、……,80%-90%和90%-100%。其中,每个分档为一个分段。
220、处理器执行第一动作,并在执行第一动作后的第二时段,计算执行第一动作后,系统获得的实际收益值。
其中,第二时段位于第一时段之后。
以Q学习为例,具体地,在步骤220包括以下两个过程:(1)在第一时段,处理器执行第一动作,并计算系统由此获得的奖励值;(2)根据系统奖励值更新第一动作的Q值。
需要说明的是,这里的过程(2),即根据系统奖励值更新第一动作的Q值的过程,可以参考现有技术,这里不作详述。
230、系统处理器判断该多个动作中是否存在Q值大于实际收益值的第二动作,若该多个动作中存在Q值大于实际收益值的第二动作,则调整该分段的边界。
在本申请实施例中,处理器首先根据系统在第一时段所处的状态(以下记作状态S1),确定状态S1对应的分段(以下记作分段#1),并确定分段#1对应的多个动作中Q值最大的动作(以下记作动作A1)。其后,处理器执行动作A1,并在执行动作A1后的第二时段,计算执行动作A1后系统获得的实际收益值。最后,处理器判断分段#1对应的多个动作中,是否存在Q值大于实际收益值的动作(以下记作动作A2),如果存在动作A2,则调整分段#1 的边界。
根据前文所述,分段是系统状态的一段连续的取值。因此,调整分段的边界,也可以说是调整状态之间的边界。
具体地,在本申请实施例中,调整分段的边界可以有多种方式,以下分别作详细说明。
1、根据执行动作后系统获得的实际收益值调整分段的边界。
首先,假设系统在第一时段的状态为状态S1,并且在第一时段,状态S1所在的分段为分段#A,分段#A对应的多个动作中动作A1的Q值最大。
具体地,在第一时段,如果系统在状态S1时,处理器执行分段#A对应的多个动作中的最优动作A1,使得在处理器执行完动作A1后的第二时段,分段#A对应的多个动作中的最优动作发生了变更(例如,最优动作由A1变更为动作A2),则需要对分段#A的边界进行调整。
应理解,调整分段#A的边界,是指调整分段#A与相邻分段的边界值。
图3为本申请实施例提供的调整分段边界的示例。如图3所示,假设分段#A与分段#B之间的原始边界为系统的资源利用率为0.7。
方法执行之前,系统的状态为系统的资源利用率为0.62。并且,Q表中资源利用率为0.62所在的分段对应的多个动作中,Q值最大的动作(即分段对应的最优)动作为动作0。处理器执行动作0之后,处理器计算系统获得的(0.62,动作0)的奖励值。
根据(0.62,动作0)的奖励值,对动作0的Q值进行更新。
更新Q值以后,如果分段#A对应的多个动作中,Q值最大的动作不再为动作0,而是变更为不同于动作0的另外一个动作(假设,变更为动作+1),此时,则对分段#A的边界进行调整。这里,对分段#A的边界进行调整是指对分段#A和分段#B的之间的边界值进行调整。
具体地,根据本申请实施例,应将分段#A和分段#B之间的边界值由原始的0.7调整为0.62。
2、分治法。
分治算法的基本思想是将一个规模为N的问题分解为K个规模较小的子问题,这些子问题相互独立且与原问题性质相同。求出各个子问题的解后,将子问题的解逐层合并,就可得到原问题的解。
将分治法应用于本申请实施例中,可以用来调整分段的边界。
继续以图3所示为例。处理器执行动作0之后,如果发现分段#A对应的最优动作由动作0变为动作+1。这时应将分段#A的边界调整为:
Figure PCTCN2016108312-appb-000005
3、聚类法。
聚类(英文全称可以为:Cluster)法是研究分类问题的一种统计分析方法。聚类分析以相似性为基础,处于同一个类别的元素之间比处于不同的类别之间的元素之间具有更多的相似性。
在本申请实施例中,将聚类法应用于调整分段边界时的流程主要包括如下步骤301-304。
301、对过去一段时间内T内系统的状态数据进行聚类操作。
假定在算法中预设处理器可采取的动作数量为5个,分别为动作-2、动作-1、动作0、动作+1和动作+2。
根据预设的动作数量和种类,经过聚类操作可以将系统的状态数据划分为上述5个类别。
需要说明的是,本申请实施例中对聚类操作的所采用的具体算法不作限定。例如,可以使用经典的聚类算法K-MEANS、改进的K-MEDOIDS算法、Clara算法等。
302、根据系统在当前时刻的状态数据对应的最大Q值,确定该状态数据的最优动作(记作动作A1),并将该状态数据加入到该最优动作(即,动作A1)所属于的类别(记作类别#P)中。
303、更新类别#P的聚类中心位置。
304、重新计算类别#P的边界值。
具体地,假设重新确定的聚类中心为ui,半径为ri。相邻的聚类中心为uj,半径为rj,则新的类别#P的边界值应为:
Figure PCTCN2016108312-appb-000006
4、分类法。
将系统的状态S作为输入,将处理器在系统处于状态S时应采取的最优动作A作为输出,采用支持向量机(Support Vector Machine,SVM)、决策树等分类方法确定两个相邻分段的边界值。
具体地,利用过去一段时间T内应用的状态数据,采用支持向量机的方 式确定两个相邻分段之间的边界值。当有新的数据加入时,重新运行支持向量机的方法确定新的边界值。
可选地,作为一个实施例,可以采用分类法的中的逻辑回归法确定两个相邻分段的边界值。
具体地,将逻辑回归法应用于本申请实施例中时,主要思想是利用过去一段时间T内系统的状态数据(或者说,系统状态的历史数据),采用逻辑回归的方法来确定两个相邻分段之间的状态空间边界值。当有新的数据加入时,重新运行逻辑回归方法确定分段之间新的边界值。
需要说明的是,前文所述的分治法、聚类法和分类法(例如,逻辑回归法)的具体实现过程可以参考现有技术,此处不作详细描述。
在前文所述的实施例中,系统的状态属性仅以系统的平均资源利用率为例。
可选地,作为一个实施例,每个状态的属性使用系统的下列参数中的至少一项进行表征:内存利用率、中央处理器CPU的利用率、网络利用率和该系统所使用的机器数量。
优选地,当状态的属性使用一个参数进行表征时,通过本申请实施例提供的调整状态空间边界的方法,可以将系统的状态数量减少到与可执行的动作的数量相等。
可以理解的是,当使用多个参数表征一个系统状态的属性时,使用前文所述的任意一种算法确定出的两个相邻分段之间的边界将是一个多维的边界。采用多个参数表征状态空间的属性,使得Q学习中状态的表征多维度化,能够使状态的描述更加准确和细化,可以使算法的性能得到进一步优化。
图4示出了本申请实施例提供的调整分段边界的又一个示例。如图4所示,采用分类法中的逻辑回归法确定相邻分段的边界值。
图4中所示的“星”和“点”为系统在过去一段时间T内的状态数据。这些状态数据是用两个属性来表征的,例如,平均资源利用率和机器数量。图4中横轴(x轴)表示平均资源利用率,纵轴(y轴)表示机器数量。这里,将用平均资源利用率表征的数据对应的分段记作分段#A,将用机器数量表征的数据对应的分段记作分段#B。
具体地,处理器首先对平均资源利用率进行归一化,使其数值取值范围与机器数量数值取值范围相同。然后,采用逻辑回归的方法来确定两个相邻 分段之间的边界值。
如图4中所示,y=x是这两类数据的分割线,即就是说,机器数量=100×平均资源利用率,为实际的分段边界。当机器数量小于(100×平均资源利用率)时,属于分段#A,反之,属于分段#B。当有新的状态数据加入时,运行逻辑回归方法重新确定分段的边界。
可选地,作为一个实施例,执行第一动作之前,该方法还包括:
确定第一状态的状态值是否属于该分段的预设区域,该预设区域内每个状态的状态值与该分段的边界值之间的差值小于或等于预设阈值;
当确定第一状态的状态值属于该预设区域时,采用(1-ε)的概率执行第一动作。
应理解,预设区域实际上为该分段中靠近分段边界值的一部分区域。即,预设区域中所包括的状态位于该分段边界值附近(包括分段的边界值)。
预设区域可以根据资源调度情况进行设置和调整。例如,当状态空间较大时,可以选择将预设区域设置的大一些,以加算法的收敛速度。而当状态空间已经调整的较小时,此时,可以将预设区域设置的小一些,以细化状态空间边界的划分,以使得状态的划分更加准确。
在本申请实施例中,对于预设阈值的具体取值不作限定。实际上,预设阈值与预设区域是对应变化的。当预设区域较大时,预设阈值相应地较大。当预设区域较小时,预设阈值也相应地较小。预设阈值应为预设区域的首尾两个状态值之差的绝对值。
例如,参见图3,假设预设区域设定为平均资源利用率为1%。如果系统在某一时段的平均资源利用率为69.8%,此时,平均资源率为69.8%落在分段(30%-70%)的预设区域。根据本申请实施例,此时,处理器可以选择以(1-ε)的概率执行分段(30%-70%)对应的最优动作,即执行动作0。也可以以ε的概率执行分段(30%-70%)对应的多个动作(图3中未示出)中的其它任一动作。而如果系统的平均资源利用率为65%,不属于分段(30%-70%)的预设区域,则处理器确定执行动作0。
从图3中可以看出,系统状态的每个分段各有两个边界值。不管系统在某个时刻的状态落入一个分段的上边界或下边界,均以(1-ε)的概率执行该分段对应的最优动作。
因此,平均资源率为69.8%可以认为是落入分段(30%-70%)上边界的 预设区域。再以假定系统的平均资源利用率为30.5%为例,此时,平均资源利用率落入分段(30%-70%)下边界的预设区域内,此时,处理器也应以(1-ε)的概率执行动作0。
在现有技术中,强化学习算法每次在选择一个状态(记作状态#A)下的最优动作时,都采用ε贪婪策略。ε贪婪策略是指系统处理器在进行动作选择和执行时,将以(1-ε)的概率选择状态#A所在分段中Q值最大的动作,而以ε的概率选择该分段对应的多个动作中除最优动作以外的其它任一动作。
可以理解的是,采用ε贪婪策略的初衷是在于平衡算法的探索能力(exploration)以及开采能力(exploitation),以加强算法的探索能力。对那些没有执行过的动作进行尝试,看是否能获得更好的效果。然而过多的进行探索尝试会影响算法的性能。
在本申请实施例中,我们认为处于两个分段边界附近的那些状态更值得采用ε贪婪策略。因为分段边界附近的状态正好位于可能采取的相邻两种动作之间,会出现动作选择波动的可能性将更大。因此,在本申请实施例中,当系统在第一时段的第一状态的状态值为第一分段和第二分段的边界值,或者第一状态的状态值处于第一分段和第二分段的边界值附近时,处理器选择以(1-ε)的概率采取第一分段对应的多个动作中Q值最大的第一动作,以ε的概率执行第一分段对应的多个动作中除第一动作之外的其他任一动作。
通过对处于分段边界附近的状态采取ε贪婪策略,可以减少无效的尝试次数,从而能够提升算法性能。
下面结合表2和表3,对本申请实施例的Q学习中调整状态边界的方法进行举例说明。
为了便于理解和说明,首先作如下假设:(1)Q学习可采取的动作有5类,分别为:减少2台机器、减少1台机器、保持机器数量不变、增加1台机器和增加2台机器。(2)以10%为划分粒度,将系统的平均资源利用率划分为10档。(3)系统机器数量总量为100台。系统当前使用的机器数量为1台。
这里仅以采用前文所述的聚类法作为示例,对调整相邻分段边界的过程进行说明。
表2为分段边界调整前Q表通过聚类法进行聚类后得到的结果。根据本申请实施例的方法,具体的流程如下:
401、假设系统当前的状态为平均资源利用率为0.36,查询Q表得到,平均资源利用率为0.36所在的分段应为平均资源利用率为(30%-70%)范围的分段(记作分段#A),分段#A对应的多个动作中Q值最大的动作为动作0,即保持系统使用的机器数量不变。根据奖励函数计算(0.36,动作0)的奖励值,并得到奖励值为正。根据奖励值更新动作0的Q值后,系统的平均资源利用率为0.36所在的分段对应的最优动作变为动作-1,表明系统在当前的平均资源利用率(即,0.36)下,系统使用的机器数量存在浪费现象,动作-1更好。由于分段#A对应的多个动作中的最优动作发生了变化,所以需要对分段#A的边界进行更新。
在步骤401中,系统在某个状态时,系统处理器执行该状态所在分段对应的最优动作后,处理器计算状态-动作组合的奖励值以及更新该最优动作的Q值的具体过程可以参考现有技术,这里不作详述。
表2
Figure PCTCN2016108312-appb-000007
402、将0.36加入到类别[0.1,0.3]中,重新计算该类别的中心ui和半径ri
403、重新确定新的边界值。
假设新的聚类中心为ui,半径为ri。相邻聚类中心为uj,半径为rj,则新的边界值应更新为:
Figure PCTCN2016108312-appb-000008
分段#A的边界更新后的Q表如表3所示。
表3
Figure PCTCN2016108312-appb-000009
可见,分段(10%-30%)与分段(30%-70%)的边界由30%更新为33%。
本申请实施例的调整状态边界的方法,还可以适用在利用Q学习算法进行动作预测的场景中。例如,移动网络的动态信道分配、机器人的动作预测等。在机器人动作预测的场景中,我们可以定义机器人的动作为向左移动1步、向左移动2步、原地不动、向右移动1步、向右移动2步等,而状态空间可以定义为机器人当前距离目的地的位置距离(例如,可以为经度距离、纬度距离等)。这样,通过减少状态空间的数量,可以快速指导机器人得到一个更加准确的动作预测。
在本申请实施例中,通过对系统状态划分得到的多个分段之间的边界进行调整,使得系统的状态数量减少,加快了Q学习算法的收敛速度,提升了算法的性能。
以上结合图1至图4,对本申请实施例的调整状态边界的方法作了详细说明,以下结合图5和图6,对本申请实施例的调整状态边界的装置和设备进行说明。
图5是本申请实施例的调整状态边界的装置500的示意性框图。该装置500配置在业务运行系统中。如图5所示,装置500包括:
处理单元510,用于根据系统在第一时段所处于的第一状态,确定第一状态对应的分段,并确定分段对应的多个动作中Q值最大的第一动作,其中,每个动作的Q值用于表示执行每个动作后系统可获得的预期收益值;
处理单元510,还用于执行第一动作,并在执行第一动作后的第二时段,计算执行第一动作后,系统获得的实际收益值;
处理单元510,还用于判断该多个动作中是否存在Q值大于实际收益值 的第二动作,若该多个动作中存在Q值大于实际收益值的第二动作,则调整该分段的边界。
本申请实施例的调整状态边界的装置500中的各单元和上述其它操作或功能分别为了实现上述调整状态边界的方法200中的相应流程。为了简洁,此处不再赘述。
应理解,这里的处理单元可以为处理器。装置500还应包括存储单元。存储单元可以为存储器。存储器用于存储计算机指令。处理器用于执行存储器中存储的计算机指令。当计算机指令被执行时,处理器执行本申请实施例提供的调整状态边界的方法200的相应步骤。
图6为本申请实施例提供的调整状态边界的设备600的示意性结构图。如图6所示,设备600包括:存储器610、处理器620和通信接口630。其中,存储器610、处理器620和通信接口630通过通信总线640相互连接。
存储器610用于存储执行本发明方案的应用程序、代码或指令。处理器620用于执行存储器610中存储的应用程序、代码或指令,以完成Q学习中调整状态边界的方法200以及各实施例中的相应流程和/或操作。为了简洁,此处不再赘述。
应理解,图5中提供的调整状态边界的装置500,可以通过图6中所示的调整状态边界的设备600来实现。例如,图5中的处理单元可以由图6中的处理器620实现,存储单元可以由存储器610来实现。
图6中所示的处理器620,可以为中央处理器(CPU)、微处理器、特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制本发明方案程序执行的集成电路。
图6中所示的存储器610,可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存 在,通过通信总线(例如,图6中的通信总线640)与处理器相连接。存储器也可以和处理器集成在一起。
通信总线640除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。为了清楚说明起见,在图中将各种总线都标为通信总线。
通信接口630可以是有线接口,例如光纤分布式数据接口(Fiber Distributed Data Interface,简称FDDI)、千兆以太网(Gigabit Ethernet,简称GE)接口等,也可以是无线接口。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的各实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个网络单元,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例的方案的目的。
另外,在本申请各实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以各单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请具体实施方式,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (10)

  1. 一种Q学习中调整状态空间边界的方法,应用于业务运行系统,其特征在于,包括:
    根据所述系统在第一时段的第一状态,确定所述第一状态所在的分段,并确定所述分段对应的多个动作中Q值最大的第一动作,其中,所述分段是所述系统状态的状态值连续取值范围中的一段,每个动作的Q值用于表示执行所述每个动作后所述系统可获得的预期收益值;
    执行所述第一动作,并在执行所述第一动作后的第二时段,计算执行所述第一动作后,所述系统获得的实际收益值;
    判断所述多个动作中是否存在Q值大于所述实际收益值的第二动作,若所述多个动作中存在Q值大于所述实际收益值的第二动作,则调整所述分段的边界。
  2. 根据权利要求1所述的方法,其特征在于,若所述多个动作中存在Q值大于所述实际收益值的第二动作,则调整所述分段的边界,包括:
    将所述分段的边界调整为所述第一状态的状态值。
  3. 根据权利要求1或2所述的方法,其特征在于,每个状态的属性使用所述系统的下列参数中的至少一项进行表征:
    内存利用率、中央处理器CPU的利用率、网络利用率和所使用的机器数量。
  4. 根据权利要求1至3中任一项所述的方法,其特征在于,所述执行所述第一动作之前,所述方法还包括:
    确定所述第一状态的状态值是否属于所述分段的预设区域,所述预设区域内每个状态的状态值与所述分段的边界值之间的差值小于或等于预设阈值;
    当确定所述第一状态的状态值属于所述预设区域时,采用(1-ε)的概率执行所述第一动作。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述调整所述分段的边界,包括:
    采用以下任意一种算法调整所述分段的边界:
    分治法、聚类法和分类法。
  6. 一种Q学习中调整状态空间边界的装置,配置在业务运行系统中, 其特征在于,包括:
    处理单元,用于根据所述系统在第一时段所处于的第一状态,确定所述第一状态对应的分段,并确定所述分段对应的多个动作中Q值最大的第一动作,其中,所述分段是所述系统状态的状态值连续取值范围中的一段,每个动作的Q值用于表示执行所述每个动作后所述系统可获得的预期收益值;
    所述处理单元,还用于执行所述第一动作,并在执行所述第一动作后的第二时段,计算执行所述第一动作后,所述系统获得的实际收益值;
    所述处理单元,还用于判断所述多个动作中是否存在Q值大于所述实际收益值的第二动作,若所述多个动作中存在Q值大于所述实际收益值的第二动作,则调整所述分段的空间边界。
  7. 根据权利要求6所述的装置,其特征在于,所述处理单元具体用于将所述分段的空间边界调整为所述第一状态的状态值。
  8. 根据权利要求6或7所述的装置,其特征在于,所述每个状态的属性使用所述系统的下列参数中的至少一项进行表征:
    内存利用率、中央处理器CPU的利用率、网络利用率和所使用的机器数量。
  9. 根据权利要求6至8中任一项所述的装置,其特征在于,所述处理单元具体用于:
    确定所述第一状态的状态值是否属于所述分段的预设区域,所述预设区域内每个状态的状态值与所述分段的边界值之间的差值小于或等于预设阈值;
    当确定所述第一状态的状态值属于所述预设区域时,采用(1-ε)的概率执行所述第一动作。
  10. 根据权利要求6至9中任一项所述的装置,其特征在于,所述处理单元具体用于采用以下任意一种算法调整所述分段的边界:
    分治法、聚类法和分类法。
PCT/CN2016/108312 2016-12-02 2016-12-02 Q学习中调整状态空间边界的方法和装置 WO2018098797A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201680056875.0A CN108476084B (zh) 2016-12-02 2016-12-02 Q学习中调整状态空间边界的方法和装置
PCT/CN2016/108312 WO2018098797A1 (zh) 2016-12-02 2016-12-02 Q学习中调整状态空间边界的方法和装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/108312 WO2018098797A1 (zh) 2016-12-02 2016-12-02 Q学习中调整状态空间边界的方法和装置

Publications (1)

Publication Number Publication Date
WO2018098797A1 true WO2018098797A1 (zh) 2018-06-07

Family

ID=62241176

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/108312 WO2018098797A1 (zh) 2016-12-02 2016-12-02 Q学习中调整状态空间边界的方法和装置

Country Status (2)

Country Link
CN (1) CN108476084B (zh)
WO (1) WO2018098797A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115875091A (zh) * 2021-09-26 2023-03-31 国能智深控制技术有限公司 汽轮机阀门流量特性的监测方法、装置和可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101466111A (zh) * 2009-01-13 2009-06-24 中国人民解放军理工大学通信工程学院 基于政策规划约束q学习的动态频谱接入方法
CN104168087A (zh) * 2014-08-08 2014-11-26 浙江大学 无速率编码传输系统中基于q学习的传输帧长主动自适应调整方法
WO2015054264A1 (en) * 2013-10-08 2015-04-16 Google Inc. Methods and apparatus for reinforcement learning
CN104635772A (zh) * 2014-12-08 2015-05-20 南京信息工程大学 一种制造系统自适应动态调度方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571570A (zh) * 2011-12-27 2012-07-11 广东电网公司电力科学研究院 一种基于强化学习的网络流量负载均衡控制方法
CN102868972B (zh) * 2012-09-05 2016-04-27 河海大学常州校区 基于改进q学习算法的物联网错误传感器节点定位方法
CN104200077A (zh) * 2014-08-22 2014-12-10 广西师范大学 基于子空间学习的嵌入式属性选择方法及其应用
US10460254B2 (en) * 2015-03-17 2019-10-29 Vmware, Inc. System and method for reducing state space in reinforced learning by using decision tree classification
CN105260230B (zh) * 2015-10-30 2018-06-26 广东石油化工学院 基于分段服务等级协议的数据中心虚拟机资源调度方法
CN105930214B (zh) * 2016-04-22 2019-04-26 广东石油化工学院 一种基于q学习的混合云作业调度方法
CN106157650A (zh) * 2016-07-11 2016-11-23 东南大学 一种基于强化学习可变限速控制的快速道路通行效率改善方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101466111A (zh) * 2009-01-13 2009-06-24 中国人民解放军理工大学通信工程学院 基于政策规划约束q学习的动态频谱接入方法
WO2015054264A1 (en) * 2013-10-08 2015-04-16 Google Inc. Methods and apparatus for reinforcement learning
CN104168087A (zh) * 2014-08-08 2014-11-26 浙江大学 无速率编码传输系统中基于q学习的传输帧长主动自适应调整方法
CN104635772A (zh) * 2014-12-08 2015-05-20 南京信息工程大学 一种制造系统自适应动态调度方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115875091A (zh) * 2021-09-26 2023-03-31 国能智深控制技术有限公司 汽轮机阀门流量特性的监测方法、装置和可读存储介质
CN115875091B (zh) * 2021-09-26 2024-01-09 国能智深控制技术有限公司 汽轮机阀门流量特性的监测方法、装置和可读存储介质

Also Published As

Publication number Publication date
CN108476084B (zh) 2020-05-08
CN108476084A (zh) 2018-08-31

Similar Documents

Publication Publication Date Title
Chen et al. Self-adaptive resource allocation for cloud-based software services based on iterative QoS prediction model
US11115421B2 (en) Security monitoring platform for managing access rights associated with cloud applications
US20230325721A1 (en) Systems and methods for implementing an intelligent machine learning optimization platform for multiple tuning criteria
CN111124689B (zh) 一种集群中容器资源动态分配方法
US8996504B2 (en) Plan caching using density-based clustering
US20220351019A1 (en) Adaptive Search Method and Apparatus for Neural Network
US11614978B2 (en) Deep reinforcement learning for workflow optimization using provenance-based simulation
Russo et al. Reinforcement learning based policies for elastic stream processing on heterogeneous resources
Zhao et al. Load shedding for complex event processing: Input-based and state-based techniques
US11544290B2 (en) Intelligent data distribution and replication using observed data access patterns
US11409453B2 (en) Storage capacity forecasting for storage systems in an active tier of a storage environment
CN106202092A (zh) 数据处理的方法及系统
Siddesha et al. A novel deep reinforcement learning scheme for task scheduling in cloud computing
CN111966495A (zh) 数据处理方法和装置
Kafle et al. Intelligent and agile control of edge resources for latency-sensitive IoT services
WO2023089350A1 (en) An architecture for a self-adaptive computation management in edge cloud
CN112000460A (zh) 一种基于改进贝叶斯算法的服务扩缩容的方法及相关设备
WO2018098797A1 (zh) Q学习中调整状态空间边界的方法和装置
US10860236B2 (en) Method and system for proactive data migration across tiered storage
Naik et al. Developing a cloud computing data center virtual machine consolidation based on multi-objective hybrid fruit-fly cuckoo search algorithm
Liu et al. An improved affinity propagation clustering algorithm for large-scale data sets
Tong et al. Energy and performance-efficient dynamic consolidate VMs using deep-Q neural network
CN114138416A (zh) 面向负载-时间窗口的基于dqn云软件资源自适应分配方法
CN110765345B (zh) 搜索方法、装置以及设备
Joseph et al. An incremental off-policy search in a model-free Markov decision process using a single sample path

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16922864

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16922864

Country of ref document: EP

Kind code of ref document: A1