CN113946428A

CN113946428A - Processor dynamic control method, electronic equipment and storage medium

Info

Publication number: CN113946428A
Application number: CN202111288651.1A
Authority: CN
Inventors: 彭嘉乔
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2021-11-02
Filing date: 2021-11-02
Publication date: 2022-01-18
Anticipated expiration: 2041-11-02
Also published as: CN113946428B

Abstract

The embodiment of the disclosure provides a processor dynamic control method, electronic equipment and a storage medium. Wherein the method comprises the following steps: searching a Q value table according to the current state information of the processor to acquire a control action; and adjusting the processor according to the processor running parameters corresponding to the control action, wherein the Q value table is pre-established according to a Q-learning algorithm, the state set corresponding to the Q value table is a set of various states divided according to the processor state information, and the action set corresponding to the Q value table is a set of various adjusting actions for adjusting the running parameters of the processor to different target running parameters. The dynamic control scheme of the processor provided by the embodiment of the disclosure is combined with the reinforcement learning scheme, and the optimal control target is dynamically determined according to the current system state of the processor and the previous learning result, so that the optimal control scheme can be adaptively selected, and the control target is effectively achieved.

Description

Processor dynamic control method, electronic equipment and storage medium

Technical Field

The present invention relates to, but not limited to, the field of processor control, and in particular, to a method for dynamically controlling a processor, an electronic device, and a storage medium.

Background

The dynamic control of processor operation modes or parameters is a direction of constantly exploring continuous optimization in the field of processor control, and aims to save energy and/or optimize performance of a chip, and relates to many different aspects, such as common processor chip energy saving technology-DVFS scheme, CPU idle scheme and the like.

Taking the energy-saving scheme of the processor chip as an example, the power consumption of the CMOS circuit is divided into dynamic power consumption and static power consumption, which can be expressed as a formula:

P＝αC_LV²f+αVI_short+VI_leak

the first term is dynamic power consumption, where α is the activity factor (i.e., the slew rate) of the CMOS, C_LRepresenting the load capacitance, V is the supply voltage and f is the clock frequency. The other two terms are static power consumption, mainly caused by leakage current. V and f can usually only be adjusted starting from dynamic power consumption when it is desired to reduce power consumption.

Dvfs (dynamic Voltage and Frequency scaling), i.e., dynamic Voltage Frequency adjustment, is a common chip energy saving technology. The technology reduces the voltage and the frequency to reduce the power consumption when the performance is redundant by dynamic adjustment during operation, improves the voltage and the frequency to meet the working requirement when high performance is required, and reasonably chooses between the power consumption and the performance.

DVFS is a technology that has comprehensive requirements for software and hardware. The Intel SpeedStep technology is adopted in hardware, so that a CPU can be switched between a high frequency and a low frequency, the low frequency is reduced when a battery is used, and the high frequency is recovered when an alternating current power supply is used. From this, eist (enhanced Intel SpeedStep technology) was derived, which dynamically reduces the CPU frequency when the CPU utilization is low, and reverts to the original frequency once it is detected that the CPU utilization is high. The iem (intelligent Energy manager) solution proposed by ARM corporation is software-centric, works in conjunction with iec (intelligent Energy controller) hardware, interfaces with the operating system, measures current system performance levels using counters, timers, and parameters derived from the operating system, and generates predictions of future performance. The ARM company also proposed avs (adaptive Voltage scaling) technology, which is a closed-loop solution that evaluates various factors, such as process differences between devices between chips, temperature fluctuations during chip operation, and load variations, and determines an optimal Voltage frequency relationship under these conditions.

In the aspect of pure software, Linux provides a CPU frequency modulation framework cpufreq supporting different strategies (governor). The current common strategies are:

it can be seen that how to accurately and effectively control power consumption on the premise of meeting the use requirements of a processor is a direction for constantly exploring improvement in the field of processor control.

Disclosure of Invention

The embodiment of the disclosure provides a processor dynamic control method, an electronic device and a storage medium, which combine a reinforcement learning scheme, dynamically determine an optimal control target according to a current system state of a processor and a previous learning result, realize control over a controller operation parameter, and can adaptively select an optimal control scheme to effectively achieve the control target.

The embodiment of the disclosure provides a processor dynamic control method, which includes:

searching a Q value table according to the current state information of the processor to acquire a control action;

adjusting the processor according to the processor operation parameters corresponding to the control action;

the Q value table is pre-established according to a Q-learning algorithm, the state set corresponding to the Q value table is a set of various states divided according to the state information of the processor, and the action set corresponding to the Q value table is a set of various adjusting actions for adjusting the operating parameters of the processor to different target operating parameters.

An embodiment of the present disclosure further provides an electronic device, including:

one or more processors;

a storage device for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the processor dynamic control method according to any embodiment of the disclosure.

The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the computer program implements a processor dynamic control method according to any embodiment of the present disclosure.

Other aspects will be apparent upon reading and understanding the attached drawings and detailed description.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a flow chart of a method for dynamic control of a processor according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a Q-value table updating method according to an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating another Q-value table updating method according to an embodiment of the disclosure.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that all the directional indicators (such as up, down, left, right, front, and rear … …) in the embodiment of the present invention are only used to explain the relative position relationship between the components, the movement situation, etc. in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indicator is changed accordingly.

In addition, the descriptions related to "first", "second", etc. in the present invention are only for descriptive purposes and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "connected," "secured," and the like are to be construed broadly, and for example, "secured" may be a fixed connection, a removable connection, or an integral part; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In addition, the technical solutions in the embodiments of the present invention may be combined with each other, but it must be based on the realization of those skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination of technical solutions should not be considered to exist, and is not within the protection scope of the present invention.

Solutions to the state of the art regarding processor chip power savings, as well as system level voltage management solutions, need to take into account uncertainties and variability from multiple aspects of environment, application, and hardware. It can be seen that, in the related art, the load change is still passively responded, the decision is made based on the prior setting or the statistical data of the last observation period, and the scene with high performance suddenly required is difficult to respond in time due to hysteresis. For a periodically changing load (which is common in a baseband chip working scene), past decisions cannot be recorded to save decision making overhead, and advanced adjustment cannot be performed to achieve comprehensive optimization of performance and power consumption. More powerful processor control techniques and power management techniques need to be able to learn from events that occurred in the past in order to make optimal decisions adaptively.

The dynamic control scheme of the processor provided by the embodiment of the disclosure is a dynamic control scheme based on reinforcement learning. In the related scheme, Q-learning is a value-based algorithm in the reinforcement learning algorithm. The algorithm models the learning process as a sequence of state states and action actions, assuming that the states satisfy the Markov Decision Process (MDP), i.e. the next state depends only on the previous state, independent of the further past. The available probabilities are expressed as:

P(s_n|s_n-1)＝P(s_n|s_n-1,…,s₀)

MDP is a description of a fully observable environment, that is, the observed state content completely determines the required characteristics of the decision. If this assumption is fulfilled, the state change can be described by a probabilistic model P (s '| s, a), i.e. performing action a in state s with a corresponding probability of transitioning to state s'.

The algorithm has the main idea that a Q value table is constructed by the states and the actions, the states and the actions are updated through learning, and then the action which obtains the maximum benefit in the current state is selected according to the Q value. After the Q value table is updated, the overhead of maintenance and table lookup is very low, and the strict requirements of the embedded system on performance are met.

Defining π (a | s) as a policy for taking action a in state s, R (s '| s, a) represents reward rewarded for taking action a in state s and transitioning to state s'. The goal is to find the strategy that maximizes the cumulative prize, i.e.:

where γ is the decay value, with closer γ to 1 representing more emphasis on the subsequent state and closer γ to 0 representing more emphasis on the current benefit.

Solving the optimal decision sequence of MDP using Bellman equation with a function of state value V_π(s) evaluating the current state, the value function being determined by the current state and the following statesDetermining:

V_π(s)＝E_π[r_t+1+γ[r_t+2+γ[…]]|S_t＝s]

V_π(s)＝E_π[r_t+1+γV(s′)|S_t＝s]

by V^*(s) represents the optimal cumulative expectation, given that:

V^*(s)＝max_πV_π(s)

accordingly, the Q-value function Q (s, a) in the Q-learning algorithm has:

q_π(s,a)＝E_π[r_t+1+γr_t+2+γ²r_t+3+…|A_t＝a,S_t＝s]

q_π(s,a)＝E_π[r_t+1+γq_π(S_t+1,A_t)|A_t＝a,S_t＝s]

optimum Q value Q^*(s,a)＝max_πQ^*(s, a) are:

Q^*(s,a)＝∑_s′P(s′|s,a)(R(s,a,s′)+γmaxa′Q^*(s′,a′))

obtaining a Q value calculation formula:

learning can be done from the above derivation, iteratively updating the Q value:

Q(s,a)←Q(s,a)+α[r+γmaxa′Q(s′,a′)-Q(s,a)]

where α is the learning rate, which is used to control the update rate.

Q-learning is an improvement on the time difference method (combining Monte Carlo and dynamic programming), and can be used for off-line learning (off-learning). The Q-learning does not follow the interaction sequence completely, and the action that maximizes the value is selected at the next moment, and in the idea of an algorithm, Q-learning combines the optimal values of the sub-parts, and it is desirable that each learning is updated with the optimal results accumulated iteratively. This also gives the Q-learning algorithm the problem of overestimating the outcome of the action. Because it uses the most valuable action, neglecting the action of the non-sampled state, and a large error may be generated between the two actions.

The scheme provided by the embodiment of the disclosure combines Q-learning with processor operating parameter control to realize dynamic control of the processor.

An embodiment of the present disclosure provides a processor dynamic control method, as shown in fig. 1, including:

step 110, searching a Q value table according to the current state information of the processor to acquire a control action;

step 120, adjusting the processor according to the processor operation parameters corresponding to the control action;

the Q value table is pre-established according to a Q-learning algorithm, the state set corresponding to the Q value table is a set of various states divided according to the state information of the processor, and the action set corresponding to the Q value table is a set of various adjustment actions for adjusting the operating parameters of the processor to different target operating parameters

In some exemplary embodiments, the processor comprises: a central processing unit CPU, a micro control unit MCU, a graphic processing unit GPU or an embedded neural network processor NPU and the like.

In some exemplary embodiments, the operating parameters of the processor include: voltage and/or frequency; accordingly, the processor dynamic control scheme is also referred to as a dynamic voltage frequency scaling, DVFS, scheme of the processor.

In some exemplary embodiments, the operating parameters of the processor include: a timestamp of the processor entering and/or exiting the idle state; accordingly, the processor dynamic control scheme is also referred to as an IDLE scheme of the processor.

In some exemplary embodiments, the status information includes: the number of active processor clock cycles within a first duration;

the states in the state set corresponding to the Q value table respectively correspond to active processor clock cycle number intervals within a plurality of first durations.

That is, each state in the state set corresponds to an interval of the number of processor clock cycles that is active for the first duration.

In some exemplary embodiments, the status information includes: the number of active processor clock cycles within the first duration and the slack of system tasks within the first duration;

the plurality of states in the state set corresponding to the Q value table respectively correspond to a plurality of combinations of an active processor clock cycle number interval within the first duration and a system task slack interval within the first duration.

That is, each state in the state set corresponds to a combination of an interval of the number of processor clock cycles active for the first duration and a slack interval of system tasks for the first duration.

It should be noted that, taking a CPU as an example, the number of active processor clock cycles in the first duration is the number of active CPU clock cycles in the first duration. In some exemplary embodiments, the hardware circuit passes a POWER _ OFF/WFI/WFE (POWER supply/Wait for interrupt/Wait for event) signal of the CPU through an or gate, and then sinks the signal and a clock signal of the CPU into a nand gate, and finally connects a counter to obtain the number of active CPU clock cycles within a set time duration.

In some example embodiments, the number of active processor clock cycles in the first duration is determined based on a POWER signal (POWER OFF) of the processor, a wait for interrupt signal (WFI), a wait for event (WFE) signal, a clock signal of the processor, and a counter in the first duration.

In some exemplary embodiments, the number of active processor clock cycles within the first length of time is determined according to the following:

and in the first time length, inputting a power signal, a waiting interrupt signal and a waiting event signal of the processor into an OR gate, importing an output signal after the OR gate and a clock signal of the processor into a NAND gate, and connecting an output signal after the NAND gate with a counter to obtain the number of active processor clock cycles in the first time length.

It should be noted that the Active state of the processor refers to a normal operating state of the processor, which refers to a normal operating state compared to other low power consumption standby states. Taking an embedded system as an example, the ARM architecture has two instructions WFI/WFE that can immediately enter a low power standby mode on the ARM core and wait for a related wake event to wake up. The X86 system also has a similar Low-power standby (Low-power standby) mechanism. The low power consumption standby state is also set to Inactive state Inactive.

Accordingly, the number of active processor clock cycles in the first duration, i.e., the number of clock cycles in the first duration that the processor is in a normal operating state.

In some exemplary embodiments, the active processor clock cycle number interval within the first duration is divided according to:

where c is the number of active processor clock cycles in the first duration, c_bThe number of active processor clock cycles in the first duration when the processor utilization is the first percentage is shown, Δ c is the corresponding change amplitude of the number of active processor clock cycles in the first duration when the processor utilization changes by the first percentage change amplitude, and d is an integer greater than or equal to 1.

In some example embodiments, where d is 4, the active interval of the number of processor clock cycles within the first duration is divided according to:

in some exemplary embodiments, the first percentage is 50%; other values may also be set by those skilled in the art.

In some exemplary embodiments, the first percentage change amplitude is 10%, meaning that the processor utilization change amplitude is 10%; other values may also be set by those skilled in the art.

In some exemplary embodiments, the state set S corresponding to the Q value table determined according to the above interval division manner is denoted as S { C }. Those skilled in the art may also adopt other number of interval division criteria, wherein the number of intervals and the interval boundaries may be adjusted according to the characteristics of the processor to be controlled, and are not limited to the aspects illustrated in the embodiments of the present disclosure.

In some example embodiments, the first intra-epoch system task slack interval is partitioned according to the following:

wherein L is the slack of the system task in the first duration, a and b are positive integers, and a is greater than b.

In some exemplary embodiments, a-30 and b-5, respectively, the first intra-epoch system task slack interval is divided according to the following:

wherein L is the slack of the system task in the first duration.

Those skilled in the art may also perform interval division on the system task slack L by using other interval division criteria, where the number of intervals and the interval boundaries may be correspondingly adjusted according to the characteristics of the processor to be controlled, and are not limited to the aspects illustrated in the embodiments of the present disclosure.

It should be noted that the task slack time in the related scheme is in units of time, for example, a task must be completed in 200ms, and the running time required by itself is 100ms, so the scheduler must schedule the task to be executed before 100ms, and the slack time of the task is 100 ms. In the embodiment of the present disclosure, the task slack Laxity represents a percentage of the total deadline, for example, a task must be completed in 200ms, and the running time required by itself is 100ms, so the scheduler must schedule the task to be executed before 100ms, the slack time of the task is 100ms, and the task slack of the task is 100/200-50%. As another example, another task must be completed in 400ms, which itself needs to be run for 150ms, and the relaxation time is 250ms, and the relaxation degree is 250/400-62.5%.

The system task slack in the first time period refers to the average slack of all or part of the tasks in the first time period, or the weighted average slack. The partial tasks may be preset types of tasks in the whole system, or preset proportions of system tasks in the whole system.

If the processor performs DVFS adjustment, which results in a task that originally needs 100ms to complete, and needs 150ms to complete after the adjustment, the scheduler must schedule the task to execute before 50ms, the slack time of the task becomes 50ms, and the task slack of the task is 50/200-25%. It can be seen that the first in-epoch system task slack can characterize the state of the processor.

It should be noted that the first duration is also referred to as an observation period, that is, the number of active processor clock cycles and/or the slack of system tasks are counted by taking the first duration as a period.

In some exemplary embodiments, the state set S corresponding to the Q value table determined according to the above interval division manner is denoted as S { C, L }.

In some exemplary embodiments, the plurality of actions in the action set corresponding to the Q-value table correspond to a plurality of target operating parameters, respectively. Executing an action is adjusting the operating parameters of the processor to the target operating parameters corresponding to the action.

In some exemplary embodiments, the operating parameters of the CPU include: voltage and frequency; then action set a is a CPU-selectable (configurable) voltage-frequency combination a v, f.

In some exemplary embodiments, when the selectable voltage and frequency are four steps each, the size of the Q value table is as follows when the state set is S { C }: i S { C } × i a { v, f } | 160; in the case of the state set S { C, L }, the size of the Q-value table is: i S { C, L } | × | a { v, f } | 800.

In some exemplary embodiments, the step 110 of searching the Q-value table according to the current state information of the processor to obtain the control action includes:

determining the state corresponding to the state set of the Q value table according to the current state information of the processor;

and searching the Q value table according to the determined state to obtain the action corresponding to the maximum Q value as the control action.

Here, the operation corresponding to the maximum Q value is also referred to as an optimal operation corresponding to the state.

In some exemplary embodiments, taking DVFS scheme as an example, looking up the Q value table according to the determined state to obtain an action corresponding to the maximum Q value, where the action is the most preferable action in the current state, and corresponds to a voltage-frequency combination a { v, f }, which is also referred to as a target voltage frequency; performing this action, the voltage and frequency of the processor are adjusted to v, f, respectively.

It can be seen that according to the dynamic control scheme of the processor provided by the embodiment of the present disclosure, the control action can be determined by using the Q value table established through reinforcement learning, and the operation parameters of the controller are further adjusted to achieve the control target. In the control process, the data volume of the utilized Q value table is small, the table look-up cost is low, complex calculation is not needed, the resource requirement for realizing the control scheme is low, the control delay is short, and the control scheme can be used for an embedded system with limited computing resources or strict performance requirements.

In some exemplary embodiments, the method further comprises:

and step 130, executing a Q value updating algorithm to update the Q value table under the condition that the updating condition is met.

In some exemplary embodiments, the satisfying of the update condition includes at least one of:

when a set updating period comes during the running of the processor, the updating condition is met;

when a set trigger event occurs during the running of the processor, the updating condition is met;

in some exemplary embodiments, the triggering event includes at least one of:

the variation amplitude of the operation parameters corresponding to the two control actions is larger than a set threshold value;

the occupancy of the processor is below a set threshold.

According to the above examples, those skilled in the art may also set other update conditions and/or trigger events, which are not limited to the aspects of the examples of the embodiments of the present disclosure.

It should be noted that, according to step 130, the initial Q-value table may be subjected to a control test in a test environment to complete a preliminary learning process of the Q-learning algorithm, and a Q-value table that may be used in step 110 is established; the method can also be executed when the updating condition is met in the normal operation and control process of the processor in the non-testing state, continuously learns, and further updates the Q value table, so that the Q value table is closer to the current situation of the operation environment of the processor, and the control scheme executed based on the updated Q value table is more reasonable. The process of updating the Q-value table in step 130 is also referred to as the Q-learning process or Q-learning stage.

In some exemplary embodiments, as shown in fig. 2, the Q-value updating algorithm is executed to update the Q-value table in step 130 according to the following method:

step 1301, determining a statistical value of the number of active processor clock cycles within the second duration;

step 1302, determining the state corresponding to the state set in the Q value table according to the statistical value;

step 1303, determining to develop or explore corresponding actions by adopting an e-greedy algorithm according to the state and the Q value table;

step 1304, executing the corresponding development or exploration action, adjusting the operating parameters of the processor, and obtaining environment rewards;

step 1305, updating the Q-value table according to the environmental reward.

It should be noted that the second duration is also referred to as a statistical duration, that is, a duration representing one statistical period, and data obtained in each observation period (the first duration) included in the second duration is counted to obtain a corresponding statistical value.

In some exemplary embodiments, the determining the statistics of the number of active processor clock cycles for the second duration comprises:

determining the statistical value according to the following manner:

wherein,

is the average number of active processor clock cycles in the t-th statistical period, c_tFor the number of processor clock cycles active in the last first duration in the tth statistical period,

and beta is a weighting coefficient, the duration of one statistical period is a second duration, and the second duration is greater than or equal to the first duration.

In some exemplary embodiments, the second duration is n x the first duration, and n is an integer greater than 1.

It should be noted that the weighting factor β is a variable over-parameter value, and is essentially a moving average weighted exponentially and progressively, such that more recent data is weighted more heavily.

In some exemplary embodiments, the statistical value may be an average of a number of processor clock cycles that are over-active for a plurality of first durations; or may be an exponentially weighted average of the number of processor clock cycles that are over-active for a plurality of first durations. It can be seen that, in the learning stage, the state corresponding to the Q-value table is determined based on the statistical value of the number of clock cycles of the processor, and then the control action is determined based on the state, and the control is executed, and then the environmental reward is acquired, and the Q-value table is updated, so that the Q-value in the Q-value table can be learned according to a plurality of events occurring in the past, and the learning result (the Q-value table) can be optimized by using the periodic load change and the corresponding control event. During the operation of the control scheme, the learning can be performed in a self-adaptive manner, and the learning effect is continuously improved.

In some exemplary embodiments, at MSE_t>(1+λ)MSE_ref-newIn the case of (b), the weighting coefficient β is 1;

wherein,

1>λ>0 is a constant, MSE, defining the error range_ref-oldThe initial value is 0.

It should be noted that, one important point in the learning process is to deal with sudden changes of load for faster response speed, and in the above statistical value calculation method, β is set as a dynamically changing parameter, and is modified based on Mean Square Error (MSE):

if MSE_t>(1+λ)MSE_ref-newThen β ← 1.

It can be seen that in calculating the statistics of the number of processor clock cycles, the MSE is determined if_t>(1+λ)MSE_ref-newIf a sudden load change is indicated, the weighting factor β is reset and the parameter β is modified to 1 such that the number of clock cycles c in the last (last) first time period is equal to_tThe weight is the heaviest.

In some exemplary embodiments, determining the state in the state set corresponding to the Q-value table according to the statistical value in step 1302 includes:

and under the condition that a plurality of states in the state set corresponding to the Q value table respectively correspond to active processor clock cycle number intervals in a plurality of first time lengths, determining the processor clock cycle number intervals according to the statistical value, and determining corresponding states according to the processor clock cycle number intervals.

and under the condition that a plurality of states in a state set corresponding to the Q value table respectively correspond to a plurality of combinations of processor clock period number intervals which are active in the first time length and system task relaxation intervals in the first time length, determining the processor clock period number intervals according to the statistical value, determining the task relaxation intervals according to the system task relaxation in the latest first time length, and determining corresponding states according to the processor clock period number intervals and the task relaxation intervals.

In some exemplary embodiments, the environmental reward r_tCalculated according to the following formula:

r_t＝k(|L_t+1|-|L_t|)

wherein k is a constant, r_tEnvironmental reward for the t-th statistical period, L_tRepresents the slack of the system task in the t statistical period, L_t+1Representing the slack of the system task in the t +1 th statistical period; the duration of one statistical period is a second duration, and the second duration is greater than or equal to the first duration.

In some exemplary embodiments, k is a non-negative number. In some exemplary embodiments, k is a multiple of 2.

Wherein L is_tThe calculation method of (2) is consistent with the calculation method of the system task slack L in the first time period.

In some exemplary embodiments, the environmental reward r_tCalculated from the load percentage change of the processor. For example, in the case of a state set that does not introduce slackEnvironmental rewards may be determined in this manner.

r_t＝m(Load_t+1-Load_t)

wherein m is a constant, r_tFor the environmental reward of the t-th statistical period, Load_tRepresents the percentage of system Load, in the t-th statistical period_t+1Representing the percentage of the system load in the t +1 th statistical period; the duration of one statistical period is a second duration, and the second duration is greater than or equal to the first duration.

In some exemplary embodiments, m is a non-negative number. In some exemplary embodiments, m is a multiple of 2.

It should be noted that, according to the environmental reward, the new Q value is calculated by using the Q value function, and the specific aspect of the Q value table is updated correspondingly, and those skilled in the art may implement the method according to the relevant aspect of the Q-learning algorithm, and the method does not belong to the scope defined or protected by the present disclosure.

In some exemplary embodiments, the need for the strategy in step 1303 balances development (explore) with exploration (explore) during learning. The exploration is to find more information about the environment. The development is to use the known information to get the most rewards. The goal is to maximize the expected jackpot, but it sometimes falls into a local optimum dilemma, so it is desirable to timely jump out of the local optimum solution to find a new balance.

In some exemplary embodiments, development refers to choosing to maximize the current expected revenue, and exploration is a random choice. Here, the e-greedy algorithm is employed:

A^*←max_aQ(s,a)

determining to develop or explore corresponding actions according to an Ee-greedy algorithm;

if step 1303 determines that the corresponding optimal action is developed according to the e-greedy algorithm, step 1304 executes the corresponding optimal action, adjusts the operating parameters of the processor, and calculates the environment reward; if step 1303 determines that the corresponding random action is explored according to the e-greedy algorithm, step 1304 performs the corresponding random action for the exploration, adjusts the operating parameters of the processor, and calculates the environmental reward. It can be seen that the learning process can be more comprehensively reinforced by exploration, so that the continuously updated Q value table achieves the optimal learning effect.

In some exemplary embodiments, the exploration may also select the action in other manners, and is not limited to the random selection manner in the above examples.

In some exemplary embodiments, the step 130 of executing the Q value updating algorithm to update the Q value table, as shown in fig. 3, includes:

step 130-1, determine if there is a load mutation? If so, performing step 130-2, otherwise, performing step 130-3;

step 130-2, resetting the weighting coefficient beta to 1;

step 130-3, calculating a statistical value of the number of active processor clock cycles in the second duration;

step 130-4, determining the state corresponding to the Q value table according to the statistic value;

step 130-5, determining to develop or explore corresponding actions according to the adoption of an E-greedy algorithm;

step 130-6, searching a table to obtain an optimal action;

step 130-7, executing the optimal action, adjusting the operation parameters of the processor and obtaining environment rewards;

step 130-8, acquiring random actions during exploration;

step 130-9, executing random action, adjusting the operation parameters of the processor and obtaining environment rewards;

in step 130-10, the Q-value table is updated based on the acquired environmental rewards.

In some exemplary embodiments, the method may be used for CPU idle decision, establishing a corresponding state set and action set according to the control parameters of the cpuoidle scheme and the corresponding control effect evaluation, and revising the environment reward calculation scheme, so as to implement the CPU idle control scheme.

Further, the combination of CPU idle and DVFS realizes the lowest total power consumption by the system entering a long-term idle state after the system completes tasks with high power consumption in a short time through frequency boosting. The system can adaptively find the balance point of idle and DVFS to achieve the optimal energy-saving effect.

The embodiment of the present disclosure further provides an electronic device, which includes:

one or more processors;

a storage device for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors are enabled to implement the method for dynamically controlling the processors described in any of the above embodiments.

The embodiment of the present disclosure further provides a storage medium, where a computer program is stored in the storage medium, where the computer program is configured to execute the processor dynamic control method described in any of the above embodiments when running.

It can be seen that, the embodiment of the present disclosure provides a DVFS algorithm based on reinforcement learning, which is more intelligent than the conventional power management strategy, so that an embedded system can adaptively adjust a frequency modulation strategy during operation, and can achieve a certain predictability for advanced response to changes in regularity, and better adapt to load changes; the method has the advantages that the past frequency modulation information can be utilized, repeated calculation is reduced, the cost for making a frequency modulation decision is low, the time delay is low, the real-time performance of the system is improved, the response to the change of the load is quicker compared with the traditional machine learning strategy, and 20% of energy consumption can be saved on average.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all modifications and equivalents of the present invention, which are made by the contents of the present specification and the accompanying drawings, or directly/indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for dynamic control of a processor, comprising:

2. The method of claim 1,

the state information includes: the number of active processor clock cycles within a first duration;

3. The method of claim 1,

the state information includes: the number of active processor clock cycles within the first duration and the slack of system tasks within the first duration;

4. The method according to any one of claims 1 to 3,

the searching the Q value table according to the current state information of the processor to obtain the control action comprises the following steps:

determining the state corresponding to the state set of the Q value table according to the current state information;

5. The method according to any one of claims 1 to 3,

the method further comprises the following steps:

in the case where the update condition is satisfied, executing a Q-value update algorithm to update the Q-value table according to the following method:

determining a statistical value of the number of active processor clock cycles within the second duration;

determining the state corresponding to the state set of the Q value table according to the statistical value;

according to the state and the Q value table, adopting an e-greedy algorithm to determine and develop or explore corresponding actions;

executing the corresponding action of development or exploration, adjusting the operation parameters of the processor, and obtaining environment rewards;

and updating the Q value table according to the environment reward.

6. The method of claim 5,

the determining a statistical value of the number of active processor clock cycles in the second duration comprises:

determining the statistical value according to the following manner:

wherein,

7. The method of claim 5,

the environment award r_tCalculated according to the following formula:

r_t＝k(|L_t+1|-|L_t|)

wherein k is a constant, r_tEnvironmental reward for the t-th statistical period, L_tRepresents the slack of the system task in the t statistical period, L_t+1Representing the slack of the system task in the t +1 th statistical period; the duration of one statistical period is a second duration, and the second duration is greater than or equal to the first duration;

or,

the environment award r_tCalculated according to the following formula:

r_t＝m(Load_t+1-Load_t)

wherein m is a constant, r_tFor the environmental reward of the t-th statistical period, Load_tRepresents the percentage of system Load, in the t-th statistical period_t+1Represents the percentage of the system load in the t +1 th statistical period.

8. The method according to any one of claims 1 to 3,

the operating parameters of the processor include: voltage and/or frequency.

9. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the processor dynamic control method of any of claims 1-8.

10. A computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the processor dynamic control method of any one of claims 1-8.