CN108549232B

CN108549232B - A kind of room air self-adaptation control method based on approximate model planning

Info

Publication number: CN108549232B
Application number: CN201810430729.0A
Authority: CN
Inventors: 钟珊; 龚声蓉; 伏玉琛; 王朝晖; 董瑞志; 姚宇峰
Original assignee: Changshu Institute of Technology
Current assignee: Changshu Institute of Technology
Priority date: 2018-05-08
Filing date: 2018-05-08
Publication date: 2019-11-08
Anticipated expiration: 2038-05-08
Also published as: CN108549232A

Abstract

The invention discloses a kind of room air self-adaptation control methods based on approximate model planning, including initialization current state, model, hyper parameter, environment and explore strategy, it is executed according to policy selection movement is explored, to obtain award and NextState, current state, movement, award and NextState are formed into current sample with more new model, value function and strategy；The sample track of present sample track and reconstruct is added in the pond of track after each plot, then using all tracks in the pond of track come more new model；Analog sample is generated using the model of update to be planned；When algorithm reaches maximum plot number and restrains, so that it may the optimal policy of accomplished room air self adaptive control.The present invention is planned by one approximate environmental model of study using the environmental model of study, to improve the efficiency of study.

Description

A kind of room air self-adaptation control method based on approximate model planning

Technical field

The present invention relates to a kind of room air self-adaptation control method, more particularly to a kind of based on approximate model planning Room air self-adaptation control method.

Background technique

With the economic development and improvement of people's living standards, people are also growing day by day for the attention degree of environment. A place of the indoor environment as people's main activities, it is more close with people's health, therefore, how to effectively realize room Interior air it is safe, pure and fresh and comfortable, be improve mankind's self a critical issue.

In current most indoor environment, it is only mounted with the equipment such as air-conditioning and air purifier, and be between them It is isolated existing, it needs individually to come starting device and realizes to the adjusting of air themperature and the purification of air.For some danger Gas cannot achieve quickly to adjust by air purifier and rapidly will merely such as the formaldehyde and sulfur dioxide in confined space Its indoor content is reduced within secure threshold, and needing to open a window immediately carries out convection current.Therefore, it is necessary to increase window control equipment. However, the equipment such as air-conditioning and air purifier require manual control and adjustment, do not have intelligence.Therefore it needs to design corresponding Control method enable these equipment to start and operate automatically by perception environment, realize to indoor environment in real time from Dynamic control.

Summary of the invention

For above-mentioned prior art defect, task of the invention lies in provide a kind of Interior Space based on approximate model planning Gas self-adaptation control method realizes automatic control terminal to meet indoor air environment requirement, while realizing as much as possible indoor The maximum comfort and satisfaction of personnel.

The technical scheme is that such: a kind of room air self-adaptation control method based on approximate model planning, The following steps are included:

Step 1), initialization Markovian decision model, are arranged the state space X and motion space U of environment；

Step 2), initiation parameter vector, the parameter vector include: value function parameterPolicing parameter State transition function parameterReward functions parameterWith eligibility trace parameter

The hyper parameter of step 3), initialization algorithm, the hyper parameter include: discount rate γ, decay factor λ, plot number E, The learning rate α of maximum time step S, value function that the exploration standard deviation ε of Gaussian function, each plot are included₁, strategy study Rate α₂, the learning rate α of model and the number K of planning；

Step 4), initialization current episode s=1；

Step 5), initialization current state x_t=x, current time walk t=1；

Step 6), selection movement: according to the movement u that should be executed under exploration policy selection current state_t=u；

Step 7) generates sample: in current state x_tUnder, execution acts u, obtains next state x_t+1It awards immediately r_t+1, the sample of generation is (x_t,u_t,x_t+1,r_t+1)；

Step 8), using sample (x_t,u_t,x_t+1,r_t+1) more new model state transition function parameter vector θ and award letter Number parameter vector υ；

Step 9) calculates time difference error (Temporal Difference, TD) error；

Step 10) updates eligibility trace: updating eligibility trace parameter vector；

Step 11) updates value function: updating the corresponding parameter vector of value function；

Step 12), more new strategy: the corresponding parameter vector of more new strategy；

Step 13) updates current state: x_t=x_t+1；

Step 14) updates current time step t=t+1, judges whether to reach maximum time step: if reached, being transferred to step 15)；Otherwise, step 5) is transferred to continue to execute；

Step 15) is planned using approximate model；

Step 16) updates current episode s=s+1, judges whether to reach maximum plot number: if reached, being transferred to step 17)；Otherwise, step 5) is transferred to continue to execute；

Step 17), the optimal policy for obtaining room air self adaptive control to optimal policy according to study.

Value function approximate representation described in step (2) is as a preferred technical solution,Wherein,For Gaussian function, for state x to be mapped as feature vector,Centered on point, σ₁For the standard of state dimension Difference, ξ are parameter vector, and dimension is consistent with feature vector；It is described strategy approximate representation beIt is wherein special Levy vectorIdentical as the expression of value function, ζ is policing parameter vector；Model includes state transition function and reward functions, Migration approximation to function is expressed asReward functions approximate representation is r_t+1=φ^T(x_t,u_t)υ_t, whereinFor state action feature,For the central point of movement, σ₂For the standard deviation for acting dimension, θ is state The parameter vector of function is migrated, υ is the parameter vector of reward functions.

The exploration strategy in step (6) is generated using Gaussian function as a preferred technical solution, in free position place The movement taken according toIt obtains, wherein h (x)=u^*Expression obtains most at state x according to optimal policy Excellent movement, ε are to explore the factor.

The more new model in step (8) is using the prediction error of single step as gradient signal as a preferred technical solution: By the status predication error of single stepAs gradient, obtaining migration function parameter vector more new formula isError is predicted into the award of single stepAs gradient, parameter vector is obtained More new formula is

The calculation formula of calculating TD error in step (9) as a preferred technical solution, are as follows: ω=r+ γ V (x_t+1)-V (x_t)。

Eligibility trace more new formula in step (10) as a preferred technical solution, are as follows:

Step (11) median function more new formula as a preferred technical solution, are as follows: ξ_t+1=ξ_t+α₁ωe(x_t)。

Policy update formula in step (12) as a preferred technical solution, are as follows: ζ_t+1=ζ_t+α₂ω(u-u*)^Tφ(x_t)。

The model planning in step (15) is the iteration land productivity under certain planning number as a preferred technical solution, With model x_t+1=φ^T(x_t,u_t)θ_tAnd r_t+1=φ^T(x_t,u_t)υ_tNext state and award are generated, and utilizes the state and prize Reward carrys out updated value function parameter vector ξ_t+1=ξ_t+α₁ωe(x_t) and policing parameter vector ζ t_t+1=ζ_t+α₂ω(u-u*)^Tφ (x_t)。

The advantages of the present invention over the prior art are that: the nitrification enhancement based on approximate model planning passes through An approximate environmental model is practised, and carries out sector planning using the environmental model of study, to improve the efficiency of study.It is logical Continuous acquisition terminal equipment (air purifier and air-conditioning) is crossed from the perception data on belt sensor to learn optimal policy, is utilized The optimal policy learnt controls equipment to start accordingly, starts corresponding terminal device (window, air purifier and sky Adjust) indoor environment automatically controlled in real time.

Detailed description of the invention

Fig. 1 is that the present invention is based on the room air adaptive control system structural schematic diagrams that approximate model is planned；

Fig. 2 is server architecture schematic diagram in room air adaptive control system of the present invention；

Fig. 3 is the whole control flow signal for the room air adaptive control system planned the present invention is based on approximate model Figure；

Fig. 4 is that the present invention is based on the room air self-adaptation control method flow diagrams that approximate model is planned.

Specific embodiment

Below with reference to embodiment, the invention will be further described, but not as a limitation of the invention.

A kind of room air self-adaptation control method based on approximate model planning that the present embodiment is related to is applied to such as Fig. 1 Shown in room air self-adapted adjustment system, the main modular in the system has: 1, air purifier 2, air-conditioning 3, window control Control equipment 4, clarifier control equipment 5, air-conditioning control equipment 6, display and management equipment 7, server and 8 cell phone applications.Each module Between can be used wireless network connection, the organizational form of wireless network uses WI-FI network, but also can choose GPRS, 3G, 4G And Zigbee is as cordless communication network.Incorporated by reference to shown in Fig. 2, the chief component of server have central controller 8a, Storage unit 8b, sensor unit 8c include temperature sensor, humidity sensor, formaldehyde sensor, sulfur dioxide sensing The various sensings such as device, PM2.5 sensor (done expansion interface, can be convenient the new sensor type of increase in this part of sensor) The interface circuit 8d of device and all kinds of control equipment.In conjunction with shown in Fig. 3, cycle sensor sends data to server, Server after receiving the data, through current value compared with secure threshold, come determine based on approximate model planning reinforcing The reward value of learning algorithm, and the nitrification enhancement planned based on approximate model is inputted using the data as sample data, it learns Commonly use the optimal policy in control.In addition, issuing control life to control equipment when the data value of acquisition is more than secure threshold It enables, while prompt indoor occupant is sounded an alarm by display and management equipment, house-owner and tenant are sent information to by internet Cell phone application on, carry out the prompt of relevant information.

Room air self-adaptation control method based on approximate model planning mainly includes two aspects:

Firstly, it is necessary to be judged data and formatted:

1) temperature sensor: setting normal temperature value should be 18 DEG C~28 DEG C, when temperature is located at this section, temperature Value is normal value；

2) humidity sensor: normal humidity value is set as 40%~60%, when humidity is located at this section, humidity value For normal value；

3) formaldehyde sensor: setting normal formolite number as 0~0.08mg/m3, when content of formaldehyde is located at this section, Formolite number is normal value；

4) SO 2 sensor: normal titanium dioxide sulfur number is set as 0~0.50mg/m3, when sulfur dioxide concentration position When this section, titanium dioxide sulfur number is positive constant value；

5) PM2.5 sensor: normal PM2.5 value is set as 0~75ug/m3, when PM2.5 concentration is located at this section When, PM2.5 belongs to normal value.

When the data that above-mentioned value sensor is sent are normal value, control equipment is without carrying out any operation；When it In certain data when occurring abnormal, it is necessary to send and order to corresponding control equipment, starting relevant device progress corresponding operating. Such as when the concentration of sulfur dioxide is more than 0.5mg/m3, indoor concentration is more than safety value, can send commands to window control equipment, To which startup trigger opens window.

Error is calculated in order to exclude different data dimension bring, Regularization is carried out to each data, using formulaTo be handled, wherein x_maxIndicate maximum value, x_minIndicate minimum value,Indicate current value, then regularization Value afterwards byX is converted to, range is between [0,1].

Secondly, optimal policy is solved using the nitrification enhancement planned based on approximate model, in order to realize room air Automatic real-time control, need for the control problem of indoor control to be modeled as a MDP problem first, then recall corresponding Algorithm solved.Room air adaptive control system is modeled rear corresponding MDP and can indicate are as follows:

(1) state space: the dimension of state is 5, and the component of state mainly includes temperature sensor, humidity sensor, first The reading of aldehyde sensor, SO 2 sensor and PM2.5 sensor.In state space free position x=temperature, it is wet Degree, formaldehyde, sulfur dioxide, PM2.5 }；

(2) motion space: the dimension of movement is 3, can be expressed as u={ movement of air-conditioning control equipment, air purifier Control the movement of equipment, the movement of window control equipment }.

The movement of air-conditioning control equipment specifically includes that the small wind of 1 refrigeration, 2 refrigeration strong wind, the small wind of 3 heatings, 4 heating strong wind, 5 Humidification, 6 dehumidifying.

The movement of air purifier control equipment specifically includes that 1 purification, 2 are closed.

The movement of window control equipment specifically includes that 1 maximum angle opens (90 °), 2 wide-angles opening (is less than greater than 60 ° 90 °), 3 moderate angles open (be greater than 30 ° less than 60 °), 4 low-angles are opened and (are greater than 0 ° less than 30 °), 3 close.

(3) reward functions: reward functions are to be manually set, and the value having can be according to risk, the subjective experience of people The operational effect of skewed popularity and algorithm is finely adjusted, such as to this kind of hazardous gas of sulfur dioxide, it will usually be arranged one it is larger Negative reward so that controller learns optimal policy in this state as early as possible, secondly, if indoor occupant is to temperature high nothing Method is endured, and can give a larger negative reward when temperature value is higher than range of normal value.

In this example implementation process, the design of reward functions is as follows:

1) temperature value: when measured temperature is at normal value section, otherwise it is -5 that reward value, which is+1,；

2) humidity value: when humidity measurements are at normal value section, otherwise it is -1 that reward value, which is+1,；

3) formolite number: when formaldehyde measurement value is at normal value section, otherwise it is -10 that reward value, which is+1,；

4) titanium dioxide sulfur number: when sulfur dioxide measured value is at normal value section, otherwise it is -15 that reward value, which is+1,；

5) PM2.5 value: when PM2.5 measured value is at normal value section, otherwise it is -8 that reward value, which is+1,；

(4) migrate function: after migration function representation executes the movement of selection under current state, system or environment moved The next state moved on to.Since the state in the example is realized by reading sensing data, move to next A state can also be obtained by reading the data of sensor.

It is illustrated in figure 4 the process of the nitrification enhancement based on approximate model planning in control centre, the tool executed Body process includes below step:

Step 1), initialization Markovian decision model, i.e., be modeled as one for room air control problem according to above-mentioned MDP problem, init state space, motion space, reward functions and migration function；

Step 2), initiation parameter vector, parameter vector mainly includes: value function parameterPolicing parameter State transition function parameterReward functions parameterWith eligibility trace parameter

The hyper parameter of step 3), initialization algorithm, hyper parameter specifically include that setting discount rate γ=0.95, decay factor λ =0.85, plot number E=500, exploration standard deviation ε=0.6 of Gaussian function, the maximum time that each plot is included walk S= 400, the learning rate α of value function₁=0.7, tactful learning rate α₂=0.6, learning rate α=0.5 of model, the number K=of planning 100；

Step 4), initialization current episode s=1；

Step 5), initialization current state x_tThe initial reading of=each sensor, current time walk t=1；

Step 6), selection movement: the movement that should be executed under current state is selected according to strategy is explored

Step 7) generates sample: in current state x_tUnder, execution movement u (does not execute any movement or starting control equipment Responded), the reading for reading sensor obtains next state x_t+1R is awarded immediately_t+1, the sample of generation is (x_t,u_t, x_t+1,r_t+1)；

Step 8), learning model: sample (x is utilized_t,u_t,x_t+1,r_t+1) come state transition function and the award of more new model The parameter vector of functionWith

Step 9) calculates TD error: calculating TD error ω=r+ γ V (x first_t+1)-V(x_t)；

Step 10) updates eligibility trace: updating eligibility trace parameter vector

Step 11) updates value function: updating the corresponding parameter vector ξ of value function_t+1=ξ_t+α₁ωe(x_t)；

Step 12), more new strategy: the corresponding parameter vector ζ t of more new strategy_t+1=ζ_t+α₂ω(u-u*)^Tφ(x_t)；

Step 13) updates current state: saving the reading x of current sensor_t=x_t+1；

Step 15) plans that initialization current state is current ambient conditions, and initializes eligibility trace using approximate modelCirculation executes K times, according toSelection movement；Calculate next state x of prediction_t+1=φ^T(x_t,u_t) θ_t；Calculate the award r of prediction_t+1=φ^T(x_t,u_t)υ_t；Establish analog sample (x_t,u_t,x_t+1,r_t+1)；Calculate TD error ω=r+ γV(x_t+1)-V(x_t)；Update eligibility trace parameterUpdated value function parameter ξ_t+1=ξ_t+α₁ωe₁ (x_t)；Update policing parameter ζ_t+1=ζ_t+α₂ω(u-u*)^Tφ(x_t)；

Step 17), the optimal policy for obtaining indoor environment self adaptive control to optimal policy according to study.

Claims

1. a kind of room air self-adaptation control method based on approximate model planning, which comprises the following steps:

Step 2), initiation parameter vector, the parameter vector include: value function parameterPolicing parameterState Migrate function parameterReward functions parameterWith eligibility trace parameter

The hyper parameter of step 3), initialization algorithm, the hyper parameter include: setting discount rate γ, decay factor λ, plot number E, The exploration standard deviation ε of Gaussian function, the maximum time that each plot is included walk S, the learning rate α of value function₁, tactful study Rate α₂, the learning rate α of model, the number K of planning；

Step 4), initialization current episode s=1；

Step 5), initialization current state x_t=x, current time walk t=1；

Step 6), selection movement: according to the movement u that should be executed under exploration policy selection current state_t=u；The exploration strategy is adopted Generated with Gaussian function, the movement taken at free position according toIt obtains, wherein h (x)=u^*It indicates The optimal movement obtained at state x according to optimal policy；

Step 7) generates sample: in current state x_tUnder, execution acts u, obtains next state x_t+1R is awarded immediately_t+1, raw At sample be (x_t,u_t,x_t+1,r_t+1)；

Step 8), using sample (x_t,u_t,x_t+1,r_t+1) more new model state transition function parameter vector θ and reward functions ginseng Number vector υ, the more new model is using the prediction error of single step as gradient signal: by the status predication error of single stepAs gradient, obtaining migration function parameter vector more new formula isIt will be single Error is predicted in the award of stepAs gradient, the more new formula for obtaining reward functions parameter vector isφ(x_t,u_t) it is state action feature；

Step 9) calculates TD error；

Step 10) updates eligibility trace: updating eligibility trace parameter vector, eligibility trace more new formula is For the corresponding feature vector of state x；

Step 11) updates value function: updating the corresponding parameter vector of value function, value function more new formula is ξ_t+1=ξ_t+α₁ωe (x_t)；

Step 12), more new strategy: the corresponding parameter vector of more new strategy, policy update formula are ζ_t+1=ξ_t+α₂ω(u-u*)^T φ(x_t), φ (x_t) it is state feature；

Step 13) updates current state: x_t=x_t+1；

Step 14) updates current time step t=t+1, judges whether to reach maximum time step: if reached, being transferred to step 15)； Otherwise, step 5) is transferred to continue to execute；

Step 15) plans that the approximate model planning is iteratively utilized under certain planning number using approximate model Model x_t+1=φ^T(x_t,u_t)θ_tAnd r_t+1=φ^T(x_t,u_t)υ_tNext state and award are generated, and utilizes the state and award Carry out updated value function parameter vector ξ_t+1=ξ_t+α₁ωe(x_t) and policing parameter vector ζ_t+1=ζ_t+α₂ω(u-u*)^Tφ(x_t)；

Step 16) updates current episode s=s+1, judges whether to reach maximum plot number: if reached, being transferred to step 17)；It is no Then, step 5) is transferred to continue to execute；

2. the room air self-adaptation control method according to claim 1 based on approximate model planning, which is characterized in that Value function approximate representation described in step (2) isWherein, Gaussian functionFor by state x Feature vector is mapped as,Centered on point, σ is the standard deviation of Gaussian function, and dimension and the feature vector of ξ be consistent；It is described Tactful approximate representation isWherein, feature vectorIt is identical as the expression of value function；Model is moved comprising state Function and reward functions are moved, migration approximation to function is expressed as x_t+1=φ^T(x_t,u_t)θ_t；Reward functions can be by approximate representation r_t+1=φ^T(x_t,u_t)υ_t, whereinFor state action feature,For the central point of movement, σ₁For state The standard deviation of dimension, σ₂For the standard deviation for acting dimension, θ is the parameter vector of state transition function, and υ is the parameter of reward functions Vector.

3. the room air self-adaptation control method according to claim 1 based on approximate model planning, which is characterized in that The calculation formula of calculating TD error in step (9) are as follows: ω=r+ γ V (x_t+1)-V(x_t)。