CN107315572A

CN107315572A - Build control method, storage medium and the terminal device of Mechatronic Systems

Info

Publication number: CN107315572A
Application number: CN201710592114.3A
Authority: CN
Inventors: 孙凫; 孙一凫; 吴若飒; 张豪; 王宗祥
Original assignee: Beijing Geyun Technology Co Ltd
Current assignee: Beijing Geyun Technology Co Ltd
Priority date: 2017-07-19
Filing date: 2017-07-19
Publication date: 2017-11-03
Anticipated expiration: 2037-07-19
Also published as: CN107315572B

Abstract

This application provides a kind of control method, storage medium and terminal device for building Mechatronic Systems, this method includes：Obtain sensing data and determine current state according to goal-selling data；The value for performing corresponding actions according to current strategies under current state is predicted according to the cost function based on strategy, and according to preset algorithm iteration recovery value function and its strategy, untill the strategy after renewal is identical with the strategy before renewal；The corresponding action of current state is determined according to the strategy after renewal and performed.Control efficiency is improved, artificial experience is eliminated the reliance on, moreover it is possible to the effect of low-loss and energy-saving is reached.

Description

Build control method, storage medium and the terminal device of Mechatronic Systems

Technical field

The application is related to the control technology field of building Mechatronic Systems, more particularly to a kind of controlling party for building Mechatronic Systems Method, storage medium and terminal device.

Background technology

Building electro mechanical system device is indispensable important component in building, including industrial building, civilian is built Build, the plumbing in utilities building, electrically, heating, ventilation, fire-fighting, communication and Automated condtrol etc..

Modern architecture electro mechanical system device is generally using traditional proportional-integral-differential (PID) control or fuzzy control etc. Algorithm, its autgmentability is weaker, needs artificial regulation quantity of parameters for specific building or room or rule of thumb sets Empirical value.And the control effect being finally reached is also more rough, energy consumption is higher.

The content of the invention

In view of this, the embodiment of the present application provides a kind of control method, storage medium and terminal for building Mechatronic Systems Equipment, automatic control effect to solve to build Mechatronic Systems in the prior art is rough, precision is too low, more artificial warps of dependence The technical problem tested.

According to the one side of the embodiment of the present application, there is provided a kind of control method for building Mechatronic Systems, methods described Including：Obtain sensing data and determine current state according to goal-selling data；Predicted according to the cost function based on strategy Perform the value of corresponding actions according to current strategies under current state, and according to preset algorithm iteration recovery value function and its Strategy, untill the strategy after renewal is identical with the strategy before renewal；Current state correspondence is determined according to the strategy after renewal Action and execution.

According to the another aspect of the embodiment of the present application there is provided a kind of terminal device, including：Processor；At storage Manage the memory of device executable instruction；Wherein, the processor is configured as：Obtain sensing data and according to goal-selling number According to determination current state；Predicted according to the cost function based on strategy and perform corresponding actions according to current strategies under current state Value, and according to preset algorithm iteration recovery value function and its strategy, until the strategy after renewal with update before strategy Untill identical；The corresponding action of current state is determined according to the strategy after renewal and performed.

According to the another aspect of the embodiment of the present application there is provided a kind of computer-readable recording medium, meter is stored thereon with Calculation machine is instructed, the control method of above-mentioned building Mechatronic Systems is realized in instruction when being executed by processor the step of.

The beneficial effect of the embodiment of the present application includes：Using measured data real-time optimal control strategy, control effect is improved Rate, eliminates the reliance on artificial experience, moreover it is possible to reach the effect of low-loss and energy-saving, and the control based on strategy helps to find the electromechanical system of building The globally optimal solution of system, so as to realize optimum control of the system to many equipment multiple targets.

Brief description of the drawings

By description referring to the drawings to the embodiment of the present application, the above-mentioned and other purpose of the application, feature and Advantage will be apparent from, in the accompanying drawings：

Fig. 1 is the schematic flow sheet of the control method for the building Mechatronic Systems that the embodiment of the present application is provided；

Fig. 2 is the schematic flow sheet of the embodiment of the present application iteration recovery value function and its strategy；

Fig. 3 is the schematic flow sheet of the control method for the building Mechatronic Systems that the embodiment of the present application is provided.

Embodiment

The application is described below based on embodiment, but the application is not restricted to these embodiments.Under Text is detailed to describe some specific detail sections in the detailed description of the application.Do not have for a person skilled in the art The description of these detail sections can also understand the application completely.In order to avoid obscuring the essence of the application, known method, mistake Journey, flow, element and circuit do not have detailed narration.

In addition, it should be understood by one skilled in the art that provided herein accompanying drawing be provided to explanation purpose, and What accompanying drawing was not necessarily drawn to scale.

Unless the context clearly requires otherwise, otherwise entire disclosure is similar with the " comprising " in claims, "comprising" etc. Word should be construed to the implication included rather than exclusive or exhaustive implication；That is, being containing for " including but is not limited to " Justice.

In the description of the present application, it is to be understood that term " first ", " second " etc. are only used for describing purpose, without It is understood that to indicate or imply relative importance.In addition, in the description of the present application, unless otherwise indicated, the implication of " multiple " It is two or more.

The embodiment of the present application updates the cost function based on strategy using intensified learning method, and building Mechatronic Systems is constantly sharp Learnt with the environmental data of actual measurement, Optimal Control Strategy, using study to optimal policy remove control device, be finally reached setting Target.Building Mechatronic Systems is made strategy by intensified learning and updated based on the tactful continuous iteration of cost function until strategy Convergence, so as to find the action under optimal policy and strategy in each state with maximum value.Not only increase control Efficiency processed, moreover it is possible to reach the effect of low-loss and energy-saving, eliminates the reliance on artificial experience, saves a large amount of manpowers, and extension and replicability By force, it can apply in other building Mechatronic Systems.

Fig. 1 is the control method for the building Mechatronic Systems that the embodiment of the present application is provided, it is adaptable to terminal device, terminal device Can be computer, console, server etc., this method comprises the following steps.

S10, obtains sensing data and determines current state according to goal-selling data.

The data gathered by sensor can be that the ambient condition of interior of building, power supply and are set water supply condition, pipeline The data such as received shipment row.The desired value to be reached can be pre-set for each item data, so as to be managed to interior of building state Control.Current state can be determined according to the difference between sensing data and predetermined target value.

S11, predicts according to the cost function based on strategy and performs corresponding actions according to current strategies under current state Value, and according to preset algorithm iteration recovery value function and its strategy, until the strategy after renewal and the tactful phase before renewal With untill.

Strategy be the stateful lower set for corresponding to the action performed respectively, a strategy is internal to be likely to occur comprising all State and its next step execution action corresponding relation.If a state includes a variety of data variables, should by exhaustion All combinations of multiple data variables are stateful come the institute for determining to be likely to occur；It can also include in each corresponding action multiple Controlled variable.

Cost function is used for the corresponding relation reflected between state, action and value, and defines state space and action Space.If a state includes a variety of data variables, whole to define by all combinations of exhaustive the plurality of data variable Individual state space；If an action includes multiple controlled quentity controlled variables, all combinations of exhaustive multiple controlled quentity controlled variables are whole dynamic to define Make space；Value refers to that, in each state for performing the benefit corresponding to each action, value is bigger, represents in the state The lower effect for performing the action is better, contributes to faster close to default control target.The cost function can be Q value matrixs or Approximating function.

When being initialized to strategy, it can configure that every kind of state is corresponding to perform action.Cost function is carried out initial The value for performing each action can be given to assign random value during change under each state.In addition also need to initialize Reward Program, root According to the predetermined target value of interior of building indices variable (for example, environmental index, power supply target, water supply index etc.), calculate The distance between currency of each index and desired value and the return value after negating as corresponding states：

R (y)=- (y₁-y₁₀)²-(y₂-y₂₀)²-(y₃-y₃₀)².......；Wherein, r (y) represents return value, y₁、y₂、 y₃... represent the currency of indices variable, y₁₀、y₂₀、y₃₀... represent the desired value of indices variable.

Determine after current state, identical is matched from strategy or closest state, so that it is determined that this is current The corresponding action of state.Determine to perform the value of the action under current state further according to cost function.Then in conjunction with pre- imputation Method recovery value function, the action of the Maximum Value under current state is determined according to the cost function after renewal, and will be worth most Big action is updated into current strategies and current state binding.

S11 as shown in Figure 2 can further comprise the steps：

S110, according to preset algorithm recovery value function.

S111, the action of the Maximum Value under current state is determined according to the cost function after renewal, and by Maximum Value Action update into the current strategies of cost function.

Whether S112, the strategy before judging the strategy after updating and updating is identical.If both are identical, S113 is performed； If both are different, S110 is returned.

S113, stops iteration, using the tactful current optimal policy as cost function after renewal.

If the action of Maximum Value is with current state, the corresponding original action in strategy is identical, and the strategy is real after updating Do not changed in matter；If corresponding original action is different in strategy from current state, the strategy there occurs after updating Change.

If strategy is changed, continue according to preset algorithm recovery value function, further according to the value after renewal Function redefines the action of Maximum Value and more new strategy under current state, until the strategy after renewal and the strategy before renewal It is identical, i.e., strategy in current state it is corresponding action do not change, be now considered as searched out it is optimal under current state Strategy.

In one embodiment, the cost function Q based on strategy can be updated based on Bellman equation^hl。

Q^h(x, u)=r (x, u)+γ Q^h(f(x,u),h(f(x,u)))；Wherein l represents iterations, Q^h(x, u) is represented State x acts the Q values obtained by u according to tactful h execution, and r (x, u) represents the return value obtained by state x execution acts u, γ represents discount factor, and f (x, u) represents state x and acts the transition equation that u obtains next state by execution.

The corresponding action of maximum Q values is found according to the cost function after renewal, the action is updated into strategy, i.e., h_l+1(x)∈argmax_uQ^hl(x,u).When the corresponding action of current state no longer changes under the strategy, i.e. h_l+1=h_lWhen, stop Iteration, otherwise returns and continues iteration recovery value function Q^hlAnd its strategy, until h_l+1=h_lUntill.

S12, determines the corresponding action of current state according to the strategy after renewal and performs.

In the present embodiment, using measured data real-time optimal control strategy, control efficiency is improved, artificial warp is eliminated the reliance on Test, moreover it is possible to reach the effect of low-loss and energy-saving, help to find the globally optimal solution of building Mechatronic Systems based on policy control, can be with Realize optimum control of the complication system to many equipment multiple targets.

In one embodiment, in addition to being pre-configured with to initial policy, initial policy, which can also be, utilizes product Obtained from tired historical data is trained to neutral net.By the state accumulated in preset duration and its action can be performed Data as training data, or, when above-mentioned data accumulation is to predetermined number as training data, built in advance for training Vertical neutral net, the error between the action actually performed in the prediction action of neutral net and the training data of accumulation Untill pre-determined threshold.Neutral net is divided into input layer, hidden layer and output layer, and it is state that it, which is inputted, is output as The action of prediction, wherein hidden layer are configured to preferably use amendment linear unit in 10 implicit nodes, the present embodiment (Rectified Linear Unit, ReLU) activation primitive.ReLU activation primitive expression formulas are：F (x)=max (0, x).ReLU The advantage of activation primitive is：Gradient is unsaturated, and gradient calculation formula is：1{x>0 }, in back-propagation process, ladder is alleviated The problem of spending disperse；Calculating speed is fast, during forward-propagating, and S-shaped (sigmoid) activation primitive and tanh (tanh) swash Function living needs gauge index when calculating activation value, and ReLU functions only need to set threshold value, if x<0, then f (x)=0, such as Fruit x>0, then f (x)=x, accelerates the calculating speed of forward-propagating.

In addition, when obtaining initial policy using trained neural metwork training, if according in above-described embodiment Iteration recovery value function and its strategy process within preset duration (such as 30 minutes) still fail to reach predetermined target value, Then it can continue to train the neutral net using the state and its action data of accumulation in the preset duration.Said process is such as Shown in Fig. 3, this method further comprises：

S13, judges whether to reach goal-selling shape after by preset duration according to the data got from sensor State.When not up to goal-selling state, step S14 is performed.

S14, continues to train neutral net using the state and its action data accumulated in preset duration.Training is updated to obtain More ageing control initial policy after, return to step S10 continues to control building Mechatronic Systems according to new initial policy To reach goal-selling state as early as possible.

In addition, in the embodiment of the present application, terminal device can by hardware processor (hardware processor) come Realize each above-mentioned functional steps.Terminal device includes：Processor, the memory for storing processor-executable instruction；Its In, processor is configured as：Obtain sensing data and determine current state according to goal-selling data；According to based on strategy Cost function prediction performs the value of corresponding actions under current state according to current strategies, and is updated according to preset algorithm iteration Cost function and its strategy, untill the strategy after renewal is identical with the strategy before renewal；Determined according to the strategy after renewal The corresponding action of current state is simultaneously performed.

In one embodiment, according to preset algorithm iteration recovery value function and its strategy, after once updating Include untill strategy is identical with the strategy before renewal：

According to preset algorithm recovery value function；The Maximum Value under current state is determined according to the cost function after renewal Action, and by Maximum Value action update into the current strategies of cost function；

Whether the strategy before judging the strategy after updating and updating is identical；The strategy before strategy and renewal after renewal is not Meanwhile, return to the step of above-mentioned iteration recovery value function and its strategy；The strategy after strategy before renewal is with updating is identical When, stop iteration, using the tactful current optimal policy as cost function after renewal.

In one embodiment, include according to preset algorithm recovery value function：

Based on Bellman equation Q^h(x, u)=r (x, u)+γ Q^h(f (x, u), h (f (x, u))) updates the value based on strategy Function Q^hl, wherein l represents iterations, Q^hQ value of (x, the u) representative obtained by state x acts u according to tactful h execution, r (x, U) return value obtained by state x execution acts u is represented, γ represents discount factor, and it is dynamic by performing that f (x, u) represents state x The transition equation of next state is obtained as u.

In one embodiment, the processor is configured to：Utilize the historic state and its action data of accumulation Training neutral net obtains the strategy, and the input of neutral net is state, is output as action.

In one embodiment, the activation primitive of neutral net is ReLU functions.

In one embodiment, the processor is configured to：If obtained after preset duration from sensor To data be not up to goal-selling data, then continue to train nerve using the state and its action data of accumulation in preset duration Network.

It will be understood by those skilled in the art that embodiments herein can be provided as method, device (equipment) or computer Program product.Therefore, in terms of the application can be using complete hardware embodiment, complete software embodiment or combination software and hardware Embodiment form.Moreover, the application can be used in one or more meters for wherein including computer usable program code The computer journey that calculation machine usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of sequence product.

The application is with reference to the flow chart according to the method for the embodiment of the present application, device (equipment) and computer program product And/or block diagram is described.It should be understood that can be by each flow in computer program instructions implementation process figure and/or block diagram And/or square frame and the flow in flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided to refer to The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is made to produce One machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for realizing The device for the function of being specified in one flow of flow chart or multiple flows and/or one square frame of block diagram or multiple square frames.

These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which is produced, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.

These computer program instructions can be also loaded into computer or other programmable data processing devices so that in meter Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, thus in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.

The preferred embodiment of the application is the foregoing is only, the application is not limited to, for those skilled in the art For, the application can have various changes and change.It is all any modifications made within spirit herein and principle, equivalent Replace, improve etc., it should be included within the protection domain of the application.

Claims

1. a kind of control method for building Mechatronic Systems, it is characterised in that methods described includes：

Obtain sensing data and determine current state according to goal-selling data；

The value for performing corresponding actions according to current strategies under the current state is predicted according to the cost function based on strategy, And the cost function and its strategy are updated according to preset algorithm iteration, until the strategy after renewal is identical with the strategy before updating Untill；

The corresponding action of the current state is determined according to the strategy after renewal and performed.

2. according to the method described in claim 1, it is characterised in that according to preset algorithm iteration update the cost function and its Strategy, includes untill the strategy after once updating is identical with the strategy before renewal：

The cost function is updated according to preset algorithm；

The action of the Maximum Value under the current state is determined according to the cost function after renewal, and by the Maximum Value Action is updated into the current strategies of the cost function；

Whether the strategy before judging the strategy after updating and updating is identical；

When the strategy after renewal is tactful different from before renewal, the step of above-mentioned iteration recovery value function and its strategy is returned Suddenly；

When the strategy before renewal is tactful identical with after renewal, stops iteration, regard the strategy after the renewal as the valency The current optimal policy of value function.

3. method according to claim 2, it is characterised in that updating the cost function according to preset algorithm includes：

Based on Bellman equation Q^h(x, u)=r (x, u)+γ Q^h(f (x, u), h (f (x, u))) updates the cost function based on strategy Q^hl, wherein l represents iterations, Q^h(x, u) represents the Q values obtained by state x acts u according to tactful h execution, r (x, u) generation Return value of the table obtained by state x execution acts u, γ represents discount factor, and f (x, u) represents state x and acts u by execution Obtain the transition equation of next state.

4. according to the method described in claim 1, it is characterised in that methods described also includes：

The strategy, the input of the neutral net are obtained using historic state and its action data the training neutral net of accumulation For state, action is output as.

5. method according to claim 4, it is characterised in that the activation primitive of the neutral net is ReLU functions.

6. method according to claim 4, it is characterised in that methods described also includes：

If the data got after preset duration from sensor are not up to the goal-selling data, using described pre- If the state and its action data of accumulation continue to train the neutral net in duration.

7. a kind of terminal device, it is characterised in that including：

Processor；

Memory for storing processor-executable instruction；

Wherein, the processor is configured as：Perform claim requires the control of the building Mechatronic Systems described in 1 to 6 any one Method.

8. a kind of computer-readable recording medium, is stored thereon with computer instruction, it is characterised in that the instruction is held by processor The step of control method that Mechatronic Systems is built described in claim 1-6 is realized during row.