CN108021028B

CN108021028B - It is a kind of to be converted based on relevant redundancy and enhance the various dimensions cooperative control method learnt

Info

Publication number: CN108021028B
Application number: CN201711407168.4A
Authority: CN
Inventors: 李鹏华; 王欢; 李嫄源; 朱智勤; 张家昌
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2017-12-22
Filing date: 2017-12-22
Publication date: 2019-04-09
Anticipated expiration: 2037-12-22
Also published as: CN108021028A

Abstract

The present invention relates to a kind of various dimensions intelligence caravan cooperative control methods converted based on relevant redundancy with enhancing study, belong to internet of things field.This method is around intelligent business Sojourn house car Unified Device connection protocol, shared device interface, the autonomous Collaborative Control demand for improving level of integrated system, utilize the autonomous control boot policy based on POMDP model and depth enhancing study, input of the state of a control obtained using various dimensions Intelligent Fusion as computer control system, POMDP model is established to perceive, adapt to, the variation of tracing equipment state of a control, optimizing behavior strategy is selected using based on the policy optimization method of depth enhancing study, realizes the autonomous Collaborative Control of commercial Sojourn house car.The present invention not only contributes to the validity and real-time of final decision, while improving the accuracy of interaction feedback and the study degree of optimization of strategy, promotes user experience.

Description

It is a kind of to be converted based on relevant redundancy and enhance the various dimensions cooperative control method learnt

Technical field

The invention belongs to internet of things field, are related to a kind of convert based on relevant redundancy and cooperate with control with the various dimensions of enhancing study Method processed.

Background technique

Product of the intelligent caravan as intelligent network connection automobile and smart home depth integration, utilizes multi-sensor data collection With the car borne gateway communication technology, intelligentized control method is carried out to mobile unit, meets space experience and intelligence of the people for caravan The demand of life.The real-time and accuracy executed as the intelligent control technology of one of intelligent caravan core, control strategy Directly decide the superiority and inferiority of intelligent caravan, but for the existing intelligent caravan of existing market, single, control that there is control modes The problems such as strategy generating intelligence processed is short of, Executing Cost is excessively high.For this purpose, this patent using multi-source heterogeneous information characteristics it is unified with Fusion, integrates Multiple Source Sensor data, lays the foundation for the control under multi-modal complex environment, and uses POMDP mould State of a control boot policy method under type is combined with state of a control boot policy optimization the two of depth enhancing study, makes to control Decision more accurately with intelligence, while using the bottom control based on bus, is greatly reduced sensor cost of access, improves entire The reliability of aware platform saves a large amount of calculation resources.

Summary of the invention

In view of this, the purpose of the present invention is to provide a kind of converted based on relevant redundancy to assist with the various dimensions for enhancing study Same control method,

In order to achieve the above objectives, the invention provides the following technical scheme:

It is a kind of to be converted based on relevant redundancy and enhance the various dimensions cooperative control method learnt, comprising the following steps:

S1: multi-source heterogeneous information characteristics are unified and are merged；

S2: using the state of a control strategy guidance based on POMDP model；

S3: using the state of a control boot policy optimization based on depth enhancing study；

S4: the distributed bottom control based on bus is used.

Further, the step S1 specifically: under multiple-sensor network environment, multisensor Heterogeneous Information passes through classics Correlation analysis algorithm CCA and isomorphism relevant redundancy convert (Isomorphic Relevant Redundant Transformation, IRRT) algorithm analysis, multiple Heterogeneous Informations are mapped to unified, the computable space of dimension, Information is merged after carrying out unified representation to characteristic information.

Further, the step S2 specifically: the commercial Sojourn house car obtained using multi-source heterogeneous integration technology is all kinds of to be set Standby state of a control establishes POMDP model to perceive, adapt to, the variation of tracing equipment state of a control；Pass through POMDP model Internal action device is acted to equipment state of a control application, to cause equipment state of a control to change, and obtains certain return；Root A possibility that a series of performed strategies are measured according to the accumulative return of acquisition, and then the control of the equipment of commercial Sojourn house car is asked Topic is converted into policy selection problem；Specifically, POMDP model is described as<S, A, T, O, Q, and β>, integrated environment state is in POMDP Confidence state in model probability distribution is expressed as B={ b_t, the probability distribution of t moment is b_t={ b_t(s₁) ..., b_t (s_m)}；Wherein, b_t(s_i) expression t moment ambient condition be s_iProbability；By controlling current time the observation of environment and moving The selection of work, POMDP model inference go out the value of the confidence of subsequent time state of a control；Assuming that the confidence state of initial time is b₀, Execution acts a and observation o, obtains subsequent time confidence state b₁；When in state of a control S₁, what model obtained is viewed as O₁, mould Type internal state is i₁；By calculating, corresponding movement a is selected according to state of a control boot policy₁, cause ambient condition from S₁ It is transferred to S₂, model, which obtains, returns r₁With observation O₂, model internal state is from i at this time₁(b₁) it is transferred to i₂(b₂), then model according to This is continued to run；

Specifically, the boot policy estimation function of Construct question realizes dialogue state tracking, which isWherein,It is corresponding The value of the movement vector state s of node n；Developed by state of a control strategy, obtains subsequent timeState of a control boot policy function, whereinIndicate optimal plan Slightly,Indicate the strategic function of last moment.

Further, the step S3 specifically: drawing for commercial Sojourn house car equipment state of a control is obtained according to POMDP model Strategy is led, selects optimizing behavior strategy using the policy optimization method of study DQN is enhanced based on depth；Specifically, using Q- Network (Q (s, a；Behavioral strategy θ)) is defined, target Q- network (Q (s, a are utilized；θ^-)) target Q value that DQN loses item is generated, with And memory POMDP model is used to train the stochastical sampling state value of Q network again；Learn to define POMDP model by enhancing It is expected that Total ReturnWherein, r is returned_tIt is converted by the factor gamma ∈ [0,1] of each time step, T is to terminate step Suddenly；Using movement value function Q^π(s, a) observation state s_tAdaptive expectations, and utilize neural network Q (s, a)=(Q (s, a；θ)) Approximation movement value function；For based on the boot policy π, optimal movement value function Q under movement a^π(s, a)=E [R_t|s_t=a, a₁ =a, π] pass through strategyIt realizes；Construct the Bellman equation containing action value aIt is solved by adjusting Bellman target component of the Q-network to iteration；

Firstly, DQN using memory reconstruct, in each time step t of POMDP model, will remember tuple e_t=(s_t, a_t, r_t, s_t+1) it is stored in mnemonic D_t={ e₁..., e_tIn；

Secondly, DQN maintains two independent Q network (Q (s, a respectively；θ)) and (Q (s, a；θ^-))；Parameter current 6 is every It is repeatedly updated in a time step, and is copied to old parameter θ after n iterations^-In；When updating iteration, in order to It minimizes relative to old parameter θ^-Side Bellman error, pass through optimization loss functionTo update parameter current 6；For updating i every time, from memory Individually sampling obtains memory tuple (s, a, r, s ')~U (D) in memory D；For each sample, calculated by stochastic gradient descent Method updates parameter current 6；The gradient g of decline_iBy 6 relative to θ^-Loss sample gradientIt acquires；

Finally, selecting in each time step t relative to current Q- network (Q (s, a；Preference behavior act θ))； Q network (Q (s, a are safeguarded using Center Parameter server；θ^-)) distributed indicate；Meanwhile the parameter server receives by force The gradient information that chemistry is practised, and under the driving of asynchronous stochastic gradient descent algorithm, ginseng is modified using these gradient informations Number vector θ^-。

Further, the step S4 specifically: the addressing mode for the data channel that design is mapped based on memory, synthesis are examined Consider triggering mode, timing and load capacity problem, cooperates with variable connector and sampling holder, realize being total to for Data interface channels It enjoys；The self-control system with redundancy structure is designed, intelligently parsing merges decision control instruction obtained, it is defeated to take into account power supply Fluctuation, electromagnetic radiation and distributed capacitor inductive interferences factor out, complete the autonomous control of mobile unit.

The beneficial effects of the present invention are: the present invention is a kind of various dimensions intelligence converted based on relevant redundancy with enhancing study It can caravan cooperative control method.Equipment working condition is monitored, drive environmental classes people perception, characteristics of human body identification, user be intended to push away Reason, resource information interaction, vehicle device autonomous control etc., compared with other methods, this patent surrounds intelligent business Sojourn house car Unified Device connection protocol, shared device interface, the autonomous Collaborative Control demand for improving level of integrated system, using based on POMDP The autonomous control boot policy of model and depth enhancing study, the state of a control obtained using various dimensions Intelligent Fusion is as calculating The input of machine control system establishes POMDP model to perceive, adapt to, the variation of tracing equipment state of a control, using being based on depth Enhancing learns the policy optimization method of (DQN) to select optimizing behavior strategy, realizes the autonomous Collaborative Control of commercial Sojourn house car. POMDP model and depth are enhanced in conjunction with two methods of study, formed under multi-modal mode with the intelligent caravan under complex environment Optimal Control decision, not only contributes to the validity and real-time of final decision, at the same improve the accuracy of interaction feedback with The study degree of optimization of strategy promotes user experience.

Detailed description of the invention

In order to keep the purpose of the present invention, technical scheme and beneficial effects clearer, the present invention provides following attached drawing and carries out Illustrate:

Fig. 1 be multi-source heterogeneous feature unified representation with merge scheme；

Fig. 2 is in the state of a control boot policy figure of POMDP model；

Fig. 3 is the state of a control strategy optimization model of depth enhancing study.

Specific embodiment

Below in conjunction with attached drawing, a preferred embodiment of the present invention will be described in detail.

As shown in Figure 1, 2, 3, each section specific implementation details of the present invention are as follows:

1, multi-source heterogeneous information characteristics are unified and are merged.The process includes following 5 steps:

(1) multi-modal data acquires；

(2) multi-modal feature extraction；

(3) feature association；

(4) multi-modal feature unified representation；

(5) multi-source heterogeneous information characteristics fusion.

2, the state of a control boot policy design based on POMDP model.The process includes following 4 steps:

(1) POMDP model perception, adaptation, the variation of tracing equipment state of a control are established；

(2) affector is acted to equipment state of a control application, obtains certain return；

(3) a possibility that a series of performed strategies are measured according to the accumulative return of acquisition,

(4) gained policy selection is carried out；

3, the state of a control boot policy optimization based on depth enhancing study.The process includes following 3 steps:

(1) Q- net definitions behavioral strategy；

(2) target Q value that DQN loses item is generated；

(3) the stochastical sampling state value that POMDP model is used to train Q network is remembered again.

4, the distributed bottom control based on bus.Using CAN bus based intelligent caravan gateway centralized control scheme, Intelligent caravan interface unit module is introduced, the effective diversity that controlled device is isolated reduces the complexity of system；Pass through keyboard Control, remote control mode improve the reliability of intelligent caravan gateway from intelligent caravan gateway separation；Intelligent room vehicle internal network is selected CAN bus reduces the cost of system, meets the scalability of system, and according to more principal characteristics of CAN bus, realizes controlled pair The plug-and-play feature of elephant.

Finally, it is stated that preferred embodiment above is only used to illustrate the technical scheme of the present invention and not to limit it, although logical It crosses above preferred embodiment the present invention is described in detail, however, those skilled in the art should understand that, can be Various changes are made to it in form and in details, without departing from claims of the present invention limited range.

Claims

1. a kind of converted based on relevant redundancy and enhance the various dimensions cooperative control method learnt, it is characterised in that: this method packet Include following steps:

S2: using the state of a control strategy guidance based on POMDP model；

S4: the distributed bottom control based on bus is used；

The step S2 specifically: using the control shape for the commercial Sojourn house car various kinds of equipment that multi-source heterogeneous integration technology obtains State establishes POMDP model to perceive, adapt to, the variation of tracing equipment state of a control；Pass through the internal action device of POMDP model It is acted to equipment state of a control application, to cause equipment state of a control to change, and obtains certain return；According to the tired of acquisition Meter returns a possibility that measure a series of performed strategies, and then the equipment control problem of commercial Sojourn house car is converted into plan Slightly select permeability；Specifically, POMDP model is described as { S, A, T, O, Q, β }, and integrated environment state is in POMDP model probability point Confidence state in cloth is expressed as B={ b_t, the probability distribution of t moment is b_t={ b_t(s₁),...,b_t(S_m)}；Wherein, b_t (s_i) expression t moment ambient condition be S_iProbability；By controlling current time the selection of the observation and movement of environment, POMDP Model inference goes out the value of the confidence of subsequent time state of a control；Assuming that the confidence state of initial time is b₀, execution act a and observation O obtains subsequent time confidence state b₁；When in state of a control S₁, what model obtained is viewed as O₁, model internal state is i₁； By calculating, corresponding movement a is selected according to state of a control boot policy₁, cause ambient condition from S₁It is transferred to S₂, model obtains R must be returned₁With observation O₂, model internal state is from i at this time₁(b₁) it is transferred to i₂(b₂), then model continues to run according to this；

Specifically, the boot policy estimation function of Construct question realizes dialogue state tracking, which isWherein,It is corresponding The value of the movement vector state s of node n；Developed by state of a control strategy, obtains subsequent timeState of a control boot policy function, whereinIndicate optimal plan Slightly, V_t* the strategic function of last moment is indicated；

The step S3 specifically: the boot policy of commercial Sojourn house car equipment state of a control is obtained according to POMDP model, is used Enhance the policy optimization method of study DQN based on depth to select optimizing behavior strategy；Specifically, using Q- network (Q (s, a； Behavioral strategy θ)) is defined, target Q- network (Q (s, a are utilized；θ^-)) target Q value of DQN loss item is generated, and remember again POMDP model is used to train the stochastical sampling state value of Q network；Learn the expection Total Return of definition POMDP model by enhancingWherein, r is returned_tIt is converted by factor gamma=[0,1] of each time step, T is to terminate step；Using movement Value function Q^π(s, a) observation state S_tAdaptive expectations, and utilize neural network Q (s, a)=(Q (s, a；θ^-)) approximation action value Function；For based on the boot policy π, optimal movement value function Q under movement a^π(s, a)=E [R_t|s_t=a, a₁=a, π] pass through StrategyIt realizes；Construct the Bellman equation containing action value aIt is solved by adjusting Bellman target component of the Q-network to iteration；

Firstly, DQN using memory reconstruct, in each time step t of POMDP model, will remember tuple e_t=(s_t,a_t,r_t, s_t+1) it is stored in mnemonic D_t={ e₁,...,e_tIn；

Secondly, DQN maintains two independent Q network (Q (s, a respectively；θ)) and (Q (s, a；θ^-))；Parameter current θ is in each time It is repeatedly updated in step-length, and is copied to old parameter θ after n iterations^-In；When updating iteration, in order to minimize Relative to old parameter θ^-Side Bellman error, pass through optimization loss functionTo update parameter current θ；For updating i every time, from memory Individually sampling obtains memory tuple (s, a, r, s`)~U (D) in memory D；For each sample, calculated by stochastic gradient descent Method updates parameter current δ；The gradient g of decline_iBy θ relative to θ^-Loss sample gradientIt acquires；

Finally, selecting in each time step t relative to current Q- network (Q (s, a；Preference behavior act θ))；It uses Center Parameter server safeguards Q network (Q (s, a；θ^-)) distributed indicate；Meanwhile the parameter server receives extensive chemical The gradient information practised, and under the driving of asynchronous stochastic gradient descent algorithm, modified using these gradient informations parameter to Measure θ^-。

2. a kind of various dimensions cooperative control method converted based on relevant redundancy with enhancing study according to claim 1, It is characterized by: the step S1 specifically: under multiple-sensor network environment, multisensor Heterogeneous Information passes through classical related Parser CCA and isomorphism relevant redundancy transformation (IsomorphicRelevantRedundantTransformation, IRRT) Multiple Heterogeneous Informations are mapped to unified, the computable space of dimension by algorithm analysis, carry out unified table to characteristic information Information is merged after showing.

3. a kind of various dimensions cooperative control method converted based on relevant redundancy with enhancing study according to claim 1, It is characterized by: the step S4 specifically: the addressing mode for the data channel that design is mapped based on memory comprehensively considers touching Originating party formula, timing and load capacity problem cooperate with variable connector and sampling holder, realize the shared of Data interface channels；If The self-control system with redundancy structure is counted, intelligently parsing merges decision control instruction obtained, takes into account power supply output wave Dynamic, electromagnetic radiation and distributed capacitor inductive interferences factor, complete the autonomous control of mobile unit.