CN108288097A

CN108288097A - Higher-dimension Continuous action space discretization heuristic approach in intensified learning task

Info

Publication number: CN108288097A
Application number: CN201810071032.9A
Authority: CN
Inventors: 陈志波; 张直政; 陈嘉乐
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2018-01-24
Filing date: 2018-01-24
Publication date: 2018-07-17

Abstract

The invention discloses higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task, continuous motion space is converted to a discrete motion space by quantization operation, then the self-encoding encoder realized by deep neural network carries out dimensionality reduction coding to the dictionary value in discrete movement space and counts, the occurrence number that each dictionary value in the policy update of certain number corresponds to coding codeword is counted again, and the dictionary value seldom occurred is removed by probability from action dictionary, to which constantly removal acts the redundancy in dictionary, and then search efficiency when raising intelligent body policy update.

Description

Higher-dimension Continuous action space discretization heuristic approach in intensified learning task

Technical field

The present invention relates to higher-dimensions in artificial intelligence, machine learning techniques field more particularly to a kind of intensified learning task to connect Continuous motion space discretization heuristic approach.

Background technology

Intensified learning is as a kind of important machine learning method, in intelligent control machine people, man-machine game, clinical medicine And there are many applications in the fields such as analysis prediction.Supervised learning and unsupervised learning during intensified learning learns independently of conventional machines Except, experience is obtained in the interaction between intelligent agent and environment, is mapped from environment to behavior to completing intelligent agent Policy learning.In intensified learning, intelligent agent receives to come from the status information of environment and the strategy based on study generates one A to act on environment, state changes after environment receives the action, while generating a return value (reward or punishment), And the current state after variation is sent to intelligent agent with the feedback signal, intelligent agent is further according to the information update received Strategy simultaneously (acts) according to the next result of decision of policy selection.The learning objective of reinforcement learning system is the friendship with environment In mutual behavior, the parameter of intelligent agent itself is dynamically adjusted to update strategy to be learned so that the positive letter of environmental feedback Number maximum.

The convergence process of nitrification enhancement can be regarded as and be scanned for motion space, therefrom find optimal policy Process.For motion space higher-dimension and continuous learning tasks, since its motion space is larger, cause intelligent agent to strategy Exploration difficulty is big, learning efficiency is low.In consideration of it, it is necessary to be furtherd investigate, for the strong of higher-dimension Continuous action space Change learning tasks, discretization is carried out to its motion space using neural network, to improve intelligent agent in intensified learning Practise efficiency.

Invention content

The object of the present invention is to provide higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task, can To improve the learning efficiency and stability of nitrification enhancement.

The purpose of the present invention is what is be achieved through the following technical solutions：

Higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task, including：

Step S1, each dimension of motion space is quantified, an excessively complete action is formed using quantized result Dictionary；

Step S2, using the dictionary value in action dictionary, the self-encoding encoder that one deep neural network of training is realized is completed To the dimensionality reduction and coding of dictionary value in action dictionary；

Step S3, a count table being worth all for 0 is initialized, when intelligent agent and environment interact, according to strategy It determines that corresponding dictionary value is as action in action dictionary, and corresponding count value in count table is updated, to complete Policy update；

Step S4, after intelligent agent and environment complete K policy update, for acting each dictionary value in dictionary, Its count results in count table is combined to calculate the probability for removing it from action dictionary, with the probability that is calculated by phase Answer dictionary value from action dictionary in remove, each dictionary value in traversal action dictionary, using the dictionary value preserved as New action dictionary；

Step S5, it is back to step S3, continues policy update until convergence.

As seen from the above technical solution provided by the invention, continuous motion space can be converted to by quantization operation One discrete motion space, the self-encoding encoder then realized by deep neural network is to the working value in discrete movement space It carries out dimensionality reduction coding and counts, then count the occurrence number that each dictionary value in the policy update of certain number corresponds to coding codeword, And remove the dictionary value seldom occurred from action dictionary by probability, to which constantly removal acts the redundancy in dictionary, in turn Improve search efficiency when intelligent body policy update.

Description of the drawings

In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill in field, without creative efforts, other are can also be obtained according to these attached drawings Attached drawing.

Fig. 1 is higher-dimension Continuous action space discretization exploration side in a kind of intensified learning task provided in an embodiment of the present invention The schematic diagram of method；

Fig. 2 is the structural schematic diagram provided in an embodiment of the present invention from coder network.

Specific implementation mode

With reference to the attached drawing in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete Ground describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this The embodiment of invention, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, belongs to protection scope of the present invention.

Higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task of offer of the embodiment of the present invention, such as schemes Shown in 1, include mainly：

Step S1, each dimension of motion space is quantified, an excessively complete action is formed using quantized result Dictionary.

In this step, takes each dimension that sufficiently small quantized interval △ ties up D motion space to carry out uniform quantization, utilize Uniform quantization result forms an excessively complete action dictionary Dict.Since quantized interval Δ determines, in the action dictionary Dictionary value be it is limited multiple, be calculated as M.

Illustratively, the embodiment of the present invention can (each dimension be in [0,1] section for 18 dimension Continuous action spaces Successive value), the motion space of the quantized interval △=0.1 pair environment can be taken to quantify, by the discretization knot after quantization Fruit then acts the number M=10 of dictionary value in dictionary as excessively complete action dictionary Dict¹⁸。

It will be understood by those skilled in the art that this single stepping needs to be selected for acting the sensibility of value according to environment Quantized interval, to ensure that obtained discretization action dictionary was complete.

Step S2, using the dictionary value in action dictionary, the self-encoding encoder that one deep neural network of training is realized is completed To the dimensionality reduction and coding of dictionary value in action dictionary.

In the embodiment of the present invention, selects Hash coding as its coding mode, depth nerve as shown in Figure 2 may be used The self-encoding encoder structure of real-time performance.

For any of action dictionary Dict D dimension dictionary values a_i, it is b (a that the result of binaryzation is carried out to it_i).Obviously Ground, b (a_i) and a_iEqually and a D dimension vector, by b (a_i) in jth dimensional vector value be denoted as b_j(a_i).To b (a_i) carry out Quantization, obtains " b (a_i) ", by " b (a_i) " space that is mapped to more low-dimensional by certain down-sampling mapping method obtains g (" b (a_i) "), it can be seen that g (" b (a_i) ") it is one about a_iFunction, therefore can be denoted asIt needs to illustrate It is that there are many kinds of above-mentioned down-sampling mapping methods, such as SimHash algorithms may be used to realize in we in the present embodiment.

In the embodiment of the present invention, the loss function of the self-encoding encoder is defined as：

Wherein, the number of dictionary value, p (a during M is using Δ as the action dictionary Dict of quantized interval_i) indicate a certain dictionary Value a_iThe probability of appearance,It is weighted used by for there are two loss function items without physical significance in constraint loss function Coefficient.

It will be understood by those skilled in the art that the self-encoding encoder for carrying out Hash coding to working value that Fig. 2 is provided Structure is only a kind of more effective network structure design scheme.For other specific interactive environments, it is empty that its action can be directed to Between dimension and the statistical natures of data design similar neural network structure and loss function, designed result is in this hair Within bright protection domain.

Step S3, a count table being worth all for 0 is initialized, when intelligent agent and environment interact, according to strategy It determines that corresponding dictionary value is as action in action dictionary, and corresponding count value in count table is updated, to complete Policy update.

In the embodiment of the present invention, when intelligent agent is interacted with environment, intelligent agent receives the state for coming from environment Value and return value, and determine a dictionary value as the current result of decision according to current strategies from action dictionary and return to ring Border, the result of decision received is calculated the state value and return value of subsequent time by environment as an action, and this is moved It is transmitted to trained self-encoding encoder to be encoded, result is exported to the corresponding count value of count table according to self-encoding encoderIt is updated, i.e.,The length of count table should be all different from action dictionary The number of result is equal after dictionary value dimensionality reduction, coding.

Illustratively, it is operated by dimensionality reduction, the action that we tie up the action Vector Processing of 18 dimensions at one 8 is vectorial, again It is encoded by Hash and the working value two-value of each dimension is turned to 0 or 1, then the length of count table is 2⁸。

Step S4, after intelligent agent and environment complete K policy update, for acting each dictionary value in dictionary, Its count results in count table is combined to calculate the probability for removing it from action dictionary, with the probability that is calculated by phase Answer dictionary value from action dictionary in remove, each dictionary value in traversal action dictionary, using the dictionary value preserved as New action dictionary.

In the embodiment of the present invention, calculating the new probability formula that a dictionary value is removed from action dictionary is：

Wherein, α and β indicates weight coefficient, is traditionally arranged to be constant；K is completed for intelligent agent and environment After secondary policy update, dictionary value a_iUpdate result in corresponding count table；It indicates to count in count table As a result the maximum count value in.

Illustratively, α and β could be provided as 0.5, K and could be provided as 40000.

Step S5, after completing the update to acting dictionary, by count table again zero setting, intelligent body continues strategy more The new strategy learnt up to intelligent agent meets example demand or algorithmic stability convergence.

It should be noted that decision of the intelligent body when being interacted with environment, is selected from current action dictionary A certain dictionary value is selected as the result of decision.

In said program of the embodiment of the present invention, quantization operation continuous motion space can be converted to one it is discrete dynamic Make space, the self-encoding encoder then realized by deep neural network carries out dimensionality reduction to the dictionary value in discrete movement space, compiles Code simultaneously counts, then counts the occurrence number that each dictionary value in the policy update of certain number corresponds to coding codeword, and will seldom go out Existing dictionary value is removed by probability from action dictionary, to which constantly removal acts the redundancy in dictionary, and then improves intelligent body Search efficiency when policy update.

Patent of the present invention has certain detectability, specific detection scheme as follows：

One, the occupancy situation of computer storage unit can be detected whether in the process of running according to relative program process Used the invention patent relates to technical solution.Since being run relevant program process, its process of dynamic detection is to memory Occupancy subtract at regular intervals if the interim variation characteristic reduced is presented to the occupancy of memory in associated process It is few primary then to maintain a relatively stable state, then the program with may have been used greatly involved in patent of the present invention very much Technical solution.

It should be noted that due to during program is run it is possible that the special circumstances such as RAM leakage, Whether used the invention patent relates to technical solution also need to further determine that in conjunction with other detection methods.

Whether in relative program have the presence of " self-encoding encoder ", " count table " and " action dictionary ", if detection if two, detecting To same or similar module, then it is likely used only to having used the technical solution involved by patent of the present invention.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment can By software realization, the mode of necessary general hardware platform can also be added to realize by software.Based on this understanding, The technical solution of above-described embodiment can be expressed in the form of software products, the software product can be stored in one it is non-easily In the property lost storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.), including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes the method described in each embodiment of the present invention.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Any one skilled in the art is in the technical scope of present disclosure, the change or replacement that can be readily occurred in, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of claims Subject to enclosing.

Claims

1. higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task, which is characterized in that including：

Step S1, each dimension of motion space is quantified, an excessively complete action dictionary is formed using quantized result；

Step S2, using the dictionary value in action dictionary, the self-encoding encoder that one deep neural network of training is realized is completed to dynamic Make the dimensionality reduction and coding of dictionary value in dictionary；

Step S3, it initializes a value and is all determined according to strategy for 0 count table when intelligent agent and environment are interacted It acts corresponding dictionary value in dictionary and is used as action, and corresponding count value in count table is updated, it is primary to complete Policy update；

Step S4, it after intelligent agent and environment complete K policy update, for each dictionary value in action dictionary, ties It closes its count results in count table and calculates the probability for removing it from action dictionary, with the probability that is calculated by corresponding word Allusion quotation value from action dictionary in remove, each dictionary value in traversal action dictionary, using the dictionary value preserved as newly Act dictionary；

Step S5, it is back to step S3, continues policy update until convergence.

2. higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task according to claim 1, It is characterized in that, takes each dimension that sufficiently small quantized interval △ ties up D motion space to carry out uniform quantization, utilize uniform quantization As a result an excessively complete action dictionary Dict is formed.

3. higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task according to claim 1, It is characterized in that, the dictionary value in the dictionary using action, the self-encoding encoder completion pair that one deep neural network of training is realized The dimensionality reduction of dictionary value includes with coding in action dictionary：

For any of action dictionary D n dimension dictionary values a_i, down-sampling dimensionality reduction is carried out by the characteristic pattern exported to self-encoding encoder Result with coding is

4. higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task according to claim 1 or 3, It is characterized in that, the loss function of the self-encoding encoder is defined as：

Wherein, b_j(a_i) it is that D ties up dictionary value a_iBinaryzation result b (a_i) in jth dimensional vector value；M is using Δ as quantized interval Act the number of dictionary value in dictionary Dict, p (a_i) indicate dictionary value a_iThe probability of appearance,To have in constraint loss function Weighting coefficient used by two loss function items without physical significance.

5. higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task according to claim 1, It is characterized in that, when intelligent agent and environment interact, determines that corresponding dictionary value is as dynamic in action dictionary according to strategy Make, and to corresponding count value in count table be updated including：

When intelligent agent and environment interact, intelligent agent receives the state value and return value for coming from environment, and from action Determine that a dictionary value returns to environment as the current result of decision according to current strategies in dictionary, environment is determined what is received Plan result is acted as one to calculate the state value and return value of subsequent time, and the action is transmitted to trained own coding Device is encoded, and exporting result according to self-encoding encoder is updated corresponding count value in count table.

6. higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task according to claim 1, It is characterized in that, calculating the new probability formula that a dictionary value is removed from action dictionary is：

Wherein, α and β indicates weight coefficient；After K policy update being completed for intelligent agent and environment, dictionary value a_iIt is right Update result in the count table answered；Indicate the maximum count value in count results in count table.