CN108288097A - Higher-dimension Continuous action space discretization heuristic approach in intensified learning task - Google Patents
Higher-dimension Continuous action space discretization heuristic approach in intensified learning task Download PDFInfo
- Publication number
- CN108288097A CN108288097A CN201810071032.9A CN201810071032A CN108288097A CN 108288097 A CN108288097 A CN 108288097A CN 201810071032 A CN201810071032 A CN 201810071032A CN 108288097 A CN108288097 A CN 108288097A
- Authority
- CN
- China
- Prior art keywords
- dictionary
- value
- action
- dimension
- environment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Medical Informatics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Image Processing (AREA)
Abstract
The invention discloses higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task, continuous motion space is converted to a discrete motion space by quantization operation, then the self-encoding encoder realized by deep neural network carries out dimensionality reduction coding to the dictionary value in discrete movement space and counts, the occurrence number that each dictionary value in the policy update of certain number corresponds to coding codeword is counted again, and the dictionary value seldom occurred is removed by probability from action dictionary, to which constantly removal acts the redundancy in dictionary, and then search efficiency when raising intelligent body policy update.
Description
Technical field
The present invention relates to higher-dimensions in artificial intelligence, machine learning techniques field more particularly to a kind of intensified learning task to connect
Continuous motion space discretization heuristic approach.
Background technology
Intensified learning is as a kind of important machine learning method, in intelligent control machine people, man-machine game, clinical medicine
And there are many applications in the fields such as analysis prediction.Supervised learning and unsupervised learning during intensified learning learns independently of conventional machines
Except, experience is obtained in the interaction between intelligent agent and environment, is mapped from environment to behavior to completing intelligent agent
Policy learning.In intensified learning, intelligent agent receives to come from the status information of environment and the strategy based on study generates one
A to act on environment, state changes after environment receives the action, while generating a return value (reward or punishment),
And the current state after variation is sent to intelligent agent with the feedback signal, intelligent agent is further according to the information update received
Strategy simultaneously (acts) according to the next result of decision of policy selection.The learning objective of reinforcement learning system is the friendship with environment
In mutual behavior, the parameter of intelligent agent itself is dynamically adjusted to update strategy to be learned so that the positive letter of environmental feedback
Number maximum.
The convergence process of nitrification enhancement can be regarded as and be scanned for motion space, therefrom find optimal policy
Process.For motion space higher-dimension and continuous learning tasks, since its motion space is larger, cause intelligent agent to strategy
Exploration difficulty is big, learning efficiency is low.In consideration of it, it is necessary to be furtherd investigate, for the strong of higher-dimension Continuous action space
Change learning tasks, discretization is carried out to its motion space using neural network, to improve intelligent agent in intensified learning
Practise efficiency.
Invention content
The object of the present invention is to provide higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task, can
To improve the learning efficiency and stability of nitrification enhancement.
The purpose of the present invention is what is be achieved through the following technical solutions:
Higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task, including:
Step S1, each dimension of motion space is quantified, an excessively complete action is formed using quantized result
Dictionary;
Step S2, using the dictionary value in action dictionary, the self-encoding encoder that one deep neural network of training is realized is completed
To the dimensionality reduction and coding of dictionary value in action dictionary;
Step S3, a count table being worth all for 0 is initialized, when intelligent agent and environment interact, according to strategy
It determines that corresponding dictionary value is as action in action dictionary, and corresponding count value in count table is updated, to complete
Policy update;
Step S4, after intelligent agent and environment complete K policy update, for acting each dictionary value in dictionary,
Its count results in count table is combined to calculate the probability for removing it from action dictionary, with the probability that is calculated by phase
Answer dictionary value from action dictionary in remove, each dictionary value in traversal action dictionary, using the dictionary value preserved as
New action dictionary;
Step S5, it is back to step S3, continues policy update until convergence.
As seen from the above technical solution provided by the invention, continuous motion space can be converted to by quantization operation
One discrete motion space, the self-encoding encoder then realized by deep neural network is to the working value in discrete movement space
It carries out dimensionality reduction coding and counts, then count the occurrence number that each dictionary value in the policy update of certain number corresponds to coding codeword,
And remove the dictionary value seldom occurred from action dictionary by probability, to which constantly removal acts the redundancy in dictionary, in turn
Improve search efficiency when intelligent body policy update.
Description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment
Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this
For the those of ordinary skill in field, without creative efforts, other are can also be obtained according to these attached drawings
Attached drawing.
Fig. 1 is higher-dimension Continuous action space discretization exploration side in a kind of intensified learning task provided in an embodiment of the present invention
The schematic diagram of method;
Fig. 2 is the structural schematic diagram provided in an embodiment of the present invention from coder network.
Specific implementation mode
With reference to the attached drawing in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete
Ground describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this
The embodiment of invention, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, belongs to protection scope of the present invention.
Higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task of offer of the embodiment of the present invention, such as schemes
Shown in 1, include mainly:
Step S1, each dimension of motion space is quantified, an excessively complete action is formed using quantized result
Dictionary.
In this step, takes each dimension that sufficiently small quantized interval △ ties up D motion space to carry out uniform quantization, utilize
Uniform quantization result forms an excessively complete action dictionary Dict.Since quantized interval Δ determines, in the action dictionary
Dictionary value be it is limited multiple, be calculated as M.
Illustratively, the embodiment of the present invention can (each dimension be in [0,1] section for 18 dimension Continuous action spaces
Successive value), the motion space of the quantized interval △=0.1 pair environment can be taken to quantify, by the discretization knot after quantization
Fruit then acts the number M=10 of dictionary value in dictionary as excessively complete action dictionary Dict18。
It will be understood by those skilled in the art that this single stepping needs to be selected for acting the sensibility of value according to environment
Quantized interval, to ensure that obtained discretization action dictionary was complete.
Step S2, using the dictionary value in action dictionary, the self-encoding encoder that one deep neural network of training is realized is completed
To the dimensionality reduction and coding of dictionary value in action dictionary.
In the embodiment of the present invention, selects Hash coding as its coding mode, depth nerve as shown in Figure 2 may be used
The self-encoding encoder structure of real-time performance.
For any of action dictionary Dict D dimension dictionary values ai, it is b (a that the result of binaryzation is carried out to iti).Obviously
Ground, b (ai) and aiEqually and a D dimension vector, by b (ai) in jth dimensional vector value be denoted as bj(ai).To b (ai) carry out
Quantization, obtains " b (ai) ", by " b (ai) " space that is mapped to more low-dimensional by certain down-sampling mapping method obtains g (" b
(ai) "), it can be seen that g (" b (ai) ") it is one about aiFunction, therefore can be denoted asIt needs to illustrate
It is that there are many kinds of above-mentioned down-sampling mapping methods, such as SimHash algorithms may be used to realize in we in the present embodiment.
In the embodiment of the present invention, the loss function of the self-encoding encoder is defined as:
Wherein, the number of dictionary value, p (a during M is using Δ as the action dictionary Dict of quantized intervali) indicate a certain dictionary
Value aiThe probability of appearance,It is weighted used by for there are two loss function items without physical significance in constraint loss function
Coefficient.
It will be understood by those skilled in the art that the self-encoding encoder for carrying out Hash coding to working value that Fig. 2 is provided
Structure is only a kind of more effective network structure design scheme.For other specific interactive environments, it is empty that its action can be directed to
Between dimension and the statistical natures of data design similar neural network structure and loss function, designed result is in this hair
Within bright protection domain.
Step S3, a count table being worth all for 0 is initialized, when intelligent agent and environment interact, according to strategy
It determines that corresponding dictionary value is as action in action dictionary, and corresponding count value in count table is updated, to complete
Policy update.
In the embodiment of the present invention, when intelligent agent is interacted with environment, intelligent agent receives the state for coming from environment
Value and return value, and determine a dictionary value as the current result of decision according to current strategies from action dictionary and return to ring
Border, the result of decision received is calculated the state value and return value of subsequent time by environment as an action, and this is moved
It is transmitted to trained self-encoding encoder to be encoded, result is exported to the corresponding count value of count table according to self-encoding encoderIt is updated, i.e.,The length of count table should be all different from action dictionary
The number of result is equal after dictionary value dimensionality reduction, coding.
Illustratively, it is operated by dimensionality reduction, the action that we tie up the action Vector Processing of 18 dimensions at one 8 is vectorial, again
It is encoded by Hash and the working value two-value of each dimension is turned to 0 or 1, then the length of count table is 28。
Step S4, after intelligent agent and environment complete K policy update, for acting each dictionary value in dictionary,
Its count results in count table is combined to calculate the probability for removing it from action dictionary, with the probability that is calculated by phase
Answer dictionary value from action dictionary in remove, each dictionary value in traversal action dictionary, using the dictionary value preserved as
New action dictionary.
In the embodiment of the present invention, calculating the new probability formula that a dictionary value is removed from action dictionary is:
Wherein, α and β indicates weight coefficient, is traditionally arranged to be constant;K is completed for intelligent agent and environment
After secondary policy update, dictionary value aiUpdate result in corresponding count table;It indicates to count in count table
As a result the maximum count value in.
Illustratively, α and β could be provided as 0.5, K and could be provided as 40000.
Step S5, after completing the update to acting dictionary, by count table again zero setting, intelligent body continues strategy more
The new strategy learnt up to intelligent agent meets example demand or algorithmic stability convergence.
It should be noted that decision of the intelligent body when being interacted with environment, is selected from current action dictionary
A certain dictionary value is selected as the result of decision.
In said program of the embodiment of the present invention, quantization operation continuous motion space can be converted to one it is discrete dynamic
Make space, the self-encoding encoder then realized by deep neural network carries out dimensionality reduction to the dictionary value in discrete movement space, compiles
Code simultaneously counts, then counts the occurrence number that each dictionary value in the policy update of certain number corresponds to coding codeword, and will seldom go out
Existing dictionary value is removed by probability from action dictionary, to which constantly removal acts the redundancy in dictionary, and then improves intelligent body
Search efficiency when policy update.
Patent of the present invention has certain detectability, specific detection scheme as follows:
One, the occupancy situation of computer storage unit can be detected whether in the process of running according to relative program process
Used the invention patent relates to technical solution.Since being run relevant program process, its process of dynamic detection is to memory
Occupancy subtract at regular intervals if the interim variation characteristic reduced is presented to the occupancy of memory in associated process
It is few primary then to maintain a relatively stable state, then the program with may have been used greatly involved in patent of the present invention very much
Technical solution.
It should be noted that due to during program is run it is possible that the special circumstances such as RAM leakage,
Whether used the invention patent relates to technical solution also need to further determine that in conjunction with other detection methods.
Whether in relative program have the presence of " self-encoding encoder ", " count table " and " action dictionary ", if detection if two, detecting
To same or similar module, then it is likely used only to having used the technical solution involved by patent of the present invention.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment can
By software realization, the mode of necessary general hardware platform can also be added to realize by software.Based on this understanding,
The technical solution of above-described embodiment can be expressed in the form of software products, the software product can be stored in one it is non-easily
In the property lost storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.), including some instructions are with so that a computer is set
Standby (can be personal computer, server or the network equipment etc.) executes the method described in each embodiment of the present invention.
The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto,
Any one skilled in the art is in the technical scope of present disclosure, the change or replacement that can be readily occurred in,
It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of claims
Subject to enclosing.
Claims (6)
1. higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task, which is characterized in that including:
Step S1, each dimension of motion space is quantified, an excessively complete action dictionary is formed using quantized result;
Step S2, using the dictionary value in action dictionary, the self-encoding encoder that one deep neural network of training is realized is completed to dynamic
Make the dimensionality reduction and coding of dictionary value in dictionary;
Step S3, it initializes a value and is all determined according to strategy for 0 count table when intelligent agent and environment are interacted
It acts corresponding dictionary value in dictionary and is used as action, and corresponding count value in count table is updated, it is primary to complete
Policy update;
Step S4, it after intelligent agent and environment complete K policy update, for each dictionary value in action dictionary, ties
It closes its count results in count table and calculates the probability for removing it from action dictionary, with the probability that is calculated by corresponding word
Allusion quotation value from action dictionary in remove, each dictionary value in traversal action dictionary, using the dictionary value preserved as newly
Act dictionary;
Step S5, it is back to step S3, continues policy update until convergence.
2. higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task according to claim 1,
It is characterized in that, takes each dimension that sufficiently small quantized interval △ ties up D motion space to carry out uniform quantization, utilize uniform quantization
As a result an excessively complete action dictionary Dict is formed.
3. higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task according to claim 1,
It is characterized in that, the dictionary value in the dictionary using action, the self-encoding encoder completion pair that one deep neural network of training is realized
The dimensionality reduction of dictionary value includes with coding in action dictionary:
For any of action dictionary D n dimension dictionary values ai, down-sampling dimensionality reduction is carried out by the characteristic pattern exported to self-encoding encoder
Result with coding is
4. higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task according to claim 1 or 3,
It is characterized in that, the loss function of the self-encoding encoder is defined as:
Wherein, bj(ai) it is that D ties up dictionary value aiBinaryzation result b (ai) in jth dimensional vector value;M is using Δ as quantized interval
Act the number of dictionary value in dictionary Dict, p (ai) indicate dictionary value aiThe probability of appearance,To have in constraint loss function
Weighting coefficient used by two loss function items without physical significance.
5. higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task according to claim 1,
It is characterized in that, when intelligent agent and environment interact, determines that corresponding dictionary value is as dynamic in action dictionary according to strategy
Make, and to corresponding count value in count table be updated including:
When intelligent agent and environment interact, intelligent agent receives the state value and return value for coming from environment, and from action
Determine that a dictionary value returns to environment as the current result of decision according to current strategies in dictionary, environment is determined what is received
Plan result is acted as one to calculate the state value and return value of subsequent time, and the action is transmitted to trained own coding
Device is encoded, and exporting result according to self-encoding encoder is updated corresponding count value in count table.
6. higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task according to claim 1,
It is characterized in that, calculating the new probability formula that a dictionary value is removed from action dictionary is:
Wherein, α and β indicates weight coefficient;After K policy update being completed for intelligent agent and environment, dictionary value aiIt is right
Update result in the count table answered;Indicate the maximum count value in count results in count table.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810071032.9A CN108288097A (en) | 2018-01-24 | 2018-01-24 | Higher-dimension Continuous action space discretization heuristic approach in intensified learning task |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810071032.9A CN108288097A (en) | 2018-01-24 | 2018-01-24 | Higher-dimension Continuous action space discretization heuristic approach in intensified learning task |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108288097A true CN108288097A (en) | 2018-07-17 |
Family
ID=62835828
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810071032.9A Pending CN108288097A (en) | 2018-01-24 | 2018-01-24 | Higher-dimension Continuous action space discretization heuristic approach in intensified learning task |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108288097A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112257872A (en) * | 2020-10-30 | 2021-01-22 | 周世海 | Target planning method for reinforcement learning |
CN112292699A (en) * | 2019-05-15 | 2021-01-29 | 创新先进技术有限公司 | Determining action selection guidelines for an execution device |
-
2018
- 2018-01-24 CN CN201810071032.9A patent/CN108288097A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112292699A (en) * | 2019-05-15 | 2021-01-29 | 创新先进技术有限公司 | Determining action selection guidelines for an execution device |
CN112257872A (en) * | 2020-10-30 | 2021-01-22 | 周世海 | Target planning method for reinforcement learning |
CN112257872B (en) * | 2020-10-30 | 2022-09-13 | 周世海 | Target planning method for reinforcement learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Qiang et al. | Reinforcement learning model, algorithms and its application | |
CN106411896B (en) | Network security situation prediction method based on APDE-RBF neural network | |
Ergezer et al. | Oppositional biogeography-based optimization | |
CN109711544A (en) | Method, apparatus, electronic equipment and the computer storage medium of model compression | |
Esmin et al. | Hybrid evolutionary algorithm based on PSO and GA mutation | |
CN107958286A (en) | A kind of depth migration learning method of field Adaptive Networking | |
CN113435606A (en) | Method and device for optimizing reinforcement learning model, storage medium and electronic equipment | |
Donate et al. | Evolutionary optimization of sparsely connected and time-lagged neural networks for time series forecasting | |
CN108288097A (en) | Higher-dimension Continuous action space discretization heuristic approach in intensified learning task | |
Bergmeir et al. | Time series modeling and forecasting using memetic algorithms for regime-switching models | |
Ollington et al. | Incorporating expert advice into reinforcement learning using constructive neural networks | |
Lee et al. | A genetic algorithm based robust learning credit assignment cerebellar model articulation controller | |
Li et al. | A new relational tri-training system with adaptive data editing for inductive logic programming | |
Yang et al. | A prediction model based on Big Data analysis using hybrid FCM clustering | |
Wu et al. | Fault diagnosis of TE process based on incremental learning | |
CN111314119B (en) | Method and device for quickly reconstructing unmanned platform information sensing network in uncertain environment | |
Chun et al. | Impact of momentum bias on forecasting through knowledge discovery techniques in the foreign exchange market | |
Houssein et al. | Salp swarm algorithm: modification and application | |
Bharti et al. | QL-SSA: An Adaptive Q-Learning based Squirrel Search Algorithm for Feature Selection | |
Dale et al. | Supervised clustering using decision trees and decision graphs: An ecological comparison | |
O’Neill et al. | Forecasting market indices using evolutionary automatic programming: A case study | |
Liu et al. | Evolutionary participatory learning in fuzzy systems modeling | |
Song et al. | Reinforcement learning with chromatic networks for compact architecture search | |
Wang et al. | Research on Portfolio Optimization Based on Deep Reinforcement Learning | |
Sharma et al. | Bottom-up Pittsburgh approach for discovery of classification rules |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180717 |