CN108288097A - Higher-dimension Continuous action space discretization heuristic approach in intensified learning task - Google Patents

Higher-dimension Continuous action space discretization heuristic approach in intensified learning task Download PDF

Info

Publication number
CN108288097A
CN108288097A CN201810071032.9A CN201810071032A CN108288097A CN 108288097 A CN108288097 A CN 108288097A CN 201810071032 A CN201810071032 A CN 201810071032A CN 108288097 A CN108288097 A CN 108288097A
Authority
CN
China
Prior art keywords
dictionary
value
action
dimension
environment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810071032.9A
Other languages
Chinese (zh)
Inventor
陈志波
张直政
陈嘉乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN201810071032.9A priority Critical patent/CN108288097A/en
Publication of CN108288097A publication Critical patent/CN108288097A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task, continuous motion space is converted to a discrete motion space by quantization operation, then the self-encoding encoder realized by deep neural network carries out dimensionality reduction coding to the dictionary value in discrete movement space and counts, the occurrence number that each dictionary value in the policy update of certain number corresponds to coding codeword is counted again, and the dictionary value seldom occurred is removed by probability from action dictionary, to which constantly removal acts the redundancy in dictionary, and then search efficiency when raising intelligent body policy update.

Description

Higher-dimension Continuous action space discretization heuristic approach in intensified learning task
Technical field
The present invention relates to higher-dimensions in artificial intelligence, machine learning techniques field more particularly to a kind of intensified learning task to connect Continuous motion space discretization heuristic approach.
Background technology
Intensified learning is as a kind of important machine learning method, in intelligent control machine people, man-machine game, clinical medicine And there are many applications in the fields such as analysis prediction.Supervised learning and unsupervised learning during intensified learning learns independently of conventional machines Except, experience is obtained in the interaction between intelligent agent and environment, is mapped from environment to behavior to completing intelligent agent Policy learning.In intensified learning, intelligent agent receives to come from the status information of environment and the strategy based on study generates one A to act on environment, state changes after environment receives the action, while generating a return value (reward or punishment), And the current state after variation is sent to intelligent agent with the feedback signal, intelligent agent is further according to the information update received Strategy simultaneously (acts) according to the next result of decision of policy selection.The learning objective of reinforcement learning system is the friendship with environment In mutual behavior, the parameter of intelligent agent itself is dynamically adjusted to update strategy to be learned so that the positive letter of environmental feedback Number maximum.
The convergence process of nitrification enhancement can be regarded as and be scanned for motion space, therefrom find optimal policy Process.For motion space higher-dimension and continuous learning tasks, since its motion space is larger, cause intelligent agent to strategy Exploration difficulty is big, learning efficiency is low.In consideration of it, it is necessary to be furtherd investigate, for the strong of higher-dimension Continuous action space Change learning tasks, discretization is carried out to its motion space using neural network, to improve intelligent agent in intensified learning Practise efficiency.
Invention content
The object of the present invention is to provide higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task, can To improve the learning efficiency and stability of nitrification enhancement.
The purpose of the present invention is what is be achieved through the following technical solutions:
Higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task, including:
Step S1, each dimension of motion space is quantified, an excessively complete action is formed using quantized result Dictionary;
Step S2, using the dictionary value in action dictionary, the self-encoding encoder that one deep neural network of training is realized is completed To the dimensionality reduction and coding of dictionary value in action dictionary;
Step S3, a count table being worth all for 0 is initialized, when intelligent agent and environment interact, according to strategy It determines that corresponding dictionary value is as action in action dictionary, and corresponding count value in count table is updated, to complete Policy update;
Step S4, after intelligent agent and environment complete K policy update, for acting each dictionary value in dictionary, Its count results in count table is combined to calculate the probability for removing it from action dictionary, with the probability that is calculated by phase Answer dictionary value from action dictionary in remove, each dictionary value in traversal action dictionary, using the dictionary value preserved as New action dictionary;
Step S5, it is back to step S3, continues policy update until convergence.
As seen from the above technical solution provided by the invention, continuous motion space can be converted to by quantization operation One discrete motion space, the self-encoding encoder then realized by deep neural network is to the working value in discrete movement space It carries out dimensionality reduction coding and counts, then count the occurrence number that each dictionary value in the policy update of certain number corresponds to coding codeword, And remove the dictionary value seldom occurred from action dictionary by probability, to which constantly removal acts the redundancy in dictionary, in turn Improve search efficiency when intelligent body policy update.
Description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill in field, without creative efforts, other are can also be obtained according to these attached drawings Attached drawing.
Fig. 1 is higher-dimension Continuous action space discretization exploration side in a kind of intensified learning task provided in an embodiment of the present invention The schematic diagram of method;
Fig. 2 is the structural schematic diagram provided in an embodiment of the present invention from coder network.
Specific implementation mode
With reference to the attached drawing in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete Ground describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this The embodiment of invention, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, belongs to protection scope of the present invention.
Higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task of offer of the embodiment of the present invention, such as schemes Shown in 1, include mainly:
Step S1, each dimension of motion space is quantified, an excessively complete action is formed using quantized result Dictionary.
In this step, takes each dimension that sufficiently small quantized interval △ ties up D motion space to carry out uniform quantization, utilize Uniform quantization result forms an excessively complete action dictionary Dict.Since quantized interval Δ determines, in the action dictionary Dictionary value be it is limited multiple, be calculated as M.
Illustratively, the embodiment of the present invention can (each dimension be in [0,1] section for 18 dimension Continuous action spaces Successive value), the motion space of the quantized interval △=0.1 pair environment can be taken to quantify, by the discretization knot after quantization Fruit then acts the number M=10 of dictionary value in dictionary as excessively complete action dictionary Dict18
It will be understood by those skilled in the art that this single stepping needs to be selected for acting the sensibility of value according to environment Quantized interval, to ensure that obtained discretization action dictionary was complete.
Step S2, using the dictionary value in action dictionary, the self-encoding encoder that one deep neural network of training is realized is completed To the dimensionality reduction and coding of dictionary value in action dictionary.
In the embodiment of the present invention, selects Hash coding as its coding mode, depth nerve as shown in Figure 2 may be used The self-encoding encoder structure of real-time performance.
For any of action dictionary Dict D dimension dictionary values ai, it is b (a that the result of binaryzation is carried out to iti).Obviously Ground, b (ai) and aiEqually and a D dimension vector, by b (ai) in jth dimensional vector value be denoted as bj(ai).To b (ai) carry out Quantization, obtains " b (ai) ", by " b (ai) " space that is mapped to more low-dimensional by certain down-sampling mapping method obtains g (" b (ai) "), it can be seen that g (" b (ai) ") it is one about aiFunction, therefore can be denoted asIt needs to illustrate It is that there are many kinds of above-mentioned down-sampling mapping methods, such as SimHash algorithms may be used to realize in we in the present embodiment.
In the embodiment of the present invention, the loss function of the self-encoding encoder is defined as:
Wherein, the number of dictionary value, p (a during M is using Δ as the action dictionary Dict of quantized intervali) indicate a certain dictionary Value aiThe probability of appearance,It is weighted used by for there are two loss function items without physical significance in constraint loss function Coefficient.
It will be understood by those skilled in the art that the self-encoding encoder for carrying out Hash coding to working value that Fig. 2 is provided Structure is only a kind of more effective network structure design scheme.For other specific interactive environments, it is empty that its action can be directed to Between dimension and the statistical natures of data design similar neural network structure and loss function, designed result is in this hair Within bright protection domain.
Step S3, a count table being worth all for 0 is initialized, when intelligent agent and environment interact, according to strategy It determines that corresponding dictionary value is as action in action dictionary, and corresponding count value in count table is updated, to complete Policy update.
In the embodiment of the present invention, when intelligent agent is interacted with environment, intelligent agent receives the state for coming from environment Value and return value, and determine a dictionary value as the current result of decision according to current strategies from action dictionary and return to ring Border, the result of decision received is calculated the state value and return value of subsequent time by environment as an action, and this is moved It is transmitted to trained self-encoding encoder to be encoded, result is exported to the corresponding count value of count table according to self-encoding encoderIt is updated, i.e.,The length of count table should be all different from action dictionary The number of result is equal after dictionary value dimensionality reduction, coding.
Illustratively, it is operated by dimensionality reduction, the action that we tie up the action Vector Processing of 18 dimensions at one 8 is vectorial, again It is encoded by Hash and the working value two-value of each dimension is turned to 0 or 1, then the length of count table is 28
Step S4, after intelligent agent and environment complete K policy update, for acting each dictionary value in dictionary, Its count results in count table is combined to calculate the probability for removing it from action dictionary, with the probability that is calculated by phase Answer dictionary value from action dictionary in remove, each dictionary value in traversal action dictionary, using the dictionary value preserved as New action dictionary.
In the embodiment of the present invention, calculating the new probability formula that a dictionary value is removed from action dictionary is:
Wherein, α and β indicates weight coefficient, is traditionally arranged to be constant;K is completed for intelligent agent and environment After secondary policy update, dictionary value aiUpdate result in corresponding count table;It indicates to count in count table As a result the maximum count value in.
Illustratively, α and β could be provided as 0.5, K and could be provided as 40000.
Step S5, after completing the update to acting dictionary, by count table again zero setting, intelligent body continues strategy more The new strategy learnt up to intelligent agent meets example demand or algorithmic stability convergence.
It should be noted that decision of the intelligent body when being interacted with environment, is selected from current action dictionary A certain dictionary value is selected as the result of decision.
In said program of the embodiment of the present invention, quantization operation continuous motion space can be converted to one it is discrete dynamic Make space, the self-encoding encoder then realized by deep neural network carries out dimensionality reduction to the dictionary value in discrete movement space, compiles Code simultaneously counts, then counts the occurrence number that each dictionary value in the policy update of certain number corresponds to coding codeword, and will seldom go out Existing dictionary value is removed by probability from action dictionary, to which constantly removal acts the redundancy in dictionary, and then improves intelligent body Search efficiency when policy update.
Patent of the present invention has certain detectability, specific detection scheme as follows:
One, the occupancy situation of computer storage unit can be detected whether in the process of running according to relative program process Used the invention patent relates to technical solution.Since being run relevant program process, its process of dynamic detection is to memory Occupancy subtract at regular intervals if the interim variation characteristic reduced is presented to the occupancy of memory in associated process It is few primary then to maintain a relatively stable state, then the program with may have been used greatly involved in patent of the present invention very much Technical solution.
It should be noted that due to during program is run it is possible that the special circumstances such as RAM leakage, Whether used the invention patent relates to technical solution also need to further determine that in conjunction with other detection methods.
Whether in relative program have the presence of " self-encoding encoder ", " count table " and " action dictionary ", if detection if two, detecting To same or similar module, then it is likely used only to having used the technical solution involved by patent of the present invention.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment can By software realization, the mode of necessary general hardware platform can also be added to realize by software.Based on this understanding, The technical solution of above-described embodiment can be expressed in the form of software products, the software product can be stored in one it is non-easily In the property lost storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.), including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes the method described in each embodiment of the present invention.
The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Any one skilled in the art is in the technical scope of present disclosure, the change or replacement that can be readily occurred in, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of claims Subject to enclosing.

Claims (6)

1. higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task, which is characterized in that including:
Step S1, each dimension of motion space is quantified, an excessively complete action dictionary is formed using quantized result;
Step S2, using the dictionary value in action dictionary, the self-encoding encoder that one deep neural network of training is realized is completed to dynamic Make the dimensionality reduction and coding of dictionary value in dictionary;
Step S3, it initializes a value and is all determined according to strategy for 0 count table when intelligent agent and environment are interacted It acts corresponding dictionary value in dictionary and is used as action, and corresponding count value in count table is updated, it is primary to complete Policy update;
Step S4, it after intelligent agent and environment complete K policy update, for each dictionary value in action dictionary, ties It closes its count results in count table and calculates the probability for removing it from action dictionary, with the probability that is calculated by corresponding word Allusion quotation value from action dictionary in remove, each dictionary value in traversal action dictionary, using the dictionary value preserved as newly Act dictionary;
Step S5, it is back to step S3, continues policy update until convergence.
2. higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task according to claim 1, It is characterized in that, takes each dimension that sufficiently small quantized interval △ ties up D motion space to carry out uniform quantization, utilize uniform quantization As a result an excessively complete action dictionary Dict is formed.
3. higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task according to claim 1, It is characterized in that, the dictionary value in the dictionary using action, the self-encoding encoder completion pair that one deep neural network of training is realized The dimensionality reduction of dictionary value includes with coding in action dictionary:
For any of action dictionary D n dimension dictionary values ai, down-sampling dimensionality reduction is carried out by the characteristic pattern exported to self-encoding encoder Result with coding is
4. higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task according to claim 1 or 3, It is characterized in that, the loss function of the self-encoding encoder is defined as:
Wherein, bj(ai) it is that D ties up dictionary value aiBinaryzation result b (ai) in jth dimensional vector value;M is using Δ as quantized interval Act the number of dictionary value in dictionary Dict, p (ai) indicate dictionary value aiThe probability of appearance,To have in constraint loss function Weighting coefficient used by two loss function items without physical significance.
5. higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task according to claim 1, It is characterized in that, when intelligent agent and environment interact, determines that corresponding dictionary value is as dynamic in action dictionary according to strategy Make, and to corresponding count value in count table be updated including:
When intelligent agent and environment interact, intelligent agent receives the state value and return value for coming from environment, and from action Determine that a dictionary value returns to environment as the current result of decision according to current strategies in dictionary, environment is determined what is received Plan result is acted as one to calculate the state value and return value of subsequent time, and the action is transmitted to trained own coding Device is encoded, and exporting result according to self-encoding encoder is updated corresponding count value in count table.
6. higher-dimension Continuous action space discretization heuristic approach in a kind of intensified learning task according to claim 1, It is characterized in that, calculating the new probability formula that a dictionary value is removed from action dictionary is:
Wherein, α and β indicates weight coefficient;After K policy update being completed for intelligent agent and environment, dictionary value aiIt is right Update result in the count table answered;Indicate the maximum count value in count results in count table.
CN201810071032.9A 2018-01-24 2018-01-24 Higher-dimension Continuous action space discretization heuristic approach in intensified learning task Pending CN108288097A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810071032.9A CN108288097A (en) 2018-01-24 2018-01-24 Higher-dimension Continuous action space discretization heuristic approach in intensified learning task

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810071032.9A CN108288097A (en) 2018-01-24 2018-01-24 Higher-dimension Continuous action space discretization heuristic approach in intensified learning task

Publications (1)

Publication Number Publication Date
CN108288097A true CN108288097A (en) 2018-07-17

Family

ID=62835828

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810071032.9A Pending CN108288097A (en) 2018-01-24 2018-01-24 Higher-dimension Continuous action space discretization heuristic approach in intensified learning task

Country Status (1)

Country Link
CN (1) CN108288097A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257872A (en) * 2020-10-30 2021-01-22 周世海 Target planning method for reinforcement learning
CN112292699A (en) * 2019-05-15 2021-01-29 创新先进技术有限公司 Determining action selection guidelines for an execution device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112292699A (en) * 2019-05-15 2021-01-29 创新先进技术有限公司 Determining action selection guidelines for an execution device
CN112257872A (en) * 2020-10-30 2021-01-22 周世海 Target planning method for reinforcement learning
CN112257872B (en) * 2020-10-30 2022-09-13 周世海 Target planning method for reinforcement learning

Similar Documents

Publication Publication Date Title
Qiang et al. Reinforcement learning model, algorithms and its application
CN106411896B (en) Network security situation prediction method based on APDE-RBF neural network
Ergezer et al. Oppositional biogeography-based optimization
CN109711544A (en) Method, apparatus, electronic equipment and the computer storage medium of model compression
Esmin et al. Hybrid evolutionary algorithm based on PSO and GA mutation
CN107958286A (en) A kind of depth migration learning method of field Adaptive Networking
CN113435606A (en) Method and device for optimizing reinforcement learning model, storage medium and electronic equipment
Donate et al. Evolutionary optimization of sparsely connected and time-lagged neural networks for time series forecasting
CN108288097A (en) Higher-dimension Continuous action space discretization heuristic approach in intensified learning task
Bergmeir et al. Time series modeling and forecasting using memetic algorithms for regime-switching models
Ollington et al. Incorporating expert advice into reinforcement learning using constructive neural networks
Lee et al. A genetic algorithm based robust learning credit assignment cerebellar model articulation controller
Li et al. A new relational tri-training system with adaptive data editing for inductive logic programming
Yang et al. A prediction model based on Big Data analysis using hybrid FCM clustering
Wu et al. Fault diagnosis of TE process based on incremental learning
CN111314119B (en) Method and device for quickly reconstructing unmanned platform information sensing network in uncertain environment
Chun et al. Impact of momentum bias on forecasting through knowledge discovery techniques in the foreign exchange market
Houssein et al. Salp swarm algorithm: modification and application
Bharti et al. QL-SSA: An Adaptive Q-Learning based Squirrel Search Algorithm for Feature Selection
Dale et al. Supervised clustering using decision trees and decision graphs: An ecological comparison
O’Neill et al. Forecasting market indices using evolutionary automatic programming: A case study
Liu et al. Evolutionary participatory learning in fuzzy systems modeling
Song et al. Reinforcement learning with chromatic networks for compact architecture search
Wang et al. Research on Portfolio Optimization Based on Deep Reinforcement Learning
Sharma et al. Bottom-up Pittsburgh approach for discovery of classification rules

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180717