WO2022042093A1 - 智能机器人及其学习方法 - Google Patents

智能机器人及其学习方法 Download PDF

Info

Publication number
WO2022042093A1
WO2022042093A1 PCT/CN2021/105935 CN2021105935W WO2022042093A1 WO 2022042093 A1 WO2022042093 A1 WO 2022042093A1 CN 2021105935 W CN2021105935 W CN 2021105935W WO 2022042093 A1 WO2022042093 A1 WO 2022042093A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
brain
message
state
agent
Prior art date
Application number
PCT/CN2021/105935
Other languages
English (en)
French (fr)
Inventor
朱宝
Original Assignee
朱宝
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 朱宝 filed Critical 朱宝
Publication of WO2022042093A1 publication Critical patent/WO2022042093A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates to the field of artificial intelligence, and in particular, to an intelligent robot and a learning method thereof.
  • the present disclosure aims to solve the problems of current artificial intelligence such as language, emotion, and insufficient ability to deal with complex tasks, so as to obtain artificial intelligence robots that can understand and use language, have emotions, and can deal with some complex field problems.
  • an embodiment of the present disclosure provides an intelligent robot, the intelligent robot includes an intelligent body, and the intelligent body includes: a first processing module, the first processing module includes a brain model, and the The brain model is used to obtain the first interaction message, as well as the state fed back by the external environment and the agent, and output actions and/or second interaction messages according to the state and/or the first interaction message to obtain new a first interactive message, and/or, to make the external environment and the agent adjust the state according to the action; a second processing module, the second processing module includes a mental model, and the mental model is used to obtain the at least one of the state and the first interaction message, and outputting a reward according to at least one of the state and the first interaction message; an update module, configured to update the brain model according to the reward to realize the Learning of an agent; wherein, the mental model is also used for evolutionary learning, and the brain model is updated by using the evolved mental model, so as to obtain a mental model and a brain model suitable for the external environment
  • the intelligent robot of the embodiment of the present disclosure drives the brain model to learn based on the mental model of the agent, and allows the mental model of the agent to be updated in the evolution of the group, so that the robot can solve complex domain problems.
  • an embodiment of the present disclosure proposes a learning method for an intelligent robot based on the above-mentioned embodiments, including the following steps: obtaining a first interaction message, and a state fed back by an external environment and an agent through the brain model, and according to The state and/or the first interaction message, output the action and/or the second interaction message to obtain a new first interaction message, and/or make the external environment and the agent adjust all the actions according to the action.
  • the brain model is updated according to the reward to realize the learning of the agent; evolutionary learning is performed on the mental model, and the brain model is updated by using the evolved mental model, so as to obtain a mental model and a brain model suitable for the external environment.
  • the learning method of an intelligent robot proposed by the embodiment of the present disclosure drives the brain model to learn based on the mental model of the agent itself, and allows the mental model of the agent to be updated in the evolution of the group, so as to obtain an intelligent robot that can solve problems in complex fields. robot.
  • FIG. 1 is a schematic structural diagram of an intelligent robot according to an embodiment of the present disclosure
  • FIG. 2 is a flowchart of a learning method for an intelligent robot according to an embodiment of the present disclosure
  • Figure 3(a) is a schematic diagram of a brain model according to an embodiment of the present disclosure.
  • Figure 3(b) is a schematic diagram of a heart model according to an embodiment of the present disclosure.
  • Figures 4(a)-4(c) are schematic diagrams of assisted reinforcement learning training of multiple examples of the present disclosure.
  • FIG. 5 is a flowchart of evolutionary learning according to an embodiment of the present disclosure.
  • FIG. 6 is a schematic diagram of a decision tree model of an example of the present disclosure.
  • FIG. 7 is a schematic diagram of the external environment of an example of the present disclosure.
  • the entire reinforcement learning system generally includes five parts: agent, state, reward/punishment, action and environment.
  • the agent is the core of the entire reinforcement learning system. According to the reward provided by the external environment as feedback, it learns a series of mappings from environmental states (states) to actions (actions). The principle of action selection is to maximize the reward accumulated in the future. probability. The selected action not only affects the reward at the current moment, but also affects the reward at the next moment or even in the future. Therefore, the basic rule of the agent in the learning process is: if an action brings a positive reward from the external environment, that is, a reward, then This action will be strengthened, otherwise it will gradually weaken.
  • the state refers to the information of the environment where the agent is located, and contains all the information that the agent uses for action selection.
  • Reward/punishment refers to a quantifiable scalar feedback signal provided to the agent by the external environment, which is used to evaluate the quality of the action taken by the agent at a certain time step. It is a scalar and generally adopts a positive number. represents a reward, and a negative number represents a penalty.
  • Action refers to the action taken by the agent during the interaction process.
  • the external environment (Environment), the environment will receive a series of actions performed by the agent, and evaluate the quality of this series of actions, and convert it into a quantifiable reward feedback to the agent without telling the agent. How to learn to move. An agent can only learn from its own historical experience. At the same time, the external environment also provides the state it is in like the agent.
  • the reward/punishment is made by the agent itself, and the agent receives the external environment and the agent feedback (state). ) and feedback actions to the external environment, it can also receive interactive messages and output interactive messages.
  • the intelligent robot includes an intelligent body 100 .
  • the agent 100 includes: a first processing module, a second processing module and an update module.
  • the first processing module includes a brain model (Brain), the brain model (Brain) is used to obtain the first interactive message message1, the external environment Environment and the agent 100 (the agent can be itself or other intelligent agents) body) feedback state (state), and according to the state (state) and/or the first interactive message message1, output the action action and/or the second interactive message message2 to obtain a new first interactive message message1, and/or, Make the external environment and the agent 100 adjust the state (state) according to the action action;
  • the second processing module includes a heart model Heart, the heart model Heart is used to obtain at least one of the state (state) and the first interactive message message1, and according to the state (state), at least one of the first interactive message message1 outputs reward;
  • the update module is used to update the brain model (Brain) according to the reward reward to realize the learning of the
  • the first interactive message and the second interactive message may be messages in the form of text or messages in other forms, for example, changes in amplitude, frequency, duration, etc. of optical signals, changes in amplitude and frequency of electrical signals, etc. , duration changes, etc., and other forms of signals that can transmit information.
  • the message may also be a combination of multiple forms of messages, and may also be a multi-dimensional message. Therefore, by restricting the form of the message, more different types of message expression systems can be formed. For example, if the message can only use the pitch of the sound, it is possible to realize music for expressing information; if the message is a combination of limbs, then It is possible to implement a dance for expressing information.
  • the intelligent body 100 can obtain the first interactive message through the message collection device of the intelligent robot, such as obtaining motion images through a camera, and obtaining sound messages through a microphone; the current intelligent body can also have a communication module, which can realize The communication connection between the current agent and other agents can further obtain the first interaction message transmitted by other agents through the communication module.
  • the intelligent robot may include multiple agents 100 (two are shown in FIG. 1 ).
  • the external environment and the agents 100 can feed back states to each agent 100 , each agent 100 can output actions to the external environment, and messages message1 and message2 can be exchanged between the agents 100 .
  • the reward output of the mental model Heart can not only be affected by the external environment. For example, the external environment is very harsh.
  • the agent 100 perceives the external environment, and the mental model Heart receives the external environment feedback (state), and can output a penalty reward.
  • the brain model can include a convolutional neural network (cnn) that accepts a state, a feedforward neural network (embedding) that accepts a message (message), a recurrent neural network (rnn) with an attention mechanism, and an output action (action). ) feedforward neural network (fcn+softmax) and output message (message) feedforward neural network (fcn+softmax).
  • the input of the agent brain model (Brain) of the embodiment of the present disclosure increases the state of the agent (state) in addition to the state of the external environment (state). state) and messages of other agents; the input of the heart model (Heart) can be consistent with the input of the brain model (Brain).
  • the reward that drives the learning of the brain model (Brain) comes from the output of the heart model (Heart).
  • the brain model includes an input layer, a core layer, and an output layer that are connected in sequence.
  • the input layer includes a convolutional neural network layer (cnn) and an embedding layer (embedding).
  • the convolutional neural network layer (cnn) It is used to receive the state, the embedding layer is used to receive the first interactive message (message), the core layer includes a recurrent neural network (rnn) based on the attention mechanism, and the output layer includes a fully connected layer (fcn) and a softmax layer. , the output layer is used to output actions and/or second interactive messages.
  • the heart model can be used to drive the agent 100 to complete: 1) "physiological” needs, such as breathing, drinking water, ingesting food, seeking suitable temperature, sleep, and sex. 2) "Emotional” needs, such as superiority, correspond to emotions such as ashamedy, showing off, pride, and frustration, which can be embodied in the current agent's wealth ranking, etc.; belonging, corresponding to feelings such as longing, attachment, missing, and loneliness, can be specific. It can be transformed into the number of closely related agents within a limited range, etc.; the sense of gain, corresponding to emotions such as satisfaction, happiness, happiness, and sadness, can be embodied as the achievement or possession of any small goal.
  • the heart model (Heart) can be obtained through evolution, for example, the heart model can include a decision tree model, as shown in Figure 3(b), DT in Figure 3(b) is the abbreviation of Decision Tree, that is, decision tree.
  • the updating module updates the brain model, specifically, performing reinforcement learning on the brain model, that is, using the above-mentioned heart model (Heart) to perform reinforcement learning on the brain model (Brain) of the agent.
  • the training algorithm can use reinforcement learning algorithms such as Q-Learning, A3C (Actor-Critic, actor-comment), PPO (Proximal Policy Optimization, proximal policy optimization), DDPG (Deep Deterministic Policy Gradient, deep deterministic policy gradient). Algorithms can be selected according to task requirements, and some algorithms themselves require new network models.
  • the present disclosure does not limit the task target, but limits the model structure and training method of the agent, specifically the complete model of the agent based on message cooperation ( Brain model + heart model) training.
  • the intelligent robot of the embodiment of the present disclosure drives the brain model to learn based on the mental model of the agent itself, and allows the mental model of the agent to be updated in the evolution of the group, so that the intelligent robot can solve complex domain problems.
  • the heart model (Heart) of the agent 100 is derived from evolution (rather than given and fitting), thereby solving the problem that weak artificial intelligence robots can only be applied to specific and not very complex fields .
  • the basic need of an agent is to survive. Since the external environment is the cause of the death of the agent, the agent needs to make its own needs consistent with the needs of the external environment for the agent.
  • an evolutionary learning of the heart model (Heart) can also be performed to allow the agent to use feelings and emotions.
  • the present disclosure simulates the breeding process to obtain a demand function. Specifically, a model is built to fit the heart model (Heart).
  • the parameters of the model are similar to the genes of the agent.
  • the input is state and message, and the output is reward.
  • the genetic mutation algorithm is used to update the parameters of the model.
  • Step 1 Random initialization based on the heart model to obtain n gene sequences (model parameter encoding).
  • Fig. 6 shows a binary decision tree, and the depth of the tree represents the number of different feature expressions. Except for the leaf node, other nodes represent an expression for judging whether the feature satisfies the condition, and two feature expressions f1 and f2 are shown in FIG. 6 . If the current node is judged to be satisfied, continue to judge the characteristic expression of the right path, otherwise judge the left characteristic expression, and the leaf node represents the final output result.
  • the parameters of the decision tree model in Figure 6 are [f1, f2, -1, -0.5, f2, 0.5, 1], and the parameters are binarized to facilitate the introduction of a genetic mutation algorithm.
  • the genes in the first layer have two choices, the genes in the second layer also have two choices (here, the same expression is allowed to be judged repeatedly, but the model depth is always the number of different feature expressions), the genes in the third layer are Discretization, assuming that there are only four choices, corresponding to different values in the third layer, then the above genes are binarized to [0,1,00,01,1,10,11], and the gene sequence is obtained.
  • n-1 vectors of the same dimension as the above gene sequences are obtained.
  • Step 2 Duplicate the brain model to obtain n brain models
  • Step 3 Restore each gene sequence to the corresponding heart model (Heart), and generate corresponding rewards according to the feedback state of the external environment through each heart model (Heart), for each brain model (Brain), based on the rewards Brain model for reinforcement learning training;
  • Step 4 Eliminate m gene sequences with low scores, where m ⁇ n;
  • Step 5 Based on the remaining n-m gene sequences, perform genetic mutation operation to obtain new m gene sequences;
  • Step 6 Using the remaining n-m gene sequences and the new m gene sequences, return to Step 3 until a heart model with a score greater than a preset value is obtained, so as to update the brain model based on the heart model.
  • evolutionary learning can also be performed using evolutionary algorithms such as ant colony, tabu, and simulated annealing.
  • the needs of the agent itself can be consistent with the needs of the external environment for the agent, so that the intelligent robot can adapt to the external environment.
  • the update module is further configured to: acquire time series data, where the time series data includes the time series of messages themselves, the time series of states themselves, the time series of actions themselves, and the time series from messages to messages , state-to-message time series, state-to-action time series, message-to-action time series, and one or more of the time series between messages, states, and actions; train the brain model according to the time series data. Therefore, based on the above-mentioned training of various time series regressions, the types of information that the agent can receive and output can be improved, the training efficiency of the brain model can be improved, and subsequent updates of the brain model can be facilitated.
  • the update module can also be used to: obtain the sample state fed back by the external environment; input the sample state to the observer, so that the observer outputs a sample message according to the sample state; output the sample message to the initial brain model , and output the sample action through the initial brain model; obtain the sample return based on the sample action feedback from the external environment; update the initial brain model according to the sample return to obtain the brain model.
  • the reinforcement learning training can be assisted by the observer, as shown in Figure 4(a), compared with the training in general reinforcement learning, in this embodiment, an observer is added, and the state is converted into a message through the observer, and the intelligent The body obtains state through the message.
  • the agent can also output the message to the observer according to the message, and the observer can also consider the message output by the agent when outputting the message.
  • the state-to-message, message-to-action training is improved, allowing the agent to understand and use the message, helping the agent to better meet the needs of artificiality.
  • the updating process of the brain model may further include: acquiring a sample state fed back by the external environment; inputting the sample state into the initial brain model, and outputting a sample message through the initial brain model; outputting the sample message to The executor, so that the executor outputs sample actions according to the sample messages; obtains the sample returns fed back by the external environment according to the sample actions; and updates the initial brain model according to the sample returns to obtain the brain model.
  • the reinforcement learning training can be assisted by the executor, as shown in FIG. 4(b), compared with the training in general reinforcement learning, in this embodiment, an executor is added, and the agent is based on the (state) through the executor.
  • the converted message outputs an action, and the agent is trained based on the reward fed back by the action through the external environment.
  • the executor can also output the message to the agent according to the message, and the agent can also consider the message output by the executor when outputting the message.
  • the state-to-message, message-to-action training is improved, allowing the agent to understand and use the message, helping the agent to better meet the needs of artificiality.
  • observers and actuators may be established models or artificial, which are not limited here.
  • reinforcement learning for multi-agent message cooperation can also be performed. Different from multi-agent reinforcement learning based on game theory, this example emphasizes the completion of training tasks through message cooperation.
  • the learning goals of the intelligent robot include: let the agent learn to whack the hamster, let the agent understand the language used when whacking the hamster, let the agent use the language to complete the whack, and let the agent like to hit the ground mouse.
  • the learning process is as follows:
  • Step 1 Build the external environment
  • the external environment allows other agents to cooperate with the current agent to fight moles, such as an agent responsible for observation and one responsible for execution.
  • the state in Figure 3(b) is the state observable by the agent, including other intelligence
  • the score of the agent and the observable environment, and the reward is the feeling that the agent can feel.
  • Step 3 Input and output restrictions
  • the state is the state of the nine-square grid, and whether the score of the agent is high.
  • the state input can be represented by a 9-bit binary vector.
  • the first 8 bits indicate whether there are hamsters in the corresponding position, and the latter digit indicates whether the score is higher, such as [100000001], indicating that the first grid of the nine-square grid has hamsters, and the current agent scores forward.
  • the message is limited to human natural language, such as "hit the front”, “hit the first”, “on your left” and so on.
  • Common words and terminators can be one-hot encoded, input and output in time series.
  • Action is limited to the whack-a-mole method corresponding to 8 grids and no action is taken. There are 9 cases in total. Action can also be represented by a 9-bit binary vector.
  • time series can be pre-edited and let the agent learn these series. You can manually play a few whack-a-mole games, record the corresponding (state), message, and action according to the time step, and then use these data to perform time series regression training on the agent.
  • the agent can learn the following:
  • Some instructions such as telling the agent to "hit the first", the agent will hit the first grid;
  • the agent can hit the hamster in the state, it will be rewarded with one point; input the state to the agent, let the agent learn to take action to obtain the maximum reward, and the learning method uses the reinforcement learning mode.
  • the agent can learn the following:
  • the agent can learn the following:
  • the heart model adopts the decision tree model, and the input environment is a 9-bit binary vector, such as [100000001], a 9-layer binary tree can be designed. Except for the leaf node, other tree nodes are the first dimension, such as the ninth dimension, Then the node value is 9. Leaf nodes are discretized values between 0 and 1, representing the value of reward. The left side of the binary tree represents the decision path under the dimension value 0, and the right side is the decision path of 1. Then randomize a binary tree and save it with pre-order traversal, it will form a multi-dimensional vector, similar to [5,8,9,3,4,...,0.75], and then binary code the value of each dimension to get [011,100] ,...], this is the genetic sequence of the agent.
  • Heart evolutionary heart model
  • the score may be higher (equivalent to increasing the learning rate), and vice versa.
  • Brain brain model
  • Heart heart model
  • the agent can also learn to whack the hamster without the relevant data and knowledge, but it will take longer.
  • Each brain model (Brain) and heart model (Heart) are initially randomly parameterized models. Through continuous reinforcement learning and evolutionary learning, an intelligent body that adapts to the environment and environmental rules and likes whack-a-mole is finally generated.
  • the intelligent robot of the embodiment of the present disclosure can adapt to a field without relevant knowledge, and obtain an intelligent body that can adapt to the field.
  • Agents can learn to understand and use language, and have inherent emotions. In the process of training such models, humans can understand music, language, the generation of emotions, and the reasons for social relationships and thinking patterns. It has an immeasurable role in the progress of humanities and society.
  • FIG. 2 is a flowchart of a learning method for an intelligent robot according to an embodiment of the present disclosure.
  • the learning method of the intelligent robot includes the following steps:
  • S1 obtain the first interaction message and the state fed back by the external environment and the agent through the brain model, and output action and/or second interaction message according to the state and/or the first interaction message to obtain a new first interaction message , and/or, make the external environment and agent adjust state according to the action.
  • S2 Obtain at least one of the state and the first interactive message through the mental model, and output a report according to at least one of the state and the first interactive message.
  • S4 perform evolutionary learning on the heart model, and update the brain model by using the evolved heart model, so as to obtain a heart model and a brain model suitable for the external environment.
  • evolutionary learning is performed on the mental model by the following steps:
  • Step 1 Random initialization based on the heart model to obtain n gene sequences (model codes);
  • Step 2 Duplicate the brain model to obtain n brain models
  • Step 3 restore each gene sequence to the corresponding heart model, generate corresponding rewards according to the feedback state of the external environment through each heart model, and perform reinforcement learning training on the corresponding brain model for each brain model based on the rewards;
  • Step 4 Eliminate m gene sequences with low scores, where m ⁇ n;
  • Step 5 Based on the remaining n-m gene sequences, perform genetic mutation operation to obtain new m gene sequences;
  • Step 6 Using the remaining n-m gene sequences and the new m gene sequences, return to Step 3 until a heart model with a score greater than a preset value is obtained, so as to update the brain model based on the heart model.
  • the learning method of the intelligent robot in the embodiment of the present disclosure can make the intelligent robot adapt to a field without relevant knowledge, and obtain an intelligent body that can adapt to the field.
  • This learning method enables the agent to learn to understand and use language, and has inherent emotions.
  • humans can understand music, language, the generation of emotions, and the reasons for understanding social relations and thinking patterns. It plays an immeasurable role in achieving progress in the field of industrial science and progress in the humanities and society.
  • a "computer-readable medium” can be any device that can contain, store, communicate, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or apparatus.
  • computer readable media include the following: electrical connections with one or more wiring (electronic devices), portable computer disk cartridges (magnetic devices), random access memory (RAM), Read Only Memory (ROM), Erasable Editable Read Only Memory (EPROM or Flash Memory), Fiber Optic Devices, and Portable Compact Disc Read Only Memory (CDROM).
  • the computer readable medium may even be paper or other suitable medium on which the program may be printed, as the paper or other medium may be optically scanned, for example, followed by editing, interpretation, or other suitable medium as necessary process to obtain the program electronically and then store it in computer memory.
  • portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof.
  • various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system.
  • a suitable instruction execution system For example, if implemented in hardware, like another embodiment, it can be implemented by any one of the following techniques known in the art, or a combination thereof: discrete logic with logic gates for implementing logic functions on data signals circuits, ASICs with suitable combinational logic gates, Programmable Gate Arrays (PGA), Field Programmable Gate Arrays (FPGA), etc.
  • first and second are only used for descriptive purposes, and should not be construed as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature delimited with “first”, “second” may expressly or implicitly include at least one of that feature.
  • plurality means at least two, such as two, three, etc., unless expressly and specifically defined otherwise.
  • a first feature "on” or “under” a second feature may be in direct contact with the first and second features, or indirectly through an intermediary between the first and second features touch.
  • the first feature being “above”, “over” and “above” the second feature may mean that the first feature is directly above or obliquely above the second feature, or simply means that the first feature is level higher than the second feature.
  • the first feature being “below”, “below” and “below” the second feature may mean that the first feature is directly below or obliquely below the second feature, or simply means that the first feature has a lower level than the second feature.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Genetics & Genomics (AREA)
  • Physiology (AREA)
  • Feedback Control In General (AREA)
  • Manipulator (AREA)

Abstract

一种智能机器人及其学习方法,涉及人工智能领域。其中,智能机器人包括智能体(100),智能体(100)包括:脑模型,用于获取第一交互消息,以及外部环境和智能体(100)反馈的状态,并根据状态和/或第一交互消息,输出动作和/或第二交互消息,以获取新的第一交互消息,和/或,使外部环境根据动作调整状态;心模型,用于根据状态、第一交互消息中的至少一个输出回报;更新模块,用于根据回报更新脑模型以实现智能体(100)的学习;心模型还用于进行进化学习,并利用进化后的心模型更新脑模型,以得到适合环境的心模型和脑模型。

Description

智能机器人及其学习方法
相关申请的交叉引用
本公开要求于2020年8月27日提交的申请号为202010875710.4、名称为“智能机器人及其学习方法”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及人工智能领域,尤其涉及一种智能机器人及其学习方法。
背景技术
近年来,人工智能发展迅速,智能机器人,例如苹果siri、微软小冰等聊天机器人,开始进入人们的视野,alphaGo也在围棋领域大胜人类棋手。但已有的人工智能技术,在解决涉及人类自然语言、情感以及不容易被定义的复杂领域的任务还存在很多困难,所以已有的人工智能被业界统称作弱人工智能。
发明内容
本公开旨在解决语言、情感以及应对复杂任务能力不足等当前人工智能存在的问题,以获得能够理解和运用语言,具有情感,并能够应对一些复杂领域问题的人工智能机器人。
为达到上述目的,第一方面,本公开实施例提出了一种智能机器人,所述智能机器人包括智能体,所述智能体包括:第一处理模块,所述第一处理模块包括脑模型,所述脑模型用于获取第一交互消息,以及外部环境和智能体反馈的状态,并根据所述状态和/或所述第一交互消息,输出动作和/或第二交互消息,以获取新的第一交互消息,和/或,使所述外部环境和智能体根据所述动作调整所述状态;第二处理模块,所述第二处理模块包括心模型,所述心模型用于获取所述状态、所述第一交互消息中的至少一个,并根据所述状态、所述第一交互消息中的至少一个输出回报;更新模块,用于根据所述回报更新所述脑模型以实现所述智能体的学习;其中,所述心模型还用于进行进化学习,并利用进化后的心模型更新所述脑模型,以得到适合外部环境的心模型和脑模型。
本公开实施例的智能机器人,基于智能体自身的心模型驱动脑模型进行学习,并让智能体的心模型在群体的进化中得到更新,进而使得该机器人可以解决复杂领域问题。
第二方面,本公开实施例提出了一种基于上述实施例的智能机器人的学习方法,包括以下步骤:通过所述脑模型获取第一交互消息,以及外部环境和智能体反馈的状态,并根据所述状态和/或所述第一交互消息,输出动作和/或第二交互消息,以获取新的第一交互消息,和/或,使所述外部环境和智能体根据所述动作调整所述状态;通过所述心模型获取 所述状态、所述第一交互消息中的至少一个,并根据所述状态、所述第一交互消息中的至少一个输出回报;通过所述更新模块根据所述回报更新所述脑模型以实现所述智能体的学习;对所述心模型进行进化学习,并利用进化后的心模型更新所述脑模型,以得到适合外部环境的心模型和脑模型。
本公开实施例提出的智能机器人的学习方法,基于智能体自身的心模型驱动脑模型进行学习,并让智能体的心模型在群体的进化中得到更新,进而可得到可以解决复杂领域问题的智能机器人。
附图说明
本公开的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解,其中:
图1是根据本公开一个实施例智能机器人的结构示意图;
图2是本公开一个实施例的智能机器人的学习方法的流程图;
图3(a)是本公开一个实施例的脑模型的示意图;
图3(b)是本公开一个实施例的心模型的示意图;
图4(a)-图4(c)是本公开多个示例的协助式强化学习训练的示意图;
图5是本公开一个实施例的进化学习的流程图;
图6是本公开一个示例的决策树模型的示意图;
图7是本公开一个示例的外部环境的示意图。
具体实施方式
在传统的强化学习中,整个强化学习系统一般包括智能体、状态(state)、奖赏/惩罚(reward)、动作(action)和外部环境(Environment)五部分。
具体地,智能体是整个强化学习系统核心,根据外部环境提供的reward作为反馈,学习一系列的环境状态(state)到动作(action)的映射,动作选择的原则是最大化未来累积的reward的概率。选择的动作不仅影响当前时刻的reward,还会影响下一时刻甚至未来的reward,因此,智能体在学习过程中的基本规则是:如果某个action带来了外部环境的正回报即奖励,那么这一action会被加强,反之则会逐渐削弱。
状态((state)),指智能体所处的环境信息,包含了智能体用于进行action选择的所有信息。
奖励/惩罚(reward),是指外部环境提供给智能体的一个可量化的标量反馈信号,用于评价智能体在某一个时间步所做action的好坏,其是一个标量,一般采用正数表示奖励,负数表示惩罚。
动作(action),指智能体在交互过程中所采取的操作。
外部环境(Environment),环境会接收智能体执行的一系列的action,并且对这一系列的action的好坏进行评价,并转换成一种可量化的reward反馈给智能体,而不会告诉智能体应该如何去学习动作。智能体只能靠自己的历史经历去学习。同时,外部环境还像智能体提供它所处的状态(state)。
相较于上述传统的强化学习,本公开提出的一种智能机器人及其学习方法中,奖励/惩罚的回报是智能体自身做出的,且智能体除了接收外部环境和智能体反馈的(state)和向外部环境反馈action之外,还可以接收交互消息,并可以输出交互消息。
下面详细描述本公开的实施例,实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本公开,而不能理解为对本公开的限制。
下面参考附图1-7描述本公开实施例的智能机器人及其学习方法。
在本公开的实施例中,如图1所示,智能机器人包括智能体100。智能体100包括:第一处理模块、第二处理模块和更新模块。参见图1,第一处理模块包括脑模型(Brain),脑模型(Brain)用于获取第一交互消息message1,以及外部环境Environment和智能体100(该智能体可以是自身,也可以是其他智能体)反馈的状态(state),并根据状态(state)和/或第一交互消息message1,输出动作action和/或第二交互消息message2,以获取新的第一交互消息message1,和/或,使外部环境Environment和智能体100根据动作action调整状态(state);第二处理模块包括心模型Heart,心模型Heart用于获取状态(state)、第一交互消息message1中的至少一个,并根据状态(state)、第一交互消息message1中的至少一个输出回报reward;更新模块用于根据回报reward更新脑模型(Brain)以实现智能体100的学习;其中,心模型heart还可用于进行进化学习,并利用进化后的心模型Heart更新脑模型(Brain),以得到适合外部环境的心模型Heart和脑膜型(Brain)。
其中,第一交互消息、第二交互消息可以是文本形式的消息,也可以是其他形式的消息,例如,光信号的振幅变化、频率变化、持续时间变化等,电信号的振幅变化、频率变化、持续时间变化等,其他可传递信息的信号形式。可选地,message还可以是多种形式消息的组合,还可以是多维度消息。由此,通过限定message的形式,可形成更多不同类型的message表达体系,例如,限定message只能够使用声音的音高,则有可能实现用于表达信息的音乐;限定message为肢体组合,则有可能实现用于表达信息的舞蹈。
需要说明的是,智能体100可通过智能机器人的消息采集装置获取第一交互消息,如通过摄像头获取动作图像,通过麦克风获取声音消息等;当前智能体也可具有通信模块,该通信模块可实现当前智能体与其他智能体的通信连接,进而通过该通信模块可获取其他智能体传输的第一交互消息。
在该实施例中,智能机器人可包括多个智能体100(图1中示出了两个)。外部环境Environment和智能体100可向各智能体100反馈状态(state),各智能体100可向外部环境输出动作action,智能体100之间可交互消息message1、message2。心模型Heart输出的回报reward不仅可受外部环境Environment的影响,例如,外部环境很恶劣,智能体100通过感知外部环境,心模型Heart接收到外部环境反馈的(state),可输出一个惩罚的reward,促使智能体100远离恶劣环境;还可受智能体100自身的影响,例如,智能体100自身感受到了饥饿,饥饿的信息通过心模型Heart输出一个惩罚的reward,促使智能体100寻找食物;还可受其他智能体的影响,例如,当前智能体在物质财富上落后于其他智能体,通过和其他智能体物质财富的对比信息,经过心模型Heart可输出一个惩罚的reward,促使智能体获得更多的物质财富。脑模型(Brain)可包括接受状态(state)的卷积神经网络(cnn),接受消息(message)的前馈神经网络(embedding),带attention机制的循环神经(rnn),以及输出动作(action)的前馈神经网络(fcn+softmax)和输出消息(message)的前馈神经网络(fcn+softmax)。
具体地,参见图1,相较于传统强化学习中的智能体,本公开实施例的智能体脑模型(Brain)的输入除了外部环境的状态(state)外,还增加了智能体的状态(state)和其他智能体的消息(message);心模型(Heart)的输入可同脑模型(Brain)的输入一致。驱动脑模型(Brain)学习的回报(reward),来自于心模型(Heart)的输出。
如图3(a)所示,脑模型包括依次连接的输入层、核心层和输出层,输入层包括卷积神经网络层(cnn)和嵌入层(embedding),卷积神经网络层(cnn)用以接收状态(state),嵌入层(embedding)用以接收第一交互消息(message),核心层包括基于attention机制的循环神经网络(rnn),输出层包括全连接层(fcn)和softmax层,输出层用以输出动作(action)和/或第二交互消息(message)。
心模型(Heart)可用于驱动智能体100完成:1)“生理”需求,如呼吸、喝水、摄入食物、寻求适宜的温度、睡眠、性。2)“情感”需求,如优越感,对应嫉妒、炫耀、自豪、挫败等情感,可以具体化为当前智能体的财富排名等;归属感,对应思念、依恋、想念、孤独等情感,可以具体化为有限范围内的关系密切的智能体数量等;获得感,对应满足、幸福、快乐、难过等情感,可以具体化为任意一个小的目标的达成或拥有。其中,心模型(Heart)可通过进化获得,例如,心模型可包括决策树模型,如图3(b)所示,图3(b)中的DT是Decision Tree的缩写,即决策树。
更新模块更新脑模型,具体可为对脑模型进行强化学习,即利用上述的心模型(Heart)对智能体脑模型(Brain)进行强化学习时。训练算法可使用Q-Learning、A3C(Actor-Critic,演员-评论)、PPO(Proximal Policy Optimization,近端策略优化)、DDPG(Deep Deterministic Policy Gradient,深度确定性策略梯度)等强化学习算法,具体算法可根 据任务需求选择,其中,部分算法本身需要新的网络模型。需要说明的是,在采用上述的算法进行训练学习时,本公开并不限定任务目标,而是限定智能体的模型结构和训练方式,具体的是基于消息(message)协作的智能体完备模型(脑模型+心模型)的训练。
由此,本公开实施例的智能机器人,基于智能体自身的心模型驱动脑模型进行学习,并让智能体的心模型在群体的进化中得到更新,使得智能机器人可以解决复杂领域问题。
在本公开的实施例中,智能体100心模型(Heart)来源于进化(而不是给定和拟合),由此可以解决弱人工智能机器人只能应用于具体的不是很复杂的领域的问题。智能体的基本需求是生存,由于外部环境是导致智能体死亡的原因,所以智能体需要让自己的需求和外部环境对智能体的需求相一致。为了实现该目标,还可对心模型(Heart)进行进化学习,以使智能体用于感受和情感。由于初级的智能体通过变异来适应环境,如病毒等;而高级的智能体通过繁衍来适应环境,其中繁衍包含了遗传和变异。为此,本公开模拟繁衍过程来获得需求函数。具体地,构建一个模型来拟合心模型(Heart),模型的参数类似于智能体的基因,输入是state和message,输出是reward,通过遗传变异算法,来更新模型的参数。
如图5所示,心模型通过遗传变异算法进行进化学习过程如下:
步骤一:基于心模型随机初始化得到n个基因序列(模型参数编码)。
作为一个示例,图6中示出了一个二叉决策树,树的深度,表示不同特征表达式的个数。除叶子节点外,其他节点均表示一个判断特征是否满足条件的表达式,图6中示出f1和f2两个特征表达式。若当前节点判断为满足,则继续判断右侧路径的特征表达式,否则判断左侧的特征表达式,叶子节点表示最终的输出结果。图6中的决策树模型参数为[f1,f2,-1,-0.5,f2,0.5,1],对该参数进行二进制化,便于引出遗传变异算法。因为第一层的基因有两种选择,第二层的基因也有两种选择(这里准许重复判断相同的表达式,但模型深度始终为不同特征表达式的个数),第三层的基因进行离散化,假设只有四种选择,对应第三层中不同的值,那么以上的基因二进制化以后为[0,1,00,01,1,10,11],得到基因序列。
进一步地,随机初始化得到与上述基因序列相同维度的n-1个向量,得到n个基因序列。
步骤二:对脑模型进行复制,得到n个脑模型;
步骤三:将每个基因序列恢复为对应的心模型(Heart),通过各心模型(Heart)根据外部环境反馈的状态生成相应的回报,对每个脑模型(Brain),基于回报对相应的脑模型进行强化学习训练;
步骤四:淘汰得分较低的m个基因序列,其中,m<n;
步骤五:基于剩余的n-m个基因序列,进行遗传变异操作得到新的m个基因序列;
步骤六:利用剩余的n-m个基因序列和新的m个基因序列,返回至步骤三,直至得到得分大于预设值的心模型,以基于该心模型更新脑模型。
作为一个示例,还可采用蚁群、禁忌、模拟退火的进化算法进行进化学习。
由此,通过心模型的进化学习,可实现智能体自身的需求和外部环境对智能体的需求相一致,便于智能机器人适应外部环境。
在本公开的一个实施例中,更新模块还可用于:获取时间序列数据,其中,时间序列数据包括消息自身的时间序列、状态自身的时间序列、动作自身的时间序列、消息到消息的时间序列、状态到消息的时间序列、状态到动作的时间序列、消息到动作的时间序列,以及消息、状态、动作三者间的时间序列中的一个或多个;根据时间序列数据训练脑模型。由此,基于上述多种时间序列回归的训练,可完善智能体所能接收和输出的信息类型,并可提高脑模型的训练效率,便于后续对脑模型的更新。
在本公开的一个实施例中,更新模块还可用于:获取外部环境反馈的样本状态;将样本状态输入至观察器,以使观察器根据样本状态输出样本消息;将样本消息输出至脑初始模型,并通过脑初始模型输出样本动作;获取外部环境根据样本动作反馈的样本回报;根据样本回报更新脑初始模型,以得到脑模型。
具体地,可通过观察器协助进行强化学习训练,如图4(a)所示,相较于一般强化学习中的训练,该实施例中增加观察者,通过观察者将state转化为message,智能体通过该message获取state。参见图4(a),智能体还可根据message输出message至观察者,观察者在输出message时可还考虑智能体输出的message。由此,完善了state到message、message到action训练,可让智能体可以理解和运用message,助于智能体更能满足人工化需求。
在本公开的一个实施例中,脑模型的更新过程,还可包括:获取外部环境反馈的样本状态;将样本状态输入至脑初始模型,并通过脑初始模型输出样本消息;将样本消息输出至执行器,以使执行器根据样本消息输出样本动作;获取外部环境根据样本动作反馈的样本回报;根据样本回报更新脑初始模型,以得到脑模型。
具体地,可通过执行器协助进行强化学习训练,如图4(b)所示,相较于一般强化学习中的训练,该实施例中增加执行者,通过执行者将智能体根据(state)转化的message输出action,智能体通过外部环境基于该action反馈的reward进行训练。参见图4(b),执行器还可根据message输出message至智能体,智能体在输出message时可还考虑执行器输出的message。由此,完善了state到message、message到action训练,可让智能体可以理解和运用message,助于智能体更能满足人工化需求。
需要说明的是,上述的观察器和执行器可以是已建立的模型,也可以是人工,此处不做限定。
作为一个示例,如图4(c)所示,还可进行对多智能体message协作进行强化学习,和多智能体基于博弈论进行强化学习不同,该示例强调的是通过message协作完成训练任务。
为便于理解本公开的智能机器人,下面通过一个具体示例进行说明:
该示例中,智能机器人的学习目标包括:让智能体学会打地鼠,让智能体能够理解打地鼠时候使用的语言,让智能体会使用语言来完成打地鼠,让智能体喜欢上打地鼠。
学习过程如下:
第一步:构建外部环境
如图7所示,设计一个九宫格画面,锤子放在中间,地鼠随机从周围的某个格子里钻出来。智能体可以拿起锤子,去打地鼠,如果地鼠被智能体打到,地鼠会随机选择另外一个周边的格子钻出来。
因为拿锤子的智能体不方便观察(只能观察到上下左右其中的一个方向),外部环境允许其他智能体配合当前智能体打地鼠,如由一个负责观察,一个负责执行的智能体配合。
智能体每打到一个地鼠,奖励一分,配合的智能体也能够获得一分。
第二步:模型建立
建立如图3(a)所示的脑模型(Brain)和如图3(b)所示的心模型(Heart),图3(b)中的state是智能体可观察的状态,包括其他智能体的得分情况和可观察的环境,reward是智能体可感受到的感觉。
第三步:输入输出限定
state为九宫格的状态,和智能体的得分是否靠前。state输入可以用一个9位二进制向量表示,前8位表示相应位置是否有地鼠,后一位表示是否得分靠前,如[100000001],表示九宫格第一个格子有地鼠,当前智能体得分靠前。
message限定为人类的自然语言,如“打前面”、“打第一个”、“在你左边”等。可以对常用字词和结束符进行one-hot编码,按时间序列输入输出。
action限定为对应8个格子的打地鼠方法和不采取动作,共9种情况,action也可以使用9位的二进制向量表示。
第四步:模型训练
(1)时间序列回归训练
可以预先编辑一些时间序列,让智能体学习这些序列。可以人工玩几盘打地鼠游戏,按照时间步记录下对应的(state)、message、action,然后利用这些数据对智能体进行时间序列回归训练。
学习完毕后,智能体可以学会以下内容:
一些指令:如告诉智能体“打第一个”,智能体会击打第一个格子;
一些描述:如智能体看见第一个格子有地鼠,会输出消息“打第一个”;
一些交流:如问智能体“打第几个”,智能体会回复“打第一个”。
(2)强化学习训练
智能体如果能够打中state中的地鼠,则奖励一分;给智能体输入state,让智能体学会做出action,以获得最大的reward,学习方法使用强化学习模式。
学习完毕后,智能体可以学会以下内容:
看见地鼠便正确击打地鼠。
(3)message协作式强化学习训练
具体可采用如图4(a)-图4(c)所示的message协作式强化学习训练。
学习完毕后,智能体可以学会以下内容:
理解其他智能体发出的message;
运用message,帮助其他智能体击打地鼠。
(4)进化学习
心模型采用决策树模型,输入环境为9位二进制表示的向量,如[100000001],可以设计一个9层的二叉树,除叶子节点以外,其他树节点为第几个维度,如第九个维度,则节点值为9。叶子节点是离散化的0-1之间的值,表示reward的值。二叉树向左表示该维度值为0下的决策路径,向右为1的决策路径。然后随机一个二叉树,并用先序遍历保存,则会形成一个多维向量,类似[5,8,9,3,4,……,0.75],进而对每个维度的值进行二进制编码,得到[011,100,……],这就是智能体的基因序列了。
利用如图5所示的进化学习步骤进行进化学习,学习完毕后,智能体能够获得进化最佳的心模型(Heart),该模型具有以下功能:
在智能体正确击打地鼠的时候,获得奖励得分;
在智能体得分排序靠后的时候,得分可能更高(相当于增加了学习率),反之得分要低。
进一步地,将上述的脑模型(Brain)和心模型(Heart)进行组合,可得到喜欢打地鼠的智能体,该智能体在外部环境或者环境规则有微小变化的情况下,仍是喜欢击打地鼠,并且能慢慢速适应新的环境和规则。
由此,在没有相关数据和知识的情况下,智能体也可以学会打地鼠,只是需要时间更长。每个脑模型(Brain)和心模型(Heart)初始情况下都是随机参数化的模型。通过强化学习和进化学习不断的进行,最终产生适应环境和环境规则的,喜欢打地鼠的智能体。
综上,本公开实施例的智能机器人,可以适应没有相关知识的领域,并获得能够适应该领域的智能体。智能体可以学会理解和使用语言,具备内在情感,人类在训练这种模型的过程中可以理解音乐、语言、情感的产生、理解社会关系和思维方式的形成理由,这些 在实现工业科学领域的进步和人文社会的进步方面都有着不可估量的作用。
图2是本公开实施例的智能机器人的学习方法的流程图。
基于上述的智能机器人,如图2所示,智能机器人的学习方法包括以下步骤:
S1,通过脑模型获取第一交互消息,以及外部环境和智能体反馈的状态,并根据状态和/或第一交互消息,输出动作和/或第二交互消息,以获取新的第一交互消息,和/或,使外部环境和智能体根据动作调整状态。
S2,通过心模型获取状态、第一交互消息中的至少一个,并根据状态、第一交互消息中的至少一个输出回报。
S3,通过更新模块根据回报更新脑模型以实现智能体的学习。
S4,对心模型进行进化学习,并利用进化后的心模型更新脑模型,以得到适合外部环境的心模型和脑模型。
在本公开的一个实施例中,通过如下步骤对心模型进行进化学习:
步骤一:基于心模型随机初始化得到n个基因序列(模型编码);
步骤二:对脑模型进行复制,得到n个脑模型;
步骤三:将每个基因序列恢复为对应的心模型,通过各心模型根据外部环境反馈的状态生成相应的回报,对每个脑模型,基于回报对相应的脑模型进行强化学习训练;
步骤四:淘汰得分较低的m个基因序列,其中,m<n;
步骤五:基于剩余的n-m个基因序列,进行遗传变异操作得到新的m个基因序列;
步骤六:利用剩余的n-m个基因序列和新的m个基因序列,返回至步骤三,直至得到得分大于预设值的心模型,以基于心模型更新脑模型。
本公开实施例的智能机器人的学习方法,可以使智能机器人适应没有相关知识的领域,并获得能够适应该领域的智能体。该学习方法,可使智能体可以学会理解和使用语言,具备内在情感,人类在训练这种模型的过程中可以理解音乐、语言、情感的产生、理解社会关系和思维方式的形成理由,这些在实现工业科学领域的进步和人文社会的进步方面都有着不可估量的作用。
需要说明的是,在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用,或结合这些指令执行系统、装置或设备而使用。就本说明书而言,"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下:具有一个或多个布线的电连接部(电子装置),便携式计算机盘盒(磁装置),随机存取存 储器(RAM),只读存储器(ROM),可擦除可编辑只读存储器(EPROM或闪速存储器),光纤装置,以及便携式光盘只读存储器(CDROM)。另外,计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质,因为可以例如通过对纸或其他介质进行光学扫描,接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序,然后将其存储在计算机存储器中。
应当理解,本公开的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如,如果用硬件来实现,和另一实施方式一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本公开的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本公开的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。
在本公开中,除非另有明确的规定和限定,第一特征在第二特征“上”或“下”可以是第一和第二特征直接接触,或第一和第二特征通过中间媒介间接接触。而且,第一特征在第二特征“之上”、“上方”和“上面”可是第一特征在第二特征正上方或斜上方,或仅仅表示第一特征水平高度高于第二特征。第一特征在第二特征“之下”、“下方”和“下面”可以是第一特征在第二特征正下方或斜下方,或仅仅表示第一特征水平高度小于第二特征。
尽管上面已经示出和描述了本公开的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本公开的限制,本领域的普通技术人员在本公开的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (10)

  1. 一种智能机器人,其特征在于,所述智能机器人包括智能体,所述智能体包括:
    第一处理模块,所述第一处理模块包括脑模型,所述脑模型用于获取第一交互消息,以及外部环境和智能体反馈的状态,并根据所述状态和/或所述第一交互消息,输出动作和/或第二交互消息,以获取新的第一交互消息,和/或,使所述外部环境和智能体根据所述动作调整所述状态;
    第二处理模块,所述第二处理模块包括心模型,所述心模型用于获取所述状态、所述第一交互消息中的至少一个,并根据所述状态、所述第一交互消息中的至少一个输出回报;
    更新模块,用于根据所述回报更新所述脑模型以实现所述智能体的学习;
    其中,所述心模型还用于进行进化学习,并利用进化后的心模型更新所述脑模型,以得到适合外部环境的心模型和脑模型。
  2. 如权利要求1所述的智能机器人,其特征在于,所述心模型还用于通过如下步骤进行进化学习:
    步骤一:基于所述心模型随机初始化得到n个基因序列(模型参数编码);
    步骤二:对所述脑模型进行复制,得到n个脑模型;
    步骤三:将每个基因序列恢复为对应的心模型,通过各所述心模型根据所述外部环境反馈的状态生成相应的回报,对每个脑模型,基于所述回报对相应的脑模型进行强化学习训练;
    步骤四:淘汰得分较低的m个基因序列,其中,m<n;
    步骤五:基于剩余的n-m个基因序列,进行遗传变异操作得到新的m个基因序列;
    步骤六:利用所述剩余的n-m个基因序列和所述新的m个基因序列,返回至所述步骤三,直至得到得分大于预设值的心模型,以基于所述心模型更新所述脑模型。
  3. 如权利要求1所述的智能机器人,其特征在于,所述脑模型包括依次连接的输入层、核心层和输出层,其中,
    所述输入层,包括卷积神经网络层和嵌入层,所述卷积神经网络层用以接收所述状态,所述嵌入层用以接收所述第一交互消息;
    所述核心层,包括基于attention机制的循环神经网络;
    所述输出层,包括全连接层和softmax层,所述输出层用以输出所述动作和/或所述第二交互消息。
  4. 如权利要求1所述的智能机器人,其特征在于,所述心模型包括决策树模型。
  5. 如权利要求1所述的智能机器人,其特征在于,所述智能机器人包括多个智能体,所述第一交互消息来源于其他智能体或者用户。
  6. 如权利要求1所述的智能机器人,其特征在于,所述更新模块还用于:
    获取时间序列数据,其中,所述时间序列数据包括消息自身的时间序列、状态自身的时间序列、动作自身的时间序列、消息到消息的时间序列、状态到消息的时间序列、状态到动作的时间序列、消息到动作的时间序列,以及消息、状态、动作三者间的时间序列中的一个或多个;
    根据所述时间序列数据训练得到所述脑模型。
  7. 如权利要求1所述的智能机器人,其特征在于,所述更新模块还用于:
    获取所述外部环境反馈的样本状态;
    将所述样本状态输入至观察器,以使所述观察器根据所述样本状态输出样本消息;
    将所述样本消息输出至脑初始模型,并通过所述脑初始模型输出样本动作;
    获取所述外部环境根据所述样本动作反馈的样本回报;
    根据所述样本回报更新所述脑初始模型,以得到所述脑模型。
  8. 如权利要求1所述的智能机器人,其特征在于,所述更新模块还用于:
    获取所述外部环境反馈的样本状态;
    将所述样本状态输入至脑初始模型,并通过所述脑初始模型输出样本消息;
    将所述样本消息输出至执行器,以使所述执行器根据所述样本消息输出样本动作;
    获取所述外部环境根据所述样本动作反馈的样本回报;
    根据所述样本回报更新所述脑初始模型,以得到所述脑模型。
  9. 一种如权利要求1-8中任一项所述的智能机器人的学习方法,其特征在于,包括以下步骤:
    通过所述脑模型获取第一交互消息,以及外部环境和智能体反馈的状态,并根据所述状态和/或所述第一交互消息,输出动作和/或第二交互消息,以获取新的第一交互消息,和/或,使所述外部环境根据所述动作调整所述状态;
    通过所述心模型获取所述状态、所述第一交互消息中的至少一个,并根据所述状态、所述第一交互消息中的至少一个输出回报;
    通过所述更新模块根据所述回报更新所述脑模型以实现所述智能体的学习;
    对所述心模型进行进化学习,并利用进化后的心模型更新所述脑模型,以得到适合外部环境的心模型和脑模型。
  10. 如权利要求1所述的智能机器人的学习方法,其特征在于,通过如下步骤对所述心模型进行进化学习:
    步骤一:基于所述心模型随机初始化得到n个基因序列(模型参数编码);
    步骤二:对所述脑模型进行复制,得到n个脑模型;
    步骤三:将每个基因序列恢复为对应的心模型,通过各所述心模型根据所述外部环境 反馈的状态生成相应的回报,对每个脑模型,基于所述回报对相应的脑模型进行强化学习训练;
    步骤四:淘汰得分较低的m个基因序列,其中,m<n;
    步骤五:基于剩余的n-m个基因序列,进行遗传变异操作得到新的m个基因序列;
    步骤六:利用所述剩余的n-m个基因序列和所述新的m个基因序列,返回至所述步骤三,直至得到得分大于预设值的心模型,以基于所述心模型更新所述脑模型。
PCT/CN2021/105935 2020-08-27 2021-07-13 智能机器人及其学习方法 WO2022042093A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010875710.4A CN114118434A (zh) 2020-08-27 2020-08-27 智能机器人及其学习方法
CN202010875710.4 2020-08-27

Publications (1)

Publication Number Publication Date
WO2022042093A1 true WO2022042093A1 (zh) 2022-03-03

Family

ID=80354501

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/105935 WO2022042093A1 (zh) 2020-08-27 2021-07-13 智能机器人及其学习方法

Country Status (2)

Country Link
CN (1) CN114118434A (zh)
WO (1) WO2022042093A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117898683A (zh) * 2024-03-19 2024-04-19 中国人民解放军西部战区总医院 儿童睡眠质量检测方法及装置
CN117898683B (zh) * 2024-03-19 2024-06-07 中国人民解放军西部战区总医院 儿童睡眠质量检测方法及装置

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999010758A1 (en) * 1997-08-22 1999-03-04 Hynomics Corporation Multiple-agent hybrid control architecture for intelligent real-time control of distributed nonlinear processes
WO2013122140A1 (ja) * 2012-02-14 2013-08-22 独立行政法人国立がん研究センター 抗がん剤の作用を増強する医薬組成物、がん治療用キット、診断薬、及びスクリーニング方法
CN108170736A (zh) * 2017-12-15 2018-06-15 南瑞集团有限公司 一种基于循环注意力机制的文档快速扫描定性方法
CN109657802A (zh) * 2019-01-28 2019-04-19 清华大学深圳研究生院 一种混合专家强化学习方法及系统
CN109919319A (zh) * 2018-12-31 2019-06-21 中国科学院软件研究所 基于多个历史最佳q网络的深度强化学习方法及设备
US20190303776A1 (en) * 2018-04-03 2019-10-03 Cogitai, Inc. Method and system for an intelligent artificial agent
CN110389591A (zh) * 2019-08-29 2019-10-29 哈尔滨工程大学 一种基于dbq算法的路径规划方法
CN110502033A (zh) * 2019-09-04 2019-11-26 中国人民解放军国防科技大学 一种基于强化学习的固定翼无人机群集控制方法
CN110666793A (zh) * 2019-09-11 2020-01-10 大连理工大学 基于深度强化学习实现机器人方形零件装配的方法
CN110826725A (zh) * 2019-11-07 2020-02-21 深圳大学 基于认知的智能体强化学习方法、装置、系统、计算机设备及存储介质
CN110826723A (zh) * 2019-10-12 2020-02-21 中国海洋大学 一种结合tamer框架和面部表情反馈的交互强化学习方法
CN111144793A (zh) * 2020-01-03 2020-05-12 南京邮电大学 基于多智能体深度强化学习的商业建筑hvac控制方法
CN111282279A (zh) * 2020-02-05 2020-06-16 腾讯科技(深圳)有限公司 模型训练的方法、基于交互式应用的对象控制方法及装置

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999010758A1 (en) * 1997-08-22 1999-03-04 Hynomics Corporation Multiple-agent hybrid control architecture for intelligent real-time control of distributed nonlinear processes
WO2013122140A1 (ja) * 2012-02-14 2013-08-22 独立行政法人国立がん研究センター 抗がん剤の作用を増強する医薬組成物、がん治療用キット、診断薬、及びスクリーニング方法
CN108170736A (zh) * 2017-12-15 2018-06-15 南瑞集团有限公司 一种基于循环注意力机制的文档快速扫描定性方法
US20190303776A1 (en) * 2018-04-03 2019-10-03 Cogitai, Inc. Method and system for an intelligent artificial agent
CN109919319A (zh) * 2018-12-31 2019-06-21 中国科学院软件研究所 基于多个历史最佳q网络的深度强化学习方法及设备
CN109657802A (zh) * 2019-01-28 2019-04-19 清华大学深圳研究生院 一种混合专家强化学习方法及系统
CN110389591A (zh) * 2019-08-29 2019-10-29 哈尔滨工程大学 一种基于dbq算法的路径规划方法
CN110502033A (zh) * 2019-09-04 2019-11-26 中国人民解放军国防科技大学 一种基于强化学习的固定翼无人机群集控制方法
CN110666793A (zh) * 2019-09-11 2020-01-10 大连理工大学 基于深度强化学习实现机器人方形零件装配的方法
CN110826723A (zh) * 2019-10-12 2020-02-21 中国海洋大学 一种结合tamer框架和面部表情反馈的交互强化学习方法
CN110826725A (zh) * 2019-11-07 2020-02-21 深圳大学 基于认知的智能体强化学习方法、装置、系统、计算机设备及存储介质
CN111144793A (zh) * 2020-01-03 2020-05-12 南京邮电大学 基于多智能体深度强化学习的商业建筑hvac控制方法
CN111282279A (zh) * 2020-02-05 2020-06-16 腾讯科技(深圳)有限公司 模型训练的方法、基于交互式应用的对象控制方法及装置

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"A dissertation Submitted in partial fulfillment of the requirements for the degree of Master of Engineering in Control Science and Engineering", 30 April 2018, GRADUATE SCHOOL OF NATIONAL UNIVERSITY OF DEFENSE TECHNOLOGY CHANGSHA,HUNAN., CN, article LI SIDING: "Reinforcement Learning-based Dynamic Path Planning for Mobile Robots", pages: 1 - 88, XP055906046 *
DONG-HYUN LEE ; IN-WON PARK ; JONG-HWAN KIM: "Q-learning using fuzzified states and weighted actions and its application to omni-direnctional mobile robot control", COMPUTATIONAL INTELLIGENCE IN ROBOTICS AND AUTOMATION (CIRA), 2009 IEEE INTERNATIONAL SYMPOSIUM ON, IEEE, PISCATAWAY, NJ, USA, 15 December 2009 (2009-12-15), Piscataway, NJ, USA , pages 102 - 107, XP031643855, ISBN: 978-1-4244-4808-1 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117898683A (zh) * 2024-03-19 2024-04-19 中国人民解放军西部战区总医院 儿童睡眠质量检测方法及装置
CN117898683B (zh) * 2024-03-19 2024-06-07 中国人民解放军西部战区总医院 儿童睡眠质量检测方法及装置

Also Published As

Publication number Publication date
CN114118434A (zh) 2022-03-01

Similar Documents

Publication Publication Date Title
Christiansen et al. Creating language: Integrating evolution, acquisition, and processing
Yu et al. Emotional multiagent reinforcement learning in spatial social dilemmas
Du et al. Vision-language models as success detectors
Palanisamy Hands-On Intelligent Agents with OpenAI Gym: Your guide to developing AI agents using deep reinforcement learning
Norris Beginning artificial intelligence with the Raspberry Pi
Acharya et al. Neurosymbolic reinforcement learning and planning: A survey
WO2022042093A1 (zh) 智能机器人及其学习方法
Karpov et al. Human-assisted neuroevolution through shaping, advice and examples
Jacob Sharing and ascribing goals
Lim et al. Intelligent npcs for educational role play game
Millhouse et al. Embodied, Situated, and Grounded Intelligence: Implications for AI
Moy et al. Evolution strategies for sparse reward gridworld environments
Mahoor et al. Morphology dictates a robot's ability to ground crowd-proposed language
Zhu et al. Design and implementation of NPC AI based on genetic algorithm and BP neural network
Kanervisto Advances in deep learning for playing video games
Zhang et al. Language-Guided World Models: A Model-Based Approach to AI Control
Bignold et al. Rule-based interactive assisted reinforcement learning
Menon et al. An Efficient Application of Neuroevolution for Competitive Multiagent Learning
Simon Ethics and artificial general intelligence: technological prediction as a groundwork for guidelines
Stanton Simultaneous incremental neuroevolution of motor control, navigation and object manipulation in 3D virtual creatures.
Kovačević et al. Artificial intelligence in computer games
Chitizadeh et al. General language evolution in general game playing
KR102358179B1 (ko) 인공지능 원리 학습을 위한 게임 컨텐츠 제공 방법, 장치 및 컴퓨터-판독가능 기록매체
Maresso Emergent behavior in neuroevolved agents
Kim et al. Inference of Other’s Minds with Limited Information in Evolutionary Robotics

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21859933

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21859933

Country of ref document: EP

Kind code of ref document: A1