CN114118434A

CN114118434A - Intelligent robot and learning method thereof

Info

Publication number: CN114118434A
Application number: CN202010875710.4A
Authority: CN
Inventors: 朱宝
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-08-27
Filing date: 2020-08-27
Publication date: 2022-03-01
Also published as: WO2022042093A1

Abstract

The invention discloses an intelligent robot and a learning method thereof, wherein the intelligent robot comprises an intelligent agent, and the intelligent agent comprises: the brain model is used for acquiring the first interactive message and the state fed back by the external environment and the intelligent agent, and outputting the action and/or the second interactive message according to the state and/or the first interactive message so as to acquire a new first interactive message and/or enable the external environment to adjust the state according to the action; a heart model for outputting a reward according to at least one of the status, the first interaction message; an update module for updating the brain model according to the reward to enable learning of the agent; the heart model is also used for evolution learning, and the brain model is updated by using the evolved heart model so as to obtain the heart model and the meninges type which are suitable for the environment. The intelligent robot drives the brain model to learn based on the heart model of the intelligent body, and updates the heart model of the intelligent body in the evolution of the group, so that the intelligent robot can solve the problem of the complex field.

Description

Intelligent robot and learning method thereof

Technical Field

The invention relates to the field of artificial intelligence, in particular to an intelligent robot and a learning method thereof.

Background

In recent years, artificial intelligence has been rapidly developed, and intelligent robots, such as apple siri and microsoft ice, have begun to enter the field of vision of people. alphaGo also has great success in the field of weiqi over human players. However, the existing artificial intelligence technology has many difficulties in solving tasks related to human natural language, emotion and complex fields which are not easily defined, so the existing artificial intelligence is generally called weak artificial intelligence by the industry.

Disclosure of Invention

The invention aims to solve the problems of language, emotion, insufficient capability of coping with complex tasks and the like in the current artificial intelligence so as to obtain an artificial intelligence robot which can understand and use the language, has emotion and can cope with the problems in some complex fields.

In order to achieve the above object, an embodiment of a first aspect of the present invention provides an intelligent robot, where the intelligent robot includes an intelligent agent, and the intelligent agent includes: the first processing module comprises a brain model, and the brain model is used for acquiring a first interactive message and the state fed back by an external environment and an agent, and outputting an action and/or a second interactive message according to the state and/or the first interactive message to acquire a new first interactive message and/or enabling the external environment and the agent to adjust the state according to the action; a second processing module, including a heart model, configured to obtain at least one of the status and the first interaction message, and output a reward according to the at least one of the status and the first interaction message; an update module to update the brain model according to the reward to enable learning of the agent; the heart model is also used for evolution learning, and the brain model is updated by using the evolved heart model so as to obtain the heart model and the meninges type which are suitable for the external environment.

The intelligent robot provided by the embodiment of the invention drives the brain model to learn based on the heart model of the intelligent body, and updates the heart model of the intelligent body in the group evolution, so that the robot can solve the problem of the complex field.

In order to achieve the above object, a second aspect of the present invention provides a method for learning an intelligent robot based on the above embodiment, including the following steps: acquiring a first interactive message and states fed back by an external environment and an agent through the brain model, and outputting an action and/or a second interactive message according to the states and/or the first interactive message to acquire a new first interactive message, and/or enabling the external environment and the agent to adjust the states according to the action; obtaining at least one of the state and the first interactive message through the heart model, and outputting a return according to the at least one of the state and the first interactive message; updating, by the update module, the brain model in accordance with the reward to enable learning of the agent; and carrying out evolution learning on the heart model, and updating the brain model by using the evolved heart model to obtain the heart model and the meninges type which are suitable for the external environment.

According to the learning method of the intelligent robot, the brain model is driven to learn based on the heart model of the intelligent body, the heart model of the intelligent body is updated in the group evolution, and the intelligent robot capable of solving the problems in the complex field can be obtained.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic diagram of a configuration of an intelligent robot according to one embodiment of the present invention;

FIG. 2 is a flow chart of a learning method of an intelligent robot of one embodiment of the present invention;

FIG. 3(a) is a schematic diagram of a brain model according to an embodiment of the present invention;

FIG. 3(b) is a schematic illustration of a heart model according to one embodiment of the invention;

4(a) -4 (c) are schematic diagrams of assisted reinforcement learning training according to various examples of the present invention;

FIG. 5 is a flow diagram of evolutionary learning of one embodiment of the present invention;

FIG. 6 is a schematic diagram of a decision tree model of one example of the invention;

FIG. 7 is a schematic illustration of an external environment of an example of the present invention.

Detailed Description

In conventional reinforcement learning, the whole reinforcement learning system generally includes five parts of an agent, a state, a reward/penalty, an action and an external Environment.

Specifically, the agent is the core of the whole reinforcement learning system, and learns a series of mappings from environment states (state) to actions (action) according to the rewarded provided by the external environment as feedback, and the principle of action selection is to maximize the probability of the rewarded accumulated in the future. The selected action not only affects reward at the current moment, but also affects reward at the next moment and even in the future, so the basic rule of the agent in the learning process is as follows: an action is enhanced if it brings a positive reward or reward to the external environment, and is gradually diminished otherwise.

The state (state) indicates environment information where the agent is located, and includes all information used by the agent to select an action.

Reward/penalty (rewarded), which is a quantifiable scalar feedback signal provided to the agent by the external environment for evaluating the performance of an action performed by the agent at a certain time step, is a scalar, and generally adopts a positive number to represent reward and a negative number to represent penalty.

Action (action) refers to the action taken by the agent during the interaction.

And (4) an external Environment (Environment), wherein the Environment receives a series of actions executed by the intelligent agent, evaluates the quality of the series of actions, converts the series of actions into quantifiable rewarded feedback to the intelligent agent and does not tell the intelligent agent how to learn the actions. The agent can only learn by his history. At the same time, the external environment also provides the state it is in like the agent.

Compared with the traditional reinforcement learning, in the intelligent robot and the learning method thereof provided by the invention, the reward/punishment return is made by the intelligent agent, and the intelligent agent can receive the interactive message and output the interactive message besides receiving the external environment and the feedback (state) of the intelligent agent and the feedback action of the external environment.

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

An intelligent robot and a learning method thereof according to an embodiment of the present invention will be described with reference to fig. 1 to 7.

In an embodiment of the present invention, as shown in fig. 1, the smart robot includes a smart agent 100. The agent 100 includes: the device comprises a first processing module, a second processing module and an updating module. Referring to fig. 1, the first processing module includes a Brain model (Brain) for obtaining a first interaction message1, and a state (state) fed back by the external environment event and the agent 100 (the agent may be itself or another agent), and outputs an action and/or a second interaction message2 according to the state (state) and/or the first interaction message1 to obtain a new first interaction message1, and/or makes the external environment event and the agent 100 adjust the state (state) according to the action; the second processing module comprises a Heart model Heart, wherein the Heart model Heart is used for acquiring at least one of the state (state) and the first interaction message1 and outputting a reward according to at least one of the state (state) and the first interaction message 1; the updating module is used for updating a Brain model (Brain) according to reward to realize the learning of the agent 100; the Heart model Heart can also be used for evolution learning, and the Brain model (Brain) is updated by using the Heart model Heart after evolution so as to obtain the Heart model Heart and the Brain membrane type (Brain) suitable for the external environment.

The first interactive message and the second interactive message may be messages in text form, or messages in other forms, for example, amplitude variation, frequency variation, duration variation, etc. of an optical signal, amplitude variation, frequency variation, duration variation, etc. of an electrical signal, or other signal forms capable of transmitting information. Optionally, the message may also be a combination of messages in multiple forms, and may also be a multidimensional message. Thus, by defining the form of the message, more different types of message expression systems can be formed, for example, by defining that the message can only use the pitch of the sound, it is possible to realize music for expressing information; by defining the message as a combination of limbs, it is possible to implement a dance for expressing information.

It should be noted that the agent 100 may obtain the first interactive message through a message collecting device of the intelligent robot, for example, obtain a motion image through a camera, obtain a voice message through a microphone, and the like; the current agent can also have a communication module, which can realize the communication connection between the current agent and other agents, and further can obtain the first interactive message transmitted by other agents through the communication module.

In this embodiment, the intelligent robot may include a plurality of agents 100 (two shown in FIG. 1). The external Environment and the agents 100 can feed back a state (state) to each agent 100, each agent 100 can output an action to the external Environment, and the agents 100 can exchange messages 1 and 2 with each other. The reward output by the Heart model Heart can be influenced by the Environment, for example, the Environment is very bad, the smart 100 senses the Environment, and the Heart model Heart receives the feedback (state) of the Environment and can output a punished reward to prompt the smart 100 to be far away from the bad Environment; the intelligent agent 100 may also be affected by the intelligent agent 100 itself, for example, the intelligent agent 100 itself senses hunger, and the information of hunger outputs punished reward through Heart model Heart to prompt the intelligent agent 100 to search for food; the system can be influenced by other agents, for example, the current agent lags behind other agents in material wealth, and can output punished reward through Heart model Heart through the comparison information with other agent material wealth, so that the agent is promoted to obtain more material wealth. The Brain model (Brain) may include a convolutional neural network (cnn) accepting a state (state), a feedforward neural network (embedding) accepting a message (message), a recurrent neural network (rnn) with an attention mechanism, and a feedforward neural network (fcn + softmax) outputting an action (action) and a feedforward neural network (fcn + softmax) outputting a message (message).

Specifically, referring to fig. 1, compared to the agent in the conventional reinforcement learning, the input of the agent Brain type (Brain) of the embodiment of the present invention increases the state (state) of the agent and the message (message) of other agents in addition to the state (state) of the external environment; the input of the Heart model (Heart) may be identical to the input of the Brain model (Brain). Reward (reward) driving Brain model (Brain) learning, from the output of Heart model (Heart).

As shown in fig. 3(a), the brain model includes an input layer, a core layer and an output layer connected in sequence, the input layer includes a convolutional neural network layer (cnn) and an embedding layer (embedding), the convolutional neural network layer (cnn) is used for receiving a state (state), the embedding layer (embedding) is used for receiving a first interaction message (message), the core layer includes a recurrent neural network (rnn) based on an authentication mechanism, the output layer includes a fully-connected layer (fcn) and a softmax layer, and the output layer is used for outputting an action (action) and/or a second interaction message (message).

The Heart model (Heart) may be used to drive agent 100 to accomplish: 1) "physiological" requirements, such as breathing, drinking water, ingesting food, seeking proper temperature, sleep, sex. 2) The 'emotion' requirements, such as superior feelings, corresponding feelings of jealousy, blazing, self-luxury, frustration and the like, can be embodied as the wealth ranking of the current agent and the like; attribution feelings can be embodied as the number of agents with close relationship in a limited range and the like corresponding to emotions such as thought, attaching, thinking, lonely and the like; the sense of acquisition, corresponding to the emotions of satisfaction, happiness, difficulty, etc., can be embodied as the achievement or possession of any one small target. Here, the Heart model (Heart) may be obtained by evolution, for example, the Heart model may include a Decision Tree model, as shown in fig. 3(b), and DT in fig. 3(b) is an abbreviation of Decision Tree, i.e., Decision Tree.

The updating module updates the Brain model, and specifically can perform reinforcement learning on the Brain model, namely when the Heart model (Heart) is used for performing reinforcement learning on an intelligent Brain model (Brain). The training algorithm may use strong Learning algorithms such as Q-Learning, A3C (Actor-Critic), PPO (proximity Policy Optimization), DDPG (Deep Deterministic Policy Gradient), etc., and a specific algorithm may be selected according to task requirements, wherein part of the algorithm itself needs a new network model. In the training and learning using the above algorithm, the present invention does not limit the task target, but limits the model structure and the training mode of the agent, specifically, training of the complete agent model (brain model + heart model) based on message (message) cooperation.

Therefore, the intelligent robot provided by the embodiment of the invention drives the brain model to learn based on the heart model of the intelligent body, and updates the heart model of the intelligent body in the group evolution, so that the intelligent robot can solve the problem of the complex field.

In an embodiment of the invention, the Heart model (Heart) of the agent 100 is derived from evolution (rather than giving and fitting), thereby solving the problem that a weak artificial intelligence robot can only be applied to specific, not very complex fields. The basic requirement of the agent is survival, and since the external environment is the cause of death of the agent, the agent needs to keep its own requirements consistent with the requirements of the external environment for the agent. To achieve this goal, evolutionary learning of the cardiac model (Heart) is also possible to use the intelligence for feelings and emotions. Because the primary agents adapt to the environment through variation, such as viruses and the like; advanced agents adapt to the environment through proliferation, which includes inheritance and variation. To this end, the present invention simulates the process of multiplication to obtain the demand function. Specifically, a model is constructed to fit a Heart model (Heart), parameters of the model are similar to genes of an agent, input is state and message, output is reward, and parameters of the model are updated through a genetic variation algorithm.

As shown in fig. 5, the heart model performs an evolutionary learning process through a genetic variation algorithm as follows:

the method comprises the following steps: based on the random initialization of the heart model, n gene sequences (model parameter codes) are obtained.

As an example, a binary decision tree, the depth of which represents the number of different feature expressions, is shown in FIG. 6. The nodes other than the leaf node each represent an expression for judging whether the feature satisfies the condition, and two feature expressions f1 and f2 are shown in fig. 6. If the current node is judged to be satisfied, the characteristic expression of the right path is continuously judged, otherwise, the characteristic expression of the left path is judged, and the leaf node represents the final output result. The decision tree model parameters in fig. 6 are [ f1, f2, -1, -0.5, f2,0.5,1], which are binarized to derive genetic variation algorithms. Since there are two choices for the genes in the first layer and two choices for the genes in the second layer (which permits repeated judgment of the same expression, but the model depth is always the number of different characteristic expressions), the genes in the third layer are discretized, and assuming that there are only four choices, corresponding to different values in the third layer, the above genes are binarized to [0,1,00,01,1,10,11], yielding the gene sequences.

Further, randomly initializing to obtain n-1 vectors with the same dimension as the gene sequences to obtain n gene sequences.

Step two: copying the brain models to obtain n brain models;

step three: restoring each gene sequence into a corresponding Heart model (Heart), generating corresponding return according to the state fed back by the external environment through each Heart model (Heart), and performing reinforcement learning training on each Brain model (Brain) based on the return;

step four: eliminating m gene sequences with lower scores, wherein m is less than n;

step five: performing genetic variation operation based on the remaining n-m gene sequences to obtain new m gene sequences;

step six: and returning to the step three by using the remaining n-m gene sequences and the new m gene sequences until a heart model with the score larger than a preset value is obtained, and updating the brain model based on the heart model.

As an example, evolutionary learning may also be performed using ant colony, tabu, simulated annealing evolutionary algorithms.

Therefore, through evolution learning of the heart model, the requirement of the intelligent body and the requirement of the external environment on the intelligent body can be consistent, and the intelligent robot is convenient to adapt to the external environment.

In one embodiment of the invention, the update module is further operable to: acquiring time sequence data, wherein the time sequence data comprises one or more of a time sequence of a message, a time sequence of a state, a time sequence of an action, a time sequence of a message, a time sequence of a state to an action, a time sequence of a message to an action, and a time sequence among the message, the state and the action; a brain model is trained from the time-series data. Therefore, based on the training of the multiple time series regression, the information types which can be received and output by the intelligent agent can be perfected, the training efficiency of the brain model can be improved, and the subsequent updating of the brain model is facilitated.

In one embodiment of the invention, the update module is further operable to: obtaining a sample state fed back by an external environment; inputting the sample state to an observer, so that the observer outputs a sample message according to the sample state; outputting the sample information to a brain initial model, and outputting a sample action through the brain initial model; obtaining sample return fed back by an external environment according to the sample action; and updating the brain initial model according to the sample return to obtain the brain model.

Specifically, the reinforcement learning training can be assisted by the observer, as shown in fig. 4(a), compared with the training in general reinforcement learning, in this embodiment, the observer is added, the state is converted into a message by the observer, and the state is acquired by the agent through the message. Referring to fig. 4(a), the agent may also output the message to the observer according to the message, and the observer may also consider the message output by the agent when outputting the message. Therefore, the training from state to message and from message to action is perfected, the intelligent agent can understand and use the message, and the intelligent agent can meet the requirement of manual operation.

In an embodiment of the present invention, the update process of the brain model may further include: obtaining a sample state fed back by an external environment; inputting the sample state into a brain initial model, and outputting a sample message through the brain initial model; outputting the sample message to an actuator so that the actuator outputs a sample action according to the sample message; obtaining sample return fed back by an external environment according to the sample action; and updating the brain initial model according to the sample return to obtain the brain model.

Specifically, the reinforcement learning training can be assisted by the actuator, as shown in fig. 4(b), compared with the training in general reinforcement learning, the embodiment adds the executor, the executor outputs the agent according to the (state) converted message to the action, and the agent trains through the external environment based on the rewarded feedback of the action. Referring to fig. 4(b), the executor may further output a message to the agent according to the message, and the agent may further consider the message output by the executor when outputting the message. Therefore, the training from state to message and from message to action is perfected, the intelligent agent can understand and use the message, and the intelligent agent can meet the requirement of manual operation.

It should be noted that the above-mentioned observer and actuator may be an established model or may be a human, and are not limited herein.

As an example, as shown in FIG. 4(c), instead of multi-agent reinforcement learning based on game theory, reinforcement learning of multi-agent message collaboration can also be performed, which emphasizes the training task done through message collaboration.

To facilitate understanding of the intelligent robot of the present invention, the following description is made by way of a specific example:

in this example, the learning objectives of the intelligent robot include: let the intelligent agent learn to beat the hamster, let the intelligent agent can understand the language that uses when beating the hamster, let the intelligent agent realize using the language to accomplish and beat the hamster, let the intelligent agent like to go up and beat the hamster.

The learning process is as follows:

the first step is as follows: building an external environment

As shown in fig. 7, a Sudoku frame is designed, a hammer is placed in the middle, and the hamster randomly drills out of a certain grid around the hamster. The intelligent body can take up the hammer to hit the gopher, and if the gopher is hit by the intelligent body, the gopher randomly selects another grid around the gopher to drill out.

Because the agent holding the hammer is inconvenient to observe (only one direction of up, down, left and right can be observed), the external environment allows other agents to cooperate with the current agent to make a groundmouse, such as one agent responsible for observation and one agent responsible for execution.

Every time the intelligent agent plays a land mouse, one point is rewarded, and the matched intelligent agent can also obtain one point.

The second step is that: model building

Establishing a Brain model (Brain) as shown in fig. 3(a) and a Heart model (Heart) as shown in fig. 3(b), wherein state in fig. 3(b) is the observable state of the agent, including the scoring situation and observable environment of other agents, and reward is the perception that the agent can feel.

The third step: input-output limiting

State is the state of the Sudoku and whether the score of the agent is in front or not. The state input can be represented by a 9-bit binary vector, the first 8 bits represent whether the mouse exists at the corresponding position, the later bit represents whether the score is ahead, such as [100000001], the first grid of the nine-grid is represented that the mouse exists, and the current agent is represented that the score is ahead.

message is defined as a natural language of human beings such as "get ahead", "get first", "to your left", etc. The common words and end symbols can be one-hot encoded and input and output in time series.

action is limited to the method of playing the groundmouse corresponding to 8 grids and no action is taken, and 9 cases in total, and action can also be represented by a 9-bit binary vector.

The fourth step: model training

(1) Time series regression training

Some time sequences may be pre-edited for the agent to learn the sequences. Several groups of land mouse playing games can be played manually, corresponding (state), message and action are recorded according to time steps, and then the time series regression training is carried out on the intelligent agent by using the data.

After learning, the agent can learn the following:

some instructions: if the intelligent agent is told to 'hit the first', the intelligent agent can hit the first grid;

some of the descriptions are: if the agent sees the first box with a squirrel, it will output a message "make first";

some exchanges: if the agent is asked to "hit a few", the agent will revert to "hit the first".

(2) Reinforcement learning training

If the agent can play the squirrel in the middle state, the agent awards one point; and inputting a state into the intelligent agent, and enabling the intelligent agent to learn action so as to obtain maximum reward, wherein the learning method uses an enhanced learning mode.

After learning, the agent can learn the following:

the squirrel can be seen to hit the squirrel correctly.

(3) message collaborative reinforcement learning training

Specifically, the message collaborative reinforcement learning training shown in fig. 4(a) -4 (c) may be adopted.

After learning, the agent can learn the following:

understanding messages sent by other agents;

and using the message to help other agents strike the hamster.

(4) Evolutionary learning

The heart model adopts a decision tree model, the input environment is a vector represented by a 9-bit binary system, for example [100000001], a 9-layer binary tree can be designed, except leaf nodes, other tree nodes are in the third dimension, for example, the ninth dimension, and the node value is 9. Leaf nodes are discretized values between 0-1, representing reward values. The left of the binary tree represents the decision path with the dimension value of 0, and the right represents the decision path with the dimension value of 1. Then a binary tree is randomized and stored by using a preorder traversal, a multi-dimensional vector is formed, and the similarity is [5,8,9,3,4, … … and 0.75], and further the value of each dimension is binary coded to obtain [011,100 and … … ], namely the gene sequence of the agent.

The evolution learning step shown in fig. 5 is used for evolution learning, and after learning is completed, the intelligent agent can obtain an optimal Heart model (Heart) with evolution, and the model has the following functions:

when the intelligent agent correctly hits the hamster, obtaining reward scores;

when the agent scores rank back, the scores may be higher (equivalent to increasing the learning rate), whereas the scores may be lower.

Further, by combining the Brain model (Brain) and the Heart model (Heart), an agent that prefers a hamster can be obtained, and the agent can adapt to a new environment and rule slowly and quickly, while still preferring a hamster when the external environment or the environmental rule slightly changes.

Thus, without relevant data and knowledge, the agent can learn to fight the hamster, but for a longer time. Each Brain model (Brain) and Heart model (Heart) was initially a randomly parameterized model. Through the continuous implementation of reinforcement learning and evolution learning, an intelligent agent which is adaptive to the environment and the environmental rules and likes to fight the hamster is finally generated.

In summary, the intelligent robot according to the embodiment of the present invention can adapt to a field without related knowledge, and obtain an intelligent agent that can adapt to the field. The intelligent agent can learn to understand and use the language and has internal emotion, and human beings can understand the generation of music, language and emotion and the reason for understanding social relationship and thinking way in the process of training the model, which have immeasurable effects on the aspects of realizing the progress in the field of industrial science and the progress in the human society.

Fig. 2 is a flowchart of a learning method of an intelligent robot according to an embodiment of the present invention.

Based on the intelligent robot, as shown in fig. 2, the learning method of the intelligent robot includes the following steps:

and S1, acquiring the first interactive message and the feedback state of the external environment and the intelligent agent through the brain model, and outputting the action and/or the second interactive message according to the state and/or the first interactive message to acquire a new first interactive message and/or enabling the external environment and the intelligent agent to adjust the state according to the action.

S2, obtaining at least one of the state and the first interactive message through the heart model, and outputting the return according to at least one of the state and the first interactive message.

And S3, updating the brain model according to the reward through an updating module to realize the learning of the intelligent agent.

And S4, performing evolution learning on the heart model, and updating the brain model by using the evolved heart model to obtain a heart model and a meninges type which are suitable for the external environment.

In one embodiment of the invention, the heart model is evolutionarily learned by:

the method comprises the following steps: randomly initializing based on a heart model to obtain n gene sequences (model codes);

step two: copying the brain models to obtain n brain models;

step three: restoring each gene sequence into a corresponding heart model, generating corresponding returns according to the state fed back by the external environment through each heart model, and performing reinforcement learning training on the corresponding brain model based on the returns for each brain model;

The learning method of the intelligent robot in the embodiment of the invention can enable the intelligent robot to adapt to the field without relevant knowledge and obtain the intelligent body capable of adapting to the field. The learning method can enable an intelligent agent to learn understanding and using language, has internal emotion, and human beings can understand the generation of music, language and emotion, understand social relations and formation reasons of thinking modes in the process of training the model, and the learning method has immeasurable effects on the aspects of realizing the progress of the industrial science field and the progress of the human society.

It should be noted that the logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the present invention, unless otherwise expressly stated or limited, the first feature "on" or "under" the second feature may be directly contacting the first and second features or indirectly contacting the first and second features through an intermediate. Also, a first feature "on," "over," and "above" a second feature may be directly or diagonally above the second feature, or may simply indicate that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. An intelligent robot, comprising an intelligent agent, wherein the intelligent agent comprises:

the first processing module comprises a brain model, and the brain model is used for acquiring a first interactive message and the state fed back by an external environment and an agent, and outputting an action and/or a second interactive message according to the state and/or the first interactive message to acquire a new first interactive message and/or enabling the external environment and the agent to adjust the state according to the action;

a second processing module, including a heart model, configured to obtain at least one of the status and the first interaction message, and output a reward according to the at least one of the status and the first interaction message;

an update module to update the brain model according to the reward to enable learning of the agent;

the heart model is also used for evolution learning, and the brain model is updated by using the evolved heart model so as to obtain the heart model and the meninges type which are suitable for the external environment.

2. The intelligent robot of claim 1, wherein the heart model is further configured to perform evolutionary learning by:

the method comprises the following steps: randomly initializing based on the heart model to obtain n gene sequences (model parameter codes);

step two: copying the brain models to obtain n brain models;

step three: restoring each gene sequence into a corresponding heart model, generating corresponding returns according to the state fed back by the external environment through each heart model, and performing reinforcement learning training on each brain model on the basis of the returns;

step six: and returning to the third step by using the residual n-m gene sequences and the new m gene sequences until a heart model with a score larger than a preset value is obtained, so as to update the brain model based on the heart model.

3. The intelligent robot of claim 1, wherein the brain model comprises an input layer, a core layer, and an output layer connected in sequence, wherein,

the input layer comprises a convolutional neural network layer and an embedded layer, the convolutional neural network layer is used for receiving the state, and the embedded layer is used for receiving the first interactive message;

the core layer comprises a recurrent neural network based on an attention mechanism;

the output layer comprises a full connection layer and a softmax layer, and the output layer is used for outputting the action and/or the second interactive message.

4. The intelligent robot of claim 1, wherein the heart model comprises a decision tree model.

5. The intelligent robot of claim 1, wherein the intelligent robot comprises a plurality of agents, and wherein the first interactive message originates from other agents or users.

6. The intelligent robot of claim 1, wherein the update module is further to:

acquiring time sequence data, wherein the time sequence data comprises one or more of a time sequence of a message, a time sequence of a state, a time sequence of an action, a time sequence of a message, a time sequence of a state to an action, a time sequence of a message to an action, and a time sequence among the message, the state and the action;

and training according to the time sequence data to obtain the brain model.

7. The intelligent robot of claim 1, wherein the update module is further to:

obtaining a sample state of the external environment feedback;

inputting the sample state to an observer to cause the observer to output a sample message according to the sample state;

outputting the sample message to a brain initial model, and outputting a sample action through the brain initial model;

obtaining sample return fed back by the external environment according to the sample action;

and updating the brain initial model according to the sample return to obtain the brain model.

8. The intelligent robot of claim 1, wherein the update module is further to:

obtaining a sample state of the external environment feedback;

inputting the sample state into a brain initial model, and outputting a sample message through the brain initial model;

outputting the sample message to an actuator to cause the actuator to output a sample action according to the sample message;

9. A learning method of an intelligent robot according to any one of claims 1-8, comprising the steps of:

acquiring a first interactive message and states fed back by an external environment and an agent through the brain model, and outputting an action and/or a second interactive message according to the states and/or the first interactive message to acquire a new first interactive message, and/or enabling the external environment to adjust the states according to the action;

obtaining at least one of the state and the first interactive message through the heart model, and outputting a return according to the at least one of the state and the first interactive message;

updating, by the update module, the brain model in accordance with the reward to enable learning of the agent;

and carrying out evolution learning on the heart model, and updating the brain model by using the evolved heart model to obtain the heart model and the meninges type which are suitable for the external environment.

10. The learning method of an intelligent robot according to claim 1, wherein the heart model is evolutionarily learned by:

step two: copying the brain models to obtain n brain models;