CN115204387A

CN115204387A - Learning method and device under layered target condition and electronic equipment

Info

Publication number: CN115204387A
Application number: CN202210863041.8A
Authority: CN
Inventors: 王岩
Original assignee: Faoyiwei Suzhou Robot System Co ltd
Current assignee: Faoyiwei Suzhou Robot System Co ltd
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2022-10-18
Anticipated expiration: 2042-07-21
Also published as: CN115204387B

Abstract

The application provides a learning method, a learning device and electronic equipment under a layered target condition, wherein the states of any two non-adjacent time points in a training data set and the states of any middle time point between any two non-adjacent time points form a state data pair, and the state of the previous time point in any two non-adjacent time points, the corresponding action and the state of the middle time point form an action data pair. And obtaining a state model by using the state data pair training, obtaining an action model by using the action data pair training, and obtaining a learning model based on the state model and the action model. In this scheme, training data is divided into state data pairs having intermediate states, and action data pairs including actions are constructed, so that the entire process can be divided into a plurality of stages. In this way, the obtained learning model can periodically check the action reproduction based on the information of each stage and correct the action error of the controlled object in time, thereby effectively suppressing the accumulated error.

Description

Learning method and device under layered target condition and electronic equipment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a learning method and device under a layered target condition and electronic equipment.

Background

In robot reinforcement learning, when an Agent learns certain tasks, the Agent can have the problems of huge task search space, sparse reward, difficulty in designing a reward function and the like. In order to cope with the above problems, in recent years, a method of imitating learning has been greatly developed. The simulation learning is essentially supervised learning, in the simulation learning, an intelligent agent learns by observing and simulating the behavior strategy of an expert, the search space is greatly reduced, a reward function does not need to be designed, and the defect of reinforcement learning is overcome. Common methods of mimic Learning include Behavioral Cloning (BC) and Inverse Reinforcement Learning (IRL).

Behavioral cloning, which refers to an agent directly cloning expert strategies, generally uses "state-action pairs" collected from the teaching of human experts as training data, learns discrete distributions through discrete data, and finally obtains cloning strategies. The reverse reinforcement learning means that assuming that the expert strategy is perfect, an agent explains the behavior of an expert by learning an incentive function (i.e. optimizes an initial incentive function so that the score of the expert strategy is the highest), and then obtains the optimal strategy under the optimal incentive function by a reinforcement learning algorithm, and finally the strategy should be consistent with the expert strategy.

Inverse reinforcement learning has lower requirements on the training data set than behavioral cloning, but is computationally expensive due to the need to continually use reinforcement learning algorithms. The recurrence accuracy of behavior cloning is easily affected by accumulated errors (composite errors) in the execution of actions, and if the state of the robot enters a blind area of training data due to the accumulated errors or the change of a task environment, a cloning strategy may cause the robot to make an action which is difficult to predict, so that a task fails, and even serious consequences are generated. Therefore, behavioral cloning requires that the training data set cover as much as possible all possible states, which is generally difficult to achieve.

Disclosure of Invention

The object of the present invention includes, for example, providing a learning method, apparatus and electronic device under a hierarchical target condition, which can correct a motion error of a control object in time to effectively suppress an accumulated error.

Embodiments of the invention may be implemented as follows:

in a first aspect, the present invention provides a learning method under a hierarchical target condition, the method comprising:

acquiring a training data set, wherein the training data set comprises states of an object at each time point in a period of continuous time and corresponding actions in each state;

forming a state data pair by the states of any two nonadjacent time points and the state of any middle time point between any two nonadjacent time points;

forming an action data pair by the state of the previous time point in any two non-adjacent time points, the corresponding action and the state of the middle time point;

and obtaining a state model by using the state data pair training, obtaining an action model by using the action data pair training, and obtaining a learning model based on the state model and the action model.

In an optional embodiment, the step of obtaining a state model by using the state data pair training includes:

taking states of any two nonadjacent time points in each state data pair as an initial state and a target state respectively, and inputting the initial state and the target state into a state model;

and taking the state of any intermediate time point between any two non-adjacent time points as an intermediate state, taking the intermediate state as a state sample label, and realizing the training of the state model based on the state sample label.

In an optional embodiment, the step of obtaining a motion model by using the motion data pair training includes:

respectively taking the state of the previous time point and the state of the middle time point in any two non-adjacent time points as an initial state and a middle state, and inputting the initial state and the middle state into an action model;

and taking the action corresponding to the previous time point in any two non-adjacent time points as an action sample label, and realizing the training of the action model based on the action sample label.

In an alternative embodiment, the method further comprises:

acquiring the current state and the preset target state of an object to be controlled;

inputting the current state and a preset target state into a state model of the learning model to obtain a preset intermediate state;

inputting the current state and a preset intermediate state into an action model of the learning model, and outputting a target action;

and controlling the object to be controlled by utilizing the target action.

In an optional embodiment, the step of controlling the object to be controlled by using the target action includes:

controlling the object to be controlled to operate by utilizing the target action pair so as to update the current state of the object to be controlled;

and obtaining an updated target action based on the updated current state and the preset target state, controlling the object to be controlled by the updated target action, and stopping controlling the object to be controlled until the difference between the current state and the preset target state is less than or equal to a preset threshold value.

In an alternative embodiment, the method further comprises:

setting a plurality of state sliding window widths of the state data pairs;

the step of forming a state data pair from states of any two non-adjacent time points and states of any intermediate time point between any two non-adjacent time points includes:

starting from the previous time point of any two nonadjacent time points, dividing the state sliding window width according to each state sliding window width to determine a plurality of middle time points;

and acquiring the state of each time point in any two non-adjacent time points and the state of each intermediate time point between any two non-adjacent time points to form a plurality of state data pairs.

In an alternative embodiment, the method further comprises:

setting a plurality of action sliding window widths of the action data pairs;

the step of forming an action data pair by the state of the previous time point, the corresponding action and the state of the intermediate time point in any two non-adjacent time points includes:

determining corresponding intermediate time points according to a previous time point in any two non-adjacent time points and the width of each action sliding window;

and acquiring the state and the corresponding action of the previous time point and the state of each intermediate time point to form a plurality of action data pairs.

In an alternative embodiment, the step of acquiring a training data set includes:

acquiring an original data set, wherein the original data set comprises states of objects at all time points in a section of continuous events and corresponding actions in all the states;

controlling a test object based on the original data set, and obtaining the actual state of the test object under each action;

and adding the actual state and the corresponding action of each time point into the original data set so as to expand to obtain a training data set.

In a second aspect, the present invention provides an apparatus for learning under hierarchical target conditions, the apparatus comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a training data set, and the training data set comprises states of an object at each time point in a period of continuous time and corresponding actions in each state;

the first building module is used for forming the states of any two nonadjacent time points and the states of any middle time point between any two nonadjacent time points into a state data pair;

the second construction module is used for forming an action data pair by the state of the previous time point in any two non-adjacent time points, the corresponding action and the state of the middle time point;

and the training module is used for obtaining a state model by utilizing the state data pair training, obtaining an action model by utilizing the action data pair training and obtaining a learning model based on the state model and the action model.

In a third aspect, the present invention provides an electronic device comprising one or more storage media and one or more processors in communication with the storage media, the one or more storage media storing processor-executable machine-executable instructions that, when executed by the electronic device, are executed by the processors to perform the method steps of any one of the preceding embodiments.

The beneficial effects of the embodiment of the invention include, for example:

after a training data set is obtained, the states of any two non-adjacent time points in the training data set and the states of any middle time point between any two non-adjacent time points form a state data pair, and the state of the previous time point in any two non-adjacent time points, the corresponding action and the state of the middle time point form an action data pair. Thus, a state model is obtained by using the state data pair training, an action model is obtained by using the action data pair training, and a learning model is obtained based on the state model and the action model. In this scheme, training data is divided into state data pairs having intermediate states, and action data pairs including actions are constructed with the data pairs having intermediate states, so that the entire process can be divided into a plurality of stages. In this way, the obtained learning model can periodically check the action reproduction based on the information of each stage and correct the action error of the controlled object in time, thereby effectively suppressing the accumulated error.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a flowchart of a learning method under a layered target condition according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a model training phase and a task execution phase provided in an embodiment of the present application;

FIG. 3 is a flowchart of sub-steps included in step S101 of FIG. 1;

FIG. 4 is a flowchart of sub-steps involved in step S102 of FIG. 1;

FIG. 5 is a flowchart illustrating sub-steps involved in step S103 of FIG. 1;

FIG. 6 is a flowchart illustrating sub-steps included in step S104 of FIG. 1;

FIG. 7 is another flowchart of the substeps involved in step S104 of FIG. 1;

fig. 8 is a flowchart of a control method in a learning method under a layered target condition according to an embodiment of the present application;

FIG. 9 is a flowchart of sub-steps involved in step S204 of FIG. 8;

fig. 10 is a block diagram of an electronic device according to an embodiment of the present application;

fig. 11 is a functional block diagram of a learning apparatus under a layered target condition according to an embodiment of the present application.

Icon: 110-a storage medium; 120-a processor; 130-learning means under hierarchical target conditions; 131-an acquisition module; 132-a first building block; 133-a second building block; 134-a training module; 140-communication interface.

Detailed Description

In the prior art, various improvements have been proposed to improve the performance of behavioral cloning methods. For example, the first method is to use a learning machine model, train in combination with stability constraint conditions to obtain a dynamic prediction model, and use the model to perform simulated learning, so that the stability, reproduction precision and model training speed of the simulated learning of the robot are guaranteed, and the humanization degree of the robot motion is effectively improved. However, this method uses an original expert data set, and lacks a good capability of coping with a state other than the data set.

In addition, the second scheme is to divide the teaching action into multiple sections of actions according to the steps, correspondingly generate teaching track data and an error threshold of the teaching action of each section of action, acquire the simulated action of the learner and divide the simulated action into multiple sections of actions, then compare the simulated track data and the teaching track data of each section of action to judge whether the simulated action exceeds the error threshold so as to judge whether the simulated action is qualified. The intelligent degree of the process of the action learning is higher, and the method has better teaching ability. However, this method requires a manual splitting action, increasing the workload.

Based on the above research findings, the present application provides a learning scheme under a hierarchical objective condition, which can divide the entire process into a plurality of stages by dividing training data into state data pairs having intermediate states and constructing action data pairs including actions in the case of the data pairs having intermediate states. The trained learning model can periodically check in action reproduction based on the information of each stage, and timely correct the action error of the controlled object, thereby effectively inhibiting the accumulated error. In addition, the scheme expands the training data, can deal with control in more states, does not need to carry out a manual splitting process, and avoids the increase of workload.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

It should be noted that the features of the embodiments of the present invention may be combined with each other without conflict.

Referring to fig. 1, a flowchart of a learning method under a layered target condition according to an embodiment of the present application is shown, where method steps defined in a flow related to the learning method under the layered target condition can be implemented by an electronic device, for example, a personal computer, a notebook computer, a server, and other devices. The specific process shown in FIG. 1 will be described in detail below.

S101, a training data set is obtained, wherein the training data set comprises states of an object at each time point in a period of continuous time and corresponding actions of the object in each state.

S102, states of any two nonadjacent time points and states of any middle time point between any two nonadjacent time points form a state data pair.

And S103, forming a motion data pair by the state of the previous time point and the corresponding motion of the arbitrary two nonadjacent time points and the state of the middle time point.

And S104, obtaining a state model by using the state data pair training, obtaining an action model by using the action data pair training, and obtaining a learning model based on the state model and the action model.

In this embodiment, the acquired training data set may be a data set obtained based on expert teaching of an object, where the object may be, for example, a robot, for example, a four-axis robot, a six-axis robot, or the like.

The acquired training data set can control the robot to travel for a certain distance under the operation of an expert, complete a grabbing task and complete a polishing task to generate the data set. The training data set may include a plurality of state data, and each state data may be recorded according to a sampling period, for example, the sampling period may be 1S, 2S, or the like. Therefore, the respective time points may be time points at time intervals of 1S, 2S, and the like.

And each state data can correspond to an action, and when the robot is in a certain state, the robot is controlled according to the corresponding action, so that the robot can be switched to the next state.

Wherein the state may be, for example, a robot joint angle and the action may be, for example, a robot joint velocity.

In this embodiment, the state and the action in the collected training data set are presented by a data pair in a state-action form, for example, the training data set may be recorded as:

wherein, tau _i It is indicated that the data of the ith group,

and

respectively showing the state and action at time t in the ith group of data.

In the prior art, the training of the model is directly performed by using the training data set in the form described above, and then the robot is controlled based on the obtained model. For example, based on the current state of the robot and in combination with the trained model, a corresponding action is generated, so that the robot is controlled to reach the state at the next moment according to the generated action. I.e. according to the current state of the robot

Generating corresponding actions

And commands the robot to perform

To reach the state of the next momentState of the art

In the mode adopted in the prior art, on one hand, errors exist in the learning process, and the accumulated errors can influence the accuracy of action recurrence. On the other hand, due to the insufficient data volume of the training data set and the existence of the accumulated error, once the robot state enters the prediction blind area of the model due to the accumulated error or the change of the task environment, the task may fail, and even the robot may be damaged.

Based on this, in the present embodiment, when the training data set is acquired, the training data set is reprocessed, and the model is trained using the processed data set. The overall idea of processing the training data set is to re-label the original training data in the state-action form to form a triplet data structure of state-action-target state. The state and the target state may be corresponding to adjacent time points, or may be corresponding target states at non-adjacent time points. That is, for a certain time point, a triplet data structure of a state between the certain time point and an adjacent time point may be constructed, or a triplet data structure of a state between the certain time point and a plurality of different non-adjacent time points may be constructed.

The re-marking mode can find rich target states for training from expert teaching data, which is equivalent to expanding the original training data volume, thereby improving the generalization capability of the strategy.

Specifically, in the present embodiment, please refer to fig. 2 in combination, states at any two non-adjacent time points and states at any intermediate time point between the any two non-adjacent time points may form a state data pair.

The state data pairs may be represented as "state-intermediate state-target state (s-sl-s) _h ) "wherein the state of the previous time point of any two non-adjacent time points can be" state ", the state of the middle time point can be" middle state ", and the state of the next time point of any two non-adjacent time points can be" stateThe state at each point in time may be a "target state". And the interval between the two non-adjacent time points may be one time interval, two time intervals, fifty time intervals, etc. without limitation. The intermediate time point may be an intermediate time point adjacent to the previous time point, or an intermediate time point separated from the previous time point by two time points.

Thus, an entire process, which amounts to "state" to "target state", is divided into a plurality of different phases by intermediate points in time.

On this basis, the state at the previous time point and the corresponding action at any two non-adjacent time points and the state at the intermediate time point may be configured as an action data pair.

The action data pairs may be represented as "state-action-intermediate state (s-a-s) _l ) "in the form of a letter. The state corresponds to a previous time point in two non-adjacent time points, the action corresponds to the previous time point, and the intermediate state is the state of the intermediate time point.

Similarly, in the case where the intermediate time point can be set based on the requirement, for example, the first half, the middle or the second half of the time period between two non-adjacent time points, etc., the time interval between the "state" and the "intermediate state" in the action data pair is also variable accordingly.

On the basis of the above, a state model can be obtained based on the state data pair training, and the state model is expressed as pi _s (s _l |s,s _h ) And obtaining a motion model based on the motion data pair training, wherein the motion model is represented as pi _M (a|s,s _l ). And combining the state model and the action model to obtain a learning model. The learning model can output actions capable of controlling the robot to operate, and further sends instructions with specific actions to a controller for controlling the robot to realize control.

In the learning method under the hierarchical target condition provided by this embodiment, the training data is divided into the state data pairs having the intermediate state, and the action data pairs including the actions are constructed under the condition of the data pairs having the intermediate state, so that the whole process can be divided into a plurality of stages. In this way, the obtained learning model can periodically check the action reproduction based on the information of each stage and correct the action error of the controlled object in time, thereby effectively suppressing the accumulated error.

In this embodiment, in order to further expand the sample set, the training data set obtained may be processed based on the data set taught by the obtained expert. Alternatively, referring to fig. 3, the training data set may be obtained by:

s1011, acquiring an original data set, wherein the original data set comprises states of objects at each time point in a section of continuous events and corresponding actions in each state.

And S1012, controlling the test object based on the original data set, and obtaining the actual state of the test object under each action.

And S1013, adding the actual state and the corresponding action of each time point into the original data set to expand to obtain a training data set.

In this embodiment, the original data set may be a data set generated in the expert teaching process. A test object, which may be a test robot, is controlled based on the raw data set. In controlling the test robot according to the original data set, the test robot will generate new data, e.g. new states, based on the original data set during the control due to the actual control error. Thus, a state-action data pair format may be constructed based on the new actual state and corresponding action. And adding the newly generated data pairs into the original data set to obtain an expanded training data set.

On the basis, in this embodiment, in constructing the state data pairs, the width of the state sliding window of each state data pair may be set based on requirements, that is, the width of the sliding window is set in a time period between a time point corresponding to the "state" in the state data pair and a time point corresponding to the "intermediate state", that is, the width of the sliding window determines the position of the intermediate time point. The width of the plurality of state sliding windows of the state data pairs can be set as required.

Referring to fig. 4, in the case of determining the state sliding window width of the state data pair, the state data pair can be constructed by:

and S1021, starting from the previous time point of any two non-adjacent time points, dividing the time points according to the width of the state sliding window respectively to determine a plurality of intermediate time points.

S1022, obtaining the state of each time point of the two non-adjacent time points and the state of each intermediate time point between the two non-adjacent time points, so as to form a plurality of state data pairs.

Generally, the previous time point and the next time point of any two non-adjacent time points are known, that is, the construction is performed from the previous time point, and the sub-phases are divided to reach the next time point. Therefore, from the previous time point, the intermediate time point can be determined according to the set state sliding window width, and then the time point corresponding to each state in the state data pair can be determined.

And forming a state data pair based on the state of the previous time point, the state of the middle time point and the state of the next time point in the obtained training data set.

In addition, in this embodiment, a plurality of motion sliding window widths of the motion data pair may also be set based on the requirement. The action sliding window width is a time span between a time point corresponding to the "state" and a time point corresponding to the "intermediate state" in the action data pair.

Referring to fig. 5, in the case of determining the width of the motion sliding window, the motion data pair may be constructed by:

and S1031, determining intermediate time points according to the previous time point in any two non-adjacent time points and the width of each action sliding window, and determining corresponding intermediate time points.

S1032, obtaining the state and the corresponding action at the previous time point, and the states at the intermediate time points, so as to form a plurality of action data pairs.

In this embodiment, the action window width is related to the state window width, because the intermediate time point utilized in the action data pair is determined based on the state window width. Only in different motion data pairs, the data pairs may be specifically determined based on different motion sliding window widths.

For different action data pairs, a time period range can be determined from the previous time point according to the width of the action sliding window according to the time point corresponding to the state in the current action data pair, namely the previous time point in two non-adjacent time points, so that the end point of the time period range is determined to be the middle time point. And constructing an action data pair based on the state of the previous time point in the training data set, the action corresponding to the previous time point and the state of the middle time point.

In the corresponding state data pair and action data pair, the time points corresponding to the states are consistent. For example, two non-adjacent time points may be t =1 and t =100, and one state sliding window width set is d =10.

When the pair of state data and the pair of motion data correspond to each other, t =1 is a time point corresponding to "state" in the pair of state data "state-intermediate state-target state", t =11 is a time point corresponding to "intermediate state", and t =100 is a time point corresponding to "target state". In the action data pair "state-action-intermediate state", the action window width is d' =10 accordingly. The time point corresponding to the "state" is t =1, the "action" is an action corresponding to the state at the time point of t =1, the state of t =1 is switched to the state of t =2 based on the action, and the time point corresponding to the "intermediate state" is t =11.

The action data pair represents that when the action data pair is to be converted from the state of t =1 to the state of t =11, the action needs to be executed on the basis of the time point of t =1 to be successfully converted to the state of t =2, and then the intermediate state of t =11 can be accurately reached through subsequent conversion.

After the state data pairs and the action data pairs are constructed and obtained in the above mode, the state models can be obtained by training the state data pairs respectively, and the action models can be obtained by training the action data pairs.

Referring to fig. 6, in the present embodiment, the state model can be trained in the following manner:

s1041, taking states of any two nonadjacent time points in each state data pair as an initial state and a target state respectively, and inputting the initial state and the target state into a state model.

S1042, taking the state of any intermediate time point between any two non-adjacent time points as an intermediate state, taking the intermediate state as a state sample label, and realizing the training of a state model based on the state sample label.

In the present embodiment, for each data pair "state-intermediate state-target state", where "state" and "target state" are input as state models, the state models can output predicted intermediate states by learning information of "state" and "target state".

And the 'intermediate state' in the state data pair is used as a state sample label, and the predicted intermediate state output by the model continuously learns the information of the state sample label, so that the predicted intermediate state output by the model finally is consistent with the state sample label or the difference between the predicted intermediate state and the state sample label is extremely small, and the training of the state model is realized. That is, the state model obtained by the final training can predict and output the state of the middle time point based on the states of two non-adjacent time points.

In addition, referring to fig. 7, in this embodiment, the training of the motion model can be implemented by:

and S1043, respectively taking the state of the previous time point and the state of the middle time point of the any two non-adjacent time points as an initial state and a middle state, and inputting the initial state and the middle state into an action model.

And S1044, taking the action corresponding to the previous time point in any two non-adjacent time points as an action sample label, and realizing the training of the action model based on the action sample label.

In the present embodiment, for each pair of motion data, "state-motion-intermediate state", where "state" and "intermediate state" are input as motion models, the motion models output predicted motions to be executed in "state" by learning information of "state" and "intermediate state".

And the 'action' in the action data pair is used as an action sample label, so that the predicted action output by the action model continuously learns and imitates the action sample label, and finally the predicted action output by the action model is consistent with the action state label or the difference between the predicted action and the action state label is extremely small, so as to realize the training of the action model.

The state model and the motion model obtained by the training in the above manner may constitute a learning model. The learning model can be used for realizing the control of the robot in the actual control stage.

Therefore, referring to fig. 8, on the basis of the above, the learning method under the layered target condition provided in this embodiment may further include the following steps:

s201, acquiring the current state and the preset target state of the object to be controlled.

S202, inputting the current state and the preset target state into a state model of the learning model to obtain a preset intermediate state.

And S203, inputting the current state and the preset intermediate state into the action model of the learning model, and outputting a target action.

And S204, controlling the object to be controlled by using the target action.

In this embodiment, the object to be controlled may be a robot that actually needs to be controlled, and the preset target state may be a state that the robot needed in the current control task finally reaches. The current state of the robot can be acquired by devices such as joint encoders, sensors, cameras and the like.

Referring to fig. 2, the current state and the preset target state of the robot may be input into the state model as input information. The state model may output a preset intermediate state between the current state and the preset target state by processing the current state and the preset target state. The preset intermediate state may be understood as a state corresponding to a certain time point between the start time point and the end time point in the process of controlling the robot.

On the basis, the current state and the preset intermediate state are used as input information and input into the action model, and the action model can output the action corresponding to the current state through processing the current state and the preset intermediate state. The action can be understood as that the action is executed in the current state until the action is executed in the current state to reach the preset intermediate state.

Therefore, the robot can be controlled based on the obtained motion, and the control task can be completed.

In the above control process, one process is required from the current state to the preset target state, and a plurality of times of control are required to be performed, that is, the step S204 may be performed in the following manner, please refer to fig. 9:

s2041, controlling the object to be controlled to operate by using the target action pair so as to update the current state of the object to be controlled.

And S2042, obtaining an updated target action based on the updated current state and the preset target state, controlling the object to be controlled by the updated target action, and stopping controlling the object to be controlled until the difference between the current state and the preset target state is less than or equal to a preset threshold value.

In this embodiment, when an action corresponding to the current state is obtained, the robot is controlled to execute the action in the current state, and the robot state can be changed to the state at the next time. And taking the state at the next moment as the updated current state, so as to continue the above manner, obtain the action in the updated current state, and then control the robot to operate according to the action. And ending the control task under the condition that the state of the robot is consistent with the preset target state or the difference between the state of the robot and the preset target state is less than or equal to the preset threshold value after the control.

According to the learning method under the layered target condition, the training data set is effectively expanded in a mode of reconstructing the state data pair and the action data pair, so that the model obtained through training has better generalization capability, the task environment change which cannot be contained in the training data is better met, and the problem of the dead zone of the cloning strategy caused by insufficient data volume is solved.

In addition, the state model and the action model are obtained through the layered strategy training, the sub-targets in the track can be searched and updated spontaneously when the robot executes the track action, and the action errors of the robot can be corrected in time, so that the problem of poor reproduction accuracy caused by accumulated errors is effectively solved, the process is automatically completed by the model, manual intervention is not needed, and the labor cost is reduced.

Referring to fig. 10, a schematic diagram of exemplary components of an electronic device according to an embodiment of the present disclosure is provided, where the electronic device may be a personal computer, a notebook computer, a server, or the like. The electronic device may include a storage medium 110, a processor 120, a learning apparatus 130 under hierarchical target conditions, and a communication interface 140. In this embodiment, the storage medium 110 and the processor 120 are both located in the electronic device and are separately disposed. However, it should be understood that the storage medium 110 may be separate from the electronic device and may be accessed by the processor 120 through a bus interface. Alternatively, the storage medium 110 may be integrated into the processor 120, for example, may be a cache and/or general purpose registers.

The learning device 130 under the hierarchical target condition may be understood as the electronic device or the processor 120 of the electronic device, or may be understood as a software functional module that is independent of the electronic device or the processor 120 and implements the learning method under the control of the electronic device under the hierarchical target condition.

As shown in fig. 11, the learning apparatus 130 under the above-mentioned hierarchical target condition may include an obtaining module 131, a first constructing module 132, a second constructing module 133, and a training module 134. The functions of the functional blocks of the learning device 130 under the hierarchical target condition will be described in detail below.

An obtaining module 131, configured to obtain a training data set, where the training data set includes states of an object at each time point and corresponding actions in each state in a period of continuous time;

it is understood that the obtaining module 131 may be configured to perform the step S101, and for detailed implementation of the obtaining module 131, reference may be made to what is described above with respect to the step S101.

A first constructing module 132, configured to form a state data pair from states at any two non-adjacent time points and states at any intermediate time point between the any two non-adjacent time points;

it is understood that the first building block 132 can be used to execute the step S102, and for the detailed implementation of the first building block 132, reference can be made to the above description of the step S102.

A second constructing module 133, configured to form an action data pair from a state of a previous time point of the any two non-adjacent time points, the corresponding action, and the state of the intermediate time point;

it is understood that the second building module 133 can be used to execute the above step S103, and for the detailed implementation of the second building module 133, reference can be made to the above contents related to step S103.

And the training module 134 is configured to obtain a state model through the state data pair training, obtain an action model through the action data pair training, and obtain a learning model based on the state model and the action model.

It is understood that the training module 134 can be used to perform the step S104, and for the detailed implementation of the training module 134, reference can be made to the above description related to the step S104.

In one possible implementation, the training module 134 may be configured to:

respectively taking states of any two nonadjacent time points in each state data pair as an initial state and a target state, and inputting the initial state and the target state into a state model;

and taking the state of any middle time point between any two non-adjacent time points as a middle state, taking the middle state as a state sample label, and realizing the training of the state model based on the state sample label.

In one possible implementation, the training module 134 may be configured to:

In a possible implementation, the learning apparatus 130 under the hierarchical target condition further includes a control module, which may be configured to:

and controlling the object to be controlled by utilizing the target action.

In a possible implementation, the control module may be configured to:

In a possible implementation, the learning apparatus 130 under the hierarchical target condition further includes a setting module, which may be configured to:

setting a plurality of state sliding window widths of the state data pairs;

the first building block 132 described above may be configured to:

starting from the previous time point of any two non-adjacent time points, respectively dividing the state sliding window width to determine a plurality of intermediate time points;

and acquiring the state of each time point in any two non-adjacent time points and the state of each middle any time point between any two non-adjacent time points to form a plurality of state data pairs.

In a possible implementation, the setting module may be further configured to:

setting a plurality of action sliding window widths of the action data pairs;

the second building block 133 described above may be configured to:

determining corresponding intermediate time points according to the previous time point of the any two non-adjacent time points and the width of each action sliding window;

In a possible implementation, the obtaining module 131 may be configured to:

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

Further, an embodiment of the present application also provides a computer-readable storage medium, where a machine-executable instruction is stored, and when the machine-executable instruction is executed, the learning method under the hierarchical target condition provided by the foregoing embodiment is implemented.

Specifically, the computer readable storage medium can be a general storage medium, such as a removable disk, a hard disk, or the like, and when executed, the computer program on the computer readable storage medium can execute the learning method under the above hierarchical target condition. With regard to the processes involved when the executable instructions in the computer-readable storage medium are executed, reference may be made to the relevant description of the above method embodiments, which are not described in detail herein.

To sum up, after the training data set is obtained, the state of any two non-adjacent time points in the training data set and the state of any intermediate time point between any two non-adjacent time points form a state data pair, and the state of the previous time point in any two non-adjacent time points, the corresponding action, and the state of the intermediate time point form an action data pair. Thus, a state model is obtained by using the state data pair training, an action model is obtained by using the action data pair training, and a learning model is obtained based on the state model and the action model. In this scheme, training data is divided into state data pairs having intermediate states, and action data pairs including actions are constructed with the data pairs having intermediate states, so that the entire process can be divided into a plurality of stages. In this way, the obtained learning model can periodically check the action reproduction based on the information of each stage and correct the action error of the controlled object in time, thereby effectively suppressing the accumulated error.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method of learning under hierarchical target conditions, the method comprising:

2. The learning method under the condition of the layered object according to claim 1, wherein the step of obtaining the state model by using the state data pair training comprises:

3. The learning method under the condition of layered objects according to claim 1, wherein the step of obtaining the motion model by using the motion data pair training comprises:

and taking the action corresponding to the previous time point in any two non-adjacent time points as an action sample label, and realizing the training of an action model based on the action sample label.

4. The learning method under hierarchical objective conditions as recited in claim 1, further comprising:

and controlling the object to be controlled by utilizing the target action.

5. The learning method under the hierarchical object condition according to claim 4, wherein the step of controlling the object to be controlled by the target action includes:

controlling the operation of the object to be controlled by utilizing the target action pair so as to update the current state of the object to be controlled;

6. The learning method under hierarchical objective conditions as recited in claim 1, further comprising:

setting a plurality of state sliding window widths of the state data pairs;

7. The learning method under layered objective conditions of claim 6, further comprising:

setting a plurality of action sliding window widths of the action data pairs;

8. The learning method under hierarchical objective conditions as set forth in claim 1, wherein the step of obtaining a training data set comprises:

9. An apparatus for learning under hierarchical target conditions, the apparatus comprising:

10. An electronic device comprising one or more storage media and one or more processors in communication with the storage media, the one or more storage media storing processor-executable machine-executable instructions that, when executed by the electronic device, are executed by the processors to perform the method steps of any of claims 1-8.