CN115204387B

CN115204387B - Learning method and device under layered target condition and electronic equipment

Info

Publication number: CN115204387B
Application number: CN202210863041.8A
Authority: CN
Inventors: 王岩
Original assignee: Faoyiwei Suzhou Robot System Co ltd
Current assignee: Faoyiwei Suzhou Robot System Co ltd
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2023-10-03
Anticipated expiration: 2042-07-21
Also published as: CN115204387A

Abstract

The application provides a learning method, a learning device and electronic equipment under a layered target condition, wherein states of any two non-adjacent time points in a training data set and states of any intermediate time points between the any two non-adjacent time points form a state data pair, and states of a previous time point in the any two non-adjacent time points, corresponding actions and states of intermediate time points form an action data pair. Training to obtain a state model by using the state data pair, training to obtain an action model by using the action data pair, and obtaining a learning model based on the state model and the action model. In this scheme, training data is divided into pairs of state data having intermediate states, and pairs of action data including actions are constructed, so that the entire process can be divided into a plurality of stages. Thus, the obtained learning model can periodically check in the action reproduction based on the information of each stage and timely correct the action errors of the control object, thereby effectively inhibiting the accumulated errors.

Description

Learning method and device under layered target condition and electronic equipment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a learning method and device under a layered target condition and electronic equipment.

Background

In robot reinforcement learning, an Agent (Agent) can encounter problems of huge task search space, sparse rewards, difficult rewarding function design and the like when learning certain tasks. In order to cope with the above problems, in recent years, a method of mimicking learning has been greatly developed. The simulated learning is essentially a supervised learning in which the agent learns by observing and simulating the behavior strategy of the expert, greatly reducing the search space, and eliminating the need to design a reward function, making up the disadvantages of reinforcement learning. Common imitative learning methods include behavioral cloning (Behavioral Cloning, BC), inverse reinforcement learning (Inverse Reinforcement Learning, IRL), and the like.

Behavior cloning, which is to say that an agent directly clones expert strategies, usually uses "state-action pairs" acquired from the teaching of human expert as training data, and learns discrete distributions through discrete data to finally obtain cloning strategies. Reverse reinforcement learning, which means that assuming that expert strategy is perfect, an agent interprets the expert's behavior by learning a bonus function (i.e., optimizing an initial bonus function such that expert strategy scores highest), then obtains the optimal strategy under the optimal bonus function by reinforcement learning algorithm, and finally the strategy should be consistent with the expert strategy.

Inverse reinforcement learning requires less training data sets than behavioral cloning, but is computationally expensive due to the continuous use of reinforcement learning algorithms. While the accuracy of behavior cloning is susceptible to accumulated errors (compounding errors) in the execution of actions, if the state of the robot enters a dead zone of training data due to accumulated errors or task environment changes, the cloning strategy may cause the robot to make difficult-to-predict actions, thereby causing task failure and even serious consequences. Thus, behavioral cloning requires training data sets to cover as much of all possible states as is generally difficult to achieve.

Disclosure of Invention

The invention aims to provide a learning method, a learning device and electronic equipment under a layered target condition, which can timely correct action errors of a control object to effectively restrain accumulated errors.

Embodiments of the invention may be implemented as follows:

in a first aspect, the present invention provides a learning method under a hierarchical target condition, the method comprising:

acquiring a training data set, wherein the training data set comprises states of an object at each time point and corresponding actions under each state in a continuous time;

The states of any two non-adjacent time points and the states of any intermediate time points between the any two non-adjacent time points form a state data pair;

forming an action data pair by the state of the previous time point in any two non-adjacent time points, the corresponding action and the state of the middle time point;

and training the state data pair to obtain a state model, training the action data pair to obtain an action model, and obtaining a learning model based on the state model and the action model.

In an alternative embodiment, the step of training to obtain a state model using the pair of state data includes:

the states of any two non-adjacent time points in each state data pair are respectively used as an initial state and a target state, and the initial state and the target state are input into a state model;

taking the state of any intermediate time point between any two non-adjacent time points as an intermediate state, taking the intermediate state as a state sample label, and realizing training of a state model based on the state sample label.

In an alternative embodiment, the step of training to obtain the motion model by using the motion data pair includes:

The state of the previous time point in the arbitrary two non-adjacent time points and the state of the middle time point are respectively used as an initial state and a middle state, and the initial state and the middle state are input into an action model;

and taking the action corresponding to the previous time point in any two non-adjacent time points as an action sample label, and realizing training of an action model based on the action sample label.

In an alternative embodiment, the method further comprises:

acquiring the current state of an object to be controlled and a preset target state;

inputting the current state and a preset target state into a state model of the learning model to obtain a preset intermediate state;

inputting the current state and the preset intermediate state into an action model of the learning model, and outputting a target action;

and controlling the object to be controlled by utilizing the target action.

In an alternative embodiment, the step of controlling the object to be controlled using the destination action includes:

controlling the operation of the object to be controlled by utilizing the target action so as to update the current state of the object to be controlled;

and obtaining an updated target action based on the updated current state and the preset target state, and controlling the object to be controlled by the updated target action until the difference between the current state and the preset target state is smaller than or equal to a preset threshold value, and stopping controlling the object to be controlled.

In an alternative embodiment, the method further comprises:

setting a plurality of state sliding window widths of the state data pairs;

the step of composing the states of any two non-adjacent time points and the states of any intermediate time points between the any two non-adjacent time points into a state data pair includes:

starting from the previous time point in any two non-adjacent time points, dividing the time points according to the width of each state sliding window to determine a plurality of middle time points;

the state of each time point in any two non-adjacent time points and the state of each intermediate time point between any two non-adjacent time points are acquired to form a plurality of state data pairs.

In an alternative embodiment, the method further comprises:

setting a plurality of action sliding window widths of action data pairs;

the step of forming the state of the previous time point in the arbitrary two non-adjacent time points and the state of the corresponding action and the middle time point into an action data pair comprises the following steps:

determining corresponding intermediate time points according to the previous time point in the two non-adjacent time points and the width of each action sliding window;

And acquiring the state of the previous time point and the corresponding action, and the states of the intermediate time points to form a plurality of action data pairs.

In an alternative embodiment, the step of acquiring the training data set includes:

acquiring an original data set, wherein the original data set comprises states of objects at each time point in a continuous event and corresponding actions in each state;

controlling a test object based on the original data set, and obtaining the actual state of the test object under each action;

and adding the actual state and the corresponding action of each time point into the original data set to expand and obtain a training data set.

In a second aspect, the present invention provides a learning apparatus under hierarchical target conditions, the apparatus comprising:

the acquisition module is used for acquiring a training data set, wherein the training data set comprises states of the object at each time point and corresponding actions under each state in a continuous time;

a first construction module, configured to construct a state data pair from states of any two non-adjacent time points and states of any intermediate time points between the any two non-adjacent time points;

The second construction module is used for forming an action data pair by the state of the previous time point in the any two non-adjacent time points, the corresponding action and the state of the middle time point;

and the training module is used for training the state data pair to obtain a state model, training the action data pair to obtain an action model, and obtaining a learning model based on the state model and the action model.

In a third aspect, the present application provides an electronic device comprising one or more storage media and one or more processors in communication with the storage media, the one or more storage media storing machine-executable instructions that are executable by the processor to perform the method steps recited in any one of the preceding embodiments when the electronic device is operated.

The beneficial effects of the embodiment of the application include, for example:

the application provides a learning method, a learning device and electronic equipment under a layered target condition, wherein after a training data set is acquired, states of any two non-adjacent time points in the training data set and states of any intermediate time points between the any two non-adjacent time points form a state data pair, and states of a previous time point in the any two non-adjacent time points, corresponding actions and states of intermediate time points form an action data pair. Thus, a state model is obtained by training the pair of state data, an action model is obtained by training the pair of action data, and a learning model is obtained based on the state model and the action model. In this scheme, training data is divided into state data pairs having intermediate states, and action data pairs including actions are constructed with the data pairs having intermediate states, so that the entire process can be divided into a plurality of stages. Thus, the obtained learning model can periodically check in the action reproduction based on the information of each stage and timely correct the action errors of the control object, thereby effectively inhibiting the accumulated errors.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a learning method under a hierarchical target condition provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of a model training stage and a task execution stage according to an embodiment of the present application;

FIG. 3 is a flow chart of sub-steps included in step S101 of FIG. 1;

FIG. 4 is a flow chart of sub-steps included in step S102 of FIG. 1;

FIG. 5 is a flow chart of sub-steps included in step S103 of FIG. 1;

FIG. 6 is a flow chart of sub-steps included in step S104 of FIG. 1;

FIG. 7 is another flow chart of sub-steps included in step S104 of FIG. 1;

fig. 8 is a flowchart of a control method in the learning method under the layered target condition provided in the embodiment of the present application;

FIG. 9 is a flowchart of sub-steps included in step S204 of FIG. 8;

Fig. 10 is a block diagram of an electronic device according to an embodiment of the present application;

fig. 11 is a functional block diagram of a learning device under a hierarchical target condition according to an embodiment of the present application.

Icon: 110-a storage medium; a 120-processor; 130-learning means under hierarchical target conditions; 131-an acquisition module; 132-a first building block; 133-a second building block; 134-training module; 140-communication interface.

Detailed Description

In the prior art, various improvements have been proposed in order to improve the performance of the behavioral cloning method. For example, the first method is to use a learning machine model to obtain a dynamic prediction model by combining with stability constraint condition training, and use the model to simulate learning, so that the stability, reproduction accuracy and model training speed of the robot simulating learning are guaranteed, and the humanization degree of the robot motion is effectively improved. However, this method uses the original expert dataset and lacks good coping ability for states other than the dataset.

In addition, according to the second scheme, the teaching action is split into multiple sections according to the steps, teaching track data and error threshold values of the teaching actions of the sections are correspondingly generated, the imitation action of a learner is collected and split into multiple sections, then the imitation track data of the sections are correspondingly compared, whether the error threshold values are exceeded or not is judged, and whether the imitation action is qualified or not is judged. The intelligent degree of the action learning process is higher, and the method has better teaching ability. However, this method requires a manual splitting action, which increases the workload.

Based on the above-described findings, the present application provides a learning scheme under a hierarchical target condition, by dividing training data into pairs of state data having intermediate states, and constructing pairs of action data including actions with the pairs of data having intermediate states, the entire process can be divided into a plurality of stages. The learning model obtained through training can be periodically checked in the action reproduction based on the information of each stage, and the action errors of the control object can be corrected in time, so that the accumulated errors are effectively restrained. In addition, the training data is expanded by the scheme, more control under the state can be handled, a manual splitting process is not needed, and the workload is prevented from being increased.

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

It should be noted that the features of the embodiments of the present application may be combined with each other without conflict.

Referring to fig. 1, a flowchart of a learning method under a hierarchical target condition according to an embodiment of the present application is provided, where method steps defined by a flow related to the learning method under the hierarchical target condition may be implemented by an electronic device, for example, a personal computer, a notebook computer, a server, etc. The specific flow shown in fig. 1 will be described in detail.

S101, acquiring a training data set, wherein the training data set comprises states of an object at each time point and corresponding actions under each state in a continuous time.

S102, forming a state data pair by the states of any two non-adjacent time points and the states of any intermediate time points between the any two non-adjacent time points.

S103, forming an action data pair by the state of the previous time point in any two non-adjacent time points, the corresponding action and the state of the middle time point.

S104, training to obtain a state model by using the state data pair, training to obtain an action model by using the action data pair, and obtaining a learning model based on the state model and the action model.

In this embodiment, the acquired training data set may be a data set obtained based on expert teaching of an object, where the object may be, for example, a robot, for example, a four-axis robot, a six-axis robot, or the like, without limitation.

The acquired training data set can control the robot to travel a certain distance under the operation of an expert, complete a grabbing task and complete a polishing task. The training data set may include a plurality of state data, and each state data may be recorded according to a sampling period, for example, the sampling period may be 1S, 2S, or the like. Thus, the respective time points may be time points at time intervals of 1S, 2S, or the like.

And each state data can be corresponding to an action, and when the robot is in a certain state, the robot is controlled according to the corresponding action, so that the robot can be converted into the next state.

Wherein the state may be, for example, a robot joint angle and the action may be, for example, a robot joint speed.

In this embodiment, the states and actions in the collected training dataset are presented as pairs of data in the form of state-actions, e.g., the training dataset may be written as:

wherein τ _i Representing the data of the i-th group,and->The state and the operation at time t in the i-th group data are respectively represented.

In the prior art, the model is directly trained by the training data set in the form, and then the robot is controlled based on the obtained model. For example, a corresponding motion is generated based on the current state of the robot and in combination with the trained model, so that the robot is controlled to reach the state at the next moment according to the generated motion. I.e. according to the current state of the robotGenerating corresponding actions->And instructs the robot to perform +>To reach the state of the next moment +.>

In the prior art, on one hand, errors exist in the learning process, and the accumulated errors can influence the accuracy of action reproduction. On the other hand, due to the insufficient data volume of the training data set and the existence of accumulated errors, once the accumulated errors or the task environment changes cause the state of the robot to enter the prediction blind area of the model, the task may fail and even damage the robot.

Based on this, in the present embodiment, in the case of acquiring the training data set described above, the training data set is reprocessed, and the model training is performed with the processed data set. The overall idea for processing the training data set is to re-label the original state-motion form training data to form a state-motion-target state triplet data structure. The state and the target state may be a state corresponding to an adjacent time point, or may be a target state corresponding to a non-adjacent time point. That is, for a certain point in time, a triplet data structure may be constructed that obtains states between itself and adjacent points in time, or a triplet data structure may be constructed that obtains states between itself and a plurality of different points in time that are not adjacent.

The mode of re-marking can find out rich target states from expert teaching data for training, which is equivalent to expanding the original training data volume, thereby improving the generalization capability of strategies.

Specifically, in this embodiment, referring to fig. 2 in combination, the states of any two non-adjacent time points and the states of any intermediate time points between the any two non-adjacent time points may be formed into a state data pair.

The state data pair may be expressed as a "state-intermediate state-target state (s-s _l -s _h ) "wherein the state of the previous time point of the arbitrary two non-adjacent time points may be the" state ", the state of the intermediate time point may be the" intermediate state ", and the state of the subsequent time point of the two non-adjacent time points may be the" target state ". And wherein the interval between the two non-adjacent time points may be one time interval, two time intervals, fifty time intervals, etc., without limitation. While the intermediate time point canThe time point is not limited to a middle time point adjacent to the previous time point, or a middle time point separated from the previous time point by two time points.

Thus, an entire process corresponding to "state" to "target state" is divided into a plurality of different phases by intermediate points in time.

On this basis, the state of the previous time point and the state of the corresponding action and the intermediate time point in any two non-adjacent time points can be formed into an action data pair.

The action data pairs may be represented as "state-action-intermediate states (s-a-s _l ) "form. The "state" corresponds to the previous time point of the two non-adjacent time points, the "action" corresponds to the previous time point, and the "intermediate state" corresponds to the state of the intermediate time point.

Similarly, in the case where the above-described intermediate time point may be set based on demand, for example, the first half, middle, or second half of the time period set between two non-adjacent time points, etc., then the time interval between the "state" and the "intermediate state" in the action data pair is also correspondingly variable.

Based on the above, a state model can be obtained based on the state data pair training, which is expressed as pi _s (s _l |s,s _h ) Training based on the action data pair to obtain an action model, denoted pi _M (a|s,s _l ). And combining the state model and the action model to obtain a learning model. The learning model can output actions capable of controlling the robot to operate, and then an instruction with specific actions is sent to a controller for controlling the robot to realize control.

According to the learning method under the layered target condition provided by the embodiment, training data is divided into the state data pairs with intermediate states, and action data pairs comprising actions are constructed under the condition that the state data pairs with intermediate states exist, so that the whole process can be divided into a plurality of stages. Thus, the obtained learning model can periodically check in the action reproduction based on the information of each stage and timely correct the action errors of the control object, thereby effectively inhibiting the accumulated errors.

In this embodiment, in order to further expand the sample set, the acquired training data set may be processed based on the acquired data set taught by the expert. Alternatively, referring to fig. 3, the training data set may be acquired by:

s1011, acquiring an original data set, wherein the original data set comprises states of objects at all time points in a continuous event and corresponding actions under all states.

And S1012, controlling the test object based on the original data set, and obtaining the actual state of the test object under each action.

S1013, adding the actual state and the corresponding action of each time point to the original data set to expand and obtain a training data set.

In this embodiment, the original data set may be a data set generated in the expert teaching process described above. The control of the test object, which may be a test robot, is based on the raw data set. In the course of controlling the test robot in accordance with the original data set, the test robot will generate new data, e.g. new states, on the basis of the original data set during the control due to the actual control errors. Thus, a data pair form of state-action can be constructed based on the new actual state and the corresponding action. And adding the newly generated data pair into the original data set, thereby obtaining the extended training data set.

On the basis of the above, in this embodiment, when the state data pairs are constructed, the width of the state sliding window of each state data pair may be set based on the requirement, that is, in the period between the time point corresponding to the "state" and the time point corresponding to the "intermediate state" in the state data pair, that is, the width of the sliding window determines the position of the intermediate time point. The plurality of status sliding window widths of the status data pairs may be set as desired.

Referring to fig. 4, in the case of determining a state sliding window width of a state data pair, when constructing the state data pair, it may be implemented by:

s1021, starting from the previous time point in any two non-adjacent time points, dividing the state sliding window width to determine a plurality of middle time points.

S1022, acquiring the state of each time point in the arbitrary two non-adjacent time points and the state of each intermediate arbitrary time point between the arbitrary two non-adjacent time points to form a plurality of state data pairs.

In general, the preceding and following one of said arbitrary two non-adjacent time points are known, i.e. the construction is performed starting from this preceding time point, the sub-phases being divided to reach the following time point. Therefore, from the previous time point, the intermediate time point can be determined according to the set state sliding window width, and then the time point corresponding to each state in the state data pair can be determined.

The obtained state data pair can be formed based on the state of the previous time point, the state of the middle time point and the state of the later time point in the obtained training data set.

In addition, in the present embodiment, a plurality of action sliding window widths of the action data pair may also be set based on the requirement. The action sliding window width refers to a time span between a point in time corresponding to the "state" and a point in time corresponding to the "intermediate state" in the action data pair.

Referring to fig. 5, in the case of determining the motion sliding window width, the motion data pair may be constructed by:

s1031, determining intermediate time points according to the previous time point in the any two non-adjacent time points and the width of each action sliding window, and determining corresponding intermediate time points.

S1032, acquiring the state of the previous time point and the corresponding action, and the states of the intermediate time points, so as to form a plurality of action data pairs.

In this embodiment, the action window width is related to the status window width, because the intermediate time point utilized in the action data pair is determined based on the status window width. Only in different pairs of motion data, the pairs of data may be specifically determined based on different motion sliding window widths.

For different action data pairs, a time period range can be determined from a previous time point according to the action sliding window width according to the time point corresponding to the state in the current action data pair, namely, the previous time point in two non-adjacent time points, so that the ending point of the time period range is determined to be an intermediate time point. And constructing action data pairs based on the state of the previous time point in the training data set and the corresponding actions thereof and the state of the intermediate time point.

In the corresponding state data pair and action data pair, the time points corresponding to the 'states' are consistent. For example, two non-adjacent time points may be t=1 and t=100, and one state sliding window width is set to d=10.

When the state data pair corresponds to the action data pair, the time point corresponding to the "state" in the "state-intermediate state-target state" of the state data pair is t=1, the time point corresponding to the "intermediate state" is t=11, and the time point corresponding to the "target state" is t=100. In the "state-motion-intermediate state" of the motion data pair, the motion window width is d' =10, accordingly. The time point corresponding to the "state" is t=1, the "action" is an action corresponding to the state at the time point of t=1, the state of t=1 can be converted to the state of t=2 based on the action, and the time point corresponding to the "intermediate state" is t=11.

In the representation of the action data pair, when the user wants to switch from the state of t=1 to the state of t=11, the user needs to execute the action on the basis of the time point of t=1 to successfully switch to the state of t=2, and then the user can accurately reach the intermediate state of t=11 after subsequent switching.

After the state data pair and the action data pair are constructed in the mode, the state data pair can be used for training to obtain a state model, and the action data pair can be used for training to obtain an action model.

Referring to fig. 6, in this embodiment, training of the state model may be performed by:

s1041, taking states of any two non-adjacent time points in each state data pair as an initial state and a target state respectively, and inputting the initial state and the target state into a state model.

S1042, taking the state of any intermediate time point between any two non-adjacent time points as an intermediate state, taking the intermediate state as a state sample label, and realizing training of a state model based on the state sample label.

In the present embodiment, for each data pair "state-intermediate state-target state", where "state" and "target state" are input to a state model, the state model can output a predicted intermediate state by learning information of "state" and "target state".

The 'middle state' in the state data pair is used as a state sample label, and the predicted middle state output by the model is consistent with the state sample label or has little difference by continuously learning the information of the state sample label, so that the training of the state model is realized. That is, the finally trained state model may predict and output the state of its intermediate time point based on the states of two non-adjacent time points.

In addition, referring to fig. 7, in this embodiment, training of the motion model may be achieved by:

s1043, taking the state of the previous time point and the state of the middle time point of the arbitrary two non-adjacent time points as an initial state and a middle state, and inputting the initial state and the middle state into the action model.

S1044, taking the action corresponding to the previous time point in any two non-adjacent time points as an action sample label, and realizing training of an action model based on the action sample label.

In this embodiment, for each pair of action data, "state-action-intermediate state", where "state" and "intermediate state" are input as an action model, the action model outputs a predicted action to be executed in "state" by learning information of "state" and "intermediate state".

The motion data pair is used as a motion sample label, so that the predicted motion output by the motion model continuously learns to imitate the motion sample label, and finally, the predicted motion output by the motion model is consistent with the motion state label or the difference between the predicted motion and the motion state label is extremely small, so that the training of the motion model is realized.

The state model and the action model trained in the above manner may constitute a learning model. The control of the robot can be realized in the actual control stage by utilizing the learning model.

Therefore, referring to fig. 8, on the basis of the above, the learning method under the hierarchical target condition provided in the present embodiment may further include the following steps:

s201, acquiring the current state of an object to be controlled and a preset target state.

S202, inputting the current state and the preset target state into a state model of the learning model to obtain a preset intermediate state.

S203, inputting the current state and the preset intermediate state into an action model of the learning model, and outputting a target action.

S204, controlling the object to be controlled by utilizing the target action.

In this embodiment, the object to be controlled may be a robot that needs to be controlled actually, and the preset target state may be a state that is required in the present control task and that is finally reached by the robot. Wherein the current state of the robot may be collected by devices such as joint encoders, sensors, cameras, etc.

Referring to fig. 2 in combination, the current state and the preset target state of the robot may be input as input information to the state model. The state model may output a preset intermediate state between the current state and the preset target state by processing the current state and the preset target state. The preset intermediate state may be understood as a state corresponding to a certain time point between the start time point and the end time point in the process of controlling the robot.

On the basis, the current state and the preset intermediate state are used as input information and are input into the action model, and the action model can output actions corresponding to the current state through processing the current state and the preset intermediate state. This action is understood to be possible only after the action is performed in the current state, if the current state is controlled to accurately reach the preset intermediate state.

Thus, the robot can be controlled based on the obtained motion, thereby completing the control task.

In the above control process, a process is required from the current state to the preset target state, and a plurality of times of control is required, that is, the above step S204 may be implemented as follows, please refer to fig. 9:

S2041, controlling the operation of the object to be controlled by utilizing the target action so as to update the current state of the object to be controlled.

S2042, obtaining updated target actions based on the updated current state and the preset target state, and controlling the object to be controlled by the updated target actions until the difference between the current state and the preset target state is smaller than or equal to a preset threshold value, and stopping controlling the object to be controlled.

In this embodiment, when an action corresponding to the current state is obtained, the robot is controlled to execute the action in the current state, and the state of the robot can be changed to the state at the next time. And taking the state at the next moment as the updated current state, continuing the mode to obtain the action in the updated current state, and then controlling the robot to operate according to the action. And (3) until the state of the robot is consistent with the preset target state after the control, or the difference between the state and the preset target state is smaller than or equal to a preset threshold value, ending the control task.

According to the learning method under the layered target condition, the training data set is effectively expanded by reconstructing the state data pair and the action data pair, so that the model obtained by training has better generalization capability, task environment changes which cannot be contained in the training data are better dealt with, and the problem of cloning strategy blind areas caused by insufficient data quantity is solved.

In addition, the state model and the action model are obtained through layering strategy training, sub-targets in the track can be automatically searched and updated when the robot executes the track action, and the action error of the robot is corrected in time, so that the problem of poor reproduction accuracy caused by accumulated errors is effectively reduced, the process is automatically completed by the model, manual intervention is not needed, and the labor cost is reduced.

Referring to fig. 10, an exemplary component diagram of an electronic device according to an embodiment of the present application is shown, where the electronic device may be a personal computer, a notebook computer, a server, etc. The electronic device may include a storage medium 110, a processor 120, a learning device 130 under layered target conditions, and a communication interface 140. In this embodiment, the storage medium 110 and the processor 120 are both located in the electronic device and are separately disposed. However, it should be understood that the storage medium 110 may also be separate from the electronic device and accessible to the processor 120 through a bus interface. Alternatively, the storage medium 110 may be integrated into the processor 120, for example, as a cache and/or general purpose registers.

The learning means 130 under the hierarchical target condition may be understood as the above-mentioned electronic device, or the processor 120 of the electronic device, or may be understood as a software functional module that implements the learning method under the hierarchical target condition under the control of the electronic device, independently of the above-mentioned electronic device or the processor 120.

As shown in fig. 11, the learning device 130 under the above hierarchical target condition may include an acquisition module 131, a first construction module 132, a second construction module 133, and a training module 134. The functions of the respective functional blocks of the learning device 130 under the hierarchical target condition are described in detail below.

The obtaining module 131 is configured to obtain a training data set, where the training data set includes states of the object at each time point and corresponding actions under each state in a continuous time period;

it will be appreciated that the acquisition module 131 may be used to perform step S101 described above, and reference may be made to the details of the implementation of the acquisition module 131 as described above with respect to step S101.

A first construction module 132, configured to construct a state data pair from states of any two non-adjacent time points and states of any intermediate time points between the any two non-adjacent time points;

it will be appreciated that the first building block 132 may be configured to perform step S102 described above, and reference may be made to the details of implementation of the first building block 132 regarding step S102 described above.

A second construction module 133, configured to construct an action data pair from the state of the previous time point and the corresponding action of the two non-adjacent time points, and the state of the middle time point;

It will be appreciated that the second building block 133 may be used to perform step S103 described above, and reference may be made to the details of step S103 regarding the implementation of the second building block 133.

The training module 134 is configured to train to obtain a state model by using the pair of state data, train to obtain an action model by using the pair of action data, and obtain a learning model based on the state model and the action model.

It will be appreciated that the training module 134 may be used to perform step S104 described above, and reference may be made to the details of step S104 regarding the implementation of the training module 134.

In one possible implementation, the training module 134 may be configured to:

In one possible implementation, the learning device 130 under the hierarchical target condition further includes a control module that may be configured to:

and controlling the object to be controlled by utilizing the target action.

In one possible implementation, the control module may be configured to:

In one possible implementation, the learning device 130 under the hierarchical target condition further includes a setting module that may be used to:

setting a plurality of state sliding window widths of the state data pairs;

the first building block 132 may be configured to:

the state of each time point in any two non-adjacent time points and the state of each intermediate arbitrary time point between any two non-adjacent time points are acquired to form a plurality of state data pairs.

In one possible implementation, the setting module may be further configured to:

setting a plurality of action sliding window widths of action data pairs;

the second building block 133 may be configured to:

In one possible implementation, the acquiring module 131 may be configured to:

The process flow of each module in the apparatus and the interaction flow between the modules may be described with reference to the related descriptions in the above method embodiments, which are not described in detail herein.

Further, the embodiment of the present application also provides a computer readable storage medium, where machine executable instructions are stored, where the machine executable instructions when executed implement the learning method under the hierarchical target condition provided in the above embodiment.

Specifically, the computer-readable storage medium can be a general-purpose storage medium, such as a removable disk, a hard disk, or the like, and when the computer program on the computer-readable storage medium is executed, the learning method under the above-described hierarchical target condition can be executed. With respect to the processes involved in the computer readable storage medium and when executed as executable instructions thereof, reference is made to the relevant descriptions of the method embodiments described above and will not be described in detail herein.

In summary, after the training data set is acquired, the learning method, the learning device and the electronic device under the hierarchical target condition according to the embodiments of the present application form a state data pair from the states of any two non-adjacent time points in the training data set and the states of any intermediate time points between the any two non-adjacent time points, and form an action data pair from the states of the previous time point and the corresponding action and the states of the intermediate time points in the any two non-adjacent time points. Thus, a state model is obtained by training the pair of state data, an action model is obtained by training the pair of action data, and a learning model is obtained based on the state model and the action model. In this scheme, training data is divided into state data pairs having intermediate states, and action data pairs including actions are constructed with the data pairs having intermediate states, so that the entire process can be divided into a plurality of stages. Thus, the obtained learning model can periodically check in the action reproduction based on the information of each stage and timely correct the action errors of the control object, thereby effectively inhibiting the accumulated errors.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present application should be included in the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of learning under a hierarchical target condition, the method comprising:

training to obtain a state model by using the state data pair, training to obtain an action model by using the action data pair, and obtaining a learning model based on the state model and the action model;

the step of training to obtain a state model by using the state data pair comprises the following steps:

the states of any two non-adjacent time points in each state data pair are respectively used as an initial state and a target state, and the initial state and the target state are input into a state model; taking the state of any intermediate time point between any two non-adjacent time points as an intermediate state, taking the intermediate state as a state sample label, and realizing training of a state model based on the state sample label;

The step of training to obtain the action model by utilizing the action data pair comprises the following steps:

the state of the previous time point in the arbitrary two non-adjacent time points and the state of the middle time point are respectively used as an initial state and a middle state, and the initial state and the middle state are input into an action model; and taking the action corresponding to the previous time point in any two non-adjacent time points as an action sample label, and realizing training of an action model based on the action sample label.

2. The method of learning under hierarchical target conditions of claim 1, further comprising:

and controlling the object to be controlled by utilizing the target action.

3. The learning method under the hierarchical target condition according to claim 2, characterized in that the step of controlling the object to be controlled using the destination action includes:

4. The method of learning under hierarchical target conditions of claim 1, further comprising:

setting a plurality of state sliding window widths of the state data pairs;

5. The method of learning under hierarchical target conditions of claim 4, further comprising:

setting a plurality of action sliding window widths of action data pairs;

6. The method of learning under hierarchical target conditions of claim 1, wherein the step of acquiring a training dataset comprises:

7. A learning device under a hierarchical target condition, the device comprising:

the training module is used for training the state data pair to obtain a state model, training the action data pair to obtain an action model, and obtaining a learning model based on the state model and the action model;

the training module is used for taking states of any two non-adjacent time points in each state data pair as an initial state and a target state respectively, and inputting the initial state and the target state into a state model; taking the state of any intermediate time point between any two non-adjacent time points as an intermediate state, taking the intermediate state as a state sample label, and realizing training of a state model based on the state sample label;

The training module is used for taking the state of the previous time point in any two non-adjacent time points and the state of the middle time point as an initial state and a middle state respectively, and inputting the initial state and the middle state into the action model; and taking the action corresponding to the previous time point in any two non-adjacent time points as an action sample label, and realizing training of an action model based on the action sample label.

8. An electronic device comprising one or more storage media and one or more processors in communication with the storage media, the one or more storage media storing processor-executable machine-executable instructions that, when the electronic device is run, are executed by the processor to perform the method steps recited in any of claims 1-6.