CN115439510A

CN115439510A - Active target tracking method and system based on expert strategy guidance

Info

Publication number: CN115439510A
Application number: CN202211388347.9A
Authority: CN
Inventors: 宋然; 栾迎新; 张钰荻; 张伟; 李晓磊; 张倩
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-11-08
Filing date: 2022-11-08
Publication date: 2022-12-06
Anticipated expiration: 2042-11-08
Also published as: CN115439510B

Abstract

The invention discloses an active target tracking method and system based on expert strategy guidance, which belongs to the technical field of active target tracking and comprises the following steps: acquiring a scene observation image, a scene map and an intelligent body pose; according to the scene map and the pose of the intelligent agent, obtaining a local map of each intelligent agent and motion tracks of all the intelligent agents in each local map as first training data; inputting the first training data into an expert tracker and an expert target object respectively, performing countermeasure reinforcement learning by the expert target object and the expert tracker, and outputting a suggested action by the expert tracker; inputting the scene observation image into a student tracker, and training the student tracker by using the suggested action as a label of the scene observation image to obtain a trained student tracker; and identifying the acquired scene real-time image by using the trained student tracker to obtain the decision-making action of the intelligent agent. Accurate tracking of the target is achieved.

Description

Active target tracking method and system based on expert strategy guidance

Technical Field

The invention relates to the technical field of active target tracking, in particular to an active target tracking method and system based on expert strategy guidance.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Active target tracking refers to the fact that in a dynamic three-dimensional scene, an intelligent body equipped with a camera can enable a target object to be located in the center of the visual field of the intelligent body in a relatively stable size and posture all the time by autonomously adjusting actions. The most advanced active target tracking method at present is a complete end-to-end optimization method depending on deep reinforcement learning. The whole end-to-end optimization process is driven by data, the neural network needs enough and good samples to optimize parameters, and the reinforcement learning optimization also needs to explore more states and actions. However, the conventional active target tracking method adopts a learning strategy of direct confrontation, and a trained target object does not utilize the capacity of an obstacle, so that sufficient challenges cannot be brought to the tracker, such as movement around the obstacle and disappearance of the target object in the field of view of the tracker. A tracker that can handle complex scenes cannot be trained. Therefore, the existing method cannot guarantee accurate target tracking in a complex environment.

Disclosure of Invention

In order to solve the problems, the invention provides an active target tracking method and system based on expert strategy guidance, which can realize active target tracking in complex scenes.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, an active target tracking method based on expert policy guidance is disclosed, which comprises the following steps:

acquiring a scene observation image, a scene map and an intelligent body pose;

according to the scene map and the pose of the intelligent agent, obtaining a local map of each intelligent agent and motion tracks of all the intelligent agents in each local map as first training data;

inputting the first training data into an expert tracker and an expert target object respectively, performing countermeasure reinforcement learning by the expert target object and the expert tracker, and outputting a suggested action by the expert tracker;

inputting the scene observation image into a student tracker, and training the student tracker by using the suggested action as a label of the scene observation image to obtain a trained student tracker;

and identifying the acquired scene real-time image by using the trained student tracker to obtain the decision-making action of the intelligent agent.

In a second aspect, an active target tracking system based on expert policy guidance is disclosed, comprising:

the training data acquisition module is used for acquiring a scene observation image, a scene map and an intelligent agent pose;

the first-stage training module is used for obtaining a local map of each intelligent agent and motion tracks of all the intelligent agents in each local map as first training data according to a scene map and the pose of the intelligent agents; inputting the first training data into an expert tracker and an expert target object respectively, performing countermeasure reinforcement learning by the expert target object and the expert tracker, and outputting a suggested action by the expert tracker;

the student tracker training module is used for inputting the scene observation image into the student tracker, training the student tracker by taking the suggested action as a label of the scene observation image, and obtaining the trained student tracker;

and the example tracking module is used for identifying the acquired scene real-time images by using the trained student tracker to obtain the intelligent agent decision-making action.

In a third aspect, an electronic device is provided, which includes a memory and a processor, and computer instructions stored in the memory and executed on the processor, wherein when the computer instructions are executed by the processor, the steps of the active target tracking method based on expert policy guidance are completed.

In a fourth aspect, a computer-readable storage medium is provided for storing computer instructions, which when executed by a processor, perform the steps of an expert policy guidance-based active target tracking method.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the method, an expert model is trained by obtaining an intelligent body local map of a scene and an intelligent body motion trail in the map, an expert tracker outputs a suggested action, an escape strategy is output through an expert target object, then the suggested action output by the expert tracker is used as a label of a scene observation image, the scene observation image is input into a student tracker, the student tracker is trained to obtain a trained student tracker, and the strong scene understanding capability and decision capability of the expert tracker are transferred into the student tracker, so that the student tracker has an obstacle avoidance function, the performance of the student tracker is improved, meanwhile, the extra overhead of on-line drawing construction in an inference process is omitted, the calculation rate is improved, and the real-time performance of target tracking is guaranteed.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a block diagram showing the overall structure of the method disclosed in example 1;

FIG. 2 is a global map constructed for a training scenario in example 1;

FIG. 3 is a visual representation of a map and agent trajectory used in training an expert agent as disclosed in example 1;

FIG. 4 is a schematic diagram showing a comparison of reward mechanisms disclosed in example 1, wherein (a) is the distribution of obstacles, and (b) is the reward mechanism used in expert tracker training;

FIG. 5 is a handwritten target object trajectory when verifying the tracking effect of the disclosed tracker of embodiment 1;

fig. 6 is a simulation demonstration of the disclosed tracker of example 1, wherein (a) is an expert tracker demonstration and (b) is a student tracker demonstration.

Detailed Description

The invention is further described with reference to the following figures and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example 1

In order to improve the accuracy and real-time performance of active target tracking, in this embodiment, an active target tracking method based on expert policy guidance is disclosed, as shown in fig. 1, including:

s1: and acquiring a scene observation image, a scene map and an intelligent body pose.

The acquired scene observation image is a scene observation RGB image or an RGB-D image under the visual angle of the tracker at each moment.

Determining the position and scale of each obstacle in the scene, constructing a scene global map according to the scale and position information of the obstacles, and recording the scene global map as

Wherein is obstructed byThe values of the grid points occupied by the object are set to be non-zero, between 0 and 1, indicated as light in fig. 2, and the grid points not occupied are set to be 0, indicated as dark in fig. 2.

S2: and according to the scene map and the pose of the intelligent agent, obtaining a local map of each intelligent agent and motion tracks of all the intelligent agents in each local map as first training data.

The present embodiment represents the environmental structure information using a grid map centered on an agent. In order to obtain the structure of the environment surrounding the agent,

at any moment, the pose of the agent in the scene is obtained

Wherein, in the process,

is the position and posture of the tracker under the global map,

the pose of the target object under the global map is obtained, and the global map is processed according to the pose

Performing rotation and translation, i.e. calculating the transformation from the global map coordinate system of the scene map to the agent-centric coordinate system, to obtain the agent-centric local map

Where subscripts 1 and 2 represent the tracker and target object, respectively,

for a local map centered on the tracker,

is a local map centered on the target object. TheThe process can be represented as:

（1）

wherein, the first and the second end of the pipe are connected with each other,

to be an intelligent agentiA local map that is the center of the map,

representing a global map

To position with an agent

As a coordinate system of the center.

Coordinate transformation is commonly used to establish a one-to-one correspondence between two different coordinate systems, assuming a coordinate system

Rotate anticlockwise around self Z axis

After the angle, the other translation

And a coordinate system

Coincidence, then coordinate system

Point of (1)

And a coordinate system

Point of (1)

There is a one-to-one correspondence:

（2）

adopting formula (2) to map the global map

To position with an agent

As a coordinate system of the center.

In order to enable all agents to know the motion forms of the agents and other agents, the motion tracks of all agents are represented on a local map of one agent. The track of the intelligent agent is constructed by collecting the pose of the intelligent agent in the historical frame, and the historical pose is transformed to the coordinate system taking the current intelligent agent as the center by utilizing the coordinate system. In addition, to represent timing information in the trajectories, the trajectories of all agents are represented as an arithmetic series with respect to pose time.

tTime of day, agentjThe motion trajectories T relative to the global map collected for all agents are:

and the pose thereof

Wherein the track

By

Personal intelligent agentiAnd (4) historical pose construction. Each historical pose of agent to each agentA calculation is performed and a time-dependent value is assigned to it.

Time of day, agentiIn thattHistorical pose under intelligent body coordinate system at any moment

Can be expressed as:

（3）

（4）

wherein the content of the first and second substances,

representing absolute pose of an agent

To switch to

In the coordinate system with the center as the center,

values representing time distances. Thus, the agentjThe motion trajectory of each agent in the local map of (a) may be represented as:

（5）

the visual results of the constructed local maps and the movement tracks of the agents in the maps are shown in fig. 3, wherein the black part in the map is a passable area, the white line is the track of the tracker and the target object, the rest white or gray parts are non-passable parts, and the lighter the color is, the higher the barrier height is.

S3: the first training data are respectively input into an expert tracker and an expert target object, the expert target object and the expert tracker perform confrontation reinforcement learning, the expert tracker outputs suggested actions, and the expert target object outputs escape strategies corresponding to the targets.

The expert tracker comprises a convolutional neural network and a sequence model, the local map and the motion trail of the intelligent agent are coded through the convolutional neural network to obtain coded information, and the coded information is identified through the sequence model to obtain decision-making action.

Each expert agent needs a model with sufficient expressive power to map the input to a simple action. The expert tracker firstly encodes environment structure information and intelligent agent motion information by using a convolutional neural network to obtain encoded information, wherein the environment structure information is a local map of an intelligent agent, the intelligent agent motion information is an intelligent agent motion track in the map, then a sequence model is used for modeling the dynamic characteristics between sequence observations, estimating the environment state and outputting corresponding motion distribution. In addition, a cost function of the current state needs to be estimated simultaneously for iterative estimation and promotion of the strategy.

Therefore, the structure of the expert tracker is shown in table 1, where C5x5-32S1P2 represents a convolutional neural network, which employs 32 convolutional kernels of size 5x5, the step size of each convolutional kernel is 1, and the size of the padding area is 2; the LSTM256 represents that the sequence model adopts a long-term and short-term memory network, and the input and output dimensionality of the sequence model is 256; FC6 denotes the fully connected layer with an output dimension of 6.

Each expert tracker maps its own local map

And the motion tracks of all intelligent bodies in the local map

As an input, among other things,

output the predicted action

The calculation process of its predicted action can be expressed as equation (6):

(6)

TABLE 1 model Structure for expert models

The expert tracker and the expert target object adopt a reward mechanism which can be known by shielding, when the expert tracker is not in a shielding state, the value range of the reward value of the expert tracker is limited to be between 0 and 1, and when the expert tracker is in a shielding state, the reward value of the expert tracker is set to be-1.

Whether shelter from relative orientation and position that accessible map and agent and judge: when any point on the connection line between the agents is marked as occupied on the map, namely, the shielding occurs. The reward for the expert tracker may be expressed as:

(7)

wherein the content of the first and second substances,

is a reward for the expert tracker that,

、

respectively the actual distance and the expected distance of the expert tracker from the target object,

、

respectively the actual and expected angles of the expert tracker to the target object,

、

the maximum distance and angle, respectively, that the expert tracker can see, the subscript indicating the time of day is omitted.

Besides the observed value of the expert target object, the expert target object also has the ability of acquiring the observed value of the expert tracker and predicting the obtained reward value, the reward of the expert target object is the opposite number of the reward of the expert tracker, and the zero-sum competitive relationship between the expert tracker and two agents of the expert target object is maintained. Thus, when occlusion occurs, expert tracker tracking is penalized disadvantageously, while the expert target object is rewarded for being in a state that facilitates escaping the tracker's line of sight. Fig. 4 is a graph showing the relationship between the position of the expert target object and the reward that should be obtained by the expert tracker when the expert tracker fixing position is (0, 0), wherein (a) is the obstacle distribution and (b) in fig. 4 is the reward mechanism used for training the expert tracker.

It can be seen that the reward mechanism proposed in this embodiment can be timely fed back to the expert tracker or expert target object when occlusion occurs.

The specific process of obtaining the output suggested action of the expert tracker comprises the following steps:

respectively inputting first training data into an expert tracker and an expert target object, performing countermeasure learning on the expert tracker through the expert target object, performing pre-training on the expert tracker, outputting decision actions by the expert tracker in the pre-training process, outputting escape strategies corresponding to targets by the expert target object, and constructing an expert strategy pool through strategies of an expert target object model;

selecting a micro-call expert target object model from an expert strategy pool;

the method comprises the steps of performing counterstudy by micro-calling an expert target object model and a pre-trained expert tracker, performing micro-adjustment on the pre-trained expert tracker, and outputting suggested actions through the micro-adjusted expert tracker.

In particular implementation, the training process of the expert tracker is divided into antagonistic expert strategy learning and fine-tuning of the expert tracker.

First, first training data are respectively input into an expert tracker and an expert target object model, and the expert tracker and the expert target object model are optimized through countermeasure reinforcement learning to generate a diversification strategy, wherein the process is a pre-training process of the expert tracking model. With the optimization, the expert target object model generates different strategies to escape the tracking of the expert tracker, and the expert tracker learns various strategies to cope with the escape strategies of the expert target object. In the process, not only a relatively strong expert tracker model is learned, but also the strategies of the expert target object models are stored for 200, 250, 300, 350, 400, 450, 550, 650, 700, 800 and 950 ten thousand times of interaction times to construct an expert strategy pool.

Second, the tracker expert model is fine-tuned. As reinforcement learning progresses, expert trackers are gradually forgotten about past methods of dealing with escape strategies, and further adjustments to the expert tracker are required. In the process, the pre-trained expert tracker performs confrontation training with the expert target object models in the expert strategy pool, the expert tracker tries to learn a stronger strategy so as to perfectly cope with all the strategy models in the expert target object strategy pool, the expert tracker evaluates for 100 times in a training environment, and the turn length can be stabilized to be more than 495.

S4: and inputting the scene observation image into a student tracker, and training the student tracker by using the suggested action as a label of the scene observation image to obtain the trained student tracker.

This example trains a simple lightweight student tracker under the direction of expert strategies. In this stage, the input to the student tracker is set to the scene observation image at the tracker perspective at each moment. The optimization process of the student tracker is a supervised learning process, and the student tracker is trained by adopting double constraints of a feature space and an output space, so that the strong scene understanding capability and decision making capability of the expert tracker are transferred to the student tracker. When the student tracker is trained, the student tracker is guided by the student target object, and in order to generate diversified target object strategies, in the training process, the model parameters of the student target object are randomly sampled from the expert target object strategy pool constructed in the first stage.

TABLE 2 model structure of student tracker

The model structure of the student tracker is shown in table 2 and comprises a convolutional neural network and a sequence model, wherein the convolutional neural network is used for coding an input observation image to obtain coding information, and the sequence model is used for identifying the coding information to obtain a decision action.

Wherein, C5x5-32S1P2 represents a convolutional neural network, which adopts 32 convolutional kernels with the size of 5x5, the stride of each convolutional kernel is 1, and the size of a filling area is 2; the LSTM256 represents that the sequence model adopts a long-term and short-term memory network unit, and the input and output dimensionality of the sequence model is 256; FC6 denotes the fully connected layer with an output dimension of 6.

The supervision signal used in the training of the student tracker is two parts: feature space constraints and action space constraints because the student tracker is required to simultaneously migrate the scene awareness and decision making capabilities of the expert tracker. Thus, loss function of student tracker

Is defined as two parts:

(8)

wherein the content of the first and second substances,

、

respectively a loss function in feature space and a loss function in motion space,

for the over parameter, set to 0.1.

The recommended action output by the expert tracker is used as a dense supervision signal for training of the student tracker, and the KL divergence is used for forcing the output of the student tracker to approach the output of the expert tracker. At each time step, the expert tracker observes and gives a suggested action according to the current privilege information to serve as a data label for the training of the student tracker model; in training, using the KL divergence to force the output of the student tracker to approach the output of the expert tracker, the computation of this partial loss function can be expressed as:

(9)

in the formula (I), the compound is shown in the specification,

for the output of the student tracker at time t,

the output of the expert tracker at time t.

In order for student trackers to have greater scene understanding capabilities, student trackers are forced to learn features similar to expert trackers. Therefore, the present embodiment calculates the loss function by measuring the similarity of convolutional neural network outputs in expert and student trackers as the feature space constraint, and the calculation can be expressed as:

(10)

wherein MSE represents the mean square loss function,

、

respectively, the output of the last convolutional layer of the student tracker and the expert tracker.

Furthermore, to assist in mining difficult samples for training of student trackers, in the training process, the student target object model selects actions using a randomly sampled strategy in the target object expert strategy pool constructed in the first stage.

S5: and identifying the acquired scene real-time image by using the trained student tracker to obtain an intelligent agent decision action.

The active target tracking model (EG-AOT) constructed in this embodiment is shown in fig. 1 and includes an expert model and a student model, the expert model includes an expert tracker and an expert target object that are used for learning against each other, the student model includes a student tracker and a student target object, and the student target object guides the student tracker.

The present embodiment verifies the performance of the active target tracking model (EG-AOT) using a target object based on point-to-point navigation (Nav) and a target object based on trajectory planning (PathPlanning).

The target object based on the trajectory planning can directly obtain the scene map, and the trajectory planning is carried out in two steps: first, at the beginning of each round, target objects are picked from the map

Selecting two points at two sides of each obstacle at random, and sharing the two points

The points are taken as path-level subtarget points and are connected into a closed-loop path,reuse of

The algorithm calculates a final path that avoids the obstacle. Secondly, secondary child target points are screened out from the path again, the number of the secondary child target points is larger than that of the primary child target points, the target object can avoid obstacles, the expected travelling speed and the rotation angle of the target object can be determined at each moment according to the current direction of the target object, the distance from the position of the secondary child target points and the angle, and the actual travelling speed is determined by adding certain noise to the expected travelling speed. Used in the experiments

。

Because PathPlanning uses an environment map to plan a path in advance, the target object has the ability to avoid obstacles, and there can be more probability to challenge the tracker: such as when the target is occluded by an obstacle. A schematic diagram of the path planning for some target objects is shown in fig. 5.

In the first stage training of the active target tracking model, the size of the local map is set to 80x80, wherein the side length of each grid corresponds to the distance of 10 cm in the simulation environment, and the center of each grid is the position of the agent. The model is trained on a computer, and model optimization is carried out by adopting 6 threads. The total number of interactions of the agent with the environment while countering the learning and fine tuning of the expert tracker is 1000M each. In the second stage, the observed data of the student model are all adjusted to 80x80 and then input into the model, 4 threads are adopted for model optimization, and the updating times are 2000M. Other hyper-parameters used for training and evaluation are shown in table 3, and the tracker and target object motion space settings are shown in tables 4 and 5.

TABLE 3 this example presents hyper-parameters for EG-AOT in training and evaluation

TABLE 4 motion space of active tracker

Movement of	Speed (cm/sec)	Angle (degree)
			Forward	200	0
Retreat	-200	0
			Go forward to the right	150	45
Go forward to the left	150	-45
			Turn to the right	0	45
Rotate to the left	0	-45
			Stop	0	0

TABLE 5 learning of the motion space of a target object

Movement of	Speed (cm/sec)	Angle (degree)
			Forward	150	0
Retreat	-150	0
			Go forward to the right	100	45
Go forward to the left	100	-45
			Turn to the right	0	45
Rotate to the left	0	-45
			Stop	0	0

The expected position difference, the turn length, the success rate and the shielding rate are used for evaluating the performance of the model. Specific descriptions about each index are as follows:

the expected position difference is an accumulated value of the expected position difference at each time, and the expected position difference at each step is calculated by the formula

The larger the value, the better.

The viewable area in the turn length is defined as a sector area in front of the tracker with a radius of 750 cm and a range of 90 degrees. The current round stops as long as the target is outside this area for 5 seconds or the round length reaches 500.

Success rate, when the round length reaches 500, is marked as a successful trace, while power indicates the ratio of the number of successful traces over all trials.

The student tracker disclosed in this example was compared to a baseline method, which included the latest AD-VAT and AD-VAT + algorithms. For the sake of fairness, the student tracker in this embodiment uses the input that is the RGB image, and the same network model structure in accordance with the reference method, and constructs the variant AD-VAT and AD-VAT + of the AD-VAT and AD-VAT + algorithms, which are compared with the student tracker using the RGDB image as the input.

TABLE 6 comparative experiment results with reference method for RGB input

And comparing the experimental results of the model by taking RGB as input. The experimental results are shown in table 6, with the Nav strategy applied to the target subjects. From experimental results, compared with a benchmark method, the student tracker provided by the embodiment can obtain longer turn length and better success rate in most scenes, and is improved in average result. This is because, although the same model structure and observation input are adopted, the student tracker disclosed in the present embodiment migrates the scene understanding capability and decision-making capability of the expert policy tracker, and has a certain capability of handling obstacles, and thus can achieve performance improvement.

Comparing model experiment results when the RGBD image is used as input. The experimental results are shown in table 7, with the target subject using Nav strategy. Overall, the experimental conclusions for the RGBD data input are similar to those for the RGB data input: although the student tracker proposed in this embodiment is inferior to the benchmark method in terms of the desired position difference index, better results were obtained in terms of average round length and success rate. Furthermore, it can be seen that the student tracker proposed by the present embodiment is more elevated than the respective reference method, because the spatial cues are more missing for RGB data, and thus the model learning scene understanding is more difficult.

TABLE 7 comparative experiment results with the baseline method of RGBD input

Note: the results are the mean and variance of 100 replicates and are expressed as "mean ± variance". The best results are shown in bold font. The last column is the average result in all scenes.

And (5) comparing the running time. This embodiment proposes that the model is run time consistent with the baseline method, where the run time of the model is 0.002260s per frame for RGB input and 0.002943s per frame for RGBD input.

In order to verify the reasonableness and superiority of the expert strategy disclosed in the embodiment, other different expert strategies Depth and MaskDepth are additionally constructed and compared experimentally.

A Depth: the target tracker inputs the real depth image of the first view angle as a tracker model, and the target person can learn the real depth image of the first view angle, the depth image of the first view angle of the tracker and the action taken by the tracker as inputs.

MaskDepth: the target tracker splices the semantic segmentation map and the real depth image of the first visual angle of the target tracker along the channel dimension as input, and the target person can learn to take the semantic segmentation map and the real depth image of the first visual angle of the target person, the semantic segmentation map and the depth image of the first visual angle of the tracker and actions taken by the tracker as input. The model structure is shown in table 8 for the tracker model structure.

The experimental results are shown in table 9, comparing the performance of the same tracker strategy on various evaluation indexes, in particular the occlusion rate index, it can be seen that in the ability to challenge the tracker with obstacles to make difficulties: nav < pathplating < expert target object presented in this example. In fact, nav can hardly process the situation of the obstacle, pathplating has certain obstacle utilization capacity by manually selecting some path sub-target points close to the obstacle by utilizing obstacle position information and planning an obstacle avoidable path by utilizing an a-x algorithm, and the confrontation reinforcement learning of the expert tracker and the expert target object provided by the embodiment can acquire more complete obstacle position information and target person movement information, so that the action can be selected by comprehensively considering the structure of the environment around the tracker and the movement of the tracker, and the capability of manufacturing a difficult tracking scene by utilizing the obstacle is stronger than that of pathplating.

TABLE 8 tracker model architecture

TABLE 9 expert policy Performance comparison

Note: the results are the mean and variance of 100 replicates and are expressed as "mean ± variance". The best results are shown in bold font. The last column is the average result across all scenes.

Furthermore, the expert tracker proposed in this embodiment exhibits the best performance on all evaluation indexes as the objective strategy changes. More specifically, as the obstacle capability of the target strategy is increased, the tracking performance of both the Depth tracker and the maskdept tracker is obviously reduced: the Depth tracker success rate drops from 0.86 to 0.41 and the maskdepth tracker success rate drops from 0.77 to 0.33. However, the embodiment proposes that the expert trackers all achieve robust tracking: the average round length of 100 tests is stable at 495 and the success rate is stable at 0.9. Furthermore, the expert tracker proposed by the present embodiment always has a lower occlusion rate, i.e. it handles occlusion more strongly than other expert trackers.

In order to more intuitively show the performance of the method provided by the embodiment, the expert tracker and the student tracker provided by the embodiment are respectively operated and demonstrated in a virtual environment, the result is shown in fig. 6, the figure in the figure is a virtual figure, in the figure, (a) is the expert tracker demonstration result, and (b) is the student tracker demonstration result, both of which are represented by a first visual angle of the tracker, the number at the upper left corner of each frame image is the current frame number, the leftmost column in the figure is a schematic diagram of the relative position relationship among the tracker, a target object and an obstacle, wherein two circles with darker colors represent the positions of the target where the target starts and ends movement; the two circles with lighter colors represent the positions where the tracker starts and ends movement, the dotted lines and arrows represent the movement trajectory and direction, respectively, and the middle rectangle or ellipse represents an obstacle.

The embodiment discloses a method, an expert model is trained by obtaining an intelligent agent local map of a scene and an intelligent agent motion trail in the map, an expert tracker outputs a proposal action, an escape strategy is output by an expert target object, then the proposal action output by the expert tracker is used as a label of a scene observation image, the scene observation image is input into a student tracker, the student tracker is trained to obtain a trained student tracker, and the strong scene understanding capability and decision capability of the expert tracker are transferred into the student tracker.

Example 2

In this embodiment, an active target tracking system based on expert policy guidance is provided, including:

the training data acquisition module is used for acquiring a scene observation image, a scene map and an intelligent body pose;

the first-stage training module is used for obtaining a local map of each intelligent agent and motion tracks of all the intelligent agents in each local map as first training data according to a scene map and the pose of the intelligent agents; respectively inputting the first training data into an expert tracker and an expert target object, performing countermeasure reinforcement learning by the expert target object and the expert tracker, outputting a suggested action through the expert tracker, and outputting an escape strategy corresponding to a target through the expert target object;

Example 3

In this embodiment, an electronic device is disclosed, which comprises a memory and a processor, and computer instructions stored in the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the expert policy guidance-based active target tracking method disclosed in embodiment 1.

Example 4

In this embodiment, a computer readable storage medium is disclosed for storing computer instructions that, when executed by a processor, perform the steps of an expert policy guidance based active target tracking method disclosed in embodiment 1.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. An active target tracking method based on expert strategy guidance is characterized by comprising the following steps:

acquiring a scene observation image, a scene map and an intelligent body pose;

2. The active target tracking method based on expert strategy guidance as claimed in claim 1, wherein an occlusion aware reward mechanism is adopted by the expert tracker and the expert target object, when the expert tracker is not in an occlusion state, the value range of the reward value of the expert tracker is limited to be between 0 and 1, and when the expert tracker is in an occlusion state, the reward value of the expert tracker is set to be-1.

3. The active target tracking method based on expert strategy guidance as claimed in claim 1, wherein when the expert target object and the expert tracker perform countermeasure reinforcement learning, an escape strategy corresponding to a target is output through the expert target object, and an expert strategy pool is constructed through a model strategy of the expert target object.

4. The expert strategy guidance-based active target tracking method according to claim 3, wherein the specific process of obtaining the output recommended action of the expert tracker is as follows:

respectively inputting the first training data into an expert tracker and an expert target object, performing countermeasure learning on the expert tracker through the expert target object, performing pre-training on the expert tracker, outputting a decision action by the expert tracker in the pre-training process, outputting an escape strategy corresponding to a target by the expert target object, and constructing an expert strategy pool through a strategy of an expert target object model;

selecting a micro-transfer expert target object model from an expert strategy pool;

5. The expert strategy guidance-based active target tracking method of claim 3, wherein in training the student tracker, the student tracker is guided by student target objects, wherein the student target object model is an expert target object model in an expert strategy pool.

6. The active target tracking method based on expert strategy guidance according to claim 1, wherein the expert tracker and the student tracker both comprise a convolutional neural network and a sequence model, the convolutional neural network in the expert tracker encodes a local map and a relative motion trajectory of an agent to obtain encoded information, and the encoded information is identified by the sequence model to obtain a suggested action; a convolutional neural network in the student tracker encodes the scene observation image to obtain encoding information, and the encoding information is identified through a sequence model to obtain a decision action.

7. The expert strategy guidance-based active target tracking method of claim 6, wherein the loss function of the student tracker comprises loss in feature space and loss in action space, the loss in action space is calculated by KL divergence, and the loss in feature space is obtained by similarity calculation between the output of convolutional neural network in the expert tracker and the output of the student tracker.

8. An active target tracking system based on expert policy guidance, comprising:

9. An electronic device comprising a memory and a processor and computer instructions stored in the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the expert policy guidance based active target tracking method of any one of claims 1-7.

10. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the expert policy guidance based active object tracking method of any one of claims 1 to 7.