CN117893873B

CN117893873B - Active tracking method based on multi-mode information fusion

Info

Publication number: CN117893873B
Application number: CN202410304634.XA
Authority: CN
Inventors: 周云; 吴巧云; 谭春雨; 伍煜强
Original assignee: Anhui University
Current assignee: Anhui University
Filing date: 2024-03-18
Publication date: 2024-06-07
Anticipated expiration: 2044-03-18

Abstract

The invention relates to an active tracking method based on multi-mode information fusion, which comprises the following steps of; acquiring colour imagesDepth imageSum normal mapThree data information; inputting the initial characteristics of three data information into a multi-mode information preprocessing module; the multi-mode information fusion module adopts a training mode of two stages of pre-training and formal training to perform feature fusion on initial features, outputs the formal training features to be input into a reinforcement learning AC frame network RACNet with information fusion regularization constraint, and outputs corresponding predicted execution actions; the invention utilizes the multi-mode information acquired by the intelligent agent to more accurately describe the current state, and increases the constraint on the characteristics after fusion to improve the training efficiency of the reinforcement learning algorithm, thereby achieving ideal effects on the training efficiency and tracking precision.

Description

Active tracking method based on multi-mode information fusion

Technical Field

The invention relates to the technical field of target tracking, in particular to an active tracking method based on multi-mode information fusion.

Background

In the field of computer vision research, object tracking is a very challenging task. Target tracking in a general sense means that a target to be tracked is given in an initial frame and the position of the target is continuously output in a subsequent frame. In this tracking, it is generally assumed that the camera is fixed, and then a moving object in the field of view of the camera is tracked. Under this assumption, the target is easily moved out of the field of view of the camera or is blocked by other objects in the field of view, making it difficult for the tracker to accurately track and locate the target. The active target tracking based on vision aims to adjust the position and focal length of the camera in real time according to the target position in the vision observation information, and the camera is controlled to move along with the target so as to ensure that the target is always in the field of view of the camera, so that the active target tracking based on vision has important theoretical research significance and practical application value.

The prior active tracking work mainly inputs visual color images perceived by an intelligent agent into a convolution network to obtain state expression at the current moment, and sends the state expression to a subsequent reinforcement learning network. However, this method of extracting state expressions by convolutional networks is time consuming and inefficient. On one hand, visual information perceived by an intelligent agent is provided with information of multiple modes such as a depth image, a normal map and the like besides a color image, and the effective fusion of the information of multiple modes can provide a state expression with stronger informativity for an active tracking algorithm, so that the current state can be described more accurately, thereby accelerating the training of reinforcement learning and improving the training effect. On the other hand, the feature form after the multi-mode information fusion is also very important to the subsequent reinforcement learning network training, and the regularization constraint on the fused features can further improve the training efficiency and effect of reinforcement learning.

In the fields of computer vision and reinforcement learning research, active tracking is a challenging emerging task, and most of the previous active tracking algorithms do not fully utilize multi-mode data information acquired by an intelligent agent, do not further restrict fused features, and are difficult to achieve a satisfactory tracking effect. Therefore, in order to solve the above-mentioned problems, a more efficient active tracking algorithm needs to be proposed to improve the training efficiency and tracking effect of the algorithm.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an active tracking method based on multi-modal information fusion, which solves the problems that the traditional active tracking method does not fully utilize multi-modal data information acquired by an intelligent agent, and further constraint on fused features does not cause lower training efficiency and poorer tracking effect. The method utilizes the multi-mode information acquired by the intelligent agent to carry out fusion to describe the current state more accurately, and increases the constraint on the characteristics after fusion to improve the training efficiency of the reinforcement learning algorithm, thereby achieving ideal effects on the training efficiency and tracking precision.

In order to solve the technical problems, the invention provides the following technical scheme: an active tracking method based on multi-mode information fusion comprises the following steps:

S1, acquiring data information of various modes under the view angle of a tracking agent in an active tracking virtual environment based on a UE framework, wherein the data information comprises a color image Depth image/>And normal map/>Three data information;

s2, constructing a feature extraction fusion network FEFNet with a multi-mode information fusion mechanism, wherein the network comprises a multi-mode information preprocessing module and a multi-mode information fusion module;

S3, inputting the three data information in the S1 into a multi-mode information preprocessing module to obtain initial characteristics of the three data information 、/>、/>；

S4, the multi-mode information fusion module adopts a training mode of two stages of pre-training and formal training to perform initial characteristics on three data information、/>、/>Feature fusion is carried out to obtain pretrained feature output/>And formal training feature output/>；

S5, constructing a reinforcement learning AC framework network RACNet with information fusion regularization constraint, and outputting formal training characteristicsAnd outputting the corresponding predicted execution action by inputting the predicted execution action into the network.

Further, in step S1, the specific process includes the following steps:

S11, setting two intelligent agents with a moving function in an active tracking virtual environment based on a UE framework, namely a tracking intelligent agent and a target intelligent agent, wherein the two intelligent agents are controlled to move in the environment by a set program;

S12, acquiring data information of various modes under the view angle of the tracking agent, namely a color image, in real time through interactive codes in the virtual environment Depth image/>And normal map/>Three data formats.

Further, in step S3, the multi-modal information preprocessing module is obtained by stacking the CONV-MP-ReLU layers, so that the preprocessing module can perform initial feature extraction on the input various modal data information, and input the color imageDepth image/>And normal map/>The sizes of the three data information are regulated to be uniform, and the initial characteristics of different data information after pretreatment in each mode are respectively as follows: /(I)，，/>；

Wherein, the serial combination of the CONV-MP-ReLU layer, i.e. CONV convolution layer, maxPooling max-pooling layer and ReLU activation layer is represented,Then this represents a superposition of the individual CONV-MP-ReLU layers.

Further, in step S4, the multi-modal information fusion module includes a multi-modal information fusion module in a pre-training processAnd a multimodal information fusion module/>, in a formal training processMultimode information fusion module/>, in the pre-training processIs composed of a direct weighted fusion network, namely the output/>, of pre-training characteristicsThe method comprises the following steps:；

Wherein, ，/>And/>The weight corresponding to each mode is given;

multimode information fusion module in formal training process Is composed of mapping coding structure, i.e. formal training feature output/>The method comprises the following steps:

；

Wherein, Representing the feature fused by depth picture information and normal map information,/>Representing matrix transpose operations,/>Representing a matrix multiplication operation,/>Representation/>Layer/>Representation/>The layer of the material is formed from a layer,Representing normalization operations,/>Representing a linear mapping operation of the result.

Further, in step S4, the multimodal information fusion module adopts a training manner of two stages of pre-training and formal training, which specifically includes: in the training stage 1, the invention uses the training network structure as the multi-mode information preprocessing module and the multi-mode information fusion moduleIn combination with the reinforcement learning network RACNet, a suitable multi-modal information preprocessing module is trained first; in the training stage 2, the appropriate multimodal information preprocessing module parameters obtained in the stage 1 are preloaded as the multimodal information preprocessing module of the training stage 2, and the multimodal information fusion module/>, in the training network structureIs replaced by/>The form performs stage 2 training.

Further, in step S5, a reinforcement learning AC framework network RACNet with information fusion regularization constraints is constructed, specifically including: outputting formal training characteristics in multi-mode information fusion modulePerforming a bilingual regularization constraint, i.e., pair/>The singular matrix and the singular value thereof are constrained, and the corresponding singular matrix constraint is as follows:

；

Wherein, And/>Singular values of maximum and minimum singular matrices, respectively,/>And/>Weights corresponding to the bilingual regularization constraint are respectively used for constraining the singular matrix and the singular value thereof, so that the extracted characteristic/>The better the current state can be represented, so that the performance of the model is improved;

on the basis of the reinforcement learning AC framework, the reinforcement learning algorithm loss function after the double regularization constraint is added, which is expressed as:

；

Wherein, And/>To strengthen learning the loss function corresponding to the Actor network and the Critic network under the AC framework,/>Then is pair/>Is to restrict term/>The corresponding loss function is used to determine the loss,，/>And/>The weight corresponding to each loss is given.

Further, in step S5, the corresponding predicted execution action is output, specifically including the following steps:

S51, sequentially inputting data information acquired in real time into a feature extraction fusion network FEFNet and a reinforcement learning AC framework network RACNet with information fusion regularization constraint, and outputting a corresponding predicted execution action;

s52, according to the obtained action instruction, the action direction of the tracking intelligent agent is adjusted in real time, so that the tracking intelligent agent can perform action adjustment according to the position of the target intelligent agent under the current view angle, and accurate active target tracking is performed;

s53, repeating all the steps until the target intelligent agent is lost in the field of view of the tracking intelligent agent or the target intelligent agent is tracked to a preset maximum frame number.

By means of the technical scheme, the invention provides an active tracking method based on multi-mode information fusion, which has at least the following beneficial effects:

1. The invention utilizes the fusion of the multi-modal information to actively track, and the feature extraction fusion network based on the multi-modal information fusion mechanism can extract the multi-modal information acquired by the intelligent agent more comprehensively, thus being more reasonable than the prior simple convolution network and achieving ideal effect on tracking performance. Compared with the traditional simple deep learning network model, the feature extraction fusion network under the multi-mode information fusion mechanism can more efficiently process and utilize the input data information under multiple modes;

2. The multi-mode information extraction and fusion mechanism is introduced in the aspect of feature extraction, and the feature extraction and fusion network of the designed multi-mode information fusion mechanism can effectively fuse the feature information of color images, depth images, normal maps and other modes. The multi-mode information fusion module adopts a two-stage training mode of pre-training and formal training to perform feature fusion on the initial features, so that the description capability and the characterization capability of the extracted features on the current state are further improved, the performance of active tracking is improved, and a more robust tracking effect is realized;

3. In the reinforcement learning framework network, further feature form regularization constraint is carried out on the input characterization features so as to improve training efficiency and accuracy of the subsequent reinforcement learning network. Compared with the characteristic features without feature constraint, the method has the advantages that the training convergence, the training effect and the like are improved, and faster active target tracking can be realized.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a flow chart of an active tracking method based on multi-modal information fusion in the present invention;

FIG. 2 is a schematic view of a random environment selected during training in accordance with the present invention;

FIG. 3 is a schematic view of a simulation environment selected during testing in accordance with the present invention;

FIG. 4 is a schematic diagram of training length curves and rewards curves during a one-stage training process according to the present invention;

FIG. 5 is a schematic diagram of training length curves and rewards curves during a two-stage training process of the present invention;

Fig. 6 is a graph of partial frame tracking results for city streetscapes and indoor scenes in accordance with the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description. Therefore, the realization process of how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in a method of implementing an embodiment described above may be implemented by a program to instruct related hardware, and thus, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Referring to fig. 1-6, a specific implementation of the present embodiment is shown, and the method constructs a feature extraction fusion network with a multi-modal information fusion mechanism and a reinforcement learning framework network with information fusion regularization constraints for training and testing. The method utilizes the multi-mode information acquired by the intelligent agent to carry out fusion to describe the current state more accurately, and increases the constraint on the characteristics after fusion to improve the training efficiency of the reinforcement learning algorithm, thereby achieving ideal effects on the training efficiency and tracking precision.

Referring to fig. 1, the embodiment provides an active tracking method based on multi-mode information fusion, which includes the following steps:

as a preferred embodiment of step S1, the specific procedure comprises the steps of:

S12, acquiring data information of various modes under the view angle of the tracking agent, namely a color image, in real time through interactive codes in the virtual environment Depth image/>And normal map/>Three data formats. S2, constructing a feature extraction fusion network FEFNet with a multi-mode information fusion mechanism, wherein the network comprises a multi-mode information preprocessing module and a multi-mode information fusion module;

As a preferred implementation manner of the step S3, in the step S3, the multi-mode information preprocessing module is obtained by superposing CONV-MP-ReLU layers, so that the preprocessing module can perform initial feature extraction on input various mode data information and input color imagesDepth image/>And normal map/>The sizes of the three data information are regulated to be uniform, and the initial characteristics of different data information after pretreatment in each mode are respectively as follows:，/>，/>；

Wherein, the serial combination of the CONV-MP-ReLU layer, i.e. CONV convolution layer, maxPooling max-pooling layer and ReLU activation layer is represented, Then this represents a superposition of the individual CONV-MP-ReLU layers.

As a preferred embodiment of step S4, in step S4, the multi-modal information fusion module includes a multi-modal information fusion module in a pre-training processAnd a multimodal information fusion module/>, in a formal training processMultimode information fusion module/>, in the pre-training processIs composed of a direct weighted fusion network, namely the output/>, of pre-training characteristicsThe method comprises the following steps: /(I)；

Wherein,，/>And/>The weight corresponding to each mode is given;

；

As another preferred embodiment of step S4, in step S4, the multimodal information fusion module adopts a training manner of two phases of pre-training and formal training, and specifically includes: in the training stage 1, the invention uses the training network structure as the multi-mode information preprocessing module and the multi-mode information fusion moduleIn combination with the reinforcement learning network RACNet, a suitable multi-modal information preprocessing module is trained first; in the training stage 2, the appropriate multimodal information preprocessing module parameters obtained in the stage 1 are preloaded as the multimodal information preprocessing module of the training stage 2, and the multimodal information fusion module/>, in the training network structureIs replaced by/>The form performs stage 2 training.

As a preferred embodiment of step S5, in step S5, a reinforcement learning AC framework network RACNet with information fusion regularization constraints is constructed, specifically including: outputting formal training characteristics in multi-mode information fusion modulePerforming a bilingual regularization constraint, i.e., pair/>The singular matrix and the singular value thereof are constrained, and the corresponding singular matrix constraint is as follows:

；

As another preferred embodiment of step S5, in step S5, the execution action corresponding to the prediction is outputted, specifically including the steps of:

In the actual implementation process of the invention, the action selection of the tracking agent is shown in fig. 1, and various actions such as forward, backward, steering and the like can be executed, and the target agent can randomly act or walk according to a set path. The training environment and the testing environment are shown in fig. 2 and 3, and three different data forms selected by the invention in the training environment are shown in fig. 2, namely a color image mode, a depth image mode and a normal map mode from left to right. In addition, scene illumination, target texture, environment background and the like can be transformed through script control in the training environment, so that the environment is enhanced, and the migration and robustness of the algorithm are improved. In the testing process, the invention selects 2 simulation environments for testing, namely an outdoor scene and an indoor scene, as shown in fig. 3. Many different obstacles in the scenes block the vision, and the problems of light ray transformation exist, which increase a plurality of obstacles and difficulties for the intelligent agent to realize active tracking.

In the actual training process, rewards in the environment are calculated according to a preset rewarding function, the final rewards of the invention are related to the distance between the target intelligent agent and the tracking intelligent agent and the angle between the target intelligent agent and the tracking intelligent agent, namely, the closer the distance is to the set optimal distance and the angle between the target intelligent agent and the tracking intelligent agent is to the angle between the target intelligent agent and the tracking intelligent agent, the closer the distance is to the set optimal distance, the larger the rewards obtained by the algorithm are. The invention adopts a training mode of two stages, namely pre-training and formal training. In training stage 1, the invention uses training network structure as multi-mode information preprocessing moduleMultimodal information fusion Module/>And a reinforcement learning network (RACNet) for first training out a suitable multimodal information preprocessing module. In training stage 2, the multimodal information fusion module in the training network structure is replaced with/>And (3) preloading the parameters of the multi-mode information preprocessing module obtained in the stage 1, and training. Such training mode settings may help the network converge to better results faster in the initial stage and be more stable in the subsequent stage, thereby speeding up the overall training process.

Fig. 4 and 5 are length curves and rewards curves of the present invention during training, wherein the abscissa is the number of interactions and the ordinate is the optimal length and corresponding rewards value that the tracking agent can track with the current number of interactions. Fig. 4 shows a training length curve and a reward curve during a one-stage training process, and fig. 5 shows a training length curve and a reward curve during a two-stage training process. As can be seen from the figure, the two-stage training process can reach convergence quickly on a one-stage basis, and the two-stage training can reach a higher tracking length and tracking rewards than the one-stage training results.

Fig. 5 is a schematic diagram of the present invention partially tracking in indoor and outdoor scenes. The result shows that the method can continuously and stably track the target intelligent agent under the complex conditions of snowflake interference of outdoor snow scenes, pillar shielding of indoor garages and the like, and has high accuracy.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

The foregoing embodiments have been presented in a detail description of the invention, and are presented herein with a particular application to the understanding of the principles and embodiments of the invention, the foregoing embodiments being merely intended to facilitate an understanding of the method of the invention and its core concepts; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. An active tracking method based on multi-mode information fusion is characterized by comprising the following steps:

s1, acquiring data information of various modes under a tracking agent visual angle in an active tracking virtual environment based on a UE framework, wherein the data information comprises three data information of a color image I _rgb, a depth image I _depth and a normal map I _normdl;

s3, inputting the three data information in the S1 into a multi-mode information preprocessing module to obtain initial characteristics F _rgb、F_depth、F_normal of the three data information;

S4, the multi-mode information fusion module adopts a two-stage training mode of pre-training and formal training to perform feature fusion on initial features F _rgb、F_depth、F_normal of three data information, and pre-training feature output is obtained And formal training feature output

In step S4, the multi-modal information fusion module includes a multi-modal information fusion module MMF ₁ in the pre-training process and a multi-modal information fusion module MMF ₂ in the formal training process, where the multi-modal information fusion module MMF ₁ in the pre-training process is composed of a direct weighted fusion network, i.e. the pre-training feature outputThe method comprises the following steps:

Wherein, W _rgb,W_depth and W _normal are weights corresponding to each mode;

The multimode information fusion module MMF ₂ in the formal training process is composed of a mapping coding structure, namely the formal training feature output The method comprises the following steps:

Wherein, Representing the features fused by depth picture information and normal map information, [ ] ^T represents a matrix transpose operation,/>Representing matrix multiplication operations, linear (·) representing Linear layers, softmax (·) representing Softmax layers, norm (·) representing normalization operations, proj { · } representing Linear mapping operations on the results;

s5, constructing a reinforcement learning AC framework network RACNet with information fusion regularization constraint, and outputting formal training characteristics Inputting the predicted execution actions into the network and outputting the predicted execution actions;

in step S5, the corresponding predicted execution action is output, specifically including the steps of:

S53, repeating the steps S51-S52 until the target intelligent agent is lost in the field of view of the tracking intelligent agent or the target intelligent agent is tracked to a preset maximum frame number.

2. The method of claim 1, wherein in step S1, the specific process includes the following steps:

S12, acquiring data information of various modes under the view angle of the tracking agent in real time through interactive codes in the virtual environment, namely three data forms of a color image I _rgb, a depth image I _depth and a normal map I _normal.

3. The active tracking method based on multi-mode information fusion according to claim 1, wherein in step S3, the multi-mode information preprocessing module is obtained by stacking n CONV-MP-ReLU layers, so that the preprocessing module can perform initial feature extraction on input data information of various modes, and adjust the sizes of three data information of an input color image I _rgb, a depth image I _depth and a normal map I _normal to be uniform size t×t, wherein initial features of different preprocessed data information in each mode are:

wherein CMR represents the serial combination of the CONV-MP-ReLU layer, i.e. the CONV convolution layer, maxPooling max-pooling layer and ReLU activation layer, Then this represents a superposition of n CONV-MP-ReLU layers.

4. The method of claim 1, wherein in step S4, the multi-modal information fusion module adopts a training mode of two stages of pre-training and formal training, and the method specifically comprises: in the pre-training, the training network structure is a form of combining a multi-modal information preprocessing module, a multi-modal information fusion module MMF ₁ and a reinforcement learning network RACNet, and a proper multi-modal information preprocessing module is trained firstly; in the formal training, firstly, the proper multimodal information preprocessing module parameters obtained in the pre-training are used as the multimodal information preprocessing module of the formal training, and the multimodal information fusion module MMF ₁ in the training network structure is replaced by an MMF ₂ form to perform the formal training.

5. The method of claim 1, wherein in step S5, a reinforcement learning AC framework network RACNet with information fusion regularization constraints is constructed, and the method specifically comprises: outputting formal training characteristics in multi-mode information fusion modulePerforming a bilingual regularization constraint, i.e., pair/>The singular matrix sigma and the singular value thereof are constrained, and the corresponding singular matrix constraint is as follows:

Wherein sigma _max and sigma _min are respectively the largest and smallest singular values in the singular matrix sigma, and W ₁ and W ₂ are respectively weights corresponding to the bilingual regularization constraint, and the extracted features can be achieved by constraining the singular matrix and the singular values thereof The better the current state can be represented, so that the performance of the model is improved;

Loss＝W_actorLoss_Actor+W_criticLoss_Critic+W_normLoss_Norm；

Wherein Loss _Actor and Loss _Critic are Loss functions corresponding to the Actor network and the Critic network under the reinforcement learning AC framework, and Loss _Norm is the pair The two regularization constraint terms Norm ^out correspond to the loss functions, and W _actor,W_critic and W _norm are weights corresponding to the losses.