CN117934551A

CN117934551A - Mixed reality tracking interaction system

Info

Publication number: CN117934551A
Application number: CN202410113877.5A
Authority: CN
Inventors: 王晓燕; 王璇; 刘松; 武世杰; 朱飞
Original assignee: Beijing Tiangong Color Television Technology Co ltd
Current assignee: Beijing Tiangong Color Television Technology Co ltd
Priority date: 2024-01-27
Filing date: 2024-01-27
Publication date: 2024-04-26

Abstract

The application discloses a mixed reality tracking interaction system, which relates to the field of intelligent interaction, and comprises the steps of acquiring adjacent first frame monitoring images and second frame monitoring images by using camera equipment, introducing an image processing and analyzing algorithm at the rear end to analyze the adjacent first frame monitoring images and the second frame monitoring images, so as to acquire motion and gesture information of a user, converting the motion and gesture information into tracking data, wherein the tracking data can comprise information such as position, gesture and motion trail of the user, and the like, and is used for subsequent processing and interaction. In this way, the motion tracking of the user can be performed through the analysis and comparison of the continuous image frames, so that the re-identification of the object is realized, and in such a way, the accuracy and the stability of the tracking of the object of the user can be improved, so that more accurate tracking capability and better user experience and interaction effect are provided.

Description

Mixed reality tracking interaction system

Technical Field

The application relates to the field of intelligent interaction, and more particularly, to a mixed reality tracking interaction system.

Background

Mixed Reality (MR) is a technology that merges real world and virtual world together, and allows a user to interact with a virtual object in a real environment, and also allows the virtual object to reflect changes in the real environment. In order to achieve the effect of mixed reality, accurate tracking of the motion and gesture of the user and realistic rendering and control of virtual scenes and objects are required.

However, conventional interactive systems often fail to provide adequate realism and immersion when rendering virtual scenes and objects, and the appearance, motion, and interactive behavior of the virtual objects may differ significantly from that of the real world, which may reduce the user's experience and participation. In addition, when the traditional interaction system tracks the motion and gesture of the user, a certain error and delay often exist, which leads to incomplete matching of the position and gesture of the virtual object with the actual action of the user, and influences the accuracy and instantaneity of interaction.

Accordingly, an optimized mixed reality tracking interactive system is desired.

Disclosure of Invention

The present application has been made to solve the above-mentioned technical problems. The embodiment of the application provides a mixed reality tracking interaction system, which acquires adjacent first frame monitoring images and second frame monitoring images by using camera equipment, and introduces an image processing and analyzing algorithm at the rear end to analyze the adjacent first frame monitoring images and the second frame monitoring images so as to acquire motion and gesture information of a user and convert the motion and gesture information into tracking data, wherein the tracking data can comprise information such as the position, the gesture, the motion track and the like of the user for subsequent processing and interaction. In this way, the motion tracking of the user can be performed through the analysis and comparison of the continuous image frames, so that the re-identification of the object is realized, and in such a way, the accuracy and the stability of the tracking of the object of the user can be improved, so that more accurate tracking capability and better user experience and interaction effect are provided.

According to one aspect of the present application, there is provided a mixed reality tracking interaction system comprising:

The tracking module is used for tracking the motion and the gesture of the user and generating tracking data;

The processing module is used for receiving the tracking data and the user input and generating a virtual scene and an object according to the tracking data and the user input;

The display module is used for displaying the virtual scene and the object;

the interaction module is used for recognizing the gesture and the sound of the user and controlling the behaviors of the virtual scene and the object according to the gesture and the sound of the user;

And the sound synthesizer is used for generating the sound of the virtual scene.

Compared with the prior art, the mixed reality tracking interaction system provided by the application acquires the adjacent first frame monitoring image and second frame monitoring image by using the camera equipment, and introduces an image processing and analyzing algorithm at the rear end to analyze the adjacent first frame monitoring image and second frame monitoring image, so that the motion and gesture information of the user is acquired and converted into tracking data, wherein the tracking data can comprise the information of the position, the gesture, the motion track and the like of the user for subsequent processing and interaction. In this way, the motion tracking of the user can be performed through the analysis and comparison of the continuous image frames, so that the re-identification of the object is realized, and in such a way, the accuracy and the stability of the tracking of the object of the user can be improved, so that more accurate tracking capability and better user experience and interaction effect are provided.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing embodiments of the present application in more detail with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.

FIG. 1 is a block diagram of a mixed reality tracking interactive system according to an embodiment of the application;

FIG. 2 is a system architecture diagram of a mixed reality tracking interactive system according to an embodiment of the application;

FIG. 3 is a block diagram of a training phase of a mixed reality tracking interactive system according to an embodiment of the application;

Fig. 4 is a block diagram of a tracking module in a mixed reality tracking interactive system according to an embodiment of the application.

Detailed Description

Hereinafter, exemplary embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

As used in the specification and in the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

Although the present application makes various references to certain modules in a system according to embodiments of the present application, any number of different modules may be used and run on a user terminal and/or server. The modules are merely illustrative, and different aspects of the systems and methods may use different modules.

A flowchart is used in the present application to describe the operations performed by a system according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Also, other operations may be added to or removed from these processes.

Conventional interactive systems often fail to provide adequate realism and immersion when rendering virtual scenes and objects, and the appearance, motion, and interactive behavior of the virtual objects may vary significantly from the real world, which can reduce the user's experience and participation. In addition, when the traditional interaction system tracks the motion and gesture of the user, a certain error and delay often exist, which leads to incomplete matching of the position and gesture of the virtual object with the actual action of the user, and influences the accuracy and instantaneity of interaction. Accordingly, an optimized mixed reality tracking interactive system is desired.

In the technical scheme of the application, a mixed reality tracking interaction system is provided. Fig. 1 is a block diagram of a mixed reality tracking interactive system according to an embodiment of the application. As shown in fig. 1, a mixed reality tracking interactive system 300 according to an embodiment of the application includes: a tracking module 310 for tracking the motion and gesture of the user and generating tracking data; a processing module 320, configured to receive the tracking data and the user input, and generate a virtual scene and an object according to the tracking data and the user input; a display module 330, configured to display the virtual scene and the object; an interaction module 340, configured to recognize a gesture and a sound of the user, and control a behavior of the virtual scene and the object according to the gesture and the sound of the user; a sound synthesizer 350 for generating sound of the virtual scene.

In particular, the tracking module 310 is configured to track the motion and gesture of the user and generate tracking data. In particular, in one specific example of the present application, as shown in fig. 2 and 4, the tracking module 310 includes: an image acquisition unit 311, configured to acquire adjacent first frame monitoring images and second frame monitoring images acquired by the camera; an image feature extraction unit 312 for passing the first frame monitoring image and the second frame monitoring image through a twin tracker including a first image encoder and a second image encoder to obtain a first object image feature map and a second object image feature map; an image feature channel saliency analysis unit 313, configured to perform channel saliency processing on the first object image feature image and the second object image feature image to obtain a channel saliency first object image feature image and a channel saliency second object image feature image; the image local feature semantic similarity association analysis unit 314 is configured to calculate correlations between feature matrices of respective sets of corresponding channel dimensions in the channel saliency first object image feature map and the channel saliency second object image feature map, so as to obtain a tracking semantic feature vector composed of a plurality of correlations as a tracking semantic feature; a tracking result generating unit 315, configured to determine whether a first object in a first frame of monitoring image and a second object in a second frame of monitoring image are the same object based on the tracking semantic feature, and generate a tracking result; and a tracking data generating unit 316, configured to generate the tracking data based on the tracking result.

Specifically, the image acquisition unit 311 is configured to acquire a first frame monitoring image and a second frame monitoring image that are adjacent to each other and are acquired by the camera. It should be appreciated that by acquiring adjacent first and second frame monitoring images, it may be used for object tracking.

Specifically, the image feature extraction unit 312 is configured to pass the first frame monitoring image and the second frame monitoring image through a twin tracker including a first image encoder and a second image encoder to obtain a first object image feature map and a second object image feature map. It should be appreciated that in a mixed reality tracking interactive system, the tracking module needs to track the motion and gesture of the user and generate corresponding tracking data. To achieve accurate tracking, semantic feature information associated with the user needs to be extracted from successive image frames. Thus, feature analysis and capture of the first frame monitoring image and the second frame monitoring image can be performed using a convolutional neural network model having excellent expressive power in terms of implicit feature extraction of images. In particular, in order to further improve the sufficiency and the accuracy of the analysis and comparison of the semantic features of the motion and the gesture of the user in the first frame monitoring image and the second frame monitoring image, in the technical solution of the present application, the first frame monitoring image and the second frame monitoring image are passed through a twin tracker comprising a first image encoder and a second image encoder to obtain a first object image feature map and a second object image feature map. Here, the twin tracker is used for encoding the adjacent frame images, so that feature information, which is not obvious at the image source domain end, of motion and gesture semantics of the user object in the adjacent frame images can be extracted, and the tracking precision of the user object can be improved.

Specifically, the image feature channel saliency analysis unit 313 is configured to perform channel saliency processing on the first object image feature map and the second object image feature map to obtain a channel saliency first object image feature map and a channel saliency second object image feature map. In a mixed reality tracking interaction system, a tracking module needs to track the motion and the gesture of a user and generate corresponding tracking data. In order to achieve accurate tracking, key semantic feature information related to the user needs to be extracted from the image features, and other irrelevant interference features are filtered. Therefore, in the technical scheme of the application, the first object image feature map and the second object image feature map are further respectively passed through a channel attention module to obtain a channel saliency first object image feature map and a channel saliency second object image feature map. It should be understood that the channel attention module may weight the features of different channels according to the importance degree of the image features, so as to highlight important user semantic features and suppress secondary features, thereby reducing the interference of irrelevant information and improving tracking accuracy. That is, features related to user motion and gesture in the image can be better captured through the channel-salified feature map, thereby providing more reliable input for subsequent tracking tasks. More specifically, passing the first object image feature map and the second object image feature map through a channel attention module to obtain the channel-salified first object image feature map and the channel-salified second object image feature map, respectively, includes: global average pooling is carried out on each feature matrix of the first object image feature map and the second object image feature map along the channel dimension so as to obtain a channel feature vector; inputting the channel feature vector into a Softmax activation function to obtain a channel attention weight vector; and weighting each feature matrix of the first object image feature map and the second object image feature map along the channel dimension by taking the feature value of each position in the channel attention weight vector as a weight so as to obtain the channel saliency first object image feature map and the channel saliency second object image feature map.

Specifically, the image local feature semantic similarity association analysis unit 314 is configured to calculate correlations between feature matrices of respective sets of corresponding channel dimensions in the channel saliency first object image feature map and the channel saliency second object image feature map, so as to obtain a tracking semantic feature vector composed of a plurality of correlations as a tracking semantic feature. The feature matrices along the channel dimension in the channel-saliency first object image feature map and the channel-saliency second object image feature map are considered to represent feature information about different motions and attitudes of the user object in the two images, respectively. Therefore, in order to accurately track the user object, analysis and comparison are required to be performed on the corresponding image features of the adjacent frames so as to capture similar feature and differential feature information between the user motion and gesture features at different time points, thereby determining the position and gesture change of the tracking target. Specifically, in the technical scheme of the application, the correlation degree between the feature matrixes of each group of corresponding channel dimensions in the channel saliency first object image feature map and the channel saliency second object image feature map is further calculated respectively to obtain the tracking semantic feature vector composed of a plurality of correlation degrees. In particular, a tracking semantic feature vector can be constructed by combining a plurality of correlations, each of which corresponds to a feature dimension. Thus, this tracking semantic feature vector may be used to represent feature information of the tracked object, such as position, pose, shape, etc., to support subsequent object recognition and tracking tasks. More specifically, calculating correlations between feature matrices of respective sets of corresponding channel dimensions in the channel-salified first object image feature map and the channel-salified second object image feature map, respectively, to obtain the tracking semantic feature vector composed of a plurality of correlations, includes: respectively calculating the correlation between the feature matrixes of each group of corresponding channel dimensions in the channel saliency first object image feature map and the channel saliency second object image feature map according to the following correlation formula to obtain the tracking semantic feature vector consisting of a plurality of correlations; wherein, the correlation formula is:

wherein, And/>The characteristic values of the positions of the characteristic matrixes of the corresponding channel dimensions in the channel saliency first object image characteristic diagram and the channel saliency second object image characteristic diagram are respectively,/>Is the width of the feature matrix of each group of corresponding channel dimensions in the channel-salified first object image feature map and the channel-salified second object image feature map,/>Is the height of the feature matrix of each group of corresponding channel dimensions in the channel-salified first object image feature map and the channel-salified second object image feature map,/>Is the feature value of each position in the tracking semantic feature vector,/>The logarithmic function value is shown with the base of 2.

Specifically, the tracking result generating unit 315 is configured to determine, based on the tracking semantic feature, whether the first object in the first frame monitoring image and the second object in the second frame monitoring image are the same object, and generate a tracking result. That is, in the technical solution of the present application, the tracking semantic feature vector is passed through a classifier to obtain a tracking result, where the tracking result is used to indicate whether the first object in the first frame of monitoring image and the second object in the second frame of monitoring image are the same object. That is, classification processing is performed using tracking semantic feature information on the motion and posture of the user target object in the adjacent frame, thereby judging whether the user target object in the adjacent frame is the same object. In this way, tracking data can be generated based on the tracking results, which may include information of the position, posture, motion trajectory, etc. of the user for subsequent processing and interaction. More specifically, the tracking semantic feature vector is fully connected and encoded by using a plurality of fully connected layers of the classifier to obtain an encoded classification feature vector; and passing the coding classification feature vector through a Softmax classification function of the classifier to obtain the classification result.

A classifier refers to a machine learning model or algorithm that is used to classify input data into different categories or labels. The classifier is part of supervised learning, which performs classification tasks by learning mappings from input data to output categories.

Fully connected layers are one type of layer commonly found in neural networks. In the fully connected layer, each neuron is connected to all neurons of the upper layer, and each connection has a weight. This means that each neuron in the fully connected layer receives inputs from all neurons in the upper layer, and weights these inputs together, and then passes the result to the next layer.

The Softmax classification function is a commonly used activation function for multi-classification problems. It converts each element of the input vector into a probability value between 0 and 1, and the sum of these probability values equals 1. The Softmax function is commonly used at the output layer of a neural network, and is particularly suited for multi-classification problems, because it can map the network output into probability distributions for individual classes. During the training process, the output of the Softmax function may be used to calculate the loss function and update the network parameters through a back propagation algorithm. Notably, the output of the Softmax function does not change the relative magnitude relationship between elements, but rather normalizes them. Thus, the Softmax function does not change the characteristics of the input vector, but simply converts it into a probability distribution form.

Specifically, the tracking data generating unit 316 is configured to generate the tracking data based on the tracking result. The tracking unit is used for tracking the motion and the gesture of the user and generating corresponding tracking data.

In particular, the processing module 320 is configured to receive the tracking data and the user input, and generate a virtual scene and an object according to the tracking data and the user input. The processing module is used for receiving tracking data and user input and generating corresponding virtual scenes and objects according to the data.

In particular, the display module 330 is configured to display the virtual scene and the object. The display module is used for displaying scenes and objects in the virtual environment.

In particular, the interaction module 340 is configured to recognize a gesture and a sound of the user, and control the behavior of the virtual scene and the object according to the gesture and the sound of the user. The interaction module is used for recognizing gestures and sounds of a user and controlling behaviors of virtual scenes and objects according to the inputs. The interaction unit may comprise a camera, a microphone and a speech recognizer. The camera is used for capturing gestures and positions of a user, the microphone is used for capturing sounds of the user, and the voice recognizer is used for recognizing commands of the user.

In particular, the sound synthesizer 350 is configured to generate sound of the virtual scene. The sound synthesizer is used for generating sound effects in the virtual scene so as to enhance the experience of a user. By using the system, a user can perform various activities such as games, education, and entertainment in a virtual environment. In addition, the system can be applied to the fields of training, simulation, virtual exercise and the like.

It should be appreciated that the twin tracker including the first and second image encoders, the channel attention module, and the classifier need to be trained prior to the inference using the neural network model described above. That is, the mixed reality tracking interaction system 300 according to the application further comprises a training stage 400 for training the twin tracker comprising the first image encoder and the second image encoder, the channel attention module and the classifier.

Fig. 3 is a block diagram of a training phase of a mixed reality tracking interactive system according to an embodiment of the application. As shown in fig. 3, a mixed reality tracking interactive system 300 according to an embodiment of the application includes: training phase 400, comprising: a training data acquisition subunit 410, configured to acquire training data, where the training data includes adjacent training first frame monitoring images and training second frame monitoring images acquired by the camera; a training image feature extraction subunit 420, configured to pass the training first frame monitoring image and the training second frame monitoring image through a twin tracker including a first image encoder and a second image encoder to obtain a training first object image feature map and a training second object image feature map; a training feature channel saliency analysis unit 430, configured to pass the training first object image feature map and the training second object image feature map through a channel attention module to obtain a training channel saliency first object image feature map and a training channel saliency second object image feature map, respectively; a training image local feature semantic similarity association analysis subunit 440, configured to calculate correlations between feature matrices of respective sets of corresponding channel dimensions in the training channel saliency first object image feature map and the training channel saliency second object image feature map, respectively, so as to obtain a training tracking semantic feature vector composed of a plurality of correlations; a loss function calculation subunit 450, configured to calculate a loss function value between the training channel saliency first object image feature map and the training channel saliency second object image feature map; a classification loss subunit 460, configured to pass the training tracking semantic feature vector through a classifier to obtain a classification loss function value; a weighting subunit 470 for calculating a weighted sum of the loss function value and the classification loss function value to obtain a final loss function value; a training subunit 480 for training the twin tracker including the first image encoder and the second image encoder, the channel attention module, and the classifier based on the final function values.

In particular, in the technical solution of the present application, the channel saliency first object image feature map and the channel saliency second object image feature map respectively express image semantic features of the first frame monitoring image and the second frame monitoring image after the overall spatial distribution of some image semantic features of the second frame monitoring image in the channel dimension is enhanced by a channel attention mechanism, so that when correlation between feature matrices of respective groups of corresponding channel dimensions in the channel saliency first object image feature map and the channel saliency second object image feature map is calculated respectively, channel distribution differences in an image semantic feature extraction process caused by source image semantic differences are considered, and after the overall spatial distribution differences of the image semantic features between the feature matrices are amplified by the channel attention mechanism, if key image semantic feature sharing between the channel saliency first object image feature map and the channel saliency second object image feature map can be improved, an expression effect of a tracking semantic feature vector composed of a plurality of correlations can be improved. That is, considering that under the image semantic feature sharing angle, the channel saliency first object image feature map and the channel saliency second object image feature map will have cross-channel dimension expression sharing of key image semantic features, therefore, in order to inhibit the key image semantic feature sharing distribution sparsification of the first frame monitoring image and the second frame monitoring image in the image semantic feature extraction and channel attention saliency process, the applicant of the present application introduces a specific loss function for the channel saliency first object image feature map and the channel saliency second object image feature map in the training process of the model, which is expressed as:

wherein, Is the channel saliency first object image feature vector obtained after the expansion of the channel saliency first object image feature map, and/>Is the channel saliency second object image feature vector obtained after the expansion of the channel saliency second object image feature map,/>And/>Respectively 1 and 2 norms of the feature vector,/>Is a boundary threshold superparameter, and the feature vectors are all in the form of row vectors,/>Representing difference by location,/>Representing vector multiplication,/>Is the loss function. Specifically, the enhancement of the shared key image semantic features between the channel-salified first object image feature map and the channel-salified second object image feature map can be regarded as the compression of the distribution information of the global feature set, and the distribution sparsification control of the key features is performed on the basis of reconstructing the relative shape relationship of the original feature manifold based on the structural representation between the channel-salified first object image feature map and the channel-salified second object image feature map, so that the shared key image semantic features between the channel-salified first object image feature map and the channel-salified second object image feature map can be enhanced, and the calculation accuracy of the correlation between feature matrices of corresponding channel dimensions in the channel-salified first object image feature map and the channel-salified second object image feature map is improved, so that the expression effect of the tracking semantic feature vector formed by a plurality of correlations is improved, and the accuracy of the tracking semantic feature vector obtained by a classifier is improved. In this way, the motion tracking of the user can be performed through the analysis and comparison of the continuous image frames, so that the re-identification of the object is realized, and in such a way, the accuracy and the stability of the tracking of the object of the user can be improved, so that more accurate tracking capability and better user experience and interaction effect are provided.

As described above, the mixed reality tracking interaction system 300 according to an embodiment of the present application may be implemented in various wireless terminals, for example, a server or the like having a mixed reality tracking interaction algorithm. In one possible implementation, the mixed reality tracking interaction system 300 according to an embodiment of the application may be integrated into a wireless terminal as a software module and/or hardware module. For example, the mixed reality tracking interaction system 300 may be a software module in the operating system of the wireless terminal or may be an application developed for the wireless terminal; of course, the mixed reality tracking interaction system 300 may equally be one of many hardware modules of the wireless terminal.

Alternatively, in another example, the mixed reality tracking interaction system 300 and the wireless terminal may be separate devices, and the mixed reality tracking interaction system 300 may be connected to the wireless terminal through a wired and/or wireless network and communicate interaction information in accordance with a agreed data format.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A mixed reality tracking interactive system, comprising:

a tracking module for tracking motion and pose of a user and generating tracking data, comprising:

the image acquisition unit is used for acquiring adjacent first frame monitoring images and second frame monitoring images acquired by the camera;

An image feature extraction unit for passing the first frame monitoring image and the second frame monitoring image through a twin tracker including a first image encoder and a second image encoder to obtain a first object image feature map and a second object image feature map;

The image feature channel saliency analysis unit is used for respectively carrying out channel saliency processing on the first object image feature image and the second object image feature image to obtain a channel saliency first object image feature image and a channel saliency second object image feature image;

the image local feature semantic similarity association analysis unit is used for respectively calculating the correlation degree between feature matrixes of each group of corresponding channel dimensions in the channel saliency first object image feature image and the channel saliency second object image feature image so as to obtain a tracking semantic feature vector consisting of a plurality of correlation degrees as a tracking semantic feature;

the tracking result generation unit is used for determining whether a first object in the first frame monitoring image and a second object in the second frame monitoring image are the same object or not based on the tracking semantic features, and generating a tracking result;

a tracking data generating unit configured to generate the tracking data based on the tracking result;

And the processing module is used for receiving the tracking data and the user input and generating a virtual scene and an object according to the tracking data and the user input.

2. The mixed reality tracking interaction system of claim 1, further comprising:

The display module is used for displaying the virtual scene and the object;

3. The mixed reality tracking interaction system of claim 2, wherein the image feature channel saliency analysis unit is configured to: and respectively passing the first object image feature map and the second object image feature map through a channel attention module to obtain the channel saliency first object image feature map and the channel saliency second object image feature map.

4. A mixed reality tracking interaction system according to claim 3, characterized in that the image local feature semantic similarity correlation analysis unit is configured to: respectively calculating the correlation between the feature matrixes of each group of corresponding channel dimensions in the channel saliency first object image feature map and the channel saliency second object image feature map according to the following correlation formula to obtain the tracking semantic feature vector consisting of a plurality of correlations;

wherein, the correlation formula is:

wherein/> And/>The characteristic values of the positions of the characteristic matrixes of the corresponding channel dimensions in the channel saliency first object image characteristic diagram and the channel saliency second object image characteristic diagram are respectively,/>Is the width of the feature matrix of each group of corresponding channel dimensions in the channel-salified first object image feature map and the channel-salified second object image feature map,/>Is the height of the feature matrix of each set of corresponding channel dimensions in the channel-salified first object image feature map and the channel-salified second object image feature map,Is the feature value of each position in the tracking semantic feature vector,/>The logarithmic function value is shown with the base of 2.

5. The mixed reality tracking interaction system of claim 4, wherein the tracking result generating unit is configured to: and the tracking semantic feature vector passes through a classifier to obtain a tracking result, wherein the tracking result is used for indicating whether a first object in a first frame of monitoring image and a second object in a second frame of monitoring image are the same object.

6. The mixed reality tracking interaction system of claim 5, further comprising a training unit for training the twin tracker including the first image encoder and the second image encoder, the channel attention module, and the classifier.

7. The mixed reality tracking interaction system of claim 6, wherein the training unit comprises:

The training data acquisition subunit is used for acquiring training data, wherein the training data comprises adjacent training first frame monitoring images and training second frame monitoring images which are acquired by the camera;

a training image feature extraction subunit, configured to pass the training first frame monitoring image and the training second frame monitoring image through a twin tracker including a first image encoder and a second image encoder to obtain a training first object image feature map and a training second object image feature map;

the training feature channel saliency analysis unit is used for enabling the training first object image feature image and the training second object image feature image to pass through the channel attention module respectively to obtain a training channel saliency first object image feature image and a training channel saliency second object image feature image;

The training image local feature semantic similarity association analysis subunit is used for respectively calculating the correlation degree between feature matrixes of each group of corresponding channel dimensions in the training channel saliency first object image feature image and the training channel saliency second object image feature image so as to obtain a training tracking semantic feature vector consisting of a plurality of correlation degrees;

A loss function calculation subunit, configured to calculate a loss function value between the training channel saliency first object image feature map and the training channel saliency second object image feature map;

The classification loss subunit is used for passing the training tracking semantic feature vector through a classifier to obtain a classification loss function value;

a weighting subunit for calculating a weighted sum of the loss function value and the classification loss function value to obtain a final loss function value;

A training subunit for training the twin tracker including the first image encoder and the second image encoder, the channel attention module, and the classifier based on the final function values.

8. The mixed reality tracking interaction system of claim 7, wherein the classification loss subunit is configured to:

processing the training tracking semantic feature vector by using the classifier to obtain a training classification result:

And calculating a cross entropy loss function value between the training classification result and a true value of whether a first object in the first frame monitoring image and a second object in the second frame monitoring image are the same object as the classification loss function value.