WO2022171036A1

WO2022171036A1 - Video target tracking method, video target tracking apparatus, storage medium, and electronic device

Info

Publication number: WO2022171036A1
Application number: PCT/CN2022/075086
Authority: WO
Inventors: 江毅; 孙培泽; 袁泽寰; 王长虎
Original assignee: 北京有竹居网络技术有限公司
Priority date: 2021-02-09
Filing date: 2022-01-29
Publication date: 2022-08-18
Also published as: CN112907628A

Abstract

The present disclosure relates to a video target tracking method, a video target tracking apparatus, a storage medium, and an electronic device, for realizing end-to-end video target tracking and reducing the time delay of video target tracking. The video target tracking method comprises: acquiring a video to be tracked; and inputting said video into a target tracking model to obtain a target tracking result corresponding to said video, the target tracking model being used for performing the following processing: for each image frame of said video, determining a feature vector corresponding to a target to be tracked in a target detection image corresponding to the image, the target detection image comprising said target; performing a first similarity calculation on each feature vector in a feature map corresponding to the image and the feature vector corresponding to said target in the target detection image, and determining a target feature vector from among all the feature vectors in the feature map according to the first similarity calculation result; and determining said target in the image according to the target feature vector.

Description

Video target tracking method, video target tracking device, storage medium and electronic device

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is based on the Chinese application with the application number of 202110179157.5 and the filing date of February 9, 2021, and claims its priority. The disclosure of the Chinese application is hereby incorporated into the present application as a whole.

technical field

The present disclosure relates to the technical field of image processing, and in particular, to a video target tracking method, a video target tracking device, a storage medium, and an electronic device.

Background technique

Video target tracking is the basis of many video application fields such as human behavior analysis and sports video commentary, and requires high real-time performance. However, the video target tracking in the related art is usually based on the process of first target detection and then target tracking. Specifically, the target detection is performed on the two frames before and after the video, and then the detected targets are matched into pairs, so as to achieve target tracking.

SUMMARY OF THE INVENTION

This Summary is provided to introduce concepts in a simplified form that are described in detail in the Detailed Description section that follows. This summary section is not intended to identify key features or essential features of the claimed technical solution, nor is it intended to be used to limit the scope of the claimed technical solution.

The inventor believes that, in the related art, target detection needs to be performed first, and then target tracking is performed, so the delay is relatively high, especially in a scenario where there are many targets to be tracked, the delay problem is particularly obvious.

In order to solve the above technical problems, the present disclosure provides a video target tracking method, tracking device, storage medium and electronic equipment, so as to realize end-to-end video target tracking and reduce the time delay of video target tracking.

In a first aspect, the present disclosure provides a video target tracking method, the method comprising:

Get the video to be tracked;

Input the video to be tracked into a target tracking model to obtain a target tracking result corresponding to the video to be tracked, and the target tracking model is used to perform the following processing:

For each frame of the video to be tracked, determine a feature vector corresponding to the target to be tracked in the target detection image corresponding to the image, and the target detection image includes the target to be tracked;

Perform a first similarity calculation on each feature vector in the feature map corresponding to the image and the feature vector corresponding to the target to be tracked in the target detection image, and according to the first similarity calculation result, in the Determine the target feature vector from all the feature vectors of the feature map;

According to the target feature vector, the target to be tracked in the image is determined.

In a second aspect, the present disclosure provides a video target tracking device, the device comprising:

The acquisition module is used to acquire the video to be tracked;

A tracking module, configured to input the video to be tracked into a target tracking model to obtain a target tracking result corresponding to the video to be tracked, the tracking module includes:

a first determination submodule, configured to determine, for each frame of the video to be tracked, a feature vector corresponding to the target to be tracked in the target detection image corresponding to the image, and the target detection image includes the target to be tracked;

The second determination submodule is configured to perform a first similarity calculation between each feature vector in the feature map corresponding to the image and the feature vector corresponding to the target to be tracked in the target detection image, and calculate the similarity according to the first similarity calculation. A similarity calculation result, determining the target feature vector in all the feature vectors of the feature map;

The third determination sub-module is configured to determine the target to be tracked in the image according to the target feature vector.

In a third aspect, the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing apparatus, implements any of the video target tracking methods provided by the embodiments of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device, comprising:

a storage device on which a computer program is stored;

A processing device is configured to execute the computer program in the storage device to implement any video target tracking method provided by the embodiments of the present disclosure.

In a fifth aspect, the present disclosure provides a computer program, comprising: instructions, when executed by a processor, the instructions cause the processor to execute any of the video object tracking methods provided by the embodiments of the present disclosure.

In a sixth aspect, the present disclosure provides a computer program product comprising instructions, which when executed by a processor, cause the processor to execute any of the video object tracking methods provided by the embodiments of the present disclosure.

Through the above technical solution, the target tracking model can perform the first similarity calculation between the feature vector corresponding to each frame of the video to be tracked and the feature vector corresponding to the target to be tracked in the target detection image, so as to determine the first similarity according to the first similarity calculation result. The target to be tracked in each frame of image. Therefore, the target to be tracked in each frame of image output by the target tracking model can correspond to the target to be tracked in the target detection image one-to-one, that is, target detection and target association can be completed at the same time, thereby reducing the time in the target tracking process. extension.

Other features and advantages of the present disclosure will be described in detail in the detailed description that follows.

Description of drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent when taken in conjunction with the accompanying drawings and with reference to the following detailed description. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that the originals and elements are not necessarily drawn to scale. In the attached image:

1 is a flowchart of a method for tracking a video target according to an exemplary embodiment of the present disclosure;

2 is a schematic diagram of a target tracking process in a video target tracking method according to an exemplary embodiment of the present disclosure;

3 is a schematic diagram of a target tracking process in another video target tracking method according to an exemplary embodiment of the present disclosure;

4 is a block diagram of a video target tracking apparatus according to an exemplary embodiment of the present disclosure;

FIG. 5 is a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed ways

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for the purpose of A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are only for exemplary purposes, and are not intended to limit the protection scope of the present disclosure.

It should be understood that the various steps described in the method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.

As used herein, the term "including" and variations thereof are open-ended inclusions, ie, "including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions of other terms will be given in the description below.

It should be noted that concepts such as "first" and "second" mentioned in the present disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or interdependence. In addition, it should be noted that the modifications of "a" and "a plurality" mentioned in the present disclosure are illustrative rather than restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, they should be understood as "a" or more".

The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are only for illustrative purposes, and are not intended to limit the scope of these messages or information.

Video target tracking in the related art is usually based on a process of target detection and then target tracking. Specifically, the detection module performs target detection on the two frames before and after the video, and then the detected targets are matched into pairs by the association module, so as to achieve target tracking. The model components of this process are relatively complex, and the delay is relatively high, especially in scenarios where there are many targets to be tracked, the delay problem is particularly obvious.

In view of this, the present disclosure proposes a video target tracking method, a video target tracking device, a storage medium and an electronic device, so as to realize end-to-end video target tracking and reduce the time delay of video target tracking.

FIG. 1 is a flowchart of a video target tracking method according to an exemplary embodiment of the present disclosure. 1, the video target tracking method includes:

Step 101, acquiring the video to be tracked.

For example, acquiring the video to be tracked may be in response to the user's video input operation, acquiring the video input by the user, or automatically acquiring the video captured by the image capturing device from the image capturing device after receiving the target tracking instruction, etc. , which is not limited in this embodiment of the present disclosure.

Step 102: Input the video to be tracked into the target tracking model to obtain the target tracking result corresponding to the video to be tracked. The target tracking model is used to perform the following processing: for each frame of the video to be tracked, determine the feature vector corresponding to the target to be tracked in the target detection image corresponding to the image, and the target detection image includes the target to be tracked; The first similarity calculation is performed between each feature vector in the feature map and the feature vector corresponding to the target to be tracked in the target detection image, and the target feature vector is determined in all the feature vectors in the feature map according to the first similarity calculation result; According to the target feature vector, the target to be tracked in the image is determined.

In the above manner, the target tracking model can perform the first similarity calculation between the feature vector corresponding to each frame of the video to be tracked and the feature vector corresponding to the target to be tracked in the target detection image, so as to determine the first similarity according to the first similarity calculation result. The target to be tracked in a frame of images. Therefore, the target to be tracked in each frame of image output by the target tracking model can correspond to the target to be tracked in the target detection image one-to-one, that is, target detection and target association can be completed at the same time, thereby reducing the time in the target tracking process. extension.

In order to make those skilled in the art better understand the video target tracking method provided by the present disclosure, the above steps are illustrated in detail below.

For example, after acquiring the video to be tracked, the video to be tracked may be input into the target tracking model. It should be understood that each frame of images in the video to be tracked has a time sequence, so a video image sequence composed of multiple frames of images arranged in time sequence can be obtained according to the video to be tracked. Therefore, inputting the video to be tracked into the target tracking model may also be inputting the video image sequence corresponding to the video to be tracked into the target tracking model.

For example, the training of the target tracking model may be performed according to the sample images and sample target information corresponding to the sample images. Specifically, the sample image can be input into the target tracking model to obtain the predicted target information output by the target tracking model for the sample image, then the loss function is calculated according to the predicted target information and the sample target information, and finally the loss function is calculated according to the As a result, the parameters of the target tracking model are adjusted to make the target tracking model output more accurate target information. Thus, the target detection function and the target tracking function of the target tracking model can be trained synchronously.

It should be understood that since the related art requires target detection and then target association, the model training method in the related art is to gradually train the detection module and the association module. In scenarios with many targets to be tracked, this training method requires It takes a lot of time and it is difficult to achieve the optimal training effect. However, the method of synchronizing the training of the target detection function and the target tracking function of the target tracking model in the embodiment of the present disclosure not only simplifies the components of the target tracking model, but also simplifies the training process of the target tracking model, which can better satisfy multiple targets. Tracked scene requirements.

In the application stage, the target tracking model can determine the feature vector corresponding to the target to be tracked in the target detection image corresponding to the image for each frame of the video to be tracked, and then compare each feature vector in the feature map corresponding to the image with the target detection image. The first similarity calculation is performed on the feature vector corresponding to the target to be tracked in the image, and the target feature vector is determined from all the feature vectors in the feature map according to the first similarity calculation result. Finally, the target to be tracked in the image is determined according to the target feature vector.

For example, the target detection image may be a previous frame image of the image including the target to be tracked, or the target detection image may be a preset input image including the target to be tracked. It should be understood that the video target tracking method provided by the embodiments of the present disclosure can be applied to two application scenarios where the tracking target is given and the tracking target is unknown. Wherein, in the given scene of the tracking target, the target detection image may be a preset input image including the target to be tracked, for example, the target to be tracked is person A, then the preset input image may be the person photographed by an image acquisition device Full body photo of A. In a scenario where the tracking target is unknown, all targets in each frame of image need to be tracked, and the target detection image may be the previous frame of the image including the target to be tracked.

After the target detection image is determined, a feature vector corresponding to the target to be tracked in the target detection image may be determined, and the feature vector may be a result of vectorization based on the image features of the center pixel of the target to be tracked, or the feature vector may be It is the result obtained by vectorizing the image feature of a certain pixel point that the target to be tracked can be distinguished from other targets, etc., which is not limited in this embodiment of the present disclosure. The manner of determining the feature vector is similar to that in the related art, and details are not repeated here.

For example, the feature map corresponding to the image may be an image obtained by quantization according to the image feature vector of each pixel in the image. Moreover, the feature vector corresponding to the target to be tracked in the target detection image is a pixel-level feature vector, so each feature vector in the feature map corresponding to the target to be tracked image corresponding to the target to be tracked can be compared with the target detection image corresponding to the target to be tracked. The feature vector performs the first similarity calculation, so as to achieve target tracking according to the first similarity calculation result. For example, the first similarity calculation may be to perform vector dot product calculation, Euclidean distance calculation, etc. between the feature vector corresponding to each pixel in the image and the feature vector corresponding to the target to be tracked in the target detection image. The method of calculating the first similarity is not limited.

It should be understood that the target tracking model may include an attention mechanism module, and the attention mechanism module may perform the first similarity calculation process to determine the feature vector corresponding to the target existing in both the image and the target detection image, that is, to obtain the target feature. vector.

For example, referring to FIG. 2 , in a given scene of a tracking target, the target detection image is a preset input image including the target to be tracked. Take each frame of the image to be tracked as the image of the current frame, determine the feature map of the image of the current frame, and then the attention mechanism module can correspond to each feature vector in the feature map with the target to be tracked in the preset input image The first similarity calculation is performed on the feature vector of , and the target feature vector is output according to the first similarity calculation result, so that the target tracking model can determine the target to be tracked in the frame image according to the target feature vector.

In a possible way, according to the first similarity calculation result, determining the target feature vector among all the feature vectors of the feature map may be: when the target detection image includes N targets to be tracked, among all the feature vectors of the feature map, Select N eigenvectors with the largest first similarity calculation result as target eigenvectors, where N is a positive integer.

For example, in a given scene of a tracking target, the target detection image is a preset input image including the target to be tracked. It can be determined by the target detection method in the related art that the target detection image includes N targets to be tracked, that is, the number of feature vectors corresponding to the targets to be tracked is N. In this scenario, for each frame of the video to be tracked, among all the feature vectors included in the feature map corresponding to the image, the N feature vectors with the largest first similarity calculation result may be selected as the target feature vector to determine The target to be tracked in this image. Among them, all the first similarity calculation results can be sorted from large to small, and then, from all the feature vectors included in the feature map corresponding to the image, N feature vectors corresponding to the first similarity calculation result in the top order are selected. as the target feature vector. Of course, it is also possible to sort all the first similarity calculation results from small to large, and then, from all the feature vectors included in the feature map corresponding to the image, select N feature vectors corresponding to the last sorted first similarity calculation result as The target feature vector is not limited in this embodiment of the present disclosure.

In the above manner, the target to be tracked determined according to the selected N feature vectors is the target existing in each frame of the video to be tracked and the target detection image, that is, the target to be tracked in each frame of the image output by the target tracking model. The tracking target can be in one-to-one correspondence with the target to be tracked in the target detection image, so target detection and target association can be completed at the same time, thereby reducing the time delay in the target tracking process.

In a possible way, in a scenario where the tracking target is unknown, since all targets in the image need to be tracked, the target tracking model can also be used to determine the feature vectors corresponding to all targets in the image according to the pre-trained position vector parameters. Correspondingly, according to the first similarity calculation result, determining the target feature vector among all the feature vectors of the feature map may be: when the target detection image includes N targets to be tracked, among all the feature vectors of the feature map, select the first target feature vector. The N eigenvectors with the largest similarity calculation result are regarded as similar eigenvectors, and N is a positive integer. Then, the feature vectors corresponding to all targets in the image and N similar feature vectors are deduplicated to obtain target feature vectors.

For example, referring to FIG. 3 , in a scenario where the tracking target is unknown, the target detection image is the previous frame of the image including the target to be tracked. Take each frame of the image to be tracked as the image of the current frame, first determine the feature map of the image of the current frame, and then the attention mechanism module can compare each feature vector in the feature map with the target to be tracked in the previous frame of image. A first similarity calculation is performed on the corresponding feature vector, and a similar feature vector is determined according to the first similarity calculation result. At the same time, the attention mechanism module can also determine the feature vectors of all objects in this frame of images according to the pre-trained position vector parameters. After that, the target tracking model can perform feature vector fusion based on the similar feature vectors and the feature vectors of all targets in the frame image to obtain the target feature vector. Finally, the target tracking model can determine the target to be tracked in the current frame image according to the target feature vector.

Illustratively, the position vector parameter may include a plurality of unit position vectors. For example, for an H×W image, the position vector parameter may include equal to or more than H×W unit position vectors to cover each pixel in the image. Location. It should be understood that, the number of unit position vectors included in the position vector parameter can be set to be larger to adapt to the image size in different scenarios.

In a possible way, the position vector parameter may be obtained by training in the following way: determining the predicted feature vector corresponding to the target in the sample image according to the initial position vector parameter, so as to obtain the predicted target information corresponding to the sample image, wherein the sample image is pre-marked with Corresponding sample target information, and then calculate the loss function according to the predicted target information and the sample target information, and adjust the initial position vector parameter according to the calculation result of the loss function.

For example, the initial position vector parameter may be a random value, that is, after setting the number of unit position vectors in the position vector, the value of each unit position vector is a random value. Then, according to the result of the loss function of the target tracking model in the training process, the position vector parameter can be adjusted through the back-propagation algorithm, so that the position vector parameter can more accurately predict the position of the target in the image.

In a possible manner, if the image is the first frame image of the video to be tracked, the feature vector determined according to the pre-trained position vector parameter may be used as the target feature vector. It should be understood that in a scenario where the tracking target is unknown, the target detection image may be the previous frame of the image, so the similarity calculation cannot be performed on the first frame of image. In the embodiment of the present disclosure, the feature vectors corresponding to all targets in the image determined according to the pre-trained position vector parameters may be used as target feature vectors to determine the target to be tracked in the image.

After the target to be tracked in the first frame of image is determined according to the position vector parameter, each frame of the video to be tracked may correspond to the previous frame of image including the target to be tracked, so the corresponding The feature vector and the first similarity calculation result determine the target feature vector.

For example, when the target detection image includes N targets to be tracked, from all the feature vectors in the feature map, N feature vectors with the largest first similarity calculation result may be selected as similar feature vectors, where N is a positive integer. Then, the feature vectors corresponding to all targets in the image and N similar feature vectors can be deduplicated to obtain target feature vectors. Wherein, for the process of determining the N similar feature vectors, reference may be made to the above content about selecting the N feature vectors with the largest first similarity calculation result as the target feature vector, which will not be repeated here.

Illustratively, the N similar feature vectors represent the feature vectors corresponding to the targets existing in each frame of image in the video to be tracked and the target detection image, and the feature vector determined according to the position vector parameter is each frame in the video to be tracked. The feature vectors corresponding to all targets in the image, and the feature vectors corresponding to the same targets exist in both. For example, if a certain frame of image in the video to be tracked includes targets B1, B2 and B3, and the target detection image includes targets B1 and B2, the N similar feature vectors are the feature vectors corresponding to the targets B1 and B2. According to the position vector parameter The determined eigenvectors are eigenvectors corresponding to the targets B1, B2, and B3, and the two have eigenvectors corresponding to the same targets (B1 and B2). In this case, in order to avoid vector redundancy, feature vector fusion can be performed. For example, the feature vectors corresponding to all the targets in the image and N similar feature vectors can be deduplicated to obtain the target feature vector for determining the target to be tracked. Target.

In a possible way, the feature vectors corresponding to all the targets in the image and N similar feature vectors are deduplicated to obtain the target feature vector, which can be: for the feature vector corresponding to each target in the image, the feature vector Perform the second similarity calculation with N similar feature vectors, when the second similarity calculation result is greater than or equal to the preset similarity, delete the second similarity from the feature vectors corresponding to all targets in the image or from the N similar feature vectors The eigenvector corresponding to the result of the degree calculation. Then, the eigenvectors corresponding to all the targets in the deleted image and the remaining eigenvectors in the N similar eigenvectors are taken as target eigenvectors.

Exemplarily, the second similarity calculation may be for a feature vector corresponding to each target in the image, performing vector dot product calculation, Euclidean distance calculation, etc. on the feature vector and N similar feature vectors. The method of similarity calculation is not limited. The preset similarity can be customized according to the actual situation, which is not limited in the present disclosure. When the second similarity calculation result is greater than or equal to the preset similarity, the feature vector corresponding to a certain target in the image and a certain feature vector among the N similar feature vectors can be regarded as the same feature vector, so that the feature vector can be regarded as the same feature vector. Vector to perform the delete operation. Among them, considering that the feature vector may exist in the feature vectors corresponding to all the targets in the image and in the N similar feature vectors at the same time, the deletion operation can be performed in the feature vectors corresponding to all the targets in the image or in the N similar feature vectors. .

For example, in the above example, the N similar eigenvectors are the eigenvectors corresponding to the targets B1 and B2, and the eigenvectors determined according to the position vector parameters are the eigenvectors corresponding to the targets B1, B2, and B3. After the second similarity is calculated, you can It is determined that the feature vector corresponding to the second similarity calculation result greater than or equal to the preset similarity is the feature vector corresponding to the targets B1 and B2. You can choose to delete the eigenvectors corresponding to the targets B1 and B2 from the N similar eigenvectors, and the N similar eigenvectors are empty after deletion. Alternatively, you can choose to delete the feature vectors corresponding to the targets B1 and B2 from the feature vectors determined according to the position vector parameters (that is, the feature vectors corresponding to all targets in the image), and delete the remaining feature vectors to be the feature vector corresponding to the target B3. It should be understood that, after the deletion process, feature vectors corresponding to targets existing in both the image and the target detection image can be preferentially retained, so as to ensure target association and thus achieve target tracking.

After the deletion process, the feature vectors corresponding to all the targets in the image and the remaining feature vectors in the N similar feature vectors can be used as target feature vectors. For example, in the above example, after selecting to delete the eigenvectors corresponding to the targets B1 and B2 from the N similar eigenvectors, the N similar eigenvectors are empty, and the eigenvectors corresponding to all targets in the image are those corresponding to the targets B1, B2 and B3. feature vector, then the feature vectors corresponding to all the targets in the image and the remaining feature vectors in the N similar feature vectors are the feature vectors corresponding to the targets B1, B2 and B3, that is, the target feature vectors are the feature vectors corresponding to the targets B1, B2 and B3 .

It should be understood that the above-mentioned deduplication processing method is only a possible method for feature vector fusion for feature vectors provided by the embodiment of the present disclosure. During the specific implementation of the present disclosure, other methods may also be used to The feature vector and the N similar feature vectors are fused to the feature vector, which is not limited in this embodiment of the present disclosure.

In the above manner, the feature vectors corresponding to all the targets in the image can be determined through the position vector parameter, and the similar feature vectors corresponding to the targets existing in both the image and the target detection image can be determined according to the second similarity calculation result, and then the similar feature The vector and the feature vectors corresponding to all the targets in the image are fused with feature vectors to remove redundant feature vectors, and at the same time improve the computational efficiency, a more accurate target to be tracked can be obtained.

After the target feature vector is obtained by any of the above methods, the target feature vector can be subjected to linear feature transformation to obtain tracking frame information corresponding to the target to be tracked in the image, where the tracking frame information includes the position information of the tracking frame corresponding to the target to be tracked and size information, so that the target to be tracked can be indicated in the image according to the tracking frame information. This process is similar to the related art, and will not be repeated here.

Based on the same inventive concept, an embodiment of the present disclosure also provides a video target tracking apparatus, which can become part or all of an electronic device through software, hardware, or a combination of the two. 4, the video target tracking device includes:

an acquisition module 401, configured to acquire the video to be tracked;

A tracking module 402, configured to input the video to be tracked into a target tracking model to obtain a target tracking result corresponding to the video to be tracked, the tracking module includes:

The first determination sub-module 4021 is configured to, for each frame of the video to be tracked, determine the feature vector corresponding to the target to be tracked in the target detection image corresponding to the image, and the target detection image includes the target to be tracked ;

The second determination sub-module 4022 is configured to perform a first similarity calculation between each feature vector in the feature map corresponding to the image and the feature vector corresponding to the target to be tracked in the target detection image, and calculate the similarity according to The first similarity calculation result, the target feature vector is determined in all the feature vectors of the feature map;

The third determination sub-module 4023 is configured to determine the target to be tracked in the image according to the target feature vector.

Optionally, the target detection image is an image of the previous frame of the image that includes the target to be tracked; or, the target detection image is a preset input image that includes the target to be tracked.

Optionally, the second determination submodule 4022 is used for:

When the target detection image includes N targets to be tracked, among all the feature vectors in the feature map, the N feature vectors with the largest first similarity calculation result are selected as the target feature vectors, and N is positive Integer.

Optionally, the target tracking model is also used to determine the corresponding feature vectors of all targets in the image according to the position vector parameter of pre-training, and the second determination submodule is used for:

When the target detection image includes N targets to be tracked, among all the feature vectors in the feature map, the N feature vectors with the largest first similarity calculation result are selected as similar feature vectors, where N is positive integer;

Deduplication processing is performed on the feature vectors corresponding to all targets in the image and the N similar feature vectors to obtain target feature vectors.

Optionally, the image is a first frame image of the video to be tracked, and the device further includes:

The fourth determination sub-module is configured to use the feature vector determined according to the pre-trained position vector parameter as the target feature vector.

Optionally, the second determination submodule 4022 is used for:

For the feature vector corresponding to each target in the image, a second similarity calculation is performed on the feature vector and the N similar feature vectors;

When the second similarity calculation result is greater than or equal to the preset similarity, delete the feature vector corresponding to the second similarity calculation result from the feature vectors corresponding to all the targets in the image or the N similar feature vectors ;

The feature vectors corresponding to all targets in the image after deletion processing and the remaining feature vectors in the N similar feature vectors are used as target feature vectors.

Optionally, the apparatus 400 further includes the following modules for obtaining the position vector parameters through training:

The first training module is used to determine the predicted feature vector corresponding to the target in the sample image according to the initial position vector parameter, so as to obtain the predicted target information corresponding to the sample image, wherein the sample image is pre-marked with the corresponding sample target information;

The second training module is configured to calculate a loss function according to the predicted target information and the sample target information, and adjust the initial position vector parameter according to the calculation result of the loss function.

The above-described modules may be implemented as software components executing on one or more general-purpose processors, or as hardware, such as programmable logic devices and/or application specific integrated circuits, that perform certain functions or combinations thereof. In some embodiments, the modules may be embodied in the form of a software product that may be stored in non-volatile storage media including a computer device (eg, a personal computer, a server, a network device, mobile terminal, etc.) to implement the method described in the embodiments of the present invention. In other embodiments, the above-mentioned modules may also be implemented on a single device, or may be distributed on multiple devices. The functions of these modules can be combined with each other or further split into multiple sub-modules.

Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here.

Based on the same inventive concept, an embodiment of the present disclosure further provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, implements the steps of any of the above video target tracking methods.

Based on the same inventive concept, an embodiment of the present disclosure also provides an electronic device, including:

a storage device on which a computer program is stored;

A processing device is used to execute the computer program in the storage device, so as to realize the steps of any of the above video target tracking methods.

Referring next to FIG. 5 , it shows a schematic structural diagram of an electronic device 500 suitable for implementing an embodiment of the present disclosure. Terminal devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portable multimedia players), vehicle-mounted terminals (eg, mobile terminals such as in-vehicle navigation terminals), etc., and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in FIG. 5 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.

As shown in FIG. 5 , an electronic device 500 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 501 that may be loaded into random access according to a program stored in a read only memory (ROM) 502 or from a storage device 508 Various appropriate actions and processes are executed by the programs in the memory (RAM) 503 . In the RAM 503, various programs and data required for the operation of the electronic device 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504 .

Typically, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 507 such as a computer; a storage device 508 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 509 . Communication means 509 may allow electronic device 500 to communicate wirelessly or by wire with other devices to exchange data. While FIG. 5 shows electronic device 500 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication device 509 , or from the storage device 508 , or from the ROM 502 . When the computer program is executed by the processing apparatus 501, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed. Based on the same inventive concept, an embodiment of the present disclosure provides a computer program, including: instructions, when executed by a processor, the instructions cause the processor to execute any of the above video object tracking methods.

Based on the same inventive concept, an embodiment of the present disclosure provides a computer program product, including instructions, when executed by a processor, the instructions cause the processor to execute any of the above video object tracking methods.

It should be noted that the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal in baseband or propagated as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol), can be used for communication, and can communicate with digital data in any form or medium (eg, communication network) interconnection. Examples of communication networks include local area networks ("LAN"), wide area networks ("WAN"), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently known or future development network of.

The above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.

The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: acquires the video to be tracked; inputs the video to be tracked into the target tracking model to obtain The target tracking result corresponding to the video to be tracked, and the target tracking model is used to perform the following processing: for each frame of the video to be tracked, determine the feature corresponding to the target to be tracked in the target detection image corresponding to the image vector, the target detection image includes the target to be tracked; first similarity is performed between each feature vector in the feature map corresponding to the image and the feature vector corresponding to the target to be tracked in the target detection image degree calculation, and according to the first similarity calculation result, determine the target feature vector from all the feature vectors in the feature map; according to the target feature vector, determine the target to be tracked in the image.

Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider to via Internet connection).

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.

The modules involved in the embodiments of the present disclosure may be implemented in software or hardware. Among them, the name of the module does not constitute a limitation of the module itself under certain circumstances.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.

In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, Example 1 provides a video object tracking method, the method comprising:

Get the video to be tracked;

According to one or more embodiments of the present disclosure, Example 2 provides the method of Example 1, wherein the target detection image is a previous frame image of the image including the target to be tracked; or

The target detection image is a preset input image including the target to be tracked.

According to one or more embodiments of the present disclosure, Example 3 provides the method of Example 1, wherein according to the first similarity calculation result, determining a target feature vector among all feature vectors of the feature map, including:

According to one or more embodiments of the present disclosure, Example 4 provides the method of Example 1, wherein the target tracking model is further configured to determine feature vectors corresponding to all targets in the image according to pre-trained position vector parameters, The first similarity calculation result determines the target feature vector in all the feature vectors of the feature map, including:

According to one or more embodiments of the present disclosure, Example 5 provides the method of Example 4, where the image is a first frame image of the video to be tracked, and the method further includes:

The feature vector determined according to the pre-trained position vector parameter is used as the target feature vector.

According to one or more embodiments of the present disclosure, Example 6 provides the method of Example 4, wherein the feature vectors corresponding to all targets in the image and the N similar feature vectors are deduplicated to obtain target features vector, including:

According to one or more embodiments of the present disclosure, Example 7 provides the method of any one of Examples 4-6, and the position vector parameter is obtained by training in the following manner:

Determine the predicted feature vector corresponding to the target in the sample image according to the initial position vector parameter, so as to obtain the predicted target information corresponding to the sample image, wherein the sample image is pre-marked with corresponding sample target information;

A loss function is calculated according to the predicted target information and the sample target information, and the initial position vector parameter is adjusted according to the calculation result of the loss function.

According to one or more embodiments of the present disclosure, Example 8 provides a video target tracking apparatus, the apparatus comprising:

The acquisition module is used to acquire the video to be tracked;

According to one or more embodiments of the present disclosure, Example 9 provides the apparatus of Example 8, and the target detection image is a previous frame image of the image including the target to be tracked; or, the target detection image is a preset input image including the target to be tracked.

According to one or more embodiments of the present disclosure, Example 10 provides the apparatus of Example 8, the second determination submodule for:

According to one or more embodiments of the present disclosure, Example 11 provides the apparatus of Example 8, wherein the target tracking model is further configured to determine feature vectors corresponding to all targets in the image according to pre-trained position vector parameters, and the first Two determine the sub-module for:

According to one or more embodiments of the present disclosure, Example 12 provides the apparatus of Example 11, where the image is a first frame image of the video to be tracked, and the apparatus further includes:

According to one or more embodiments of the present disclosure, Example 13 provides the apparatus of Example 11, the second determination submodule for:

According to one or more embodiments of the present disclosure, Example 14 provides the apparatus of any one of Examples 11-13, the apparatus further comprising the following module for training to obtain the position vector parameter:

According to one or more embodiments of the present disclosure, Example 15 provides a computer-readable medium having stored thereon a computer program that, when executed by a processing apparatus, implements the steps of the method of any one of Examples 1-7.

According to one or more embodiments of the present disclosure, Example 16 provides an electronic device comprising:

a storage device on which a computer program is stored;

A processing device, configured to execute the computer program in the storage device, to implement the steps of the method in any one of Examples 1-7.

The above description is merely a preferred embodiment of the present disclosure and an illustration of the technical principles employed. Those skilled in the art should understand that the scope of the disclosure involved in the present disclosure is not limited to the technical solutions formed by the specific combination of the above-mentioned technical features, and should also cover, without departing from the above-mentioned disclosed concept, the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of its equivalent features. For example, a technical solution is formed by replacing the above features with the technical features disclosed in the present disclosure (but not limited to) with similar functions.

Additionally, although operations are depicted in a particular order, this should not be construed as requiring that the operations be performed in the particular order shown or in a sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, although the above discussion contains several implementation-specific details, these should not be construed as limitations on the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or logical acts of method, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims. Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here.

Claims

A video target tracking method, comprising:

Get the video to be tracked;

Input the video to be tracked into a target tracking model to obtain a target tracking result corresponding to the video to be tracked, and the target tracking model is used to perform the following processing:

For each frame of the video to be tracked, determine a feature vector corresponding to the target to be tracked in the target detection image corresponding to the image, and the target detection image includes the target to be tracked;

Perform a first similarity calculation on each feature vector in the feature map corresponding to the image and the feature vector corresponding to the target to be tracked in the target detection image, and according to the first similarity calculation result, in the Determine the target feature vector from all the feature vectors of the feature map;

According to the target feature vector, the target to be tracked in the image is determined.
The video target tracking method according to claim 1, wherein the target detection image is a previous frame image of the image including the target to be tracked; or

The target detection image is a preset input image including the target to be tracked.
The video target tracking method according to claim 1 or 2, wherein, according to the first similarity calculation result, the target feature vector is determined from all the feature vectors of the feature map, including:

When the target detection image includes N targets to be tracked, among all the feature vectors in the feature map, the N feature vectors with the largest first similarity calculation result are selected as the target feature vectors, and N is positive Integer.
The video target tracking method according to any one of claims 1-3, wherein the target tracking model is further configured to determine feature vectors corresponding to all targets in the image according to pre-trained position vector parameters, and the The first similarity calculation result is determined, and the target feature vector is determined in all the feature vectors of the feature map, including:

When the target detection image includes N targets to be tracked, among all the feature vectors in the feature map, the N feature vectors with the largest first similarity calculation result are selected as similar feature vectors, where N is positive integer;

Deduplication processing is performed on the feature vectors corresponding to all targets in the image and the N similar feature vectors to obtain target feature vectors.
The video target tracking method according to any one of claims 1-3, wherein the target tracking model is further configured to determine feature vectors corresponding to all targets in the image according to pre-trained position vector parameters, and the The first similarity calculation result is determined, and the target feature vector is determined in all the feature vectors of the feature map, including:

When the target detection image includes N targets to be tracked, among all the feature vectors in the feature map, the N feature vectors with the largest first similarity calculation result are selected as similar feature vectors, where N is positive integer;

The feature vectors corresponding to all targets in the image and the N similar feature vectors are fused to obtain target feature vectors.
The video target tracking method according to claim 4 or 5, wherein the image is a first frame image of the video to be tracked, and the video target tracking method further comprises:

The feature vector determined according to the pre-trained position vector parameter is used as the target feature vector.
The video target tracking method according to claim 4, wherein the deduplication processing is performed on the feature vectors corresponding to all targets in the image and the N similar feature vectors to obtain target feature vectors, comprising:

For the feature vector corresponding to each target in the image, a second similarity calculation is performed on the feature vector and the N similar feature vectors;

When the second similarity calculation result is greater than or equal to the preset similarity, delete the feature vector corresponding to the second similarity calculation result from the feature vectors corresponding to all the targets in the image or the N similar feature vectors ;

The feature vectors corresponding to all targets in the image after deletion processing and the remaining feature vectors in the N similar feature vectors are used as target feature vectors.
The video target tracking method according to any one of claims 4-7, wherein the position vector parameter is obtained by training in the following manner:

Determine the predicted feature vector corresponding to the target in the sample image according to the initial position vector parameter, so as to obtain the predicted target information corresponding to the sample image, wherein the sample image is pre-marked with corresponding sample target information;

A loss function is calculated according to the predicted target information and the sample target information, and the initial position vector parameter is adjusted according to the calculation result of the loss function.
The video target tracking method according to any one of claims 1-8, wherein, for each frame of the video to be tracked, the feature vector corresponding to the target to be tracked in the target detection image corresponding to the image is determined. ,include:

By vectorizing the image feature of the pixel at the center of the target to be tracked, or the image feature of a pixel that can be distinguished from other targets, the target detection image corresponding to the image corresponds to the target to be tracked. eigenvectors of .
The video target tracking method according to any one of claims 1-9, wherein the feature map corresponding to the image is obtained by quantizing the image feature vector of each pixel in the image.
The video target tracking method according to any one of claims 1-10, wherein the step of comparing each feature vector in the feature map corresponding to the image with all the target detection images corresponding to the to-be-tracked target The first similarity calculation is performed on the feature vector, including:

Perform vector dot product calculation or Euclidean distance calculation between each feature vector in the image and the feature vector corresponding to the target to be tracked in the target detection image.
The video target tracking method according to any one of claims 1-11, wherein the determining the target to be tracked in the image according to the target feature vector comprises:

According to the target feature vector, the target to be tracked that exists in both the target detection image and the image in the image is determined.
A video target tracking device, comprising:

The acquisition module is used to acquire the video to be tracked;

A tracking module, configured to input the video to be tracked into a target tracking model to obtain a target tracking result corresponding to the video to be tracked, the tracking module includes:

a first determination submodule, configured to determine, for each frame of the video to be tracked, a feature vector corresponding to the target to be tracked in the target detection image corresponding to the image, and the target detection image includes the target to be tracked;

The second determination submodule is configured to perform a first similarity calculation between each feature vector in the feature map corresponding to the image and the feature vector corresponding to the target to be tracked in the target detection image, and calculate the similarity according to the first similarity calculation. A similarity calculation result, determining the target feature vector in all the feature vectors of the feature map;

The third determination sub-module is configured to determine the target to be tracked in the image according to the target feature vector.
A computer-readable medium on which a computer program is stored, wherein, when the program is executed by a processing device, the steps of the video object tracking method according to any one of claims 1-12 are implemented.
An electronic device comprising:

a storage device on which a computer program is stored;

A processing device is configured to execute the computer program in the storage device to implement the steps of the video target tracking method according to any one of claims 1-12.
A computer program comprising:

Instructions which, when executed by a processor, cause the processor to perform the video object tracking method of any of claims 1-12.
A computer program product comprising instructions which, when executed by a processor, cause the processor to perform the video object tracking method of any of claims 1-12.