CN112907628A

CN112907628A - Video target tracking method and device, storage medium and electronic equipment

Info

Publication number: CN112907628A
Application number: CN202110179157.5A
Authority: CN
Inventors: 江毅; 孙培泽; 袁泽寰; 王长虎
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2021-06-04
Also published as: WO2022171036A1

Abstract

The disclosure relates to a video target tracking method, a video target tracking device, a storage medium and electronic equipment, which are used for realizing end-to-end video target tracking and reducing time delay of video target tracking. The method comprises the following steps: acquiring a video to be tracked; inputting a video to be tracked into a target tracking model to obtain a target tracking result corresponding to the video to be tracked, wherein the target tracking model is used for executing the following processing: determining a feature vector corresponding to a target to be tracked in a target detection image corresponding to an image for each frame of image of a video to be tracked, wherein the target detection image comprises the target to be tracked; performing first similarity calculation on each feature vector in the feature map corresponding to the image and the feature vector corresponding to the target to be tracked in the target detection image, and determining a target feature vector in all feature vectors of the feature map according to the first similarity calculation result; and determining the target to be tracked in the image according to the target feature vector.

Description

Video target tracking method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method and an apparatus for tracking a video target, a storage medium, and an electronic device.

Background

Video target tracking is the basis of a plurality of video application fields such as security monitoring, human behavior analysis, sports video explanation and the like, and has higher requirements on real-time performance. However, the video target tracking in the related art is usually based on a process of first target detection and then target tracking. Specifically, target detection is performed on two frames before and after the video, and then the detected targets are matched in pairs, so that the target tracking is realized. In this way, since target detection is performed first and then target tracking is performed, the time delay is high, and particularly in a scenario with many targets to be tracked, the time delay problem is particularly significant.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a video target tracking method, including:

acquiring a video to be tracked;

inputting the video to be tracked into a target tracking model to obtain a target tracking result corresponding to the video to be tracked, wherein the target tracking model is used for executing the following processing:

determining a feature vector corresponding to a target to be tracked in a target detection image corresponding to each frame of image of the video to be tracked, wherein the target detection image comprises the target to be tracked;

performing first similarity calculation on each feature vector in the feature map corresponding to the image and the feature vector corresponding to the target to be tracked in the target detection image, and determining a target feature vector in all feature vectors of the feature map according to a first similarity calculation result;

and determining a target to be tracked in the image according to the target feature vector.

In a second aspect, the present disclosure provides a video target tracking device, the device comprising:

the acquisition module is used for acquiring a video to be tracked;

a tracking module, configured to input the video to be tracked into a target tracking model to obtain a target tracking result corresponding to the video to be tracked, where the tracking module includes:

the first determining submodule is used for determining a feature vector corresponding to a target to be tracked in a target detection image corresponding to each frame of image of the video to be tracked, wherein the target detection image comprises the target to be tracked;

the second determining submodule is used for performing first similarity calculation on each feature vector in the feature map corresponding to the image and the feature vector corresponding to the target to be tracked in the target detection image, and determining a target feature vector in all feature vectors of the feature map according to a first similarity calculation result;

and the third determining submodule is used for determining the target to be tracked in the image according to the target feature vector.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method of the first aspect.

Through the technical scheme, the target tracking model can perform first similarity calculation on the feature vector corresponding to each frame of image of the video to be tracked and the feature vector corresponding to the target to be tracked in the target detection image, so that the target to be tracked in each frame of image is determined according to the first similarity calculation result. Therefore, the target to be tracked in each frame of image output by the target tracking model can correspond to the target to be tracked in the target detection image one by one, namely, the target detection and the target association can be completed simultaneously, so that the time delay in the target tracking process can be reduced.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow chart illustrating a video target tracking method according to an exemplary embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a target tracking process in a video target tracking method according to an exemplary embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating a target tracking process in another video target tracking method according to an exemplary embodiment of the present disclosure;

FIG. 4 is a block diagram illustrating a video target tracking device according to an exemplary embodiment of the present disclosure;

fig. 5 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units. It is further noted that references to "a", "an", and "the" modifications in the present disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

As background, video object tracking in the related art is usually based on a process of object detection and object tracking. Specifically, target detection is performed on front and rear frames in a video through a detection module, and then the detected targets are matched in pairs through a correlation module, so that the tracking of the targets is realized. The model component of the process is complex, the time delay is high, and the time delay problem is particularly obvious in a scene with many targets to be tracked.

In view of this, the present disclosure provides a video target tracking method, an apparatus, a storage medium, and an electronic device, so as to achieve end-to-end video target tracking and reduce a time delay of video target tracking.

Fig. 1 is a flowchart illustrating a video target tracking method according to an exemplary embodiment of the present disclosure. Referring to fig. 1, the video target tracking method includes:

step 101, obtaining a video to be tracked.

For example, the acquiring of the video to be tracked may be acquiring the video input by the user in response to a video input operation of the user, or may be automatically acquiring the video shot by the image capturing device from the image capturing device after receiving the target tracking instruction, and the like, which is not limited in this disclosure.

Step 102, inputting the video to be tracked into the target tracking model to obtain a target tracking result corresponding to the video to be tracked. The target tracking model is used for executing the following processing: determining a feature vector corresponding to a target to be tracked in a target detection image corresponding to an image for each frame of image of a video to be tracked, wherein the target detection image comprises the target to be tracked; performing first similarity calculation on each feature vector in the feature map corresponding to the image and the feature vector corresponding to the target to be tracked in the target detection image, and determining a target feature vector in all feature vectors of the feature map according to the first similarity calculation result; and determining the target to be tracked in the image according to the target feature vector.

Through the method, the target tracking model can perform first similarity calculation on the feature vector corresponding to each frame of image of the video to be tracked and the feature vector corresponding to the target to be tracked in the target detection image, so that the target to be tracked in each frame of image is determined according to the first similarity calculation result. Therefore, the target to be tracked in each frame of image output by the target tracking model can correspond to the target to be tracked in the target detection image one by one, namely, the target detection and the target association can be completed simultaneously, so that the time delay in the target tracking process can be reduced.

In order to make the video object tracking method provided by the present disclosure more understandable to those skilled in the art, the above steps are exemplified in detail below.

For example, after the video to be tracked is acquired, the video to be tracked may be input into the target tracking model. It should be understood that each frame of image in the video to be tracked has a chronological order, so that a video image sequence composed of a plurality of frames of images arranged in chronological order can be obtained according to the video to be tracked. Therefore, the input of the video to be tracked into the target tracking model may also be the input of the video image sequence corresponding to the video to be tracked into the target tracking model.

For example, the training of the target tracking model may be performed according to the sample image and the sample target information corresponding to the sample image. Specifically, the sample image may be input into the target tracking model to obtain predicted target information output by the target tracking model for the sample image, then a loss function is calculated according to the predicted target information and the sample target information, and finally a parameter of the target tracking model is adjusted according to a calculation result of the loss function, so that the target tracking model outputs more accurate target information. Therefore, the synchronous training of the target detection function and the target tracking function of the target tracking model can be realized.

It should be understood that, because the related art needs to perform target detection first and then target association, the model training mode in the related art is to train the detection module and the association module step by step, and in a scene with many targets to be tracked, the training mode needs to consume a lot of time, and an optimal training effect is difficult to achieve. In the embodiment of the disclosure, the mode of synchronously training the target detection function and the target tracking function of the target tracking model not only simplifies the components of the target tracking model, but also simplifies the training process of the target tracking model, and can better meet the scene requirement of multi-target tracking.

In the application stage, the target tracking model may determine, for each frame of image of the video to be tracked, a feature vector corresponding to the target to be tracked in the target detection image corresponding to the image, then perform a first similarity calculation on each feature vector in the feature map corresponding to the image and a feature vector corresponding to the target to be tracked in the target detection image, and determine the target feature vector in all feature vectors of the feature map according to a first similarity calculation result. And finally, determining the target to be tracked in the image according to the target feature vector.

For example, the target detection image may be a last frame image of an image including the target to be tracked, or the target detection image may be a preset input image including the target to be tracked. It should be understood that the video target tracking method provided by the embodiment of the present disclosure can be applied to two application scenarios, i.e., tracking target given and tracking target unknown. In a given scene of a tracked target, the target detection image may be a preset input image including the target to be tracked, for example, the target to be tracked is a person a, and then the preset input image may be a full-body photograph of the person a taken through the image capturing device. In a scene where the tracking target is unknown, all targets in each frame of image need to be tracked, and the target detection image may be the last frame of image of the image including the target to be tracked.

After the target detection image is determined, a feature vector corresponding to the target to be tracked in the target detection image may be determined, where the feature vector may be a result obtained by vectorizing the image feature of a central pixel of the target to be tracked, or the feature vector may be a result obtained by vectorizing the image feature of a certain pixel that can be distinguished from other targets of the target to be tracked, and the like, and the embodiment of the present disclosure does not limit this. The determination method of the feature vector is similar to that in the related art, and is not described herein again.

For example, the feature map corresponding to the image may be an image obtained by vectorization according to the image feature of each pixel point in the image. Moreover, the feature vector corresponding to the target to be tracked in the target detection image is a pixel-level feature vector, so that each feature vector in the feature map corresponding to the image corresponding to the target to be tracked and the feature vector corresponding to the target to be tracked in the target detection image can be subjected to first similarity calculation, and target tracking is achieved according to the first similarity calculation result. For example, the first similarity calculation may be a vector dot product calculation, an euclidean distance calculation, or the like performed on a feature vector corresponding to each pixel point in the image and a feature vector corresponding to the target to be tracked in the target detection image, and the embodiment of the present disclosure does not limit the manner of the first similarity calculation.

It should be understood that the target tracking model may include an attention mechanism module, and the attention mechanism module may perform a first similarity calculation process to determine feature vectors corresponding to targets present in both the image and the target detection image, i.e., to obtain target feature vectors.

For example, referring to fig. 2, in a given scene of a tracking target, the target detection image is a preset input image including the target to be tracked. And then, the attention mechanism module can perform first similarity calculation on each feature vector in the feature map and a feature vector corresponding to a target to be tracked in a preset input image, and output a target feature vector according to the first similarity calculation result, so that the target tracking model can determine the target to be tracked in the frame image according to the target feature vector.

In a possible manner, according to the first similarity calculation result, determining the target feature vector among all the feature vectors of the feature map may be: when the target detection image comprises N targets to be tracked, selecting N feature vectors with the maximum first similarity calculation result from all feature vectors of the feature map as target feature vectors, wherein N is a positive integer.

Illustratively, in a given scene of the tracking target, the target detection image is a preset input image including the target to be tracked. The target detection image including the N targets to be tracked can be determined through a target detection mode in the related art, that is, the number of the feature vectors corresponding to the targets to be tracked is N. In this scenario, for each frame of image of the video to be tracked, N feature vectors with the largest first similarity calculation result may be selected as the target feature vectors from all feature vectors included in the feature map corresponding to the image to determine the target to be tracked in the image. All the first similarity calculation results can be ranked from large to small, and then the N feature vectors corresponding to the first similarity calculation result with the highest ranking are selected from all the feature vectors included in the feature map corresponding to the image as the target feature vectors. Of course, all the first similarity calculation results may also be sorted from small to large, and then, among all the feature vectors included in the feature map corresponding to the image, the N feature vectors corresponding to the first similarity calculation result that is sorted last are selected as the target feature vectors, which is not limited in the embodiment of the present disclosure.

Through the mode, the target to be tracked determined according to the selected N eigenvectors is a target existing in each frame of image in the video to be tracked and the target detection image, namely, the target to be tracked in each frame of image output by the target tracking model can correspond to the target to be tracked in the target detection image one by one, so that the target detection and the target association can be completed simultaneously, and the time delay in the target tracking process can be reduced.

In a possible manner, in a scene where the tracking target is unknown, since all targets in the image need to be tracked, the target tracking model may also be used to determine feature vectors corresponding to all targets in the image according to the pre-trained position vector parameters. Accordingly, determining the target feature vector among all the feature vectors of the feature map according to the first similarity calculation result may be: when the target detection image comprises N targets to be tracked, selecting N feature vectors with the maximum first similarity calculation result from all feature vectors of the feature map as similar feature vectors, wherein N is a positive integer. And then, carrying out de-duplication processing on the feature vectors corresponding to all targets in the image and the N similar feature vectors to obtain the target feature vectors.

For example, referring to fig. 3, in a scene in which a tracking target is unknown, the target detection image is a previous frame image of an image including the target to be tracked. Each frame of image of the video to be tracked is respectively used as the image of the current frame, the feature map of the image of the current frame is firstly determined, then the attention mechanism module can carry out first similarity calculation on each feature vector in the feature map and the feature vector corresponding to the target to be tracked in the image of the previous frame, and the similar feature vector is determined according to the first similarity calculation result. Meanwhile, the attention mechanism module can also determine the feature vectors of all targets in the image of the frame according to the pre-trained position vector parameters. Then, the target tracking model can perform feature vector fusion on the similar feature vectors and the feature vectors of all targets in the image of the current frame to obtain the target feature vectors. Finally, the target tracking model can determine the target to be tracked in the image of the frame according to the target feature vector.

For example, the location vector parameter may include a number of unit location vectors, such as for an H W image, the location vector parameter may include equal to or more than H W unit location vectors to cover each pixel point location in the image. It should be understood that the number of unit position vectors included in the position vector parameter can be set to be large to accommodate the image size in different scenes.

In a possible approach, the position vector parameters may be trained as follows: and determining a prediction characteristic vector corresponding to the target in the sample image according to the initial position vector parameter to obtain prediction target information corresponding to the sample image, wherein the sample image is marked with corresponding sample target information in advance, then calculating a loss function according to the prediction target information and the sample target information, and adjusting the initial position vector parameter according to a calculation result of the loss function.

For example, the initial position vector parameter may be a random value, that is, after the number of unit position vectors in the position vector is set, the value of each unit position vector is a random value. Then, the position vector parameter can be adjusted through a back propagation algorithm according to the loss function result of the target tracking model in the training process, so that the position vector parameter can more accurately predict the position of the target in the image.

In a possible manner, if the image is the first frame image of the video to be tracked, the feature vector determined according to the pre-trained position vector parameter may be used as the target feature vector. It should be understood that in a scene where the tracking target is unknown, the target detection image may be the previous frame image of the image, and therefore the first frame image cannot be subjected to the similarity calculation. In the embodiment of the present disclosure, the feature vectors corresponding to all targets in the image determined according to the pre-trained position vector parameters may be used as the target feature vectors to determine the target to be tracked in the image.

After determining the target to be tracked in the first frame of image according to the position vector parameter, each frame of image of the video to be tracked may correspond to the previous frame of image including the target to be tracked, and thus the feature vectors corresponding to all the targets in each frame of image may be combined with the first similarity calculation result to determine the target feature vector.

For example, when the target detection image includes N targets to be tracked, N feature vectors with the largest first similarity calculation result may be selected as similar feature vectors in all feature vectors of the feature map, where N is a positive integer. Then, the feature vectors corresponding to all the targets in the image and the N similar feature vectors may be subjected to de-duplication processing to obtain target feature vectors. The process of determining the N similar feature vectors may refer to the content of selecting the N feature vectors with the largest first similarity calculation result as the target feature vector, and is not described herein again.

For example, the N similar feature vectors represent feature vectors corresponding to targets existing in each frame of image and target detection image of the video to be tracked, the feature vector determined according to the position vector parameter is a feature vector corresponding to all targets in each frame of image of the video to be tracked, and the feature vectors corresponding to the same targets exist in both the frames of image. For example, a certain frame of image in the video to be tracked includes targets B1, B2, and B3, and the target detection image includes targets B1 and B2, so that N similar feature vectors are feature vectors corresponding to targets B1 and B2, the feature vectors determined according to the position vector parameters are feature vectors corresponding to targets B1, B2, and B3, and feature vectors corresponding to the same targets (B1 and B2) exist in both. In this case, to avoid vector redundancy, feature vector fusion may be performed, for example, feature vectors corresponding to all targets in the image and N similar feature vectors may be subjected to deduplication processing to obtain target feature vectors for determining the target to be tracked.

In a possible manner, the feature vectors corresponding to all targets in the image and the N similar feature vectors are subjected to de-duplication processing to obtain the target feature vectors, which may be: and performing second similarity calculation on the feature vector and the N similar feature vectors aiming at the feature vector corresponding to each target in the image, and deleting the feature vector corresponding to the second similarity calculation result from the feature vectors corresponding to all the targets or the N similar feature vectors in the image when the second similarity calculation result is greater than or equal to the preset similarity. And then, taking the feature vectors corresponding to all the targets in the deleted image and the residual feature vectors in the N similar feature vectors as target feature vectors.

For example, the second similarity calculation may be to perform vector dot product calculation, euclidean distance calculation, and the like on a feature vector corresponding to each target in the image and N similar feature vectors, and the embodiment of the present disclosure does not limit the manner of the second similarity calculation. The preset similarity can be customized according to actual conditions, and the preset similarity is not limited by the disclosure. When the second similarity calculation result is greater than or equal to the preset similarity, the feature vector corresponding to a certain target in the image and a certain feature vector in the N similar feature vectors can be regarded as the same feature vector, so that the deletion operation can be performed on the feature vector. And in consideration of the fact that the feature vector may exist in the feature vectors corresponding to all the targets in the image and in the N similar feature vectors at the same time, the deletion operation may be performed in the feature vectors corresponding to all the targets in the image or in the N similar feature vectors.

For example, in the above example, the N similar eigenvectors are eigenvectors corresponding to the targets B1 and B2, the eigenvectors determined according to the position vector parameters are eigenvectors corresponding to the targets B1, B2 and B3, and after the second similarity calculation, the eigenvectors corresponding to the second similarity calculation result that is greater than or equal to the preset similarity may be determined as eigenvectors corresponding to the targets B1 and B2. The feature vectors corresponding to the targets B1 and B2 may be selected to be deleted from the N similar feature vectors, after which the N similar feature vectors are null. Alternatively, the feature vectors corresponding to the objects B1 and B2 may be selected to be deleted from the feature vectors determined according to the position vector parameters (i.e., from the feature vectors corresponding to all objects in the image), and the remaining feature vectors may be deleted as the feature vectors corresponding to the object B3. It should be understood that after the deletion process, the feature vectors corresponding to the targets existing in both the image and the target detection image may be preferentially retained, so as to ensure the target association and further implement the target tracking.

After the deletion process, feature vectors corresponding to all targets in the image and the remaining feature vectors in the N similar feature vectors may be used as target feature vectors. For example, in the above example, after the feature vectors corresponding to the targets B1 and B2 are deleted from the N similar feature vectors, the N similar feature vectors are null, the feature vectors corresponding to all the targets in the image are the feature vectors corresponding to the targets B1, B2, and B3, and the remaining feature vectors in the feature vectors corresponding to all the targets in the image and the N similar feature vectors are the feature vectors corresponding to the targets B1, B2, and B3, that is, the target feature vectors are the feature vectors corresponding to the targets B1, B2, and B3.

It should be understood that the above-mentioned deduplication processing manner is only one possible manner for feature vector fusion provided by the embodiment of the present disclosure, and when the present disclosure is implemented specifically, feature vectors corresponding to all targets in an image and N similar feature vectors may also be subjected to feature vector fusion in other manners, which is not limited by the embodiment of the present disclosure.

By the method, the feature vectors corresponding to all targets in the image can be determined through the position vector parameters, the similar feature vectors corresponding to the targets existing in the image and the target detection image are determined according to the second similarity calculation result, and then the feature vectors corresponding to all targets in the similar feature vectors and the image are subjected to feature vector fusion to remove redundant feature vectors, so that the calculation efficiency is improved, and meanwhile, more accurate targets to be tracked are obtained.

After the target feature vector is obtained by any of the above methods, linear feature conversion may be performed on the target feature vector to obtain tracking frame information corresponding to the target to be tracked in the image, where the tracking frame information includes position information and size information of a tracking frame corresponding to the target to be tracked, so that the target to be tracked may be indicated in the image according to the tracking frame information.

Based on the same inventive concept, the disclosed embodiments also provide a video target tracking apparatus, which may become part or all of an electronic device through software, hardware or a combination of both. Referring to fig. 4, the video object tracking apparatus includes:

an obtaining module 401, configured to obtain a video to be tracked;

a tracking module 402, configured to input the video to be tracked into a target tracking model to obtain a target tracking result corresponding to the video to be tracked, where the tracking module includes:

the first determining submodule 4021 is configured to determine, for each frame of image of the video to be tracked, a feature vector corresponding to a target to be tracked in a target detection image corresponding to the image, where the target detection image includes the target to be tracked;

the second determining submodule 4022 is configured to perform first similarity calculation on each feature vector in the feature map corresponding to the image and the feature vector corresponding to the target to be tracked in the target detection image, and determine a target feature vector in all feature vectors of the feature map according to a first similarity calculation result;

the third determining submodule 4023 is configured to determine the target to be tracked in the image according to the target feature vector.

Optionally, the target detection image is a previous frame image of the image including the target to be tracked; or, the target detection image is a preset input image including the target to be tracked.

Optionally, the second determining sub-module 4022 is configured to:

when the target detection image comprises N targets to be tracked, selecting N feature vectors with the maximum first similarity calculation result from all feature vectors of the feature map as target feature vectors, wherein N is a positive integer.

Optionally, the target tracking model is further configured to determine feature vectors corresponding to all targets in the image according to the pre-trained position vector parameters, and the second determining submodule is configured to:

when the target detection image comprises N targets to be tracked, selecting N feature vectors with the maximum first similarity calculation result from all feature vectors of the feature map as similar feature vectors, wherein N is a positive integer;

and carrying out duplicate removal processing on the feature vectors corresponding to all targets in the image and the N similar feature vectors to obtain target feature vectors.

Optionally, the image is a first frame image of the video to be tracked, and the apparatus further includes:

and the fourth determining submodule is used for taking the feature vector determined according to the position vector parameters which are pre-trained as the target feature vector.

Optionally, the second determining sub-module 4022 is configured to:

performing second similarity calculation on the feature vectors and the N similar feature vectors aiming at the feature vector corresponding to each target in the image;

when the second similarity calculation result is greater than or equal to the preset similarity, deleting the feature vectors corresponding to the second similarity calculation result from the feature vectors corresponding to all the targets in the image or from the N similar feature vectors;

and taking the feature vectors corresponding to all the targets in the image after the deletion processing and the residual feature vectors in the N similar feature vectors as target feature vectors.

Optionally, the apparatus 400 further includes the following modules for training to obtain the position vector parameters:

the first training module is used for determining a prediction characteristic vector corresponding to a target in a sample image according to the initial position vector parameter so as to obtain prediction target information corresponding to the sample image, wherein the sample image is marked with corresponding sample target information in advance;

and the second training module is used for calculating a loss function according to the predicted target information and the sample target information and adjusting the initial position vector parameters according to the calculation result of the loss function.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Based on the same inventive concept, the disclosed embodiments also provide a computer readable medium, on which a computer program is stored, which when executed by a processing device, implements the steps of any of the above-mentioned video object tracking methods.

Based on the same inventive concept, an embodiment of the present disclosure further provides an electronic device, including:

a storage device having a computer program stored thereon;

and the processing device is used for executing the computer program in the storage device so as to realize the steps of any video target tracking method.

Reference is now made to fig. 5, which illustrates a schematic diagram of an electronic device) 500 suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 5, electronic device 500 may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates an electronic device 500 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 501.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the communication may be performed using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a video to be tracked; inputting the video to be tracked into a target tracking model to obtain a target tracking result corresponding to the video to be tracked, wherein the target tracking model is used for executing the following processing: determining a feature vector corresponding to a target to be tracked in a target detection image corresponding to each frame of image of the video to be tracked, wherein the target detection image comprises the target to be tracked; performing first similarity calculation on each feature vector in the feature map corresponding to the image and the feature vector corresponding to the target to be tracked in the target detection image, and determining a target feature vector in all feature vectors of the feature map according to a first similarity calculation result; and determining a target to be tracked in the image according to the target feature vector.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. Wherein the name of a module in some cases does not constitute a limitation on the module itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides a video target tracking method, according to one or more embodiments of the present disclosure, the method comprising:

acquiring a video to be tracked;

Example 2 provides the method of example 1, the target detection image being a last frame image of the image including the target to be tracked; or

The target detection image is a preset input image including the target to be tracked.

Example 3 provides the method of example 1, the determining a target feature vector among all feature vectors of the feature map according to the first similarity calculation result, including:

Example 4 provides the method of example 1, wherein the target tracking model is further configured to determine feature vectors corresponding to all targets in the image according to pre-trained location vector parameters, and the determining target feature vectors in all feature vectors of the feature map according to the first similarity calculation result includes:

Example 5 provides the method of example 4, the image being a first frame image of the video to be tracked, according to one or more embodiments of the present disclosure, the method further comprising:

and taking the feature vector determined according to the pre-trained position vector parameters as the target feature vector.

Example 6 provides the method of example 4, where the performing deduplication processing on the feature vectors corresponding to all the targets in the image and the N similar feature vectors to obtain a target feature vector includes:

Example 7 provides the method of any one of examples 4-6, the position vector parameters being trained in the following manner, in accordance with one or more embodiments of the present disclosure:

determining a prediction characteristic vector corresponding to a target in a sample image according to the initial position vector parameter to obtain prediction target information corresponding to the sample image, wherein the sample image is marked with corresponding sample target information in advance;

and calculating a loss function according to the predicted target information and the sample target information, and adjusting the initial position vector parameters according to the calculation result of the loss function.

Example 8 provides, in accordance with one or more embodiments of the present disclosure, a video target tracking apparatus, the apparatus comprising:

the acquisition module is used for acquiring a video to be tracked;

Example 9 provides the apparatus of example 8, the target detection image being a last frame image of the image including the target to be tracked, in accordance with one or more embodiments of the present disclosure; or, the target detection image is a preset input image including the target to be tracked.

Example 10 provides the apparatus of example 8, the second determination submodule to:

Example 11 provides the apparatus of example 8, the target tracking model is further configured to determine feature vectors corresponding to all targets in the image according to pre-trained position vector parameters, and the second determination submodule is configured to:

Example 12 provides the apparatus of example 11, the image being a first frame image of the video to be tracked, the apparatus further comprising:

Example 13 provides the apparatus of example 11, the second determination submodule to:

Example 14 provides the apparatus of any one of examples 11-13, further comprising means for training the position vector parameters to:

Example 15 provides a computer readable medium having stored thereon a computer program that, when executed by a processing apparatus, performs the steps of the method of any of examples 1-7, in accordance with one or more embodiments of the present disclosure.

Example 16 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method of any of examples 1-7.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A method for video object tracking, the method comprising:

acquiring a video to be tracked;

2. The method according to claim 1, wherein the target detection image is a last frame image of the image including the target to be tracked; or

3. The method according to claim 1, wherein determining a target feature vector among all feature vectors of the feature map according to the first similarity calculation result comprises:

4. The method of claim 1, wherein the target tracking model is further configured to determine feature vectors corresponding to all targets in the image according to pre-trained location vector parameters, and wherein determining target feature vectors among all feature vectors of the feature map according to the first similarity calculation result comprises:

5. The method of claim 4, wherein the image is a first frame image of the video to be tracked, the method further comprising:

6. The method according to claim 4, wherein the performing de-duplication processing on the feature vectors corresponding to all the targets in the image and the N similar feature vectors to obtain a target feature vector comprises:

7. The method according to any of claims 4-6, wherein the position vector parameters are trained by:

8. A video object tracking apparatus, the apparatus comprising:

the acquisition module is used for acquiring a video to be tracked;

9. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 7.

10. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 7.