CN114202559A

CN114202559A - Target tracking method and device, electronic equipment and storage medium

Info

Publication number: CN114202559A
Application number: CN202010986849.6A
Authority: CN
Inventors: 李虎
Original assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Current assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2022-03-18

Abstract

When a current frame image in an image sequence is subjected to target tracking, interframe sampling of the image sequence, wherein the sampling density at different positions and the time intervals from the different positions to the current frame image are in an inverse correlation relationship, is carried out, interframe correlation information with high correlation degree is extracted for the current frame image, and the interframe correlation information with high correlation degree can guide or assist the current frame image to carry out more accurate/precise positioning on the target; and the interframe images obtained by sampling in the manner are further subjected to intraframe multiscale sampling respectively, so that multiscale distribution information of the target in the intraframe images can be obtained, the understanding of the single image frame is improved, and the situations that the tracking target is easily lost due to the constant receptive field of a convolution kernel in a convolution neural network and the like are overcome.

Description

Target tracking method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer vision, and in particular, to a target tracking method and apparatus, an electronic device, and a storage medium.

Background

The target tracking is a hot problem of computer vision field research, and by processing an image sequence, a moving target is tracked in a camera view field, so that semantic or non-semantic information support is provided for target movement rule research or behavior understanding, event detection and the like, and the target tracking can comprise single-target tracking, multi-target tracking, cross-camera multi-target tracking and the like.

Currently, tracking of a target is generally achieved by a conventional method such as optical flow analysis, or by a method of a convolutional neural network. The target tracking is performed by traditional methods such as optical flow analysis and the like, so that the method has the advantage of simple calculation, but the extracted feature information (such as gray features) is insufficient, and the accuracy of target tracking is limited; the target tracking is carried out by the convolutional neural network method, complex characteristic information can be extracted, the defect that the traditional methods such as optical flow analysis and the like have insufficient characteristic information is overcome to a certain extent, the problem that the accuracy or precision of a target tracking task is low still exists, and the situations that the tracked target is lost easily occur.

Disclosure of Invention

In view of this, the present application provides a target tracking method, an apparatus, an electronic device, and a storage medium, which are used to solve the problem existing in the target tracking by the convolutional neural network method, improve the accuracy/precision of the target tracking task, and avoid the situations of target tracking loss and the like.

The specific technical scheme is as follows:

a target tracking method, comprising:

determining a current frame image to be processed in an image sequence; the image sequence comprises a plurality of frames of images which are arranged in an imaging time sequence, and the image contents of at least partial images in the plurality of frames of images have correlation;

performing first sampling processing on the image sequence to obtain a group of first sampling images corresponding to the current frame image; the sampling density of different positions in the image sequence is in an inverse correlation relation with the time intervals from the different positions to the current frame image;

respectively carrying out second sampling processing on each first sampling image to obtain a group of second sampling images which respectively correspond to each first sampling image and comprise second sampling images with multiple scales;

in a scale space formed by the group of first sampling images and the group of second sampling images corresponding to the first sampling images, determining a positioning result of the current frame image on the target to be tracked under the same scale based on a single group of inter-frame sampling images with the same scale so as to obtain each positioning result of the target to be tracked under different scales corresponding to the current frame image;

and positioning the target to be tracked in the current frame image based on each positioning result of the target to be tracked under different scales corresponding to the current frame image.

Optionally, the performing a first sampling process on the image sequence to obtain a group of first sampled images corresponding to the current frame image includes:

performing interframe sampling on the image sequence according to a Gaussian distribution mode;

in the interframe sampling of the Gaussian distribution mode, the current frame image corresponds to the center of Gaussian distribution, and the sampling density at different positions in the image sequence presents a Gaussian distribution state taking the current frame image as the center.

Optionally, the performing the second sampling processing on each first sampling image respectively includes:

performing down-sampling operation on each first sampling image for multiple times to obtain a plurality of second sampling images with different scales corresponding to each first sampling image;

and the second sampling images with different scales respectively correspond to different resolutions.

Optionally, in a scale space formed by the group of first sampling images and the groups of second sampling images corresponding to the first sampling images, based on a single group of inter-frame sampling images with the same scale, determining a positioning result of the current frame image on the target to be tracked in the same scale to obtain each positioning result of the target to be tracked in different scales corresponding to the current frame image, where the method includes:

inputting the group of first sampling images and each group of second sampling images corresponding to each first sampling image into a pre-constructed convolutional neural network model comprising a plurality of convolutional layers; different convolutional layers correspond to different scales in the scale space one by one; each convolution layer at least processes a group of inter-frame sampling images under the corresponding scale, so that the group of inter-frame sampling images are used for guiding the current frame image to position the target to be tracked under the corresponding scale;

acquiring output information of the convolutional neural network model;

the output information comprises positioning results of the target to be tracked under different scales corresponding to the current frame image; the positioning result of the target to be tracked under a scale corresponding to the current frame image comprises the following steps: and the at least one target to be tracked corresponds to the positioning position information of the scale in the current frame image and/or the respective class label of the at least one target to be tracked.

Optionally, in the method:

the input of a first layer of convolutional layers of the plurality of convolutional layers is the set of first sampled images;

the inputs for the other convolutional layers of the plurality of convolutional layers except the first convolutional layer are: and stacking a group of sampling images of the corresponding scale and the characteristic images output by the convolution layer on the previous layer.

Optionally, the positioning the target to be tracked in the current frame image based on the positioning results of the target to be tracked under different scales corresponding to the current frame image includes:

and carrying out non-maximum suppression processing on positioning results of the target to be tracked under different scales corresponding to the current frame image so as to position the target to be tracked in the current frame image.

An object tracking device, comprising:

the first determining unit is used for determining a current frame image to be processed in the image sequence; the image sequence comprises a plurality of frames of images which are arranged in an imaging time sequence, and the image contents of at least partial images in the plurality of frames of images have correlation;

the first sampling unit is used for carrying out first sampling processing on the image sequence to obtain a group of first sampling images corresponding to the current frame image; the sampling density of different positions in the image sequence is in an inverse correlation relation with the time intervals from the different positions to the current frame image;

the second sampling unit is used for respectively carrying out second sampling processing on each first sampling image to obtain a group of second sampling images which respectively correspond to each first sampling image and comprise second sampling images with multiple scales;

a second determining unit, configured to determine, based on a single group of inter-frame sampling images with a same scale, a positioning result of the current frame image on the target to be tracked in the same scale in a scale space formed by the group of first sampling images and the groups of second sampling images corresponding to the first sampling images, so as to obtain positioning results of the target to be tracked in different scales corresponding to the current frame image;

and the positioning unit is used for positioning the target to be tracked in the current frame image based on each positioning result of the target to be tracked under different scales corresponding to the current frame image.

Optionally, the first sampling unit is specifically configured to:

Optionally, the second sampling unit is specifically configured to:

Optionally, the second determining unit is specifically configured to:

acquiring output information of the convolutional neural network model;

Optionally, in the apparatus:

inputting the convolution layer of the first layer of the convolution neural network model into the group of first sampling images;

the inputs of other convolutional layers except the convolutional layer of the first layer of the convolutional neural network model are as follows: and stacking a group of sampling images of the corresponding scale and the characteristic images output by the convolution layer on the previous layer.

An electronic device, comprising:

a memory for storing a set of computer instructions;

a processor for implementing a method of object tracking as claimed in any preceding claim by executing a set of instructions stored on said memory.

A computer readable storage medium having stored therein a set of computer instructions which, when executed by a processor, implement a target tracking method as in any above.

According to the target tracking method, the target tracking device, the electronic equipment and the storage medium, when the target tracking is carried out on the current frame image in the image sequence, inter-frame sampling of 'the sampling density at different positions and the time interval from the different positions to the current frame image are in an anti-correlation relation' is carried out on the image sequence, inter-frame correlation information with high correlation degree is extracted for the current frame image, and the inter-frame correlation information with high correlation degree can guide or assist the current frame image to carry out more accurate/precise positioning on the target; and further performing intra-frame multi-scale sampling on each inter-frame image obtained by sampling in the manner, so that multi-scale distribution information of the target in the intra-frame image can be obtained, understanding of the single image frame is improved, and the situations that the tracking target is easily lost due to the fact that the receptive field of a convolution neural network is fixed and the like are overcome, so that the accuracy/precision of a target tracking task can be at least improved by combining the inter-frame correlation information with high correlation and the multi-scale distribution information of the intra-frame image, and the situations that the tracking target is lost and the like are avoided.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a target tracking method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating distribution of sampling densities at different positions in an image sequence during Gaussian interframe sampling according to an embodiment of the present application;

fig. 3 is a schematic diagram illustrating comparison of effects of a multi-scale second sampling image obtained by performing multiple downsampling on a first sampling image according to an embodiment of the present application and the first sampling image;

FIG. 4 is a schematic diagram of a single set of inter-frame sampled images of the same scale provided by an embodiment of the present application;

FIG. 5 is a schematic flow chart of another target tracking method provided in the embodiments of the present application;

FIG. 6 is a schematic diagram of input information of different convolutional layers in a convolutional neural network model provided in an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a target tracking device provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The target tracking is performed by a convolutional neural network method, so that complex characteristic information (depth characteristics) can be extracted, and the commonly used methods mainly include processing methods such as SiamFC and FlowTrack, however, the inventor finds that the existing method for performing target tracking by a convolutional neural network still has the following defects through research:

in the prior art, original frame images in a video are input into a convolutional neural network to track a target of a video frame image, and when a convolutional neural network is used for feature extraction, a receptive field of a convolutional kernel is fixed, however, in target tracking of an image sequence such as a video, the size of a region where a target is located in each frame image is often different, and the variation intensity of the target in different frame images and the scale of the variation region are also often different, for example, when a player plays a football in the video, because a shooting camera is fixed, the distance between the player and a lens is changed along with time, which will cause the sizes of the players in different frame images of the video to be different, the variation intensity of the player in different frame images and the scale of the variation region are also different, and when a convolutional neural network is used for feature extraction, the receptive field of the convolution kernel is fixed and invariable, and the fixed and invariable receptive field of the convolution kernel is difficult to adapt to the accurate extraction of the characteristics of the targets with constantly changing sizes and different changing intensities and changing areas in the video in each frame image, which easily causes the situations of target tracking loss in the tracking process and target positioning omission in some images;

meanwhile, in the existing method for tracking the target through the convolutional neural network, interframe image sampling is carried out on the video by adopting an equally-spaced extraction method, so that the difference of interframe information correlation cannot be embodied, and the problem of low accuracy or precision of a target tracking task still exists correspondingly.

In order to at least solve the above-mentioned defects of the existing method for tracking the target through the convolutional neural network, embodiments of the present application provide a target tracking method, apparatus, electronic device, and storage medium.

Referring to fig. 1, a schematic flow chart of the target tracking method provided in the embodiment of the present application is shown, and the target tracking method in the embodiment of the present application may be applied to, but is not limited to, a terminal device such as a mobile phone, a tablet computer, and a personal PC (e.g., a notebook, an all-in-one machine, and a desktop computer) having a computer vision processing function, or a physical machine corresponding to a private cloud/public cloud platform having a computer vision processing function, a server (a computer vision processing server), and the like.

As shown in fig. 1, in this embodiment, the target tracking method includes the following processing steps:

step 101, determining a current frame image to be processed in an image sequence; the image sequence comprises a plurality of frames of images which are arranged in imaging time sequence, and the image content of at least part of the images in the plurality of frames of images has correlation.

The image sequence to be subject to target tracking may be, but is not limited to: a complete video or a video clip taken from a complete video, or a set of image sequences extracted from a complete video/video clip or a set of image sequences taken in time sequence with the device camera, is downloaded or taken with the device camera.

It is easy to understand that the image content of at least part of the images in the multi-frame images included in the image sequence to be subjected to target tracking is associated, the image sequence essentially includes multi-frame images respectively corresponding to different time points in the time sequence, and the image content of the multi-frame images included in the image sequence changes along with the change of time and has certain association, in particular, such as the video example of the player kicking the football.

When performing target tracking on an image sequence such as a video, generally, a plurality of frames of images in the image sequence are processed frame by frame, for example, the images are processed frame by frame based on a serial mode or a parallel mode in time sequence, and target tracking on the image sequence is realized by positioning a target to be tracked on each frame of image.

The object to be tracked may be any object or a certain component of an object in the camera field of view, such as a person, an animal, an object, or a certain component of these objects (e.g., a human face).

And 102, performing first sampling processing on the image sequence to obtain a group of first sampling images corresponding to the current frame image.

In a group of image sequences (e.g., a video sequence), objects usually have different states (which may include positions of the objects in the images and shapes/postures of the objects) in different time periods, and the states of the objects in different image frames that are closer in time are closer, and similarly, the states of the objects in different image frames that have larger time intervals are different, and even the objects are disappeared or not existed, that is, images with smaller time intervals in the image sequence have larger correlation between image contents, and images with larger time intervals have smaller correlation between image contents.

In view of this, when performing target tracking, for a current frame image to be processed (target positioning to be performed), the present application proposes to perform the following first sampling processing on an image sequence:

the sampling density of the positions closer to the current frame image in the image sequence is more dense, and the sampling of the positions farther away from the current frame image is more sparse, so that the sampling density of different positions in the image sequence is in an inverse correlation relation with the time intervals from the different positions to the current frame image.

That is, it is considered that the correlation between the image of the image sequence having a smaller time interval from the current frame image and the current frame image is larger, and the correlation between the image having a larger time interval from the current frame image and the current frame image is smaller.

In implementation, the image sequence may optionally be inter-frame sampled in a gaussian distribution.

As shown in fig. 2, in the inter-frame sampling in the gaussian distribution manner, the current frame image corresponds to a center of gaussian distribution, the sampling density at different positions in the image sequence is in a gaussian distribution state with the current frame image as the center, the closer to the current frame image, the smaller the sampling interval is, the more dense the sampling is, and the farther from the current frame image, the larger the sampling interval is, the more sparse the sampling is.

As another embodiment, it is also possible to divide a plurality of different time segments for the image sequence in advance, set a relatively small sampling time interval for the time segment close to the current frame image, set a relatively large sampling time interval for the time segment far from the current frame image, and continuously increase (or decrease) the sampling time intervals corresponding to the different time segments as the time intervals of the different time segments and the current frame image become larger (or smaller), thereby extracting as much as possible the inter-frame correlation information having a high correlation with the current frame image in the image sequence.

Based on the sampling processing, a group of inter-frame first sampling images corresponding to the current frame image can be obtained, and the group of first sampling images comprises the current frame image to be subjected to target positioning.

And 103, respectively carrying out second sampling processing on each first sampling image to obtain a group of second sampling images which respectively correspond to each first sampling image and comprise second sampling images with multiple scales.

In order to eliminate adverse effects on target tracking caused by a receptive field with a constant convolution kernel in a convolution neural network, after interframe first sampling is performed on an image sequence based on a gaussian distribution and other modes, the embodiment of the application continues to perform intraframe multiscale sampling, that is, the second sampling processing, on each image in a group of first sampling images obtained by the first sampling.

By respectively carrying out intra-frame multi-scale sampling processing on each first sampling image, multi-scale distribution information of the target in the intra-frame image of each first sampling image can be obtained, and the adaptability of the convolution kernel to the target in the process of carrying out size change on different frames in the process of carrying out feature extraction on the basis of a fixed receptive field is improved through the intra-frame multi-scale distribution information of the target.

Optionally, in this embodiment, an image pyramid manner is adopted to perform intra-frame multi-scale sampling processing (second sampling processing) on each first sampling image.

The image pyramid is a kind of multi-scale representation of an image, and is an effective but conceptually simple structure for interpreting an image in multi-resolution. A pyramid of an image is a series of image sets of the same original image, with progressively lower resolutions arranged in a pyramid shape. The method comprises the steps of obtaining the images by sequentially performing down-sampling, stopping sampling until a certain termination condition is reached (for example, a preset sampling frequency is reached, or the image scale obtained by sampling reaches a preset scale), comparing the images of one layer and one layer obtained by sampling into a pyramid, wherein the higher the level is, the smaller the image scale is, and the lower the resolution is.

Thus, the second sampling process for the first sampled image may include:

performing down-sampling operation on each first sampling image for multiple times to obtain a plurality of second sampling images with different scales corresponding to each first sampling image; and the second sampling images with different scales respectively correspond to different resolutions.

In the downsampling operation, downsampling pixel points in the first sampling image for multiple times respectively based on different pixel point intervals/scale factors to obtain a plurality of sampling images with different scales; but not limited to, the original image may be downsampled, and the obtained sampled image may be downsampled in multiple iterations to obtain multiple sampled images with different scales. Here, the scale is understood to be a scale factor for controlling the scaling in the (lateral, longitudinal) direction of the image X, Y, and the sampled images of different scales correspond to different resolutions respectively. The larger the pixel point interval (the smaller the scale factor) is when downsampling is performed, the smaller the scale of the obtained sampled image is and the lower the resolution is.

A group of second sampling images in a pyramid shape corresponding to each first sampling image can be obtained through the downsampling operation, referring to a schematic diagram of comparison between each sampling image obtained by downsampling a certain first sampling image for multiple times and an original image of the first sampling image provided in fig. 3, where the image (a) is an original image of the first sampling image, the images (b), (c), and (d) are second sampling images of different scales obtained by downsampling the first sampling image respectively, and the scales of the second sampling images (b), (c), and (d) obtained through downsampling are sequentially reduced, and the resolution is sequentially reduced.

And 104, determining a positioning result of the current frame image to the target to be tracked in the same scale based on a single group of inter-frame sampling images with the same scale in a scale space formed by the group of first sampling images and the group of second sampling images corresponding to the first sampling images so as to obtain each positioning result of the target to be tracked in different scales corresponding to the current frame image.

As described above, in a group of image sequences (e.g., a video), the states of objects in different images that are closer in time are closer, the states of objects in images with larger time intervals are more different, and even the objects are disappeared or not existed, that is, the images with smaller time intervals in the image sequence have larger correlation between image contents, and the images with larger time intervals have smaller correlation between image contents, and in combination with the correlation characteristics between the images at different positions in the image sequence, the inter-frame image, especially the previous and subsequent frame images of the current frame image (a plurality of images before and after the current frame image and the current frame image are closer in time) can guide the positioning of the objects in the current frame image.

On this basis, in the embodiment, in a scale space formed by the group of first sampling images and the group of second sampling images corresponding to the first sampling images, each single group of inter-frame sampling images having the same scale guides the current frame image to position the target at the scale, and each single group of inter-frame sampling images having the same scale in the scale space are specifically as shown in fig. 4, and a result of positioning the target at each scale by the current frame image can be correspondingly obtained corresponding to each scale at the scale space.

And 105, positioning the target to be tracked in the current frame image based on the positioning results of the target to be tracked under different scales corresponding to the current frame image.

It is easy to understand that for the current frame image, there may be a case where the target cannot be located at a certain scale due to poor location accuracy/precision of the target at one or several scales or due to loss of target tracking, for example, due to the influence of the existence of the convolution kernel of the fixed receptive field as described above.

However, when the target is tracked, each first sampling image obtained by inter-sampling is further multi-scale sampled, and multi-scale distribution information of the target in the frame is covered, so that in the positioning result of the current frame image on the target in each scale, the positioning accuracy/precision of the target in one or more scales is high. The target to be tracked in the current frame image with high accuracy/high precision can be finally positioned by integrating the positioning results of the current frame image on the target under each scale.

Specifically, the target to be tracked in the current frame image can be determined by performing non-maximum suppression processing on each positioning result of the target to be tracked under different scales corresponding to the current frame image.

In the embodiment, when the target tracking is performed on the current frame image in the image sequence, inter-frame sampling of the image sequence is performed, wherein the inter-frame sampling is performed, and the inter-frame sampling is performed, wherein the inter-frame sampling is in an inverse correlation relationship between the sampling density at different positions and the time intervals from the different positions to the current frame image, so that inter-frame correlation information with high correlation is extracted from the current frame image, and the inter-frame correlation information with high correlation can guide or assist the current frame image to perform more accurate/precise positioning on the target; and further performing intra-frame multi-scale sampling on each inter-frame image obtained by sampling in the manner, so that multi-scale distribution information of the target in the intra-frame image can be obtained, understanding of the single image frame is improved, and the situations that the tracking target is easily lost due to the fact that the receptive field of a convolution neural network is fixed and the like are overcome, so that the accuracy/precision of a target tracking task can be at least improved by combining the inter-frame correlation information with high correlation and the multi-scale distribution information of the intra-frame image, and the situations that the tracking target is lost and the like are avoided.

An alternative embodiment of step 104 of the target tracking method shown in fig. 1 is provided below, which specifically guides the positioning of the target at the scale of the current frame image by the convolutional neural network and using each single set of inter-frame sampled images having the same scale.

Referring to another flow chart of the target tracking method shown in fig. 5, in the embodiment based on the convolutional neural network, step 104 can be implemented as the following processing procedures:

step 501, inputting the group of first sampling images and each group of second sampling images corresponding to each first sampling image into a pre-constructed convolutional neural network model comprising a plurality of convolutional layers;

and 502, acquiring output information of the convolutional neural network model.

In implementation, a convolutional neural network model may be constructed in advance based on a training process.

When constructing the model, the image sequence, such as a series of videos, used as the training data set is processed by performing inter-frame first sampling on the current video frame to be processed in the image sequence according to the processing method shown in the previous embodiment, and further performing intra-frame second sampling on each first sampling image obtained by the inter-frame sampling, on the basis, inputting a group of first sampling images corresponding to the current frame image and each group of second sampling images corresponding to each first sampling image obtained by two types of sampling into the model, outputting the model as a positioning result of the target in each scale corresponding to the current frame image, matching the output result of the model with a pre-calibrated target positioning result, feeding back the model based on the matching condition, and adjusting each dimension weight of the convolution kernel in each convolution layer of the model to realize one training of the model, the above-described process is performed cyclically on each frame of image in a training dataset, such as a series of videos, such that the training process on the model is iteratively repeated until a set target is reached (e.g., the degree of matching between the output result of the model and the pre-calibrated target positioning result is expected), and the model construction is completed.

For the implementation mode that the first sampling is interframe Gaussian sampling and the second sampling is pyramid multi-scale sampling, the primary input of the model is specifically an image set obtained after interframe Gaussian sampling and intraframe pyramid sampling of the current frame image.

Based on the constructed convolutional neural network model, when the target tracking needs to be performed on the image sequence, for a current frame image to be tracked, a group of first sampling images (interframe Gaussian sampling images) corresponding to the current frame image and each group of second sampling images (intraframe pyramid multi-scale sampling images) corresponding to each first sampling image are input into the model, and then each positioning result of the target output by the model under different scales corresponding to the current frame image can be obtained.

The constructed convolutional neural network model comprises a plurality of convolutional layers, wherein different convolutional layers are in one-to-one correspondence with different scales in the scale space (the scale space formed by the group of first sampling images and the group of second sampling images corresponding to the first sampling images); each convolution layer at least processes a group of inter-frame sampling images under the corresponding scale, so that the group of inter-frame sampling images are used for guiding the current frame image to position the target to be tracked under the corresponding scale.

Further, referring to fig. 6, in a model processing procedure for the current frame image:

the input of the convolution neural network model first layer convolution layer is:

the group of first sampling images corresponding to the current frame image;

the inputs of other convolutional layers except the convolutional layer of the first layer of the convolutional neural network model are as follows:

and (c) stacking (concat) a group of sampling images of the corresponding scale and the characteristic images output by the convolution layer at the previous layer.

In the model output information, the positioning result of the target in a scale corresponding to the current frame image may include: and the at least one target to be tracked corresponds to the positioning position information of the scale in the current frame image and/or the respective class label of the at least one target to be tracked.

In implementation, optionally, the model may specifically output positioning windows of the target corresponding to each scale in the current frame image and category labels respectively associated with the positioning windows. The category labels may be "0", "1", etc., and each category label indicates a corresponding category of the target, such as label "0" indicating "person", label "1" indicating "football", etc.

In the embodiment of the application, for a current frame image to be subjected to target tracking in an image sequence, inter-frame first sampling (for example, inter-frame Gaussian sampling) is performed on the image sequence, and intra-frame second sampling (for example, intra-frame pyramid multi-scale sampling) is performed on each first sampling image obtained by the sampling, so that high-correlation-degree correlation information of the current frame image in the video sequence can be extracted, a better guiding function can be played for target positioning of the current frame image, the positioning accuracy/precision of a target in the current frame image is improved, meanwhile, sampling and extraction of low-correlation-degree correlation information are avoided, the calculated amount of the whole image sequence during analysis in a target tracking process is reduced, the use of calculation resources is reduced, and besides, multi-scale distribution information of the target in the intra-frame image can also be obtained, the situations that the tracking target is lost and the like easily caused by the fact that the receptive field of the convolution kernel is fixed and unchanged in the convolution neural network can be avoided.

Corresponding to the image denoising method described above, an embodiment of the present application further discloses a target tracking apparatus, and referring to a schematic structural diagram of the target tracking apparatus shown in fig. 7, the apparatus may include:

a first determining unit 701, configured to determine a current frame image to be processed in an image sequence; the image sequence comprises a plurality of frames of images which are arranged in an imaging time sequence, and the image contents of at least partial images in the plurality of frames of images have correlation;

a first sampling unit 702, configured to perform a first sampling process on the image sequence to obtain a group of first sampled images corresponding to the current frame image; the sampling density of different positions in the image sequence is in an inverse correlation relation with the time intervals from the different positions to the current frame image;

the second sampling unit 703 is configured to perform second sampling processing on each first sampling image, so as to obtain a group of second sampling images, which respectively correspond to each first sampling image and include second sampling images of multiple scales;

a second determining unit 704, configured to determine, based on a single group of inter-frame sampling images with a same scale, a positioning result of the current frame image on the target to be tracked in the same scale in a scale space formed by the group of first sampling images and the groups of second sampling images corresponding to the first sampling images, so as to obtain positioning results of the target to be tracked in different scales corresponding to the current frame image;

the positioning unit 705 is configured to position the target to be tracked in the current frame image based on each positioning result of the target to be tracked in different scales corresponding to the current frame image.

In an optional implementation manner of the embodiment of the present application, the first sampling unit 702 is specifically configured to:

In an optional implementation manner of the embodiment of the present application, the second sampling unit 703 is specifically configured to:

In an optional implementation manner of the embodiment of the present application, the second determining unit 704 is specifically configured to:

acquiring output information of the convolutional neural network model;

In an alternative implementation of the embodiments of the present application:

inputting a first layer of convolutional layers in the convolutional neural network model into the group of first sampling images;

the inputs of other convolutional layers except the first convolutional layer in the convolutional neural network model are: and stacking a group of sampling images of the corresponding scale and the characteristic images output by the convolution layer on the previous layer.

In an optional implementation manner of the embodiment of the present application, the positioning unit 705 is specifically configured to:

For the target tracking device disclosed in the embodiment of the present application, since it corresponds to the target tracking method disclosed in the corresponding method embodiment above, the description is relatively simple, and for the relevant similarities, please refer to the description of the target tracking method part in the method embodiment above, and details are not described here.

The embodiment of the application also discloses an electronic device, which can be but not limited to a mobile phone, a tablet computer, a personal PC (e.g., a notebook, an all-in-one machine, a desktop), and other terminal devices with a computer vision processing function, or a physical machine corresponding to a private cloud/public cloud platform, a server (an image processing server), and the like with a computer vision processing function.

The composition structure of the electronic device is shown in fig. 8, and at least includes:

a memory 801 for storing a set of computer instructions;

the set of computer instructions may be embodied in the form of a computer program.

The memory 801 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

A processor 802 for implementing the target tracking method as disclosed in the above method embodiments by executing the set of instructions deposited on the memory.

The processor 802 may be a Central Processing Unit (CPU), an application-specific integrated circuit (ASIC), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, etc.

Besides, the electronic device may further include a communication interface, a communication bus, and the like. The memory, the processor and the communication interface communicate with each other via a communication bus.

The communication interface is used for communication between the electronic device and other devices. The communication bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like, and may be divided into an address bus, a data bus, a control bus, and the like.

In this embodiment, when a processor in the electronic device executes a computer instruction set stored in the memory, and performs target tracking on a current frame image in an image sequence, inter-frame sampling "the sampling density at different positions and the time interval from the different positions to the current frame image are in an inverse correlation" on the image sequence, inter-frame correlation information with high correlation is extracted for the current frame image, and the inter-frame correlation information with high correlation can guide or assist the current frame image in more accurate/precise positioning on a target; and further performing intra-frame multi-scale sampling on each inter-frame image obtained by sampling in the manner, so that multi-scale distribution information of the target in the intra-frame image can be obtained, understanding of the single image frame is improved, and the situations that the tracking target is easily lost due to the fact that the receptive field of a convolution neural network is fixed and the like are overcome, so that the accuracy/precision of a target tracking task can be at least improved by combining the inter-frame correlation information with high correlation and the multi-scale distribution information of the intra-frame image, and the situations that the tracking target is lost and the like are avoided.

In addition, the present application discloses a computer-readable storage medium, in which a set of computer instructions is stored, and when the set of computer instructions is executed by a processor, the target tracking method disclosed in the above method embodiment is implemented.

When the instruction stored in the computer-readable storage medium runs, aiming at a current frame image to be tracked and positioned by a target in an image sequence, inter-frame sampling of 'the sampling density at different positions and the time interval from the different positions to the current frame image are in an anti-correlation relation' is carried out on the image sequence, inter-frame correlation information with high correlation degree is extracted for the current frame image, and the inter-frame correlation information with high correlation degree can guide or assist the current frame image to more accurately/precisely position the target; and further performing intra-frame multi-scale sampling on each inter-frame image obtained by sampling in the manner, so that multi-scale distribution information of the target in the intra-frame image can be obtained, understanding of the single image frame is improved, and the situations that the tracking target is easily lost due to the fact that the receptive field of a convolution neural network is fixed and the like are overcome, so that the accuracy/precision of a target tracking task can be at least improved by combining the inter-frame correlation information with high correlation and the multi-scale distribution information of the intra-frame image, and the situations that the tracking target is lost and the like are avoided.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.

For convenience of description, the above system or apparatus is described as being divided into various modules or units by function, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

Finally, it is further noted that, herein, relational terms such as first, second, third, fourth, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A target tracking method, comprising:

2. The method of claim 1, wherein the performing the first sampling process on the image sequence to obtain a set of first sampled images corresponding to the current frame image comprises:

3. The method according to claim 1, wherein the performing the second sampling processing on each of the first sampled images respectively comprises:

4. The method according to claim 1, wherein the determining, in a scale space formed by the set of first sampling images and the sets of second sampling images corresponding to the respective first sampling images, a positioning result of the current frame image to be tracked in the same scale based on a single set of inter-frame sampling images having the same scale to obtain respective positioning results of the target to be tracked in different scales corresponding to the current frame image comprises:

acquiring output information of the convolutional neural network model;

5. The method of claim 4, wherein:

6. The method according to claim 1, wherein the locating the target to be tracked in the current frame image based on the respective locating results of the target to be tracked under different scales corresponding to the current frame image comprises:

7. An object tracking device, comprising:

8. The apparatus according to claim 7, wherein the first sampling unit is specifically configured to:

9. The apparatus according to claim 7, wherein the second sampling unit is specifically configured to:

10. The apparatus according to claim 7, wherein the second determining unit is specifically configured to:

acquiring output information of the convolutional neural network model;

11. The apparatus of claim 10, wherein:

12. An electronic device, comprising:

a memory for storing a set of computer instructions;

a processor for implementing the method of object tracking according to any one of claims 1 to 6 by executing a set of instructions stored on the memory.

13. A computer-readable storage medium having stored therein a set of computer instructions which, when executed by a processor, implement the target tracking method of any one of claims 1-6.