CN111223128A

CN111223128A - Target tracking method, device, equipment and storage medium

Info

Publication number: CN111223128A
Application number: CN202010051488.6A
Authority: CN
Inventors: 曹文明; 陈学军; 何志权
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2020-06-02

Abstract

The method is applicable to the technical field of target tracking, and provides a target tracking method, a device, equipment and a storage medium.

Description

Target tracking method, device, equipment and storage medium

Technical Field

The present application belongs to the field of target tracking technologies, and in particular, to a target tracking method, apparatus, device, and storage medium.

Background

Currently, for consecutive video frame images, the position of the same target in different video frame images is continuously tracked. Because the image backgrounds of the target and other objects in the video frame image are similar and difficult to distinguish, target positions of a plurality of similar targets are found in the next frame image according to the target characteristics of the target in the previous frame image, and finally, the specific position of the target in the next frame image is not accurately determined in the plurality of target positions.

In summary, there is a problem that the accuracy is low when the same target object is tracked in consecutive video frame images.

Disclosure of Invention

The embodiment of the application provides a target tracking method, a target tracking device and a storage medium, which can solve the problem that the accuracy is low when the same target object is tracked in continuous video frame images at present.

In a first aspect, an embodiment of the present application provides a target tracking method, including:

acquiring the initial position of a target in the previous frame of input image;

inputting the initial position into a first model to obtain each predicted position of the target in the current input image;

obtaining a prediction image corresponding to each prediction position in the current input image, and inputting each prediction image into a second model to obtain a classification result of each prediction image and the target;

and determining the target position of the target in the current input image according to the classification result.

In an embodiment, the inputting the initial position into the first model to obtain each predicted position of the target in the current input image includes:

acquiring the length and the width of the target in the previous frame of input image;

generating a target matrix according to the length and the width;

and acquiring each predicted position of the target in the current input image through Gaussian distribution according to the target matrix.

In an embodiment, the second model comprises a multi-layer network structure between an input layer and an output layer, wherein the first layer comprises a first convolution layer, a first activation layer, a first local normalization layer, a first max-pooling layer; the second layer comprises a second convolution layer, a second activation layer, a second batch normalization layer and a second maximum pooling layer; the third layer comprises a third convolution layer, a third activation layer and a channel attention layer; the fourth layer comprises a first full connection layer and a fourth activation layer; the fifth layer comprises a random inactivation layer, a second full-connection layer and a fifth activation layer; the sixth layer is a classification layer.

In one embodiment, the processing of the channel attention layer includes:

acquiring a second input characteristic output by the second activation layer;

executing first processing on the second input features to obtain first output features, and executing second processing on the second input features to obtain second output features;

combining the first output characteristic with the second output characteristic to generate a third output characteristic;

and combining the third output characteristic with the second input characteristic to generate a target characteristic.

In one embodiment, the training of the second model comprises:

acquiring first training data, wherein the first training data comprises input images of known classes and first input features of the input images;

inputting the first input feature into an initial second model for propagation to obtain the prediction category of the input image;

generating a first prediction loss according to the prediction category and the known category, and iteratively updating the model parameters of the initial second model according to the first prediction loss;

if the first prediction loss is not converged in the iterative updating process, adjusting model parameters of the initial second model, returning to execute the step of inputting the first training data into the initial second model for training to obtain the prediction loss generated by the prediction type and the known type and the subsequent steps;

and if the first prediction loss is converged in the iterative updating process, finishing training the initial second model, and taking the current initial second model as the trained second model.

In an embodiment, after the current initial second model is taken as the trained second model, the method further includes:

acquiring a network structure of the second model and model parameters thereof;

acquiring second training data, wherein the second training data comprises the initial position of a target in the previous frame of input image and the target feature of the target in the current input image;

inputting the initial position into a first model to obtain each predicted position of a target in the current input image;

obtaining a prediction image corresponding to each prediction position in the current input image and the image characteristics of each prediction image;

inputting each image characteristic into a network structure of a second model to obtain a prediction result of each predicted image and a target;

and obtaining a predicted image corresponding to the optimal prediction result and the target to generate a second prediction loss, iteratively updating the model parameters of the second model again according to the second prediction loss, and finishing training the second model if the second prediction loss is converged in the iterative updating process.

In an embodiment, the second model comprises at least one fully connected layer;

iteratively updating the model parameters of the second model again according to the second predicted loss comprises:

and updating the model parameters of all the fully-connected layers in the second model, and keeping the model parameters of the rest layers unchanged.

In a second aspect, an embodiment of the present application provides a target tracking apparatus, including:

the first acquisition module is used for acquiring the initial position of the target in the previous frame of input image;

the first input module is used for inputting the initial position into a first model to obtain a plurality of predicted positions of the target in the current input image;

the first processing module is used for acquiring a predicted image corresponding to each predicted position in the current input image, and inputting the predicted image into a second model to obtain a classification result of each predicted image and the target;

and the determining module is used for determining the target position of the target in the current input image according to the classification result.

In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor, when executing the computer program, implements the object tracking method according to any one of the above first aspects.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and the computer program is executed by a processor to perform the target tracking method according to any one of the first aspect.

In a fifth aspect, an embodiment of the present application provides a computer program product, which, when run on a terminal device, causes the terminal device to execute the target tracking method according to any one of the above first aspects.

It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.

Compared with the prior art, the embodiment of the application has the advantages that: the method comprises the steps of estimating a plurality of possible positions of a target in a current frame input image according to a first model through the position of the target in a previous frame input image, removing the rest positions of the target in the current frame input image, which are similar to the image background of the target, inputting predicted images in the input image corresponding to the positions into a second model, obtaining the classification result of each predicted image and the target according to the trained second model to determine the optimal predicted image, further determining the current optimal predicted image to be the target position of the target in the current frame input image, and improving the accuracy of tracking the same target object in continuous video frame images.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flowchart of an implementation of a target tracking method provided in an embodiment of the present application;

FIG. 2 is a schematic flow chart of another implementation of the target tracking method provided in the embodiment of the present application;

fig. 3 is a schematic flow chart illustrating an implementation flow of a channel attention layer in a target tracking method provided in an embodiment of the present application;

FIG. 4 is a schematic flowchart of another implementation of a target tracking method provided in an embodiment of the present application;

FIG. 5 is a schematic flowchart of another implementation of the target tracking method according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a target tracking device according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

The target tracking method provided by the embodiment of the present application may be applied to terminal devices such as a tablet computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, and a Personal Digital Assistant (PDA), and the specific type of the terminal device is not limited in the embodiment of the present application.

Referring to fig. 1, fig. 1 is a schematic flow chart of a target tracking method according to an embodiment of the present disclosure; the details are as follows:

s101, acquiring the initial position of the target in the input image of the previous frame.

In application, the terminal device may store continuous video frame images in advance, or receive the continuous video frame images sent by the server through the network. The target may be a target object to be tracked in a video frame image, and the obtaining manner includes, but is not limited to, inputting an initial position of the target in the terminal device by a user, or obtaining the initial position of the target by the terminal device according to a preset rule. The preset rule may be that the terminal device stores in advance a target feature input by a user and an image region where a target may appear, and then obtains features of each region in the input image and compares the features with the target feature one by one, and obtains an initial position of the target without limitation.

In application, the initial position of the target may be specifically expressed by two-dimensional coordinates, for example, a two-dimensional space coordinate system is constructed by using a central point in the input image as a coordinate origin, and the initial position of the target is expressed by the two-dimensional space coordinate system. If the target occupies a region in the input image, the central point of the region image can be used as the initial position of the target.

And S102, inputting the initial position into a first model to obtain each predicted position of the target in the current input image.

In application, the first model is a probability model composed of gaussian distribution functions. A Gaussian distribution function model introduces latent variables (initial positions) through a Gaussian process to predict the distribution of the target in the next frame of image. When training the gaussian distribution function, the method for model training may use a maximum likelihood estimation method to train the hyper-parameters in the gaussian distribution function, which is not limited. Wherein a gaussian distribution function has a mean function and a covariance function. In order to reduce data variables in the gaussian distribution function and avoid dimension explosion during training, in this embodiment, the mean value corresponding to the mean value function is defined as the initial position of the target in the previous frame, and the covariance function is a diagonal matrix of the initial position of the previous frame, specifically, the length and width of the area occupied by the target in the input image of the previous frame are determined, for example (0.09 a)²，0.09b²0.25), where a is the length of the area occupied by the target in the previous frame of input image, and b is the width of the area occupied by the target in the previous frame of input image. After the predicted position is obtained according to the initial position, the predicted position is a coordinate point, the area of the predicted position can be centered on the coordinate point, and the length of the area occupied by the target can be equal to or different from the length of the area occupied by the target in the previous frameThe width of the target occupied area may be equal to or different from the width of the target occupied area in the previous frame, and is not limited.

S103, obtaining a prediction image corresponding to each prediction position in the current input image, and inputting each prediction image into a second model to obtain a classification result of each prediction image and the target.

In application, the predicted image is an image of a predicted position region. After the target prediction position of the current frame input image is obtained according to the initial position of the target of the previous frame, the image in the area of the current position is obtained, whether the predicted image is consistent with the target or not is sequentially judged through the second model, and a classification result is determined, wherein the classification result can be the similarity between each predicted image and the target. The target can be a target in the first frame of input image, or a template in the previous frame of input image, and a classification result is obtained by comparing the target feature of the target with the target feature of the input image.

Illustratively, the target is a target of a previous frame of input image, the terminal device obtains a target feature of the target of the previous frame, where the target feature may be represented by a 64-bit hash sequence code, the terminal device processes each predicted image according to the trained second model calculation to obtain an image feature of each predicted image, or may also be a 64-bit hash sequence code, and performs matching according to the image feature of the predicted image and the target feature, and obtains a predicted image with the highest matching degree as a classification result according to the hash sequence code, for example, a ratio of the same number of the hash sequence codes at the same position to the total number of the hash sequence codes is determined as the matching degree.

And S104, determining the target position of the target in the current input image according to the classification result.

In application, the classification result is the similarity between each predicted image and the target, so that a predicted image corresponding to the optimal similarity can be obtained according to the similarities between the predicted images and the target, and the predicted image is taken as the target position of the target in the current input image.

In this embodiment, a plurality of positions where a target may exist in a current frame input image are estimated according to a first model after the position of the target in a previous frame input image is passed, the remaining positions of the target in the current frame input image, which are similar to the image background of the target, are removed, predicted images in the input images corresponding to the plurality of positions are input into a second model, classification results of the predicted images and the target are obtained according to the trained second model to determine an optimal predicted image, and then the current optimal predicted image is determined to be the target position of the target in the current frame input image, so that the accuracy when the same target object is tracked in continuous video frame images is improved.

Referring to fig. 2, in an embodiment, S102 includes:

s201, acquiring the length and the width of the target in the previous frame of input image.

In application, the terminal device acquires the initial position of the target in the last frame of input image, wherein the initial position of the terminal device should include the length and the width of the occupied area of the target in the last frame of input image, and includes the initial position point which can represent the occupied area of the target. The diagonal matrix can be constructed according to the length and the width of the region occupied by the target of the previous frame, and then the length and the width of the predicted region occupied by the target position of the current input image are correspondingly changed in the prediction process of the first model (Gaussian distribution function).

And S202, generating a target matrix according to the length and the width.

In application, the target matrix is a diagonal matrix constructed by the length and width of the area occupied by the target obtained from the previous frame of input image, namely (0.09 a)²，0.09b²0.25), when the length and width of the region occupied by the target in each frame of the input image are changed, the diagonal matrix thereof is also changed. For example, after the first model is input according to the target position, the length and the width of the occupied area in the previous frame of input image, it can predict N-256 predicted positions in the current input image, and the length and the width of each predicted position in the current input image can be obtained by multiplying by 1.05^SiGo on to changeAnd the step of converting the image into the image, wherein Si represents the length and width of the area occupied by the target in the input image of the ith frame, and is not limited in this respect.

S203, obtaining each predicted position of the target in the current input image through Gaussian distribution according to the target matrix.

In application, the target matrix is a diagonal matrix, and the gaussian distribution is a trained first model. The step of acquiring the respective predicted positions in the current input image is identical to S102, and will not be described in detail.

In this embodiment, the predicted position of the target in the current input image is predicted by the target position, the length of the occupied area and the width of the occupied area of the previous frame of input image and combining the gaussian distribution function, so that the terminal device can remove the rest positions of the target in the current frame of input image, which are similar to the image background of the target, retain the predicted position of the target in the current input image, which is most likely to exist, improve the accuracy of target tracking, and reduce the matching time of the target in the current input image.

In application, the second model is specifically a channel attention deformable tracker model, and the purpose of the second model is to train the channel attention deformable tracker model to determine which known class the current input image belongs to according to the image features of the current input image. The target feature of the target of the previous frame of input image is compared and classified with the image feature of the predicted position in the current input image, and the target feature of the previous frame is judged to be more similar to the image feature of the predicted position.

In a specific application, the terminal device inputs an image with a pixel size of 107 × 107 × 3 in the input layer, wherein the input image size is 107 × 107, 3 is a channel of the input image, and the three channels are Red (Red), Green (Green) and Blue (Blue), namely RGB. In the first layer of convolutional layer, 107 × 107 × 3 images are subjected to convolution operation, the convolution kernel can be 7 × 7, the convolution step is 2, then a 96-dimensional feature map is generated, and then the 96-dimensional feature map is input into the activation layer, wherein the activation function is a ReLU function and is used for explaining the feature map mapped to a high-dimensional nonlinear interval, so that the main features in the feature map are reserved, parameters are reduced, and the overfitting problem during the training of the second model is prevented. The first local normalization layer is local response normalization, and is used for calculating the gradient of each layer in the process of carrying out neural network back propagation by using the data (characteristic diagram) of the layer to multiply the gradient transmitted by the upper layer to obtain the gradient of the layer. Therefore, the data normalization of the current layer can be limited to zero mean value, and then the gradient explosion will not occur when the subsequent multiplication is carried out on the gradient transmitted by the upper layer. The first maximum pooling layer is used for selecting the features with the largest values in the local feature maps to form a new feature map so as to reduce the error of current feature extraction. Each feature map is 107 × 107, 96 dimensions are provided, the feature with the largest value can be selected from the local feature maps with the size of 3 × 3 in each feature map, the selection step length of the local feature map is 2, and a new 51 × 51 × 96 feature map is formed and serves as output. Correspondingly, in the second layer, the 51 × 51 × 96 feature map output by the first layer is taken as an input, the convolution operation is performed on the feature map, the convolution kernel can be 5 × 5, the convolution step size is 2, and then a feature map with 256 dimensions is generated; inputting a 256-dimensional feature map into the activation layer, wherein the activation function is also a ReLU function; then, second batch normalization is carried out on the characteristic graphs, wherein the second batch normalization is group normalization and is used for grouping 256-dimensional characteristic graphs, each group of characteristic graphs containing a certain number of dimensions is processed, the total number of dimensions of the characteristic graphs is unchanged, and the second batch normalization is used for reducing errors caused by current characteristic extraction of each group in a balanced manner; then, the input feature map is input again into the second maximum pooling layer to be subjected to maximum pooling, and an 11 × 11 × 256 feature map is output, which will not be described in detail. In the third layer, similarly, the 11 × 11 × 256 feature map output by the second layer is taken as an input, a convolution operation is performed on the feature map, the convolution kernel can be 3 × 3, the convolution step is 1, and then a 512-dimensional feature map is generated; then inputting a 512-dimensional feature map into the activation layer to obtain 512 x 3, wherein the activation function is also a ReLU function; at this time, the obtained feature map is input into a map channel attention layer, and is used for enhancing the image corresponding to the current feature map by color difference, sharpening, brightness and the like, so that the feature map obtained after processing can more easily express the features of the input image, and a 4608 × 1-dimensional feature map is output. The fourth layer comprises a first full connection layer and a fourth activation layer and is used for compressing and transforming the 4608 x 1 dimensional feature map into a 512 x 1 dimensional feature map. And the fifth layer is a random deactivation layer to prevent overfitting. The last layer is reduced to 2 dimensions as a two-classifier. The fourth layer is a fully connected layer and one activated layer is transformed from 4608 to 512 in dimension, the fifth layer is a random inactivated layer and is used for randomly returning the weight of part of hidden layers to zero to prevent overfitting, and then the processed 512 x 1-dimensional feature map is obtained after the feature map passes through the second fully connected layer and the fifth activated layer again. The last layer is a classification layer, which comprises a full connection layer inside and is used for integrating 512 x 1 dimensional features together to output a value, namely, the 512 x 1 dimensional feature map is reduced to a 2 x 1 dimensional feature map, namely a 2 x 1 matrix, and the numerical value obtained by calculation according to the current matrix is the probability of classifying the current image. For example, some of the features are summed to output a final 2 × 1 matrix, each number represents the probability (or score) of the respective class, and the class with the highest probability is used as the result of the classification of the second model.

In this embodiment, a plurality of normalization layers and a maximum pooling layer are arranged in a multilayer network structure in the second model, and normalization processing and maximum pooling processing are performed on the image features of the input image, so that errors in classification judgment according to the extracted image features are reduced for the trained second model.

Referring to fig. 3, in an embodiment, the processing of the channel attention layer includes:

s301, acquiring a second input characteristic output by the second activation layer.

In application, the second input feature is obtained by processing the current input image through the third convolution layer and the third activation layer in the first layer, the second layer and the third layer in the second model, that is, the feature map of 512 × 3 × 3 dimensions.

S302, executing first processing on the second input characteristic to obtain a first output characteristic, and executing second processing on the second input characteristic to obtain a second output characteristic.

In the application, the first processing is pooling processing for performing pooling operation on the 512 × 3 × 3 dimensional feature map and outputting the 512 × 1 feature map. Specifically, for 512 3 × 3 feature maps, the numerical values of each 3 × 3 feature map are subjected to superposition averaging to obtain one-dimensional feature values of the 3 × 3 feature map, and all the feature maps are subjected to pooling processing to obtain a feature map composed of 512 one-dimensional feature values, that is, the first output feature.

In the application, the second processing is maximum pooling processing for performing uniform pooling operation on the feature maps of 512 × 3 × 3 dimensions and outputting 512 × 1 feature maps. Specifically, for 512 3 × 3 feature maps, the maximum numerical value is selected from the numerical values of each 3 × 3 feature map as the one-dimensional feature value of the 3 × 3 feature map, and the feature map composed of 512 one-dimensional feature values, that is, the second output feature, is obtained by performing the maximum pooling process on all the feature maps. In application, the first processing and the second processing are processing operations performed on the second input feature, and are parallel operations, that is, the first output feature and the second output feature are generated simultaneously.

And S303, combining the first output characteristic and the second output characteristic to generate a third output characteristic.

In application, for 512 one-dimensional feature values in the first output feature and 512 one-dimensional feature values in the second output feature, corresponding one-dimensional feature values may be subjected to superposition processing, and new 512 one-dimensional feature values are generated by combination to form a new third output feature.

And S304, combining the third output characteristic with the second input characteristic to generate a target characteristic.

In application, the target feature is a feature obtained after the processing of the channel attention layer, and the image feature of the input image can be enhanced by corresponding features such as color difference, sharpening, brightness and the like after the processing of the channel attention layer, so that the generated target feature can represent the image feature of the input image. The combining of the third output characteristic with the second input characteristic may be, but is not limited to, performing matrix dot product operation on the third output characteristic and the second input characteristic to calculate autocorrelation in the characteristic. Illustratively, if the second input feature F ∈ R^C×H×W(ii) a The channel attention layer sequentially calculates a one-dimensional first output characteristic and a one-dimensional second output characteristic as M_C∈R^C×1×1The object is characterized in that

Wherein

Is element-by-element dot multiplication, F^*For the calculated output, i.e. target feature, M_C(F) A third output characteristic is generated for the first output characteristic combined with the second output characteristic. The pooling and max-pooling processes may aggregate spatial information between each feature map, thereby generating target features that are more representative of the input image.

In application, the channel attention layer further comprises a third full connection layer, a sixth activation layer, a fourth full connection layer and a seventh activation layer, wherein the third full connection layer is used for compressing the generated third output characteristic to obtain a compressed characteristic diagram. For example, the third output feature is a 512 × 1 feature map, the feature map is output as a 32 × 1 feature map after compression processing, then the feature map is processed by a sixth activation layer to prevent overfitting, the activation function is a linear rectification function, then reconstruction processing is performed by a fourth full connection layer, that is, the 32 × 1 feature map is reconstructed back to the 512 × 1 feature map, and then the feature map is input into a seventh activation layer to be processed to retain main feature parameters.

In this embodiment, a first output feature is obtained by performing a first process on a feature map of an input attention layer, a second output feature is obtained by performing a second process at the same time, the first output feature and the second output feature are combined to generate a third output feature, and a target feature is generated by combining the second input feature, so that an image corresponding to a current feature map is subjected to enhancement processes such as color difference, sharpening, brightness, and the like, so that the feature map obtained after the process is easier to express features of the input image.

Referring to fig. 3, in an embodiment, the training step of the second model includes:

s401, obtaining first training data, wherein the first training data comprises input images of known types and first input features of the input images.

In an application, the first training data comprises an input image of a known class and a first input feature of the input image. The first training data comprise positive sample training data and negative sample training data in the actual training process, and the category of the input image and the first input feature of the input image are both used as known first training data and input into the second model for training.

S402, inputting the first input feature into an initial second model for propagation to obtain the prediction category of the input image.

In application, the propagation is to process the first input feature sequentially through an input layer, an intermediate multi-layer network structure and an output layer of the initial second model, and the final result is a propagation, and after the propagation process is finished, the prediction type of a certain input image can be obtained. The method comprises the steps of initializing an initial second model by setting initial values of model parameters in the initial second model, giving random values to learning parameters and bias parameters between each layer of network structures in the initial second model, inputting first training data to an input layer, and calculating according to the learning parameters corresponding to each layer of network structures to obtain the prediction type of the current input image output by the output layer.

And S403, generating a first prediction loss according to the prediction category and the known category, and iteratively updating the model parameters of the initial second model according to the first prediction loss.

In the application, a first prediction loss of the output layer is obtained according to the obtained prediction type and the known type, and back propagation is performed based on the first prediction loss to update the initial values of the model parameters in the initial second model. The first prediction loss formula may be: losscls ═ y'_i-y_i)²Where m is the number of input images of known classification results, y'_iFor the prediction class of the i-th input image, y_iIs the known class of the ith input image.

S404, if the first prediction loss is not converged in the iterative updating process, adjusting model parameters of the initial second model, returning to execute the step of inputting the first training data into the initial second model for training to obtain the prediction type and the step of generating the prediction loss by the known type, and the subsequent steps.

S405, if the first prediction loss is converged in the iterative updating process, finishing training the initial second model, and taking the current initial second model as the trained second model.

In application, the convergence condition of the initial second model can be determined according to the obtained first prediction loss in the iterative updating process. Specifically, when the first prediction loss is smaller than a preset value or after a certain number of times, the obtained values are all kept unchanged, and it is determined that the initial second model is converged. Otherwise, updating the new model parameters obtained by training to the original model parameters in the initial second model, inputting the first training data again for training, and repeating the training steps S402-S405. And updating the original model parameters in the initial second model when the propagation training is carried out in each iteration process, namely, the iteration updating.

In this embodiment, the second model is initialized, the input image of the known category and the first input feature of the input image are trained to obtain the first prediction loss, and the second model is updated according to the first prediction loss, so that the accuracy of image classification in the trained second model is improved.

Referring to fig. 5, in an embodiment, after taking the current initial second model as the trained second model, the method further includes:

s501, obtaining the network structure of the second model and the model parameters thereof.

In application, the obtained second model is obtained by training according to the whole input image and the image characteristics of the input image, and is applicable to target tracking of continuous video frame images. Specifically, after each predicted position of the target in the current input image is predicted according to the initial position of the target in the previous frame of input image, the trained second model can be used to classify and compare the predicted images corresponding to the predicted positions, and the classification result of the predicted images and the target is obtained, so that the target tracking of the continuous video frame images is realized. However, the second model is obtained by training according to the first training data of the whole input image, so the model parameters in the current second model need to be subjected to micro-processing, and the model needs to be trained again. The network structure of the second model is the six-layer network structure described above, and the model parameters include, but are not limited to, learning parameters (weights), bias parameters, and the like in each layer of network structure, which is not limited to this.

S502, second training data are obtained, wherein the second training data comprise the initial position of the target in the previous frame of input image and the target feature of the target in the current input image.

S503, inputting the initial position into a first model to obtain each predicted position of the target in the current input image.

S504, obtaining a prediction image corresponding to each prediction position in the current input image and the image characteristics of each prediction image.

In application, the second training data includes an initial position of the target in the previous frame of input image and a target feature of the target in the current frame of input image. The initial position of the target in the previous frame of input image is used for inputting into the first model, so that the first model predicts each predicted position in the current input image, which is the same as the contents of S201-S203, and this will not be described in detail.

And S505, inputting the image characteristics into a network structure of a second model to obtain the prediction results of the predicted images and the target.

S506, obtaining a predicted image corresponding to the optimal prediction result and the target to generate a second prediction loss, iteratively updating the model parameters of the second model again according to the second prediction loss, and finishing training the second model if the second prediction loss is converged in the iterative updating process.

In application, the target feature of the target in the current input image is used for comparing with the image feature corresponding to the predicted image to obtain a second prediction result, a second prediction loss is generated according to the second prediction result and the real result corresponding to the target, and the model parameter of the second model is updated iteratively according to the second prediction loss.

In a specific application, the second training data may use a large visualization database (ImageNet) studied by visual object recognition software, wherein the parameters of the common convolutional layer in the second model are set to 0.0001, the learning rate of the fully-connected parameters of the channel attention layer may be set to 0.004, and the learning rates of the remaining fully-connected layers may be set to 0.001, and the iteration is performed 40 times. Each small batch in the second training data comprises positive samples M⁺(-32) and negative examples

For participating in training. And when the score (prediction result) of the model to the predicted image is larger than a preset threshold value, stopping updating and storing the current model and the model parameters thereof. Illustratively, if the prediction categories include "cat", "dog" and "tiger", the probability (a) that the prediction category is "cat" is calculated after one of the image features is input into the second model₁0.77) and the prediction category is "dog (a)₂0.13) "and the prediction category is" tiger (a)₃0.1) ". Then the second model selects the category corresponding to the maximum value in the three a as the final prediction category, namely a₁Corresponding category of "cat", and a₁Probability of corresponding to "cat" (a)₁0.77) as the prediction result of the current predicted image and the target. According to the method, the prediction results of each predicted image and the target can be obtained, the optimal prediction result is selected, the square difference between the optimal prediction result and the preset real result y of the current target is 1, the second prediction loss is obtained, and the model parameters of the second model are iteratively updated again according to the second prediction loss. The process of stopping iteratively updating the model parameters of the second model may be consistent with or inconsistent with the step of S405, which is not described in detail. In application, if the second predicted loss converges, the steps S502-S506 are repeated, which will not be described in detail.

In the embodiment, by combining the first model and the second model, the first model is used to predict each predicted position of the target in the current input image, the second model is used to compare the predicted image of each predicted position with the target to obtain an optimal predicted image, the second model is used to calculate the predicted result of the optimal predicted image and the real result of the target, a second prediction loss is generated to adjust the model parameters of the second model again, and further the accuracy of target tracking in the continuous video frame images is realized through the first model and the second model

In an embodiment, the second model comprises at least one fully connected layer; iteratively updating the model parameters of the second model again according to the second predicted loss comprises:

In application, the second model is a six-layer network structure including an input layer, an output layer and a space therebetween. In the six-layer network structure, each layer of the network structure has corresponding model parameters. For example, the model parameter of the j-th network structure is set to W_j，W₁To W₅To load the trained model parameters, W, through S401-S405₆To obtain learning parameters by random initialization. During the training process, W₁To W₃In which the model parameters are fixed and invariant, W₄To W₆The model parameters of all the fully connected layers need to be updated according to the second prediction loss, so that W in the second model can be reduced₁To W₃The calculated amount of each model parameter in the method enables the updating of the second model to be more efficient, and the overfitting problem caused by training from S401 to S405 can be avoided.

In this embodiment, model parameters of the full connection layer in the second model are updated, so that the calculation in the second model is more efficient, the overfitting problem caused by the first training is avoided, and the recognition accuracy of the trained second model on the image is higher.

As shown in fig. 6, the present embodiment further provides a target tracking apparatus 100, including:

the first obtaining module 10 is configured to obtain an initial position of a target in an input image of a previous frame.

A first input module 20, configured to input the initial position into the first model, so as to obtain multiple predicted positions of the target in the current input image.

The first processing module 30 is configured to obtain a predicted image corresponding to each predicted position in the current input image, and input the predicted image into the second model to obtain a classification result between each predicted image and the target.

And the determining module 40 is configured to determine a target position of the target in the current input image according to the classification result.

In one embodiment, the first input module 10 is further configured to:

generating a target matrix according to the length and the width;

In one embodiment, the first input module 10 is further configured to:

generating a target matrix according to the length and the width;

In one embodiment, the target tracking device 100 may also be used in a channel attention layer process, including:

and the second acquisition module is used for acquiring the second input characteristic output by the second activation layer.

And the second processing module is used for executing first processing on the second input characteristics to obtain first output characteristics and executing second processing on the second input characteristics to obtain second output characteristics.

And the first generation module is used for combining the first output characteristic and the second output characteristic to generate a third output characteristic.

And the second generation module is used for combining the third output characteristic with the second input characteristic to generate a target characteristic.

In an embodiment, the target tracking device 100 may also be used to perform training of a second model, including:

a third obtaining module, configured to obtain first training data, where the first training data includes an input image of a known category and a first input feature of the input image.

And the second input module is used for inputting the first input features into an initial second model for propagation to obtain the prediction category of the input image.

And the first updating module is used for generating a first prediction loss according to the prediction category and the known category and iteratively updating the model parameters of the initial second model according to the first prediction loss.

And the first convergence module is used for adjusting the model parameters of the initial second model if the first prediction loss is not converged in the iterative updating process, and returning to execute the step of inputting the first training data into the initial second model for training to obtain the prediction type and the prediction loss generated by the known type and the subsequent steps.

And the second convergence module is used for finishing training the initial second model and taking the current initial second model as the trained second model if the first prediction loss is converged in the iterative updating process.

In one embodiment, the target tracking device 100 further comprises:

and the fourth acquisition module is used for acquiring the network structure of the second model and the model parameters thereof.

And the fifth acquisition module is used for acquiring second training data, wherein the second training data comprises the initial position of the target in the previous frame of input image and the target feature of the target in the current input image.

And the third input module is used for inputting the initial position into the first model to obtain each predicted position of the target in the current input image.

And the sixth acquisition module is used for acquiring the predicted images corresponding to the predicted positions in the current input image and the image characteristics of the predicted images.

And the fourth input module is used for inputting each image characteristic into the network structure of the second model to obtain the prediction result of each predicted image and the target.

And the third processing module is used for acquiring a predicted image corresponding to the optimal prediction result and the target to generate a second prediction loss, iterating and updating the model parameters of the second model according to the second prediction loss again, and finishing training the second model if the second prediction loss is converged in the iterative updating process.

In an embodiment, the second model comprises at least one fully connected layer; the target tracking apparatus 100 further includes:

and the second updating module is used for updating the model parameters of all the fully-connected layers in the second model and keeping the model parameters of the rest layers unchanged.

An embodiment of the present application further provides a terminal device, where the terminal device includes: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, the processor implementing the steps of any of the various method embodiments described above when executing the computer program.

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps in the above-mentioned method embodiments may be implemented.

The embodiments of the present application provide a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the above method embodiments when executed.

Fig. 7 is a schematic diagram of a terminal device 80 according to an embodiment of the present application. As shown in fig. 7, the terminal device 80 of this embodiment includes: a processor 803, a memory 801 and a computer program 802 stored in the memory 801 and executable on the processor 803. The processor 803 implements the steps in the various method embodiments described above, such as the steps S101 to S104 shown in fig. 1, when executing the computer program 802. Alternatively, the processor 803 realizes the functions of the modules/units in the above-described device embodiments when executing the computer program 802.

Illustratively, the computer program 802 may be partitioned into one or more modules/units that are stored in the memory 801 and executed by the processor 803 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 802 in the terminal device 80. For example, the computer program 802 may be divided into a first obtaining module, a first inputting module, a first processing module and a determining module, and the specific functions of each module are as follows:

and the first acquisition module is used for acquiring the initial position of the target in the previous frame of input image.

And the first input module is used for inputting the initial position into the first model to obtain a plurality of predicted positions of the target in the current input image.

And the first processing module is used for acquiring a predicted image corresponding to each predicted position in the current input image, and inputting the predicted image into the second model to obtain a classification result of each predicted image and the target.

The terminal device 80 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 803 and a memory 801. Those skilled in the art will appreciate that fig. 7 is merely an example of a terminal device 80, and does not constitute a limitation of terminal device 80, and may include more or fewer components than shown, or some components in combination, or different components, e.g., the terminal device may also include input-output devices, network access devices, buses, etc.

The Processor 803 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 801 may be an internal storage unit of the terminal device 80, such as a hard disk or a memory of the terminal device 80. The memory 801 may also be an external storage device of the terminal device 80, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the terminal device 80. In one embodiment, the memory 801 may also include both internal and external memory units of the terminal device 80. The memory 801 is used to store the computer programs and other programs and data required by the terminal device. The memory 801 may also be used to temporarily store data that has been output or is to be output.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A target tracking method, comprising:

2. The method for tracking the target of claim 1, wherein the inputting the initial position into the first model to obtain each predicted position of the target in the current input image comprises:

generating a target matrix according to the length and the width;

3. The target tracking method of claim 1, wherein the second model comprises a multi-layer network structure between an input layer and an output layer, wherein the first layer comprises a first convolution layer, a first activation layer, a first local normalization layer, a first max-pooling layer; the second layer comprises a second convolution layer, a second activation layer, a second batch normalization layer and a second maximum pooling layer; the third layer comprises a third convolution layer, a third activation layer and a channel attention layer; the fourth layer comprises a first full connection layer and a fourth activation layer; the fifth layer comprises a random inactivation layer, a second full-connection layer and a fifth activation layer; the sixth layer is a classification layer.

4. The target tracking method of claim 3, wherein the processing of the channel attention layer comprises:

acquiring a second input characteristic output by the second activation layer;

5. The target tracking method of claim 1, wherein the training step of the second model comprises:

6. The method of target tracking of claim 5, further comprising, after taking the current initial second model as the trained second model:

acquiring a network structure of the second model and model parameters thereof;

7. The target tracking method of claim 6, wherein the second model comprises at least one fully connected layer;

8. An object tracking device, comprising:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.