CN111291745A

CN111291745A - Target position estimation method and device, storage medium and terminal

Info

Publication number: CN111291745A
Application number: CN201910038152.3A
Authority: CN
Inventors: 潘博阳; 罗小伟; 王森; 刘阳; 林福辉
Original assignee: Spreadtrum Communications Shanghai Co Ltd
Current assignee: Spreadtrum Communications Shanghai Co Ltd
Priority date: 2019-01-15
Filing date: 2019-01-15
Publication date: 2020-06-16
Anticipated expiration: 2039-01-15
Also published as: CN111291745B

Abstract

A target position estimation method and device, a storage medium and a terminal are provided, and the method comprises the following steps: acquiring N characteristic maps obtained by calculating a previous frame of image through N convolutional layers in a convolutional neural network, acquiring N characteristic maps obtained by calculating a current frame of image through N convolutional layers in the convolutional neural network, and sequentially acquiring a first response mapping map, a first candidate response mapping map, a second fused response mapping map, an nth candidate response mapping map, an N +1 response mapping map and an N +1 fused response mapping map; determining the coordinates of the response mapping in the n +1 th fusion response mapping chart with the maximum response value as the n +1 th target center; and when N +1 is equal to N, determining the center of the (N + 1) th target as the target position in the current frame image. The scheme of the invention is beneficial to optimizing the accuracy of target tracking and improving the accuracy of the estimation result.

Description

Target position estimation method and device, storage medium and terminal

Technical Field

The present invention relates to the field of target tracking technologies, and in particular, to a target position estimation method and apparatus, a storage medium, and a terminal.

Background

Computer vision is a development direction of future mobile phone multimedia applications, and Object Tracking (Object Tracking) is an important research topic of computer vision, and at present, the algorithm has been widely used in video monitoring, robot vision, Virtual Reality (VR), and Augmented Reality (AR).

The existing methods for estimating the position of an object generally refer to utilizing an algorithm to automatically locate the position of the object in a subsequent video sequence according to an initial object position. The common target tracking algorithm mainly comprises two steps of Training (tracing) and detecting (Detection). Wherein, the training refers to extracting samples according to the target position of the previous frame, then performing Feature extraction (Feature extraction), and then training the model through a Classifier (Classifier). The detection means that the current frame is predicted according to the model of the previous frame, the sample with the highest confidence coefficient is selected as the target position of the current frame, and then the model parameters are updated to predict the target position of the next frame.

However, in the prior art, the Feature extraction mainly relies on manual features (Hand-manipulated), such as HAAR (HAAR), Histogram of Gradient (HOG), Scale-invariant Feature Transform (SIFT), etc., and the classification operation is performed after a single Feature map is extracted, which results in low accuracy.

Disclosure of Invention

The invention aims to provide a target position estimation method and device, a storage medium and a terminal, which are beneficial to optimizing the accuracy of target tracking and improving the accuracy of an estimation result.

To solve the above technical problem, an embodiment of the present invention provides a target position estimation method, including the following steps: acquiring N feature maps obtained by calculating a previous frame of image through N convolutional layers in a convolutional neural network, acquiring N feature maps obtained by calculating a current frame of image through the N convolutional layers in the convolutional neural network, wherein the size of each feature map is consistent and comprises a plurality of response maps, each response map comprises a response value and a coordinate thereof, N is less than or equal to the number of the convolutional layers in the convolutional neural network and is a positive integer, and the N feature maps of each frame of image are arranged in a reverse order according to the layer number of the convolutional layer for generating each feature map; respectively cutting feature maps with preset sizes by taking coordinates where a target center point of a previous frame image is located as a center in the first feature map of the previous frame image and the first feature map of the current frame image, and obtaining a first response mapping map by adopting classification operation, wherein the first response mapping map comprises a plurality of response mappings, and the first feature map is a feature map generated by calculating the last convolution layer; determining coordinates of a response map with a maximum response value in the first response map as a first target center; respectively cutting feature maps with preset sizes in the first feature map of the previous frame image and the first feature map of the current frame image by taking the first target center as the center, and obtaining a first candidate response map by adopting a classification algorithm, wherein the first candidate response map comprises a plurality of response maps; respectively cutting feature maps with preset sizes in the second feature map of the previous frame image and the second feature map of the current frame image by taking the first target center as the center, and obtaining a second response map by adopting a classification algorithm, wherein the second response map comprises a plurality of response maps; weighting and summing the response maps of the first candidate response map and the second response map by adopting a first preset weight value to obtain a second fusion response map; determining the coordinate of the response mapping with the maximum response value in the second fusion response mapping map as a second target center; respectively calculating to obtain candidate response mapping maps corresponding to each feature map for the second to nth feature maps of the previous frame image, wherein the feature maps with preset sizes are respectively cut out by taking the nth target center as the center for the nth feature map of the previous frame image and the nth feature map of the current frame image, and a classification algorithm is adopted to obtain an nth candidate response mapping map, wherein the nth candidate response mapping map comprises a plurality of response mappings, and N is a positive integer and is more than 1 and less than N; respectively calculating a response mapping map corresponding to each feature map for the second to nth feature maps of the previous frame image, wherein the feature maps with preset sizes are respectively cut out by taking the nth target center as the center for the (N + 1) th feature map of the previous frame image and the (N + 1) th feature map of the current frame image, and a classification algorithm is adopted to obtain an (N + 1) th response mapping map, wherein the (N + 1) th response mapping map comprises a plurality of response mappings; respectively calculating a fusion response mapping map corresponding to each feature map for the second to nth feature maps of the previous frame image, wherein the nth candidate response mapping map and the response mapping of the (N + 1) th response mapping map are weighted and summed by adopting an nth preset weight value to obtain an (N + 1) th fusion response mapping map; determining the coordinates of the response mapping in the n +1 th fusion response mapping chart with the maximum response value as the n +1 th target center; and when N +1 is equal to N, determining the center of the (N + 1) th target as the target position in the current frame image.

Optionally, the obtaining N feature maps obtained by calculating the previous frame of image by using the N convolutional layers in the convolutional neural network, and the obtaining N feature maps obtained by calculating the current frame of image by using the N convolutional layers in the convolutional neural network includes: respectively obtaining a characteristic diagram obtained after a previous frame of image passes through N convolutional layers in a convolutional neural network; respectively obtaining a characteristic diagram obtained after the current frame image passes through N convolutional layers in a convolutional neural network; and respectively scaling the N characteristic graphs of the previous frame image and the N characteristic graphs of the current frame image to preset characteristic graph sizes.

Optionally, the N feature maps of the previous frame image and the N feature maps of the current frame image are respectively scaled to a preset feature map size by using a bilinear interpolation method or a trilinear interpolation method.

Optionally, in the first feature map of the previous frame image and the first feature map of the current frame image, respectively cutting feature maps of preset sizes with a coordinate of a target center point of the previous frame image as a center, and obtaining the first response map by using a classification operation includes: in the first feature map of the previous frame image, a first target window with a preset size is adopted, and a first target feature map with a preset size is cut by taking a coordinate where a target center point in the previous frame image is located as a center; in the first feature map of the current frame image, cutting a first search feature map with a preset size by adopting a first search window with a preset size and taking a coordinate where a target center point in a previous frame image is located as a center; and respectively inputting the first target feature map and the first search feature map into a classifier for classification operation to obtain a first response mapping map.

Optionally, the algorithm of the classification operation includes: KCF algorithm, ADABOOST algorithm, and SVM algorithm.

Optionally, the convolutional neural network includes: AlexNet, VGGNet, and GoogleNet.

To solve the above technical problem, an embodiment of the present invention provides a target position estimation device, including: the acquisition module is suitable for acquiring N characteristic maps obtained by calculating the previous frame of image through N convolutional layers in a convolutional neural network, acquiring N characteristic maps obtained by calculating the current frame of image through the N convolutional layers in the convolutional neural network, wherein the size of each characteristic map is consistent and comprises a plurality of response maps, each response map comprises a response value and a coordinate thereof, N is less than or equal to the number of the convolutional layers in the convolutional neural network and is a positive integer, and the N characteristic maps of each frame of image are arranged in a reverse order according to the layer numbers of the convolutional layers generating the characteristic maps; the first map determining module is suitable for respectively cutting feature maps with preset sizes by taking the coordinate where the target center point of the previous frame image is located as the center in the first feature map of the previous frame image and the first feature map of the current frame image, and obtaining a first response map by adopting classification operation, wherein the first response map comprises a plurality of response maps, and the first feature map is a feature map generated by calculating the last convolution layer; a first center determining module adapted to determine coordinates of a response map having a largest response value in the first response map as a first target center; the first candidate map determining module is suitable for respectively cutting feature maps with preset sizes in the first feature map of the previous frame image and the first feature map of the current frame image by taking the first target center as a center, and obtaining a first candidate response map by adopting a classification algorithm, wherein the first candidate response map comprises a plurality of response maps; the second mapping map determining module is suitable for respectively cutting feature maps with preset sizes in a second feature map of the previous frame image and a second feature map of the current frame image by taking the first target center as a center, and obtaining a second response mapping map by adopting a classification algorithm, wherein the second response mapping map comprises a plurality of response mappings; the first fusion map determining module is suitable for weighting and summing response maps of the first candidate response map and the second response map by adopting a first preset weight value to obtain a second fusion response map; a second center determination module adapted to determine coordinates of a response map having a maximum response value in the second fused response map as a second target center; an nth candidate map determining module, adapted to calculate candidate response maps corresponding to each feature map for the second to nth feature maps of the previous frame image, respectively, wherein for the nth feature map of the previous frame image and the nth feature map of the current frame image, feature maps of a preset size are respectively clipped with an nth target center as a center, and an nth candidate response map is obtained by using a classification algorithm, and the nth candidate response map includes a plurality of response maps, where 1 < N and N is a positive integer; an nth candidate map determining module, adapted to calculate candidate response maps corresponding to each feature map for the second to nth feature maps of the previous frame image, respectively, wherein for the nth feature map of the previous frame image and the nth feature map of the current frame image, feature maps of a preset size are respectively clipped with an nth target center as a center, and an nth candidate response map is obtained by using a classification algorithm, and the nth candidate response map includes a plurality of response maps, where 1 < N and N is a positive integer; an nth candidate map determining module, adapted to calculate candidate response maps corresponding to each feature map for the second to nth feature maps of the previous frame image, respectively, wherein for the nth feature map of the previous frame image and the nth feature map of the current frame image, feature maps of a preset size are respectively clipped with an nth target center as a center, and an nth candidate response map is obtained by using a classification algorithm, and the nth candidate response map includes a plurality of response maps, where 1 < N and N is a positive integer; an n +1 center determination module adapted to determine coordinates of a response map in the n +1 th fused response map having a largest response value as an n +1 th target center; and the target position determining module is suitable for determining the center of the (N + 1) th target as the target position in the current frame image when N +1 is equal to N.

Optionally, the first map determining module includes: the first target image cutting sub-module is suitable for cutting a first target feature image with a preset size by adopting a first target window with a preset size in the first feature image of the previous frame image and taking the coordinate where the target center point in the previous frame image is located as the center; the first search map cutting sub-module is suitable for cutting a first search feature map with a preset size by adopting a first search window with a preset size in the first feature map of the current frame image and taking the coordinate where the target center point in the previous frame image is located as the center; and the classification operation submodule is suitable for respectively inputting the first target characteristic diagram and the first search characteristic diagram into a classifier to carry out classification operation so as to obtain a first response mapping diagram.

To solve the above technical problem, an embodiment of the present invention provides a storage medium having stored thereon computer instructions, which when executed, perform the steps of the above target position estimation method.

In order to solve the above technical problem, an embodiment of the present invention provides a terminal, including a memory and a processor, where the memory stores computer instructions capable of being executed on the processor, and the processor executes the steps of the target position estimation method when executing the computer instructions.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, N characteristic maps of the previous frame image and the current frame image are respectively obtained, the classification operation is carried out on every two characteristic maps in sequence, and the obtained response maps are subjected to weighted summation to obtain the fused response map, so that each characteristic map can be subjected to multiple iterative operations and the result is applied to the estimation of the target position, the target tracking accuracy is favorably optimized, and the estimation result accuracy is improved.

Further, in the embodiment of the present invention, by setting the first target window with a preset size and the first search window with a preset size, a proper feature map may be obtained by clipping to perform a classification operation, which is helpful for obtaining a response map.

Drawings

FIG. 1 is a flow chart of a method for estimating a target location in an embodiment of the present invention;

FIG. 2 is a flowchart of one embodiment of step S101 of FIG. 1;

FIG. 3 is a schematic diagram of an application scenario of a feature map extraction method according to an embodiment of the present invention;

FIG. 4 is a flowchart of one embodiment of step S104 of FIG. 1;

FIG. 5 is a schematic diagram of an application scenario of a target location estimation method according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a target position estimation apparatus according to an embodiment of the present invention.

Detailed Description

In the prior art, Convolutional Neural Networks (CNNs) model the real world using a plurality of filters and nonlinear activation functions. The conventional algorithm based on the convolutional neural network is to regard the convolutional neural network as a black box, and output results of the black box are sent to a classifier as features for classification.

Conventional target tracking algorithms use manual features and classifiers to achieve the tracking target. However, for the real world, artificial features have inherent limitations and cannot accurately model targets.

The inventor of the present invention has found through research that in the prior art, the extraction of features mainly depends on manual features such as HAAR, HOG, SIFT, etc. In particular, on the one hand, manual features have a relatively high computational complexity, and on the other hand, manual features do not allow accurate modeling of objects in the real world, for example, when the appearance, shape of the objects changes. Furthermore, after a single characteristic diagram is extracted, classification operation is performed on the single characteristic diagram, which easily causes the problems of insufficient complexity and low accuracy.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

Referring to fig. 1, fig. 1 is a flowchart of a target position estimation method according to an embodiment of the present invention. The method may include steps S101 to S112:

step S101: acquiring N feature maps obtained by calculating a previous frame of image through N convolutional layers in a convolutional neural network, acquiring N feature maps obtained by calculating a current frame of image through the N convolutional layers in the convolutional neural network, wherein the size of each feature map is consistent and comprises a plurality of response maps, each response map comprises a response value and a coordinate thereof, N is less than or equal to the number of the convolutional layers in the convolutional neural network and is a positive integer, and the N feature maps of each frame of image are arranged in a reverse order according to the layer number of the convolutional layer for generating each feature map;

step S102: respectively cutting feature maps with preset sizes by taking coordinates where a target center point of a previous frame image is located as a center in the first feature map of the previous frame image and the first feature map of the current frame image, and obtaining a first response mapping map by adopting classification operation, wherein the first response mapping map comprises a plurality of response mappings, and the first feature map is a feature map generated by calculating the last convolution layer;

step S103: determining coordinates of a response map with a maximum response value in the first response map as a first target center;

step S104: respectively cutting feature maps with preset sizes in the first feature map of the previous frame image and the first feature map of the current frame image by taking the first target center as the center, and obtaining a first candidate response map by adopting a classification algorithm, wherein the first candidate response map comprises a plurality of response maps;

step S105: respectively cutting feature maps with preset sizes in the second feature map of the previous frame image and the second feature map of the current frame image by taking the first target center as the center, and obtaining a second response map by adopting a classification algorithm, wherein the second response map comprises a plurality of response maps;

step S106: weighting and summing the response maps of the first candidate response map and the second response map by adopting a first preset weight value to obtain a second fusion response map;

step S107: determining the coordinate of the response mapping with the maximum response value in the second fusion response mapping map as a second target center;

step S108: respectively calculating to obtain candidate response mapping maps corresponding to each feature map for the second to nth feature maps of the previous frame image, wherein the feature maps with preset sizes are respectively cut out by taking the nth target center as the center for the nth feature map of the previous frame image and the nth feature map of the current frame image, and a classification algorithm is adopted to obtain an nth candidate response mapping map, wherein the nth candidate response mapping map comprises a plurality of response mappings, and N is a positive integer and is more than 1 and less than N;

step S109: respectively calculating a response mapping map corresponding to each feature map for the second to nth feature maps of the previous frame image, wherein the feature maps with preset sizes are respectively cut out by taking the nth target center as the center for the (N + 1) th feature map of the previous frame image and the (N + 1) th feature map of the current frame image, and a classification algorithm is adopted to obtain an (N + 1) th response mapping map, wherein the (N + 1) th response mapping map comprises a plurality of response mappings;

step S110: respectively calculating a fusion response mapping map corresponding to each feature map for the second to nth feature maps of the previous frame image, wherein the nth candidate response mapping map and the response mapping of the (N + 1) th response mapping map are weighted and summed by adopting an nth preset weight value to obtain an (N + 1) th fusion response mapping map;

step S111: determining the coordinates of the response mapping in the n +1 th fusion response mapping chart with the maximum response value as the n +1 th target center;

step S112: and when N +1 is equal to N, determining the center of the (N + 1) th target as the target position in the current frame image.

In the specific implementation of step S101, the feature map is obtained through calculation, so that the feature map is operated in the subsequent step.

Referring to fig. 2, fig. 2 is a flowchart of an embodiment of step S101 in fig. 1. The step of acquiring N feature maps obtained by calculating the previous frame image by using the N convolutional layers in the convolutional neural network, and the step of acquiring N feature maps obtained by calculating the current frame image by using the N convolutional layers in the convolutional neural network may include steps S21 to S23, which are described below.

In step S21, feature maps obtained by passing the previous frame of image through N convolutional layers in the convolutional neural network are obtained.

In step S22, feature maps obtained by passing the current frame image through N convolutional layers in the convolutional neural network are obtained.

In step S23, the N feature maps of the previous frame image and the N feature maps of the current frame image are scaled to a preset feature map size, respectively.

Further, a Bi-linear Interpolation (Bi-linear Interpolation) or a tri-linear Interpolation may be used to respectively scale the N feature maps of the previous frame image and the N feature maps of the current frame image to a preset feature map size, thereby more effectively implementing the Interpolation of the target position.

Further, the convolutional neural network may include: alex network (Alex Net), oxford university's Visual Geometry Group Net (VGG Net), and Google network (Google Net).

It should be noted that, in the embodiment of the present invention, no limitation is made to a specific convolutional neural network.

In the specific implementation, VGG Net is taken as an example, and a CNN model is introduced for the feature extraction process. The image is sent to VGG Net for forward propagation, and a Feature Map (Feature Map) is generated corresponding to each convolution layer. The shallow feature map has a higher resolution and the deep feature map has a lower resolution.

Referring to fig. 3, fig. 3 is a schematic view of an application scenario of a feature map extraction method in the embodiment of the present invention.

As depicted in fig. 3, the image 101 is transferred to the CNN model 102, wherein the CNN model 102 includes a plurality of convolutional layers.

The image 101 may be a previous frame image or a current frame image.

Specifically, the plurality of convolutional layers may include a first convolutional layer 103, a second convolutional layer 105, … …, an N-1 convolutional layer 107, an Nth convolutional layer 109.

The N characteristic graphs obtained after calculation are arranged in a reversed order according to the layer numbers of the convolution layers for generating the characteristic graphs. Specifically, the image 101 passes through the first convolution layer 103 to obtain the nth feature map 104, the second convolution layer 105 to obtain the nth-1 feature map 106, the … … passes through the nth-1 convolution layer 107 to obtain the second feature map 108, and the nth convolution layer 109 to obtain the first feature map 110.

With reference to fig. 1, in the specific implementation of step S102, a first response map is obtained by using a classification operation in the first feature map of the previous frame image and the first feature map of the current frame image.

Referring to fig. 4, fig. 4 is a flowchart of an embodiment of step S104 in fig. 1. The step of respectively cutting out feature maps of a preset size with the coordinate of the target center point of the previous frame image as the center in the first feature map of the previous frame image and the first feature map of the current frame image, and obtaining the first response map by using a classification operation may include steps S41 to S43, and each step is described below.

In step S41, a first target window with a preset size is used in the first feature map of the previous frame image, and the first target feature map with a preset size is cut out with the coordinates of the center point of the target in the previous frame image as the center.

In step S42, in the first feature map of the current frame image, a first search window with a preset size is adopted, and the first search feature map with a preset size is cut out with the coordinate of the target center point in the previous frame image as the center.

In step S43, the first target feature map and the first search feature map are respectively input to a classifier for classification operation, so as to obtain a first response map.

Further, the algorithm of the classification operation may include: a Kernellated Correlation Filter (KCF) algorithm, an adaptive enhancement (ADABOOST) algorithm, and a Support Vector Machine (SVM) algorithm.

It should be noted that, in the embodiment of the present invention, the specific algorithm of the classification operation is not limited.

In one non-limiting example of an embodiment of the present invention, a KCF classifier is used to perform a KCF classification operation, thereby obtaining a first response map.

In the specific implementation, firstly, the coordinates of the central point of the target in the image are takenTo center, an image sample x of size W H is acquired in the vicinity of the center to train the classifier, and then all the circularly shifted samples x are filtered by the correlation filter using the properties of the circularly shifted Matrix (Cyclic Shift Matrix) and the appropriate augmented image_w,hAnd (W, H) is formed by {0,1, W-1} × {0,1, H-1} as a training sample of the classifier. Meanwhile, the Regression Target y follows Gaussian distribution, namely the central point value of the Target is 1, the value of the Regression Target y is more attenuated at the position farther away from the central point and is attenuated to 0 at the edge of the Target, wherein y (w, h) represents x_w,hThe Label (Label) of (1).

The purpose of the training is to find the following function:

f(z)＝w^Tz

such that the sample x_w,hAnd its regression target y (w, h) is minimized, i.e. mean square error

Where phi denotes the mapping of samples to Hilbert space by a kernel function k. The inner product of x and x' (Innerproduct) is expressed as

<φ(x),φ(x')>＝κ(x,x')

Where λ represents the regularization term coefficient (regularization term).

The solution w of the linear problem after mapping its input into the nonlinear feature space phi (x) is expressed as

And the solution of vector α is

Wherein F and F^-1Respectively, the fourier forward and inverse transforms. Wherein (k)^x)＝κ(x_w,hX) vector α contains all of the α (w, h) coefficients, the Appearance Model (Appearance Model) needs to be updated to process each frame object

The KCF tracking algorithm Model comprises a learned Target Appearance Model (Target Appearance Model)

And classifier coefficients F (α).

Further, the response map value and its coordinates may be calculated in the current frame:

where ⊙ represents a point-by-point multiplication (Element-wise Product),

representing the learned target appearance model.

According to the above steps, a first response map can be obtained. It should be noted that, in the embodiment of the present invention, the step of inputting the feature map into the classifier to perform the classification operation to obtain the response map may be implemented by using the above steps.

In the embodiment of the invention, by setting the first target window with the preset size and the first search window with the preset size, a proper feature map can be obtained through cutting to perform classification operation, which is beneficial to realizing the acquisition of the response map.

With continuing reference to fig. 1, the specific implementation of steps S103 through S112 can be described in detail with reference to fig. 5.

Referring to fig. 5, fig. 5 is a schematic view of an application scenario of a target position estimation method in the embodiment of the present invention. In the application scenario diagram, the number N of convolutional layers is 3.

In the first feature map of the previous frame image, a first target window 111 with a preset size is adopted, and a first target feature map 112 with a preset size is cut out by taking the coordinate where the target center point of the previous frame image is located as the center; in the first feature map of the current frame image, a first search window 113 with a preset size is adopted, and a first search feature map 114 with a preset size is cut by taking the coordinate where the target center point of the previous frame image is located as the center; the first target feature map 112 and the first search feature map 114 are respectively input to a KCF classifier 115 for classification operation, so as to obtain a first response map 116.

The coordinates of the response map having the largest response value in the first response map 116 are determined as the first target center.

In the first feature map of the previous frame image, a first candidate target window 211 with a preset size is adopted, and a first candidate target feature map 212 with a preset size is cut by taking the first target center as the center; in the first feature map of the current frame image, a first candidate search window 213 with a preset size is adopted, and a first candidate search feature map 214 with a preset size is cut by taking the first target center as a center; the first candidate target feature map 212 and the first candidate search feature map 214 are respectively input to a KCF classifier 215 for classification operation, so as to obtain a first candidate response map 216.

The preset size may be the same as the preset size in the above step, and the KCF taxonomy 215 may be the same as the KCF classifier 115, which will not be described in detail in the following description.

In the second feature map of the previous frame image, a second target feature map 312 with a preset size is cut by adopting a second target window 311 with a preset size and taking the first target center as the center; in the second feature map of the current frame image, a second search window 313 with a preset size is adopted, and a second search feature map 314 with a preset size is cut out with the first target center as the center; the second target feature map 312 and the second search feature map 314 are respectively input to a KCF classifier 315 for classification operation, so as to obtain a second response map 316.

The response maps of the first candidate response map 216 and the second response map 316 are weighted and summed with a first preset weight value to obtain a second fused response map 400.

The coordinates of the response map having the largest response value in the second fused response map 400 are determined as the second target center.

In the second feature map of the previous frame image, a second candidate target window 411 with a preset size is adopted, and a second candidate target feature map 412 with a preset size is cut out by taking the center of the second target as the center; in the second feature map of the current frame image, a second candidate search window 413 with a preset size is adopted, and a second candidate search feature map 414 with a preset size is cut out with the second target center as the center; the second candidate target feature map 412 and the second candidate search feature map 414 are respectively input to a KCF classifier 415 for classification operation, so as to obtain a second candidate response map 416.

In the second feature map of the previous frame image, a third target feature map 512 with a preset size is cut by adopting a third target window 511 with a preset size and taking the second target center as the center; in the second feature map of the current frame image, a third search window 513 with a preset size is adopted, and a third search feature map 514 with a preset size is cut out with the second target center as the center; and inputting the third target feature map 512 and the third search feature map 514 into a KCF classifier 515 respectively for classification operation to obtain a third response map 516.

And performing weighted summation on the response maps of the second candidate response map 416 and the third response map 516 by using a second preset weight value to obtain a third fused response map 517.

The coordinates of the response map having the largest response value in the third fused response map 517 are determined as the third target center.

Further, the third target center may be determined as the target position in the current frame image, and the final target window 518 may also be determined with the third target center as the center, for example, by performing clipping with a preset size.

It should be noted that in fig. 5, 3 feature maps are used for description, and when N > 3, the weighted response map may be determined continuously to determine the nth target center.

Specifically, candidate response maps corresponding to each feature map may be obtained by calculating for the second to nth feature maps of the previous frame image, respectively, where for the nth feature map of the previous frame image and the nth feature map of the current frame image, feature maps of a preset size are respectively clipped with an nth target center as a center, and an nth candidate response map is obtained by using a classification algorithm, where the nth candidate response map includes a plurality of response maps, where 1 < N and N is a positive integer.

And respectively calculating the second to the nth feature maps of the previous frame image to obtain a response mapping map corresponding to each feature map, wherein the N +1 th feature map of the previous frame image and the N +1 th feature map of the current frame image are respectively cut out with the N target center as the center and the N +1 th response mapping map with a preset size is obtained by adopting a classification algorithm, and the N +1 th response mapping map comprises a plurality of response mappings.

And respectively calculating a fusion response mapping map corresponding to each feature map for the second to nth feature maps of the previous frame image, wherein the nth candidate response mapping map and the response mapping of the (N + 1) th response mapping map are weighted and summed by adopting an nth preset weight value to obtain an (N + 1) th fusion response mapping map.

And determining the coordinates of the response mapping in the n +1 th fusion response mapping chart with the maximum response value as the n +1 th target center.

And when N +1 is equal to N, determining that the N +1 th target center is the target position in the current frame image, and determining a final target window by taking the N +1 th target center as the center.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a target position estimation apparatus according to an embodiment of the present invention. The target position estimating apparatus may include:

an obtaining module 601, adapted to obtain N feature maps obtained by calculating a previous frame of image by N convolutional layers in a convolutional neural network, obtain N feature maps obtained by calculating a current frame of image by N convolutional layers in the convolutional neural network, where the size of each feature map is consistent and includes a plurality of response maps, each response map includes a response value and its coordinates, where N is less than or equal to the number of convolutional layers in the convolutional neural network and is a positive integer, and the N feature maps of each frame of image are arranged in a reverse order according to the layer numbers of the convolutional layers generating the feature maps;

a first map determining module 602, adapted to respectively crop feature maps of a preset size with a coordinate of a target center point of a previous frame image as a center in the first feature map of the previous frame image and the first feature map of the current frame image, and obtain a first response map by using a classification operation, where the first response map includes multiple response maps, and the first feature map is a feature map generated by calculating a last convolution layer;

a first center determining module 603 adapted to determine coordinates of a response map having a largest response value in the first response map as a first target center;

a first candidate map determining module 604, adapted to respectively cut feature maps of a preset size from the first feature map of the previous frame image and the first feature map of the current frame image with the first target center as a center, and obtain a first candidate response map by using a classification algorithm, where the first candidate response map includes multiple response maps;

a second map determining module 605, adapted to respectively cut feature maps of a preset size from the second feature map of the previous frame image and the second feature map of the current frame image with the first target center as a center, and obtain a second response map by using a classification algorithm, where the second response map includes multiple response maps;

a first fused map determining module 606, adapted to perform weighted summation on the response maps of the first candidate response map and the second response map by using a first preset weight value to obtain a second fused response map;

a second center determining module 607 adapted to determine coordinates of a response map having a largest response value in the second fused response map as a second target center;

an nth candidate map determining module 608 adapted to calculate candidate response maps corresponding to each feature map for the second to nth feature maps of the previous frame image, respectively, wherein for the nth feature map of the previous frame image and the nth feature map of the current frame image, feature maps of a preset size are respectively clipped with an nth target center as a center, and an nth candidate response map is obtained by using a classification algorithm, where the nth candidate response map includes a plurality of response maps, where 1 < N and N is a positive integer;

an N +1 response map determining module 609, adapted to calculate a response map corresponding to each feature map for the second to N-th feature maps of the previous frame image, respectively, wherein for the N + 1-th feature map of the previous frame image and the N + 1-th feature map of the current frame image, the feature maps with preset sizes are respectively clipped with the nth target center as a center, and an N + 1-th response map is obtained by using a classification algorithm, where the N + 1-th response map includes multiple response maps;

an nth fusion map determining module 610, adapted to calculate fusion response maps corresponding to each feature map for the second to nth feature maps of the previous frame image, respectively, wherein an nth preset weight value is adopted to perform weighted summation on the response maps of the nth candidate response map and the N +1 th response map to obtain an N +1 th fusion response map;

an n +1 center determining module 611 adapted to determine coordinates of a response map in the n +1 th fused response map having a largest response value as an n +1 th target center;

the target position determining module 612 is adapted to determine that the N +1 th target center is the target position in the current frame image when N +1 is equal to N.

Further, the first map determining module 602 may include: a first target image clipping sub-module (not shown) adapted to clip a first target feature image of a preset size using a first target window of a preset size in the first feature image of the previous frame image, with the coordinate of the target center point in the previous frame image as the center; a first search map clipping sub-module (not shown) adapted to clip a first search feature map of a preset size using a first search window of a preset size in the first feature map of the current frame image and centering on a coordinate where a target center point in a previous frame image is located; and a classification operation sub-module (not shown) adapted to input the first target feature map and the first search feature map into a classifier respectively for performing the classification operation, so as to obtain a first response map.

For the principle, specific implementation and beneficial effects of the target position estimation apparatus, please refer to the related description of the target position estimation method shown in fig. 1 to 5 and the foregoing, which is not repeated herein.

Embodiments of the present invention also provide a storage medium having stored thereon computer instructions, which when executed perform the steps of the method for estimating a target position shown in fig. 1 to 5. The storage medium may be a computer-readable storage medium, and may include, for example, a non-volatile (non-volatile) or non-transitory (non-transitory) memory, and may further include an optical disc, a mechanical hard disk, a solid state hard disk, and the like.

An embodiment of the present invention further provides a terminal, including a memory and a processor, where the memory stores computer instructions capable of being executed on the processor, and the processor executes the computer instructions to perform the steps of the target position estimation method shown in fig. 1 to 5. The terminal includes, but is not limited to, a mobile phone, a computer, a tablet computer and other terminal devices.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method of estimating a target position, comprising the steps of:

acquiring N feature maps obtained by calculating a previous frame of image through N convolutional layers in a convolutional neural network, acquiring N feature maps obtained by calculating a current frame of image through the N convolutional layers in the convolutional neural network, wherein the size of each feature map is consistent and comprises a plurality of response maps, each response map comprises a response value and a coordinate thereof, N is less than or equal to the number of the convolutional layers in the convolutional neural network and is a positive integer, and the N feature maps of each frame of image are arranged in a reverse order according to the layer number of the convolutional layer for generating each feature map;

respectively cutting feature maps with preset sizes by taking coordinates where a target center point of a previous frame image is located as a center in the first feature map of the previous frame image and the first feature map of the current frame image, and obtaining a first response mapping map by adopting classification operation, wherein the first response mapping map comprises a plurality of response mappings, and the first feature map is a feature map generated by calculating the last convolution layer;

determining coordinates of a response map with a maximum response value in the first response map as a first target center;

respectively cutting feature maps with preset sizes in the first feature map of the previous frame image and the first feature map of the current frame image by taking the first target center as the center, and obtaining a first candidate response map by adopting a classification algorithm, wherein the first candidate response map comprises a plurality of response maps;

respectively cutting feature maps with preset sizes in the second feature map of the previous frame image and the second feature map of the current frame image by taking the first target center as the center, and obtaining a second response map by adopting a classification algorithm, wherein the second response map comprises a plurality of response maps;

weighting and summing the response maps of the first candidate response map and the second response map by adopting a first preset weight value to obtain a second fusion response map;

determining the coordinate of the response mapping with the maximum response value in the second fusion response mapping map as a second target center;

respectively calculating to obtain candidate response mapping maps corresponding to each feature map for the second to nth feature maps of the previous frame image, wherein the feature maps with preset sizes are respectively cut out by taking the nth target center as the center for the nth feature map of the previous frame image and the nth feature map of the current frame image, and a classification algorithm is adopted to obtain an nth candidate response mapping map, wherein the nth candidate response mapping map comprises a plurality of response mappings, and N is a positive integer and is more than 1 and less than N;

respectively calculating a response mapping map corresponding to each feature map for the second to nth feature maps of the previous frame image, wherein the feature maps with preset sizes are respectively cut out by taking the nth target center as the center for the (N + 1) th feature map of the previous frame image and the (N + 1) th feature map of the current frame image, and a classification algorithm is adopted to obtain an (N + 1) th response mapping map, wherein the (N + 1) th response mapping map comprises a plurality of response mappings;

respectively calculating a fusion response mapping map corresponding to each feature map for the second to nth feature maps of the previous frame image, wherein the nth candidate response mapping map and the response mapping of the (N + 1) th response mapping map are weighted and summed by adopting an nth preset weight value to obtain an (N + 1) th fusion response mapping map;

determining the coordinates of the response mapping in the n +1 th fusion response mapping chart with the maximum response value as the n +1 th target center;

and when N +1 is equal to N, determining the center of the (N + 1) th target as the target position in the current frame image.

2. The method of claim 1, wherein the obtaining of the N feature maps obtained by computing the previous frame of image by the N convolutional layers in the convolutional neural network, and the obtaining of the N feature maps obtained by computing the current frame of image by the N convolutional layers in the convolutional neural network comprises:

respectively obtaining a characteristic diagram obtained after a previous frame of image passes through N convolutional layers in a convolutional neural network;

respectively obtaining a characteristic diagram obtained after the current frame image passes through N convolutional layers in a convolutional neural network;

and respectively scaling the N characteristic graphs of the previous frame image and the N characteristic graphs of the current frame image to preset characteristic graph sizes.

3. The object position estimation method according to claim 2, wherein the N feature maps of the previous frame image and the N feature maps of the current frame image are scaled to a preset feature map size, respectively, using a bilinear interpolation method or a trilinear interpolation method.

4. The method of claim 1, wherein the step of respectively cutting out feature maps of a preset size from the coordinates of the center point of the target in the previous frame image as the center in the first feature map of the previous frame image and the first feature map of the current frame image, and obtaining the first response map by using a classification operation comprises:

in the first feature map of the previous frame image, a first target window with a preset size is adopted, and a first target feature map with a preset size is cut by taking a coordinate where a target center point in the previous frame image is located as a center;

in the first feature map of the current frame image, cutting a first search feature map with a preset size by adopting a first search window with a preset size and taking a coordinate where a target center point in a previous frame image is located as a center;

and respectively inputting the first target feature map and the first search feature map into a classifier for classification operation to obtain a first response mapping map.

5. The target position estimation method according to claim 1, wherein the algorithm of the classification operation includes: KCF algorithm, ADABOOST algorithm, and SVM algorithm.

6. The target position estimation method according to claim 1, wherein the convolutional neural network comprises: AlexNet, VGGNet, and GoogleNet.

7. A target position estimating apparatus, characterized by comprising:

the acquisition module is suitable for acquiring N characteristic maps obtained by calculating the previous frame of image through N convolutional layers in a convolutional neural network, acquiring N characteristic maps obtained by calculating the current frame of image through the N convolutional layers in the convolutional neural network, wherein the size of each characteristic map is consistent and comprises a plurality of response maps, each response map comprises a response value and a coordinate thereof, N is less than or equal to the number of the convolutional layers in the convolutional neural network and is a positive integer, and the N characteristic maps of each frame of image are arranged in a reverse order according to the layer numbers of the convolutional layers generating the characteristic maps;

the first map determining module is suitable for respectively cutting feature maps with preset sizes by taking the coordinate where the target center point of the previous frame image is located as the center in the first feature map of the previous frame image and the first feature map of the current frame image, and obtaining a first response map by adopting classification operation, wherein the first response map comprises a plurality of response maps, and the first feature map is a feature map generated by calculating the last convolution layer;

a first center determining module adapted to determine coordinates of a response map having a largest response value in the first response map as a first target center;

the first candidate map determining module is suitable for respectively cutting feature maps with preset sizes in the first feature map of the previous frame image and the first feature map of the current frame image by taking the first target center as a center, and obtaining a first candidate response map by adopting a classification algorithm, wherein the first candidate response map comprises a plurality of response maps;

the second mapping map determining module is suitable for respectively cutting feature maps with preset sizes in a second feature map of the previous frame image and a second feature map of the current frame image by taking the first target center as a center, and obtaining a second response mapping map by adopting a classification algorithm, wherein the second response mapping map comprises a plurality of response mappings;

the first fusion map determining module is suitable for weighting and summing response maps of the first candidate response map and the second response map by adopting a first preset weight value to obtain a second fusion response map;

a second center determination module adapted to determine coordinates of a response map having a maximum response value in the second fused response map as a second target center;

an nth candidate map determining module, adapted to calculate candidate response maps corresponding to each feature map for the second to nth feature maps of the previous frame image, respectively, wherein for the nth feature map of the previous frame image and the nth feature map of the current frame image, feature maps of a preset size are respectively clipped with an nth target center as a center, and an nth candidate response map is obtained by using a classification algorithm, and the nth candidate response map includes a plurality of response maps, where 1 < N and N is a positive integer;

the N +1 response map determining module is suitable for respectively calculating the second to the nth feature maps of the previous frame image to obtain a response mapping map corresponding to each feature map, wherein the N +1 feature map of the previous frame image and the N +1 feature map of the current frame image are respectively cut out by taking the nth target center as the center, and a classification algorithm is adopted to obtain the N +1 response mapping map, wherein the N +1 response mapping map comprises a plurality of response mappings;

an nth fusion map determining module, adapted to calculate fusion response maps corresponding to the second to nth feature maps of the previous frame image, respectively, wherein an nth preset weight value is adopted to perform weighted summation on the nth candidate response map and the response map of the (N + 1) th response map to obtain an (N + 1) th fusion response map;

an n +1 center determination module adapted to determine coordinates of a response map in the n +1 th fused response map having a largest response value as an n +1 th target center;

and the target position determining module is suitable for determining the center of the (N + 1) th target as the target position in the current frame image when N +1 is equal to N.

8. The target position estimation device of claim 7, wherein the first map determination module comprises:

the first target image cutting sub-module is suitable for cutting a first target feature image with a preset size by adopting a first target window with a preset size in the first feature image of the previous frame image and taking the coordinate where the target center point in the previous frame image is located as the center;

the first search map cutting sub-module is suitable for cutting a first search feature map with a preset size by adopting a first search window with a preset size in the first feature map of the current frame image and taking the coordinate where the target center point in the previous frame image is located as the center;

and the classification operation submodule is suitable for respectively inputting the first target characteristic diagram and the first search characteristic diagram into a classifier to carry out classification operation so as to obtain a first response mapping diagram.

9. A storage medium having stored thereon computer instructions which, when executed, perform the steps of the method of estimating a position of an object according to any one of claims 1 to 6.

10. A terminal comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, wherein the processor, when executing the computer instructions, performs the steps of the target position estimation method of any one of claims 1 to 6.