CN113129332A

CN113129332A - Method and apparatus for performing target object tracking

Info

Publication number: CN113129332A
Application number: CN202010044865.3A
Authority: CN
Inventors: 陈一伟; 徐静涛; 俞佳茜; 俞炳仁; 韩在濬; 李贤廷; 崔昌圭; 王强; 谭航凯
Original assignee: Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd
Current assignee: Beijing Samsung Telecom R&D Center; Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd
Priority date: 2020-01-16
Filing date: 2020-01-16
Publication date: 2021-07-16
Also published as: KR20210092672A

Abstract

Disclosed are a method and an apparatus for performing target object tracking, the method including: acquiring a first depth feature of a target object area image and a second depth feature of a search area image in the image; obtaining a global response image according to the first depth feature and the second depth feature, and obtaining a prediction result of a first-stage target object bounding box according to the global response image; updating the second depth feature according to the prediction result of the target object bounding box in the first stage; and obtaining a plurality of local feature blocks based on the first depth feature, obtaining a local response map according to the plurality of local feature blocks and the updated second depth feature, and obtaining a second-stage target object surrounding frame prediction result according to the local response map.

Description

Method and apparatus for performing target object tracking

Technical Field

The present invention relates generally to the field of computer vision, and more particularly, to a method and apparatus for performing target object tracking.

Background

Visual object tracking (Visual object tracking) is an important direction in computer vision, and its specific task is to continuously predict the Bounding box of a target object in a video sequence according to a first frame image and a given Bounding box of the target object (Bounding box) in a subsequent frame image. The target object may be some object or object part. The task of target tracking based on vision is very challenging due to the little given information, complex scene, etc. The complex scene mainly comprises the conditions that a target object is partially or completely shielded, the shape of the target object is changed, the target object moves rapidly, a frame image is blurred, the light of the scene is changed, the change of the visual angle of a video is large, and the like. The target tracking method can establish the association of the target object at different moments, so the target tracking method has wide application in the field of computer vision, particularly in partial video application, including camera focus tracking, action recognition, live event broadcasting, security monitoring, man-machine interaction and the like.

However, the existing target tracking methods only use a one-stage network, and only perform one correlation operation on the features of the target object and the features of the search area in the one-stage network, only the global feature correlation of the target object is considered, which leads to the decrease of the target tracking accuracy. In view of the above, a target tracking method and apparatus capable of improving the accuracy of target tracking are needed.

Disclosure of Invention

Aiming at the problem of low target tracking accuracy, the invention provides a target object tracking method and a target object tracking system which combine block correlation and global correlation under a two-stage framework.

According to an aspect of the present invention, there is provided a method for tracking a target object using a cascade network, the method may include: acquiring a first depth feature of a target object area image and a second depth feature of a search area image in the image; obtaining a global response image according to the first depth feature and the second depth feature, and obtaining a prediction result of a first-stage target object bounding box according to the global response image; updating the second depth feature according to the prediction result of the target object bounding box in the first stage; and obtaining a plurality of local feature blocks based on the first depth feature, obtaining a local response map according to the plurality of local feature blocks and the updated second depth feature, and obtaining a second-stage target object surrounding frame prediction result according to the local response map.

According to an exemplary embodiment, obtaining the plurality of local feature blocks based on the first depth feature may comprise: partitioning the first depth feature or a third depth feature obtained by further feature extraction on the first depth feature to obtain a plurality of local feature blocks; obtaining the partial response map according to the plurality of partial feature blocks and the updated second depth feature may include: and performing block correlation on the plurality of local feature blocks and the updated second depth feature or a fourth depth feature obtained by further performing feature extraction on the updated second depth feature to obtain a local response map.

According to an exemplary embodiment, the block correlating the plurality of local feature blocks with the updated second depth feature or a fourth depth feature obtained by further feature extracting the updated second depth feature to obtain the local response map may include: performing block correlation on each local feature block in the plurality of local feature blocks and the updated second depth feature or fourth depth feature to obtain a plurality of local sub-response maps, and fusing the plurality of local sub-response maps to obtain the local response map.

According to an exemplary embodiment, obtaining the second stage target object bounding box prediction result according to the local response graph may include: and predicting the position offset and the size offset of the second-stage target object surrounding frame according to the local response graph, and obtaining a second-stage target object surrounding frame prediction result according to the predicted position offset and size offset.

According to an exemplary embodiment, fusing the plurality of partial sub-response graphs to obtain the partial response graph comprises: classifying each local feature block of the plurality of local feature blocks as a target object feature block or a background feature block; and fusing the partial sub-response graph corresponding to the target object feature block and the sub-response graph corresponding to the background feature block to obtain the partial response graph.

According to an exemplary embodiment, classifying each of the plurality of local feature blocks as a target object feature block or a background feature block comprises: and taking an initial target object surrounding frame calibrated on the target object sub-image as a classification basis, and classifying each local feature block into a target object feature block or a background feature block according to the ratio of an overlapping area between each local feature block and the initial target object surrounding frame to each local feature block.

According to an exemplary embodiment, the first-stage target object bounding box prediction result and the second-stage target object bounding box prediction result each include position information and size information of the target object bounding box, the position offset may be a coordinate offset between a center position coordinate of the second-stage target object bounding box and a center position coordinate of the first-stage target object bounding box, and the size offset may be a size offset between the second-stage target object bounding box and a pre-specified target object bounding box, wherein obtaining the second-stage target object bounding box prediction result according to the predicted position offset and size offset may include: when the sum of the absolute values of the coordinate deviations is larger than a preset threshold value, taking the prediction result of the target object bounding box in the first stage as the prediction result of the target object bounding box in the second stage; when the sum of the absolute values of the coordinate offsets is less than or equal to the preset threshold, a second-stage target-object-enclosing-frame prediction result is obtained by adding the center position of the first-stage target-object enclosing frame to the predicted position offset and adding the size of the pre-specified target-object enclosing frame to the predicted size offset.

According to an exemplary embodiment, obtaining the first-stage target object bounding box prediction result according to the global response graph may include: and taking the position with the maximum value in the global response map as position information included in the first-stage target object bounding box prediction result, and taking the size of a target object bounding box predicted on an image before the current frame image as size information included in the first-stage target object bounding box prediction result.

According to an exemplary embodiment, the blocking the first depth feature or a third depth feature obtained by further feature extraction of the first depth feature to obtain a plurality of local feature blocks may comprise: partitioning the first depth feature or the third depth feature according to one of the following three partitioning modes: dividing modes that all local feature blocks do not overlap with each other; dividing modes that adjacent local feature blocks are overlapped with each other; based on a predetermined partitioning pattern of the block distribution.

According to another aspect of the present invention, there is provided an apparatus for performing target object tracking, the apparatus may include: a first stage tracker configured to: acquiring a first depth feature of a target object area image and a second depth feature of a search area image in the image; obtaining a global response image according to the first depth feature and the second depth feature, and obtaining a prediction result of a first-stage target object bounding box according to the global response image; updating the second depth feature according to the prediction result of the target object bounding box in the first stage; a second stage tracker configured to: and obtaining a plurality of local feature blocks based on the first depth feature, obtaining a local response map according to the plurality of local feature blocks and the updated second depth feature, and obtaining a second-stage target object surrounding frame prediction result according to the local response map.

According to another aspect of the present invention, an electronic device is provided, which may comprise a processor and a memory, wherein the memory has stored therein program instructions, wherein the program instructions, when executed by the processor, cause the processor to perform the object tracking method as described above.

According to another aspect of the present invention, there is provided a computer-readable recording medium having program instructions recorded thereon, wherein the program instructions, when executed by a processor, cause the processor to perform the object tracking method as described above.

According to the target tracking method and the target tracking system, the global correlation and the block correlation are combined by utilizing the cascade network to carry out two-stage tracking, so that the target tracking accuracy can be effectively improved, and the target tracking method and the target tracking system have the characteristics of light weight and low calculation consumption and can carry out high-precision and stable real-time tracking on the target object.

Drawings

These and/or other aspects and advantages of the present application will become more apparent and more readily appreciated from the following detailed description of the embodiments of the present application, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic flow chart of a conventional target tracking device for tracking a target object;

FIG. 2 is a schematic diagram of a prior art target tracking method;

FIG. 3 shows a schematic view of the concept of the object tracking method according to the invention;

FIG. 4 shows a detailed schematic of the concept of the object tracking method according to the invention;

FIG. 5 is a flow chart illustrating a target tracking method according to the present invention;

FIG. 6 is a schematic diagram of a global correlation operation according to the present invention;

FIG. 7 is a schematic diagram of the first stage operation of the target tracking method according to the present invention;

FIG. 8 is a schematic diagram of a block partitioning scheme according to the present invention;

FIG. 9 is a schematic diagram of block correlation operations according to the present invention;

FIG. 10 is a schematic illustration of interference suppression and response map fusion according to the present invention;

FIG. 11 is a schematic illustration of adaptive prediction according to the present invention;

FIG. 12 is a schematic diagram of a second stage operation of the target tracking method according to the present invention;

FIG. 13 is a schematic diagram of network training in accordance with the present invention;

fig. 14 shows the effect difference of the block correlation method combined with interference suppression proposed by the present invention compared with the global correlation method and the block correlation method;

FIG. 15 is a block diagram of an object tracking device according to the present invention.

Detailed Description

Before describing the inventive concept and exemplary embodiments of the present invention, in order to facilitate a better understanding of the present invention, a brief description will now be made of an object tracking system and an object tracking method in the related art.

Fig. 1 is a schematic flow chart of a target tracking device for tracking a target object. As shown in fig. 1, generally, a monocular camera may be used to capture a first frame of image of a video sequence, and a bounding box of a target object is obtained through manual annotation or Visual object detection (Visual object detection), so as to initialize a target tracking device. In the subsequent frame image of the video sequence, a search area is selected according to the prediction result of the previous frame image, and the target object is tracked by utilizing the image characteristics.

Feature representation (Feature representation) of target objects is a key that affects the performance of target tracking systems. The existing target tracking method mainly adopts a Hand-crafted feature (Hand-crafted feature) and a depth feature (Deep feature). Common manual features include Histogram of Oriented Gradient (HOG), Scale Invariant Features (SIFT), Gray scale features (Gray), and the like. The depth features are obtained through training and learning of the parameter model on a large number of samples, and compared with manual features, the depth features are more distinguishable and robust. In recent years, with the breakthrough of the target tracking method based on the depth feature, the traditional method based on the manual feature, which is already superior in Robustness (Robustness) and Accuracy (Accuracy), especially the depth feature obtained by using the Convolutional Neural Network (CNN), has been developed.

Some existing target tracking methods based on depth features adopt a twin network (Simase network) as a basic framework of a target tracking network. The twin network respectively extracts the target object image feature of the first frame image and the search area image feature of the current frame image by using the same network parameters, ensures that the extracted features are in the same feature space, and then obtains the response image of the target object in the search area image by using the correlation of the two features. After training is carried out on a large-scale data set, better results are obtained in the aspects of precision and robustness. The existing target tracking method based on the twin network can be divided into two types of regression without a bounding box and regression with a bounding box.

The method for regression without bounding box extracts the characteristics of the target object image of the first frame and the search area image of the current frame in the same characteristic space through a twin network, and obtains a response graph representing the matching degree between the target object image and the search area image by using related operation. And taking the position corresponding to the maximum value in the response image as the central position of the target object in the current frame image. And aiming at the target size change, adopting multi-scale test, and taking the target object surrounding frame under the scale with the maximum response as the surrounding frame of the target object in the current frame image.

The enclosed frame regression method is an improvement over the non-enclosed frame regression method. FIG. 2 is a schematic diagram of a prior art target object tracking method using bounding box regression. As shown in fig. 2, the method further expands the location response information obtained by the correlation operation. For example, a Region-generated network (RPN) is used to obtain the classification (target object and non-target object) of multiple candidate frames and the coordinate regression result of bounding boxes at the same time, and the candidate frame with the highest classification probability is selected as the target object in the current frame image. Due to the fact that the surrounding frame coordinate regression is learned, the target tracking accuracy is high.

By adopting the target tracking scheme without bounding box regression, the model can be lighter, has fewer parameters and lower precision because extra networks are not needed for bounding box coordinate regression learning. And multi-scale testing reduces the advantage of model lightweight (one test is required at each scale). By adopting the technical scheme of target tracking with bounding box regression, for example, combining a region generation network, the performance of the tracker can be greatly improved, but the parameter quantity is large, so that the real-time performance of the system is reduced.

In addition, the two target tracking methods only adopt a one-stage network, and only perform one-time correlation operation on the characteristics of the target object and the characteristics of the search area in the one-stage network, and only consider the global characteristic correlation of the target object. Such a treatment has the following technical drawbacks: on one hand, due to the fact that the positive and negative samples are not balanced in network training, the influence of the negative samples tends to be restrained by a network in a stage, information of the positive samples is not sufficiently utilized, and tracking accuracy is reduced. And since tracking is performed on the target object between video image frames, the tracker is susceptible to cumulative errors. In the case of complicated background and excessive change of light or object shape, the tracking performance may be reduced if the information around the object is not mined. On the other hand, the target object is integrally tracked only by adopting the global characteristics of the target object, the method ignores the local information of the target object, and the tracking is not accurate enough under the condition that the target is deformed and the like.

Therefore, the invention provides a new target tracking method to improve the accuracy of target tracking.

Hereinafter, concepts and exemplary embodiments of the present invention for performing target object tracking will be described in detail with reference to fig. 3 to 14.

Fig. 3 shows a schematic view of the concept of the object tracking method according to the invention. The target tracking method of the invention comprises two stages. As shown in fig. 3, in the case where the target object region image and the search region image are determined, first stage tracking is performed to obtain a rough preliminary target tracking result. And then, carrying out second-stage tracking to obtain an accurate final target tracking result.

Fig. 4 shows a detailed schematic of the concept of the object tracking method according to the invention. Specifically, as shown in fig. 4, the target tracking method in the present invention includes two stages. In the first stage, rough tracking is carried out, in the stage, under the condition that a target object area image and a search area image are determined, Global features of the target object area image and the search area image are respectively extracted, Global correlation (Global correlation) calculation is carried out on the extracted Global features to obtain a Global response image, and then rough matching is carried out to obtain a rough prediction result. The second stage is fine tracking, and further obtains a local response map according to the local image feature of the target object and the updated search area image feature (for example, the local response map is obtained by performing block correlation on the local image feature of the target object and the updated search area image feature), and further obtains a final target tracking result according to the local response map.

Next, the object tracking method according to the present invention will be described in detail with reference to fig. 5 to 13. Fig. 5 is a flowchart illustrating a target tracking method according to the present invention. Referring to fig. 5, in step S510, a first depth feature of a target object region image and a second depth feature of a search region image in an image are acquired. For example, first, a video sequence may be acquired, and then, using a first neural network, a first depth feature of a target object region image in a first frame image of the video sequence is extracted, and a second depth feature of a search region image in a current frame image of the video sequence is extracted. Here, the target object region image may be obtained by cropping the first frame image according to an initial target object bounding box that is manually calibrated, or the target object sub-image may be obtained by cropping the first frame image according to an initial target object bounding box that is determined by target object detection, which is not limited by the present invention. Further, here, the first depth feature is a global feature of the target object region image, and the second depth feature is a global feature of the search region image. As an example, the above-mentioned first neural network may be a twin convolutional network, but is not limited thereto.

In step S520, a global response map is obtained according to the first depth feature and the second depth feature, and a first-stage target object bounding box prediction result is obtained according to the global response map. Specifically, the global response map is obtained by performing a global correlation calculation on the first depth feature and the second depth feature. For ease of understanding, the relevant operations will be briefly described. In the image task, applying the correlation operation can obtain a response map Y representing the degree of similarity of the two images, wherein the larger the value, the higher the degree of similarity between the corresponding position in the search area image Z and the target object area image X is indicated. The correlation calculation is shown below:

Y＝corr(X,Z)

where h, w represent the size of the image X and i, j, u, v are the coordinates in the image, respectively. FIG. 6 is a schematic diagram of a global correlation operation according to the present invention. The global correlation operation of the present invention is briefly described below with reference to fig. 6. As shown in fig. 6, the global correlation is to the entire image feature of the target object area image (referred to as "target object image feature F in fig. 6)_T") and the entire image feature of the search area sub-image (referred to as" search area image feature F in fig. 6)_St") performs the associated operation. Through the global correlation operation, a global response graph can be obtained. In the present invention, the first-stage target object bounding box prediction result may include position information and size information of the target object bounding box. Specifically, in step S520, obtaining the first-stage target object bounding box prediction result according to the global response map may include: and taking the position with the maximum value in the global response map as position information included in the first-stage target object bounding box prediction result, and taking the size of a target object bounding box predicted on an image before the current frame image as size information included in the first-stage target object bounding box prediction result. In step S530, the second depth feature is predicted according to the first stage target object bounding boxAn update is performed to obtain an updated second depth feature. Specifically, the search area sub-image may be cropped according to the first-stage target object bounding box prediction result to obtain a narrowed search area sub-image, and the second depth feature of the narrowed search area sub-image is extracted by using the first convolution network and is used as the updated second depth feature.

The above steps S510 to S530 are operations performed in the first-stage tracking according to the target tracking method of the present invention.

In order to more intuitively understand the operation of the first stage of the target tracking method according to the present invention, the operation of the first stage of the target tracking method will be briefly described again with reference to fig. 7. As shown in fig. 7, in the first stage tracking, three parts are mainly included: feature extraction (corresponding to step S510 above), global correlation (corresponding to step S520 above), and feature map clipping (corresponding to step S530 above). According to an exemplary embodiment, first, in the feature extraction section, feature extraction may be performed on the target object area image and the search area image, respectively, using, for example, a convolutional neural network. In feature extraction, a lightweight convolutional neural network phi can be adopted₁And extracting image features. For the input target object region image Z and the search region image X, a convolution neural network phi can be utilized₁Obtaining a depth feature phi of a target object region image₁(Z) depth feature of image of search area₁(X). In the twin convolutional network employed, the parameters of the two branches are shared to ensure that the images are mapped to the same feature space.

Next, in the global correlation section, a global response map is obtained by performing global correlation operation on the extracted features, and the response map is processed to obtain a first-stage prediction result (i.e., the above-mentioned first-stage target object bounding box prediction result). The global correlation operation may obtain a global position (similarity degree) response map f of the entire target object region image and the search region image, which may be expressed as follows:

f＝corr(φ₁(Z)，φ₁(X))

after obtaining the global response graph, it is optionalThe position with the maximum response value in the global response image is taken as the first-stage prediction position of the target object bounding box, the size of the bounding box is selected from the size of the target object bounding box predicted by the previous frame image of the current frame image, and thus the first-stage target object bounding box prediction result P can be obtained₁＝(x₁,y₁,w₁,h₁) Wherein x is₁And y₁Respectively, an abscissa and an ordinate of the center position of the first-stage target object bounding box, w₁And h₁The width and height of the first stage target object bounding box, respectively.

Finally, in the feature map cutting part, according to the central position and the size of the target surrounding frame predicted in the first stage, cutting the search area image X to obtain a search area image X' with a smaller area range, and extracting the second depth feature of the search area image to obtain an updated second depth feature phi₁(X') for use in the second stage.

Referring back to fig. 5, the operation of the second phase of the object tracking method according to the present invention will be described. In step S540, a plurality of local feature blocks are obtained based on the first depth feature, a local response map is obtained according to the plurality of local feature blocks and the updated second depth feature, and a second-stage target object bounding box prediction result is obtained according to the local response map. For example, the first depth feature or a third depth feature obtained by further feature extraction of the first depth feature may be segmented to obtain a plurality of local feature blocks. As an example, the first depth feature and the updated second depth feature may be further feature extracted using a second neural network to obtain a third depth feature of the target object image and a fourth depth feature of the search area image. Here, the third depth feature is a depth feature obtained by performing a further convolution operation on the first depth feature, and the fourth depth feature is a depth feature obtained by performing a further convolution operation on the updated second depth feature.

As an example, the blocking of the first depth feature or the third depth feature obtained by further feature extraction of the first depth feature to obtain the plurality of local feature blocks may be the blocking of the first depth feature or the third depth feature according to one of the following three division manners: dividing modes that all local feature blocks do not overlap with each other; dividing modes that adjacent local feature blocks are overlapped with each other; based on a predetermined partitioning pattern of the block distribution. Here, the predetermined block distribution may be a block distribution specified artificially or a block distribution learned. The artificially specified block distribution may be based on block partitioning according to a specified specific distribution. For example, when the specified block distribution is a gaussian distribution, the distribution of blocks is closer to the center. Further, the learned block distribution mentioned here may be a block distribution obtained by: parameters of specific distribution (such as mean and variance of gaussian distribution) are used as an optimization target, and are continuously adjusted in training until the most suitable parameters are found, so that the corresponding block distribution is learned. Fig. 8 is a schematic diagram of the block division according to the present invention, in which the above three division manners are specifically shown.

As described above, after the plurality of local feature blocks are obtained, a local response map may be obtained according to the plurality of local feature blocks and the updated second depth feature, and a second-stage target object bounding box prediction result may be obtained according to the local response map. According to an exemplary embodiment, the plurality of local feature blocks may be block-correlated with the updated second depth feature or a fourth depth feature obtained by further feature extraction of the updated second depth feature to obtain the local response map. Specifically, first, each of the plurality of local feature blocks may be block-correlated with the updated second depth feature or fourth depth feature to obtain a plurality of local sub-response maps. Then, the plurality of partial sub-response maps are fused to obtain the partial response map. Fig. 9 is a schematic diagram of block correlation operations according to the present invention. As shown in fig. 9, the block correlation is to perform a correlation operation on each of a plurality of local feature blocks and a search region image feature (the updated second depth feature or the updated fourth depth feature mentioned above) after the target object feature (the first depth feature or the third depth feature mentioned above) is blocked, so as to obtain a plurality of local sub-response maps, and then response map fusion may be performed to obtain a fused local response map. According to an exemplary embodiment of the present invention, the fusing the plurality of partial sub-response maps to obtain the partial response map may include: classifying each local feature block in the plurality of local feature blocks into a target object feature block or a background feature block, and fusing a local sub-response map corresponding to the target object feature block and a sub-response map corresponding to the background feature block to obtain the local response map. According to the fusion mode, the stability and the precision of the target tracking method can be further improved. This is because, in the target object sub-image, a part of the background region exists in addition to the target object, and the characteristics of the background region may affect the stability and accuracy of the target tracking method. And the local feature blocks are classified into the target object feature block and the background feature block and then are fused, so that the interference of the background can be effectively reduced.

Fig. 10 is a schematic illustration of interference suppression and response map fusion according to the present invention. As shown in fig. 10, as an example, when local feature block classification is performed, an initial target object bounding box marked on a target object region image may be used as a classification basis, and each local feature block may be classified as a target object feature block or a background feature block according to a ratio of an overlapping region between each local feature block and the initial target object bounding box to each local feature block. For example, an initial target object bounding box marked on the target object region image is used as a classification basis, a target object feature block is determined when the local feature block has a region of more than p% in the bounding box, and a background feature block is determined when the overlapping portion of the local feature block and the bounding box is less than p%, where p may be a preset threshold. After classification, the local sub-response map corresponding to the target object feature block and the sub-response map corresponding to the background feature block may be fused to obtain the local response map, for example, using the following equation:

wherein the content of the first and second substances,

wherein S is the partial response map S_oFor local sub-response maps, s, corresponding to target object feature blocks_bFor sub-response maps corresponding to background feature blocks, n_oNumber of feature blocks for target object, n_bIs the number of background feature blocks.

After obtaining the local response map, next, a second-stage target object bounding box prediction result is obtained according to the local response map. Specifically, the position offset and the size offset of the second-stage target object bounding box may be predicted from the local response map, and a second-stage target object bounding box prediction result may be obtained from the predicted position offset and size offset. For example, the local response map may be processed using a third neural network to predict the position offset and the size offset of the bounding box of the target object at the second stage. The third neural network may be different from the first and second neural networks mentioned above. Here, the second target object bounding box prediction result may include position information and size information of the target object bounding box. Hereinafter, the above process may be referred to as an adaptive prediction process. FIG. 11 is a schematic diagram of adaptive prediction according to the present invention. As shown in fig. 11, in the adaptive prediction process, the partial response map S may be processed by a convolution network first, and the offset D ═ of the bounding box in the second stage may be predicted (D)_x,d_y,d_w,d_h) The offset includes a position offset and a size offset. According to an exemplary embodiment, the position offset may be a coordinate offset between the center position coordinates of the second-stage target object enclosure box and the center position coordinates of the first-stage target object enclosure box, and the size offset may be a size offset between the second-stage target object enclosure box and a pre-specified target object enclosure box. After obtaining the offset, obtaining a second stage target object bounding box prediction result according to the predicted position offset and size offset. Specifically, when the sum of the absolute values of the coordinate deviations is greater than a preset threshold, the first-stage target pair is selectedThe object bounding box prediction result is used as a second stage target object bounding box prediction result; however, when the sum of the absolute values of the coordinate offsets is less than or equal to the preset threshold, the second-stage target-object-enclosure-frame prediction result is obtained by adding the center position of the first-stage target-object enclosure frame to the predicted position offset and adding the size of the pre-specified target-object enclosure frame to the predicted size offset. For example, if the first stage target object bounding box prediction result is P₁＝(x₁,y₁,w₁,h₁) And the pre-specified size of the target object bounding box is (w)₀,h₀) (i.e., width w)₀Height of h₀) Then the second stage target object bounding box prediction result may be P₂＝(x₁₊d_x,y₁₊d_y,w₀₊d_w,h₀₊d_h)。

To this end, the operation of the second stage of the object tracking method according to the present invention is completed. In order to more intuitively understand the operation in the second stage, the operation in the second stage of the target tracking method will be briefly described again with reference to fig. 12. As shown in FIG. 12, the depth feature φ of the target object sub-image is obtained through the first stage of operation₁(Z) and updated depth feature phi₁(X') after which it may be input into a convolutional network to further extract features, and then block correlation operations may be performed on the extracted features. Then, interference suppression and sub-response graph fusion can be carried out to obtain a local response graph, and finally, a second-stage target object surrounding frame prediction result P is obtained through adaptive prediction₂。

According to the present invention, target object tracking may be performed by the method shown in fig. 5 using a cascade network (including the above-mentioned first, second, and third neural networks), and the cascade network may be trained using multiple supervision signals. Here, the multi-supervision signal includes a global response map, a local response map, and a target enclosure. In the following, the training of the cascade network is briefly described. It will be clear to those skilled in the art that the training process is the same as the operation performed by the prediction process, except that the input at the time of training is the target object region image (which may also be referred to as the "template image") and the search region image and the known bounding box on the search region image, and the output may be the target bounding box predicted on the search region image, the global response map, and the local response map. The three are used as supervision signals, and parameters in the network are learned by optimizing a loss function to be convergent through iterative loop learning. Specifically, when training is performed by using a multi-supervision signal, a first-stage tracking is performed to obtain a global response map, wherein in the global response map, a distance from a central position to a certain threshold is set as +1, and a distance greater than the certain threshold is set as-1. Next, a second stage of tracking is performed. Specifically, firstly, a segmentation result of the target on the search area image can be obtained (a segmentation algorithm or manual labeling can be adopted), secondly, distance transformation is carried out on the segmentation result, and the distance variation graph is subjected to numerical value normalization to obtain a supervision signal of the local response graph. And finally, obtaining a fine prediction result through adaptive position prediction. In the training process, the global response diagram, the local response diagram and the target enclosure frame are used as supervision signals, and parameters in the cascade network are learned through iterative loop learning and continuous optimization of a loss function until convergence. For example, the training process may be: firstly, image pairs (including template images and search area images) extracted from the same video sequence are input into a one-stage tracking network, and results (global response image prediction and rough prediction frames) output in one stage are obtained. And (3) calculating Loss pass 0 (Loss can be understood as difference, and the Loss is zero or no difference) of the global response diagram prediction and the true value by using a Binary Cross entropy Loss function (Binary Cross entropy). Then, based on the prediction results of the coarse prediction box, one-stage features (referred to as "shared features" in fig. 13) are clipped as input to the second-stage trace network and generate a true value of the local response map. Then, two phases of outputs (local response map prediction and fine prediction block) are obtained. Then, KL Divergence (Kullback-leibler Divergence) is used to measure Loss of Loss1 between local response map prediction and true value. Next, the L1 distance is used to measure the Loss of the fine prediction frame and the real frame, Loss 2. Finally, the Loss (Loss 0+ (a1) × Loss1+ (a2) × Loss2) is optimized to converge to learn the parameters of the network, where a1 and a2 are weights of the losses.

After training is finished, the cascade network can be tested, specifically, after a first frame image and a target object surrounding frame are given, the first frame image and the target object surrounding frame are input into the cascade network for initialization to obtain target image characteristics, then subsequent frame images are continuously input, and a tracking result of a target object can be obtained in real time.

The object tracking method of the present invention has been described above. The invention provides a target object tracking method under a two-stage framework by combining block correlation and global correlation, which extracts local information by adopting the block correlation for adjusting a target object surrounding frame for the first time. The target tracking method can track the target object on the mobile equipment in real time with high precision and stability.

In addition, as described above, the target tracking method of the present invention also employs a block correlation method in combination with interference suppression. Fig. 14 shows the difference in effect between the block correlation method with interference suppression and the global correlation method and the block correlation method adopted in the present invention. As shown in fig. 14, by comparing and analyzing the global correlation response diagram (i.e., the above-mentioned global response diagram), the block correlation response diagram (i.e., the above-mentioned local response diagram) and the result of combining interference suppression based on block correlation in the target tracking process, it can be found that the block correlation method combining interference suppression adopted in the present invention can effectively extract the detailed information of the tracked target, thereby further improving the tracking accuracy.

Fig. 15 is a block diagram of an apparatus for performing target object tracking (hereinafter, referred to as a "target tracking apparatus" for convenience of description) according to the present invention. Referring to fig. 15, the target tracking device 1500 may include a first stage tracker 1510 and a second stage tracker 1520. In particular, the first stage tracker 1510 may be configured to: acquiring a first depth feature of a target object area image and a second depth feature of a search area image in the image; obtaining a global response image according to the first depth feature and the second depth feature, and obtaining a prediction result of a first-stage target object bounding box according to the global response image; and updating the second depth feature according to the prediction result of the target object bounding box in the first stage. The second stage tracker 1530 may be configured to: and obtaining a plurality of local feature blocks based on the first depth feature, obtaining a local response map according to the plurality of local feature blocks and the updated second depth feature, and obtaining a second-stage target object surrounding frame prediction result according to the local response map. Since the content related to the above operation has been introduced in the process of describing the target tracking method according to the present invention, it is not described herein again for brevity.

The target tracking method and the target tracking apparatus according to the embodiment of the present invention have been described above with reference to fig. 1 to 15. However, it should be understood that: the various units in the apparatus shown in fig. 15 (e.g., the first stage tracker 1510 and the second stage tracker 1520) may be respectively configured as software, hardware, firmware, or any combination thereof to perform specific functions. For example, these units may correspond to dedicated integrated circuits, to pure software code, or to modules combining software and hardware. By way of example, and not limitation, the device described with reference to fig. 15 may be a PC computer, tablet device, personal digital assistant, smart phone, web application, or other device capable of executing program instructions.

It should be noted that although the target tracking apparatus 1500 is described above as being divided into units for respectively performing corresponding processes, it is clear to those skilled in the art that the processes performed by the units may also be performed without any specific unit division by the target tracking apparatus or without explicit demarcation between the units. Further, the apparatus described above with reference to fig. 15 is not limited to include the above-described units, but some other units (e.g., a storage unit, a data processing unit, etc.) may be added as needed, or the above units may be combined.

Further, the object tracking method according to the present invention may be recorded in a computer-readable recording medium. Specifically, according to the present invention, there may be provided a computer-readable recording medium having recorded thereon program instructions, which, when executed by a processor, may cause the processor to execute the target tracking method as described above. Examples of the computer readable recording medium may include magnetic media (e.g., hard disks, floppy disks, and magnetic tapes); optical media (e.g., CD-ROM and DVD); magneto-optical media (e.g., optical disks); and hardware devices (e.g., Read Only Memory (ROM), Random Access Memory (RAM), flash memory, etc.) that are specially configured to store and execute program instructions. Further, according to the present invention, there may also be provided an electronic device comprising a processor and a memory, the memory having stored therein program instructions, wherein the program instructions, when executed by the processor, cause the processor to perform the object tracking method as described above. Examples of program instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter.

In addition, some operations in the target tracking method according to the exemplary embodiment of the present application may be implemented by software, some operations may be implemented by hardware, and further, the operations may be implemented by a combination of hardware and software.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present invention as defined by the following claims.

Claims

1. A method of performing target object tracking, comprising:

acquiring a first depth feature of a target object area image and a second depth feature of a search area image in the image;

obtaining a global response image according to the first depth feature and the second depth feature, and obtaining a prediction result of a first-stage target object bounding box according to the global response image;

updating the second depth feature according to the prediction result of the target object bounding box in the first stage;

and obtaining a plurality of local feature blocks based on the first depth feature, obtaining a local response map according to the plurality of local feature blocks and the updated second depth feature, and obtaining a second-stage target object surrounding frame prediction result according to the local response map.

2. The method of claim 1, wherein obtaining a plurality of local feature blocks based on the first depth feature comprises: blocking the first depth feature or a third depth feature obtained by further feature extraction of the first depth feature to obtain a plurality of local feature blocks,

obtaining a local response map according to the plurality of local feature blocks and the updated second depth feature comprises: and performing block correlation on the plurality of local feature blocks and the updated second depth feature or a fourth depth feature obtained by further performing feature extraction on the updated second depth feature to obtain a local response map.

3. The method of claim 2, wherein block correlating the plurality of local feature blocks with the updated second depth feature or a fourth depth feature obtained by further feature extracting the updated second depth feature to obtain a local response map comprises: performing block correlation on each local feature block in the plurality of local feature blocks and the updated second depth feature or fourth depth feature to obtain a plurality of local sub-response maps, and fusing the plurality of local sub-response maps to obtain the local response map;

the obtaining of the second-stage target object bounding box prediction result according to the local response graph comprises: and predicting the position offset and the size offset of the second-stage target object surrounding frame according to the local response graph, and obtaining a second-stage target object surrounding frame prediction result according to the predicted position offset and size offset.

4. The method of claim 3, wherein fusing the plurality of partial sub-response maps to obtain the partial response map comprises:

classifying each local feature block of the plurality of local feature blocks as a target object feature block or a background feature block;

and fusing the partial sub-response graph corresponding to the target object feature block and the sub-response graph corresponding to the background feature block to obtain the partial response graph.

5. The method of claim 4, wherein classifying each of the plurality of local feature blocks as a target object feature block or a background feature block comprises:

and taking an initial target object surrounding frame calibrated on the target object sub-image as a classification basis, and classifying each local feature block into a target object feature block or a background feature block according to the ratio of an overlapping area between each local feature block and the initial target object surrounding frame to each local feature block.

6. The method of claim 3, wherein the first-stage target-object bounding box prediction result and the second-stage target-object bounding box prediction result each include position information and size information of a target-object bounding box, the position offset being a coordinate offset between center position coordinates of the second-stage target-object bounding box and center position coordinates of the first-stage target-object bounding box, the size offset being a size offset between the second-stage target-object bounding box and a pre-specified target-object bounding box,

obtaining a second stage target object bounding box prediction result according to the predicted position offset and size offset comprises the following steps:

when the sum of the absolute values of the coordinate deviations is larger than a preset threshold value, taking the prediction result of the target object bounding box in the first stage as the prediction result of the target object bounding box in the second stage;

when the sum of the absolute values of the coordinate offsets is less than or equal to the preset threshold, a second-stage target-object-enclosing-frame prediction result is obtained by adding the center position of the first-stage target-object enclosing frame to the predicted position offset and adding the size of the pre-specified target-object enclosing frame to the predicted size offset.

7. The method of claim 6, wherein obtaining the first stage target object bounding box prediction result from the global response graph comprises:

and taking the position with the maximum value in the global response map as position information included in the first-stage target object bounding box prediction result, and taking the size of a target object bounding box predicted on an image before the current frame image as size information included in the first-stage target object bounding box prediction result.

8. The method of claim 2, wherein the partitioning of the first depth feature or a third depth feature obtained by further feature extraction of the first depth feature to obtain a plurality of local feature blocks comprises: partitioning the first depth feature or the third depth feature according to one of the following three partitioning modes:

dividing modes that all local feature blocks do not overlap with each other;

dividing modes that adjacent local feature blocks are overlapped with each other;

based on a predetermined partitioning pattern of the block distribution.

9. An apparatus to perform target object tracking, comprising:

a first stage tracker configured to: acquiring a first depth feature of a target object area image and a second depth feature of a search area image in the image; obtaining a global response image according to the first depth feature and the second depth feature, and obtaining a prediction result of a first-stage target object bounding box according to the global response image; updating the second depth feature according to the prediction result of the target object bounding box in the first stage;

a second stage tracker configured to: and obtaining a plurality of local feature blocks based on the first depth feature, obtaining a local response map according to the plurality of local feature blocks and the updated second depth feature, and obtaining a second-stage target object surrounding frame prediction result according to the local response map.

10. An electronic device comprising a processor and a memory, wherein the memory has stored therein program instructions, wherein the program instructions, when executed by the processor, cause the processor to perform the method of any of claims 1-8.

11. A computer-readable recording medium having program instructions recorded thereon, wherein the program instructions, when executed by a processor, cause the processor to perform the method of any one of claims 1-8.