CN112241967B

CN112241967B - Target tracking method, device, medium and equipment

Info

Publication number: CN112241967B
Application number: CN201910640796.XA
Authority: CN
Inventors: 胡涛; 申晗; 黄李超
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2019-07-16
Filing date: 2019-07-16
Publication date: 2023-08-25
Anticipated expiration: 2039-07-16
Also published as: CN112241967A

Abstract

A target tracking method, apparatus, medium and device are disclosed. The method comprises the following steps: determining a plurality of image areas in a current frame according to position information of a target in a history frame before the current frame, and obtaining a plurality of first image blocks; respectively extracting second image blocks and multi-level image features of the plurality of first image blocks in at least one historical frame before the current frame to obtain multi-level second image features corresponding to the second image blocks and multi-level first image features corresponding to the first image blocks; performing feature aggregation processing according to the second image features and the first image features to obtain aggregation image features corresponding to each first image block, wherein the aggregation image features have multiple layers; and determining the position information of the target in the current frame according to the correlation between the image characteristics of the reference image blocks in the reference frame and the image characteristics of each aggregation. The method and the device are beneficial to improving the accuracy of target tracking.

Description

Target tracking method, device, medium and equipment

Technical Field

The present disclosure relates to computer vision technology, and more particularly, to a target tracking method, a target tracking apparatus, a storage medium, and an electronic device.

Background

Target tracking technology has been applied in various fields such as unmanned driving, navigation, security and protection. The tasks of target tracking techniques generally include: given an object and its position in an initial video frame, the given object is identified from each video frame of the video sequence and its position is located.

In practical application scenarios, the target tracking technology faces challenges, such as target appearance change, target deformation, target occlusion, image blurring caused by target motion, low image resolution, illumination change, and the like, which affect the accuracy of target tracking. How to quickly and accurately realize target tracking is a technical problem which is worth concerned.

Disclosure of Invention

The present disclosure has been made in order to solve the above technical problems. Embodiments of the present disclosure provide a target tracking method, a target tracking apparatus, a storage medium, and an electronic device.

According to an aspect of the embodiments of the present disclosure, there is provided a target tracking method, including: determining a plurality of image areas in a current frame according to position information of a target in a history frame before the current frame, and obtaining a plurality of first image blocks; respectively extracting second image blocks and multi-level image features of the plurality of first image blocks in at least one historical frame before the current frame to obtain multi-level second image features corresponding to the second image blocks and multi-level first image features corresponding to the first image blocks; performing feature aggregation processing according to the second image features and the first image features to obtain aggregation image features corresponding to each first image block, wherein the aggregation image features have multiple layers; and determining the position information of the target in the current frame according to the correlation between the image characteristics of the reference image blocks in the reference frame and the image characteristics of each aggregation.

According to another aspect of the embodiments of the present disclosure, there is provided a target tracking apparatus including: the image block processing module is used for determining a plurality of image areas in the current frame according to the position information of a target in a history frame before the current frame to obtain a plurality of first image blocks; the trunk feature extraction module is used for respectively extracting the second image blocks in at least one historical frame before the current frame and the multi-level image features of the first image blocks obtained by the image block processing module, and obtaining the multi-level second image features corresponding to the second image blocks and the multi-level first image features corresponding to the first image blocks; the feature processing module is used for carrying out feature aggregation processing according to the second image features and the first image features obtained by the trunk feature extraction module to obtain aggregation image features corresponding to the first image blocks, wherein the aggregation image features have multiple layers; and the target positioning module is used for determining the position information of the target in the current frame according to the correlation between the image characteristics of the reference image block in the reference frame and the aggregate image characteristics obtained by the characteristic processing module.

According to still another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the above-described object tracking method.

According to still another aspect of the embodiments of the present disclosure, there is provided an electronic device including: a processor; a memory for storing the processor-executable instructions; the processor is configured to read the executable instructions from the memory and execute the instructions to implement the target tracking method described above.

According to the object tracking method and the object tracking device provided by the embodiment of the disclosure, the feature aggregation processing is performed on the basis of the second image features of multiple levels of the second image blocks in the history frame and the first image features of multiple levels of the first image blocks in the current frame, and the second image features and the first image features can reflect the temporal change of the image features, and the multiple levels of the image features can reflect the characteristics of the image features on different scales, so that the obtained aggregated image features can be regarded as the image features with spatial perception and temporal perception, and the aggregated image features have better expressive capacity. By utilizing the correlation of the aggregated image features and the image features of the reference image block, accurate determination of the position information of the target in the current frame is facilitated. Therefore, the technical scheme provided by the disclosure is beneficial to improving the accuracy of target tracking.

The technical scheme of the present disclosure is described in further detail below through the accompanying drawings and examples.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The disclosure may be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a schematic diagram of one scenario to which the present disclosure is applicable;

FIG. 2 is a flow diagram of one embodiment of a target tracking method of the present disclosure;

FIG. 3 is a flow diagram of one embodiment of the present disclosure for obtaining a plurality of first image blocks;

FIG. 4 is a flow chart of one embodiment of the present disclosure for obtaining aggregate image features corresponding to each of the first image blocks;

FIG. 5 is a flow diagram of one embodiment of the aggregation process using weight values of the present disclosure;

FIG. 6 is a flow chart of one embodiment of the present disclosure for obtaining a first weight value for any first pyramid image feature and a second weight value for any deformed second pyramid image feature;

FIG. 7 is a flow chart of one embodiment of determining location information of a target in a current frame according to correlation of image features of reference image blocks in a reference frame with respective aggregated image features;

FIG. 8 is a schematic diagram of one embodiment of the present disclosure for implementing target tracking using a neural network;

FIG. 9 is a schematic diagram of an embodiment of a target tracking apparatus of the present disclosure;

fig. 10 is a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present disclosure and not all of the embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise.

It will be appreciated by those of skill in the art that the terms "first," "second," etc. in embodiments of the present disclosure are used merely to distinguish between different steps, devices or modules, etc., and do not represent any particular technical meaning nor necessarily logical order between them.

It should also be understood that in embodiments of the present disclosure, "plurality" may refer to two or more, and "at least one" may refer to one, two or more.

It should also be appreciated that any component, data, or structure referred to in the presently disclosed embodiments may be generally understood as one or more without explicit limitation or the contrary in the context.

In addition, the term "and/or" in this disclosure is merely an association relationship describing an association object, and indicates that three relationships may exist, such as a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the front and rear association objects are an or relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and that the same or similar features may be referred to each other, and for brevity, will not be described in detail.

Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

Embodiments of the present disclosure are applicable to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with the terminal device, computer system, or server, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc., that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment. In a distributed cloud computing environment, tasks may be performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computing system storage media including memory storage devices.

Summary of the disclosure

In carrying out the present disclosure, the inventors found that: object tracking techniques typically require the use of image features extracted from the current frame to locate the position of an object in the video frame. The shallow layer features extracted from the current frame can effectively retain the edge, texture, position information and the like of the target, while the deep layer features extracted from the current frame can better abstract and model the target, and have the characteristic of consistency of intra-class transformation.

In an actual application scene, when phenomena such as appearance change, deformation, occlusion, image blurring caused by movement of a target, low image resolution, illumination change and the like of the target occur, although the phenomena can influence edges, textures, position information and the like of an object in shallow features, if shallow features and deep features of a current frame and inter-frame information (such as time information and movement information and the like) between the current frame and a historical frame of the current frame can be effectively utilized in a target tracking process, the influence of the phenomena on target tracking can be avoided to a certain extent, and thus the accuracy of target tracking is improved.

Exemplary overview

An application scenario of the object tracking technology provided in the present disclosure is described below with reference to fig. 1.

Fig. 1 is a video frame in a video, where the frame is a motorcycle race, and the motorcycle and its driver in the frame are targets that need to be tracked. The illumination on the field in fig. 1 is not ideal and the running speed of the motorcycle is very fast. The non-ideal illumination on the competition field can influence the definition of the whole picture of the video frame, and the running speed of the motorcycle is very fast, so that the content of part of the picture in the video frame can be blurred.

The location information of the target in the video frame determined using the target tracking technique provided by the present disclosure is shown as target detection box 100 in fig. 1. And the location information of the object in the video frame determined using other object tracking techniques, as shown in the object detection block 101 and the object detection block 102 of fig. 1. Comparing the target detection frame 100, the target detection frame 101 and the target detection frame 102 can show that the target tracking technology provided by the disclosure has better target tracking accuracy.

Exemplary method

FIG. 2 is a flow chart of one embodiment of a target tracking method of the present disclosure. As shown in fig. 2, the method of this embodiment includes: s200, S201, S202, and S203. The steps are described separately below.

S200, determining a plurality of image areas in the current frame according to the position information of the target in the history frame before the current frame, and obtaining a plurality of first image blocks.

The current frame in the present disclosure refers to a video frame of a current target to be searched in a video, and the current frame may also be referred to as a current video frame, a video frame to be processed, or a search frame. Historical frames in this disclosure generally refer to frames of video that have a time earlier than the time of the current frame in the video. The location information of the object in the present disclosure in the history frame before the current frame may be generally: the position information of the target in the frame preceding the current frame. Of course, the location information of the object in the present disclosure in the history frame before the current frame may also be: position information of the target in a second frame or a third frame preceding the current frame.

The object in the present disclosure may refer to an object such as a person, an animal, or a vehicle that needs position tracking.

The location information of the object in the history frame in the present disclosure is known information. For example, the positional information of the object in the history frame in the present disclosure may be positional information formed based on the initialization setting. For another example, the location information of the object in the present disclosure in the history frame may be the location information obtained using the object tracking method of the present disclosure.

The plurality of image areas in the present disclosure are each determined based on the positional information of the object in the history frame, that is, the plurality of image areas are each generally within a certain area range around the position of the object in the history frame. The first image block in this disclosure may be referred to as a search image block.

S201, respectively extracting the second image blocks and the multi-level image features of the plurality of first image blocks in at least one historical frame before the current frame, and obtaining the multi-level second image features corresponding to each second image block and the multi-level first image features corresponding to each first image block.

The second image block in the present disclosure may refer to: image blocks that are cut out from the history frame based on the position information of the object in the history frame. The second image block in this disclosure may be referred to as a history image block.

The various levels of image features in this disclosure refer to: in the case where image blocks are provided to a neural network (e.g., a trunk feature extraction module) for extracting image features, a plurality of layers of image features are formed from the image features each output by a different convolution layer in the trunk feature extraction module. That is, the image features of the multiple layers corresponding to any image block include: and the image features of the layers correspond to one convolution layer in the trunk feature extraction module, and the image features of the different layers correspond to different convolution layers. In addition, the spatial resolutions of different layers of image features in the multiple layers of image features corresponding to any image block are generally different, and the channel numbers of different layers of image features in the multiple layers of image features corresponding to any image block are also generally different. The image features of the various levels corresponding to any image block in the present disclosure may be referred to as pyramid image features of the image block. The number of layers of image features included in the pyramid image features may be set according to actual conditions.

The number of history frames in S201 may be different from the number of history frames in S200. The number of history frames in S200 is typically 1, and the number of history frames in S201 may be one or more. In the case where the number of history frames in S201 is 1, the present disclosure may obtain a plurality of levels of second image features (i.e., pyramid image features) corresponding to one history image block in the history frame. In the case that the number of history frames in S201 is greater than 1, each history frame corresponds to one history image block, the present disclosure may obtain the second image features of the multiple levels corresponding to each history image block. The second image features of the plurality of levels corresponding to the historical image blocks may be referred to as second pyramid image features. The first image features of the multiple levels corresponding to the search image block may be referred to as first pyramid image features.

S202, performing feature aggregation processing according to the second image features and the first image features to obtain aggregation image features corresponding to the first image blocks.

The feature aggregation processing in the present disclosure means: the image features of different image blocks are aggregated together to form a new image feature. The present disclosure generally performs feature aggregation processing for each layer in the pyramid image features when performing feature aggregation processing for the second pyramid image features and the first pyramid image features. For example, feature aggregation processing is performed on the first layer image features in the second pyramid image features and the first layer image features in the first pyramid image features to obtain first layer aggregated image features, feature aggregation processing is performed on the second layer image features in the second pyramid image features and the second layer image features in the first pyramid image features to obtain second layer aggregated image features, and so on until the last layer aggregated image features are obtained. From this, the aggregate image features in this disclosure are also pyramid image features.

S203, determining the position information of the target in the current frame according to the correlation between the image features of the reference image blocks in the reference frame and the aggregate image features.

The reference frames in this disclosure typically belong to the same video as the current and historical frames. The reference frame may be the first frame in the video. The reference frame may also be referred to as a template frame. Reference image blocks in a reference frame generally refer to image blocks that contain a target, e.g., reference image blocks may be image blocks that are cut out from the reference frame based on a manually labeled target bounding box. The image features of the reference image block in the present disclosure may not be pyramid image features. The correlation between image features in the present disclosure may be regarded as a consistency or correlation between image features, or the like.

The method and the device perform feature aggregation processing on the basis of the second image features of multiple levels of the second image blocks in the history frame and the first image features of multiple levels of the first image blocks in the current frame, and the second image features and the first image features can reflect the change of the image features in time, and the multiple levels of the image features can reflect the characteristics of the image features on different scales, so that the method and the device can be regarded as image features with spatial perception and temporal perception, and the aggregated image features have better expression capability. The method and the device are beneficial to accurately determining the position information of the target in the current frame by utilizing the correlation of the image characteristics of the aggregate image characteristics and the image characteristics of the reference image block. Therefore, the technical scheme provided by the disclosure is beneficial to improving the accuracy of target tracking.

In an alternative example, one implementation of the present disclosure to obtain a plurality of first image blocks is shown in fig. 3.

In fig. 3, S300, in the current frame, the target detection frame is enlarged with the center point of the target detection frame in the history frame as the center, and an enlarged area is obtained.

Optionally, the present disclosure determines a specific position of a target detection frame in a history frame in a current frame, and then amplifies the target detection frame n times with a center point of the target detection frame in the current frame as a center, where an amplified region of the target detection frame is an amplified region. Where n is typically greater than 1, e.g., n is 3.

In the case where the enlarged region exceeds the size of the history frame, a padding process may be performed on the pixel values in the exceeded region, for example, the pixel values in the exceeded region may be padded with 0, respectively.

S301, determining a plurality of subareas in the amplifying region according to a preset scale factor.

The preset scale factor in the present disclosure may be expressed in the form of the following formula (1):

in the above formula (1), a ^S For the preset scale factor, S (uppercase S) is an arithmetic series, a and S (lowercase S) may be super parameters of the neural network, that is, a and S are preset known values, for example, a may take the value of 1.03, and S may be a positive integer greater than 2.

Optionally, since S is an arithmetic progression, a plurality of different values of a can be obtained by the present disclosure using the above formula (1) ^S The method can utilize the length and the width of the target detection frame in the history frame to respectively take different values of a ^s The method and the device can determine the positions of the scaled target detection frames in the enlarged area by taking the central point of the enlarged area as the central point of each scaled target detection frame, so that each scaled target detection frame corresponds to one sub-area, and a plurality of sub-areas are obtained.

S302, respectively adjusting the sizes of the plurality of sub-areas to be preset sizes to obtain a plurality of first image blocks.

Optionally, the predetermined dimension in the present disclosure is a preset known value. The predetermined size may be set according to actual requirements, for example, the predetermined size may be, for example: 125 x 125.

The present disclosure enlarges the object detection frame in the history frame so that the object in the current frame is in the enlarged area. By determining a plurality of sub-regions from within the magnified region using a preset scale factor, a plurality of possible target detection boxes are predicted for targets in the current frame. By adjusting the size of each sub-region to a predetermined size, the subsequent processing of each sub-region is facilitated.

In an alternative example, the manner of obtaining the second image features of the multiple levels corresponding to each of the second image blocks in the present disclosure may be: according to the target detection frame in at least one history frame of the current frame (i.e. the position information of the target in the history frame), the second image blocks are segmented from the corresponding history frame, each segmented second image block is provided to a neural network (a trunk feature extraction module described in the following embodiment) for extracting image features, the image feature extraction operation is performed on each second image block through the trunk feature extraction module, and for any second image block, the image features output by each of a plurality of layers (such as a plurality of convolution layers) in the trunk feature extraction module can be obtained. The number of layers of image features included in the second pyramid image feature may be 3, 4, or more.

In an alternative example, the manner of obtaining the first image features of the multiple levels corresponding to each of the first image blocks in the present disclosure may be: the first image blocks are respectively provided to a neural network (a trunk feature extraction module described in the following embodiments) for extracting image features, and image feature extraction operations are respectively performed on the first image blocks via the trunk feature extraction module. Likewise, the number of layers of image features included in the first pyramid image feature may be 3, 4, or more.

Optionally, the number of layers of the first pyramid image feature and the second pyramid image feature is the same, i.e. the number of layers of the image features included in the first pyramid image feature is generally the same as the number of layers of the image features included in the second pyramid image feature. In addition, the number of channels and spatial resolution of image features of the same layer in the first pyramid image feature and the second pyramid image feature are typically the same. That is, the number of channels and the spatial resolution of the uppermost layer image feature in the first pyramid image feature are the same as the number of channels and the spatial resolution of the uppermost layer image feature in the second golden sub-tower image feature, the number of channels and the spatial resolution of the next-to-last layer image feature in the first pyramid image feature are the same as the number of channels and the spatial resolution of the next-to-last layer image feature in the second golden sub-tower image feature, and so on, the number of channels and the spatial resolution of the lowermost layer image feature in the first pyramid image feature are the same as the number of channels and the spatial resolution of the lowermost layer image feature in the second golden sub-tower image feature.

The number of layers of the image features contained in the first pyramid image features is the same as the number of layers of the image features contained in the second pyramid image features, and the channel number and the spatial resolution of the image features of the same layer in the first pyramid image features and the second pyramid image features are the same, so that the subsequent feature aggregation processing of the first pyramid image features and the second pyramid image features is facilitated.

In an alternative example, one embodiment of the present disclosure for obtaining aggregate image features for each of the first image blocks is shown in fig. 4.

In fig. 4, S400, according to each first image feature (i.e., a first pyramid image feature), each second image feature (i.e., a second pyramid image feature) is deformed to obtain a plurality of deformed second image features (i.e., deformed second pyramid image features).

Alternatively, the deforming process in the present disclosure may refer to: a process of mapping the second image feature into the first image feature. The morphing process may also be considered as feature-aligning the first image feature and the second image feature. The morphing process in the present disclosure may also be referred to as an alignment process, a mapping process, or the like.

Optionally, for any second pyramid image feature, the disclosure may deform each layer of image feature in the second pyramid image feature separately. Since the deformed second image features in the present disclosure are still pyramid image features, the deformed second image features may be referred to as deformed second pyramid image features.

Alternatively, the present disclosure may implement a morphing process of image features using motion information. That is, the present disclosure may first obtain motion information between each first image feature and a plurality of second image features, and then, according to the obtained motion information, respectively perform deformation processing on the plurality of second image features, thereby obtaining a plurality of deformed second image features. And the second image features are deformed by utilizing the motion information, so that the deformed second image features can be accurately and conveniently obtained.

Alternatively, the motion information in the present disclosure may refer to: inter-frame motion information for two video frames in a video. The present disclosure may obtain inter-frame motion information in a variety of ways. The present disclosure may obtain inter-frame motion information in a variety of manners, for example, for any first pyramid image feature and any second pyramid image feature, first, feature stitching is performed on the first pyramid image feature and the second pyramid image feature based on channel connection to obtain a stitched image feature, that is, a stitched pyramid image feature, and then convolution processing is performed on the stitched pyramid image feature to obtain inter-frame motion information between the first pyramid image feature and the second pyramid image feature.

Specifically, for any first pyramid image feature and any second pyramid image feature, the spatial resolution and the number of channels of any layer image feature in the first pyramid image feature are the same as the spatial resolution and the number of channels of the same layer image feature in the second pyramid image feature, for any layer image feature (such as the nth layer image feature), firstly, based on the channel connection of the nth layer image feature, feature stitching is performed on the nth layer image feature in the first pyramid image feature and the nth layer image feature in the second pyramid image feature to obtain the nth layer stitched image feature, convolution processing is performed on the nth layer stitched image feature, for example, the nth layer stitched image feature is provided to a convolution processing unit including at least one convolution layer, and according to the output of the convolution processing unit, inter-frame motion information between the nth layer image feature in the first pyramid image feature and the nth layer image feature in the second pyramid image feature can be obtained. By adopting the mode, the inter-frame motion information between each layer of image features in the first pyramid image features and the corresponding layer of image features in the second pyramid image features can be obtained. The method and the device acquire the inter-frame motion information by adopting the characteristic splicing and convolution processing modes, so that the inter-frame motion information can be acquired quickly, and the deformation processing efficiency can be improved.

Alternatively, the present disclosure may also obtain inter-frame motion information in other manners, e.g., based on optical flow; for another example, inter-frame motion information is obtained based on inter-pixel correlation. The present disclosure is not limited in the manner in which the inter-frame motion information is obtained.

S401, respectively performing feature aggregation processing on each first pyramid image feature and a plurality of deformed second pyramid image features to obtain aggregation image features corresponding to each first image block.

According to the method and the device, the second pyramid image features and the first pyramid image features are aligned through deformation processing of the second pyramid image features, so that when the first pyramid image features and the deformed second pyramid image features are utilized for aggregation processing, the feature expression of the aggregation image features is more accurate, and the accuracy of target tracking is improved.

Optionally, the present disclosure may perform aggregation processing according to the weight values corresponding to the first image feature and the second image feature. An example of aggregation processing using weight values is shown in fig. 5.

In fig. 5, S500, a first weight value of each first pyramid image feature and a second weight value of each deformed second pyramid image feature are obtained according to each first image feature (i.e., a first pyramid image feature) and a plurality of deformed second image features (i.e., deformed second pyramid image features).

Optionally, the first weight value in the disclosure includes a weight value layer corresponding to each layer of image features of the first pyramid image feature. That is, the first weight value in the present disclosure includes multiple weight value layers, the number of layers of the weight value layers included in the first weight value is the same as the number of layers of the image features included in the first pyramid image feature, and the size of each weight layer in the first weight value is generally related to the spatial resolution of the image features of the corresponding layer in the first pyramid image feature corresponding to the weight layer, so that the multiple weight layers included in the first weight value are pyramid-shaped.

Likewise, the second weight in the present disclosure includes each layer of image features that morph the second pyramid image feature corresponding to a layer of weight values. That is, the second weight value in the present disclosure includes a plurality of weight value layers, the number of layers of the weight value layers included in the second weight value is the same as the number of layers of the image feature included in the deformed second pyramid image feature, and the size of each weight layer in the second weight value is generally related to the spatial resolution of the image feature of the corresponding layer in the deformed second pyramid image feature corresponding to the weight layer, so that the plurality of weight layers included in the second weight value are pyramid-shaped.

Alternatively, the present disclosure obtains a first weight value for any first pyramid image feature and a second weight value for any deformed second pyramid image feature as shown in fig. 6 below.

S501, determining the aggregation image features corresponding to the first image blocks according to the first pyramid image features, the first weight values, the deformed second pyramid image features and the second weight values.

Optionally, the present disclosure may perform feature aggregation processing on an nth layer of image features in the first pyramid image features and an nth layer of image features in the second pyramid image features according to the nth weight value layer in the first weight value and the nth weight value layer in the second weight value, so as to obtain an aggregated nth layer of image features. The spatial resolution and the number of channels of the aggregated nth layer image feature are the same as the spatial resolution and the number of channels of the nth layer image feature of the first pyramid image feature, respectively. After the feature aggregation processing is performed on all the layer image features in the first pyramid image features and all the layer image features in the second pyramid image features in the above manner, aggregated multi-layer image features can be obtained. The aggregated multi-layer image features are identical in pyramid shape and, thus, may be referred to as an aggregated pyramid image feature.

The method and the device are beneficial to enabling the feature expression of the feature of the aggregated image to be more accurate by utilizing the weight value to execute the aggregation processing operation, so that the accuracy of target tracking is improved.

In addition, the present disclosure may also employ other means for feature aggregation processing. For example, the present disclosure may employ RNN (Recurrent Neural Networks, recurrent neural network) to perform feature aggregation processing on the first pyramid image features and the deformed second pyramid image features. For another example, the present disclosure may employ a 3D convolution approach to feature aggregation processing of the first pyramid image features and the deformed second pyramid image features. The present disclosure is not limited in this regard.

In an alternative example, one example of the present disclosure obtaining a first weight value for a first image feature (i.e., a first pyramid image feature) and a second weight value for a warped second image feature (i.e., a warped second pyramid image feature) is shown in fig. 6.

In fig. 6, S600, a first embedded feature of each first image feature (i.e., a first pyramid image feature) and a second embedded feature of each deformed second image feature (i.e., a deformed second pyramid image feature) are acquired.

Alternatively, the embedded feature in the present disclosure may refer to a feature obtained by performing feature embedding (Feature Embedding) processing on the input feature. The number of channels of the feature before the feature embedding process is generally greatly different from the number of channels of the feature after the feature embedding process, and the feature embedding process may be considered to be a dimension reduction process for the feature.

Alternatively, the present disclosure may utilize an embedded neural network element to obtain the first embedded feature and the second embedded feature. Specifically, the first pyramid image features are input into an embedded neural network unit, and the first embedded features are obtained according to the output of the embedded neural network unit. And inputting the second pyramid image features into the embedded neural network unit, and obtaining second embedded features according to the output of the embedded neural network unit.

Optionally, the network structure used for embedding the neural network unit includes, but is not limited to: a Bottleneck structure. As an example, the embedded neural network unit may include three convolution layers, and the convolution kernels of the three convolution layers may be: 1×1×32, 3×2×32, and 1×1×64.

S601, calculating the similarity between each first embedded feature and each second embedded feature respectively to obtain a plurality of similarities.

Alternatively, the present disclosure may calculate a cosine distance between the first embedded feature and the second embedded feature, and represent a similarity between the first embedded feature and the second embedded feature using the cosine distance.

S602, determining a first weight value of each first pyramid image feature and a second weight value of each deformed second pyramid image feature according to the multiple similarities.

Optionally, the disclosure may normalize the obtained similarity, thereby obtaining the first weight value and the second weight value. For example, the first weight and the second weight may be obtained by calculation using the following formula (2):

in the above formula (2), τ is 0, then w _t-τ→t Is w _t→t ，w _t→t Representing a first weight value; if τ is not 0, w _t-τ→t Representing a second weight value; EXP represents an index of x; if τ is 0, e _t-τ→t E is _t ，e _t Representing a first embedded feature, e when τ is not 0 _t-τ→t Representing a second embedded feature; the value range of tau is 0 to T; t represents the number of second image blocks.

The method and the device can conveniently obtain the first weight value of the first pyramid image feature and the second weight value of the deformed second pyramid image feature by utilizing the similarity among the embedded features, so that the efficiency of feature aggregation processing is improved.

In an alternative example, the present disclosure determines an example of the location information of the target in the current frame according to the correlation of the image features of the reference image block in the reference frame with the respective aggregate image features, respectively, as shown in fig. 7.

In fig. 7, S700, the image features of the reference image block are extracted, and the reference image features are obtained.

Alternatively, an example of the acquisition manner of the reference image feature in the present disclosure may be:

First, image features of multiple layers of reference image blocks are extracted, and first reference image features of multiple layers are obtained. For example, the reference image block is provided to a neural network (such as a trunk feature extraction module) for extracting image features, and the image feature extraction operation is performed on the reference image block by the trunk feature extraction module. The number of layers of image features included in the first reference image feature may be 3, 4 or more. Typically, the number of layers of image features included in the first reference image feature is the same as the number of layers of image features included in the first pyramid image feature and the second pyramid image feature. In addition, the number of channels and spatial resolution of the first reference image feature and image features of the same layer in the first pyramid image feature and the second pyramid image feature may be the same.

And secondly, carrying out feature fusion processing on the first reference image features of multiple layers, and determining the fused image features, wherein the reference image features are generated by the fused image features. For example, the image features after the fusion processing may be directly used as the reference image features.

Optionally, the feature fusion processing for the first reference image feature in the present disclosure may specifically be: first, all layer image features in the first reference image feature are subjected to convolution processing of 1×1, so that the layer image features have the same number of channels. And secondly, carrying out up-sampling processing on the uppermost image characteristic after the convolution processing, wherein the spatial resolution and the channel number of the image characteristic obtained after the up-sampling processing and the sub-upper image characteristic after the convolution processing are the same, and the image characteristic obtained after the up-sampling processing and the sub-upper image characteristic after the convolution processing are overlapped. And thirdly, carrying out up-sampling processing on the image features after the previous superposition, wherein the spatial resolution and the channel number of the image features obtained after the up-sampling processing are the same as those of the image features of the next layer of the next upper layer after the convolution processing, and the image features obtained after the up-sampling processing and the image features of the next layer of the next upper layer after the convolution processing are superposed. And the like until the image features are overlapped with the image features of the last layer, and finally obtaining the reference image features.

The method and the device are beneficial to enabling the feature expression of the reference image features to be more accurate by carrying out feature fusion processing on the image features of multiple layers of the reference image block, so that the position information of the target determined later in the current frame is beneficial to be more accurate.

S701, respectively calculating Gaussian responses between the reference image features and the aggregate image features.

Optionally, the present disclosure may perform feature fusion processing on the aggregated image feature, and calculate the gaussian response using the obtained reference image feature and the aggregated image feature after the feature fusion processing.

Optionally, the process of feature fusion processing on the aggregate image features in the present disclosure may be: all the image features of all the layers in the aggregate image features are subjected to 1X 1 convolution processing respectively, so that the image features of all the layers have the same channel number. And secondly, carrying out up-sampling processing on the uppermost image feature after the convolution processing, wherein the spatial resolution and the channel number of the image feature obtained after the up-sampling processing and the sub-upper image feature after the convolution processing are the same, and the image feature obtained after the up-sampling processing and the sub-upper image feature after the convolution processing are overlapped. And thirdly, carrying out up-sampling processing on the image features after the previous superposition, wherein the spatial resolution and the channel number of the image features obtained after the up-sampling processing are the same as those of the image features of the next layer of the next upper layer after the convolution processing, and the image features obtained after the up-sampling processing and the image features of the next layer of the next upper layer after the convolution processing are superposed. And the like, until the image features are overlapped with the image features of the last layer, finally obtaining the aggregate image features after feature fusion processing.

Alternatively, the present disclosure may employ a correlation filtering approach to obtain a gaussian response between the reference image features and the aggregated image features after the feature fusion process. The operation performed by the correlation filtering method can be expressed by the following formula (3):

in the above formula (3), g represents a gaussian response; f (F) ^-1 The inverse fourier transform of the x is shown;representation ofComplex conjugate of (2); />Parameters used for correlation filtering, +.>Can be obtained by calculation using the following formula (4); the addition of Hadamard (Hadamard); />Representation->Fourier transform of (a); />And the characteristic value of the channel d in the aggregated image characteristic after the characteristic fusion processing of the current frame is represented.

In the above formula (4), λ represents a penalty coefficient, which is a known value, for example, λ may be 0.0001;representation->Fourier transform of (a); />Representing reference image features; y is ^* Represents the complex conjugate of y; />Representation of pair y ^* Performing Fourier transform; y represents based on referenceA standard gaussian distribution formed by reference image blocks in a frame; the ". Iy represents Hadamard multiplication.

The present disclosure can obtain a gaussian response between the reference image feature and the aggregated image feature after each feature fusion process using the above-described formula (3) and formula (4).

S702, determining the position information of the target in the current frame according to the first image block corresponding to the maximum Gaussian response value.

The method and the device have the advantages that the position information of the target in the current frame is determined by utilizing the Gaussian response, so that one first image block can be selected from a plurality of first image blocks quickly; because the Gaussian response corresponding to the first image block is maximum, the position information of the first image block is most likely to be the position information of the target in the current frame, so that the position information of the target in the current frame can be accurately determined, and the accuracy of target tracking can be improved.

In one alternative example, the present disclosure may utilize a neural network to implement the target tracking method described above. The neural network of the present disclosure may be referred to as a Spatial-aware based time-aggregated neural network (SATAN).

An example of achieving target tracking using SATAN is shown in FIG. 8.

In fig. 8, it is assumed that the t-th frame in the video (i.e., search frame t in fig. 8) is the current frame 800. Assume that the number of history frames for providing the second image block is k (k is an integer greater than 0), for example, a t-k history frame 801 (i.e., t-k frame in fig. 8), a t-k+1 history frame (not shown in fig. 8), … …, and a t-1 history frame (not shown in fig. 8). Reference frame 802 is the lowest image on the left of fig. 8 (i.e., the template frame in fig. 8). The reference frame 802 is provided with an object detection frame. The reference frame 802, the current frame 800, and the historical frames represent temporal information.

The present disclosure may cut out reference image blocks from the reference frame 802 according to the target detection frame in the reference frame 802 and adjust the reference image blocks to a predetermined size (e.g., 125 x 125). The reference image block may be provided to a backbone feature extraction module (not shown in fig. 8) in the SATAN, and image features of multiple levels of the input reference image block are extracted via the backbone feature extraction module, so as to obtain first reference image features of multiple levels. All layer image features in the first reference image features of the multiple layers have the same channel number after being respectively subjected to convolution processing of 1×1. After the convolution processing, the four image features with the same number of channels are formed as shown in the left four image features in the lowermost block 803 in fig. 8 (i.e., the left four small blocks in the block 803). After the up-sampling process and the superposition process, the four image features are sequentially processed, and finally the image feature on the rightmost side in the lowermost block 803 in fig. 8 is formed, which is the reference image feature 804.

The present disclosure may perform a magnification process (e.g., magnification by 3 times, etc.) on the object detection frame in a history frame (e.g., a t-1 st history frame) adjacent to the current frame 800, and determine an area of the magnified object detection frame in the current frame 800, i.e., a magnified area. The position of the center point of the target detection frame in the history frame is the position of the enlarged region in the current frame 800. If the partial region of the enlarged target detection frame is beyond the range of the current frame 800, the partial region may be processed in an existing manner, e.g., zero padding, etc. Then, the present disclosure may determine a plurality of sub-regions in the enlarged region according to a preset scale factor, and adjust the sizes of the plurality of sub-regions to a predetermined size (e.g., 125×125), respectively, so as to obtain a plurality of first image blocks. Then, each first image block is provided to a trunk feature extraction module in the SATAN, and the trunk feature extraction module extracts image features of multiple layers of the first image block to obtain first image features of multiple layers, such as first image features 805 of four layers shown on the left side of fig. 8.

The present disclosure may cut out the second image block from the corresponding history frame according to the target detection frame in the history frame and adjust the second image block to a predetermined size (e.g., 125×125). The second image block can be provided for a trunk feature extraction module in the SATAN, and the trunk feature extraction module extracts the image features of multiple layers of the second image block to obtain the second image features of multiple layers. The present disclosure performs the above processing on all of the k history frames, so that k second image features can be obtained. One example of a plurality of levels of second image features corresponding to the t-k history frame 801 is four levels of second image features 806 shown on the left side of fig. 8.

For each of the second image features, the present disclosure may use an alignment unit 807 in the SATAN to align it with the same layer image feature in the first image feature, such that a deformed second image feature may be obtained by the alignment unit 807, one of the deformed second image features being shown as a rightmost small block 809 in an uppermost block 808 of fig. 8. Specifically, for any first image feature and any second image feature, the alignment unit 807 may perform feature stitching on the same layer image feature in the first image feature and the second image feature based on channel connection to obtain a layer of stitched image feature, and perform convolution processing on the layer of stitched image feature (e.g., the alignment unit 807 performs convolution processing using a convolution processing unit included in the layer of stitched image feature) to obtain inter-frame motion information (inter-frame motion information may also be referred to as an offset) between the layer of image feature in the first image feature and the second image feature. The alignment unit 807 may obtain one of the deformed second image features by deforming the one of the second image features using the inter-frame motion information, e.g., the alignment unit 807 may perform deformable convolution processing on the corresponding one of the second image features based on the inter-frame motion information, thereby obtaining the one of the deformed second image features. With the above method, the present disclosure can obtain each layer of image features in all of the deformed second image features.

For each layer of image features in the first image features, the present disclosure may perform feature aggregation processing on one layer of image features in the first image features and the same layer of image features in each deformed second image feature by using the aggregation unit 810 in the SATAN, to obtain an aggregated image feature of the first image block corresponding to the first image feature. For example, the aggregation unit 810 may include: the embedded neural network unit may include at least one embedded layer (for example, three layers of convolution kernels 1×1×32, 3×2×32, and 1×1×64, respectively), and the aggregation unit 810 may acquire the first embedded feature of each first image feature and the second embedded feature of each deformed second image feature by using the embedded neural network unit; then, the aggregation unit 810 calculates cosine similarities between each first embedded feature and each second embedded feature, thereby obtaining a plurality of cosine similarities; then, the aggregation unit 810 obtains the aggregate image feature of the first image block corresponding to the first image feature according to the multiple similarities and the corresponding first image feature and the second image feature. With the above method, the present disclosure may obtain an aggregate image feature for each first image block. An example of aggregate image features of a first image block may refer to four blocks 811 in fig. 8, i.e. aggregate image features comprising four layers of image features. The present disclosure may utilize weights to implement feature aggregation processing, and may refer specifically to the description above with respect to fig. 5. Since the multiple levels of the second image feature and the first image feature may reflect the characteristics of the image feature on different scales, the present disclosure may embody spatial information of the image feature in the x-axis direction of fig. 8.

The image fusion processing can be carried out on the aggregate image characteristics of each first image block by utilizing the up-sampling unit, so that the aggregate image characteristics of each image block after the characteristic fusion processing are obtained. The method and the device can utilize the up-sampling unit to perform feature fusion processing on the first reference image features of multiple layers, and the image features after fusion processing are used as the reference image features of the reference image block.

The present disclosure may utilize a correlation filtering unit to obtain a gaussian response between the reference image features and the aggregated image features after the feature fusion process. The correlation filtering unit may include a correlation filtering layer. According to the Gaussian response maximum value output by the correlation filtering unit, one first image block is selected from the plurality of first image blocks, the selected first image block corresponds to the Gaussian response maximum value, and the coordinate information of the selected first image block can represent the position of the target in the current frame.

The training process of the trunk feature extraction module in the SATAN of the disclosure is prior to the training process of other modules, that is, the present disclosure can utilize the image samples to train the trunk feature extraction module alone, and the network parameters of the trunk feature extraction module which are successfully trained are not generally changed in the subsequent training process of other modules and units.

The present disclosure may train other modules and units in the SATAN, except for the trunk feature extraction module, together, and the loss function adopted for training may be as shown in the following formula (5):

loss＝||g-y|| ² formula (5)

In the above formula (5), g may be calculated by using the above formula (3) and formula (4), and y represents a standard gaussian distribution formed based on labeling information of an image sample, that is, position information of a target detection frame in the image sample; | x I ² The euclidean distance is expressed.

Exemplary apparatus

Fig. 9 is a schematic structural view of an embodiment of the object tracking device of the present disclosure. The apparatus of this embodiment may be used to implement the method embodiments of the present disclosure described above. As shown in fig. 9, the apparatus of this embodiment includes: an image block processing module 900, a backbone feature extraction module 901, a feature processing module 902, and a target localization module 903.

The image block processing module 900 is configured to determine a plurality of image areas in a current frame according to position information of a target in a history frame before the current frame, and obtain a plurality of first image blocks.

Optionally, the image block processing module 900 in the present disclosure may amplify the target detection frame in the current frame with the center point of the target detection frame in the history frame as the center, to obtain an amplified region; then, the image block processing module 900 determines a plurality of sub-regions in the enlarged region according to the preset scale factor; the image block processing module 900 adjusts the sizes of the plurality of sub-areas to predetermined sizes, respectively, to obtain a plurality of first image blocks.

The trunk feature extraction module 901 is configured to extract, respectively, a second image block in at least one historical frame before the current frame and multiple levels of image features of the multiple first image blocks obtained by the image block processing module 900, and obtain multiple levels of second image features corresponding to each second image block and multiple levels of first image features corresponding to each first image block.

Optionally, the trunk feature extraction module 901 may extract multiple levels of image features of the second image blocks in each history frame, to obtain second pyramid image features corresponding to each second image block; the main feature extraction module 901 extracts the image features of multiple layers of each first image block to obtain the first pyramid image features corresponding to each first image block; the first pyramid image features and the second pyramid image features have the same hierarchical number, and the channels of the same layer in the first pyramid image features and the second pyramid image features have the same number and spatial resolution.

Optionally, the trunk feature extraction module 901 in the present disclosure is further configured to extract image features of multiple levels of the reference image block, and obtain first reference image features of multiple levels.

The feature processing module 902 is configured to perform feature aggregation processing according to the second image features and the first image features obtained by the trunk feature extracting module 901, so as to obtain aggregate image features corresponding to each first image block. Wherein the aggregate image features have multiple levels.

Alternatively, the feature processing module 902 may include: an alignment unit 9021 and a polymerization unit 9022. The alignment unit 9021 is configured to perform deformation processing on each second image feature according to each first image feature obtained by the trunk feature extraction module 901, so as to obtain a plurality of deformed second image features. For example, the alignment unit 9021 may first acquire motion information between each of the first image features and the plurality of second image features, respectively; then, the alignment unit 9021 performs deformation processing on the plurality of second image features, respectively, based on the motion information obtained thereby, thereby obtaining a plurality of deformed second image features. The manner in which the alignment unit 9021 acquires the motion information between each of the first image features and the plurality of second image features may be: the alignment unit 9021 performs feature stitching on each first image feature and each second image feature based on channel connection, so as to obtain a plurality of stitched image features; the alignment unit 9021 performs convolution processing on the plurality of stitched image features, respectively, to obtain motion information between each first image feature and the plurality of second image features, respectively. The aggregation unit 9022 is configured to perform feature aggregation processing on each first image feature and the plurality of deformed second image features obtained by the alignment unit 9021, so as to obtain an aggregate image feature corresponding to each first image block. For example, the aggregation unit 9022 obtains a first weight value of each first image feature and a second weight value of each deformed second image feature from each first image feature and a plurality of deformed second image features; the first weight value comprises weight value layers corresponding to the image features of each layer of the first image feature, and the second weight value comprises weight value layers corresponding to the image features of each layer of the second image feature; the aggregation unit 9022 determines an aggregate image feature corresponding to each of the first image blocks according to each of the first image features, the first weight values, each of the deformed second image features, and the second weight values. The aggregation unit 9022 may obtain, according to each first image feature and the plurality of deformed second image features, a first weight value of each first image feature and a second weight value of each deformed second image feature in one manner as follows: the aggregation unit 9022 first acquires (for example, acquires by using an embedded neural network unit included in the aggregation unit) a first embedded feature of each first image feature and a second embedded feature of each deformed second image feature; then, the aggregation unit 9022 calculates the similarity (such as cosine similarity, i.e. cosine distance) between each first embedded feature and each second embedded feature, so as to obtain a plurality of similarities; then, the aggregation unit 9022 determines a first weight value of each of the first image features and a second weight value of each of the deformed second image features from the plurality of similarities.

The object positioning module 903 is configured to determine location information of an object in the current frame according to correlation between image features of reference image blocks in the reference frame and each aggregate image feature obtained by the feature processing module 902.

Optionally, the object location module 903 may include: a correlation filtering unit 9031 and a determination position unit 9032. The correlation filtering unit 9031 is configured to calculate gaussian responses between the reference image feature and each of the aggregated image features, respectively. For example, the correlation filtering unit 9031 may be specifically configured to calculate gaussian responses between the reference image features and the fused aggregate image features, respectively. The correlation filtering unit 9031 may include a correlation filtering layer. The operation specifically performed by the correlation filtering unit 9031 can be referred to the description of S701 in the above-described method embodiment. The determining unit 9032 is configured to determine, according to the first image block corresponding to the maximum value of the gaussian response, location information of the target in the current frame.

Optionally, the apparatus in the present disclosure further includes: the up-sampling unit 904 is configured to perform feature fusion processing on the first reference image features of multiple layers, and determine the image features after the fusion processing; the reference image features are generated from the fused image features. For example, the image features after the fusion processing may be directly used as the reference image features. In addition, the upsampling unit 904 may be further configured to perform feature fusion processing on each of the aggregated image features, and determine the aggregated image features after the fusion processing. The operations specifically performed by the upsampling unit 904 may be referred to the description of S700 and S701 in the above-described method embodiments.

Exemplary electronic device

An electronic device according to an embodiment of the present disclosure is described below with reference to fig. 10. Fig. 10 shows a block diagram of an electronic device according to an embodiment of the disclosure. As shown in fig. 10, the electronic device 101 includes one or more processors 1011 and memory 1012.

The processor 1011 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities and may control other components in the electronic device 101 to perform desired functions.

Memory 1012 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example: random Access Memory (RAM) and/or cache, etc. The nonvolatile memory may include, for example: read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by the processor 1011 to implement the object tracking methods and/or other desired functions of the various embodiments of the present disclosure described above. Various contents such as an input signal, a signal component, a noise component, and the like may also be stored in the computer-readable storage medium.

In one example, the electronic device 101 may further include: input device 1013, output device 1014, etc., which are interconnected by a bus system and/or other forms of connection mechanisms (not shown). In addition, the input device 1013 may also include, for example, a keyboard, a mouse, and the like. The output device 1014 can output various information to the outside. The output devices 1014 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc.

Of course, only some of the components of the electronic device 101 that are relevant to the present disclosure are shown in fig. 10 for simplicity, components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 101 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer readable storage Medium

In addition to the methods and apparatus described above, embodiments of the present disclosure may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in an object tracking method according to various embodiments of the present disclosure described in the "exemplary methods" section of the present description.

The computer program product may write program code for performing the operations of embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform steps in a target tracking method according to various embodiments of the present disclosure described in the above "exemplary method" section of the present disclosure.

The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium may include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present disclosure have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present disclosure are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present disclosure. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, since the disclosure is not necessarily limited to practice with the specific details described.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, so that the same or similar parts between the embodiments are mutually referred to. For system embodiments, the description is relatively simple as it essentially corresponds to method embodiments, and reference should be made to the description of method embodiments for relevant points.

The block diagrams of the devices, apparatuses, devices, systems referred to in this disclosure are merely illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatus, devices, and systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the sequence specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the apparatus, devices and methods of the present disclosure, components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered equivalent to the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects, and the like, will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, changes, additions, and sub-combinations thereof.

Claims

1. A target tracking method, comprising:

determining a plurality of image areas in a current frame according to position information of a target in a history frame before the current frame, and obtaining a plurality of first image blocks;

respectively extracting second image blocks and multi-level image features of the plurality of first image blocks in at least one historical frame before the current frame to obtain multi-level second image features corresponding to the second image blocks and multi-level first image features corresponding to the first image blocks;

performing feature aggregation processing according to the second image features and the first image features to obtain aggregation image features corresponding to each first image block, wherein the aggregation image features have multiple layers;

and determining the position information of the target in the current frame according to the correlation between the image characteristics of the reference image blocks in the reference frame and the image characteristics of each aggregation.

2. The method of claim 1, wherein the determining a plurality of image areas in the current frame according to the position information of the object in the history frame before the current frame to obtain a plurality of first image blocks comprises:

in the current frame, amplifying a target detection frame by taking the center point of the target detection frame in a history frame as the center to obtain an amplified region;

determining a plurality of subareas in the amplifying region according to a preset scale factor;

and respectively adjusting the sizes of the plurality of sub-areas to be preset sizes to obtain a plurality of first image blocks.

3. The method according to claim 1, wherein the extracting the second image blocks in the at least one historical frame before the current frame and the image features of the multiple levels of the first image blocks respectively obtain the second image features of the multiple levels corresponding to each second image block and the first image features of the multiple levels corresponding to each first image block respectively, includes:

extracting multiple layers of image features of the second image blocks in each history frame to obtain second pyramid image features corresponding to each second image block;

extracting multiple layers of image features of each first image block to obtain first pyramid image features corresponding to each first image block;

The first pyramid image features and the second pyramid image features have the same hierarchical number, and the channels of the same layer in the first pyramid image features and the second pyramid image features have the same number and spatial resolution.

4. A method according to any one of claims 1 to 3, wherein the performing feature aggregation processing according to the second image features and the first image features to obtain aggregate image features corresponding to each first image block includes:

respectively carrying out deformation processing on each second image feature according to each first image feature to obtain a plurality of deformed second image features;

and respectively carrying out feature aggregation processing on each first image feature and a plurality of deformed second image features to obtain aggregation image features corresponding to each first image block.

5. The method of claim 4, wherein the deforming each of the second image features according to each of the first image features to obtain a plurality of deformed second image features includes:

acquiring motion information between each first image feature and a plurality of second image features;

and respectively carrying out deformation processing on the plurality of second image features according to the motion information to obtain a plurality of deformed second image features.

6. The method of claim 5, wherein the acquiring motion information between each of the first image features and the plurality of second image features, respectively, comprises:

based on the channel connection, respectively performing feature stitching on each first image feature and each second image feature to obtain a plurality of stitched image features;

and respectively carrying out convolution processing on the plurality of spliced image features to obtain motion information between each first image feature and the plurality of second image features.

7. The method of claim 4, wherein the feature aggregation processing is performed on each first image feature and the plurality of deformed second image features to obtain an aggregate image feature corresponding to each first image block, respectively, and the method comprises:

according to each first image feature and a plurality of deformed second image features, a first weight value of each first image feature and a second weight value of each deformed second image feature are obtained; the first weight value comprises weight value layers corresponding to the image features of each layer of the first image feature, and the second weight value comprises weight value layers corresponding to the image features of each layer of the second image feature;

and determining the aggregation image characteristics corresponding to each first image block according to each first image characteristic, the first weight value, each deformed second image characteristic and the second weight value.

8. The method of claim 7, wherein the obtaining a first weight value for each first image feature and a second weight value for each deformed second image feature from each first image feature and the plurality of deformed second image features comprises:

acquiring first embedded features of each first image feature and second embedded features of each deformed second image feature;

respectively calculating the similarity between each first embedded feature and each second embedded feature to obtain a plurality of similarities;

and determining a first weight value of each first image feature and a second weight value of each deformed second image feature according to the multiple similarities.

9. A method according to any one of claims 1 to 3, wherein said determining the location information of the object in the current frame based on the correlation of the image features of the reference image block in the reference frame with the respective aggregate image features, respectively, comprises:

extracting image features of the reference image block to obtain reference image features;

respectively calculating Gaussian responses between the reference image features and each aggregate image feature;

and determining the position information of the target in the current frame according to the first image block corresponding to the maximum Gaussian response value.

10. The method of claim 9, wherein the extracting image features of the reference image block to obtain reference image features comprises:

extracting image features of multiple layers of reference image blocks to obtain first reference image features of multiple layers;

performing feature fusion processing on the first reference image features of the multiple layers, and determining the image features after the fusion processing; the reference image features are generated by the fused image features;

the computing the gaussian response between the reference image feature and each of the aggregated image features, respectively, comprises:

performing feature fusion processing on each aggregate image feature, and determining the aggregate image features after the fusion processing;

and respectively calculating Gaussian responses between the reference image features and the fused aggregate image features.

11. An object tracking device comprising:

the image block processing module is used for determining a plurality of image areas in the current frame according to the position information of a target in a history frame before the current frame to obtain a plurality of first image blocks;

the trunk feature extraction module is used for respectively extracting the second image blocks in at least one historical frame before the current frame and the multi-level image features of the first image blocks obtained by the image block processing module, and obtaining the multi-level second image features corresponding to the second image blocks and the multi-level first image features corresponding to the first image blocks;

The feature processing module is used for carrying out feature aggregation processing according to the second image features and the first image features obtained by the trunk feature extraction module to obtain aggregation image features corresponding to the first image blocks, wherein the aggregation image features have multiple layers;

and the target positioning module is used for determining the position information of the target in the current frame according to the correlation between the image characteristics of the reference image block in the reference frame and the aggregate image characteristics obtained by the characteristic processing module.

12. The apparatus of claim 11, wherein the feature processing module comprises:

the alignment unit is used for respectively carrying out deformation processing on each second image characteristic according to each first image characteristic obtained by the trunk characteristic extraction module to obtain a plurality of deformed second image characteristics;

and the aggregation unit is used for respectively carrying out feature aggregation processing on each first image feature and the plurality of deformed second image features obtained by the alignment unit to obtain aggregation image features corresponding to each first image block.

13. The apparatus of claim 11 or 12, wherein the stem feature extraction module is further to:

The target positioning module comprises:

and a correlation filtering unit: for computing gaussian responses between the reference image features and the respective aggregate image features, respectively;

and the position determining unit is used for determining the position information of the target in the current frame according to the first image block corresponding to the maximum Gaussian response value.

14. A computer readable storage medium storing a computer program for performing the method of any one of the preceding claims 1-10.

15. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor being configured to read the executable instructions from the memory and execute the instructions to implement the method of any of the preceding claims 1-10.