WO2018086607A1

WO2018086607A1 - Target tracking method, electronic device, and storage medium

Info

Publication number: WO2018086607A1
Application number: PCT/CN2017/110577
Authority: WO
Inventors: 唐矗
Original assignee: 纳恩博（北京）科技有限公司
Priority date: 2016-11-11
Filing date: 2017-11-10
Publication date: 2018-05-17
Also published as: CN106650630B; CN106650630A

Abstract

A target tracking method, an electronic device, and a storage medium. The electronic device comprises an image collection unit. The image collection unit is configured to collect image data. The method is applied to the electronic device, comprising: determining, in an initial frame of image of the image data, a target to be tracked (S101); extracting a plurality of candidate targets from a subsequent frame of image of the image data, wherein the subsequent frame of image is any frame of image following the initial frame of image (S102); calculating the similarities between the candidate targets and the target to be tracked (S103); and determining a candidate target in the plurality of candidate targets that has a highest similarity with the target to be tracked as the target to be tracked (S104). The method resolves the technical problems in the prior art that it cannot be determined whether a target to be tracked is lost and it is difficult to find a target to be tracked after the target to be tracked is lost in a visual tracking method of online learning.

Description

Target tracking method, electronic device and storage medium

Cross-reference to related applications

The present application is filed on the basis of the Chinese Patent Application Serial No. No. No. No. No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No No

Technical field

The present invention relates to the field of electronic technologies, and in particular, to a target tracking method, an electronic device, and a storage medium.

Background technique

The visual tracking technology based on online learning has become a hot spot of visual tracking after its rise in recent years. Such a method extracts feature templates according to the specified tracking targets in the initial frame picture without any prior experience of offline learning. The training model is used for tracking the target in subsequent videos. In the tracking process, according to the tracking status Update the model to accommodate changes in the target's posture. This type of method does not require any offline training, and can track any object specified by the user, which has high versatility.

However, due to the single feature of the tracking target and the single template, it is difficult to judge whether the target is lost or not during the tracking process of the target; and after the target is lost, the continuous update of the tracking template will continue to enlarge the error, making the target difficult to retrieve. It is difficult to form a stable tracking system.

Summary of the invention

The embodiment of the present invention solves the visual tracking method of online learning in the prior art by providing a target tracking method, an electronic device, and a storage medium, and it is impossible to determine whether the tracking target is It’s hard to get back the technical problems of tracking targets after losing them.

In one aspect, the present invention provides the following technical solutions through an embodiment of the present invention:

A target tracking method is applied to an electronic device, wherein the electronic device has an image capturing unit, and the image collecting unit is configured to collect image data, and the method includes:

Determining a tracking target in an initial frame image of the image data;

Extracting a plurality of candidate targets in a subsequent frame image of the image data, the subsequent frame images being any frame image subsequent to the initial frame image;

Calculating the similarity between each candidate target and the tracking target;

A candidate target having the highest similarity with the tracking target among the plurality of candidate targets is determined as the tracking target.

Preferably, the determining a tracking target in the initial frame image of the image data comprises:

After outputting the initial frame image through the display screen, acquiring a user's selection operation; determining the tracking target in the initial frame image based on a user's selection operation; or

Obtaining feature information for describing the tracking target; determining the tracking target in the initial frame image based on the feature information.

Preferably, the extracting a plurality of candidate targets in the subsequent frame image of the image data comprises:

Determining an i-th bounding frame of the tracking target in the i-1th frame image, wherein the i-1th frame image belongs to the image data, and i is an integer greater than or equal to 2; when i is equal to 2 The image of the i-1th frame is the initial frame image;

Determining, in the ith frame image, an ith image block, wherein the ith frame image is the subsequent frame image, a center of the ith image block, and the first The center position of the i-1 enclosing frame is the same, and the area of the i-th image block is larger than the area of the i-th enclosing frame;

The plurality of candidate targets are determined within the ith image block.

Preferably, the calculating the similarity between each candidate target and the tracking target comprises:

Selecting a first candidate target from the plurality of candidate targets, wherein the first candidate target Is any one of the plurality of candidate targets;

Calculating a first color feature vector of the first candidate target, and calculating a second color feature vector of the tracking target;

Calculating a distance between the first color feature vector and the second color feature vector, wherein the distance is a similarity between the first candidate target and the tracking target.

Preferably, the calculating the first color feature vector of the first candidate target and calculating the second color feature vector of the tracking target comprises:

Performing main component segmentation on the image of the first candidate target to obtain a first mask image; and performing principal component segmentation on the image of the tracking target to obtain a second mask image;

Scaling the first mask image and the second mask image to the same size;

And dividing the first mask image into M regions; and dividing the second mask image into M regions, where M is a positive integer;

Calculating a color feature vector of each region in the first mask image; and calculating a color feature vector of each region in the second mask image;

And sequentially connecting color feature vectors of each region in the first mask image to obtain the first color feature vector; and sequentially connecting color feature vectors of each region in the second mask image to obtain the The second color feature vector.

Preferably, the calculating a color feature vector of each region in the first mask image; and calculating a color feature vector of each region in the second mask image comprises:

Determine the W main color, W is a positive integer;

Calculating a projection weight of each pixel in the first region of the first mask image on each of the main colors, the first region being any one of the M regions in the first mask image; Calculating a projection weight of each pixel in the second region of the second mask image on each of the main colors, the second region being any one of the M regions in the second mask image;

Obtaining the projection weight based on each of the primary colors in each of the primary colors a W-dimensional color feature vector corresponding to each pixel in the first region; and, based on a projection weight of each pixel in each of the second colors in the second region, obtaining a W-dimension corresponding to each pixel in the second region Color feature vector

Normalizing a W-dimensional color feature vector corresponding to each pixel in the first region to obtain a color feature vector of each pixel in the first region; and corresponding to each pixel in the second region And normalizing the W-dimensional color feature vector to obtain a color feature vector of each pixel in the second region;

Adding color feature vectors of each pixel in the first region to obtain a color feature vector of the first region; and adding color feature vectors of each pixel in the second region to obtain the The color feature vector of the second region.

Preferably, the projection weight of the first pixel on each n primary colors is calculated based on the following equation:

The first pixel is any one of the first region or the second region, and the nth main color is any one of the W main colors, and w _n is a projection weight of the first pixel on the nth main color, I _r , I _g , and I _b are RGB values of the first pixel; R _n , G _n , B _n are the The RGB values of the n main colors.

Normalizing an image of the first candidate target with an image of the tracking target to the same size;

Entering an image of the tracking target into a first convolutional network of a first depth neural network for feature calculation to obtain a feature vector of the tracking target, wherein the first depth neural network is based on a Siamese structure;

Inputting an image of the first candidate target to a second convolution of the first depth neural network Performing feature calculation in the network to obtain a feature vector of the first candidate target;

Inputting a feature vector of the tracking target and a feature vector of the first candidate target into a first fully connected network of the first depth neural network for similarity calculation, obtaining the first candidate target and the tracking The similarity of the target.

Preferably, the determining the plurality of candidate targets in the ith image block comprises:

Inputting the ith image block into a third convolution network of the second depth neural network to perform feature calculation, to obtain a feature map of the ith image block, wherein the second depth neural network is based on a Siamese structure;

And inputting a feature map of the ith image block into an RPN network of the deep neural network to obtain feature numbers of the plurality of candidate targets and the plurality of candidate targets.

Extracting a feature vector of the first candidate target from the feature vectors of the plurality of candidate targets, wherein the first candidate target is any one of the plurality of candidate targets;

Inputting an image of the tracking target into a fourth convolution network of the second depth neural network for feature calculation, obtaining a feature vector of the tracking target, the fourth convolution network and the third convolution Network shared convolutional layer parameters;

Inputting a feature vector of the tracking target and a feature vector of the first candidate target into a second fully connected network of the second depth neural network to perform a similarity calculation, obtaining the first candidate target and the tracking The similarity of the target.

In another aspect, the present invention provides the following technical solutions through an embodiment of the present invention:

An electronic device having an image acquisition unit, the image acquisition unit is configured to collect image data, and the electronic device includes:

a first determining unit, configured to determine a tracking target in an initial frame image of the image data;

An extracting unit configured to extract a plurality of candidate targets in a subsequent frame image of the image data, The subsequent frame image is any frame image subsequent to the initial frame image;

a calculating unit configured to calculate a similarity between each candidate target and the tracking target;

The second determining unit is configured to determine, as the tracking target, a candidate target that has the highest similarity with the tracking target among the plurality of candidate targets.

Preferably, the first determining unit includes:

a first determining subunit configured to acquire a user's selection operation after outputting the initial frame image through the display screen; determining the tracking target in the initial frame image based on a user's selection operation; or

a second determining subunit configured to acquire feature information for describing the tracking target; and determining the tracking target in the initial frame image based on the feature information.

Preferably, the extracting unit comprises:

a first determining subunit, configured to determine an i-th bounding frame of the tracking target in the i-1th frame image, wherein the i-1th frame image belongs to the image data, and i is greater than or equal to 2 An integer of the i-th frame is the initial frame image when i is equal to 2;

a second determining subunit, configured to determine an i-th image block in the i-th frame image based on the i-th bounding frame, wherein the i-th frame image is the subsequent frame image, the ith The center of the image block is the same as the center position of the i-1th bounding frame, and the area of the i-th image block is larger than the area of the i-th bounding frame;

a third determining subunit configured to determine the plurality of candidate targets within the ith image block.

Preferably, the calculating unit comprises:

a first selection sub-unit configured to select a first candidate target from the plurality of candidate targets, wherein the first candidate target is any one of the plurality of candidate targets;

a first calculation subunit configured to calculate a first color feature vector of the first candidate target, and calculate a second color feature vector of the tracking target;

a second calculation subunit configured to calculate the first color feature vector and the second color The distance of the feature vector, wherein the distance is the similarity between the first candidate target and the tracking target.

Preferably, the first calculating subunit is further configured to:

Performing main component segmentation on the image of the first candidate target to obtain a first mask image; and performing principal component segmentation on the image of the tracking target to obtain a second mask image; and the first mask image and the first mask image Dividing the second mask image to the same size; dividing the first mask image into M regions; and dividing the second mask image into M regions, M is a positive integer; calculating the first mask image a color feature vector of each region; and calculating a color feature vector of each region in the second mask image; sequentially connecting color feature vectors of each region in the first mask image to obtain the first color a feature vector; and, sequentially connecting the color feature vectors of each of the regions in the second mask image to obtain the second color feature vector.

Preferably, the first calculating subunit is further configured to:

Determining a W main color, W being a positive integer; calculating a projection weight of each pixel in the first region of the first mask image on each of the main colors, the first region being in the first mask image Any one of the M regions; and calculating a projection weight of each pixel in the second region of the second mask image on each of the main colors, the second region being in the second mask image Any one of the M regions; obtaining a W-dimensional color feature vector corresponding to each pixel in the first region based on a projection weight of each pixel in each of the first regions; and, based on a projection weight of each pixel in the second region on each of the main colors, obtaining a W-dimensional color feature vector corresponding to each pixel in the second region; and a W dimension corresponding to each pixel in the first region The color feature vector is normalized to obtain a color feature vector of each pixel in the first region; and normalizing the W-dimensional color feature vector corresponding to each pixel in the second region to obtain the The color of each pixel in the second area Eigenvectors; region of the first feature vector of each pixel color is added to obtain the first color region Laid a eigenvector; and summing the color feature vectors of each pixel in the second region to obtain a color feature vector of the second region.

Preferably, the first calculating subunit is further configured to calculate a projection weight of the first pixel on each n main colors based on the following equation:

Preferably, the calculating unit comprises:

a second selection subunit, configured to select a first candidate target from the plurality of candidate targets, wherein the first candidate target is any one of the plurality of candidate targets;

a normalization subunit configured to normalize an image of the first candidate target to an image of the tracking target to the same size;

a first input subunit, configured to input an image of the tracking target into a first convolutional network of a first depth neural network for feature calculation, to obtain a feature vector of the tracking target, wherein the first deep neural The network is based on the Siamese structure;

a second input subunit, configured to input an image of the first candidate target into a second convolution network of the first depth neural network to perform feature calculation, to obtain a feature vector of the first candidate target, The second convolution network and the first convolution network share a convolution layer parameter;

a third input subunit, configured to input a feature vector of the tracking target and a feature vector of the first candidate target into a first fully connected network of the first depth neural network to perform a similarity calculation, to obtain the The similarity between the first candidate target and the tracking target.

Preferably, the third determining subunit is further configured to:

Inputting the ith image block into a third convolution network of the second deep neural network Calculating a feature map of the ith image block, wherein the second depth neural network is based on a Siamese structure; and importing a feature map of the ith image block into an RPN network of the second deep neural network Obtaining the plurality of candidate targets and feature vectors of the plurality of candidate targets.

Preferably, the calculating unit comprises:

Extracting a subunit, configured to extract a feature vector of the first candidate target from the feature vectors of the plurality of candidate targets, wherein the first candidate target is any one of the plurality of candidate targets;

a fourth input subunit, configured to input an image of the tracking target into a fourth convolution network of the second depth neural network to perform feature calculation, to obtain a feature vector of the tracking target, where the fourth The convolution network and the third convolution network share a convolution layer parameter;

a fifth input subunit, configured to input a feature vector of the tracking target and a feature vector of the first candidate target into a second fully connected network of the second depth neural network to perform a similarity calculation, to obtain the The similarity between the first candidate target and the tracking target.

In still another aspect, the present invention provides the following technical solutions through an embodiment of the present invention:

An electronic device comprising: a processor and a memory for storing a computer program executable on the processor, wherein the processor is operative to perform the steps of the method described above when the computer program is run.

A computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the method described above.

One or more technical solutions provided in the embodiments of the present invention have at least the following technical effects or advantages:

In the embodiment of the present invention, a tracking target is determined in an initial frame image of the image data; a plurality of candidate targets are extracted in a subsequent frame image of the image data; and a similarity between each candidate target and the tracking target is calculated; The highest candidate target is determined as the tracking target. As will be followed by each The candidate target of one frame image is compared with the tracking target in the initial frame image, and the candidate target with the highest similarity among the candidate targets is determined as the tracking target, thereby implementing tracking of the tracking target. Compared with the online tracking visual tracking method in the prior art, the tracking method in the embodiment of the present invention can be regarded as determining whether the target is lost or not, and has a reliable judgment. Tracking whether the target is lost or not; and does not need to maintain the tracking template, avoiding the continuous update of the tracking template, causing the error to be continuously amplified, which is beneficial to recovering the lost tracking target, thereby improving the robustness of the tracking system.

DRAWINGS

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are some embodiments of the present invention. Other drawings may also be obtained from those of ordinary skill in the art in light of the inventive work.

1 is a flowchart of a target tracking method according to an embodiment of the present invention;

2 is a schematic diagram of an initial frame image in an embodiment of the present invention;

3 is a schematic diagram of an initial tracking target in an embodiment of the present invention;

4 is a schematic diagram of an image of a second frame in an embodiment of the present invention;

FIG. 5 is a schematic diagram of candidate objects determined in a second frame image according to an embodiment of the present invention; FIG.

6 is a schematic diagram of a first deep neural network according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a second deep neural network according to an embodiment of the present invention; FIG.

FIG. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

detailed description

The embodiment of the present invention solves the prior art visual tracking method for online learning by providing a target tracking method and device, and has the technical problem that it is impossible to determine whether the tracking target is lost or not, and it is difficult to retrieve the tracking target after the lost.

The technical solution of the embodiment of the present invention is to solve the above technical problem, and the general idea is as follows:

A target tracking method is applied to an electronic device, wherein the electronic device has an image acquisition unit, and the image acquisition unit is configured to acquire image data, the method comprising: determining a tracking target in the initial frame image of the image data; Extracting a plurality of candidate targets in the subsequent frame image, the subsequent frame image is any frame image after the initial frame image; calculating the similarity between the candidate target and the tracking target; and maximizing the similarity between the candidate targets and the tracking target The candidate target is determined as the tracking target.

In order to better understand the above technical solutions, the above technical solutions will be described in detail below in conjunction with the drawings and specific embodiments.

Embodiment 1

The embodiment provides a target tracking method, which is applied to an electronic device, and the electronic device may be: a ground robot (for example, a balance vehicle), or a drone (for example, a multi-rotor drone, or a fixed wing without The device is not limited to the specific embodiment of the device. Wherein, the electronic device has an image acquisition unit (for example, a camera), and the image acquisition unit is configured to collect image data.

As shown in FIG. 1, the target tracking method includes:

Step S101: Determine a tracking target in the initial frame image of the image data.

As an optional embodiment, step S101 includes:

After outputting the initial frame image through the display screen, acquiring a user's selection operation; determining a tracking target in the initial frame image based on the user's selection operation; or

In a specific implementation process, as shown in FIG. 2, an image acquired by the image acquisition unit may be acquired, and the image (for example, an initial frame image 300) is output through a display screen set on the electronic device, and a user executed is acquired. Select an action (for example, when the display is a touch screen, pass the touch The touch screen acquires the user's selection operation, and then determines a tracking target (ie, initial tracking target 000) from the initial frame image 300 based on the selection operation. Alternatively, the feature information for describing the tracking target is acquired, and the tracking target (ie, the initial tracking target 000) is determined in the initial frame image 300 in conjunction with a saliency detection or an object detection algorithm. Here, as shown in FIG. 3, the image 311 of the initial tracking target 000 can be extracted and saved for backup, and the image 311 is the image in the first bounding frame 310.

Step S102: extracting a plurality of candidate targets in the subsequent frame image of the image data, and the subsequent frame image is any frame image subsequent to the initial frame image.

As an optional embodiment, step S102 includes:

Determining an i-th bounding frame of the tracking target in the i-1th frame image (wherein the i-1th frame image belongs to image data, i is an integer greater than or equal to 2; when i is equal to 2, the i-1th frame The image is the initial frame image); based on the i-1th bounding box, the i-th image block is determined in the i-th frame image, wherein the i-th frame image is the subsequent frame image, the center of the i-th image block and the i-th image 1 The center of the bounding box is the same, the area of the i-th image block is larger than the area of the i-1th bounding box; and a plurality of candidate targets are determined within the i-th image block.

For example, as shown in FIG. 2, FIG. 2 is an initial frame image including a plurality of person targets, and the tracking target to be tracked is a person in the first bounding box 310. As shown in FIG. 4, FIG. 4 is a second frame image in which the position or posture of each character object is changed.

When i is equal to 2, as shown in FIG. 3, the bounding frame (ie, the first bounding box 310) in the initial frame image 300 is determined (ie, the initial tracking target 000), and the bounding box is generally rectangular, and Can just surround the tracking target (ie: initial tracking target 000). As shown in FIG. 4, based on the position of the first bounding frame 310 (the position of the first bounding frame 310 in the initial frame image 300 is the same as the position in the second frame image 400), one is determined in the second frame image 400. In the image block (ie, the second image block 420), the second image block 420 is the same as the center of the first bounding frame 310, but the second image block 420 is larger than the area of the first bounding frame 310, and is in the second image block. Possible in 420 There are a plurality of targets, wherein the tracking target determined in the initial frame image 300 (ie, the initial tracking target 000) is in the second image block 420, where the second image can be utilized by methods such as saliency analysis or target detection. The plurality of targets are determined in block 420 and determined as candidate targets (ie, candidate target 401, candidate target 402, candidate target 403, candidate target 404). Further, based on steps S103 to S104, the tracking target is determined from among the candidate targets, that is, the initial tracking target 000 is identified from the second frame image. Here, specific embodiments of S103 to S104 will be described in detail later.

Similarly, when i is equal to 3, after the tracking target is recognized from the second frame image 400, the bounding frame of the tracking target in the second frame image 400 (ie, the second bounding frame) is determined, based on the second surrounding. a frame, wherein an image block (ie, a third image block) is determined in the image of the third frame, and the third image block is the same as the center of the second frame, but the third image block is larger than the area of the second image block. There may be multiple targets in the third image block, wherein the tracking targets determined in the initial frame image are in these targets, and the method may be determined in the third image block by means of saliency analysis or target detection. A plurality of targets and determining the plurality of targets as candidate targets. Further, based on steps S103 to S104, the tracking target is determined from among the candidate targets, that is, the initial tracking target 000 is identified from the third frame image.

Similarly, when i is equal to 4, the fourth image block is determined in the fourth frame image, and the plurality of candidate targets are determined in the fourth image block, and further, the steps are determined from the candidate targets based on steps S103 to S104. Track the target (ie: initial tracking target 000). And so on, when i is equal to 5, 6, 7, 8, ..., a plurality of candidate targets are determined in each frame image, and the tracking target is determined from the candidate targets based on steps S103 to S104 ( That is: the target 000) is initially tracked, thereby achieving the identification tracking of the tracking target.

In a specific implementation process, after determining a plurality of candidate targets from within the ith image block, images of each candidate target are extracted and saved for backup. As shown in FIG. 5, the image 421 of the candidate target 401, the image 422 of the candidate target 402, the image 423 of the candidate target 403, and the candidate mesh are extracted and saved. Image 424 of the label 404.

Step S103: Calculate the similarity between the candidate target and the tracking target.

For example, in a specific example, the similarity of each candidate target to the tracking target is calculated.

In a specific implementation process, it is required to first calculate the similarity between each candidate target and the tracking target. Wherein, the tracking target is an initial tracking target 000 (shown in FIG. 3) determined in the initial frame image 300, the candidate target is from an ith image block in the ith frame image, and the ith frame image is a Subsequent frame graphics (ie, any frame image after the initial frame graphic). For example, as shown in FIG. 4, the candidate target includes the candidate target 401, the candidate target 402, the candidate target 403, and the candidate target 404 determined in the second frame image 400.

In the specific implementation process, the target re-identification algorithm can be used to calculate the similarity between each candidate target and the tracking target. Here, the following three embodiments are available for step S103.

Manner 1: Calculate the similarity between each candidate target and the tracking target by using a color feature based target re-identification algorithm.

As an optional embodiment, step S103 includes:

Selecting a first candidate target from among a plurality of candidate targets, wherein the first candidate target is any one of the plurality of candidate targets; calculating a first color feature vector of the first candidate target, and calculating a second of the tracking target a color feature vector; calculating a distance between the first color feature vector and the second color feature vector, wherein the distance is the similarity between the first candidate target and the tracking target.

For example, as shown in FIG. 3, the color feature vector of the initial tracking target 000 is calculated, wherein the initial tracking target 000 is the tracking target determined in the initial frame image 300, as shown in FIG. 5, and then the candidate target is sequentially calculated. The color feature vector of 401, and finally, the distance between the color feature vector of the initial tracking target 000 and the color feature vector of the candidate target 401 is calculated, which represents the similarity between the candidate target 401 and the initial tracking target 000. Similarly, the similarity between the candidate target 402, the candidate target 403, the candidate target 404, and the initial tracking target 000 is calculated separately.

In a specific implementation process, the distance between the first color feature vector and the second color feature vector may be calculated based on the Euclidean distance formula.

As an optional embodiment, in more detail, the calculating the first color feature vector of the first candidate target and calculating the second color feature vector of the tracking target comprises:

Principal component segmentation is performed on the image of the first candidate target pair to obtain a first mask image; and the image of the tracking target is subjected to Saliency Segmentation to obtain a second mask image; the first mask image and the second mask are obtained The image is scaled to the same size; the first mask image is equally divided into M regions; and the second mask image is equally divided into M regions, M being a positive integer; the color feature vector of each region in the first mask image is calculated; Calculating a color feature vector of each region in the second mask image; sequentially connecting color feature vectors of each region in the first mask image to obtain a first color feature vector; and, for each region in the second mask image The color feature vectors are sequentially connected to obtain a second color feature vector.

For example, when calculating the color feature vector (ie, the second color feature vector) of the tracking target (ie, the initial tracking target 000), the image 311 of the initial tracking target 000 may be first subjected to principal component segmentation to obtain a second mask. Image (in the mask image, only the principal component area keeps the pixel value consistent with the original image, and other regions have a pixel value of 0), wherein the image 311 of the initial tracking target 000 is a rectangle and can immediately surround the initial tracking target 000, and then The second mask image is scaled to a preset size, and then the second mask image is equally divided into four regions (upper and lower halved, left and right halved), and then the color eigenvectors of each of the four regions are respectively calculated. Finally, the color feature vectors of each of the four regions are sequentially connected (if the color feature vector of each region is a 10-dimensional vector, then a sequential connection obtains a 40-dimensional vector), and the tracking target is obtained after normalization ( That is: the color feature vector (ie, the second color feature vector) of the initial tracking target 000).

Similarly, when calculating the color feature vector of the candidate target 401, the image 421 of the candidate target 401 may be subjected to principal component segmentation to obtain a first mask image, wherein the candidate target image is 401. The image block 421 is rectangular and can surround the candidate target 401, and then the first mask image is also scaled to a preset size, which is the same size as the second mask image, and the first mask image is equally divided into four regions (upper and lower Equally divided into two equal parts, and then calculate the color feature vector of each of the four regions separately, and finally connect the color feature vectors of each of the four regions sequentially (wherein, if the color of each region) The feature vector is a 10-dimensional vector, and then a sequential connection obtains a 40-dimensional vector. After normalization, the color feature vector of the candidate target 401 is obtained. Similarly, the color feature vector of the candidate target 402, the color feature vector of the candidate target 403, and the color feature vector of the candidate target 404 are respectively calculated.

As an optional embodiment, in more detail, the calculating a color feature vector of each region in the first mask image; and calculating a color feature vector of each region in the second mask image includes:

Determining W main colors, W is a positive integer; calculating a projection weight of each pixel in the first region of the first mask image on each of the main colors, the first region being any of the M regions in the first mask image An area; and calculating a projection weight of each pixel in the second area of the second mask image on each of the main colors, the second area being any one of the M areas in the second mask image; a projection weight of each pixel in the region on each of the main colors, obtaining a W-dimensional color feature vector corresponding to each pixel in the first region; and, based on the projection weight of each pixel in each of the main colors in the second region Obtaining a W-dimensional color feature vector corresponding to each pixel in the second region; normalizing the W-dimensional color feature vector corresponding to each pixel in the first region to obtain a color feature vector of each pixel in the first region; And normalizing the W-dimensional color feature vector corresponding to each pixel in the second region to obtain a color feature vector of each pixel in the second region; adding the color feature vectors of each pixel in the first region to obtain First Color feature vector area; and, in the second region of each pixel of the color feature vector addition, to obtain the color feature vector of the second region.

For example, you can define 10 main colors, which are red, yellow, blue, green, cyan, purple, orange, white, black, gray, and numbered sequentially from 1 to 10 (ie: red is number 1, yellow is No. 2, blue is No. 3, ..., gray is No. 10), and then the corresponding RGB values of each color are recorded, specifically expressed as: R _n , G _n , B _n , n represent these 10 main colors Number (for example: R ₁ represents the R value of red, G ₂ represents the G value of yellow, and B ₁₀ represents the B value of gray).

After the first mask image is equally divided into four regions (upper and lower halving, left and right halving), when calculating the color feature vector of each region in the first mask image, first, from among the four regions An area (ie, the first area) calculates a projection weight of each pixel in the first area on each of the main colors, and obtains a projection weight of each of the pixels in the first area in the 10 main colors, wherein each Each pixel obtains a 10-dimensional color feature vector, and then, after normalizing the 10-dimensional color feature vector, as the color feature vector of the pixel, after obtaining the color feature vector of all the pixels in the first region, all will be The color feature vectors of the pixels are added, and finally, the color feature vector of the first region is obtained. Based on the method, the color feature vector of each of the four regions in the first mask image can be calculated.

Similarly, after the second mask image is equally divided into four regions (upper and lower halving, left and right halving), when calculating the color feature vector of each region in the second mask image, first, from the four regions Any one of the regions (ie, the second region) is selected, and the projection weight of each pixel in the second region on each of the main colors is calculated, and the projection weight of each of the pixels in the second region is obtained. Wherein, each pixel obtains the first 10-dimensional color feature vector, and then normalizes the 10-dimensional color feature vector, and as the color feature vector of the pixel, obtains the color features of all the pixels in the second region. After the vector, the color feature vectors of all the pixels are added, and finally, the color feature vector of the second region is obtained. Based on the method, the color feature vector of each of the four regions in the second mask image can be calculated.

As an alternative embodiment, in more detail, the projection weight of the first pixel on every n main colors can be calculated based on the following equation:

Wherein, the first pixel is any one of the first region or the second region, and the nth main color is any one of the main colors of W, and w _n is the first pixel on the nth main color The projection weights, I _r , I _g , and I _b are the RGB values of the first pixel; R _n , G _n , B _n are the RGB values of the nth main color.

For example, n is the number of the above 10 main colors. When calculating the projection weight of a certain pixel in the first region or the second region on yellow (number 2), it can be calculated based on the following equation:

Where w ₂ is the projection weight of the pixel on yellow, R ₂ , G ₂ , and B ₂ are yellow RGB values, and I _r , I _g , and I _b are the RGB values of the pixel.

Manner 2: Using a target re-recognition algorithm based on a deep neural network, the similarity between each candidate target and the tracking target is calculated.

As an optional embodiment, step S103 includes:

As shown in FIG. 6, a first candidate target is selected from a plurality of candidate targets, wherein the first candidate target is any one of a plurality of candidate targets; and the image of the first candidate target and the image of the tracking target are returned The image of the tracking target is input to the first convolution network 601 of the first depth neural network through the first input terminal 611 for feature calculation, and the feature vector of the tracking target is obtained, wherein the first depth neural network Based on the Siamese structure; the image of the first candidate target is input into the second convolution network 602 of the first depth neural network through the second input end 612 to perform feature calculation, and the feature vector of the first candidate target is obtained, wherein the second volume The product network 602 and the first convolutional network 601 share the convolution layer parameters, that is, the volume base layer parameters are the same; the feature vector of the tracking target and the feature vector of the first candidate target are input to the first fully connected layer 603 of the first deep neural network. Performing a similarity calculation, and finally obtaining the first candidate target at the first output 621 The similarity of the targets is tracked, wherein the outputs of the first convolutional network 601 and the second convolutional network 602 are automatically entered as inputs to the first fully connected network 603.

In a specific implementation process, the first deep neural network needs to be trained offline (as shown in FIG. 6), and the first deep neural network includes a first convolutional network 601, a second convolutional network 602, and a first fully connected network 603, An input terminal 611, a second input terminal 612, and a first output terminal 621, wherein the first convolutional network 601 and the second convolutional network 602 are bilateral deep neural networks adopting a Siamese structure, and each side of the network adopts AlexNet. The network structure before FC6 in the network, the first convolutional network 601 and the second convolutional network 602 all contain a plurality of convolution layers, the convolutional layer in the first convolutional network 601 and the second convolutional network 602. The convolutional layers are mutually shared convolutional layers with the same parameters. The images input by the first convolutional network 601 and the second convolutional network 602 need to be normalized to the same size. Here, the image of the normalized tracking target is input into the first convolution network 601, and the feature vector of the tracking target can be obtained; and the image of the normalized first candidate target is input to the second convolution network. In 602, a feature vector of the first candidate target can be obtained. The first convolutional layer 601 and the second convolutional layer 602 are connected to the first fully connected network 603. The first fully connected network 603 includes a plurality of fully connected layers for calculating the distance between the input feature vectors on both sides. The similarity between the first candidate target and the tracking target. Among them, the parameters in the first deep neural network are obtained through offline learning, and the method of training the first deep neural network is consistent with the training method of the general convolutional neural network. After the offline training is finished, the first deep nerve can be obtained. The network network is used in the tracking system.

For example, when calculating the similarity between the candidate target 401 and the initial tracking target 000 by using the first depth neural network, the image 421 of the candidate target 401 and the image 311 of the initial tracking target 000 may be first normalized to the same size; The image 311 of the initial tracking target 000 is input into the first convolution network 601 to obtain the feature vector of the initial tracking target 000, and the image 421 of the candidate target 401 is used in the second convolution network 602 to obtain the feature vector of the candidate target 401; Finally, the feature vector of the initial tracking target 000 and the feature vector of the candidate target 401 are input to the first full connection. The network 603 is connected to obtain the similarity between the candidate target 401 and the initial tracking target 000.

Similarly, after the image 422 of the candidate target 402 is normalized to the image 311 corresponding to the initial tracking target 000, the image 311 of the initial tracking target 000 is input into the first convolution network 601, and at the same time, the image of the candidate target 402 is taken. The 422 is input to the second convolutional network 602 to obtain the similarity between the candidate target 402 and the initial tracking target 000. By analogy, the similarity of the candidate target 403 and the initial tracking target 000, and the similarity of the candidate target 404 and the initial tracking target 000 can be obtained.

Manner 3: Using a deep neural network, simultaneously generating candidate targets and calculating the similarity between each candidate target and the tracking target.

As an optional embodiment, when performing the determining a plurality of candidate targets in the ith image block, in addition to using a method such as saliency analysis or target detection, a second method as shown in FIG. 7 may be utilized. Deep neural network.

Specifically, as shown in FIG. 7, the second deep neural network may be trained offline, the second deep neural network is based on the Siamese structure, and the second deep neural network includes a third convolutional network 604, a fourth convolutional network 605, and an RPN ( Region Proposal Network, network 607 and second fully connected network 606, third input 613, fourth input 614, and second output 622. The output of the third convolutional network 604 is input to the RPN network 607, and the fourth convolutional network 605 and the RPN network 607 are simultaneously connected to the second fully connected network 606. The third convolutional network 604 includes a plurality of convolution layers for performing feature calculation on the i-th image block, and the third convolution network 604 is used to obtain a feature map of the i-th image block, and the RPN network 607 is configured to A feature map of the i-th image block, a plurality of candidate targets are extracted from the i-th image block, and feature vectors of each candidate target are calculated.

The second deep neural network shown in FIG. 7 is mainly different from the first deep neural network shown in FIG. 6 in the lower half of FIG. The third convolutional network 604 in FIG. 7 takes the i-th image block as an input, and additionally adds an RPN network 607, which is the i-th image. The candidate target is extracted on the feature map obtained after the block is calculated by the third convolution network 604. The RPN network 607 directly uses the feature map calculated by the third convolution network 604 to perform calculation, and directly finds the candidate target in the feature after the calculation. The corresponding position on the map directly acquires the feature vector of each candidate target on the feature map, and then the feature vector corresponding to the initial tracking target 000 is input to the second fully connected network 606 to calculate the similarity.

In a specific implementation process, the ith image block may be input into the third convolution network 604 of the second depth neural network through the fourth input terminal 614 to perform feature calculation, and obtain a feature map of the ith image block; The feature map of the block is input to the RPN network 607 of the second depth neural network for feature calculation, a plurality of candidate targets are extracted, and feature vectors of each candidate target are calculated.

For example, the second image block 420 can be input into the third convolution network 604 of the second depth neural network to obtain a feature map of the second image block 420, and the feature image of the second image block 420 is input to the second image block 420. In the RPN network 607 of the deep neural network, a plurality of candidate targets (ie, candidate target 401, candidate target 402, candidate target 404, candidate target 404) are extracted, and feature vectors of each candidate target can also be obtained.

As an optional embodiment, step S103 includes:

Extracting a feature vector of the first candidate target from the feature vectors of the plurality of candidate targets, wherein the first candidate target is any one of the plurality of candidate targets; and inputting the image of the tracking target to the third input terminal 613 Feature calculation is performed in the fourth convolutional network 605 of the two-depth neural network to obtain the feature vector of the tracking target, wherein the fourth convolutional network 605 and the third convolutional network 604 both contain multiple convolutional layers, the fourth volume The convolutional layer and the third convolutional network 604 in the product network 605 share the convolutional layer parameters, i.e., the volume base layer parameters are the same. Inputting the feature vector of the tracking target and the feature vector of the first candidate target into the second fully connected network 606 of the second depth neural network for similarity calculation, and finally obtaining the first candidate target and the tracking target at the second output 622 Similarity.

In a specific implementation process, as shown in FIG. 7, the second deep neural network includes a third convolution The network 604 and the RPN network 607 further include a fourth convolution network 605 and a second fully connected network 606. The RPN network 704 is configured to extract a plurality of candidate targets based on the feature map output by the third convolution network 604. And calculating the feature vector of each candidate target, and inputting the feature vector of each candidate target into the second fully connected network 606 in sequence, and the fourth convolution network 605 is configured to calculate the feature vector of the tracking target and output to the second full connection. The network 606, the second fully connected network 606 is configured to calculate the similarity between the first candidate target and the tracking target based on the feature vector of the first candidate target and the feature vector of the tracking target.

For example, as described above, after the second image block 420 is input to the third convolutional network 604 of the second deep neural network, the candidate can be obtained by the calculation of the third convolutional network 604 and the RPN network 607. The feature vector of the target 421, the feature vector of the candidate target 422, the feature vector of the candidate target 424, and the feature vector of the candidate target 424. At the same time, the image 311 corresponding to the initial tracking target 000 is input to the fourth convolution network 605 of the second deep neural network, and the similarity between the candidate target 401 and the initial tracking target 000 can be calculated by the second fully connected network 606. The similarity between the candidate target 402 and the initial tracking target 000, the similarity between the candidate target 403 and the initial tracking target 000, and the similarity between the candidate target 404 and the initial tracking target 000.

Step S104: Determine a candidate target having the highest similarity with the tracking target among the plurality of candidate targets as the tracking target.

In the specific implementation process, after calculating the similarity between each candidate target and the tracking target, the candidate with the highest similarity can be used as the tracking target.

For example, if the similarity between the candidate target 402 and the initial tracking target 000 is the highest, the candidate target 402 continues to be tracked as the tracking target.

The above mainly takes the second frame image 400 as an example, and for each candidate target in the second image block 420 in the second frame image 400, the similarity between each candidate target and the initial tracking target 000 is calculated separately, and is similar. The candidate with the highest degree is used as the tracking target in the image of the second frame. Similarly, for subsequent frame images (for example, the third frame image, the fourth frame image, the fifth frame image, ...), The same is true, the similarity between each candidate target and the initial tracking target 000 in each frame image is calculated, and the candidate object with the highest similarity is used as the tracking target in the frame image.

The technical solutions in the foregoing embodiments of the present invention have at least the following technical effects or advantages:

Since the candidate target of each subsequent frame image is compared with the tracking target in the initial frame image, the candidate object with the highest similarity among the candidate targets is determined as the tracking target, thereby implementing tracking of the tracking target. Compared with the online tracking visual tracking method in the prior art, the target tracking method in the embodiment of the present invention can be regarded as determining whether the target is lost or not, and the processing is reliable. It is not necessary to maintain the tracking template, and the tracking template is not required to be maintained, so that the error is continuously amplified, which is beneficial to recovering the tracking target, thereby improving the robustness of the tracking system.

Embodiment 2

The embodiment provides an electronic device, which has an image acquisition unit, and the image acquisition unit is configured to collect image data. As shown in FIG. 8 , the electronic device includes:

The first determining unit 801 is configured to determine a tracking target in the initial frame image of the image data;

The extracting unit 802 is configured to extract a plurality of candidate targets in the subsequent frame image of the image data, where the subsequent frame image is any frame image subsequent to the initial frame image;

The calculating unit 803 is configured to calculate a similarity between the candidate target and the tracking target;

The second determining unit 804 is configured to determine a candidate target that has the highest similarity with the tracking target among the plurality of candidate targets as the tracking target.

As an optional embodiment, the first determining unit 801 includes:

a first determining subunit configured to acquire a user's selection operation after outputting the initial frame image through the display screen; determining a tracking target in the initial frame image based on the user's selection operation; or

a second determining subunit configured to acquire feature information for describing the tracking target; and determining a tracking target in the initial frame image based on the feature information.

As an optional embodiment, the extracting unit 802 includes:

a first determining subunit configured to determine an i-1 bounding frame of the tracking target in the i-1th frame image, wherein the i-1th frame image belongs to image data, and i is an integer greater than or equal to 2; 2, the image of the i-1th frame is the initial frame image;

a second determining subunit configured to determine an i-th image block in the i-th frame image based on the i-th bounding frame, wherein the i-th frame image is a subsequent frame image, the center of the i-th image block and the i-th image 1 The center position of the bounding frame is the same, and the area of the i-th image block is larger than the area of the i-1th bounding frame;

A third determining subunit configured to determine a plurality of candidate targets within the ith image block.

As an optional embodiment, the calculating unit 803 includes:

a first calculation subunit configured to calculate a first color feature vector of the first candidate target and calculate a second color feature vector of the tracking target;

And a second calculating subunit configured to calculate a distance between the first color feature vector and the second color feature vector, wherein the distance is the similarity between the first candidate target and the tracking target.

As an optional embodiment, the first computing subunit is further configured to:

Performing a principal component segmentation on the first candidate target image to obtain a first mask image; and performing principal component segmentation on the image of the tracking target to obtain a second mask image; and scaling the first mask image and the second mask image to the same size; Dividing the first mask image into M regions equally; and dividing the second mask image into M regions, M being a positive integer; calculating a color feature vector of each region in the first mask image; and calculating a second mask image a color feature vector of each of the regions; sequentially connecting the color feature vectors of each region in the first mask image to obtain a first color feature vector; and sequentially connecting the color feature vectors of each region in the second mask image, A second color feature vector is obtained.

As an optional embodiment, the first calculating subunit is further configured to calculate a projection weight of the first pixel on each n main colors based on the following equation:

Wherein the first pixel is any of the first or second area of a pixel, the primary colors of n kinds of primary colors W is any one of a primary color, the first pixel W _n on the n kinds of primary colors The projection weights, I _r , I _g , and I _b are the RGB values of the first pixel; R _n , G _n , B _n are the RGB values of the nth main color.

As an optional embodiment, the calculating unit 803 includes:

a normalized subunit configured to normalize an image of the first candidate target to an image of the tracking target to the same size;

a first input subunit, configured to input an image of the tracking target into a first convolution network of the first depth neural network for feature calculation, to obtain a feature vector of the tracking target, wherein the first depth neural network is based on the Siamese structure;

a second input subunit, configured to input an image of the first candidate target into a second convolution network of the first depth neural network to perform feature calculation, to obtain a feature vector of the first candidate target;

a third input subunit, configured to input the feature vector of the tracking target and the feature vector of the first candidate target into the first fully connected network of the first depth neural network for similarity calculation, to obtain the first candidate target and the tracking target Similarity.

As an optional embodiment, the third determining subunit is further configured to:

The ith image block is input into a third convolution network of the second depth neural network to perform feature calculation, and the feature map of the ith image block is obtained, wherein the second depth neural network is based on the Siamese structure; and the feature of the ith image block is The map is input to the RPN network of the second deep neural network, and a plurality of candidate targets are extracted, and feature vectors of the plurality of candidate targets are obtained.

As an optional embodiment, the calculating unit 803 includes:

a fourth input subunit configured to input an image of the tracking target into a fourth convolution network of the second depth neural network for feature calculation, to obtain a feature vector of the tracking target;

a fifth input subunit, configured to input the feature vector of the tracking target and the feature vector of the first candidate target into the second fully connected network of the second depth neural network to perform similarity calculation, to obtain the first candidate target and the tracking target Similarity.

The electronic device introduced in this embodiment is an electronic device used in the method for implementing the target tracking method in the embodiment of the present invention. Therefore, those skilled in the art can understand the method based on the target tracking method introduced in the embodiment of the present invention. The specific embodiment of the electronic device of the embodiment and various variations thereof, so how to implement the invention for the electronic device The method in the example is not described in detail. The electronic device used in the method of the subject tracking method in the embodiments of the present invention is within the scope of the present invention.

In an actual application, the first determining unit 801, the extracting unit 802, the calculating unit 803, and the second determining unit 804 may all run on an electronic device, and may be a central processing unit (CPU) or a microprocessor located on the electronic device. (MPU), or digital signal processor (DSP), or programmable gate array (FPGA) implementation.

Since the candidate target of each subsequent frame image is compared with the tracking target in the initial frame image, the candidate object with the highest similarity among the candidate targets is determined as the tracking target, thereby implementing tracking of the tracking target. Compared with the electronic device in the prior art that uses the online learning visual tracking method, the processing of each frame after the initial frame can be regarded as determining whether the target is lost or not. It has the advantage of reliably judging whether the tracking target is lost or not; and it does not need to maintain the tracking template, which avoids the continuous updating of the tracking template, so that the error is continuously amplified, which is beneficial to recovering the tracking target, thereby improving the robustness of the tracking system. Sex.

In a specific embodiment, the electronic device includes: a processor and a memory for storing a computer program executable on the processor, wherein the processor is configured to execute the computer program when The steps of the method.

Here, in practical applications, the memory may be implemented by any type of volatile or non-volatile storage device, or a combination thereof. The non-volatile memory may be a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), or an Erasable Programmable Read (EPROM). Only Memory), Electrically Erasable Programmable Read-Only Memory (EEPROM), Ferromagnetic Random Access Memory (FRAM), Flash Memory, Magnetic Surface Memory , CD, or CD-ROM (CD-ROM, Compact Disc) Read-Only Memory); the magnetic surface memory can be a disk storage or a tape storage. The volatile memory can be a random access memory (RAM) that acts as an external cache. By way of example and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access (SSRAM). DRAM (Dynamic Random Access Memory), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), enhancement Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory Bus Random Access Memory (DRRAM) ). The memories described in the embodiments of the present invention are intended to include, but are not limited to, these and any other suitable types of memory.

The processor may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in a processor or an instruction in a form of software. The above processor may be a general purpose processor, a digital signal processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, or the like. The processor may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present invention. A general purpose processor can be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiment of the present invention may be directly implemented as a hardware decoding processor, or may be performed by a combination of hardware and software modules in the decoding processor. The software module can be located in a storage medium, the storage medium being located in the memory, the processor reading the information in the memory, and completing the steps of the foregoing methods in combination with the hardware thereof.

Embodiments of the present invention also provide a computer readable storage medium, including, for example, a computer A memory of the program, which may be executed by a processor of the electronic device described above to perform the steps described in the foregoing methods. The computer readable storage medium may be a memory such as FRAM, ROM, programmable read only memory PROM, EPROM, EEPROM, Flash Memory, magnetic surface memory, optical disk, or CD-ROM; or may include one or any combination of the above memories. Various equipment.

Those skilled in the art will appreciate that embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (system), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine for the execution of instructions for execution by a processor of a computer or other programmable data processing device. Means for implementing the functions specified in one or more of the flow or in a block or blocks of the flow chart.

The computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device. The apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.

These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on the computer or other programmable device to produce the computer The implemented processing, such as instructions executed on a computer or other programmable device, provides steps for implementing the functions specified in one or more blocks of the flowchart or in a block or blocks of the flowchart.

While the preferred embodiment of the invention has been described, it will be understood that Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and the modifications and

It is apparent that those skilled in the art can make various modifications and variations to the invention without departing from the spirit and scope of the invention. Thus, it is intended that the present invention cover the modifications and modifications of the invention

Industrial applicability

The embodiment of the invention has the advantages of reliably determining whether the tracking target is lost or not, and does not need to maintain the tracking template, and avoids the continuous updating of the tracking template, so that the error is continuously amplified, which is beneficial to recovering the tracking target and the tracking target, thereby improving the tracking. The robustness of the system.

Claims

A target tracking method is applied to an electronic device, wherein the electronic device has an image capturing unit, and the image collecting unit is configured to collect image data, and the method includes:

Determining a tracking target in an initial frame image of the image data;

Extracting a plurality of candidate targets in a subsequent frame image of the image data, the subsequent frame images being any frame image subsequent to the initial frame image;

Calculating the similarity between the candidate target and the tracking target;

A candidate target having the highest similarity with the tracking target among the plurality of candidate targets is determined as the tracking target.
The target tracking method according to claim 1, wherein said determining a tracking target in the initial frame image of the image data comprises:

After outputting the initial frame image through the display screen, acquiring a user's selection operation; determining the tracking target in the initial frame image based on a user's selection operation; or

Obtaining feature information for describing the tracking target; determining the tracking target in the initial frame image based on the feature information.
The target tracking method according to claim 1, wherein the extracting a plurality of candidate targets in a subsequent frame image of the image data comprises:

Determining an i-th bounding frame of the tracking target in the i-1th frame image, wherein the i-1th frame image belongs to the image data, and i is an integer greater than or equal to 2; when i is equal to 2 The image of the i-1th frame is the initial frame image;

Determining, in the ith frame image, an ith image block, wherein the ith frame image is the subsequent frame image, a center of the ith image block, and the first The center position of the i-1 enclosing frame is the same, and the area of the i-th image block is larger than the area of the i-th enclosing frame;

The plurality of candidate targets are determined within the ith image block.
The target tracking method according to claim 1, wherein said calculating a candidate target and The similarity of the tracking target includes:

Selecting a first candidate target from the plurality of candidate targets, wherein the first candidate target is any one of the plurality of candidate targets;

Calculating a first color feature vector of the first candidate target, and calculating a second color feature vector of the tracking target;

Calculating a distance between the first color feature vector and the second color feature vector, wherein the distance is a similarity between the first candidate target and the tracking target.
The target tracking method according to claim 4, wherein the calculating the first color feature vector of the first candidate target and calculating the second color feature vector of the tracking target comprises:

Performing main component segmentation on the image of the first candidate target to obtain a first mask image; and performing principal component segmentation on the image of the tracking target to obtain a second mask image;

Scaling the first mask image and the second mask image to the same size;

And dividing the first mask image into M regions; and dividing the second mask image into M regions, where M is a positive integer;

Calculating a color feature vector of each region in the first mask image; and calculating a color feature vector of each region in the second mask image;

And sequentially connecting color feature vectors of each region in the first mask image to obtain the first color feature vector; and sequentially connecting color feature vectors of each region in the second mask image to obtain the The second color feature vector.
The target tracking method according to claim 5, wherein said calculating a color feature vector of each of said first mask images; and calculating a color feature vector of each of said second mask images, including :

Determine the W main color, W is a positive integer;

Calculating a projection of each pixel in the first region of the first mask image on each of the main colors a weight, the first area being any one of the M areas in the first mask image; and calculating a projection of each pixel in the second area of the second mask image on each of the main colors Weighted, the second area is any one of M areas in the second mask image;

Obtaining a W-dimensional color feature vector corresponding to each pixel in the first region based on a projection weight of each pixel in the first region on each primary color; and, based on each pixel in the second region a projection weight on each primary color, obtaining a corresponding W-dimensional color feature vector for each pixel in the second region;

Normalizing a W-dimensional color feature vector corresponding to each pixel in the first region to obtain a color feature vector of each pixel in the first region; and corresponding to each pixel in the second region And normalizing the W-dimensional color feature vector to obtain a color feature vector of each pixel in the second region;

Adding color feature vectors of each pixel in the first region to obtain a color feature vector of the first region; and adding color feature vectors of each pixel in the second region to obtain the The color feature vector of the second region.
The target tracking method according to claim 6, wherein the projection weight of the first pixel on each n main colors is calculated based on the following equation:

The first pixel is any one of the first region or the second region, and the nth main color is any one of the W main colors, and w n is a projection weight of the first pixel on the nth main color, I r , I g , and I b are RGB values of the first pixel; R n , G n , B n are the The RGB values of the n main colors.
The target tracking method according to claim 1, wherein the calculating the similarity between the candidate target and the tracking target comprises:

Selecting a first candidate target from the plurality of candidate targets, wherein the first candidate target is any one of the plurality of candidate targets;

Normalizing an image of the first candidate target with an image of the tracking target to the same size;

Entering an image of the tracking target into a first convolutional network of a first depth neural network for feature calculation to obtain a feature vector of the tracking target, wherein the first depth neural network is based on a Siamese structure;

Inputting an image of the first candidate target into a second convolution network of the first depth neural network for feature calculation, obtaining a feature vector of the first candidate target, the second convolution network, and the The first convolutional network shares the convolutional layer parameters;

Inputting a feature vector of the tracking target and a feature vector of the first candidate target into a first fully connected network of the first depth neural network for similarity calculation, obtaining the first candidate target and the tracking The similarity of the target.
The target tracking method according to claim 3, wherein said determining said plurality of candidate targets in said ith image block comprises:

Inputting the ith image block into a third convolution network of the second depth neural network to perform feature calculation, to obtain a feature map of the ith image block, wherein the second depth neural network is based on a Siamese structure;

And inputting a feature map of the ith image block into an RPN network of the second depth neural network to obtain feature numbers of the plurality of candidate targets and the plurality of candidate targets.
The target tracking method according to claim 9, wherein the calculating the similarity between each candidate target and the tracking target comprises:

Extracting a feature vector of the first candidate target from the feature vectors of the plurality of candidate targets, wherein the first candidate target is any one of the plurality of candidate targets;

Inputting an image of the tracking target into a fourth convolution network of the second depth neural network for feature calculation, obtaining a feature vector of the tracking target, the fourth convolution network and the third convolution Network shared convolutional layer parameters;

Inputting a feature vector of the tracking target and a feature vector of the first candidate target into a second fully connected network of the second depth neural network to perform a similarity calculation, obtaining the first candidate target and the tracking The similarity of the target.
An electronic device having an image acquisition unit, the image acquisition unit is configured to collect image data, and the electronic device includes:

a first determining unit, configured to determine a tracking target in an initial frame image of the image data;

An extracting unit configured to extract a plurality of candidate targets in a subsequent frame image of the image data, the subsequent frame image being any frame image subsequent to the initial frame image;

a calculating unit configured to calculate a similarity between the candidate target and the tracking target;

The second determining unit is configured to determine, as the tracking target, a candidate target that has the highest similarity with the tracking target among the plurality of candidate targets.
The electronic device of claim 11, wherein the first determining unit comprises:

a first determining subunit configured to acquire a user's selection operation after outputting the initial frame image through the display screen; determining the tracking target in the initial frame image based on a user's selection operation; or

a second determining subunit configured to acquire feature information for describing the tracking target; and determining the tracking target in the initial frame image based on the feature information.
The electronic device of claim 11, wherein the extracting unit comprises:

a first determining subunit, configured to determine an i-th bounding frame of the tracking target in the i-1th frame image, wherein the i-1th frame image belongs to the image data, and i is greater than or equal to 2 An integer of the i-th frame is the initial frame image when i is equal to 2;

a second determining subunit, configured to determine an i-th image block in the i-th frame image based on the i-th bounding frame, wherein the i-th frame image is the subsequent frame image, the ith The center of the image block is the same as the center position of the i-1th bounding frame, and the area of the i-th image block is larger than The area of the i-1 bounding frame;

a third determining subunit configured to determine the plurality of candidate targets within the ith image block.
The electronic device of claim 11, wherein the computing unit comprises:

a first selection sub-unit configured to select a first candidate target from the plurality of candidate targets, wherein the first candidate target is any one of the plurality of candidate targets;

a first calculation subunit configured to calculate a first color feature vector of the first candidate target, and calculate a second color feature vector of the tracking target;

a second calculating subunit configured to calculate a distance between the first color feature vector and the second color feature vector, wherein the distance is a similarity between the first candidate target and the tracking target.
The electronic device of claim 14, wherein the first computing subunit is further configured to:

Performing main component segmentation on the image of the first candidate target to obtain a first mask image; and performing principal component segmentation on the image of the tracking target to obtain a second mask image; and the first mask image and the first mask image Dividing the second mask image to the same size; dividing the first mask image into M regions; and dividing the second mask image into M regions, M is a positive integer; calculating the first mask image a color feature vector of each region; and calculating a color feature vector of each region in the second mask image; sequentially connecting color feature vectors of each region in the first mask image to obtain the first color a feature vector; and, sequentially connecting the color feature vectors of each of the regions in the second mask image to obtain the second color feature vector.
The electronic device of claim 15, wherein the first computing subunit is further configured to:

Determining a W main color, W being a positive integer; calculating a projection weight of each pixel in the first region of the first mask image on each of the main colors, the first region being the first mask map Any one of the M regions in the image; and calculating a projection weight of each pixel in the second region of the second mask image on each of the main colors, the second region being the second mask Any one of the M regions in the image; obtaining a W-dimensional color feature vector corresponding to each pixel in the first region based on a projection weight of each pixel in each of the first regions; And obtaining, according to a projection weight of each pixel in each of the second colors in the second region, a corresponding one-dimensional W-dimensional color feature vector in the second region; corresponding to each pixel in the first region The W-dimensional color feature vector is normalized to obtain a color feature vector of each pixel in the first region; and normalizing the W-dimensional color feature vector corresponding to each pixel in the second region, Obtaining a color feature vector of each pixel in the second region; adding a color feature vector of each pixel in the first region to obtain a color feature vector of the first region; and, the second Each pixel in the area Adding the color feature vector, the feature vector to obtain a color of the second region.
The electronic device of claim 16, wherein the first computing subunit is further configured to calculate a projection weight of the first pixel on each n primary colors based on the following equation:

The first pixel is any one of the first region or the second region, and the nth main color is any one of the W main colors, and w n is a projection weight of the first pixel on the nth main color, I r , I g , and I b are RGB values of the first pixel; R n , G n , B n are the The RGB values of the n main colors.
The electronic device of claim 11, wherein the computing unit comprises:

a second selection subunit, configured to select a first candidate target from the plurality of candidate targets, wherein the first candidate target is any one of the plurality of candidate targets;

a normalization subunit configured to normalize an image of the first candidate target to an image of the tracking target to the same size;

a first input subunit configured to input an image of the tracking target to a first deep nerve Feature calculation is performed in a first convolutional network of the network to obtain a feature vector of the tracking target, wherein the first depth neural network is based on a Siamese structure;

a second input subunit, configured to input an image of the first candidate target into a second convolution network of the first depth neural network to perform feature calculation, to obtain a feature vector of the first candidate target;

a third input subunit, configured to input a feature vector of the tracking target and a feature vector of the first candidate target into a first fully connected network of the first depth neural network to perform a similarity calculation, to obtain the The similarity between the first candidate target and the tracking target.
The electronic device of claim 13, wherein the third determining subunit is further configured to:

Inputting the ith image block into a third convolutional network of the second depth neural network to perform feature calculation to obtain a feature map of the ith image block, wherein the second depth neural network is based on a Siamese structure; The feature map of the ith image block is input to an RPN network of the second depth neural network, and the plurality of candidate targets and feature vectors of the plurality of candidate targets are obtained.
The electronic device of claim 19, wherein the computing unit comprises:

Extracting a subunit, configured to extract a feature vector of the first candidate target from the feature vectors of the plurality of candidate targets, wherein the first candidate target is any one of the plurality of candidate targets;

a fourth input subunit, configured to input an image of the tracking target into a fourth convolution network of the second depth neural network for feature calculation, to obtain a feature vector of the tracking target, the fourth convolution Sharing a convolution layer parameter with the network and the third convolutional network;

a fifth input subunit, configured to input a feature vector of the tracking target and a feature vector of the first candidate target into a second fully connected network of the second depth neural network to perform a similarity calculation, to obtain the The similarity between the first candidate target and the tracking target.
An electronic device, comprising: a processor and for storing on a processor A memory of a running computer program, wherein the processor is operative to perform the steps of the method of claims 1 to 10 when the computer program is run.
A computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the method of claims 1 to 10.