WO2021057309A1

WO2021057309A1 - Tracked target determination method and related device

Info

Publication number: WO2021057309A1
Application number: PCT/CN2020/108990
Authority: WO
Inventors: 丁旭; 胡文泽
Original assignee: 深圳云天励飞技术股份有限公司
Priority date: 2019-09-27
Filing date: 2020-08-13
Publication date: 2021-04-01
Also published as: CN110826403A; CN110826403B

Abstract

A tracked target determination method and a related device, which are applied to an electronic device. The method comprises: acquiring a first image and a second image from the same target video file, and acquiring N first tracking frames of the first image (101), wherein the first image is a previous preset frame image of the second image, and the first image and the second image both comprise N tracked targets; inputting the second image into an hourglass network model to perform feature extraction, and outputting a target feature map (102); inputting the target feature map into a prediction network so as to output a thermodynamic chart, a width and height value set, and a feature vector set (103); determining N second tracking frames on the basis of the thermodynamic chart and the width and height value set (104); determining N first correlations on the basis of the N first tracking frames, the N second tracking frames, and the feature vector set (105); and determining, on the basis of the N first correlations, tracked targets selected by means of the N second tracking frames (106). The method can improve the precision of determining tracked targets.

Description

Tracking target determination method and related equipment

Technical field

This application relates to the field of electronic technology, and in particular to a tracking target determination method and related equipment.

Background technique

Target tracking is one of the key technologies in the field of image processing and video processing. Target tracking is used to identify tracking targets in videos or images, and is widely used in related fields such as smart transportation, human-computer interaction, and national defense investigation. The determination of the tracking target is one of the essential key steps to achieve target tracking. At present, the determination of the tracking target is mainly based on the deep sort (deep sort) algorithm, and the use of deep sort only uses the predicted position information for matching, and the prediction accuracy low.

Summary of the invention

The embodiment of the present application provides a tracking target determination method and related equipment, which are used to improve the accuracy of determining the tracking target.

In the first aspect, an embodiment of the present application provides a tracking target determination method, which is applied to an electronic device, and the method includes:

Acquiring a first image and a second image in the same target video file, and acquiring N first tracking frames of the first image, wherein the first image is the previous preset frame image of the second image, The first image and the second image both include N tracking targets, and the N first tracking frames are used to frame and select the N tracking targets in the first image, and the N is greater than 1. Integer

Input the second image into the hourglass network model for feature extraction, and output a target feature map;

Input the target feature map to the prediction network to output a heat map, a set of width and height values, and a set of feature vectors;

Determine N second tracking frames based on the heat map and the width and height value set;

Based on the N first tracking frames, the N second tracking frames, and the feature vector set, N first correspondences are determined, and the N first correspondences are used to characterize the N first correspondences. A one-to-one correspondence between the tracking frame and the N second tracking frames;

The tracking target selected by the N second tracking frames is determined based on the N first correspondences.

In a second aspect, an embodiment of the present application provides a tracking target determination device, which is applied to an electronic device, and the device includes:

The information acquiring unit is configured to acquire a first image and a second image in the same target video file, and acquire N first tracking frames of the first image, wherein the first image is the image of the second image The first preset frame image, the first image and the second image both include N tracking targets, and the N first tracking frames are used to frame and select the N tracking targets in the first image, The N is an integer greater than 1;

The feature extraction unit is configured to input the second image into the hourglass network model for feature extraction, and output a target feature map;

The data determining unit is used to input the target feature map to the prediction network to output the heat map, the width and height value set, and the feature vector set;

A tracking frame determining unit, configured to determine N second tracking frames based on the heat map and the width and height value set;

The correspondence relationship determining unit is configured to determine N first correspondence relationships based on the N first tracking frames, the N second tracking frames, and the feature vector set, where the N first correspondence relationships are used for Characterize the one-to-one correspondence between the N first tracking frames and the N second tracking frames;

The tracking target determining unit is configured to determine the tracking target selected by the N second tracking frames based on the N first correspondences.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and are configured to be processed by the above The above program includes instructions for executing the steps in the method described in the first aspect of the embodiments of the present application.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, wherein the above-mentioned computer-readable storage medium stores a computer program for electronic data exchange, wherein the above-mentioned computer program enables a computer to execute Part or all of the steps described in the method described in one aspect.

In a fifth aspect, the embodiments of the present application provide a computer program product, wherein the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute as implemented in this application. Examples include part or all of the steps described in the method described in the first aspect. The computer program product may be a software installation package.

It can be seen that in this embodiment of the application, the first image and the second image are first obtained from the same target video file, the first image is the previous preset frame image of the second image, and then the second image is input into the hourglass network model Obtain the target feature map, and then input the target feature map into the prediction network to obtain the heat map, the width and height value set and the feature vector set, and then determine the second tracking frame according to the heat map and the width and height value set, and the second tracking frame is used for Frame the N tracking targets in the second image, and finally determine the tracking target according to the first tracking frame, the second tracking frame, and the feature vector set. The first tracking frame is used to frame all the tracking targets in the first image. The N tracking targets are described. It can be seen that this application jointly determines the tracking target based on a certain image, the previous preset frame image of the certain image, and the tracking frame associated with the previous preset frame image, and realizes the tracking that changes with the position of the tracking target, and then Improve the accuracy of determining the tracking target.

These and other aspects of the present application will be more concise and understandable in the description of the following embodiments.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

FIG. 1A is a schematic flowchart of a method for determining a tracking target provided by an embodiment of the present application;

FIG. 1B is a schematic structural diagram of an hourglass network model provided by an embodiment of the present application;

Fig. 1C is a schematic diagram of a thermal map provided by an embodiment of the present application;

2A is a schematic flowchart of another tracking target determination method provided by an embodiment of the present application;

2B is a schematic diagram of another tracking target determination method provided by an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an electronic device provided by an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an apparatus for determining a tracking target according to an embodiment of the present application.

detailed description

In order to enable those skilled in the art to better understand the solutions of the application, the technical solutions in the embodiments of the application will be clearly and completely described below in conjunction with the drawings in the embodiments of the application. Obviously, the described embodiments are only It is a part of the embodiments of this application, not all the embodiments.

The terms "first", "second", "third" and "fourth" in the specification and claims of the application and the drawings are used to distinguish different objects, rather than describing a specific order .

Electronic devices can include various handheld devices with wireless communication functions, vehicle-mounted devices, wearable devices, computing devices or other processing devices connected to wireless modems, as well as various forms of user equipment (UE), mobile stations ( Mobile Station, MS), terminal device (terminal device) and so on.

As shown in FIG. 1A, FIG. 1A shows a tracking target determination method provided by an embodiment of the present application, which is applied to the above-mentioned electronic device and specifically includes the following steps:

Step 101: The electronic device obtains a first image and a second image in the same target video file, and obtains N first tracking frames of the first image, where the first image is the front of the second image A preset frame image, the first image and the second image each include N tracking targets, the N first tracking frames are used to frame the N tracking targets in the first image, so Said N is an integer greater than 1.

Wherein, said obtaining the N first tracking frames of the first image includes: obtaining the second widths of the N first tracking frames, the second heights of the N first tracking frames, and the N Feature vectors of the second positions of the N first tracking frames and the second center points of the N first tracking frames.

Wherein, the size of the first image and the second image, that is, the width and the height are the same. The first image and the second image are both images including N tracking targets, that is, both the first image and the second image display N tracking targets.

For example, if 4 tracking targets are displayed in the first image, and the 4 tracking targets are 1, 2, 3, and 4, the tracking targets 1, 2, 3, and 4 are also displayed in the second image.

Wherein, the previous preset frame image is, for example, the previous frame image, the previous two frame images, the previous 4 frame images, the previous 5 frame images, and so on.

Wherein, the target video file is a video file followed by the tracking target. The target video file is stored in an electronic device, or the target video file is stored in the cloud, etc.

Step 102: The electronic device inputs the second image into the hourglass network model for feature extraction, and outputs a target feature map.

The target feature map includes M feature points of N tracking targets, and the M is a positive integer. The number of feature points of each tracking target can be the same or different. The feature points of each tracking target can be 8, 10, 13, 18 and other values. The feature points are used to mark the tracking target. different positions. For example, assuming that the tracking target is a person, the feature points may be the joint points of the person.

Step 103: The electronic device inputs the target feature map to the prediction network to output the heat map, the width and height value set, and the feature vector set.

Step 104: The electronic device determines N second tracking frames based on the heat map and the width and height value set.

Step 105: The electronic device determines N first correspondences based on the N first tracking frames, the N second tracking frames, and the feature vector set, and the N first correspondences are used to characterize all The one-to-one correspondence between the N first tracking frames and the N second tracking frames.

Wherein, the shape of the first tracking frame and the second tracking frame are the same, and the shapes of the first tracking frame and the second tracking frame may be rectangles, squares, diamonds, circles and other shapes.

Wherein, the width of the first image is greater than the width of the N first tracking frames, the height of the first image is greater than the height of the N first tracking frames; the width of the second image is greater than the width of the The width of the N second tracking frames, and the height of the second image is greater than the height of the N second tracking frames.

Wherein, two adjacent first tracking frames in the N first tracking frames may have overlapping parts, and two adjacent second tracking frames in the N second tracking frames may have overlapping parts.

Wherein, the one-to-one correspondence means that there is a second tracking frame in the N second tracking frames that is the same as the tracking target selected by a first tracking frame in the N first tracking frames.

For example, suppose there are three second tracking frames, such as second tracking frame 1, second tracking frame 2, and second tracking frame 3, and there are three tracking targets, such as A, B, and C. If the second tracking box 1 selects A, the second tracking box 2 selects B, and the second tracking box 3 selects C, there are 3 first tracking boxes, such as the first tracking box 1, the first tracking box 2, and the first tracking box. One tracking frame 3, if the first tracking frame 1 corresponds to the second tracking frame 1, the first tracking frame 2 corresponds to the second tracking frame 2, and the first tracking frame 3 corresponds to the second tracking frame 3 one to one. , Then the first tracking box 1 selects A, the first tracking box 2 selects B, and the first tracking box 3 selects C.

Wherein, the heights of the first tracking frame and the second tracking frame that have a corresponding relationship may be the same or different, which is not limited here. The width of the first tracking frame and the second tracking frame that have a corresponding relationship may be the same or different, which is not limited here.

Step 106: The electronic device determines the tracking target selected by the N second tracking frames based on the N first correspondences.

In an implementation manner of the present application, after step 206, the method further includes: the electronic device displays the N second tracking frames on the second image.

In an implementation of the present application, the hourglass network model is composed of i hourglass networks arranged in sequence, and the input image of the i-th hourglass network is an image obtained by synthesizing the input image and the output image of the i-1th hourglass network, The i is an integer greater than or equal to 2;

Each time the hourglass network passes through, the first processing is performed. In the first processing, the input image is down-sampled through multiple first convolution blocks of the hourglass network, and the first feature map is output; The feature map is up-sampled through a plurality of second convolution blocks of the hourglass network to output a second feature map; the second feature map is superimposed with the input image to output a third feature map.

Wherein, the first convolutional block is a first convolutional neural network, the second convolutional block is a second convolutional neural network, and the difference between the first convolutional neural network and the second convolutional neural network The role is different.

Among them, the hourglass network model can be composed of 2, 4, 5, 7 or other numbers of hourglass networks arranged in sequence. The structure diagram of the hourglass network model is shown in Figure 1B. In the case that the hourglass network model is composed of two hourglass networks, on the one hand, the accuracy of the calculation can be ensured, and on the other hand, the calculation speed can be improved.

Wherein, the input image of the first hourglass network in the hourglass network model is the target image, and the feature map output by the last hourglass network in the hourglass network model is the target feature map.

Among them, as shown in Figure 1B, each hourglass network is a symmetric network, and each hourglass network can perform down-sampling and up-sampling. The number of sampling is the same, such as 4 times, 6 times, 7 times and other values. The technique used for downsampling is nearest neighbor interpolation, which is used to reduce image resolution. The technique used for upsampling is maximum pooling or average pooling, which is used to improve the image resolution.

In the embodiment of this application, the hourglass network a is not the first hourglass network arranged in the hourglass network model. The input image of the hourglass network a that is downsampled for the first time is image 1 (image 1 is the input image of the hourglass network b and The output image of the hourglass network b is synthesized. In the hourglass network model, the hourglass network a is adjacent to the hourglass network b and is behind the hourglass network b). The input image of the hourglass network a that is downsampled next time is the previous downsample The output image of the hourglass network a will be downsampled the next time the resolution of the output image is reduced by twice the resolution of the input image that will be downsampled next time, and the input image of the hourglass network a that will be upsampled for the first time It is the output image of the hourglass network a that was down-sampled for the last time. The input image of the hourglass network a that is up-sampled next time is the superposition and merging of the output image of the previous up-sampling and the output image of the symmetric down-sampling. The hourglass network a is the next time The resolution of the output image to be upsampled is doubled on the basis of the resolution of the input image to be upsampled next time.

The input image of the first hourglass network in the hourglass network model for downsampling for the first time is the target image, and the specific implementation method of the first hourglass network in the hourglass network model for upsampling and downsampling is the same as that of the hourglass network a. For details, please refer to the above content, which will not be described here.

For example, suppose that the number of upsampling and downsampling of the hourglass network a is 4 times, image 1 is 6*128*128, where 6 is the number of channels, 128*128 is the resolution of image 1, and adjacent interpolation is used After performing the first down-sampling, it outputs image 2 with a resolution of 6*64*64. After performing the second down-sampling on image 2, it outputs image 3 with a resolution of 6*32*32. Perform the third on image 3. After the second downsampling, the output resolution is 6*16*16 image 4, after the fourth downsampling is performed on image 4, the output resolution is 6*8*8 image 5, when the 4 downsampling is completed, the image 5Using average pooling for up-sampling, after performing the first up-sampling, the output resolution is 6*16*16 image 6. Combine image 6 and image 4 output from the third down-sampling as the second up-sampling Input, perform the second up-sampling output resolution of 6*32*32 image 7, merge image 7 and image 3 as the input of the third up-sampling, perform the third up-sampling output resolution of 6*64* Image 8 of 64, and finally image 8 and image 2 are combined as the input of the fourth up-sampling, and the fourth up-sampling is performed to output image 9 with a resolution of 6*128*128.

It can be seen that in the embodiment of the present application, multiple down-sampling and multiple up-sampling are performed through each hourglass network, so that the features of different regions in the target image can be extracted, and the feature points in the target image can be preserved. The spatial relationship can improve the probability of determining the tracking target image.

In an implementation manner of the present application, the prediction network includes a heat map branch, a wide-height branch, and a feature vector branch; the electronic device inputs the target feature map to the prediction network to output a heat map and a wide-height value set. And feature vector set, including:

The electronic device inputs the target feature map to the heat map branch to output a heat map, and inputs the target feature map to the width and height branch to output a width and height value set;

The electronic device inputs the heat map into the feature vector branch to output a feature vector set.

Wherein, the inputting the target feature map to the width and height branch to output the width and height value set includes: adding the target feature map, the second width of the N first tracking frames, and the N The second height of the first tracking frame is input to the width and height branch to output the width and height value set.

Wherein, the inputting the heat map into the feature vector branch to output a feature vector set includes: inputting the heat map and the feature vectors of the second center points of the N first tracking frames into the feature Vector branches to output feature vector sets.

Wherein, the electronic device inputs the target feature map into the heat map branch, and inputs the target feature map into the wide-height branch is executed in parallel.

Among them, the heat map branch is obtained by the electronic device using the first formula to train the third convolution block.

The first formula is:

Wherein, the H is the height of the target feature map; the W is the width of the target feature map; the P _ij is the probability that the feature point located at (i, j) is the target feature point; the y _ij is the target feature point; The label value of the feature point at the position (i, j) in the first image, when calculating the probability that the feature point at the position (i, j) is the target feature point, the label value is used to indicate its corresponding feature point The possibility of calculation error occurs. The larger the mark value, the greater the possibility of calculation error. The smaller the mark value, the lower the possibility of calculation error. The mark value means that the electronic device is in the third volume. The α and β are fixed values that are set when the product block is trained. Under different circumstances, the values of α and β may be different.

The heat map is shown in FIG. 1C, the point in FIG. 1C represents the center point, the ordinate on the left in FIG. 1C represents the probability, and the abscissa and the ordinate on the right in FIG. 1C jointly represent the location of the center point.

Among them, the width and height branches are obtained by the electronic device using the second formula to train the fourth convolution block.

The second formula is: L ₂ =|f(x)-Y| ²

The f(x) and Y are both width or height, and L ₂ is the square of the width difference or the square of the height difference.

Among them, the width and height value set includes the corresponding relationship between the width and the square of the width difference and the corresponding relationship between the height and the square of the height difference, as shown in Table 1.

Table 1

高度(mm)Height (mm)	高度差的平方(mm ²) Square of height difference (mm ² )	宽度(mm)Width(mm)	宽度差的平方(mm ²) Square of width difference (mm ² )
h1h1	H1H1	k1k1	K1K1
h2h2	H2H2	k2k2	k2k2

...

Wherein, the third convolution block is a third convolutional neural network, and the fourth convolution block is a fourth convolutional neural network. The functions of the third convolutional neural network and the fourth convolutional neural network are different from each other.

Wherein, the feature vector branch includes a first branch, a second branch, and a third branch. The first branch is obtained by the electronic device using the third formula to train the fifth convolution block, and the second branch is the electronic device using the fourth formula It is obtained by training the sixth convolution block, and the third branch is obtained by training the seventh convolution block by the electronic device using the fifth formula.

Wherein, the fifth convolutional block is a fifth convolutional neural network, the sixth convolutional block is a sixth convolutional neural network, and the seventh convolutional block is a seventh convolutional neural network. The functions of the fifth convolutional neural network, the sixth convolutional neural network, and the seventh convolutional neural network are different from each other.

The third formula:

Among them, the

Is the feature vector of the second center point of any first tracking frame, the

Is the feature vector of the first center point of the second tracking frame corresponding to the any one of the first tracking frames, and the e _k is the feature vector of the second center point of the any one of the first tracking frames and its corresponding second The mean value of the feature vector of the first center point of the tracking frame.

The fourth formula:

Wherein, the e _k is the feature vector of the second center point of one of the N first tracking frames, and the first center of the second tracking frame corresponding to the one of the first tracking frames The mean value of the feature vector of the point; the e _j is the second tracking frame corresponding to the second center point of the other first tracking frame in the N first tracking frames The mean value of the eigenvectors of the first center point. The Δ=1.

The fifth formula:

d ₁₂ =||x ₁ -x ₂ ||

The x ₁ is the feature vector of the first center point, and the x ₂ is the feature vector of the second center point.

Among them, the feature vector set includes the feature vectors of the first center points of the N second tracking frames, as shown in Table 2.

Table 2

第一中心点First center point	特征向量Feature vector
(a1,b1)(a1,b1)	c1c1
(a2,b2)(a2,b2)	3c23c2
(a3,b3)(a3,b3)	1.5c31.5c3
……...	……...

Among them, the eigenvector corresponding to the second center point (a1, b1) is c1, the eigenvector corresponding to the second center point (a2, b2) is 3c2, and the eigenvector corresponding to the second center point (a3, b3) is 1.5c3 , C1, c2 and c3 are all basic solution systems, which can be the same or different.

It can be seen that in the embodiment of the present application, since the input of the target feature map into the two branches is executed in parallel, the time required for the convolution operation is reduced, and the calculation efficiency is improved.

In an implementation manner of the present application, the N second tracking frames are determined based on the heat map and the width and height value set.

The electronic device determines the first position of the first center point of the N second tracking frames based on the heat map;

The electronic device determines the first height of the N second tracking frames and the first width of the N second tracking frames based on the width and height value set.

Wherein, the first heights of any two first tracking frames of the N first tracking frames may be equal or unequal, and the first widths of any two first tracking frames of the N first tracking frames It may be equal or unequal, and the positions of the first center points of any two first tracking frames of the N first tracking frames are different.

Specifically, through the heat map, the probability that each of the M feature points is the first center point can be obtained, and then the first N feature points with the highest probability among the M feature points are taken as the first center point. A center point, and then the first positions of N first center points can be obtained. For example, as shown in Figure 1C, feature point 1, feature point 2, and feature point 3 are the three corresponding feature points with higher probability among all the feature points shown in Figure 1C. In the case of the heat map shown in Figure 1C, The first center points are feature point 1, feature point 2, and feature point 3.

Specifically, the first height is known, and the square of the height difference corresponding to the first height can be obtained from Table 1, and then the second height can be calculated based on the second formula. For example, suppose the first height is C, and the square of the height difference corresponding to the first height is c, then

The first width is known, and the square of the width difference corresponding to the first width can be obtained from Table 1, and then the second width can be calculated based on the second formula. For example, suppose the first width is D, and the square of the width difference corresponding to the first width is d, then

In an implementation manner of the present application, the electronic device determines N first correspondences based on the N first tracking frames, the N second tracking frames, and the feature vector set, including:

Determining the feature vectors of the N first center points according to the first positions passing through the feature vector set and the N first center points;

Determine N offset sets according to the feature vectors of the N first center points and the feature vectors of the second center points of the N first tracking frames, and determine the N offset sets according to the N first tracking frames and the The N second tracking frames determine N matching degree sets, the N offset sets are in one-to-one correspondence with the feature vectors of the N first center points, and each offset set includes N Offsets, the N offsets are the corresponding feature vectors of the first center point, and the position offsets in the feature vector set relative to the feature vector of any of the second center points, The N matching degree sets have a one-to-one correspondence with the N first tracking frames, each of the matching degree sets includes N matching degrees, and the N matching degrees are the corresponding first tracking frames. The degree of matching with any of the second tracking frames;

According to the N offset sets and the N matching degree sets, N second correspondences are determined, and the N second correspondences are used to characterize the N second center points and the N th One-to-one correspondence of a central point;

Determine N first correspondences according to the N second correspondences.

Among them, the sixth formula is used to calculate N offset sets. The sixth formula:

Wherein d _a represents the feature vector of the first and second center point of a tracking frame, D _b represents the center point of the first feature vector b of a second tracking frame, the

Represents the covariance matrix of the first tracking frame a, the d ⁽¹⁾ (a, b) represents the feature vector of the first center point of the first tracking frame a relative to the second center point of the second tracking frame b The vector is offset in the position of the feature vector set.

Among them, the seventh formula is used to calculate N matching degree sets.

The seventh formula:

Wherein, the L _k is=100, and the

Is the feature vector of the second center point of the first tracking frame a in the k-th frame, the

Represents a transpose vector of the second feature of the second center point b of the tracking frame, said R _a is set in the most recent feature vector 100 in a second center point of a first track frame, the d ⁽²⁾ (a, b) represents the degree of match between the first tracking frame a and the second tracking frame b in appearance.

Finally, the eighth formula is used to perform a weighted calculation on any offset set in the N offset sets and any one in the N matching degree sets.

The eighth formula:

C _a,b =λd ⁽¹⁾ (a,b)+(1-λ)d ⁽²⁾ (a,b)

Wherein, the λ is a fixed value, and under different circumstances, the value of the λ may be different; the C _{a and b} are weighted sums.

If it is determined by the N offset sets that the distance between the second center point of the first tracking frame o and the first center point of the second tracking frame p is the shortest, and the first tracking frame o and the second tracking frame p If the weighted sum is greater than the first value, the first tracking frame o and the second tracking frame p have a corresponding relationship. The first tracking frame o is one of the N first tracking frames, and the second tracking frame p is the Nth tracking frame. One of the two tracking boxes.

Optionally, through the feature vector of the first center point and the feature vector of the second center point, the one-to-one correspondence between the N first center points and the N second center points can be determined. For example, A1 and A2 are the first center points, B1 and B2 are the second center points. Since the relationship between A1 and B1 and B2 cannot be judged, and the relationship between A2 and B1 and B2, there may be two situations: A1 and B1 Corresponding, A2 corresponds to B2; A1 corresponds to B2, and A2 corresponds to B1. Assuming that A1 corresponds to B1 and A2 corresponds to B2, first use the third formula to narrow the distance between A1 and B1, then use the fourth formula to widen the distance between A1 and B2, and finally the fifth formula to calculate the distance The distance between A1 and B1 is A1B1. Assuming that A1 corresponds to B2, and A2 corresponds to B1, first use the third formula to narrow the distance between A1 and B2, then use the fourth formula to widen the distance between A1 and B1, and finally the fifth formula to calculate the distance The distance between A1 and B2 is A1B2. Finally, compare the distance between A1B1 and A1B2, if A1B1>A1B2, then A1 and B1 correspond; if A1B1<A1B2, then A1 and B2 correspond.

It can be seen that, in the embodiment of the present application, in the process of determining whether there is a corresponding relationship between a certain first tracking frame and a certain second tracking frame, when the first center point of the certain first tracking frame is The distance between the second center points of a certain second tracking frame is the closest, and the feature vector of the second center point of the certain first tracking frame is relative to the feature vector of the first center point of the certain second tracking frame. The position offset in the feature vector branch is the smallest, and the weighted sum of the matching degree of the certain first tracking frame and the certain second tracking frame is greater than the first value, only then can the certain first tracking frame and the certain one be determined The second tracking frame has a corresponding relationship, which improves the accuracy of determining the corresponding relationship of the tracking frame, thereby improving the accuracy of determining the tracking target.

It should be noted that FIG. 1B and FIG. 1C provided in the embodiments of the present application are only used as examples, and do not constitute a limitation to the embodiments of the present application.

Consistent with the embodiment shown in FIG. 1A, please refer to FIG. 2A. FIG. 2A is a schematic flowchart of another tracking target determination method provided by an embodiment of the present application, which is applied to the above-mentioned electronic device and specifically includes the following steps:

Step 201: The electronic device obtains a first image and a second image in the same target video file, and obtains N first tracking frames of the first image, where the first image is the front of the second image A preset frame image, the first image and the second image each include N tracking targets, the N first tracking frames are used to frame the N tracking targets in the first image, so Said N is an integer greater than 1.

Step 202: The electronic device inputs the second image into the hourglass network model for feature extraction, and outputs a target feature map.

Step 203: The electronic device inputs the target feature map to the heat map branch to output a heat map, and inputs the target feature map to the width and height branch to output a width and height value set.

Step 204: The electronic device inputs the heat map into the feature vector branch to output a feature vector set.

Step 205: The electronic device determines the first positions of the first center points of the N second tracking frames based on the heat map.

Step 206: The electronic device determines the first height of the N second tracking frames and the first width of the N second tracking frames based on the width and height value set.

Step 207: The electronic device determines the feature vectors of the N first center points according to the first positions passing through the feature vector set and the N first center points.

Step 208: The electronic device determines N offset sets according to the feature vectors of the N first center points and the feature vectors of the second center points of the N first tracking frames, and according to the Nth A tracking frame and the N second tracking frames determine N matching degree sets, the N offset sets correspond to the feature vectors of the N first center points one-to-one, and each offset The quantity sets each include N offsets, and the N offsets are the feature vector of the first center point corresponding to the feature vector of any one of the second center points in the feature vector set. , The N matching degree sets correspond to the N first tracking frames in a one-to-one correspondence, each of the matching degree sets includes N matching degrees, and the N matching degrees are all corresponding The degree of matching between the first tracking frame and any of the second tracking frames.

Step 209: The electronic device determines N second correspondences according to the N offset sets and the N matching degree sets, and the N second correspondences are used to characterize the relationship between the N second center points and the N second center points. The one-to-one correspondence of the N first center points. Step 210: The electronic device determines N first correspondences according to the N second correspondences.

Step 211: The electronic device determines the tracking target selected by the N second tracking frames based on the N first correspondences.

For example, as shown in FIG. 2B, the first image including the tracking target S and the tracking target D is input into the hourglass network model, and the target feature map is input through the hourglass network model, and then the target feature maps are input into the heat of the prediction module. The graph branch and the wide-height branch. After passing through these two branches, the heat map and the wide-height value set are respectively output, and then the heat map is input into the feature vector branch of the prediction module. After passing through the branch, the feature vector set is output, and then combined N first tracking frames, heat maps, width and height value sets determine the one-to-one correspondence between N second tracking frames and N second tracking frames and N first tracking frames, and finally based on N second tracking frames and According to the one-to-one correspondence of the N first tracking frames, it is possible to know which tracking target is selected by the N second tracking paragraphs, so as to achieve the purpose of determining the tracking target.

It should be noted that the specific implementation process of this embodiment can refer to the specific implementation process described in the foregoing method embodiment, which will not be described here.

Consistent with the embodiment shown in FIG. 1A and FIG. 2A, please refer to FIG. 3. FIG. 3 is a schematic structural diagram of an electronic device provided by an embodiment of the present application. As shown in the figure, the electronic device includes a processor and a memory. , A communication interface, and one or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by the processor, and the programs include instructions for executing the following steps:

Determining N second tracking frames based on the heat map and the width and height value set, where the N second tracking frames are used to frame select N tracking targets in the second image;

In one implementation of the present application, the hourglass network model is constructed by sequentially arranging i hourglass networks, and the input image of the i-th hourglass network is an image obtained by synthesizing the input image and the output image of the i-1th hourglass network , The i is an integer greater than or equal to 2;

In an implementation of the present application, the prediction network includes a heat map branch, a wide and high branch, and a feature vector branch; the target feature map is input to the prediction network to output the heat map, the wide and high value set, and the feature vector. In terms of set, the above program includes instructions for executing the following steps:

Inputting the target feature map to the heat map branch to output a heat map, and inputting the target feature map to the width and height branch to output a width and height value set;

The heat map is input into the feature vector branch to output a feature vector set.

In an implementation manner of the present application, in terms of determining N second tracking frames based on the heat map and the width and height value set, the foregoing program includes instructions for executing the following steps:

Determining the first position of the first center point of the N second tracking frames based on the heat map;

The first height of the N second tracking frames and the first width of the N second tracking frames are determined based on the width and height value set.

In an implementation manner of the present application, in terms of determining N first correspondences based on the N first tracking frames, the N second tracking frames, and the feature vector set, the above-mentioned program includes methods for executing The following steps instructions:

Determine N first correspondences according to the N second correspondences.

Please refer to FIG. 4. FIG. 4 is a tracking target determination device provided by an embodiment of the present application, which is applied to the above electronic equipment, and the device includes:

The information obtaining unit 401 is configured to obtain a first image and a second image in the same target video file, and obtain N first tracking frames of the first image, where the first image is the second image The first image and the second image each include N tracking targets, and the N first tracking frames are used to frame and select the N tracking targets in the first image , The N is an integer greater than 1;

The feature extraction unit 402 is configured to input the second image into the hourglass network model for feature extraction, and output a target feature map;

The data determining unit 403 is configured to input the target feature map to the prediction network to output the heat map, the width and height value set, and the feature vector set;

A tracking frame determination unit 404, configured to determine N second tracking frames based on the heat map and the width and height value set;

The correspondence relationship determining unit 405 is configured to determine N first correspondence relationships based on the N first tracking frames, the N second tracking frames, and the feature vector set, and the N first correspondence relationships are used To characterize the one-to-one correspondence between the N first tracking frames and the N second tracking frames;

The tracking target determining unit 406 is configured to determine the tracking target selected by the N second tracking frames based on the N first correspondences.

In an implementation manner of this application,

The hourglass network model is composed of i hourglass networks arranged in sequence, the input image of the i-th hourglass network is an image obtained by synthesizing the input image and the output image of the i-1th hourglass network, and the i is greater than or equal to 2 Integer

In an implementation of the present application, the prediction network includes a heat map branch, a wide-height branch, and a feature vector branch; the target feature map is input to the prediction network to output the heat map, the wide-height value set, and the feature vector In terms of collection, the data determining unit 403 is specifically configured to:

In an implementation manner of the present application, the N second tracking frames are determined based on the heat map and the width and height value set, and the tracking frame determining unit 404 is further configured to:

In an implementation manner of the present application, the N first correspondences are determined based on the N first tracking frames, the N second tracking frames, and the feature vector set, and the correspondence determining unit 405 is also used for:

Determine N first correspondences according to the N second correspondences.

It should be noted that the information acquisition unit 401, the feature extraction unit 402, the data determination unit 403, the tracking frame determination unit 404, the correspondence determination unit 405, and the tracking target determination unit 406 may be implemented by a processor.

The embodiment of the present application also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for electronic data exchange, wherein the computer program causes the computer to execute the electronic Part or all of the steps described by the device.

The embodiments of the present application also provide a computer program product, wherein the computer program product includes a non-transitory computer-readable storage medium storing a computer program, and the computer program is operable to cause a computer to execute the electronic Part or all of the steps described by the device. The computer program product may be a software installation package.

The steps of the method or algorithm described in the embodiments of the present application may be implemented in a hardware manner, or may be implemented in a manner in which a processor executes software instructions. Software instructions can be composed of corresponding software modules, which can be stored in random access memory (Random Access Memory, RAM), flash memory, read-only memory (Read Only Memory, ROM), and erasable programmable read-only memory ( Erasable Programmable ROM (EPROM), Electrically Erasable Programmable Read-Only Memory (Electrically EPROM, EEPROM), registers, hard disk, mobile hard disk, CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor, so that the processor can read information from the storage medium and write information to the storage medium. Of course, the storage medium may also be an integral part of the processor. The processor and the storage medium may be located in the ASIC. In addition, the ASIC may be located in an access network device, a target network device, or a core network device. Of course, the processor and the storage medium may also exist as discrete components in the access network device, the target network device, or the core network device.

Those skilled in the art should be aware that, in one or more of the foregoing examples, the functions described in the embodiments of the present application may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented by software, it can be implemented in the form of a computer program product in whole or in part. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server or data center via wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (such as infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a Digital Video Disc (DVD)), or a semiconductor medium (for example, a Solid State Disk (SSD)) )Wait.

The specific implementations described above further describe the purpose, technical solutions, and beneficial effects of the embodiments of the application in detail. It should be understood that the foregoing descriptions are only specific implementations of the embodiments of the application, and are not used for To limit the protection scope of the embodiments of the application, any modification, equivalent replacement, improvement, etc. made on the basis of the technical solutions of the embodiments of the application shall be included in the protection scope of the embodiments of the application.

Claims

A tracking target determination method, characterized in that it is applied to an electronic device, and the method includes:

Acquiring a first image and a second image in the same target video file, and acquiring N first tracking frames of the first image, wherein the first image is the previous preset frame image of the second image, The first image and the second image both include N tracking targets, and the N first tracking frames are used to frame and select the N tracking targets in the first image, and the N is greater than 1. Integer

Input the second image into the hourglass network model for feature extraction, and output a target feature map;

Input the target feature map to the prediction network to output a heat map, a set of width and height values, and a set of feature vectors;

Determining N second tracking frames based on the heat map and the width and height value set, where the N second tracking frames are used to frame select N tracking targets in the second image;

Based on the N first tracking frames, the N second tracking frames, and the feature vector set, N first correspondences are determined, and the N first correspondences are used to characterize the N first correspondences. A one-to-one correspondence between the tracking frame and the N second tracking frames;

The tracking target selected by the N second tracking frames is determined based on the N first correspondences.
The method according to claim 1, wherein the hourglass network model is composed of i hourglass networks arranged in sequence, and the input image of the i-th hourglass network is a composite of the input image and output image of the i-1th hourglass network In the obtained image, the i is an integer greater than or equal to 2;

Each time the hourglass network passes through, the first processing is performed. In the first processing, the input image is down-sampled through multiple first convolution blocks of the hourglass network, and the first feature map is output; The feature map is up-sampled through a plurality of second convolution blocks of the hourglass network to output a second feature map; the second feature map is superimposed with the input image to output a third feature map.
The method according to claim 1 or 2, wherein the prediction network includes a heat map branch, a wide and high branch, and a feature vector branch; the target feature map is input to the prediction network to output the heat map, The width and height value set and feature vector set, including:

Inputting the target feature map to the heat map branch to output a heat map, and inputting the target feature map to the width and height branch to output a width and height value set;

The heat map is input into the feature vector branch to output a feature vector set.
The method according to claim 3, wherein the determining N second tracking frames based on the heat map and the width and height value set comprises:

Determining the first position of the first center point of the N second tracking frames based on the heat map;

The first height of the N second tracking frames and the first width of the N second tracking frames are determined based on the width and height value set.
The method according to claim 4, wherein the determining N first correspondences based on the N first tracking frames, the N second tracking frames, and the feature vector set comprises:

Determining the feature vectors of the N first center points according to the first positions passing through the feature vector set and the N first center points;

Determine N offset sets according to the feature vectors of the N first center points and the feature vectors of the second center points of the N first tracking frames, and determine the N offset sets according to the N first tracking frames and the The N second tracking frames determine N matching degree sets, the N offset sets are in one-to-one correspondence with the feature vectors of the N first center points, and each offset set includes N Offsets, the N offsets are the corresponding feature vectors of the first center point, and the position offsets in the feature vector set relative to the feature vector of any of the second center points, The N matching degree sets have a one-to-one correspondence with the N first tracking frames, each of the matching degree sets includes N matching degrees, and the N matching degrees are the corresponding first tracking frames. The degree of matching with any of the second tracking frames;

According to the N offset sets and the N matching degree sets, N second correspondences are determined, and the N second correspondences are used to characterize the N second center points and the N th One-to-one correspondence of a central point;

Determine N first correspondences according to the N second correspondences.
A tracking target determination device, characterized in that it is applied to electronic equipment, and the device includes:

The information acquiring unit is configured to acquire a first image and a second image in the same target video file, and acquire N first tracking frames of the first image, wherein the first image is the image of the second image The first preset frame image, the first image and the second image both include N tracking targets, and the N first tracking frames are used to frame and select the N tracking targets in the first image, The N is an integer greater than 1;

The feature extraction unit is configured to input the second image into the hourglass network model for feature extraction, and output a target feature map;

The data determining unit is used to input the target feature map to the prediction network to output the heat map, the width and height value set, and the feature vector set;

A tracking frame determining unit, configured to determine N second tracking frames based on the heat map and the width and height value set;

The correspondence relationship determining unit is configured to determine N first correspondence relationships based on the N first tracking frames, the N second tracking frames, and the feature vector set, where the N first correspondence relationships are used for Characterize the one-to-one correspondence between the N first tracking frames and the N second tracking frames;

The tracking target determining unit is configured to determine the tracking target selected by the N second tracking frames based on the N first correspondences.
The device according to claim 6, wherein the hourglass network model is composed of i hourglass networks arranged in sequence, and the input image of the i-th hourglass network is a composite of the input image and output image of the i-1th hourglass network In the obtained image, the i is an integer greater than or equal to 2;

Each time the hourglass network passes through, the first processing is performed. In the first processing, the input image is down-sampled through multiple first convolution blocks of the hourglass network, and the first feature map is output; The feature map is up-sampled through a plurality of second convolution blocks of the hourglass network to output a second feature map; the second feature map is superimposed with the input image to output a third feature map.
The device according to claim 6 or 7, wherein the prediction network includes a heat map branch, a wide-height branch, and a feature vector branch; the target feature map is input to the prediction network to output the heat map, wide-high branch, and feature vector branch; In terms of high value sets and feature vector sets, the data determining unit is specifically used for:

Inputting the target feature map to the heat map branch to output a heat map, and inputting the target feature map to the width and height branch to output a width and height value set;

The heat map is input to the feature vector branch to output a feature vector set.
An electronic device, characterized in that the electronic device includes a processor, a memory, a communication interface, and one or more programs, and the one or more programs are stored in the memory and configured by the Executed by the processor, the program includes instructions for executing the steps in the method according to any one of claims 1-5.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, wherein the computer program is processed to execute the method according to any one of claims 1-5.