CN116958202A

CN116958202A - Spatial target motion tracking method and system combining local information and global information

Info

Publication number: CN116958202A
Application number: CN202310945393.2A
Authority: CN
Inventors: 张泽旭; 黄烨飞; 袁帅; 苏宇; 袁萌萌; 王艺诗; 徐田来; 郭鹏
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2023-07-28
Filing date: 2023-07-28
Publication date: 2023-10-27

Abstract

The invention discloses a space target motion tracking method and system combining local information and global information, relates to the technical field of target tracking, and aims to solve the problem that tracking fails due to repetitive textures of space targets in the prior art. The technical key points of the invention include: encoding the image feature position of the space object into a binary description vector; the Euclidean distance among a plurality of binary description vectors in all video frames is calculated, and for each characteristic position in the previous video frame, a plurality of characteristic positions with the minimum Euclidean distance in the next video frame are determined to be used as primary matching relations; calculating Euclidean distance of the preliminary matching relationship by using a neural network to verify, and only reserving the optimal matching relationship with the minimum distance; and solving the motion of the space target according to the optimal matching relation of the characteristic positions, and realizing the tracking of the space target. The invention improves the calculation efficiency of Euclidean distance of each characteristic position between images, reduces the calculation amount of the matching process, and has lower storage consumption.

Description

Spatial target motion tracking method and system combining local information and global information

Technical Field

The invention relates to the technical field of target tracking, in particular to a spatial target motion tracking method and system combining local information and global information.

Background

The target motion tracking based on the optical information can recover the motion condition of the tracked target, provides information support for the development of tasks such as navigation approximation, target perception recognition and the like, and is one of the key problems in the field of computer vision. In object motion tracking, it is generally necessary to extract feature positions in an object, and perform motion solution according to a projection process by changing the feature positions between images.

However, the conventional feature location description process only uses local area information around feature points, and is difficult to cope with a large number of repeated textures in a space target, and there is a potential failure risk for tracking the movement of the space target. Different description vectors can be constructed for similar textures at different positions by introducing global information through a neural network, so that the influence of the repetitive textures in space target tracking is effective, but the process has higher requirements on calculation amount and storage space compared with binary description vectors.

Disclosure of Invention

In view of the above problems, the present invention provides a spatial target motion tracking method and system combining local information and global information, so as to solve the problem that the existing method is difficult to cope with spatial target repetitive textures, and improve the computing efficiency while dealing with similar structures.

According to an aspect of the present invention, there is provided a spatial target motion tracking method combining local information and global information, the method comprising the steps of:

step one, acquiring an image video stream of space target motion, and sampling the image video stream into a continuous video frame sequence;

step two, for each video frame, acquiring the characteristic position of a space target in the video frame, and encoding the characteristic position into a binary description vector;

thirdly, calculating Euclidean distances among a plurality of binary description vectors in all video frames, and determining a plurality of feature positions with minimum Euclidean distances in the next video frame as a preliminary matching relationship for each feature position in the previous video frame;

inputting the video frame sequence into a pre-trained neural network, and obtaining a multidimensional vector corresponding to each characteristic position;

step five, calculating Euclidean distances between the multidimensional vector corresponding to each characteristic position in the previous video frame and the multidimensional vectors corresponding to the characteristic positions in the preliminary matching relation in the next video frame, and selecting one characteristic position with the minimum Euclidean distance in the next video frame as the optimal matching relation of the characteristic positions in the previous video frame;

and step six, solving the motion of the space target according to the optimal matching relation of the characteristic positions so as to track the space target.

Further, the specific steps of the second step include:

step two, calculating the Harris feature points of the image for each video frame;

step two, setting a non-maximum value inhibition area, screening the Harris characteristic points according to the non-maximum value inhibition area, and taking the screened Harris characteristic points as space target characteristic positions;

step two, calculating the main direction of the region at each characteristic position of the space target, and dividing the characteristic positions into a plurality of subareas according to the main direction of the region;

and step two, respectively encoding each subarea of the characteristic position, and splicing the plurality of encoded description vectors to obtain a binary description vector.

Further, the specific process of screening Harris feature points according to the non-maximum suppression area in the second step includes: and sequentially taking maximum positions according to the Harris response amplitude, setting the response values of all pixel positions of a non-maximum suppression area around the maximum positions to be 0, and repeating the process to realize screening of Harris feature points.

Further, the specific steps of the second step comprise:

extracting a circular region M with a radius r by taking a characteristic position O as a circle center _O Calculate the circular area M _O The center of gravity C of the pixel value of (2):

wherein I (x, y) is a pixel value corresponding to the pixel (x, y) in the region;

connecting the characteristic position O with the gravity center C to construct the main direction of the regionFor the circular region M _O Calculating +.>And->Angle COP;

according to the COP, the circular area M _O Dividing into m quadrants, in each of which the pixel is subdivided into n intervals according to the distance from the pixel to the characteristic position O, i.e. the characteristic positionDivided into a plurality of sub-regions.

Further, the specific steps of the second step comprise:

for circular area M _O All pixels in the array are subjected to histogram equalization processing, and an equalized circular area M is calculated _O Average pixel values;

the ratio interval is [0, 100 ]]Evenly divided into p ₁ Each interval is provided with a corresponding binary code sequence B1, the binary code sequence B1 satisfies that the code sequences of adjacent intervals are different by only one bit, and the coding length k of the binary code sequence B1 ₁ To meet the requirements ofIs a minimum integer solution of (2);

calculate the circular area M _O The proportion of pixels in each sub-region being higher than the average pixel valueAccording to the pixel ratioThe corresponding proportion interval determines the value of a binary code sequence B1 corresponding to the subarea;

calculating the sub-center of gravity of each sub-region

According toObtain->An included angle between the clockwise direction and the nearest quadrant axis is determined to be +.A higher limit of the included angle is determined according to the number m of the divided quadrants in the second and third steps>Angle interval +.>Evenly divided into p ₂ Each interval is coded into a binary code sequence B2 for each included angle interval, and the coding rule is as follows: the adjacent interval code sequence has only one bit which is different, and the head and tail interval codes also have only one bit which is different; coding length k of binary code sequence B2 ₂ To meet->Is a minimum integer solution of (2);

according to the sub-centre of gravity of each sub-regionCorresponding->The included angle interval which belongs to determines the value of a binary code sequence B2 corresponding to the subarea;

splicing B1 and B2 to obtain a circular region M _O Encoding each sub-region of the block;

the codes of all the subregions are spliced in sequence to obtain (k ₁ +k ₂ ) And (5) carrying out mn-bit binary coding, namely, obtaining a binary description vector.

Further, the structural design of the neural network in the fourth step is as follows:

the neural network is 8 layers in total, the first 7 layers adopt the form of two-dimensional convolution and ReLU activation function, and the last layer only uses two-dimensional convolution; the output dimensions of each layer are 8, 16, 32, 64, 128 and 128 respectively, convolution kernels with the size of 5 are adopted in the third layer and the sixth layer, the step length is 2, the edge filling is 1, and the convolution kernels and the step length core edge filling of the rest layers are 3, 1 and 1.

According to another aspect of the present invention, there is provided a spatial target motion tracking system combining local information and global information, the system comprising:

an image sequence acquisition module configured to acquire an image video stream of spatial object motion, sampling it as a sequence of consecutive video frames;

a local information acquisition module configured to acquire, for each video frame, a feature position of a spatial target therein, and encode the feature position as a binary description vector; the Euclidean distance among a plurality of binary description vectors in all video frames is calculated, and for each characteristic position in the previous video frame, a plurality of characteristic positions with the minimum Euclidean distance in the next video frame are determined to be used as primary matching relations;

the global information acquisition module is configured to input the video frame sequence into a pre-trained neural network to acquire multidimensional vectors corresponding to each characteristic position; calculating Euclidean distances between the multidimensional vector corresponding to each characteristic position in the previous video frame and the multidimensional vectors corresponding to the characteristic positions in the preliminary matching relation in the next video frame, and selecting one characteristic position with the minimum Euclidean distance in the next video frame as the optimal matching relation of the characteristic positions in the previous video frame;

and the target tracking module is configured to solve the space target motion according to the optimal matching relation of the characteristic positions so as to track the space target.

Further, the specific steps of the local information obtaining module for each video frame, obtaining the feature position of the spatial target therein, and encoding the feature position into a binary description vector include:

Further, the specific steps of calculating the main direction of the region for each feature position of the space object in the local information acquisition module and dividing the feature position into a plurality of sub-regions according to the main direction of the region include:

according to the COP, the circular area M _O Dividing into m quadrants, dividing into n intervals according to the distance from the pixel to the characteristic position O in each quadrant, namely dividing the characteristic position into a plurality of subareas.

Further, the local information obtaining module encodes each sub-region of the feature position, and splices the encoded plurality of description vectors, and the specific steps of obtaining the binary description vector include:

for circular area M _O All pixels in the array are subjected to histogram equalization processing, and an equalized circular area M is calculated _O Average imageA prime value;

calculating the sub-center of gravity of each sub-region

Further, the structural design of the neural network in the global information acquisition module is as follows:

The beneficial technical effects of the invention are as follows:

the invention provides a space target motion tracking method and a system combining local information and global information, wherein for each video frame, the characteristic position of a space target is obtained, and the characteristic position is encoded into a binary description vector; the Euclidean distance among a plurality of binary description vectors in all video frames is calculated, and for each characteristic position in the previous video frame, a plurality of characteristic positions with the minimum Euclidean distance in the next video frame are determined to be used as primary matching relations; inputting the video frame sequence into a pre-trained neural network, and obtaining a multidimensional vector corresponding to each characteristic position; calculating Euclidean distances between the multidimensional vector corresponding to each characteristic position in the previous video frame and the multidimensional vectors corresponding to the characteristic positions in the preliminary matching relation in the next video frame, and selecting one characteristic position with the minimum Euclidean distance in the next video frame as the optimal matching relation of the characteristic positions in the previous video frame; and solving the motion of the space target according to the optimal matching relation of the characteristic positions so as to track the space target. The feature position is encoded into a binary description vector, so that the calculation efficiency of calculating the Euclidean distance of each feature position between images is improved; after a plurality of preliminary matching results are constructed for each feature position by utilizing the binary description vector, the Euclidean distance of the preliminary matching results is calculated by utilizing the neural network for verification, and only the matching relation with the minimum distance is reserved, so that the calculated amount of the matching process is reduced.

Drawings

The invention may be better understood by reference to the following description taken in conjunction with the accompanying drawings, which are included to provide a further illustration of the preferred embodiments of the invention and to explain the principles and advantages of the invention, together with the detailed description below.

FIG. 1 is a flow chart of a method of spatial target motion tracking combining local information and global information in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of generating sub-centers of gravity of each region of a feature point in an embodiment of the present invention;

FIG. 3 is a diagram illustrating an example of an encoding sequence of included angle information in partial information encoding according to an embodiment of the present invention;

FIG. 4 is a diagram of global information generation in accordance with an embodiment of the present invention;

FIG. 5 is an exemplary graph of a matching result of feature positions between successive frames in an embodiment of the present invention;

FIG. 6 is a graph showing the relationship between the feature matching accuracy and the target motion angle between consecutive frames according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, exemplary embodiments or examples of the present invention will be described below with reference to the accompanying drawings. It is apparent that the described embodiments or examples are only implementations or examples of a part of the invention, not all. All other embodiments or examples, which may be made by one of ordinary skill in the art without undue burden, are intended to be within the scope of the present invention based on the embodiments or examples herein.

The embodiment of the invention provides a space target motion tracking method combining local information and global information, as shown in fig. 1, comprising the following steps:

As a preferred embodiment of the present invention, the specific steps of step two are as follows:

step two, for each video frame, converting the image into a gray image, and calculating the Harris characteristic points of the image;

step two, setting a non-maximum value inhibition area, screening Harris characteristic points according to the non-maximum value inhibition area, and taking the screened characteristic points as space target characteristic positions; as an example, the process of screening Harris feature points according to the non-maximum suppression area is: sequentially taking maximum positions according to Harris response amplitude values, setting the responses of all pixel positions of a non-maximum suppression area around the maximum positions to be 0, and repeating the process to realize screening of Harris feature points and obtain feature positions of a space target;

according to the embodiment of the invention, each image characteristic position is divided into a plurality of subareas based on the main direction, so that the follow-up binary description vector is ensured to keep stable to image rotation. The specific dividing process is as follows:

extracting a circular region M with a radius r (for example, 20 pixels) around a feature position O _O For M _O Histogram equalization is performed on all pixels in the image to enhance the contrast of texture, and the gray scale range after the equalization is [0, 255 ]]Is expressed as:

wherein n is _j G is the number of pixels with the gray value of j in the region _k And equalizing the gray scale of the pixel with the gray scale value k.

For the circular region M _O Calculating the pixel value center of gravity C of the region:

where x, y are all pixel coordinates in the region, and I (x, y) is the corresponding pixel value of pixel (x, y) in the region.

Connecting the characteristic position O with the gravity center C to construct the main direction of the regionFor M _O Calculating +.>And (3) withIncluded angle:

further, since the arccos function range is [0, pi ], to determine the quadrant to which P belongs, we willExpansion into three dimensions, namely:

calculation ofCross +.>Record->The last dimension value is t. Then->And->The included angle COP is expressed as:

according to the COP, M _O Dividing into m quadrants, dividing each quadrant into n sections according to the distance from a pixel to a characteristic position O, ensuring that the areas of all the subareas are equal, and recording all the subareas as S _i (i=1, …, mn). As shown in fig. 2, into 4 quadrants, each of which is divided intoThe total of 3 sections divides the area around the feature position O into 12 sub-areas.

According to an embodiment of the present invention, each feature location is encoded as (k ₁ +k ₂ ) An mn-bit binary description vector, where k ₁ 、k ₂ For encoding the length of B1, B2. The specific process is as follows:

for the feature position O, there are mn sub-regions in total, each corresponding to B1 and B2 codes, and first, the B1 code corresponding to each sub-region is calculated: calculate the equalized region M _O Average pixel value of (a); the ratio interval is [0, 100 ]]Evenly divided into p ₁ Each interval is provided with a corresponding binary code sequence B1, the code sequence of each interval is different from the code sequence of the adjacent interval by only one bit, and the coding length k of each interval is equal to the code length k ₁ To meet the requirements ofIs a minimum integer solution of (2); statistics of the proportion of pixels in the respective subregion above the average pixel value +.>According to the pixel proportion->The associated proportion interval determines the value of the binary code sequence B1 corresponding to the sub-region.

At p ₁ By way of example of =8, we can obtain k ₁ At least 3, each interval is correspondingly encoded as follows:

then, the B2 code corresponding to each sub-region is calculated: calculating the sub-center of gravity of each sub-region:

according toObtain->An angle with its nearest quadrant axis in a clockwise direction; determining the upper limit of the included angle according to the number m of the divided quadrants in the second and third steps>As shown in FIG. 3, the interval +.>Evenly divided into p ₂ Each interval is coded into a binary code sequence B2 for each included angle interval, and the coding rule is as follows: in meeting the adjacent areaOn the basis that the code sequence between the two codes is only one bit different, the code length k of B2 also needs to further meet the requirement that the codes of the first and the last intervals are also only one bit different ₂ To meet the requirements ofIs a minimum integer solution of (2);

according to the sub-centre of gravity of each sub-regionCorresponding->The included angle interval to which the binary code sequence B2 belongs determines the value of the binary code sequence B2 corresponding to the subarea.

With quadrant number m=4, interval number p ₂ =8 for example: because the number of the quadrants is 4, the upper limit of the included angleThe interval [0 DEG, 90 DEG ]]Dividing into 8 included angle sections, and correspondingly encoding each section as follows:

the coding of all sub-regions is spliced in order, which can be represented as (k ₁ +k ₂ ) mn bits binary coding. Taking the foregoing example as an example, the method is divided into 4 quadrants, each quadrant is divided into 3 sub-regions, and the B1 and B2 codes of each region are 3 bits, so that the binary description vector of the feature position totals (3+3) ×4×3=72 bits.

As a preferred embodiment of the present invention, the specific steps of step three are as follows: and extracting the characteristic positions and the corresponding binary description vectors according to the steps respectively for two continuous images in the video frame sequence, and constructing a preliminary matching relationship by calculating Euclidean distances of all the description vectors in different images. For each feature position in the previous frame image, a plurality of (e.g., three) preliminary matching relationships with the smallest Euclidean distance in the subsequent frame are reserved. Namely, for each characteristic position in the image of the previous frame, the Euclidean distance between the corresponding characteristic vector of the position and all characteristic description vectors of the following frame is calculated for two continuous frames. For binary vectors, the Euclidean distance is calculated by performing an exclusive OR operation by bit and then summing. For each feature position, three feature positions whose euclidean distance is smallest in the following frame are reserved.

As a preferred embodiment of the present invention, the specific steps of step four are as follows: and scaling the two frames of images to 640 multiplied by 640, respectively inputting the images into a trained neural network, and obtaining 128-dimensional vectors corresponding to each characteristic position through interpolation. For an image of original size (3, h, w), it is first scaled to (3, 640, 640) input to the neural network, outputting a three-dimensional matrix of size (128, 160, 160). The matrix is up-sampled by interpolation as (128, h, w). The 128-dimensional information of the up-sampled matrix is used as a description vector for each pixel position in the image, as shown in fig. 4.

The neural network is designed as follows: the neural network is 8 layers in total, the first 7 layers adopt the form of two-dimensional convolution and ReLU activation function, and the last layer only uses two-dimensional convolution. The output dimension of each layer is 8/16/32/32/64/64/128/128, convolution kernels with the size of 5 are adopted in the third layer and the sixth layer, the step length is 2, the edge is filled with 1, and the edge filling settings of the convolution kernels and the step length kernels of the other layers are 3, 1 and 1.

After obtaining the three-dimensional matrix of dimensions (128, h, w), the following normalization operations are performed: let 128-dimensional description information when three-dimensional matrix corresponds to h=i, w=j be denoted as n _ij . Calculating n _ij Is equal to n _ij || ₂ I.e. calculate n _ij The root number is opened after the sum of squares of all elements in the formula. Reuse ofInstead of the original n _ij As global description information of the position of the image pixel coordinates (i, j).

As a preferred embodiment of the present invention, the specific steps of the fifth step include: for the previous frame arbitrary feature position coordinates (u, v), which were obtained in step threeA plurality of (e.g., three) preliminary matching relationships are denoted as (u) ₁ ，v ₁ )、(u ₂ ，v ₂ )、(u ₃ ，v ₃ ) The method comprises the steps of carrying out a first treatment on the surface of the Acquiring a previous frame I through the fourth step ₁ Description vector n of image at (u, v) position _uv Acquiring a post-frame image I ₂ The description vector at the corresponding matching feature location is expressed asRespectively calculate n _uv And only the feature position corresponding to the minimum Euclidean distance is reserved as the optimal matching relation with the Euclidean distance of the three vectors.

As a preferred embodiment of the present invention, the specific steps of the step six include: recovering an essential matrix through the corresponding relation of the characteristic positions, and decomposing the essential matrix to obtain relative motion information, wherein the relative motion information comprises the following specific steps: assume that the front and back two frames of images are combined to screen N groups of matched characteristic positions, and respectively recorded asAnd->Solving a least squares solution of the following system of overdetermined equations:

wherein the matrixSVD-decomposing the matrix to obtain E=U ΣV ^T Wherein U, V are orthogonal arrays, ++>

The relative motion of the target is recorded as a rotation matrix R and a translation matrix t= [ t ] respectively ₁ ，t ₂ ，t ₃ ] ^T Then r= UWV can be obtained from the SVD decomposition result ^T Or UW ^T V ^T ，Or UZ ^T V ^T A total of 4 solutions, wherein-> And checking the positive and negative of the projection depth of the space coordinate point to obtain a unique solution.

Further experiments prove the technical effect of the invention.

The experiment combines two target models (a box and a satellite model) to perform image simulation and analysis, and a characteristic position matching result is shown in fig. 5. The analysis was performed on 100 consecutive simulation images, with a simulation image size of 640×480. Under 10 groups of different initial postures, the pitching/yawing/rolling angles of the model are increased simultaneously through simulation software, the step length is 1 degree, and the average matching accuracy of the feature points is calculated as shown in fig. 6. From fig. 5 and fig. 6, it can be seen that the present invention can effectively construct a correct matching relationship of similar texture features in an image, and has a higher matching accuracy for parallax increase caused by target rotation.

Another embodiment of the present invention proposes a spatial target motion tracking system combining local information and global information, the system comprising:

As a preferred embodiment of the present invention, the specific steps of the local information acquisition module, for each video frame, acquiring a feature position of a spatial target therein, and encoding the feature position into a binary description vector include:

As a preferred embodiment of the present invention, the specific steps of calculating the main direction of the region for each feature position of the space object in the local information acquisition module, and dividing the feature position into a plurality of sub-regions according to the main direction of the region include:

As a preferred embodiment of the present invention, the local information obtaining module encodes each sub-region of the feature position, and splices a plurality of encoded description vectors, and the specific steps for obtaining the binary description vector include:

calculate the circular area M _O The proportion of pixels in each sub-region being higher than the average pixel valueAccording to the pixel proportion->The corresponding proportion interval determines the value of a binary code sequence B1 corresponding to the subarea;

calculating the sub-center of gravity of each sub-region

As a preferred embodiment of the present invention, the neural network in the global information obtaining module is designed as follows: the neural network is 8 layers in total, the first 7 layers adopt the form of two-dimensional convolution and ReLU activation function, and the last layer only uses two-dimensional convolution; the output dimensions of each layer are 8, 16, 32, 64, 128 and 128 respectively, convolution kernels with the size of 5 are adopted in the third layer and the sixth layer, the step length is 2, the edge filling is 1, and the convolution kernels and the step length core edge filling of the rest layers are 3, 1 and 1.

The function of a spatial target motion tracking system combining local information and global information according to the embodiment of the present invention may be described by the aforementioned spatial target motion tracking method combining local information and global information, so that the system embodiment is not described in detail, and reference may be made to the above method embodiment, which is not described herein.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of the above description, will appreciate that other embodiments are contemplated within the scope of the invention as described herein. The disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention, which is defined by the appended claims.

Claims

1. The spatial target motion tracking method combining the local information and the global information is characterized by comprising the following steps of:

2. The method for tracking the motion of a spatial target by combining local information and global information according to claim 1, wherein the specific steps of the second step include:

3. The method for tracking the motion of the spatial target by combining local information and global information according to claim 2, wherein the specific process of screening Harris feature points according to the non-maximum suppression area in the second step comprises the following steps: and sequentially taking maximum positions according to the Harris response amplitude, setting the response values of all pixel positions of a non-maximum suppression area around the maximum positions to be 0, and repeating the process to realize screening of Harris feature points.

4. The method for tracking the motion of a spatial target by combining local information and global information according to claim 2, wherein the specific steps of the second to third steps include:

connecting the characteristic position O with the gravity center C to construct the main direction of the regionFor the circular region M _O Any position point P in the list is calculatedAnd->Angle COP;

circular area M according to the COP _O Dividing into m quadrants, dividing into n intervals according to the distance from the pixel to the characteristic position O in each quadrant, namely dividing the characteristic position into a plurality of subareas.

5. The method for spatial object motion tracking combining local information and global information according to claim 4, wherein the specific steps of the second to fourth steps include:

calculating the sub-center of gravity of each sub-region

6. The method for tracking the motion of a spatial target by combining local information and global information according to claim 1, wherein the neural network in the fourth step is structurally designed as follows:

7. A spatial target motion tracking system combining local information and global information, comprising:

8. The spatial object motion tracking system combining local information and global information according to claim 7, wherein the specific steps of the local information acquisition module acquiring, for each video frame, a feature position of a spatial object therein, and encoding the feature position into a binary description vector, comprises:

9. The spatial target motion tracking system combining local information and global information according to claim 8, wherein the specific steps of calculating a region main direction for each characteristic position of the spatial target in the local information acquisition module, and dividing the characteristic position into a plurality of sub-regions according to the region main direction comprise:

10. The spatial target motion tracking system combining local information and global information according to claim 9, wherein the specific steps of encoding each sub-region of the feature position in the local information acquisition module, and splicing the plurality of encoded description vectors to obtain a binary description vector include:

calculating the sub-center of gravity of each sub-region

splicing B1 and B2 to obtain a circular region M _O Each of the sub-regions ofEncoding;