CN109767454B

CN109767454B - Unmanned aerial vehicle aerial video moving target detection method based on time-space-frequency significance

Info

Publication number: CN109767454B
Application number: CN201811552410.1A
Authority: CN
Inventors: 李映; 汪亦文; 李静玉; 白宗文; 聂金苗
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2018-12-18
Filing date: 2018-12-18
Publication date: 2022-05-10
Anticipated expiration: 2038-12-18
Also published as: CN109767454A

Abstract

The invention relates to an unmanned aerial vehicle aerial video moving target detection method based on time-space-frequency saliency, which is characterized by extracting the time saliency of a video by using a Lucas-Kanade optical flow method, extracting the spatial saliency of an image by using color distribution, converting the image from a spatial domain to a frequency domain, extracting the frequency domain saliency of the image by using a spectral residual error method, performing linear weighting and fusion on the time, space and frequency domain saliency to obtain a saliency confidence map, binarizing the saliency confidence map by setting a threshold value, and extracting a moving target from an aerial video. The time domain, the space domain and the frequency domain are fused, the defects of the respective domains are made up by using the significance of the other two domains, the detection precision and the detection robustness are improved, the algorithm is simple, and the execution efficiency is high.

Description

Unmanned aerial vehicle aerial video moving target detection method based on time-space-frequency significance

Technical Field

The invention relates to a method for detecting a moving target from an unmanned aerial vehicle aerial video, and belongs to the field of computer vision.

Background

Unmanned aerial vehicle aerial video moving target detection is one of important branches in the field of aerial video intelligent analysis, and has extremely important application in the fields of military affairs and civilian use. At present, in the aspect of aerial photography video moving target detection, expert scholars at home and abroad have already performed some research works. Earlier, there is a method based on inter-frame difference, which registers adjacent frames based on feature points or regions, then performs difference on the registered adjacent frames, and determines the position of a moving target according to a difference image. However, this method is susceptible to the accuracy of the registration algorithm. If the registration precision is not high, the difference result is not accurate enough, and the judgment of the position of the following moving target is greatly influenced. In addition, because the target in the aerial video is relatively small, a background model estimation method is used for detecting the moving target in some technologies. However, this method is easily affected by the established background model, and if the established background model contains targets, the subsequent target detection cannot achieve a good detection effect. The moving target detection method based on the time-space-frequency significance extracts the significance from a time domain, a space domain and a frequency domain respectively, and then fuses the three significances to realize the detection of the moving target. The method mainly utilizes the characteristics of a visual system of human eyes to obtain a target candidate area in an image, and combines motion information in a video to realize the detection of a moving target.

Disclosure of Invention

Technical problem to be solved

In order to avoid the defects of the prior art, the invention provides a method for detecting the unmanned aerial vehicle aerial video moving target based on time-space-frequency significance fusion, which aims to solve the problems of low detection precision and the like.

Technical scheme

An unmanned aerial vehicle aerial video moving target detection method based on time-space-frequency significance is characterized by comprising the following steps:

step 1: extracting the time significance of the video by using a Lucas-Kanade optical flow method;

step 2: extracting the spatial significance of the image by utilizing the color distribution;

and step 3: converting the image from a space domain to a frequency domain, and extracting the frequency domain significance of the image by using a spectral residual error method;

and 4, step 4: the method comprises the following steps of performing linear weighted fusion on time, space and frequency domain saliency to obtain a saliency confidence map, setting a threshold value to binarize the saliency confidence map, and extracting a moving target from an aerial video, wherein the specific steps are as follows:

1) carrying out linear weighted fusion on the time, space and frequency domain significance to obtain a significance confidence map S (x, y):

S(x,y)＝μ₁S_t(x,y)+μ₂S_s(x,y)+μ₃S_f(x,y)

wherein S is_t(x, y) is temporal significance, S_s(x, y) is spatial significance, S_f(x, y) is frequency domain significance, μ_iIs the weight of each item;

2) obtaining a binary image B by setting a threshold value to carry out binarization on S (x, y), and searching all connected regions in B by taking 8 connected regions as a standard_c1；

3) The regions meeting the conditions in the B_cSetting the circumscribed rectangular region of 1 as 1, and searching for the region of the connected region by taking the 8 connected regions as the standard_c2; simultaneously using prewitt operator to extract the edge information graph of the original input gray level image if the edge information graph is in region_c2, if the sum of the gray values of each row of more than 5 rows in the edge information graph of the corresponding position of one connected region of the region exceeds 5, the connected region is reserved as the region_c3；

4) Initializing a zero matrix with the same size as the original input image and generating a region_c3 corresponding to the position 1, and setting the rest part to be 0 to obtain a binary image Y₁；

5) The binary image B was subjected to a disk-shaped morphological closing operation with a radius of 7 to fill the cavity region and obtain Y₂；Y₁、Y₂After the corresponding position elements and the operation are carried out, a final binary image Y is obtained;

6) searching all connected region regions meeting the standard of Y by taking 8 connected regions as the standard_cfinal；region_cThe final position is the position of the moving object extracted from the aerial video.

The specific steps of step 1 are as follows:

step 11: normalizing the optical flow directional diagram:

wherein, theta_iRepresenting the angle value of the (x, y) point optical flow, and performing disc-shaped morphological closing operation with the radius of 3 on the normalized optical flow directional diagram to obtain a gray scale diagram C;

step 12: counting the frequency of the gray values from 0 to 255 in the gray map C, calculating the frequency of each gray value, and taking the negative logarithm of the frequency to obtain the directional significance of the point:

wherein N is_iRepresenting the number of all points which are the same as the gray value of the (x, y) point, and N representing the number of all pixel points in C;

the same method can obtain a time saliency map S based on optical flow amplitude_aWherein the amplitude is normalized as follows, and the rest steps are the same as the direction significance;

final temporal saliency map S_t(x, y) is defined as a linear weighted sum of the time-series saliency map based on optical flow magnitude and the time-series saliency map based on optical flow direction, expressed by:

S_t(x,y)＝w₁S_a(x,y)+w₂S_d(x,y)。

the specific steps of step 2 are as follows:

step 21: traversing 4 neighborhoods of the gray image from the pixel coordinates (0,0) of the gray image, if the gray value difference is smaller than a threshold value, judging the same connected region, otherwise, setting the connected region as a new connected region starting point, and repeating the operation until the image is completely traversed;

step 22: calculating the gray average value of each connected region, and uniformly assigning values to all pixel points in the region to obtain an image M;

step 23: counting the number of pixels in each connected domain in the image M, calculating the occurrence frequency of the pixels in each connected domain, and taking the negative logarithm of the frequency to obtain the spatial significance:

wherein N is_connect(i) Representing the number of all the pixels in the same connected domain with the (x, y) point, N_connectRepresenting the number of all the pixel points in M.

The specific steps of step 3 are as follows:

step 31: giving a gray image H (x, y), and converting the gray image H (x, y) from a spatial domain to a frequency domain through two-dimensional discrete Fourier transform F to obtain a representation F [ H (x, y) ] of the image in the frequency domain;

step 32: obtaining the amplitude A (F) and phase P (F) of F [ H (x, y) ]:

A(f)＝|F[H(x,y)]|

wherein, | - | represents the amplitude-taking operation,

representing a phase taking operation;

step 33: logarithm of amplitude A (F) of F [ H (x, y) ] is taken to obtain a logarithmic spectrum L (F):

L(f)＝log(A(f))

step 34: using local smoothing filters h_n(f) Smoothing the log spectrum:

M(f)＝L(f)*h_n(f)

where h is_n(f) Is an n x n matrix, in whichEach pixel is equal and defined as follows:

step 35: the difference between the magnitude graph of the log spectrum and the mean value of the log spectrum after mean value filtering is the spectrum residual:

R(f)＝L(f)-M(f)

step 36: the spectrum residual R (f) and the phase P (f) are subjected to two-dimensional inverse discrete Fourier transform, so that the spectrum residual R (f) and the phase P (f) can be converted from a frequency domain to a space domain, as shown in the following formula;

T(x,y)＝|F^-1[exp{R(f)+iP(f)}]|²

step 37: reconstructing an image by performing Gaussian filtering on the image after the spectrum residual is converted into a spatial domain, wherein the image is used for representing the significance of each pixel of the original image and becomes a significance map:

S_f(x,y)＝T(x,y)*Gaussian。

advantageous effects

According to the unmanned aerial vehicle aerial video moving target detection method based on the time-space-frequency significance, the time domain significance, the space domain significance and the frequency domain significance are fused, the significance of the other two domains is utilized to make up the defects of the respective domains, the detection precision and the detection robustness are improved, the algorithm is simple, and the execution efficiency is high.

Drawings

FIG. 1 is a flow chart of aerial video moving object detection based on time-space-frequency saliency

Detailed Description

The invention will now be further described with reference to the following examples and drawings:

the scheme adopts an aerial video moving target detection method based on time-space-frequency significance, and comprises the following specific steps:

step 1: and extracting the time significance of the video by using a Lucas-Kanade optical flow method.

Step 2: the spatial saliency of the image is extracted by using the color distribution.

And step 3: and converting the image from a space domain to a frequency domain, and extracting the frequency domain significance of the image by using a spectral residual error method.

And 4, step 4: and performing linear weighting fusion on the time, space and frequency domain saliency to obtain a saliency confidence map, binarizing the saliency confidence map through a proper threshold value, and extracting a moving object from the aerial video.

A preferred embodiment of the invention comprises the following steps:

Assuming that I (x, y, t) is the gray scale value of any pixel point (x, y) in the image at a certain time t, the offset of the pixel at the position (x, y) in the original image in the x and y directions is dx and dy at the time t + dt. According to the principle that the gray scale of the pixel at the corresponding position is kept unchanged in a short time, the method comprises the following steps:

I(x,y,t)＝I(x+dx,y+dy,t+dt) (1)

the right side of the equation is expanded by Taylor's formula, and the higher order infinitesimal is omitted because the motion is small enough, so that:

the gradient of the image at (x, y, t) corresponding to the x, y, t direction. Vx and Vy are the motion velocities of the pixel in the x and y directions, respectively.

Assuming that the pixel motion of a local area is consistent, in this embodiment, a 5 × 5 neighborhood is selected, and when the pixel position is an edge, the gray value of the missing part is filled with 0. Then a set of the above equations can be established for the pixel in its neighborhood:

here pixel₁，pixel₂，…，pixel_nIs the pixel point in the neighborhood of pixel point (x, y)5x5 on this image I. This is oneThe series of equations may be written uniformly as Qv ═ b, where:

the system of equations contains only two unknowns, V_xAnd V_y. Lucas and Kanade solve the least square solution of the equation set by using the least square method as the optical flow of the pixel (x, y), and the solution is obtained:

v＝(Q^TQ^-1)Q^Tb (5)

then, the amplitude and direction of the point light flow are respectively calculated:

temporal saliency is found based on magnitude and direction of optical flow, respectively.

The directions are taken as examples:

1) normalizing the optical flow directional diagram as follows:

wherein theta is_iThe angular value representing the optical flow of the (x, y) point. Then, performing disc-shaped morphological closing operation with the radius of 3 on the normalized optical flow directional diagram to obtain a gray scale diagram C;

2) and (4) counting the frequency of the gray values from 0 to 255 in the gray map C, calculating the frequency of each gray value, and taking the negative logarithm of the frequency to obtain the directional significance of the point. The following were used:

wherein N is_iAnd N represents the number of all pixels in C.

The same method can obtain a time saliency map S based on optical flow amplitude_aWhere the magnitude is normalized as follows, the remaining steps are the same as the directional saliency.

Final temporal saliency map S_t(x, y) is defined as a linear weighted sum of the time-series saliency map based on optical flow magnitude and the time-series saliency map based on optical flow direction, represented by:

S_t(x,y)＝w₁S_a(x,y)+w₂S_d(x,y) (11)

in this embodiment, w₁And w₂Take 0.7 and 0.3, respectively.

Mean shift segmentation is carried out on the image, and then a calculation method of motion information significance is used for reference, namely negative logarithm is taken for distribution frequency, and spatial significance of the image is obtained. The method comprises the following specific steps:

1) traversing 4 neighborhoods of the gray-scale image from the pixel coordinates (0,0), if the gray-scale value difference is smaller than a threshold (5 in the embodiment), determining the same connected region, otherwise, setting the connected region as a new connected region starting point, and repeating the above operations until the image is completely traversed.

2) And calculating the gray average value of each connected region, and uniformly assigning values to all pixel points in the region to obtain an image M.

3) And counting the number of pixels in each connected domain in the image M, calculating the occurrence frequency of the pixels in each connected domain, and taking the negative logarithm of the frequency to obtain the spatial significance.

And 3, step 3: and converting the image from a space domain to a frequency domain, and extracting the frequency domain significance of the image by using a spectral residual error method.

1) Given a gray scale image H (x, y), it is transformed from the spatial domain to the frequency domain by a two-dimensional discrete Fourier transform F, resulting in a representation F [ H (x, y) ] of the image in the frequency domain.

2) Obtaining the amplitude A (F) and phase P (F) of F [ H (x, y) ]:

A(f)＝|F[H(x,y)]| (13)

wherein, | - | represents the amplitude-taking operation,

representing a phase fetch operation.

3) Logarithm of amplitude A (F) of F [ H (x, y) ] is taken to obtain a logarithmic spectrum L (F):

L(f)＝log(A(f)) (15)

4) using local smoothing filters h_n(f) The log spectrum is smoothed, as shown below, to obtain the approximate shape of the log spectrum:

M(f)＝L(f)*h_n(f) (16)

where h is_n(f) Is an n × n matrix (3 × 3 in this embodiment), where each pixel is equal, and is defined as shown in the following formula:

5) the difference between the amplitude diagram of the log spectrum and the mean value of the log spectrum after mean value filtering is a spectrum residual error, and can be calculated according to the following formula;

R(f)＝L(f)-M(f) (18)

6) the spectral residuals may capture the frequency portion of the anomaly in the image and thus may be used for salient object detection. The spectrum residual R (f) and the phase P (f) are subjected to two-dimensional inverse discrete Fourier transform, so that the spectrum residual R (f) and the phase P (f) can be converted from a frequency domain to a space domain, as shown in the following formula;

T(x,y)＝|F^-1[exp{R(f)+iP(f)}]|² (19)

7) an image is reconstructed by performing gaussian filtering on the image after the spectrum residual is converted into the spatial domain (in the scheme, a gaussian low-pass filter with the size of 3 × 3 and the standard deviation of 1 is adopted) so as to represent the significance of each pixel of the original image, and the image becomes a significance map.

S_f(x,y)＝T(x,y)*Gaussian (20)

S(x,y)＝μ₁S_t(x,y)+μ₂S_s(x,y)+μ₃S_f(x,y) (21)

μ_iis the weight of each term, mu in this embodiment₁、μ₂Mu minute₃0.52, 0.2 and 0.28, respectively.

2) Binarizing S (x, y) by a proper threshold (the threshold is 0.2 in the embodiment) to obtain a binary image B, and searching all connected regions in B by using 8 connected regions as a standard_c1, each connected region area needs to be between 20 × 20 pixels and 200 × 200 pixels in the scheme, and the aspect ratio and the width-to-length ratio are both less than or equal to 5.

3) The regions meeting the conditions in the B_cSetting the circumscribed rectangular region of 1 as 1, and searching for the region of the connected region by taking the 8 connected regions as the standard_c2. Simultaneously using prewitt operator to extract the edge information graph of the original input gray level image if the edge information graph is in region_c2 edge information of a corresponding position of one connected regionIf the sum of the gray values of each row of more than 5 rows in the graph is more than 5, the gray value is kept as the region_c3。

4) Initializing a zero matrix with the same size as the original input image and generating a region_c3 corresponding to the position 1, and setting the rest part to be 0 to obtain a binary image Y₁。

5) The binary image B was subjected to a disk-shaped morphological closing operation with a radius of 7 to fill the cavity region and obtain Y₂。Y₁、Y₂And obtaining a final binary image Y after the corresponding position elements and the operation.

6) Searching all connected region regions meeting the standard of Y by taking 8 connected regions as the standard_cfinal, the criteria in this embodiment are: the number of rows of each connected region needs to be greater than or equal to 0.6 times the area of the circumscribed rectangle.

region_cThe final position is the position of the moving object extracted from the aerial video.

Claims

1. An unmanned aerial vehicle aerial video moving target detection method based on time-space-frequency significance is characterized by comprising the following steps:

and step 3: converting the image from a space domain to a frequency domain, and extracting the frequency domain significance of the image by using a spectrum residual error method;

S(x,y)＝μ₁S_t(x,y)+μ₂S_s(x,y)+μ₃S_f(x,y)

wherein S is_t(x, y) is temporal significance, S_s(x, y) is a spatial displayBirth character, S_f(x, y) is frequency domain significance, μ_iIs the weight of the respective item;

2. The unmanned aerial vehicle aerial video moving object detection method based on time-space-frequency saliency as claimed in claim 1, wherein the specific steps of step 1 are as follows:

step 11: normalizing the optical flow directional diagram:

S_t(x,y)＝w₁S_a(x,y)+w₂S_d(x,y)。

3. the unmanned aerial vehicle aerial video moving object detection method based on time-space-frequency saliency of claim 1, characterized in that the specific steps of step 2 are as follows:

4. The unmanned aerial vehicle aerial video moving object detection method based on time-space-frequency saliency of claim 1, characterized in that the specific steps of step 3 are as follows:

step 32: obtaining the amplitude A (F) and phase P (F) of F [ H (x, y) ]:

A(f)＝|F[H(x,y)]|

wherein, | - | represents the amplitude-taking operation,

representing a phase taking operation;

L(f)＝log(A(f))

step 34: using local smoothing filters h_n(f) Smoothing the log spectrum:

M(f)＝L(f)*h_n(f)

where h is_n(f) Is an n x n matrix where each pixel is equal, defined as:

R(f)＝L(f)-M(f)

T(x,y)＝|F^-1[exp{R(f)+iP(f)}]|²

S_f(x,y)＝T(x,y)*Gaussian。