CN118229731A

CN118229731A - Moving object detection method and device for jittering video

Info

Publication number: CN118229731A
Application number: CN202410280416.7A
Authority: CN
Inventors: 贾振红; 王创创; 宋森森; 周刚; 石飞
Original assignee: Xinjiang University
Current assignee: Xinjiang University
Priority date: 2024-03-12
Filing date: 2024-03-12
Publication date: 2024-06-21

Abstract

The invention discloses a moving object detection method and a device for a jittering video, comprising the following steps: performing wavelet transformation on the image to extract a high-frequency region; traversing the high-frequency region by adopting a corner extraction algorithm, and extracting high-quality characteristic points; carrying out feature description on the extracted feature points; dividing an image into n image blocks based on resolution, performing feature matching on the corresponding image blocks, and fitting out a motion matrix common to all feature points by using a least square method; performing inverse motion compensation on the image to obtain a stable image sequence; and carrying out Gaussian background modeling, and extracting a moving target. The device comprises: a processor and a memory. The method ensures real-time extraction of the moving target while realizing video image stabilization, introduces frequency domain analysis to extract a significant region, enhances ORB descriptors by using a feature description mode based on deep learning, improves accuracy, and adopts a field query mode to quickly match feature points; and image background modeling is carried out by adopting the Gaussian models with the self-adaptive number, so that the rapid extraction of the moving target is realized.

Description

Moving object detection method and device for jittering video

Technical Field

The invention relates to the field of monitoring security, in particular to a moving object detection method and device for a jittering video.

Background

Moving object detection in intelligent video monitoring is a research focus. Although many algorithms detect quickly and accurately in static scenarios, in cases where jitter interference is unavoidable, for example: camera shake caused by wind or nearby large vehicles can present challenges. Due to severe noise interference, jittered surveillance video presents a challenge in that it is not possible to accurately detect an actual moving object. Therefore, selection of an efficient and stable video image stabilization algorithm is imperative to mitigate the impact of shaking on accurately detecting moving objects.

The common static scene video image stabilization algorithm mainly comprises the following steps: gray projection algorithm, block matching algorithm, feature point matching algorithm. The gray level projection algorithm projects a gray level image in the horizontal direction and the vertical direction to obtain one-dimensional vectors in the two directions, and then calculates the correlation of projection vectors of the shaking frame and the reference frame in the horizontal direction and the vertical direction, so that motion vectors of the image in the vertical direction and the horizontal direction respectively are obtained. With this method, if there is a moving object in the image, the calculation result will be affected, and since the projection information of the image will also change after the gray level projection of the moving object in the image when the scene is still, the image will be wrongly dithered, so the algorithm is not suitable for the complex background application scene. The block matching algorithm firstly divides an image into a plurality of sub-blocks, then searches each sub-block of the current frame to obtain a matched sub-block on a reference frame, and obtains the motion vector of the whole image by solving the motion vectors of all the matched image blocks. In order to be able to obtain more accurate motion parameters, it is often necessary to design the image blocks as small as possible, which means that there are more image blocks in the image, and since the block matching method requires searching and calculating each block, the computational complexity may be relatively high on a large-scale image or a high-resolution image. The feature point matching algorithm firstly extracts points with clear features on a reference frame and a current frame, describes the feature points by using descriptors, matches the feature points to obtain matching point pairs, obtains high-quality matching point pairs after screening, obtains a motion matrix by the point pairs, and has higher calculation efficiency.

In the jittered video, translation and rotation scaling exist between image sequences, however, both the gray projection algorithm and the block matching algorithm can only calculate motion vectors of images in vertical and horizontal directions, and the method can only acquire translation information, but cannot acquire rotation and scaling information. The feature point matching algorithm mainly comprises SIFT, SURF, ORB of the feature point matching algorithm for translating, rotating and scaling the transformation vector of the image by extracting and matching the points with unique features of the reference frame and the current frame and calculating the motion vector of the feature points. Although SIFT and SURF algorithms perform well in terms of matching accuracy, real-time performance is difficult to achieve due to long computation time, and compared with fast and rbrie algorithms combined by ORB algorithm, accuracy and speed can be both achieved in a general scene.

Disclosure of Invention

The invention provides a moving target detection method and a moving target detection device for a jittering video, wherein the method combines a video image stabilizing algorithm and a moving target detection algorithm, ensures real-time extraction of a moving target while realizing video image stabilizing, introduces frequency domain analysis to extract a significant region, and enhances an ORB descriptor by using a feature description mode based on deep learning, thereby improving accuracy and adopting a field query mode to quickly match feature points; image background modeling is carried out by adopting a self-adaptive number of Gaussian models, so that the rapid extraction of moving targets is realized, and the following description is given in detail:

a moving object detection method for a jittered video comprises the following steps:

performing wavelet transformation on the image to extract a high-frequency region; traversing the high-frequency region by adopting a corner extraction algorithm, and extracting high-quality characteristic points; carrying out feature description on the extracted feature points;

Dividing an image into n image blocks based on resolution, performing feature matching on the corresponding image blocks, and fitting out a motion matrix common to all feature points by using a least square method;

Performing inverse motion compensation on the image to obtain a stable image sequence; and carrying out Gaussian background modeling, and extracting a moving target.

The image is subjected to wavelet transformation, and the high-frequency region is extracted as follows:

Dividing an image into non-overlapped blocks with the size of 2 multiplied by 2, respectively obtaining high-frequency information of the image blocks in the horizontal, vertical and diagonal directions by calculating the difference of pixel values in each block, and fusing all the high-frequency information to obtain a final feature extraction region;

HL＝(P(n)-Q(n)+R(n)-S(n))/4

LH＝(P(n)+Q(n)-R(n)-S(n))/4

HH＝(P(n)-Q(n)-R(n)+S(n))/4

Where HL is a high-frequency component in the horizontal direction, LH is a high-frequency component in the vertical direction, HH is a high-frequency component in the diagonal direction, and P (n), Q (n), R (n), S (n) are gray values of pixel points in upper left, upper right, lower left, and lower right corners of the 2×2 image block, respectively.

The high-frequency area is traversed by adopting a corner extraction algorithm, and the high-quality characteristic points are extracted as follows: creating a 16 x 16 image frame taking the characteristic point as the center, and calculating the gray centroid of a circular area around the characteristic point;

Wherein: i (x, y) is the gray value at coordinates (x, y), m ₀₀,m₁₀,m₀₁ is the moment, and the centroid coordinates are:

By connecting the feature points and the centroid, the feature direction is obtained:

the image is divided into n image blocks based on resolution, and feature matching is performed on the corresponding image blocks as follows:

Searching a point B most similar to the point A by using a violence matcher algorithm; performing reverse operation, finding a point C which is most similar to the point B, and if the point C is the point A, considering the point A and the point B as correct matching point pairs; dividing the image into a plurality of image blocks with the same size according to the resolution, and comparing the characteristic points with all the characteristic points on the corresponding blocks of the target image when the characteristic points are matched.

The Gaussian background modeling is performed, and the moving targets are extracted as follows: background modeling is carried out on pixel points by using M distributed Gaussian mixture models:

Wherein: For the pixel value at this time, t is the period, X _t is the sequence of images in one period in the video, BG is the background, FG is the foreground,/> Is a Gaussian distribution coefficient,/>Is a Gaussian distribution mean value estimation value,/>The method is a Gaussian distribution variance estimation value and a Gaussian distribution coefficient recursion updating mode:

Wherein: setting at least C=0.01 x T sampling values to be consistent with a distribution when the Marshall distance between the sample and a certain distribution is smaller than three standard deviations, wherein the sample belongs to the distribution, and if the distribution close to the sample cannot be found in all the distributions, a new distribution is created;

Deleting when the number of the model distribution exceeds the maximum value The smallest distribution, the dirichlet distribution with negative weights will reduce the distribution weights that cannot be matched to the pixel values, and the distribution is discarded when its weight is negative.

A second aspect, a moving object detection apparatus for a jittered video, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions that invoke the program instructions stored in the memory to cause an apparatus to perform the method of any of the first aspects.

In a third aspect, a computer readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method of any of the first aspects.

The technical scheme provided by the invention has the beneficial effects that:

1. according to the method, the high-frequency area required by extracting the characteristic points is calculated by wavelet transformation before extracting the characteristics, so that the calculated amount required by traversing the whole image is greatly reduced;

2. In order to solve RBRIEF (rotation-invariant binary robust feature) descriptor low-precision problem, the invention adopts enhanced efficient local image feature descriptor to improve the feature description stage of ORB;

3. In the feature matching stage, considering that the motion of the feature points caused by video jitter has reciprocating characteristics, the track of the motion changes back and forth within a certain range, the invention provides a neighborhood query matching method, which greatly shortens the traditional time wasted in the feature matching aspect of the full-image searching method, and has certain advantages in precision and speed compared with the existing method.

Drawings

FIG. 1 is a flow chart for extracting salient regions of an image using haar wavelet transform;

FIG. 2 is a feature extraction schematic;

FIG. 3 is a schematic diagram of a feature description;

FIG. 4 is an original video frame;

Fig. 5 is a schematic diagram of a result of directly extracting a moving object in a jittered video;

Fig. 6 is a schematic diagram of the result of a moving object detection method for jittered video.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.

Example 1

A moving object detection method for a jittered video, see fig. 1, the method comprising the steps of:

101: firstly, carrying out wavelet transformation on an image, and extracting a high-frequency region;

102: traversing the high-frequency region by adopting a corner extraction algorithm, and extracting high-quality characteristic points;

103: carrying out feature description on the extracted feature points;

104: dividing an image into n image blocks based on resolution, performing feature matching on the corresponding image blocks, and fitting out a motion matrix common to all feature points by using a least square method;

105: performing inverse motion compensation on the image to obtain a stable image sequence;

106: and carrying out Gaussian background modeling, and extracting a moving target.

In summary, the embodiment of the present invention adopts the domain query method to quickly match the feature points through the steps 101-106; and image background modeling is carried out by adopting the Gaussian models with the self-adaptive number, so that the rapid extraction of the moving target is realized.

Example 2

The scheme of embodiment 1 is further described below in conjunction with fig. 1-4, and is described in detail below:

Step 201: performing wavelet transformation on the image by using a Haar wavelet function to realize high-frequency region separation;

Specifically, in the embodiment of the invention, firstly, an image is divided into non-overlapped blocks with the size of 2×2, high-frequency information of the image blocks in horizontal, vertical and diagonal directions can be respectively obtained by calculating the difference of pixel values in each block, and all the high-frequency information is fused to obtain a final feature extraction area, and the detailed formulas (1) - (3) are shown as follows:

HL＝(P(n)-Q(n)+R(n)-S(n))/4 (1)

LH＝(P(n)+Q(n)-R(n)-S(n))/4 (2)

HH＝(P(n)-Q(n)-R(n)+S(n))/4 (3)

Where HL is a high-frequency component in the horizontal direction, LH is a high-frequency component in the vertical direction, and HH is a high-frequency component in the diagonal direction. P (n), Q (n), R (n), S (n) are gray values of pixels at upper left, upper right, lower left, and lower right corners of the 2×2 image block, respectively.

Step 202: feature extraction is performed using FAST (accelerated segment test feature point algorithm) in that one pixel in an image is taken as a center point, and the intensity value thereof is compared with the intensity values of 16 pixels around it, as shown in fig. 2. Calculating the number of all points whose difference is greater than a certain threshold, and if this number is large enough, the embodiment of the present invention considers this point to be a corner point, see formula (4):

wherein: i (x) is a candidate feature point pixel, namely a pixel of a central point selected by us, I (y) is a pixel difference threshold value set by 16 points around the pixel of the central point, N is the number that the difference value between the pixel values of the central point and the pixels around the central point is larger than the threshold value, and epsilon _d is the set threshold value.

In jittered video, tilt transformations inevitably occur between images. In order to ensure that the feature points on the tilted image still successfully match the feature points on the reference frame, a feature direction is added for each feature point. First, a 16 x 16 image frame centered on a feature point is created and the gray centroid of the circular area around the feature point is calculated.

m₀₀＝∑_x,y∈I(x,y) (5)

m₁₀＝∑_x,y∈x*I(x,y) (6)

m₀₁＝∑_x,y∈y*I(x,y) (7)

Wherein: i (x, y) is a gray value at a coordinate (x, y), m ₀₀,m₁₀,m₀₁ is a moment, and the centroid coordinate is shown in formula (8).

By connecting the feature points and the centroid, the feature direction is obtained. See formula (9) for details.

When generating the descriptor, it is first necessary to rotate the direction of the feature point to the horizontal direction. In this way, even if the image is rotated, the descriptors of the feature points can be generated in a uniform direction, thereby ensuring that the descriptors have directional consistency in the subsequent feature matching stage.

Step 203: the descriptor is generated by calculating the average gray-scale difference between image blocks near the feature points, see formula (10) for details.

Where I (q), I (R) are the gray values at the pixels q, R, respectively, R (p, s) are cubes of size s centered on p, f (x: p1, p2, s) calculates the difference between the average gray values of the pixels in R (p 1, s) and R (p 2, s), which is determined by a threshold to obtain h (x), and finally binarizes the difference to obtain an m-bit binary descriptor, in a manner shown in fig. 3.

Step 204: performing feature matching by using a cross matching strategy;

Firstly, using BFMatcher (violence matcher) algorithm to find a point B which is most similar to the point A; then, the reverse operation is performed to find the point C most similar to the point B. If point C is point A, then point A and point B are considered to be the correct matching point pair. The cross matching method effectively reduces the mismatching rate. In order to improve the matching efficiency, an image is divided into a plurality of image blocks with the same size according to resolution, and when matching feature points, only the feature points need to be compared with all feature points on corresponding blocks of a target image, and the feature points do not need to be compared with all feature points extracted from the whole image.

Step 205: describing motion vectors between images using affine transformation matrices;

first, assume two vector spaces k and j:

k＝(x,y) (11)

j＝(x′,y′) (12)

if it is desired to change the vector space from k to j, the transformation can be performed by the following formula:

j＝k*w+b (13)

wherein: w is a rotation transformation matrix, b is a translation matrix, and the above formula is split to obtain:

x′＝w₀₀*x+w₀₁*y+b₀ (14)

y′＝w₁₀*x+w₁₁*y+b₁ (15)

Then the above is converted into multiplication of matrix:

Wherein: w ₀₀、w₀₁、w₁₀、w₁₁ is used to describe the rotation transformation vector and b ₀、b₁ is used to describe the displacement transformation vector. M is an affine transformation matrix.

The matrix contains six unknowns and only three matching point pairs are needed to solve, but more matching point pairs are needed to find a relatively accurate motion matrix in view of noise and mismatch problems. The embodiment of the invention uses PROSAC (progressive sampling consistency algorithm) algorithm to carry out fine matching on the matching points. In solving the motion matrix, PROSAC algorithm sorts the data points first, locating the most likely interior point number in front, then initializing the parameters and extracting from the first m data points. Randomly selecting a minimum sample set from the data points, and calculating a motion matrix according to the sample set; then, the number of interior points meeting the error threshold is calculated, if more interior points are found, the model is updated, if more iterations are needed, the sampling step is returned, and finally, the optimal motion matrix and the interior point set are returned.

Step 206: and carrying out Gaussian background modeling, and extracting a moving target.

In modeling the gaussian background of an image, if a fixed number of gaussian components are used, complex regions of variation may not be fully described and computing resources would be wasted on stable regions. According to the embodiment of the invention, the number of Gaussian components is adjusted according to image characteristics by adopting an adaptive Gaussian mixture model, an appropriate number of Gaussian models are determined by adopting iterative updating and Bayesian Information Criterion (BIC), the number of Gaussian components can be automatically increased or stopped by calculating the posterior probability of each sample belonging to each Gaussian model and using the complexity and fitting degree of the BIC evaluation model AGMM (adaptive Gaussian mixture model). This adaptive approach achieves a balance between accuracy and computational efficiency.

Firstly, background modeling is carried out on pixel points by using M distributed Gaussian mixture models, and the formula (17) is as follows:

Wherein: For the pixel value at this time, t is the period, X _t is the sequence of images in one period in the video, BG is the background, FG is the foreground,/> Is a Gaussian distribution coefficient,/>Is a Gaussian distribution mean value estimation value,/>Is a gaussian distribution variance estimate, assuming that the covariance matrix is diagonal and the identity matrix I has the appropriate dimensions. Wherein the gaussian distribution coefficient recursion updating mode is shown in formula (18):

Wherein: where T is the training sample set size at time T. The constant here refers to the degree of exponential decay limiting the impact of the old data and can be approximately understood as For/>Based on this, at least c=0.01×t samples are set to match one distribution, so here C _T is 0.01. The rest are set to 0 for the new sample conforming to the distribution O _M set to 1. When a sample is less than three standard deviations from a certain distribution, the sample is considered to belong to this distribution, and if a distribution close to it is not found in all distributions, a new distribution is created. Deleting when the number of the model distribution exceeds the maximum valueMinimal distribution. The dirichlet distribution of negative weights reduces the distribution weights which cannot be matched with the pixel values, and the distribution is abandoned when the weights are negative, so that the self-adaption of the number of Gaussian distributions is realized.

Example 3

To verify the effectiveness and robustness of the method to video stabilization, it will be tested on two sets of real beat data and compared to the currently popular and stable SIFT, AKAZE and ORB, qtree_orb, SIRB algorithms. To intuitively demonstrate the stabilizing effect of the different algorithms, the average PSNR and SSIM of the stabilizing results were compared in two sets of videos, and the processing time per frame for each algorithm, table 1 shows that for the first video, PSNR and SSIM of the method were increased by 5.8%, 3.2%, respectively, and time was reduced by 52.8% compared to ORB. Moreover, both accuracy and speed far exceed the ORB's modified algorithms Qtree _orb and SIRB. Although the accuracy is comparable to the high accuracy SIFT and AKAZE algorithms, the required computation time is significantly reduced. For the second video. In terms of accuracy, the PSNR value and SSIM value of the method are the highest of all the comparison algorithms, and the time taken is much lower than that of all the comparison algorithms. Wherein, PSNR and SSIM values are respectively improved by 9.2 percent and 5.4 percent compared with ORB, and time is reduced by 48 percent. From the aspect of comprehensive comparison precision and speed, the method is superior to the five comparison algorithms.

Example 4

A moving object detection apparatus for a jittered video, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions, the processor invoking the program instructions stored in the memory to cause the apparatus to perform the following method steps in embodiment 1:

Dividing the image into non-overlapping blocks of 2x 2 size, respectively obtaining high frequency information of the image blocks in horizontal, vertical and diagonal directions by calculating the difference of pixel values in each block, fusing all the high frequency information to obtain a final feature extraction region,

HL＝(P(n)-Q(n)+R(n)-S(n))/4

LH＝(P(n)+Q(n)-R(n)-S(n))/4

HH＝(P(n)-Q(n)-R(n)+S(n))/4

The high-frequency region is traversed by adopting a corner extraction algorithm, and the high-quality characteristic points are extracted as follows: creating a 16 x 16 image frame taking the characteristic point as the center, and calculating the gray centroid of a circular area around the characteristic point;

The method comprises the steps of dividing an image into n image blocks based on resolution, and performing feature matching on the corresponding image blocks comprises the following steps:

The Gaussian background modeling is carried out, and the moving targets are extracted as follows: background modeling is carried out on pixel points by using M distributed Gaussian mixture models:

It should be noted that, the device descriptions in the above embodiments correspond to the method descriptions in the embodiments, and the embodiments of the present invention are not described herein in detail.

The execution main body of the processor and the memory can be a device with a calculation function, such as a computer, a singlechip, a microcontroller, and the like, and the execution main body is not limited in the embodiment of the invention, and is selected according to the needs in practical application. The data signals are transmitted between the memory and the processor through the bus, and the embodiments of the present invention will not be described in detail.

Based on the same inventive concept, the embodiment of the present invention also provides a computer readable storage medium, where the storage medium includes a stored program, and when the program runs, the device where the storage medium is controlled to execute the method steps in the above embodiment. The computer readable storage medium includes, but is not limited to, flash memory, hard disk, solid state disk, and the like.

It should be noted that the readable storage medium descriptions in the above embodiments correspond to the method descriptions in the embodiments, and the embodiments of the present invention are not described herein.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the invention, in whole or in part.

The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. Computer readable storage media can be any available media that can be accessed by a computer or data storage devices, such as servers, data centers, etc., that contain an integration of one or more available media. The usable medium may be a magnetic medium or a semiconductor medium, or the like. The embodiment of the invention does not limit the types of other devices except the types of the devices, so long as the devices can complete the functions.

Those skilled in the art will appreciate that the drawings are schematic representations of only one preferred embodiment, and that the above-described embodiment numbers are merely for illustration purposes and do not represent advantages or disadvantages of the embodiments.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A moving object detection method for a jittered video, the method comprising the steps of:

2. The moving object detection method for a jittered video according to claim 1, wherein the wavelet transforming the image to extract a high frequency region is:

HL＝(P(n)-Q(n)+R(n)-S(n))/4

LH＝(P(n)+Q(n)-R(n)-S(n))/4

HH＝(P(n)-Q(n)-R(n)+S(n))/4

3. The method for detecting a moving object for a jittered video according to claim 1, wherein the traversing the high-frequency region by using a corner extraction algorithm extracts high-quality feature points as follows: creating a 16 x 16 image frame taking the characteristic point as the center, and calculating the gray centroid of a circular area around the characteristic point;

4. The method for detecting a moving object for a jittered video according to claim 1, wherein the dividing the image into n image blocks based on resolution, and performing feature matching on the corresponding image blocks comprises:

5. The method for detecting a moving object for a jittered video according to claim 1, wherein the performing gaussian background modeling and extracting the moving object are as follows:

Background modeling is carried out on pixel points by using M distributed Gaussian mixture models:

6. A moving object detection apparatus for a jittered video, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions that cause the apparatus to perform the method of any of claims 1-5, the processor invoking the program instructions stored in the memory.

7. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-5.