CN112907621A

CN112907621A - Moving object extraction method based on difference and semantic information fusion

Info

Publication number: CN112907621A
Application number: CN202011439962.9A
Authority: CN
Inventors: 谢巍; 卢永辉; 周延; 许练濠; 吴伟林
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-02-24
Filing date: 2021-02-24
Publication date: 2021-06-04
Anticipated expiration: 2041-02-24
Also published as: CN112907621B

Abstract

The invention discloses a moving target extraction method based on difference and semantic information fusion, which mainly comprises the following steps: (1) acquiring an N-frame image sequence from monitoring equipment; (2) calculating difference information between image frames by using an inter-frame difference method according to two image frames with an interval of N; (3) extracting semantic information in the image by using a trained example segmentation model based on a convolutional neural network, wherein the semantic information comprises a target class and a pixel mask; (4) and combining the difference information and the semantic information through a fusion algorithm to extract a moving target in the image. According to the method, strong semantic information is introduced through an example segmentation model based on a convolutional neural network, and a moving target in an image can be well extracted by combining difference information obtained by an interframe difference method. The method is simple to implement and has good robustness.

Description

Moving object extraction method based on difference and semantic information fusion

Technical Field

The invention relates to the field of digital image processing and computer vision, in particular to a moving object extraction method based on difference and semantic information fusion.

Background

The analysis of moving objects has been one of the important research contents in the field of computer vision, and has been widely applied in production and life. The most common application scenario is analysis of surveillance videos, which usually does not pay much attention to static targets, but needs to be intensively analyzed for moving targets, because moving targets are likely to have important influence on production and living activities in the current surveillance scenario. At present, the moving target is mainly analyzed based on a background difference method and an inter-frame difference method (joint, Pierre-Marc. comprehensive study of background sub-analysis algorithms [ J ]. Journal of Electronic Imaging,2010,19(3): 033003-033003. the background difference method needs to model the background, but the same modeling method is difficult to be applied to a plurality of application scenarios, and the background model is established and background updating in the subsequent process needs a large amount of calculation, so that the use is complicated.

Thanks to the explosion of deep learning and the exponential increase of computing hardware computing power, convolutional neural networks are highly brilliant in various application fields of computer vision. The convolutional neural network has a parameter number of millions, which makes it possible to extract feature information of various levels in an image, including low-level texture information and high-level semantic information. The low-level texture information is the basis for correctly predicting and extracting high-level semantic information by the network, and the high-level semantic information enables the result of network prediction to have continuity and integrity. Currently, most convolutional neural networks predict a single picture, so that it is difficult to acquire motion information in the Image (Braham M, sbestien parrard, Droogenbroeck M v.semantic Background processing. IEEE/IEEE International Conference on Image processing. 2018.).

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a moving object extraction method based on difference and semantic information fusion. The method comprises the steps of obtaining semantic information in an image by using a convolutional neural network, obtaining difference information in the image by using an interframe difference method, combining the semantic information and the difference information by using a fusion algorithm, and finally extracting a moving target in the image. The method combines the semantic information extracted by the convolutional neural network and the motion information obtained by the interframe difference method, can well extract the motion target in the scene, and has simple implementation method and good robustness.

A moving object extraction method based on difference and semantic information fusion comprises the following steps:

s1, acquiring an N-frame image sequence from the monitoring equipment;

s2, calculating difference information between image frames by using an inter-frame difference method according to two image frames with an interval of N;

s3, extracting semantic information in the image by using the trained example segmentation model based on the convolutional neural network, wherein the semantic information comprises a target class and a pixel mask;

and S4, combining the difference information and the semantic information through a fusion algorithm, and extracting the moving object in the image.

Preferably, the image frames acquired in step S1 are all 24-bit RGB images of 3 channels, the length of the image sequence needs to be kept at N, and all the image frames need to be filtered by a gaussian kernel, which is expressed by the following expression:

preferably, in the step S1, when the image sequence is initialized, all image frames in the sequence adopt the 1 st frame, and the replacement starts as the number of acquired frames increases, and the replacement is performed according to a first-in-first-out (FIFO) principle, that is, the T-th frame replaces the T-N-th frame, the T + 1-th frame replaces the T-N + 1-th frame, and T > N.

The length N of the image sequence in step S1 is adjusted according to a specific application scenario.

Preferably, the input of the inter-frame difference method applied in step S2 is the T-N frame and the T-th frame, and the inter-frame difference method specifically includes the following steps:

s21, converting the two frame images from the RGB images into gray-scale images respectively, wherein the conversion formula is as follows:

gray＝0.299*R+0.587*G+0.114*B

r, G, B, which represent the three color channels of a color image, respectively;

s22, subtracting the gray values of the corresponding positions of the two frames of images, and then taking the absolute value to obtain a difference result, wherein the formula is as follows:

dif(x,y)＝|src(x,y)-dst(x,y)|

where src (x, y) represents a pixel value at coordinate (x, y) on the grayscale map of the reference frame, and dst (x, y) represents a pixel value at coordinate (x, y) on the grayscale map of the current frame.

Preferably, the input size of the convolutional neural network used in step S3 is 416 × 416 × 3;

the convolutional neural network comprises a backbone network, a characteristic pyramid network and 3 detection heads, wherein the backbone network is as follows: the system comprises a basic convolutional layer, a down-sampling layer, a residual error module, a down-sampling layer, 2 residual error modules which are connected in series, a down-sampling layer, 4 residual error modules which are connected in series, a down-sampling layer and 2 residual error modules which are connected in series;

the structure of the feature Pyramid network fpn (feature Pyramid network) from bottom to top is: the device comprises a cascade structure, a basic convolution layer, an upper sampling layer, a cascade structure, a basic convolution layer, an upper sampling layer and a cascade structure; the trunk network and the feature pyramid network are combined through 3 transverse connections, each transverse connection is composed of a basic convolution layer, the size matching of feature graphs is required to be ensured during connection, namely the feature graphs with the same size from the trunk network and the feature pyramid network are connected through one transverse connection;

the structure and parameters of the 3 detection heads are the same, and each detection head comprises two prediction branches; predicting a target class which is responsible for prediction of a current grid on each grid of a characteristic diagram by one of prediction branches corresponding to the target class, wherein the output dimension is S multiplied by C, S represents the size of the characteristic diagram of a current detection head, C represents the number of classes which need to be predicted in total, and the prediction branches are composed of basic convolution layers; the other prediction branch corresponds to target pixel mask prediction, the position mask of the target which is responsible for prediction of the current grid is predicted on each grid of the characteristic diagram, and the output dimension is H multiplied by W multiplied by S²Where S denotes a feature map size of a current detection head, H and W denote a height and a width of an input picture, respectively, and the prediction branch is composed of an upsampled layer and a base convolutional layer.

Preferably, the upsampling layer in the FPN structure is a resize function by using a nearest neighbor difference method;

the upsampling layer in the branch is masked at the position of the detection header by a Transposed convolution (Transposed convolution).

Preferably, the fusion algorithm in step S4 includes the following steps:

s41, performing morphological filtering on the difference result, and then performing binarization to obtain a moving pixel mask;

s42, performing channel separation on the target mask in the segmentation result, and performing binarization to obtain a target pixel mask;

s43, respectively calculating the proportion P of the moving pixels in each target pixel mask;

s44, judging the size relation between the ratio P of the moving pixels and the ratio threshold T by combining the set ratio threshold T, and if P is greater than T, judging that the current target is a moving target;

in the step S41, the morphological filtering is an On (OPEN) operation, specifically, an etching (erod) operation is performed first, and then an expansion (DILATE) operation is performed again;

and performing binarization operation after the morphological filtering operation, wherein a value thresh close to 0 is selected as a binarization threshold value, and a motion pixel mask obtained after binarization only comprises two values: 0 and 255, 8-bit unsigned integer, the formula of binarization is as follows:

where dif (x, y) denotes the difference result obtained in step S22.

Preferably, the channel separation operation in step S42 is to divide all target pixel masks predicted by the model one by one in the channel dimension to obtain a single-channel grayscale target pixel mask;

and (3) carrying out binarization after channel separation, wherein the threshold value of binarization is 1, and the target mask obtained after binarization only contains two values: 0 and 255, 8-bit unsigned integer.

Preferably, the step of calculating the ratio P of the moving pixels in the target pixel mask in step S43 is as follows:

calculating the number n of pixels with the median value of 255 in the target pixel mask₁；

Secondly, calculating the number n of pixels with the median values of 255 in the target pixel mask and the motion pixel mask₂；

Calculating the proportion P of the moving pixels in the target pixel mask by the following formula:

preferably, in the step S44, by adjusting the moving pixel ratio threshold T in the target pixel mask, the tolerance of the algorithm to the environmental noise may be adjusted, where the larger T is, the less sensitive to the environmental noise is, and if P > T, the current target is determined to be a moving target; otherwise, the target is a non-moving target.

Compared with the prior art, the invention has the following advantages and effects:

(1) the invention relates to a moving target extraction method based on difference and semantic information fusion, which comprises the steps of firstly obtaining an N-frame image sequence from monitoring equipment; calculating motion pixels in the image by using an inter-frame difference method according to the two frames of images with the interval of N to obtain a difference result; segmenting target object pixels in the image by using the trained convolutional neural network model to obtain segmentation results, wherein the segmentation results comprise a target category and a pixel mask; and combining the difference result and the segmentation result through a fusion algorithm, wherein the result is the extracted moving target. Therefore, the method and the device can acquire accurate motion information and semantic information from the image at the same time, can well extract the motion target in the scene, and provide effective technical support for guaranteeing the safety of production and living activities in the scene.

(2) The method combines the artificial intelligence technology, utilizes the convolutional neural network to carry out target segmentation on the input RGB image, can obtain the pixel-level accurate position mask of the target, and has stronger semantic information, continuity and integrity.

(3) In the method, the defect that the convolutional neural network cannot acquire the motion information in the image is supplemented by using an interframe difference method, the implementation is simple, the calculation amount is small, and the pixel-level motion information in the image can be conveniently and accurately acquired.

(4) The method combines the motion information acquired by the interframe difference method and the semantic information acquired by the convolutional neural network through a fusion algorithm, so that the method has the advantages of interframe difference and convolutional neural network, not only ensures the effect of extracting the moving target, but also is simple to implement and has good robustness.

Drawings

FIG. 1 is a flow chart of a moving object extraction method based on a convolutional neural network and interframe difference according to the present embodiment;

FIG. 2 is a flow chart of the fusion algorithm of the present embodiment;

FIG. 3 is a schematic diagram illustrating the effect of the interframe difference method in this embodiment;

FIG. 4 is a diagram of a convolutional neural network structure employed for segmentation in the present embodiment;

FIG. 5 is a diagram of a detection branch structure of a convolutional neural network employed for segmentation in the present embodiment;

fig. 6 is a schematic diagram of the moving object extraction result in the present embodiment.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

The embodiment discloses a moving object extraction method based on difference and semantic information fusion, as shown in fig. 1, comprising the following steps:

s1, acquiring an N-frame image sequence from the zoom dome camera;

the image frames obtained from the zoom dome camera are all 24-bit RGB images of 3 channels, and the length of the image sequence needs to be kept at N. In order to reduce the effect of noise on the subsequent steps, all acquired image frames need to be gaussian filtered first. The gaussian nuclear expression used is as follows:

image sequence at initialization, all image frames in the sequence take the 1 st frame. And (3) replacing the frame number with the acquired frame number, wherein the t-th frame is replaced by the t (t > N) th frame, the t +1 th frame is replaced by the t-N +1 th frame and the like according to a first-in first-out principle during replacement.

The length N of the image sequence may be adjusted according to the specific application scenario. When the moving speed of the moving target to be extracted is large (more than or equal to 50 pixels per frame), N is set between 1 and 5; when the motion speed of a moving object needing to be extracted is small (<50 pixels per frame), N is set to be 5-15. A common value for N, N ═ 5.

the input of the interframe difference method is the t-N frame and the t frame. The interframe difference method specifically comprises the following steps:

gray＝0.299*R+0.587*G+0.114*B

wherein R, G, B represent the three color channels of a color image, respectively.

And S22, subtracting the gray values of the corresponding positions of the two frames of images, and then taking the absolute value to obtain a difference result. The implementation formula is as follows:

dif(x,y)＝|src(x,y)-dst(x,y)|

When N is 1, the effect of the inter-frame difference method is schematically shown in fig. 3.

the convolutional neural network shown in fig. 4 uses an input size of 416 × 416 × 3. The Basic structure of the convolutional neural network mainly comprises a Basic convolutional layer (Basic conv), a residual error module (Res), a Down sampling layer (Down sample), an Up sampling layer (Up sample) and a cascade structure

Forming;

the convolutional neural network comprises a main network, a characteristic pyramid network and most of 3 detection heads (Head), wherein the main network is as follows: the system comprises a basic convolutional layer, a down-sampling layer, a residual error module, a down-sampling layer, 2 residual error modules which are connected in series, a down-sampling layer, 4 residual error modules which are connected in series, a down-sampling layer and 2 residual error modules which are connected in series;

the structure from bottom to top (from a smaller feature map to a larger feature map) of the feature Pyramid network fpn (feature Pyramid network) is: the device comprises a cascade structure, a basic convolution layer, an upper sampling layer, a cascade structure, a basic convolution layer, an upper sampling layer and a cascade structure. The trunk network and the feature pyramid network are combined through 3 transverse connections, each transverse connection is composed of a basic convolution layer, and the feature graphs need to be matched in size during connection, namely the feature graphs with the same size from the trunk network and the feature pyramid network are connected through one transverse connection.

The 3 detection heads have the same structure and parameters. Each detection head comprises two prediction branches; one of the prediction branches corresponds to target category prediction, the target category of which the current grid is responsible for prediction is predicted on each grid of the characteristic diagram, the output dimension (Class) is S multiplied by C, wherein S represents the size of the characteristic diagram of the current detection head, C represents the total category number needing prediction, and the prediction branch is composed of a basic convolutional layer; the other prediction branch corresponds to target pixel Mask prediction, a position Mask of a current grid responsible for predicting a target is predicted on each grid of the characteristic diagram, and the output dimension (Mask) is H multiplied by W multiplied by S²Where S denotes a feature map size of a current detection head, H and W denote a height and a width of an input picture, respectively, and the prediction branch is composed of an upsampled layer and a base convolutional layer.

There are two methods for implementing the upsampling layer. An upsampling layer in the FPN structure, implemented by a resize function; at the upsampling layer in the detection branch, this is achieved by Transposed convolution (Transposed convolution).

As shown in fig. 2, the fusion algorithm specifically includes the following steps:

and S44, combining the given ratio threshold value T, judging the size relationship between P and T, and if P is greater than T, judging that the current target moves.

The fusion algorithm screens the non-moving target result by using the motion information obtained by the interframe difference, so that the result only contains the position mask of each moving target in the current image.

The morphological filtering in step S41 is an On (OPEN) operation in order to reduce the effect of environmental noise on the result. The threshold value for the binarization operation is generally selected to be a value close to 0, since a slight pixel value variation due to environmental noise needs to be ignored, and a common value is 5. The motion pixel mask obtained after binarization only comprises two values: 0 and 255. the formula of binarization for thresh ═ 5 is as follows:

where dif (x, y) denotes the difference result obtained in step S22.

The channel separation operation in S42 is to separate all the target masks predicted by the model one by one to obtain a single-channel gray target mask. The threshold value of the binarization operation is 1, and the target pixel mask obtained after binarization only comprises two values: 0 and 255.

The calculation method of the moving pixel proportion P in the target pixel mask in S43 is as follows:

Calculating to obtain the moving pixel proportion P in the target pixel mask by the following formula;

in S44, tolerance of the algorithm to the environmental noise may be adjusted by adjusting the moving pixel ratio threshold T in the target pixel mask, where the larger T is, the less sensitive the algorithm to the environmental noise is. The common value of T is 0.1, if P is greater than T, the current target is judged to be a moving target; otherwise, the target is a non-moving target. And combining the class branches of the network, and obtaining the class information of the current moving object. When N is 1, a schematic diagram of the moving object extraction result of the method of the present invention is shown in fig. 6.

Compared with the implementation of the method, the method has higher robustness, can well extract the moving target in the scene, and provides accurate and stable results for subsequent computer vision tasks.

The invention may be implemented in hardware, firmware, software, or a combination thereof. For a hardware implementation, the invention may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Programmable Logic Devices (PLDs), field-programmable gate arrays (FPGAs), processors, controllers, micro-controllers, electronic devices, other electronic units designed to perform the functions described herein, or a combination thereof.

For a firmware and/or software implementation, the techniques may be implemented with modules (e.g., procedures, steps, flows, and so on) that perform the functions described herein. The firmware and/or software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The invention has been described in connection with the accompanying drawings, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of the components set forth in the following description, since various insubstantial modifications of the inventive concept and arrangement, or direct application of the inventive concept and arrangement to other applications without modification, are intended to be covered by the scope of the invention.

Claims

1. A moving object extraction method based on difference and semantic information fusion is characterized by comprising the following steps:

s1, acquiring an N-frame image sequence from the monitoring equipment;

2. The method for extracting moving object based on difference and semantic information fusion as claimed in claim 1, wherein the image frames obtained in step S1 are all 24-bit RGB images of 3 channels, the length of the image sequence needs to be kept as N, all the image frames need to be filtered by gaussian kernel, and the gaussian kernel expression is as follows:

3. the method for extracting moving objects based on difference and semantic information fusion as claimed in claim 2, wherein in step S1, when the image sequence is initialized, all image frames in the sequence adopt the 1 st frame, and as the number of acquired frames increases, the replacement is started, and the replacement is performed according to the first-in-first-out (FIFO) principle, that is, the T-th frame replaces the T-N-th frame, the T + 1-th frame replaces the T-N + 1-th frame, and T > N.

4. The method for extracting a moving object based on difference and semantic information fusion as claimed in claim 3, wherein the input of the inter-frame difference method applied in step S2 is the T-N frame and the T-th frame, and the inter-frame difference method specifically includes the following steps:

gray＝0.299*R+0.587*G+0.114*B

dif(x,y)＝|src(x,y)-dst(x,y)|

5. The method for extracting moving object based on difference and semantic information fusion as claimed in claim 4, wherein the input size of the convolutional neural network used in the step S3 is 416 x 3;

the structure of the 3 detection headsThe parameters are the same, and each detection head comprises two prediction branches; predicting a target class which is responsible for prediction of a current grid on each grid of a characteristic diagram by one of prediction branches corresponding to the target class, wherein the output dimension is S multiplied by C, S represents the size of the characteristic diagram of a current detection head, C represents the number of classes which need to be predicted in total, and the prediction branches are composed of basic convolution layers; the other prediction branch corresponds to target pixel mask prediction, the position mask of the target which is responsible for prediction of the current grid is predicted on each grid of the characteristic diagram, and the output dimension is H multiplied by W multiplied by S²Where S denotes a feature map size of a current detection head, H and W denote a height and a width of an input picture, respectively, and the prediction branch is composed of an upsampled layer and a base convolutional layer.

6. The method for extracting moving object based on difference and semantic information fusion as claimed in claim 5, characterized in that, the upsampling layer in FPN structure is by using nearest neighbor difference method resize function;

7. The method for extracting moving object based on difference and semantic information fusion as claimed in claim 6, wherein the fusion algorithm in step S4 includes the following steps:

where dif (x, y) denotes the difference result obtained in step S22.

8. The method for extracting a moving object based on difference and semantic information fusion as claimed in claim 7, wherein the channel separation operation in step S42 is to divide all target pixel masks predicted by the model one by one in channel dimension to obtain a single-channel gray-scale target pixel mask;

9. The method for extracting a moving object based on difference and semantic information fusion according to claim 8, wherein the step of calculating the ratio P of moving pixels in the mask of target pixels in step S43 is as follows:

10. the method for extracting a moving object based on difference and semantic information fusion according to claim 9, wherein in step S44, tolerance of the algorithm to the environmental noise can be adjusted by adjusting a moving pixel ratio threshold T in a target pixel mask, where the larger T is, the less sensitive it to the environmental noise is, and if P > T, the current object is determined to be a moving object; otherwise, the target is a non-moving target.