CN115830515B

CN115830515B - Video target behavior recognition method based on space grid

Info

Publication number: CN115830515B
Application number: CN202310047339.6A
Authority: CN
Inventors: 施晓东; 徐俊瑜; 刘佳; 韩东; 谢诏光; 孙镱诚; 陆中祥; 丁阳
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2023-01-31
Filing date: 2023-01-31
Publication date: 2023-05-02
Anticipated expiration: 2043-01-31
Also published as: CN115830515A

Abstract

The invention discloses a video target behavior recognition method based on a space grid, which comprises the steps of establishing a data set, wherein the data set comprises the types and states of targets; identifying the type and state of a target in the video frame through a target identification algorithm, and detecting the moving target of the video frame through a moving target detection algorithm; based on space grid positioning, and combining with grid surrounding conditions, analyzing the behavior and action of the target in the video frame through target detection and motion detection. The invention aims at a scene which is based on the video target behavior identification of a space grid. The motion of the target is identified through target detection and identification and moving target detection, and the behavior of the target is identified through space grid positioning and combining with grid surrounding conditions.

Description

Video target behavior recognition method based on space grid

Technical Field

The invention relates to the field of geographic space rasterization processing and situation awareness, in particular to a video target behavior recognition method based on a space grid.

Background

Video object behavior identification based on spatial grid is a very useful method to study battlefield object behavior. By analyzing the behavior of the battlefield target for a long time, the obtained data is more scientific, more objective and more reference value. Although many researches and innovations are carried out on the current behavior analysis and identification method, most of the current behavior analysis and identification method is based on the traditional evaluation method, and the actual use requirements are not met in terms of accuracy, timeliness and practicability.

For example, CN111222487a discloses a video target behavior recognition method and an electronic device, the method includes: acquiring a video to be identified, wherein the video comprises an image frame of the video to be identified; acquiring one or more local target images through a target detection model; matching the acquired local target images through a target tracking model to acquire one or more target image sequences; performing quality scoring on the target image behaviors in each target image sequence through a target behavior quality scoring model to obtain a high-quality target image subsequence; and performing behavior recognition on the obtained high-quality target image subsequence through a behavior recognition model to obtain a behavior recognition result. According to the method, behavior recognition is only carried out on a high-quality target image subsequence in the video target image sequence, so that on one hand, the influence of a low-quality target behavior recognition result on the whole video target behavior recognition result is eliminated; on the other hand, since only high-quality target behaviors are identified, video target behavior identification efficiency can be improved. However, the spatial information is not processed, so that when the spatial information is applied to analysis of battlefield target behaviors, the result is deviated and the requirement cannot be met.

There is therefore a need to solve the above problems.

Disclosure of Invention

The invention aims to: the first object of the invention is to provide a video target behavior recognition method based on a space grid, which recognizes the action of a target through target detection recognition and moving target detection, recognizes the behavior of the target through space grid positioning in combination with grid surrounding conditions, analyzes the behavior of the target in a battlefield, enables the result to be more accurate and faster, and is more suitable for the requirements of the current battlefield.

The technical scheme is as follows: in order to achieve the above purpose, the invention discloses a video target behavior recognition method based on a space grid, which comprises the following steps:

(1) Establishing a data set, wherein the data set contains the type and the state of the target;

(2) The type and state of the object in the video frame are identified by the object identification algorithm,

(3) Detecting a moving target of the video frame through a moving target detection algorithm;

(4) Based on space grid positioning, and combining with grid surrounding conditions, analyzing the behavior and action of the target in the video frame through target detection and motion detection.

Wherein, in the step (1), the data set contains the type and state of the target, and in the process of making the data set, the data set contains the type of the target to be identified, and if the target is in a combat attitude, the target is noted in the tag to be in the combat state.

Preferably, the step (2) specifically includes the following steps:

(2.1) detecting by a multi-scale feature map,

(2.2) classifying and regressing the features extracted by the convolution calculation of the feature graphs with different sizes directly through a convolution network;

(2.3) network training employs a priori blocks.

Furthermore, the specific steps of the detection in the step (2.1) through the multi-scale feature map are as follows: the neural network structure used for calculation is divided into six layers of characteristic diagrams for classifying and regressing pictures, the characteristic diagrams of each layer are different in size, the characteristic diagram at the front end of the network is larger, the characteristic diagram is smaller after the pooling layer is added, the characteristic diagram with larger scale is used for processing smaller targets, and the characteristic diagram with smaller scale is used for processing larger targets.

Further, the specific steps of adopting a priori frame for network training in the step (2.3) are as follows:

setting boxes with different sizes and length-width ratios by taking pixels of the feature map as centers, wherein each pixel is provided with a plurality of prior boxes with different sizes and length-width ratios for detecting targets with different sizes and length-width ratios; the detection target in the picture can use the prior frame most suitable for the detection target to train the network model; the size of the prior frame linearly increases, satisfying the following formula:

wherein m is the number of feature graphs, and m has a value of 5, < >>

Representing the proportion of the size of the kth a priori frame to the picture size, +.>

and

Respectively indicate->

Minimum and maximum values of (2);

the generated prior frame and the real detection target match follow 2 criteria, wherein the first criterion is that a prior frame with the largest coincidence degree with the real detection target in the picture in the feature map is found first and is represented by an IOU, and then the prior frame with the largest IOU value is matched with the real detection target; the second matching criterion is to avoid that the number of positive and negative samples is too large, and for the remaining prior frame with the IOU value not being the maximum, if the IOU value of the real target exceeds a set threshold value, the prior frame is considered to be matched with the real target; the network finally outputs the category confidence and position coordinate information of the predicted target, so the loss function is a weighted sum of the category confidence error and the predicted position error of the predicted target:

wherein ,

n: representing the number of positive samples in the a priori block;

: only 0 or 1, if 1, the j-th real object in the representing picture is matched with the i-th prior frame, and the type of the real object is p;

c: confidence representing the target class;

l: representing predicted values for real targets;

g: position information representing a real target;

: category confidence error;

: predicted position error;

: a loss function;

pos, positive sample set;

neg: a negative sample set;

: the z-th target class is the confidence of p, where:

，

: the central position coordinates of the prediction frame, the width of the prediction frame and the height of the prediction frame;

: the e-th detection of the i-th prior frame prediction in an imageA predicted value of the target;

: the position of the e-th detection target predicted by the i-th prior frame in the image, the calculation formula +.>

Expressed as:

。

in the moving object detection in the step (3), the gray value of each pixel point is represented by a plurality of gaussian distributions, and each gaussian distribution function has different weights; if a pixel in the current video frame meets the established Gaussian model, the pixel is considered to be background, otherwise the pixel is considered to be foreground; and then updating parameters of the Gaussian model, sorting different Gaussian distributions according to priority, and selecting the Gaussian distribution which accords with the priority through a set threshold value to serve as a background model.

In the step (3), moving object detection is performed on the video frame through a moving object detection algorithm, K Gaussian distribution functions are selected to represent gray values of all pixel points in an image, and M models which describe the background are selected from the K Gaussian distribution functions; different gaussian distributions give different weights

Wherein a represents different Gaussian distributions, so a+.K; selecting proper weights and thresholds, and when the weights meet the thresholds, considering pixels meeting the Gaussian distribution as the background, and considering the rest pixels as the foreground; let the gray value of the pixel value at a certain instant t be +.>

The probability density function is represented by a combination of K gaussian distribution functions, then:

, wherein ：

Representing the weights of the a-th mixed Gaussian model at the time t, and the sum of all the weights is 1;

representing the mean value of the pixel gray values of the a-th mixed Gaussian model at the moment t;

representing a covariance matrix of an a-th mixed Gaussian model at a moment t;

representing the pixel gray value of the a-th mixed Gaussian model at the moment t;

the K Gaussian distribution functions are arranged in a descending order, and then the first M Gaussian distributions serving as the background are selected according to a preset threshold value; when a new image is processed, comparing and matching pixel points on the image with the established Gaussian mixture model, and if a certain pixel point and an a-th Gaussian distribution in the Gaussian mixture model meet the following conditions:

wherein

A gradation value representing the pixel value at time t;

representing the standard deviation of the pixel gray values of the a-th mixed Gaussian model at the moment t; />

Then the point is considered to match the a-th gaussian distribution and then the successfully matched function is updated with the following parameters:

, wherein ,

Representing learning rate, 0 +.>

The larger the value is, the larger the frequency of background updating in the video is;

representing the variance of the pixel gray values of the a-th mixed Gaussian model at the moment t;

if the pixel is not matched with the Gaussian distribution function, the Gaussian distribution function parameter is not changed, only the corresponding weight is updated, and the corresponding formula is as follows:

if the pixel is not matched with any Gaussian distribution function corresponding to the pixel, judging the pixel point as a foreground, and replacing a Gaussian model with the minimum weight in the established model; the mean value of the replaced new Gaussian function is the gray value of the current pixel; normalizing the weight of the updated background model, and making each Gaussian distribution function according to +.>

The values are arranged in descending order; and then screening the foreground according to the set threshold T, setting the first M Gaussian distributions meeting the conditions as the background, and setting the rest Gaussian distributions as the foreground.

Preferably, in the step (4), the combat zone corresponding to the high-proportion single-frame map range is formed by dividing an r×c low-proportion single-frame map into a combat zone and a combat basic zone, i.e. one combat zone comprises r×c combat basic zones, and boundary lines of the combat basic zones are connected to form an attack line and a coordinate line; judging the behavior of the target according to the positions of the battle area and the battle basic area to which the target belongs and by combining the analysis of the sea, land, air and surrounding three-dimensional environments of the grid; if the object is detected at the same position of the previous frame and the current frame and no moving object is detected at the same position, the detected object is in a stationary state; if a target is detected in the previous frame and a target is detected near the same position in the current frame and a moving target is detected at the same time in the region, the detected target is in a moving state; if the previous frame detects that the target is in a normal state, the current frame detects that the target is in a fight state at the same position, and detects that the moving target is in a fight state at the same position; if the target is static or moving in the I area, the target is characterized as an intrusion behavior; if the target is combat in I area, then it is characterized as an attack.

The beneficial effects are that: compared with the prior art, the invention has the following remarkable advantages: according to the invention, through the combination of the space grid and the target recognition target detection and motion detection algorithm, the battlefield target behavior is analyzed, so that the result is more accurate, the speed is faster, and the method is more suitable for the requirements of the current battlefield.

Drawings

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is a schematic diagram of a multi-scale feature map detection in accordance with the present invention;

FIG. 3 is a schematic diagram of motion recognition according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

As shown in fig. 1, the method for identifying video target behavior based on spatial grid of the present invention comprises the following steps:

(1) Establishing a data set, wherein the data set comprises the type and the state of the target, and in the process of manufacturing the data set, the data set comprises the type of the target to be identified, and if the target is in a combat attitude, the target is noted to be in the combat state in the tag;

(2) The method for identifying the type and the state of the target in the video frame through the target identification algorithm specifically comprises the following steps:

(2.1) detecting through a multi-scale feature map, as shown in fig. 2, dividing the neural network structure for calculation into six layers of feature maps for classifying and regressing pictures, wherein the feature maps of each layer are different in size, the feature map at the front end of the network is larger, the smaller the later the feature map is, the smaller the size of the feature map is, the larger the size of the feature map is, and the larger the size of the feature map is, so that the larger the size of the feature map is;

(2.2) classifying and regressing the features extracted by the convolution calculation of the feature graphs with different sizes directly through a convolution network; in general, when a neural network is used for detecting a target, a convolutional network is generally used for extracting the characteristics of a picture, and then the extracted characteristics are sent into a fully-connected network for classification or regression;

(2.3) adopting a priori frame for network training, wherein the specific steps of adopting the priori frame for network training in the step (2.3) are as follows:

wherein m is the number of feature graphs, and m has a value of 5, < >>

and

Respectively indicate->

Minimum and maximum values of (2);

wherein ,

n: representing the number of positive samples in the a priori block;

c: confidence representing the target class;

l: representing predicted values for real targets;

g: position information representing a real target;

: category confidence error;

: predicted position error;

: a loss function;

pos, positive sample set;

neg: a negative sample set;

: the z-th target class is the confidence of p, where:

，

: a predicted value of an e-th detection target predicted by an i-th prior frame in the image;

Expressed as:

；

(3) Detecting a moving target of the video frame through a moving target detection algorithm, wherein in the moving target detection, the gray value of each pixel point is represented by a plurality of Gaussian distribution functions, and each Gaussian distribution function has different weights; if a pixel in the current video frame meets the established Gaussian model, the pixel is considered to be background, otherwise the pixel is considered to be foreground; then, updating parameters of the Gaussian model, sorting different Gaussian distributions according to priority, and selecting the Gaussian distribution which accords with the priority through a set threshold value to serve as a background model;

in the step (3), moving object detection is carried out on the video frame through a moving object detection algorithm, K Gaussian distribution functions are selected to represent gray values of all pixel points in an image, and M models which describe the background are selected from the K Gaussian distribution functions; different gaussian distributions give different weights

, wherein ：

representing a covariance matrix of an a-th mixed Gaussian model at a moment t;

representing the pixel gray value of the a-th mixed Gaussian model at the moment t; />

wherein

A gradation value representing the pixel value at time t;

representing the standard deviation of the pixel gray values of the a-th mixed Gaussian model at the moment t;

, wherein ,

Representing learning rate, 0 +.>

(4) Based on space grid positioning, and combining with grid surrounding conditions, analyzing the behavior and action of a target in a video frame through target detection and motion detection;

the combat zone corresponding to the map range of 1:100 tens of thousands of units consists of 144 maps of 1:10 tens of thousands of units, and a combat zone and a combat basic zone are respectively marked, namely, one combat zone comprises 144 combat basic zones, and boundary lines of the combat basic zones are connected to form attack and defense lines and a cooperative line; judging the behavior of the target according to the positions of the battle area and the battle basic area to which the target belongs and by combining the analysis of the sea, land, air and surrounding three-dimensional environments of the grid; the motion recognition schematic diagram is shown in fig. 3, if the object is detected at the same position of the previous frame and the current frame and no moving object is detected at the same position, the detected object is in a static state; if a target is detected in the previous frame and a target is detected near the same position in the current frame and a moving target is detected at the same time in the region, the detected target is in a moving state; if the previous frame detects that the target is in a normal state, the current frame detects that the target is in a fight state at the same position, and detects that the moving target is in a fight state at the same position; if the target is static or moving in the I area, the target is characterized as an intrusion behavior; if the target is combat in I area, then it is characterized as an attack.

Claims

1. The method for identifying the video target behavior based on the space grid is characterized by comprising the following steps of:

in the step (4), a combat zone and a combat basic zone are respectively marked by an R-C single map with a low proportion corresponding to the high proportion single map range, namely, one combat zone comprises R-C combat basic zones, and boundary lines of the combat basic zones are connected to form attack and defense lines and cooperation lines; judging the behavior of the target according to the positions of the battle area and the battle basic area to which the target belongs and by combining the analysis of the sea, land, air and surrounding three-dimensional environments of the grid; if the object is detected at the same position of the previous frame and the current frame and no moving object is detected at the same position, the detected object is in a stationary state; if a target is detected in the previous frame and a target is detected near the same position in the current frame and a moving target is detected at the same time in the region, the detected target is in a moving state; if the previous frame detects that the target is in a normal state, the current frame detects that the target is in a fight state at the same position, and detects that the moving target is in a fight state at the same position; if the target is static or moving in the I area, the target is characterized as an intrusion behavior; if the target is combat in I area, then it is characterized as an attack.

2. The method for identifying video object behaviors based on spatial grid according to claim 1, wherein: in the step (1), the data set includes the type and state of the target, and in the process of making the data set, the data set includes the type of the target to be identified, and if the target is in a combat attitude, the target is noted in the tag to be in the combat state.

3. The method for identifying video object behaviors based on spatial grid according to claim 2, wherein: the step (2) specifically comprises the following steps:

(2.1) detecting by a multi-scale feature map,

(2.3) network training employs a priori blocks.

4. A method for identifying video object behaviors based on a spatial grid according to claim 3, wherein: the specific steps of the detection in the step (2.1) through the multi-scale feature map are as follows: the neural network structure used for calculation is divided into six layers of feature graphs for classifying and regressing pictures, the feature graphs of each layer are different in size, the feature graph at the front end of the network is large, the feature graph is smaller after the pooling layer is added, the feature graph with large scale is used for processing small targets, and the feature graph with small scale is used for processing large targets.

5. The method for identifying video object behaviors based on spatial grid as recited in claim 4, wherein: the specific steps of adopting a priori frame for network training in the step (2.3) are as follows:

wherein m is the number of feature graphs, and m has a value of 5, < >>

and

Respectively indicate->

Minimum and maximum values of (2);

wherein ,

n: representing the number of positive samples in the a priori block;

c: confidence representing the target class;

l: representing predicted values for real targets;

g: position information representing a real target;

: category confidence error;

: predicted position error;

: a loss function;

pos, positive sample set;

neg: a negative sample set;

: the z-th target class is the confidence of p, where:

，

Expressed as:

。

6. the method for identifying video object behaviors based on spatial grid according to claim 5, wherein: in the moving object detection in the step (3), the gray value of each pixel point is represented by a plurality of gaussian distributions, and each gaussian distribution function has different weights; if a pixel in the current video frame meets the established Gaussian model, the pixel is considered to be background, otherwise the pixel is considered to be foreground; and then updating parameters of the Gaussian model, sorting different Gaussian distributions according to priority, and selecting the Gaussian distribution which accords with the priority through a set threshold value to serve as a background model.

7. The method for identifying video object behaviors based on spatial grid as claimed in claim 6, wherein: in the step (3), moving object detection is carried out on the video frame through a moving object detection algorithm, K Gaussian distribution functions are selected to represent gray values of all pixel points in an image, and M models which describe the background are selected from the K Gaussian distribution functions; different gaussian distributions give different weights

Wherein a represents different Gaussian distributions, so a+.K; selecting proper weights and thresholds, and when the weights meet the thresholds, considering pixels meeting the Gaussian distribution as the background, and considering the rest pixels as the foreground; let it be assumed that the gray value of the pixel value at a certain time tIs->

, wherein ：

representing a covariance matrix of an a-th mixed Gaussian model at a moment t;

wherein

A gradation value representing the pixel value at time t;

, wherein ,

Representing learning rate, 0 +.>

The values are arranged in descending order; then, foreground screening is carried out according to the set threshold T, and the first M Gaussian distributions meeting the conditions are setPut as background and the rest as foreground. />