CN115830515A

CN115830515A - Video target behavior identification method based on spatial grid

Info

Publication number: CN115830515A
Application number: CN202310047339.6A
Authority: CN
Inventors: 施晓东; 徐俊瑜; 刘佳; 韩东; 谢诏光; 孙镱诚; 陆中祥; 丁阳
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2023-01-31
Filing date: 2023-01-31
Publication date: 2023-03-21
Anticipated expiration: 2043-01-31
Also published as: CN115830515B

Abstract

The invention discloses a video target behavior identification method based on a spatial grid, which comprises the steps of establishing a data set, wherein the data set comprises the type and the state of a target; identifying the type and the state of a target in a video frame through a target identification algorithm, and detecting a moving target in the video frame through a moving target detection algorithm; based on spatial grid positioning, the behavior and the action of the target in the video frame are analyzed through target detection and motion detection in combination with the peripheral situation of the grid. The invention aims at the scene of video target behavior identification based on spatial grids. The action of the target is identified through target detection identification and moving target detection, and the behavior of the target is identified through space grid positioning and combining the peripheral situation of the grid.

Description

Video target behavior identification method based on spatial grid

Technical Field

The invention relates to the fields of geographic space rasterization processing and situation perception, in particular to a video target behavior identification method based on a spatial grid.

Background

Video target behavior identification based on spatial grids is a very useful method for studying battlefield target behavior. By analyzing the behavior of the battlefield target for a long time, the obtained data is more scientific, more objective and has more reference value. Although a great deal of research and innovation is carried out on the existing behavior analysis and identification method, most of the existing behavior analysis and identification methods are based on the traditional evaluation method, and the actual use requirements are not met in the aspects of accuracy, timeliness and practicability.

For example, CN111222487a in the prior art discloses a video target behavior recognition method and an electronic device, where the method includes: acquiring a video to be identified, wherein the video comprises an image frame of the video to be identified; acquiring one or more local target images through a target detection model; matching the obtained local target images through a target tracking model to obtain one or more target image sequences; performing quality scoring on the target image behaviors in each target image sequence through a target behavior quality scoring model to obtain a high-quality target image subsequence; and performing behavior recognition on the obtained high-quality target image subsequence through a behavior recognition model to obtain a behavior recognition result. The method only carries out behavior recognition on the high-quality target image subsequence in the video target image sequence, and on one hand, the influence of the low-quality target behavior recognition result on the whole video target behavior recognition result is eliminated; on the other hand, the efficiency of identifying the video target behaviors can be improved because only high-quality target behaviors are identified. However, the spatial information is not processed, so when the method is applied to analyzing battlefield target behaviors, the result is deviated, and the requirement cannot be met.

Therefore, there is a need to solve the above problems.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a video target behavior recognition method based on a spatial grid.

The technical scheme is as follows: in order to achieve the above purpose, the invention discloses a video target behavior recognition method based on a spatial grid, which comprises the following steps:

(1) Establishing a data set, wherein the data set comprises the type and the state of a target;

(2) The type and the state of the target in the video frame are identified through a target identification algorithm,

(3) Detecting a moving object of the video frame by a moving object detection algorithm;

(4) Based on spatial grid positioning, the behavior and the action of the target in the video frame are analyzed through target detection and motion detection by combining the peripheral conditions of the grid.

The data set in the step (1) comprises the type and the state of the target, in the process of manufacturing the data set, the data set comprises the type of the target to be identified, and if the target is in a fighting posture, the target is noted to be in a fighting state in the label.

Preferably, the step (2) specifically comprises the following steps:

(2.1) detecting through a multi-scale feature map,

(2.2) directly carrying out classification and regression on the features extracted by carrying out convolution calculation on feature maps with different sizes through a convolution network;

and (2.3) training the network by adopting an a priori box.

Furthermore, the specific steps of detecting through the multi-scale feature map in the step (2.1) are as follows: the neural network structure used for calculation is divided into six layers of feature maps for carrying out image classification and regression, the feature maps of each layer are different in size, the feature map at the front end of the network is larger, the feature maps are smaller in the backward direction along with the addition of the pooling layer, the feature map with the larger scale is used for processing a smaller target, and the feature map with the smaller scale is used for processing a larger target.

Further, the specific steps of the prior frame adopted by the network training in the step (2.3) are as follows:

setting boxes with different sizes and aspect ratios by taking pixels of the feature map as centers, wherein each pixel is provided with a plurality of prior boxes with different sizes and aspect ratios for detecting targets with different sizes and aspect ratios; training a network model by using a prior frame which is most suitable for the detection target in the picture; the size of the prior frame is linearly increased, and the following formula is satisfied:

wherein m is the number of feature maps, the value of m is 5,

representing the ratio of the size of the kth prior box to the picture size,

and

respectively represent

Minimum and maximum values of;

matching the generated prior frame with a real detection target follows 2 criteria, wherein the first criterion is to find the prior frame with the maximum coincidence degree with the real detection target in the picture in the feature map, represent the prior frame by an IOU (input output Unit), and then match the prior frame with the maximum IOU value with the real detection target; the second matching criterion is to avoid that the difference between the number of positive and negative samples is too large, and for the prior frame with the residual IOU value which is not the maximum, if the IOU value of the prior frame and the real target exceeds the set threshold value, the prior frame is considered to be matched with the real target; the final output of the network is the class confidence and position coordinate information of the predicted target, so the loss function is the weighted sum of the class confidence error and the predicted position error of the predicted target:

wherein ,

n: represents the number of positive samples in the prior box;

: only 0 or 1 is obtained, if the value is equal to 1, the jth real target in the representative picture is matched with the ith prior frame, and the type of the real target is p;

c: a confidence level representing a target category;

l: representing a predicted value for a real target;

g: position information representing a real target;

: a category confidence error;

: a predicted position error;

: a loss function;

pos is a positive sample set;

and (4) Neg: a negative sample set;

: the confidence of the ith target class is p, where:

，

: the coordinates of the center position of the prediction frame, the width of the prediction frame and the height of the prediction frame;

: predicting a predicted value of an mth detection target predicted by a jth prior frame in the image;

: the position of the m detection target predicted by the jth prior frame in the image is calculated according to a formula

Expressed as:

。

moreover, in the moving object detection in the step (3), the gray value of each pixel point is represented by a plurality of gaussian distributions, and each gaussian distribution function has different weights; if the pixel in the current video frame accords with the established Gaussian model, the pixel is considered as the background, otherwise, the pixel is considered as the foreground; and then updating parameters of the Gaussian model, sequencing different Gaussian distributions according to the priority, and selecting the consistent Gaussian distribution as a background model through a set threshold.

Further, the step (3) is calculated by detecting a moving objectThe method comprises the steps of carrying out moving object detection on a video frame, selecting K Gaussian distribution functions to represent the gray value of each pixel point in an image, and selecting M models as description backgrounds from the K Gaussian distributions; different weight values are given to different Gaussian distributions

Where i represents a different Gaussian distribution, so i ≦ K; selecting proper weight values and threshold values, and when the weight values meet the threshold values, regarding the pixels meeting the Gaussian distribution as backgrounds, and regarding the rest as foregrounds; let the gray-scale value of the pixel value at a certain time t be

Expressing its probability density function as a combination of K gaussian distribution functions, then:

, wherein ：

represents the ith weight of the Gaussian mixture model at the time t, and the sum of all the weights is 1;

representing the mean value of the ith model pixel gray value of the mixed Gaussian model at the time t;

representing the covariance of the ith model pixel gray value of the Gaussian mixture model at time t;

representing the gray value of the ith pixel;

arranging the K Gaussian distribution functions in a descending order, and then selecting the first M Gaussian distributions as backgrounds according to a preset threshold; when processing new image, comparing and matching pixel points on the image with the established Gaussian mixture model, if a certain pixel point is matched with the Gaussian mixture modelThe ith Gaussian distribution in the profile satisfies:

wherein

A gray value representing the pixel value at time t;

representing the mean value of the ith model pixel gray value of the Gaussian mixture model at the moment t;

then the point is considered to match the ith gaussian distribution and the successfully matched function is updated with the following parameters:

, wherein ,

(0≦

≦ 1) represents the learning rate, the larger the value, the more frequent the background update in the video;

representing the variance of the ith model pixel gray value of the Gaussian mixture model at the moment t;

if the pixel is not matched with the Gaussian distribution function, the parameters of the Gaussian distribution function do not need to be changed, and only the corresponding weight is updated, and the corresponding formula is as follows:

if the pixel is not matched with any Gaussian distribution function corresponding to the pixel, judging the pixel as a foreground, and replacing the Gaussian model with the minimum weight in the established model; the mean value of the replaced new Gaussian function is the gray value of the current pixel; normalizing the weight of the updated background model, and performing Gaussian distribution function according to the weight

The values are sorted in descending order; and then screening the foreground according to the set threshold value T, setting the first M Gaussian distributions meeting the conditions as the background, and setting the rest Gaussian distributions as the foreground.

Preferably, the combat zone corresponding to the high-scale single map range in the step (4) is composed of R × C low-scale single maps, and the combat zone and the combat basic zone are respectively divided, that is, one combat zone includes R × C combat basic zones, and boundary lines of the combat basic zones are connected to form an attack and defense line and a cooperative line; according to the positions of the combat zone and the combat basic zone to which the target belongs, the behavior of the target is judged by combining the sea, land and air environment and the surrounding three-dimensional environment analysis of the grid; if the target is detected at the same position of the previous frame and the current frame and no moving target is detected at the same position, the detected target is in a static state; if the target is detected in the previous frame and the target is detected near the same position in the current frame, and the moving target is detected in the area at the same time, the detected target is in a moving state; if the target is detected to be in a normal state in the previous frame, the target is detected to be in a fighting state in the same position of the current frame, and the moving target is detected to be in the fighting state in the same position, the detected target is in the fighting state; if the target is static or moving in the area, the target is identified as an intrusion behavior; if the target is fighting in our area, it is identified as an attack.

Has the advantages that: compared with the prior art, the invention has the following remarkable advantages: the invention combines the space grid with the target recognition target detection and the motion detection algorithm to analyze the battlefield target behavior, so that the result is more accurate, the speed is higher, and the invention is more suitable for the requirements of the current battlefield.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a schematic diagram of multi-scale feature map detection according to the present invention;

FIG. 3 is a schematic diagram of action recognition in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

As shown in fig. 1, the invention relates to a video target behavior recognition method based on spatial grid, which comprises the following steps:

(1) Establishing a data set, wherein the data set comprises the type and the state of a target, the data set comprises the type and the state of the target, in the process of manufacturing the data set, the data set comprises the type of the target to be identified, and if the target is in a fighting posture, the target is noted to be in a fighting state in a label;

(2) The method for identifying the type and the state of the target in the video frame through the target identification algorithm specifically comprises the following steps:

(2.1) detecting through a multi-scale feature map, as shown in fig. 2, dividing a neural network structure for calculation into six layers of feature maps for carrying out image classification and regression, wherein the feature maps of each layer are different in size, the feature map at the front end of the network is larger, the feature map is smaller in the backward direction along with the addition of a pooling layer, a smaller target is processed by the feature map with a larger scale, and a larger target is processed by the feature map with a smaller scale;

(2.2) directly carrying out convolution calculation on the feature graphs with different sizes through a convolution network to extract features, and carrying out classification and regression on the features; when a general neural network is used for detecting a target, a convolution network is usually used for extracting the features of a picture, then the extracted features are sent into a full-connection network for classification or regression, but the invention directly uses the convolution network to perform convolution calculation on feature graphs with different sizes and extracts the features for classification and regression, and the method is shown in figure 2;

(2.3) the network training adopts a prior frame, the pixels of the characteristic diagram are taken as the center, the boxes with different sizes and length-width ratios are arranged, and each pixel is provided with a plurality of prior frames with different sizes and length-width ratios for detecting the targets with different sizes and length-width ratios; training a network model by using a prior frame which is most suitable for a detection target in the picture; the size of the prior frame is linearly increased, and the following formula is satisfied:

wherein m is the number of characteristic maps, the value of m is 5,

representing the ratio of the size of the kth prior box to the picture size,

and

respectively represent

Minimum and maximum values of;

matching the generated prior frame with a real detection target follows 2 criteria, wherein the first criterion is to find the prior frame with the maximum coincidence degree with the real detection target in the picture in the feature map, represent the prior frame by an IOU (input output Unit), and then match the prior frame with the maximum IOU value with the real detection target; the second matching criterion is to avoid the difference between the positive and negative sample numbers, and the residual IOU value is not the maximum prior frame, if the IOU value with the real target exceeds the set valueA threshold value, which is also considered to match the prior frame with the real target; the final output of the network is the class confidence and position coordinate information of the predicted target, so the loss function is the weighted sum of the class confidence error and the predicted position error of the predicted target:

wherein ,

n: represents the number of positive samples in the prior box;

c: a confidence level representing a target category;

l: representing a predicted value for a real target;

g: position information representing a real target;

: a category confidence error;

: a predicted position error;

: a loss function;

pos is a positive sample set;

and (4) Neg: a negative sample set;

: the confidence of the ith target class is p, where:

，

Expressed as:

；

(3) Performing moving object detection on the video frame through a moving object detection algorithm, wherein in the moving object detection, the gray value of each pixel point is represented by a plurality of Gaussian distributions, and each Gaussian distribution function has different weights; if the pixel in the current video frame accords with the established Gaussian model, the pixel is considered as the background, otherwise, the pixel is considered as the foreground; then updating parameters of the Gaussian model, sequencing different Gaussian distributions according to priorities, and selecting the consistent Gaussian distribution through a set threshold value to serve as a background model;

performing moving object detection on a video frame through a moving object detection algorithm, selecting K Gaussian distribution functions to represent the gray value of each pixel point in an image, and selecting M models as description backgrounds from the K Gaussian distributions; different weight values are given to different Gaussian distributions

Where i represents a different Gaussian distribution, so i ≦ K; selecting proper onesWhen the weight value meets the threshold value, the pixels meeting the Gaussian distribution are considered as the background, and the rest pixels are considered as the foreground; let the gray-scale value of the pixel value at a certain time t be

, wherein ：

representing the gray value of the ith pixel;

arranging the K Gaussian distribution functions in a descending order, and then selecting the first M Gaussian distributions as backgrounds according to a preset threshold; when a new image is processed, comparing and matching pixel points on the image with the established Gaussian mixture model, and if a certain pixel point meets the ith Gaussian distribution in the Gaussian mixture model:

wherein

A gray value representing the pixel value at time t;

, wherein ,

(0≦

if the pixel is not matched with the Gaussian distribution function, the parameters of the Gaussian distribution function do not need to be changed, and only the corresponding weight is updated, wherein the corresponding formula is as follows:

The values are sorted in descending order; and then screening the foreground according to the set threshold value T, setting the front M Gaussian distributions meeting the conditions as the background, and setting the rest Gaussian distributions as the foreground.

(4) Based on spatial grid positioning, combining the peripheral conditions of grids, and analyzing the behavior and the action of a target in a video frame through target detection and motion detection;

the method comprises the following steps of 1, forming a combat region corresponding to the range of 100 ten thousand single maps by 144-frame 1; according to the positions of the combat zone and the combat basic zone to which the target belongs, the behavior of the target is judged by combining the sea, land and air environment and the surrounding three-dimensional environment analysis of the grid; the schematic diagram of motion recognition is shown in fig. 3, if an object is detected at the same position in the previous frame and the current frame and no moving object is detected at the same position, the detected object is in a static state; if the target is detected in the previous frame and the target is detected near the same position in the current frame, and the moving target is detected in the area at the same time, the detected target is in a moving state; if the target is detected to be in a normal state in the previous frame, the target is detected to be in a fighting state in the same position of the current frame, and the moving target is detected to be in the same position, the detected target is in the fighting state; if the target is static or moving in the area, the target is identified as an intrusion behavior; if the target fights in the area, the target is identified as an attack.

Claims

1. A video target behavior identification method based on spatial grids is characterized by comprising the following steps:

(4) Based on spatial grid positioning, the behavior and the action of the target in the video frame are analyzed through target detection and motion detection in combination with the peripheral situation of the grid.

2. The method for identifying the behavior of the video target based on the spatial grid as claimed in claim 1, wherein: the data set in the step (1) comprises the type and the state of the target, in the process of manufacturing the data set, the data set comprises the type of the target to be identified, and if the target is in a fighting posture, the target is noted to be in a fighting state in the label.

3. The method of claim 2, wherein the video object behavior recognition based on spatial grid is characterized in that: the step (2) specifically comprises the following steps:

(2.1) detecting through a multi-scale feature map,

and (2.3) training the network by adopting an a priori box.

4. The method according to claim 3, wherein the video target behavior recognition method based on the spatial grid is characterized in that: the specific steps of detecting through the multi-scale characteristic diagram in the step (2.1) are as follows: the neural network structure used for calculation is divided into six layers of feature maps for carrying out image classification and regression, the feature maps of each layer are different in size, the feature map at the front end of the network is larger, the feature maps are smaller in the backward direction along with the addition of the pooling layer, the feature map with the larger scale is used for processing a smaller target, and the feature map with the smaller scale is used for processing a larger target.

5. The method according to claim 4, wherein the video target behavior recognition method based on the spatial grid is characterized in that: the specific steps of the network training in the step (2.3) adopting the prior frame are as follows:

setting boxes with different sizes and aspect ratios by taking pixels of the feature map as centers, wherein each pixel is provided with a plurality of prior boxes with different sizes and aspect ratios for detecting targets with different sizes and aspect ratios; training a network model by using a prior frame which is most suitable for a detection target in the picture; the size of the prior frame is linearly increased, and the following formula is satisfied:

wherein m is the number of feature maps, the value of m is 5,

representing the ratio of the size of the kth prior box to the picture size,

and

respectively represent

Minimum and maximum values of;

matching the generated prior frame with a real detection target follows 2 criteria, wherein the first criterion is to find the prior frame with the maximum coincidence degree with the real detection target in the picture in the feature map, represent the prior frame by an IOU (input output Unit), and then match the prior frame with the maximum IOU value with the real detection target; the second matching criterion is to avoid the difference between the number of positive and negative samples, and the residual IOU value is not the maximum prior frame, if the IOU value with the real target exceeds the set thresholdThe prior frame is also considered to match the real target; the final output of the network is the class confidence and position coordinate information of the predicted target, so the loss function is the weighted sum of the class confidence error and the predicted position error of the predicted target:

wherein ,

n: represents the number of positive samples in the prior box;

c: a confidence level representing a target category;

l: representing a predicted value for a real target;

g: position information representing a real target;

: a category confidence error;

: a predicted position error;

: a loss function;

pos is a positive sample set;

and (4) Neg: a negative sample set;

: the confidence of the ith target class is p, where:

，

: predicting the predicted value of the mth detection target predicted by the jth prior frame in the image;

Expressed as:

。

6. the method according to claim 5, wherein the video target behavior recognition method based on the spatial grid is characterized in that: in the moving object detection in the step (3), the gray value of each pixel point is represented by a plurality of Gaussian distributions, and each Gaussian distribution function has different weights; if the pixel in the current video frame accords with the established Gaussian model, the pixel is considered as the background, otherwise, the pixel is considered as the foreground; and then updating parameters of the Gaussian model, sequencing different Gaussian distributions according to the priority, and selecting the consistent Gaussian distribution as a background model through a set threshold.

7. According to claim6 the video target behavior recognition method based on the spatial grid is characterized in that: in the step (3), moving object detection is performed on the video frame through a moving object detection algorithm, K Gaussian distribution functions are selected to represent the gray value of each pixel point in an image, and M models are selected from the K Gaussian distributions to be used as models for describing the background; different weight values are given to different Gaussian distributions

Where i represents a different Gaussian distribution, so i ≦ K; selecting proper weight values and threshold values, and when the weight values meet the threshold values, regarding the pixels meeting the Gaussian distribution as backgrounds, and regarding the rest of the pixels as foregrounds; let the gray-scale value of the pixel value at a certain time t be

, wherein ：

representing the gray value of the ith pixel;

wherein

A gray value representing the pixel value at time t;

, wherein ,

(0≦

represents the ith weight of the Gaussian mixture model at time t, and all weightsThe sum is 1;

8. The method according to claim 7, wherein the video target behavior recognition method based on spatial grid is characterized in that: the combat zone corresponding to the high-proportion single map range in the step (4) consists of R, C and low-proportion single maps, and the combat zone and the combat basic zone are respectively divided, namely one combat zone comprises R, C combat basic zones, and boundary lines of the combat basic zones are connected to form an attack and defense line and a cooperative line; according to the positions of the combat zone and the combat basic zone to which the target belongs, the behavior of the target is judged by combining the sea, land and air environment and the surrounding three-dimensional environment analysis of the grid; if the target is detected at the same position of the previous frame and the current frame and no moving target is detected at the same position, the detected target is in a static state; if the target is detected in the previous frame and the target is detected near the same position in the current frame, and the moving target is detected in the area at the same time, the detected target is in a moving state; if the target is detected to be in a normal state in the previous frame, the target is detected to be in a fighting state in the same position of the current frame, and the moving target is detected to be in the same position, the detected target is in the fighting state; if the target is static or moving in the area, the target is identified as an intrusion behavior; if the target fights in the area, the target is identified as an attack.