CN114627397A

CN114627397A - Behavior recognition model construction method and behavior recognition method

Info

Publication number: CN114627397A
Application number: CN202011432324.4A
Authority: CN
Inventors: 朱朝
Original assignee: SF Technology Co Ltd
Current assignee: SF Technology Co Ltd
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2022-06-14

Abstract

The application relates to a behavior recognition model construction method and a behavior recognition method. The method comprises the following steps: acquiring a video frame sequence set carrying behavior labels and an initial behavior identification model, wherein the model comprises a space-time feature extraction network, a space feature extraction network, a feature fusion network and a convolution operation network; inputting each video frame sequence in the video frame sequence set into a space-time characteristic extraction network to obtain space-time characteristics corresponding to the video frame sequence, and inputting the video frame sequence into a space characteristic extraction network to obtain space characteristics corresponding to the video frame sequence; inputting the space-time characteristics and the space characteristics into a characteristic fusion network to obtain fusion characteristics corresponding to the video frame sequence; inputting the fusion characteristics into a convolution operation network to obtain a behavior prediction result corresponding to the video frame sequence; and adjusting the model parameters of the initial behavior recognition model according to the behavior prediction result and the behavior labels to obtain the trained behavior recognition model. By adopting the method, accurate behavior recognition can be realized.

Description

Behavior recognition model construction method and behavior recognition method

Technical Field

The present application relates to the field of computer technologies, and in particular, to a behavior recognition model building method, a behavior recognition device, a computer device, and a storage medium.

Background

With the development of computer technology, behavior recognition technology appears, which is an important research direction in the field of computer vision and has wide application prospects in daily scenes such as monitoring and automatic driving. For example, in the field of logistics, whether violent sorting behaviors exist or not can be accurately and quickly screened out through a behavior recognition technology, so that more accurate and timely reminding and guidance can be achieved.

In the conventional technology, when behavior recognition is performed on video data, a common method is to recognize people in each video frame, then reasonably connect the bounding boxes into a motion pipeline, detect a behavior to be recognized, and then perform behavior recognition on the behavior to be recognized.

However, the conventional technique has a problem that the behavior recognition is inaccurate.

Disclosure of Invention

In view of the above, it is necessary to provide a behavior recognition model construction method supporting high-accuracy behavior recognition and a behavior recognition method that can realize highly accurate behavior recognition, in order to solve the above technical problems.

A method of behavior recognition model construction, the method comprising:

acquiring a video frame sequence set carrying behavior labels and an initial behavior identification model, wherein the initial behavior identification model comprises a space-time feature extraction network, a space feature extraction network, a feature fusion network and a convolution operation network;

inputting each video frame sequence in the video frame sequence set into a space-time characteristic extraction network to obtain space-time characteristics corresponding to the video frame sequence, and inputting the video frame sequence into a space characteristic extraction network to obtain space characteristics corresponding to the video frame sequence;

inputting the space-time characteristics and the space characteristics into a characteristic fusion network to obtain fusion characteristics corresponding to the video frame sequence;

inputting the fusion characteristics into a convolution operation network to obtain a behavior prediction result corresponding to the video frame sequence;

and adjusting the model parameters of the initial behavior recognition model according to the behavior prediction result and the behavior label to obtain the trained behavior recognition model.

In one embodiment, inputting the video frame sequence into a spatial feature extraction network, and obtaining spatial features corresponding to the video frame sequence comprises:

carrying out image fusion on the video frame sequence to obtain a track picture corresponding to the video frame sequence;

and inputting the track picture into a spatial feature extraction network to obtain spatial features corresponding to the video frame sequence.

In one embodiment, the image fusion of the video frame sequence to obtain the track picture corresponding to the video frame sequence includes:

converting each video frame in the video frame sequence into a gray image, and dividing the converted video frame sequence according to the number of preset image channels to obtain a plurality of groups of gray images to be fused;

and calculating the pixel average value of each group of gray level images to be fused, and obtaining the track picture corresponding to the video frame sequence according to the pixel average value.

In one embodiment, inputting the spatio-temporal features and the spatial features into a feature fusion network to obtain fusion features corresponding to the sequence of video frames comprises:

inputting the space-time characteristics and the space characteristics into a characteristic fusion network, and determining characteristic weight distribution through the characteristic fusion network;

and obtaining fusion characteristics corresponding to the video frame sequence according to the characteristic weight distribution, the space-time characteristics and the spatial characteristics.

In one embodiment, inputting the fusion features into a convolution operation network to obtain a behavior prediction result corresponding to the sequence of video frames comprises:

inputting the fusion characteristics into a convolution operation network to obtain an alternative detection frame;

and performing threshold value brushing selection on the alternative detection frame according to the behavior mark to obtain a behavior prediction result corresponding to the video frame sequence.

In one embodiment, the adjusting the model parameters of the initial behavior recognition model according to the behavior prediction result and the behavior label to obtain the trained behavior recognition model includes:

calculating a frame loss function according to a prediction detection frame and a behavior label in a behavior prediction result, and calculating a category loss function according to a prediction category and the behavior label in the behavior prediction result;

obtaining a model loss function according to the frame loss function and the category loss function;

and adjusting the model parameters of the initial behavior recognition model according to the model loss function to obtain the trained behavior recognition model.

A method of behavior recognition, the method comprising:

acquiring a video frame sequence to be identified;

inputting a video frame sequence to be recognized into a trained behavior recognition model to obtain a predicted behavior recognition result corresponding to the video frame sequence to be recognized, and constructing the trained behavior recognition model according to the behavior recognition model construction method;

and comparing the confidence of the detection frame in the predicted behavior recognition result with a preset detection frame confidence threshold to obtain a behavior recognition result.

A behavior recognition model building apparatus, the apparatus comprising:

the system comprises a sample acquisition module, a behavior annotation processing module and a behavior annotation processing module, wherein the sample acquisition module is used for acquiring a video frame sequence set carrying behavior annotation and an initial behavior identification model, and the initial behavior identification model comprises a space-time feature extraction network, a spatial feature extraction network, a feature fusion network and a convolution operation network;

the characteristic extraction module is used for inputting each video frame sequence in the video frame sequence set into a space-time characteristic extraction network to obtain space-time characteristics corresponding to the video frame sequences, and inputting the video frame sequences into a space characteristic extraction network to obtain space characteristics corresponding to the video frame sequences;

the characteristic fusion module is used for inputting the space-time characteristic and the spatial characteristic into a characteristic fusion network to obtain fusion characteristics corresponding to the video frame sequence;

the prediction module is used for inputting the fusion characteristics into a convolution operation network to obtain a behavior prediction result corresponding to the video frame sequence;

and the adjusting module is used for adjusting the model parameters of the initial behavior recognition model according to the behavior prediction result and the behavior label to obtain the trained behavior recognition model.

A behavior recognition device, the device comprising:

the data acquisition module is used for acquiring a video frame sequence to be identified;

the recognition module is used for inputting the video frame sequence to be recognized into the trained behavior recognition model to obtain a predicted behavior recognition result corresponding to the video frame sequence to be recognized, and the trained behavior recognition model is constructed according to the behavior recognition model construction method;

and the comparison module is used for comparing the confidence of the detection frame in the predicted behavior recognition result with a preset detection frame confidence threshold value to obtain a behavior recognition result.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring a video frame sequence set carrying behavior labels and an initial behavior identification model, wherein the initial behavior identification model comprises a space-time characteristic extraction network, a space characteristic extraction network, a characteristic fusion network and a convolution operation network;

acquiring a video frame sequence to be identified;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring a video frame sequence to be identified;

According to the behavior recognition model construction method, after the video frame sequence set carrying the behavior labels is obtained, the spatio-temporal feature extraction and the spatial feature extraction are carried out on each video frame sequence, the spatio-temporal feature and the spatial feature corresponding to the video frame sequence are obtained, the spatio-temporal feature and the spatial feature are fused by using the feature fusion network, the fusion feature with richer feature information can be obtained, the fusion feature is input into the convolution operation network, the behavior prediction result corresponding to the video frame sequence is obtained, the model parameters of the initial behavior recognition model are adjusted according to the behavior prediction result and the behavior labels, the training of the initial behavior recognition model can be realized, the behavior recognition model capable of realizing accurate behavior recognition is obtained, and therefore the behavior recognition accuracy can be improved by using the behavior recognition model.

According to the behavior recognition method, after the video frame sequence to be recognized is obtained, the video frame sequence to be recognized is input into the trained behavior recognition model, so that an accurate predicted behavior recognition result can be obtained, the confidence coefficient of the detection frame in the predicted behavior recognition result can be compared with the preset detection frame confidence coefficient threshold value, the behavior recognition result can be obtained, and accurate behavior recognition is achieved.

Drawings

FIG. 1 is a schematic flow chart diagram of a behavior recognition model construction method according to an embodiment;

FIG. 2 is a flow diagram that illustrates a method for behavior recognition, according to one embodiment;

FIG. 3 is a flowchart illustrating a method for constructing a behavior recognition model according to an embodiment;

FIG. 4 is a flow diagram illustrating a behavior recognition model building method and a behavior recognition method according to an embodiment;

FIG. 5 is a block diagram showing a configuration of a behavior recognition model building apparatus according to an embodiment;

FIG. 6 is a block diagram showing the structure of a behavior recognizing apparatus according to an embodiment;

FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In an embodiment, as shown in fig. 1, a behavior recognition model construction method is provided, and this embodiment is illustrated by applying the method to a server, it is to be understood that the method may also be applied to a terminal, and may also be applied to a system including a terminal and a server, and is implemented by interaction between the terminal and the server. In this embodiment, the method includes the steps of:

step 102, acquiring a video frame sequence set carrying behavior labels and an initial behavior identification model, wherein the initial behavior identification model comprises a space-time feature extraction network, a spatial feature extraction network, a feature fusion network and a convolution operation network.

The video frame sequence refers to a video frame sequence cut out from an original monitoring video in a plurality of scenes according to a time sequence, and the number of the cut video frames in the video frame sequence can be set according to needs. The behavior labels are used for representing preset behavior categories corresponding to the video frame sequences. For example, in the field of logistics, the behavior tag may specifically be violent sorting, non-violent sorting, and the like. The initial behavior recognition model refers to a behavior recognition model without parameter adjustment, and comprises a space-time feature extraction network, a space feature extraction network, a feature fusion network and a convolution operation network. The spatio-temporal feature extraction network is used for extracting spatio-temporal features of each video frame in the video frame sequence, and the spatio-temporal features refer to time sequence spatial features. For example, the spatio-temporal feature extraction Network may be a 3D-CNN (three-dimensional Convolutional Neural Network) Network, and the 3D Network extended by a residual error Network (residual Network) is used as a basic skeleton. The spatial feature extraction network is used for extracting spatial features of video frames in the video frame sequence. For example, the spatial feature extraction Network may specifically be a 2D-CNN (two-dimensional-Convolutional Neural Network) Network, and the basic skeleton of the spatial feature extraction Network is Darknet.

Specifically, after an original monitoring video is obtained, a video frame sequence can be intercepted from the original monitoring video according to the time sequence by presetting the number of intercepted video frames, behavior labels are set for the video frame sequence, the video frame sequence carrying the behavior labels is stored in a preset database, and when a behavior recognition model needs to be constructed, a server can obtain a video frame sequence set carrying the behavior labels and an initial behavior recognition model from the preset database. The setting of the behavior marking refers to marking out an area containing the behavior action by using a rectangular frame and marking the type of the behavior action.

And 104, inputting each video frame sequence in the video frame sequence set into a space-time characteristic extraction network to obtain space-time characteristics corresponding to the video frame sequences, and inputting the video frame sequences into a space characteristic extraction network to obtain space characteristics corresponding to the video frame sequences.

Specifically, the server inputs each video frame sequence in the video frame sequence set into a spatio-temporal feature extraction network to obtain spatio-temporal features corresponding to the video frame sequences, and inputs the video frame sequences into a spatial feature extraction network to obtain spatial features corresponding to the video frame sequences, so that the spatio-temporal features and the spatial features corresponding to the video frame sequences are synchronously obtained. Before each video frame sequence is input into the space-time feature extraction network and the spatial feature extraction network, the server performs data enhancement processing on each video frame in each video frame sequence, wherein the data enhancement processing comprises cutting, translation, rotation, brightness change, noise increase and the like, and the cutting refers to cutting the video frame according to a rectangular frame in the behavior label to make the rectangular frame of the behavior label more prominent.

For example, when the empty feature extraction network is a 3D-CNN network, the feature with the output size of (bn, c2, 1, w, h) can be obtained by inputting bn × 6 pictures (tensors are represented as (bn, 6, w, h, c), where bn denotes the number of pictures, w and h denote the picture sizes, and c denotes the number of image channels, and the tensors may be (16, 6, 608, 416, 3), for example) into the 3D-CNN network. When the spatial feature extraction network is a 2D-CNN network, by inputting bn pictures (tensors are expressed as (bn, w, h, c), and may be (16, 608, 416, 3) in correspondence with the above tensors), features having an output size of (bn, c1, w, h) can be obtained.

And 106, inputting the space-time characteristics and the space characteristics into the characteristic fusion network to obtain fusion characteristics corresponding to the video frame sequence.

The feature fusion network is a network for fusing the spatio-temporal features and the spatial features. For example, the feature fusion network may specifically be an attention mechanism module combining space and channels, and may learn feature weight distribution from the space-time features and the space features, and then fuse the space-time features and the space features according to the feature weight distribution.

Specifically, the server inputs the space-time characteristics and the space characteristics into the characteristic fusion network, learns the space-time characteristics and the space characteristics through the characteristic fusion network to obtain corresponding characteristic weight distribution, and fuses the space-time characteristics and the space characteristics by utilizing the characteristic weight distribution.

And step 108, inputting the fusion characteristics into a convolution operation network to obtain a behavior prediction result corresponding to the video frame sequence.

The convolution operation network refers to a network for performing behavior prediction according to the fusion characteristics. For example, the convolution operation network may specifically include a continuous convolution layer and a plurality of shortcut links, and may predict a detection frame, a detection frame confidence level, and a network of a category. The behavior prediction result comprises a prediction detection box, a detection box confidence and a prediction category, wherein the prediction detection box can be represented by four frame coordinates, and the frame coordinates refer to pixel coordinates on the video frame.

Specifically, the server inputs the fusion features into a convolution operation network, and obtains behavior prediction results corresponding to the video frame sequence through analysis of the fusion features by the convolution operation network.

And step 110, adjusting model parameters of the initial behavior recognition model according to the behavior prediction result and the behavior labels to obtain a trained behavior recognition model.

Specifically, the server calculates a model loss function according to a prediction detection frame, a detection frame confidence coefficient, a prediction category and a behavior label in the behavior prediction result, and performs back propagation on the initial behavior recognition model according to the model loss function to update model parameters, so as to obtain a trained behavior recognition model. The model loss function includes a frame loss function and a category loss function, and needs to be calculated respectively.

The track picture refers to a picture obtained after image fusion is performed on each video frame in a video frame sequence, and the track picture can represent a behavior track corresponding to each video frame. For example, the track picture may be an image optical flow track picture.

Specifically, the server performs image fusion on each video frame in the video frame sequence, fuses each video frame into one track picture, and performs feature extraction on the track picture by using a spatial feature extraction network to obtain spatial features corresponding to the video frame sequence.

In this embodiment, the image fusion is performed on the video frame sequence to obtain the track picture corresponding to the video frame sequence, and the track picture is input to the spatial feature extraction network to obtain the spatial feature corresponding to the video frame sequence, so that the spatial feature corresponding to the video frame sequence can be obtained.

The preset image channel number refers to the image channel number of a preset track picture, for example, when the track picture is an RGB (red, green, blue, red, green, blue) picture, the preset image channel number may be specifically 3.

Specifically, the server converts each video frame in the video frame sequence into a grayscale image, randomly divides the converted video frame sequence according to a preset image channel number, obtains multiple groups of grayscale images to be fused on the premise that the number of the images of each group of grayscale images to be fused is as average as possible, respectively calculates the pixel average value of each group of grayscale images to be fused, uses the pixel average value of each group of grayscale images to be fused as the image channel pixel of the track image corresponding to the video frame sequence, and obtains the track image according to the image channel pixel.

Specifically, the track picture is an RGB picture, the video frame sequence includes 6 video frames as an example for description, the server converts continuous 6 video frames into grayscale images, and then divides the converted video frame sequence according to a preset number of image channels (i.e. 3) to obtain 3 groups of grayscale images to be fused, wherein according to the time sequence of the video frames, a first two-dimensional grayscale image is a group, a third four-dimensional grayscale image is a group, a fifth six-dimensional grayscale image is a group, the server calculates the pixel average value of each group of grayscale images to be fused by dividing the first two-dimensional pixel sum by 2, dividing the third four-dimensional pixel sum by 2, and dividing the fifth six-dimensional pixel sum by 2, and taking the above obtained three pixel average values as the RGB three channels of the track picture after fusion.

In this embodiment, each video frame in the video frame sequence is converted into a grayscale image, the converted video frame sequence is divided according to the number of preset image channels to obtain a plurality of groups of grayscale images to be fused, a pixel average value of each group of grayscale images to be fused is calculated, a track picture corresponding to the video frame sequence is obtained according to the pixel average value, and the track picture can be obtained.

The feature fusion network may specifically refer to a convolution attention module, which is a simple and effective attention module for feedforward convolution neural network. Given an intermediate feature map, the convolution attention module infers an attention map in turn along two independent dimensions (channel and space) and then multiplies the attention map with the input feature map for adaptive feature optimization, which can be seamlessly integrated into any CNN architecture and can be trained end-to-end with the underlying CNN.

Specifically, the server inputs the space-time feature and the spatial feature into the feature fusion network, learns the space-time feature and the spatial feature through a series of convolution and pooling operations of the feature fusion network, determines feature weight distribution of the space-time feature and the spatial feature, determines feature weights respectively corresponding to the space-time feature and the spatial feature according to the feature weight distribution, and obtains fusion features corresponding to the video frame sequence according to the feature weights, the space-time feature and the spatial feature. For example, when the spatial signature is (bn, c1, w, h) and the spatio-temporal signature is (bn, c2, 1, w, h), the fused features can be obtained as (bn, c1+ c2, w, h) by feature fusion through the feature fusion network.

In this embodiment, the spatio-temporal features and the spatial features are input into the feature fusion network, the feature weight distribution is determined by the feature fusion network, and the fusion features corresponding to the video frame sequence are obtained according to the feature weight distribution, the spatio-temporal features, and the spatial features, so that the fusion features can be obtained.

and performing threshold value brushing on the alternative detection frame according to the behavior mark to obtain a behavior prediction result corresponding to the video frame sequence.

Specifically, the server inputs the fusion features into the convolution operation network, and can use the convolution operation network to realize behavior prediction according to the fusion features, so as to obtain the alternative detection frames, the confidence degrees of the alternative detection frames, and the behavior categories. After the candidate detection frames and the confidence degrees of the candidate detection frames are obtained, the server performs threshold value brushing on the candidate detection frames according to the behavior labels to obtain behavior prediction results corresponding to the video frame sequence.

Specifically, the threshold selection is performed by using a preset detection frame threshold, an IOU (Intersection-over-unity) threshold and a confidence level of the detection frame, and by using the selection method, the detection frame with the confidence level meeting the requirement can be selected to realize accurate prediction, wherein the preset detection frame threshold can be set according to needs. Further, the brushing selection process may be that the server firstly performs primary screening on the candidate detection frames according to a preset detection frame threshold and the confidence of the candidate detection frames to obtain detection frames to be subjected to secondary screening, performs IOU threshold brushing selection on the detection frames to be subjected to secondary screening according to the confidence of the detection frames to be subjected to secondary screening to obtain detection frames meeting the requirements, and obtains behavior prediction results according to the detection frames meeting the requirements, and the confidence and behavior categories corresponding to the detection frames.

Further, the mode of performing the threshold selection of the IOU may be: sorting the detection frames to be secondarily screened according to the confidence degrees of the detection frames to be secondarily screened, selecting the detection frames to be secondarily screened with the highest score, traversing the rest detection frames to be secondarily screened according to the detection frames to be secondarily screened with the highest score, deleting the detection frames to be secondarily screened with the highest score if the overlapping area of the detection frames with the detection frames to be secondarily screened with the highest score is larger than a preset proportional threshold value in the rest detection frames to be secondarily screened with the highest score exists, continuously selecting the detection frames to be secondarily screened with the highest score from the rest detection frames to be secondarily screened without being deleted, returning to the step of traversing the rest detection frames to be secondarily screened according to the detection frames to be secondarily screened with the highest score until the brushing of all the detection frames to be secondarily screened is finished, and obtaining the detection frames meeting the requirements. The preset proportion threshold value can be set according to needs, and in this way, redundant detection frames existing in the alternative detection frames can be refreshed.

In this embodiment, the candidate detection frame is obtained by inputting the fusion feature into the convolution operation network, and the threshold value brushing selection is performed on the candidate detection frame according to the behavior label to obtain a behavior prediction result corresponding to the video frame sequence, so that the behavior prediction result can be obtained.

calculating a frame loss function according to a prediction detection frame and a behavior mark in a behavior prediction result, and calculating a category loss function according to a prediction category and a behavior mark in the behavior prediction result;

The frame loss function comprises a frame coordinate loss function and a frame confidence coefficient loss function, the frame coordinate loss function can be obtained by comparing the coordinates of the prediction detection frame in the behavior prediction result with the coordinates of the real frame in the behavior label, and the frame confidence coefficient loss function can be obtained by the confidence coefficient of the prediction detection frame in the behavior prediction result. The category loss function is used for characterizing the loss between the prediction category in the behavior prediction result and the behavior category in the behavior annotation.

Specifically, the server calculates a frame coordinate Loss function according to the coordinates of the prediction detection frame in the behavior prediction result and the coordinates of the real frame in the behavior label, performs mean square Loss calculation on the confidence of the prediction detection frame in the behavior prediction result to obtain a frame confidence Loss function, obtains a frame Loss function according to the frame coordinate Loss function and the frame confidence Loss function, performs Focal Loss on each prediction category according to the prediction category in the behavior prediction result and the behavior category in the behavior label to obtain a category Loss function, obtains a model Loss function according to the frame Loss function, the category Loss function and the preset Loss function weight, performs back propagation according to the model Loss function, adjusts the model parameters of the initial behavior recognition model until the model Loss function converges or is less than the preset Loss function threshold, and obtaining the trained behavior recognition model. The preset loss function weight and the preset loss function threshold value can be set according to needs.

For example, the frame coordinate loss function can be expressed by the formula: l. the₁loss(x_i,y_i)＝|x_i-y_iIs obtained where x_iAnd y_iAnd respectively representing the coordinates of the prediction detection box and the coordinates of the real border in the behavior label. The frame confidence loss function can be expressed by the formula: l_MSEloss(x_i,y_i)＝(x_i-y_i)²To obtain x therein_iAnd y_iThe confidence of the predicted detection box and the confidence of the real border in the behavior label are respectively represented. The class loss function may be represented by the formula:

and obtaining the prediction category of y.

Further, back propagation is carried out according to the model loss function, when the model parameters of the initial behavior recognition model are adjusted, the model gradient is calculated according to the model loss function, back propagation is carried out by utilizing the model gradient, and in the process of carrying out back propagation by utilizing the model gradient, the model gradient is divided into two parts for the feature fusion network, and parameter updating is carried out through the space-time feature extraction network and the space feature extraction network respectively.

In this embodiment, the frame loss function is calculated according to the prediction detection frame and the behavior label in the behavior prediction result, the category loss function is calculated according to the prediction category and the behavior label in the behavior prediction result, the model loss function is obtained according to the frame loss function and the category loss function, the model parameters of the initial behavior recognition model are adjusted according to the model loss function, the trained behavior recognition model is obtained, and the training of the behavior recognition model can be realized.

In an embodiment, as shown in fig. 2, a behavior recognition method is provided, and this embodiment is illustrated by applying the method to a server, and it is to be understood that the method may also be applied to a terminal, and may also be applied to a system including the terminal and the server, and is implemented by interaction between the terminal and the server. In this embodiment, the method includes the steps of:

step 202, obtaining a video frame sequence to be identified.

And 204, inputting the video frame sequence to be recognized into the trained behavior recognition model to obtain a predicted behavior recognition result corresponding to the video frame sequence to be recognized, and constructing the trained behavior recognition model according to the behavior recognition model construction method.

And step 206, comparing the confidence of the detection frame in the predicted behavior recognition result with a preset detection frame confidence threshold to obtain a behavior recognition result.

Specifically, when behavior recognition is needed, the server acquires a video frame sequence to be recognized, inputs the video frame sequence to be recognized into a trained behavior recognition model, and obtains a predicted behavior recognition result corresponding to the video frame sequence to be recognized, the predicted behavior recognition result includes a detection frame, a confidence coefficient of the detection frame and a behavior category, the server can screen out the detection frame meeting the requirement by comparing the confidence coefficient of the detection frame in the predicted behavior recognition result with a preset detection frame confidence coefficient threshold, and the behavior recognition result is obtained according to the detection frame meeting the requirement and the behavior category corresponding to the detection frame.

In one embodiment, as shown in fig. 3, the behavior recognition model construction method of the present application is described by an embodiment, and the method specifically includes the following steps:

1. and (3) a data processing stage: the method comprises the steps that a server obtains a video frame sequence set carrying behavior marks, data enhancement processing is carried out on each video frame sequence in the video frame sequence set, and image fusion is carried out on the video frame sequences after the data enhancement processing to obtain a fusion picture (namely a track picture);

2. a characteristic extraction stage: the server inputs the video frame sequence after the data enhancement processing into a 3D-CNN network (instant space feature extraction network) in the initial behavior recognition model to obtain space-time features corresponding to the video frame sequence, and inputs the fusion picture into a 2D-CNN network (namely a space feature extraction network) in the initial behavior recognition model to obtain space features corresponding to the video frame sequence;

3. a characteristic fusion stage: inputting the space-time characteristics and the space characteristics into a characteristic fusion network to obtain fusion characteristics corresponding to the video frame sequence;

4. and (3) processing and outputting: and inputting the fusion characteristics into a convolution operation network to obtain a behavior prediction result corresponding to the video frame sequence, and adjusting model parameters of the initial behavior recognition model according to the behavior prediction result and the behavior labels to obtain a trained behavior recognition model.

In one embodiment, as shown in fig. 4, a behavior recognition model construction method and a behavior recognition method according to the present application are described by one embodiment, where the method specifically includes the following steps:

step 402, acquiring a video frame sequence set carrying behavior labels and an initial behavior identification model, wherein the initial behavior identification model comprises a space-time feature extraction network, a spatial feature extraction network, a feature fusion network and a convolution operation network;

step 404, inputting each video frame sequence in the video frame sequence set into a spatio-temporal feature extraction network to obtain spatio-temporal features corresponding to the video frame sequences, and inputting the video frame sequences into a spatial feature extraction network to obtain spatial features corresponding to the video frame sequences;

step 406, inputting the space-time characteristics and the space characteristics into a characteristic fusion network to obtain fusion characteristics corresponding to the video frame sequence;

step 408, inputting the fusion features into a convolution operation network to obtain behavior prediction results corresponding to the video frame sequence;

and step 410, adjusting model parameters of the initial behavior recognition model according to the behavior prediction result and the behavior label to obtain a trained behavior recognition model.

Step 412, acquiring a video frame sequence to be identified;

step 414, inputting the video frame sequence to be recognized into the trained behavior recognition model to obtain a predicted behavior recognition result corresponding to the video frame sequence to be recognized;

and step 416, comparing the confidence of the detection frame in the predicted behavior recognition result with a preset detection frame confidence threshold value to obtain a behavior recognition result.

It should be understood that, although the steps in the flowcharts related to the above embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least a part of the steps in each flowchart related to the above embodiment may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

In one embodiment, as shown in fig. 5, there is provided a behavior recognition model building apparatus including: a sample acquisition module 502, a feature extraction module 504, a feature fusion module 506, a prediction module 508, and an adjustment module 510, wherein:

a sample obtaining module 502, configured to obtain a video frame sequence set carrying a behavior label and an initial behavior identification model, where the initial behavior identification model includes a spatio-temporal feature extraction network, a spatial feature extraction network, a feature fusion network, and a convolution operation network;

a feature extraction module 504, configured to input each video frame sequence in the video frame sequence set into a spatio-temporal feature extraction network to obtain spatio-temporal features corresponding to the video frame sequence, and input the video frame sequence into a spatial feature extraction network to obtain spatial features corresponding to the video frame sequence;

a feature fusion module 506, configured to input the spatio-temporal features and the spatial features into a feature fusion network to obtain fusion features corresponding to the video frame sequence;

a prediction module 508, configured to input the fusion feature into a convolution operation network to obtain a behavior prediction result corresponding to the video frame sequence;

and an adjusting module 510, configured to adjust a model parameter of the initial behavior recognition model according to the behavior prediction result and the behavior label, to obtain a trained behavior recognition model.

According to the behavior recognition model construction device, after the video frame sequence set carrying the behavior labels is obtained, the spatio-temporal feature extraction and the spatial feature extraction are carried out on each video frame sequence, the spatio-temporal feature and the spatial feature corresponding to the video frame sequence are obtained, the spatio-temporal feature and the spatial feature are fused by using the feature fusion network, the fusion feature with richer feature information can be obtained, the fusion feature is input into the convolution operation network, the behavior prediction result corresponding to the video frame sequence is obtained, the model parameters of the initial behavior recognition model are adjusted according to the behavior prediction result and the behavior labels, the training of the initial behavior recognition model can be realized, the behavior recognition model capable of realizing accurate behavior recognition is obtained, and therefore the behavior recognition accuracy can be improved by using the behavior recognition model.

In one embodiment, the feature extraction module is further configured to perform image fusion on the video frame sequence to obtain a track picture corresponding to the video frame sequence, and input the track picture into the spatial feature extraction network to obtain spatial features corresponding to the video frame sequence.

In one embodiment, the feature extraction module is further configured to convert each video frame in the video frame sequence into a grayscale, divide the converted video frame sequence according to a preset number of image channels to obtain a plurality of groups of grayscale images to be fused, calculate a pixel average value of each group of grayscale images to be fused, and obtain a track picture corresponding to the video frame sequence according to the pixel average value.

In one embodiment, the feature fusion module is further configured to input the spatio-temporal features and the spatial features into a feature fusion network, determine a feature weight distribution through the feature fusion network, and obtain fusion features corresponding to the video frame sequence according to the feature weight distribution, the spatio-temporal features, and the spatial features.

In one embodiment, the prediction module is further configured to input the fusion features into a convolution operation network to obtain an alternative detection frame, and perform threshold value brushing on the alternative detection frame according to the behavior labels to obtain a behavior prediction result corresponding to the video frame sequence.

In one embodiment, the adjusting module is further configured to calculate a frame loss function according to the prediction detection frame and the behavior label in the behavior prediction result, calculate a category loss function according to the prediction category and the behavior label in the behavior prediction result, obtain a model loss function according to the frame loss function and the category loss function, and adjust the model parameters of the initial behavior recognition model according to the model loss function to obtain the trained behavior recognition model.

In one embodiment, as shown in fig. 6, there is provided a behavior recognition apparatus including: a data acquisition module 602, an identification module 604, and a comparison module 606, wherein:

a data obtaining module 602, configured to obtain a sequence of video frames to be identified;

the recognition module 604 is configured to input the video frame sequence to be recognized into the trained behavior recognition model to obtain a predicted behavior recognition result corresponding to the video frame sequence to be recognized, where the trained behavior recognition model is constructed according to the behavior recognition model construction method;

the comparison module 606 is configured to compare the confidence of the detection frame in the predicted behavior recognition result with a preset detection frame confidence threshold, so as to obtain a behavior recognition result.

According to the behavior recognition device, after the video frame sequence to be recognized is obtained, the video frame sequence to be recognized is input into the trained behavior recognition model, so that an accurate predicted behavior recognition result can be obtained, the confidence coefficient of the detection frame in the predicted behavior recognition result can be compared with the preset detection frame confidence coefficient threshold value, the behavior recognition result can be obtained, and accurate behavior recognition can be achieved.

For the specific limitations of the behavior recognition model construction device and the behavior recognition device, reference may be made to the above limitations of the behavior recognition model construction method and the behavior recognition method, which are not described herein again. The modules in the behavior recognition model building device and the behavior recognition device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing a sequence of video frames carrying behavior annotations, and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a behavior recognition model construction method and a behavior recognition method.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

In one embodiment, the processor, when executing the computer program, further performs the steps of: and performing image fusion on the video frame sequence to obtain a track picture corresponding to the video frame sequence, and inputting the track picture into a spatial feature extraction network to obtain spatial features corresponding to the video frame sequence.

In one embodiment, the processor, when executing the computer program, further performs the steps of: converting each video frame in the video frame sequence into a gray image, dividing the converted video frame sequence according to the number of preset image channels to obtain a plurality of groups of gray images to be fused, calculating the pixel average value of each group of gray images to be fused, and obtaining a track picture corresponding to the video frame sequence according to the pixel average value.

In one embodiment, the processor when executing the computer program further performs the steps of: inputting the space-time characteristics and the space characteristics into a characteristic fusion network, determining characteristic weight distribution through the characteristic fusion network, and obtaining fusion characteristics corresponding to the video frame sequence according to the characteristic weight distribution, the space-time characteristics and the space characteristics.

In one embodiment, the processor, when executing the computer program, further performs the steps of: and inputting the fusion characteristics into a convolution operation network to obtain an alternative detection frame, and performing threshold brushing on the alternative detection frame according to the behavior labels to obtain a behavior prediction result corresponding to the video frame sequence.

In one embodiment, the processor, when executing the computer program, further performs the steps of: calculating a frame loss function according to the prediction detection frame and the behavior label in the behavior prediction result, calculating a category loss function according to the prediction category and the behavior label in the behavior prediction result, obtaining a model loss function according to the frame loss function and the category loss function, and adjusting the model parameters of the initial behavior recognition model according to the model loss function to obtain a trained behavior recognition model.

acquiring a video frame sequence to be identified;

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of: and performing image fusion on the video frame sequence to obtain a track picture corresponding to the video frame sequence, and inputting the track picture into a spatial feature extraction network to obtain spatial features corresponding to the video frame sequence.

In one embodiment, the computer program when executed by the processor further performs the steps of: converting each video frame in the video frame sequence into a gray image, dividing the converted video frame sequence according to the number of preset image channels to obtain a plurality of groups of gray images to be fused, calculating the pixel average value of each group of gray images to be fused, and obtaining a track picture corresponding to the video frame sequence according to the pixel average value.

In one embodiment, the computer program when executed by the processor further performs the steps of: inputting the space-time characteristics and the space characteristics into a characteristic fusion network, determining characteristic weight distribution through the characteristic fusion network, and obtaining fusion characteristics corresponding to the video frame sequence according to the characteristic weight distribution, the space-time characteristics and the space characteristics.

In one embodiment, the computer program when executed by the processor further performs the steps of: and inputting the fusion features into a convolution operation network to obtain an alternative detection frame, and performing threshold brushing on the alternative detection frame according to the behavior labels to obtain a behavior prediction result corresponding to the video frame sequence.

In one embodiment, the computer program when executed by the processor further performs the steps of: calculating a frame loss function according to the prediction detection frame and the behavior mark in the behavior prediction result, calculating a category loss function according to the prediction category and the behavior mark in the behavior prediction result, obtaining a model loss function according to the frame loss function and the category loss function, and adjusting the model parameters of the initial behavior recognition model according to the model loss function to obtain a trained behavior recognition model.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, performs the steps of:

acquiring a video frame sequence to be identified;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), for example.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for constructing a behavior recognition model, the method comprising:

inputting each video frame sequence in the video frame sequence set into the spatio-temporal feature extraction network to obtain spatio-temporal features corresponding to the video frame sequences, and inputting the video frame sequences into the spatial feature extraction network to obtain spatial features corresponding to the video frame sequences;

inputting the space-time characteristics and the space characteristics into the characteristic fusion network to obtain fusion characteristics corresponding to the video frame sequence;

inputting the fusion features into the convolution operation network to obtain behavior prediction results corresponding to the video frame sequence;

and adjusting the model parameters of the initial behavior recognition model according to the behavior prediction result and the behavior labels to obtain a trained behavior recognition model.

2. The method of claim 1, wherein inputting the sequence of video frames into the spatial feature extraction network to obtain spatial features corresponding to the sequence of video frames comprises:

and inputting the track picture into the spatial feature extraction network to obtain spatial features corresponding to the video frame sequence.

3. The method of claim 2, wherein the image fusing the sequence of video frames to obtain a track picture corresponding to the sequence of video frames comprises:

4. The method of claim 1, wherein the inputting the spatio-temporal features and the spatial features into the feature fusion network to obtain fused features corresponding to the sequence of video frames comprises:

inputting the space-time characteristics and the space characteristics into the characteristic fusion network, and determining characteristic weight distribution through the characteristic fusion network;

5. The method of claim 1, wherein inputting the fused feature into the convolutional network of operations to obtain a behavioral predictor corresponding to the sequence of video frames comprises:

inputting the fusion characteristics into the convolution operation network to obtain an alternative detection frame;

6. The method of claim 1, wherein the adjusting model parameters of the initial behavior recognition model according to the behavior prediction result and the behavior label to obtain a trained behavior recognition model comprises:

calculating a frame loss function according to a prediction detection frame and the behavior mark in the behavior prediction result, and calculating a category loss function according to a prediction category and the behavior mark in the behavior prediction result;

and adjusting the model parameters of the initial behavior recognition model according to the model loss function to obtain a trained behavior recognition model.

7. A method of behavior recognition, the method comprising:

acquiring a video frame sequence to be identified;

inputting the video frame sequence to be recognized into a trained behavior recognition model to obtain a predicted behavior recognition result corresponding to the video frame sequence to be recognized, wherein the trained behavior recognition model is constructed according to the method of any one of the claims 1 to 6;

and comparing the confidence of the detection frame in the predicted behavior recognition result with a preset detection frame confidence threshold value to obtain a behavior recognition result.

8. A behavior recognition model construction apparatus, characterized in that the apparatus comprises:

the system comprises a sample acquisition module, a behavior annotation processing module and a behavior annotation processing module, wherein the sample acquisition module is used for acquiring a video frame sequence set carrying behavior annotations and an initial behavior identification model, and the initial behavior identification model comprises a space-time feature extraction network, a space feature extraction network, a feature fusion network and a convolution operation network;

the characteristic extraction module is used for inputting each video frame sequence in the video frame sequence set into a space-time characteristic extraction network to obtain space-time characteristics corresponding to the video frame sequences, and inputting the video frame sequences into the space characteristic extraction network to obtain space characteristics corresponding to the video frame sequences;

the feature fusion module is used for inputting the space-time feature and the spatial feature into the feature fusion network to obtain fusion features corresponding to the video frame sequence;

the prediction module is used for inputting the fusion characteristics into the convolution operation network to obtain a behavior prediction result corresponding to the video frame sequence;

and the adjusting module is used for adjusting the model parameters of the initial behavior recognition model according to the behavior prediction result and the behavior label to obtain a trained behavior recognition model.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.