CN113239869A

CN113239869A - Two-stage behavior identification method and system based on key frame sequence and behavior information

Info

Publication number: CN113239869A
Application number: CN202110605394.3A
Authority: CN
Inventors: 刘芳; 李玲玲; 唐瑜; 焦李成; 陈璞华; 郭雨薇; 刘旭; 古晶
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-08-10
Anticipated expiration: 2041-05-31
Also published as: CN113239869B

Abstract

The invention discloses a two-stage behavior recognition method and a two-stage behavior recognition system based on a key frame sequence and behavior information, wherein the similar adjacent frames are screened out by calculating the similarity between sparse representation results of video frames to obtain the key frame sequence; calculating the similarity between the categories by using the behavior information of the behavior category labels, and dividing all the behavior categories into a plurality of large categories; constructing a two-stage behavior recognition method model and training, wherein the training of the first stage enables the network to have the capability of coarse classification, and the training of the second stage enables the network to have the capability of fine classification; and finally, identifying the video by using the trained model. The method acquires the key frame sequence of the video as input data, so that the key frame sequence contains more information, and divides the network training process and the identification process into two stages by utilizing the information of the behavior class labels, so that the network learning process is easier finally, and the identification accuracy is improved.

Description

Two-stage behavior identification method and system based on key frame sequence and behavior information

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a two-stage behavior identification method and a two-stage behavior identification system based on a key frame sequence and behavior information.

Background

With the increasing of computer computing power and the development of streaming media, video data is more and more, and people have not satisfied with the processing capability of computers on image data. It is desirable that a computer can process video data as well as image data and analyze information contained in the video data, and therefore video analysis is an important problem to be solved urgently in the field of artificial intelligence. Behavior recognition, which is one of the contents of video analysis, is also called motion recognition, and aims to analyze the behavior of a person from a video containing complete motion and recognize the motion category made by the person in the video. Unlike object recognition in static images, the research object of behavior recognition is not static but dynamic video data, so that the research object needs to pay attention to the change of the space-time motion of people or objects in the video for effective recognition, and the data is also transformed from the two-dimensional space of the static images to the three-dimensional space-time space of the dynamic video, so the complexity of behavior recognition is much higher than that of image recognition.

The existing behavior recognition methods based on deep learning are different. The behavior identification method based on the 3D convolutional network has good identification effect because the time information and the space information of the video can be simultaneously modeled. The behavior recognition method based on the 3D convolutional network is characterized in that a deep network is used as a feature extractor to extract features of an input video frame sequence, and then a classifier is used to classify the features to obtain behavior categories. However, because of the 3D convolution used in this type of method, the amount of computation and the number of parameters are both large, and the amount of computation is reduced, and only the length of the input video frame sequence can be reduced. In addition, the existing behavior recognition method based on the 3D convolutional network trains the network model end to end, and the network is directly made to learn the corresponding relation between the data and the real label, which may increase the difficulty of network learning.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a two-stage behavior recognition method and system based on a key frame sequence and behavior information, which aims at the deficiencies in the prior art, and divide the network learning process and the prediction into two stages by introducing the behavior information of a behavior category label, and select the key frame by using the sparse representation of the image, so that the input video frame sequence contains more information, and the final recognition accuracy is improved.

The invention adopts the following technical scheme:

a two-stage behavior identification method based on a key frame sequence and behavior information comprises the following steps:

s1, for all videos V in the data set^allSelecting key frames one by one to obtain a key frame sequence F corresponding to all videos^allThen divide all videos into training videos V^trainAnd a test video V^testThe corresponding key frame sequence is divided into F^trainAnd F^test；

S2, calculating the similarity among all behavior categories in the video to obtain a category similarity matrix S;

s3, dividing N categories in the behavior category C into K large categories according to the similarity matrix S obtained in the step S2;

s4, constructing a two-stage behavior recognition network model based on the key frame, wherein the two-stage behavior recognition network model comprises a one-stage feature extractor G₁Class classifier Class₁Two stage feature extractor G₂And two-stage classifier Class₂Step S1 is to train the corresponding key frame sequence F of the video^trainAnd a corresponding label Y^trainFed into two-stage line in batchesIn order to identify the network model, training is carried out by using K large classes divided in the step S3, wherein the size of each batch is B;

s5, testing the key frame sequence F corresponding to the video in the step S1^testAnd S4, sending the two-stage behavior recognition network model trained in the step S4 to obtain the behavior category of the test video.

Specifically, in step S1, all videos V in the data set^allSelecting the key frames one by one to obtain a key frame sequence which specifically comprises the following steps:

s101, converting a video v into a video frame sequence [ x ]₁,x₂,…,x_i,…,x_T]Where T represents the length of the video frame sequence, x_iRepresents the ith video frame in the video v;

s102, processing each original video frame to obtain a processed video frame sequence [ x'₁,x′₂,…,x′_i,…,x′_T]；

S103, obtaining a video frame sequence [ x 'by utilizing a K-SVD algorithm'₁,x′₂,…,x′_i,…,x′_T]Is sparse representation [ alpha ]₁,α₂,…,α_T]；

And S104, acquiring a key frame sequence.

Further, step S104 specifically includes:

s1041, calculating the similarity matrix S' between frames belongs to R^T×T；

S1042, selecting a first frame x₁Put in the sequence f of key frames_selectAnd x is₁As the current frame x_nowThen set the maximum interval length tau, at which time f_select＝[x₁]，x_now＝x₁；

S1043, traversing x_nowSelecting x from the later tau frames according to the similarity matrix S_nowAdding a frame with lowest similarity into the key frame sequence f_selectAnd as new x_now；

S1044 and repeating the step S1043 until all frames are traversed to obtain a final key frame sequence f_select。

Specifically, in step S2, the obtaining of the category similarity matrix S specifically includes:

s201, obtaining a sentence vector Vec ═ { Vec ═ Vec of each behavior category label by using a BERT model₁,vec₂,…,vec_i,…,vec_N}，vec_iIndicates the ith category c_iA corresponding sentence vector;

s202, calculating the similarity sim (i, j) ═ cos (vec) between two different categories_i,vec_j) Cos () represents the cosine distance;

and S203, constructing a category similarity matrix S according to the similarity between different categories.

Further, in step S203, the similarity S_i,jThe method specifically comprises the following steps:

wherein r is a threshold value, and C is the number of categories.

Specifically, step S4 specifically includes:

s401, constructing a two-stage behavior recognition network model based on key frames, and a one-stage feature extractor G₁A feature extractor in 3D-ResNet 34; one-stage classifier Class₁The system comprises an input layer, a global 3D pooling layer and a full-connection layer which are connected in sequence; two-stage feature extractor G₂The system comprises K lightweight feature extractors, wherein each lightweight feature extractor comprises an input layer, a first 3D convolutional layer and a second 3D convolutional layer which are sequentially connected; two-stage classifier Class₂Then K classifiers are included, and each classifier comprises an input layer, a global 3D pooling layer and a full connection layer which are sequentially connected;

s402, training a stage feature extractor G₁And a stage classifier Class₁；

S403, training two-stage feature extractor G₂And two-stage classifier Class₂。

Further, in steps S402 and S403, a one-stage feature extractor G is trained₁And a stage classifier Class₁And two-stage feature extractor G₂And two-stage classifier Class₂The method specifically comprises the following steps:

setting the training batch size B as 32 and the iteration number epoch as 100; from a set F of training key frame sequences^trainAnd a corresponding label Y^trainRandomly sampling a batch of B samples

B sampled samples are sent to a one-stage feature extractor G₁To obtain G₁The output characteristics are sent into a Class classifier of a first stage₁Obtaining a primary classification result, and then sending the output features into a two-stage feature extractor G according to the primary classification result₂Corresponding feature extractor and two-stage classifier Class₂The classifier in (1) to obtain a final classification result; using cross entropy loss function L of similarity_simCalculating the loss value of the current batch, and updating G by using a batch random gradient descent method₁And Class₁Network parameters of (1) using a cross entropy loss function L_simCalculating the loss value of the current batch, and fixing G₁And Class₁Using a batch stochastic gradient descent method to update G₂And Class₂The network parameter of (2); repeating the steps until the iteration times epoch are reached; output a stage feature extractor G₁Weight of (2)

And a stage classifier Class₁Weight of (2)

Output two-stage feature extractor G₂Weight of (2)

And two-stage classifier Class₂Weight of (2)

Further, the similarity cross entropy loss function used in a stage of training is:

the cross entropy loss function used in the two-stage training process is:

wherein B represents the number of samples in each batch, C represents the number of categories, y_iRepresenting the real class, p, of the ith sample_icThe representation model predicts the probability that the ith sample is of class c.

Specifically, step S5 specifically includes:

s501, sending the key frame sequence to be predicted to the stage feature extractor G in the step S4₁Then G is added₁The extracted features are fed to a one-stage classifier Class in step S4₁Obtaining a preliminary classification result;

s502, determining to use the two-stage feature extractor G according to the primary classification result obtained in the step S501₂Which feature extractor and two-stage classifier Class in₂Which classifier of (1);

judging which big class the sent key frame sequence belongs to according to the preliminary classification result obtained in the step S501, and if the key frame sequence belongs to the kth big class, using G₂Kth feature extractor and Class₂The kth classifier in (1).

S503, converting G in the step S501₁The output of step S502 is sent to determine good G₂The feature extractor in (1) further extracts features, and then sends the further extracted features to step S502 to determine the Class₂The classifier in (1) obtains the final classification result.

Another technical solution of the present invention is a face recognition system based on spatiotemporal feature fusion and difficult sample mining, comprising:

a selection module for all videos V in the data set^allSelecting key frames one by one to obtain a key frame sequence F corresponding to all videos^allThen divide all videos into training videos V^trainAnd a test video V^testThe corresponding key frame sequence is divided into F^trainAnd F^test；

The computing module is used for computing the similarity among all behavior categories in the video to obtain a category similarity matrix S;

the dividing module is used for dividing N categories in the behavior category C into K major categories according to the similarity matrix S obtained by the calculating module;

a network module for constructing a two-stage behavior recognition network model based on the key frame, wherein the two-stage behavior recognition network model comprises a one-stage feature extractor G₁Class classifier Class₁Two stage feature extractor G₂And two-stage classifier Class₂Selecting a key frame sequence F corresponding to the module training video^trainAnd a corresponding label Y^trainSending the data into a two-stage behavior recognition network model in batches, and training by using K major classes divided by a dividing module, wherein the size of each batch is B;

an identification module for selecting the key frame sequence F corresponding to the module test video^testAnd sending the two-stage behavior recognition network model which is trained by the network module to obtain the behavior category of the test video.

Compared with the prior art, the invention has at least the following beneficial effects:

compared with the existing method of intercepting a video frame sequence randomly in sequence as input data and carrying out end-to-end training, the two-stage behavior identification method based on the key frame sequence and the behavior information utilizes the sparse representation result of each video frame to calculate the similarity between the frames and screen out the similar frames to obtain the key frame sequence, and the key frame sequence is used as the input data to avoid information redundancy in the input data. Through the similarity between the categories, a plurality of specific categories are aggregated into a large category, and then in the network model training process, the training is carried out in two stages: the first stage aims at learning the difference between different large categories, so that the model has the capability of rough classification; the second stage aims at learning the difference between similar categories in the same large category, so that the model has the capability of fine classification. Through the two-stage training mode, the learning process of the network is more reasonable, and the learning difficulty of the network is also reduced.

Furthermore, the obtained key frame sequence can screen out repeated frames contained in the original video frame sequence, so that the redundancy degree of information is reduced to the minimum, a network receives more information, and the accuracy of final identification is improved.

Furthermore, the final time span of the key frame sequence can be maximized and more information can be contained by screening out the current frame and the frame which is least similar to the frame tau in the following frames to gradually construct the key frame sequence.

Further, the calculation of the similarity of each category through the label of each behavior category can greatly reduce the calculation amount, compared with the calculation of the visual similarity between each video to obtain the similarity between each category.

Furthermore, the constructed similarity matrix S can be used when a loss value is calculated, so that the information of the behavior category label is indirectly used in the learning process of the network, the information quantity accepted by the network is increased, and the final identification accuracy can be improved.

Further, a first-stage feature extractor is used for extracting basic features, a second-stage feature extractor is used for extracting different features in each similar category from the basic features, a first-stage classifier is used for rough classification, and a second-stage classifier is used for fine classification. Each module has different functions and separately learns and updates parameters, so that the learning difficulty of the whole network is reduced.

Further, use of G₁Extracting basic features, classifying the basic features by the classifier in the process of only needing to classify K classes (namely, selecting one from the K classes), and then selecting G according to a primary classification result₂And Class₂The proper feature extractor and classifier are adopted to obtain the final classificationAs a result, the process classifier also only needs to select one category from a few categories as the predicted final category. Compared with other one-stage identification methods, the two-stage method only needs to select one class from a small number of classes as a prediction result each time classification is carried out, and the training process of the network is easy.

Further, because the similarity matrix is added into the similarity cross entropy loss function, the common features among similar categories are noticed by the one-stage feature extractor in the training process, so that the one-stage classifier has the capability of rough classification, and the cross entropy loss function is used in the two stages, so that the two-stage feature extractor can notice different features among similar categories and the two-stage classifier has the capability of fine classification.

Further, firstly pass through Class₁And G₁Obtaining a preliminary classification result, performing rough classification on the video, and then performing Class classification₂And G₂And obtaining a final classification result. Compared with the method for directly identifying the final result, the identification method is simpler and easier, each stage is only responsible for one relatively easy task, and the overall fault tolerance rate is high, so that the final identification accuracy rate of the method is improved.

In summary, the invention selects the key frame sequence containing more information through the similarity between the video frames, calculates the similarity between the categories by using the information of the behavior category labels, divides all the behavior categories into a plurality of large categories, further divides the behavior identification process into two stages, firstly carries out coarse classification and then carries out fine classification, and improves the classification accuracy.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a network block diagram of a two-stage behavior recognition method based on a sequence of key frames and behavior information according to the present invention;

FIG. 2 is a diagram of a sequence of frames in contrast to a sequence of key frames extracted in the present invention;

FIG. 3 is a comparison of an original video frame and a reconstructed video frame;

FIG. 4 is a schematic diagram of a two-stage behavior recognition network architecture based on a key frame sequence and behavior information according to the present invention; wherein, (a) is a network architecture of a stage feature extractor G _ 1; (b) a stage classifier Class (the network architecture of the stage classifier) (1); (c) a network architecture of a two-stage feature extractor G _2 and a two-stage classifier (Class) ("Class") -2;

FIG. 5 is a diagram of the recognition result of a two-stage behavior recognition method for a partial video frame sequence based on a key frame sequence and behavior information;

FIG. 6 is a category activation graph for a two-stage behavior recognition network based on a sequence of key frames and behavior information when recognizing two similar categories.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Various structural schematics according to the disclosed embodiments of the invention are shown in the drawings. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of various regions, layers and their relative sizes and positional relationships shown in the drawings are merely exemplary, and deviations may occur in practice due to manufacturing tolerances or technical limitations, and a person skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions, according to actual needs.

The invention provides a two-stage behavior identification method based on a key frame sequence and behavior information, which is used for selecting key frames of videos to obtain the key frame sequence corresponding to each video; calculating the similarity between the categories by using the information of the behavior category labels, and constructing a category similarity matrix; dividing all behavior categories into a plurality of large categories according to the similarity among the behavior categories, wherein the specific behavior categories contained in each large category are similar; constructing and training a two-stage behavior recognition network model based on key frames, wherein the network model comprises a one-stage feature extractor, a one-stage classifier, a two-stage feature extractor and a two-stage classifier, the one-stage feature extractor and the one-stage classifier are trained in the first stage, and the two-stage feature extractor and the two-stage classifier are trained in the second stage; inputting a key frame sequence corresponding to a test video to obtain a behavior category corresponding to the video; the invention obtains the similarity between adjacent frames by using the sparse representation result of each video frame, then selects the key frame sequence according to the interframe similarity and uses the key frame sequence as the input data of the network model, so that the network model can obtain more input information for identification.

Referring to fig. 1, the present invention provides a two-stage behavior recognition method based on a sequence of key frames and behavior information, including the following steps:

s1, for all videos V in the data set^allSelecting key frames one by one to obtain a key frame sequence F corresponding to all videos^allThen divide all videos into training videos V^trainAnd a test video V^testThe corresponding key frame sequence is divided into F^trainAnd F^testWherein, in the step (A),

referring to fig. 2, the sequence of the video frames and the sequence of the key frames selected by the key frames are sequentially cut, and the sequence of the key frames can better show the whole action and contain more information under the condition of equal length. In the video frame sequence intercepted in sequence, the difference between adjacent frames is small, and finally, the contained information amount is small, so in the invention, similar frames are screened out by utilizing the similarity between the adjacent frames, and finally, the key frame sequence is obtained.

The key frame selection for each video to obtain a key frame sequence specifically includes:

s101, converting a video v into a video frame sequence [ x ]₁,x₂,…,x_i,…,x_T]Where T represents the length of the sequence of video frames, x_iRepresents the ith video frame in the video v;

Specifically, the method comprises the following steps: converting each original frame into a gray-scale image, cutting the gray-scale image into N image blocks with the same size, drawing each small image block into a one-dimensional vector, and processing to change the size of each video frame from W × H × 3 to L × N, wherein L is W × H/N.

And S104, acquiring a key frame sequence.

S1041, calculating the similarity matrix S' between frames belongs to R^T×T；

First, the sparse representation result of each frame is pulled into a one-dimensional vector to obtain [ alpha'₁,α′₂,α′₃,…,α′_T]Then, utilizing S'_i,j＝1-cos(α′_i,α′_j) The similarity between two frames is calculated in turn, where cos () represents the cosine distance, S'_i,jRepresenting the similarity of the ith frame and the jth frame;

s1042, selecting a first frame x₁Put in the sequence f of key frames_selectAnd x is₁As the current frame x_now. Then set the maximum interval length τ, at which time f_select＝[x₁]，x_now＝x₁；

S1043, traversing x_nowSelecting x from the later tau frames according to the similarity matrix S_nowAdding a frame with lowest similarity into the key frame sequence f_selectAnd treat it as a new x_now；

S2, calculating the similarity among the categories of all the behaviors in the video to obtain a category similarity matrix S;

the behavior class is C ═ { C ═ C₁,c₂,…,c_i,…,c_NIs the similarity matrix S e R^N×NR represents a real number field, and N represents the number of behavior categories; s_i,jRepresenting the similarity of the ith and jth categories.

The category similarity matrix S is calculated as follows:

s201, obtaining each behavior category label by using BERT modelSentence vector Vec ═ { Vec ═ Vec₁,vec₂,…,vec_i,…,vec_NIn which vec_iIndicates the ith category c_iA corresponding sentence vector;

s202, calculating the similarity sim (i, j) ═ cos (vec) between two different categories_i,vec_j) Where cos () represents the cosine distance;

s203, constructing a category similarity matrix S e R according to the similarity between different categories^C×CThe specific calculation formula is as follows:

where r is a threshold value and the range is [0,1 ].

k major classes are C ═ C'₁,C′₂,…,C′_i,…,C′_K}，C′_iRepresents the ith major class;

all behavior classes C ═ { C ═ C₁,c₂,…,c_i,…,c_NAre divided into K general classes C '═ C'₁,C′₂,…,C′_i,…,C′_KIn the training process of step S4, there are two classifiers, Class₁For coarse classification, i.e. the predicted result is { C'₁,C′₂,…,C′_i,…,C′_KOne of them, and Class₂For predicting the final class, i.e. the predicted result is c₁,c₂,…,c_i,…,c_NOne of them.

S4, constructing a two-stage behavior recognition network model based on the key frame, wherein the two-stage behavior recognition network model comprises a one-stage feature extractor G₁Class classifier Class₁Two stage feature extractor G₂And two-stage classifier Class₂Step S1 is to train the corresponding key frame sequence F of the video^trainAnd a corresponding label Y^trainSending the data into a two-stage behavior recognition network model in batches, and dividing the K large classes by using the step S3Training, wherein the size of each batch is B;

wherein, a stage feature extractor G₁Taking a key frame sequence as input, and outputting the key frame sequence as an extracted feature graph; one-stage classifier Class₁With G₁Is used as an input, and the output is the probability distribution of the preliminary classification.

Two-stage feature extractor G₂With G₁The output of (2) is used as input, and the output is a feature map after further feature extraction; two-stage classifier Class₂With G₂Is used as input and the output is the probability distribution of the final classification.

S402, training a stage feature extractor G₁And a stage classifier Class₁；

The training data is F^trainAll key frame sequences in (1) and corresponding labels Y^trainFirst, to G₁Inputting a batch of key frame sequences, and then G₁Is input to Class₁Obtaining the probability distribution of the primary classification, then calculating the loss by using the probability distribution of the primary classification and the real label, and updating G through a random gradient descent algorithm₁And Class₁The process is repeated until the loss no longer drops and the model converges.

The similar cross entropy loss function used during the first stage training is:

wherein B represents the number of samples in each batch, C represents the number of categories, y_iRepresenting the true class of the ith sample, S_yi,cIs the value in the category similarity matrix S, representing the y-th_iSimilarity of class and class c, p_icThe representation model predicts the probability that the ith sample is of class c.

Cross entropy loss function L by similarity_simApplying a batch random gradient descent method to a stage feature extractor G₁And a stage classifier Class₁Training is carried out to obtain the trained network parameters, and the training method comprises the following steps:

step 1, setting the size B of a training batch to be 32 and the iteration number epoch to be 100;

step 2, from the training key frame sequence set F^trainAnd a corresponding label Y^trainRandomly sampling a batch of B samples

Step 3, sending the sampled B samples into a one-stage feature extractor G₁To obtain G₁The output characteristics are sent into a Class classifier of a first stage₁Obtaining a preliminary classification result;

step 4, utilizing a similar cross entropy loss function L_simCalculating the loss value of the current batch, and updating G by using a batch random gradient descent method₁And Class₁The network parameter of (2);

step 5, repeating the steps 2 to 4 until the iteration time epoch is reached;

step 6, outputting a stage feature extractor G₁Weight of (2)

And a stage classifier Class₁Weight of (2)

During this training, G₁And Class₁Is kept unchanged, only updates G are learned₂And Class₂The training data is still F^trainAll key frame sequences in (1) and corresponding labels Y^trainFirstly, inputting a batch of key frame sequence into G₁And Class₁Obtaining a preliminary classification result, judging which big class each key frame sequence in the batch belongs to according to the preliminary classification result, and further determining that each key frame sequence is in G₂And Class₂The feature extractor and classifier in (1), then G₁The output characteristic diagram is sent to the determined G₂Mid-feature extractor and Class₂The classifier in (1) obtains the final classification result, finally calculates the loss, and uses gradient descent algorithm to update G₂And Class₂Until the model converges.

The loss function used during the two-stage training process is:

wherein S is_yi,cIs the value in the class similarity matrix S, B represents the number of samples in each batch, C represents the number of classes, y_iRepresenting the real class, p, of the ith sample_icThe representation model predicts the probability that the ith sample is of class c.

By cross entropy loss function L_CCEThe two-stage feature extractor G is subjected to a batch random gradient descent method₂And two-stage classifier Class₂Training to obtain the trained network parameters, wherein the training method comprises the following steps:

step 2, from the training key frame sequence set F^trainAnd a corresponding label Y^trainRandomly sampling B batchesSample(s)

Step 3, sending the sampled B samples into a one-stage feature extractor G₁To obtain G₁The output characteristics are sent into a Class classifier of a first stage₁Obtaining a preliminary classification result, and classifying G according to the major class to which the preliminary classification result belongs₁Output feature map of (1) is sent to G₂And G is extracted₂Is sent to Class₂The corresponding classifier in the step (2) obtains a final classification result;

step 4, utilizing a cross entropy loss function L_simCalculating the loss value of the current batch, and fixing G₁And Class₁Only using a batch stochastic gradient descent method to update G₂And Class₂The network parameter of (2);

step 5, repeating the steps 2 to 4 until the iteration time epoch is reached;

step 6, outputting a two-stage feature extractor G₂Weight of (2)

And two-stage classifier Class₂Weight of (2)

S5, testing the key frame sequence F corresponding to the video in the step S1^testAnd sending the two-stage behavior recognition network model trained in the step S4 to obtain the behavior category of the test video.

In another embodiment of the present invention, a two-stage behavior recognition system based on a key frame sequence and behavior information is provided, which can be used to implement the two-stage behavior recognition method based on a key frame sequence and behavior information described above.

Wherein the selection module selects all videos V in the data set^allSelecting key frames one by one to obtain a key frame sequence F corresponding to all videos^allThen divide all videos into training videos V^trainAnd a test video V^testThe corresponding key frame sequence is divided into F^trainAnd F^test；

a network module for constructing a two-stage behavior recognition network model based on the key frame, wherein the two-stage behavior recognition network model comprises a one-stage feature extractor G₁Class classifier Class₁Two stage feature extractor G₂And two-stage classifier Class₂Selecting a key frame sequence F corresponding to the module training video^trainAnd a corresponding label Y^trainIs fed in two batchesIn the stage behavior recognition network model, training is carried out by utilizing K large classes divided by a dividing module, and the size of each batch is B;

In yet another embodiment of the present invention, a terminal device is provided that includes a processor and a memory for storing a computer program comprising program instructions, the processor being configured to execute the program instructions stored by the computer storage medium. The Processor may be a Central Processing Unit (CPU), or may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable gate array (FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, etc., which is a computing core and a control core of the terminal, and is adapted to implement one or more instructions, and is specifically adapted to load and execute one or more instructions to implement a corresponding method flow or a corresponding function; the processor according to the embodiment of the present invention may be used for the operation of the two-stage behavior recognition method based on the key frame sequence and the behavior information, and includes:

for all videos V in the data set^allSelecting key frames one by one to obtain a key frame sequence F corresponding to all videos^allThen divide all videos into training videos V^trainAnd a test video V^testThe corresponding key frame sequence is divided into F^trainAnd F^test(ii) a Calculating the similarity between all behavior categories in the video to obtain a category similarity matrix S; dividing N categories in the behavior category C into K large categories according to the similarity matrix S; constructing a two-stage behavior recognition network model based on key frames, wherein the two-stage behavior recognition network model comprises a one-stage feature extractor G₁Class classifier Class₁Second order, ofSegment feature extractor G₂And two-stage classifier Class₂A sequence F of key frames corresponding to the training video^trainAnd a corresponding label Y^trainSending the data into a two-stage behavior recognition network model in batches, and training by using K divided major classes, wherein the size of each batch is B; a key frame sequence F corresponding to the test video^testAnd sending the two-stage behavior recognition network model which is trained to obtain the behavior category of the test video.

In still another embodiment of the present invention, the present invention further provides a storage medium, specifically a computer-readable storage medium (Memory), which is a Memory device in a terminal device and is used for storing programs and data. It is understood that the computer readable storage medium herein may include a built-in storage medium in the terminal device, and may also include an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also, one or more instructions, which may be one or more computer programs (including program code), are stored in the memory space and are adapted to be loaded and executed by the processor. It should be noted that the computer-readable storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory.

One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to perform the corresponding steps of the above embodiments with respect to a two-stage behavior recognition method based on a sequence of key frames and behavior information; one or more instructions in the computer-readable storage medium are loaded by the processor and perform the steps of:

for all videos V in the data set^allSelecting key frames one by one to obtain a key frame sequence F corresponding to all videos^allThen divide all videos into training videos V^trainAnd a test video V^testThe corresponding key frame sequence is divided into F^trainAnd F^test(ii) a Computing all in videoObtaining a category similarity matrix S according to the similarity between the behavior categories; dividing N categories in the behavior category C into K large categories according to the similarity matrix S; constructing a two-stage behavior recognition network model based on key frames, wherein the two-stage behavior recognition network model comprises a one-stage feature extractor G₁Class classifier Class₁Two stage feature extractor G₂And two-stage classifier Class₂A sequence F of key frames corresponding to the training video^trainAnd a corresponding label Y^trainSending the data into a two-stage behavior recognition network model in batches, and training by using K divided major classes, wherein the size of each batch is B; a key frame sequence F corresponding to the test video^testAnd sending the two-stage behavior recognition network model which is trained to obtain the behavior category of the test video.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The effect of the present invention can be further illustrated by the following simulation results

1. Simulation conditions

The hardware conditions of the simulation of the invention are as follows: the intelligent sensing and image understanding laboratory graphic workstation carries 4 GPUs with 12G video memories; a data set used by simulation of the invention is a UCF101 data set, the data set comprises 13320 videos with the resolution of 320 x 240 of 101 action categories, after the videos are divided according to a data set division mode provided by the authority of the data set, a training set comprises 9537 videos, and a test set comprises 3783 videos.

2. Simulation content and results

The experiment is carried out under the simulation condition by using the method of the invention, firstly, the video frames contained in each video in the data set are sparsely represented to obtain corresponding sparse representation results, for example, fig. 3 is a comparison graph of the original video frames and the video frames reconstructed according to the sparse representation results, the first line is the original video frames, the second line is the reconstructed video frames, then, the similarity between the frames is calculated according to the sparse representation results of the video frames, and the similar frames are screened out to obtain the key frame sequence. And then all 101 behavior categories are divided into 10 major categories by calculating a category similarity matrix, and a two-stage behavior recognition network model is trained by using a key frame sequence corresponding to the video in the training set. After training, using a trained network model to identify videos in a test set, fig. 5 shows the identification results of a partial video frame sequence in two stages, when viewed from top to bottom, a first row and a second row are video frame sequences randomly intercepted in sequence, a third row and a fourth row are key frame sequences, and fig. 6 shows two similar categories in a data set: the category activation maps of the video frames of eye makeup and lip makeup in the two-stage recognition process are shown, the first row from top to bottom is the category activation map of 'eyebrow makeup', and the second row is the category activation map of 'lip makeup'.

As can be seen from fig. 5 and fig. 6, in the two-stage behavior recognition network model, in the initial recognition process of the first stage, the network model pays more attention to the commonality between similar categories, and although the sequence of video frames is easily recognized incorrectly, the network model is also mistakenly recognized as a similar category.

Table 1 shows the comparison result between the final recognition accuracy of the test set in the UCF101 dataset identified by the method of the present invention and other methods.

TABLE 1

Method	Accuracy (%)
		IDT	85.90
Temporal stream network	83.70
		LRCN	82.90
C3D	76.02
		3D-ResNet18	83.51
3D-ResNet34	83.69
		The method of the invention	87.23

From the results in table 1, the present invention achieves good classification results.

In summary, the two-stage behavior recognition method and system based on the key frame sequence and the behavior information calculate the inter-frame similarity by using the sparse representation result of the video frame, and further screen out the key frame sequence as the input of the network model, so that the information content of the input video frame sequence is effectively increased under the condition of unchanged length; the method uses the behavior information of the behavior category labels to calculate the similarity between categories, divides the similar categories into the same large category, further divides the identification process into two stages, firstly carries out rough classification and then carries out fine classification, so that the model corresponding to each stage only learns the corresponding capacity, the learning process of the whole network model is easier, and good identification accuracy can be achieved.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. The two-stage behavior identification method based on the key frame sequence and the behavior information is characterized by comprising the following steps of:

s4, constructing a two-stage behavior recognition network model based on the key frame, wherein the two-stage behavior recognition network model comprises a one-stage feature extractor G₁Class classifier Class₁Two stage feature extractor G₂And two-stage classifier Class₂Step S1 is to train the corresponding key frame sequence F of the video^trainAnd a corresponding label Y^trainSending the data into a two-stage behavior recognition network model in batches, and training by using K classes divided in the step S3, wherein the size of each batch is B;

s5, testing the key frame sequence F corresponding to the video in the step S1^testTraining in step S4And obtaining the behavior category of the test video by the trained two-stage behavior recognition network model.

2. The method according to claim 1, wherein in step S1, all videos V in the data set^allSelecting the key frames one by one to obtain a key frame sequence which specifically comprises the following steps:

s101, converting a video v into a video frame sequence x₁,x₂,…,x_i,…,x_T-, T denotes the length of the video frame sequence, x_iRepresents the ith video frame in the video v;

And S104, acquiring a key frame sequence.

3. The method according to claim 2, wherein step S104 is specifically:

s1041, calculating the similarity matrix S' between frames belongs to R^T×T；

4. The method according to claim 1, wherein in step S2, the obtaining of the category similarity matrix S specifically includes:

5. The method according to claim 4, wherein in step S203, the similarity S_i,jThe method specifically comprises the following steps:

wherein r is a threshold value, and C is the number of categories.

6. The method according to claim 1, wherein step S4 is specifically:

s402, training a stage feature extractor G₁And a stage classifier Class₁；

7. The method of claim 6, wherein in steps S402 and S403, a one-stage feature extractor G is trained₁And a stage classifier Class₁And two-stage feature extractor G₂And two-stage classifier Class₂The method specifically comprises the following steps:

And a stage classifier Class₁Weight of (2)

Output two-stage feature extractor G₂Weight of (2)

And two-stage classifier Class₂Weight of (2)

8. The method of claim 7, wherein the similarity cross entropy loss function used during a stage of training is:

the cross entropy loss function used in the two-stage training process is:

9. The method according to claim 1, wherein step S5 is specifically:

determining the key frame to be sent according to the preliminary classification result obtained in step S501To which major class the sequence belongs, if it belongs to the kth major class, G is used₂Kth feature extractor and Class₂The kth classifier in (1);

10. A two-stage behavior recognition system based on a sequence of key frames and behavior information, comprising: