CN113239869A - Two-stage behavior identification method and system based on key frame sequence and behavior information - Google Patents

Two-stage behavior identification method and system based on key frame sequence and behavior information Download PDF

Info

Publication number
CN113239869A
CN113239869A CN202110605394.3A CN202110605394A CN113239869A CN 113239869 A CN113239869 A CN 113239869A CN 202110605394 A CN202110605394 A CN 202110605394A CN 113239869 A CN113239869 A CN 113239869A
Authority
CN
China
Prior art keywords
stage
class
behavior
key frame
frame sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110605394.3A
Other languages
Chinese (zh)
Other versions
CN113239869B (en
Inventor
刘芳
李玲玲
唐瑜
焦李成
陈璞华
郭雨薇
刘旭
古晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202110605394.3A priority Critical patent/CN113239869B/en
Publication of CN113239869A publication Critical patent/CN113239869A/en
Application granted granted Critical
Publication of CN113239869B publication Critical patent/CN113239869B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a two-stage behavior recognition method and a two-stage behavior recognition system based on a key frame sequence and behavior information, wherein the similar adjacent frames are screened out by calculating the similarity between sparse representation results of video frames to obtain the key frame sequence; calculating the similarity between the categories by using the behavior information of the behavior category labels, and dividing all the behavior categories into a plurality of large categories; constructing a two-stage behavior recognition method model and training, wherein the training of the first stage enables the network to have the capability of coarse classification, and the training of the second stage enables the network to have the capability of fine classification; and finally, identifying the video by using the trained model. The method acquires the key frame sequence of the video as input data, so that the key frame sequence contains more information, and divides the network training process and the identification process into two stages by utilizing the information of the behavior class labels, so that the network learning process is easier finally, and the identification accuracy is improved.

Description

Two-stage behavior identification method and system based on key frame sequence and behavior information
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a two-stage behavior identification method and a two-stage behavior identification system based on a key frame sequence and behavior information.
Background
With the increasing of computer computing power and the development of streaming media, video data is more and more, and people have not satisfied with the processing capability of computers on image data. It is desirable that a computer can process video data as well as image data and analyze information contained in the video data, and therefore video analysis is an important problem to be solved urgently in the field of artificial intelligence. Behavior recognition, which is one of the contents of video analysis, is also called motion recognition, and aims to analyze the behavior of a person from a video containing complete motion and recognize the motion category made by the person in the video. Unlike object recognition in static images, the research object of behavior recognition is not static but dynamic video data, so that the research object needs to pay attention to the change of the space-time motion of people or objects in the video for effective recognition, and the data is also transformed from the two-dimensional space of the static images to the three-dimensional space-time space of the dynamic video, so the complexity of behavior recognition is much higher than that of image recognition.
The existing behavior recognition methods based on deep learning are different. The behavior identification method based on the 3D convolutional network has good identification effect because the time information and the space information of the video can be simultaneously modeled. The behavior recognition method based on the 3D convolutional network is characterized in that a deep network is used as a feature extractor to extract features of an input video frame sequence, and then a classifier is used to classify the features to obtain behavior categories. However, because of the 3D convolution used in this type of method, the amount of computation and the number of parameters are both large, and the amount of computation is reduced, and only the length of the input video frame sequence can be reduced. In addition, the existing behavior recognition method based on the 3D convolutional network trains the network model end to end, and the network is directly made to learn the corresponding relation between the data and the real label, which may increase the difficulty of network learning.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a two-stage behavior recognition method and system based on a key frame sequence and behavior information, which aims at the deficiencies in the prior art, and divide the network learning process and the prediction into two stages by introducing the behavior information of a behavior category label, and select the key frame by using the sparse representation of the image, so that the input video frame sequence contains more information, and the final recognition accuracy is improved.
The invention adopts the following technical scheme:
a two-stage behavior identification method based on a key frame sequence and behavior information comprises the following steps:
s1, for all videos V in the data setallSelecting key frames one by one to obtain a key frame sequence F corresponding to all videosallThen divide all videos into training videos VtrainAnd a test video VtestThe corresponding key frame sequence is divided into FtrainAnd Ftest
S2, calculating the similarity among all behavior categories in the video to obtain a category similarity matrix S;
s3, dividing N categories in the behavior category C into K large categories according to the similarity matrix S obtained in the step S2;
s4, constructing a two-stage behavior recognition network model based on the key frame, wherein the two-stage behavior recognition network model comprises a one-stage feature extractor G1Class classifier Class1Two stage feature extractor G2And two-stage classifier Class2Step S1 is to train the corresponding key frame sequence F of the videotrainAnd a corresponding label YtrainFed into two-stage line in batchesIn order to identify the network model, training is carried out by using K large classes divided in the step S3, wherein the size of each batch is B;
s5, testing the key frame sequence F corresponding to the video in the step S1testAnd S4, sending the two-stage behavior recognition network model trained in the step S4 to obtain the behavior category of the test video.
Specifically, in step S1, all videos V in the data setallSelecting the key frames one by one to obtain a key frame sequence which specifically comprises the following steps:
s101, converting a video v into a video frame sequence [ x ]1,x2,…,xi,…,xT]Where T represents the length of the video frame sequence, xiRepresents the ith video frame in the video v;
s102, processing each original video frame to obtain a processed video frame sequence [ x'1,x′2,…,x′i,…,x′T];
S103, obtaining a video frame sequence [ x 'by utilizing a K-SVD algorithm'1,x′2,…,x′i,…,x′T]Is sparse representation [ alpha ]12,…,αT];
And S104, acquiring a key frame sequence.
Further, step S104 specifically includes:
s1041, calculating the similarity matrix S' between frames belongs to RT×T
S1042, selecting a first frame x1Put in the sequence f of key framesselectAnd x is1As the current frame xnowThen set the maximum interval length tau, at which time fselect=[x1],xnow=x1
S1043, traversing xnowSelecting x from the later tau frames according to the similarity matrix SnowAdding a frame with lowest similarity into the key frame sequence fselectAnd as new xnow
S1044 and repeating the step S1043 until all frames are traversed to obtain a final key frame sequence fselect
Specifically, in step S2, the obtaining of the category similarity matrix S specifically includes:
s201, obtaining a sentence vector Vec ═ { Vec ═ Vec of each behavior category label by using a BERT model1,vec2,…,veci,…,vecN},veciIndicates the ith category ciA corresponding sentence vector;
s202, calculating the similarity sim (i, j) ═ cos (vec) between two different categoriesi,vecj) Cos () represents the cosine distance;
and S203, constructing a category similarity matrix S according to the similarity between different categories.
Further, in step S203, the similarity Si,jThe method specifically comprises the following steps:
Figure BDA0003093926540000041
wherein r is a threshold value, and C is the number of categories.
Specifically, step S4 specifically includes:
s401, constructing a two-stage behavior recognition network model based on key frames, and a one-stage feature extractor G1A feature extractor in 3D-ResNet 34; one-stage classifier Class1The system comprises an input layer, a global 3D pooling layer and a full-connection layer which are connected in sequence; two-stage feature extractor G2The system comprises K lightweight feature extractors, wherein each lightweight feature extractor comprises an input layer, a first 3D convolutional layer and a second 3D convolutional layer which are sequentially connected; two-stage classifier Class2Then K classifiers are included, and each classifier comprises an input layer, a global 3D pooling layer and a full connection layer which are sequentially connected;
s402, training a stage feature extractor G1And a stage classifier Class1
S403, training two-stage feature extractor G2And two-stage classifier Class2
Further, in steps S402 and S403, a one-stage feature extractor G is trained1And a stage classifier Class1And two-stage feature extractor G2And two-stage classifier Class2The method specifically comprises the following steps:
setting the training batch size B as 32 and the iteration number epoch as 100; from a set F of training key frame sequencestrainAnd a corresponding label YtrainRandomly sampling a batch of B samples
Figure BDA0003093926540000042
B sampled samples are sent to a one-stage feature extractor G1To obtain G1The output characteristics are sent into a Class classifier of a first stage1Obtaining a primary classification result, and then sending the output features into a two-stage feature extractor G according to the primary classification result2Corresponding feature extractor and two-stage classifier Class2The classifier in (1) to obtain a final classification result; using cross entropy loss function L of similaritysimCalculating the loss value of the current batch, and updating G by using a batch random gradient descent method1And Class1Network parameters of (1) using a cross entropy loss function LsimCalculating the loss value of the current batch, and fixing G1And Class1Using a batch stochastic gradient descent method to update G2And Class2The network parameter of (2); repeating the steps until the iteration times epoch are reached; output a stage feature extractor G1Weight of (2)
Figure BDA0003093926540000051
And a stage classifier Class1Weight of (2)
Figure BDA0003093926540000052
Output two-stage feature extractor G2Weight of (2)
Figure BDA0003093926540000053
And two-stage classifier Class2Weight of (2)
Figure BDA0003093926540000054
Further, the similarity cross entropy loss function used in a stage of training is:
Figure BDA0003093926540000055
the cross entropy loss function used in the two-stage training process is:
Figure BDA0003093926540000056
wherein B represents the number of samples in each batch, C represents the number of categories, yiRepresenting the real class, p, of the ith sampleicThe representation model predicts the probability that the ith sample is of class c.
Specifically, step S5 specifically includes:
s501, sending the key frame sequence to be predicted to the stage feature extractor G in the step S41Then G is added1The extracted features are fed to a one-stage classifier Class in step S41Obtaining a preliminary classification result;
s502, determining to use the two-stage feature extractor G according to the primary classification result obtained in the step S5012Which feature extractor and two-stage classifier Class in2Which classifier of (1);
judging which big class the sent key frame sequence belongs to according to the preliminary classification result obtained in the step S501, and if the key frame sequence belongs to the kth big class, using G2Kth feature extractor and Class2The kth classifier in (1).
S503, converting G in the step S5011The output of step S502 is sent to determine good G2The feature extractor in (1) further extracts features, and then sends the further extracted features to step S502 to determine the Class2The classifier in (1) obtains the final classification result.
Another technical solution of the present invention is a face recognition system based on spatiotemporal feature fusion and difficult sample mining, comprising:
a selection module for all videos V in the data setallSelecting key frames one by one to obtain a key frame sequence F corresponding to all videosallThen divide all videos into training videos VtrainAnd a test video VtestThe corresponding key frame sequence is divided into FtrainAnd Ftest
The computing module is used for computing the similarity among all behavior categories in the video to obtain a category similarity matrix S;
the dividing module is used for dividing N categories in the behavior category C into K major categories according to the similarity matrix S obtained by the calculating module;
a network module for constructing a two-stage behavior recognition network model based on the key frame, wherein the two-stage behavior recognition network model comprises a one-stage feature extractor G1Class classifier Class1Two stage feature extractor G2And two-stage classifier Class2Selecting a key frame sequence F corresponding to the module training videotrainAnd a corresponding label YtrainSending the data into a two-stage behavior recognition network model in batches, and training by using K major classes divided by a dividing module, wherein the size of each batch is B;
an identification module for selecting the key frame sequence F corresponding to the module test videotestAnd sending the two-stage behavior recognition network model which is trained by the network module to obtain the behavior category of the test video.
Compared with the prior art, the invention has at least the following beneficial effects:
compared with the existing method of intercepting a video frame sequence randomly in sequence as input data and carrying out end-to-end training, the two-stage behavior identification method based on the key frame sequence and the behavior information utilizes the sparse representation result of each video frame to calculate the similarity between the frames and screen out the similar frames to obtain the key frame sequence, and the key frame sequence is used as the input data to avoid information redundancy in the input data. Through the similarity between the categories, a plurality of specific categories are aggregated into a large category, and then in the network model training process, the training is carried out in two stages: the first stage aims at learning the difference between different large categories, so that the model has the capability of rough classification; the second stage aims at learning the difference between similar categories in the same large category, so that the model has the capability of fine classification. Through the two-stage training mode, the learning process of the network is more reasonable, and the learning difficulty of the network is also reduced.
Furthermore, the obtained key frame sequence can screen out repeated frames contained in the original video frame sequence, so that the redundancy degree of information is reduced to the minimum, a network receives more information, and the accuracy of final identification is improved.
Furthermore, the final time span of the key frame sequence can be maximized and more information can be contained by screening out the current frame and the frame which is least similar to the frame tau in the following frames to gradually construct the key frame sequence.
Further, the calculation of the similarity of each category through the label of each behavior category can greatly reduce the calculation amount, compared with the calculation of the visual similarity between each video to obtain the similarity between each category.
Furthermore, the constructed similarity matrix S can be used when a loss value is calculated, so that the information of the behavior category label is indirectly used in the learning process of the network, the information quantity accepted by the network is increased, and the final identification accuracy can be improved.
Further, a first-stage feature extractor is used for extracting basic features, a second-stage feature extractor is used for extracting different features in each similar category from the basic features, a first-stage classifier is used for rough classification, and a second-stage classifier is used for fine classification. Each module has different functions and separately learns and updates parameters, so that the learning difficulty of the whole network is reduced.
Further, use of G1Extracting basic features, classifying the basic features by the classifier in the process of only needing to classify K classes (namely, selecting one from the K classes), and then selecting G according to a primary classification result2And Class2The proper feature extractor and classifier are adopted to obtain the final classificationAs a result, the process classifier also only needs to select one category from a few categories as the predicted final category. Compared with other one-stage identification methods, the two-stage method only needs to select one class from a small number of classes as a prediction result each time classification is carried out, and the training process of the network is easy.
Further, because the similarity matrix is added into the similarity cross entropy loss function, the common features among similar categories are noticed by the one-stage feature extractor in the training process, so that the one-stage classifier has the capability of rough classification, and the cross entropy loss function is used in the two stages, so that the two-stage feature extractor can notice different features among similar categories and the two-stage classifier has the capability of fine classification.
Further, firstly pass through Class1And G1Obtaining a preliminary classification result, performing rough classification on the video, and then performing Class classification2And G2And obtaining a final classification result. Compared with the method for directly identifying the final result, the identification method is simpler and easier, each stage is only responsible for one relatively easy task, and the overall fault tolerance rate is high, so that the final identification accuracy rate of the method is improved.
In summary, the invention selects the key frame sequence containing more information through the similarity between the video frames, calculates the similarity between the categories by using the information of the behavior category labels, divides all the behavior categories into a plurality of large categories, further divides the behavior identification process into two stages, firstly carries out coarse classification and then carries out fine classification, and improves the classification accuracy.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a network block diagram of a two-stage behavior recognition method based on a sequence of key frames and behavior information according to the present invention;
FIG. 2 is a diagram of a sequence of frames in contrast to a sequence of key frames extracted in the present invention;
FIG. 3 is a comparison of an original video frame and a reconstructed video frame;
FIG. 4 is a schematic diagram of a two-stage behavior recognition network architecture based on a key frame sequence and behavior information according to the present invention; wherein, (a) is a network architecture of a stage feature extractor G _ 1; (b) a stage classifier Class (the network architecture of the stage classifier) (1); (c) a network architecture of a two-stage feature extractor G _2 and a two-stage classifier (Class) ("Class") -2;
FIG. 5 is a diagram of the recognition result of a two-stage behavior recognition method for a partial video frame sequence based on a key frame sequence and behavior information;
FIG. 6 is a category activation graph for a two-stage behavior recognition network based on a sequence of key frames and behavior information when recognizing two similar categories.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Various structural schematics according to the disclosed embodiments of the invention are shown in the drawings. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of various regions, layers and their relative sizes and positional relationships shown in the drawings are merely exemplary, and deviations may occur in practice due to manufacturing tolerances or technical limitations, and a person skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions, according to actual needs.
The invention provides a two-stage behavior identification method based on a key frame sequence and behavior information, which is used for selecting key frames of videos to obtain the key frame sequence corresponding to each video; calculating the similarity between the categories by using the information of the behavior category labels, and constructing a category similarity matrix; dividing all behavior categories into a plurality of large categories according to the similarity among the behavior categories, wherein the specific behavior categories contained in each large category are similar; constructing and training a two-stage behavior recognition network model based on key frames, wherein the network model comprises a one-stage feature extractor, a one-stage classifier, a two-stage feature extractor and a two-stage classifier, the one-stage feature extractor and the one-stage classifier are trained in the first stage, and the two-stage feature extractor and the two-stage classifier are trained in the second stage; inputting a key frame sequence corresponding to a test video to obtain a behavior category corresponding to the video; the invention obtains the similarity between adjacent frames by using the sparse representation result of each video frame, then selects the key frame sequence according to the interframe similarity and uses the key frame sequence as the input data of the network model, so that the network model can obtain more input information for identification.
Referring to fig. 1, the present invention provides a two-stage behavior recognition method based on a sequence of key frames and behavior information, including the following steps:
s1, for all videos V in the data setallSelecting key frames one by one to obtain a key frame sequence F corresponding to all videosallThen divide all videos into training videos VtrainAnd a test video VtestThe corresponding key frame sequence is divided into FtrainAnd FtestWherein, in the step (A),
Figure BDA0003093926540000111
Figure BDA0003093926540000112
referring to fig. 2, the sequence of the video frames and the sequence of the key frames selected by the key frames are sequentially cut, and the sequence of the key frames can better show the whole action and contain more information under the condition of equal length. In the video frame sequence intercepted in sequence, the difference between adjacent frames is small, and finally, the contained information amount is small, so in the invention, similar frames are screened out by utilizing the similarity between the adjacent frames, and finally, the key frame sequence is obtained.
The key frame selection for each video to obtain a key frame sequence specifically includes:
s101, converting a video v into a video frame sequence [ x ]1,x2,…,xi,…,xT]Where T represents the length of the sequence of video frames, xiRepresents the ith video frame in the video v;
s102, processing each original video frame to obtain a processed video frame sequence [ x'1,x′2,…,x′i,…,x′T];
Specifically, the method comprises the following steps: converting each original frame into a gray-scale image, cutting the gray-scale image into N image blocks with the same size, drawing each small image block into a one-dimensional vector, and processing to change the size of each video frame from W × H × 3 to L × N, wherein L is W × H/N.
S103, obtaining a video frame sequence [ x 'by utilizing a K-SVD algorithm'1,x′2,…,x′i,…,x′T]Is sparse representation [ alpha ]12,…,αT];
And S104, acquiring a key frame sequence.
S1041, calculating the similarity matrix S' between frames belongs to RT×T
First, the sparse representation result of each frame is pulled into a one-dimensional vector to obtain [ alpha'1,α′2,α′3,…,α′T]Then, utilizing S'i,j=1-cos(α′i,α′j) The similarity between two frames is calculated in turn, where cos () represents the cosine distance, S'i,jRepresenting the similarity of the ith frame and the jth frame;
s1042, selecting a first frame x1Put in the sequence f of key framesselectAnd x is1As the current frame xnow. Then set the maximum interval length τ, at which time fselect=[x1],xnow=x1
S1043, traversing xnowSelecting x from the later tau frames according to the similarity matrix SnowAdding a frame with lowest similarity into the key frame sequence fselectAnd treat it as a new xnow
S1044 and repeating the step S1043 until all frames are traversed to obtain a final key frame sequence fselect
S2, calculating the similarity among the categories of all the behaviors in the video to obtain a category similarity matrix S;
the behavior class is C ═ { C ═ C1,c2,…,ci,…,cNIs the similarity matrix S e RN×NR represents a real number field, and N represents the number of behavior categories; si,jRepresenting the similarity of the ith and jth categories.
The category similarity matrix S is calculated as follows:
s201, obtaining each behavior category label by using BERT modelSentence vector Vec ═ { Vec ═ Vec1,vec2,…,veci,…,vecNIn which veciIndicates the ith category ciA corresponding sentence vector;
s202, calculating the similarity sim (i, j) ═ cos (vec) between two different categoriesi,vecj) Where cos () represents the cosine distance;
s203, constructing a category similarity matrix S e R according to the similarity between different categoriesC×CThe specific calculation formula is as follows:
Figure BDA0003093926540000121
where r is a threshold value and the range is [0,1 ].
S3, dividing N categories in the behavior category C into K large categories according to the similarity matrix S obtained in the step S2;
k major classes are C ═ C'1,C′2,…,C′i,…,C′K},C′iRepresents the ith major class;
all behavior classes C ═ { C ═ C1,c2,…,ci,…,cNAre divided into K general classes C '═ C'1,C′2,…,C′i,…,C′KIn the training process of step S4, there are two classifiers, Class1For coarse classification, i.e. the predicted result is { C'1,C′2,…,C′i,…,C′KOne of them, and Class2For predicting the final class, i.e. the predicted result is c1,c2,…,ci,…,cNOne of them.
S4, constructing a two-stage behavior recognition network model based on the key frame, wherein the two-stage behavior recognition network model comprises a one-stage feature extractor G1Class classifier Class1Two stage feature extractor G2And two-stage classifier Class2Step S1 is to train the corresponding key frame sequence F of the videotrainAnd a corresponding label YtrainSending the data into a two-stage behavior recognition network model in batches, and dividing the K large classes by using the step S3Training, wherein the size of each batch is B;
s401, constructing a two-stage behavior recognition network model based on key frames, and a one-stage feature extractor G1A feature extractor in 3D-ResNet 34; one-stage classifier Class1The system comprises an input layer, a global 3D pooling layer and a full-connection layer which are connected in sequence; two-stage feature extractor G2The system comprises K lightweight feature extractors, wherein each lightweight feature extractor comprises an input layer, a first 3D convolutional layer and a second 3D convolutional layer which are sequentially connected; two-stage classifier Class2Then K classifiers are included, and each classifier comprises an input layer, a global 3D pooling layer and a full connection layer which are sequentially connected;
wherein, a stage feature extractor G1Taking a key frame sequence as input, and outputting the key frame sequence as an extracted feature graph; one-stage classifier Class1With G1Is used as an input, and the output is the probability distribution of the preliminary classification.
Two-stage feature extractor G2With G1The output of (2) is used as input, and the output is a feature map after further feature extraction; two-stage classifier Class2With G2Is used as input and the output is the probability distribution of the final classification.
S402, training a stage feature extractor G1And a stage classifier Class1
The training data is FtrainAll key frame sequences in (1) and corresponding labels YtrainFirst, to G1Inputting a batch of key frame sequences, and then G1Is input to Class1Obtaining the probability distribution of the primary classification, then calculating the loss by using the probability distribution of the primary classification and the real label, and updating G through a random gradient descent algorithm1And Class1The process is repeated until the loss no longer drops and the model converges.
The similar cross entropy loss function used during the first stage training is:
Figure BDA0003093926540000141
wherein B represents the number of samples in each batch, C represents the number of categories, yiRepresenting the true class of the ith sample, Syi,cIs the value in the category similarity matrix S, representing the y-thiSimilarity of class and class c, picThe representation model predicts the probability that the ith sample is of class c.
Cross entropy loss function L by similaritysimApplying a batch random gradient descent method to a stage feature extractor G1And a stage classifier Class1Training is carried out to obtain the trained network parameters, and the training method comprises the following steps:
step 1, setting the size B of a training batch to be 32 and the iteration number epoch to be 100;
step 2, from the training key frame sequence set FtrainAnd a corresponding label YtrainRandomly sampling a batch of B samples
Figure BDA0003093926540000142
Step 3, sending the sampled B samples into a one-stage feature extractor G1To obtain G1The output characteristics are sent into a Class classifier of a first stage1Obtaining a preliminary classification result;
step 4, utilizing a similar cross entropy loss function LsimCalculating the loss value of the current batch, and updating G by using a batch random gradient descent method1And Class1The network parameter of (2);
step 5, repeating the steps 2 to 4 until the iteration time epoch is reached;
step 6, outputting a stage feature extractor G1Weight of (2)
Figure BDA0003093926540000143
And a stage classifier Class1Weight of (2)
Figure BDA0003093926540000144
S403, training two-stage feature extractor G2And two-stage classifier Class2
During this training, G1And Class1Is kept unchanged, only updates G are learned2And Class2The training data is still FtrainAll key frame sequences in (1) and corresponding labels YtrainFirstly, inputting a batch of key frame sequence into G1And Class1Obtaining a preliminary classification result, judging which big class each key frame sequence in the batch belongs to according to the preliminary classification result, and further determining that each key frame sequence is in G2And Class2The feature extractor and classifier in (1), then G1The output characteristic diagram is sent to the determined G2Mid-feature extractor and Class2The classifier in (1) obtains the final classification result, finally calculates the loss, and uses gradient descent algorithm to update G2And Class2Until the model converges.
The loss function used during the two-stage training process is:
Figure BDA0003093926540000151
wherein S isyi,cIs the value in the class similarity matrix S, B represents the number of samples in each batch, C represents the number of classes, yiRepresenting the real class, p, of the ith sampleicThe representation model predicts the probability that the ith sample is of class c.
By cross entropy loss function LCCEThe two-stage feature extractor G is subjected to a batch random gradient descent method2And two-stage classifier Class2Training to obtain the trained network parameters, wherein the training method comprises the following steps:
step 1, setting the size B of a training batch to be 32 and the iteration number epoch to be 100;
step 2, from the training key frame sequence set FtrainAnd a corresponding label YtrainRandomly sampling B batchesSample(s)
Figure BDA0003093926540000152
Step 3, sending the sampled B samples into a one-stage feature extractor G1To obtain G1The output characteristics are sent into a Class classifier of a first stage1Obtaining a preliminary classification result, and classifying G according to the major class to which the preliminary classification result belongs1Output feature map of (1) is sent to G2And G is extracted2Is sent to Class2The corresponding classifier in the step (2) obtains a final classification result;
step 4, utilizing a cross entropy loss function LsimCalculating the loss value of the current batch, and fixing G1And Class1Only using a batch stochastic gradient descent method to update G2And Class2The network parameter of (2);
step 5, repeating the steps 2 to 4 until the iteration time epoch is reached;
step 6, outputting a two-stage feature extractor G2Weight of (2)
Figure BDA0003093926540000161
And two-stage classifier Class2Weight of (2)
Figure BDA0003093926540000162
S5, testing the key frame sequence F corresponding to the video in the step S1testAnd sending the two-stage behavior recognition network model trained in the step S4 to obtain the behavior category of the test video.
S501, sending the key frame sequence to be predicted to the stage feature extractor G in the step S41Then G is added1The extracted features are fed to a one-stage classifier Class in step S41Obtaining a preliminary classification result;
s502, determining to use the two-stage feature extractor G according to the primary classification result obtained in the step S5012Which feature extractor and two-stage classifier Class in2Which classifier of (1);
judging which big class the sent key frame sequence belongs to according to the preliminary classification result obtained in the step S501, and if the key frame sequence belongs to the kth big class, using G2Kth feature extractor and Class2The kth classifier in (1).
S503, converting G in the step S5011The output of step S502 is sent to determine good G2The feature extractor in (1) further extracts features, and then sends the further extracted features to step S502 to determine the Class2The classifier in (1) obtains the final classification result.
In another embodiment of the present invention, a two-stage behavior recognition system based on a key frame sequence and behavior information is provided, which can be used to implement the two-stage behavior recognition method based on a key frame sequence and behavior information described above.
Wherein the selection module selects all videos V in the data setallSelecting key frames one by one to obtain a key frame sequence F corresponding to all videosallThen divide all videos into training videos VtrainAnd a test video VtestThe corresponding key frame sequence is divided into FtrainAnd Ftest
The computing module is used for computing the similarity among all behavior categories in the video to obtain a category similarity matrix S;
the dividing module is used for dividing N categories in the behavior category C into K major categories according to the similarity matrix S obtained by the calculating module;
a network module for constructing a two-stage behavior recognition network model based on the key frame, wherein the two-stage behavior recognition network model comprises a one-stage feature extractor G1Class classifier Class1Two stage feature extractor G2And two-stage classifier Class2Selecting a key frame sequence F corresponding to the module training videotrainAnd a corresponding label YtrainIs fed in two batchesIn the stage behavior recognition network model, training is carried out by utilizing K large classes divided by a dividing module, and the size of each batch is B;
an identification module for selecting the key frame sequence F corresponding to the module test videotestAnd sending the two-stage behavior recognition network model which is trained by the network module to obtain the behavior category of the test video.
In yet another embodiment of the present invention, a terminal device is provided that includes a processor and a memory for storing a computer program comprising program instructions, the processor being configured to execute the program instructions stored by the computer storage medium. The Processor may be a Central Processing Unit (CPU), or may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable gate array (FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, etc., which is a computing core and a control core of the terminal, and is adapted to implement one or more instructions, and is specifically adapted to load and execute one or more instructions to implement a corresponding method flow or a corresponding function; the processor according to the embodiment of the present invention may be used for the operation of the two-stage behavior recognition method based on the key frame sequence and the behavior information, and includes:
for all videos V in the data setallSelecting key frames one by one to obtain a key frame sequence F corresponding to all videosallThen divide all videos into training videos VtrainAnd a test video VtestThe corresponding key frame sequence is divided into FtrainAnd Ftest(ii) a Calculating the similarity between all behavior categories in the video to obtain a category similarity matrix S; dividing N categories in the behavior category C into K large categories according to the similarity matrix S; constructing a two-stage behavior recognition network model based on key frames, wherein the two-stage behavior recognition network model comprises a one-stage feature extractor G1Class classifier Class1Second order, ofSegment feature extractor G2And two-stage classifier Class2A sequence F of key frames corresponding to the training videotrainAnd a corresponding label YtrainSending the data into a two-stage behavior recognition network model in batches, and training by using K divided major classes, wherein the size of each batch is B; a key frame sequence F corresponding to the test videotestAnd sending the two-stage behavior recognition network model which is trained to obtain the behavior category of the test video.
In still another embodiment of the present invention, the present invention further provides a storage medium, specifically a computer-readable storage medium (Memory), which is a Memory device in a terminal device and is used for storing programs and data. It is understood that the computer readable storage medium herein may include a built-in storage medium in the terminal device, and may also include an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also, one or more instructions, which may be one or more computer programs (including program code), are stored in the memory space and are adapted to be loaded and executed by the processor. It should be noted that the computer-readable storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory.
One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to perform the corresponding steps of the above embodiments with respect to a two-stage behavior recognition method based on a sequence of key frames and behavior information; one or more instructions in the computer-readable storage medium are loaded by the processor and perform the steps of:
for all videos V in the data setallSelecting key frames one by one to obtain a key frame sequence F corresponding to all videosallThen divide all videos into training videos VtrainAnd a test video VtestThe corresponding key frame sequence is divided into FtrainAnd Ftest(ii) a Computing all in videoObtaining a category similarity matrix S according to the similarity between the behavior categories; dividing N categories in the behavior category C into K large categories according to the similarity matrix S; constructing a two-stage behavior recognition network model based on key frames, wherein the two-stage behavior recognition network model comprises a one-stage feature extractor G1Class classifier Class1Two stage feature extractor G2And two-stage classifier Class2A sequence F of key frames corresponding to the training videotrainAnd a corresponding label YtrainSending the data into a two-stage behavior recognition network model in batches, and training by using K divided major classes, wherein the size of each batch is B; a key frame sequence F corresponding to the test videotestAnd sending the two-stage behavior recognition network model which is trained to obtain the behavior category of the test video.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The effect of the present invention can be further illustrated by the following simulation results
1. Simulation conditions
The hardware conditions of the simulation of the invention are as follows: the intelligent sensing and image understanding laboratory graphic workstation carries 4 GPUs with 12G video memories; a data set used by simulation of the invention is a UCF101 data set, the data set comprises 13320 videos with the resolution of 320 x 240 of 101 action categories, after the videos are divided according to a data set division mode provided by the authority of the data set, a training set comprises 9537 videos, and a test set comprises 3783 videos.
2. Simulation content and results
The experiment is carried out under the simulation condition by using the method of the invention, firstly, the video frames contained in each video in the data set are sparsely represented to obtain corresponding sparse representation results, for example, fig. 3 is a comparison graph of the original video frames and the video frames reconstructed according to the sparse representation results, the first line is the original video frames, the second line is the reconstructed video frames, then, the similarity between the frames is calculated according to the sparse representation results of the video frames, and the similar frames are screened out to obtain the key frame sequence. And then all 101 behavior categories are divided into 10 major categories by calculating a category similarity matrix, and a two-stage behavior recognition network model is trained by using a key frame sequence corresponding to the video in the training set. After training, using a trained network model to identify videos in a test set, fig. 5 shows the identification results of a partial video frame sequence in two stages, when viewed from top to bottom, a first row and a second row are video frame sequences randomly intercepted in sequence, a third row and a fourth row are key frame sequences, and fig. 6 shows two similar categories in a data set: the category activation maps of the video frames of eye makeup and lip makeup in the two-stage recognition process are shown, the first row from top to bottom is the category activation map of 'eyebrow makeup', and the second row is the category activation map of 'lip makeup'.
As can be seen from fig. 5 and fig. 6, in the two-stage behavior recognition network model, in the initial recognition process of the first stage, the network model pays more attention to the commonality between similar categories, and although the sequence of video frames is easily recognized incorrectly, the network model is also mistakenly recognized as a similar category.
Table 1 shows the comparison result between the final recognition accuracy of the test set in the UCF101 dataset identified by the method of the present invention and other methods.
TABLE 1
Method Accuracy (%)
IDT 85.90
Temporal stream network 83.70
LRCN 82.90
C3D 76.02
3D-ResNet18 83.51
3D-ResNet34 83.69
The method of the invention 87.23
From the results in table 1, the present invention achieves good classification results.
In summary, the two-stage behavior recognition method and system based on the key frame sequence and the behavior information calculate the inter-frame similarity by using the sparse representation result of the video frame, and further screen out the key frame sequence as the input of the network model, so that the information content of the input video frame sequence is effectively increased under the condition of unchanged length; the method uses the behavior information of the behavior category labels to calculate the similarity between categories, divides the similar categories into the same large category, further divides the identification process into two stages, firstly carries out rough classification and then carries out fine classification, so that the model corresponding to each stage only learns the corresponding capacity, the learning process of the whole network model is easier, and good identification accuracy can be achieved.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims (10)

1. The two-stage behavior identification method based on the key frame sequence and the behavior information is characterized by comprising the following steps of:
s1, for all videos V in the data setallSelecting key frames one by one to obtain a key frame sequence F corresponding to all videosallThen divide all videos into training videos VtrainAnd a test video VtestThe corresponding key frame sequence is divided into FtrainAnd Ftest
S2, calculating the similarity among all behavior categories in the video to obtain a category similarity matrix S;
s3, dividing N categories in the behavior category C into K large categories according to the similarity matrix S obtained in the step S2;
s4, constructing a two-stage behavior recognition network model based on the key frame, wherein the two-stage behavior recognition network model comprises a one-stage feature extractor G1Class classifier Class1Two stage feature extractor G2And two-stage classifier Class2Step S1 is to train the corresponding key frame sequence F of the videotrainAnd a corresponding label YtrainSending the data into a two-stage behavior recognition network model in batches, and training by using K classes divided in the step S3, wherein the size of each batch is B;
s5, testing the key frame sequence F corresponding to the video in the step S1testTraining in step S4And obtaining the behavior category of the test video by the trained two-stage behavior recognition network model.
2. The method according to claim 1, wherein in step S1, all videos V in the data setallSelecting the key frames one by one to obtain a key frame sequence which specifically comprises the following steps:
s101, converting a video v into a video frame sequence x1,x2,…,xi,…,xT-, T denotes the length of the video frame sequence, xiRepresents the ith video frame in the video v;
s102, processing each original video frame to obtain a processed video frame sequence [ x'1,x′2,…,x′i,…,x′T];
S103, obtaining a video frame sequence [ x 'by utilizing a K-SVD algorithm'1,x′2,…,x′i,…,x′T]Is sparse representation [ alpha ]12,…,αT];
And S104, acquiring a key frame sequence.
3. The method according to claim 2, wherein step S104 is specifically:
s1041, calculating the similarity matrix S' between frames belongs to RT×T
S1042, selecting a first frame x1Put in the sequence f of key framesselectAnd x is1As the current frame xnowThen set the maximum interval length tau, at which time fselect=[x1],xnow=x1
S1043, traversing xnowSelecting x from the later tau frames according to the similarity matrix SnowAdding a frame with lowest similarity into the key frame sequence fselectAnd as new xnow
S1044 and repeating the step S1043 until all frames are traversed to obtain a final key frame sequence fselect
4. The method according to claim 1, wherein in step S2, the obtaining of the category similarity matrix S specifically includes:
s201, obtaining a sentence vector Vec ═ { Vec ═ Vec of each behavior category label by using a BERT model1,vec2,…,veci,…,vecN},veciIndicates the ith category ciA corresponding sentence vector;
s202, calculating the similarity sim (i, j) ═ cos (vec) between two different categoriesi,vecj) Cos () represents the cosine distance;
and S203, constructing a category similarity matrix S according to the similarity between different categories.
5. The method according to claim 4, wherein in step S203, the similarity Si,jThe method specifically comprises the following steps:
Figure FDA0003093926530000021
wherein r is a threshold value, and C is the number of categories.
6. The method according to claim 1, wherein step S4 is specifically:
s401, constructing a two-stage behavior recognition network model based on key frames, and a one-stage feature extractor G1A feature extractor in 3D-ResNet 34; one-stage classifier Class1The system comprises an input layer, a global 3D pooling layer and a full-connection layer which are connected in sequence; two-stage feature extractor G2The system comprises K lightweight feature extractors, wherein each lightweight feature extractor comprises an input layer, a first 3D convolutional layer and a second 3D convolutional layer which are sequentially connected; two-stage classifier Class2Then K classifiers are included, and each classifier comprises an input layer, a global 3D pooling layer and a full connection layer which are sequentially connected;
s402, training a stage feature extractor G1And a stage classifier Class1
S403, training two-stage feature extractor G2And two-stage classifier Class2
7. The method of claim 6, wherein in steps S402 and S403, a one-stage feature extractor G is trained1And a stage classifier Class1And two-stage feature extractor G2And two-stage classifier Class2The method specifically comprises the following steps:
setting the training batch size B as 32 and the iteration number epoch as 100; from a set F of training key frame sequencestrainAnd a corresponding label YtrainRandomly sampling a batch of B samples
Figure FDA0003093926530000031
B sampled samples are sent to a one-stage feature extractor G1To obtain G1The output characteristics are sent into a Class classifier of a first stage1Obtaining a primary classification result, and then sending the output features into a two-stage feature extractor G according to the primary classification result2Corresponding feature extractor and two-stage classifier Class2The classifier in (1) to obtain a final classification result; using cross entropy loss function L of similaritysimCalculating the loss value of the current batch, and updating G by using a batch random gradient descent method1And Class1Network parameters of (1) using a cross entropy loss function LsimCalculating the loss value of the current batch, and fixing G1And Class1Using a batch stochastic gradient descent method to update G2And Class2The network parameter of (2); repeating the steps until the iteration times epoch are reached; output a stage feature extractor G1Weight of (2)
Figure FDA0003093926530000032
And a stage classifier Class1Weight of (2)
Figure FDA0003093926530000033
Output two-stage feature extractor G2Weight of (2)
Figure FDA0003093926530000034
And two-stage classifier Class2Weight of (2)
Figure FDA0003093926530000035
8. The method of claim 7, wherein the similarity cross entropy loss function used during a stage of training is:
Figure FDA0003093926530000041
the cross entropy loss function used in the two-stage training process is:
Figure FDA0003093926530000042
wherein B represents the number of samples in each batch, C represents the number of categories, yiRepresenting the real class, p, of the ith sampleicThe representation model predicts the probability that the ith sample is of class c.
9. The method according to claim 1, wherein step S5 is specifically:
s501, sending the key frame sequence to be predicted to the stage feature extractor G in the step S41Then G is added1The extracted features are fed to a one-stage classifier Class in step S41Obtaining a preliminary classification result;
s502, determining to use the two-stage feature extractor G according to the primary classification result obtained in the step S5012Which feature extractor and two-stage classifier Class in2Which classifier of (1);
determining the key frame to be sent according to the preliminary classification result obtained in step S501To which major class the sequence belongs, if it belongs to the kth major class, G is used2Kth feature extractor and Class2The kth classifier in (1);
s503, converting G in the step S5011The output of step S502 is sent to determine good G2The feature extractor in (1) further extracts features, and then sends the further extracted features to step S502 to determine the Class2The classifier in (1) obtains the final classification result.
10. A two-stage behavior recognition system based on a sequence of key frames and behavior information, comprising:
a selection module for all videos V in the data setallSelecting key frames one by one to obtain a key frame sequence F corresponding to all videosallThen divide all videos into training videos VtrainAnd a test video VtestThe corresponding key frame sequence is divided into FtrainAnd Ftest
The computing module is used for computing the similarity among all behavior categories in the video to obtain a category similarity matrix S;
the dividing module is used for dividing N categories in the behavior category C into K major categories according to the similarity matrix S obtained by the calculating module;
a network module for constructing a two-stage behavior recognition network model based on the key frame, wherein the two-stage behavior recognition network model comprises a one-stage feature extractor G1Class classifier Class1Two stage feature extractor G2And two-stage classifier Class2Selecting a key frame sequence F corresponding to the module training videotrainAnd a corresponding label YtrainSending the data into a two-stage behavior recognition network model in batches, and training by using K major classes divided by a dividing module, wherein the size of each batch is B;
an identification module for selecting the key frame sequence F corresponding to the module test videotestAnd sending the two-stage behavior recognition network model which is trained by the network module to obtain the behavior category of the test video.
CN202110605394.3A 2021-05-31 2021-05-31 Two-stage behavior recognition method and system based on key frame sequence and behavior information Active CN113239869B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110605394.3A CN113239869B (en) 2021-05-31 2021-05-31 Two-stage behavior recognition method and system based on key frame sequence and behavior information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110605394.3A CN113239869B (en) 2021-05-31 2021-05-31 Two-stage behavior recognition method and system based on key frame sequence and behavior information

Publications (2)

Publication Number Publication Date
CN113239869A true CN113239869A (en) 2021-08-10
CN113239869B CN113239869B (en) 2023-08-11

Family

ID=77136003

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110605394.3A Active CN113239869B (en) 2021-05-31 2021-05-31 Two-stage behavior recognition method and system based on key frame sequence and behavior information

Country Status (1)

Country Link
CN (1) CN113239869B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114882403A (en) * 2022-05-05 2022-08-09 杭州电子科技大学 Video space-time action positioning method based on progressive attention hypergraph
CN114973684A (en) * 2022-07-25 2022-08-30 深圳联和智慧科技有限公司 Construction site fixed-point monitoring method and system
CN115035462A (en) * 2022-08-09 2022-09-09 阿里巴巴(中国)有限公司 Video identification method, device, equipment and storage medium
CN116400812A (en) * 2023-06-05 2023-07-07 中国科学院自动化研究所 Emergency rescue gesture recognition method and device based on surface electromyographic signals
CN116580832A (en) * 2023-05-05 2023-08-11 暨南大学 Auxiliary diagnosis system and method for senile dementia based on video data

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140079297A1 (en) * 2012-09-17 2014-03-20 Saied Tadayon Application of Z-Webs and Z-factors to Analytics, Search Engine, Learning, Recognition, Natural Language, and Other Utilities
CN109800698A (en) * 2019-01-11 2019-05-24 北京邮电大学 Icon detection method based on depth network
CN109961034A (en) * 2019-03-18 2019-07-02 西安电子科技大学 Video object detection method based on convolution gating cycle neural unit
CN110826491A (en) * 2019-11-07 2020-02-21 北京工业大学 Video key frame detection method based on cascading manual features and depth features
CN111626245A (en) * 2020-06-01 2020-09-04 安徽大学 Human behavior identification method based on video key frame
CN111832516A (en) * 2020-07-22 2020-10-27 西安电子科技大学 Video behavior identification method based on unsupervised video representation learning
WO2020221278A1 (en) * 2019-04-29 2020-11-05 北京金山云网络技术有限公司 Video classification method and model training method and apparatus thereof, and electronic device
CN112580555A (en) * 2020-12-25 2021-03-30 中国科学技术大学 Spontaneous micro-expression recognition method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140079297A1 (en) * 2012-09-17 2014-03-20 Saied Tadayon Application of Z-Webs and Z-factors to Analytics, Search Engine, Learning, Recognition, Natural Language, and Other Utilities
CN109800698A (en) * 2019-01-11 2019-05-24 北京邮电大学 Icon detection method based on depth network
CN109961034A (en) * 2019-03-18 2019-07-02 西安电子科技大学 Video object detection method based on convolution gating cycle neural unit
WO2020221278A1 (en) * 2019-04-29 2020-11-05 北京金山云网络技术有限公司 Video classification method and model training method and apparatus thereof, and electronic device
CN110826491A (en) * 2019-11-07 2020-02-21 北京工业大学 Video key frame detection method based on cascading manual features and depth features
CN111626245A (en) * 2020-06-01 2020-09-04 安徽大学 Human behavior identification method based on video key frame
CN111832516A (en) * 2020-07-22 2020-10-27 西安电子科技大学 Video behavior identification method based on unsupervised video representation learning
CN112580555A (en) * 2020-12-25 2021-03-30 中国科学技术大学 Spontaneous micro-expression recognition method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
孔娟;田丽;: "基于互信息量的视频关键帧提取算法", 安阳工学院学报, no. 04 *
张聪聪;何宁;: "基于关键帧的双流卷积网络的人体动作识别方法", 南京信息工程大学学报(自然科学版), no. 06 *
梁建胜;温贺平;: "基于深度学习的视频关键帧提取与视频检索", 控制工程, no. 05 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114882403A (en) * 2022-05-05 2022-08-09 杭州电子科技大学 Video space-time action positioning method based on progressive attention hypergraph
CN114882403B (en) * 2022-05-05 2022-12-02 杭州电子科技大学 Video space-time action positioning method based on progressive attention hypergraph
CN114973684A (en) * 2022-07-25 2022-08-30 深圳联和智慧科技有限公司 Construction site fixed-point monitoring method and system
CN115035462A (en) * 2022-08-09 2022-09-09 阿里巴巴(中国)有限公司 Video identification method, device, equipment and storage medium
CN116580832A (en) * 2023-05-05 2023-08-11 暨南大学 Auxiliary diagnosis system and method for senile dementia based on video data
CN116400812A (en) * 2023-06-05 2023-07-07 中国科学院自动化研究所 Emergency rescue gesture recognition method and device based on surface electromyographic signals
CN116400812B (en) * 2023-06-05 2023-09-12 中国科学院自动化研究所 Emergency rescue gesture recognition method and device based on surface electromyographic signals

Also Published As

Publication number Publication date
CN113239869B (en) 2023-08-11

Similar Documents

Publication Publication Date Title
CN113239869B (en) Two-stage behavior recognition method and system based on key frame sequence and behavior information
Oh et al. Crowd counting with decomposed uncertainty
CN109993102B (en) Similar face retrieval method, device and storage medium
CN108470172B (en) Text information identification method and device
CN104063883B (en) A kind of monitor video abstraction generating method being combined based on object and key frame
KR102094320B1 (en) Method for improving image using reinforcement learning
CN111046821B (en) Video behavior recognition method and system and electronic equipment
CN112489092B (en) Fine-grained industrial motion modality classification method, storage medium, device and apparatus
CN111597920B (en) Full convolution single-stage human body example segmentation method in natural scene
CN113762138B (en) Identification method, device, computer equipment and storage medium for fake face pictures
Wang et al. Learning efficient binarized object detectors with information compression
CN114359563B (en) Model training method, device, computer equipment and storage medium
CN110390347A (en) Conditions leading formula confrontation for deep neural network generates test method and system
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
CN112580458A (en) Facial expression recognition method, device, equipment and storage medium
CN112528058B (en) Fine-grained image classification method based on image attribute active learning
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN110827265A (en) Image anomaly detection method based on deep learning
CN114283350A (en) Visual model training and video processing method, device, equipment and storage medium
CN113111716A (en) Remote sensing image semi-automatic labeling method and device based on deep learning
CN114419406A (en) Image change detection method, training method, device and computer equipment
CN115578770A (en) Small sample facial expression recognition method and system based on self-supervision
CN112786160A (en) Multi-image input multi-label gastroscope image classification method based on graph neural network
CN113822134A (en) Instance tracking method, device, equipment and storage medium based on video
CN113378722B (en) Behavior identification method and system based on 3D convolution and multilevel semantic information fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant