CN114550310A - Method and device for identifying multi-label behaviors - Google Patents

Method and device for identifying multi-label behaviors Download PDF

Info

Publication number
CN114550310A
CN114550310A CN202210425904.3A CN202210425904A CN114550310A CN 114550310 A CN114550310 A CN 114550310A CN 202210425904 A CN202210425904 A CN 202210425904A CN 114550310 A CN114550310 A CN 114550310A
Authority
CN
China
Prior art keywords
behavior
correlation
feature
characteristic
behaviors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210425904.3A
Other languages
Chinese (zh)
Inventor
张翼翔
叶小培
张江峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Moredian Technology Co ltd
Original Assignee
Hangzhou Moredian Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Moredian Technology Co ltd filed Critical Hangzhou Moredian Technology Co ltd
Priority to CN202210425904.3A priority Critical patent/CN114550310A/en
Publication of CN114550310A publication Critical patent/CN114550310A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention discloses a method and a device for identifying multi-label behaviors. The method comprises the following steps: recognizing the input image according to a pre-trained behavior recognition model to obtain a characteristic diagram; extracting a key area according to the feature map; acquiring at least one behavior specific characteristic according to the key area; acquiring correlation characteristics among the behaviors according to the specificity characteristics of at least one behavior; and classifying according to the specificity characteristics and the correlation characteristics to obtain classification results corresponding to each behavior. The scheme provided by the invention can realize accurate identification of the multi-label behavior appearing in the video.

Description

Method and device for identifying multi-label behaviors
Technical Field
The invention relates to the field of computer technology application, in particular to a method and a device for identifying multi-label behaviors.
Background
With the development of computer technology, behavior recognition technology is becoming mature, and crowd detection, fighting detection and behavior early warning are widely applied in the field of security and protection; in the field of sports, the body shadow of behavior recognition can be seen in training evaluation and action scoring of athletes; the behavior recognition is widely applied to the fields of smart home, man-machine interaction and short video. The multi-label behavior recognition shows better robustness in a complex scene. The multi-tag behavior recognition is different from the single-tag behavior recognition, and when a plurality of actions occur in a detected video, the single-tag recognition can only output one behavior tag and cannot completely embody the content contained in the video.
Multi-label behavior recognition is a classification problem, and as the name suggests, after a video is input, a behavior recognition model can output a plurality of behavior classes in a representative video. The process can be as follows: and taking a plurality of video frames as input, extracting features through a neural network, and sending the features into a classifier to obtain a classification result. Among them, the design of neural network, i.e., feature extraction, is the most important loop in the process.
However, the current methods have the following problems:
most of current mainstream schemes are to uniformly extract features and classify after videos are sent into a network, but for multi-label videos, videos contain a plurality of different sub-behaviors, and specific features of certain behaviors are submerged by uniformly extracting features, so that high accuracy is difficult to obtain.
Aiming at the problem that the existing neural network cannot meet the requirement of multi-label behavior identification in the feature extraction process in the prior art, an effective solution is not provided at present.
Disclosure of Invention
In order to solve the above technical problems, embodiments of the present invention are expected to provide a method and an apparatus for identifying a multi-tag behavior, so as to at least solve the problem that an existing neural network cannot meet the requirement of identifying the multi-tag behavior in the feature extraction process in the prior art.
The technical scheme of the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides a method for identifying a multi-tag behavior, including: recognizing the input image according to a pre-trained behavior recognition model to obtain a characteristic diagram; extracting a key area according to the feature map; acquiring at least one behavior specific characteristic according to the key area; acquiring correlation characteristics among the behaviors according to the specificity characteristics of at least one behavior; and classifying according to the specificity characteristics and the correlation characteristics to obtain classification results corresponding to each behavior.
Optionally, before recognizing the input image according to the pre-trained behavior recognition model, the method further includes: acquiring a training image; inputting a training image serving as an input image into an end-to-end network model, and acquiring the specific characteristic of at least one behavior in the training image; acquiring correlation characteristics among the behaviors according to the specificity characteristics of at least one behavior; and training the end-to-end network model according to the specific characteristics and the correlation characteristics until the end-to-end network model converges to obtain a behavior recognition model.
Further, optionally, training the end-to-end network model according to the specific feature and the correlation feature until the end-to-end network model converges, and obtaining the behavior recognition model includes: taking the specificity characteristic and the correlation characteristic as input data of the end-to-end network model; the input data passes through a full connection layer of an end-to-end network model to obtain a first characteristic value corresponding to the specific characteristic and a second characteristic value corresponding to the correlation characteristic; inputting the first characteristic value and the second characteristic value into a softmax layer to obtain a classification result; and training the end-to-end network model according to the classification result and the input data until the end-to-end network model is converged to obtain a behavior recognition model.
Optionally, after the first feature value and the second feature value are input into the softmax layer to obtain the classification result, the method further includes: sending the classification result to a preset loss function to calculate a loss value, performing gradient back propagation, and updating parameters; wherein, a loss function is preset and used for the classification task.
Optionally, recognizing the input image according to a pre-trained behavior recognition model to obtain a feature map includes: and under the condition that the input image comprises a video, carrying out image extraction on a video input behavior recognition model with preset dimensionality to obtain a characteristic diagram, wherein the preset dimensionality comprises channel number, time, width and height.
Optionally, extracting the key region according to the feature map includes: by applying a plurality of attention modules on the feature map, key regions are extracted from the feature map.
Further, optionally, the obtaining of the at least one behavior-specific feature according to the key region includes: and activating through an attention mechanism according to the key area to obtain the specific characteristics of at least one behavior in the key area.
Optionally, the obtaining the correlation characteristic between the behaviors according to the specificity characteristic of the at least one behavior includes: generating a correlation matrix by counting the correlation among the behaviors according to the specific characteristics of at least one behavior; and acquiring the correlation characteristics among the behaviors according to the correlation matrix.
Further, optionally, the classifying according to the specificity feature and the correlation feature to obtain a classification result corresponding to each behavior includes: taking the specific features and the correlation features as input data of a behavior recognition model; the input data are processed through a full connection layer of a behavior recognition model to obtain a first characteristic value corresponding to the specific characteristic and a second characteristic value corresponding to the correlation characteristic; and inputting the first characteristic value and the second characteristic value into the softmax layer to obtain a classification result.
In a second aspect, an embodiment of the present invention provides an apparatus for identifying multi-tag behaviors, including: the recognition module is used for recognizing the input image according to a pre-trained behavior recognition model to obtain a characteristic diagram; the extraction module is used for extracting a key area according to the characteristic diagram; the first acquisition module is used for acquiring the specific characteristics of at least one behavior according to the key area; the second acquisition module is used for acquiring correlation characteristics among the behaviors according to the specific characteristics of at least one behavior; and the classification module is used for classifying according to the specificity characteristics and the correlation characteristics to obtain classification results corresponding to each behavior.
According to the embodiment of the invention, an input image is identified according to a pre-trained behavior identification model to obtain a characteristic diagram; extracting a key area according to the feature map; acquiring at least one behavior specific characteristic according to the key area; acquiring correlation characteristics among the behaviors according to the specificity characteristics of at least one behavior; and classifying according to the specific characteristics and the correlation characteristics to obtain classification results corresponding to all behaviors, so that the multi-label behaviors appearing in the video can be accurately identified.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a schematic flowchart of a method for identifying multi-tag behaviors according to an embodiment of the present invention;
fig. 2a is a schematic diagram of a network structure of C3D in a method for identifying multi-tag behaviors according to an embodiment of the present invention;
fig. 2b is a schematic diagram of a network convolution of C3D in a method for identifying multi-tag behaviors according to an embodiment of the present invention;
fig. 3 is a schematic diagram illustrating behavior recognition performed by a neural network in a method for recognizing multi-tag behaviors according to an embodiment of the present invention;
fig. 4 is a schematic diagram of an apparatus for identifying multi-tag behaviors according to a second embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first", "second", and the like in the description and claims of the present invention and the accompanying drawings are used for distinguishing different objects, and are not used for limiting a specific order.
It should be noted that the following embodiments of the present invention may be implemented individually, or may be implemented in combination with each other, and the embodiments of the present invention are not limited in this respect.
Example one
In a first aspect, an embodiment of the present invention provides a method for identifying a multi-tag behavior, where the method is applied in a scenario of behavior feature identification, and the method for identifying a multi-tag behavior provided in the embodiment of the present application includes:
the method for identifying the multi-label behavior provided by the embodiment of the application can be applied to identifying scenes with the multi-label behavior in videos, has wide application in the fields of short video recommendation and the like, and has the following implementation modes in the specific implementation process:
the method for identifying multi-tag behaviors provided by the embodiment of the application identifies a plurality of behaviors existing in a video by creating a model, so that the method for identifying multi-tag behaviors provided by the embodiment of the application comprises the following steps: a model generation phase and a model application phase, wherein,
stage one, the model generation stage includes: a model training phase and a model testing phase, wherein,
a model training stage:
s1, acquiring a training image;
in this embodiment, the training image may be a video, and in particular, a video clip may be taken as an example, and in this embodiment, the video clip is denoted as V, for example, a video frame in the video V may be an RGB image, three channels with a scale of 24 × 24, and 32 frames in total, an input dimension is denoted as 3 × 32 × 224, each video has a plurality of real tags, for example, a video shows that a person walks from a distance, and when a chair is sitting, the tags are denoted as walking and sitting, that is, the tags are used to represent behavior types of target objects in the video.
S2, inputting the training image as an input image into the end-to-end network model, and acquiring the specific characteristics of at least one behavior in the training image;
the neural network used in the method for identifying the multi-label behaviors provided by the embodiment of the application can be an end-to-end network model, the training image is used as an input image and is input into the end-to-end network model, and the specific characteristics of at least one behavior in the training image are obtained based on the end-to-end network model. In order to solve the problem that in the related art, for a multi-label video, since a video includes a plurality of different sub-behaviors, and the uniform extraction of features can overwhelm the specific features of some behaviors, and thus it is difficult to achieve high accuracy, the embodiment of the present application proposes to extract the specific features for each sub-behavior (i.e., at least one behavior in the embodiment of the present application) separately.
S3, acquiring correlation characteristics among the behaviors according to the specificity characteristics of at least one behavior;
based on the specific characteristics of at least one behavior obtained in step S2, the correlation characteristics between behaviors are obtained, so that the effect of behavior identification is further improved by using the correlation information between each child behavior and adding prior information.
And S4, training the end-to-end network model according to the specificity characteristics and the correlation characteristics until the end-to-end network model converges to obtain a behavior recognition model.
The method comprises the following steps of training an end-to-end network model according to the specific characteristics and the correlation characteristics until the end-to-end network model converges, and obtaining a behavior recognition model, wherein the training step comprises the following steps: taking the specificity characteristic and the correlation characteristic as input data of the end-to-end network model; the input data passes through a full connection layer of an end-to-end network model to obtain a first characteristic value corresponding to the specific characteristic and a second characteristic value corresponding to the correlation characteristic; inputting the first characteristic value and the second characteristic value into a softmax layer to obtain a classification result; and training the end-to-end network model according to the classification result and the input data until the end-to-end network model is converged to obtain a behavior recognition model.
Specific features are denoted in the examples of the present application
Figure DEST_PATH_IMAGE001
The correlation characteristics are recorded as
Figure 281338DEST_PATH_IMAGE002
By passing through
Figure DEST_PATH_IMAGE003
And
Figure 464058DEST_PATH_IMAGE004
inputting the full connection layer of the end-to-end network model, and obtaining by calculation
Figure DEST_PATH_IMAGE005
And
Figure 384740DEST_PATH_IMAGE006
wherein, in the step (A),
Figure DEST_PATH_IMAGE007
for the first characteristic value in the embodiment of the present application,
Figure 186474DEST_PATH_IMAGE008
is the second characteristic value in the embodiment of the present application.
After the first feature value and the second feature value are input into the softmax layer to obtain the classification result, the method for identifying the multi-tag behavior provided by the embodiment of the application further includes: sending the classification result to a preset loss function to calculate a loss value, performing gradient back propagation, and updating parameters; wherein, a loss function is preset and used for the classification task.
The preset loss function formula in the embodiment of the application is as follows:
Figure DEST_PATH_IMAGE009
the loss function is the cross entropy loss, used for the classification task. Wherein, N represents the total number of samples,
Figure 545387DEST_PATH_IMAGE010
a true label representing sample i, with a positive class of 1, a negative class of 0,
Figure DEST_PATH_IMAGE011
representing the probability that sample i is predicted as a positive class,
Figure 508795DEST_PATH_IMAGE012
i.e. the probability that sample i is predicted as a negative class. Due to the desire to
Figure DEST_PATH_IMAGE013
Is as large as possible, i.e.
Figure 41407DEST_PATH_IMAGE014
As small as possible, the predetermined loss function is constructed accordingly.
And finally, after n rounds of training, the end-to-end network model converges to obtain a behavior recognition model.
And (3) a model testing stage:
step 0, inputting a test video segment V with a dimension of 3 × 32 × 224.
And step 1, sending the V into a backbone network, and finishing end-to-end calculation by the network.
Step 2, finally obtaining characteristics
Figure DEST_PATH_IMAGE015
And
Figure 381253DEST_PATH_IMAGE016
step 3, characterizing
Figure DEST_PATH_IMAGE017
And
Figure 456656DEST_PATH_IMAGE018
feeding into softAnd max layers, and obtaining a classification result.
Stage two, model application stage
Fig. 1 is a schematic flowchart of a method for identifying multi-tag behaviors according to an embodiment of the present invention; as shown in fig. 1, a method for identifying multi-tag behaviors provided in an embodiment of the present application includes:
step S102, recognizing an input image according to a pre-trained behavior recognition model to obtain a feature map;
optionally, recognizing the input image according to a pre-trained behavior recognition model to obtain a feature map includes: and under the condition that the input image comprises a video, carrying out image extraction on a video input behavior recognition model with preset dimensionality to obtain a characteristic diagram, wherein the preset dimensionality comprises channel number, time, width and height.
In this embodiment, the behavior recognition model may be a C3D network structure, as shown in fig. 2a, fig. 2a is a schematic diagram of a C3D network structure in the method for recognizing a multi-tag behavior according to an embodiment of the present invention. The main reason why the C3D network structure is used in the embodiment of the present application is that behavior recognition is a task that requires both spatial domain information and time domain information, and a common 2D convolutional neural network can only capture spatial domain information and cannot reasonably and effectively utilize time information, so that the C3D network structure is used as a feature extraction network. As shown in fig. 2b, fig. 2b is a schematic diagram of a C3D network convolution in a method for identifying multi-label behaviors according to an embodiment of the present invention, where the 3D convolution stacks consecutive frames and performs a uniform convolution operation, and the obtained feature map includes both frame sequence information (i.e., time domain information) and spatial domain information.
As shown in fig. 3, fig. 3 is a schematic diagram of behavior recognition performed by a neural network in a method for recognizing multi-tag behaviors according to an embodiment of the present invention, and a process of behavior recognition may be divided into three stages, where in the first stage, a feature diagram in the embodiment of the present application may be obtained by passing a video sequence with dimensions 3 × 32 × 224 through a C3D structure, and is denoted as "C3D structure
Figure DEST_PATH_IMAGE019
(ii) a In the embodiments of the present application, dimensions are used to indicate: c (T W H) (C represents the number of channels, T represents time, W represents width, and H represents height).
Step S104, extracting key areas according to the feature map;
wherein, the step S104 of extracting the key area according to the feature map includes: by applying a plurality of attention modules on the feature map, key regions are extracted from the feature map.
Specifically, based on the result obtained in step S102
Figure 260140DEST_PATH_IMAGE020
In a
Figure DEST_PATH_IMAGE021
The method comprises the following steps of applying a plurality of attention modules to extract key areas, and aims to: people are the main body of behavior recognition, but the space positions occupied by people are small, and a lot of redundant information exists in the characteristic diagram, so that the training effect is interfered, therefore, the calculation amount can be saved and the effect can be improved by extracting key areas to perform subsequent characteristic calculation.
It should be noted that in the embodiment of the application, a key area self-learning mode is provided, in the model training process, a better key area is continuously learned, and the task of multi-label behavior recognition is supplemented, so that the effect is further improved.
Step S106, acquiring at least one behavior specific characteristic according to the key area;
the step S106 of obtaining at least one behavior specificity feature according to the key region includes: and activating through an attention mechanism according to the key area to obtain the specific characteristics of at least one behavior in the key area.
Specifically, as shown in fig. 3, in the second stage, the specific feature of the child behavior (i.e., the specific feature of at least one behavior) is obtained by the attention mechanism activation using the obtained key region feature of the key region.
Step S108, acquiring correlation characteristics among the behaviors according to the specificity characteristics of at least one behavior;
in step S108, obtaining the correlation characteristic between the behaviors according to the specific characteristic of the at least one behavior includes: generating a correlation matrix by counting the correlation among the behaviors according to the specific characteristics of at least one behavior; and acquiring the correlation characteristics among the behaviors according to the correlation matrix.
Specifically, as shown in fig. 3, the correlation feature between the behaviors is obtained according to the specificity feature of at least one behavior obtained in step S106, so as to: although a coherent long action includes many automatic actions, there is often a relationship between sub-actions, and the two actions of "stretching hands" and "holding a cup" are more strongly related than the two actions of "stretching hands" and "lying down".
And step S110, classifying according to the specificity characteristics and the correlation characteristics to obtain classification results corresponding to each behavior.
In step S110, the classifying according to the specificity feature and the correlation feature to obtain a classification result corresponding to each behavior includes: taking the specific features and the correlation features as input data of a behavior recognition model; the input data are processed through a full connection layer of a behavior recognition model to obtain a first characteristic value corresponding to the specific characteristic and a second characteristic value corresponding to the correlation characteristic; and inputting the first characteristic value and the second characteristic value into the softmax layer to obtain a classification result.
Specifically, as shown in fig. 3, in the third stage, in order to avoid confusion between the sub-action specific feature and the sub-action related feature in the backward propagation, the sub-action specific feature and the sub-action related feature are respectively mixed up
Figure 280048DEST_PATH_IMAGE022
And
Figure DEST_PATH_IMAGE023
and (4) entering the softmax layer for classification (namely, inputting the first characteristic value and the second characteristic value into the softmax layer in the embodiment of the application to obtain a classification result).
In summary, with reference to steps S102 to S110, the method for identifying a multi-tag behavior provided in the embodiment of the present application specifically includes:
after the video sequence passes through the C3D structure, the characteristic diagram is obtained and is marked as
Figure 423585DEST_PATH_IMAGE024
K key regions are marked as
Figure DEST_PATH_IMAGE025
Is applied to
Figure 228861DEST_PATH_IMAGE026
Is noted as an attention module
Figure DEST_PATH_IMAGE027
A sub-action-specific features are expressed as
Figure 65230DEST_PATH_IMAGE028
Applied to sub-action features
Figure DEST_PATH_IMAGE029
Is noted as an attention module
Figure 369172DEST_PATH_IMAGE030
And a sub-action dependency feature is noted
Figure DEST_PATH_IMAGE031
Figure 188836DEST_PATH_IMAGE032
And
Figure DEST_PATH_IMAGE033
separately feeding into the full-connection layer, and recording the obtained characteristics
Figure 629045DEST_PATH_IMAGE034
And
Figure DEST_PATH_IMAGE035
. The classification result is recorded as
Figure 105157DEST_PATH_IMAGE036
And
Figure DEST_PATH_IMAGE037
Figure 647127DEST_PATH_IMAGE038
dimension C (T W H), is a two-dimensional matrix (intuitive understanding: C columns, each with T W H, each representing all the spatio-temporal information of a channel).
k key regions are composed of
Figure DEST_PATH_IMAGE039
Act on
Figure 601308DEST_PATH_IMAGE040
To obtain a total of k
Figure DEST_PATH_IMAGE041
Modules, e.g. formula 2
Figure 161602DEST_PATH_IMAGE042
(2)
Figure DEST_PATH_IMAGE043
Figure 477790DEST_PATH_IMAGE044
Dimension is
Figure DEST_PATH_IMAGE045
Is a weight matrix, the main purpose is to adjust the number of channels,
Figure 897270DEST_PATH_IMAGE046
dimension of
Figure DEST_PATH_IMAGE047
Figure 592824DEST_PATH_IMAGE048
Dimension of
Figure DEST_PATH_IMAGE049
The formula is as follows:
Figure 820675DEST_PATH_IMAGE050
(3)
Figure DEST_PATH_IMAGE051
is composed of
Figure 825540DEST_PATH_IMAGE052
Any one of the rows of the feature matrix,
Figure DEST_PATH_IMAGE053
the meaning of the representation: activating all the space-time information in each characteristic channel to obtain activated characteristic vectors, wherein the dimensionality is as follows: and (T W H) 1, the softmax function can output a probability value of (T W H) 1 dimension, the probability value represents the activation degree of each dimension of the feature, and the larger the value is, the higher the response of the position is, namely, the key region represents the behavior. By selecting the first k values of the probability value ranking, the self-learning of the key area is completed. Then
Figure 73594DEST_PATH_IMAGE054
Has the dimension of
Figure DEST_PATH_IMAGE055
Figure 556528DEST_PATH_IMAGE056
Has the dimension of
Figure DEST_PATH_IMAGE057
Sub-action specific features by
Figure 842147DEST_PATH_IMAGE058
Act on
Figure DEST_PATH_IMAGE059
The method comprises the steps of (1) obtaining,
Figure 752334DEST_PATH_IMAGE060
and
Figure DEST_PATH_IMAGE061
the principle is the same as that of the prior art,
Figure 490614DEST_PATH_IMAGE062
has the dimension of
Figure DEST_PATH_IMAGE063
A is the number of all sub-actions in the training data, and for each key area
Figure 511660DEST_PATH_IMAGE064
After activation, the sub-action category most probably corresponding to each key area is obtained. The sub-action specific characteristics are expressed as
Figure DEST_PATH_IMAGE065
The values of the coefficients, as in equation 4,
Figure 386206DEST_PATH_IMAGE066
dimension of
Figure DEST_PATH_IMAGE067
Figure 732874DEST_PATH_IMAGE068
(4)
This completes the learning of the sub-action specific features.
Sub-action dependency features
Figure DEST_PATH_IMAGE069
By
Figure 955520DEST_PATH_IMAGE070
Obtaining a correlation matrix with the size of
Figure DEST_PATH_IMAGE071
The frequency of each pair of sub-action combination is recorded in the table, and specifically:
Figure 45836DEST_PATH_IMAGE072
(5)
Figure DEST_PATH_IMAGE073
representing the probability of the ith and jth sub-actions co-occurring in all training samples,
Figure 899522DEST_PATH_IMAGE074
representing the probability of the ith sub-action occurring in all training samples.
Based on the above, two features are obtained:
Figure DEST_PATH_IMAGE075
and
Figure 902244DEST_PATH_IMAGE076
. Sending the two features into a full connection layer to obtain a feature vector
Figure DEST_PATH_IMAGE077
And
Figure 864384DEST_PATH_IMAGE078
then sending the obtained data into a softmax layer for classification, and recording the obtained result as
Figure DEST_PATH_IMAGE079
And
Figure 977965DEST_PATH_IMAGE080
and the dimensionalities of the two are the same and are both A, the result values are added to obtain the probability values corresponding to all the sub-categories, a threshold value is set, and the sub-action larger than the threshold value is the prediction label of the behavior.
So far, the whole process is ended.
According to the embodiment of the invention, the input image is identified according to the pre-trained behavior identification model to obtain the characteristic diagram; extracting a key area according to the feature map; acquiring at least one behavior specific characteristic according to the key area; acquiring correlation characteristics among the behaviors according to the specificity characteristics of at least one behavior; and classifying according to the specific characteristics and the correlation characteristics to obtain classification results corresponding to all behaviors, so that the multi-label behaviors appearing in the video can be accurately identified.
Example two
In a second aspect, an embodiment of the present invention provides an apparatus for identifying a multi-tag behavior, and fig. 4 is a schematic diagram of an apparatus for identifying a multi-tag behavior according to a second embodiment of the present invention; as shown in fig. 4, an apparatus for identifying multi-tag behaviors provided in an embodiment of the present application includes: the recognition module 40 is used for recognizing the input image according to a pre-trained behavior recognition model to obtain a feature map; an extraction module 42, configured to extract a key region according to the feature map; a first obtaining module 44, configured to obtain a specific feature of at least one behavior according to the key area; a second obtaining module 46, configured to obtain a correlation feature between the behaviors according to the specific feature of the at least one behavior; and the classification module 48 is configured to perform classification according to the specificity characteristics and the correlation characteristics to obtain classification results corresponding to each behavior.
Optionally, the apparatus for identifying a multi-tag behavior provided in the embodiment of the present application further includes: the image acquisition module is used for acquiring a training image before the input image is identified according to a pre-trained behavior identification model; the first characteristic acquisition module is used for inputting the training image into the end-to-end network model as an input image and acquiring the specific characteristic of at least one behavior in the training image; the second characteristic acquisition module is used for acquiring correlation characteristics among the behaviors according to the specific characteristics of at least one behavior; and the training module is used for training the end-to-end network model according to the specific characteristics and the correlation characteristics until the end-to-end network model converges to obtain a behavior recognition model.
Further, optionally, the training module comprises: a first input unit, configured to use the specificity feature and the correlation feature as input data of an end-to-end network model; the characteristic value acquisition unit is used for enabling the input data to pass through a full connection layer of the end-to-end network model to obtain a first characteristic value corresponding to the specific characteristic and a second characteristic value corresponding to the correlation characteristic; the second input unit is used for inputting the first characteristic value and the second characteristic value into the softmax layer to obtain a classification result; and the training unit is used for training the end-to-end network model according to the classification result and the input data until the end-to-end network model is converged to obtain a behavior recognition model.
Optionally, the apparatus for identifying a multi-tag behavior provided in the embodiment of the present application further includes: the calculation unit is used for inputting the first characteristic value and the second characteristic value into the softmax layer to obtain a classification result, then sending the classification result into a preset loss function to calculate a loss value, performing gradient back propagation and updating parameters; wherein, a loss function is preset and used for the classification task.
Optionally, the identification module 40 includes: the identification unit is used for extracting images from the video input behavior identification model with preset dimensionality under the condition that the input images comprise videos to obtain a feature map, wherein the preset dimensionality comprises the number of channels, time, width and height.
Optionally, the extracting module 42 includes: and the advancing unit is used for extracting the key area from the feature map by applying a plurality of attention modules on the feature map.
Further, optionally, the first obtaining module 44 includes: the first obtaining unit is used for obtaining the specific characteristics of at least one behavior in the key area according to the activation of the key area through an attention mechanism.
Optionally, the second obtaining module 46 includes: the matrix generation unit is used for generating a correlation matrix by counting the correlation among the behaviors according to the specific characteristics of at least one behavior; and the second acquisition unit is used for acquiring the correlation characteristics among the behaviors according to the correlation matrix.
Further, optionally, the classification module 48 includes: a data input unit for taking the specificity characteristic and the correlation characteristic as input data of the behavior recognition model; the computing unit is used for enabling the input data to pass through a full connection layer of the behavior recognition model to obtain a first characteristic value corresponding to the specific characteristic and a second characteristic value corresponding to the correlation characteristic; and the classification unit is used for inputting the first characteristic value and the second characteristic value into the softmax layer to obtain a classification result.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims (10)

1. A method of identifying multi-tag behavior, comprising:
recognizing the input image according to a pre-trained behavior recognition model to obtain a feature map;
extracting a key area according to the feature map;
acquiring at least one behavior specific characteristic according to the key area;
acquiring correlation characteristics among the behaviors according to the specificity characteristics of the at least one behavior;
and classifying according to the specificity characteristics and the correlation characteristics to obtain classification results corresponding to each behavior.
2. The method of claim 1, wherein prior to the recognizing the input image according to the pre-trained behavior recognition model, the method further comprises:
acquiring a training image;
inputting the training image into an end-to-end network model as an input image, and acquiring the specific characteristic of at least one behavior in the training image;
acquiring correlation characteristics among the behaviors according to the specificity characteristics of the at least one behavior;
and training the end-to-end network model according to the specificity characteristics and the correlation characteristics until the end-to-end network model converges to obtain the behavior recognition model.
3. The method of claim 2, wherein the training the end-to-end network model according to the specificity feature and the correlation feature until the end-to-end network model converges to obtain the behavior recognition model comprises:
taking the specificity feature and the relevance feature as input data of the end-to-end network model;
enabling the input data to pass through a full connection layer of the end-to-end network model to obtain a first characteristic value corresponding to the specific characteristic and a second characteristic value corresponding to the correlation characteristic;
inputting the first characteristic value and the second characteristic value into a softmax layer to obtain a classification result;
and training the end-to-end network model according to the classification result and the input data until the end-to-end network model is converged to obtain the behavior recognition model.
4. The method of claim 3, wherein after the inputting the first feature value and the second feature value into a softmax layer to obtain a classification result, the method further comprises:
sending the classification result to a preset loss function to calculate a loss value, performing gradient back propagation, and updating parameters; and the preset loss function is used for classifying tasks.
5. The method of claim 1, wherein the recognizing the input image according to the pre-trained behavior recognition model to obtain the feature map comprises:
and under the condition that the input image comprises a video, inputting the video with preset dimensionality into the behavior recognition model for image extraction to obtain the feature map, wherein the preset dimensionality comprises channel number, time, width and height.
6. The method according to claim 1 or 5, wherein the extracting key regions according to the feature map comprises:
extracting the key regions from the feature map by applying a plurality of attention modules on the feature map.
7. The method of claim 6, wherein the obtaining at least one behavior specific feature according to the key region comprises:
and activating through an attention mechanism according to the key area to obtain the specific characteristics of the at least one behavior in the key area.
8. The method of claim 7, wherein the obtaining the correlation characteristic between the behaviors according to the characteristic of the at least one behavior comprises:
generating a correlation matrix by counting the correlation among the behaviors according to the specific characteristics of the at least one behavior;
and acquiring the correlation characteristics among the behaviors according to the correlation matrix.
9. The method of claim 8, wherein the classifying according to the specificity feature and the correlation feature to obtain a classification result corresponding to each behavior comprises:
taking the specificity feature and the correlation feature as input data of the behavior recognition model;
enabling the input data to pass through a full connection layer of the behavior recognition model to obtain a first characteristic value corresponding to the specific characteristic and a second characteristic value corresponding to the correlation characteristic;
and inputting the first characteristic value and the second characteristic value into a softmax layer to obtain the classification result.
10. An apparatus for identifying multi-tag behavior, comprising:
the recognition module is used for recognizing the input image according to a pre-trained behavior recognition model to obtain a characteristic diagram;
the extraction module is used for extracting a key area according to the feature map;
the first acquisition module is used for acquiring the specific characteristics of at least one behavior according to the key area;
the second acquisition module is used for acquiring the correlation characteristics among the behaviors according to the specific characteristics of the at least one behavior;
and the classification module is used for classifying according to the specificity characteristics and the correlation characteristics to obtain classification results corresponding to all behaviors.
CN202210425904.3A 2022-04-22 2022-04-22 Method and device for identifying multi-label behaviors Pending CN114550310A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210425904.3A CN114550310A (en) 2022-04-22 2022-04-22 Method and device for identifying multi-label behaviors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210425904.3A CN114550310A (en) 2022-04-22 2022-04-22 Method and device for identifying multi-label behaviors

Publications (1)

Publication Number Publication Date
CN114550310A true CN114550310A (en) 2022-05-27

Family

ID=81667211

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210425904.3A Pending CN114550310A (en) 2022-04-22 2022-04-22 Method and device for identifying multi-label behaviors

Country Status (1)

Country Link
CN (1) CN114550310A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111476315A (en) * 2020-04-27 2020-07-31 中国科学院合肥物质科学研究院 Image multi-label identification method based on statistical correlation and graph convolution technology
WO2020221278A1 (en) * 2019-04-29 2020-11-05 北京金山云网络技术有限公司 Video classification method and model training method and apparatus thereof, and electronic device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020221278A1 (en) * 2019-04-29 2020-11-05 北京金山云网络技术有限公司 Video classification method and model training method and apparatus thereof, and electronic device
CN111476315A (en) * 2020-04-27 2020-07-31 中国科学院合肥物质科学研究院 Image multi-label identification method based on statistical correlation and graph convolution technology

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YANYI ZHANG ET AL.: "Multi-label activity recognition using activity-specific features and activity correlations", 《2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION(CVPR)》 *

Similar Documents

Publication Publication Date Title
Montserrat et al. Deepfakes detection with automatic face weighting
CN110633745B (en) Image classification training method and device based on artificial intelligence and storage medium
Heilbron et al. Scc: Semantic context cascade for efficient action detection
Li et al. Multiple-human parsing in the wild
Minhas et al. Incremental learning in human action recognition based on snippets
CN111814902A (en) Target detection model training method, target identification method, device and medium
CN109376696B (en) Video motion classification method and device, computer equipment and storage medium
Osherov et al. Increasing cnn robustness to occlusions by reducing filter support
CN110088776A (en) For training the computer equipment of deep neural network
CN110188829B (en) Neural network training method, target recognition method and related products
CN109086873A (en) Training method, recognition methods, device and the processing equipment of recurrent neural network
CN110569731A (en) face recognition method and device and electronic equipment
Ravi et al. A dataset and preliminary results for umpire pose detection using SVM classification of deep features
CN111814817A (en) Video classification method and device, storage medium and electronic equipment
CN113033523B (en) Method and system for constructing falling judgment model and falling judgment method and system
CN113033507B (en) Scene recognition method and device, computer equipment and storage medium
CN111401343B (en) Method for identifying attributes of people in image and training method and device for identification model
CN111985333B (en) Behavior detection method based on graph structure information interaction enhancement and electronic device
Wu et al. Multiple models fusion for emotion recognition in the wild
CN111291695B (en) Training method and recognition method for recognition model of personnel illegal behaviors and computer equipment
Caba Heilbron et al. Scc: Semantic context cascade for efficient action detection
Liu et al. Mix attention based convolutional neural network for clothing brand logo recognition and classification
CN113947209A (en) Integrated learning method, system and storage medium based on cloud edge cooperation
CN112347965A (en) Video relation detection method and system based on space-time diagram
CN116189281B (en) End-to-end human behavior classification method and system based on space-time self-adaptive fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220527