CN114550310A

CN114550310A - Method and device for identifying multi-label behaviors

Info

Publication number: CN114550310A
Application number: CN202210425904.3A
Authority: CN
Inventors: 张翼翔; 叶小培; 张江峰
Original assignee: Hangzhou Moredian Technology Co ltd
Current assignee: Hangzhou Moredian Technology Co ltd
Priority date: 2022-04-22
Filing date: 2022-04-22
Publication date: 2022-05-27

Abstract

The embodiment of the invention discloses a method and a device for identifying multi-label behaviors. The method comprises the following steps: recognizing the input image according to a pre-trained behavior recognition model to obtain a characteristic diagram; extracting a key area according to the feature map; acquiring at least one behavior specific characteristic according to the key area; acquiring correlation characteristics among the behaviors according to the specificity characteristics of at least one behavior; and classifying according to the specificity characteristics and the correlation characteristics to obtain classification results corresponding to each behavior. The scheme provided by the invention can realize accurate identification of the multi-label behavior appearing in the video.

Description

Method and device for identifying multi-label behaviors

Technical Field

The invention relates to the field of computer technology application, in particular to a method and a device for identifying multi-label behaviors.

Background

With the development of computer technology, behavior recognition technology is becoming mature, and crowd detection, fighting detection and behavior early warning are widely applied in the field of security and protection; in the field of sports, the body shadow of behavior recognition can be seen in training evaluation and action scoring of athletes; the behavior recognition is widely applied to the fields of smart home, man-machine interaction and short video. The multi-label behavior recognition shows better robustness in a complex scene. The multi-tag behavior recognition is different from the single-tag behavior recognition, and when a plurality of actions occur in a detected video, the single-tag recognition can only output one behavior tag and cannot completely embody the content contained in the video.

Multi-label behavior recognition is a classification problem, and as the name suggests, after a video is input, a behavior recognition model can output a plurality of behavior classes in a representative video. The process can be as follows: and taking a plurality of video frames as input, extracting features through a neural network, and sending the features into a classifier to obtain a classification result. Among them, the design of neural network, i.e., feature extraction, is the most important loop in the process.

However, the current methods have the following problems:

most of current mainstream schemes are to uniformly extract features and classify after videos are sent into a network, but for multi-label videos, videos contain a plurality of different sub-behaviors, and specific features of certain behaviors are submerged by uniformly extracting features, so that high accuracy is difficult to obtain.

Aiming at the problem that the existing neural network cannot meet the requirement of multi-label behavior identification in the feature extraction process in the prior art, an effective solution is not provided at present.

Disclosure of Invention

In order to solve the above technical problems, embodiments of the present invention are expected to provide a method and an apparatus for identifying a multi-tag behavior, so as to at least solve the problem that an existing neural network cannot meet the requirement of identifying the multi-tag behavior in the feature extraction process in the prior art.

The technical scheme of the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a method for identifying a multi-tag behavior, including: recognizing the input image according to a pre-trained behavior recognition model to obtain a characteristic diagram; extracting a key area according to the feature map; acquiring at least one behavior specific characteristic according to the key area; acquiring correlation characteristics among the behaviors according to the specificity characteristics of at least one behavior; and classifying according to the specificity characteristics and the correlation characteristics to obtain classification results corresponding to each behavior.

Optionally, before recognizing the input image according to the pre-trained behavior recognition model, the method further includes: acquiring a training image; inputting a training image serving as an input image into an end-to-end network model, and acquiring the specific characteristic of at least one behavior in the training image; acquiring correlation characteristics among the behaviors according to the specificity characteristics of at least one behavior; and training the end-to-end network model according to the specific characteristics and the correlation characteristics until the end-to-end network model converges to obtain a behavior recognition model.

Further, optionally, training the end-to-end network model according to the specific feature and the correlation feature until the end-to-end network model converges, and obtaining the behavior recognition model includes: taking the specificity characteristic and the correlation characteristic as input data of the end-to-end network model; the input data passes through a full connection layer of an end-to-end network model to obtain a first characteristic value corresponding to the specific characteristic and a second characteristic value corresponding to the correlation characteristic; inputting the first characteristic value and the second characteristic value into a softmax layer to obtain a classification result; and training the end-to-end network model according to the classification result and the input data until the end-to-end network model is converged to obtain a behavior recognition model.

Optionally, after the first feature value and the second feature value are input into the softmax layer to obtain the classification result, the method further includes: sending the classification result to a preset loss function to calculate a loss value, performing gradient back propagation, and updating parameters; wherein, a loss function is preset and used for the classification task.

Optionally, recognizing the input image according to a pre-trained behavior recognition model to obtain a feature map includes: and under the condition that the input image comprises a video, carrying out image extraction on a video input behavior recognition model with preset dimensionality to obtain a characteristic diagram, wherein the preset dimensionality comprises channel number, time, width and height.

Optionally, extracting the key region according to the feature map includes: by applying a plurality of attention modules on the feature map, key regions are extracted from the feature map.

Further, optionally, the obtaining of the at least one behavior-specific feature according to the key region includes: and activating through an attention mechanism according to the key area to obtain the specific characteristics of at least one behavior in the key area.

Optionally, the obtaining the correlation characteristic between the behaviors according to the specificity characteristic of the at least one behavior includes: generating a correlation matrix by counting the correlation among the behaviors according to the specific characteristics of at least one behavior; and acquiring the correlation characteristics among the behaviors according to the correlation matrix.

Further, optionally, the classifying according to the specificity feature and the correlation feature to obtain a classification result corresponding to each behavior includes: taking the specific features and the correlation features as input data of a behavior recognition model; the input data are processed through a full connection layer of a behavior recognition model to obtain a first characteristic value corresponding to the specific characteristic and a second characteristic value corresponding to the correlation characteristic; and inputting the first characteristic value and the second characteristic value into the softmax layer to obtain a classification result.

In a second aspect, an embodiment of the present invention provides an apparatus for identifying multi-tag behaviors, including: the recognition module is used for recognizing the input image according to a pre-trained behavior recognition model to obtain a characteristic diagram; the extraction module is used for extracting a key area according to the characteristic diagram; the first acquisition module is used for acquiring the specific characteristics of at least one behavior according to the key area; the second acquisition module is used for acquiring correlation characteristics among the behaviors according to the specific characteristics of at least one behavior; and the classification module is used for classifying according to the specificity characteristics and the correlation characteristics to obtain classification results corresponding to each behavior.

According to the embodiment of the invention, an input image is identified according to a pre-trained behavior identification model to obtain a characteristic diagram; extracting a key area according to the feature map; acquiring at least one behavior specific characteristic according to the key area; acquiring correlation characteristics among the behaviors according to the specificity characteristics of at least one behavior; and classifying according to the specific characteristics and the correlation characteristics to obtain classification results corresponding to all behaviors, so that the multi-label behaviors appearing in the video can be accurately identified.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a schematic flowchart of a method for identifying multi-tag behaviors according to an embodiment of the present invention;

fig. 2a is a schematic diagram of a network structure of C3D in a method for identifying multi-tag behaviors according to an embodiment of the present invention;

fig. 2b is a schematic diagram of a network convolution of C3D in a method for identifying multi-tag behaviors according to an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating behavior recognition performed by a neural network in a method for recognizing multi-tag behaviors according to an embodiment of the present invention;

fig. 4 is a schematic diagram of an apparatus for identifying multi-tag behaviors according to a second embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first", "second", and the like in the description and claims of the present invention and the accompanying drawings are used for distinguishing different objects, and are not used for limiting a specific order.

It should be noted that the following embodiments of the present invention may be implemented individually, or may be implemented in combination with each other, and the embodiments of the present invention are not limited in this respect.

Example one

In a first aspect, an embodiment of the present invention provides a method for identifying a multi-tag behavior, where the method is applied in a scenario of behavior feature identification, and the method for identifying a multi-tag behavior provided in the embodiment of the present application includes:

the method for identifying the multi-label behavior provided by the embodiment of the application can be applied to identifying scenes with the multi-label behavior in videos, has wide application in the fields of short video recommendation and the like, and has the following implementation modes in the specific implementation process:

the method for identifying multi-tag behaviors provided by the embodiment of the application identifies a plurality of behaviors existing in a video by creating a model, so that the method for identifying multi-tag behaviors provided by the embodiment of the application comprises the following steps: a model generation phase and a model application phase, wherein,

stage one, the model generation stage includes: a model training phase and a model testing phase, wherein,

a model training stage:

s1, acquiring a training image;

in this embodiment, the training image may be a video, and in particular, a video clip may be taken as an example, and in this embodiment, the video clip is denoted as V, for example, a video frame in the video V may be an RGB image, three channels with a scale of 24 × 24, and 32 frames in total, an input dimension is denoted as 3 × 32 × 224, each video has a plurality of real tags, for example, a video shows that a person walks from a distance, and when a chair is sitting, the tags are denoted as walking and sitting, that is, the tags are used to represent behavior types of target objects in the video.

S2, inputting the training image as an input image into the end-to-end network model, and acquiring the specific characteristics of at least one behavior in the training image;

the neural network used in the method for identifying the multi-label behaviors provided by the embodiment of the application can be an end-to-end network model, the training image is used as an input image and is input into the end-to-end network model, and the specific characteristics of at least one behavior in the training image are obtained based on the end-to-end network model. In order to solve the problem that in the related art, for a multi-label video, since a video includes a plurality of different sub-behaviors, and the uniform extraction of features can overwhelm the specific features of some behaviors, and thus it is difficult to achieve high accuracy, the embodiment of the present application proposes to extract the specific features for each sub-behavior (i.e., at least one behavior in the embodiment of the present application) separately.

S3, acquiring correlation characteristics among the behaviors according to the specificity characteristics of at least one behavior;

based on the specific characteristics of at least one behavior obtained in step S2, the correlation characteristics between behaviors are obtained, so that the effect of behavior identification is further improved by using the correlation information between each child behavior and adding prior information.

And S4, training the end-to-end network model according to the specificity characteristics and the correlation characteristics until the end-to-end network model converges to obtain a behavior recognition model.

The method comprises the following steps of training an end-to-end network model according to the specific characteristics and the correlation characteristics until the end-to-end network model converges, and obtaining a behavior recognition model, wherein the training step comprises the following steps: taking the specificity characteristic and the correlation characteristic as input data of the end-to-end network model; the input data passes through a full connection layer of an end-to-end network model to obtain a first characteristic value corresponding to the specific characteristic and a second characteristic value corresponding to the correlation characteristic; inputting the first characteristic value and the second characteristic value into a softmax layer to obtain a classification result; and training the end-to-end network model according to the classification result and the input data until the end-to-end network model is converged to obtain a behavior recognition model.

Specific features are denoted in the examples of the present application

The correlation characteristics are recorded as

By passing through

And

inputting the full connection layer of the end-to-end network model, and obtaining by calculation

And

wherein, in the step (A),

for the first characteristic value in the embodiment of the present application,

is the second characteristic value in the embodiment of the present application.

After the first feature value and the second feature value are input into the softmax layer to obtain the classification result, the method for identifying the multi-tag behavior provided by the embodiment of the application further includes: sending the classification result to a preset loss function to calculate a loss value, performing gradient back propagation, and updating parameters; wherein, a loss function is preset and used for the classification task.

The preset loss function formula in the embodiment of the application is as follows:

the loss function is the cross entropy loss, used for the classification task. Wherein, N represents the total number of samples,

a true label representing sample i, with a positive class of 1, a negative class of 0,

representing the probability that sample i is predicted as a positive class,

i.e. the probability that sample i is predicted as a negative class. Due to the desire to

Is as large as possible, i.e.

As small as possible, the predetermined loss function is constructed accordingly.

And finally, after n rounds of training, the end-to-end network model converges to obtain a behavior recognition model.

And (3) a model testing stage:

step 0, inputting a test video segment V with a dimension of 3 × 32 × 224.

And step 1, sending the V into a backbone network, and finishing end-to-end calculation by the network.

Step 2, finally obtaining characteristics

And

。

step 3, characterizing

And

feeding into softAnd max layers, and obtaining a classification result.

Stage two, model application stage

Fig. 1 is a schematic flowchart of a method for identifying multi-tag behaviors according to an embodiment of the present invention; as shown in fig. 1, a method for identifying multi-tag behaviors provided in an embodiment of the present application includes:

step S102, recognizing an input image according to a pre-trained behavior recognition model to obtain a feature map;

In this embodiment, the behavior recognition model may be a C3D network structure, as shown in fig. 2a, fig. 2a is a schematic diagram of a C3D network structure in the method for recognizing a multi-tag behavior according to an embodiment of the present invention. The main reason why the C3D network structure is used in the embodiment of the present application is that behavior recognition is a task that requires both spatial domain information and time domain information, and a common 2D convolutional neural network can only capture spatial domain information and cannot reasonably and effectively utilize time information, so that the C3D network structure is used as a feature extraction network. As shown in fig. 2b, fig. 2b is a schematic diagram of a C3D network convolution in a method for identifying multi-label behaviors according to an embodiment of the present invention, where the 3D convolution stacks consecutive frames and performs a uniform convolution operation, and the obtained feature map includes both frame sequence information (i.e., time domain information) and spatial domain information.

As shown in fig. 3, fig. 3 is a schematic diagram of behavior recognition performed by a neural network in a method for recognizing multi-tag behaviors according to an embodiment of the present invention, and a process of behavior recognition may be divided into three stages, where in the first stage, a feature diagram in the embodiment of the present application may be obtained by passing a video sequence with dimensions 3 × 32 × 224 through a C3D structure, and is denoted as "C3D structure

(ii) a In the embodiments of the present application, dimensions are used to indicate: c (T W H) (C represents the number of channels, T represents time, W represents width, and H represents height).

Step S104, extracting key areas according to the feature map;

wherein, the step S104 of extracting the key area according to the feature map includes: by applying a plurality of attention modules on the feature map, key regions are extracted from the feature map.

Specifically, based on the result obtained in step S102

In a

The method comprises the following steps of applying a plurality of attention modules to extract key areas, and aims to: people are the main body of behavior recognition, but the space positions occupied by people are small, and a lot of redundant information exists in the characteristic diagram, so that the training effect is interfered, therefore, the calculation amount can be saved and the effect can be improved by extracting key areas to perform subsequent characteristic calculation.

It should be noted that in the embodiment of the application, a key area self-learning mode is provided, in the model training process, a better key area is continuously learned, and the task of multi-label behavior recognition is supplemented, so that the effect is further improved.

Step S106, acquiring at least one behavior specific characteristic according to the key area;

the step S106 of obtaining at least one behavior specificity feature according to the key region includes: and activating through an attention mechanism according to the key area to obtain the specific characteristics of at least one behavior in the key area.

Specifically, as shown in fig. 3, in the second stage, the specific feature of the child behavior (i.e., the specific feature of at least one behavior) is obtained by the attention mechanism activation using the obtained key region feature of the key region.

Step S108, acquiring correlation characteristics among the behaviors according to the specificity characteristics of at least one behavior;

in step S108, obtaining the correlation characteristic between the behaviors according to the specific characteristic of the at least one behavior includes: generating a correlation matrix by counting the correlation among the behaviors according to the specific characteristics of at least one behavior; and acquiring the correlation characteristics among the behaviors according to the correlation matrix.

Specifically, as shown in fig. 3, the correlation feature between the behaviors is obtained according to the specificity feature of at least one behavior obtained in step S106, so as to: although a coherent long action includes many automatic actions, there is often a relationship between sub-actions, and the two actions of "stretching hands" and "holding a cup" are more strongly related than the two actions of "stretching hands" and "lying down".

And step S110, classifying according to the specificity characteristics and the correlation characteristics to obtain classification results corresponding to each behavior.

In step S110, the classifying according to the specificity feature and the correlation feature to obtain a classification result corresponding to each behavior includes: taking the specific features and the correlation features as input data of a behavior recognition model; the input data are processed through a full connection layer of a behavior recognition model to obtain a first characteristic value corresponding to the specific characteristic and a second characteristic value corresponding to the correlation characteristic; and inputting the first characteristic value and the second characteristic value into the softmax layer to obtain a classification result.

Specifically, as shown in fig. 3, in the third stage, in order to avoid confusion between the sub-action specific feature and the sub-action related feature in the backward propagation, the sub-action specific feature and the sub-action related feature are respectively mixed up

And

and (4) entering the softmax layer for classification (namely, inputting the first characteristic value and the second characteristic value into the softmax layer in the embodiment of the application to obtain a classification result).

In summary, with reference to steps S102 to S110, the method for identifying a multi-tag behavior provided in the embodiment of the present application specifically includes:

after the video sequence passes through the C3D structure, the characteristic diagram is obtained and is marked as

K key regions are marked as

Is applied to

Is noted as an attention module

A sub-action-specific features are expressed as

Applied to sub-action features

Is noted as an attention module

And a sub-action dependency feature is noted

，

And

separately feeding into the full-connection layer, and recording the obtained characteristics

And

. The classification result is recorded as

And

。

dimension C (T W H), is a two-dimensional matrix (intuitive understanding: C columns, each with T W H, each representing all the spatio-temporal information of a channel).

k key regions are composed of

Act on

To obtain a total of k

Modules, e.g. formula 2

（2）

，

Dimension is

Is a weight matrix, the main purpose is to adjust the number of channels,

dimension of

。

Dimension of

The formula is as follows:

（3）

is composed of

Any one of the rows of the feature matrix,

the meaning of the representation: activating all the space-time information in each characteristic channel to obtain activated characteristic vectors, wherein the dimensionality is as follows: and (T W H) 1, the softmax function can output a probability value of (T W H) 1 dimension, the probability value represents the activation degree of each dimension of the feature, and the larger the value is, the higher the response of the position is, namely, the key region represents the behavior. By selecting the first k values of the probability value ranking, the self-learning of the key area is completed. Then

Has the dimension of

，

Has the dimension of

。

Sub-action specific features by

Act on

The method comprises the steps of (1) obtaining,

and

the principle is the same as that of the prior art,

has the dimension of

A is the number of all sub-actions in the training data, and for each key area

After activation, the sub-action category most probably corresponding to each key area is obtained. The sub-action specific characteristics are expressed as

The values of the coefficients, as in equation 4,

dimension of

。

（4）

This completes the learning of the sub-action specific features.

Sub-action dependency features

By

Obtaining a correlation matrix with the size of

The frequency of each pair of sub-action combination is recorded in the table, and specifically:

（5）

representing the probability of the ith and jth sub-actions co-occurring in all training samples,

representing the probability of the ith sub-action occurring in all training samples.

Based on the above, two features are obtained:

and

. Sending the two features into a full connection layer to obtain a feature vector

And

then sending the obtained data into a softmax layer for classification, and recording the obtained result as

And

and the dimensionalities of the two are the same and are both A, the result values are added to obtain the probability values corresponding to all the sub-categories, a threshold value is set, and the sub-action larger than the threshold value is the prediction label of the behavior.

So far, the whole process is ended.

According to the embodiment of the invention, the input image is identified according to the pre-trained behavior identification model to obtain the characteristic diagram; extracting a key area according to the feature map; acquiring at least one behavior specific characteristic according to the key area; acquiring correlation characteristics among the behaviors according to the specificity characteristics of at least one behavior; and classifying according to the specific characteristics and the correlation characteristics to obtain classification results corresponding to all behaviors, so that the multi-label behaviors appearing in the video can be accurately identified.

Example two

In a second aspect, an embodiment of the present invention provides an apparatus for identifying a multi-tag behavior, and fig. 4 is a schematic diagram of an apparatus for identifying a multi-tag behavior according to a second embodiment of the present invention; as shown in fig. 4, an apparatus for identifying multi-tag behaviors provided in an embodiment of the present application includes: the recognition module 40 is used for recognizing the input image according to a pre-trained behavior recognition model to obtain a feature map; an extraction module 42, configured to extract a key region according to the feature map; a first obtaining module 44, configured to obtain a specific feature of at least one behavior according to the key area; a second obtaining module 46, configured to obtain a correlation feature between the behaviors according to the specific feature of the at least one behavior; and the classification module 48 is configured to perform classification according to the specificity characteristics and the correlation characteristics to obtain classification results corresponding to each behavior.

Optionally, the apparatus for identifying a multi-tag behavior provided in the embodiment of the present application further includes: the image acquisition module is used for acquiring a training image before the input image is identified according to a pre-trained behavior identification model; the first characteristic acquisition module is used for inputting the training image into the end-to-end network model as an input image and acquiring the specific characteristic of at least one behavior in the training image; the second characteristic acquisition module is used for acquiring correlation characteristics among the behaviors according to the specific characteristics of at least one behavior; and the training module is used for training the end-to-end network model according to the specific characteristics and the correlation characteristics until the end-to-end network model converges to obtain a behavior recognition model.

Further, optionally, the training module comprises: a first input unit, configured to use the specificity feature and the correlation feature as input data of an end-to-end network model; the characteristic value acquisition unit is used for enabling the input data to pass through a full connection layer of the end-to-end network model to obtain a first characteristic value corresponding to the specific characteristic and a second characteristic value corresponding to the correlation characteristic; the second input unit is used for inputting the first characteristic value and the second characteristic value into the softmax layer to obtain a classification result; and the training unit is used for training the end-to-end network model according to the classification result and the input data until the end-to-end network model is converged to obtain a behavior recognition model.

Optionally, the apparatus for identifying a multi-tag behavior provided in the embodiment of the present application further includes: the calculation unit is used for inputting the first characteristic value and the second characteristic value into the softmax layer to obtain a classification result, then sending the classification result into a preset loss function to calculate a loss value, performing gradient back propagation and updating parameters; wherein, a loss function is preset and used for the classification task.

Optionally, the identification module 40 includes: the identification unit is used for extracting images from the video input behavior identification model with preset dimensionality under the condition that the input images comprise videos to obtain a feature map, wherein the preset dimensionality comprises the number of channels, time, width and height.

Optionally, the extracting module 42 includes: and the advancing unit is used for extracting the key area from the feature map by applying a plurality of attention modules on the feature map.

Further, optionally, the first obtaining module 44 includes: the first obtaining unit is used for obtaining the specific characteristics of at least one behavior in the key area according to the activation of the key area through an attention mechanism.

Optionally, the second obtaining module 46 includes: the matrix generation unit is used for generating a correlation matrix by counting the correlation among the behaviors according to the specific characteristics of at least one behavior; and the second acquisition unit is used for acquiring the correlation characteristics among the behaviors according to the correlation matrix.

Further, optionally, the classification module 48 includes: a data input unit for taking the specificity characteristic and the correlation characteristic as input data of the behavior recognition model; the computing unit is used for enabling the input data to pass through a full connection layer of the behavior recognition model to obtain a first characteristic value corresponding to the specific characteristic and a second characteristic value corresponding to the correlation characteristic; and the classification unit is used for inputting the first characteristic value and the second characteristic value into the softmax layer to obtain a classification result.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A method of identifying multi-tag behavior, comprising:

recognizing the input image according to a pre-trained behavior recognition model to obtain a feature map;

extracting a key area according to the feature map;

acquiring at least one behavior specific characteristic according to the key area;

acquiring correlation characteristics among the behaviors according to the specificity characteristics of the at least one behavior;

and classifying according to the specificity characteristics and the correlation characteristics to obtain classification results corresponding to each behavior.

2. The method of claim 1, wherein prior to the recognizing the input image according to the pre-trained behavior recognition model, the method further comprises:

acquiring a training image;

inputting the training image into an end-to-end network model as an input image, and acquiring the specific characteristic of at least one behavior in the training image;

and training the end-to-end network model according to the specificity characteristics and the correlation characteristics until the end-to-end network model converges to obtain the behavior recognition model.

3. The method of claim 2, wherein the training the end-to-end network model according to the specificity feature and the correlation feature until the end-to-end network model converges to obtain the behavior recognition model comprises:

taking the specificity feature and the relevance feature as input data of the end-to-end network model;

enabling the input data to pass through a full connection layer of the end-to-end network model to obtain a first characteristic value corresponding to the specific characteristic and a second characteristic value corresponding to the correlation characteristic;

inputting the first characteristic value and the second characteristic value into a softmax layer to obtain a classification result;

and training the end-to-end network model according to the classification result and the input data until the end-to-end network model is converged to obtain the behavior recognition model.

4. The method of claim 3, wherein after the inputting the first feature value and the second feature value into a softmax layer to obtain a classification result, the method further comprises:

sending the classification result to a preset loss function to calculate a loss value, performing gradient back propagation, and updating parameters; and the preset loss function is used for classifying tasks.

5. The method of claim 1, wherein the recognizing the input image according to the pre-trained behavior recognition model to obtain the feature map comprises:

and under the condition that the input image comprises a video, inputting the video with preset dimensionality into the behavior recognition model for image extraction to obtain the feature map, wherein the preset dimensionality comprises channel number, time, width and height.

6. The method according to claim 1 or 5, wherein the extracting key regions according to the feature map comprises:

extracting the key regions from the feature map by applying a plurality of attention modules on the feature map.

7. The method of claim 6, wherein the obtaining at least one behavior specific feature according to the key region comprises:

and activating through an attention mechanism according to the key area to obtain the specific characteristics of the at least one behavior in the key area.

8. The method of claim 7, wherein the obtaining the correlation characteristic between the behaviors according to the characteristic of the at least one behavior comprises:

generating a correlation matrix by counting the correlation among the behaviors according to the specific characteristics of the at least one behavior;

and acquiring the correlation characteristics among the behaviors according to the correlation matrix.

9. The method of claim 8, wherein the classifying according to the specificity feature and the correlation feature to obtain a classification result corresponding to each behavior comprises:

taking the specificity feature and the correlation feature as input data of the behavior recognition model;

enabling the input data to pass through a full connection layer of the behavior recognition model to obtain a first characteristic value corresponding to the specific characteristic and a second characteristic value corresponding to the correlation characteristic;

and inputting the first characteristic value and the second characteristic value into a softmax layer to obtain the classification result.

10. An apparatus for identifying multi-tag behavior, comprising:

the recognition module is used for recognizing the input image according to a pre-trained behavior recognition model to obtain a characteristic diagram;

the extraction module is used for extracting a key area according to the feature map;

the first acquisition module is used for acquiring the specific characteristics of at least one behavior according to the key area;

the second acquisition module is used for acquiring the correlation characteristics among the behaviors according to the specific characteristics of the at least one behavior;

and the classification module is used for classifying according to the specificity characteristics and the correlation characteristics to obtain classification results corresponding to all behaviors.