CN114140879A

CN114140879A - Behavior identification method and device based on multi-head cascade attention network and time convolution network

Info

Publication number: CN114140879A
Application number: CN202111446154.XA
Authority: CN
Inventors: 郭媛君; 杨之乐; 陈雪健; 冯伟; 王尧; 吴承科
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-03-04

Abstract

The invention provides a behavior identification method and a device based on a multi-head cascade attention network and a time convolution network, comprising the following steps: collecting a video and extracting video characteristic information in the video; capturing local attention weight values in a self-attention mode; capturing other characteristic information in the video by adopting a multi-head attention mechanism; weighting the characteristic values in the characteristic space by adopting a linear transformation and normalization method, and increasing the diversity of the self-attention characteristics; integrating local features into a plurality of global representations by using the local attention weight, and learning attention weight by taking the self-attention feature as input; extracting time sequence characteristics according to the multi-stage time convolution network, and improving a prediction result; and analyzing the prediction result through an expert system to obtain a final behavior category. The behavior recognition method effectively solves the limitation of the existing recognition method, and has the advantages of accurate and timely monitoring result and low possibility of being influenced by external factors such as dust, volatile gas and the like.

Description

Behavior identification method and device based on multi-head cascade attention network and time convolution network

Technical Field

The invention belongs to the field of behavior identification, and particularly relates to a behavior identification method, a behavior identification system, an electronic device and application based on a multi-head cascade attention network and a time convolution network.

Background

With the development of monitoring technology, image information acquired by an optical camera is input into a computer, and a computer vision technology is utilized to perform real-time information processing and pattern recognition on sequence images in a video according to a previously designed algorithm so as to detect smoking behaviors. Compared with a manual supervision method and a traditional sensor smoke alarm, the smoking behavior detection system based on computer vision has the advantages of wide monitoring range, high utilization rate of monitoring resources, automatic positioning of smokers, alarm sending and the like.

The traditional smoking detection method generally adopts the modes of manual supervision, smoke sensors, wearable equipment, manual supervision and the like to detect. These methods have a number of limitations: firstly, the smoke concentration in an outdoor scene is greatly diluted and cannot be sensed by a smoke sensor; secondly, the wearable equipment has higher detection cost; thirdly, the manual detection method requires great manpower investment. And the traditional physical detection method cannot locate smokers in real time.

Smoking detection and intervention have used different available technologies in the past few years, including sensors, computer vision, wearable sensory computing technologies, and the like. Due to the characteristics of low cigarette concentration and easiness in dispersion, the smoke detection equipment based on the sensor is limited by the size and the sealing degree of a use space, is easily interfered by external factors such as dust, volatile gas and the like, and cannot be applied to smoking behavior detection in most public places. Meanwhile, the traditional smoke sensing equipment cannot position smokers in real time and cannot effectively guarantee the effective operation of smoke prohibition work.

Therefore, a monitoring means which is low in cost, efficient and capable of determining the target action in real time is urgently needed to be developed.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a behavior identification method and device based on a multi-head cascade attention network and a time convolution network, which are used for solving at least one of the technical problems.

In order to achieve the purpose, the invention adopts the specific scheme that:

a behavior identification method based on a multi-head cascade attention network and a time convolution network comprises the following steps:

collecting a video and extracting video characteristic information in the video;

learning at least 1 attention feature through the video feature information, and capturing a local attention weight in a self-attention mode;

capturing other characteristic information in the video by adopting a multi-head attention mechanism;

weighting the characteristic values in the characteristic space by adopting a linear transformation and normalization method, and increasing the diversity of the self-attention characteristics;

adopting a multi-head cascade attention network, integrating local features into a plurality of global representations by using the local attention weight, and learning attention weight by taking the self-attention feature as input;

acquiring a first action label corresponding to the video characteristic information according to the attention weight, and extracting time sequence characteristics of the first action label according to a multi-stage time convolution network to improve a prediction result;

and analyzing the prediction result through an expert system to obtain a final behavior category.

The acquiring a video and extracting video feature information in the video includes:

by I ═ I₁,I₂...I_k]k segments representing the video, with a parameter theta₁The feature extraction network extracts feature information of the video:

X＝[x₁，x₂，...x_K]＝[r(I₁；θ₁)，...，r(I_K；θ₁)]

wherein Ii ∈ R^H*W*3*LH and W are the height and width of the incoming video segment, respectively, and L is the length of the video segment.

The "learning at least 1 attention feature through the video feature information and capturing a local attention weight in a self-attention manner" includes:

inputting the video feature information into the next two full-connection layers, and obtaining various learned attention features through the first connection layer for learning self-attention weight and the normalization of data combined with the second connection layer;

attention weight α_ijThe inputs of (a) are defined as follows:

each output of the first FC layer is a weighted value of the ith primitive feature and the attention weight of the ith head attention module, which is defined as follows:

wherein k is the number of video segments; x_jThe j frame video characteristic information; w is a parameter of the fully connected layer of the global attention module.

The weighting of the feature values in the feature space by adopting the linear transformation and normalization method comprises the following steps:

the linear transformation is performed by the following procedure:

wherein y' is obtained by linear transformation of y of the full connection layer; and N is the number of the self-attention modules.

The "adopting a multi-head cascade attention network, integrating local features into a plurality of global representations by using the local attention weight, and learning the attention weight by using the self-attention feature as an input" includes:

learning attention weights by connecting the video representation and the cascaded layers of self-attention features, each attention weight being defined as follows, taking the self-attention features as input:

β_i＝sigmoid(w^T[y_i′；G])

wherein w is a parameter of the fully connected layer of the global attention module; [ y ]_i′；G]Denotes a number y_i' and G are connected by a concatenation operator; i is 1, 2, 3 … ….

The "acquiring a first action tag corresponding to the video feature information according to the attention weight and extracting a time series feature of the first action tag according to a multi-stage time convolution network" includes:

introducing a multi-stage time convolution network to finish the task of dividing the time action, and introducing expansion convolution in the time convolution network;

each stage in the time convolutional network takes an initial prediction from a previous stage and refines it.

A behavior recognition system based on a multi-head cascade attention network and a time convolution network comprises:

the multi-head cascade network module is used for acquiring local attention weights in the video and integrating local features into a plurality of global representations according to the local attention weights;

and the action logic combination module performs data interaction with the multi-head cascade network module and is used for acquiring behavior classification of the video information.

The multi-head cascade network module comprises:

the local attention module is connected with the outside and used for learning a plurality of attention weights of each network segment from network segment characteristics generated by the backbone of the multi-head cascade network module, capturing the importance of the local characteristics in a self-attention mode and obtaining local attention weights;

the global attention module is in data interaction with the local attention module and is used for integrating local features into a plurality of global representations by using the local attention weight values and then learning secondary attention of global information in a relational manner;

and the global attention module performs data interaction with the action logic combination module and is used for performing behavior recognition and classification.

An electronic device based on behavior recognition of a multi-headed cascade attention network and a time convolution network, comprising:

a storage medium for storing a computer program;

a processing unit, which exchanges data with the storage medium, and is used for executing the computer program through the processing unit when performing behavior recognition, so as to perform the steps of the behavior recognition method based on the multi-head cascade attention network and the time convolution network according to any one of claims 1 to 6.

The behavior identification method based on the multi-head cascade attention network and the time convolution network is applied to the smoking monitoring direction.

Has the advantages that: the invention has the following advantages:

the method comprises the steps of firstly collecting video clips in monitoring videos of various public places, and marking the obtained videos to form a data set. And inputting the marked data set into a multi-head cascade attention network for pre-training to obtain pre-training weights, testing, training again and updating the pre-training weights, so that the accuracy of the network for identifying and positioning smoking behaviors achieves a better effect. The method effectively solves the limitation of the existing identification method, and has the advantages of accurate and timely monitoring result and low possibility of being influenced by external factors such as dust, volatile gas and the like.

The system of the invention completes the identification and classification of behaviors in the video by constructing two layers of attention modules and combining with a time convolution network, wherein the two layers of attention modules comprise: the local attention module and the global attention module capture the importance of local features in a self-attention mode to obtain local attention weights by utilizing the local attention module, then integrate the local features into a plurality of global representations by utilizing the global attention module, and then learn secondary attention of global information in a relational mode; and finally, carrying out final identification and classification through a multistage time convolution network. The system has the advantages of simple structure and accurate recognition result after two-stage recognition.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a schematic diagram of a multi-head cascade network according to the present invention.

FIG. 3 is a block diagram of a multi-stage time convolutional network action logic combination.

Fig. 4 is a block diagram of an electronic device based on behavior recognition of a multi-headed cascade attention network and a time convolution network.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

In this context: each self-attention module obtains self-attention characteristics, and all the self-attention characteristics are collectively called attention characteristics; specific example I: in this embodiment, a method and an apparatus for behavior recognition based on a multi-headed cascade attention network and a time convolution network according to the present invention are described in detail by taking smoking monitoring in real time as an example.

The specific technical flow chart of the invention is shown in the attached figures 1-3, and the detailed scheme of the behavior identification method based on the multi-head cascade attention network and the time convolution network comprises the following steps:

s1, first using I ═ I₁,I₂...I_k]k segments represent the video, and then feature information of the video is extracted through a feature extraction network with a parameter theta: x ═ X₁，x₂，...x_K]＝[r(I₁；θ₁)，...，r(I_K；θ1₎]Wherein, I_i∈R^H ^*W*3*LH and W are the height and width of the incoming video segment, respectively, and L is the length of the video segment.

And S2, inputting the video feature information subjected to feature extraction into the next two full-connection layers, wherein the first full-connection layer aims to learn the self-attention weight, and the second full-connection layer is combined with the normalization of data and aims to learn various attention features. The self-attention module obtains respective attention weights; note that the input of the weights is defined as follows: first we divide the video of K frames into { I }₁，I₂，…,I_kIs passed through a parameter of θ₁Is used for extracting the characteristic of the network r (·; theta)₁) Obtaining the characteristic X ═ X of the video K frame₁,x₂,…,x_k](ii) a Input video clip feature X_jSelf-attention weight of (a)_ijIs defined as follows:

each output of the first fully connected layer is a weighted value of the ith original feature and the attention weight of the ith head attention module. The definition is as follows:

s3, because a single-head self-attention weight feature may intelligently reflect the feature of a certain aspect of the video, for this reason, a multi-head attention mechanism is adopted to capture the feature of more aspects of the video.

S4, in order to avoid that these self-attention weighted features always tend to focus on similar signals, and increase the diversity of the self-attention features, we adopt the methods of linear transformation and normalization to weight or shift the feature values in the feature space. The characteristics are different from each other and the distribution is also different while the scale invariance is ensured through linear transformation and normalization, and the scale invariance is also beneficial to optimizing the network. The process is defined as follows:

wherein y' is obtained by linearly transforming y of the full connection layer.

The global attention module learns attention weights by concatenating the video representation and the approximate video representation as input, S5. This module operates mainly on global features, and each attention weight can be defined in the form:

β_i＝sigmoid(w^T[y_i′；G])

where w is a parameter of the fully connected layer of the global attention module, [ y_i′；G]Indicating that yi' and G are connected by a concatenation operator.

And S6, after the first action label is obtained, adding a multi-stage time convolution network at the end, and extracting the time sequence characteristics of the obtained result. The effect of this combination is a gradual improvement of the predictions of the first few stages.

On the basis, a multi-stage time convolution network is introduced to complete the division task of the time action; in order to reduce the number of parameters that need to be processed, a dilation convolution is introduced into this time convolution network. In this multi-stage model, each stage takes an initial prediction from the previous stage and refines it. Using such a multi-stage architecture helps to provide more context to predict class labels for each segment. In addition, since the output of each stage is an initial prediction, the network can capture the dependency relationship between the action classes and learn the possible action sequences, which helps to reduce over-segmentation errors.

S7, after obtaining the result of the action class label, we can proceed the logic analysis between the actions through an expert system. Finally, an accurate behavior category is obtained.

Specific example II:

the invention also discloses an embodiment: as shown in fig. 4, a behavior recognition system based on a multi-head cascade attention network and a time convolution network includes: a multi-head cascade network module 100 and an action logic combination module 200; the multi-head cascade network module 100 is configured to obtain a local attention weight in a video and integrate local features into a plurality of global representations according to the local attention weight; the action logic combination module 200 performs data interaction with the multi-head cascade network module, and is used for behavior classification of video information.

The multi-head cascade network module 100 includes: a local attention module 101 and a global attention module 102; the local attention module 101 is connected with the outside, and is configured to learn a plurality of attention weights of each network segment from network segment features generated by a backbone of the multi-head cascade network module, and capture the importance of the local features in a self-attention manner to obtain local attention weights; the global attention module 102 performs data interaction with the local attention module 101, and is configured to integrate local features into a plurality of global representations by using the local attention weights, and then learn secondary attention of global information in a relational manner; the global attention module 102 performs data interaction with the action logic combination module 200 for behavior recognition and classification.

Specific example III:

the invention also provides an embodiment: an electronic device based on behavior recognition of a multi-headed cascade attention network and a time convolution network, comprising: a storage medium and a processing unit; a storage medium for storing a computer program; the processing unit exchanges data with the storage medium, and is used for executing the computer program through the processing unit when performing behavior recognition, so as to perform the steps of the behavior recognition method based on the multi-head cascade attention network and the time convolution network.

In the electronic device, the storage medium is preferably a storage device such as a mobile hard disk, a solid state disk, or a usb disk; a processing unit, preferably a CPU, for exchanging data with the storage medium, and executing the computer program by the processing unit when performing behavior recognition, so as to perform the above-mentioned steps of behavior recognition based on the multi-headed cascade attention network and the time convolution network.

The CPU described above can execute various appropriate actions and processes according to a program stored in a storage medium. The electronic device also includes peripherals including an input part for a keyboard, a mouse, etc., and an output part such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), etc., and a speaker, etc.; in particular, according to the disclosed embodiments of the invention, the processes as described in any of FIG. 1 may be implemented as computer software programs.

An embodiment provided by the invention comprises a computer program product comprising a computer program carried on a computer readable medium, the computer program comprising instructions for executing a method as described in fig. 1

Program code for the method shown in any of the flowcharts. The computer program may be downloaded and installed from a network. The computer program, when executed by the CPU, performs the above-described functions defined in the system of the present invention.

The present invention also provides a computer-readable storage medium having a computer program stored therein; the computer program, when executed, performs the steps of the behavior recognition method based on the multi-headed cascade attention network and the time convolution network as described above.

In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily change or replace the present invention within the technical scope of the present invention. Therefore, the protection scope of the present invention is subject to the protection scope of the claims.

Claims

1. A behavior identification method based on a multi-head cascade attention network and a time convolution network is characterized by comprising the following steps:

2. The behavior identification method based on the multi-head cascade attention network and the time convolution network as claimed in claim 1, wherein said "acquiring a video and extracting video feature information in the video" comprises:

X＝[x₁，x₂，...x_K]＝[r(I₁；θ₁)，...，r(I_K；θ₁)]

3. The method according to claim 1, wherein the "learning at least 1 attention feature through the video feature information and capturing local attention weight in a self-attention manner" includes:

the video of K frames is first divided into { I }₁，I₂，…,I_kA feature extraction network r (-) through a parameter theta₁) Obtaining the characteristic X ═ X of the video K frame₁,x₂,…,x_k]；

Input video clip feature X_jSelf-attention weight of (a)_ijIs defined as follows:

4. The behavior identification method based on the multi-head cascade attention network and the time convolution network as claimed in claim 1, wherein the "weighting feature values in feature space by using linear transformation and normalization" comprises:

the linear transformation is performed by the following procedure:

5. The method according to claim 1, wherein the step of integrating local features into a plurality of global representations by using the local attention weight value and learning the attention weight value by using the self-attention feature as an input comprises the steps of:

β_i＝sigmoid(w^T[y_i′；G])

6. The method according to claim 1, wherein the step of obtaining a first action tag corresponding to the video feature information according to the attention weight and extracting a time series feature of the first action tag according to a multi-stage time convolution network comprises:

7. A behavior recognition system based on a multi-head cascade attention network and a time convolution network is characterized by comprising:

8. The behavior recognition system based on the multi-head cascade attention network and the time convolution network as claimed in claim 7, wherein the multi-head cascade network module comprises:

and the global attention module and the action logic combination module carry out data interaction for behavior identification and classification.

9. An electronic device based on behavior recognition of a multi-headed cascade attention network and a time convolution network, comprising:

a storage medium for storing a computer program;

10. The use of a behavior recognition method based on a multi-headed cascade attention network and a time convolution network according to any one of claims 1 to 6 in the smoking monitoring direction.