CN113033657A

CN113033657A - Multi-user behavior identification method based on Transformer network

Info

Publication number: CN113033657A
Application number: CN202110312085.7A
Authority: CN
Inventors: 曹菁菁; 储洁; 郭富康
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2021-06-25

Abstract

The invention discloses a multi-user behavior identification method based on a Transformer network, which comprises the following steps: collecting an environment sensor data set, taking sensor data based on a time sequence as input to enter a model, and sampling through a sliding window with a fixed size; embedding the sampled events into an initial vector, then adding position codes to represent the sequence of the events in the sequence, and then enabling the vector to enter an Encoder of a transform network; the top full connectivity layer is applied to classify the labels of users and activities. The invention uses an end-to-end method, and avoids the process that the traditional machine learning method needs to manually make characteristics and distinguish a training set test set. The invention uses a time attention mechanism to enable the network to pay more attention to the key frame which contributes most to behavior recognition, and can effectively solve the problem of giving equal importance to time sequence data when the deep neural network automatically extracts the characteristics.

Description

Multi-user behavior identification method based on Transformer network

Technical Field

The invention belongs to the field of human behavior recognition, and particularly relates to a multi-person behavior recognition method based on an Encoder technology in a transform network, which is mainly used for recognizing human behaviors aiming at environmental sensor data.

Background

In recent years, human behavior recognition has received much attention. Accurate and efficient human behavior recognition plays an important role in human-computer interaction, family safety monitoring and the like. Human behavior recognition may contribute to detecting behavioral activities of the elderly, identifying potential safety hazards and physical degradation, etc. As the basis of smart home, human behavior recognition needs to be performed based on data obtained by sensors. Compare with video sensor and wearing sensor, environmental sensor installs on floor, door and window or electrical equipment, reduces the inconvenience that the data acquisition process probably caused the resident activity, uses more extensively. The current research of human behavior recognition technology based on environmental sensor data faces the following problems:

1. multi-person behavior recognition is difficult: much research is currently focused on identifying the behavior of a single resident, however, there are always a plurality of residents with different behavior habits in a room, and parallel activities or cooperative activities exist, which bring complex challenges to the activity identification.

2. The traditional machine learning method has low recognition efficiency: the machine learning method requires the use of hand-made statistical and frequency features to represent segments of the raw sensor stream and train the machine learning model to classify residents and activities. The effectiveness of this method depends to a large extent on the effectiveness of the manual feature.

3. Neural networks are not suitable for processing binary data: with the research in the field of deep learning, CNN is gradually applied to human behavior recognition, but it is mainly used to process continuous signal data and lacks adaptability to binary environmental sensor data.

The invention content is as follows:

in order to overcome the defects of the background art, the invention provides a multi-user behavior identification method based on a transform network, which is used for solving the problem of simultaneously finishing the identification of multiple users and corresponding activities according to data collected by an environmental sensor.

In order to solve the technical problems, the invention adopts the technical scheme that:

a multi-person behavior identification method based on a Transformer network comprises the following steps:

step 1, collecting an environment sensor data set, taking sensor data based on a time sequence as input, entering a model, and sampling through a sliding window with a fixed size;

step 2, embedding the sampled events into an initial vector, then adding position codes to represent the sequence of the events in the sequence, and then enabling the vector to enter an Encoder of a transform network;

and 3, classifying the labels of the users and the activities by applying a top full-connection layer.

Preferably, the specific method of step 1 comprises:

step 1.1, arranging an environmental sensor in a measured space region, and collecting user behavior data;

step 1.2, the collected environmental sensor data is represented by ON or OFF, wherein ON represents that the sensor is triggered, and OFF represents that the sensor is not triggered;

step 1.3, screening original data, removing data with the attribute of OFF, reserving data with the attribute of ON, taking each ON data as an event, and arranging the screened ON data according to a time sequence to form time sequence data;

and step 1.4, segmenting the time sequence data obtained in the step 1.3 to obtain a data slice sample.

Preferably, the specific method of step 1.4 comprises: arranging the screened data with ON attributes according to a time sequence to form a group of time sequence data; and acquiring original information on the time sequence data by using a sliding window with a preset fixed size, wherein the acquired result of the sliding window is used as a data slice sample.

Preferably, the predetermined fixed-size sliding window size k is an empirical parameter.

Preferably, the specific method of step 2 comprises:

step 2.1, mapping the discrete data variable corresponding to each slice data sample to a continuous characterization vector through an Embedding algorithm, and performing independent heat treatment on each sample data by the Embedding algorithm to convert the sample data into a vector;

step 2.2, the embedded result set is the embedded matrix R^T×CWherein T represents a time series dimension and C represents a channel dimension; in the process, the time series dimension is the length k of a sliding window in data slicing, each channel represents a corresponding sensor, and the number is N;

and 2.3, adding position codes. Constructing a matrix PE with the same dimension as the embedded matrix, wherein the rows of the matrix PE represent time sequence samples, the columns represent sensors, and each value in the matrix PE is obtained by the following formula;

PE(pos，2i)＝sin(pos/10000²ⁱ/^dmodel)

PE(pos，2i+1)＝cos(pos/10000^2i/dmode)

wherein PE represents a position coding matrix, pos represents a serial number corresponding to the sensor, i represents the position of the row vector in the matrix, and d_modelA dimension representing a row vector;

adding the PE matrix and the embedded matrix to obtain a new eigenvector matrix introduced with position coding;

step 2.4, inputting m row vectors in the new characteristic vector matrix into an Encoder, wherein the numerical value of m is Batch size Batch size set by the transform network;

step 2.5, the vector entering the encoder is firstly transmitted to a multi-head attention layer to obtain a new characterization vector; respectively calculating attention values under different attention heads by adopting a Multi-attention mechanism, so that the network pays more attention to a key frame which has the maximum contribution to behavior identification, wherein the calculation method comprises the following steps of:

MultiHead(Q，K，V)＝Concat(head₁，...，head_h)

q, K and V respectively represent Query vectors in the attention mechanism, and represent sample attributes matched with the samples; the Value vector Key represents the attribute of the sample and the Value vector Value represents the information contained in the sample;

step 2.6, carrying out Normalization processing on the new characterization vector generated by the attention Layer in the step 2.5 through Layer Normalization, summing the input matrix in the step 2.4 and the matrix obtained in the step 2.5, and carrying out Normalization to obtain a new matrix;

step 2.7, transferring the matrix obtained in the step 2.6 to a Feed-Forward neural network Feed Forward for processing to obtain a reinforced characterization vector matrix;

step 2.8, the reinforced characterization vector matrix obtained in the step 2.7 is accessed into a normalization layer, and elements in the matrix are unitized according to rows to obtain a normalization matrix;

and 2.9, continuously sending the output normalized matrix to the next encoder to obtain a final characteristic matrix.

Preferably, the number of row vectors input to the Encoder in step 2.4 is m, the value of m being the Batch size set by the transform network.

Preferably, 6 sequentially arranged Encoder encoders are included.

Preferably, the specific method of step 3 comprises:

step 3.1, inputting the two-dimensional matrix of the T multiplied by C eigenvector obtained in the step 2.9 into a full connection layer, and automatically tiling to generate a one-dimensional vector with the length of T multiplied by C;

step 3.2, mapping the T multiplied by C one-dimensional feature vector to a sample marking space through a full connection layer to obtain a classification result vector, wherein elements in the vector are numerical values of each category obtained by weighting and summing the features;

step 3.3, a Softmax function is adopted as a classifier in the full connection layer, the input of the neurons of the full connection layer is mapped to the output end, each output value in the classification result vector is converted into probability, and a final classification vector Yt is obtained; calculating the difference between the expected output and the actual output by taking the Cross Entropy of Cross Encopy of the expected output and the actual output as a loss function;

and 3.4, finally outputting a final classification vector Yt by the whole network model based on the transform network improvement, wherein the vector Yt comprises user identification information and activity identification information, the first a elements in the vector respectively represent corresponding residents, the second b elements respectively represent corresponding behavior activities, and the value of each element represents the probability of identifying the corresponding resident or activity.

Preferably, the dimension of the vector Yt in step 3.4 is the sum of the number of residents a and the number of activities b.

The invention has the beneficial effects that: the invention uses an end-to-end method, and avoids the process that the traditional machine learning method needs to manually make characteristics and distinguish a training set test set. The invention uses a time attention mechanism to enable the network to pay more attention to the key frame which contributes most to behavior recognition, and can effectively solve the problem of giving equal importance to time sequence data when the deep neural network automatically extracts the characteristics. The invention uses the improved Transformer structure model, and the tasks only need to be classified, so that the decoder structure in the original model is deleted, and the accuracy of user and activity identification is improved by a more simplified framework. The invention can realize the identification of a plurality of users and simultaneously output the corresponding activity of each user.

Drawings

FIG. 1 is a schematic overall flow diagram of an embodiment of the present invention;

FIG. 2 is a schematic diagram of sliding window sampling according to an embodiment of the present invention;

FIG. 3 is a schematic view of a model of an attention mechanism according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an Encoder of the Encoder in the embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings and examples.

step 1, collecting an environment sensor data set, taking sensor data based on a time sequence as input, entering a model, and sampling through a sliding window with a fixed size; the specific method of step 1 comprises:

step 1.4, segmenting the time sequence data obtained in the step 1.3 to obtain a data slice sample; the specific method comprises the following steps: arranging the screened data with ON attributes according to a time sequence to form a group of time sequence data; and acquiring original information on the time sequence data by using a sliding window with a preset fixed size, wherein the acquired result of the sliding window is used as a data slice sample. The predetermined fixed size sliding window size k is an empirical parameter.

Step 2, embedding the sampled events into an initial vector, then adding position codes to represent the sequence of the events in the sequence, and then enabling the vector to enter an Encoder of a transform network; the specific method of the step 2 comprises the following steps:

PE(pos，2i)＝sin(pos/10000^2i/dmodel)

PE(pos，2i+1)＝cos(pos/10000^2i/dmodel)

step 2.5, the vector entering the Encoder is firstly transmitted to a multi-head attention layer (referring to fig. 4, the multi-head attention layer is one of the internal structures of the Encoder) to obtain a new characterization vector; respectively calculating attention values under different attention heads by adopting a Multi-attention mechanism, so that the network pays more attention to a key frame which has the maximum contribution to behavior identification, wherein the calculation method comprises the following steps of:

MultiHead(Q，K，V)＝Concat(head₁，...，head_h)

and 2.6, carrying out Normalization processing on the new characterization vector generated by the attention Layer in the step 2.5 through Layer Normalization, and facilitating nonlinear processing of data by a ReLU activation function in a Feed Forward neural network in the follow-up process. Summing the input matrix in the step 2.4 and the matrix obtained in the step 2.5, and normalizing to obtain a new matrix;

step 2.7, transferring the matrix obtained in the step 2.6 to a Feed-Forward neural network Feed Forward for processing to obtain a reinforced characterization vector matrix; the expressive power of the characterization vector is enhanced by activating the function.

Step 2.8, in order to avoid gradient disappearance and accelerate the convergence process of full-connection layer training, the strengthened characterization vector matrix obtained in the step 2.7 is connected into a normalization layer, and elements in the matrix are unitized according to rows to obtain a normalization matrix;

Preferably, in step 2.4, the number of row vectors of the Encoder is m, and the value of m is the Batch size set by the transform network, and the optimal parameters are obtained according to the multiple experimental results. The Batch size is also the optimum parameter for the network to be tuned, and the value obtained by the experiment is 64. The present embodiment includes 6 sequentially arranged Encoder encoders.

And 3, classifying the labels of the users and the activities by applying a top full-connection layer. The specific method of the step 3 comprises the following steps:

and 3.4, finally outputting a final classification vector Yt by the whole network model based on the transform network improvement, wherein the vector Yt comprises user identification information and activity identification information, the first a elements in the vector respectively represent corresponding residents, the second b elements respectively represent corresponding behavior activities, and the value of each element represents the probability of identifying the corresponding resident or activity. The dimension of the vector Yt in step 3.4 is the sum of the number of residents a and the number of activities b.

The vector Yt is a classification result generated after the feature vector passes through the full connection layer. The Transformer network is originally used for natural language processing and is firstly applied to the field of human behavior recognition. The invention improves the Transformer network according to the requirement of the identification task, removes the decoder therein and adds the full connection layer.

In summary, the multi-user behavior recognition method based on the transform network provided by the invention specifically comprises the steps of collecting and sampling environmental sensor data, preprocessing the data to obtain sampling segments, adding position codes, giving different importance to the data by using an attention mechanism, and realizing recognition and classification of users and activities through a full connection layer.

The following examples are given to illustrate embodiments of the present invention. Referring to fig. 1, the present embodiment provides a method for identifying multi-user behaviors based on a transform network, including the following steps:

(1) data is collected using environmental sensors. The method is characterized in that 37 binary sensors are installed in a workplace, a plurality of volunteer participants are recruited to perform a series of activities in an intelligent home, and 15 daily life activities including door opening, stair climbing, window opening, clothes drying, furniture moving, floor cleaning, flower watering and the like are collected.

(2) Screening original data, screening data with sensor readings of ON from the collected data, removing data with attributes of OFF, and identifying each ON data as an event.

(3) The data segment is intercepted using a sliding window method, with the sliding window size set to 12.

(4) Each sample is converted into a vector by an embedding algorithm.

(5) The result of the embedding is an embedding matrix of the form P^T×CWherein T and C areRespectively, a time series dimension and a channel dimension. In this process, the time series dimension is the sliding window length 12, and each channel represents a corresponding sensor, the number being 37.

(6) Position coding is added, adding a vector in each input embedding. By adding different values to these embedded vectors, meaningful distances between the embedded vectors can be provided.

(7) A certain number (Batch size) of vectors obtained by (6) enter the encoder as Input (Input).

(8) These vectors are passed to a multi-headed attention layer. The method comprises the following steps of adopting a multi-attention mechanism to respectively calculate attention values under different attention heads, enabling a network to pay more attention to a key frame which contributes most to behavior identification, and adopting the following calculation method:

MultiHead(Q，K，V)＝Concat(head₁，...，head_h)

wherein Q, K, V represent the query vector, key vector and value vector in the attention mechanism, respectively.

(9) And the new characterization vectors generated by the attention Layer are normalized through Layer Normalization, so that the nonlinear processing of data by a subsequent ReLU activation function in Feed forwarding is facilitated. And (4) summing the input matrix in the step (7) and the matrix obtained in the step (8), and normalizing.

(10) And then the vector is transferred to a Feed-Forward neural network for processing, and the expression capability of the characterization vector is enhanced through an activation function.

(11) And then entering a normalization layer to carry out a summation normalization step.

(12) The output is sent to the next encoder, repeating the steps (8) - (11), the architecture proposed by the present invention comprises 6 encoders.

(13) When a two-dimensional matrix of x eigenvectors enters the fully connected layer, it will be tiled into one-dimensional vectors of length x.

(14) The vector in (1) enters full connection layer processing. The number of the fully-connected layers is set to be 2, and 256 hidden neurons are arranged in the fully-connected layers.

(15) And a Softmax function is adopted in the full connection layer as a classifier, and the cross entropy is adopted as a loss function. The Softmax function maps the inputs of the fully-connected layer neurons to the outputs, transforming each output value into a probability corresponding to each class.

(16) Outputting a predefined resident and activity vector containing user identification information and activity identification information, the Boolean value of each element in the vector reflecting a determination of whether the corresponding resident performs the corresponding activity.

(17) And finally, taking the Accuracy (Accuracy) as an index for evaluating the Accuracy of user identification and activity identification.

In summary, the multi-user behavior identification method based on the Transformer network can realize effective identification of multiple users and corresponding activities. The method comprises the steps of firstly collecting and sampling environmental sensor data, then preprocessing the data and obtaining sampling segments, then adding position codes and giving different importance to the data by an attention mechanism, and finally realizing the identification and classification of users and activities through a full connection layer. The method has the advantages that the method is simple and convenient to operate from end to end, a time attention mechanism enables a network to pay more attention to the key frame which contributes most to behavior identification, the model is light, simple and effective, and multiple users can be identified at the same time.

It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims

1. A multi-user behavior identification method based on a Transformer network is characterized by comprising the following steps:

2. The method for recognizing multi-user behaviors based on the Transformer network as claimed in claim 1, wherein the specific method in the step 1 comprises:

3. The method for recognizing multi-person behaviors based on the Transformer network as claimed in claim 1, wherein the specific method in step 1.4 comprises: arranging the screened data with ON attributes according to a time sequence to form a group of time sequence data; and acquiring original information on the time sequence data by using a preset sliding window with a fixed size, wherein the acquired result of the sliding window is used as a data slice sample.

4. The method for multi-user behavior recognition based on the Transformer network as claimed in claim 1, wherein: the preset fixed-size sliding window size k is an empirical parameter.

5. The method for recognizing multi-user behaviors based on the Transformer network as claimed in claim 1, wherein the specific method in the step 2 comprises:

and 2.3, adding position codes. Constructing a matrix PE with the same dimension as the embedded matrix, wherein rows of the matrix PE represent time sequence samples, columns represent sensors, and each value in the matrix PE is obtained by the following formula;

PE(pos，2i)＝sin(pos/10000^2i/dmodel)

PE(pos，2i+1)＝cos(pos/10000^2i/dmodel)

MultiHead(Q，K，V)＝Concat(head₁，…，head_h)

6. The method for multi-person behavior recognition based on Transformer network as claimed in claim 1, wherein the number of row vectors input into the Encoder in step 2.4 is m, and the value of m is Batch size set by the Transformer network.

7. The method for multi-user behavior recognition based on the Transformer network as claimed in claim 1, wherein: including 6 sequentially arranged encoders.

8. The method for recognizing multi-user behaviors based on the Transformer network as claimed in claim 5, wherein the specific method in the step 3 comprises:

step 3.2, mapping the T multiplied by C one-dimensional feature vector to a sample mark space through a full connection layer to obtain a classification result vector, wherein elements in the vector are numerical values of each category obtained by weighting and summing the features;

9. The method for multi-person behavior recognition based on the transform network as claimed in claim 1, wherein the dimension of the vector Yt in step 3.4 is the sum of the number of residents a and the number of activities b.