CN117725528A

CN117725528A - Depth feature fusion-based personnel action recognition method in industrial scene

Info

Publication number: CN117725528A
Application number: CN202410127029.XA
Authority: CN
Inventors: 杨岳毅; 翟家博; 王海泉; 温盛军; 寇祥; 王瑷珲; 于浩玮; 王瑞琪
Original assignee: Zhongyuan University of Technology
Current assignee: Zhongyuan University of Technology
Priority date: 2024-01-30
Filing date: 2024-01-30
Publication date: 2024-03-19

Abstract

The invention discloses a personnel action recognition method in an industrial scene based on depth feature fusion, which comprises the following steps: acquiring personnel production action data based on a six-axis wearable sensor; preprocessing personnel production action data to obtain preprocessed data; and constructing a depth feature fusion neural network model based on the convolution neural network with the combined attention and the bidirectional long-short-term memory network, and inputting the preprocessed data into the depth feature fusion neural network model for action classification to obtain a personnel action type recognition result. The convolutional neural network with the combined attention mechanism in the depth feature fusion neural network model has stronger signal local feature extraction capability, the bidirectional long-short-term memory network has good signal time sequence feature extraction capability, and the extracted signal local features and time sequence features are subjected to depth fusion, so that more useful depth feature information in the signals can be extracted, and the accuracy of personnel action classification is improved.

Description

Depth feature fusion-based personnel action recognition method in industrial scene

Technical Field

The invention relates to the technical field of safety management of industrial production sites, in particular to a personnel action recognition method in an industrial scene based on depth feature fusion.

Background

In industrial production, some raw materials have dangerous characteristics such as inflammable, explosive, highly toxic and the like, and the process flow of industrial production is complex, such as control of temperature and flow of a feeding/discharging port, control of temperature, pressure and flow rate in a reaction kettle and the like, which puts higher demands on staff operation.

Although various video monitoring devices are installed in the current industrial production workshops, most of the video monitoring devices can only be used for manual monitoring and post playback, and the actions of personnel cannot be automatically identified in real time. Moreover, when the machine vision technology is applied to an industrial scene in the prior art, the problems of insufficient light, serious shielding, limited visual field and the like generally exist, and the application of the personnel action recognition method based on the machine vision is limited. Therefore, the invention discloses a method for identifying personnel actions in an industrial scene based on depth feature fusion.

Disclosure of Invention

The invention aims to provide a method for identifying personnel actions in an industrial scene based on depth feature fusion, so as to solve the problems in the prior art.

The invention provides a personnel action recognition method in an industrial scene based on depth feature fusion, which comprises the following steps:

acquiring personnel production action data based on a six-axis wearable sensor;

preprocessing the personnel production action data to obtain preprocessed data;

and constructing a depth feature fusion neural network model based on a convolution neural network with combined attention and a bidirectional long-short-time memory network, and inputting the preprocessing data into the depth feature fusion neural network model for action classification to obtain a personnel action type recognition result.

Optionally, the process of preprocessing the personnel production action data includes:

smoothing the personnel production action data based on a Butterworth low pass filter to obtain filtered data;

normalizing the filtered data to obtain normalized data;

dividing the normalized data by adopting a window segmentation technology to obtain a plurality of data fragments;

and obtaining preprocessing data based on a plurality of the data fragments.

Optionally, the butterworth low-pass filter has a calculation formula as follows:

wherein D is _o As the cut-off frequency of the filter, n is the order of the filter, in practical application, D can be modified according to different filtering requirements _o And n.

Optionally, the depth fusion network model comprises a convolutional neural network model with combined attention and a bidirectional long-short-term memory network model;

the process for obtaining the recognition result of the personnel action type based on the depth fusion network model comprises the following steps:

inputting the preprocessed data into the convolutional neural network model with the combined meaning force to obtain a space feature vector;

meanwhile, inputting the preprocessing data into the bidirectional long-short-time memory network model to obtain a time sequence feature vector;

performing matrix reshaping on the time sequence feature vector to obtain a signal feature vector with the same dimension as the space feature vector;

performing feature vector splicing on the spatial feature vector and the signal feature vector to obtain a depth fusion feature vector;

and inputting the depth fusion feature vector into a full connection layer for regression analysis to obtain a personnel action type recognition result.

Optionally, activating the fused feature vector by adopting a SaveReLU activation function in the process of carrying out regression analysis on the full connection layer;

the calculation formula of the SaveReLU activation function is as follows:

where e is a constant having a value in the range of 0, 0.5.

Optionally, the convolutional neural network model with combined attention comprises a combined attention module, wherein the combined attention module comprises a channel attention unit and a spatial attention unit;

wherein the process of obtaining the local feature vector based on the combined attention module comprises:

the spatial attention unit carries out convolution processing on the preprocessed data to obtain feature mapping on a time signal; processing the feature mapping on the time sequence signal based on an activation function to obtain a time sequence weight vector;

meanwhile, the spatial attention unit encodes the characteristic information of the preprocessed data to obtain the characteristic information among the local time sequence signal segments;

multiplying the time weight vector and the characteristic information between the local time sequence signal segments to obtain a first calibration characteristic diagram;

adding the first calibration feature map and the preprocessing data based on a residual error method to obtain a spatial attention unit feature map;

after the space attention unit feature map is input into the channel attention unit, a global average feature vector and a global maximum feature vector are obtained through an average pooling layer and a maximum pooling layer of the channel attention unit respectively;

obtaining a channel attention feature map based on the global average feature vector and the global maximum feature vector;

multiplying the spatial attention unit feature map and the channel attention feature map to obtain a second calibration feature map;

adding the second calibration feature map and the spatial attention unit feature map based on a residual error method to obtain an optimized feature map;

and obtaining local feature vectors based on the optimized feature map.

Optionally, the calculation of the periodic signal feature extracted by the bi-directional long-short-term memory network BiLSTM model is disclosed as follows:

wherein,for the output vector of the forward LSTM hidden layer at time t, (t=1, 2,..once., N), the vector x is input from the current time _t And forward LSTM output vector of previous moment +.>Jointly determining;

wherein,for the output vector of the t-moment reverse LSTM hidden layer, (t=1, 2,..once., N), the vector x is input from the current moment _t And the reverse LSTM output vector of the previous moment +.>Jointly determining;

h _t for the output of BiLSTM model omega _t A weight matrix output for the forward LSTM; v _t A weight matrix output for reverse LSTM; b _t Is the bias of the weight matrix.

The invention has the following technical effects:

compared with the traditional mathematical model prediction method, the deep learning action recognition method can better process nonlinear signals, avoids the excessive dependence of the traditional method on professional knowledge and priori experience in feature extraction and mathematical model selection, and can extract more signal features in feature information extraction.

The combined attention module designed by the invention, wherein the core of the spatial attention module is to learn the local sequence characteristics related to the action characteristics, and the core of the channel attention module is to adaptively enhance the activation mapping related to the action characteristics and restrain the action independent characteristics, so that the design can provide different weights for the input channel characteristics and the time sequence characteristics, and can better acquire the depth characteristics.

The convolutional neural network with the combined attention mechanism in the depth feature fusion neural network model has stronger signal local feature extraction capability, the bidirectional long-short-term memory network has good signal time sequence feature extraction capability, and the extracted signal local features and time sequence features are subjected to depth fusion, so that more useful depth feature information in the signals can be extracted, and the accuracy of personnel action classification is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for identifying personnel actions in an industrial scene based on depth feature fusion in an embodiment of the invention;

FIG. 2 is a schematic diagram of the wearing position of an IMU when a personnel action data set is used for collecting data in an industrial background constructed in an embodiment of the invention;

FIG. 3 is a schematic diagram of a one-dimensional convolutional neural network model with downsampling properties according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a combined attention module according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a convolutional neural network model with a force injection mechanism constructed in an embodiment of the present invention;

FIG. 6 is a schematic diagram of a spatial attention module in a combined attention module according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a channel attention module in a combined attention module according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a BiLSTM module of the bidirectional long and short term memory network in an embodiment of the invention;

fig. 9 is a schematic structural diagram of a depth feature fusion neural network model in an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The industrial production has very important position in national economy industry chain, is a key and key support for promoting the high-quality development of the economy in China, and has important effects on stabilizing the economic growth and meeting the growing beautiful living demands of residents. However, in industrial production, some raw materials have dangerous characteristics such as inflammable, explosive, highly toxic and the like, and if a safety problem occurs during processing or storage, immeasurable economic property loss and even casualties can be caused.

Personnel factors are one of the main reasons for causing safety accidents on industrial production sites, and because the industrial production process flow is complex, such as controlling the temperature and flow of a feed/discharge port, controlling the temperature, pressure and flow rate in a reaction kettle, and the like, the personnel requirements are high on personnel operation. Therefore, the invention provides a personnel action recognition method based on depth feature fusion in an industrial scene for guaranteeing real-time recognition of the operation actions of operators, and the personnel action recognition based on the inertial measurement unit not only can solve the problems well, but also can effectively guarantee recognition accuracy and reduce cost.

As shown in fig. 1, this embodiment discloses a method for identifying personnel actions in an industrial scene based on depth feature fusion, which includes:

acquiring personnel production action data based on a six-axis wearable sensor; preprocessing personnel production action data to obtain preprocessed data; the specific implementation process comprises the following steps:

step S1: and determining the common action category of the personnel according to the characteristics of the industrial scene.

Step S2: fixing 6-axis wearable sensors at key positions of human arms, legs and the like, and collecting 3-axis acceleration and 3-axis gyroscope data of each key position of the human body;

the data used for training the model are acquired by IMUs (a triaxial accelerometer and a triaxial gyroscope), and the wearing positions of the 4 IMUs during acquisition are shown in a figure 2, wherein a yellow rectangle represents the IMU; the collected actions under the industrial background are divided into 10 types, namely walking, ascending stairs, descending stairs, sitting, running, screwing a valve leftwards, screwing a valve rightwards, climbing a ladder upwards, climbing a ladder downwards and leaning, and the collection frequency of acceleration and a gyroscope is 148Hz for ensuring the consistency of data in time;

step S3: filtering and normalizing the motion data produced by personnel in the industrial scene in the step S2, and then dividing;

because the inherent noise of the sensor and unintentional jitter of a tester can lead to noise of acquired data, the acquired data needs to be filtered and denoised before feature extraction, the embodiment selects a low-pass filter with the largest flat amplitude response, namely a Butterworth low-pass filter, in consideration of the fact that the frequency of a human motion signal is mainly concentrated in a low-frequency section of 0-10 Hz, and determines that the order of the filter is 4 according to certain given indexes such as the acceleration and gyroscope acquisition frequency of 128Hz, the pass and stop band cut-off frequency and the like, and therefore the 4-order Butterworth low-pass filter with the cut-off frequency of 10Hz is selected for carrying out smoothing treatment on the signal to obtain filtered data;

the butterworth low-pass filter has a calculation formula as follows;

Carrying out linear function normalization processing on the filtered data to obtain normalized data, and processing the input data by adopting a maximum and minimum normalization method to map the original signal data to [0,1] in order to improve the network convergence speed and avoid the oversaturation of neurons;

the calculation formula of the normalization process is as follows:

x is the input signal value, x _max 、x _min X is the maximum and minimum of all signals _norm Is normalized value.

The method comprises the steps of dividing normalized data by a window segmentation technology to obtain a plurality of data fragments, dividing time sequence data into smaller time windows by a window segmentation technology by most behavior recognition methods, extracting features on each time window and recognizing corresponding actions, wherein the window length and the window overlapping rate have great influence on the accuracy and the instantaneity of human behavior recognition.

At present, the window length is generally set to be between 0.5 and 10 seconds, and the window overlapping rate is generally set to be 20% -50%. In the constructed personnel action data set in the industrial scene, taking the time span of the target action and the requirement on real-time into consideration, in the embodiment, a sliding window with the length of 860ms and the overlapping rate of 50% is adopted to divide the sample, so that training and testing samples are obtained.

Dividing to obtain a plurality of data fragments, constructing a sample set S by all the data fragments together, and dividing the sample set according to the ratio of the training set to the test set 7:3; the specific process is as follows: the sliding window with the width of 860ms and the overlapping rate of 50% is adopted to divide the motion data, and the divided motion data is F _m ，F _m ＝[x ₁ ,x ₂ ,...,x _t ]M represents the number of motion data, each data segment has a length of 128, and all data segments together form a sample set S, s= [ F ₁ ,F ₂ ,…,F _m ]；

Preprocessing data is obtained based on the plurality of data segments.

The method comprises the steps of constructing a deep fusion network model based on a convolutional neural network with combined attention and a bidirectional long-short-time memory network, inputting preprocessing data into the deep fusion network model for action classification to obtain a personnel action type recognition result, and specifically implementing the steps of:

inputting the preprocessed data into a convolutional neural network model with combined attention to obtain a spatial feature vector; meanwhile, inputting the preprocessed data into a bidirectional long-short-time memory network model to obtain a time sequence feature vector; matrix reshaping is carried out on the time sequence feature vector to obtain a signal feature vector with the same dimension as the space feature vector; performing feature vector splicing on the spatial feature vector and the signal feature vector to obtain a depth fusion feature vector; and inputting the depth fusion feature vector into a full connection layer for regression analysis to obtain a personnel action type recognition result.

Step S4: and inputting the data of the action sample training set into a convolutional neural network with a combined attention mechanism and a bidirectional long-short-time memory network depth feature fusion neural network model, and performing model training.

In particular, the embodiment constructs a one-dimensional convolutional neural network with a downsampling property, as shown in fig. 3, by adopting layered convolution instead of using maximum pooling to aggregate characteristic information, which can reduce information loss, and in the network, convolution kernels with different sizes are adopted to learn long-term and short-term characteristics of an input signal, and as the number of network layers increases, the convolution kernels gradually decrease, and the number of convolution channels and convolution steps gradually increase, so that the recognition accuracy of a model is improved through test discovery.

TABLE 1

In a specific embodiment, the combined attention module is composed of a channel attention unit and a space attention unit;

the channel attention unit models the interdependence among channels, the characteristics of each layer are optimized in a self-adaptive mode, the attention weight of the model to important action characteristics can be effectively improved by the space attention unit, and the learning of the important action signal characteristics is selectively enhanced, so that more distinguishing characteristics are obtained.

Because different activation maps identify data features to different extents, not all features are related to human motion information. Therefore, the combined attention module formed by the channel attention unit and the space attention unit is introduced, so that not only can information irrelevant to main characteristics of the action signals be restrained, more distinguishing characteristics can be learned from the input one-dimensional signals by the model, but also the characteristic extraction capability of the model can be effectively improved.

The combined attention module structure is schematically shown in FIG. 4

The spatial attention module is shown in fig. 6, firstly, a 1×1 convolution layer is used for obtaining a mapping of input features on a time signal, namely, a feature mapping on the time signal, then a sigmoid type function is used for processing the feature mapping on the time signal to obtain a time weight vector, the attention module uses the convolution layer to encode feature information between local time signal segments to prevent excessive focusing on the signal segments, then multiplication operation is carried out on the time weight vector to obtain a first calibration feature map, and the first calibration feature map and preprocessing data are added to obtain a spatial attention unit feature map based on a residual method, namely, the input feature map of a channel attention unit is obtained;

the working process of the channel attention module is divided into three steps: compression, excitation, product and addition, as shown in fig. 7. First is an input feature map, which is a feature map processed by the spatial attention module. And then compressing the input space attention unit feature graphs, namely global average pooling and global maximum pooling to obtain a global average value and a global maximum value of each channel, then performing excitation operation, adding the feature channel graphs subjected to global average pooling and global maximum pooling respectively through a multi-layer perceptron by using two full-connection layers and a sigmoid activation function to obtain a final channel attention feature graph, and finally performing multiplication operation on the input feature graph and the final channel attention feature graph to obtain a second calibration feature graph, and adding the second calibration feature graph and the space attention unit feature graph to generate a combined attention module optimization feature graph based on a residual error method.

The specific implementation steps of the combined attention module are as follows:

the space attention unit firstly obtains the mapping of the features on the time signal through a 1X 1 convolution layer, namely the feature projection on the time signal; and then, the feature mapping on the time signal is processed through the Sigmoid activation function to obtain a time weight vector, and the convolution layer with the size of 1 multiplied by 1 can reduce the input data dimension and can enable the network to extract more deep features.

The method comprises the steps that a spatial attention unit simultaneously encodes feature information among local time signal segments through a one-dimensional convolution layer according to feature images to be processed, namely preprocessing data, so as to prevent the attention unit from focusing over part of the signal segments, and then multiplication operation is carried out on the feature information among the local time signal segments and a time weight vector to obtain a first calibration feature image;

the spatial attention unit introduces the residual concept to add the first calibration feature map to the feature map to be processed to obtain the input feature map of the channel attention unit, i.e. the spatial attention unit feature map, which aims to avoid that repeated calibration results in partial feature loss.

The channel attention unit firstly carries out pooling operation on the acquired input feature images by using an average pooling layer and a maximum pooling layer to acquire feature vectors, and the average pooling layer and the maximum pooling layer are connected in parallel because of too much information lost by single pooling, so that the extracted deep features are more comprehensive and richer;

the global average feature vector and the global maximum feature vector are respectively transmitted to a multi-layer perceptron through the full-connection layer and the Sigmoid activation function, and then added to obtain a channel attention feature map;

and multiplying the channel attention feature map and the input feature map to obtain a second calibration feature map.

The channel attention unit introduces the residual concept of adding the second calibration feature map to the input feature map of the channel attention unit to obtain an optimized feature map, which aims to avoid partial feature loss caused by repeated calibration.

In a specific embodiment, a structural schematic diagram of a convolutional neural network model with a combined attention mechanism is shown in fig. 5, and the constructed convolutional neural network model with the combined attention mechanism can enable input to adaptively optimize feature mapping from two angles of a channel and a space by embedding the combined attention modules in networks with different depths, so that more deep action features are learned, and further deep mining is carried out on potential relations between input and output of the model, so that more interpretable feature knowledge is hopefully obtained.

The specific implementation steps of the convolutional neural network model with the combined attention mechanism are as follows:

the input layer inputs the motion sample data into a first convolution layer and then obtains a first feature map through convolution operation;

the first combined attention module performs depth feature extraction on the first feature map to obtain a first optimized feature map;

the second convolution layer carries out convolution operation on the first optimized feature map to obtain a second feature map;

the second combined attention module performs depth feature extraction on the second feature map to obtain a second optimized feature map;

the third convolution layer carries out convolution operation on the second optimized feature map to obtain a third feature map;

the third combined attention module performs depth feature extraction on the third feature map to obtain a third optimized feature map;

the fourth convolution layer carries out convolution operation on the third optimized feature map to obtain a fourth feature map;

the fourth combined attention module performs depth feature extraction on the fourth feature map to obtain a fourth optimized feature map;

the average pooling layer reduces the dimension of the fourth optimization feature map and transmits the fourth optimization feature map to the full-connection layer;

transmitting the feature map transmitted by the average pooling layer to the Softmax layer through the full-connection layer, and carrying out personnel action recognition by using a classifier in the Softmax layer to output classification label feature classification;

the classification referred to herein is to use a training data set with labels when training a deep feature fusion neural network model, where each sample has a known class label, and the model performs classification by learning to extract features from the input one-dimensional data and mapping them to the corresponding class, and in general, the model performs a chain derivation algorithm to minimize the cross entropy loss function and obtain a set of weight coefficients. When the test is carried out, the input data is multiplied by the weight coefficient, the value with the maximum probability is used as output through the Softmax activation function, the pseudo tag is obtained, and the pseudo tag is consistent with the real tag, namely the classification is successful.

Although the convolutional neural network with the combined force injection mechanism has excellent space signal extraction capability, the convolutional neural network does not have the time sequence signal extraction capability of a long-short-term memory network, if the two characteristic signals are subjected to characteristic deep fusion, the respective advantages are reserved, the defect of a single network is avoided, and the invention can extract as much useful depth characteristic information as possible in the signals.

For data with periodic signal characteristics, such as acceleration and gyroscope, the bidirectional long-short-time memory network has a good characteristic extraction effect, and mainly because the bidirectional long-short-time memory network is based on LSTM and combines a forward LSTM network and a reverse LSTM network, the model can effectively consider the characteristic relation between the front and the rear of the signal in the time sequence signal characteristic extraction process, so that the periodic signal characteristics in the data can be effectively extracted.

FIG. 8 shows a two-way long and short-term memory network BiLSTM, wherein the two-way long and short-term memory network model extraction formula is:

h _t for the output of the bidirectional long-short-time memory network model, omega _t A weight matrix output for the forward LSTM; v _t A weight matrix output for reverse LSTM; b _t Is the bias of the weight matrix.

Meanwhile, the influence degree of the output of each unit on the final task result is considered to be different, so that a self-attention mechanism is added to perform network optimization, a weight is set for the output of each unit, and the time sequence characteristics of the actions of the personnel can be extracted more effectively.

The characteristic fusion process is characterized in that the fusion characteristic is derived from characteristic information respectively extracted by the convolutional neural network with the combined force injection mechanism and the bidirectional long-short-time memory network, the extracted characteristic information is output in the form of characteristic vectors, the characteristic vectors output by the two characteristic fusion process are spliced and combined into new characteristic vectors through a characteristic fusion layer, the characteristic fusion mode is back-end fusion, and the characteristic signals respectively extracted by the convolutional neural network with the combined force injection mechanism and the bidirectional long-short-time memory network are fused, so that local characteristic signals and time sequence characteristic signals in data are obtained.

Finally, inputting the fused characteristics into a full-connection layer, wherein the full-connection layer is a regression analysis layer, and the collected gyroscope data has positive numbers and negative numbers, if a ReLU activation function is used, the gradient value obtained by calculating the input part smaller than zero in the back propagation process is 0, so that some ReLU neurons can not be activated at all times in the subsequent training process, and the SaveReLU activation function is adopted for activation;

the calculation formula of the SaveReLU activation function is:

e is a constant with a value range of 0,0.5, thereby ensuring that the derivative of SaveReLU is always non-zero, further calculating the gradient of the input smaller than zero in the back propagation process, and avoiding the problem of neuron death.

Step S5: and inputting the data of the motion sample test set to the trained depth feature fusion neural network model, and performing motion classification to obtain a classification result.

The depth feature fusion neural network model structure is shown in fig. 9

The model training adopts a cross entropy loss function, and the key super parameter setting is shown in table 2:

TABLE 2

The performance of the model is evaluated by using the test data sample, and the specific method is to calculate the recognition accuracy, precision, recall rate and F1 value of the model to various actions of the test data.

Firstly, the depth characteristic fusion neural network model provided by the invention is compared and analyzed with two advanced algorithm models Deep ConvLSTM and H-LSTM under the condition of personnel action data sets in the industrial background, so that the superiority of the proposed algorithm model is proved.

Secondly, in order to prove the effectiveness of each step of improvement from the one-dimensional convolutional neural network model to the depth feature fusion neural network model, verification is needed through an ablation experiment.

The specific experimental strategy is as follows:

step S1, verifying a personnel action data set of a traditional one-dimensional convolutional neural network under an industrial background

Step S2, based on the model in step S1, improving the model into a down-sampled one-dimensional convolutional neural network and verifying a personnel action data set in an industrial background

Step S3, based on the model in the step S2, carrying out parallel training on the model and a bidirectional long-short-time memory network, then fusing and outputting the model and verifying a personnel action data set in an industrial background

Step S4, based on the model in step S3, embedding a combined attention module behind each convolution layer and verifying the personnel action data set in the industrial context

Experimental execution strategy:

according to the embodiment, the personnel action data set under the industrial background is firstly collected, the data set is subjected to filtering processing, then the data normalization is adopted to perform model data preprocessing, the normalization can convert the data into the data set between 0 and 1, the situation that the data value difference is overlarge to cause the coverage of the decimal number in the data set to be excessively reduced can be avoided through the data normalization, and meanwhile, the situation that the model optimization path is deviated when the situation occurs is avoided.

After data normalization, the data are divided and divided by a sliding window with the width of 860ms and the overlapping rate of 50%, and the divided data set is divided according to the following steps: 3, the proportion of the model is divided into a training set and a testing set, the training set is used as model input training data for model feature extraction and parameter training work, the testing set can be used for verifying a model predicted value, and model parameter fine adjustment is carried out through testing set verification.

Firstly, a training data set is transmitted into a convolutional neural network model with a combined force injection mechanism to extract local characteristics, and meanwhile, data is transmitted into a bidirectional long-short-time memory network model to extract periodic signals. The feature extracted by the convolutional neural network with the combined attention mechanism is a two-dimensional matrix, and the feature extracted by the bidirectional long-short-time memory network is a one-dimensional vector, so that the two-dimensional matrix is remodelled into the one-dimensional vector before fusion. And finally, inputting the feature vectors after feature fusion into a full-connection layer, carrying out regression analysis on the full-connection layer, extracting local features and periodic features of data through the steps, gradually optimizing model parameters in the training process, and finishing model training.

And finally, identifying the action type of the worker by combining the extracted characteristics with a model regression analysis result.

The one-dimensional convolutional neural network model is a downsampled one-dimensional convolutional neural network model, firstly, the characteristic dimension is reduced under the condition of not losing information by using layered convolution, long-term and short-term characteristics of input signals are learned by adopting convolution kernels with different sizes in a network, the number and the stride of convolution channels are gradually increased along with the increase of the number of network layers, and the convolution kernels are gradually reduced.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The method for identifying the personnel actions in the industrial scene based on depth feature fusion is characterized by comprising the following steps:

acquiring personnel production action data based on a six-axis wearable sensor;

preprocessing the personnel production action data to obtain preprocessed data;

2. The method for identifying personnel actions in an industrial scene based on depth feature fusion according to claim 1, wherein the process of preprocessing the personnel production action data comprises:

normalizing the filtered data to obtain normalized data;

and obtaining preprocessing data based on a plurality of the data fragments.

3. The method for identifying personnel actions in an industrial scene based on depth feature fusion according to claim 2, wherein the calculation formula of the butterworth low-pass filter is as follows:

4. The method for identifying personnel actions in an industrial scene based on depth feature fusion according to claim 1, wherein the depth fusion network model comprises a convolutional neural network model with combined attention and a bidirectional long-short-term memory network model;

5. The method for identifying personnel actions in an industrial scene based on depth feature fusion according to claim 4, wherein the fused feature vector is activated by adopting a SaveReLU activation function in the process of performing regression analysis on a full connection layer;

the calculation formula of the SaveReLU activation function is as follows:

where e is a constant having a value in the range of 0, 0.5.

6. The method for identifying personnel actions in an industrial scene based on depth feature fusion according to claim 4, wherein the convolutional neural network model with combined attention comprises a combined attention module, wherein the combined attention module comprises a channel attention unit and a spatial attention unit;

and obtaining local feature vectors based on the optimized feature map.

7. The method for identifying personnel actions in an industrial scene based on depth feature fusion according to claim 6, wherein the calculation of the periodic signal features extracted by the bi-directional long-short-time memory network BiLSTM model is disclosed as follows: