CN111325166A

CN111325166A - Sitting posture identification method based on projection reconstruction and multi-input multi-output neural network

Info

Publication number: CN111325166A
Application number: CN202010119569.5A
Authority: CN
Inventors: 沈捷; 黄安义; 王莉; 曹磊
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2020-02-26
Filing date: 2020-02-26
Publication date: 2020-06-23
Anticipated expiration: 2040-02-26
Also published as: CN111325166B

Abstract

The invention relates to a sitting posture identification method based on projection reconstruction and a multiple-input multiple-output neural network (MIMO-CNN), which comprises the following steps: acquiring a depth image of the upper half of a human body and a foreground contour map of the human body; pre-treating; projecting the depth information of the sitting posture contour, and reconstructing to obtain a three-view depth map; designing an MIMO-CNN network for sitting posture identification and learning model parameters; and (4) self-learning of the model. The advantages are that: and combining the preprocessed depth image with the human body contour map to eliminate the interference of surrounding background on the sitting posture recognition. The three-view depth map is obtained by using a projection reconstruction method, so that the sitting posture information is richer. The designed MIMO-CNN structure is particularly suitable for projection reconstruction characteristic information, integrates an attention mechanism, can better focus on hot spot areas with different sitting postures, improves the identification precision, adopts model self-learning, well balances the requirements on real-time performance and accuracy, and has stronger interference resistance to view angle change and complex environment background.

Description

Sitting posture identification method based on projection reconstruction and multi-input multi-output neural network

Technical Field

The invention discloses a sitting posture identification method based on projection information and a multiple-input multiple-output neural network (MIMO-CNN), and belongs to the technical field of human body posture identification.

Background

With the rapid development of science and technology, the sitting posture has become one of the most common daily states of modern people at present, and is closely related to the human body. Most writing offices and people before computers are all carried out in sitting postures, particularly sitting postures of teenagers and children in the learning process are not standard, but few people can notice the influence of the sitting postures on the body health, and a great number of people have a great number of bad habits when sitting and using the computers and like lowering heads, hunchback, sit up, inclined sitting and the like. Therefore, the sitting posture of the person can be automatically recognized by the method, and the sitting posture correction and guidance based on the sitting posture recognition method has high practical value.

The key to correcting and guiding the sitting posture is to accurately and quickly recognize the sitting posture, and research on the sitting posture recognition in China and abroad has been started from a very early time. Based on human experience in human posture recognition and motion detection, there are several methods for posture recognition. Some piezoelectric sensors are placed on the seat, and the sitting posture of a person on the seat is identified by analyzing data collected by the sensors, so that the method is easily influenced by external unstable factors such as environmental noise, offset and crosstalk, the pressure distribution is not only related to the sitting posture, but also the weight of the person and the sitting area on the seat greatly influence the irregular change of the data of the sensors, the generated data is difficult to understand, and the precision is greatly reduced. In recent years, with the development of computer vision, sitting posture recognition methods based on vision and image processing technologies are diversified, some researchers recognize sitting postures by using the size and position relation of the area occupied by the face in a video, some researchers also acquire human skeleton information by using a depth sensor and recognize the sitting postures according to the angle of a skeleton node, and other researchers acquire a large amount of human sitting posture information and train a sitting posture model by using machine learning to further recognize the human sitting postures. In summary, the current method has three main defects, and firstly, the interference resistance is poor under a complex background. Second, the traditional vision-based sitting posture recognition method is extremely sensitive to angle changes of the camera. Thirdly, the adaptability to the diversity of the sitting postures of the human body is poor, the existing method recognizes the sitting postures under a relatively ideal action, but the natural sitting postures of the people on the real seats are various and complex, and the adaptability and the robustness of the recognition method are still hot spots of the existing research method.

Disclosure of Invention

The invention provides a sitting posture identification method based on projection reconstruction and a multiple-input multiple-output neural network (MIMO-CNN), which aims to overcome the defects of poor anti-interference performance, high modeling difficulty, excessive sensitivity to angle change of a camera and the like of the existing sitting posture identification, and the method comprises the steps of respectively obtaining deep sitting posture images of a human body through a depth camera, preprocessing the deep sitting posture images and reconstructing three-dimensional information to obtain multi-view depth information, and finally utilizing the MIMO-CNN network to identify sitting posture states in the front and back directions and sitting posture states in the left and right directions of the sitting posture, so that a user can feed back misjudgment samples in the using process and regularly relearn and optimize network parameters, thereby improving the identification precision of the network.

The technical solution of the invention is as follows: a sitting posture identification method based on projection reconstruction and a multiple-input multiple-output neural network (MIMO-CNN) comprises the following steps:

(1) image acquisition: acquiring a depth image and a human body foreground contour map by using a depth camera;

(2) image preprocessing: preprocessing operations such as histogram equalization and filtering are carried out on the obtained depth image and the human body foreground contour map, and data enhancement is carried out on the sample so as to expand a data set for training;

(3) and (3) depth image projection reconstruction: performing projection reconstruction on the human body foreground contour depth map, and sequentially obtaining a left view, a top view and a main view, namely a three-view depth map, by taking the opposite direction of an X, Y, Z axis as a projection direction as shown in fig. 3;

(4) establishing a sitting posture identification model: designing an MIMO-CNN for sitting posture identification, and respectively taking the three-view depth maps processed in the step (3) as the input of three channels of the MIMO (multiple input multiple output) -CNN for network training;

(5) and (3) sitting posture identification: inputting the three-view depth map obtained by preprocessing into the MIMO-CNN as an input quantity, and finally identifying a sitting posture according to the distribution condition of a human body in the space;

(6) self-learning of the model: and self-screening the fed-back error samples, collecting the screened misjudgment samples, and automatically re-learning the model to improve the identification precision of the model.

The step (1) of obtaining the image comprises the following specific steps:

1) acquiring a depth image by using a depth camera;

2) and acquiring the human body foreground contour by using a built-in algorithm of a depth camera.

The image preprocessing of the step (2) comprises the following specific steps:

1) carrying out histogram equalization on the native depth image;

2) performing opening operation on the depth image, and removing outer edge burrs and missing blocks inside the outline;

3) denoising the human body foreground contour by adopting a median filtering method;

4) the human body foreground contour is taken as a template, and human body depth information is captured from the depth image to obtain a human body foreground depth map;

5) cutting off redundant white background of the human body foreground depth map through self-adaptive cutting;

7) the bilinear interpolation size is normalized to 224 x 224.

8) And performing data enhancement on the cut human body foreground depth map to expand a training sample.

The depth image projection reconstruction in the step (3) comprises the following specific steps:

1) for the human body foreground depth map in the step 2, the upper left corner is the origin of coordinates, the right side is the positive direction of an X axis, the downward side is the positive direction of a Y axis, and the pixel value is the direction of a Z axis, so that pixel points of one depth image can be seen as three-dimensional points. And (3) converting the original Z axis into the Y axis, converting the Y axis into the Z axis, and normalizing the data from 0-224 to 0-255 to obtain the top-view projection image of the human body foreground depth image.

2) And (3) converting the original Z axis of the human body foreground depth map in the step (2) into an X axis, converting the X axis into the Z axis, and normalizing the data from 0-224 to 0-255 to obtain a left-view projection map of the human body foreground depth map.

3) And (3) carrying out bilinear interpolation on the left view and the top view of the human body foreground depth map to normalize the sizes into 224 x 224, and collectively referring to the human body foreground depth map processed in the step (2) as a three-view depth map.

The step (4) of establishing the sitting posture recognition model comprises the following specific steps:

1) MIMO-CNN design: the MIMO-CNN takes three-view depth maps (a left view, a top view and a main view) as input and respectively inputs the three views into three branch networks to obtain 3 different feature matrices. And then concat splicing the three feature matrixes of the left view, the top view and the main view in the feature matrix number dimension for the sitting posture feature in the front-back direction. And splicing the two feature matrixes concat of the top view and the front view for the sitting posture feature in the left-right direction. Respectively inputting the spliced two sitting posture state features into two deep sub-network branches, finally outputting two 1-dimensional feature vectors corresponding to the feature vectors in the front-back direction and the left-right direction of the sitting posture by the sub-network branches, and finally performing probability distribution output on the sitting posture vectors in the left-right and front-back states by using 2 softmax layers;

2) training model parameters: and respectively inputting the three-view depth maps into three channels of the MIMO-CNN to obtain model sitting posture information, and then calculating the cross entropy loss between the model sitting posture result and the real label. And continuously updating and optimizing the parameters of the network by using a back propagation gradient descent algorithm according to the loss function to finish network training.

The step 1) MIMO-CNN design comprises the following specific steps:

(a) an original image is input, and a convolution operation is performed first using 3 × 3 kernels, and then performed, BatchNorm normalization, and activation by Relu6 activation functions, resulting in a 112 × 32 feature map.

The calculation process of the convolutional layer is as follows:

wherein the content of the first and second substances,

net activation of the jth channel, called convolutional layer l, by outputting a profile for the previous layer

The result of convolution summation and offset is obtained,

the output of the jth channel, which is convolution l, f (-) is referred to as the activation function, and the Relu6 function is used herein. M_jRepresentation for computing

Is used to generate a set of input feature maps,

is a matrix of convolution kernels, and is,

is the bias to the convolved feature map. For an output profile

Each input feature map

Corresponding convolution kernel

Possibly differently, "' is a convolution symbol.

The Relu6 activation function f (x) is:

f(x)＝Min(Max(0,x),6)

data were normalized to a gaussian distribution with mean 0 and variance 1 using BatchNorm after convolution and activation:

wherein, X^kIs the kth feature map in the feature layer, E (X)^k) For obtaining an input profile X^kMean value of, Var (X)^k) For obtaining a characteristic diagram X^kThe variance of (a) is determined,

is normalized output;

(b) and carrying out convolution on the feature map after convolution by a CBAM attention convolution module, wherein the CBAM has the main function of leading the network to be more concentrated on the important regions and channels of the feature map in space and communication.

(c) Then, using an inverted Residual Block module to extract features; the method comprises the steps of firstly increasing the dimension of an input feature map by using point-wise constraint, then carrying out BatchNorm algorithm normalization, activating a Relu6 activation function, then carrying out convolution operation by using a depth-wise constraint mode, carrying out BatchNorm algorithm normalization again after operation, carrying out Relu6 function operation, and finally reducing the dimension by using point-wise constraint. Note that at this time, after the last point-wise contribution, after the BatchNorm algorithm normalization, the Relu6 activation function is no longer used, but a linear activation function is used, so as to retain more feature information, ensure the expression capability of the model, and have the idea of Resnet. And (c) after the step a is finished, performing feature extraction by using four inversed residual Block modules, and finally obtaining the feature maps of 14 × 64 of the three views respectively.

(d) In the dimension of the number of the feature matrix, splicing three features concat of 14 × 64 in the left view, the top view and the main view into sitting posture features for 14 × 192 in the front-back direction, and splicing two features concat of the top view and the main view into sitting posture features for 14 × 128 in the left-right direction;

(e) and carrying out convolution on the spliced two features by a CBAM attention convolution module, wherein the CBAM has the main function of leading the network to be more concentrated on the important regions and channels of the feature map in space and communication. After convolution, a characteristic diagram of 14 × 192 in the front-back direction and a sitting posture characteristic of 14 × 128 in the left-right direction are obtained;

(f) respectively carrying out the same operation on the two characteristics after the convolution of the attention convolution module, firstly carrying out three times of inverted Residual Block operation to obtain a characteristic diagram of 7 × 320, then carrying out point-wise fusion to expand the characteristic diagram to obtain a characteristic diagram of 7 × 1280, obtaining a one-dimensional characteristic of 1 × 128 by using average pooling, and finally obtaining a front and back direction sub-network by using point-wise fusion to obtain a one-dimensional characteristic of 1 × 4 and a left and right direction sub-network to obtain a one-dimensional characteristic of 1 × 3;

(g) the method comprises the following steps of respectively outputting probability distribution of sitting posture vectors in left-right and front-back states by using 2 softmax layers:

the operational function of the Softmax layer is as follows:

wherein Z_jIs the jth input variable, M is the number of input variables,

for output, the probability of the output class being j can be represented.

The step 2) of model parameter training comprises the following specific steps:

(a) and respectively inputting the three-view depth maps into three channels of the MIMO-CNN to obtain model sitting posture information, and then calculating the cross entropy loss between the model sitting posture result and the real label.

The cross entropy is calculated as:

wherein label_iExpressed as onehot encoded tag, m is the number of samples of batch.

Loss function for this model:

Loss＝L_v+L_h+γ∑_j|w_j ²|

wherein L is_vCross entropy, L, output for front-to-back direction_hIs the cross entropy of the left and right directions ∑_j|w_j ²And | is the L2 regular term of the training simulator, and gamma is the coefficient of the regular term, so that the problem of over-fitting training is prevented.

(b) And (3) continuously updating and optimizing the parameters of the network by using a back-propagation gradient descent algorithm, wherein the output of the model is continuously close to the real label, and when the accuracy of the verification set reaches a stable region and is not increased any more, the network training is finished.

The step (6) of model self-learning comprises the following specific steps:

1) in the using process, the situation of error identification found by a user is fed back, the background automatically analyzes the situation, and error samples meeting the conditions are uploaded in a cloud terminal;

2) performing manual secondary judgment on the error sample fed back to the cloud end by the client, and labeling and adding the error sample into a data set;

3) putting the database into the model again for training at regular intervals, and fine-tuning the model after a small amount of iteration;

the invention has the beneficial effects that:

1) the depth image obtained by the depth camera is combined with the preprocessed human body contour map, so that the interference of surrounding background on the sitting posture recognition is eliminated.

2) Three-view depth maps are obtained by using human body contour depth information reconstruction, so that sitting posture information is richer.

3) The MIMO-CNN structure designed by the invention is better suitable for the characteristic extraction of multi-view sitting posture information, has a high-performance attention mechanism and can better focus on areas which need to be focused most in different sitting posture images. Under the condition of reducing the size of the model, the real-time performance and the recognition accuracy are well balanced, and the front and back and the left and right states of the sitting posture can be recognized simultaneously, so that the recognition of the sitting posture state is more accurate. Therefore, the device has strong anti-interference performance on the change of the sitting posture background and the angle of the camera.

4) By adopting a model self-learning mode, the model precision is continuously improved by the feedback of a user error sample in the using process of the user.

Drawings

FIG. 1 is a flow chart of a sitting posture identification method;

FIG. 2 is a depth image acquired by a depth camera;

FIG. 3 is a human foreground profile acquired by a depth camera;

FIG. 4 is a comparison graph before and after histogram equalization of depth images;

FIG. 5 is a comparison graph before and after a depth image histogram opening operation;

FIG. 6 is a comparison graph before and after median filtering of a human foreground contour map;

FIG. 7 is a process diagram for obtaining a human body foreground sitting posture depth image

FIG. 8 is a background adaptive clipping flow diagram;

FIG. 9 is a diagram of the effect after clipping;

FIG. 10 is a three-view depth map of human sitting posture depth;

FIG. 11 is a main structure diagram of the MIMO-CNN.

FIG. 12 is a schematic representation of a CBAM;

FIG. 13 is a schematic view of a CBAM channel attention model;

FIG. 14 is a schematic diagram of a CBAM spatial attention model;

FIG. 15 is a diagram of an Inverted residual module architecture for MIMO-CNN;

FIG. 16 is a sitting posture state classification diagram;

FIG. 17 is a front network framework parameter table;

FIG. 18 is a table of predicted network parameters for a back network in-and-out sitting position;

FIG. 19 is a table of predicted network parameters for a rear network left and right sitting posture;

Detailed Description

A sitting posture identification method based on projection reconstruction and a multiple-input multiple-output neural network (MIMO-CNN) comprises the following steps:

Example 1

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

A sitting posture identification method combining projection information and a multiple input multiple output neural network (MIMO-CNN) is shown in a flow chart of fig. 1. The method comprises the following steps:

step 1, the specific implementation process of image acquisition is as follows:

a1, acquiring a depth image by using a depth camera, as shown in figure 2;

a2, using a random decision forest classification algorithm of an official SDK of a depth camera to finally divide a human body into 32 parts, and finally obtaining a human body foreground contour required by people, as shown in figure 3;

step 2, the image preprocessing process comprises:

b1, histogram equalization is performed on the raw depth image, and histogram equalization is a method for adjusting contrast using an image histogram in the field of image processing.

g(x,y)＝S_f(x,y)*(L-1)

L is the total number of possible gray levels in the image, the total number of pixels in the original image f is n, f (x, y) is the original pixel value at the position of the original image f (x, y), and g (x, y) is the pixel value after histogram equalization.

In this way, the depth information can be better distributed over the histogram. This can be used to enhance the local contrast without affecting the overall contrast, so that the depth information can more clearly express the distance information, as shown in fig. 4.

B2, performing an opening operation on the human body foreground contour to remove outer edge burrs of the human body foreground and defect blocks inside the contour, wherein the opening operation is composed of corrosion and expansion operation, the opening operation is corrosion first and then expansion, the corrosion operation is to reduce the boundary, the expansion operation is to expand the boundary, the method is the opening operation performed by using a 5 x 5 kernel, and a comparison graph is shown in FIG. 5.

B3, denoising the human body foreground contour by using a median filtering method, setting the gray value of each pixel point as the median of the gray values of all pixel points in the neighborhood window of the point 8, and obtaining the human body foreground contour with smooth edge after denoising, wherein the comparison graph is shown in fig. 6.

B4, based on the human body foreground contour, using the contour as coordinate to grab the human body depth information of the depth image containing the background, and forming the required human body depth sitting posture data, obtaining the human body depth foreground contour map, as shown in fig. 7.

The formula for obtaining the human body foreground depth image is as follows:

x∈[0,rows-1],y∈[0,cols-1]

in the formula, F (x, y) is the human body foreground depth image generated finally, G (x, y) is the depth image, and D (x, y) is the human body foreground contour image.

B5, cutting off the redundant white background by adaptive cutting, because the pure white background is still too large and is meaningless for the whole recognition, automatically cutting to obtain the optimal picture feature by traversing the pixel values of rows or columns up and down, left and right, and the adaptive cutting flow chart is as shown in fig. 8, and the cutting effect chart is as shown in fig. 9.

B6, randomly cutting the three-view depth map by the length and width of 0.85 times; randomly overturning left and right main views of a straight sitting posture, a lying desk sitting posture, a backward sitting posture and a low-head sitting posture which are centered in the left and right directions, namely, overturning left and right at a probability of 50%; data enhancement is performed to expand the training samples.

B7, performing bilinear interpolation on the extended samples to normalize the samples to 224 × 224.

And step 3, the process of depth image projection reconstruction is as follows:

and C1, regarding the human body foreground depth map in the step 2, the upper left corner is the origin of coordinates, the right side is the positive direction of an X axis, the downward side is the positive direction of a Y axis, and the pixel value is the direction of a Z axis, so that the pixel points of one depth image can be seen as three-dimensional points. And (3) converting the original Z axis into the Y axis, converting the Y axis into the Z axis, and normalizing the data from 0-224 to 0-255 to obtain the top-view projection image of the human body foreground depth image.

And C2, converting the original Z axis of the human body foreground depth map in the step 2 into an X axis, converting the X axis into the Z axis, and normalizing the data from 0-224 to 0-255 to obtain a left-view projection map of the human body foreground depth map.

C3, performing bilinear interpolation size normalization on the left view and the top view of the human body foreground depth map to 224 × 224, and collectively referring to the human body foreground depth map processed in step 2 as a three-view depth map, as shown in fig. 10.

Step 4, the process of establishing the sitting posture identification model comprises the following steps:

d1, the MIMO-CNN takes the three-view depth maps (left view, top view and main view) as input, and inputs them into three branch networks respectively, so as to obtain 3 different feature matrices. And then concat splicing the three feature matrixes of the left view, the top view and the main view in the feature matrix number dimension for the sitting posture feature in the front-back direction. And splicing the two feature matrixes concat of the top view and the front view for the sitting posture feature in the left-right direction. And inputting the spliced two sitting posture state features into two deep sub-network branches respectively, outputting two 1-dimensional feature vectors by the sub-network branches finally, respectively corresponding to the feature vectors in the front-back direction and the left-right direction of the sitting posture, and finally performing probability distribution output on the sitting posture vectors in the left-right and front-back states by using 2 softmax layers, wherein the main structure diagram of the MIMO-CNN is shown in FIG. 11, and the structure is a design framework of the MIMO-CNN.

And D2, the model parameter training process comprises the steps of firstly respectively inputting the three-view depth maps into three channels of the MIMO-CNN to obtain model sitting posture information, and then calculating the cross entropy loss between the model sitting posture result and the real label. And continuously updating and optimizing the parameters of the network by using a back propagation gradient descent algorithm according to the loss function to finish network training.

The step D1, the design of MIMO-CNN, includes the following specific steps:

(a) inputting a certain original three-view depth image, firstly, performing convolution layer operation on three shallow sub-network branches by using 3 × 3 kernels respectively, then performing BatchNorm normalization on the three shallow sub-network branches, and activating by using a Relu6 activation function to obtain a 112 × 32 feature map.

The calculation process of the convolutional layer is as follows:

wherein the content of the first and second substances,

The result of convolution summation and offset is obtained,

Is used to generate a set of input feature maps,

is a matrix of convolution kernels, and is,

is the bias to the convolved feature map. For an output profile

Each input feature map

Corresponding convolution kernel

Possibly differently, "' is a convolution symbol.

The Relu6 activation function f (x) is:

f(x)＝Min(Max(0，x)，6)

is normalized output;

(b) and (4) performing convolution on the feature map after convolution by a CBAM attention convolution module to obtain a feature map of 112 × 32. The main role of CBAM is to focus the network more on the important feature areas and network critical channels, and the schematic diagram of CBAM is shown in fig. 12.

The process of the CBAM attention convolution module is as follows:

firstly, performing spatial max and avg pooling on an input feature map (112 × 32), then performing multi-layer perceptron calculation, finally adding two one-dimensional vectors, finally performing relu activation function activation on the added one-dimensional vectors to obtain a1 × 32 channel attention model, and multiplying the model and the input feature map to obtain a 112 × 32 feature map, wherein the channel attention model of the CBAM is shown in fig. 13.

The channel attention model formula described is:

M_c(F)＝σ(MLP(Avgpool(F)))+σ(MLP(Maxpool(F))))

f is an input feature diagram, Mc is a channel attention model sigma and represents a relu activation function, MLP represents a multilayer perceptron, AvgPool is a spatial average pooling operation, and MaxPool is a spatial maximum pooling operation.

And then, performing max and avg channel pooling on the feature map obtained after the previous step of operation, then performing concat channel splicing, performing convolution operation of 7 × 7 on the spliced module to obtain a module (112 × 1), finally performing sigmoid activation function activation, and multiplying the model and the input feature map to obtain a feature map of 112 × 32, namely a final feature map. The channel attention model of CBAM is shown in fig. 14.

The spatial attention model formula described is:

M_s(F)＝σ(f^7*7([Avgpool(F))；Maxpool(F)]))

f is an input feature map, Ms is a spatial attention model sigma representing sigmoid activation function, AvgPool is a spatial average pooling operation, MaxPool is a spatial maximum pooling operation, F^7*7The convolution operation is 7 by 7.

(c) After step b is finished, the feature extraction is performed by using an inverted Residual Block module proposed by four MobilenetV2 networks, and finally, feature maps of 14 × 64 of three views are obtained, and parameters of the front network are shown in fig. 17. An inverted Residual Block module structure, as shown in FIG. 15; the dimension of the input feature map is firstly enlarged by using point-wise constraint, then the BatchNorm algorithm normalization is carried out, activation is carried out on a Relu6 activation function, then convolution operation is carried out in a depth-wise constraint mode, the BatchNorm algorithm normalization is carried out again after operation, the Relu6 function operation is carried out, and finally the dimension is reduced by using point-wise constraint. Note that at this time, after the last point-wise contribution, after the BatchNorm algorithm normalization, the Relu6 activation function is no longer used, but a linear activation function is used, so as to retain more feature information, ensure the expression capability of the model, and have the idea of Resnet.

(e) and respectively inputting the spliced two features into two branches of the deep sub-network to continue carrying out convolution by the CBAM attention convolution module. After convolution, a characteristic diagram of 14 × 192 in the front-back direction and a sitting posture characteristic of 14 × 128 in the left-right direction are obtained;

(e) respectively carrying out the same operation on the two features after the convolution of the attention convolution module, firstly carrying out three times of inverted Residual Block operation to obtain a feature map of 7 × 320, then carrying out point-wise containment to expand the feature map to obtain a feature map of 7 × 1280, obtaining one-dimensional features of 1 × 128 by using average pooling, and finally obtaining one-dimensional features of 1 × 4 of the front and back sub-network branches and one-dimensional features of 1 × 3 of the left and right sub-network branches by using point-wise containment, wherein the predicted network parameters of the front and back and left and right sitting posture states of the deep sub-network branches are shown in a graph 18/19;

(f) the method comprises the following steps of respectively outputting probability distribution of sitting posture vectors in left-right and front-back states by using 2 softmax layers:

the operational function of the Softmax layer is as follows:

wherein Z_jIs the jth input variable, M is the number of input variables,

for output, the probability of the output class being j can be represented.

The step D2, model training, comprises the following specific steps:

The cross entropy is calculated as:

wherein label_iIs denoted by onehot encoded tag, m is the number of samples of batch.

Loss function for this model:

Loss＝L_v+L_h+γ∑_j|w_j ²|

Step 5, sitting posture identification: and (4) inputting the three-view depth map obtained by preprocessing as an input quantity into the sitting posture recognition model trained in the step 4 to recognize the sitting posture. The sitting posture can be divided into forward sitting, head lowering, backward leaning and lying down tables and left and right directions, namely left leaning, middle leaning and right leaning. Considering that the left and right direction judgment is not carried out any more in the particularity of sitting postures of the lying table, the sitting postures are classified as shown in fig. 16;

step 6, model self-learning is automatically carried out along with feedback collection of error samples, and the performance of the model is improved, and the method comprises the following specific steps:

and E1, in the using process, the user finds the situation of wrong identification, feeds back the situation, the background analyzes the probability distribution of the output of the softmax layer which identifies the wrong identification, and if the predicted judgment result probability value is lower than 0.65, the classification is regarded as fuzzy classification, namely correct judgment cannot be made in certain two similar classifications, and the classification needs to be fed back to the cloud as a wrong sample for learning. In the recognition process, under the condition that the model is stable, the prediction output of the approximate value is almost not wrong, so the error is ignored by the background;

e2, performing manual secondary judgment on the error sample fed back to the cloud end by the client, labeling and adding the error sample into the data set;

e3, putting the database into the model again for training, and fine-tuning the model after a small amount of iteration;

the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. The sitting posture identification method based on projection reconstruction and the multi-input multi-output neural network is characterized by comprising the following steps of:

(2) image preprocessing: carrying out histogram equalization and filtering preprocessing operation on the obtained depth image and the human body foreground contour map, and carrying out data enhancement on the sample so as to expand a data set for training;

(3) and (3) depth image projection reconstruction: carrying out projection reconstruction on the human body foreground contour depth map, and respectively taking the opposite direction of an X, Y, Z axis as a projection direction to sequentially obtain a left view, a top view and a main view, namely a three-view depth map;

(4) establishing a sitting posture identification model: designing a multi-input multi-output neural network for sitting posture identification, and respectively taking the three-view depth maps processed in the step (3) as the input of three channels of the multi-input multi-output neural network for network training;

(5) and (3) sitting posture identification: inputting the three-view depth map obtained by preprocessing as an input quantity into a multi-input multi-output neural network, and finally identifying a sitting posture according to the distribution condition of a human body in a space;

2. The sitting posture identifying method based on projection reconstruction and multiple-input multiple-output neural network as claimed in claim 1, wherein the step (1) of image acquisition comprises the following specific steps:

1) acquiring a depth image by using a depth camera;

3. The sitting posture identifying method based on projection reconstruction and multiple-input multiple-output neural network as claimed in claim 1, wherein the step (2) of image preprocessing comprises the following specific steps:

1) carrying out histogram equalization on the native depth image;

7) bilinear interpolation dimensions are normalized to 224 x 224;

4. The sitting posture identifying method based on projection reconstruction and multiple-input multiple-output neural network as claimed in claim 1, wherein the step (3) of depth image projection reconstruction comprises the following specific steps:

1) regarding the human body foreground depth map preprocessed in the step (2), the upper left corner is the origin of coordinates, the right side is the positive direction of an X axis, the downward side is the positive direction of a Y axis, the pixel value is the direction of a Z axis, the pixel points of one depth map are regarded as three-dimensional points, the original Z axis is converted into the Y axis, the original Y axis is converted into the Z axis, and the data are normalized from 0-224 to 0-255, so that the overlooking projection map of the human body foreground depth map is obtained;

2) transforming the original Z axis of the human body foreground depth map preprocessed in the step (2) into an X axis, transforming the X axis into the Z axis, and normalizing the data from 0-224 to 0-255 to obtain a left-view projection map of the human body foreground depth map;

3) and (3) carrying out bilinear interpolation on the left view and the top view of the human body foreground depth map to normalize the sizes to 224 x 224, and collectively referring to the human body foreground depth map preprocessed in the step (2) as a three-view depth map.

5. The sitting posture recognition method based on projection reconstruction and multiple-input multiple-output neural network as claimed in claim 1, wherein the step (4) of establishing a sitting posture recognition model comprises the following specific steps:

1) designing a multi-input multi-output neural network: the multi-input multi-output neural network takes a left view, a top view and a main view three-view depth map as input and respectively inputs the input into three branch networks to obtain 3 different feature matrixes; secondly, concat splicing the three feature matrixes of the left view, the top view and the main view on the dimension of the number of the feature matrixes, and using the three feature matrixes for sitting posture state features in the front-back direction; splicing the two feature matrixes concat of the top view and the main view to be used for sitting posture features in the left and right directions; respectively inputting the spliced two sitting posture state features into two deep sub-network branches, finally outputting two 1-dimensional feature vectors corresponding to the feature vectors in the front-back direction and the left-right direction of the sitting posture by the sub-network branches, and finally performing probability distribution output on the sitting posture vectors in the left-right and front-back states by using 2 softmax layers;

2) training model parameters: inputting the three-view depth maps into three channels of a multi-input multi-output neural network respectively to obtain model sitting posture information, and calculating cross entropy loss between a model sitting posture result and a real label; and continuously updating and optimizing the parameters of the network by using a back propagation gradient descent algorithm according to the loss function to finish network training.

6. The sitting posture identifying method based on projection reconstruction and multiple-input multiple-output neural network as claimed in claim 5, wherein the step 1) multiple-input multiple-output neural network design comprises the following specific steps:

(a) inputting a certain original image, firstly performing convolution layer operation by using 3 × 3 kernels, and then performing BatchNorm normalization and Relu6 activation on the original image to obtain a characteristic map of 112 × 32;

the calculation process of the convolutional layer is as follows:

wherein the content of the first and second substances,

The result of convolution summation and offset is obtained,

f (-) is the output of the jth channel of convolution l called the activation function, using the Relu6 function; m_jRepresentation for computing

Is used to generate a set of input feature maps,

is a matrix of convolution kernels, and is,

is the bias to the convolved feature map; for an output profile

Each input feature map

Corresponding convolution kernel

Possibly differently, "' is a convolution symbol;

the Relu6 activation function f (x) is:

f(x)＝Min(Max(0,x),6)

is normalized output;

(b) carrying out convolution on the feature map subjected to convolution by a CBAM attention convolution module, wherein the CBAM has the main function of enabling a network to be more concentrated on important feature areas and network key channels;

(c) then, using an inverted Residual Block module to extract features; the method comprises the steps that firstly, the dimension of an input feature map is enlarged by point-wise constraint, then BatchNorm algorithm normalization is carried out, activation is carried out on a Relu6 activation function, convolution operation is carried out in a depth-wise constraint mode, after operation, BatchNorm algorithm normalization and Relu6 function operation are carried out again, and finally the dimension is reduced by point-wise constraint; at the moment, after the last point-wise contribution, the value norm algorithm is normalized, and then a Relu6 activating function is not used, but a linear activating function is used, so that more characteristic information is reserved, the expression capability of the model is ensured, and the concept of Resnet is realized; after the step (a) is finished, performing feature extraction by using four inverted residual Block modules to finally obtain feature maps of 14 × 64 of the three views respectively;

(e) performing convolution on the spliced two features by using a CBAM (CBAM) attention convolution module; after convolution, a characteristic diagram of 14 × 192 in the front-back direction and a sitting posture characteristic of 14 × 128 in the left-right direction are obtained;

(f) respectively carrying out the same operation on the two characteristics after the convolution of the attention convolution module, firstly carrying out three times of inverted Residual Block operation to obtain characteristic diagrams of 7 × 320, then carrying out point-wise fusion to expand the characteristic diagrams to obtain characteristic diagrams of 7 × 1280, obtaining one-dimensional characteristics of 1 × 128 by using average pooling, and finally obtaining the one-dimensional characteristics of 1 × 4 by using the sub-networks in the front and back directions and the one-dimensional characteristics of 1 × 3 by using the sub-networks in the left and right directions;

the operational function of the Softmax layer is as follows:

wherein Z_jIs the jth input variable, M is the number of input variables,

for output, the probability of the output class being j can be represented.

7. The sitting posture recognition method based on projection reconstruction and multiple-input multiple-output neural network as claimed in claim 5, wherein the step 2) model parameter training comprises the following specific steps:

(a) inputting the three-view depth maps into three channels of a multi-input multi-output neural network respectively to obtain model sitting posture information, and then calculating the cross entropy loss between the model sitting posture result and a real label, wherein the calculation formula of the cross entropy is as follows:

wherein label_iExpressed as onehot encoded tag, m is the number of samples of batch;

loss function of this model:

Loss＝L_v+L_h+γ∑_j|w_j ²|

wherein L is_vCross entropy, L, output for front-to-back direction_hIs the cross entropy of the left and right directions ∑_j|w_j ²I is the L2 regular term of the training vector, | is the coefficient of the regular term, so as to prevent the problem of over-fitting of the training;

8. The sitting posture identifying method based on projection reconstruction and multiple-input multiple-output neural network as claimed in claim 1, wherein the step (6) of model self-learning comprises the following specific steps:

3) and (4) regularly putting the database into the model again for training, and performing fine tuning on the model after a small amount of iteration.