WO2021237913A1

WO2021237913A1 - Sitting posture recognition method based on monocular video image sequence

Info

Publication number: WO2021237913A1
Application number: PCT/CN2020/104054
Authority: WO
Inventors: 李灏为; 杨志
Original assignee: 大连成者云软件有限公司
Priority date: 2020-05-27
Filing date: 2020-07-24
Publication date: 2021-12-02
Also published as: CN111626211B; CN111626211A

Abstract

The present invention provides a sitting posture recognition method based on a monocular video image sequence. First, a video frame is acquired from a monocular camera; then, on one hand, the current video frame is sent to a desktop detection module to detect a desktop position and pose, and on the other hand, the current frame is used to update a video frame sequence and the video frame sequence is sent to a human pose estimation and behavior recognition module to obtain a three-dimensional human pose and the current behavior category; whether the current behavior category is a static behavior is determined, if yes, the human pose, the current behavior category, and the desktop position and pose are sent to a sitting posture evaluation module to evaluate the current sitting posture, otherwise, directly proceed to the next frame. According to the method provided in the present invention, a three-dimensional human pose can be directly acquired from a monocular image, a multi-frame image sequence is used, shielding and light change are resisted, the robustness is good, a non-static behavior is filtered by means of behavior recognition, and the accuracy is improved in combination with desktop pose information.

Description

A Sitting Posture Recognition Method Based on Monocular Video Image Sequence

Technical field

The present invention relates to the fields of video image processing, computer vision and human body posture recognition, and in particular, to a sitting posture recognition method based on a monocular video image sequence.

Background technique

As the pace of life continues to accelerate, people spend most of their day at work and study. Maintaining an irregular sitting posture for a long time can easily develop bad habits such as hunchback and twisting of the body. In severe cases, it will cause cervical spondylosis, lumbar disc herniation, and myopia, causing irreversible damage to the body, thus to a great extent Affect daily study, work and life. The sitting posture recognition algorithm usually uses sensors to extract the half-length posture of the recognized object, and according to the algorithm of the standard degree of sitting posture, helps users adjust incorrect sitting posture in time to ensure people's health.

The non-contact sensors based on the current sitting posture recognition algorithm are mainly divided into the following types:

Ultrasonic sensor. Ultrasonic has certain requirements on the measuring surface. The measurement surface density is low, the ultrasonic wave penetrates the object, there will be multiple echoes; the measurement surface is uneven, the ultrasonic wave is scattered, there will also be multiple echoes; the measurement surface is inclined, the ultrasonic wave is not reflected correctly; the measurement surface is too small, The amount of ultrasound reflected back is not enough. Therefore, the measurement effect of ultrasonic is poor.

Binocular vision sensor. This kind of sensor has high manufacturing process requirements, is very sensitive to ambient light, has poor performance for scenes lacking texture, and has high computational complexity. The camera baseline limits the measurement range, and there are dead corners in use.

Monocular vision sensor: The hardware cost is low, but generally only two-dimensional information can be obtained, and the effect of sitting posture recognition is not as good as that of binocular cameras; for occlusion, sudden light changes, etc., sitting posture recognition is less robust; and a small hole imaging model is required, And provide additional prior knowledge to obtain three-dimensional information.

In addition, most sitting posture recognition methods only consider relatively static typing, writing, and reading behaviors, but in actual application scenarios, the recognition object may also have dynamic behaviors such as stretching, head swinging, drinking, and answering the phone. When the above dynamic behavior occurs, it is easy to be recognized as a wrong sitting posture. The existing sitting posture recognition method also does not combine the desktop position information in a specific scene, which severely limits the improvement of the sitting posture recognition accuracy.

Summary of the invention

The present invention is proposed in view of at least one of the above-mentioned problems.

The present invention pays more attention to the sitting posture recognition method based on the monocular vision sensor, especially based on the monocular video image sequence. The present invention aims to improve the accuracy of the sitting posture recognition method based on the monocular video image sequence, and the method is blocking , Robustness under abnormal usage conditions such as sudden light changes.

The invention is also based on behavior recognition. In practical applications, it has been proved that the present invention can improve the recognition accuracy when the recognition object has dynamic behavior, without requiring additional external detection results. In addition, the present invention can adaptively match the desktop location information during recognition.

The purpose of the present invention is to provide a sitting posture recognition method based on monocular video image sequence, including:

S1. Obtain the current video frame from the monocular camera and update the video frame sequence, the capacity of the video frame sequence is fixed;

S2. The video frame sequence is sent to the human body posture estimation and behavior recognition module, and the human body posture is estimated and the behavior type is recognized by obtaining the three-dimensional coordinates of the key points. The behavior type includes static behavior and dynamic behavior, and the behavior is judged. If the type recognition result is a static behavior, execute S3, otherwise execute S1;

S3. The sitting posture evaluation module simultaneously receives the estimation result of the human body posture and the recognition result of the behavior type, performs sitting posture evaluation based on the two, and gives corresponding prompts according to the evaluation result.

Compared with the prior art, the present invention has the following advantages:

1. The present invention is based on the development of monocular video image sequence, uses a multi-task end-to-end network structure to realize human body posture estimation and behavior recognition, and assists the accuracy of behavior recognition through accurate posture estimation results.

2. The present invention uses low-level feature map results in the process of realizing behavior recognition, which can obtain environmental context information related to gestures, and further improve the accuracy of recognition between similar behaviors.

3. The present invention adds an attention mechanism in the spatial domain when acquiring 3-dimensional posture information, and improves the accuracy of the key points of the posture through the context of the spatial domain.

4. The present invention combines the desktop position and posture information in the actual scene to perform sitting posture evaluation, and improves the sitting posture recognition accuracy.

Based on the above reasons, the present invention can be widely used in office equipment and teaching equipment.

Description of the drawings

In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.

Figure 1 is a flowchart of the sitting posture recognition method of the present invention.

Figure 2 is a schematic diagram of the structure of the human body posture estimation and behavior recognition module of the present invention.

Fig. 3 is a schematic diagram of the structure of the low-level feature extraction sub-module in the embodiment.

Fig. 4 is a schematic diagram of the distribution of 11 key points in a sitting state in the embodiment.

Figure 5 is a schematic diagram of the SACAM network structure in the embodiment.

Fig. 6 is a flowchart of posture estimation heat map decoding in an embodiment.

Fig. 7 is a schematic diagram of an input of a posture estimation result of a video sequence of a behavior recognition unit in an embodiment.

Fig. 8 is a schematic diagram of the SRLRTM network structure in the embodiment.

Detailed ways

In order to enable those skilled in the art to better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only It is a part of the embodiments of the present invention, not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

It should be noted that the terms “first” and “second” in the description and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances so that the embodiments of the present invention described herein can be implemented in an order other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those clearly listed. Those steps or units may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or equipment.

Fig. 1 is a sitting posture recognition method based on a monocular video image sequence provided by the present invention. The process for performing sitting posture evaluation includes the following steps:

S1. Obtain the current video frame Frame _k from the monocular camera and update the video frame sequence. The video frame sequence is VideoClip={Frame _i |i∈k-T+1,...,k}, which can store T frames image.

S2. The video frame sequence is sent to the human body posture estimation and behavior recognition module, and the human body posture is estimated and the behavior type is recognized by obtaining the 3-dimensional coordinates of the key points. The behavior type includes static behavior and dynamic behavior, and the behavior is judged. If the type recognition result is a static behavior, execute S3, otherwise execute S1. In addition, this step may also include sending the current frame Frame _{k to the} desktop detection module for desktop pose detection.

S3. The sitting posture evaluation module simultaneously receives the estimation result of the human body posture and the recognition result of the behavior type, performs sitting posture evaluation based on the two, and gives corresponding prompts according to the evaluation result. Correspondingly, this step may also include the sitting posture evaluation module receiving the desktop posture detection result to assist in the sitting posture evaluation. Behavior recognition results indicate that when the recognized object is in a relatively static state such as typing, writing, and reading, the sitting posture is evaluated. The sitting posture evaluation in the present invention can adopt but not limited to the following methods: 1) Enter the standard sitting posture in advance, and calculate the similarity of each joint vector between the current sitting posture and the standard sitting posture; 2) Judge the distance between the head and the desktop; 3) Do classification tasks, use The neural network is trained to discriminate.

In the present invention, it is preferable to adopt a multi-task end-to-end network structure for human body posture estimation and behavior recognition as the human body posture estimation and behavior recognition module. Compared with the conventionally used staged, multi-task network, it can more accurately use the posture estimation results to assist in improving the accuracy of behavior recognition, and since the accuracy of sitting posture recognition largely depends on the accuracy of human posture estimation and the accuracy of behavior, Therefore, the accuracy of sitting posture recognition can be further improved. In the previous segmented cascade recognition algorithm, the input is only the human posture, and such input features cause behaviors with similar postures to be confused with each other in recognition. For example, the postures in the two situations of drinking water and smoking are very similar.

In order to solve the above problems, the human body posture estimation and behavior recognition module further includes a low-level feature extraction sub-module and at least a first-level estimation and recognition working group. Among them: the low-level feature extraction submodule is mainly used to process each frame image in the video frame sequence into a low-level feature map. The estimation and recognition working group includes a three-dimensional pose estimation part and a behavior recognition part working in parallel. The three-dimensional pose estimation part of the first-level estimation and recognition working group uses the low-level feature map as input features and outputs the human pose estimation result, and the behavior recognition part uses the current-level human pose estimation result and the low-level feature map as input features, and Output behavior recognition results; the three-dimensional pose estimation parts of other estimation and recognition working groups all use the low-level feature map and the upper-level human pose estimation results as input features, and output the human pose estimation results, and the behavior recognition part uses the same level of human body The posture estimation result and the upper-level behavior recognition result are used as input features, and the behavior recognition result is output. As a preferred embodiment of the present invention, the human body posture estimation and behavior recognition module uses the human body posture estimation results and behavior recognition results obtained by the last-level estimation and recognition working group as output. The present invention introduces a re-injection mechanism between the three-dimensional posture estimation part and the behavior recognition part between the estimation and recognition working groups at all levels, and between the three-dimensional posture estimation part and the behavior recognition part within the estimation and recognition working group, and significantly improves the posture. Accuracy of estimation and behavior recognition results.

Specifically, the low-level feature extraction submodule is the input part of the network, that is, the stem of the network. The T-frame video frame sequence is resized to the same size and then sent to the network. The output of this part is a low-level feature. The inventive concept underlying the present invention is to use as few convolutional layers as possible to compress features into desired shapes. Focusing on the efficiency of the network, it is not required that the features extracted at this time have a good fitting ability. In order to improve the effectiveness of this feature, the present invention introduces a re-injection mechanism (re-injection) into the network to refine and adjust this feature. At the same time, the pose estimation part and the behavior recognition part also specially design the network structure to separately Modeling in the airspace and time domain will be elaborated in the follow-up content. The present invention is based on the Resnet bottleneck layer of the residual network, and optimizes the network structure to improve the speed of the network. As shown in Figure 3, preferably, the original network 1×1 convolution is replaced with a 1×1 group convolution (1×1 groupconv) + channel aliasing (channelshuffle) form, it is in the realization of 1×1 convolution The function reduces the amount of calculation at the same time; replacing the 3×3 convolution with a 3×3 depthwise conv with a step size of 2, which can also reduce the amount of calculation. The final addition operation is changed to a channel concatenation operation, and each identity map performs a maximum pooling operation with a step size of 2. The above-mentioned optimized design can ensure that the original image can obtain the desired feature map shape through a few modified bottleneck layers.

In addition, the present invention simultaneously introduces a re-injection mechanism (re-injection) into the three-dimensional posture estimation and behavior recognition to form the structure of the entire body posture estimation and behavior recognition module, as shown in FIG. Each 3D pose estimation module adds the low-level features and the features of the previous 3D pose estimation module as input features, and the behavior recognition module adds the current input features and the features before the global pooling of the previous behavior recognition module as the new input features. . Through this re-injection mechanism, the characteristics will be continuously adjusted, and the results of the network will gradually become more accurate.

In an embodiment of the present invention, the three-dimensional pose estimation unit is used to perform: a heat map extraction step and a heat map decoding step. The heat map extraction step is executed once or stacked multiple times.

Specifically, the present invention defines the 3D posture estimation in the sitting state as the 3D coordinates of 11 key points. Once these coordinates are determined, the posture of the human body can be connected according to the topology of the human body. The 11 key points are left eye 1, right eye 2, nose 3, left mouth corner 4, right mouth corner 5, left shoulder 6, right shoulder 7, left elbow 8, right elbow 9, left wrist 10, and right wrist 11, as shown in the figure 4 shown.

In the heat map extraction step, the structure of the 3D pose estimation part is also optimized based on Resnet, and a new network structure SACAM (sptial attention and channel attention module) is proposed. In this structure, the maximum pooling is performed along the channel, and the pooling result is subjected to 3x3 convolution to obtain the attention of the spatial domain, that is, the weights of different pixel positions, to refine the features. Then SE layer is introduced to learn the weights of different channels, that is, channel-level attention, and re-refine the characteristics of different channels. The SACAM structure is shown in Figure 5. Since the aforementioned low-level feature extraction part has quickly adjusted the size of the feature map to the required resolution, no down-sampling is performed in the SACAM block, the convolution step size is all 1, and the pooling operation is only to extract attention. The SACAM input and The resolution of the output feature map remains the same.

Further, in the heat map decoding step, after the pose estimation input feature continuously passes through one or more SACAM stacked structures, a key point heat map Heatmap is generated, the size of which is (hw, hh, hc). Converted to (hx, hy, hz, hk) through the reshape operation, hx and hy are the two-dimensional pose estimation results, hz is the key point depth value, hc is the number of key point categories, set to 11 in this embodiment, hc=hz *hk, hw=hx, hh=hy.

Then, perform global maximum pooling on the third dimension of the Heatmap to obtain the heat map Hxy with a size of (hx,hy,hk); perform global maximum pooling on the first two dimensions of the Heatmap to obtain the heatmap Hz with a size of (hz,hk). In this embodiment, the soft-argmax is used to parse the two-dimensional key point coordinates and the depth coordinates from the two heat maps, and jointly form the three-dimensional key point coordinates. Traditional algorithms often use argmax to obtain coordinate values from the heat map, and the result of the second operation is not direct, which destroys the back-propagation chain. In the present invention, soft-argmax is used, which essentially defines the event as the maximum value falling on the coordinates (x, y), so that the heat map Hxy and Hz will naturally become the corresponding probability mass functions, and the maximum value coordinates are calculated In order to obtain expectations, the formula is as follows:

In the formula, x is the input image; i, j represent the position _{in the image; xi} refers to the corresponding pixel value at the position i of the image x, and x _{j is the} same; the output of the function is the coordinate of the maximum value of the image.

For the confidence of key points, we do global maximum pooling on the first two dimensions of the heat map Hxy to get Cxy, and do global pooling on the first dimension of the heat map Hz to get Cz, and add the two according to the channel to get the confidence. Conf. The whole process of posture estimation heat map decoding is shown in Figure 6.

In a further embodiment of the present invention, the behavior recognition unit is used to perform the behavior recognition model building step, the recognition input feature building step, the behavior recognition step and the classification step.

Steps to build behavior recognition model

When designing a model, the short-term information and long-term information are respectively modeled by using the behavior recognition input features, and the two models are connected in series to form a recognition model. As a further preferred embodiment, the SRLRTM block structure is designed for the shape of the input feature, and the short-term information and the long-term information can be modeled by using ordinary 2-dimensional convolution. As shown in Figure 8, SRLRTM is divided into two parts. The left half of SRLRTM models short-term information. It uses 1×1 convolution to enhance the flow of information between channels and reduce the number of channels. The purpose of hk×3 convolution is to model short-term information. Because the second dimension of the feature represents time T, the purpose of setting the second dimension of the convolution kernel to 3 is to model 3 adjacent frames. Then perform channel maximum pooling to obtain a spatiotemporal attention, and autocorrelate it with the identity mapping feature to obtain a local enhancement feature. At the same time, in order to preserve the integrity of the information, jump connection here, and the original feature and the local enhancement Features are added. The right half of SRLRTM models long-term information. The first 1×1 convolution is also to enhance the flow of information between channels and reduce the number of channels. The hk×T convolution is to simultaneously model the T frame information. It can be used with 1×1 convolution to obtain a channel attention. Then, it is multiplied with the identity mapping feature in the channel dimension to obtain a global enhancement feature, which is then added to the identity mapping feature to retain the original information. The left half and the right half are connected in series to form an SRLRTM block. After multiple stacked SRLRTM blocks, connect a global maximum pooling layer, a fully connected, and then a softmax to obtain the recognition and classification results.

Identify input feature construction steps

This step is mainly used to extract pose estimation features and scene context features, and stitch the two to form behavior recognition input features. The input of the behavior recognition part includes two parts, one is the result of pose estimation, and the other is the low-level features extracted by the low-level feature extraction sub-module. In this embodiment, the appearance of the human body and the environmental context are combined to perform behavior recognition, which can solve the problem of insufficient accuracy in judging behavior only by gesture.

For the result of attitude estimation, the format needs to be converted to facilitate the network to process it. In this embodiment, the time dimension is taken as the horizontal axis, the key point category is taken as the vertical axis, and the x, y, and z coordinates of the 3-dimensional key point correspond to 3 channels. Such features can be directly processed by 2-dimensional ordinary convolution. The feature is shown in Figure 7, and its shape is (hk, T, 3).

For human body appearance and scene context features, this embodiment extracts the low-level features and the heat map as the outer product. Specifically, the heat map Hxy is extracted as (hx,hy,hk), that is, (hw,hh,hk), and the process of low-level features to the heat map Hxy is not down-sampled, and the low-level feature is denoted as F, and its size is ( hw, hh, hd), where hd is the number of channels. Calculate the outer product for each channel of Hxy and each channel of F, and the result is (hx, hy, hk*hd). Since the outer product of two vectors is equal to the area of the parallelogram formed by the two vectors, the result of the outer product can be the similarity of the two vectors, or reflect the length of the two vectors, and the outer product of the matrix is essentially the corresponding column in the matrix The outer product. The purpose of calculating the outer product in this embodiment is to extract human body appearance information and context information on the heat map using all key point positions at a time. After obtaining the outer product result, perform global average pooling on the first two axes, the feature shape becomes (hk*hd), and then splicing the features of T video frames to obtain the human body appearance and scene context feature Representf, whose shape is ( T,hk*hd), split the second channel, adjust the order, and finally the characteristic shape becomes (hk,T,hd). Since the shape of the pose estimation feature is (hk,T,3), the shape of the human body appearance and the scene context feature is (hk,T,hd), the first two dimensions of the two are the same, and they are stitched according to the channel to form a behavior recognition Enter the characteristics (hk, T, 3+hd).

Behavior recognition and classification steps

In this step, the recognition input feature is input into the recognition model to obtain the recognition classification result. In this embodiment, the recognition results are divided into static behaviors and dynamic behaviors for the sitting state, where dynamic behaviors include, but are not limited to: stretching, standing up, sitting down, reaching for things, shaking your head, turning around, calling, and talking with others Wait. Static behavior includes, but is not limited to: writing, typing, and reading.

In the present invention, a desktop three-dimensional detection method is preferably used as a desktop detection module. It obtains the desktop posture and position, relying only on monocular images, and does not require additional sensors. Since the posture and position of the desktop are known, the relative relationship between the person and the desktop can be obtained, which improves the accuracy of sitting posture evaluation. However, none of the previous sitting posture monitoring algorithms consider the use of this information.

The desktop detection module further includes a plane area detection unit and a plane three-dimensional parameter inference unit.

The plane area detection unit is realized by the Mask-RCNN network. The Mask-RCNN network is an instance segmentation model, which can determine the location and category of each target in the picture, and give pixel-level predictions. The desktop detection frame, desktop segmentation results and the output feature map of its backbone network obtained by Mask-RCNN are used as the input of the plane 3D parameter inference unit.

The plane 3D parameter inference unit first uses ROIAlign to extract the features of the desktop, and returns to the normal vector of the desktop. ROI Align is to find the position of the corresponding desktop detection frame on the backbone network output feature map, and then perform bilinear interpolation to obtain desktop features.

The plane 3D parameter inference unit also decodes the backbone network feature map to obtain a global depth map. The decoding part adopts bilinear interpolation to make the predicted depth map and the input monocular image reach the same resolution.

The plane 3D parameter inference unit uses the desktop segmentation mask to extract the desktop depth in the global depth map. The extraction operation is implemented using an AND operation, and then returns to the position vector of the desktop through a maximum pooling layer and two 1x1 convolutions. Finally, the desktop normal vector and the desktop position vector are used as the output of the desktop detection module.

In the present invention, a sitting posture evaluation method is preferably used as a sitting posture evaluation module. In the algorithm preparation stage, enter the standard sitting posture in advance. When the algorithm is running, it calculates the similarity of each joint vector between the current sitting posture and the standard sitting posture; calculates the angle between the left and right eye line and the desktop normal vector, and the distance between the nose and the desktop position. According to the magnitude of the deviation, the sitting posture is divided into standard, slightly incorrect, severely incorrect, etc., and remind users according to different levels.

In the following, a specific application example is used to further illustrate the scheme of the present invention.

1. Continuously obtain 512×512 video frames from the monocular image, and do the following two processing: a) Update the video queue with a capacity of T=10, and send the entire video queue to the human body posture estimation and behavior recognition module; b) Send the current video frame directly to the desktop detection module.

2. The human body pose estimation and behavior recognition module starts to work, and the low-level feature extraction sub-module adopts the bottleneck layer structure as shown in Fig. 3 and is stacked 4 times. The output low-level feature resolution is 32×32, and the number of channels is increased from 3 to 576. Among them, the first bottleneck layer channel is amplified to 12, the second bottleneck layer channel is amplified to 48, the third bottleneck layer channel is amplified to 192, and the fourth bottleneck layer channel is amplified to 576. The posture evaluation unit sends each of the T video frames into the SACAM structure for three-dimensional posture estimation. The SACAM blocks are stacked 5 times, all of which have a convolution step length of 1, and the pose estimation feature is obtained. Then, the pose estimation feature is sent to the heat map decoding module to obtain Pxy, Pz, and Conf, all of which are 11 channels, corresponding to 11 key points. Since the T video frames are processed separately, the body pose results of T frames will be obtained here. The behavior recognition unit first constructs the behavior recognition input feature, and the constructed feature size is (hk, T, 3+hd) = (11, 10, 579)), and then sends it to the SRLRTM block structure. After the behavior recognition input feature passes through 5 stacked SRLRTM blocks, connect a global maximum pooling layer, a full connection, and then a softmax to obtain the recognition classification result.

3. Introduce the re-injection mechanism. As shown in Figure 2, each 3D pose estimation module adds low-level features and the previous 3D pose estimation module's characteristics as input features, and the behavior recognition module combines the current input features and the previous behavior recognition module globally. The features before pooling are added as new input features to improve the accuracy of network recognition.

4. Perform desktop detection. Desktop detection is essentially a plane detection problem. The purpose is to obtain the position and posture of the desktop from the image. The desktop detection module performs 3D plane detection on the monocular image, and obtains the depth map and normal vector describing each plane as the position information and posture information of the plane. Then according to the placement of the camera, search upwards from the bottom of the image to determine the range of the desktop.

5. Perform sitting posture evaluation and prompting. When the behavior recognition unit recognizes that the recognized object is in a relatively static state such as typing, writing, reading, etc., it evaluates its sitting posture and gives corresponding prompts based on the evaluation results.

In the above-mentioned embodiments of the present invention, the description of each embodiment has its own focus. For parts that are not described in detail in an embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. Among them, the device embodiments described above are merely illustrative. For example, the division of the units may be a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes. .

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the foregoing embodiments can still be modified, or some or all of the technical features can be equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention. Scope.

Claims

A sitting posture recognition method based on monocular video image sequence, which is characterized in that it includes:

S1. Obtain the current video frame from the monocular camera and update the video frame sequence, the capacity of the video frame sequence is fixed;

S2. The video frame sequence is sent to the human body posture estimation and behavior recognition module, and the human body posture is estimated and the behavior type is recognized by obtaining the three-dimensional coordinates of the key points. The behavior type includes static behavior and dynamic behavior, and the behavior is judged. If the type recognition result is a static behavior, execute S3, otherwise execute S1;

S3. The sitting posture evaluation module simultaneously receives the estimation result of the human body posture and the recognition result of the behavior type, performs sitting posture evaluation based on the two, and gives corresponding prompts according to the evaluation result.
The sitting posture recognition method according to claim 1, wherein the human body posture estimation and behavior recognition module comprises:

A low-level feature extraction sub-module for processing each frame image in the video frame sequence into a low-level feature map;

And at least a first-level estimation and recognition working group, the estimation and recognition working group includes a three-dimensional pose estimation part and a behavior recognition part working in parallel, wherein:

The three-dimensional pose estimation section of the first-level estimation and recognition working group uses the low-level feature map as an input feature, and outputs the human pose estimation result,

The behavior recognition unit takes the human pose estimation results and low-level feature maps at the same level as input features, and outputs the behavior recognition results;

The three-dimensional pose estimation parts of other estimation and recognition working groups all use the low-level feature map creation and the upper-level human pose estimation results as input features, and output the human pose estimation results,

The behavior recognition part takes the human posture estimation result of the current level and the behavior recognition result of the upper level as input features, and outputs the behavior recognition result.
The sitting posture recognition method according to claim 2, wherein the three-dimensional posture estimation unit is configured to perform:

The heat map extraction step is to extract the spatial domain attention and channel-level attention of the input features of the three-dimensional pose estimation unit, and generate a key point heat map according to the spatial domain attention and the channel-level attention, and the spatial domain attention is The weight of each pixel position of the image, and the channel-level attention is the weight of each input channel;

The heat map decoding step is to perform global maximum pooling on the key point heat map to obtain a two-dimensional coordinate heat map and a depth coordinate heat map, and extract the three-dimensional key point coordinates from the two-dimensional coordinate heat map and the depth coordinate heat map.
The sitting posture recognition method according to any one of claims 2-3, wherein the human posture estimation and behavior recognition module uses the human posture estimation results and behavior recognition results obtained by the last-level estimation and recognition working group as Output.
The sitting posture recognition method according to claim 2, wherein the behavior recognition unit is configured to execute:

Recognize the input feature construction step, extract the pose estimation feature and the scene context feature, and stitch the two to form the behavior recognition input feature;

In the behavior recognition step, the recognition input feature is input into the recognition model to obtain the recognition classification result.
The sitting posture recognition method according to claim 5, wherein the recognition part is also used to perform the behavior recognition model building step, which specifically includes:

Model the short-term information and long-term information of the behavior recognition input feature separately, and obtain the short-term information sub-model and the long-term information sub-model respectively;

Connecting the short-term information sub-model and the long-term information sub-model in series to form a behavior recognition working group;

Performing a stacking operation on a plurality of the behavior recognition work groups to obtain a behavior recognition model.
The sitting posture recognition method according to claim 6, wherein the behavior recognition step comprises:

Input the behavior recognition input feature into the behavior recognition model to obtain the behavior recognition intermediate feature;

The behavior recognition intermediate features are classified using maximum pooling, fully connected layer and softmax, so as to obtain behavior recognition classification results.
The sitting posture recognition method according to claim 1, wherein the S2 further comprises: sending the current frame to a desktop detection module for desktop posture detection.
The sitting posture recognition method according to claim 8, wherein the S3 further comprises: the sitting posture evaluation module receives the desktop posture detection result, and assists in the sitting posture evaluation.