WO2021237913A1 - Sitting posture recognition method based on monocular video image sequence - Google Patents

Sitting posture recognition method based on monocular video image sequence Download PDF

Info

Publication number
WO2021237913A1
WO2021237913A1 PCT/CN2020/104054 CN2020104054W WO2021237913A1 WO 2021237913 A1 WO2021237913 A1 WO 2021237913A1 CN 2020104054 W CN2020104054 W CN 2020104054W WO 2021237913 A1 WO2021237913 A1 WO 2021237913A1
Authority
WO
WIPO (PCT)
Prior art keywords
recognition
behavior
estimation
sitting posture
behavior recognition
Prior art date
Application number
PCT/CN2020/104054
Other languages
French (fr)
Chinese (zh)
Inventor
李灏为
杨志
Original Assignee
大连成者云软件有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 大连成者云软件有限公司 filed Critical 大连成者云软件有限公司
Publication of WO2021237913A1 publication Critical patent/WO2021237913A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition

Definitions

  • the present invention relates to the fields of video image processing, computer vision and human body posture recognition, and in particular, to a sitting posture recognition method based on a monocular video image sequence.
  • the sitting posture recognition algorithm usually uses sensors to extract the half-length posture of the recognized object, and according to the algorithm of the standard degree of sitting posture, helps users adjust incorrect sitting posture in time to ensure people's health.
  • the non-contact sensors based on the current sitting posture recognition algorithm are mainly divided into the following types:
  • Ultrasonic sensor Ultrasonic has certain requirements on the measuring surface.
  • the measurement surface density is low, the ultrasonic wave penetrates the object, there will be multiple echoes; the measurement surface is uneven, the ultrasonic wave is scattered, there will also be multiple echoes; the measurement surface is inclined, the ultrasonic wave is not reflected correctly; the measurement surface is too small, The amount of ultrasound reflected back is not enough. Therefore, the measurement effect of ultrasonic is poor.
  • Binocular vision sensor This kind of sensor has high manufacturing process requirements, is very sensitive to ambient light, has poor performance for scenes lacking texture, and has high computational complexity.
  • the camera baseline limits the measurement range, and there are dead corners in use.
  • Monocular vision sensor The hardware cost is low, but generally only two-dimensional information can be obtained, and the effect of sitting posture recognition is not as good as that of binocular cameras; for occlusion, sudden light changes, etc., sitting posture recognition is less robust; and a small hole imaging model is required, And provide additional prior knowledge to obtain three-dimensional information.
  • the present invention is proposed in view of at least one of the above-mentioned problems.
  • the present invention pays more attention to the sitting posture recognition method based on the monocular vision sensor, especially based on the monocular video image sequence.
  • the present invention aims to improve the accuracy of the sitting posture recognition method based on the monocular video image sequence, and the method is blocking , Robustness under abnormal usage conditions such as sudden light changes.
  • the invention is also based on behavior recognition. In practical applications, it has been proved that the present invention can improve the recognition accuracy when the recognition object has dynamic behavior, without requiring additional external detection results. In addition, the present invention can adaptively match the desktop location information during recognition.
  • the purpose of the present invention is to provide a sitting posture recognition method based on monocular video image sequence, including:
  • the video frame sequence is sent to the human body posture estimation and behavior recognition module, and the human body posture is estimated and the behavior type is recognized by obtaining the three-dimensional coordinates of the key points.
  • the behavior type includes static behavior and dynamic behavior, and the behavior is judged. If the type recognition result is a static behavior, execute S3, otherwise execute S1;
  • the sitting posture evaluation module simultaneously receives the estimation result of the human body posture and the recognition result of the behavior type, performs sitting posture evaluation based on the two, and gives corresponding prompts according to the evaluation result.
  • the present invention has the following advantages:
  • the present invention is based on the development of monocular video image sequence, uses a multi-task end-to-end network structure to realize human body posture estimation and behavior recognition, and assists the accuracy of behavior recognition through accurate posture estimation results.
  • the present invention uses low-level feature map results in the process of realizing behavior recognition, which can obtain environmental context information related to gestures, and further improve the accuracy of recognition between similar behaviors.
  • the present invention adds an attention mechanism in the spatial domain when acquiring 3-dimensional posture information, and improves the accuracy of the key points of the posture through the context of the spatial domain.
  • the present invention combines the desktop position and posture information in the actual scene to perform sitting posture evaluation, and improves the sitting posture recognition accuracy.
  • the present invention can be widely used in office equipment and teaching equipment.
  • Figure 1 is a flowchart of the sitting posture recognition method of the present invention.
  • Figure 2 is a schematic diagram of the structure of the human body posture estimation and behavior recognition module of the present invention.
  • Fig. 3 is a schematic diagram of the structure of the low-level feature extraction sub-module in the embodiment.
  • Fig. 4 is a schematic diagram of the distribution of 11 key points in a sitting state in the embodiment.
  • Figure 5 is a schematic diagram of the SACAM network structure in the embodiment.
  • Fig. 6 is a flowchart of posture estimation heat map decoding in an embodiment.
  • Fig. 7 is a schematic diagram of an input of a posture estimation result of a video sequence of a behavior recognition unit in an embodiment.
  • Fig. 8 is a schematic diagram of the SRLRTM network structure in the embodiment.
  • Fig. 1 is a sitting posture recognition method based on a monocular video image sequence provided by the present invention.
  • the process for performing sitting posture evaluation includes the following steps:
  • VideoClip ⁇ Frame i
  • the video frame sequence is sent to the human body posture estimation and behavior recognition module, and the human body posture is estimated and the behavior type is recognized by obtaining the 3-dimensional coordinates of the key points.
  • the behavior type includes static behavior and dynamic behavior, and the behavior is judged. If the type recognition result is a static behavior, execute S3, otherwise execute S1.
  • this step may also include sending the current frame Frame k to the desktop detection module for desktop pose detection.
  • the sitting posture evaluation module simultaneously receives the estimation result of the human body posture and the recognition result of the behavior type, performs sitting posture evaluation based on the two, and gives corresponding prompts according to the evaluation result.
  • this step may also include the sitting posture evaluation module receiving the desktop posture detection result to assist in the sitting posture evaluation.
  • Behavior recognition results indicate that when the recognized object is in a relatively static state such as typing, writing, and reading, the sitting posture is evaluated.
  • the sitting posture evaluation in the present invention can adopt but not limited to the following methods: 1) Enter the standard sitting posture in advance, and calculate the similarity of each joint vector between the current sitting posture and the standard sitting posture; 2) Judge the distance between the head and the desktop; 3) Do classification tasks, use The neural network is trained to discriminate.
  • the human body posture estimation and behavior recognition it is preferable to adopt a multi-task end-to-end network structure for human body posture estimation and behavior recognition as the human body posture estimation and behavior recognition module.
  • a multi-task end-to-end network structure for human body posture estimation and behavior recognition it can more accurately use the posture estimation results to assist in improving the accuracy of behavior recognition, and since the accuracy of sitting posture recognition largely depends on the accuracy of human posture estimation and the accuracy of behavior, Therefore, the accuracy of sitting posture recognition can be further improved.
  • the input is only the human posture, and such input features cause behaviors with similar postures to be confused with each other in recognition. For example, the postures in the two situations of drinking water and smoking are very similar.
  • the human body posture estimation and behavior recognition module further includes a low-level feature extraction sub-module and at least a first-level estimation and recognition working group.
  • the low-level feature extraction submodule is mainly used to process each frame image in the video frame sequence into a low-level feature map.
  • the estimation and recognition working group includes a three-dimensional pose estimation part and a behavior recognition part working in parallel.
  • the three-dimensional pose estimation part of the first-level estimation and recognition working group uses the low-level feature map as input features and outputs the human pose estimation result
  • the behavior recognition part uses the current-level human pose estimation result and the low-level feature map as input features, and Output behavior recognition results
  • the three-dimensional pose estimation parts of other estimation and recognition working groups all use the low-level feature map and the upper-level human pose estimation results as input features, and output the human pose estimation results
  • the behavior recognition part uses the same level of human body
  • the posture estimation result and the upper-level behavior recognition result are used as input features, and the behavior recognition result is output.
  • the human body posture estimation and behavior recognition module uses the human body posture estimation results and behavior recognition results obtained by the last-level estimation and recognition working group as output.
  • the present invention introduces a re-injection mechanism between the three-dimensional posture estimation part and the behavior recognition part between the estimation and recognition working groups at all levels, and between the three-dimensional posture estimation part and the behavior recognition part within the estimation and recognition working group, and significantly improves the posture. Accuracy of estimation and behavior recognition results.
  • the low-level feature extraction submodule is the input part of the network, that is, the stem of the network.
  • the T-frame video frame sequence is resized to the same size and then sent to the network.
  • the output of this part is a low-level feature.
  • the inventive concept underlying the present invention is to use as few convolutional layers as possible to compress features into desired shapes. Focusing on the efficiency of the network, it is not required that the features extracted at this time have a good fitting ability. In order to improve the effectiveness of this feature, the present invention introduces a re-injection mechanism (re-injection) into the network to refine and adjust this feature.
  • the pose estimation part and the behavior recognition part also specially design the network structure to separately Modeling in the airspace and time domain will be elaborated in the follow-up content.
  • the present invention is based on the Resnet bottleneck layer of the residual network, and optimizes the network structure to improve the speed of the network.
  • the original network 1 ⁇ 1 convolution is replaced with a 1 ⁇ 1 group convolution (1 ⁇ 1 groupconv) + channel aliasing (channelshuffle) form, it is in the realization of 1 ⁇ 1 convolution
  • the function reduces the amount of calculation at the same time; replacing the 3 ⁇ 3 convolution with a 3 ⁇ 3 depthwise conv with a step size of 2, which can also reduce the amount of calculation.
  • the final addition operation is changed to a channel concatenation operation, and each identity map performs a maximum pooling operation with a step size of 2.
  • the above-mentioned optimized design can ensure that the original image can obtain the desired feature map shape through a few modified bottleneck layers.
  • the present invention simultaneously introduces a re-injection mechanism (re-injection) into the three-dimensional posture estimation and behavior recognition to form the structure of the entire body posture estimation and behavior recognition module, as shown in FIG.
  • a re-injection mechanism re-injection
  • Each 3D pose estimation module adds the low-level features and the features of the previous 3D pose estimation module as input features
  • the behavior recognition module adds the current input features and the features before the global pooling of the previous behavior recognition module as the new input features. .
  • the characteristics will be continuously adjusted, and the results of the network will gradually become more accurate.
  • the three-dimensional pose estimation unit is used to perform: a heat map extraction step and a heat map decoding step.
  • the heat map extraction step is executed once or stacked multiple times.
  • the present invention defines the 3D posture estimation in the sitting state as the 3D coordinates of 11 key points. Once these coordinates are determined, the posture of the human body can be connected according to the topology of the human body.
  • the 11 key points are left eye 1, right eye 2, nose 3, left mouth corner 4, right mouth corner 5, left shoulder 6, right shoulder 7, left elbow 8, right elbow 9, left wrist 10, and right wrist 11, as shown in the figure 4 shown.
  • the structure of the 3D pose estimation part is also optimized based on Resnet, and a new network structure SACAM (sptial attention and channel attention module) is proposed.
  • SACAM partial attention and channel attention module
  • the maximum pooling is performed along the channel, and the pooling result is subjected to 3x3 convolution to obtain the attention of the spatial domain, that is, the weights of different pixel positions, to refine the features.
  • SE layer is introduced to learn the weights of different channels, that is, channel-level attention, and re-refine the characteristics of different channels.
  • the SACAM structure is shown in Figure 5.
  • the convolution step size is all 1, and the pooling operation is only to extract attention.
  • the SACAM input and The resolution of the output feature map remains the same.
  • the soft-argmax is used to parse the two-dimensional key point coordinates and the depth coordinates from the two heat maps, and jointly form the three-dimensional key point coordinates.
  • Traditional algorithms often use argmax to obtain coordinate values from the heat map, and the result of the second operation is not direct, which destroys the back-propagation chain.
  • soft-argmax is used, which essentially defines the event as the maximum value falling on the coordinates (x, y), so that the heat map Hxy and Hz will naturally become the corresponding probability mass functions, and the maximum value coordinates are calculated
  • the formula is as follows:
  • x is the input image; i, j represent the position in the image; xi refers to the corresponding pixel value at the position i of the image x, and x j is the same; the output of the function is the coordinate of the maximum value of the image.
  • the behavior recognition unit is used to perform the behavior recognition model building step, the recognition input feature building step, the behavior recognition step and the classification step.
  • the short-term information and long-term information are respectively modeled by using the behavior recognition input features, and the two models are connected in series to form a recognition model.
  • the SRLRTM block structure is designed for the shape of the input feature, and the short-term information and the long-term information can be modeled by using ordinary 2-dimensional convolution.
  • SRLRTM is divided into two parts. The left half of SRLRTM models short-term information. It uses 1 ⁇ 1 convolution to enhance the flow of information between channels and reduce the number of channels.
  • the purpose of hk ⁇ 3 convolution is to model short-term information. Because the second dimension of the feature represents time T, the purpose of setting the second dimension of the convolution kernel to 3 is to model 3 adjacent frames.
  • the right half of SRLRTM models long-term information.
  • the first 1 ⁇ 1 convolution is also to enhance the flow of information between channels and reduce the number of channels.
  • the hk ⁇ T convolution is to simultaneously model the T frame information. It can be used with 1 ⁇ 1 convolution to obtain a channel attention.
  • it is multiplied with the identity mapping feature in the channel dimension to obtain a global enhancement feature, which is then added to the identity mapping feature to retain the original information.
  • the left half and the right half are connected in series to form an SRLRTM block. After multiple stacked SRLRTM blocks, connect a global maximum pooling layer, a fully connected, and then a softmax to obtain the recognition and classification results.
  • This step is mainly used to extract pose estimation features and scene context features, and stitch the two to form behavior recognition input features.
  • the input of the behavior recognition part includes two parts, one is the result of pose estimation, and the other is the low-level features extracted by the low-level feature extraction sub-module.
  • the appearance of the human body and the environmental context are combined to perform behavior recognition, which can solve the problem of insufficient accuracy in judging behavior only by gesture.
  • the format needs to be converted to facilitate the network to process it.
  • the time dimension is taken as the horizontal axis
  • the key point category is taken as the vertical axis
  • the x, y, and z coordinates of the 3-dimensional key point correspond to 3 channels.
  • Such features can be directly processed by 2-dimensional ordinary convolution. The feature is shown in Figure 7, and its shape is (hk, T, 3).
  • this embodiment extracts the low-level features and the heat map as the outer product.
  • the heat map Hxy is extracted as (hx,hy,hk), that is, (hw,hh,hk), and the process of low-level features to the heat map Hxy is not down-sampled, and the low-level feature is denoted as F, and its size is ( hw, hh, hd), where hd is the number of channels.
  • the result of the outer product can be the similarity of the two vectors, or reflect the length of the two vectors, and the outer product of the matrix is essentially the corresponding column in the matrix The outer product.
  • the purpose of calculating the outer product in this embodiment is to extract human body appearance information and context information on the heat map using all key point positions at a time.
  • the feature shape becomes (hk*hd), and then splicing the features of T video frames to obtain the human body appearance and scene context feature Representf, whose shape is ( T,hk*hd), split the second channel, adjust the order, and finally the characteristic shape becomes (hk,T,hd).
  • the shape of the pose estimation feature is (hk,T,3)
  • the shape of the human body appearance and the scene context feature is (hk,T,hd)
  • the first two dimensions of the two are the same, and they are stitched according to the channel to form a behavior recognition Enter the characteristics (hk, T, 3+hd).
  • the recognition input feature is input into the recognition model to obtain the recognition classification result.
  • the recognition results are divided into static behaviors and dynamic behaviors for the sitting state, where dynamic behaviors include, but are not limited to: stretching, standing up, sitting down, reaching for things, shaking your head, turning around, calling, and talking with others Wait.
  • Static behavior includes, but is not limited to: writing, typing, and reading.
  • a desktop three-dimensional detection method is preferably used as a desktop detection module. It obtains the desktop posture and position, relying only on monocular images, and does not require additional sensors. Since the posture and position of the desktop are known, the relative relationship between the person and the desktop can be obtained, which improves the accuracy of sitting posture evaluation. However, none of the previous sitting posture monitoring algorithms consider the use of this information.
  • the desktop detection module further includes a plane area detection unit and a plane three-dimensional parameter inference unit.
  • the plane area detection unit is realized by the Mask-RCNN network.
  • the Mask-RCNN network is an instance segmentation model, which can determine the location and category of each target in the picture, and give pixel-level predictions.
  • the desktop detection frame, desktop segmentation results and the output feature map of its backbone network obtained by Mask-RCNN are used as the input of the plane 3D parameter inference unit.
  • the plane 3D parameter inference unit first uses ROIAlign to extract the features of the desktop, and returns to the normal vector of the desktop.
  • ROI Align is to find the position of the corresponding desktop detection frame on the backbone network output feature map, and then perform bilinear interpolation to obtain desktop features.
  • the plane 3D parameter inference unit also decodes the backbone network feature map to obtain a global depth map.
  • the decoding part adopts bilinear interpolation to make the predicted depth map and the input monocular image reach the same resolution.
  • the plane 3D parameter inference unit uses the desktop segmentation mask to extract the desktop depth in the global depth map.
  • the extraction operation is implemented using an AND operation, and then returns to the position vector of the desktop through a maximum pooling layer and two 1x1 convolutions. Finally, the desktop normal vector and the desktop position vector are used as the output of the desktop detection module.
  • a sitting posture evaluation method is preferably used as a sitting posture evaluation module.
  • the algorithm preparation stage enter the standard sitting posture in advance.
  • the algorithm calculates the similarity of each joint vector between the current sitting posture and the standard sitting posture; calculates the angle between the left and right eye line and the desktop normal vector, and the distance between the nose and the desktop position.
  • the sitting posture is divided into standard, slightly incorrect, severely incorrect, etc., and remind users according to different levels.
  • the human body pose estimation and behavior recognition module starts to work, and the low-level feature extraction sub-module adopts the bottleneck layer structure as shown in Fig. 3 and is stacked 4 times.
  • the output low-level feature resolution is 32 ⁇ 32, and the number of channels is increased from 3 to 576.
  • the first bottleneck layer channel is amplified to 12
  • the second bottleneck layer channel is amplified to 48
  • the third bottleneck layer channel is amplified to 192
  • the fourth bottleneck layer channel is amplified to 576.
  • the posture evaluation unit sends each of the T video frames into the SACAM structure for three-dimensional posture estimation.
  • the SACAM blocks are stacked 5 times, all of which have a convolution step length of 1, and the pose estimation feature is obtained.
  • the pose estimation feature is sent to the heat map decoding module to obtain Pxy, Pz, and Conf, all of which are 11 channels, corresponding to 11 key points. Since the T video frames are processed separately, the body pose results of T frames will be obtained here.
  • each 3D pose estimation module adds low-level features and the previous 3D pose estimation module's characteristics as input features, and the behavior recognition module combines the current input features and the previous behavior recognition module globally.
  • the features before pooling are added as new input features to improve the accuracy of network recognition.
  • Desktop detection is essentially a plane detection problem. The purpose is to obtain the position and posture of the desktop from the image.
  • the desktop detection module performs 3D plane detection on the monocular image, and obtains the depth map and normal vector describing each plane as the position information and posture information of the plane. Then according to the placement of the camera, search upwards from the bottom of the image to determine the range of the desktop.
  • the behavior recognition unit recognizes that the recognized object is in a relatively static state such as typing, writing, reading, etc., it evaluates its sitting posture and gives corresponding prompts based on the evaluation results.
  • the disclosed technical content can be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the units may be a logical function division.
  • multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present invention essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium.
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • the aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The present invention provides a sitting posture recognition method based on a monocular video image sequence. First, a video frame is acquired from a monocular camera; then, on one hand, the current video frame is sent to a desktop detection module to detect a desktop position and pose, and on the other hand, the current frame is used to update a video frame sequence and the video frame sequence is sent to a human pose estimation and behavior recognition module to obtain a three-dimensional human pose and the current behavior category; whether the current behavior category is a static behavior is determined, if yes, the human pose, the current behavior category, and the desktop position and pose are sent to a sitting posture evaluation module to evaluate the current sitting posture, otherwise, directly proceed to the next frame. According to the method provided in the present invention, a three-dimensional human pose can be directly acquired from a monocular image, a multi-frame image sequence is used, shielding and light change are resisted, the robustness is good, a non-static behavior is filtered by means of behavior recognition, and the accuracy is improved in combination with desktop pose information.

Description

一种基于单目视频图像序列的坐姿识别方法A Sitting Posture Recognition Method Based on Monocular Video Image Sequence 技术领域Technical field
本发明涉及视频图像处理、计算机视觉和人体姿态识别领域,具体而言,尤其涉及一种基于单目视频图像序列的坐姿识别方法。The present invention relates to the fields of video image processing, computer vision and human body posture recognition, and in particular, to a sitting posture recognition method based on a monocular video image sequence.
背景技术Background technique
随着生活节奏的不断加快,人们每天的大部分的时间处于工作和学习之中。长时间的维持不规范的坐姿,容易养成驼背、身体歪扭等不良习惯,严重的会引发颈椎病、腰椎间盘突出和近视等疾病,对身体造成不可逆转的伤害,从而在极大程度上影响日常学习、工作和生活。坐姿识别算法通常利用传感器对识别对象的半身姿态进行提取,并依据坐姿标准程度的算法,帮助用户及时调整不正确的坐姿,保证人们的健康。As the pace of life continues to accelerate, people spend most of their day at work and study. Maintaining an irregular sitting posture for a long time can easily develop bad habits such as hunchback and twisting of the body. In severe cases, it will cause cervical spondylosis, lumbar disc herniation, and myopia, causing irreversible damage to the body, thus to a great extent Affect daily study, work and life. The sitting posture recognition algorithm usually uses sensors to extract the half-length posture of the recognized object, and according to the algorithm of the standard degree of sitting posture, helps users adjust incorrect sitting posture in time to ensure people's health.
目前的坐姿识别算法基于的无接触传感器主要分为以下几种:The non-contact sensors based on the current sitting posture recognition algorithm are mainly divided into the following types:
超声波传感器。超声波对测量面有一定要求。测量面密度较低,超声波穿透物体,会有多个回波;测量面凹凸不平,超声波被打散,同样会有多个回波;测量面倾斜,超声波没有正确反射;测量面过小,超声波反射回的量不够。所以超声波的测量效果较差。Ultrasonic sensor. Ultrasonic has certain requirements on the measuring surface. The measurement surface density is low, the ultrasonic wave penetrates the object, there will be multiple echoes; the measurement surface is uneven, the ultrasonic wave is scattered, there will also be multiple echoes; the measurement surface is inclined, the ultrasonic wave is not reflected correctly; the measurement surface is too small, The amount of ultrasound reflected back is not enough. Therefore, the measurement effect of ultrasonic is poor.
双目视觉传感器。这种传感器制作工艺要求高,对环境光照非常敏感,对缺乏纹理场景表现差,计算复杂度高,相机基线限制了测量范围,使用中存在死角。Binocular vision sensor. This kind of sensor has high manufacturing process requirements, is very sensitive to ambient light, has poor performance for scenes lacking texture, and has high computational complexity. The camera baseline limits the measurement range, and there are dead corners in use.
单目视觉传感器:硬件成本低,但是一般只能获取二维信息,坐姿识别效果不如双目相机;对于遮挡、光照突变等情况,坐姿识别鲁棒性较差;且需要使用小孔成像模型,并提供额外的先验知识,才能获取三维信息。Monocular vision sensor: The hardware cost is low, but generally only two-dimensional information can be obtained, and the effect of sitting posture recognition is not as good as that of binocular cameras; for occlusion, sudden light changes, etc., sitting posture recognition is less robust; and a small hole imaging model is required, And provide additional prior knowledge to obtain three-dimensional information.
此外,大部分的坐姿识别方法仅考虑相对静态的打字、书写、阅读行为,但是实际应用场景中识别对象还可能存在舒展、摆头、喝水、接电话等动态 行为。上述动态行为发生时,很容易被识别成错误的坐姿。现有的坐姿识别方法也没有结合具体场景下桌面位置信息,严重限制了坐姿识别准确性的提升。In addition, most sitting posture recognition methods only consider relatively static typing, writing, and reading behaviors, but in actual application scenarios, the recognition object may also have dynamic behaviors such as stretching, head swinging, drinking, and answering the phone. When the above dynamic behavior occurs, it is easy to be recognized as a wrong sitting posture. The existing sitting posture recognition method also does not combine the desktop position information in a specific scene, which severely limits the improvement of the sitting posture recognition accuracy.
发明内容Summary of the invention
鉴于至少一个上述问题提出本发明。The present invention is proposed in view of at least one of the above-mentioned problems.
本发明更多的关注在基于单目视觉传感器的坐姿识别方法,尤其是基于单目视频图像序列,本发明旨在提高基于单目视频图像序列坐姿识别方法的准确性,以及所述方法在遮挡、光照突变等异常使用情况下的鲁棒性。The present invention pays more attention to the sitting posture recognition method based on the monocular vision sensor, especially based on the monocular video image sequence. The present invention aims to improve the accuracy of the sitting posture recognition method based on the monocular video image sequence, and the method is blocking , Robustness under abnormal usage conditions such as sudden light changes.
本发明也基于行为识别。实际应用中已经证明本发明可以改善当识别对象存在动态行为时的识别精度,而不需要额外的外部检测结果。此外,本发明在识别时可以自适应匹配桌面位置信息。The invention is also based on behavior recognition. In practical applications, it has been proved that the present invention can improve the recognition accuracy when the recognition object has dynamic behavior, without requiring additional external detection results. In addition, the present invention can adaptively match the desktop location information during recognition.
本发明的目的是提供一种基于单目视频图像序列的坐姿识别方法,包括:The purpose of the present invention is to provide a sitting posture recognition method based on monocular video image sequence, including:
S1、从单目摄像头获取当前视频帧并更新视频帧序列,所述视频帧序列的容量固定;S1. Obtain the current video frame from the monocular camera and update the video frame sequence, the capacity of the video frame sequence is fixed;
S2、将所述视频帧序列送入人体姿态估计与行为识别模块,通过获取关键点的3维坐标进行人体姿态的估计和行为类型的识别,所述行为类型包括静态行为和动态行为,判断行为类型识别结果属于静态行为则执行S3,否则执行S1;S2. The video frame sequence is sent to the human body posture estimation and behavior recognition module, and the human body posture is estimated and the behavior type is recognized by obtaining the three-dimensional coordinates of the key points. The behavior type includes static behavior and dynamic behavior, and the behavior is judged. If the type recognition result is a static behavior, execute S3, otherwise execute S1;
S3、坐姿评价模块同时接收所述人体姿态的估计结果和行为类型的识别结果,依据二者进行坐姿评估,并根据评估结果给出相应提示。S3. The sitting posture evaluation module simultaneously receives the estimation result of the human body posture and the recognition result of the behavior type, performs sitting posture evaluation based on the two, and gives corresponding prompts according to the evaluation result.
较现有技术相比,本发明具有以下优点:Compared with the prior art, the present invention has the following advantages:
1、本发明基于单目视频图像序列开发,利用多任务端到端网络结构实现了人体姿态估计与行为识别,通过精确的姿态估计结果辅助行为识别准确性。1. The present invention is based on the development of monocular video image sequence, uses a multi-task end-to-end network structure to realize human body posture estimation and behavior recognition, and assists the accuracy of behavior recognition through accurate posture estimation results.
2、本发明实现行为识别过程中利用低级特征图结果,能够获取和姿态相关的环境上下文信息,进一步提高在相似行为间的识别准确性。2. The present invention uses low-level feature map results in the process of realizing behavior recognition, which can obtain environmental context information related to gestures, and further improve the accuracy of recognition between similar behaviors.
3、本发明在获取3维姿态信息时加入了空间域上的注意力机制,通过空 间域的上下文提高姿态关键点精度。3. The present invention adds an attention mechanism in the spatial domain when acquiring 3-dimensional posture information, and improves the accuracy of the key points of the posture through the context of the spatial domain.
4、本发明结合实际场景下的桌面位置和姿态信息进行坐姿评价,提高坐姿识别精度。4. The present invention combines the desktop position and posture information in the actual scene to perform sitting posture evaluation, and improves the sitting posture recognition accuracy.
基于上述理由本发明可在办公设备、教学设备中广泛应用。Based on the above reasons, the present invention can be widely used in office equipment and teaching equipment.
附图说明Description of the drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图做以简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.
图1为本发明坐姿识别方法流程图。Figure 1 is a flowchart of the sitting posture recognition method of the present invention.
图2为本发明人体姿态估计与行为识别模块结构示意图。Figure 2 is a schematic diagram of the structure of the human body posture estimation and behavior recognition module of the present invention.
图3为实施例中低级特征提取子模块结构示意图。Fig. 3 is a schematic diagram of the structure of the low-level feature extraction sub-module in the embodiment.
图4为实施例中坐姿状态下11个关键点分布示意图。Fig. 4 is a schematic diagram of the distribution of 11 key points in a sitting state in the embodiment.
图5为实施例中SACAM网络结构示意图。Figure 5 is a schematic diagram of the SACAM network structure in the embodiment.
图6为实施例中姿态估计热图解码流程图。Fig. 6 is a flowchart of posture estimation heat map decoding in an embodiment.
图7为实施例中行为识别部的视频序列的姿态估计结果输入示意图。Fig. 7 is a schematic diagram of an input of a posture estimation result of a video sequence of a behavior recognition unit in an embodiment.
图8为实施例中SRLRTM网络结构示意图。Fig. 8 is a schematic diagram of the SRLRTM network structure in the embodiment.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。In order to enable those skilled in the art to better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only It is a part of the embodiments of the present invention, not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本 发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms “first” and “second” in the description and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances so that the embodiments of the present invention described herein can be implemented in an order other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those clearly listed. Those steps or units may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or equipment.
如图1为本发明提供的一种基于单目视频图像序列的坐姿识别方法,其用于进行坐姿评价的流程包括以下步骤:Fig. 1 is a sitting posture recognition method based on a monocular video image sequence provided by the present invention. The process for performing sitting posture evaluation includes the following steps:
S1、从单目摄像头获取当前视频帧Frame k并更新视频帧序列,所述视频帧序列为VideoClip={Frame i|i∈k-T+1,...,k},其能够存储T帧图像。 S1. Obtain the current video frame Frame k from the monocular camera and update the video frame sequence. The video frame sequence is VideoClip={Frame i |i∈k-T+1,...,k}, which can store T frames image.
S2、将所述视频帧序列送入人体姿态估计与行为识别模块,通过获取关键点的3维坐标进行人体姿态的估计和行为类型的识别,所述行为类型包括静态行为和动态行为,判断行为类型识别结果属于静态行为则执行S3,否则执行S1。此外,该步骤还可以包括将当前帧Frame k送入桌面检测模块进行桌面位姿检测。 S2. The video frame sequence is sent to the human body posture estimation and behavior recognition module, and the human body posture is estimated and the behavior type is recognized by obtaining the 3-dimensional coordinates of the key points. The behavior type includes static behavior and dynamic behavior, and the behavior is judged. If the type recognition result is a static behavior, execute S3, otherwise execute S1. In addition, this step may also include sending the current frame Frame k to the desktop detection module for desktop pose detection.
S3、坐姿评价模块同时接收所述人体姿态的估计结果和行为类型的识别结果,依据二者进行坐姿评估,并根据评估结果给出相应提示。相应的该步骤还可以包括坐姿评价模块接收桌面位姿检测结果,辅助进行坐姿评估。行为识别结果表示被识别对象处于打字、书写、阅读等相对静态的状态时,对其坐姿进行评价。本发明中坐姿评价可以采用但不限于以下方式:1)提前录入标准坐姿,计算当前坐姿和标准坐姿下各关节向量的相似度;2)判断头部和桌面距离;3)做分类任务,使用神经网络进行训练来判别。S3. The sitting posture evaluation module simultaneously receives the estimation result of the human body posture and the recognition result of the behavior type, performs sitting posture evaluation based on the two, and gives corresponding prompts according to the evaluation result. Correspondingly, this step may also include the sitting posture evaluation module receiving the desktop posture detection result to assist in the sitting posture evaluation. Behavior recognition results indicate that when the recognized object is in a relatively static state such as typing, writing, and reading, the sitting posture is evaluated. The sitting posture evaluation in the present invention can adopt but not limited to the following methods: 1) Enter the standard sitting posture in advance, and calculate the similarity of each joint vector between the current sitting posture and the standard sitting posture; 2) Judge the distance between the head and the desktop; 3) Do classification tasks, use The neural network is trained to discriminate.
在本发明中,优选采用一种针对人体姿态估计和行为识别的多任务端到端网络结构作为人体姿态估计与行为识别模块使用。其与常规使用的分阶段、多任务网络相比,能够更精确的利用姿态估计结果辅助提升行为识别精度,而由于坐姿识别精度很大程度上依赖于人体姿态估计的精度和行为的准确性,因此能够进一步提高坐姿识别精度。而以往使用的分段串联的识别算法,其输入只有人体姿态,而这样的输入特征,导致相近姿态的行为,在识别中互相混淆,比如喝水和抽烟这两种情形下的姿态非常类似。In the present invention, it is preferable to adopt a multi-task end-to-end network structure for human body posture estimation and behavior recognition as the human body posture estimation and behavior recognition module. Compared with the conventionally used staged, multi-task network, it can more accurately use the posture estimation results to assist in improving the accuracy of behavior recognition, and since the accuracy of sitting posture recognition largely depends on the accuracy of human posture estimation and the accuracy of behavior, Therefore, the accuracy of sitting posture recognition can be further improved. In the previous segmented cascade recognition algorithm, the input is only the human posture, and such input features cause behaviors with similar postures to be confused with each other in recognition. For example, the postures in the two situations of drinking water and smoking are very similar.
为了解决上述问题,所述人体姿态估计与行为识别模块进一步包括低 级特征提取子模块和至少一级估计与识别工作组。其中:低级特征提取子模块主要用以将所述视频帧序列中各帧图像处理为低级特征图。估计与识别工作组包括并行工作的三维姿态估计部和行为识别部。其中第一级估计与识别工作组的三维姿态估计部以所述低级特征图作为输入特征,并输出人体姿态估计结果,行为识别部以本级人体姿态估计结果和低级特征图作为输入特征,并输出行为识别结果;其他估计与识别工作组的三维姿态估计部均以所述低级特征图作和上一级人体姿态估计结果为输入特征,并输出人体姿态估计结果,行为识别部以本级人体姿态估计结果和上一级行为识别结果作为输入特征,并输出行为识别结果。作为本发明较佳的实施方式,人体姿态估计与行为识别模块以最后一级估计与识别工作组得到的人体姿态估计结果和行为识别结果作为输出。本发明通过各级估计与识别工作组间三维姿态估计部和行为识别部之间、以及估计与识别工作组内部三维姿态估计部和行为识别部之间之间引入重注入机制,显著提高了姿态估计和行为识别结果的准确性。In order to solve the above problems, the human body posture estimation and behavior recognition module further includes a low-level feature extraction sub-module and at least a first-level estimation and recognition working group. Among them: the low-level feature extraction submodule is mainly used to process each frame image in the video frame sequence into a low-level feature map. The estimation and recognition working group includes a three-dimensional pose estimation part and a behavior recognition part working in parallel. The three-dimensional pose estimation part of the first-level estimation and recognition working group uses the low-level feature map as input features and outputs the human pose estimation result, and the behavior recognition part uses the current-level human pose estimation result and the low-level feature map as input features, and Output behavior recognition results; the three-dimensional pose estimation parts of other estimation and recognition working groups all use the low-level feature map and the upper-level human pose estimation results as input features, and output the human pose estimation results, and the behavior recognition part uses the same level of human body The posture estimation result and the upper-level behavior recognition result are used as input features, and the behavior recognition result is output. As a preferred embodiment of the present invention, the human body posture estimation and behavior recognition module uses the human body posture estimation results and behavior recognition results obtained by the last-level estimation and recognition working group as output. The present invention introduces a re-injection mechanism between the three-dimensional posture estimation part and the behavior recognition part between the estimation and recognition working groups at all levels, and between the three-dimensional posture estimation part and the behavior recognition part within the estimation and recognition working group, and significantly improves the posture. Accuracy of estimation and behavior recognition results.
具体来说,低级特征提取子模块是网络输入部分,即网络的根茎(stem),T帧视频帧序列resize到同一的尺寸,然后被送入此网络,此部分的输出是一个低级特征。本发明所基于的发明构思是尽可能使用很少的卷积层数将特征压缩成期望的形状。侧重于网络的效率,并非要求此时提取的特征就能够有很好的拟合能力。本发明为了提升该特征的有效性,在网络中引入了重注入机制(re-injection)对这一特征进行提炼调整(refine),同时姿态估计部分和行为识别部分还专门设计了网络结构分别对空域和时间域进行建模,在后续内容中将详细阐述。本发明以残差网络Resnet瓶颈层为基础,对网络结构进行优化以提高网络的速度。如图3所示,优选地,将原网络1×1卷积替换为1×1分组卷积(1×1 groupconv)+通道混叠(channelshuffle)的形式,它在实现1×1卷积的功能的同时减小了计算量;将3×3卷积替换为步长为2的3×3深度卷积(depthwise conv),同样能够减小计算量。将最后的相加操作改为通道拼接(concate)操作,每个恒等映射都进行步长为2的最大池化操作。上述优化设计能保证原始图像能够通过很少的几个更改后的瓶颈层得到希望的特征图形状。Specifically, the low-level feature extraction submodule is the input part of the network, that is, the stem of the network. The T-frame video frame sequence is resized to the same size and then sent to the network. The output of this part is a low-level feature. The inventive concept underlying the present invention is to use as few convolutional layers as possible to compress features into desired shapes. Focusing on the efficiency of the network, it is not required that the features extracted at this time have a good fitting ability. In order to improve the effectiveness of this feature, the present invention introduces a re-injection mechanism (re-injection) into the network to refine and adjust this feature. At the same time, the pose estimation part and the behavior recognition part also specially design the network structure to separately Modeling in the airspace and time domain will be elaborated in the follow-up content. The present invention is based on the Resnet bottleneck layer of the residual network, and optimizes the network structure to improve the speed of the network. As shown in Figure 3, preferably, the original network 1×1 convolution is replaced with a 1×1 group convolution (1×1 groupconv) + channel aliasing (channelshuffle) form, it is in the realization of 1×1 convolution The function reduces the amount of calculation at the same time; replacing the 3×3 convolution with a 3×3 depthwise conv with a step size of 2, which can also reduce the amount of calculation. The final addition operation is changed to a channel concatenation operation, and each identity map performs a maximum pooling operation with a step size of 2. The above-mentioned optimized design can ensure that the original image can obtain the desired feature map shape through a few modified bottleneck layers.
另外,本发明在实现功能时,同时对三维姿态估计和行为识别引入重注 入机制(re-injection)构成整个人体姿态估计与行为识别模块的结构,如图2所示。每个三维姿态估计模块将低级特征和上一个三维姿态估计模块特征相加作为输入特征,行为识别模块将当前输入特征和上一个行为识别模块全局池化前的特征相加,作为新的输入特征。通过这种重注入机制,特征会不断的调整,网络的结果也会随之逐渐趋于精确。In addition, the present invention simultaneously introduces a re-injection mechanism (re-injection) into the three-dimensional posture estimation and behavior recognition to form the structure of the entire body posture estimation and behavior recognition module, as shown in FIG. Each 3D pose estimation module adds the low-level features and the features of the previous 3D pose estimation module as input features, and the behavior recognition module adds the current input features and the features before the global pooling of the previous behavior recognition module as the new input features. . Through this re-injection mechanism, the characteristics will be continuously adjusted, and the results of the network will gradually become more accurate.
在本发明的一种实施方式中,三维姿态估计部用于执行:热图提取步骤和热图解码步骤。其中热图提取步骤执行一次或者堆叠执行多次。In an embodiment of the present invention, the three-dimensional pose estimation unit is used to perform: a heat map extraction step and a heat map decoding step. The heat map extraction step is executed once or stacked multiple times.
具体地,本发明将坐姿状态下的3维姿态估计定义为11个关键点的3维坐标,一旦确定这些坐标,就可以按照人体拓扑结构连接出人体姿态。11个关键点分别为左眼1、右眼2、鼻子3、左嘴角4、右嘴角5、左肩6、右肩7、左肘8、右肘9、左手腕10、右手腕11,如图4所示。Specifically, the present invention defines the 3D posture estimation in the sitting state as the 3D coordinates of 11 key points. Once these coordinates are determined, the posture of the human body can be connected according to the topology of the human body. The 11 key points are left eye 1, right eye 2, nose 3, left mouth corner 4, right mouth corner 5, left shoulder 6, right shoulder 7, left elbow 8, right elbow 9, left wrist 10, and right wrist 11, as shown in the figure 4 shown.
在热图提取步骤中,同样基于Resnet对3维姿态估计部分的结构进行优化,而提出了一种新的网络结构SACAM(sptial attention and channel attention module)。此结构中沿着通道进行最大池化,对池化后结果进行3x3卷积,得到空间域的注意力,即不同像素位置的权重,对特征精炼。然后引入SE layer学习不同通道的权重,即通道级的注意力,对不同通道特征重新精炼。SACAM结构如图5所示。由于前述低级特征提取部分已经快速的将特征图尺寸调整成需要的分辨率,SACAM块中不进行下采样,卷积步长均为1,池化操作也只是为了提取注意力,SACAM的输入和输出特征图的分辨率保持一致。In the heat map extraction step, the structure of the 3D pose estimation part is also optimized based on Resnet, and a new network structure SACAM (sptial attention and channel attention module) is proposed. In this structure, the maximum pooling is performed along the channel, and the pooling result is subjected to 3x3 convolution to obtain the attention of the spatial domain, that is, the weights of different pixel positions, to refine the features. Then SE layer is introduced to learn the weights of different channels, that is, channel-level attention, and re-refine the characteristics of different channels. The SACAM structure is shown in Figure 5. Since the aforementioned low-level feature extraction part has quickly adjusted the size of the feature map to the required resolution, no down-sampling is performed in the SACAM block, the convolution step size is all 1, and the pooling operation is only to extract attention. The SACAM input and The resolution of the output feature map remains the same.
进一步地,在热图解码步骤中,姿态估计输入特征连续经过一个或多个SACAM堆叠的结构之后,生成关键点热图Heatmap,其尺寸为(hw,hh,hc)。通过reshape操作转换为(hx,hy,hz,hk),hx和hy为二维姿态估计结果,hz为关键点深度值,hc为关键点类别数,本实施例中设置为11,hc=hz*hk,hw=hx,hh=hy。Further, in the heat map decoding step, after the pose estimation input feature continuously passes through one or more SACAM stacked structures, a key point heat map Heatmap is generated, the size of which is (hw, hh, hc). Converted to (hx, hy, hz, hk) through the reshape operation, hx and hy are the two-dimensional pose estimation results, hz is the key point depth value, hc is the number of key point categories, set to 11 in this embodiment, hc=hz *hk, hw=hx, hh=hy.
然后,对Heatmap的第三个维度做全局最大池化,得到热图Hxy,尺寸为(hx,hy,hk);对Heatmap的前两个维度做全局最大池化,得到热图Hz,尺寸为(hz,hk)。本实施例中采用soft-argmax从两个热图中分别解析出二维关键点坐标和深度坐标,共同形成三维关键点坐标。传统算法从热图获取坐标值往往采用argmax,次运算结果不可导,使反向传播链遭到破坏。而本发明中使 用soft-argmax,其本质上是把事件定义为最大值落在坐标(x,y)上,这样热图Hxy和Hz自然就成为了对应的概率质量函数,求取最大值坐标就转换为了求取期望,公式如下:Then, perform global maximum pooling on the third dimension of the Heatmap to obtain the heat map Hxy with a size of (hx,hy,hk); perform global maximum pooling on the first two dimensions of the Heatmap to obtain the heatmap Hz with a size of (hz,hk). In this embodiment, the soft-argmax is used to parse the two-dimensional key point coordinates and the depth coordinates from the two heat maps, and jointly form the three-dimensional key point coordinates. Traditional algorithms often use argmax to obtain coordinate values from the heat map, and the result of the second operation is not direct, which destroys the back-propagation chain. In the present invention, soft-argmax is used, which essentially defines the event as the maximum value falling on the coordinates (x, y), so that the heat map Hxy and Hz will naturally become the corresponding probability mass functions, and the maximum value coordinates are calculated In order to obtain expectations, the formula is as follows:
Figure PCTCN2020104054-appb-000001
Figure PCTCN2020104054-appb-000001
公式中,x为输入图像;i,j表示图像中的位置;x i指在图像x的位置i处对应的像素值,x j同理;函数输出为图像最大值的坐标。 In the formula, x is the input image; i, j represent the position in the image; xi refers to the corresponding pixel value at the position i of the image x, and x j is the same; the output of the function is the coordinate of the maximum value of the image.
对于关键点的置信度,我们对热图Hxy前两个维度做全局最大池化得到Cxy,对热图Hz的第一个维度做全局池化得到Cz,二者按通道相加,得到置信度Conf。整个姿态估计热图解码的流程如图6所示。For the confidence of key points, we do global maximum pooling on the first two dimensions of the heat map Hxy to get Cxy, and do global pooling on the first dimension of the heat map Hz to get Cz, and add the two according to the channel to get the confidence. Conf. The whole process of posture estimation heat map decoding is shown in Figure 6.
在本发明进一步的实施方式中,行为识别部用于执行行为识别模型搭建步骤、识别输入特征构建步骤以及行为识别步骤和分类步骤。In a further embodiment of the present invention, the behavior recognition unit is used to perform the behavior recognition model building step, the recognition input feature building step, the behavior recognition step and the classification step.
行为识别模型搭建步骤Steps to build behavior recognition model
在设计模型时,主要利用所述行为识别输入特征对短时间信息和长时间信息分别进行建模,将两模型串联后形成识别模型。作为进一步优选的实施方案,针对输入特征的形状设计了SRLRTM块结构,采用普通的2维卷积就能对短时间信息和长时间信息进行建模。如图8所示,SRLRTM分为两个部分。SRLRTM的左半部分对短时间信息进行建模。它采用1×1卷积增强通道间信息的流动并降低通道数,hk×3卷积目的是对短时间的信息进行建模。因为特征的第二个维度代表时间T,所以卷积核第二维设置为3的目的是对相邻的3帧进行建模。然后进行通道最大池化,以获取到一个时空注意力,将它和恒等映射特征进行自相关,得到局部增强特征,同时为了保留信息的完整性,这里跳跃连接,将原有特征和局部增强特征进行相加。SRLRTM的右半部分对长时间信息进行建模。第一个1×1卷积也是为了增强通道间信息的流动并降低通道数,hk×T卷积是对T帧信息同时建模,用它和1×1卷积配合能获取到一个通道注意力,然后和恒等映射特征在通道维度上做乘法,获得一个全局增强特征,再和恒等映射特征相加,保留原有信息。通过左半部分和右半部分串联,构成一个SRLRTM块。将多个堆叠的SRLRTM块后,连接一个全局最大池化层,一个全连接,再接一个softmax,得到识别分类结 果。When designing a model, the short-term information and long-term information are respectively modeled by using the behavior recognition input features, and the two models are connected in series to form a recognition model. As a further preferred embodiment, the SRLRTM block structure is designed for the shape of the input feature, and the short-term information and the long-term information can be modeled by using ordinary 2-dimensional convolution. As shown in Figure 8, SRLRTM is divided into two parts. The left half of SRLRTM models short-term information. It uses 1×1 convolution to enhance the flow of information between channels and reduce the number of channels. The purpose of hk×3 convolution is to model short-term information. Because the second dimension of the feature represents time T, the purpose of setting the second dimension of the convolution kernel to 3 is to model 3 adjacent frames. Then perform channel maximum pooling to obtain a spatiotemporal attention, and autocorrelate it with the identity mapping feature to obtain a local enhancement feature. At the same time, in order to preserve the integrity of the information, jump connection here, and the original feature and the local enhancement Features are added. The right half of SRLRTM models long-term information. The first 1×1 convolution is also to enhance the flow of information between channels and reduce the number of channels. The hk×T convolution is to simultaneously model the T frame information. It can be used with 1×1 convolution to obtain a channel attention. Then, it is multiplied with the identity mapping feature in the channel dimension to obtain a global enhancement feature, which is then added to the identity mapping feature to retain the original information. The left half and the right half are connected in series to form an SRLRTM block. After multiple stacked SRLRTM blocks, connect a global maximum pooling layer, a fully connected, and then a softmax to obtain the recognition and classification results.
识别输入特征构建步骤Identify input feature construction steps
该步骤主要用于提取姿态估计特征和场景上下文特征,将二者拼接形成行为识别输入特征。行为识别部分的输入包括两个部分,一个是姿态估计的结果,另一个是低级特征提取子模块提取到的低级特征。在本实施方式中将人体表观和环境上下文结合起来进行行为识别,能够解决只通过姿态来对行为进行判断是不够准确的问题。This step is mainly used to extract pose estimation features and scene context features, and stitch the two to form behavior recognition input features. The input of the behavior recognition part includes two parts, one is the result of pose estimation, and the other is the low-level features extracted by the low-level feature extraction sub-module. In this embodiment, the appearance of the human body and the environmental context are combined to perform behavior recognition, which can solve the problem of insufficient accuracy in judging behavior only by gesture.
对于姿态估计的结果,需要对其格式进行转换,方便网络对其处理。本实施例中将时间维度作为水平轴,关键点类别作为垂直轴,3维关键点的x,y,z坐标对应3个通道,这样的特征可以直接利用2维普通卷积进行处理。特征如图7所示,其形状为(hk,T,3)。For the result of attitude estimation, the format needs to be converted to facilitate the network to process it. In this embodiment, the time dimension is taken as the horizontal axis, the key point category is taken as the vertical axis, and the x, y, and z coordinates of the 3-dimensional key point correspond to 3 channels. Such features can be directly processed by 2-dimensional ordinary convolution. The feature is shown in Figure 7, and its shape is (hk, T, 3).
对于人体表观和场景上下文特征,本实施例通过将低级特征和热图做外积来进行提取。具体地,提取热图Hxy为(hx,hy,hk),即(hw,hh,hk),而低级特征到热图Hxy的过程中没有进行下采样,记低级特征为F,其尺寸为(hw,hh,hd),其中hd为通道数。对Hxy的每个通道和F的每个通道计算外积,得到的结果为(hx,hy,hk*hd)。由于两个向量的外积等于两个向量组成平行四边形的面积,外积的结果可以两个向量的相似度,也可以体现两个向量的长度值,而矩阵的外积本质就是矩阵中对应列的外积。本实施例中计算外积的目的是利用一个时刻的所有关键点位置在热图上提取人体表观信息和上下文信息。得到外积结果之后,对前两个轴进行全局平均池化,特征形状变为(hk*hd),然后拼接T个视频帧的特征,得到人体表观和场景上下文特征Representf,其形状为(T,hk*hd),将第二个通道拆分,调整顺序,特征形状最后变为(hk,T,hd)。由于姿态估计特征的形状为(hk,T,3),人体表观和场景上下文特征的形状为(hk,T,hd),二者前两个维度尺寸一致,按通道拼接,形成行为识别的输入特征(hk,T,3+hd)。For human body appearance and scene context features, this embodiment extracts the low-level features and the heat map as the outer product. Specifically, the heat map Hxy is extracted as (hx,hy,hk), that is, (hw,hh,hk), and the process of low-level features to the heat map Hxy is not down-sampled, and the low-level feature is denoted as F, and its size is ( hw, hh, hd), where hd is the number of channels. Calculate the outer product for each channel of Hxy and each channel of F, and the result is (hx, hy, hk*hd). Since the outer product of two vectors is equal to the area of the parallelogram formed by the two vectors, the result of the outer product can be the similarity of the two vectors, or reflect the length of the two vectors, and the outer product of the matrix is essentially the corresponding column in the matrix The outer product. The purpose of calculating the outer product in this embodiment is to extract human body appearance information and context information on the heat map using all key point positions at a time. After obtaining the outer product result, perform global average pooling on the first two axes, the feature shape becomes (hk*hd), and then splicing the features of T video frames to obtain the human body appearance and scene context feature Representf, whose shape is ( T,hk*hd), split the second channel, adjust the order, and finally the characteristic shape becomes (hk,T,hd). Since the shape of the pose estimation feature is (hk,T,3), the shape of the human body appearance and the scene context feature is (hk,T,hd), the first two dimensions of the two are the same, and they are stitched according to the channel to form a behavior recognition Enter the characteristics (hk, T, 3+hd).
行为识别和分类步骤Behavior recognition and classification steps
在该步骤将所述识别输入特征输入到识别模型中,得到识别分类结果。在本实施方式中,将识别结果分为针对坐姿状态的静态行为和动态行为,其中动态行为包括但不限于:舒展、起立、坐下、伸手取物、摇头晃脑、转身、 打电话以及与他人交谈等。静态行为包括但不限于:书写、打字及阅读等。In this step, the recognition input feature is input into the recognition model to obtain the recognition classification result. In this embodiment, the recognition results are divided into static behaviors and dynamic behaviors for the sitting state, where dynamic behaviors include, but are not limited to: stretching, standing up, sitting down, reaching for things, shaking your head, turning around, calling, and talking with others Wait. Static behavior includes, but is not limited to: writing, typing, and reading.
在本发明中,优选一种桌面三维检测方式作为桌面检测模块使用。其获取桌面姿态和位置,只依靠单目图像,不需要额外的传感器。由于获知了桌面的姿态和位置,可以得出人和桌面的相对关系,提高了坐姿评价精度。而以往的坐姿监控算法均没有考虑这一信息的利用。In the present invention, a desktop three-dimensional detection method is preferably used as a desktop detection module. It obtains the desktop posture and position, relying only on monocular images, and does not require additional sensors. Since the posture and position of the desktop are known, the relative relationship between the person and the desktop can be obtained, which improves the accuracy of sitting posture evaluation. However, none of the previous sitting posture monitoring algorithms consider the use of this information.
所述桌面检测模块进一步包括平面区域检测单元和平面三维参数推断单元。The desktop detection module further includes a plane area detection unit and a plane three-dimensional parameter inference unit.
平面区域检测单元采用Mask-RCNN网络实现。Mask-RCNN网络是一种实例分割模型,它能确定图片中各个目标的位置和类别,给出像素级的预测。将Mask-RCNN获取到的桌面检测框、桌面分割结果和其主干网络的输出特征图作为平面三维参数推断单元的输入。The plane area detection unit is realized by the Mask-RCNN network. The Mask-RCNN network is an instance segmentation model, which can determine the location and category of each target in the picture, and give pixel-level predictions. The desktop detection frame, desktop segmentation results and the output feature map of its backbone network obtained by Mask-RCNN are used as the input of the plane 3D parameter inference unit.
平面三维参数推断单元首先利用ROI Align提取出桌面的特征,回归出桌面的法向量。ROI Align是在主干网络输出特征图上找到对应的桌面检测框的位置,然后进行双线性插值,从而获取桌面特征。The plane 3D parameter inference unit first uses ROIAlign to extract the features of the desktop, and returns to the normal vector of the desktop. ROI Align is to find the position of the corresponding desktop detection frame on the backbone network output feature map, and then perform bilinear interpolation to obtain desktop features.
平面三维参数推断单元还在主干网络特征图进行解码,获取全局深度图。解码部分采用双线性插值,使预测的深度图和输入的单目图像达到同分辨率。The plane 3D parameter inference unit also decodes the backbone network feature map to obtain a global depth map. The decoding part adopts bilinear interpolation to make the predicted depth map and the input monocular image reach the same resolution.
平面三维参数推断单元利用桌面分割掩膜提取全局深度图中的桌面深度。提取操作使用一个与操作实现,然后通过一个最大池化层和两个1x1卷积回归桌面的位置向量。最终,桌面法向量和桌面位置向量作为桌面检测模块的输出。The plane 3D parameter inference unit uses the desktop segmentation mask to extract the desktop depth in the global depth map. The extraction operation is implemented using an AND operation, and then returns to the position vector of the desktop through a maximum pooling layer and two 1x1 convolutions. Finally, the desktop normal vector and the desktop position vector are used as the output of the desktop detection module.
在本发明中,优选一种坐姿评价方法作为坐姿评价模块使用。在算法准备阶段,提前录入标准坐姿。算法运行时,计算当前坐姿和标准坐姿下各个关节向量的相似度;计算左右眼连线和桌面法向量之间的角度,鼻子和桌面位置之间的距离。按照偏差的大小,将坐姿分为标准、轻微不正确、严重不正确等,依据不同级别对用户进行提醒。In the present invention, a sitting posture evaluation method is preferably used as a sitting posture evaluation module. In the algorithm preparation stage, enter the standard sitting posture in advance. When the algorithm is running, it calculates the similarity of each joint vector between the current sitting posture and the standard sitting posture; calculates the angle between the left and right eye line and the desktop normal vector, and the distance between the nose and the desktop position. According to the magnitude of the deviation, the sitting posture is divided into standard, slightly incorrect, severely incorrect, etc., and remind users according to different levels.
下面通过一个具体的应用实例,对本发明的方案做进一步说明。In the following, a specific application example is used to further illustrate the scheme of the present invention.
1、从单目图像中不断获取512×512大小的视频帧,并做以下两方面处理:a)更新容量为T=10的视频队列,将整个视频队列送入人体姿态估计与 行为识别模块;b)将当前视频帧直接送入桌面检测模块。1. Continuously obtain 512×512 video frames from the monocular image, and do the following two processing: a) Update the video queue with a capacity of T=10, and send the entire video queue to the human body posture estimation and behavior recognition module; b) Send the current video frame directly to the desktop detection module.
2、人体姿态估计与行为识别模块开始工作,低级特征提取子模块采用如图3所示的瓶颈层结构,堆叠4次。输出低级特征分辨率为32×32,通道数从3扩增为576。其中,第一个瓶颈层通道扩增为12,第二个瓶颈层通道扩增为48,第三个瓶颈层通道扩增为192,第四个瓶颈层通道扩增为576。姿态评估部将T个视频帧中的每一帧送入SACAM结构进行三维姿态估计。SACAM块堆叠5次,其中所有的卷积步长均为1,得到姿态估计特征。然后,将姿态估计特征送入热图解码模块,得到Pxy,Pz和Conf,三者都是11通道,对应11个关键点的结果。由于对T个视频帧分别处理,所以这里会得到T帧的人体姿态结果。行为识别部首先构造行为识别输入特征,构造后的特征尺寸为(hk,T,3+hd)=(11,10,579)),然后送入SRLRTM块结构。行为识别输入特征经过5个堆叠的SRLRTM块后,连接一个全局最大池化层,一个全连接,再接一个softmax,得到识别分类结果。2. The human body pose estimation and behavior recognition module starts to work, and the low-level feature extraction sub-module adopts the bottleneck layer structure as shown in Fig. 3 and is stacked 4 times. The output low-level feature resolution is 32×32, and the number of channels is increased from 3 to 576. Among them, the first bottleneck layer channel is amplified to 12, the second bottleneck layer channel is amplified to 48, the third bottleneck layer channel is amplified to 192, and the fourth bottleneck layer channel is amplified to 576. The posture evaluation unit sends each of the T video frames into the SACAM structure for three-dimensional posture estimation. The SACAM blocks are stacked 5 times, all of which have a convolution step length of 1, and the pose estimation feature is obtained. Then, the pose estimation feature is sent to the heat map decoding module to obtain Pxy, Pz, and Conf, all of which are 11 channels, corresponding to 11 key points. Since the T video frames are processed separately, the body pose results of T frames will be obtained here. The behavior recognition unit first constructs the behavior recognition input feature, and the constructed feature size is (hk, T, 3+hd) = (11, 10, 579)), and then sends it to the SRLRTM block structure. After the behavior recognition input feature passes through 5 stacked SRLRTM blocks, connect a global maximum pooling layer, a full connection, and then a softmax to obtain the recognition classification result.
3、引入重注入机制,如图2所示,每个三维姿态估计模块将低级特征和上一个三维姿态估计模块特征相加作为输入特征,行为识别模块将当前输入特征和上一个行为识别模块全局池化前的特征相加,作为新的输入特征,以提高网络识别精度。3. Introduce the re-injection mechanism. As shown in Figure 2, each 3D pose estimation module adds low-level features and the previous 3D pose estimation module's characteristics as input features, and the behavior recognition module combines the current input features and the previous behavior recognition module globally. The features before pooling are added as new input features to improve the accuracy of network recognition.
4、进行桌面检测,桌面检测本质是平面检测问题,目的是从图像中获得桌面的位置和姿态。桌面检测模块对单目图像进行3d平面检测,获取描述各个平面的深度图和法向量,作为平面的位置信息和姿态信息。然后根据相机的摆放位置,从图像下部向上搜索确定桌面范围。4. Perform desktop detection. Desktop detection is essentially a plane detection problem. The purpose is to obtain the position and posture of the desktop from the image. The desktop detection module performs 3D plane detection on the monocular image, and obtains the depth map and normal vector describing each plane as the position information and posture information of the plane. Then according to the placement of the camera, search upwards from the bottom of the image to determine the range of the desktop.
5、进行坐姿评价及提示,行为识别部识别到被识别对象处于打字、书写、阅读等相对静态的状态时,对其坐姿进行评价,根据评价结果给出相应的提示。5. Perform sitting posture evaluation and prompting. When the behavior recognition unit recognizes that the recognized object is in a relatively static state such as typing, writing, reading, etc., it evaluates its sitting posture and gives corresponding prompts based on the evaluation results.
在本发明的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments of the present invention, the description of each embodiment has its own focus. For parts that are not described in detail in an embodiment, reference may be made to related descriptions of other embodiments.
在本申请所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,可以为一种逻辑功能划分,实际实现时可以有另外的划 分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. Among them, the device embodiments described above are merely illustrative. For example, the division of the units may be a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, units or modules, and may be in electrical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium. , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage media include: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes. .
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the foregoing embodiments can still be modified, or some or all of the technical features can be equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention. Scope.

Claims (9)

  1. 一种基于单目视频图像序列的坐姿识别方法,其特征在于,包括:A sitting posture recognition method based on monocular video image sequence, which is characterized in that it includes:
    S1、从单目摄像头获取当前视频帧并更新视频帧序列,所述视频帧序列的容量固定;S1. Obtain the current video frame from the monocular camera and update the video frame sequence, the capacity of the video frame sequence is fixed;
    S2、将所述视频帧序列送入人体姿态估计与行为识别模块,通过获取关键点的3维坐标进行人体姿态的估计和行为类型的识别,所述行为类型包括静态行为和动态行为,判断行为类型识别结果属于静态行为则执行S3,否则执行S1;S2. The video frame sequence is sent to the human body posture estimation and behavior recognition module, and the human body posture is estimated and the behavior type is recognized by obtaining the three-dimensional coordinates of the key points. The behavior type includes static behavior and dynamic behavior, and the behavior is judged. If the type recognition result is a static behavior, execute S3, otherwise execute S1;
    S3、坐姿评价模块同时接收所述人体姿态的估计结果和行为类型的识别结果,依据二者进行坐姿评估,并根据评估结果给出相应提示。S3. The sitting posture evaluation module simultaneously receives the estimation result of the human body posture and the recognition result of the behavior type, performs sitting posture evaluation based on the two, and gives corresponding prompts according to the evaluation result.
  2. 根据权利要求1所述的坐姿识别方法,其特征在于,所述人体姿态估计与行为识别模块包括:The sitting posture recognition method according to claim 1, wherein the human body posture estimation and behavior recognition module comprises:
    低级特征提取子模块,将所述视频帧序列中各帧图像处理为低级特征图;A low-level feature extraction sub-module for processing each frame image in the video frame sequence into a low-level feature map;
    以及至少一级估计与识别工作组,所述估计与识别工作组包括并行工作的三维姿态估计部和行为识别部,其中:And at least a first-level estimation and recognition working group, the estimation and recognition working group includes a three-dimensional pose estimation part and a behavior recognition part working in parallel, wherein:
    第一级估计与识别工作组的三维姿态估计部以所述低级特征图作为输入特征,并输出人体姿态估计结果,The three-dimensional pose estimation section of the first-level estimation and recognition working group uses the low-level feature map as an input feature, and outputs the human pose estimation result,
    行为识别部以本级人体姿态估计结果和低级特征图作为输入特征,并输出行为识别结果;The behavior recognition unit takes the human pose estimation results and low-level feature maps at the same level as input features, and outputs the behavior recognition results;
    其他估计与识别工作组的三维姿态估计部均以所述低级特征图作和上一级人体姿态估计结果为输入特征,并输出人体姿态估计结果,The three-dimensional pose estimation parts of other estimation and recognition working groups all use the low-level feature map creation and the upper-level human pose estimation results as input features, and output the human pose estimation results,
    行为识别部以本级人体姿态估计结果和上一级行为识别结果作为输入特征,并输出行为识别结果。The behavior recognition part takes the human posture estimation result of the current level and the behavior recognition result of the upper level as input features, and outputs the behavior recognition result.
  3. 根据权利要求2所述的坐姿识别方法,其特征在于,所述三维姿态估计部用于执行:The sitting posture recognition method according to claim 2, wherein the three-dimensional posture estimation unit is configured to perform:
    热图提取步骤,提取三维姿态估计部的输入特征的空间域注意力和通道级注意力,并根据所述空间域注意力和通道级注意力生成关键点热图,所述空间域注意力为图像各像素位置的权重,所述通道级注意力为各输入通道的权重;The heat map extraction step is to extract the spatial domain attention and channel-level attention of the input features of the three-dimensional pose estimation unit, and generate a key point heat map according to the spatial domain attention and the channel-level attention, and the spatial domain attention is The weight of each pixel position of the image, and the channel-level attention is the weight of each input channel;
    热图解码步骤,对所述关键点热图进行全局最大池化,从而获取二维坐标热图以及深度坐标热图,并由上述二维坐标热图以及深度坐标热图提取三维关键点坐标。The heat map decoding step is to perform global maximum pooling on the key point heat map to obtain a two-dimensional coordinate heat map and a depth coordinate heat map, and extract the three-dimensional key point coordinates from the two-dimensional coordinate heat map and the depth coordinate heat map.
  4. 根据权利要求2-3中任意一项所述的坐姿识别方法,其特征在于,所述人体姿态估计与行为识别模块以最后一级估计与识别工作组得到的人体姿态估计结果和行为识别结果作为输出。The sitting posture recognition method according to any one of claims 2-3, wherein the human posture estimation and behavior recognition module uses the human posture estimation results and behavior recognition results obtained by the last-level estimation and recognition working group as Output.
  5. 根据权利要求2所述的坐姿识别方法,其特征在于,所述行为识别部用于执行:The sitting posture recognition method according to claim 2, wherein the behavior recognition unit is configured to execute:
    识别输入特征构建步骤,提取姿态估计特征和场景上下文特征,将二者拼接形成行为识别输入特征;Recognize the input feature construction step, extract the pose estimation feature and the scene context feature, and stitch the two to form the behavior recognition input feature;
    行为识别步骤,将所述识别输入特征输入到识别模型中,得到识别分类结果。In the behavior recognition step, the recognition input feature is input into the recognition model to obtain the recognition classification result.
  6. 根据权利要求5所述的坐姿识别方法,其特征在于,所述识别部还用于执行行为识别模型搭建步骤,具体包括:The sitting posture recognition method according to claim 5, wherein the recognition part is also used to perform the behavior recognition model building step, which specifically includes:
    对行为识别输入特征的短时间信息和长时间信息分别进行建模,分别得到短时间信息子模型和长时间信息子模型;Model the short-term information and long-term information of the behavior recognition input feature separately, and obtain the short-term information sub-model and the long-term information sub-model respectively;
    将所述短时间信息子模型和长时间信息子模型串联构成行为识别工作组;Connecting the short-term information sub-model and the long-term information sub-model in series to form a behavior recognition working group;
    对多个所述行为识别工作组进行堆叠操作得到行为识别模型。Performing a stacking operation on a plurality of the behavior recognition work groups to obtain a behavior recognition model.
  7. 根据权利要求6所述的坐姿识别方法,其特征在于,所述行为识别步 骤包括:The sitting posture recognition method according to claim 6, wherein the behavior recognition step comprises:
    将所述行为识别输入特征输入行为识别模型,得到行为识别中间特征;Input the behavior recognition input feature into the behavior recognition model to obtain the behavior recognition intermediate feature;
    对所述行为识别中间特征,利用最大池化、全连接层和softmax进行分类,从而得到行为识别分类结果。The behavior recognition intermediate features are classified using maximum pooling, fully connected layer and softmax, so as to obtain behavior recognition classification results.
  8. 根据权利要求1所述的坐姿识别方法,其特征在于,所述S2还包括:将所述当前帧送入桌面检测模块进行桌面位姿检测。The sitting posture recognition method according to claim 1, wherein the S2 further comprises: sending the current frame to a desktop detection module for desktop posture detection.
  9. 根据权利要求8所述的坐姿识别方法,其特征在于,所述S3还包括:所述坐姿评价模块接收桌面位姿检测结果,辅助进行坐姿评估。The sitting posture recognition method according to claim 8, wherein the S3 further comprises: the sitting posture evaluation module receives the desktop posture detection result, and assists in the sitting posture evaluation.
PCT/CN2020/104054 2020-05-27 2020-07-24 Sitting posture recognition method based on monocular video image sequence WO2021237913A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010462958.8A CN111626211B (en) 2020-05-27 2020-05-27 Sitting posture identification method based on monocular video image sequence
CN202010462958.8 2020-05-27

Publications (1)

Publication Number Publication Date
WO2021237913A1 true WO2021237913A1 (en) 2021-12-02

Family

ID=72272365

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/104054 WO2021237913A1 (en) 2020-05-27 2020-07-24 Sitting posture recognition method based on monocular video image sequence

Country Status (2)

Country Link
CN (1) CN111626211B (en)
WO (1) WO2021237913A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115761885A (en) * 2022-11-16 2023-03-07 之江实验室 Behavior identification method for synchronous and cross-domain asynchronous fusion drive
CN115984384A (en) * 2023-03-20 2023-04-18 乐歌人体工学科技股份有限公司 Desktop lifting control method based on facial posture image estimation
CN117237443A (en) * 2023-02-20 2023-12-15 北京中科海芯科技有限公司 Gesture estimation method, device, electronic equipment and storage medium

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112102358B (en) * 2020-09-29 2023-04-07 南开大学 Non-invasive animal behavior characteristic observation method
CN113065532B (en) * 2021-05-19 2024-02-09 南京大学 Sitting posture geometric parameter detection method and system based on RGBD image
CN113177365B (en) * 2021-05-26 2022-12-06 上海交通大学 Heuristic method and system for vertically stacking irregular objects, storage medium and terminal
CN115690893A (en) * 2021-07-22 2023-02-03 北京有竹居网络技术有限公司 Information detection method, information detection device, information detection medium, and electronic device
CN114120357B (en) * 2021-10-22 2023-04-07 中山大学中山眼科中心 Neural network-based myopia prevention method and device
CN117746505A (en) * 2023-12-21 2024-03-22 武汉星巡智能科技有限公司 Learning accompanying method and device combined with abnormal sitting posture dynamic detection and robot

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007153035A (en) * 2005-12-01 2007-06-21 Auto Network Gijutsu Kenkyusho:Kk Occupant sitting judgement system
CN104157107A (en) * 2014-07-24 2014-11-19 燕山大学 Human body posture correction device based on Kinect sensor
CN108491754A (en) * 2018-02-02 2018-09-04 泉州装备制造研究所 A kind of dynamic representation based on skeleton character and matched Human bodys' response method
CN108665687A (en) * 2017-03-28 2018-10-16 上海市眼病防治中心 A kind of sitting posture monitoring method and device
CN110287864A (en) * 2019-06-24 2019-09-27 火石信科(广州)科技有限公司 A kind of intelligent identification of read-write scene read-write element
CN111178280A (en) * 2019-12-31 2020-05-19 北京儒博科技有限公司 Human body sitting posture identification method, device, equipment and storage medium
CN111601088A (en) * 2020-05-27 2020-08-28 大连成者科技有限公司 Sitting posture monitoring system based on monocular camera sitting posture identification technology

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015056057A (en) * 2013-09-12 2015-03-23 トヨタ自動車株式会社 Method of estimating posture and robot
CN111104816B (en) * 2018-10-25 2023-11-03 杭州海康威视数字技术股份有限公司 Object gesture recognition method and device and camera
CN110717392B (en) * 2019-09-05 2022-02-18 云知声智能科技股份有限公司 Sitting posture detection and correction method and device
CN111161349B (en) * 2019-12-12 2023-12-12 中国科学院深圳先进技术研究院 Object posture estimation method, device and equipment
CN111046825A (en) * 2019-12-19 2020-04-21 杭州晨鹰军泰科技有限公司 Human body posture recognition method, device and system and computer readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007153035A (en) * 2005-12-01 2007-06-21 Auto Network Gijutsu Kenkyusho:Kk Occupant sitting judgement system
CN104157107A (en) * 2014-07-24 2014-11-19 燕山大学 Human body posture correction device based on Kinect sensor
CN108665687A (en) * 2017-03-28 2018-10-16 上海市眼病防治中心 A kind of sitting posture monitoring method and device
CN108491754A (en) * 2018-02-02 2018-09-04 泉州装备制造研究所 A kind of dynamic representation based on skeleton character and matched Human bodys' response method
CN110287864A (en) * 2019-06-24 2019-09-27 火石信科(广州)科技有限公司 A kind of intelligent identification of read-write scene read-write element
CN111178280A (en) * 2019-12-31 2020-05-19 北京儒博科技有限公司 Human body sitting posture identification method, device, equipment and storage medium
CN111601088A (en) * 2020-05-27 2020-08-28 大连成者科技有限公司 Sitting posture monitoring system based on monocular camera sitting posture identification technology

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115761885A (en) * 2022-11-16 2023-03-07 之江实验室 Behavior identification method for synchronous and cross-domain asynchronous fusion drive
CN115761885B (en) * 2022-11-16 2023-08-29 之江实验室 Behavior recognition method for common-time and cross-domain asynchronous fusion driving
CN117237443A (en) * 2023-02-20 2023-12-15 北京中科海芯科技有限公司 Gesture estimation method, device, electronic equipment and storage medium
CN117237443B (en) * 2023-02-20 2024-04-19 北京中科海芯科技有限公司 Gesture estimation method, device, electronic equipment and storage medium
CN115984384A (en) * 2023-03-20 2023-04-18 乐歌人体工学科技股份有限公司 Desktop lifting control method based on facial posture image estimation

Also Published As

Publication number Publication date
CN111626211B (en) 2023-09-26
CN111626211A (en) 2020-09-04

Similar Documents

Publication Publication Date Title
WO2021237913A1 (en) Sitting posture recognition method based on monocular video image sequence
WO2021237914A1 (en) Sitting posture monitoring system based on monocular camera sitting posture recognition technology
US10755128B2 (en) Scene and user-input context aided visual search
KR102319177B1 (en) Method and apparatus, equipment, and storage medium for determining object pose in an image
US10096122B1 (en) Segmentation of object image data from background image data
WO2021227726A1 (en) Methods and apparatuses for training face detection and image detection neural networks, and device
US20210366152A1 (en) Method and apparatus with gaze estimation
US9330296B2 (en) Recognizing entity interactions in visual media
JP2022521844A (en) Systems and methods for measuring weight from user photos using deep learning networks
WO2023082882A1 (en) Pose estimation-based pedestrian fall action recognition method and device
CN110555481A (en) Portrait style identification method and device and computer readable storage medium
CN107392159A (en) A kind of facial focus detecting system and method
KR102338486B1 (en) User Motion Recognition Method and System using 3D Skeleton Information
US20230118864A1 (en) Lifted semantic graph embedding for omnidirectional place recognition
JP7499280B2 (en) Method and system for monocular depth estimation of a person - Patents.com
WO2021218238A1 (en) Image processing method and image processing apparatus
JP2023545190A (en) Image line-of-sight correction method, device, electronic device, and computer program
US20230326173A1 (en) Image processing method and apparatus, and computer-readable storage medium
CN111046734A (en) Multi-modal fusion sight line estimation method based on expansion convolution
CN113313732A (en) Forward-looking scene depth estimation method based on self-supervision learning
CN114036969B (en) 3D human body action recognition algorithm under multi-view condition
JP2023514107A (en) 3D reconstruction method, device, equipment and storage medium
US20220351405A1 (en) Pose determination method and device and non-transitory storage medium
JP5027030B2 (en) Object detection method, object detection apparatus, and object detection program
WO2021184359A1 (en) Target following method, target following apparatus, movable device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20937493

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20937493

Country of ref document: EP

Kind code of ref document: A1