WO2022247147A1 - Methods and systems for posture prediction - Google Patents

Methods and systems for posture prediction Download PDF

Info

Publication number
WO2022247147A1
WO2022247147A1 PCT/CN2021/128377 CN2021128377W WO2022247147A1 WO 2022247147 A1 WO2022247147 A1 WO 2022247147A1 CN 2021128377 W CN2021128377 W CN 2021128377W WO 2022247147 A1 WO2022247147 A1 WO 2022247147A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
image
local
human body
sequence
Prior art date
Application number
PCT/CN2021/128377
Other languages
English (en)
French (fr)
Inventor
Tao Xiong
Naike WEI
Huadong PAN
Jun Yin
Original Assignee
Zhejiang Dahua Technology Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dahua Technology Co., Ltd. filed Critical Zhejiang Dahua Technology Co., Ltd.
Publication of WO2022247147A1 publication Critical patent/WO2022247147A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks

Definitions

  • the present disclosure relates to the field of image recognition, and in particular to a method and a system for posture prediction.
  • the posture prediction is an important research topic in the field of image recognition, and has a wide range of application scenarios, for example, health monitoring.
  • the posture prediction is mainly realized through vision-based posture prediction.
  • the human posture and behavior may be determined.
  • the human body may be occluded, thereby decreasing the accuracy and efficiency of posture prediction. How to achieve the posture prediction with improved accuracy and efficiency under the occlusion of the human body is a technical problem to be solved urgently.
  • the method for posture prediction may include obtaining an image of a human body; determining a prediction result of a posture of the human body in the image by processing the image based on a posture prediction model, wherein the processing the image based on a posture prediction model includes: determining at least one feature sequence based on the image; determining feature representation information based on the at least one feature sequence; and determining the prediction result of the posture of the human body based on the feature representation information.
  • the system for vehicle detection may include: at least one storage medium including a set of instructions; and at least one processor configured to communicate with the at least one storage medium, wherein when executing the set of instructions, the at least one processor is configured to direct the system to perform operations including: obtaining an image of a human body; determining a prediction result of a posture of the human body in the image by processing the image based on a posture prediction model, wherein the processing the image based on a posture prediction model includes: determining at least one feature sequence based on the image; determining feature representation information based on the at least one feature sequence; and determining the prediction result of the posture of the human body based on the feature representation information.
  • Another aspect of embodiments of the present disclosure may provide a non-transitory computer-high-level readable medium including executable instructions.
  • t direct the at least one processor to perform a method, the method comprising: obtaining an image of a human body; determining a prediction result of a posture of the human body in the image by processing the image based on a posture prediction model, wherein the processing the image based on a posture prediction model includes: determining at least one feature sequence based on the image; determining feature representation information based on the at least one feature sequence; and determining the prediction result of the posture of the human body based on the feature representation information.
  • FIG. 1 is a schematic diagram illustrating an exemplary posture prediction system according to some embodiments of the present disclosure
  • FIG. 2 is a schematic diagram illustrating an exemplary processing device 112 according to some embodiments of the present disclosure
  • FIG. 3 is a flowchart illustrating an exemplary process for posture prediction according to some embodiments of the present disclosure
  • FIG. 4 is a diagram illustrating an exemplary posture prediction model according to some embodiments of the present disclosure.
  • FIG. 5 is a flowchart illustrating an exemplary process for some modules in the posture prediction model according to some embodiments of the present disclosure
  • FIG. 6 is a diagram illustrating an exemplary relation extraction module according to some embodiments of the present disclosure.
  • FIG. 7 is a diagram illustrating an exemplary posture prediction model according to other embodiments of the present disclosure.
  • FIG. 8 is a flowchart illustrating an exemplary process for posture prediction model training according to some embodiments of the present disclosure
  • FIG. 9 is a flowchart illustrating an exemplary process for posture prediction according to some embodiments of the present disclosure.
  • FIG. 10 is a flowchart illustrating another exemplary process for posture prediction according to some embodiments of the present disclosure.
  • FIG. 11 is a topological structure diagram illustrating an exemplary process for posture prediction according to some embodiments of the present disclosure
  • FIG. 12 is a topological structure diagram illustrating an exemplary an embodiment of operation S1004 in FIG. 10 according to some embodiments of the present disclosure
  • FIG. 13 is a flowchart illustrating an exemplary process for posture prediction according to some embodiments of the present disclosure
  • FIG. 14 is a schematic diagram illustrating an exemplary electronic device according to some embodiments of the present disclosure.
  • FIG. 15 is a schematic diagram illustrating exemplary computer readable storage medium according to some embodiments of the present disclosure.
  • system, ” “engine, ” “unit, ” “module, ” and/or “block” used herein are one method to distinguish different components, elements, parts, sections or assembly of different levels in ascending order. However, the terms may be displaced by another expression if they achieve the same purpose.
  • the flowcharts used in the present disclosure illustrate operations that systems implement according to some embodiments of the present disclosure. It is to be expressly understood the operations of the flowcharts may be implemented not in order. Conversely, the operations may be implemented in an inverted order, or simultaneously. Moreover, one or more other operations may be added to the flowcharts. One or more operations may be removed from the flowcharts.
  • the images may have occlusion problems, which may affect the prediction effect.
  • the systems and methods for human body posture prediction provided by some embodiments in the specification.
  • the feature matrix of the images may include the feature information of key points in the images and feature representation information may include an autocorrelation information of the key points including in the images.
  • the prediction result of the posture of the human body may further include the autocorrelation information of the key points, even if part of the key points in the images is occluded, which may improve an accuracy and efficiency of prediction positions of the key points of the images according to fitting the autocorrelation information of the key points in the images, thereby improving the accuracy and efficiency of posture prediction.
  • FIG. 1 is a schematic diagram illustrating an exemplary posture prediction system according to some embodiments of the present disclosure.
  • the posture prediction system 100 may include a server 110, a network 120, a terminal device 130, a storage device 140, an image 150, and an image collecting device 160.
  • the posture prediction system is mainly used for scenes involving human body posture prediction.
  • the posture prediction system may be applied to tasks such as human skeleton behavior recognition, pedestrian re-recognition, etc.
  • the posture prediction system may be combined with other techniques to solve actual technical problems.
  • the posture prediction system may be applied to a sports software to determine whether the action of a sportsman is standard by analyzing the posture of the sportsman.
  • the server 110 may include a processing device 112 for posture prediction.
  • the processing device 112 may obtain a posture prediction model, and further perform posture prediction on an image of a human body based on the posture prediction model to determine the prediction result of the posture of the human body in the image.
  • the server 110 may be local or remote.
  • the server 110 may be locally connected to the terminal device 130 to obtain the information and/or data sent by the terminal device 130.
  • the server 110 may remotely receive information and/data sent by the terminal device 130 via the network 120.
  • the server 110 may be implemented on a cloud platform.
  • the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an internal cloud, a multi-layer cloud, etc., or any combination thereof.
  • the network 120 may facilitate the exchange of information and/or data.
  • one or more components of the posture prediction system 100 may transmit information to other components of the posture prediction system 100 via the network 120.
  • the terminal device 130 may transmit the images 150 to the server 110 via the network 120.
  • the server 110 may transmit the prediction result obtained by performing posture prediction on the human body image to the storage device for storage.
  • the network 120 may be any form of wired or wireless network, or any combination thereof.
  • the terminal device 130 may be configured to process information and/or data associated with image processing and posture prediction to perform one or more functions disclosed in the specification.
  • the terminal device 130 may be an image obtaining device 130-1 (such as a surveillance camera, etc. ) for obtaining images.
  • the terminal device 130 may be a portable device with data collection, storage and/or sending functions, for example, a tablet computer 130-2, a notebook computer 130-3, a smart phone 130-4, a camera, etc., or any combination thereof.
  • the terminal device 130 may include one or more processing engines (for example, a single-core processing engine or a multi-core processor) .
  • the storage device 140 may store data and/or instructions related to image processing and posture prediction, for example, the images 150, the posture prediction model, etc.
  • the storage device 140 may store data obtained/acquired by the terminal device 130 and/or the server 110.
  • the storage device 140 may store data and/or instructions used by the server 110 to execute or use to complete the exemplary methods described in the present disclosure.
  • the storage device 140 may include mass storage, removable storage, volatile read-write storage, read-only storage (ROM) , etc., or any combination thereof.
  • the storage device 140 may be connected to the network 120 to communicate with one or more components in the application scenario 100 (for example, the server 110, the terminal device 130) .
  • One or more components in the application scenario 100 may access data or instructions stored in the storage device 140 via the network 120.
  • the storage device 140 may directly connect or communicate with one or more components in the application scenario 100 (for example, the server 110, the terminal device 130, etc. ) .
  • the storage device 140 may be part of the server 110.
  • the images 150 may include one or more human body images.
  • the human body images in the images 150 may be one or more of a two-dimensional image, a three-dimensional image, an image frame sequence, and an video.
  • the images 150 may be presented in the form of a human body block diagram (such as a human body block diagram 150-1, a human body block diagram 150-2) .
  • the human body block diagram may determine the human body image from the images 150 with a rectangular selection frame as small as possible.
  • a human body block diagram may include one single human body.
  • the image collecting device 160 may be any device which obtains the image 150.
  • the image collecting device 160 may collect the human body image as the image 150.
  • the image collecting device 160 may include a camera (e.g., a digital camera, an analog camera, an IP camera (IPC) , etc. ) , a video recorder, a scanner, a mobile phone, a tablet computing device, a wearable computing device, an infrared imaging device (e.g., thermal imaging device) etc.
  • the images 150 may be obtained through the terminal device 130 (for example, obtained by shooting) and transmit to the server 110 via the network 120.
  • the server 110 may download the images 150 from Internet resources via the network 120.
  • the output result of the processing device 112 may be reflected by the positions of a plurality of specific key points of the human body, wherein the specific key points (i.e., the key points of the human body) may be understood as identifying parts of the human body, for example, a knee joint, an elbow joint, etc.
  • the output result may be a plurality of thermal diagrams 114, wherein the processing device 112 may perform posture prediction on the images 150 to obtain the output result related to the posture prediction (i.e., the thermal diagrams 114) .
  • the thermal diagrams 114 may have a one-to-one correspondence with the key points of the human body, reflecting the position distribution of the key points of the human body in the image.
  • the point with the highest distribution value in the thermal diagram may be used as the prediction position of a key point of the human body. More descriptions for the thermal diagram, please refer to the related description of the operation 325.
  • Each pixel value in the thermal diagram may reflect the possibility that the corresponding key point of the human body appears at the pixel location.
  • Each pixel value in the thermal diagram may also be referred as a distribution value or a heating value.
  • the pixel value (the distribution value) may be characterized as the gray value in the thermal diagram.
  • FIG. 2 is a schematic diagram illustrating an exemplary processing device 112 according to some embodiments of the present disclosure.
  • the processing device 112 may include an input module 210 and the posture prediction model 220.
  • the posture prediction model 220 may include a feature extraction module 221, a relationship extraction module 223, and a fusion output module 225. It should be noted that the description of the processing device 112 in FIG. 1 is not limited to this. FIG. 2 is intended to illustrate, but not to limit the scope of the present disclosure.
  • the input module 210 may be configured to obtain an image, wherein the image may include the human body image.
  • the posture prediction method provided in the present disclosure may predict the posture information of the image in the image.
  • the human body obtained by the image acquisition device 160 may be used as the image.
  • the human body image including a human body obtained by the image acquisition device 130-1 may be used as the images.
  • the human body image including the human body stored in the storage device 140 may be used as the image.
  • the human body image including the human body after being preprocessed by the processing device 112 may be used as the image.
  • the posture prediction model 220 may be configured to process the human body image to determine the prediction result of the posture of the human body in the human body image.
  • the prediction result of the posture of the human body may include a characterization data configured to describe the human body posture.
  • the characterization data may be a position image of key points of the human body.
  • the key points of the human body are the parts that play a landmark role in the human body image, such as hands, feet, knees, etc.
  • the position image may be configured to describe the position of the key points of the human body in the image, and then describe the human body posture.
  • the posture prediction model 220 may be a trained neural network model, wherein modules in the posture prediction model 220 may be implemented by sub-layers or a combination of sub-layers in the neural network model.
  • the feature extraction module 221 may be configured to determine at least one feature sequence based on the human body image.
  • the feature sequence may be a collection of attributes of the human body image.
  • human body image has its own different attributes (for example, an image texture, an image edge, etc. ) , different attributes are represented by different attribute values, and a plurality of attribute values may be combined together to represent a sequence, and the sequence may be the feature sequence.
  • the human body image may include a plurality of key points, and the collection of attributes of the human body image may also be understood as a collection of the key points.
  • the feature extraction module 221 may be implemented by a feature extraction algorithm, for example, LBP (local binary patterns) algorithm, HOG (histogram of oriented gradient) algorithm, SIFT (scale invariant feature transform) algorithm and other feature extraction algorithms.
  • the feature extraction module 221 may be implemented by the neural network algorithm, for example, the neural network algorithm such as a convolutional neural network model and a deep neural network model.
  • the relationship extraction module 223 may be configured to determine feature representation information based on the at least one feature sequence.
  • the feature representation information may include an internal correlation of the feature sequence.
  • the internal correlation is configured to describe the degree of correlation between elements in the feature sequence.
  • the relationship extraction module 223 may be composed of the neural network including a sequence labeling structure.
  • the neural network algorithm such as a recurrent neural network model, a convolutional neural network model, and a deep neural network model.
  • the sequence labeling structure may be configured to label elements in feature sequences to calculate the correlation between elements.
  • the relationship extraction module 223 may be implemented by a self-attention model, for example, a transformer model and other models that refer to a self-attention mechanism.
  • the fusion output module 225 may be configured to determine the prediction result of the posture of the human body based on the feature representation information. In some embodiments, the fusion output module 225 may change the size and dimension of the feature representation information to obtain the prediction result of the posture of the human body.
  • the prediction result of the posture of the human body may include the thermal diagram, wherein the thermal diagram may reflect the distribution of the key points of the human body in the image. Each pixel value in the thermal diagram may reflect the possibility that the corresponding key point of the human body appears at the pixel location. Each pixel value in the thermal diagram is also referred as a distribution value and a heating value. In some embodiments, the pixel value (the distribution value) may be characterized as the gray value in the thermal diagram. the fusion output module 225 may use the point with the highest distribution value in the thermal diagram as the prediction position of the key point of the human body, more specific descriptions for the thermal diagram can be found in other parts of the present disclosure.
  • FIG. 3 is a flowchart illustrating an exemplary process for posture prediction method according to some embodiments of the present disclosure.
  • FIG. 4 is a diagram illustrating an exemplary posture prediction model according to some embodiments of the present disclosure.
  • FIG. 4 shows the internal structure and a parameter transfer relationship of the posture prediction model.
  • the human body image 410 may be the input of the posture prediction model 220, and the human body image 410 may be sequentially processed by the feature extraction module 221, the relationship extraction module 223, and the fusion output module 225 to obtain the prediction result of the posture of the human body.
  • Process 300 shown in FIG. 3 may be described in detail below in conjunction with FIG. 4. In some embodiments, process 300 may be executed by the processing device 112.
  • the human body image may be obtained.
  • operation 310 may be performed by the input module 210.
  • the human body image may be extracted from an image to be predicted that is acquired by an image acquisition device.
  • the human body image may include at least a portion of the human body.
  • the image to be predicted obtained by the image acquisition device 130-1 may be composed of a background image and the human body image, and the human body image may be a local image of the range of the human body in the image.
  • the human body image may be a two-dimensional image under a plurality of channels.
  • the width of the two-dimensional image may be denoted as W 0
  • the height may be denoted as H 0
  • the human body image may include d channels
  • the size of the human body image may be denoted as W 0 ⁇ H 0
  • the dimension may be denoted as d.
  • numerical multiplication may be represented by symbol “*”
  • space multiplication is represented by symbol “ ⁇ ”
  • W 0 *H 0 represents the arithmetic value of the multiplication
  • W 0 *H 0 represents a matrix, a graph, or a sequence with a width of W 0 and a height of H 0 .
  • the human body image may be obtained by an object detection algorithm.
  • the range of the human body may be determined by performing object detection on the image to be predicted that contains the human body, and the region including the range of the human body in the image to be predicted may be used as the human body image.
  • the human body image may include the human body and background information within the range of the human body.
  • the object detection algorithm may include a R-CNN (Region-Convolutional Neural Networks) algorithm, a yolo algorithm and related algorithms.
  • the image may include a plurality of human bodies to be predicted.
  • human body in the image may be predicted one by one to obtain a plurality of the human body images, and the posture prediction method may be executed on human body image to obtain the posture information of human body.
  • the above process of determining the human body image from the image to be predicted may be performed when obtaining the image to be predicted.
  • a processed human body image may be obtained in operation 310.
  • the human body image may be pre-processed before the input module 210 obtains the human body image, so that input human body image may have the same size.
  • the human body image may be processed based on the posture prediction model to determine a prediction result of the posture of the human body in the human body image.
  • operation 320 may be performed by the posture prediction model 220.
  • the posture prediction model 220 refers to an algorithm model that processes an input human body image and outputs the prediction result, and the prediction result may be configured to represent the human body posture in the human body image.
  • the posture prediction model 220 may include at least one trained neural network model, wherein modules in the posture prediction model 220 may be implemented by corresponding algorithms or sub-layers of the neural network model. In some embodiments, a part of the functional structure of the posture prediction model 220 may be implemented by the trained neural network model.
  • the human body posture in the human body image may be determined based on the prediction result of the posture of the human body.
  • the prediction result of the posture of the human body may include thermal diagrams of a plurality of key points of the human body. Key points of the human body may correspond to a thermal diagram, and each pixel value in the thermal diagram may reflect the possibility that the corresponding key point of the human body appears at the pixel location.
  • the key points of the human body may represent the identified parts of the human body, and the key points of the human body may be adjusted based on the prediction accuracy and/or the use of the algorithm.
  • the key points of the human body may include, for example, the nose, ankles, etc.
  • the key points of the human body may be specified in other ways.
  • operation 320 may be implemented according to a process including operations 321-325.
  • At least one feature sequence may be determined based on the human body image.
  • operation 321 may be performed by the feature extraction module 221.
  • the feature sequence may include a plurality of the features.
  • Each feature may include one or more values, which may be configured to reflect certain properties of the image.
  • Feature may have a similar structure.
  • the dimension of the feature sequence may be the same as the input image, and the size of the feature sequence may correspond to the size of the input image.
  • the feature sequence may be expressed as a matrix and/or vector with a plurality of elements.
  • an element of the feature sequence may be at least one of a value, a vector, and a matrix.
  • the feature sequence may be determined based on the human body image according to a feature extraction algorithm.
  • the feature extraction algorithm may include LBP (local binary patterns) algorithm, HOG (histogram of oriented gradient, directional gradient histogram) algorithm, SIFT (scale invariant feature transform) algorithm and other feature extraction algorithms.
  • the feature sequence may be determined based on the human body image according to a neural network algorithm such as a convolutional neural network model.
  • the feature sequence may include at least one local feature sequence, and the at least one local feature sequence may be configured to represent features of at least one local position or region in the human body image.
  • the feature extraction module 221 may process a local image of the human body image based on a feature extraction algorithm to obtain the corresponding local feature sequence. More descriptions for obtaining the local feature sequence, please refer to the detailed description in FIG. 5 of the specification.
  • the local feature sequence may be the feature sequence corresponding to the human body local image.
  • the feature extraction may be performed on a local head image based on the feature extraction algorithm to obtain a corresponding local head feature sequence.
  • the amount of information of the feature sequence may be increased, and the targeted processing may be performed on the local features of the human body, so that the feature extraction is more accurate.
  • the feature extraction module 221 may determine the human body local image from the human body image to obtain the local feature sequence. For example, the feature extraction module 221 may recognize a portion (e.g., the head, the legs) of the human body and determine the human body local image based on the recognized portion of the human body.
  • the feature extraction module 221 may recognize a portion (e.g., the head, the legs) of the human body and determine the human body local image based on the recognized portion of the human body.
  • the local region image may include a local region of the human body image.
  • the local region image may also be referred to as a local region image of the human body image.
  • the local region image may be determined based on the key points of the human body.
  • key points of the human body may be classified according to its region, and the human body local image may be determined according to the classification and region.
  • the human body local image may include the local head image, a local torso image, a local leg image, etc.
  • the local head image may include key points of the human body such as nose and eyes.
  • the extraction process of the local region image may refer to the detection of a local region of the human body, such as using the object detection algorithm and the extraction of the local region from the human body image to generate the local region image.
  • the posture prediction model 220 may further include a local segmentation layer 420.
  • the human body image 410 may be inputted into the local segmentation layer 420, and the local segmentation layer 420 may process the human body image 410 to obtain a plurality of the human body local images, and then input the human body local images to the feature extraction module 221 to obtain the corresponding local feature sequences.
  • the local segmentation layer 420 may be set according to actual needs. For example, when obtaining the local face images, the other parts of the local images may be not needed. In some embodiments, the posture prediction model 220 may not be provided with the local segmentation layer 420. For example, when the human body image 410 is determined according to the edge of the human body, the human body image may not include background information, and the increase in accuracy brought about by extracting the local region image may not offset the increase in cost caused by the increase in calculation of the extraction of the local region image, therefore, the local segmentation layer 420 may not be provided.
  • the local segmentation layer 420 may split the human body image 410 into several human body local images.
  • the local segmentation layer 420 may be provided with the target detection algorithm, that is, the local segmentation layer 420 may use a specific part of the human body image 410 as a detection object, and extract this part as the human body local image through the target detection algorithm.
  • the local segmentation layer may be implemented based on various feasible algorithms, such as YOLO.
  • the local segmentation layer 420 may be separately trained according to the objects.
  • the local region image may include less background information than the human body image, which reduces the influence of other background information on the extraction of local features of the human body.
  • the extraction process of the local feature sequence may be performed according to a process including operations S32101-S32103:
  • the human body image may be segmented based on the local segmentation layer to obtain at least one human body local image.
  • the at least one local feature sequence may be obtained based on the at least one local region image processed by the posture prediction model.
  • the at least one local feature sequence may be extracted from one of the at least one local region image, such that the at least one local feature sequence may correspond to one of the at least one local region image. More description for obtaining the local feature sequence, refer to the relevant descriptions of the local feature extraction layer 433 in FIG. 5.
  • a global feature extraction layer and the local feature extraction layer in the feature extraction module 221 may be set according to actual needs. How to set up the feature extraction module 221 is related to the recognition task of the posture prediction method and the recognition accuracy.
  • the feature extraction module 221 may only include the facial feature extraction layer.
  • the feature extraction module 221 may only include the global feature extraction layer.
  • a block diagram illustrating a portion of structure of modules of the posture prediction model may be provided.
  • a local segmentation layer 420 may be provided with a plurality of sub-layers, for example, the local segmentation layer 420 may include a head segmentation layer 421, a body segmentation layer 423, and a leg segmentation layer 425.
  • the local segmentation layer 420 may be set in other ways, such as the local segmentation layer 420 may include the head segmentation layer 421, the body segmentation layer 423, and a hand segmentation layer.
  • the posture prediction model may include an image preprocessing module 427 configured to process the at least one local image of the human body.
  • the image preprocessing module 427 may process the at least one local image of the human body so that a size of the at least one processed local image of the human body is consistent.
  • a preprocessing operation may include an image clipping operation, a width compression operation, a length compression operation, or the like.
  • the image preprocessing module 427 may be set after a feature extraction module 221 to process an obtained local feature sequence.
  • the processed local image of the human body with the same size may be obtained, which may be convenient for the subsequent analysis and processing for the local image of human body, and a processing result with relatively consistent format may be obtained for a comparison or further analysis of the processing result.
  • the posture prediction model may include at least one of a global feature extraction layer 431 or a local feature extraction layer 433.
  • the feature extraction module 221 may be include the global feature extraction layer 431 and the local feature extraction layer 433.
  • the local feature extraction layer 433 may include a plurality of local feature extraction sub-layers.
  • oepration321 may include determining a feature image, and determining a feature sequence.
  • a human body image 410 may be processed based on the global feature extraction layer 431 and/or the local feature extraction layer 433 to obtain at least one feature image. Then, at least one feature sequence may be determined based on at least one feature image.
  • the feature image may represent human body features as an image.
  • the feature image may be a matrix on a specified size, such as a matrix with a size of W ⁇ H.
  • Element of the matrix may reflect some features of the human body image.
  • the element may be a d-dimensional vector, wherein the elements of the feature image may be represented as feature points in the present disclosure.
  • a correspondence between dimension information of the feature image and the dimension information of the human body image may be determined based on processing of different channel information by the global feature extraction layer 431 and the local feature extraction layer 433.
  • the global feature extraction layer 431 and the local feature extraction layer 433 may involve a multi-dimensional convolution kernel, and the dimension information of the feature image may be different from the dimension information of the human body image. If the multi-dimensional convolution kernel is not involved in the global feature extraction layer 431 and the local feature extraction layer 433, the dimension information of the feature image may be the same as the dimension information of the human body image.
  • the global feature extraction layer 431 may be used to extract a global feature relationship from the human body image, and generate a corresponding feature image based on the global feature relationship.
  • the global feature extraction layer 431 may be implemented by a feature extraction algorithm and/or a neural network algorithm.
  • the local feature extraction layer 433 may be used to extract a local feature relationship of human body from the local image of the human body, and generate a corresponding local feature image based on the local feature relationship of the human body. As shown in FIG. 5, the local feature extraction layer 433 may include a plurality of sub-layers (also referred to as local feature extraction sub-layers) .
  • the setting of the sub-layers of the local feature extraction layer may be determined based on the setting of the local segmentation sub-layers.
  • the sub-layers of local feature extraction layer may be used to process the local image of the human body output by one local segmentation sub-layer.
  • sub-layers of the local feature extraction layer 433 may include a head feature extraction layer 4331, a body feature extraction layer 4333, and a leg feature extraction layer 4335.
  • sub-layer of the local feature extraction layer 433 may be used to process a corresponding local image of the human body to obtain a corresponding local feature image.
  • the head feature extraction layer 4331 may be used to obtain a head feature image based on a local image of the head.
  • the global feature extraction layer 431 and the local feature extraction layer 433 may adopt the same structure and parameters in whole or in part. In some embodiments, the global feature extraction layer 431 and local feature extraction layer 433 may adopt different structures and/or parameters.
  • a corresponding relationship may be existed between the feature sequence and the feature image.
  • the feature image may be transformed into the feature sequence according to a corresponding relationship.
  • the feature sequence may include at least one of a global feature sequence and a local feature sequence.
  • the global feature image may be generated by extracting features from the human body image 410. As shown in FIG. 5, a feature image obtained after processing the human body image 410 by the global feature extraction layer 431 is the global feature image F 0 .
  • the global feature sequence may be a feature sequence used to represent image features of the human body image 410.
  • the global feature sequence S 0 may be a sequence including W*H elements, and each element may be an element in a matrix of the feature image.
  • One or more elements in the global feature sequence S 0 may reflect corresponding features of the human body image.
  • the elements in the global feature sequence S 0 may correspond to the global feature image F 0 .
  • each element in the global feature sequence S 0 may correspond to information of a point and/or a lattice in the global feature image F 0 .
  • the posture prediction model 400 may include a sequence generating module 435 used to generate at least one feature sequence based on at least one feature image.
  • the global feature image F 0 output by the global feature extraction layer 431 may be converted into the global feature sequence S 0 .
  • the sequence generating module 435 may generate the feature sequence by converting the information of the feature image into information of elements, that is, the sequence generating module 435 may convert channel value in the feature image into an element value to obtain the feature sequence.
  • the feature image may be an image with a size of W ⁇ H and d-dimension, and the feature image may be transformed into a vector with W*H elements by the sequence generation module 435. Each element in the vector may correspond to the feature image and may be a d-dimensional vector.
  • the feature sequence of the human body may be represented as a matrix or other forms of representation, and the specific form of the representation may be determined according to requirements of operation accuracy.
  • the local feature image may be extracted from the local region image.
  • the local feature image may correspond to the setting of the local feature extraction layer.
  • the human body image 410 may be segmented by the head segmentation layer 421, the body segmentation layer 423, and leg segmentation layer 425 to obtain a local image of the head, a local image of the body and a local image of the leg.
  • the local image of the head, the local image of the body and the local image of the leg may be processed by the head feature extraction layer 4331, the body feature extraction layer 4333, and leg feature extraction layer 4335 to obtain a head feature image F 1 , a body feature image F 2 , and a leg feature image F 3 .
  • the local feature sequence may correspond to the local feature image.
  • the local feature sequence may be obtained based on the sequence generating module 435.
  • Specific conversion process may refer to the above-mentioned description of the sequence generating module 435.
  • the global feature sequence and local feature sequence in the posture prediction model may be generated by the same sequence generating module. In some embodiments, the global feature sequence and the local feature sequence may be generated by corresponding sequence generating modules, respectively.
  • Generating a feature sequence based on feature extraction from the human body image by the global feature extraction layer and/or local feature extraction layer may make image features more abundant, which may be conducive to more effective target recognition through the internal relationship of feature images or the feature sequences.
  • feature image of the at least one human body feature image may include one or more channels, and the channels of the feature image may correspond to channels of the human body image.
  • a feature element of feature sequence may correspond to channel information of a feature point of the feature image.
  • the feature image may be represented as an image with a size of W ⁇ H and d-dimension
  • the corresponding feature sequence may be represented as a matrix with W ⁇ H ⁇ d elements
  • each element in the matrix may be a channel value of the corresponding position in the corresponding feature image
  • the channel value may be a pixel value of the human body image under the corresponding channel.
  • the image attributes of the various human body images may be thereby represented by element of the feature sequence, and it may be convenient to reflect a correlation relationship in the human body image by analyzing a correlation relationship between elements.
  • a cross feature sequence in order to reflect a correlation relationship between local image of the human body, may be determined to represent the correlation relationship between local image of the human body.
  • the posture prediction model may include a cross sequence generating module 437.
  • the cross sequence generating module 437 may be used to process at least a portion of the at least one feature sequence to obtain a cross feature sequence.
  • element in the cross feature sequence may correspond to the local feature image for reflecting corresponding attributes of the local feature image.
  • an element of the cross feature sequence may be generated based on one of the at least a portion of at least one feature sequence.
  • the cross feature sequence may be determined based on local feature image.
  • local feature image may be used as an element of the cross feature sequence.
  • the cross feature sequence may be generated based on a plurality of local feature sequences, and the correlation relationship between local feature sequences may be represented by the cross feature sequence, thereby an associated relationship between key points of the human body corresponding to local feature sequence may be determined, and a specific position of occluded key points may be determined.
  • the feature sequences (e.g., the global feature sequence, the local feature sequence, the cross feature sequence) determined in the present disclosure may not only reflect overall information of the human body image, but also information of local image.
  • the posture prediction methods and systems thereby may have better accuracy.
  • the feature extraction module may be trained with the posture prediction model, that is, a training target of the feature extraction module may be the same as the posture prediction model, so that the feature sequence obtained by the feature extraction module may have a higher representation ability in any posture prediction layer.
  • feature representation information may be determined.
  • operation 323 may be implemented by the relationship extraction module 223.
  • the feature representation information may be feature information including a relationship between elements of the feature sequence.
  • the feature representation information may correspond to the feature sequence.
  • the feature representation information may include global feature representation information, local feature representation information, and cross feature representation information.
  • the global feature representation information may correspond to the global feature sequence.
  • operation 323 may be implemented by a machine learning model with a sequence structure, such as an RNN, a LSTM model, or any other neural network algorithms.
  • the sequence structure may be used to label element of feature sequence to determine correlation relationship between the element.
  • operation 323 may include self-attention mechanism. Detailed description may refer to the following description.
  • a prediction result of a posture of the human body including an internal correlation relationship of the human body image may be obtained.
  • the key points of the human body may be predicted based on the internal correlation relationship, and the accuracy of for posture prediction may be improved.
  • FIG. 6 of the present disclosure may provide a block diagram illustrating an exemplary relationship extraction module.
  • the posture prediction model may include at least one of a global relationship extraction layer, a local relationship extraction layer, or a cross relationship extraction layer.
  • the relationship extraction module 223 may be implemented by a global relationship extraction layer 441, a local relationship extraction layer 443, and a cross relationship extraction layer 445.
  • At least one of the global relationship extraction layer, the local relationship extraction layer, or the cross relationship extraction layer may be implemented by a convolutional neural network and/or a self-attention model including a sequence annotation structure.
  • the algorithm or model adopted by the global relationship extraction layer, the local relationship extraction layer, and the cross relationship extraction layer may be the same, and may share parameters.
  • the correlation relationship of elements of the global feature sequence, the local feature sequence, and the cross feature sequence may be extracted, and the correlation relationship between key point of the whole human body, the correlation relationship between different key points in a same part of the human body, and the correlation relationship between different parts of the human body may be obtained, correlation degree of key point and the degree of importance of a group of key point in the human body image may be shown based on the correlation relationship, information of key points on different layers may be obtained, and the accuracy of predicting key points may be improved.
  • a process of extracting the feature representation information by the relationship extraction module may include determining a feature relationship matrix based on the feature sequence and determining the feature representation information based on the feature relationship matrix.
  • the layer structure of the global relationship extraction layer, the layer structure of the local relationship extraction layer, and/or the layer structure of the cross relationship extraction layer may include a self-attention layer and a feedforward neural network layer. The feedforward neural network layer may be used to determine the feature representation information based on the feature relationship matrix.
  • the feature relationship matrix may be a feature matrix representing a correlation relationship between various elements of the feature sequence.
  • the feature relationship matrix may represent an internal relational matrix to describe an internal dependency of the elements of the feature sequence. For example, if the feature sequence includes 4 elements, a corresponding feature relationship matrix may be a 4 ⁇ 4 matrix, and each value in the matrix may correspond to a correlation relationship between elements. For example, a value of coordinates (3, 1) in the matrix may represent correlation relationship between the third element and the fourth element in the feature sequence.
  • the feature relationship matrix may be classified based on the corresponding feature sequence. For example, the feature relationship matrix may be classified into a global feature relationship matrix, a local feature relational matrix, and a cross feature relationship matrix.
  • the relationship extraction layer may be constructed by the self-attention layer and the feedforward neural network layer, wherein the correlation relationship between the elements in the feature sequence may be encoded by the self-attention layer, and the feature representation information including the correlation relationship between the elements may be directly output by the feedforward neural network layer.
  • the global relationship extraction layer may be used to process the global feature sequence to obtain a global feature relationship matrix.
  • the input of the global relationship extraction layer may be a global feature sequence
  • the output of the global relationship extraction layer may be a global feature relationship matrix.
  • a size of the global feature relationship matrix may be related to the number (or count) of elements of the global feature sequence.
  • the global feature sequence includes W*H elements
  • the global feature relationship matrix may be a (W*H) ⁇ (W*H) matrix
  • the correlation relationship between the global feature relationship matrix and the global feature sequence may refer to above-mentioned detailed description.
  • the output of the posture prediction model may be thermal diagrams of key points
  • the internal correlation relationship between the human body image may be represented by a correlation relationship between key points.
  • the global feature sequence may be determined based on the human body image
  • the human body image may include a plurality of key points
  • the attributes of the key points in the human body image may be represented by the global feature sequence, that is, the global feature relationship matrix may reflect the internal correlation relationship between the human body image according to the correlation relationship between key points.
  • the local relationship extraction layer may include a plurality of sub-layers to process local feature sequences.
  • the local relationship extraction layer may be used to process the local feature sequence to obtain a local feature relationship matrix.
  • the input of the local relationship extraction layer may be a corresponding local feature sequence
  • the output of the local relationship extraction layer may be a corresponding local feature relationship matrix.
  • the local feature relationship matrix may reflect an internal correlation relationship between a corresponding local image of the human body.
  • the local feature relationship matrix may be used to describe the correlation degree between elements of corresponding local feature sequence.
  • the internal correlation relationship between the local images of the human body may be understood as a correlation relationship between key points of the local images of the human body.
  • the internal correlation relationship of the local image of the head may be understood as a correlation relationship between the nose, the left eye, the right eye, or any other key point.
  • the local relationship extraction layer may include a head relationship extraction layer 4431, a body relationship extraction layer 4433, and a leg relationship extraction layer 4435.
  • An input of the local relationship extraction layer may be a local feature sequence
  • an output of the local relationship extraction layer may be a local feature relationship matrix.
  • the local feature relationship matrix may reflect an internal correlation relationship of a local image of the human body.
  • the head relationship extraction layer may be used to obtain a head feature relationship matrix according to a head feature sequence, and the head feature relationship matrix may be used to reflect a correlation relationship between elements of the head feature sequence, and an internal correlation relationship between the key points of the local image of head may be reflected based on the correlation relationship between the elements of the head feature sequence may reflect attributes of the local image of head.
  • the cross relationship extraction layer may be used to process the cross feature sequence to obtain a cross feature relationship matrix.
  • Input of the cross relationship extraction layer may be a cross feature sequence and output of the cross relationship extraction layer may be a cross feature relationship matrix.
  • the cross feature relationship matrix may be used to describe the correlation degree of element of the cross feature sequence, considering that element of the cross feature sequence may be actually feature image of the local image of human body, the internal correlation relationship described by the cross feature relationship matrix may be understood as correlation relationship between local image of human body.
  • operation 323 may be implemented according to operations S3231 and S3233 described as following ways:
  • At least one feature relationship matrix of the human body image may be determined.
  • the position information of elements may be used to represent a position of at least a portion of elements of at least one feature sequence in the feature sequence.
  • operation S3231 may be implemented by the self-attention layer of relationship extraction layer.
  • the position information of elements may be denoted by position encoding. That is, according to a position of an element in the feature sequence, a position code may be given.
  • a process of position encoding may be implemented outside the self-attention layer.
  • the process of position encoding may be implemented by the self-attention layer, that is, the self-attention layer may include a position encoder, the position encoder may be used to convert an inputted feature sequence to a feature sequence including the position information of elements.
  • the self-attention layer may include a self-attention network encoder (e.g., a transformer encoder) , and the feature relationship matrix may be constructed by the self-attention network encoder.
  • a self-attention network encoder e.g., a transformer encoder
  • the global feature relationship matrix Z 0 may be a matrix of (W*H) ⁇ (W*H) .
  • the global feature relationship matrix Z 0 may be a matrix of (W*H) ⁇ (W*H) .
  • a corresponding local feature relationship matrix Z i may be a matrix of (W’*H’) ⁇ (W’*H’)
  • element may be 3
  • element may be a local feature image
  • a corresponding cross feature relationship matrix Z 4 may be a 3 ⁇ 3 matrix.
  • the feature representation information may be determined. Operation S3233 may be implemented by the feedforward neural network layer.
  • the feedforward neural network layer may be used to convert the feature relationship matrix to the feature representation information.
  • An input of the feedforward neural network layer may be a feature relationship matrix
  • an output of the feedforward neural network layer may be the feature representation information.
  • the feature representation information obtained from the feedforward neural network layer may represent an internal correlation relationship of the feature relationship matrix corresponding to the feature representation information.
  • the global feature relationship matrix Z 0 may be converted to a global feature representation information Y 0 by the feedforward neural network layer of the global relationship extraction layer.
  • a process of determining position information of elements may be implemented by a sequence annotation structure
  • a process of determining a feature relationship matrix and determining feature representation information may be implemented by convolutional neural networks.
  • the feature representation information may include the global feature representation information representing the internal correlation relationship of the human body image, the local feature representation information representing the internal correlation relationship of the local image of human body, and the cross feature representation information representing the correlation relationship of the local images.
  • the human body image and the local image of human body may include a plurality of key points
  • the internal correlation relationship of images may be understood as a correlation relationship of the key points of the images.
  • the method for posture prediction of the present disclosure may not only recognize a posture based on content of human body image, but also may reflect the correlation relationship of the key points. Therefore, when a portion of key points of the human body image are blocked, the method for posture prediction of the present disclosure may fit blocked key points based on the correlation relationship of key points, so as to predict position of the blocked key points.
  • a prediction result of a posture of the human body may be determined.
  • the fusion output module 225 may implement operation 325.
  • the prediction result of the posture may be represented as a thermal diagram of key point of the human body.
  • the prediction result of the posture may include multiple thermal diagrams of which may correspond to one of the key points.
  • a value of pixel of the thermal diagram may reflect a possibility that key point may appear at a position of the pixel.
  • the value of pixel of the thermal diagram may be represented as distribution value or thermal value.
  • the value of the pixel (distribution value) may be represented as a gray value in the thermal diagram.
  • a point with the highest distribution value (e.g., a point with the largest gray value) may be selected as a position of a key point.
  • a size of the thermal diagram may correspond to a size of the human body image, and a position of the key point in the human body image may be determined according to a position of the key point in the thermal diagram.
  • the size of the thermal diagram may be determined according to a calculation accuracy and calculation amount, and may generally be set to 1/16 of the human body image (that is, a width of the thermal diagram is 1/4 of the width of the human body image, and a height of the thermal diagram is 1/4 of the height of the human body image) .
  • the fusion output module 225 may be implemented by at least one of a convolutional network, an up-sampling layer, a sampling layer, a pooling layer, or the like.
  • the size of the feature representation information may be needed to adjust into the size of the thermal diagram, and the number of channels of the feature representation information may be adjusted to the number of the key points.
  • the fusion output module 225 may obtain fusion information of the feature representation information (also referred to as global feature representation fusion information) by fusing feature representation information output by the relationship extraction module 223 based on the fusion output module 225, and based on the global feature representation fusion information, the prediction result of the posture of the human body may be output.
  • the fusion output module 225 may be implemented by different neural network structures.
  • the fusion output module 225 may include a feature fusing module and an output layer.
  • feature representation information e.g., the global feature representation information and of local feature representation information
  • Fusing the feature representation information may be implemented by the feature fusing module, the output of the prediction result of the posture of the human body based on the global feature representation fusion information may be implemented by the output layer.
  • the feature fusing module may be implemented by a convolutional neural network.
  • the output layer may be used to adjust the size and dimension of the global feature representation fusion information.
  • the output layer may implement a deconvolution operation, a convolution operation, an up-sampling operation.
  • the global feature representation fusion information processed by the output layer may be used as a thermal diagram, wherein the dimension of the processed global feature representation fusion information may be the same with the number of the key points, and may correspond to other, and a size of the processed global feature representation fusion information may be the same with the size of the thermal diagram, that is, images of channels of the processed global feature representation fusion information may be used as the thermal diagrams of the key points to obtain a thermal diagram of human body key point.
  • the feature representation information may be fused to obtain the global feature representation fusion information by the feature fusing module, and a plurality of feature representation information may be embodied in the global feature representation fusion information, which may be advantageous for improving effective information of the thermal diagrams of the key points obtained by the global feature representation fusion information.
  • the sizes of the feature representation information may be normalized before feature fusion, and then the feature fusion may be performed.
  • the feature fusing module may include a normalization sub-module and a weighted summation sub-module.
  • the normalization sub-module may be used to convert the size of the other feature representation information into the size of the global feature representation information Y 0 , and specifically may include the following two aspects:
  • the normalization sub-module may be used to splice a plurality of local feature representation information (e.g., Y 1 ⁇ Y 2 ⁇ Y 3 ) to obtain fusion information Ylocal of local feature representation (also referred to as local feature representation fusion information) .
  • the local feature representation information may be spliced from top to bottom to obtainthe local feature representation fusion information.
  • head feature representation information Y 1 body feature representation information Y 2 , leg feature representation information Y 3 , whose sizes may be Wy ⁇ Hy
  • number of channels may be d
  • Y 1 , Y 2 , and Y 3 may be spliced in the height direction to obtain the local feature representation fusion information Ylocal with a size of Wy ⁇ 3Hy and d channels.
  • a result of splicing i.e., the local feature representation fusion information Ylocal
  • the local feature representation fusion information Ylocal may be filled to make a size of filled local feature representation fusion information Ylocal may be the same with the global feature representation information Y 0 .
  • the normalization sub-module may be used to splice the cross feature representation information to obtain fusion information Ycross of cross feature representation (also referred to as cross feature representation fusion information) .
  • element in the cross feature representation information may be spliced as similar to or same as the local feature representation information Ylocal.
  • the weighted summation sub-module may be used to fuse the global feature representation information Y 0 , the local feature representation fusion information Ylocal, and the cross feature representation fusion information Ycross to obtain the global feature representation fusion information.
  • a process of fusing may be implemented according to weighted summation.
  • the weighted value may be determined from an experiment or an empirical value.
  • an optimal weighted value may be determined by optimizing algorithm (e.g., an annealing algorithm, a genetic algorithm, etc. ) .
  • the present disclosure may fuse the feature representation information to output the prediction result of the posture of the human body, which may improve the amount of information of the output result to improve accuracy of for posture prediction. Further, when fusing the feature representation information, the weights may be adjusted according to an actual situation, thereby making the output result conform to the actual application scenario.
  • FIG. 7 is a diagram illustrating an exemplary posture prediction model according to other embodiments of the present disclosure. Components in FIG. 7 may be described elsewhere in the present disclosure. Combined with FIG. 7, a process for posture prediction may be further described according to the present disclosure.
  • an input module may obtain a human body image as an input of the posture prediction model.
  • the human body image may be segmented to obtain local images of the human body.
  • the posture prediction model may include a local segmentation layer, and the local segmentation layer may include a plurality of segmentation sub-layers for recognizing local images of the human body image by segmenting the human body image.
  • the local segmentation layer may include a head segmentation layer, a body segmentation layer and a leg segmentation layer.
  • Inputs of the head segmentation layer, the body segmentation layer, and the leg segmentation layer may be the human body image, and outputs of the head segmentation layer, the body segmentation layer, and the leg segmentation layer may be a local image of head, a local image of body and a local image of leg, respectively.
  • Sub-layer may include a preprocessing module to make the local images of the human body have a same size.
  • the posture prediction model may include a feature extraction module and a sequence generating module.
  • the feature extraction module may include a global feature extraction layer and a local feature extraction layer.
  • the global feature extraction layer of the feature extraction module may be used to process the human body image to obtain a global feature image F 0 .
  • the local feature extraction layer may include a plurality of extraction sub-layers. The plurality of extraction sub-layers may be used to process a local image of human body to obtain a corresponding local feature image.
  • a head feature extraction layer, a body feature extraction layer, a leg feature extraction layer may be respectively used to process a local image of head, a local image of body and a local image of leg to obtain a head feature image F 1 , a body feature image F 2 , and a leg feature image F 3 .
  • the feature images F 0 -F 3 may be recorded as a human body feature image.
  • the sequence generating module may be used to generate a feature sequence of the human body according to the feature images (e.g., the global feature image and the local feature images) .
  • the posture prediction model may include a cross sequence generating module that is used to generate a cross feature sequence based on the local feature images.
  • the posture prediction model may include a global relationship extraction layer, a local relationship extraction layer, and a cross relationship extraction layer.
  • the local relationship extraction layer may include a plurality of sub-layers.
  • the local feature sequences may be inputted into the plurality of sub-layers of the local relationship extraction layer.
  • the local feature sequences may be inputted into one of the plurality of sub-layers of the local relationship extraction layer.
  • the relationship extraction layer may include a self-attention layer and a feedforward neural network layer.
  • An input of the self-attention layer may be S i +PE i
  • an output of the self-attention layer may be a feature relationship matrix Z i , wherein the PE i may be a position code of the feature sequence S i .
  • An input of the feedforward neural network layer may be the feature relationship matrix Z i , and an output of the feedforward neural network layer may be the feature representation information Y i .
  • the posture prediction model may include a feature fusing module.
  • an output of the feature fusing module may be the global feature representation fusion information Y add .
  • the feature fusing module may further include a normalization sub-module to adjust the data size and a weighted summation sub-module to fuse the data.
  • an output of the normalization sub-module may be local feature representation fusion information Y local .
  • an output of the normalization sub-module may be cross feature representation fusion information Y cross .
  • an input of the weighted summation sub-module is the global feature representation information Y 0
  • the local feature representation fusion information Y local and the cross feature representation fusion information Y cross
  • an output of the weighted summation sub-module may be global feature representation fusion information Y add .
  • the posture prediction model may include an output layer, and the output layer may be used to change size and dimension of the global feature representation fusion information Y add to obtain a group of thermal diagrams.
  • An input of the output layer may be the global feature representation fusion information Y add , and an output may be the thermal diagrams.
  • the thermal diagram may have the same size.
  • the number of the thermal diagrams may be the same with the number of the human body key points.
  • the of the thermal diagrams may correspond to one of the key points of the human body.
  • a thermal diagram corresponding to a key point may represent a distribution of the key points, wherein a thermal value of a pixel (i.e., a pixel value) of the thermal diagram may be a distribution value of the key point and the distribution value may reflect a possibility that the key point appears at a position of the pixel.
  • the value of each pixel (i.e., the pixel value) of the thermal diagram may represent the distribution value or the thermal value.
  • the value of the pixel (distribution value) may be represented as a gray value in the thermal diagram.
  • a point with the highest distribution value in the thermal diagram may be selected as a position of a key point to obtain the prediction result of posture of the human body.
  • the posture prediction model may segment the human body image into a local image of the human body to reduce interference of the background information on a prediction result, and improve the accuracy of prediction.
  • the feature sequence may be introduced as an intermediate variable to avoid analyzing the dependent relationship based on the image, reducing the amount of calculation in subsequent operations.
  • the dependent relationship of locations e.g., a global dependent relationship and a local dependent relationship
  • the human body key points may be fitted according to the relationship between other human body key points and the human body key point when the human body is covered.
  • the posture prediction model may be a machine learning model and may be obtained based on end-to-end training.
  • FIG. 8 shows a flowchart illustrating an exemplary process for training a posture prediction model according to some embodiments of the present disclosure.
  • a training sample may be obtained.
  • the human body image may be similar to or the same as the human body image as described in FIG. 3.
  • the human body image may include block diagrams including the human body to be recognized.
  • operation 810 may obtain a human body image as a sample image from a disclosed data set (COCO ⁇ MPII ⁇ CrowdHuman, etc. ) .
  • multiple training samples may be obtained through an image collecting device 160.
  • an image may not only include the human body to be recognized
  • the sample image of human body may be needed to be extract from the image by an object detection algorithm after obtaining the image.
  • a specific object detection algorithm may refer to the description of operation 310.
  • the sample image may be enhanced.
  • a great number of training samples may be needed to achieve the posture prediction in occlusion.
  • the sample images in the disclosed data set may not fully include a situation that a recognition interference is existed in human body (for example, the human body is blocked) , in some embodiments, the sample images may be pre-enhanced to increase accuracy of the posture prediction model.
  • the sample image may be enhanced by processing the sample image without damaging the spatial relationship between the key points of the human body represented in the sample image.
  • the enhancement of the sample image may be implemented by using a data enhancement tool Albumentations, and the enhancement of the sample image may include following operations:
  • a preliminary posture prediction model may be obtained.
  • the preliminary posture prediction model may be a previous posture prediction model for posture prediction that has been trained using a training set that is different from the training samples of the posture prediction model.
  • the preliminary posture prediction model may be a machine learning model that has not been trained.
  • the preliminary posture prediction model may be set referring to the description of the posture prediction model.
  • some parts of the preliminary posture prediction model may adopt a well-trained module.
  • a local segmentation layer and a feature extraction layer of the preliminary posture prediction model may use well-trained algorithms or networks.
  • the posture prediction model may be trained as whole, that is, the preliminary posture prediction model may not include the well-trained module, to avoid deficiencies of multiple modules.
  • the multiple modules may have different training targets, and an objective function of a module may deviate from macro target of a trained system (i.e., the posture prediction model) , which may cause the trained system to be difficult to achieve an optimal performance and error accumulation.
  • a posture prediction model may be obtained by iterating parameters of the preliminary posture prediction model based on a loss function.
  • operation 840 may iterate the parameters of the preliminary posture prediction model based on the loss function, and a preliminary posture prediction model meeting output conditions may be used as a well-trained posture prediction model. For example, positions of the key points of the human body in the sample image may be labeled in advance. When training, a prediction result of the preliminary posture prediction model may be compared with the label to obtain the value of the loss function.
  • the loss function applied in operation 840 may be shown as following:
  • a value of the loss function may be an arithmetic mean of a square of position differences between the labels and the prediction results of the key points.
  • a position difference may be a difference between a prediction position (i.e., the prediction result) and an actual position (i.e., the label) of a key point.
  • an overall packaging training of the posture prediction model may be achieved, complexity of a project may be reduced, macroscopic target deviation caused by the training target in the multi-module may be avoid, thereby an overall performance of a system for posture prediction may be improved.
  • the training samples may better reflect occlusion of the human body, and then recognition ability of the posture prediction model to an occluded human body may be strengthened.
  • a feature extraction module 221 may be represented by a CNN feature extraction module.
  • a relationship extraction module 223 may be represented by the Transformer module.
  • a fusion output module 225 may be represented by an prediction module.
  • the internal correlation may be a dependency, wherein the dependency may include a dependency between feature points in the feature sequence, and the feature point may be an element in a feature sequence.
  • the element of the feature sequence may be a feature sequence (e.g., cross feature sequence) , depending on the relationship, which may also include a dependency between at least part of the feature sequence.
  • the cross sequence generation module may be implemented by a leveling operation (Flatten operation) .
  • the human body image sample in the training of the posture prediction, may be represented by a first image comprising the human body.
  • a block diagram comprising a human body to be identified may illustrate a human body image frame.
  • data enhancement may also be understood as data expansion operations.
  • FIG. 9 is a flowchart illustrating an exemplary method for posture prediction according to some embodiments of the present disclosure, including:
  • the human body image may be extracted by the feature extraction module to obtain an overall feature image and a plurality of local feature images.
  • the human body image may be a human body image for training or a human body image frame extracted from a video image frame.
  • the feature extraction module is used to extract the features of the human body image to obtain the overall feature image corresponding to the human body image and a plurality of local feature images corresponding to a plurality of local regions on the human body image.
  • the feature extraction module includes a multi-layer convolutionary nucleus, and after the body image is input to the feature extraction module, the human body image is output after the overall feature image of the multilayer volume, and extracts different parts of the same body image.
  • the key points of the head area include the nose, left eye, right eye, left ear, right ear.
  • the key points of the torso area include left shoulders, right shoulders, left-arm elbows, right arm elbows, left wrists, right wrists;
  • the key points of the leg area include left hips, right hips, left knees, right knees, left ankles, right ankles.
  • the top image corresponding to the head region, the torso region, and the leg area is output after the feature extraction module outputs the corresponding local feature image.
  • S902 constructing a plurality of feature sequences based on an overall feature image and a plurality of local feature image.
  • the size and number of channels of the overall feature image are obtained. Assuming that the width of the overall feature image is W1, the height of the overall feature image is H1 and the number of channels of the overall feature image is d, the feature sequence S1 composed of W1*H1 d-dimensional feature points is directly constructed based on the overall feature image. Obtain the size and number of channels of the local feature image. Assuming that the width of the local feature image is W2, the height of the local feature image is H2 and the number of channels of the local feature image is d, the feature sequence Sn composed of W2*H2 d-dimensional feature points is directly constructed based on the local feature image. The overall feature image and local feature image pass through the same feature extraction module, and the number of channels is the same.
  • S903 using the Transformer module to extract relationship of feature sequence to obtain a dependency between feature points of the feature sequence and a dependency between at least portion of the feature sequences.
  • the transformer module is used to extract feature sequence, thereby acquiring the dependencies between the feature points in feature sequence, that is, the feature sequence S1-SN+1 input to the Transformer module, respectively, to obtain the dependency between feature points of feature sequence S1-SN+1, respectively.
  • the feature sequence S1 is constructed based on the overall feature image.
  • the dependency between the feature points in the feature sequence S1 is the dependency between the pixels in the overall feature image, and the dependency between the pixels in the overall feature image includes the dependency between the key points in the overall feature image.
  • the feature sequence S2-SN is constructed based on the local feature image.
  • the dependency between the feature points in the feature sequence S2-SN is the dependency between the pixels in local feature image, and the dependency between the pixels in the local feature image includes the dependency between the key points in the local feature image.
  • the feature sequence SN+1 is constructed based on multiple local feature images, and the feature points in the feature sequence SN+1 correspond to the local feature image one by one, so the dependency between the feature points in the feature sequence SN+1 is the dependency between local feature image, that is, the dependency between local feature sequence.
  • all feature sequences are fused to obtain a fusion thermogram, and the fusion thermal chart is predictable to obtain a prediction result of a posture of the human body using the prediction module.
  • the position of the occluded key points can be predicted based on the dependency relationship, to obtain the prediction results of posture in the occluded scene, improve the human posture prediction performance of various scenes, and make the posture prediction results more accurate.
  • the initial prediction results can be modified based on the dependency relationship to obtain more accurate posture prediction results.
  • feature extraction may be performed on the human body image to obtain the overall feature image and a plurality of local feature images, and then a plurality of feature sequences are constructed and generated.
  • the transformer module may be used to extract the relationship between feature sequence to obtain the dependency relationship between the feature points in the feature sequence and the dependency relationship between at least some feature sequences, to obtain the dependency between the key points in the human body image.
  • FIG. 10 is a flowchart illustrating another exemplary method for posture prediction of the present disclosure, including:
  • the human body image is extracted by the feature extraction module to obtain an overall feature image and a plurality of local feature images.
  • FIG. 11 is a topology schematic diagram illustrating another exemplary method for posture prediction according to the present disclosure, and the human body image is divided into three local images as an example, in other embodiments.
  • the human body image is subjected to an overall feature image extraction using the feature extraction module to obtain an overall feature image.
  • the target detection module to extract a local image corresponding to the plurality of preset regions on the human body image
  • the local image is locally extracted by the feature extraction module to obtain a plurality of local feature images.
  • the feature extraction module may be a convolutional neural networks (CNN) module.
  • the feature extraction module includes a series of "convolution+BN normalization+ReLu activation" operations.
  • the overall feature image After inputting the human body image to the feature extraction module, the overall feature image is obtained, and the overall feature image may be regarded as W1 ⁇ H1 ⁇ d feature matrix F0, where W1 is the width of the overall feature image, H1 is the height of the overall feature image, and d is the number of channels of the overall feature image.
  • the overall feature image may provide raw data for subsequent transformer modules and provides a basis for obtaining the dependency between pixels in the overall feature image.
  • the human body image is input to the target detection module, and the three parts of the head area of the input body image are detected by the target detection module, respectively, and the target detection box B_head ⁇ B_trunk ⁇ B_legs are obtained.
  • the local area on the human body is detected as the subsequent Transformer module provides raw data, providing the basis for obtaining the dependency between the pixel points between the local feature image and the local feature image.
  • B is obtained_head ⁇ B_trunk ⁇ B_After legs, preprocess these target detection frames to make B_head ⁇ B_trunk ⁇ B_The size of legs is consistent, and then the pretreated B_head ⁇ B_trunk ⁇ B_Legs through the CNN feature extraction operation consistent with the above, multiple local feature images are obtained, and the local feature image can be regarded as W2 ⁇ H2 ⁇ d feature matrixes F1, F2, F3.
  • W2 is the width of the local feature image
  • H2 is the height of the local feature image
  • d is the number of channels of the local feature image.
  • S1002 the first feature sequence consisting of a plurality of feature points is constructed based on an overall feature image, and a second feature sequence consisting of a plurality of feature points are constructed based on of the local feature images.
  • the first feature sequence S0 of W1*H1 d-dimensional points based on F0 that is, the length of S0 is W1*H1
  • element of S0 is a d-dimensional vector.
  • the second feature sequences S1, S2, and S3 composed of W2*H2 d-dimensional feature points are constructed respectively, that is, the lengths of S1, S2, and S3 are W2*H2, and element is also a d-dimensional vector.
  • S1003 flattening all of the local feature images, a third feature sequence consisting of a feature points corresponding to the flattened local feature image is constructed based on the flattened local feature images.
  • a third feature sequence S4 composed of three (W2*H2*d) dimensional feature vectors, that is, the length of S4 is 3, in which element (the element refers to the result after flattening F1, F2, and F3) is a W2*H2*d dimensional vector. Therefore, feature point corresponds to the local feature image one by one.
  • the CNN feature extraction module can not only down-sample the input image, to reduce the scale of the feature image and speed up the processing efficiency but also extract deeper and implicit image information. Then, the pixels in the feature image are arranged in turn by constructing the feature sequence, and then the feature sequence is input to the subsequent transformer module, which may more efficiently obtain the dependency between feature points in different feature sequences.
  • S1004 using the Transformer module to extract relationship of feature sequence to obtain a dependency between feature points of the feature sequence and a dependency between at least portion of the feature sequences.
  • the first feature sequence, the second feature sequence, and the third feature sequence are respectively extracted by the Transformer module, to obtain a first dependency between the feature points in the first feature sequence, and in the second feature sequence.
  • the feature points in the third feature sequence correspond to the local feature image, and therefore, the third dependency between the feature points in the third feature sequence, that is, the dependency between the feature sequence corresponding to the local feature image.
  • the global Transformer module branch uses the global Transformer module branch to extract global dependencies between F0 feature points, that is, the first dependency.
  • Cross transformer module branches are used to extract the dependencies among F1, F2 and F3, that is, the third dependency.
  • S0 may be taken as the input of the global transformer module branch, and S1, S2, and S3 are taken as the input of the three local transformer module branches.
  • the global transformer module branch can extract the global dependency between F0 feature points, that is, the dependency between each d-dimensional feature point of W1 ⁇ H1 ⁇ d dimension feature image and other feature points in the d-dimensional feature image.
  • the global dependency is expressed in (W1*H1) ⁇ (W1*H1) is characterized by matrix A0.
  • the local transformer module branch can extract the local dependence between F0 feature points, that is, the dependency between each d-dimensional feature point of W2 ⁇ H2 ⁇ d dimension feature image and other feature points in the d-dimensional feature image.
  • the cross transformer module branch can extract the dependency between the three local regions F1, F2, and F3, that is, the dependency between each (W2*H2*d) dimension feature point and other feature points in the 3*(W2*H2*d) dimension feature image.
  • the local relationship is expressed in 3 ⁇ 3 matrix A4.
  • the first dependency corresponding to the first feature sequence S0 is the dependency between pixels/key points on the human body image
  • the second dependency corresponding to the second feature sequences S1, S2 and S3 is the dependency between pixels/key points on local images in different regions on the human body image
  • the third dependency corresponding to the third feature sequence S4 is the dependency between local images in different regions on the human body image, to obtain all dependencies between different pixel points/joint points in different regions on the human body image, and for the scene in which some pixel points/key points on the human body image are occluded, based on all the above dependencies, the accuracy and robustness of human posture prediction may be improved.
  • the step of the first feature sequence, the second feature sequence, and the third feature sequence including the positional points corresponding to the feature points in any feature sequence, respectively.
  • the positional coding vector is generated based on the position of the feature point in the corresponding feature sequence.
  • the transformer module is extracted to the first feature sequence, the second feature sequence, and the fusion feature point in the third feature sequence, respectively.
  • FIG. 12 is a schematic diagram of an implementation of the topology corresponding to operation S1004 in the first embodiment of the present disclosure, including self-focal layer (Self-Attention) and feedforward nerves.
  • the location coding uses SIN-COS rules, and the specific calculation formula is as follows:
  • PE (pos, 2i) sin (pos/1000 2i/d ) (1)
  • PE (pos, 2i+1) cos (pos/1000 2i/d ) (2)
  • the pos represents the position, 2i and 2i+1 of element in the feature sequence in the feature sequence, indicate the dimension of the position encoding, the value of i may be [0, d/2] , the even number of rows use the SIN function encoding, odd number
  • location encoding By adding location encoding, the location information of feature point is introduced, and the dependencies between different feature points may be more efficiently obtained.
  • V (Si+PE i ) W v (5)
  • the softmax is then converted to a probability distribution, and the dependencies between different elements are obtained when softmax conversion is performed.
  • the probability distribution and the output of the self-attention layer may then be obtained, and the output result of the input to the feedforward neural network is then output, the dimension of the Transformer module, the dimension of the input feature sequence is consistent.
  • the dimension of Y0 is W1 ⁇ H1 ⁇ d.
  • the dimensions of Y1, Y2 and Y3 are W2 ⁇ H2 ⁇ d.
  • the dimension of Y4 is 3*(W2*H2*d) .
  • the weight values assigned to different feature sequences in advance are obtained, that is, Y0 ⁇ Y concate and Y cross sets different weight values ⁇ , ⁇ , ⁇ .
  • the feature sequence after the unified size is peripheral, thereby obtaining a fusion feature.
  • the above process is expressed as follows:
  • Y add ⁇ Y0+ ⁇ Y concate + ⁇ Y cross (8)
  • is the weight value of the first feature sequence
  • Y0 is the output of the first feature sequence after passing through the Transformer module
  • Y concate is the output of the second feature sequence synthesized by the Transformer module
  • Y cross is the output obtained after the third feature sequence is converted by the Transformer module.
  • the importance of different branches is represented by different weight values, and Y0 ⁇ Y concate and Y cross after adding weight parameters are added to perform the add operation on the feature point level to obtain the fusion feature image Y add , and then y add is input to the prediction module to estimate the position of human key points, to obtain the prediction result of the posture of the human body.
  • the fusion feature is converted to the corresponding thermal diagram using the prevailing module, and the position of the key point in the human body image is labeled based on the dependency to obtain the posture prediction of the human image on the thermal diagram.
  • the prediction module may be a Head module, which mainly may include shape operation and 1 ⁇ 1 convolution operation, where the purpose of shape operation is to convert size of W ⁇ H to a thermal diagram with a size of W heat ⁇ H heat , the shape operation can be convolution or deconvolution.
  • the specific operation depends on the size of the thermal diagram to be estimated. Assuming that it is necessary to estimate the K key points in the human body image, after the Head module, mark the positions of the predicted K key points on the thermal diagram to obtain the final human body key point prediction result.
  • the size of the fusion feature image is converted by the estimation module to meet the final thermal diagram size to be predicted, and the position of the key points is marked based on the dependence between the thermal diagram and the key points. For the human posture estimation in the occluded scene, a more accurate position corresponding to the key points may be obtained.
  • the method for posture prediction predicts the posture by constructing the feature extraction module, Transformer module and prediction module, constructs the first feature sequence composed of multiple feature points based on the overall feature image, constructs the second feature sequence composed of multiple feature points based on local feature image, and constructs the second feature sequence composed of multiple feature points based on the local feature image after leveling operation, construct a third feature sequence composed of feature points one-to-one corresponding to the local feature image after leveling operation, extract the dependency of feature points in the feature sequence through the transformer module, and obtain the dependency between pixel points/key points on the human body image based on the dependency between pixels/key points in local images in different regions on the human body image and the dependency between local images in different regions on the human body image, the prediction module is used to predict the posture, so as to improve the accuracy and robustness of posture prediction.
  • FIG. 13 is a flowchart illustrating exemplary training method for posture prediction according to some embodiments of the present disclosure, including:
  • S1301 equipped with feature extraction using the feature extraction module to obtain an overall feature image and a plurality of local feature images.
  • S1302 constructing a plurality of feature sequences based on a overall feature and a plurality of local feature images.
  • S1303 feature sequence is extracted by the Transformer module to obtain a dependency between feature points and a dependency between at least part of the feature sequences in the feature sequence.
  • the feature extraction module After obtaining the prediction result, based on the size of the thermal diagram corresponding to the prediction result, convert the original human body image into the same size as the thermal diagram, and then calculate the loss between the prediction result and the actual result. Based on the loss between the prediction result and the actual result continuously obtained in the training stage, the feature extraction module.
  • the transformer module and the prediction module perform iterative optimization, to obtain more accurate prediction results after continuous optimization.
  • the size of the human body image is converted into the same size as the thermal diagram corresponding to the prediction result to obtain the actual result of posture in the human body image; the loss function module may be used to calculate the loss to obtain the loss between the estimated result and the actual result; based on the loss, the parameters in the feature extraction module, Transformer module, and estimation module are iteratively optimized.
  • the thermal diagram with the size of W heat *H heat corresponding to the prediction result is obtained, and the size of the human body image is also converted to W heat *H heat by using the existing image pixel conversion method, so that the size of the prediction result and the actual result is the same before calculating the loss, and the loss between the prediction result and the actual result is calculated by using the loss function module,
  • the above process is expressed by the following formula:
  • the actual and predicted results of the JU J IP are respectively, and the final loss function is average loss and means for human key point.
  • the feature extraction module is used to extract the human body image to obtain a overall feature and a plurality of local feature images steps, including: in response to obtaining the first image containing the human body, using the target detection module to extract the first data expansion operations for the human body image frame to obtain a plurality of body images corresponding to the same body image frame.
  • how to obtain a first image containing the human body may include screening or artificial acquisitions from the public data set (COCO, MPII, Crowdhuman, or the like) .
  • the target detection model is used to detect the human body in the data, and the human body image frame is extracted, and the target detection model here includes, but is not limited to, the YoloV3 model.
  • the human box sample is used to expand the preliminary data sample extension, so that the human body image used is abundant.
  • the data expansion of human frame samples is realized by using the augmentations data expansion tool.
  • the specific data expansion methods include: using random fuzzy check human frame samples to blur, adjusting the hue, saturation, and value change parameters of human frame samples to realize human frame sample transformation one or more ways to convert human frame samples from RGB color space to another color space, return to RGB color space after increasing or reducing color parameters, and enhance the input human frame samples by adaptive histogram equalization.
  • the prediction results of multiple human body images generated from the same first image are compared, and the loss between the prediction results corresponding to human body images with different fuzzy degrees is calculated, to iteratively optimize the feature extraction module, Transformer module, and prediction module, to make the prediction results of the model for the same image with different fuzzy degrees close to other, improve the accuracy of prediction.
  • the above scheme is extracted by the feature extracting module to obtain an overall feature image and a plurality of local features image, and the plurality of feature sequences are generated, and feature sequence is extracted by the Transformer module to obtain a feature sequence.
  • the electronic device 1400 includes a storage 1401 and a processor 1402 coupled to other, wherein the storage 1401 stores program data (not shown) ,
  • the processor 1402 may call the program data to realize the human posture prediction method or human posture prediction model training method in any of the above embodiments.
  • relevant contents please refer to the detailed description of the above method embodiments, which will not be described here.
  • FIG. 15 is a structural diagram of an embodiment of a computer-readable storage medium of the embodiment of this specification.
  • the computer-readable storage medium 1500 stores program data 1501.
  • the human posture prediction method or human posture prediction model training method in any of the above embodiments is realized, please refer to the detailed description of the above method embodiments for the description of relevant contents, which will not be described here.
  • the numbers expressing quantities of ingredients, properties, and so forth, used to describe and claim certain embodiments of the application are to be understood as being modified in some instances by the term “about, ” “approximate, ” or “substantially” . Unless otherwise stated, “about, ” “approximate, ” or “substantially” may indicate ⁇ 20%variation of the value it describes. Accordingly, in some embodiments, the numerical parameters set forth in the description and attached claims are approximations that may vary depending upon the desired properties sought to be obtained by a particular embodiment. In some embodiments, the numerical parameters should consider specified significant digits and adopt ordinary rounding techniques. Notwithstanding that the numerical ranges and parameters configured to illustrate the broad scope of some embodiments of the present disclosure are approximations, the numerical values in specific examples may be as accurate as possible within a practical scope.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)
PCT/CN2021/128377 2021-05-24 2021-11-03 Methods and systems for posture prediction WO2022247147A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110567479.7 2021-05-24
CN202110567479.7A CN113486708B (zh) 2021-05-24 2021-05-24 人体姿态预估方法、模型训练方法、电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2022247147A1 true WO2022247147A1 (en) 2022-12-01

Family

ID=77933029

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/128377 WO2022247147A1 (en) 2021-05-24 2021-11-03 Methods and systems for posture prediction

Country Status (2)

Country Link
CN (1) CN113486708B (zh)
WO (1) WO2022247147A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116028663A (zh) * 2023-03-29 2023-04-28 深圳原世界科技有限公司 三维数据引擎平台
CN116110076A (zh) * 2023-02-09 2023-05-12 国网江苏省电力有限公司苏州供电分公司 基于混合粒度网络的输电高空作业人员身份重识别方法和系统
CN116580444A (zh) * 2023-07-14 2023-08-11 广州思林杰科技股份有限公司 基于多天线射频识别技术的长跑计时的测试方法和设备
CN117643252A (zh) * 2024-01-12 2024-03-05 西藏田硕农业科技有限公司 一种克服半夏连作障碍的设施栽培方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486708B (zh) * 2021-05-24 2022-03-25 浙江大华技术股份有限公司 人体姿态预估方法、模型训练方法、电子设备和存储介质
CN113673489B (zh) * 2021-10-21 2022-04-08 之江实验室 一种基于级联Transformer的视频群体行为识别方法
CN114550305B (zh) * 2022-03-04 2022-10-18 合肥工业大学 一种基于Transformer的人体姿态估计方法及系统
CN116524548B (zh) * 2023-07-03 2023-12-26 中国科学院自动化研究所 血管结构信息提取方法、装置及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017116403A (ja) * 2015-12-24 2017-06-29 トヨタ自動車株式会社 姿勢推定装置、姿勢推定方法、およびプログラム
JP2020052867A (ja) * 2018-09-28 2020-04-02 株式会社Axive 姿勢解析装置、姿勢解析方法、及びプログラム
CN111480178A (zh) * 2017-12-14 2020-07-31 富士通株式会社 技巧识别程序、技巧识别方法以及技巧识别系统
WO2021096669A1 (en) * 2019-11-15 2021-05-20 Microsoft Technology Licensing, Llc Assessing a pose-based sport
CN113486708A (zh) * 2021-05-24 2021-10-08 浙江大华技术股份有限公司 人体姿态预估方法、模型训练方法、电子设备和存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8861870B2 (en) * 2011-02-25 2014-10-14 Microsoft Corporation Image labeling with global parameters
CN108573231B (zh) * 2018-04-17 2021-08-31 中国民航大学 基于运动历史点云生成的深度运动图的人体行为识别方法
CN110532873A (zh) * 2019-07-24 2019-12-03 西安交通大学 一种联合人体检测与姿态估计的深度网络学习方法
CN110781736A (zh) * 2019-09-19 2020-02-11 杭州电子科技大学 基于双流网络将姿态和注意力相结合的行人重识别方法
CN112052886B (zh) * 2020-08-21 2022-06-03 暨南大学 基于卷积神经网络的人体动作姿态智能估计方法及装置
CN112200165A (zh) * 2020-12-04 2021-01-08 北京软通智慧城市科技有限公司 模型训练方法、人体姿态估计方法、装置、设备及介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017116403A (ja) * 2015-12-24 2017-06-29 トヨタ自動車株式会社 姿勢推定装置、姿勢推定方法、およびプログラム
CN111480178A (zh) * 2017-12-14 2020-07-31 富士通株式会社 技巧识别程序、技巧识别方法以及技巧识别系统
JP2020052867A (ja) * 2018-09-28 2020-04-02 株式会社Axive 姿勢解析装置、姿勢解析方法、及びプログラム
WO2021096669A1 (en) * 2019-11-15 2021-05-20 Microsoft Technology Licensing, Llc Assessing a pose-based sport
CN113486708A (zh) * 2021-05-24 2021-10-08 浙江大华技术股份有限公司 人体姿态预估方法、模型训练方法、电子设备和存储介质

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116110076A (zh) * 2023-02-09 2023-05-12 国网江苏省电力有限公司苏州供电分公司 基于混合粒度网络的输电高空作业人员身份重识别方法和系统
CN116110076B (zh) * 2023-02-09 2023-11-07 国网江苏省电力有限公司苏州供电分公司 基于混合粒度网络的输电高空作业人员身份重识别方法和系统
CN116028663A (zh) * 2023-03-29 2023-04-28 深圳原世界科技有限公司 三维数据引擎平台
CN116028663B (zh) * 2023-03-29 2023-06-20 深圳原世界科技有限公司 三维数据引擎平台
CN116580444A (zh) * 2023-07-14 2023-08-11 广州思林杰科技股份有限公司 基于多天线射频识别技术的长跑计时的测试方法和设备
CN117643252A (zh) * 2024-01-12 2024-03-05 西藏田硕农业科技有限公司 一种克服半夏连作障碍的设施栽培方法
CN117643252B (zh) * 2024-01-12 2024-05-24 西藏田硕农业科技有限公司 一种克服半夏连作障碍的设施栽培方法

Also Published As

Publication number Publication date
CN113486708A (zh) 2021-10-08
CN113486708B (zh) 2022-03-25

Similar Documents

Publication Publication Date Title
WO2022247147A1 (en) Methods and systems for posture prediction
CN110135375B (zh) 基于全局信息整合的多人姿态估计方法
Zhou et al. Learning face hallucination in the wild
Gould et al. Decomposing a scene into geometric and semantically consistent regions
Venugopal Automatic semantic segmentation with DeepLab dilated learning network for change detection in remote sensing images
CN112800903B (zh) 一种基于时空图卷积神经网络的动态表情识别方法及系统
CN112347861B (zh) 一种基于运动特征约束的人体姿态估计方法
CN110222718B (zh) 图像处理的方法及装置
CN112597941A (zh) 一种人脸识别方法、装置及电子设备
JP2013513191A (ja) 拡張現実における動的モデリングによる頑強なオブジェクト認識
WO2019136591A1 (zh) 基于弱监督时空级联神经网络的显著目标检测方法及系统
Mohanty et al. Robust pose recognition using deep learning
CN113408455A (zh) 一种基于多流信息增强图卷积网络的动作识别方法、系统及存储介质
CN113191230A (zh) 一种基于步态时空特征分解的步态识别方法
CN114419732A (zh) 基于注意力机制优化的HRNet人体姿态识别方法
Zhang et al. Multimodal image outpainting with regularized normalized diversification
CN115393404A (zh) 双光图像配准方法、装置及设备、存储介质
CN116258757A (zh) 一种基于多尺度交叉注意力的单目图像深度估计方法
Henry et al. Pix2pix gan for image-to-image translation
CN110135435A (zh) 一种基于广度学习系统的显著性检测方法及装置
Palle et al. Automated image and video object detection based on hybrid heuristic-based U-net segmentation and faster region-convolutional neural network-enabled learning
CN112801908B (zh) 图像去噪方法、装置、计算机设备和存储介质
Lu et al. GA-CSPN: generative adversarial monocular depth estimation with second-order convolutional spatial propagation network
CN114359961A (zh) 行人属性识别方法及相关设备
Qiao et al. Tracking feature extraction based on manifold learning framework

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21942711

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21942711

Country of ref document: EP

Kind code of ref document: A1