WO2023219901A1 - Methods and apparatus for human pose estimation from images using dynamic multi-headed convolutional attention - Google Patents

Methods and apparatus for human pose estimation from images using dynamic multi-headed convolutional attention Download PDF

Info

Publication number
WO2023219901A1
WO2023219901A1 PCT/US2023/021201 US2023021201W WO2023219901A1 WO 2023219901 A1 WO2023219901 A1 WO 2023219901A1 US 2023021201 W US2023021201 W US 2023021201W WO 2023219901 A1 WO2023219901 A1 WO 2023219901A1
Authority
WO
WIPO (PCT)
Prior art keywords
processor
temporal
joints
profile
machine
Prior art date
Application number
PCT/US2023/021201
Other languages
French (fr)
Inventor
Alec DIAZ-ARIAS
Dmitriy Shin
Jean E. Robillard
Mitchell MESSMORE
John RACHID
Original Assignee
INSEER Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/740,650 external-priority patent/US11482048B1/en
Application filed by INSEER Inc. filed Critical INSEER Inc.
Publication of WO2023219901A1 publication Critical patent/WO2023219901A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/251Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Definitions

  • the present disclosure relates to the field of artificial intelligence, specifically to methods and apparatus for human pose estimation using dynamic multi-headed convolutional attention based on images and/or videos collected from a sensor(s) such as a camera.
  • an apparatus is for a human pose estimation.
  • the apparatus includes a processor and a memory operatively connected to the processor.
  • the memory contains instructions configuring the processor to receive image frames, each image frame from the plurality of image frames containing a measured temporal joint data of a subject.
  • the memory further stores instructions causing the processor to train a limb segment machine- learning model of machine-learning models using an interrelation training set that contains a limb segment image data correlated to a limb matrix in a motion sequence.
  • the memory further stores instructions causing the processor to execute the limb segment machine-learning model to identify frame interrelations using the plurality of image frames as an input and generate a temporal joints profile based on the plurality of frame interrelation
  • a non-transitory, processor-readable medium causes the processor to receive, from a sensor operatively coupled to the processor, a plurality’ of image frames containing a measured temporal joint data of a subject across at least two frames of image data from the sensor and compute a plurality of averaged convoluted quantitative identifiers based on the plurality’ of image frames.
  • the non-transitory, processor-readable medium also causes the processor to train a multi-headed temporal profile machine-learning model of a plurality of machine-learning models using a temporal joints training set that includes a concatenated temporal profile correlated to a concatenated temporal sequence, execute the multi-headed temporal profile machine-learning model to output a plurality of temporal joints interrelations using the plurality of averaged convoluted quantitative identifiers as an input, and generate an aggregated temporal joints profile based on the temporal joints interrelations.
  • FIG. 1 is a block diagram illustrating an embodiment of a non-transient computer readable medium for producing a temporal joints profile, according to an embodiment.
  • FIG. 2 is a flow diagram illustrating an embodiment of a dynamic convolutional multi-headed atention for estimating a human pose, according to an embodiment.
  • FIG. 3 is a block diagram of an embodiment of a dynamic convolutional multi- headed attention for estimating a human pose, according to an embodiment.
  • FIG. 4 is a schematic illustration of a method for estimating a set of poses, according to an embodiment.
  • FIG. 5 is a schematic illustration of determining static load on a joint from an image frame to determine risk of injury and physical therapy improvement, according to an embodiment.
  • an apparatus is for human pose estimation using dynamic multi-headed convolutional attention.
  • the apparatus includes a transformer called ConvFormer that leverages a dynamic convolutional multi-headed attention mechanism.
  • a compute device incorporating the ConvFormer can receive a two- dimensional (2D) input such as a video and/or image frames representing a human subject.
  • the ConvFormer can convert the 2D input into a 3D skeletal representation and produce a temporal joints profile, representing a sequence of 3D poses highlighting the movement, of a joint of the 3D skeleton in a temporal input.
  • the ConvFormer receives the 2D input of a human subject and produces an image of a skeletal outline following the joints and limbs of the human subject, which can be overlayed on the original input, thereby producing an outline of the human subject’s skeletal framework on top of the human subject in the 2D input. This is so, at least in part, to identify the human subject’s pose and/or movement in a. sequence of time.
  • the ConvFormer can also produce a 3D skeletal representation based on the 2D input and the skeletal outline to estimate the skeletal outline’s movement and position in three dimensions.
  • the ConvFormer produces the 3D skeletal representations outlining the human subject in each frame found in the video stream.
  • the ConvFormer can also reduce complexity between frames from using convolution based on a convolution filter size.
  • the ConvFormer can also be used, for example, to reproduce a spatially and temporally smooth 3D skeletal representation of a human subject from the reduced sparsity.
  • the ConvFormer can produce temporal joints profile for each joint of a human subject and query each of those temporal joints profiles to produce a complete temporal joints profile for the overall human subject that is being analyzed by the ConvFormer.
  • the compute device can generate multiple layers of queries, keys, and values for each joint to produce a 3D skeletal representation and a 3D joint(s) sequential model denoted by a temporal joints profile.
  • the ConvFormer can generate the queries, keys, and values from each frame and average them to produce the temporal joints profile.
  • the ConvFormer can also be used, for example, to generate temporal joints profiles to reproduce skeletal joint and limb movements of the human subject accurately and smoothly from a 2D input.
  • a temporal joints profile includes a sequential movement of a specific join in a time and/or motion sequence.
  • Each joint of the skeletal outlines for each frame can be mapped throughout the frames, thereby identifying the sequential movement of each joint by following the movement and location of each joint throughout the frames.
  • the temporal joints profile can be used to analyze a joint’s movement in sequence.
  • the plurality of temporal joints profiles can be used to analyze joints’ movements in sequence and how they interact in sequence.
  • the ConvFormer can also perform a scaled-dot product attention following the fusing of temporal information resulting in the temporal joints profile or the plurality of temporal joints profiles.
  • the ConvFormer’s dynamic convolutional multi-headed attention mechanism includes a scaled dot product attention applied for each head of the multi-headed attention mechanism, where the attention can be described as a mapping function that maps a query matrix Q, a key matrix K, and a value matrix V to an output attention matrix.
  • Each head the multi -headed attention mechanism can produce a temporal joints profile or temporal joints profile different from the other heads.
  • the temporal joints profile(s) produced from different heads can include temporal joints profile(s) produced from different camera angles.
  • one head can focus on producing a singular temporal joints profile for a singular joint such as a right elbow, while the other heads can focus on producing a singular temporal joints profile for every other joint of the human subject.
  • the ConvFormer can run each head and perform the attentions in parallel in which each attention result is then concatenated, thereby producing a linearly transformed aggregated temporal joints profile in the expected dimension.
  • the resulting aggregated temporal joints profile can include the most accurate and smooth 3D output of a temporal joints profile or temporal joints profile.
  • the compute device can also be used to predict body injury.
  • a sensor can continuously record, from a certain angle, a human subject.
  • the recorded video can be sent to the ConvFormer, which can analyze the subject’s joint movements.
  • the compute device can use, for example, a machine-learning model and/or algorithm to determine potential body injuries based on the joint movements computed as temporal joints profiles and the frequency of such movements.
  • the compute device can also be used, for example, to determine physical therapy treatments to improve the body (state of health) of a human subject.
  • the compute device can use, for example, a specific temporal joints profiles indicating a healthy bodily movement.
  • a sensor can record people performing manual labor in which a specific sequence of movements represents a healthy and/or ideal mode of performance. Any significant deviations from this specific sequence can be identified by the sensor capturing data from manual laborers and their joint movements for the compute device to produce a specific physical therapy improvement and/or recommendation to achieve the healthy and/or ideal sequence of movements.
  • FIG. 1 shows a system block diagram for a compute system to produce a temporal joints profile, according to an embodiment.
  • Compute device 100 can include any processor as described in this disclosure.
  • the processor can include without limitation a microcontroller, microprocessor, digital signal processor (DSP) and/or system on a chip (SoC) as described in this disclosure.
  • Compute device 100 includes a memory 104 containing instructions for at least a processor 108.
  • Memory 104 of compute device can be, for example, a memory buffer, a random-access memory (RAM), a read-only memory (ROM), a hard drive, a flash drive, a secure digital (SD) memory card, an external hard drive, an erasable programmable read-only memory (EPROM), an embedded multi-time programmable (MTP) memory, an embedded multi-media card (eMMC), a universal flash storage (UFS) device, and/or the like.
  • Memory 104 can store, for example, video data, image data, fitness data, medical record data, and/or the like.
  • Memory 104 can further store, for example, one or more machine learning models, and/or code that includes instructions to cause processor 108 to execute one or more processes or functions such as, but not limited to, a data preprocessor, one or more machine-learning models, or the combination thereof.
  • Compute device 100 can include, for example, a hardware and/or software component to facilitate data communication between compute device 100 and external devices such as, but not limited to, a camera, a sensor, a server, a network, a communication interface, or the combination thereof.
  • Compute device 100 can include, for example a communication interface operatively coupled to and used by processor 108 and/or the memory 104.
  • a communication interface can include a network interface card (NIC), a Wi-Fi® module, a Bluetooth® module, an optical communication module, and/or any other suitable wired, wireless communication interface, or the like thereof.
  • the communication interface can be configured to connect the compute device 100 to a network.
  • Compute device 100 also includes a ConvFormer module 12.
  • ConvFormer module can be, for example, a hardware/software module and/or convolutional transformer that leverages a dynamic convolutional multi-headed self-attention mechanism for monocular 3D human pose estimation.
  • ConvFormer module 124 can include a spatial-temporal convolutional transformer.
  • Compute device 100 also includes processor 108, which contains and/or can access (e.g., from memory 104) instructions to cause compute device 100 and/or ConvFormer module 124 to receive, from a sensor 116, a video input containing image frames 120 representing a subject. 112.
  • the image frames 120 can contain a measured temporal joint data of subject 112 across at least two frames of image data from sensor 116.
  • the measured temporal joint data can include the dimensional location of a recognizable human subject as the location shifts following the movements of the human subject captured across multiple image frames.
  • the temporal joint data can include a rectangular outline and/or overlay surrounding the human subject as well as 2D coordinates of the corners of the outline and/or overlay.
  • Sensor 116 can be, for example, a device used to capture a physical phenomenon and output an electrical signal.
  • sensor 116 can include any one or more of a video camera, an image sensor, motion sensor, thermal sensor, biochemical sensor, pressure sensor, or the like thereof.
  • Sensor 116 can include multiple sensors where each sensor is housed with each other in a sensor case, forming a sensor suite.
  • a subject 112 can be, for example, a recognizable moving human captured by sensor 116 and/or within the video input.
  • Subject 112 can include, for example, a central figure in the video input.
  • subject 112. can include multiple subjects, where each subject is identified by sensor 116 and/or compute device 100.
  • a “video input,” as used in this disclosure, can be for example an electronic input containing electronic signals of a 2D visual media.
  • the video input can be embodied in any electronic medium for recording, copying, playback, broadcasting, displaying, or the like thereof.
  • the video input can include image frames 120.
  • An “image frame,” as used in this disclosure, can be, for example, an image in a sequence of images forming a video.
  • a video input having 60 frames per second can include 60 image frames for one second of video input.
  • a video input of a human person performing manual labor can be multiple image frames where each image frame contains an image of a pose the human subject is making while performing the manual labor.
  • the multiple image frames can include, for example, a sequence of such pose and/or manual labor.
  • the video input can include, for example, a recorded 2D visual media of subject 112 and its movements from a fixed position via placement of sensor 116.
  • Memory 104 stores instructions to cause processor 108 to identify, via.
  • ConvFormer module 124 multiple joints from subject 112.
  • a “joint” is for example a connection point between two or more limbs and/or bones.
  • compute device 100 can denote a joint as a circular dot placed on top of each image frame 120 where a joint within an image frame is located.
  • ConvFormer module 124 is further configured to identify multiple joints using a spatial joint machine-learning model 128.
  • a “spatial joint machine-learning model” can be, for example, any machine-learning model and/or algorithm used to output multiple joint localization overlays. Spatial joint machine-learning model 128 can be consistent with any machine- learning model as described herein.
  • a machine-learning model can include any supervised and/or unsupervised machine-learning model.
  • a machine-learning model can enable deep learning.
  • compute device 100 can use convolutional neural networks (CNNs) for image processing and object detection.
  • ConvFormer module 124 can employ recurrent neural networks (RNNs) to enable the algorithms of a machine-learning model to teach itself to identify joints, limbs, 3D skeletal representations, and the like thereof.
  • RNNs recurrent neural networks
  • compute device 100 can use a feed-forward neural network and multiple hidden layers used to transform the video input and its image frames 120 into recognizable 3D frame interrelations 136.
  • the feed-forward neural network and hidden layers can also be used to identify multiple joints of subject 112 and/or compute 2D joint localizations from the video input and/or image frames 120.
  • the feed-forward neural network and hidden layers can also be used to identify limb segments and/or compute 2D segmental localizations,
  • Memory 104 stores instructions to cause processor 108 to train, via ConvFormer module 124, spatial joint machine-learning model 128 using a spatial joints training set.
  • a “spatial joints training set” can be, for example, any training data containing a two- dimensional human pose correlated to a two-dimensional joints profile.
  • Processor 108 can be further instructed to retrieve any training set from a database.
  • Processor 108 can be further instructed to record and retrieve any inputs and outputs for any machine-learning model in the database.
  • the database can be a cloud database where compute device 100 is connected via a network interface to access the cloud database.
  • Memory 104 stores instructions to cause processor 108 to train, via ConvFormer module 124, spatial joint machine-1 earning model 128 with the spatial joints training set and execute spatial joint machine-learning model 128 with image frames 120 as an input to output the multiple joint localization overlays.
  • a “joint localization overlay,” is, for example, a 2D map of joints on an image.
  • the 2D joint localization can include an image of circular dots placed on top of each image frame 120, where each circular dot denotes a joint of a human subject 112.
  • ConvFormer module 124 can produce a joint localization overlay for each individual frame of image frames 120.
  • Memory 104 stores further instructions to cause processor 108 to determine, via ConvFormer module 124, multiple frame interrelations based on the joint localization overlay using a limb segment machine-learning model 132.
  • a “limb segment machine-learning model” is, for example, any machine-learning model used to output frame interrelation 136.
  • a “frame interrelation” is, for example, a frame-wise relation and/or sequence between two or more frames, where the frame-wise relation identifies limb segments connecting multiple joints from a joint localization over.
  • frame interrelations 136 can incorporate limb segments to be included in the joint localization overlay, thereby creating a comprehensive skeletal outline of joints and limbs.
  • Memory 104 further stores instructions to cause processor 108 to train via ConvFormer module 124 limb segment machine-learning model 132. using an interrelation training set, which can be for example a training set that contains a limb segment image data correlated to a limb matrix in a motion sequence.
  • an interrelation training set which can be for example a training set that contains a limb segment image data correlated to a limb matrix in a motion sequence.
  • ConvFormer module 124 then executes limb segment machine-learning model 124 to identify the multiple frame interrelations 136 using the image frames 120 as an input.
  • limb segment machine-learning model 132 can receive the joint localization overlays as an input to train limb segment machine-learning model 132.
  • ConvFormer module 124 can include a spatial attention module and a temporal transformer module.
  • the spatial atention module receives image frames 120 of 2D human poses as inputs and extracts a high dimensional feature for each frame’s joint correlations as denoted by the 2D segmental localization.
  • the spatial atention module can also extract multiple 2D segmental localizations for each individual frame and globally across a motion of sequence of subject 112 captured in image frames 120.
  • the spatial attention module contains spatial joint machine-learning model 128 and/or limb segment machine-learning model 132 as limb segment machine- learning model 132 outputs the frame interrelations 136, where the frame interrelations 136 include a sequence of spatial blocks denoted by the joint localization overlays.
  • the input for ConvFormer module 124 includes image frames 120 of 2D human poses. For example, an input is a 2D pose with J joints that is represented by two coordinates (u, v).
  • the spatial attention module and/or limb segment machine-learning model 132 maps the coordinate of each joint found in a frame into a high-dimensional feature with a trainable linear layer.
  • a learned positional encoding via summation is applied to the high- dimensional feature.
  • 3D skeletal representations can be reconstructed in the joint that is relative to the camera, reference frame, where the camera reference frame is where the root joint sits at the origin.
  • the spatial attention module and/or limb segment machine- learning model 132 encodes as follows: where d represents the dimension of the embedding.
  • the spatial feature sequence is fed into the spatial attention module of ConvFormer module 124, which applies a spatial attention mechanism with respect to the joint dimension to integrate information across the complete pose.
  • the output for the i-th frame of the ti-th spatial attention module of ConvFormer module 124 is denoted
  • Memory 104 can further instruct processor 108 to compute, via ConvFormer module 124, a 3D pose sequence based on the frame interrelations 136.
  • a “3D pose sequence” can be, for example, a sequence of 3D frames where each 3D frame corresponds to a 2D frame of the image frames 120.
  • ConvFormer module 124 can convert a 2D frame into a 3D frame.
  • ConvFormer module 124 can use a CNN to produce the 3D pose sequence.
  • ConvFormer module 124 can also extract multiple scaled frame interrelations based on a convolutional filter size prior to generating the temporal joints profile.
  • a “convolutional filter size” can be, for example, a 2D kernel size for which convolutional layers are to follow.
  • the convolutional filter size can be used to adjust the level of sparsity to be applied for the frame interrelations 136 denoted by the scaled frame interrelations.
  • Memory 104 further stores instructions to cause processor 108 to receive, via ConvFormer module 124, the frame interrelations 136 from limb segment machine-learning model 132, train a temporal profile machine-learning model 140 using a temporal sequence training set, and execute temporal profile machine-learning model 140 using the frame interrelations 136 as an input to output a temporal joints profile 144.
  • a “temporal joints profile machine-learning model” can be, for example, any machine-learning model and/or algorithm used to output a temporal joints profile.
  • a “temporal sequence training set” can be, for example, a training set that contains a temporal pose sequence correlated to a temporal joint sequence.
  • a “temporal joints profile” can be, for example, a sequence of a joint from 3D poses by a human subject given a video input.
  • temporal joints profile 144 can include a graphical indicator of a joint based on the frame interrelations 136 and/or 3D pose sequence.
  • the frame interrelations 136 can include a 3D sequence of a human subject’s skeletal representation.
  • Temporal joints profile 144 can include a line following the specific movement of a specific joint from 3D frame interrelations, identifying a fluid temporal motion of that specific joint.
  • ConvFormer module 124 can enable temporal profile machine-learning model 140 to produce multiple temporal joints profiles for each joint of a joint localization overlay, where each joint represents the corresponding joint of a human subject captured by the image frames.
  • temporal profile model 140 can include deep learning. For example, using a neural network, temporal profile machine-learning model 140 can structure algorithms in hidden layers that map each joint and limb segment, of a human body through the hidden layers to produce temporal joints profile 144 and/or multiple temporal joints profiles, [0028]
  • Temporal joints profile 144 can vary, for example, based on a temporal window.
  • a “temporal window” can be, for example, a finite window of time of a video stream.
  • a video stream with a duration of 1 hour can be inputted into ConvFormer module 124.
  • the video stream can include a human subject performing a. variety of manual labor and body movements.
  • ConvFormer modulel24 can analyze the movements of a specific window of that duration based on the temporal window.
  • the temporal window' can include, for example, a. time window' of the last 5 minutes of the inputted video stream. This is so, at least in part, for ConvFormer module 124 to parse temporal joints profile 144 of an entire sequence of the video input into smaller chunks.
  • Processor 108 can be instructed to determine an injury risk datum 152 and a physical therapy improvement 148 based on temporal joints profile 144.
  • injury risk datum can be, for example, any readable information describing a bodily injury or harm that is present in a human subject based on the human subject’s 3D skeletal representation or temporal joints profile(s).
  • injury risk datum 152 can include information about potential injury risks that can develop based on temporal joints profile 144.
  • a human subject captured by a camera can be mapped by ConvFormer 124 to a 3D skeletal representation, which can be denoted by temporal joints profile 144, can indicate, based on that the human subject’s movements, of present bodily injury and/or potential bodily injury.
  • a “physical therapy improvement” can be, for example, a physical improvement plan used to rehabilitate a human subject suffering from an identified bodily injury or potential injury’ based on the human subject’s 3D temporal joints profile.
  • FIG. 2 is a flow diagram illustrating an embodiment of a method 200 for dynamic multi-headed convolutional attention for estimating a human pose.
  • method 200 includes receiving, from a sensor operatively’ coupled to a compute device, image frames containing a measured temporal joint data of a subject across at least two frames of image data from the sensor.
  • method 200 can include receiving, from multiple sensors, multiple video inputs, each sensor pointing to the at least a subject from a different angle.
  • method 200 can include causing the processor of non- transitory, processor-readable medium to receive the image frames as an input, train a spatial joint machine-learning model of multiple machine-learning models using a spatial joints training set containing a two-dimensional human pose correlated to a two-dimensional joints profile, and execute the spatial joint machine-learning model using the image frame as an input, the spatial joint machine-learning model outputs multiple joint localization overlays.
  • method 200 includes computing multiple averaged convoluted quantitative identifiers based on the plurality of image frames.
  • method 200 can include computing multiple convoluted quantitative identifiers such as queries, keys, and values.
  • Method 200 can further include causing the processor to extract multiple scaled frame interrelations based on a convolutional filter size.
  • method 200 can include, prior to training a multi-headed temporal profile machine-learning model, receiving the plurality of joint localization overlay from the spatial joint machine-learning model, training the limb segment machine-learning model using an interrelation training set that contains a limb segment image data correlated to a limb matrix in a motion sequence, and executing the limb segment machine-learning model using the plurality of joint localization overlays as an input, where the limb segment machine- learning model outputs multiple frame interrelations.
  • the limb segment machine- learning model uses the plurality of joint localization overlays from the spatial join machine- learning model to produce a completed 2D overlay identifying joints and limbs of a human subject.
  • method 200 can also include receiving the plurality' of frame interrelations from the limb segment machine-learning model, training a temporal profile machine-learning model of the plurality' of machine-learning models using a temporal sequence training set containing a temporal pose sequence correlated to a temporal joint sequence, and executing the temporal profile machine-learning model using the plurality' of frame interrelations as an input, the temporal profile machine-learning model outputs a temporal joints profile.
  • the temporal profile machine-learning model can produce a temporal joints profile and/or temporal joints profile within each head of a multi-headed attention process, prior to the attention and the concatenation of each head of a transformer with a dynamic multi-headed convolutional attention mechanism.
  • method 200 includes training the multi-headed temporal profile machine- learning model of multiple machine-learning models using a temporal joints training set that includes a concatenated temporal profile correlated to a concatenated temporal sequence.
  • training the temporal profile machine-learning model and/or training any machine-learning model can include deep learning.
  • method 200 includes executing the multi-headed temporal profile machine- learning model to output multiple temporal joints interrelations using the plurality of averaged convoluted quantitative identifiers as an input.
  • the plurality of averaged quantitative identifiers forms a temporal joints profile and/or multiple temporal joints profile, where the averaged quantitative identifiers are computed as a result of convolution.
  • method 200 can also include generating multiple frame interrelations based on the averaged convolutional quantitative identifiers, and reducing sparsity of multiple joint interrelations between the plurality of frame interrelations temporal joints model using the plurality of averaged convoluted quantitative identifiers.
  • method 200 includes generating an aggregated temporal joints profile based on the plurality of temporal joints interrelations.
  • the aggregated temporal joints profile can include multiple temporal joints profiles that underwent a scaled dot product attention and concatenated to produce the aggregated temporal joints profile.
  • method 200 can include computing the plurality of averaged convoluted quantitative identifiers and generating the aggregated temporal joints profile simultaneously.
  • FIG. 3 is a block diagram of an embodiment of a dynamic multi-headed convolutional attention mechanism 300 of the ConvFormer for estimating a human pose.
  • the ConvFormer for estimating a human pose can be consistent with the ConvFormer module described in further detail above.
  • queries, keys, and values are generated via convolutions with weights of the following dimension (T, T, k) where k is the kernel size and the ID convolutions have depth the size of input sequence.
  • the ConvFormer’s dynamic multi-headed convolutional attention mechanism fuses the temporal evolution of a patch of deep joint features in one shot, which is distinct from typical temporal attentions, which attends complete pose encoding throughout the motion sequence.
  • Dynamic multi -headed convolutional attention mechanism 300 includes queries, keys and values that undergo convolution to produce averaged convoluted quantitative identifiers such as query average 320, key average 340, and value average 360. It is noted that the output from the spatial attention mechanism of the dynamic multi-headed convolutional attention mechanism outputs a sequence where B is the number of spatial blocks and T’ is the number of frames in the sequence. In some embodiments, can be represented in and thus, the ConvFormer can concatenate these features along a first axis, resulting in:
  • the dynamic multi-headed convolutional atention mechanism incorporates a learned temporal embedding to deep joint features evolution throughout time, such that are the inputs into the temporal attention mechanism of the ConvFormer as described in FIG. 1.
  • the dynamic multi-headed convolutional attention mechanism applies a Mean Per Joint Position Error (MPIPE) loss to minimize the error during optimization, where the loss function is defined as: where p is the ground truth 3D pose and p is the predicted pose and i is indexing specific joints in a skeleton.
  • MPIPE Mean Per Joint Position Error
  • Convolutional Scaled Dot Product Attention 364 can be described as a mapping function that maps a query matrix Q of Query Average 320, a key matrix K of Key Average 340, and a value matrix V of Value Average 360 to an output attention matrix where the output attention matrix entries are scores representing the strength of correlation between any two elements in the axis being attended.
  • the output of Convolutional Scaled Dot Product Attention 364 can be expressed as:
  • the query, keys, and values are computed in the same manner for a fixed filter length.
  • the feature aggregation method to produce aggregated temporal joints profile 380 can be expressed as:
  • k denotes the kernel size
  • c out denotes the output filter
  • the dynamic multi-headed convolutional attention mechanism 300 introduces sparsity via convolutions to decrease connectivity while simultaneously fusing complete temporal information prior to Convolutional Scaled Dot Product Attention 364, Moreover, due to the feature aggregation method of the dynamic multi-headed convolutional attention mechanism, the dynamic multi-headed convolutional attention mechanism provides context at different scales. Moreover, due to the convolution mechanism of the dynamic multi-headed convolutional attention mechanism, the dynamic multi-headed convolutional attention mechanism queries on inter-frame level where the temporal joints profile is learned.
  • the dynamic multi-headed convolutional attention mechanism can use convolutional filter sizes to extract different local contexts at scales and then perform an averaging operation to generate the query, keys, and values that attention is applied to, for each head: where n is the number of convolution filters used, and are generated as above.
  • n is the number of convolution filters used, and are generated as above.
  • k denotes the kernel size
  • denotes the output filter therefore, as shown in FIG. 3, the convolutional equations for are depicted with reference labels 312, 332, and 352, respectively, while the convolutional equations for are depicted with reference labels 316, 332, and 356, respectively.
  • Multi-headed Dynamic Convolutional Attention leverages multiple heads (e.g., heads 372, 368 and 376) to jointly model information from multiple representation spaces. As shown in FIG. 3, each head 372, 368, 376 applies Convolutional Scaled Dot Product Atention 364 in parallel.
  • the output of the MDCA is the concatenation of each head 372, 368, 376 and its atention outputs, where the concatenated result is the aggregated temporal joints profile 380.
  • the concatenated result is then fed into a feed-forward network such as convolutional feed forward network 385 and can be expressed as: Concatenate are computed via the procedure defined above.
  • the ConvFomier can be defined as: where denotes layer normalization and FFN denotes a feed forward network.
  • the ConvFormer module of FIG. 1 can contain two mechanisms such as the spatial attention mechanism and the temporal attention mechanism. Both the spatial and temporal attention mechanisms of the ConvFomier module have B identical blocks.
  • the output of the spatial ConvFormer encoder such as the spatial atention mechanism as described in FIG. where T is the frame sequence length, J is the number of joints, and d is the embedding dimension.
  • the output of the temporal ConvFormer such as the temporal attention mechanism as described in FIG RT x Jd.
  • FIG. 4 shows a schematic illustration of a method for estimating a set of poses, according to an embodiment.
  • FIG. 4 illustrates a process 400 of receiving image frames of human subjects, identifying the joints and limbs of the human subjects, and producing a 2D skeletal outline and/or overlay based on the joints and limbs.
  • the image at reference label 404 depicts an unfiltered image frame containing two human subjects.
  • the image at reference label 408 depicts a schematic 2D skeletal representation of each human subject’s joints and limbs combinations.
  • the image at reference label 412 depicts the 2D skeletal representation overlayed on top of the original unfiltered image frame 404.
  • FIG. 5 shows a schematic illustration of determining static load on a joint from an image frame to determine risk of injury and physical therapy improvement, according to an embodiment.
  • the ConvFormer module can also identify the torque delivered around multiple joints, which can be used to further identify bodily injury, potential risk, and some physical therapy improvement plan to alleviate the bodily injury or risk.
  • a joint torque can refer to a total torque delivered around a joint, usually delivered by muscles.
  • a dynamic load model for the backjoint (L5/S1 joint) can be computed by a method as described herein. The method, however, can be similarly applied to any of the other joints of the subject.
  • a total dynamic load on the back joint can be the sum of the torques caused by weight, linear acceleration, and angular acceleration of the body segments above the L5/S1 joint.
  • a weighted torque of the L5/S1 joint can be computed by a sum of all weighted torques of body parts and objects weighted above the back. Those can include the head, the torso, the arms, the hands, or an object(s) in the hands.
  • the weighted torque of a body part can be given by: where m is the mass value of the body part or the object(s), g is the gravitational constant, and r the distance between the center of mass (COM) of the segment and the L5/S1 in the horizontal plane.
  • the COM, the percentage of total body weight, and the radius of gyration for each body part or the object(s) can be modeled, for example, after data sets obtained from exact calculations made on cadaver bodies.
  • the subjects’ total mass can be given by the user or can be estimated using a 3D representation of a skeleton (as described with respect to FIG. 1) in conjunction with an auxiliary neural network that can predict the subject’s Body Mass Index (MBI) and/or weight based on facial features of the subject and/or the 3D representation of the skeleton.
  • MBI Body Mass Index
  • a total linear inertial torque is the sum of linear inertial torques of all body parts and any auxiliary objects interacting with the joint of interest (L5/S1 joint).
  • the 3D reconstruction is formated so that the vertical direction contains all information used to compute the linear force due to movement.
  • the linear inertial torque can be computed using: where r is the torque arm, m is the mass value of the body part or object, and a z denotes a vertical acceleration of the COM of a body part (e.g. head, torso, arms, hands, or object in the hands).
  • the linear inertial torque can be computed for each image/frame from the 3D representation of the skeleton using a central difference method of differentiation.
  • the linear inertial torque can be filtered to remove noise without changing characteristics of the image/frame using a double pass Butterworth filter whose cutoff frequency is obtained by applying Jackson’s algorithm described above.
  • a total angular inertial torque is the sum of the angular inertial torques of all body parts and any auxiliary’ objects interacting with the back.
  • the angular inertial torque for each body part can be computed using: where m is a mass of the body part, p is a radius of gyration, and a is an angular acceleration.
  • the angle of interest here is the segment angle between the body part and the transverse plane.
  • the acceleration of this angle can be computed and filtered using the same techniques described above for the linear inertial torque.
  • the total torque about the joint of interest (L5/S1 joint) can be computed as:
  • Examples of computer code include, but are not limited to, micro-code or micro- instructions, machine instructions, such as produced by a compiler, code used to produce a web service, and files containing higher-level instructions that are executed by a computer using an interpreter. For example, embodiments can be implemented using Python, Java, JavaScript, C++, and/or other programming languages and development tools. Additional examples of computer code include, but are not limited to, control signals, encrypted code, and compressed code.
  • a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
  • the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements.
  • This definition also allows that elements can optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified.
  • “at least one of A and B” can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
  • Some embodiments described herein relate to a computer storage product with a non-transitory computer-readable medium (also can be referred to as a non-transitory processor-readable medium) having instructions or computer code thereon for performing various computer-implemented operations.
  • the computer- readable medium (or processor- readable medium) is non-transitory' in the sense that it does not include transitory? propagating signals per se (e.g., a propagating electromagnetic wave carrying information on a transmission medium such as space or a cable).
  • the media and computer code also can be referred to as code
  • code can be those designed and constructed for the specific purpose or purposes.
  • non-transitory' computer-readable media include, but are not limited to, magnetic storage media such as hard disks, floppy disks, and magnetic tape; optical storage media such as Compact DiscZDigital Video Discs (CDZDVDs), Compact Disc-Read Only Memories (CD-ROMs), and holographic devices; magneto-optical storage media such as optical disks; carrier wave signal processing modules; and hardware devices that are specially configured to store and execute program code, such as Application-Specific Integrated Circuits (ASICs), Programmable Logic Devices (PLDs), Read-Only Memory'’ (ROM) and Random-Access Memory (RAM) devices.
  • ASICs Application-Specific Integrated Circuits
  • PLDs Programmable Logic Devices
  • ROM Read-Only Memory'’
  • RAM Random-Access Memory
  • Other embodiments described herein relate to a computer program product, which can include, for example, the instructions and/or computer code discussed herein.
  • Hardware modules may include, for example, a processor, a field programmable gate array (FPGA), and/or an application specific integrated circuit (ASIC).
  • Software modules (executed on hardware) can include instructions stored in a memory that is operably coupled to a processor, and can be expressed in a variety of software languages (e.g., computer code), including C, C++, JavaTM, Ruby, Visual BasicTM, and/or other object-oriented, procedural, or other programming language and development tools.
  • Examples of computer code include, but are not limited to, micro-code or micro-instructions, machine instructions, such as produced by a compiler, code used to produce a web sendee, and files containing higher-level instructions that are executed by a computer using an interpreter.
  • embodiments may be implemented using imperative programming languages (e.g., C, Fortran, etc.), functional programming languages (Haskell, Erlang, etc.), logical programming languages (e.g., Prolog), object-oriented programming languages (e.g., Java, C++, etc.) or other suitable programming languages and/or development tools.
  • Additional examples of computer code include, but are not limited to, control signals, encrypted code, and compressed code.

Abstract

An apparatus for 3D human pose estimation using dynamic multi-headed convolutional attention mechanism is presented. The apparatus contains two dynamic multi-headed convolutional attention mechanism with spatial attention and another with temporal attention that leverages the spatial attention mechanism to extract frame-wise inter-joint dependencies by analyzing sections of limbs that are related. The temporal attention mechanism extracts global inter-frame relationships by analyzing correlations between the temporal profile of joints. The temporal profile mechanism leads to a more diverse temporal attention map while achieving substantial parameter reduction.

Description

METHODS AND APPARATUS FOR HUMAN POSE ESTIMATION FROM IMAGES
USING DYNAMIC MULTI -HEADED CONVOLUTIONAL ATTENTION
CROSS-REFERENCE TO RELATED PATENT APPLICATIONS [0001] This application claims priority to and is a continuation of (1) U.S. Patent Application No. 17/740,650 filed on May 10, 2022, titled “Methods and Apparatus for Human Pose Estimation From Images Using Dynamic Multi-Headed Convolutional Attention”, and (2) U.S. Patent Application No. 17/944,418 filed on September 14, 2022, titled “Methods and Apparatus for Human Pose Estimation From Images Using Dynamic Multi-Headed Convolutional Attention”, which is a continuation of U.S. Patent Application No. 17/740,650 filed on May 10, 2022; the contents of which are hereby incorporated by reference in their entireties.
FIELD
[0002] The present disclosure relates to the field of artificial intelligence, specifically to methods and apparatus for human pose estimation using dynamic multi-headed convolutional attention based on images and/or videos collected from a sensor(s) such as a camera.
BACKGROUND
[0003] Artificial intelligence and image processing enables computers to reproduce three- dimensional (3D) skeletal representations of humans based on images and/or video collected from a camera. By leveraging deep neural networks for 3D human pose estimation, image processors can learn mappings from red -green-blue (RGB) images to 3D skeletal representations. Deep neural networks can also use an input containing two-dimensional frames of a human subject to identify specific joints and interconnecting limbs to produce such 3D skeletal representations. Current 3D human pose estimators, however, typically suffer substantial computational overhead and simultaneously poor generalizability due to capturing motion in staged environments. Furthermore, current implementations of deep neural networks attempt to solve such issues with superfluous hidden layers and overcomplicate the mapping of spatial and temporal parameters. Therefore, excess noise resulting from such overcomplication can disrupt the overall smoothness of a 3D skeletal representation. SUMMARY OF THE DISCLOSURE
[0004] In some embodiments, an apparatus is for a human pose estimation. The apparatus includes a processor and a memory operatively connected to the processor. The memory contains instructions configuring the processor to receive image frames, each image frame from the plurality of image frames containing a measured temporal joint data of a subject. The memory further stores instructions causing the processor to train a limb segment machine- learning model of machine-learning models using an interrelation training set that contains a limb segment image data correlated to a limb matrix in a motion sequence. The memory further stores instructions causing the processor to execute the limb segment machine-learning model to identify frame interrelations using the plurality of image frames as an input and generate a temporal joints profile based on the plurality of frame interrelation
[0005] In some embodiments, a non-transitory, processor-readable medium causes the processor to receive, from a sensor operatively coupled to the processor, a plurality’ of image frames containing a measured temporal joint data of a subject across at least two frames of image data from the sensor and compute a plurality of averaged convoluted quantitative identifiers based on the plurality’ of image frames. The non-transitory, processor-readable medium also causes the processor to train a multi-headed temporal profile machine-learning model of a plurality of machine-learning models using a temporal joints training set that includes a concatenated temporal profile correlated to a concatenated temporal sequence, execute the multi-headed temporal profile machine-learning model to output a plurality of temporal joints interrelations using the plurality of averaged convoluted quantitative identifiers as an input, and generate an aggregated temporal joints profile based on the temporal joints interrelations.
BRIEF DESCRIPTION OF THE DRAWINGS
[0006] For the purpose of illustrating the disclosure, the drawings show aspects of one or more embodiments. It should be understood, however, that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings.
[0007] FIG. 1 is a block diagram illustrating an embodiment of a non-transient computer readable medium for producing a temporal joints profile, according to an embodiment. [0008] FIG. 2 is a flow diagram illustrating an embodiment of a dynamic convolutional multi-headed atention for estimating a human pose, according to an embodiment.
[0009] FIG. 3 is a block diagram of an embodiment of a dynamic convolutional multi- headed attention for estimating a human pose, according to an embodiment.
[0010] FIG. 4 is a schematic illustration of a method for estimating a set of poses, according to an embodiment.
[0011] FIG. 5 is a schematic illustration of determining static load on a joint from an image frame to determine risk of injury and physical therapy improvement, according to an embodiment.
[0012] The drawings are not necessarily to scale and can be illustrated by phantom lines, diagrammatic representations and fragmentary views. In certain instances, details that are not necessary’ for an understanding of the embodiments or that render other details difficult to perceive can have been omitted
DETAILED DESCRIPTION
[0013] In some embodiments, an apparatus is for human pose estimation using dynamic multi-headed convolutional attention. The apparatus includes a transformer called ConvFormer that leverages a dynamic convolutional multi-headed attention mechanism. In one or more embodiments, a compute device incorporating the ConvFormer can receive a two- dimensional (2D) input such as a video and/or image frames representing a human subject. The ConvFormer can convert the 2D input into a 3D skeletal representation and produce a temporal joints profile, representing a sequence of 3D poses highlighting the movement, of a joint of the 3D skeleton in a temporal input. For instance, the ConvFormer receives the 2D input of a human subject and produces an image of a skeletal outline following the joints and limbs of the human subject, which can be overlayed on the original input, thereby producing an outline of the human subject’s skeletal framework on top of the human subject in the 2D input. This is so, at least in part, to identify the human subject’s pose and/or movement in a. sequence of time. The ConvFormer can also produce a 3D skeletal representation based on the 2D input and the skeletal outline to estimate the skeletal outline’s movement and position in three dimensions. As the 2D input can include a video stream containing individual image frames, the ConvFormer produces the 3D skeletal representations outlining the human subject in each frame found in the video stream. In estimating the poses of the human subject from each 3D skeletal representation on each frame, the ConvFormer can also reduce complexity between frames from using convolution based on a convolution filter size.
[0014] In some embodiments, the ConvFormer can also be used, for example, to reproduce a spatially and temporally smooth 3D skeletal representation of a human subject from the reduced sparsity. In one or more embodiments, the ConvFormer can produce temporal joints profile for each joint of a human subject and query each of those temporal joints profiles to produce a complete temporal joints profile for the overall human subject that is being analyzed by the ConvFormer. In one or more embodiments, the compute device can generate multiple layers of queries, keys, and values for each joint to produce a 3D skeletal representation and a 3D joint(s) sequential model denoted by a temporal joints profile. For instance, the ConvFormer can generate the queries, keys, and values from each frame and average them to produce the temporal joints profile. The ConvFormer can also be used, for example, to generate temporal joints profiles to reproduce skeletal joint and limb movements of the human subject accurately and smoothly from a 2D input.
[0015] Instead of generating queries, keys, and values that represent latent pose representations for individual frames found in typical 3D pose estimators or transformers, the ConvFormer queries skeletal outlines, effectively fusing temporal information and producing temporal joints profile. As described above, a temporal joints profile includes a sequential movement of a specific join in a time and/or motion sequence. Each joint of the skeletal outlines for each frame can be mapped throughout the frames, thereby identifying the sequential movement of each joint by following the movement and location of each joint throughout the frames. The temporal joints profile can be used to analyze a joint’s movement in sequence. The plurality of temporal joints profiles can be used to analyze joints’ movements in sequence and how they interact in sequence.
[0016] In some embodiments, the ConvFormer can also perform a scaled-dot product attention following the fusing of temporal information resulting in the temporal joints profile or the plurality of temporal joints profiles. The ConvFormer’s dynamic convolutional multi- headed attention mechanism includes a scaled dot product attention applied for each head of the multi-headed attention mechanism, where the attention can be described as a mapping function that maps a query matrix Q, a key matrix K, and a value matrix V to an output attention matrix. Each head the multi -headed attention mechanism can produce a temporal joints profile or temporal joints profile different from the other heads. For instance, the temporal joints profile(s) produced from different heads can include temporal joints profile(s) produced from different camera angles. In some embodiments, one head can focus on producing a singular temporal joints profile for a singular joint such as a right elbow, while the other heads can focus on producing a singular temporal joints profile for every other joint of the human subject. The ConvFormer can run each head and perform the attentions in parallel in which each attention result is then concatenated, thereby producing a linearly transformed aggregated temporal joints profile in the expected dimension. The resulting aggregated temporal joints profile can include the most accurate and smooth 3D output of a temporal joints profile or temporal joints profile.
[0017] In some embodiments, the compute device can also be used to predict body injury. For example, a sensor can continuously record, from a certain angle, a human subject. The recorded video can be sent to the ConvFormer, which can analyze the subject’s joint movements. The compute device can use, for example, a machine-learning model and/or algorithm to determine potential body injuries based on the joint movements computed as temporal joints profiles and the frequency of such movements. The compute device can also be used, for example, to determine physical therapy treatments to improve the body (state of health) of a human subject. The compute device can use, for example, a specific temporal joints profiles indicating a healthy bodily movement. For instance, a sensor can record people performing manual labor in which a specific sequence of movements represents a healthy and/or ideal mode of performance. Any significant deviations from this specific sequence can be identified by the sensor capturing data from manual laborers and their joint movements for the compute device to produce a specific physical therapy improvement and/or recommendation to achieve the healthy and/or ideal sequence of movements.
[0018] FIG. 1 shows a system block diagram for a compute system to produce a temporal joints profile, according to an embodiment. Compute device 100 can include any processor as described in this disclosure. For instance, the processor can include without limitation a microcontroller, microprocessor, digital signal processor (DSP) and/or system on a chip (SoC) as described in this disclosure. Compute device 100 includes a memory 104 containing instructions for at least a processor 108. Memory 104 of compute device can be, for example, a memory buffer, a random-access memory (RAM), a read-only memory (ROM), a hard drive, a flash drive, a secure digital (SD) memory card, an external hard drive, an erasable programmable read-only memory (EPROM), an embedded multi-time programmable (MTP) memory, an embedded multi-media card (eMMC), a universal flash storage (UFS) device, and/or the like. Memory 104 can store, for example, video data, image data, fitness data, medical record data, and/or the like. Memory 104 can further store, for example, one or more machine learning models, and/or code that includes instructions to cause processor 108 to execute one or more processes or functions such as, but not limited to, a data preprocessor, one or more machine-learning models, or the combination thereof.
[0019] Compute device 100 can include, for example, a hardware and/or software component to facilitate data communication between compute device 100 and external devices such as, but not limited to, a camera, a sensor, a server, a network, a communication interface, or the combination thereof. Compute device 100 can include, for example a communication interface operatively coupled to and used by processor 108 and/or the memory 104. For example, a communication interface can include a network interface card (NIC), a Wi-Fi® module, a Bluetooth® module, an optical communication module, and/or any other suitable wired, wireless communication interface, or the like thereof. The communication interface can be configured to connect the compute device 100 to a network. In some instances, the communication interface can facilitate receiving and/or transmitting data such as a video data, image data, fitness data, medical record data, or the like thereof as a function of the network, [0020] Compute device 100 also includes a ConvFormer module 12. ConvFormer module can be, for example, a hardware/software module and/or convolutional transformer that leverages a dynamic convolutional multi-headed self-attention mechanism for monocular 3D human pose estimation. For example, ConvFormer module 124 can include a spatial-temporal convolutional transformer. Compute device 100 also includes processor 108, which contains and/or can access (e.g., from memory 104) instructions to cause compute device 100 and/or ConvFormer module 124 to receive, from a sensor 116, a video input containing image frames 120 representing a subject. 112. In some instances, the image frames 120 can contain a measured temporal joint data of subject 112 across at least two frames of image data from sensor 116. For example, the measured temporal joint data can include the dimensional location of a recognizable human subject as the location shifts following the movements of the human subject captured across multiple image frames. In some cases, the temporal joint data, can include a rectangular outline and/or overlay surrounding the human subject as well as 2D coordinates of the corners of the outline and/or overlay.
[0021] Sensor 116 can be, for example, a device used to capture a physical phenomenon and output an electrical signal. For example, sensor 116 can include any one or more of a video camera, an image sensor, motion sensor, thermal sensor, biochemical sensor, pressure sensor, or the like thereof. Sensor 116 can include multiple sensors where each sensor is housed with each other in a sensor case, forming a sensor suite. A subject 112 can be, for example, a recognizable moving human captured by sensor 116 and/or within the video input. Subject 112 can include, for example, a central figure in the video input. In some instances, subject 112. can include multiple subjects, where each subject is identified by sensor 116 and/or compute device 100. A “video input,” as used in this disclosure, can be for example an electronic input containing electronic signals of a 2D visual media. The video input can be embodied in any electronic medium for recording, copying, playback, broadcasting, displaying, or the like thereof. The video input can include image frames 120. An “image frame,” as used in this disclosure, can be, for example, an image in a sequence of images forming a video. For example, a video input having 60 frames per second can include 60 image frames for one second of video input. A video input of a human person performing manual labor can be multiple image frames where each image frame contains an image of a pose the human subject is making while performing the manual labor. The multiple image frames can include, for example, a sequence of such pose and/or manual labor. The video input can include, for example, a recorded 2D visual media of subject 112 and its movements from a fixed position via placement of sensor 116.
[0022] Memory 104 stores instructions to cause processor 108 to identify, via. ConvFormer module 124, multiple joints from subject 112. A “joint” is for example a connection point between two or more limbs and/or bones. In some implementations, compute device 100 can denote a joint as a circular dot placed on top of each image frame 120 where a joint within an image frame is located. ConvFormer module 124 is further configured to identify multiple joints using a spatial joint machine-learning model 128. A “spatial joint machine-learning model” can be, for example, any machine-learning model and/or algorithm used to output multiple joint localization overlays. Spatial joint machine-learning model 128 can be consistent with any machine- learning model as described herein. A machine-learning model can include any supervised and/or unsupervised machine-learning model. A machine-learning model can enable deep learning. In some implementations, compute device 100 can use convolutional neural networks (CNNs) for image processing and object detection. In another implementations, ConvFormer module 124 can employ recurrent neural networks (RNNs) to enable the algorithms of a machine-learning model to teach itself to identify joints, limbs, 3D skeletal representations, and the like thereof. In another example, compute device 100 can use a feed-forward neural network and multiple hidden layers used to transform the video input and its image frames 120 into recognizable 3D frame interrelations 136. The feed-forward neural network and hidden layers can also be used to identify multiple joints of subject 112 and/or compute 2D joint localizations from the video input and/or image frames 120. The feed-forward neural network and hidden layers can also be used to identify limb segments and/or compute 2D segmental localizations,
[0023] Memory 104 stores instructions to cause processor 108 to train, via ConvFormer module 124, spatial joint machine-learning model 128 using a spatial joints training set. A “spatial joints training set” can be, for example, any training data containing a two- dimensional human pose correlated to a two-dimensional joints profile. Processor 108 can be further instructed to retrieve any training set from a database. Processor 108 can be further instructed to record and retrieve any inputs and outputs for any machine-learning model in the database. In some implementations, the database can be a cloud database where compute device 100 is connected via a network interface to access the cloud database. Memory 104 stores instructions to cause processor 108 to train, via ConvFormer module 124, spatial joint machine-1 earning model 128 with the spatial joints training set and execute spatial joint machine-learning model 128 with image frames 120 as an input to output the multiple joint localization overlays. A “joint localization overlay,” is, for example, a 2D map of joints on an image. For example, the 2D joint localization can include an image of circular dots placed on top of each image frame 120, where each circular dot denotes a joint of a human subject 112. ConvFormer module 124 can produce a joint localization overlay for each individual frame of image frames 120.
[0024] Memory 104 stores further instructions to cause processor 108 to determine, via ConvFormer module 124, multiple frame interrelations based on the joint localization overlay using a limb segment machine-learning model 132. A “limb segment machine-learning model” is, for example, any machine-learning model used to output frame interrelation 136. A “frame interrelation” is, for example, a frame-wise relation and/or sequence between two or more frames, where the frame-wise relation identifies limb segments connecting multiple joints from a joint localization over. In some implementations, frame interrelations 136 can incorporate limb segments to be included in the joint localization overlay, thereby creating a comprehensive skeletal outline of joints and limbs. Memory 104 further stores instructions to cause processor 108 to train via ConvFormer module 124 limb segment machine-learning model 132. using an interrelation training set, which can be for example a training set that contains a limb segment image data correlated to a limb matrix in a motion sequence.
ConvFormer module 124 then executes limb segment machine-learning model 124 to identify the multiple frame interrelations 136 using the image frames 120 as an input. In some implementations, limb segment machine-learning model 132 can receive the joint localization overlays as an input to train limb segment machine-learning model 132.
[0025] Alternatively and additionally, ConvFormer module 124 can include a spatial attention module and a temporal transformer module. The spatial atention module receives image frames 120 of 2D human poses as inputs and extracts a high dimensional feature for each frame’s joint correlations as denoted by the 2D segmental localization. The spatial atention module can also extract multiple 2D segmental localizations for each individual frame and globally across a motion of sequence of subject 112 captured in image frames 120. In some implementations, the spatial attention module contains spatial joint machine-learning model 128 and/or limb segment machine-learning model 132 as limb segment machine- learning model 132 outputs the frame interrelations 136, where the frame interrelations 136 include a sequence of spatial blocks denoted by the joint localization overlays. In some implementations, the input for ConvFormer module 124 includes image frames 120 of 2D human poses. For example, an input is a 2D pose with J joints that is represented by two coordinates (u, v). The spatial attention module and/or limb segment machine-learning model 132 maps the coordinate of each joint found in a frame into a high-dimensional feature with a trainable linear layer. Then a learned positional encoding via summation is applied to the high- dimensional feature. Given a sequence of 2D poses
Figure imgf000012_0001
where T represents the number of frames in the sequence, 3D skeletal representations can be reconstructed in the joint that is relative to the camera, reference frame, where the camera reference frame is where the root joint sits at the origin. For example, given a sequence of poses
Figure imgf000012_0002
and
Figure imgf000012_0003
the spatial attention module and/or limb segment machine- learning model 132 encodes as follows:
Figure imgf000012_0004
Figure imgf000012_0005
where d represents the dimension of the embedding. Subsequently, the spatial feature sequence is fed into the spatial attention module of ConvFormer module 124,
Figure imgf000012_0006
which applies a spatial attention mechanism with respect to the joint dimension to integrate information across the complete pose. The output for the i-th frame of the ti-th spatial attention module of ConvFormer module 124 is denoted
Figure imgf000012_0007
[0026] Memory 104 can further instruct processor 108 to compute, via ConvFormer module 124, a 3D pose sequence based on the frame interrelations 136. A “3D pose sequence” can be, for example, a sequence of 3D frames where each 3D frame corresponds to a 2D frame of the image frames 120. In other words, ConvFormer module 124 can convert a 2D frame into a 3D frame. In some implementations, ConvFormer module 124 can use a CNN to produce the 3D pose sequence. In another implementation, ConvFormer module 124 can also extract multiple scaled frame interrelations based on a convolutional filter size prior to generating the temporal joints profile. A “convolutional filter size” can be, for example, a 2D kernel size for which convolutional layers are to follow. The convolutional filter size can be used to adjust the level of sparsity to be applied for the frame interrelations 136 denoted by the scaled frame interrelations.
[0027] Memory 104 further stores instructions to cause processor 108 to receive, via ConvFormer module 124, the frame interrelations 136 from limb segment machine-learning model 132, train a temporal profile machine-learning model 140 using a temporal sequence training set, and execute temporal profile machine-learning model 140 using the frame interrelations 136 as an input to output a temporal joints profile 144. A “temporal joints profile machine-learning model” can be, for example, any machine-learning model and/or algorithm used to output a temporal joints profile. A “temporal sequence training set” can be, for example, a training set that contains a temporal pose sequence correlated to a temporal joint sequence. A “temporal joints profile” can be, for example, a sequence of a joint from 3D poses by a human subject given a video input. In some implementations, temporal joints profile 144 can include a graphical indicator of a joint based on the frame interrelations 136 and/or 3D pose sequence. For example, the frame interrelations 136 can include a 3D sequence of a human subject’s skeletal representation. Temporal joints profile 144 can include a line following the specific movement of a specific joint from 3D frame interrelations, identifying a fluid temporal motion of that specific joint. In some implementations, ConvFormer module 124 can enable temporal profile machine-learning model 140 to produce multiple temporal joints profiles for each joint of a joint localization overlay, where each joint represents the corresponding joint of a human subject captured by the image frames. In some implementations, temporal profile model 140 can include deep learning. For example, using a neural network, temporal profile machine-learning model 140 can structure algorithms in hidden layers that map each joint and limb segment, of a human body through the hidden layers to produce temporal joints profile 144 and/or multiple temporal joints profiles, [0028] Temporal joints profile 144 can vary, for example, based on a temporal window. A “temporal window” can be, for example, a finite window of time of a video stream. For instance, a video stream with a duration of 1 hour can be inputted into ConvFormer module 124. The video stream can include a human subject performing a. variety of manual labor and body movements. ConvFormer modulel24 can analyze the movements of a specific window of that duration based on the temporal window. The temporal window' can include, for example, a. time window' of the last 5 minutes of the inputted video stream. This is so, at least in part, for ConvFormer module 124 to parse temporal joints profile 144 of an entire sequence of the video input into smaller chunks. [0029] Processor 108 can be instructed to determine an injury risk datum 152 and a physical therapy improvement 148 based on temporal joints profile 144. An “injury risk datum” can be, for example, any readable information describing a bodily injury or harm that is present in a human subject based on the human subject’s 3D skeletal representation or temporal joints profile(s). In some implementations, injury risk datum 152 can include information about potential injury risks that can develop based on temporal joints profile 144. For example, a human subject captured by a camera can be mapped by ConvFormer 124 to a 3D skeletal representation, which can be denoted by temporal joints profile 144, can indicate, based on that the human subject’s movements, of present bodily injury and/or potential bodily injury. A “physical therapy improvement” can be, for example, a physical improvement plan used to rehabilitate a human subject suffering from an identified bodily injury or potential injury’ based on the human subject’s 3D temporal joints profile.
[0030] FIG. 2 is a flow diagram illustrating an embodiment of a method 200 for dynamic multi-headed convolutional attention for estimating a human pose. At 205, method 200 includes receiving, from a sensor operatively’ coupled to a compute device, image frames containing a measured temporal joint data of a subject across at least two frames of image data from the sensor. In some implementations, method 200 can include receiving, from multiple sensors, multiple video inputs, each sensor pointing to the at least a subject from a different angle.
[0031] In some implementations, method 200 can include causing the processor of non- transitory, processor-readable medium to receive the image frames as an input, train a spatial joint machine-learning model of multiple machine-learning models using a spatial joints training set containing a two-dimensional human pose correlated to a two-dimensional joints profile, and execute the spatial joint machine-learning model using the image frame as an input, the spatial joint machine-learning model outputs multiple joint localization overlays. [0032] At 210, method 200 includes computing multiple averaged convoluted quantitative identifiers based on the plurality of image frames. In a non-limiting embodiment, method 200 can include computing multiple convoluted quantitative identifiers such as queries, keys, and values. Method 200 can further include causing the processor to extract multiple scaled frame interrelations based on a convolutional filter size. [0033] In some implementations, method 200 can include, prior to training a multi-headed temporal profile machine-learning model, receiving the plurality of joint localization overlay from the spatial joint machine-learning model, training the limb segment machine-learning model using an interrelation training set that contains a limb segment image data correlated to a limb matrix in a motion sequence, and executing the limb segment machine-learning model using the plurality of joint localization overlays as an input, where the limb segment machine- learning model outputs multiple frame interrelations. For instance, the limb segment machine- learning model uses the plurality of joint localization overlays from the spatial join machine- learning model to produce a completed 2D overlay identifying joints and limbs of a human subject.
[0034] In some implementations, method 200 can also include receiving the plurality' of frame interrelations from the limb segment machine-learning model, training a temporal profile machine-learning model of the plurality' of machine-learning models using a temporal sequence training set containing a temporal pose sequence correlated to a temporal joint sequence, and executing the temporal profile machine-learning model using the plurality' of frame interrelations as an input, the temporal profile machine-learning model outputs a temporal joints profile. The temporal profile machine-learning model can produce a temporal joints profile and/or temporal joints profile within each head of a multi-headed attention process, prior to the attention and the concatenation of each head of a transformer with a dynamic multi-headed convolutional attention mechanism.
[0035] At 215, method 200 includes training the multi-headed temporal profile machine- learning model of multiple machine-learning models using a temporal joints training set that includes a concatenated temporal profile correlated to a concatenated temporal sequence. In some implementations, training the temporal profile machine-learning model and/or training any machine-learning model can include deep learning.
[0036] At 220, method 200 includes executing the multi-headed temporal profile machine- learning model to output multiple temporal joints interrelations using the plurality of averaged convoluted quantitative identifiers as an input. In some implementations, the plurality of averaged quantitative identifiers forms a temporal joints profile and/or multiple temporal joints profile, where the averaged quantitative identifiers are computed as a result of convolution. For instance, method 200 can also include generating multiple frame interrelations based on the averaged convolutional quantitative identifiers, and reducing sparsity of multiple joint interrelations between the plurality of frame interrelations temporal joints model using the plurality of averaged convoluted quantitative identifiers.
[0037] At 225, method 200 includes generating an aggregated temporal joints profile based on the plurality of temporal joints interrelations. The aggregated temporal joints profile can include multiple temporal joints profiles that underwent a scaled dot product attention and concatenated to produce the aggregated temporal joints profile. In some implementations, method 200 can include computing the plurality of averaged convoluted quantitative identifiers and generating the aggregated temporal joints profile simultaneously.
[0038] FIG. 3 is a block diagram of an embodiment of a dynamic multi-headed convolutional attention mechanism 300 of the ConvFormer for estimating a human pose. The ConvFormer for estimating a human pose can be consistent with the ConvFormer module described in further detail above. In typical multi-headed attention mechanisms, queries, keys, and values are generated via convolutions with weights of the following dimension (T, T, k) where k is the kernel size and the ID convolutions have depth the size of input sequence. In one or more implementations, the ConvFormer’s dynamic multi-headed convolutional attention mechanism fuses the temporal evolution of a patch of deep joint features in one shot, which is distinct from typical temporal attentions, which attends complete pose encoding throughout the motion sequence. Dynamic multi -headed convolutional attention mechanism 300 includes queries, keys and values that undergo convolution to produce averaged convoluted quantitative identifiers such as query average 320, key average 340, and value average 360. It is noted that the output from the spatial attention mechanism of the dynamic multi-headed convolutional attention mechanism outputs a sequence
Figure imgf000016_0001
where B is the number of spatial blocks and T’ is the number of frames in the sequence. In some embodiments, can be represented in
Figure imgf000016_0003
and thus, the ConvFormer can
Figure imgf000016_0004
concatenate these features along a first axis, resulting in:
Figure imgf000016_0002
Following this procedure the dynamic multi-headed convolutional atention mechanism incorporates a learned temporal embedding to deep joint features evolution throughout time, such that are the inputs into the temporal attention
Figure imgf000017_0001
mechanism of the ConvFormer as described in FIG. 1. As the dynamic multi-headed convolutional attention mechanism follows a many-to-one prediction scheme, the dynamic multi-headed convolutional attention mechanism down samples the spatial axis with a linear projection and then perform a temporal convolution with one output channel such as p = denotes a temporal convolution with one
Figure imgf000017_0002
output channel and T inputs. To train this network, the dynamic multi-headed convolutional attention mechanism applies a Mean Per Joint Position Error (MPIPE) loss to minimize the error during optimization, where the loss function is defined as:
Figure imgf000017_0003
where p is the ground truth 3D pose and p is the predicted pose and i is indexing specific joints in a skeleton.
[0039] Instead of generating queries, keys, and values, which represent latent pose representations for individual frames found in typical 3D pose estimators or transformers, the dynamic multi-headed convolutional attention mechanism queries skeletal outlines, effectively fusing temporal information and producing temporal joints profile. Convolutional Scaled Dot Product Attention 364 can be described as a mapping function that maps a query matrix Q of Query Average 320, a key matrix K of Key Average 340, and a value matrix V of Value Average 360 to an output attention matrix where the output attention matrix entries are scores representing the strength of correlation between any two elements in the axis being attended. The output of Convolutional Scaled Dot Product Attention 364 can be expressed as:
Figure imgf000017_0005
The query, keys, and values are computed in the same manner for a fixed filter length. The feature aggregation method to produce aggregated temporal joints profile 380 can be expressed as:
Figure imgf000017_0004
Here, k denotes the kernel size, cout denotes the output filter, denotes output dimension.
Figure imgf000018_0008
This is juxtaposed against a classic scaled dot product attention where queries, keys, and values are generated via a linear projection which provides global scope and redundancy due to the complete connectivity'. The dynamic multi-headed convolutional attention mechanism 300 introduces sparsity via convolutions to decrease connectivity while simultaneously fusing complete temporal information prior to Convolutional Scaled Dot Product Attention 364, Moreover, due to the feature aggregation method of the dynamic multi-headed convolutional attention mechanism, the dynamic multi-headed convolutional attention mechanism provides context at different scales. Moreover, due to the convolution mechanism of the dynamic multi- headed convolutional attention mechanism, the dynamic multi-headed convolutional attention mechanism queries on inter-frame level where the temporal joints profile is learned. To this end, the dynamic multi-headed convolutional attention mechanism can use
Figure imgf000018_0010
convolutional filter sizes to extract different local contexts at scales and then perform an averaging
Figure imgf000018_0007
operation to generate the query, keys, and values that attention is applied to, for each head:
Figure imgf000018_0001
where n is the number of convolution filters used,
Figure imgf000018_0002
and
Figure imgf000018_0009
are generated as above. As shown in FIG. 3, the first convolution filters for
Figure imgf000018_0003
depicted with reference labels 304, 324, and 344, respectively, while, depending on the number of convolution filters, the last convolutional filters for found with
Figure imgf000018_0004
reference labels 308, 328, and 348 respectively. As described above, k denotes the kernel size, denotes the output filter, therefore, as shown in FIG. 3, the convolutional equations for are depicted with reference labels 312, 332, and 352, respectively, while the
Figure imgf000018_0005
convolutional equations for are depicted with reference labels 316, 332, and
Figure imgf000018_0006
356, respectively.
[0040] Multi-headed Dynamic Convolutional Attention (MDCA) leverages multiple heads (e.g., heads 372, 368 and 376) to jointly model information from multiple representation spaces. As shown in FIG. 3, each head 372, 368, 376 applies Convolutional Scaled Dot Product Atention 364 in parallel. The output of the MDCA is the concatenation of each head 372, 368, 376 and its atention outputs, where the concatenated result is the aggregated temporal joints profile 380. The concatenated result is then fed into a feed-forward network such as convolutional feed forward network 385 and can be expressed as:
Figure imgf000019_0006
Concatenate
Figure imgf000019_0001
are computed via the procedure defined above. Lastly, the ConvFomier can be defined as:
Figure imgf000019_0002
where denotes layer normalization and FFN denotes a feed forward network. As
Figure imgf000019_0003
described previously above, the ConvFormer module of FIG. 1 can contain two mechanisms such as the spatial attention mechanism and the temporal attention mechanism. Both the spatial and temporal attention mechanisms of the ConvFomier module have B identical blocks. The output of the spatial ConvFormer encoder such as the spatial atention mechanism as described in FIG.
Figure imgf000019_0004
where T is the frame sequence length, J is the number of joints, and d is the embedding dimension. Additionally, the output of the temporal ConvFormer such as the temporal attention mechanism as described in FIG
Figure imgf000019_0005
RT x Jd.
[0041] FIG. 4 shows a schematic illustration of a method for estimating a set of poses, according to an embodiment. FIG. 4 illustrates a process 400 of receiving image frames of human subjects, identifying the joints and limbs of the human subjects, and producing a 2D skeletal outline and/or overlay based on the joints and limbs. The image at reference label 404 depicts an unfiltered image frame containing two human subjects. The image at reference label 408 depicts a schematic 2D skeletal representation of each human subject’s joints and limbs combinations. The image at reference label 412 depicts the 2D skeletal representation overlayed on top of the original unfiltered image frame 404. The Con vFormer module of FIG.
1 can use a spatial joint machine-learning model and/or a limb segment machine-learning model to generate a set of joints, a set. of limbs, and a pose estimation for each subject from multiple subjects in an image recorded by a sensor. [0042] FIG. 5 shows a schematic illustration of determining static load on a joint from an image frame to determine risk of injury and physical therapy improvement, according to an embodiment. As a result of the ConvFormer module generating a temporal joints profile or multiple temporal joints profile, the ConvFormer can also identify the torque delivered around multiple joints, which can be used to further identify bodily injury, potential risk, and some physical therapy improvement plan to alleviate the bodily injury or risk. A joint torque can refer to a total torque delivered around a joint, usually delivered by muscles. For each joint from a set of joint in a body of a subject (e.g., a patient, a worker, an athlete, etc.), multiple body parts can often contribute to a torque of force about the joint. The sum of all such torques can yield a total joint torque, which can be viewed as a rotational force about the joint. As shown in FIG. 5, a dynamic load model for the backjoint (L5/S1 joint) can be computed by a method as described herein. The method, however, can be similarly applied to any of the other joints of the subject. A total dynamic load on the back joint can be the sum of the torques caused by weight, linear acceleration, and angular acceleration of the body segments above the L5/S1 joint.
[0043] A weighted torque of the L5/S1 joint can be computed by a sum of all weighted torques of body parts and objects weighted above the back. Those can include the head, the torso, the arms, the hands, or an object(s) in the hands. The weighted torque of a body part can be given by:
Figure imgf000020_0001
where m is the mass value of the body part or the object(s), g is the gravitational constant, and r the distance between the center of mass (COM) of the segment and the L5/S1 in the horizontal plane. The COM, the percentage of total body weight, and the radius of gyration for each body part or the object(s) can be modeled, for example, after data sets obtained from exact calculations made on cadaver bodies. The subjects’ total mass can be given by the user or can be estimated using a 3D representation of a skeleton (as described with respect to FIG. 1) in conjunction with an auxiliary neural network that can predict the subject’s Body Mass Index (MBI) and/or weight based on facial features of the subject and/or the 3D representation of the skeleton.
[0044] A total linear inertial torque is the sum of linear inertial torques of all body parts and any auxiliary objects interacting with the joint of interest (L5/S1 joint). The 3D reconstruction is formated so that the vertical direction contains all information used to compute the linear force due to movement. The linear inertial torque can be computed using:
Figure imgf000021_0001
where r is the torque arm, m is the mass value of the body part or object, and azdenotes a vertical acceleration of the COM of a body part (e.g. head, torso, arms, hands, or object in the hands). The linear inertial torque can be computed for each image/frame from the 3D representation of the skeleton using a central difference method of differentiation. The linear inertial torque can be filtered to remove noise without changing characteristics of the image/frame using a double pass Butterworth filter whose cutoff frequency is obtained by applying Jackson’s algorithm described above.
[0045] A total angular inertial torque is the sum of the angular inertial torques of all body parts and any auxiliary’ objects interacting with the back. The angular inertial torque for each body part can be computed using:
Figure imgf000021_0002
where m is a mass of the body part, p is a radius of gyration, and a is an angular acceleration. The angle of interest here is the segment angle between the body part and the transverse plane. The acceleration of this angle can be computed and filtered using the same techniques described above for the linear inertial torque. Finally, the total torque about the joint of interest (L5/S1 joint) can be computed as:
Figure imgf000021_0003
Setting all acceleration equal to zero in the above equations, can yield the static torque.
[0046] It should be understood that the disclosed embodiments are not intended to be exhaustive, and functional, logical, operational, organizational, structural and/or topological modifications may be made without departing from the scope of the disclosure. As such, all examples and/or embodiments are deemed to be non-limiting throughout this disclosure.
[0047] All definitions, as defined and used herein, should be understood to control over dictionary definitions, definitions in documents incorporated by reference, and/or ordinary meanings of the defined terms.
[0048] Examples of computer code include, but are not limited to, micro-code or micro- instructions, machine instructions, such as produced by a compiler, code used to produce a web service, and files containing higher-level instructions that are executed by a computer using an interpreter. For example, embodiments can be implemented using Python, Java, JavaScript, C++, and/or other programming languages and development tools. Additional examples of computer code include, but are not limited to, control signals, encrypted code, and compressed code.
[0049] The drawings primarily are for illustrative purposes and are not intended to limit the scope of the subject matter described herein. The drawings are not necessarily to scale; in some instances, various aspects of the subject matter disclosed herein can be shown exaggerated or enlarged in the drawings to facilitate an understanding of different features. In the drawings, like reference characters generally refer to like features (e.g., functionally similar and/or structurally similar elements).
[0050] The acts performed as part of a disclosed niethod(s) can be ordered in any suitable way. Accordingly, embodiments can be constructed in which processes or steps are executed in an order different than illustrated, which can include performing some steps or processes simultaneously, even though shown as sequential acts in illustrative embodiments. Put differently, it is to be understood that such features may not necessarily be limited to a particular order of execution, but rather, any number of threads, processes, services, servers, and/or the like that may execute serially, asynchronously, concurrently, in parallel, simultaneously, synchronously, and/or the like in a manner consistent with the disclosure. As such, some of these features may be mutually contradictory, in that they cannot be simultaneously present in a single embodiment. Similarly, some features are applicable to one aspect of the innovations, and inapplicable to others.
[0051] Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limit of that range and any other stated or intervening value in that stated range is encompassed within the disclosure. That the upper and lower limits of these smaller ranges can independently be included in the smaller ranges is also encompassed within the disclosure, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the disclosure. [0052] The phrase “and/or,” as used herein in the specification and in the embodiments, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with “and/or” should be construed in the same fashion, i.e., “one or more” of the elements so conjoined. Other elements can optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
[0053] As used herein in the specification and in the embodiments, “or” should be understood to have the same meaning as “and/or” as defined above. For example, when separating items in a list, “or” or “and/or” shall be interpreted as being inclusive, i.e., the inclusion of at least one, but also including more than one, of a number or list of elements, and, optionally, additional unlisted items. Only terms clearly indicated to the contrary, such as “only one of’ or “exactly one of,” or, when used in the embodiments, “consisting of,” will refer to the inclusion of exactly one element of a number or list of elements. In general, the term “or” as used herein shall only be interpreted as indicating exclusive alternatives (i.e., “one or the other but not both”) when preceded by terms of exclusivity, such as “either,” “one of,” “only one of,” or “exactly one of.” “Consisting essentially of,” when used in the embodiments, shall have its ordinary meaning as used in the field of patent law.
[0054] As used herein in the specification and in the embodiments, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements can optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, “at least one of A and B” (or, equivalently, “at least one of A or B,” or, equivalently “at least one of A and/or B”) can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
[0055] In the embodiments, as well as in the specification above, all transitional phrases such as “comprising,” “including,” “carrying,” “having,” “containing,” “involving,” “holding,” “composed of,” and the like are to be understood to be open-ended, i.e., to mean including but not limited to. Only the transitional phrases “consisting of’ and “consisting essentially of’ shall be closed or semi-closed transitional phrases, respectively, as set forth in the United States Patent Office Manual of Patent Examining Procedures, Section 2111.03. [0056] Some embodiments described herein relate to a computer storage product with a non-transitory computer-readable medium (also can be referred to as a non-transitory processor-readable medium) having instructions or computer code thereon for performing various computer-implemented operations. The computer- readable medium (or processor- readable medium) is non-transitory' in the sense that it does not include transitory? propagating signals per se (e.g., a propagating electromagnetic wave carrying information on a transmission medium such as space or a cable). The media and computer code (also can be referred to as code) can be those designed and constructed for the specific purpose or purposes. Examples of non-transitory' computer-readable media include, but are not limited to, magnetic storage media such as hard disks, floppy disks, and magnetic tape; optical storage media such as Compact DiscZDigital Video Discs (CDZDVDs), Compact Disc-Read Only Memories (CD-ROMs), and holographic devices; magneto-optical storage media such as optical disks; carrier wave signal processing modules; and hardware devices that are specially configured to store and execute program code, such as Application-Specific Integrated Circuits (ASICs), Programmable Logic Devices (PLDs), Read-Only Memory'’ (ROM) and Random-Access Memory (RAM) devices. Other embodiments described herein relate to a computer program product, which can include, for example, the instructions and/or computer code discussed herein.
[0057] Some embodiments and/or methods described herein can be performed by software (executed on hardware), hardware, or a combination thereof. Hardware modules may include, for example, a processor, a field programmable gate array (FPGA), and/or an application specific integrated circuit (ASIC). Software modules (executed on hardware) can include instructions stored in a memory that is operably coupled to a processor, and can be expressed in a variety of software languages (e.g., computer code), including C, C++, Java™, Ruby, Visual Basic™, and/or other object-oriented, procedural, or other programming language and development tools. Examples of computer code include, but are not limited to, micro-code or micro-instructions, machine instructions, such as produced by a compiler, code used to produce a web sendee, and files containing higher-level instructions that are executed by a computer using an interpreter. For example, embodiments may be implemented using imperative programming languages (e.g., C, Fortran, etc.), functional programming languages (Haskell, Erlang, etc.), logical programming languages (e.g., Prolog), object-oriented programming languages (e.g., Java, C++, etc.) or other suitable programming languages and/or development tools. Additional examples of computer code include, but are not limited to, control signals, encrypted code, and compressed code.

Claims

1. An apparatus, comprising: at least a processor; and a memory operatively coupled to the processor, the memory storing instructions that when executed cause the processor to: receive a plurality of image frames, each image frame from the plurality of image frames containing a measured temporal joint data of a subject; execute a trained limb segment machine-learning model to identify a plurality of frame interrelations using the plurality of image frames as an input, the trained limb segment machine-learning model trained using an interrelation training set that contains limb segment image data in a motion sequence; and generate a temporal joints profile based on the plurality of frame interrelation.
2. The apparatus of claim 1, wherein the instruction to cause the processor to generate the temporal joints profile is based on a plurality of video inputs from a plurality of sensors pointing at the subject from different angles, the plurality' of sensors including the sensor.
3. The apparatus of claim 1, wherein the memory further stores instructions that when executed, cause the processor to extract a plurality of scaled frame interrelations based on a convolutional filter size prior to generating the temporal joints profile.
4. The apparatus of claim 1, wherein the memory further stores instructions to cause the processor to independently generate a plurality of temporal joints profiles for a plurality of subjects captured by the sensor.
5. The apparatus of claim 1, wherein the memory further stores instructions to cause the processor to determine an injury risk datum based on the temporal joints profile.
6. The apparatus of claim 1, wherein the memory further stores instructions to cause the processor to determine a physical therapy improvement based on the temporal joints profile.
7. A non-transitory, processor-readable medium storing processor-executable instructions to cause the processor to: receive, from a sensor operatively coupled to the processor, a plurality of image frames containing a measured temporal joint data of a subject across at least two frames of image data from the sensor: compute a plurality of averaged convoluted quantitative identifiers based on the plurality of image frames; execute a trained multi-headed temporal profile machine-learning model to output a plurality of temporal joints interrelations using the plurality of averaged convoluted quantitative identifiers as an input, the trained multi-headed temporal profile machine-learning mode using a temporal joints training set; and generate an aggregated temporal joints profile based on the temporal joints interrelations.
8. The non-transitoryv processor-readable medium of claim 7, the non-transitory’, processor- readable medium further causing the processor to receive, from a plurality of sensors, a plurality of video inputs, each sensor from the plurality of sensors pointing to the at least a subject from a different angle.
9. The non-transitory, processor-readable medium of claim 7, the non-transitory, processor-readable medium further stores instructions to cause the processor to extract a plurality of scaled frame interrelations based on a convolutional filter size prior.
10. The non-transitory/, processor-readable medium of claim 7, wherein the non-transitory/, processor-readable medium further stores instructions to cause the processor to compute the plurality of averaged convoluted quantitative identifiers and generate the aggregated temporal joints profile simultaneously.
11. The non-transitory, processor-readable medium of claim 7, wherein the non-transitory, processor-readable medium further stores instructions to cause the processor to: generate a plurality of frame interrelations based on the plurality of averaged convoluted quantitative identifiers; and reduce sparsity of a plurality of joint interrelations between the plurality of frame interrelations temporal joints model using the plurality of averaged convoluted quantitative identifiers.
12. The non-transitory, processor-readable medium of claim 7, wherein the non-transitory, processor-readable medium further stores instructions to cause the processor to map the plurality of averaged convoluted quantitative identifiers to an output attention matrix of a plurality of output attention matrices, each output attention matrix representing a sequential pose prior to generating the aggregated temporal joints profile.
13. The non-transitory'’, processor-readable medium of claim 7, the non-transitory, processor-readable medium further causing the processor to determine an injury risk datum and a physical therapy improvement based on the aggregated temporal joints profile.
14. A non-transitory, processor-readable medium storing processor-executable instructions to cause the processor to: receive a temporal joints training set that includes a concatenated temporal profile correlated to a concatenated temporal sequence; and train a multi-headed temporal profile machine-learning model of a plurality of machine-learning models using the temporal joints training set, the multi-headed temporal profile machine- learning model, when executed, outputting a plurality of temporal joints interrelations using as an input at plurality’ of averaged convoluted quantitative identifiers computed based on a plurality of image frames containing a measured temporal joint data of a subject across at least two frames of image data from a sensor operatively coupled to the processor.
15. The non-transitory, processor-readable medium of claim 14, wherein the non- transitory, processor-readable medium further causing the processor to: train a spatial joint machine-learning model of the plurality of machine-learning models using a spatial joints training set containing a two-dimensional human pose correlated to a two-dimensional joints profile, the spatial joint machine-learning model, when executed, using the plurality of image frames as an input, the spatial joint machine-learning model outputs a plurality of joint localization overlays.
16. The non-transitory, processor-readable medium of claim 15, the non-transitory, processor-readable medium further causing the processor to: receive the plurality of joint localization overlays from the spatial joint machine- learning model; and train the limb segment machine-learning model using an interrelation training set that contains a limb segment image data correlated to a limb matrix in a motion sequence, the limb segment machine-learning model, when executed, using the plurality’ of joint localization as an input and outputting the plurality of frame interrelations.
17. The non-transitory, processor-readable medium of claim 16, the non-transitory, processor-readable medium further causing the processor to: receive the plurality of frame interrelations from the limb segment machine-learning model; and train a temporal profile machine-learning model of the plurality of machine-learning models using a temporal sequence training set containing a temporal pose sequence correlated to a temporal joint sequence, the temporal profile machine-learning model, when executed, using the plurality of frame interrelations as an input and outputing a temporal joints profile.
18. The non-transitory , processor-readable medium of claim 14, wherein the plurality of machine-learning models includes deep learning.
19. The non-transitory, processor-readable medium of claim 14, wherein each image frame from the plurality of image frames containing a measured temporal joint data of a subject.
20. The non-transitory, processor-readable medium of claim 14, wherein the plurality of image frames are received from a plurality of sensors that point at the subject from a plurality of different angles.
PCT/US2023/021201 2022-05-10 2023-05-05 Methods and apparatus for human pose estimation from images using dynamic multi-headed convolutional attention WO2023219901A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US17/740,650 2022-05-10
US17/740,650 US11482048B1 (en) 2022-05-10 2022-05-10 Methods and apparatus for human pose estimation from images using dynamic multi-headed convolutional attention
US17/944,418 2022-09-14
US17/944,418 US20230368578A1 (en) 2022-05-10 2022-09-14 Methods and apparatus for human pose estimation from images using dynamic multi-headed convolutional attention

Publications (1)

Publication Number Publication Date
WO2023219901A1 true WO2023219901A1 (en) 2023-11-16

Family

ID=86710835

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/021201 WO2023219901A1 (en) 2022-05-10 2023-05-05 Methods and apparatus for human pose estimation from images using dynamic multi-headed convolutional attention

Country Status (1)

Country Link
WO (1) WO2023219901A1 (en)

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LI WENHAO ET AL: "Exploiting Temporal Contexts With Strided Transformer for 3D Human Pose Estimation", IEEE TRANSACTIONS ON MULTIMEDIA, IEEE, USA, vol. 25, 7 January 2022 (2022-01-07), pages 1282 - 1293, XP011938381, ISSN: 1520-9210, [retrieved on 20220107], DOI: 10.1109/TMM.2022.3141231 *
LI WENHAO ET AL: "MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation", 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 29 April 2022 (2022-04-29), pages 1 - 15, XP093074988, ISBN: 978-1-6654-6946-3, Retrieved from the Internet <URL:https://arxiv.org/pdf/2111.12707v3.pdf> DOI: 10.1109/CVPR52688.2022.01280 *
ZHENG CE ET AL: "3D Human Pose Estimation with Spatial and Temporal Transformers", 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), IEEE, 10 October 2021 (2021-10-10), pages 11636 - 11645, XP034093479, DOI: 10.1109/ICCV48922.2021.01145 *

Similar Documents

Publication Publication Date Title
US11482048B1 (en) Methods and apparatus for human pose estimation from images using dynamic multi-headed convolutional attention
US20200090408A1 (en) Systems and methods for augmented reality body movement guidance and measurement
CA3192004A1 (en) Methods and apapratus for machine learning to analyze musculo-skeletal rehabilitation from images
CN111402290A (en) Action restoration method and device based on skeleton key points
US11759126B2 (en) Scoring metric for physical activity performance and tracking
Kumar et al. Indian sign language recognition using graph matching on 3D motion captured signs
CN116507276A (en) Method and apparatus for machine learning to analyze musculoskeletal rehabilitation from images
Thang et al. Estimation of 3-D human body posture via co-registration of 3-D human model and sequential stereo information
Mortazavi et al. Continues online exercise monitoring and assessment system with visual guidance feedback for stroke rehabilitation
CN113255522A (en) Personalized motion attitude estimation and analysis method and system based on time consistency
Chen et al. Measurement of body joint angles for physical therapy based on mean shift tracking using two low cost Kinect images
CN113255429B (en) Method and system for estimating and tracking human body posture in video
Tong et al. Edge-assisted epipolar transformer for industrial scene reconstruction
Madadi et al. Deep unsupervised 3D human body reconstruction from a sparse set of landmarks
WO2023219901A1 (en) Methods and apparatus for human pose estimation from images using dynamic multi-headed convolutional attention
CN115546491B (en) Fall alarm method, system, electronic equipment and storage medium
CN115205750B (en) Motion real-time counting method and system based on deep learning model
CN116977506A (en) Model action redirection method, device, electronic equipment and storage medium
CN115120250A (en) Intelligent brain-controlled wheelchair system based on electroencephalogram signals and SLAM control
Zhu et al. Robust regression-based motion perception for online imitation on humanoid robot
CN115205737A (en) Real-time motion counting method and system based on Transformer model
CN113255514A (en) Behavior identification method based on local scene perception graph convolutional network
JP2021026265A (en) Image processing device, image processing program, and image processing method
Kang et al. Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting
CN114999648B (en) Early screening system, equipment and storage medium for cerebral palsy based on baby dynamic posture estimation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23729206

Country of ref document: EP

Kind code of ref document: A1