EP4315282A1 - Systèmes et procédés de reconnaissance informatique de mouvements de gestes 3d - Google Patents

Systèmes et procédés de reconnaissance informatique de mouvements de gestes 3d

Info

Publication number
EP4315282A1
EP4315282A1 EP22721841.9A EP22721841A EP4315282A1 EP 4315282 A1 EP4315282 A1 EP 4315282A1 EP 22721841 A EP22721841 A EP 22721841A EP 4315282 A1 EP4315282 A1 EP 4315282A1
Authority
EP
European Patent Office
Prior art keywords
point
images
key
processor
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP22721841.9A
Other languages
German (de)
English (en)
Inventor
Martin FISCH
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Publication of EP4315282A1 publication Critical patent/EP4315282A1/fr
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • G06V20/647Three-dimensional objects by matching two-dimensional images to three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/03Recognition of patterns in medical or anatomical images
    • G06V2201/033Recognition of patterns in medical or anatomical images of skeletal patterns

Definitions

  • the present disclosure generally relates to computer-based platforms/systems, improved computing devices/components and/or improved computing objects configured for automated computer recognition of three-dimensional (3D) gesture movements, including computer recognition of 3D gesture movements based on orientation key-points and 3D human poses.
  • 3D three-dimensional
  • Human pose key-points are typically defined as the major joint positions on the human skeleton. These key -points can correspond to major skeletal joints, and can include features such as eyes, ears, or nose. Identifying and separating the key -point mappings for multi -person images without mixing body parts from different individuals is a complex problem.
  • Single (Red, Green, Blue) RGB images and videos lack depth information, and images in the wild lack scale information or skeletal measurements. While 2D images can be annotated with 2D key-points, computing 3D key-point data is a more complex problem in part because these key- points lack important skeletal rotation information. Even more complex is the recognition of gestures in single- or multi-person images using 3D key -point data.
  • the techniques described herein relate to a system, including: a processor; and a memory storing instruction which, when executed by the processor, causes the processor to: receive a time-series of images depicting at least one subject at a plurality of time points; predict, for each image in the time-series of images, at least one orientation key- point associated with a section of a body part of the at least one subject via a neural network detector; compute, for each image in the time-series of images, a three-axis joint rotation associated with the section of the body part of the at least one subject based on at least one orientation key-point associated with the body part of the at least one subject and at least one joint key-point associated with the body part of the at least one subject; generate, for each image in the time-series of images, at least one feature including at least one of: the at least one orientation key -point, the at least one joint key-point, or the three-axis joint rotation; predict at least one motion performed by the at least one subject via
  • the techniques described herein relate to a system, wherein the memory storing instructions which, when executed by the processor, further causes the processor to: train a neural network based on an orientation key point from the at least one orientation key- point.
  • the techniques described herein relate to a system, wherein the memory storing instructions which, when executed by the processor, further causes the processor to: produce training data for a neural network to estimate a subject pose.
  • the techniques described herein relate to a system, wherein the memory storing instructions which, when executed by the processor, further causes the processor to: compute, for each image in the time-series of images, a change in position of the at least one orientation key-point over time; calculate, for each image in the time-series of images, a rotational velocity or acceleration associated with the at least one orientation key-point based on the change in position of the at least one orientation key-point; and generate, for each image in the time-series of images, at least one feature including the rotational velocity or acceleration.
  • the techniques described herein relate to a system, wherein the time- series of images is produced by at least one of a camera or a video camera.
  • the techniques described herein relate to a system, wherein the time- series of images include depth information.
  • the techniques described herein relate to a system, wherein the at least one feature of each image includes at least one feature vector encoding at least one of: the at least one orientation key-point, the at least one joint key-point, or the three-axis joint rotation.
  • the techniques described herein relate to a system, wherein the at least one feature of each image includes at least one feature map includes the at least one feature vector, wherein the at least one feature vector is a plurality of feature vectors and each feature vector of the plurality of feature vectors is associated with a particular section of a particular body part of the at least one subject.
  • the techniques described herein relate to a system, wherein the motion recognition machine learning model includes at least one temporal probabilistic model.
  • the techniques described herein relate to a system, wherein the at least one temporal probabilistic model includes at least one hierarchical temporal probabilistic model.
  • the techniques described herein relate to a method, including: receiving, by a processor, a time-series of images depicting at least one subject at a plurality of time points; predicting, by the processor for each image in the time-series of images, at least one orientation key -point associated with a section of a body part of the at least one subject via a neural network detector; computing, by the processor for each image in the time-series of images, a three-axis joint rotation associated with the section of the body part of the at least one subject based on at least one orientation key -point associated with the body part of the at least one subject and at least one joint key -point associated with the body part of the at least one subject; generating, by the processor for each image in the time-series of images, at least one feature including at least one of: the at least one orientation key-point, the at least one joint key -point, or the three-axis joint rotation; predicting, by the processor, at least one motion performed by the at least one
  • the techniques described herein relate to a method, further including: training, by the processor, a neural network based on an orientation key point from the at least one orientation key-point.
  • the techniques described herein relate to a method, further including: producing, by the processor, training data for a neural network to estimate a subject pose. [0017] In some aspects, the techniques described herein relate to a method, further including: computing, by the processor for each image in the time-series of images, a change in position of the at least one orientation key-point over time; calculating, by the processor for each image in the time-series of images, a rotational velocity or acceleration associated with the at least one orientation key-point based on the change in position of the at least one orientation key- point; and generating, by the processor for each image in the time-series of images, at least one feature including the rotational velocity or acceleration. [0018] In some aspects, the techniques described herein relate to a method, wherein the time- series of images is produced by at least one of a camera or a video camera.
  • the techniques described herein relate to a method, wherein the time- series of images include depth information.
  • the techniques described herein relate to a method, wherein the at least one feature of each image includes at least one feature vector encoding at least one of: the at least one orientation key-point, the at least one joint key-point, or the three-axis joint rotation.
  • the techniques described herein relate to a method, wherein the at least one feature of each image includes at least one feature map includes the at least one feature vector, wherein the at least one feature vector is a plurality of feature vectors and each feature vector of the plurality of feature vectors is associated with a particular section of a particular body part of the at least one subject.
  • the techniques described herein relate to a method, wherein the motion recognition machine learning model includes at least one temporal probabilistic model.
  • the techniques described herein relate to a method, wherein the at least one temporal probabilistic model includes at least one hierarchical temporal probabilistic model.
  • FIGs. 1, 2A, 2B, 3, 4, 5A, 5B, 5C, 6A, 6B, 6C, 7, 8, 9, 10 and 11 show one or more schematic flow diagrams, certain computer-based architectures, and/or screenshots of various specialized graphical user interfaces which are illustrative of some exemplary aspects of at least some embodiments of the present disclosure.
  • the terms “and” and “or” may be used interchangeably to refer to a set of items in both the conjunctive and disjunctive in order to encompass the full description of combinations and alternatives of the items.
  • a set of items may be listed with the disjunctive “or”, or with the conjunction “and.” In either case, the set is to be interpreted as meaning each of the items singularly as alternatives, as well as any combination of the listed items.
  • Figures 1 through 11 illustrate systems and methods of action and gesture recognition captured by one or more imaging devices.
  • Action and gesture recognition are fields used to identify, recognize, and interpret human gestures, posture, gait, and behavior. This can be used for human-machine interfaces and input as well as monitoring, input, and feedback loops.
  • Many of the techniques rely on key pointers represented in a 3D coordinate system and use the relative motion over time to identify a gesture, including measures of relative motion such as, e.g., velocity, acceleration, rotational velocity, rotational acceleration, or any other suitable measure of motion.
  • a system to recognition human motions by classifying gestures and actions based on the localization of human joints and detection of 3D human poses in terms of both position and full 3-axis rotations, using at least one frame red-green-blue (RGB) monocular image.
  • RGB red-green-blue
  • a machine learning model trained to identify human motions from 3D human poses and sequences thereof may predict or other recognize a motion such as an action or gesture.
  • the system is enabled by a motion recognition model engine 118 that leverages sets of key-points representing six-dimensional (6D) position of each joint to predict an action or gesture or both captured in sequences of the frame RGB monocular image.
  • FIG. 1 illustrates an example of and implementation of orientation key-point-based motion recognition according to an illustrative embodiment of the present disclosure.
  • FIG. 1 conceptually illustrates a motion recognition system 100 for the implementation of a motion recognition model engine 118.
  • the motion recognition system 100 may include hardware components such as a processor(s) 109, which may include local or remote processing components.
  • the processor(s) 109 may include any type of data processing capacity, such as a hardware logic circuit, for example an application specific integrated circuit (ASIC) and a programmable logic, or such as a computing device, for example, a microcomputer or microcontroller that include a programmable microprocessor.
  • the processor(s) 109 may include data-processing capacity provided by the microprocessor.
  • the microprocessor may include memory, processing, interface resources, controllers, and counters.
  • the microprocessor may also include one or more programs stored in memory.
  • the motion recognition system 100 may include a storage device 101, such as local hard-drive, solid-state drive, flash drive, database or other local storage, or remote storage such as a server, mainframe, database, or cloud provided storage solution.
  • a storage device 101 such as local hard-drive, solid-state drive, flash drive, database or other local storage, or remote storage such as a server, mainframe, database, or cloud provided storage solution.
  • the motion recognition system 100 may be implemented as a computing device.
  • the motion recognition system 100 may include or be incorporated, partially or entirely into at least one personal computer (PC), laptop computer, ultra-laptop computer, tablet, touch pad, portable computer, handheld computer, palmtop computer, personal digital assistant (PDA), cellular telephone, combination cellular telephone/PDA, television, smart device (e.g., smart phone, smart tablet or smart television), mobile internet device (MID), messaging device, data communication device, and so forth.
  • PC personal computer
  • PDA personal digital assistant
  • cellular telephone combination cellular telephone/PDA
  • television smart device (e.g., smart phone, smart tablet or smart television), mobile internet device (MID), messaging device, data communication device, and so forth.
  • MID mobile internet device
  • the motion recognition system 100 may be implemented across one or more servers and/or cloud platforms.
  • one or more servers for the motion recognition system 100 may include a service point which provides processing, database, and communication facilities.
  • server can refer to a single, physical processor with associated communications and data storage and database facilities, or it can refer to a networked or clustered complex of processors and associated network and storage devices, as well as operating software and one or more database systems and application software that support the services provided by the server. Cloud servers are examples.
  • the motion recognition system 100 may implement computer engines for the recognition of human performed motions as represented in a sequence of images. Accordingly, in some embodiments, the motion recognition system 100 may include computer engines such as, e.g., a detector engine 117 to detect 3D human poses in each image to output key-point vectors representing the 3D human poses, a motion recognition model engine 118 to utilize a machine learning model trained to ingest the key-point vectors and identify one or more motions in the sequence of images, and a command engine 119 to recognition a human/machine interface command and issue an instruction to a computing device according to the command.
  • computer engines such as, e.g., a detector engine 117 to detect 3D human poses in each image to output key-point vectors representing the 3D human poses, a motion recognition model engine 118 to utilize a machine learning model trained to ingest the key-point vectors and identify one or more motions in the sequence of images, and a command engine 119 to recognition a human/machine interface command and issue an instruction to
  • the terms “computer engine” and “engine” identify at least one software component and/or a combination of at least one software component and at least one hardware component which are designed/programmed/configured to manage/control other software and/or hardware components (such as the libraries, software development kits (SDKs), objects, etc.).
  • software components such as the libraries, software development kits (SDKs), objects, etc.
  • Examples of hardware elements may include processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (FPGA), logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth.
  • the one or more processors may be implemented as a Complex Instruction Set Computer (CISC) or Reduced Instruction Set Computer (RISC) processors; x86 instruction set compatible processors, multi- core, or any other microprocessor or central processing unit (CPU).
  • the one or more processors may be dual-core processor(s), dual-core mobile processor(s), and so forth.
  • Examples of software may include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API), instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof. Determining whether an embodiment is implemented using hardware elements and/or software elements may vary in accordance with any number of factors, such as desired computational rate, power levels, heat tolerances, processing cycle budget, input data rates, output data rates, memory resources, data bus speeds and other design or performance constraints.
  • the bus 115 collectively represents system, peripheral, and/or chipset buses that communicatively connect the numerous internal devices of the motion recognition system 100.
  • the bus 115 communicatively connects the processor(s) 109 with the read-only memory 111, the system memory 103, and the storage device 101.
  • the processor(s) 109 can retrieve instructions to execute and/or data to process to perform the processes of the subject technology.
  • the processor(s) 109 can be a single processor or a multi core processor in different implementations.
  • the processor(s) 109 can be any suitable processor such as, for example, a general-purpose processor, a field programmable gate array (FPGA), an application specific integrated circuit (ASIC) and/or other suitable hardware devices.
  • the read-only memory (ROM) 111 stores static data and instructions that are used by the processor(s) 109 and/or other modules of the compute device.
  • the storage device 101 is a read-and-write memory device. In some embodiments, this device is a non-volatile memory unit that stores instructions and data even when the motion recognition system 100 is disconnected from power.
  • a mass-storage device for example a magnetic or optical disk and its corresponding disk drive
  • a mass-storage device can be used as the storage device 101.
  • Other implementations can use removable storage devices (for example a flash drive, or other suitable type of removable storage devices) as the storage device 101.
  • the system memory 103 can be a read-and-write memory device. Unlike storage device 101, however, the system memory 103 is a volatile read-and-write memory, such as a random-access memory.
  • the system memory 103 stores some of the processor-executable instructions and data that the processor(s) 109 uses at runtime including processor-executable instructions to instantiate and maintain a detector engine 117, motion recognition model engine 118 or command engine 119 or any combination thereof, each of which is further described below.
  • the detector engine 117, motion recognition model engine 118 or command engine 119 or any combination thereof or any component or combination of components thereof can reside in the storage device 101.
  • states and/or properties of an instance of the detector engine 117, motion recognition model engine 118 or command engine 119 or any combination thereof can prevail in non-volatile memory even when the motion recognition system 100 is disconnected from power.
  • the front-end synchronized application can be configured to automatically relaunch and synchronize (if required) when the motion recognition system 100 is reconnected to power.
  • the detector system can execute according to the last state of the detector engine 117, motion recognition model engine 118 or command engine 119 or any combination thereof stored in the storage device 101 and synchronization may be used for those elements the motion recognition system 100 that have changed during the time the motion recognition system 100 was turned off.
  • the executable instructions to run the processes described herein on the motion recognition system 100 can be stored in the system memory 103, permanent storage device 101, and/or the read-only memory 111.
  • the various memory units can include instructions for the computing of orientation key-points including executable instructions to implement the detector engine 117, motion recognition model engine 118 or command engine 119 or any combination thereof in accordance with some implementations.
  • permanent storage device 101 can include processor executable instructions and/or code to cause the processor(s) 109 to instantiate a local instance of the detector engine 117 operatively coupled to a local instance of the motion recognition model engine 118.
  • Processor executable instructions can further cause processor(s) 109 to receive images or videos from non-local computing devices not shown in FIG. 1.
  • the processor(s) 109 coupled to one or more of memories 103 and 111, and storage device 101 receive an image depicting at least one subject.
  • the processor can predict at least one orientation key -point associated with a section of the body part of the at least one subject and compute a three-axis joint rotation via the detector engine 117.
  • a neural network detector may be configured to predict the at least orientation key-point and a three-axis joint rotation engine may compute the three-axis joint rotation.
  • the orientation key-points can be associated with the section of the body part of the at least one subject based on at least one orientation key-point associated with the body part of the at least one subject and at least one joint key -point associated with the body part of the at least one subject.
  • the processor(s) 109 coupled to one or more of memories 103 and 111, and storage device 101 receive an image depicting at least one subject.
  • the processor(s) 109 may use the image to predict at least one orientation key- point associated with a section of a body part of the at least one subject.
  • the processor(s) 109 may use at least one orientation key-point to predict an aspect of a pose associated with the at least one subject based on the at least one orientation key -point, the aspect of the pose associated with the at least one subject can include a position, size, velocity, acceleration and/or a movement associated with the at least one subject.
  • the processor(s) 109 may use the aspect of the pose of the at least one subject along with an aspect of a pose in at least one prior and/or subsequent image to recognize a motion based on a machine learning model.
  • the motion may be utilized to, e.g., display a motion label to a user via a computing device, e.g., via the output device interface 107, generate a command using the command engine 119 to cause at least one computing device to execute at least one instruction, determine a motion-related disorder or recommendation, among other applications.
  • the storage device 101 can include processor-executable instructions to render a graphical representation on a display comprising a motion classification identifying the motion captured in the sequence of images.
  • a graphical representation is indicative of a gesture and/or action of the at least one subject.
  • bus 115 can also couple the motion recognition system 100 to a network (not shown in FIG. 1) through a network interface 105.
  • the motion recognition system 100 can be part of a network of computers (for example a local area network (“LAN”), a wide area network (“WAN”), or an Intranet, or a network of networks, for example the Internet. Any or all components of the motion recognition system 100 can be used in conjunction with the embodiments described herein.
  • the bus 115 also connects to the input device interface 113 and output device interface 107.
  • the input device interface 113 enables the motion recognition system 100 to receive information or data, for example, images or video.
  • output devices may be used with output device interface 107 can include, for example, printers, audio devices (e.g., speakers), haptic output devices, and display devices (e.g., cathode ray tubes (CRT), liquid crystal displays (LCD), gas plasma displays, touch screen monitors, capacitive touchscreen and/or other suitable display device).
  • display devices e.g., cathode ray tubes (CRT), liquid crystal displays (LCD), gas plasma displays, touch screen monitors, capacitive touchscreen and/or other suitable display device.
  • Some implementations include devices that function as both input and output devices (e.g., a touchscreen display).
  • FIG. 2A is a block diagram of another exemplary computer-based system for motion recognition based on detected key-points and 3D human poses in sequences of images in accordance with one or more embodiments of the present disclosure.
  • a sequence of images 202 may be produced by an imaging device 201.
  • the sequence of images 202 may include, e.g., a series of still photographs, a series of video frames, a series of light detection and ranging (LiDAR) or radio detection and ranging (RADAR) measurements, or any other representations of a field of view of the imaging device 201.
  • the sequence of images 202 may include, e.g., a suitable sequence indicia, such as, e.g., time-stamps, ordered counts, or any other suitable mechanism for marking a chronological order of the images 202 in the sequence.
  • a detector engine 117 may ingest the images 202, e.g., sequentially according to the chronological order.
  • the detector engine 117 may employ neural network detector that is trained to predict the 3D location of a full set of key-points by predicting sets of one dimensional heatmaps, significantly reducing the computation and memory complexity associated with volumetric heatmaps.
  • PPR person posture recognition
  • CGI Computer-Generated Imagery
  • a neural network detector determines 2D and 3D, key-points related to the pose of a human from an image, image providing depth information, or video. Such key-points can then be post-processed to estimate the rotational pose of the human subject.
  • two feedforward neural networks can be implemented. For instance, a convolutional neural network for detection and a regression-based neural network with fully connected layers for adding depth (’lifting’) and refining a pose.
  • Developing a model requires identifying and designing a suitable architecture, obtaining, and preparing useful data from which to leam, training the model with the data, and validating the model.
  • Joint key-points correspond to skeletal joints and in some instances, can include features such as eyes, ears, or nose.
  • Orientation key-points refer to a set or sets of arbitrary points rigidly attached to a joint. They differ from dense pose correspondences in that orientation key-points do not correspond to a specific or recognizable body part but instead are rigidly anchored in specific directions from a joint (e.g., forward, or to a side).
  • Orientation key-points can be independent of a body shape.
  • markers used in motion capture orientation key- points include a freedom feature e.g., they do not need to be on the body or a body part.
  • two sets of orientation key-points can be assigned to the lower left leg, both sets midway between knee and ankle, with one offset in a forward direction and another offset assigned to the outside (e.g., to the left for the left leg).
  • multiple offsets can be used, for instance 0.5 bone lengths, which for the lower leg implies points well off the body. Bone lengths as a unit have the benefit of being independent of the size of a subject and can be customized to the size of each limb. For some smaller bones, the distance can be increased, for example, to reduce the relative significance of detection errors.
  • the detector engine 117 may produce a set of key -point vectors 203 representing the pose according to the orientation key-points for each image.
  • the detector engine 117 may include the neural network detector, or may capture motion capture key-point data, or both.
  • the per-image key-point vectors 203 may include, e.g., a separate vector of key-points for each image 202, an ordered concatenation of the key -points for the sequence of the images 202, a feature map including an array where each column or each row of the array is a separate vector of key -points, or other vector representation of the key -points representing the pose captured in each image 202.
  • motion recognition model engine 118 may ingest the per-image key-point vectors 203 from the detector engine 117.
  • orientation key- points can be used for action recognition and gesture recognition.
  • Conventional skeletal joint points lack rotational information, leaving one degree of freedom undefined.
  • Orientation key- points provide additional information by uniquely defining the full 6D position of each joint, which can then be used to identify gestures and actions more accurately.
  • because orientation key-points are represented by points in 3D space, their movement overtime is continuous and uniquely defined, unlike alternative rotational representations (such as quaternions or Euler angles).
  • the motion recognition model engine 118 may utilize a motion recognition model 218 that relies on relative motion over time using a continuous representation to improve accuracy and robustness for many techniques analyzing full skeletal motion over time.
  • the motion recognition model 218 may include a suitable machine learning model for processing a continuous time-series representation of motion, including the sequence of per-image key-point vectors 203 representing poses across the ordered sequence of images 202.
  • the motion recognition model 218 may include one or more exemplary AI/machine learning techniques chosen from, but not limited to, decision trees, boosting, support-vector machines, neural networks, nearest neighbor algorithms, Naive Bayes, bagging, random forests, and the like.
  • an exemplary neutral network technique may be one of, without limitation, feedforward neural network, radial basis function network, recurrent neural network, convolutional network (e.g., U-net) or other suitable network.
  • an exemplary implementation of Neural Network may be executed as follows: a. define Neural Network architecture/model, b. transfer the input data to the exemplary neural network model, c. train the exemplary model incrementally, d. determine the accuracy for a specific number of timesteps, e. apply the exemplary trained model to process the newly-received input data, and f. optionally and in parallel, continue to train the exemplary trained model with a predetermined periodicity.
  • the exemplary trained neural network model may specify a neural network by at least a neural network topology, a series of activation functions, and connection weights.
  • the topology of a neural network may include a configuration of nodes of the neural network and connections between such nodes.
  • the exemplary trained neural network model may also be specified to include other parameters, including but not limited to, bias values/functions and/or aggregation functions.
  • an activation function of a node may be a step function, sine function, continuous or piecewise linear function, sigmoid function, hyperbolic tangent function, or other type of mathematical function that represents a threshold at which the node is activated.
  • the exemplary aggregation function may be a mathematical function that combines (e.g., sum, product, etc.) input signals to the node.
  • an output of the exemplary aggregation function may be used as input to the exemplary activation function.
  • the bias may be a constant value or function that may be used by the aggregation function and/or the activation function to make the node more or less likely to be activated.
  • the motion recognition model 218 may utilize supervised machine learning, unsupervised machine learning, semi-supervised machine learning, or any combination thereof.
  • an illustrative supervised machine learning model may include, e.g., one or more neural networks.
  • the per-image key-point vectors 203 may represent a chronological order of poses.
  • the per-image key-point vectors 203 may represent time-series data.
  • the motion recognition model 218 may employ one or more neural networks for time-series description based on the time-series data to identify motions, such as gestures and actions, formed by the poses.
  • Some examples may include, e.g., recurrent neural networks employing, e.g., long-short term memory (LSTM), vector autoregressive models, Autoregressive Conditional Heteroscedasticity (ARCH) models, Autoregressive Integrated Moving Average (ARIMA) models, or other suitable supervised learning models for time-series analysis.
  • LSTM long-short term memory
  • ARCH Autoregressive Conditional Heteroscedasticity
  • ARIMA Autoregressive Integrated Moving Average
  • the motion recognition model 218 may include one or more semi-supervised or unsupervised learning models to recognize motions, such as actions or gestures, from the poses of the images 202 represented by the per-image key -point vectors 203.
  • the motion recognition model 218 may include, e.g., temporal probabilistic models, generative models such as the Hidden Markov Model (HMM) and the more generally formulated Dynamic Bayesian Networks (DBN), Discriminative models such as Conditional Random Fields (CRF), among others or any combination thereof.
  • HMM Hidden Markov Model
  • DBN Dynamic Bayesian Networks
  • CRF Conditional Random Fields
  • temporal probabilistic models such as the hidden Markov model (HMM) and conditional random fields (CRF) model directly model the correlations between the activities and the observed sensor data, such as the per-image key-point vectors 203.
  • the temporal probabilistic models may be configured as one or more hierarchical models which take into account the rich hierarchical structure that exists in human behavioral data. Such hierarchical models do not directly correlate the activities with the sensor data, but instead breaks the activity into sub-activities (sometimes referred to as actions) and models the underlying correlations accordingly.
  • semi-supervised and/or unsupervised models of the motion recognition model 218 may include vision-based model, such as, e.g., optical flow, Kalman filtering, Hidden Markov models, among others or any combination thereof.
  • the motion recognition model 218 may produce descriptive labelling for the images 202 including motion classifications 204.
  • the motion classifications 204 may include one or more labels or annotations that identify a motion, motion type or other descriptive of the motion formed by the sequence of poses in the images 202.
  • the motion classifications 204 may include a file, data object, metadata label for one or more of the images 202, text string, vector, array, data table entry, or other type of data.
  • the motion classifications 204 are structured data, such as, e.g., label, vector, tuple, array, value, etc.
  • the command engine 119 may ingest a motion classification 204 to identify an associated software instruction to cause a computing device 206 to execute processes associated with an action or gesture identified in the images 202.
  • the command engine 119 may match an action and/or gesture of the motion classification 204 to a device command 205.
  • the command engine 119 may include or be in communication with a command library, e.g., stored in a storage device such as the storage device 101.
  • the command library may return device commands 205 based on a database query in a suitable database query language, using a look-up table, an index, a heuristic search, or via any other suitable storage referencing and search technique.
  • the command engine 119 may be employed to enable a human/machine interface for input to the computing device 206.
  • the command engine 119 may enable an event based action recognition controller for the computing device.
  • the command engine 119 may utilize the motion classifications 204 to leverage recognized actions and gestures, generate a standardized list of events for one or more software programs (e.g., based on the command library described above) and send the standardized list of events to the one or more software programs of the computing device 206 to cause software events such as program inputs, interactions, user interface interactions and/or selections, user interface navigation, game control, password inputs, home automation control, etc.
  • the software programs of the computing device 206 need not be specially coded for action and/or gesture based input but may simply receive the standardized events via the device commands 205 from the command engine 119.
  • the computing device 206 may be a separate local or remote device in communication with motion recognition system 100, e.g., via the output device interface 107 and/or the input device interface 114 and/or the network interface 105.
  • the motion recognition system 100 is incorporated into the computing device 206 such that, e.g., the processor(s) 109 and storage device 101 and other components are hardware of the computing device 206.
  • FIG. 2B is a block diagram of another exemplary computer-based system for motion recognition based on detected key-points and 3D human poses in sequences of images in accordance with one or more embodiments of the present disclosure.
  • a sequence of images 202 may be produced by an imaging device 201.
  • the sequence of images 202 may include, e.g., a series of still photographs, a series of video frames, a series of light detection and ranging (LiDAR) or radio detection and ranging (RADAR) measurements, or any other representations of a field of view of the imaging device 201.
  • the sequence of images 202 may include, e.g., a suitable sequence indicia, such as, e.g., time-stamps, ordered counts, or any other suitable mechanism for marking a chronological order of the images 202 in the sequence.
  • a detector engine 117 may ingest the images 202, e.g., sequentially according to the chronological order.
  • the detector engine 117 may employ neural network detector that is trained to predict the 3D location of a full set of key-points by predicting sets of one dimensional heatmaps, significantly reducing the computation and memory complexity associated with volumetric heatmaps.
  • PPR person posture recognition
  • CGI Computer-Generated Imagery
  • a neural network detector determines 2D and 3D, key-points related to the pose of a human from an image, image providing depth information, or video. Such key-points can then be post-processed to estimate the rotational pose of the human subject.
  • two feedforward neural networks can be implemented. For instance, a convolutional neural network for detection and a regression-based neural network with fully connected layers for adding depth (’lifting’) and refining a pose.
  • Developing a model requires identifying and designing a suitable architecture, obtaining, and preparing useful data from which to leam, training the model with the data, and validating the model.
  • Joint key-points correspond to skeletal joints and in some instances, can include features such as eyes, ears, or nose.
  • Orientation key-points refer to a set or sets of arbitrary points rigidly attached to a joint. They differ from dense pose correspondences in that orientation key-points do not correspond to a specific or recognizable body part but instead are rigidly anchored in specific directions from a joint (e.g., forward, or to a side).
  • Orientation key-points can be independent of a body shape.
  • markers used in motion capture orientation key- points include a freedom feature e.g., they do not need to be on the body or a body part.
  • two sets of orientation key-points can be assigned to the lower left leg, both sets midway between knee and ankle, with one offset in a forward direction and another offset assigned to the outside (e.g., to the left for the left leg).
  • multiple offsets can be used, for instance 0.5 bone lengths, which for the lower leg implies points well off the body. Bone lengths as a unit have the benefit of being independent of the size of a subject and can be customized to the size of each limb. For some smaller bones, the distance can be increased, for example, to reduce the relative significance of detection errors.
  • the detector engine 117 may produce a set of key-point vectors 203 representing the pose according to the orientation key-points for each image.
  • the detector engine 117 may include the neural network detector, or may capture motion capture key-point data, or both.
  • the per-image key-point vectors 203 may include, e.g., a separate vector of key-points for each image 202, an ordered concatenation of the key -points for the sequence of the images 202, a feature map including an array where each column or each row of the array is a separate vector of key -points, or other vector representation of the key -points representing the pose captured in each image 202.
  • motion recognition model engine 118 may ingest the per-image key-point vectors 203 from the detector engine 117.
  • orientation key- points can be used for action recognition and gesture recognition.
  • Conventional skeletal joint points lack rotational information, leaving one degree of freedom undefined.
  • Orientation key- points provide additional information by uniquely defining the full 6D position of each joint, which can then be used to identify gestures and actions more accurately.
  • because orientation key-points are represented by points in 3D space, their movement overtime is continuous and uniquely defined, unlike alternative rotational representations (such as quaternions or Euler angles).
  • the motion recognition model engine 118 may utilize a motion recognition model 218 that relies on relative motion over time using a continuous representation to improve accuracy and robustness for many techniques analyzing full skeletal motion over time.
  • the motion recognition model 218 may include a suitable machine learning model for processing a continuous time-series representation of motion, including the sequence of per-image key-point vectors 203 representing poses across the ordered sequence of images 202.
  • the motion recognition model 218 may include one or more exemplary AI/machine learning techniques chosen from, but not limited to, decision trees, boosting, support-vector machines, neural networks, nearest neighbor algorithms, Naive Bayes, bagging, random forests, and the like.
  • an exemplary neutral network technique may be one of, without limitation, feedforward neural network, radial basis function network, recurrent neural network, convolutional network (e.g., U-net) or other suitable network.
  • an exemplary implementation of Neural Network may be executed as follows: a. define Neural Network architecture/model, b. transfer the input data to the exemplary neural network model, c. train the exemplary model incrementally, d. determine the accuracy for a specific number of timesteps, e. apply the exemplary trained model to process the newly-received input data, and f. optionally and in parallel, continue to train the exemplary trained model with a predetermined periodicity.
  • the exemplary trained neural network model may specify a neural network by at least a neural network topology, a series of activation functions, and connection weights.
  • the topology of a neural network may include a configuration of nodes of the neural network and connections between such nodes.
  • the exemplary trained neural network model may also be specified to include other parameters, including but not limited to, bias values/functions and/or aggregation functions.
  • an activation function of a node may be a step function, sine function, continuous or piecewise linear function, sigmoid function, hyperbolic tangent function, or other type of mathematical function that represents a threshold at which the node is activated.
  • the exemplary aggregation function may be a mathematical function that combines (e.g., sum, product, etc.) input signals to the node.
  • an output of the exemplary aggregation function may be used as input to the exemplary activation function.
  • the bias may be a constant value or function that may be used by the aggregation function and/or the activation function to make the node more or less likely to be activated.
  • the motion recognition model 218 may utilize supervised machine learning, unsupervised machine learning, semi-supervised machine learning, or any combination thereof.
  • an illustrative supervised machine learning model may include, e.g., one or more neural networks.
  • the per-image key-point vectors 203 represent a chronological order of poses, the per-image key-point vectors 203 may represent time-series data.
  • the motion recognition model 218 may employ one or more neural networks for time-series description based on the time-series data to identify motions, such as gestures and actions, formed by the poses.
  • Some examples may include, e.g., recurrent neural networks employing, e.g., long-short term memory (LSTM), vector autoregressive models, Autoregressive Conditional Heteroscedasticity (ARCH) models, Autoregressive Integrated Moving Average (ARIMA) models, or other suitable supervised learning models for time-series analysis.
  • LSTM long-short term memory
  • ARCH Autoregressive Conditional Heteroscedasticity
  • ARIMA Autoregressive Integrated Moving Average
  • the motion recognition model 218 may include one or more semi-supervised or unsupervised learning models to recognize motions, such as actions or gestures, from the poses of the images 202 represented by the per-image key -point vectors 203.
  • the motion recognition model 218 may include, e.g., temporal probabilistic models, generative models such as the Hidden Markov Model (HMM) and the more generally formulated Dynamic Bayesian Networks (DBN), Discriminative models such as Conditional Random Fields (CRF), among others or any combination thereof.
  • HMM Hidden Markov Model
  • DBN Dynamic Bayesian Networks
  • CRF Conditional Random Fields
  • temporal probabilistic models such as the hidden Markov model (HMM) and conditional random fields (CRF) model directly model the correlations between the activities and the observed sensor data, such as the per-image key-point vectors 203.
  • the temporal probabilistic models may be configured as one or more hierarchical models which take into account the rich hierarchical structure that exists in human behavioral data. Such hierarchical models do not directly correlate the activities with the sensor data, but instead breaks the activity into sub-activities (sometimes referred to as actions) and models the underlying correlations accordingly.
  • semi-supervised and/or unsupervised models of the motion recognition model 218 may include vision-based model, such as, e.g., optical flow, Kalman filtering, Hidden Markov models, among others or any combination thereof.
  • the motion recognition model 218 may produce descriptive labelling for the images 202 including motion classifications 204.
  • the motion classifications 204 may include one or more labels or annotations that identify a motion, motion type or other descriptive of the motion formed by the sequence of poses in the images 202.
  • the motion classifications 204 may include a file, data object, metadata label for one or more of the images 202, text string, vector, array, data table entry, or other type of data.
  • the motion classifications 204 are structured data, such as, e.g., label, vector, tuple, array, value, etc.
  • the motion recognition model 218 may predict the gesture or action performed by the subject. For example, where a human subject captured in the images 202 makes a hello wave hand gesture, the detector engine 117 may produce the per-image key-point vectors 203 including estimates of the position of his arm, fingers, and their change relative the body, expressed as a series of joint and orientation key-points. The motion recognition model 218 may use the key-points represented in the per-image key-point vectors 203 to decide the human subject has just "waved" instead of, for example, "punched".
  • the motion recognition model 218 may recognize techniques in one or more sports, such as, e.g., tennis.
  • the motion recognition model 218 may use the change in key-points over a period of time as measured by the sequence of images 202, such as, e.g., a one second interval, and determine that a tennis playing subject has done a topspin forehand instead of a slice or backhand.
  • the use of key-points including orientation key-points in the per-image key-point vectors 203 facilitates the motion recognition model 218 to differentiate between the two because rotation is key difference between a slice and topspin.
  • the motion classification 204 may be provided as an indication of a gesture and/or action 207 captured in the images 202.
  • a computing device 206 may receive the indication of the gesture/action 207 and, e.g., display the indication via a visualization layer for forming a user interface.
  • the indication may be represented as text, as an animation.
  • the computing device 206 may use the indication of the gesture/action 207 to specify that a subject in the images 202 used hand movement signifying a hello wave hand gesture instead of for example "punched".
  • the computing device 206 may depict that, e.g., a tennis stroke is a topspin forehand instead of a slice or backhand.
  • the computing device 206 may provide motion-based analysis and representations for a user.
  • the motion recognition model engine 118 recognizes actions and/or gestures and sends a code indicative of the action or gesture over a wire or network connection. Rather than just predicting what a current state of the subject, the motion recognition model engine 118 determines a discrete point in time that represents a completed action, such as, "Punch”, “Chop”, “Salute”, and announces the event, in addition to or instead of a current state such as "Fighting".
  • the controller interface listens to the software and alerts a connected program or operating system on the computing device 206 via standard notification or event protocols that an action and/or gesture has been made.
  • the controller interface may control the connected program and/or the operating system on the computing device 206, or any other software and/or hardware in communication with, or otherwise associated with, the computing device 206, to perform at least one operation in response to the action or gesture.
  • the at least one operation can include, e.g., controlling an avatar in a virtual environment (e.g., in a video game, an augmented and/or virtual reality environment, etc.), controller user interface elements (e.g., mouse cursor, keyboard, touch display touch targets, scrolling, moving windows, moving virtual objects in a software program such as computer aided design or graphics applications, or other selection and/or movement and/or control), or other suitable software and/or hardware control or any combination thereof.
  • controller user interface elements e.g., mouse cursor, keyboard, touch display touch targets, scrolling, moving windows, moving virtual objects in a software program such as computer aided design or graphics applications, or other selection and/or movement and/or control
  • suitable software and/or hardware control or any combination thereof e.
  • the computing device 206 may be a separate local or remote device in communication with motion recognition system 100, e.g., via the output device interface 107 and/or the input device interface 114 and/or the network interface 105.
  • the motion recognition system 100 is incorporated into the computing device 206 such that, e.g., the processor(s) 109 and storage device 101 and other components are hardware of the computing device 206.
  • FIG. 3 is a block diagram of another exemplary computer-based system for training a motion recognition model for motion recognition based on detected key -points and 3D human poses in sequences of images in accordance with one or more embodiments of the present disclosure.
  • the motion recognition model engine 118 may utilize the motion recognition model 318 to predict a motion classification 303 for the per-image key -point vector 301 associated with the user’s account, e.g., the per-image key-point vector 301 as described above with reference to FIG. 3.
  • the motion recognition model 318 ingests the per-image key- point vector 301 and produces a prediction of a motion classification 303 for each per-image key-point vector 301.
  • the motion recognition model 318 may include a machine learning model including a classification model, such as, e.g., a convolutional neural network (CNN), a Naive Bayes classifier, decision trees, random forest, support vector machine (SVM), K-Nearest Neighbors, or any other suitable algorithm for a classification model.
  • a classification model such as, e.g., a convolutional neural network (CNN), a Naive Bayes classifier, decision trees, random forest, support vector machine (SVM), K-Nearest Neighbors, or any other suitable algorithm for a classification model.
  • the motion recognition model 318 may advantageously include a random forest classification model
  • the per-image key-point vector 301 may include key-point data from a motion capture system using, e.g., condensed tracking data.
  • the key-points include key-points detected, e.g., via the detector engine 117 described above.
  • the per-image key-point vector 301 may include a combination of motion capture system key-point data and key -points detected via the detector engine 117.
  • the motion recognition model 318 ingests a per-image key-point vector 301 and processes the attributes encoded therein using the classification model to produce a model output vector.
  • the model output vector may be decoded to generate a label including the motion classification 303.
  • the model output vector may include or may be decoded to reveal a numerical output, e.g., a probability value between 0 and 1.
  • the probability value may indicate a degree of probability that the sequence of images of the per- image key-point vector 301 includes a particular action and/or gesture or other motion.
  • the motion recognition model 318 may test the probability value against a probability threshold, where a probability value greater than the probability threshold indicates, e.g., that the sequence of images includes a particular action and/or gesture or other motion, or that the sequence of images does not include a particular action and/or gesture or other motion.
  • the probability threshold can be, e.g., greater than 0.5, greater than 0.6, greater than 0.7, greater than 0.8, greater than 0.9, or other suitable threshold value.
  • the motion recognition model 318 may produce the motion classification 303 based on the probability value and the probability threshold.
  • the motion classification 303 may include a classification as a particular action and/or gesture or other motion where the probability value is greater than the probability threshold.
  • the motion recognition model 318 may configured such that the motion classification 303 may include a classification as not a particular action and/or gesture or other motion where the probability value is greater than the probability threshold.
  • the motion classification 303 may be provided to a computing device, e.g., as an indication of the motion, as a software program command and/or event, etc.
  • the motion classification 303 of the motion may trigger the computing device to generate and display a user interface for displaying to a user the motion classification 303, the software event, the performed software command, or any combination thereof.
  • the motion recognition model 318 may trained based on the motion classification 303 and an image sequence label 302 associated with the per-image key- point vectors 301.
  • the image sequence label 302 includes a pre-defmed (e.g., via human annotation) annotation labeling the action and/or gesture performed in the images associated with the per-image key -point vectors 301.
  • the parameters of the motion recognition model 318 may be updated to improve the accuracy of the motion recognition.
  • training is performed using the optimizer 320.
  • the motion classification 303 fed back to the optimizer 320.
  • the optimizer 320 may also ingest the image sequence label 302.
  • the optimizer 320 may employ a loss function, such as, e.g., Hinge Loss, Multi-class SVM Loss, Cross Entropy Loss, Negative Log Likelihood, or other suitable classification loss function.
  • the loss function determines an error based on the user interaction 402 and the motion classification 303.
  • the optimizer 320 may, e.g., backpropagate the error to the motion recognition model 318 to update the parameters using, e.g., gradient descent, heuristic, convergence or other optimization techniques and combinations thereof.
  • the optimizer 320 may therefore train the parameters of the motion recognition model 318 to approximate user behaviors in disputing data entries as duplicative based on feedback including the user interaction 402. As a result, the motion recognition model 318 may be continually trained and optimized based on user feedback.
  • FIG. 4 illustrates an example of detection of key-points, including orientation key- points, according to an illustrative embodiment of the present disclosure. Circles located within the skeletal structure shown in FIG. 4 represent joint key-points (e.g., joint key-point 401). Orientation key-points such as 403 represent a forward direction from the center of a section of a body part.
  • Orientation key-points such as 405 represent an outward direction from the center of a section of a body part, for example, to the left for left side limbs and trunk, and to the right for right side limbs.
  • Orientation key -points such as 407 represent an inward direction from the center of a section of a body part.
  • Orientation key-points such as 409 represent a backward direction from the center of a section of the body part.
  • Other directions that can be represented by orientation key -points can include higher direction from the center of a section of the body part, a lower direction from the center of a section of the body part and/or other suitable direction with respect to the center of a section of the body part.
  • orientation key-points can be located outside the sections describing the skeletal model of a subject.
  • training of the motion recognition model 218 and/or motion recognition model 318 can be based on weakly supervised learning, using one or more metrics to provide feedback during training. For example, the greater availability of 2D human pose annotation compared to 3D annotations can enable weakly supervise training by re-projecting predicted 3D poses into 2D and comparing the reprojection to the 2D annotation.
  • one or more of supervised training, intermediate supervision, weakly supervision individually, or in any combination thereof can be employed. For instance, supervised training by itself, weakly supervised training by itself, or a combination of supervised learning with intermediate supervision.
  • each neural network node can be a function f (x) which transforms an input vector x into an output value.
  • the input vectors can have any number of elements, often organized in multiple dimensions.
  • a network chains different functions f, g, and h to produce a final output y, where y f (g(h(x))).
  • f g(h(x)
  • each intermediate layer can have many nodes, the number of elements and input shapes can vary.
  • some functions computed within the neural network node can include:
  • normalization there are a variety of normalization functions including a type batch normalization, which transforms a batch of different samples by the sample mean value and sample standard deviation. Batch normalization can speed up training by maintaining a broadly steady range of outputs even as the inputs and weights change. For inference values can be frozen based on the mean and standard deviation of the training set.
  • SoftMax layer this technique rescales input neurons by taking their exponential values and dividing by the sum of these values. This means all values sum to one, approximating a probability distribution. Due to the exponential function, higher values will be accentuated, and the resulting distribution will be leptokurtic.
  • the neural network can be implemented as a feedforward neural network.
  • a feedforward neural network the data flows from the input to the output, layer by layer, without looping back - e.g., the outputs of the neural network may not provide feedback for the calculations. This flow is called a forward pass and depending on the size of the neural network can represent millions of calculations for a single input sample.
  • loss refers to the amount of error in the neural network model, with the goal of learning generally to minimize the loss.
  • loss is most often the mean squared error. These measures, for some sample of data, the average difference between the predicted values and the actual values, squared. Large outlier losses are particularly penalized with this measure and its popularity stems from its simplicity, mathematical convenience, and prevalence in statistical analysis.
  • Another alternative is the mean absolute error, which does not highly weight large errors.
  • the embodiments described herein can be implemented using one or more loss functions including mean squared error, mean absolute error, or other suitable loss function.
  • a stochastic gradient descent procedure can be applied to converge toward an optimal solution.
  • the method is stochastic because data is randomly shuffled and fed to the current state of a neural network model.
  • the gradient is the partial derivative of the neural network model parameters and at each iteration the parameters can be updated by a percentage of the gradient, e.g., the learning rate. Accordingly, the values of the parameters progress toward values which minimize the loss for the training data at each repeated iteration.
  • the neural network model can be configured through backpropagation. This means that each time training data passes through the model, a function calculates a measure of loss based on the resulting predictions. From the resulting loss the gradient of the final layer can then be derived, and consequently each previous layer’s gradient can be derived. This continues until the beginning of the neural network model and then the complete gradients are used to update the model weights like a stochastic gradient descent.
  • training can become impaired as the gradient of neurons in middle layers may approach zero. This can limit the ability of the neural network model to learn as weights cease to update when the gradient nears zero.
  • Rectified Linear Units ReLUs are less susceptible to the vanishing gradient than other activation functions such as sigmoid, as the derivative only changes when the activation is negative. Rectified Linear Units can be used as the principle activation function. Residual connections allow layers to pass data forward and focus on modifying the data only by applying additive changes (e.g., residuals), and can be used to develop deeper networks without vanishing gradients.
  • the neural network model can be implemented as a convolutional neural network.
  • Convolutional neural networks can exploit the structure of an image to identify simple patterns, and then combine the simple patterns into more complex ones.
  • Each filter in a convolutional neural network scans an adjacent area of the previous layer, combining the values based on learned weights.
  • the same filter, with the same weights, can then be slid across relevant dimensions to find a pattern throughout an input.
  • the filter generally penetrates the full depth of a layer, recombining lower level features to express higher level features.
  • the early levels of an image targeted convolution network typically find edges, then lines and then basic shapes like comers and curves. This often means that the early layers of a trained convolutional neural network can be reused in other networks in a process called transfer learning.
  • FIGs. 5A-5C illustrate three joint key-points during an action or gesture, according to an illustrative embodiment of the present disclosure.
  • the darker spots 501, 503, and 505 each represent ajoint key-point localized by the system.
  • heatmaps can be used as intermediate representation of an input of a convolutional neural network for key-point detection.
  • a prediction draws a point or blob in the predicted location. The higher the value (’heat’) for a given pixel the more likely the convolutional neural network model indicates that the feature is centered in that position.
  • Heatmaps can be trained by drawing a gaussian blob at the ground truth location, and predicted heatmaps can be directly compared to these ground truth heatmaps. This technique allows a vision based network to remain in the vision space throughout training. For inference, either the pixel with the maximum value can be converted to a location address (hardmax) or a probabilistic weight of values can be converted to a blended address (soft argmax). An example of this technique is illustrated in FIGs. 5A-5C.
  • FIGs. 6A-6C illustrate three orientation key-points during an action or gesture, according to an illustrative embodiment of the present disclosure.
  • the darker spots 601, 603, and 605 each represents an orientation key-point localized by the neural network.
  • the darker spots represent a convolutional neural network model localizing a point.
  • the convolutional neural network can recover some of the resolution that can be lost on a heatmap.
  • additional techniques that can be used include 2D and 3D hardmax heatmaps.
  • results can be shifted by 0.25 pixels based on which neighboring pixels has the next highest prediction.
  • This technique can effectively double the resolution in each direction.
  • during training when generating target heatmaps symmetrical Gaussian rounded to the nearest heatmap pixel may not be used, but instead increasing the resolution data in a more precise discrete sampling of a probability distribution function when generating a target may be incorporated.
  • This technique can allow an almost perfect reconstruction of a high-resolution location from the heatmap when using, for example, a spatial soft argmax layer.
  • a full set of predicted orientation key-points can be jointly transformed (with a single transformation matrix) to match the original points.
  • a 3D motion such as an action or a gesture across three dimensions, can be recognized using the set of predicted orientation key-points for a series of images.
  • FIG. 7 illustrates an example of detection of orientation key-points according to an illustrative embodiment of the present disclosure.
  • FIG. 7 shows an image 701 with ground truth data 703 and predictions enabled by the described embodiments including joint predictions, pose predictions, and rotation angles.
  • Handles such as 707 indicate a forward orientation while handles such as 709 indicate a left orientation.
  • FIG. 8 depicts a block diagram of an exemplary computer-based system and platform 800 in accordance with one or more embodiments of the present disclosure.
  • the illustrative computing devices and the illustrative computing components of the exemplary computer- based system and platform 800 may be configured to manage a large number of members and concurrent transactions, as detailed herein.
  • the exemplary computer- based system and platform 800 may be based on a scalable computer and network architecture that incorporates varies strategies for assessing the data, caching, searching, and/or database connection pooling.
  • An example of the scalable architecture is an architecture that is capable of operating multiple servers.
  • client device 802, client device 803 through client device 804 (e.g., clients) of the exemplary computer-based system and platform 800 may include virtually any computing device capable of receiving and sending a message over a network (e.g., cloud network), such as network 805, to and from another computing device, such as servers 806 and 807, each other, and the like.
  • the member devices 802-804 may be personal computers, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, and the like.
  • one or more member devices within member devices 802-804 may include computing devices that typically connect using a wireless communications medium such as cell phones, smart phones, pagers, walkie talkies, radio frequency (RF) devices, infrared (IR) devices, GBs citizens band radio, integrated devices combining one or more of the preceding devices, or virtually any mobile computing device, and the like.
  • a wireless communications medium such as cell phones, smart phones, pagers, walkie talkies, radio frequency (RF) devices, infrared (IR) devices, GBs citizens band radio, integrated devices combining one or more of the preceding devices, or virtually any mobile computing device, and the like.
  • one or more member devices within member devices 802-804 may be devices that are capable of connecting using a wired or wireless communication medium such as a PDA, POCKET PC, wearable computer, a laptop, tablet, desktop computer, a netbook, a video game device, a pager, a smart phone, an ultra- mobile personal computer (UMPC), and/or any other device that is equipped to communicate over a wired and/or wireless communication medium (e.g., NFC, RFID, NBIOT, 3G, 4G, 5G, GSM, GPRS, WiFi, WiMax, CDMA, OFDM, OFDMA, LTE, satellite, ZigBee, etc ).
  • a wired or wireless communication medium such as a PDA, POCKET PC, wearable computer, a laptop, tablet, desktop computer, a netbook, a video game device, a pager, a smart phone, an ultra- mobile personal computer (UMPC), and/or any other device that is equipped to communicate over a wired and/or wireless communication medium
  • one or more member devices within member devices 802-804 may include may run one or more applications, such as Internet browsers, mobile applications, voice calls, video games, videoconferencing, and email, among others. In some embodiments, one or more member devices within member devices 802-804 may be configured to receive and to send web pages, and the like.
  • an exemplary specifically programmed browser application of the present disclosure may be configured to receive and display graphics, text, multimedia, and the like, employing virtually any web based language, including, but not limited to Standard Generalized Markup Language (SMGL), such as HyperText Markup Language (HTML), a wireless application protocol (WAP), a Handheld Device Markup Language (HDML), such as Wireless Markup Language (WML), WMLScript, XML, JavaScript, and the like.
  • SMGL Standard Generalized Markup Language
  • HTML HyperText Markup Language
  • WAP wireless application protocol
  • HDML Handheld Device Markup Language
  • WMLScript Wireless Markup Language
  • a member device within member devices 802- 804 may be specifically programmed by either Java, .Net, QT, C, C++, Python, PHP and/or other suitable programming language.
  • device control may be distributed between multiple standalone applications.
  • software components/applications can be updated and redeployed remotely as individual units or as a full software suite.
  • a member device may periodically report status or send alerts over text or email.
  • a member device may contain a data recorder which is remotely downloadable by the user using network protocols such as FTP, SSH, or other file transfer mechanisms.
  • a member device may provide several levels of user interface, for example, advance user, standard user.
  • one or more member devices within member devices 802-804 may be specifically programmed include or execute an application to perform a variety of possible tasks, such as, without limitation, messaging functionality, browsing, searching, playing, streaming, or displaying various forms of content, including locally stored or uploaded messages, images and/or video, and/or games.
  • the exemplary network 805 may provide network access, data transport and/or other services to any computing device coupled to it.
  • the exemplary network 805 may include and implement at least one specialized network architecture that may be based at least in part on one or more standards set by, for example, without limitation, Global System for Mobile communication (GSM) Association, the Internet Engineering Task Force (IETF), and the Worldwide Interoperability for Microwave Access (WiMAX) forum.
  • GSM Global System for Mobile communication
  • IETF Internet Engineering Task Force
  • WiMAX Worldwide Interoperability for Microwave Access
  • the exemplary network 805 may implement one or more of a GSM architecture, a General Packet Radio Service (GPRS) architecture, a Universal Mobile Telecommunications System (UMTS) architecture, and an evolution of UMTS referred to as Long Term Evolution (LTE).
  • GSM Global System for Mobile communication
  • IETF Internet Engineering Task Force
  • WiMAX Worldwide Interoperability for Microwave Access
  • the exemplary network 805 may implement one or more of a
  • the exemplary network 805 may include and implement, as an alternative or in conjunction with one or more of the above, a WiMAX architecture defined by the WiMAX forum. In some embodiments and, optionally, in combination of any embodiment described above or below, the exemplary network 805 may also include, for instance, at least one of a local area network (LAN), a wide area network (WAN), the Internet, a virtual LAN (VLAN), an enterprise LAN, a layer 3 virtual private network (VPN), an enterprise IP network, or any combination thereof.
  • LAN local area network
  • WAN wide area network
  • VLAN virtual LAN
  • VPN layer 3 virtual private network
  • enterprise IP network or any combination thereof.
  • At least one computer network communication over the exemplary network 805 may be transmitted based at least in part on one of more communication modes such as but not limited to: NFC, RFID, Narrow Band Internet of Things (NBIOT), ZigBee, 3G, 4G, 5G, GSM, GPRS, WiFi, WiMax, CDMA, OFDM, OFDMA, LTE, satellite and any combination thereof.
  • the exemplary network 805 may also include mass storage, such as network attached storage (NAS), a storage area network (SAN), a content delivery network (CDN) or other forms of computer or machine readable media.
  • the exemplary server 806 or the exemplary server 807 may be a web server (or a series of servers) running a network operating system, examples of which may include but are not limited to Apache on Linux or Microsoft IIS (Internet Information Services).
  • the exemplary server 806 or the exemplary server 807 may be used for and/or provide cloud and/or network computing.
  • the exemplary server 806 or the exemplary server 807 may have connections to external systems like email, SMS messaging, text messaging, ad content providers, etc. Any of the features of the exemplary server 806 may be also implemented in the exemplary server 807 and vice versa.
  • one or more of the exemplary servers 806 and 807 may be specifically programmed to perform, in non-limiting example, as authentication servers, search servers, email servers, social networking services servers, Short Message Service (SMS) servers, Instant Messaging (IM) servers, Multimedia Messaging Service (MMS) servers, exchange servers, photo-sharing services servers, advertisement providing servers, fmancial/banking-related services servers, travel services servers, or any similarly suitable service-base servers for users of the client devices 801-804.
  • SMS Short Message Service
  • IM Instant Messaging
  • MMS Multimedia Messaging Service
  • the exemplary server 806, and/or the exemplary server 807 may include a specifically programmed software module that may be configured to send, process, and receive information using a scripting language, a remote procedure call, an email, a tweet, Short Message Service (SMS), Multimedia Message Service (MMS), instant messaging (IM), an application programming interface, Simple Object Access Protocol (SOAP) methods, Common Object Request Broker Architecture (CORBA), HTTP (Hypertext Transfer Protocol), REST (Representational State Transfer), SOAP (Simple Object Transfer Protocol), MLLP (Minimum Lower Layer Protocol), or any combination thereof.
  • SMS Short Message Service
  • MMS Multimedia Message Service
  • IM instant messaging
  • SOAP Simple Object Access Protocol
  • CORBA Common Object Request Broker Architecture
  • HTTP Hypertext Transfer Protocol
  • REST Real State Transfer
  • SOAP Simple Object Transfer Protocol
  • MLLP Minimum Lower Layer Protocol
  • FIG. 9 depicts a block diagram of another exemplary computer-based system and platform 900 in accordance with one or more embodiments of the present disclosure.
  • the client device 902a, client device 902b through client device 902n shown each at least includes a computer-readable medium, such as a random-access memory (RAM) 908 coupled to a processor 910 or FLASH memory.
  • the processor 910 may execute computer-executable program instructions stored in memory 908.
  • the processor 910 may include a microprocessor, an ASIC, and/or a state machine.
  • the processor 910 may include, or may be in communication with, media, for example computer-readable media, which stores instructions that, when executed by the processor 910, may cause the processor 910 to perform one or more steps described herein.
  • examples of computer-readable media may include, but are not limited to, an electronic, optical, magnetic, or other storage or transmission device capable of providing a processor, such as the processor 910 of client device 902a, with computer-readable instructions.
  • suitable media may include, but are not limited to, a floppy disk, CD-ROM, DVD, magnetic disk, memory chip, ROM, RAM, an ASIC, a configured processor, all optical media, all magnetic tape, or other magnetic media, or any other medium from which a computer processor can read instructions.
  • various other forms of computer-readable media may transmit or carry instructions to a computer, including a router, private or public network, or other transmission device or channel, both wired and wireless.
  • the instructions may comprise code from any computer programming language, including, for example, C, C++, Visual Basic, Java, Python, Perl, JavaScript, and etc.
  • client devices 902a through 902n may also comprise a number of external or internal devices such as a mouse, a CD-ROM, DVD, a physical or virtual keyboard, a display, or other input or output devices.
  • examples of client devices 902a through 902n e.g., clients
  • client devices 902a through 902n may be any type of processor-based platforms that are connected to a network 906 such as, without limitation, personal computers, digital assistants, personal digital assistants, smart phones, pagers, digital tablets, laptop computers, Internet appliances, and other processor-based devices.
  • client devices 902a through 902n may be specifically programmed with one or more application programs in accordance with one or more principles/methodologies detailed herein.
  • client devices 902a through 902n may operate on any operating system capable of supporting a browser or browser-enabled application, such as MicrosoftTM, WindowsTM, and/or Linux.
  • client devices 902a through 902n shown may include, for example, personal computers executing a browser application program such as Microsoft Corporation's Internet ExplorerTM, Apple Computer, Inc.'s SafariTM, Mozilla Firefox, and/or Opera.
  • user 912a, user 912b through user 912n may communicate over the exemplary network 906 with each other and/or with other systems and/or devices coupled to the network 906. As shown in FIG.
  • exemplary server devices 904 and 913 may include processor 905 and processor 914, respectively, as well as memory 917 and memory 916, respectively. In some embodiments, the server devices 904 and 913 may be also coupled to the network 906. In some embodiments, one or more client devices 902a through 902n may be mobile clients.
  • At least one database of exemplary databases 907 and 915 may be any type of database, including a database managed by a database management system (DBMS).
  • DBMS database management system
  • an exemplary DBMS-managed database may be specifically programmed as an engine that controls organization, storage, management, and/or retrieval of data in the respective database.
  • the exemplary DBMS -managed database may be specifically programmed to provide the ability to query, backup and replicate, enforce rules, provide security, compute, perform change and access logging, and/or automate optimization.
  • the exemplary DBMS-managed database may be chosen from Oracle database, IBM DB2, Adaptive Server Enterprise, FileMaker, Microsoft Access, Microsoft SQL Server, MySQL, PostgreSQL, and a NoSQL implementation.
  • the exemplary DBMS-managed database may be specifically programmed to define each respective schema of each database in the exemplary DBMS, according to a particular database model of the present disclosure which may include a hierarchical model, network model, relational model, object model, or some other suitable organization that may result in one or more applicable data structures that may include fields, records, files, and/or objects.
  • the exemplary DBMS-managed database may be specifically programmed to include metadata about the data that is stored.
  • the exemplary inventive computer-based systems/platforms, the exemplary inventive computer-based devices, and/or the exemplary inventive computer-based components of the present disclosure may be specifically configured to operate in a cloud computing/architecture 925 such as, but not limiting to: infrastructure a service (IaaS) 1110, platform as a service (PaaS) 1108, and/or software as a service (SaaS) 1106 using a web browser, mobile app, thin client, terminal emulator or other endpoint 1104.
  • IaaS infrastructure a service
  • PaaS platform as a service
  • SaaS software as a service
  • FIGS. 10 and 11 illustrate schematics of exemplary implementations of the cloud computing/architecture(s) in which the exemplary inventive computer-based systems/platforms, the exemplary inventive computer-based devices, and/or the exemplary inventive computer-based components of the present disclosure may be specifically configured to operate.
  • the term “real-time” is directed to an event/action that can occur instantaneously or almost instantaneously in time when another event/action has occurred.
  • the “real-time processing,” “real-time computation,” and “real-time execution” all pertain to the performance of a computation during the actual time that the related physical process (e.g., a user interacting with an application on a mobile device) occurs, in order that results of the computation can be used in guiding the physical process.
  • events and/or actions in accordance with the present disclosure can be in real-time and/or based on a predetermined periodicity of at least one of: nanosecond, several nanoseconds, millisecond, several milliseconds, second, several seconds, minute, several minutes, hourly, several hours, daily, several days, weekly, monthly, etc.
  • runtime corresponds to any behavior that is dynamically determined during an execution of a software application or at least a portion of software application.
  • exemplary inventive, specially programmed computing systems and platforms with associated devices are configured to operate in the distributed network environment, communicating with one another over one or more suitable data communication networks (e.g., the Internet, satellite, etc.) and utilizing one or more suitable data communication protocol s/modes such as, without limitation, IPX/SPX, X.25, AX.25, AppleTalk(TM), TCP/IP (e.g., HTTP), near-field wireless communication (NFC), RFID, Narrow Band Internet of Things (NBIOT), 3G, 4G, 5G, GSM, GPRS, WiFi, WiMax, CDMA, satellite, ZigBee, and other suitable communication modes.
  • suitable data communication protocols e.g., IPX/SPX, X.25, AX.25, AppleTalk(TM), TCP/IP (e.g., HTTP), near-field wireless communication (NFC), RFID, Narrow Band Internet of Things (NBIOT), 3G, 4G, 5G, GSM, GPRS, WiFi, WiMax, CDMA,
  • the NFC can represent a short-range wireless communications technology in which NFC-enabled devices are “swiped,” “bumped,” “tap” or otherwise moved in close proximity to communicate.
  • the NFC could include a set of short-range wireless technologies, typically requiring a distance of 10 cm or less.
  • the NFC may operate at 13.56 MHz on ISO/IEC 18000-3 air interface and at rates ranging from 106 kbit/s to 424 kbit/s.
  • the NFC can involve an initiator and a target; the initiator actively generates an RF field that can power a passive target.
  • this can enable NFC targets to take very simple form factors such as tags, stickers, key fobs, or cards that do not require batteries.
  • the NFC’s peer- to-peer communication can be conducted when a plurality of NFC-enable devices (e.g., smartphones) within close proximity of each other.
  • a machine-readable medium may include any medium and/or mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device).
  • a machine-readable medium may include read only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; electrical, optical, acoustical, or other forms of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.), and others.
  • computer engine and “engine” identify at least one software component and/or a combination of at least one software component and at least one hardware component which are designed/programmed/configured to manage/control other software and/or hardware components (such as the libraries, software development kits (SDKs), objects, etc.).
  • SDKs software development kits
  • Examples of hardware elements may include processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate array (FPGA), logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth.
  • the one or more processors may be implemented as a Complex Instruction Set Computer (CISC) or Reduced Instruction Set Computer (RISC) processors; x86 instruction set compatible processors, multi-core, or any other microprocessor or central processing unit (CPU).
  • the one or more processors may be dual-core processor(s), dual-core mobile processor(s), and so forth.
  • Computer-related systems, computer systems, and systems include any combination of hardware and software.
  • Examples of software may include software components, programs, applications, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (API), instruction sets, computer code, computer code segments, words, values, symbols, or any combination thereof. Determining whether an embodiment is implemented using hardware elements and/or software elements may vary in accordance with any number of factors, such as desired computational rate, power levels, heat tolerances, processing cycle budget, input data rates, output data rates, memory resources, data bus speeds and other design or performance constraints.
  • One or more aspects of at least one embodiment may be implemented by representative instructions stored on a machine-readable medium which represents various logic within the processor, which when read by a machine causes the machine to fabricate logic to perform the techniques described herein.
  • Such representations known as “IP cores,” may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that make the logic or processor.
  • IP cores may be stored on a tangible, machine readable medium and supplied to various customers or manufacturing facilities to load into the fabrication machines that make the logic or processor.
  • various embodiments described herein may, of course, be implemented using any appropriate hardware and/or computing software languages (e.g., C++, Objective-C, Swift, Java, JavaScript, Python, Perl, QT, etc ).
  • one or more of illustrative computer-based systems or platforms of the present disclosure may include or be incorporated, partially or entirely into at least one personal computer (PC), laptop computer, ultra-laptop computer, tablet, touch pad, portable computer, handheld computer, palmtop computer, personal digital assistant (PDA), cellular telephone, combination cellular telephone/PDA, television, smart device (e.g., smart phone, smart tablet or smart television), mobile internet device (MID), messaging device, data communication device, and so forth.
  • PC personal computer
  • laptop computer ultra-laptop computer
  • tablet touch pad
  • portable computer handheld computer
  • palmtop computer personal digital assistant
  • PDA personal digital assistant
  • cellular telephone combination cellular telephone/PDA
  • television smart device (e.g., smart phone, smart tablet or smart television), mobile internet device (MID), messaging device, data communication device, and so forth.
  • smart device e.g., smart phone, smart tablet or smart television
  • MID mobile internet device
  • server should be understood to refer to a service point which provides processing, database, and communication facilities.
  • server can refer to a single, physical processor with associated communications and data storage and database facilities, or it can refer to a networked or clustered complex of processors and associated network and storage devices, as well as operating software and one or more database systems and application software that support the services provided by the server. Cloud servers are examples.
  • one or more of the computer-based systems of the present disclosure may obtain, manipulate, transfer, store, transform, generate, and/or output any digital object and/or data unit (e.g., from inside and/or outside of a particular application) that can be in any suitable form such as, without limitation, a file, a contact, a task, an email, a message, a map, an entire application (e.g., a calculator), data points, and other suitable data.
  • any digital object and/or data unit e.g., from inside and/or outside of a particular application
  • any suitable form such as, without limitation, a file, a contact, a task, an email, a message, a map, an entire application (e.g., a calculator), data points, and other suitable data.
  • one or more of the computer-based systems of the present disclosure may be implemented across one or more of various computer platforms such as, but not limited to: (1) FreeBSD, NetBSD, OpenBSD; (2) Linux; (3) Microsoft WindowsTM; (4) OpenVMSTM; (5) OS X (MacOSTM); (6) UNIXTM; (7) Android; (8) iOSTM; (9) Embedded Linux; (10) TizenTM; (11) WebOSTM; (12) Adobe AIRTM; (13) Binary Runtime Environment for Wireless (BREWTM); (14) CocoaTM (API); (15) CocoaTM Touch; (16) JavaTM Platforms; (17) JavaFXTM; (18) QNXTM; (19) Mono; (20) Google Blink; (21) Apple WebKit; (22) Mozilla GeckoTM; (23) Mozilla XUL; (24) .NET Framework; (25) SilverbghtTM; (26) Open Web Platform; (27) Oracle Database; (28) QtTM; (29) SAP NetWeaverTM; (30) SmartfaceTM; (31)
  • illustrative computer-based systems or platforms of the present disclosure may be configured to utilize hardwired circuitry that may be used in place of or in combination with software instructions to implement features consistent with principles of the disclosure.
  • implementations consistent with principles of the disclosure are not limited to any specific combination of hardware circuitry and software.
  • various embodiments may be embodied in many different ways as a software component such as, without limitation, a stand alone software package, a combination of software packages, or it may be a software package incorporated as a “tool” in a larger software product.
  • exemplary software specifically programmed in accordance with one or more principles of the present disclosure may be downloadable from a network, for example, a website, as a stand-alone product or as an add-in package for installation in an existing software application.
  • exemplary software specifically programmed in accordance with one or more principles of the present disclosure may also be available as a client-server software application, or as a web-enabled software application.
  • exemplary software specifically programmed in accordance with one or more principles of the present disclosure may also be embodied as a software package installed on a hardware device.
  • illustrative computer-based systems or platforms of the present disclosure may be configured to handle numerous concurrent users that may be, but is not limited to, at least 100 (e.g., but not limited to, 100-999), at least 1,000 (e.g., but not limited to, 1,000-9,999 ), at least 10,000 (e.g., but not limited to, 10,000-99,999 ), at least 100,000 (e.g., but not limited to, 100,000-999,999), at least 1,000,000 (e.g., but not limited to, 1,000,000- 9,999,999), at least 10,000,000 (e.g., but not limited to, 10,000,000-99,999,999), at least 100,000,000 (e.g., but not limited to, 100,000,000-999,999,999), at least 1,000,000,000 (e.g., but not limited to, 1,000,000,000-999,999,999), and so on.
  • at least 100 e.g., but not limited to, 100-999
  • at least 1,000 e.g., but not limited to, 1,000-9,999
  • 10,000
  • illustrative computer-based systems or platforms of the present disclosure may be configured to output to distinct, specifically programmed graphical user interface implementations of the present disclosure (e.g., a desktop, a web app., etc.).
  • a final output may be displayed on a displaying screen which may be, without limitation, a screen of a computer, a screen of a mobile device, or the like.
  • the display may be a holographic display.
  • the display may be a transparent surface that may receive a visual projection.
  • Such projections may convey various forms of information, images, or objects.
  • such projections may be a visual overlay for a mobile augmented reality (MAR) application.
  • MAR mobile augmented reality
  • illustrative computer-based systems or platforms of the present disclosure may be configured to be utilized in various applications which may include, but not limited to, gaming, mobile-device games, video chats, video conferences, live video streaming, video streaming and/or augmented reality applications, mobile-device messenger applications, and others similarly suitable computer-device applications.
  • the term “mobile electronic device,” or the like may refer to any portable electronic device that may or may not be enabled with location tracking functionality (e.g., MAC address, Internet Protocol (IP) address, or the like).
  • a mobile electronic device can include, but is not limited to, a mobile phone, Personal Digital Assistant (PDA), Blackberry TM, Pager, Smartphone, or any other reasonable mobile electronic device.
  • proximity detection refers to any form of location tracking technology or locating method that can be used to provide a location of, for example, a particular computing device, system or platform of the present disclosure and any associated computing devices, based at least in part on one or more of the following techniques and devices, without limitation: accelerometer(s), gyroscope(s), Global Positioning Systems (GPS); GPS accessed using BluetoothTM; GPS accessed using any reasonable form of wireless and non-wireless communication; WiFiTM server location data; Bluetooth TM based location data; triangulation such as, but not limited to, network based triangulation, WiFiTM server information based triangulation, BluetoothTM server information based triangulation; Cell Identification based triangulation, Enhanced Cell Identification based triangulation, Uplink-Time difference of arrival (U-TDOA) based triangulation, Time of arrival (TOA) based triangulation, Angle of arrival (AOA) based triang
  • U-TDOA Time of arrival
  • TOA Time of arrival
  • AOA Angle of arrival
  • cloud As used herein, terms “cloud,” “Internet cloud,” “cloud computing,” “cloud architecture,” and similar terms correspond to at least one of the following: (1) a large number of computers connected through a real-time communication network (e.g., Internet); (2) providing the ability to run a program or application on many connected computers (e.g., physical machines, virtual machines (VMs)) at the same time; (3) network-based services, which appear to be provided by real server hardware, and are in fact served up by virtual hardware (e.g., virtual servers), simulated by software running on one or more real machines (e.g., allowing to be moved around and scaled up (or down) on the fly without affecting the end user).
  • a real-time communication network e.g., Internet
  • VMs virtual machines
  • the illustrative computer-based systems or platforms of the present disclosure may be configured to securely store and/or transmit data by utilizing one or more of encryption techniques (e.g., private/public key pair, Triple Data Encryption Standard (3DES), block cipher algorithms (e.g., IDEA, RC2, RC5, CAST and Skipjack), cryptographic hash algorithms (e.g., MD5, RIPEMD-160, RTRO, SHA-1, SHA-2, Tiger (TTH), WHIRLPOOL, RNGs).
  • encryption techniques e.g., private/public key pair, Triple Data Encryption Standard (3DES), block cipher algorithms (e.g., IDEA, RC2, RC5, CAST and Skipjack), cryptographic hash algorithms (e.g., MD5, RIPEMD-160, RTRO, SHA-1, SHA-2, Tiger (TTH), WHIRLPOOL, RNGs).
  • encryption techniques e.g., private/public key pair, Triple Data Encryption Standard (3DES),
  • the term “user” shall have a meaning of at least one user.
  • the terms “user”, “subscriber” “consumer” or “customer” should be understood to refer to a user of an application or applications as described herein and/or a consumer of data supplied by a data provider.
  • the terms “user” or “subscriber” can refer to a person who receives data provided by the data or service provider over the Internet in a browser session or can refer to an automated software application which receives the data and stores or processes the data.
  • a method comprising: receiving, by at least one processor, a sequence of images from an image capture device; detecting, by the at least one processor, a set of key-points for each image in the sequence of images; wherein the set of key-points for each image represent a set of anatomical features of a human subject represented in each image; utilizing, by the at least one processor, a motion recognition machine learning model to generate a motion classification identifying at least one of an action or a gesture based at least in part on the set of key-points for each image across the set of images and trained model parameters; and outputting, by the at least one processor, an indication of the at least one of the action or the gesture to a computing device.
  • a system comprising: at least one processor configured to execute software instructions causing the at least one processor to perform steps to: receive a sequence of images from an image capture device; detect a set of key -points for each image in the sequence of images; wherein the set of key-points for each image represent a set of anatomical features of a human subject represented in each image; utilize a motion recognition machine learning model to generate a motion classification identifying at least one of an action or a gesture based at least in part on the set of key -points for each image across the set of images and trained model parameters; and output an indication of the at least one of the action or the gesture to a computing device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Psychiatry (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Des systèmes et des procédés selon l'invention permettent une reconnaissance de mouvement automatisée par réception d'une série chronologique d'images représentant un sujet à une pluralité de points temporels. Pour chaque image dans la série chronologique d'images, un point clé d'orientation associé à une section d'une partie corporelle du sujet sont prédits par l'intermédiaire d'un détecteur de réseau neuronal, une rotation d'articulation à trois axes associée à la section de la partie de corps est calculée sur la base du point clé d'orientation associé à la partie de corps et d'un point clé d'articulation associé à la partie de corps, et des caractéristiques sont générées, comprenant au moins l'un parmi : le point clé d'orientation, le point clé d'articulation, ou la rotation conjointe à trois axes. Un mouvement effectué par le sujet est prédit par l'intermédiaire d'un modèle d'apprentissage machine de reconnaissance de mouvement sur la base des caractéristiques de chaque image dans la série chronologique d'images, et une opération est exécutée en réponse au mouvement.
EP22721841.9A 2021-03-30 2022-03-30 Systèmes et procédés de reconnaissance informatique de mouvements de gestes 3d Pending EP4315282A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163167987P 2021-03-30 2021-03-30
PCT/IB2022/000168 WO2022208168A1 (fr) 2021-03-30 2022-03-30 Systèmes et procédés de reconnaissance informatique de mouvements de gestes 3d

Publications (1)

Publication Number Publication Date
EP4315282A1 true EP4315282A1 (fr) 2024-02-07

Family

ID=81585566

Family Applications (1)

Application Number Title Priority Date Filing Date
EP22721841.9A Pending EP4315282A1 (fr) 2021-03-30 2022-03-30 Systèmes et procédés de reconnaissance informatique de mouvements de gestes 3d

Country Status (3)

Country Link
EP (1) EP4315282A1 (fr)
AU (1) AU2022247290A1 (fr)
WO (1) WO2022208168A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117152670A (zh) * 2023-10-31 2023-12-01 江西拓世智能科技股份有限公司 一种基于人工智能的行为识别方法及系统
CN118015288B (zh) * 2024-01-12 2024-06-14 广州图语信息科技有限公司 一种手部关键点去噪方法、装置、电子设备及存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8867786B2 (en) * 2012-10-31 2014-10-21 Microsoft Corporation Scenario-specific body-part tracking
CN110020633B (zh) * 2019-04-12 2022-11-04 腾讯科技(深圳)有限公司 姿态识别模型的训练方法、图像识别方法及装置
US11164336B2 (en) * 2019-09-27 2021-11-02 Martin Adrian FISCH Methods and apparatus for orientation keypoints for complete 3D human pose computerized estimation

Also Published As

Publication number Publication date
AU2022247290A1 (en) 2023-11-16
WO2022208168A1 (fr) 2022-10-06

Similar Documents

Publication Publication Date Title
US11816851B2 (en) Methods and apparatus for orientation keypoints for complete 3D human pose computerized estimation
KR102222642B1 (ko) 이미지 내의 객체 검출을 위한 신경망
US10949648B1 (en) Region-based stabilized face tracking
KR20220062338A (ko) 스테레오 카메라들로부터의 손 포즈 추정
JP2020522285A (ja) 全身測定値抽出のためのシステムおよび方法
KR20210106444A (ko) 개별 사용자에 대한 개인화된 식이 및 건강 권고 또는 추천을 생성하기 위한 자동화된 방법 및 시스템
CN116830158B (zh) 人类角色的音乐反应动画
US11715223B2 (en) Active image depth prediction
EP4315282A1 (fr) Systèmes et procédés de reconnaissance informatique de mouvements de gestes 3d
Herath et al. Development of an IoT based systems to mitigate the impact of COVID-19 pandemic in smart cities
EP4281901A1 (fr) Reconnaissance d'actions à l'aide de données de pose et d'un apprentissage automatique
US11450010B2 (en) Repetition counting and classification of movements systems and methods
An et al. A survey of embedded machine learning for smart and sustainable healthcare applications
WO2019022829A1 (fr) Rétroaction humaine dans un ajustement de modèle 3d
CN110955840B (zh) 通知和推送的联合优化
Li et al. [Retracted] Human Sports Action and Ideological and PoliticalEvaluation by Lightweight Deep Learning Model
US20230289560A1 (en) Machine learning techniques to predict content actions
US20230244985A1 (en) Optimized active learning using integer programming
US20240103610A1 (en) Egocentric human body pose tracking
Zhou et al. Motion balance ability detection based on video analysis in virtual reality environment
Zhang et al. Lightweight network for small target fall detection based on feature fusion and dynamic convolution
US20240356873A1 (en) Personal ai intent understanding
US20240310921A1 (en) Methods and Systems for Offloading Pose Processing to a Mobile Device for Motion Tracking on a Hardware Device without a Camera
US20240203069A1 (en) Method and system for tracking object for augmented reality
US20230419599A1 (en) Light estimation method for three-dimensional (3d) rendered objects

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20231030

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)