WO2020207281A1 - 姿态识别模型的训练方法、图像识别方法及装置 - Google Patents

姿态识别模型的训练方法、图像识别方法及装置 Download PDF

Info

Publication number
WO2020207281A1
WO2020207281A1 PCT/CN2020/082039 CN2020082039W WO2020207281A1 WO 2020207281 A1 WO2020207281 A1 WO 2020207281A1 CN 2020082039 W CN2020082039 W CN 2020082039W WO 2020207281 A1 WO2020207281 A1 WO 2020207281A1
Authority
WO
WIPO (PCT)
Prior art keywords
dimensional
human body
model
parameters
posture
Prior art date
Application number
PCT/CN2020/082039
Other languages
English (en)
French (fr)
Inventor
罗镜民
朱晓龙
王一同
季兴
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2020207281A1 publication Critical patent/WO2020207281A1/zh
Priority to US17/330,261 priority Critical patent/US11907848B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/74Determining position or orientation of objects or cameras using feature-based methods involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • G06V20/647Three-dimensional objects by matching two-dimensional images to three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Definitions

  • This application relates to the field of machine learning technology, in particular to a gesture recognition technology.
  • Machine Learning is a branch of artificial intelligence. Its purpose is to allow machines to learn based on prior knowledge, so as to have the logical ability to classify and judge.
  • the machine learning model represented by neural network is constantly developing and is gradually applied to human body gesture recognition, so as to realize various intelligent applications based on human body gesture.
  • the neural network models used to identify the human body's two-dimensional posture information and the three-dimensional posture information are incompatible with each other and need to be trained separately, which requires a large amount of computing resources and low training efficiency.
  • the embodiments of the present application provide a method for training a gesture recognition model, an image recognition method, device, and storage medium based on the gesture recognition model, which can realize a model that can recognize two-dimensional posture information and three-dimensional posture information of a human body.
  • the embodiment of the application provides a method for training a gesture recognition model, including:
  • the model parameters of the gesture recognition model are updated.
  • the embodiment of the present application also provides an image recognition method based on a gesture recognition model, the method including:
  • the feature map is processed to obtain two-dimensional key point parameters for characterizing a two-dimensional human body posture, and the two-dimensional key point parameters are used to identify and obtain the human body Two-dimensional posture
  • the target human body feature map cut out from the feature map and the two-dimensional key point parameters are processed to obtain three-dimensional posture parameters for representing the posture of the three-dimensional human body.
  • the three-dimensional posture parameters are used to identify the three-dimensional posture of the human body.
  • An embodiment of the present application also provides a training device for a gesture recognition model, including:
  • the first processing unit is configured to process the sample image marked with key points of the human body through the feature map model included in the gesture recognition model to obtain a feature map corresponding to the sample image;
  • the second processing unit is configured to process the feature map through the two-dimensional model included in the gesture recognition model to obtain two-dimensional key point parameters used to characterize the two-dimensional human posture;
  • the third processing unit is used to process the target human body feature map cut out from the feature map and the two-dimensional key point parameters through the three-dimensional model included in the gesture recognition model to obtain a three-dimensional human body gesture 3D attitude parameters;
  • the update unit is configured to update the model parameters of the gesture recognition model based on the target loss function.
  • the device further includes:
  • the marking unit is used to determine the key points of the human body from a set of key points according to the type of the configuration scene;
  • the sample image is annotated with reference to the key point set.
  • the key point set includes:
  • a reference key point used to locate a human body part, and an extended key point that cooperates with the reference key point to characterize various three-dimensional postures of the part.
  • the target loss function includes a first loss function corresponding to the three-dimensional model
  • the construction unit is further configured to determine corresponding two-dimensional key point information based on the three-dimensional posture parameters;
  • the target loss function further includes a loss function corresponding to the two-dimensional model and a second loss function corresponding to the three-dimensional model;
  • the two-dimensional key point parameters include: partial affinity field parameters of the human body key points And heat maps of key points of the human body
  • the three-dimensional posture parameters include: shape parameters and morphological parameters of the human body;
  • the construction unit is also used to combine the differences between the partial affinity field parameters output by the two-dimensional model and the partial affinity field parameters of the corresponding key points of the human body in the sample image, and the heat map output by the two-dimensional model and the corresponding The difference in the heat map of the key points of the human body in the sample image, constructing a loss function corresponding to the two-dimensional model;
  • the device further includes:
  • a trimming unit configured to determine the target human body in the feature map based on the two-dimensional key point parameters
  • the feature map is tailored according to the target human body to obtain the target human body feature map.
  • the update unit is further configured to determine the value of the target loss function based on the two-dimensional key point parameter and the three-dimensional posture parameter;
  • the error signal is propagated back in the gesture recognition model, and the model parameters of each layer are updated during the propagation process.
  • An embodiment of the present application also provides an image recognition device based on a gesture recognition model, the device including:
  • the first acquiring unit is configured to input the image to be recognized including the human body into the feature map model included in the gesture recognition model, and output a feature map corresponding to the image to be recognized;
  • the second acquisition unit is configured to input the feature map into the two-dimensional model included in the gesture recognition model, and output two-dimensional key point parameters used to characterize the two-dimensional human body posture, and the two-dimensional key point parameters are used for recognition The two-dimensional posture of the human body;
  • the third acquisition unit is configured to input the target human body feature map cut out from the feature map and the two-dimensional key point parameters, input the three-dimensional model included in the gesture recognition model, and output the three-dimensional model used to represent the three-dimensional human posture Posture parameters, the three-dimensional posture parameters are used to identify the three-dimensional posture of the human body.
  • the device further includes:
  • the matching unit is configured to recognize the two-dimensional posture of the human body in the image to be recognized based on the two-dimensional key point parameters; the image to be recognized is obtained by image collection based on the output of the specific person posture;
  • the prompt unit is used to output prompt information used to characterize the matching result.
  • the device further includes:
  • a human body model unit configured to construct a three-dimensional human body model corresponding to the target human body based on the three-dimensional posture parameters
  • the control unit is configured to control the three-dimensional human body model to execute a target action, and the target action matches the action performed by the target human body.
  • the embodiment of the present application also provides an image processing device, including:
  • Memory used to store executable instructions
  • the processor is configured to implement any one of the gesture recognition model training methods provided in the embodiments of the present application or the image recognition method based on the gesture recognition model when executing the executable instructions stored in the memory.
  • the embodiment of the present application also provides a storage medium that stores executable instructions to cause the processor to execute any one of the training methods of the gesture recognition model provided in the embodiments of the present application, or an image based on the gesture recognition model recognition methods.
  • the sample image marked with key points of the human body is processed through the feature map model included in the gesture recognition model to obtain a feature map corresponding to the sample image. Then, through the two-dimensional model included in the gesture recognition model, the feature map is processed to obtain the two-dimensional key point parameters used to represent the two-dimensional human posture, and the three-dimensional model included in the gesture recognition model is clipped from the feature map.
  • the target human body feature map and two-dimensional key point parameters are processed to obtain the three-dimensional posture parameters used to represent the posture of the three-dimensional human body. Combine the two-dimensional key point parameters and the three-dimensional pose parameters to construct the target loss function.
  • the target loss function considers the output results of the two-dimensional model (two-dimensional key point parameters) and the output results of the three-dimensional model (three-dimensional pose parameters), so based on the target
  • the loss function updates the model parameters of the gesture recognition model
  • the two-dimensional model and the three-dimensional model in the obtained gesture recognition model can output good results, that is, the trained gesture recognition model can output both the human body's two-dimensional posture information and output
  • the three-dimensional posture information of the human body realizes the compatibility of the two-dimensional posture information and the three-dimensional posture information of the human body.
  • the training of the posture recognition model that outputs the human body's two-dimensional posture information and three-dimensional posture information uses a set of training samples, the model is simple, and the training efficiency is high.
  • Figure 1 is a schematic diagram of a training method for a two-dimensional key point recognition model provided by related technologies
  • Figure 2 is a schematic diagram of a training method for a three-dimensional human body model provided by related technologies
  • FIG. 3 is a schematic diagram of an implementation scenario of a gesture recognition model provided by an embodiment of the application.
  • FIG. 4 is a schematic diagram of the composition structure of a training device for a gesture recognition model provided by an embodiment of the application;
  • Figure 5 is a schematic structural diagram of a gesture recognition model provided by an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a method for training a gesture recognition model provided by an embodiment of the present application
  • FIG. 7 is a schematic diagram of a process for acquiring key points of hand extension provided by an embodiment of the application.
  • FIG. 8A is a schematic diagram of human body key points corresponding to the first configuration scenario provided by an embodiment of the application.
  • FIG. 8B is a schematic diagram of human body key points corresponding to the second configuration scenario provided by an embodiment of the application.
  • FIG. 8C is a schematic diagram of the key points of the human body corresponding to the third configuration scenario provided by an embodiment of the application.
  • FIG. 8D is a schematic diagram of key points of the human body corresponding to the fourth configuration scene provided by an embodiment of the application.
  • FIG. 9 is a schematic diagram of performing feature map extraction according to an embodiment of the application.
  • FIG. 10 is a schematic diagram of a heat map of key points of the human body provided by an embodiment of the application.
  • FIG. 11 is a schematic flowchart of an image recognition method based on a gesture recognition model provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a process of image recognition using a gesture recognition model according to an embodiment of the application.
  • FIG. 13 is a schematic diagram of an application scenario of a gesture recognition model provided by an embodiment of the application.
  • FIG. 14 is a schematic diagram of an application scenario of a gesture recognition model provided by an embodiment of the application.
  • 15 is a schematic diagram of the composition structure of a training device for a gesture recognition model provided by an embodiment of the application.
  • FIG. 16 is a schematic diagram of the composition structure of an image recognition device based on a gesture recognition model provided by an embodiment of the application.
  • first ⁇ second involved is only to distinguish similar objects, and does not represent a specific order for the objects. Understandably, “first ⁇ second” can be used if allowed The specific order or sequence is exchanged, so that the embodiments of the present application described herein can be implemented in a sequence other than those illustrated or described herein.
  • the key points of the human body are the key points representative of the posture of the human body, and the posture of the human body can be recognized through the key points of the human body.
  • the key points of the human body can be bone key points of the human body, and the bones are the connections between the key points of the human body, such as head key points, neck key points, shoulder key points, elbow key points, wrist Key points, ankle key points, knee key points, hip key points and ankle key points, etc.
  • human body posture recognition is recognized through key point recognition models, so as to realize various intelligent applications based on human posture.
  • Figure 1 is a schematic diagram of the training method of the two-dimensional key point recognition model provided by related technologies. See Figure 1.
  • the training samples used in the training of the two-dimensional key point recognition model are taken from the COCO database (image data set), and are disclosed by the COCO database
  • the 17 key points of the human body are trained using the image data marked with 17 key points of the human body as the training sample.
  • the sample data is used to extract the feature map through a deep learning network (for example, the network named Darknet), and then pass the partial affinity field ( PAF, Part Affinity Fields) processing and Heatmap processing, using loss function 2 such as L2 Loss training, through non-maximum suppression (NMS, Non-Maximum Suppression) and aggregation (Grouping) operations to obtain two-dimensional body ( 2D, Two Dimension) key points and the person who belongs to the key points of the human body.
  • PAF Part Affinity Fields
  • Heatmap processing using loss function 2 such as L2 Loss training
  • NMS non-maximum suppression
  • Grouping aggregation
  • PAF processing is used for the detection of multiple human key points. Through the collection of two-dimensional direction vectors, it indicates the position and direction of the limbs (also represents the degree of association between the two key points), and then solves which human key points belong to Human problem. Based on the two-dimensional direction vector of the key points of the human body obtained by the PAF, the Grouping operation is performed to confirm which person the key points belong to in the image. After the grouping operation, the key points of the human body can be connected into a skeleton.
  • the 18 human key points solution of Openpose (a full-featured library) and the basic 8 human key points solution can also be used to recognize the two-dimensional posture of the human body.
  • Figure 2 is a schematic diagram of the training method of the human body three-dimensional model provided by related technologies. See Figure 2.
  • the sample data set is constructed using the standard of the skinned multi-person model (SMPL, A Skinned Multi-Person Linear Model).
  • SMPL skinned multi-person model
  • the image carries shape and pose, outputs the parameters (shape and pose) of the SMPL 3D model for 3D model training, and uses L2Loss to return the parameters.
  • the human body key points used for training are always a set, and there are key points when dealing with different businesses. Information redundancy and defects. For example, a scene that only requires 2D upper body simple posture information requires only 8 key points of the upper body. At this time, using 17 key points or 18 key points for model training is obviously key point redundancy. Cause a waste of computing resources.
  • the model parameters used in the training of the above SMPL model are the shape parameters and pose parameters of the human body, without considering the constraints of two-dimensional information, the model obtained by such training is recognized There will be angular errors in the posture of the human body, the actions are not accurate enough, that is, the recognition accuracy is low, and the model also has the problem of key point information redundancy and defects in different business scenarios.
  • the upper body 3D is only required for human-computer interaction. Scene, training a three-dimensional model corresponding to the entire human body obviously causes a waste of computing resources.
  • the training data used by the above two models are completely different, incompatible with each other, and the training process is different. If you want to obtain both the two-dimensional posture information of the human body and the three-dimensional posture information of the human body , Two different models need to be trained separately to process different data, which consumes time and also causes a waste of computing resources, such as central processing unit (CPU, Central Processing Unit) and graphics processing unit (GPU, Graphics Processing Unit) and other resources Take up a lot.
  • CPU Central Processing Unit
  • GPU Graphics Processing Unit
  • the trained posture recognition model can output not only the two-dimensional posture information of the human body, but also the three-dimensional posture information of the human body, which realizes the compatibility of the two-dimensional posture information and the three-dimensional posture information of the human body.
  • a set of training samples is used, the model is simple, and the training efficiency is high;
  • the posture recognition model includes a three-dimensional model, and the training of the three-dimensional model uses a two-dimensional model
  • the output two-dimensional information is constrained, so that the three-dimensional posture information of the human body output by the three-dimensional model is more accurate.
  • AI Artificial Intelligence
  • a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech technology, natural language processing technology, and machine learning/deep learning.
  • Machine Learning is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance.
  • Machine learning is the core of artificial intelligence, the fundamental way to make computers intelligent, and its applications cover all fields of artificial intelligence.
  • Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning techniques.
  • the embodiment of the application specifically trains the gesture recognition model through machine learning, so that the trained gesture recognition model can accurately perform gesture recognition for the image to be recognized.
  • Computer vision technology may also be involved in the process of gesture recognition model training or gesture recognition.
  • Computer Vision is a science that studies how to make machines "see”. More specifically, it refers to the use of cameras and computers instead of human eyes to identify, track, and measure machine vision for targets, and further Do graphics processing to make computer processing more suitable for human eyes to observe or send to the instrument to detect images.
  • computer vision studies related theories and technologies, trying to establish an artificial intelligence system that can obtain information from images or multi-dimensional data.
  • the embodiments of the present application specifically relate to image processing and image semantic understanding in computer vision technology. For example, after obtaining an image such as a to-be-recognized image or training sample, image processing is performed, such as cropping the image; for example, using image semantics
  • image processing is performed, such as cropping the image; for example, using image semantics
  • the understanding technology performs key point annotation, image classification (for example, determining the attribution of the key points of the human body), extraction of image features (feature maps), and so on.
  • Figure 3 is a schematic diagram of the implementation scenario of the gesture recognition model provided in the embodiment of this application. See Figure 3, in order to support an exemplary application, the terminal (including the terminal 40 -1 and terminal 40-2), the terminal is provided with a client for image recognition, and the terminal is connected to the server 200 through the network 300.
  • the network 300 can be a wide area network or a local area network, or a combination of the two, using wireless links to realize data transmission.
  • the server 200 is configured to input the sample image with key points of the human body into the feature map model included in the gesture recognition model, and output the feature map corresponding to the sample image; input the feature map into the two-dimensional model included in the gesture recognition model, and output it for characterization
  • the two-dimensional key point parameters of the two-dimensional human body posture; the target human body feature map and the two-dimensional key point parameters clipped from the feature map are input to the three-dimensional model included in the posture recognition model, and the three-dimensional posture parameters used to represent the three-dimensional human posture are output ;
  • the terminal (terminal 40-1 and/or terminal 40-2) is used to send an identification request carrying an image to be identified to the server 200, and the image to be identified includes one or more human bodies.
  • the server 200 is further configured to receive a recognition request sent by the terminal, use the obtained gesture recognition model to recognize the image to be recognized, and return the recognition result (two-dimensional key point parameter and/or three-dimensional posture parameter) to the terminal.
  • the terminal (terminal 40-1 and/or terminal 40-2) is also used to execute corresponding applications based on the recognition results returned by the server 200, such as driving a three-dimensional human body model, determining the corresponding two-dimensional human posture based on the recognition result and performing corresponding Assessment.
  • the training device for the gesture recognition model and the image recognition device based on the gesture recognition model provided by the embodiments of the present application can be implemented by image processing equipment.
  • the image processing equipment can be, for example, a terminal or a server.
  • the methods provided in the examples can be implemented separately by terminals such as smart phones, tablet computers, and desktop computers, or implemented separately by servers, or implemented by terminals and servers in cooperation.
  • the training device for the gesture recognition model and the image recognition device based on the gesture recognition model provided in the embodiments of the present application can be implemented as hardware or a combination of software and hardware. Taking the training device for the gesture recognition model of the embodiments of the present application as an example, the following Various exemplary implementations of the apparatus provided in the embodiments of the present application are described.
  • FIG. 4 is a schematic diagram of the composition structure of the image processing device provided by the embodiment of the application. It can be understood that FIG. 4 only shows an exemplary structure of the image processing device. Not all the structures, part of the structure or all of the structures shown in FIG. 4 can be implemented as required.
  • the image processing device includes: at least one processor 401, a memory 402, a user interface 403, and at least one network interface 404.
  • the various components in the training device 40 of the gesture recognition model are coupled together through the bus system 405.
  • the bus system 405 is used to implement connection and communication between these components.
  • the bus system 405 also includes a power bus, a control bus, and a status signal bus.
  • various buses are marked as the bus system 405 in FIG. 4.
  • the user interface 403 may include a display, a keyboard, a mouse, a trackball, a click wheel, keys, buttons, a touch panel, or a touch screen.
  • the memory 402 may be a volatile memory or a non-volatile memory, and may also include both volatile and non-volatile memory.
  • the non-volatile memory can be a read only memory (ROM, Read Only Memory), a programmable read only memory (PROM, Programmable Read-Only Memory), an erasable programmable read only memory (EPROM, Erasable Programmable Read- Only Memory, Flash Memory, etc.
  • the volatile memory may be random access memory (RAM, Random Access Memory), which is used as an external cache.
  • RAM Random Access Memory
  • many forms of RAM are available, such as static random access memory (SRAM, Static Random Access Memory), and synchronous static random access memory (SSRAM, Synchronous Static Random Access Memory).
  • SRAM static random access memory
  • SSRAM Synchronous Static Random Access Memory
  • the memory 402 in the embodiment of the present application can store data to support the operation of the terminal (such as 40-1).
  • data include: any computer program used to operate on a terminal (such as 40-1), such as an operating system and application programs.
  • the operating system contains various system programs, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks.
  • Applications can include various applications.
  • the image processing device provided by the embodiment of the present application may be directly embodied as a combination of software modules executed by the processor 401, and the software modules may be located in a storage medium.
  • the storage medium is located in the memory 402
  • the processor 401 reads the executable instructions included in the software module in the memory 402, and combines necessary hardware (for example, including the processor 401 and other components connected to the bus 405) to complete the posture provided in the embodiment of the present application
  • necessary hardware for example, including the processor 401 and other components connected to the bus 405
  • the processor 401 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gates, or transistor logic devices , Discrete hardware components, etc., where the general-purpose processor may be a microprocessor or any conventional processor.
  • DSP Digital Signal Processor
  • the general-purpose processor may be a microprocessor or any conventional processor.
  • the device provided in the embodiment of the application can directly use the processor 401 in the form of a hardware decoding processor to perform the execution, for example, dedicated by one or more applications.
  • Integrated Circuits ASIC, Application Specific Integrated Circuit
  • DSP Programmable Logic Device
  • PLD Programmable Logic Device
  • CPLD Complex Programmable Logic Device
  • FPGA Field-Programmable Gate Array
  • the memory 402 in the embodiment of the present application is used to store various types of data to support the operation of the image processing device 40. Examples of these data include: any executable instructions for operating on the image processing device 40, such as executable instructions, and a program that implements the method for training a gesture recognition model of the embodiment of the present application may be included in the executable instructions.
  • gesture recognition model is not limited to the scenarios or fields mentioned below:
  • the terminal is provided with a client.
  • the client can be a game client, a human body 3D modeling client, etc.
  • the terminal is also provided with a graphical interface, image acquisition device and The processing chip collects images containing the human body through the image acquisition device, and recognizes the two-dimensional and three-dimensional human postures of the human body in the image based on the gesture recognition model.
  • the terminal displays the actions of game characters through a graphical interface, so that the user can imitate the actions of the characters displayed on the terminal, and collect images of the actions made by the user through the image acquisition device, based on the gesture recognition model Recognize the two-dimensional body posture of the human body in the image, and evaluate the game based on the similarity between the recognition result and the actions of the characters in the game, such as scoring.
  • the terminal collects an image containing the user through an image acquisition device, and recognizes the three-dimensional posture of the human body in the image based on the posture recognition model to construct a three-dimensional human body model corresponding to the user and drive the constructed human body
  • the three-dimensional model performs the same actions as the user, and realizes the somatosensory interaction of the user in the game.
  • the intelligent robot is equipped with an image acquisition device and a processing chip.
  • the image acquisition device can collect images of the area in front of the intelligent robot, and the processing chip can recognize the posture of the human body in the area image based on the gesture recognition model.
  • the processing chip can recognize the posture of the human body in the area image based on the gesture recognition model.
  • the intelligent robot is controlled to make a preset response. For example, when the recognized human body gesture is a waving gesture, the intelligent robot is controlled to make a welcome gesture.
  • the unmanned vehicle is equipped with an image acquisition device and a processing chip.
  • the image acquisition device can collect the front image of the unmanned vehicle during driving.
  • the processing chip recognizes the human body posture (two-dimensional and/or three-dimensional) in the image based on the posture recognition model. ), in order to determine whether there is a person in front, and the position of the person in front, to control the unmanned vehicle to slow down or brake.
  • the medical equipment is equipped with an image acquisition device and a processing chip.
  • the image acquisition device can collect the user's image.
  • the processing chip recognizes the three-dimensional human posture in the image based on the gesture recognition model to construct a three-dimensional human body model corresponding to the user, and based on the constructed three-dimensional human body Model for medical analysis.
  • the monitoring system includes a front-end image acquisition device and a back-end image processing device.
  • the image acquisition device collects an image containing the user and sends it to the image processing device.
  • the image processing device recognizes the human body posture (two-dimensional and/or three-dimensional) ), and perform target tracking, posture analysis and early warning based on the recognition results.
  • FIG. 5 is a schematic diagram of the structure of the gesture recognition model provided by the embodiment of the present application, see FIG.
  • the gesture recognition model includes: a feature map model 51, a feature map clipping unit 52, a two-dimensional model 53, and a three-dimensional model 54; among them, the feature map model 51 is used to perform feature extraction on the input image to obtain Corresponding feature map; feature map clipping unit 52, used to cut the feature map output by the feature map model to obtain the target human body feature map; two-dimensional model 53, used to identify the feature map output by the feature map model, and output for Two-dimensional key point parameters that represent the posture of the two-dimensional human body; the three-dimensional model 54 is used to process the target human body feature map cut by the feature map clipping unit and the two-dimensional key point parameters output by the two-dimensional model, and the output is used to represent the three-dimensional Three-dimensional posture parameters of the human body posture.
  • Fig. 6 is a schematic flowchart of a training method for a gesture recognition model provided by an embodiment of the present application.
  • the training method may be implemented by a server or a terminal, or implemented by the server and the terminal in cooperation.
  • the server implementation as an example, such as Implementation of the server 200 in FIG. 3, combined with FIG. 5 and FIG. 6, the training method of the gesture recognition model provided by the embodiment of the present application includes:
  • Step 601 Use the feature map model included in the gesture recognition model to process a sample image marked with key points of the human body to obtain a feature map corresponding to the sample image.
  • the server inputs the sample image with the key points of the human body into the feature map model included in the gesture recognition model, and then uses the feature map model to output a feature map corresponding to the sample image.
  • a sample image for model training needs to be constructed.
  • the sample image contains a human body
  • the server labels the human body key points on the sample image containing the human body.
  • multiple configuration scenarios are included. Different configuration scenarios correspond to different numbers of human body key points in the key point set.
  • the server determines the human body key points corresponding to the current configuration scene from the key point set according to the type of the configuration scene , Based on the determined key points of the human body, refer to the key point set to label the human body key points on the sample image containing the human body.
  • the key point set includes: reference key points for locating human body parts, and extended key points that cooperate with the reference key points to characterize different three-dimensional postures of the part.
  • the benchmark key points can be the 17 human key points provided in the COCO data set
  • the extended key points can be used to synergize with one or more of these 17 human key points to characterize different three-dimensional poses of the part.
  • the extended key points can be the top key points and/or the chin key points, which cooperate with the nose key points (reference key points) to represent the movements of raising, nodding, and turning the head;
  • the extended key points can be at least one of the thumb key points, the palm key points, and the middle finger key points, which together with the wrist key points (reference key points) represent the three-dimensional hand posture.
  • the extended key points can be the key points of the crotch, and the key points of the left crotch and/or the key points of the right crotch (reference key points) jointly represent the three-dimensional posture of the waist , Such as waist twist.
  • the number of extended key points is 16, which together with the 17 human key points provided by the COCO data set form a 33 key point set.
  • the expansion key points can be obtained by means of increasing points, individual identification, etc., for example, using the means of increasing points, taking the midpoint of the left crotch key point and the right crotch key point as the crotch midpoint key point.
  • To increase the mean value take the key points of the left shoulder and the midpoint of the right shoulder as the key points of the neck (Huagai point); the key points of the hands and feet of the human body can be obtained through separate identification. Specifically, it can be constructed or The hand and/or foot recognition model in the related technology is used to input an image containing the hand or foot and output the corresponding extended key point information.
  • FIG. 7 is a schematic diagram of the process of obtaining key points of hand extension provided in an embodiment of this application. See FIG. 7. First, the image containing the human body is cropped to obtain the hand Then input the cropped image into the hand key point model to obtain a hand key point set including thumb key points, palm key points, and middle finger key points.
  • the image of the hand can be cropped in the following manner: centered on the key points of the wrist, and centered on the key points of the wrist and The length between the corresponding shoulder key points is the side length, or the length between the wrist key point and the corresponding elbow key point is the side length, and the square image containing the hand is cropped as the input of the hand key point model.
  • the image of the foot can be cropped in the following ways: centering on the ankle key point, taking the length of the ankle key point and the corresponding knee key point as the side length, or taking the ankle key point and the corresponding hip key point as the length
  • the length is the side length
  • the square image containing the foot is cropped as the input of the key point model of the foot.
  • Figure 7 number 2 corresponds to the key point of the right shoulder of the human body, number 3 corresponds to the key point of the right elbow, number 4 corresponds to the key point of the right wrist, centered on the key point of the right wrist, and key point of the right wrist
  • the line 4-2 between the key point of the right shoulder of the human body is the side length, and the image is cropped to obtain a square image containing the right hand.
  • FIGS 8A to 8D are schematic diagrams of human body key points corresponding to different types of configuration scenarios provided by embodiments of the application.
  • the configuration scenario The types can include four types, respectively corresponding to different numbers of human key points in the key point set.
  • FIG. 8A is the human body key corresponding to the first configuration scenario provided by the embodiment of the application.
  • a schematic diagram of points, the first configuration scenario can correspond to all the key points in the key point set (that is, 33 human key points);
  • FIG. 8B is a schematic diagram of the human body key points corresponding to the second configuration scenario provided by an embodiment of the application.
  • the two configuration scenarios can correspond to 20 human key points in the upper body of the human body in the key point concentration
  • Figure 8C is a schematic diagram of the human body key points corresponding to the third configuration scenario provided by the embodiment of the application, and the third configuration scenario can correspond to the key point concentration 8 key points of the human body in the upper body of the human body
  • FIG. 8D is a schematic diagram of the key points of the human body corresponding to the fourth configuration scene provided by an embodiment of the application, and the fourth configuration scene corresponds to 15 key points of the human body that are concentrated on the whole body of the human body.
  • the feature map model included in the gesture recognition model may be a neural network model based on the Darknet framework.
  • the feature map model is used to perform feature extraction on the input image (such as a sample image) to obtain a corresponding feature map.
  • Figure 9 is a schematic diagram of feature map extraction provided by an embodiment of the application. See Figure 9.
  • the Darknet framework is used, which can be based on a bounding box (bbox, Bounding box) using a sliding window.
  • Bounding box refers to a rectangular box with the smallest area that can just surround the object. It is used to calibrate the position and relative size of the object and define the range that an object occupies in the image.
  • Step 602 Process the feature map through the two-dimensional model included in the gesture recognition model to obtain two-dimensional key point parameters for representing the two-dimensional human posture.
  • the server inputs the feature map into the two-dimensional model included in the gesture recognition model, and outputs two-dimensional key point parameters used to characterize the two-dimensional human posture.
  • the two-dimensional model may be a convolutional neural network model
  • the output two-dimensional key point parameters may include Part Affinity Fields (PAF) parameters of the key points of the human body and the heat map of the key points of the human body ( Heatmap).
  • PAF Part Affinity Fields
  • the PAF parameter of the key points of the human body can be a two-dimensional direction vector corresponding to the key points of the human body, which represents the position and direction of the bone joints (limbs) of the human body.
  • the PAF parameters determine the attribution of the key points of the human body.
  • the PAF parameters of the key points of the human body may include the coordinate parameters of the key points of the human body.
  • the heat map of the key points of the human body refers to the grayscale image of the key points of the human body in the original image size, and the same position is represented by a circular Gaussian, that is, the probability that the pixels in the input feature map belong to the key points of the human body,
  • Figure 10 For the schematic diagram of the heat map of the key points of the human body provided by the embodiments of this application, refer to FIG. 10.
  • the heat map represents the probability that the pixel points are the key points of the left elbow, that is, the form of probability expression reflects the left hand
  • the probability that the elbow key point appears at the pixel point, the closer the pixel point is to the left elbow key point, the higher the probability, the farther away the left elbow key point, the lower the probability, that is, the pixel point (number 2 in Figure 10) is The probability of the key point of the left elbow and the relative position of the pixel point from the center point (number 1 in Figure 10) obey the Gaussian distribution.
  • Step 603 Use the three-dimensional model included in the gesture recognition model to process the target human body feature map cut out from the feature map and the two-dimensional key point parameters to obtain three-dimensional posture parameters for representing the three-dimensional human posture .
  • the server may input the target human body feature map and two-dimensional key point parameters cut out from the feature map, input the three-dimensional model included in the gesture recognition model, and output three-dimensional posture parameters used to represent the three-dimensional human posture.
  • the 3D pose parameters of the human body output by the 3D model correspond to a single human body. Therefore, before the feature map is input into the 3D model, if the sample image includes multiple human bodies, the feature map output by the feature map model can be targeted. Body tailoring.
  • the server can use the following methods to tailor the feature map:
  • the server determines the target human body in the feature map based on the two-dimensional key point parameters output by the two-dimensional model, and tailors the feature map according to the determined target human body to obtain the feature map of the target human body.
  • the sample image of the input feature map model can contain multiple human bodies.
  • the human body to which the key points of the human body belong is determined, and then a single human body is tailored to obtain a corresponding single human body.
  • Feature map is based on the two-dimensional key point parameters recognized by the two-dimensional model.
  • the three-dimensional model may be a convolutional neural network model
  • the server splices the tailored target human body feature map with the heat map of the human body key points output by the two-dimensional model, and inputs the splicing result into the three-dimensional model, for example, Concat splicing the feature map of the target human body and the heat map of the key points of the human body, that is, the heat map and the feature map are spliced in two matrices to input the splicing result into the three-dimensional model.
  • the three-dimensional posture parameters output by the three-dimensional model include the shape and pose of the human body; among them, the shape parameter can represent the height and weight of the human body, and the morphological parameter can represent the posture of the human body, etc. , Based on the three-dimensional posture parameters of the human body, a three-dimensional skin model of the human body can be constructed.
  • Step 604 Combine the two-dimensional key point parameters and the three-dimensional pose parameters to construct a target loss function.
  • the target loss function of the gesture recognition model includes the first loss function corresponding to the three-dimensional model; the server can implement the construction of the first loss function in the following manner: the server determines the corresponding second loss function based on the three-dimensional gesture parameters output by the three-dimensional model.
  • the dimensional key point information is combined with the two-dimensional key point parameters output by the two-dimensional model and the obtained two-dimensional key point information to construct the first loss function corresponding to the three-dimensional model. It can be seen that the construction of the first loss function corresponding to the three-dimensional model adopts two-dimensional key point information as constraints, which makes the output accuracy of the three-dimensional model higher.
  • the server calculates the positions of the key points of the two-dimensional human body based on the shape parameters and morphological parameters included in the three-dimensional posture parameters through the projection matrix function, and then calculates the positions of the key points of the human body based on the key point set and the two-dimensional calculated based on the three-dimensional posture parameters.
  • the difference between the positions of the key points of the human body and the positions of the key points of the two-dimensional human body output by the two-dimensional model and the positions of the key points of the two-dimensional human body calculated based on the three-dimensional posture parameters construct a first loss function corresponding to the three-dimensional model.
  • the constructed first loss function Loss1 can be:
  • Loss 1 av(Xgt-r(Xp)) 2 + b(X2dp-r(Xp)) 2 (1)
  • a and b are the weight coefficients in the first loss function; v indicates whether the key point X of the human body is visible in the two-dimensional image; Xp is the three-dimensional posture parameter output by the three-dimensional model, namely the shape parameter and the morphological parameter; r(Xp ) Represents the position of the key points of the human body based on the three-dimensional posture parameters and inversely calculated by the projection matrix function r(); Xgt represents the position of the key points of the human body X in the key point concentration; X2dp is the key points of the human body predicted by the two-dimensional model X's location.
  • the constructed first loss function corresponding to the three-dimensional model uses the two-dimensional human body posture information as a constraint. In this way, the accuracy of the three-dimensional posture parameters output by the three-dimensional model can be improved.
  • the target loss function of the gesture recognition model further includes the loss function corresponding to the two-dimensional model and the second loss function corresponding to the three-dimensional model; accordingly, the server can construct the loss function and corresponding loss function corresponding to the two-dimensional model in the following manner The second loss function of the three-dimensional model:
  • the constructed loss function Loss2 corresponding to the two-dimensional model can be:
  • (PAF-PAF') 2 represents the difference between the PAF parameters output by the two-dimensional model and the PAF parameters of the corresponding human body key points in the sample image
  • (heatmap-heatmap') 2 represents the morphological parameters output by the three-dimensional model and the corresponding human body The difference in morphological parameters in the sample image.
  • the second loss function Loss3 of the constructed three-dimensional model can be:
  • is the shape parameter of the human body
  • is the shape parameter of the human body
  • ( ⁇ - ⁇ ') 2 represents the difference between the shape parameter output by the three-dimensional model and the shape parameter of the corresponding human body in the sample image
  • ( ⁇ - ⁇ ') 2 represents the three-dimensional The difference between the morphological parameters output by the model and the morphological parameters of the corresponding human body in the sample image.
  • the target loss function of the gesture recognition model may be:
  • Step 605 Update the model parameters of the gesture recognition model based on the target loss function.
  • the server can use the following methods to update the model parameters of the gesture recognition model:
  • the server determines the value of the target loss function based on the two-dimensional key point parameters output by the two-dimensional model and the three-dimensional posture parameters output by the three-dimensional model, and determines whether the value of the target loss function exceeds the preset threshold. When the value of the target loss function exceeds the preset value
  • thresholding the error signal of the attitude recognition model is determined based on the target loss function, the error signal is propagated back in the attitude recognition model, and the model parameters of each layer are updated during the propagation process.
  • the server determines the error signal based on the target loss function, and propagates the error signal back from the output layer of the two-dimensional model and the three-dimensional model.
  • the gradient is solved by combining the conducted error signal (that is, the partial derivative of the Loss function to the layer parameter), and the parameter of the layer is updated with the corresponding gradient value.
  • FIG. 11 is a schematic flowchart of an image recognition method based on a gesture recognition model provided by an embodiment of the present application.
  • the image recognition method based on a gesture recognition model provided by an embodiment of the present application mainly includes three The two stages are the data preparation stage, model training stage and model application stage, which will be explained separately.
  • the data preparation stage mainly implements the construction of a key point set containing 33 key points of the human body, and the selection of the number of different human key points corresponding to different types of configuration scenarios (setting).
  • the key point set used in this embodiment is based on the 17 human key points of the COCO data set (or the 18 human key points of Openpose), and the key points of the top of the head and the chin are added to facilitate the characterization of nodding and raising the head.
  • the key points of the middle finger and the thumb are added to the wrist part, which together with the key points of the palm characterize the rotation of the wrist; in order to be compatible with the common root point in the middle of the span and related bone information in the 3D model, the middle of the span is added Point; the foot also uses the heel, left toe, and right toe to represent its three-dimensional information.
  • the key point set contains a total of 33 key points of the personal body. With the newly added expansion key points, the two-dimensional gesture recognition process includes more information about the three-dimensional rotation of the body.
  • the constructed key point set includes 17 human key points provided by the COCO data set, and the remaining 16 human key points are expanded key points.
  • points can be added by means , And obtain the expanded key points of the hands and feet by means of separate identification, and then obtain 33 points of data by fusing data.
  • the crotch midpoint can be calculated from the left and right span key points
  • the neck (Huagai point) can be calculated from the left shoulder key points and the right shoulder key points.
  • the expansion key points can be obtained by using the key point detection model of the hands and feet provided in the related technology, or the detection model used to identify the expansion key points of the corresponding hands and feet can be trained separately to make the detection
  • the model has the ability to output corresponding extended key point information based on the input image containing the hand or foot.
  • the trained model is a Fully Convolutional Neural Network (FCN, Fully Convolutional Networks) model, which includes three parts, namely a feature map model (for example, DarknetX), a two-dimensional model, and a three-dimensional model to realize a two-dimensional model And joint training of three-dimensional models.
  • FCN Fully Convolutional Neural Network
  • the corresponding setting according to business needs that is, select the corresponding key point configuration of the human body
  • input the feature map output by DarknetX into the two-dimensional model use the loss training of L2 through PAF and heatmap, and pass NMS and PAF
  • the Grouping operation obtains the position and direction information of the two-dimensional key points of the human body, and determines the attribution of the key points of the human body
  • the feature map output by DarknetX needs to be tailored to obtain the target human body feature map of a single human body.
  • the target human body feature map and the heat map output from the two-dimensional model are concatenated as the input of the three-dimensional model. This is mainly to use the key points of the two-dimensional human body to reduce the amount of calculation required for the three-dimensional model and only need to return to the target single person situation , And shared and reused the feature map output by DarknetX.
  • the human body 3D information output from the model can be used for 3D posture recognition and 3D skin model driving of the human body.
  • a 3D character model is driven to synchronize the user's actions according to the obtained 3D posture parameters (shape, pose) of the user.
  • the two-dimensional information of the human body output by the model can be used for the two-dimensional posture recognition of the human body. In practical applications, it can be used for static action recognition and sequential action recognition.
  • the terminal screen displays the actions of an animated character, and the terminal collects the user's imitation of the animated character. For actions, the terminal performs two-dimensional gesture recognition and scores according to the degree of action fit.
  • the gesture recognition model can be used for image recognition.
  • Figure 12 is an example of this application.
  • the terminal will include the human body image to be recognized, input the feature map model included in the gesture recognition model, and output the feature map corresponding to the image to be recognized; input the feature map into the gesture
  • the two-dimensional model included in the recognition model outputs the two-dimensional key point parameters used to characterize the two-dimensional human body posture.
  • the two-dimensional key point parameters are used to identify the two-dimensional posture of the human body; And two-dimensional key point parameters, input the three-dimensional model included in the gesture recognition model, and output three-dimensional posture parameters for representing the posture of the three-dimensional human body.
  • the three-dimensional posture parameters are used to identify the three-dimensional posture of the human body.
  • the terminal after the terminal outputs an image of a specific person’s posture, the image to be recognized is collected.
  • the terminal recognizes the two-dimensional posture of the human body in the image to be recognized based on the two-dimensional key point parameters output by the two-dimensional model, and then The two-dimensional posture is matched with the posture of a specific person for similarity to obtain a matching result, and output prompt information for characterizing the matching result.
  • FIG. 13 is a schematic diagram of the application scenario of the gesture recognition model provided by the embodiment of the application.
  • the terminal displays the actions of the animated characters through the dance game client.
  • Character posture the user makes corresponding actions according to the action prompts on the terminal screen.
  • the terminal collects the user’s action image, which is the image to be recognized, and inputs the image to be recognized into the posture recognition model to perform two-dimensional human posture recognition, and the recognition result is compared with
  • the poses of the animated characters are matched for similarity, and corresponding prompt information is output according to the obtained similarity, such as corresponding score, "great", "good", “miss” and other prompts.
  • the terminal constructs a three-dimensional human body model corresponding to the target human body based on the three-dimensional posture parameters output by the three-dimensional model; controls the three-dimensional human body model to execute the target action, and the target action matches the action performed by the target human body.
  • a human body three-dimensional model client is set in the terminal.
  • Figure 14 is a schematic diagram of the application scenario of the gesture recognition model provided in an embodiment of the application. See Figure 14.
  • the terminal collects user images to obtain the image to be recognized, and the image to be recognized Input to the posture recognition model, perform three-dimensional human posture recognition, construct a three-dimensional skin model according to the output three-dimensional posture parameters, and control the three-dimensional skin model to synchronize the user's actions.
  • FIG. 15 is a schematic diagram of the composition structure of a training device for a gesture recognition model provided by an embodiment of the application.
  • the training device for a gesture recognition model in an embodiment of the application includes:
  • the first processing unit 151 is configured to process the sample image marked with key points of the human body through the feature map model included in the gesture recognition model to obtain a feature map corresponding to the sample image;
  • the second processing unit 152 is configured to process the feature map through the two-dimensional model included in the gesture recognition model to obtain two-dimensional key point parameters used to characterize the two-dimensional human posture;
  • the third processing unit 153 is configured to process the target human body feature map cut out from the feature map and the two-dimensional key point parameters through the three-dimensional model included in the gesture recognition model to obtain a three-dimensional human body Three-dimensional posture parameters of posture;
  • the constructing unit 154 is configured to construct a target loss function by combining the two-dimensional key point parameters and the three-dimensional pose parameters;
  • the updating unit 155 is configured to update the model parameters of the gesture recognition model based on the target loss function.
  • the device further includes:
  • the marking unit is used to determine the key points of the human body from a set of key points according to the type of the configuration scene;
  • the sample image is annotated with reference to the key point set.
  • the key point set includes:
  • the target loss function includes a first loss function corresponding to the three-dimensional model
  • the construction unit is further configured to determine corresponding two-dimensional key point information based on the three-dimensional posture parameters;
  • the target loss function further includes a loss function corresponding to the two-dimensional model and a second loss function corresponding to the three-dimensional model;
  • the two-dimensional key point parameters include: partial affinity of the key points of the human body Field parameters and heat maps of key points of the human body, where the three-dimensional posture parameters include: shape parameters and morphological parameters of the human body;
  • the construction unit is also used to combine the differences between the partial affinity field parameters output by the two-dimensional model and the partial affinity field parameters of the corresponding key points of the human body in the sample image, and the heat map output by the two-dimensional model and the corresponding The difference in the heat map of the key points of the human body in the sample image, constructing a loss function corresponding to the two-dimensional model;
  • the device further includes:
  • a trimming unit configured to determine the target human body in the feature map based on the two-dimensional key point parameters
  • the feature map is tailored according to the target human body to obtain the target human body feature map.
  • the update unit is further configured to determine the value of the target loss function based on the two-dimensional key point parameter and the three-dimensional posture parameter;
  • the error signal is propagated back in the gesture recognition model, and the model parameters of each layer are updated during the propagation process.
  • FIG. 15 is a schematic diagram of the composition structure of an image recognition device based on a gesture recognition model provided by an embodiment of the application.
  • the image recognition device 160 based on a gesture recognition model in an embodiment of the application includes:
  • the first acquiring unit 161 is configured to process the image to be recognized including the human body through the feature map model included in the gesture recognition model to obtain a feature map corresponding to the image to be recognized;
  • the second acquiring unit 162 is configured to process the feature map through the two-dimensional model included in the gesture recognition model to obtain two-dimensional key point parameters for characterizing a two-dimensional human body posture, the two-dimensional key point parameters For recognizing the two-dimensional posture of the human body;
  • the third acquiring unit 163 is configured to process the target human body feature map cut out from the feature map and the two-dimensional key point parameters through the three-dimensional model included in the gesture recognition model to obtain a three-dimensional human body
  • the three-dimensional posture parameter of the posture, the three-dimensional posture parameter is used to recognize the three-dimensional posture of the human body.
  • the device further includes:
  • the matching unit is configured to recognize the two-dimensional posture of the human body in the image to be recognized based on the two-dimensional key point parameters; the image to be recognized is obtained by image collection based on the output of the specific person posture;
  • the prompt unit is used to output prompt information used to characterize the matching result.
  • the device further includes:
  • a human body model unit configured to construct a three-dimensional human body model corresponding to the target human body based on the three-dimensional posture parameters
  • the control unit is configured to control the three-dimensional human body model to execute a target action, and the target action matches the action performed by the target human body.
  • the embodiment of the present application also provides a storage medium storing executable instructions, and the executable instructions are stored therein.
  • the processor When the executable instructions are executed by the processor, the processor will be caused to execute the training of the gesture recognition model provided by the embodiments of the present application. method.
  • the embodiment of the application also provides a storage medium storing executable instructions, and the executable instructions are stored therein.
  • the processor will cause the processor to execute the gesture recognition model-based Image recognition method.
  • the storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM, etc.; it may also be a variety of devices including one or any combination of the foregoing memories. .
  • executable instructions may be in the form of programs, software, software modules, scripts or codes, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and their It can be deployed in any form, including being deployed as an independent program or deployed as a module, component, subroutine or other unit suitable for use in a computing environment.
  • executable instructions may but do not necessarily correspond to files in the file system, and may be stored as part of a file that saves other programs or data, for example, in a HyperText Markup Language (HTML, HyperText Markup Language) document
  • HTML HyperText Markup Language
  • One or more of the scripts in are stored in a single file dedicated to the program in question, or in multiple coordinated files (for example, a file storing one or more modules, subroutines, or code parts).
  • executable instructions can be deployed to be executed on one computing device, or on multiple computing devices located in one location, or on multiple computing devices that are distributed in multiple locations and interconnected by a communication network Executed on.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Psychiatry (AREA)
  • Medical Informatics (AREA)
  • Social Psychology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种姿态识别模型的训练方法、图像识别方法及装置;姿态识别模型的训练方法包括:将标注有人体关键点的样本图像,输入姿态识别模型包括的特征图模型,输出对应样本图像的特征图;将特征图输入姿态识别模型包括的二维模型,输出用于表征二维人体姿态的二维关键点参数;将从特征图中剪裁出的目标人体特征图及二维关键点参数,输入姿态识别模型包括的三维模型,输出用于表征三维人体姿态的三维姿态参数;结合二维关键点参数及三维姿态参数,构建目标损失函数;基于目标损失函数,更新姿态识别模型的模型参数。

Description

姿态识别模型的训练方法、图像识别方法及装置
本申请要求于2019年4月12日提交中国专利局、申请号201910294734.8、申请名称为“姿态识别模型的训练方法、图像识别方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及机器学习技术领域,尤其涉及一种姿态识别技术。
背景技术
机器学习(ML,machine Learning)是人工智能的一个分支,其目的是让机器根据先验的知识进行学习,从而具有分类和判断的逻辑能力。以神经网络为代表的机器学习模型不断发展,被逐渐应用到人体姿态识别中,从而实现基于人体姿态的各种智能化应用。
相关技术中,用于识别人体二维姿态信息以及三维姿态信息的神经网络模型互不兼容,需要单独进行训练,训练所需计算资源量大、训练效率低。
发明内容
本申请实施例提供一种姿态识别模型的训练方法、基于姿态识别模型的图像识别方法、装置及存储介质,能够实现兼容识别人体二维姿态信息及三维姿态信息的模型。
本申请实施例提供一种姿态识别模型的训练方法,包括:
通过姿态识别模型包括的特征图模型,对标注有人体关键点的样本图像进行处理,获得对应所述样本图像的特征图;
通过所述姿态识别模型包括的二维模型,对所述特征图进行处理,获得用于表征二维人体姿态的二维关键点参数;
通过所述姿态识别模型包括的三维模型,对从所述特征图中剪裁出的目标人体特征图及所述二维关键点参数进行处理,获得用于表征三维人体姿态的三维姿态参数;
结合所述二维关键点参数及所述三维姿态参数,构建目标损失函数;
基于所述目标损失函数,更新所述姿态识别模型的模型参数。
本申请实施例还提供了一种基于姿态识别模型的图像识别方法,所述方法包括:
通过所述姿态识别模型包括的特征图模型,对包含人体的待识别图像进行处理,获得对应所述待识别图像的特征图;
通过所述姿态识别模型包括的二维模型,对所述特征图进行处理,获得用于表征二维人体姿态的二维关键点参数,所述二维关键点参数用于识别得到所述人体的二维姿态;
通过所述姿态识别模型包括的三维模型,对从所述特征图中剪裁出的目标人体特征图及所述二维关键点参数进行处理,获得用于表征三维人体姿态的三维姿态参数,所述三维姿态参数用于识别得到所述人体的三维姿态。
本申请实施例还提供了一种姿态识别模型的训练装置,包括:
第一处理单元,用于通过姿态识别模型包括的特征图模型,对标注有人体关键点的样本图像进行处理,获得对应所述样本图像的特征图;
第二处理单元,用于通过所述姿态识别模型包括的二维模型,对所述特征图进行处理,获得用于表征二维人体姿态的二维关键点参数;
第三处理单元,用于通过所述姿态识别模型包括的三维模型,对从所述特征图中剪裁出的目标人体特征图及所述二维关键点参数进行处理,获得用于表征三维人体姿态的三维姿态参数;
构建单元,用于结合所述二维关键点参数及所述三维姿态参数,构建目标损失函数;
更新单元,用于基于所述目标损失函数,更新所述姿态识别模型的模型参数。
上述方案中,所述装置还包括:
标注单元,用于根据配置场景的类型,从关键点集中确定所述人体关键点;
基于所述人体关键点,参照所述关键点集对所述样本图像进行标注。
上述方案中,所述关键点集包括:
用于定位人体部位的基准关键点、以及与所述基准关键点协同表征所属部位的多种三维姿态的扩展关键点。
上述方案中,所述目标损失函数包括对应所述三维模型的第一损失函数;
所述构建单元,还用于基于所述三维姿态参数,确定相应的二维关键点信息;
结合所述二维关键点参数以及所述二维关键点信息,构造所述第一损失函数。
上述方案中,所述目标损失函数还包括对应所述二维模型的损失函数及对应所述三维模型的第二损失函数;所述二维关键点参数包括:人体关键点的部分亲和字段参数及人体关键点的热力图,所述三维姿态参数包括:人体的形状参数及形态参数;
所述构建单元,还用于结合所述二维模型输出的部分亲和字段参数与相应人体关键点在样本图像中的部分亲和字段参数的差异、所述二维模型输出的热力图与相应人体关键点在样本图像中的热力图的差异,构建对应所述二维模型的损失函数;
结合所述三维模型输出的形状参数与相应人体在样本图像中的形状参数的差异、所述三维模型输出的形态参数与相应人体在样本图像中的形态参数的差异,构建对应所述三维模型的第二损失函数。
上述方案中,所述装置还包括:
剪裁单元,用于基于所述二维关键点参数,确定所述特征图中的目标人体;
根据所述目标人体对所述特征图进行剪裁,得到所述目标人体特征图。
上述方案中,所述更新单元,还用于基于所述二维关键点参数及所述三维姿态参数,确定所述目标损失函数的值;
当所述目标损失函数的值超出预设阈值时,基于所述目标损失函数确定所述姿态识别模型的误差信号;
将所述误差信号在所述姿态识别模型中反向传播,并在传播的过程中更新各个层的模型参数。
本申请实施例还提供了一种基于姿态识别模型的图像识别装置,所述装置包括:
第一获取单元,用于将包含人体的待识别图像,输入所述姿态识别模型包括的特征图模型,输出对应所述待识别图像的特征图;
第二获取单元,用于将所述特征图输入所述姿态识别模型包括的二维模型,输出用于表征二维人体姿态的二维关键点参数,所述二维关键点参数用于识别得到所述人体的二维姿态;
第三获取单元,用于将从所述特征图中剪裁出的目标人体特征图及所述二维关键点参数,输入所述姿态识别模型包括的三维模型,输出用于表征三维人体姿态的三维姿态参数,所述三维姿态参数用于识别得到所述人体的三维姿态。
上述方案中,所述装置还包括:
匹配单元,用于基于所述二维关键点参数,识别得到所述待识别图像中人体的二维姿态;所述待识别图像为基于输出的特定人物姿态的图像采集得到的;
将所述二维姿态与所述特定人物姿态进行相似度匹配,得到匹配结果;
提示单元,用于输出用于表征所述匹配结果的提示信息。
上述方案中,所述装置还包括:
人体模型单元,用于基于所述三维姿态参数,构建对应所述目标人体的三维人体模型;
控制单元,用于控制所述三维人体模型执行目标动作,所述目标动作与所述目标人体所执行的动作相匹配。
本申请实施例还提供了一种图像处理设备,包括:
存储器,用于存储可执行指令;
处理器,用于执行所述存储器中存储的可执行指令时,实现本申请实施例提供的任一项姿态识别模型的训练方法,或基于姿态识别模型的图像识别方法。
本申请实施例还提供了一种存储介质,存储有可执行指令,用于引起处理器执行时,实现本申请实施例提供的任一项姿态识别模型的训练方法,或基于姿态识别模型的图像识别方法。
应用本申请实施例具有以下有益效果:
在对姿态识别模型进行训练时,通过姿态识别模型包括的特征图模型,对标注有人体关键点的样本图像进行处理,获得对应样本图像的特征图。接着,通过姿态识别模型包括的二维模型,对特征图进行处理,获得用于表征二维人体姿态的二维关键点参数,以及通过姿态识别模型包括的三维模型,对从特征图中剪裁出的目标人体特征图及二维关键点参数进行处理,获得用于表征三维人体姿态的三维姿态参数。结合二维关键点参数及三维姿态参数构建目标损失函数,由于目标损失函数考虑了二维模型的输出结果(二维关键点参数)和三维模型的输出结果(三维姿态参数),这样,基于目标损失函数更新姿态识别模型的模型参数后,得到的姿态识别模型中的二维模型和三维模型可以输出较好的结果,即训练得到的姿态识别模型既能够输出人体二维姿态信息,又能够输出人体的三维姿态信息,实现了人体二维姿态信息及三维姿态信息的兼容。同时,对输出人体二维姿态信息及三维姿态信息的姿态识别模型的训练,采用一套训练样本,模型简单,训练效率高。
附图说明
图1为相关技术提供的二维关键点识别模型的训练方法示意图;
图2为相关技术提供的人体三维模型的训练方法示意图;
图3为本申请实施例提供的姿态识别模型的实施场景的示意图;
图4为本申请实施例提供的姿态识别模型的训练装置的组成结构示意图;
图5是本申请实施例提供的姿态识别模型的结构示意图;
图6是本申请实施例提供的姿态识别模型的训练方法的流程示意图;
图7为本申请实施例提供的获取手部扩展关键点的流程示意图;
图8A为本申请实施例提供的对应第一配置场景的人体关键点的示意图;
图8B为本申请实施例提供的对应第二配置场景的人体关键点的示意图;
图8C为本申请实施例提供的对应第三配置场景的人体关键点的示意图;
图8D为本申请实施例提供的对应第四配置场景的人体关键点的示意图;
图9为本申请实施例提供的进行特征图提取的示意图;
图10为本申请实施例提供的人体关键点热力图的示意图;
图11是本申请实施例提供的基于姿态识别模型的图像识别方法的流程示意图;
图12为本申请实施例提供的采用姿态识别模型进行图像识别的流程示意图;
图13为本申请实施例提供的姿态识别模型的应用场景示意图;
图14为本申请实施例提供的姿态识别模型的应用场景示意图;
图15为本申请实施例提供的姿态识别模型的训练装置的组成结构示意图;
图16为本申请实施例提供的基于姿态识别模型的图像识别装置的组成结构示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。
在以下的描述中,所涉及的术语“第一\第二”仅仅是是区别类似的对象,不代表针对对象的特定排序,可以理解地,“第一\第二”在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
人体关键点是对人体姿态具有代表性的关键点,通过人体关键点可以识别人体姿态。在实际应用中,该人体关键点可以是人体的骨骼关键点,骨骼便为人体关键点之间的连线,如头部关键点、脖颈关键点、肩部关键点、手肘关键点、手腕关键点、脚腕关键点、膝盖关键点、胯部关键点及脚踝关键点等。
相关技术中,通过关键点识别模型识别人体姿态识别,从而实现基于人体姿态的各种智能化应用。
图1为相关技术提供的二维关键点识别模型的训练方法示意图,参见图1,二维关键点识别模型的训练所采用的训练样本取自COCO数据库(图像数据集),采用COCO数据库公开的17个人体关键点的方案进行训练,使用标注了17个人体关键点的图像数据作为 训练样本,样本数据通过深度学习网络(例如名称为Darknet的网络)提取特征图,然后经部分亲和字段(PAF,Part Affinity Fields)处理以及热力图(Heatmap)处理,使用损失函数2例如L2的Loss训练,通过非极大值抑制(NMS,Non-Maximum Suppression)以及聚合(Grouping)操作获得人体二维(2D,Two Dimension)关键点以及人体关键点的归属人。
这里对PAF进行说明,PAF处理用于多人体关键点检测,通过二维方向向量的集合,表示肢体的位置和方向(也代表了两个关键点的关联程度),进而解决人体关键点归属哪个人的问题。基于PAF得到的人体关键点的二维方向向量,进行Grouping操作,使得关键点分属于图像中第几个人得以确认,经Grouping操作,人体关键点可连成骨架。
在一些实施方式中,还可以采用Openpose(一种功能齐全的库)的18个人体关键点的方案以及基础8个人体关键点的方案进行人体二维姿态的识别。
图2为相关技术提供的人体三维模型的训练方法示意图,参见图2,采用蒙皮多人模型(SMPL,A Skinned Multi-Person Linear Model)的标准构建样本数据集,通过输入样本图像,该样本图像携带形状(shape)和形态(pose),输出SMPL 3D模型的参数(shape和pose)进行3D模型的训练,并使用L2Loss来回归参数。
通过上述对相关技术中人体的二维姿态信息的模型及人体三维模型的说明,可知:
对于二维关键点识别模型来说,无论是采用COCO数据库的17关键点方案,还是采用Openpose的18关键点方案,训练所采用的人体关键点总是一套,在应对不同业务时存在关键点信息的冗余以及缺陷,例如,只需要2D的上半身简单姿态信息的场景,只需要上半身8个关键点即可,此时采用17关键点或18关键点进行模型训练显然是关键点冗余,造成计算资源浪费。
对于人体三维模型来说,上述SMPL模型的训练所采用的模型参数为人体的形状(shape)参数及人体的形态(pose)参数,没有考虑二维信息的约束,如此训练得到的模型识别得到的人体的姿态动作会存在角度误差,动作不够准确,即识别准确度低,且该模型同样存在不同业务场景下存在关键点信息冗余以及缺陷的问题,例如,单纯需求上半身3D进行人机交互的场景,训练对应整个人体的三维模型显然造成计算资源浪费。
上述两种模型(对应二维信息识别的模型及人体三维模型)所采用的训练数据完全不同,互不兼容,且训练流程不同,若既想得到人体的二维姿态信息又想得到人体的三维姿态信息,需要分开训练两个不同的模型,处理不同的数据,耗费时间的同时也造成了计算资源的浪费,中央处理器(CPU,Central Processing Unit)及图形处理器(GPU,Graphics Processing Unit)等资源占用大。
基于此提出本申请实施例的姿态识别模型,训练得到的姿态识别模型既能够输出人体二维姿态信息,又能够输出人体的三维姿态信息,实现了人体二维姿态信息及三维姿态信息的兼容,且对输出人体二维姿态信息及三维姿态信息的姿态识别模型的训练,采用一套 训练样本,模型简单,训练效率高;姿态识别模型中包括三维模型,对三维模型的训练,采用二维模型输出的二维信息进行约束,使得三维模型输出的人体三维姿态信息的准确度更高。
需要强调的是,本申请实施例所提供的姿态识别模型的训练方法以及图像识别方法可以是基于人工智能实现的。人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音技术、自然语言处理技术以及机器学习/深度学习等几大方向。
本申请实施例提供的方案涉及人工智能的机器学习/深度学习、计算机视觉等技术。其中,机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。
本申请实施例具体通过机器学习训练姿态识别模型,使得训练得到的姿态识别模型可以针对待识别图像准确的进行姿态识别。
在进行姿态识别模型训练或进行姿态识别过程中,还可能涉及计算机视觉技术。计算机视觉技术(Computer Vision,CV)是一门研究如何使机器“看”的科学,更进一步的说,就是指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能系统。
本申请实施例具体涉及计算机视觉技术中的图像处理、图像语义理解等技术,例如在得到图像例如待识别图像或训练样本后,进行图像处理,例如对图像进行裁剪等;又如,利用图像语义理解技术进行关键点标注、图像分类(例如确定人体关键点的归属人)、提取图像特征(特征图)等。
首先对本申请实施例的姿态识别模型的实施场景进行说明,图3为本申请实施例提供的姿态识别模型的实施场景的示意图,参见图3,为实现支撑一个示例性应用,终端(包括终端40-1和终端40-2),终端上设置有用于图像识别的客户端,终端通过网络300连接服务器200,网络300可以是广域网或者局域网,又或者是二者的组合,使用无线链路实现数据传输。
服务器200,用于将标注有人体关键点的样本图像,输入姿态识别模型包括的特征图模型,输出对应样本图像的特征图;将特征图输入姿态识别模型包括的二维模型,输出用于表征二维人体姿态的二维关键点参数;将从特征图中剪裁出的目标人体特征图及二维关键点参数,输入姿态识别模型包括的三维模型,输出用于表征三维人体姿态的三维姿态参数;结合二维关键点参数及三维姿态参数,构建目标损失函数;基于目标损失函数,更新姿态识别模型的模型参数;如此,实现对姿态识别模型的训练。
终端(终端40-1和/或终端40-2),用于发送携带待识别图像的识别请求给服务器200,该待识别图像中包括一个或多个人体。
服务器200,还用于接收终端发送的识别请求,采用得到的姿态识别模型对待识别图像进行识别,将识别结果(二维关键点参数和/或三维姿态参数)返回给终端。
终端(终端40-1和/或终端40-2),还用于基于服务器200返回的识别结果执行相应的应用,如驱动人体三维模型,基于识别结果确定相应的二维人体姿态并进行相应的评估。
接下来对本申请实施例提供的姿态识别模型的训练装置及基于姿态识别模型的图像识别装置进行说明。本申请实施例的姿态识别模型的训练装置及基于姿态识别模型的图像识别装置,均可以通过图像处理设备来实施,图像处理设备例如可以是终端,也可以是服务器,也就是说,本申请实施例提供的方法可以由智能手机、平板电脑和台式机等终端单独实施,或者有服务器单独实施,或者由终端、服务器协同实施。本申请实施例提供的姿态识别模型的训练装置及基于姿态识别模型的图像识别装置,均可以实施为硬件或者软硬件结合的方式,以本申请实施例的姿态识别模型的训练装置为例,下面说明本申请实施例提供的装置的各种示例性实施。
下面对本申请实施例的图像处理设备的硬件结构做详细说明,图4为本申请实施例提供的图像处理设备的组成结构示意图,可以理解,图4仅仅示出了图像处理设备的示例性结构而非全部结构,根据需要可以实施图4示出的部分结构或全部结构。
本申请实施例提供的图像处理设备包括:至少一个处理器401、存储器402、用户接口403和至少一个网络接口404。姿态识别模型的训练装置40中的各个组件通过总线系统405耦合在一起。可以理解,总线系统405用于实现这些组件之间的连接通信。总线系统405除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图4中将各种总线都标为总线系统405。
其中,用户接口403可以包括显示器、键盘、鼠标、轨迹球、点击轮、按键、按钮、触感板或者触摸屏等。
可以理解,存储器402可以是易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(ROM,Read Only Memory)、可编程只读存储器(PROM,Programmable Read-Only Memory)、可擦除可编程只读存储器(EPROM,Erasable Programmable Read-Only Memory)、闪存(Flash Memory)等。易失性存储器可以是随机存取存储器(RAM,Random Access Memory),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(SRAM,Static Random Access Memory)、同步静态随机存取存储器(SSRAM,Synchronous Static Random Access Memory)。本申请实施例描述的存储器402旨在包括这些和任意其它适合类型的存储器。
本申请实施例中的存储器402能够存储数据以支持终端(如40-1)的操作。这些数据的示例包括:用于在终端(如40-1)上操作的任何计算机程序,如操作系统和应用程序。其中,操作系统包含各种系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务。应用程序可以包含各种应用程序。
作为本申请实施例提供的图像处理设备采用软硬件结合实施的示例,本申请实施例所提供的图像处理设备可以直接体现为由处理器401执行的软件模块组合,软件模块可以位于存储介质中,存储介质位于存储器402,处理器401读取存储器402中软件模块包括的可执行指令,结合必要的硬件(例如,包括处理器401以及连接到总线405的其他组件)完成本申请实施例提供的姿态识别模型的训练方法。
作为示例,处理器401可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
作为本申请实施例提供的图像处理设备采用硬件实施的示例,本申请实施例所提供的装置可以直接采用硬件译码处理器形式的处理器401来执行完成,例如,被一个或多个应用专用集成电路(ASIC,Application Specific Integrated Circuit)、DSP、可编程逻辑器件(PLD,Programmable Logic Device)、复杂可编程逻辑器件(CPLD,Complex Programmable Logic Device)、现场可编程门阵列(FPGA,Field-Programmable Gate Array)或其他电子元件执行实现本申请实施例提供的姿态识别模型的训练方法。
本申请实施例中的存储器402用于存储各种类型的数据以支持图像处理设备40的操作。这些数据的示例包括:用于在图像处理设备40上操作的任何可执行指令,如可执行指令,实现本申请实施例的姿态识别模型的训练方法的程序可以包含在可执行指令中。
基于上述对本申请实施例的姿态识别模型的实施场景、图像处理设备的说明,接下来对本申请实施例的姿态识别模型的所应用的场景或领域进行说明,需要说明的是,本申请实施例的姿态识别模型并不限于以下所提到的场景或领域:
1、人机交互;
以用户与终端进行交互为例,终端中设置有客户端,在实际应用中,该客户端可以为游戏客户端、人体三维建模客户端等,终端上还设置有图形界面、图像采集装置及处理芯片,通过图像采集装置采集包含人体的图像,基于姿态识别模型识别图像中人体的二维人体姿态及三维人体姿态。
以客户端为游戏客户端为例,终端通过图形界面展示游戏人物的动作,以使用户基于终端所展示的人物动作进行模仿,通过图像采集装置采集用户所做的动作的图像,基于姿态识别模型识别图像中人体的二维人体姿态,并基于识别结果与游戏中人物所做的动作的相似度进行游戏的评估,如打分等。
以客户端为体感游戏客户端为例,终端通过图像采集装置采集包含用户的图像,基于姿态识别模型识别图像中人体的三维人体姿态,以构建对应用户的人体三维模型,并驱动所构建的人体三维模型执行与用户执行相同的动作,实现用户在游戏中的体感互动。
以用户与智能机器人交互为例,该智能机器人中设置有图像采集装置及处理芯片,图像采集装置能够采集智能机器人的前方区域的图像,处理芯片能够基于姿态识别模型识别区域图像中的人体姿态,并在识别出的人体姿态是预设姿态时,控制智能机器人做出预设响应。比如,当识别出的人体姿态是挥手姿态时,控制智能机器人做出欢迎动作。
2、无人驾驶;
无人驾驶车上设置有图像采集装置及处理芯片,图像采集装置能够采集无人驾驶车在行驶过程中前方的图像,处理芯片基于姿态识别模型识别图像中的人体姿态(二维和/或三维),以判别前方是否有人,以及人所处前方的位置等信息,以控制无人驾驶车减速或刹车等。
3、医疗领域;
医疗设备上设置有图像采集装置及处理芯片,图像采集装置能够采集用户的图像,处理芯片基于姿态识别模型识别图像中的三维人体姿态,以构建对应用户的人体三维模型,并基于构建的人体三维模型进行医学分析。
4、监控领域;
监控系统包括前端的图像采集设备及后端的图像处理设备,图像采集设备采集包含用户的图像后发送给图像处理设备,图像处理设备基于姿态识别模型识别图像中的人体姿态(二维和/或三维),并基于识别结果进行目标跟踪、姿态分析预警等。
在对本申请实施例提供的姿态识别模型的训练方法说明之前,先对本申请实施例提供的姿态识别模型的结构进行说明,图5是本申请实施例提供的姿态识别模型的结构示意图,参见图5,本申请实施例提供的姿态识别模型包括:特征图模型51、特征图剪裁单元52、二维模型53及三维模型54;其中,特征图模型51,用于对输入的图像进行特征提取,得到相应的特征图;特征图剪裁单元52,用于特征图模型输出的特征图进行剪裁,得到目标人体特征图;二维模型53,用于对特征图模型输出的特征图进行识别,输出用于表征二维人体姿态的二维关键点参数;三维模型54,用于对特征图剪裁单元剪裁得到的目标人体特征图、以及二维模型输出的二维关键点参数进行处理,输出用于表征三维人体姿态的三维姿态参数。
基于上述对姿态识别模型的结构的说明,接下来对本申请实施例提供的姿态识别模型的训练方法进行说明。图6是本申请实施例提供的姿态识别模型的训练方法的流程示意图,在一些实施例中,该训练方法可由服务器或终端实施,或由服务器及终端协同实施,以服务器实施为例,如通过图3中的服务器200实施,结合图5及图6,本申请实施例提供的姿态识别模型的训练方法包括:
步骤601:通过姿态识别模型包括的特征图模型,对标注有人体关键点的样本图像进行处理,获得对应所述样本图像的特征图。
服务器将标注有人体关键点的样本图像,输入姿态识别模型包括的特征图模型,从而利用特征图模型输出对应样本图像的特征图。
在实际实施时,在进行姿态识别模型的训练之前,需要构建用于模型训练的样本图像,样本图像中包含人体,服务器对包含人体的样本图像进行人体关键点标注。在一些实施例中,包含多个配置场景,不同的配置场景对应关键点集中不同数量的人体关键点,服务器在根据配置场景的类型,从关键点集中确定出对应当前配置场景的人体关键点后,基于所确定的人体关键点,参照关键点集对包含人体的样本图像进行人体关键点标注。
这里,对关键点集进行说明,在一些实施例中,关键点集包括:用于定位人体部位的基准关键点、与基准关键点协同表征所属部位的不同三维姿态的扩展关键点。
在实际应用中,基准关键点可以为COCO数据集中提供的17个人体关键点,而扩展关键点则与这17个人体关键点中的一个或多个协同表征所属部位的不同三维姿态。例如,为了表征人体头部的三维姿态,扩展关键点可以为头顶关键点和/或下巴关键点,与鼻尖关键点(基准关键点)协同表征头部的抬头、点头、转头等动作;再如,为了表征人体手部的三维姿态,扩展关键点可以为大拇指关键点、手心关键点及中指关键点中的至少一个,与手腕关键点(基准关键点)协同表征手部的三维姿态,如手腕的旋转;再如,为了表征人体腰部的三维姿态,扩展关键点可以为胯中点关键点,与左胯关键点和/或右胯关键点(基准关键点)协同表征腰部的三维姿态,如腰部扭转。在一些实施例中,扩展关键点的数量为16个,与COCO数据集提供的17个人体关键点共同组成33关键点集。
接下来对关键点集中扩展关键点的获取进行说明。在实际实施时,可以通过均值增点、单独识别等方式获取扩展关键点,例如,采用均值增点的方式,取左胯关键点及右胯关键点的中点作为胯中点关键点,采用均值增点的方式,取左肩关键点及右肩关键点的中点作为脖子(华盖穴)关键点;可通过单独识别的方式获取人体手部及脚部的关键点,具体地,可构建或采用相关技术中的手部和/脚部的识别模型,输入包含手部或脚部的图像,输出相应的扩展关键点信息。
示例性地,以获取手部扩展关键点为例进行说明,图7为本申请实施例提供的获取手部扩展关键点的流程示意图,参见图7,首先对包含人体的图像进行剪裁,得到手部的图像,然后将剪裁得到的图像输入至手部关键点模型,得到包含大拇指关键点、手心关键点及中指关键点等的手部关键点集。
这里,对人体手部或脚部的图像的获取进行说明,继续参见图7,在一些实施例中,可通过如下方式剪裁得到手部的图像:以手腕关键点为中心,以手腕关键点与相应肩关键点之间的长度为边长,或者以手腕关键点与相应手肘关键点之间的长度为边长,剪裁得到包含手部的正方形图像,作为手部关键点模型的输入。在一些实施例中,可通过如下方式剪裁得到脚部的图像:以脚踝关键点为中心,以脚踝关键点与相应膝盖关键点的长度为边长,或者以脚踝关键点与相应胯关键点的长度为边长,剪裁得到包含脚部的正方形图像,作为脚部关键点模型的输入。
示例性地,继续参见图7,图7中编号2对应人体右肩关键点,编号3对应右手肘关键点,编号4对应右手腕关键点,以右手腕关键点为中心,以右手腕关键点与人体右肩关键点之间的连线4-2为边长,对图像进行剪裁得到包含右手的正方形图像。
基于上述对关键点集的说明,接下来对配置场景进行介绍,图8A至图8D为本申请实施例提供的对应不同类型的配置场景的人体关键点的示意图,在一些实施例中,配置场景的类型可以包括四种,分别对应关键点集中不同数量的人体关键点,以关键点集为上述33关键点集为例,图8A为本申请实施例提供的对应第一种配置场景的人体关键点的示意图,第一种配置场景可以对应关键点集中的全量关键点(即33个人体关键点);图8B为本申请实施例提供的对应第二种配置场景的人体关键点的示意图,第二种配置场景可以对应关键点集中人体上半身的20个人体关键点;图8C为本申请实施例提供的对应第三种配置场景的人体关键点的示意图,第三种配置场景可以对应关键点集中人体上半身的8个人体关键点;图8D为本申请实施例提供的对应第四种配置场景的人体关键点的示意图,第四种配置场景对应关键点集中人体全身的15个人体关键点。
在一些实施例中,姿态识别模型所包括的特征图模型可以为基于Darknet框架的神经网络模型,通过特征图模型对输入的图像(如样本图像)进行特征提取,得到相应的特征图。图9为本申请实施例提供的进行特征图提取的示意图,参见图9,特征图模型在进行特征提取时,采用Darknet框架,可基于边界框(bbox,Bounding box)采用滑窗的方式得 到对应输入图像的特征图,Bounding box指的是能恰好环绕物体的一个最小面积矩形框,用于标定物体的位置与相对大小,定义一个物体在图像中所占据的范围。
步骤602:通过所述姿态识别模型包括的二维模型,对所述特征图进行处理,获得用于表征二维人体姿态的二维关键点参数。
服务器将特征图输入姿态识别模型包括的二维模型,输出用于表征二维人体姿态的二维关键点参数。
在一些实施例中,二维模型可以为卷积神经网络模型,输出的二维关键点参数可包括人体关键点的部分亲和字段(PAF,Part Affinity Fields)参数及人体关键点的热力图(Heatmap)。
这里,人体关键点的PAF参数可以为对应人体关键点的二维方向向量,表示人体骨骼关节(肢体)的位置和方向,也表征了两个人体关键点的关联程度,进而可基于人体关键点的PAF参数确定人体关键点的归属人,在实际应用中,人体关键点的PAF参数可以包括人体关键点的坐标参数。
人体关键点的热力图指的是对人体关键点在原图像大小的灰度图中,在相同位置用圆形高斯表示,也即表征输入的特征图中的像素属于人体关键点的概率,图10为本申请实施例提供的人体关键点热力图的示意图,参见图10,对于人体左手肘关键点来说,热力图表征了像素点为左手肘关键点的概率,即用概率表达的形式反映左手肘关键点在该像素点处出现的可能性,像素点距离左手肘关键点越近概率越高,距离左手肘关键点越远概率越低,也即像素点(如图10中编号2)为左手肘关键点的概率和像素点距离中心点(如图10中编号1)的相对位置关系服从高斯分布。
步骤603:通过所述姿态识别模型包括的三维模型,对从所述特征图中剪裁出的目标人体特征图及所述二维关键点参数进行处理,获得用于表征三维人体姿态的三维姿态参数。
服务器可以将从特征图中剪裁出的目标人体特征图及二维关键点参数,输入姿态识别模型包括的三维模型,输出用于表征三维人体姿态的三维姿态参数。
在实际实施时,三维模型输出的人体三维姿态参数所对应的是单个人体,因此,在将特征图输入三维模型之前,若样本图像包括多个人体,可以对特征图模型输出的特征图进行目标人体剪裁,在一些实施例中,服务器可采用如下方式实现对特征图的剪裁:
服务器基于二维模型输出的二维关键点参数,确定特征图中的目标人体,根据确定的目标人体对特征图进行剪裁,得到目标人体的特征图。也就是说,输入特征图模型的样本图像可以包含多个人体,基于二维模型识别得到的二维关键点参数,确定人体关键点所属的人体,进而对单个人体进行剪裁,得到对应单个人体的特征图。
在一些实施例中,三维模型可为卷积神经网络模型,服务器将剪裁得到的目标人体特征图与二维模型输出的人体关键点的热力图进行拼接,并将拼接结果输入三维模型,例如, 对目标人体的特征图及人体关键点的热力图进行Concat拼接,即将热力图及特征图以两个矩阵拼接,以将拼接结果输入三维模型。
在一些实施例中,三维模型输出的三维姿态参数包括人体的形状参数(shape)及形态参数(pose);其中,形状参数可以表征人体的高矮肥瘦等,而形态参数可以表征人体的位姿等,基于人体的三维姿态参数可构建人体的三维蒙皮模型。
步骤604:结合二维关键点参数及三维姿态参数,构建目标损失函数。
在一些实施例中,姿态识别模型的目标损失函数包括对应三维模型的第一损失函数;服务器可通过如下方式实现第一损失函数的构建:服务器基于三维模型输出的三维姿态参数,确定相应的二维关键点信息,结合二维模型输出的二维关键点参数、以及得到的二维关键点信息,构造对应三维模型的第一损失函数。可见,对应三维模型的第一损失函数的构建采用了二维关键点信息作为约束,使得三维模型的输出准确度更高。
示例性地,服务器基于三维姿态参数包括的形状参数及形态参数,通过投影矩阵函数计算二维人体关键点的位置,然后基于关键点集中人体关键点的位置与基于三维姿态参数计算得到的二维人体关键点的位置的差异、以及二维模型输出的二维人体关键点的位置与基于三维姿态参数计算得到的二维人体关键点的位置的差异,构造对应三维模型的第一损失函数。
例如,在实际应用中,所构造的第一损失函数Loss1可以为:
Loss 1=av(Xgt-r(Xp)) 2+b(X2dp-r(Xp)) 2  (1)
相应的,第一损失函数的约束为:
min Loss(Xgt,r,a,b)=av|(Xgt)-r(Xp)|+b|X2dp-r(Xp)|  (2)
其中,a和b均为第一损失函数中的权重系数;v表示人体关键点X在二维图像中是否可见;Xp为三维模型输出的三维姿态参数,即形状参数和形态参数;r(Xp)表示基于三维姿态参数,通过投影矩阵函数r()反算出来的二维人体关键点的位置;Xgt表示人体关键点X在关键点集中的位置;X2dp为二维模型预测得到的人体关键点X的位置。
基于上述函数(1)及(2)可知,所构建的对应三维模型的第一损失函数中采用了二维人体姿态信息作为约束,如此,可提升三维模型输出的三维姿态参数的准确度。
在一些实施例中,姿态识别模型的目标损失函数还包括对应二维模型的损失函数及对应三维模型的第二损失函数;相应的,服务器可通过如下方式构建对应二维模型的损失函数及对应三维模型的第二损失函数:
结合二维模型输出的部分亲和字段参数与相应人体关键点在样本图像中的部分亲和字段参数的差异、二维模型输出的热力图与相应人体关键点在样本图像中的热力图的差异,构建对应二维模型的损失函数;
结合三维模型输出的形状参数与相应人体在样本图像中的形状参数的差异、三维模型输出的形态参数与相应人体在样本图像中的形态参数的差异,构建对应三维模型的第二损失函数。
例如,在实际应用中,所构造的对应二维模型的损失函数Loss2可以为:
Loss 2=(PAF-PAF') 2+(heatmap-heatmap') 2  (3)
其中,(PAF-PAF') 2表示二维模型输出的PAF参数与相应人体关键点在样本图像中的PAF参数的差异,(heatmap-heatmap') 2表示三维模型输出的形态参数与相应人体在样本图像中的形态参数的差异。
例如,在实际应用中,所构造的三维模型的第二损失函数Loss3可以为:
Loss 3=(β-β') 2+(θ-θ') 2  (4)
其中,β为人体形状参数,θ为人体形态参数,(β-β') 2表示三维模型输出的形状参数与相应人体在样本图像中的形状参数的差异,(θ-θ') 2表示三维模型输出的形态参数与相应人体在样本图像中的形态参数的差异。
基于上述对姿态识别模型所包括的二维模型的损失函数及三维模型的损失函数的说明,可知,在一些实施例中,姿态识别模型的目标损失函数可以为:
Loss=(PAF-PAF') 2+(heatmap-heatmap') 2+(β-β') 2+(θ-θ') 2+av(Xgt-r(Xp)) 2+b(X2dp-r(Xp)) 2  (5)
步骤605:基于目标损失函数,更新姿态识别模型的模型参数。
在一些实施例中,服务器可采用如下方式实现姿态识别模型的模型参数的更新:
服务器基于二维模型输出的二维关键点参数及三维模型输出的三维姿态参数,确定目标损失函数的值,并判断目标损失函数的值是否超出预设阈值,当目标损失函数的值超出预设阈值时,基于目标损失函数确定姿态识别模型的误差信号,将误差信号在姿态识别模型中反向传播,并在传播的过程中更新各个层的模型参数。
这里对反向传播进行说明,将训练样本数据输入到神经网络模型的输入层,经过隐藏层,最后达到输出层并输出结果,这是神经网络模型的前向传播过程,由于神经网络模型的输出结果与实际结果有误差,则计算输出结果与实际值之间的误差,并将该误差从输出层向隐藏层反向传播,直至传播到输入层,在反向传播的过程中,根据误差调整模型参数的值;不断迭代上述过程,直至收敛。
以姿态识别模型的目标损失函数为(5)为例,服务器基于目标损失函数确定误差信号,分别从二维模型及三维模型的输出层反向传播,逐层反向传播误差信号,在误差信号到达每一层时,结合传导的误差信号来求解梯度(也就是Loss函数对该层参数的偏导数),将该层的参数更新对应的梯度值。
在一些实施例中,图11是本申请实施例提供的基于姿态识别模型的图像识别方法的流程示意图,如图11所示,本申请实施例提供的基于姿态识别模型的图像识别方法主要包括三个阶段,分别为数据准备阶段、模型训练阶段及模型应用阶段,接下来分别进行说明。
1、数据准备阶段;
数据准备阶段主要实现了包含33个人体关键点的关键点集的构建,以及不同类型配置场景(setting)对应的不同人体关键点数量的选择。
其中,本实施例所采用的关键点集在COCO数据集的17个人体关键点(亦可为Openpose的18个人体关键点)的基础上,增加头顶,下巴的关键点,以方便表征点头抬头的旋转;在手腕部分增加了中指与大拇指的关键点,与手心关键点一并表征手腕的旋转;为了兼容三维模型中常见的位于跨中的root点,以及相关骨骼信息,增加了跨中点;脚上同样通过脚跟,左脚尖,右脚尖来表征其三维信息。关键点集一共包含个人体关键点33点,通过新增的扩展关键点,让二维姿态识别过程中包含了更多肢体三维旋转的信息。
这里,在实际实施中,所构建的关键点集包括COCO数据集提供的17个人体关键点,剩余的16个人体关键点为扩展关键点,可以基于上述17个人体关键点,通过均值增点,以及单独识别的方式获取手脚部位的扩展关键点,然后经融合数据的办法获得33点的数据。其中如胯中点可以由左跨关键点及右跨关键点计算得到,脖子(华盖穴)则可由左肩关键点及右肩关键点计算得到。
在实际实施中,对于扩展关键点的获取既可采用相关技术中提供的手脚部分的关键点检测模型识别得到,也可单独训练用于识别对应手脚部位的扩展关键点的检测模型,使得该检测模型具备依据输入的包含手或脚的图像,输出相应的扩展关键点信息的性能。而对输入检测模型之前的图像的剪裁可参照前述实施例的描述,此处不做赘述。
在实际应用中,对于不同的setting,进行姿态识别模型训练时所采用的样本图像中标注不同数量的人体关键点,例如,对于人体上半身的20个人体关键点的setting来说,进行模型训练时,便可只对样本图像中该20个人体关键点进行标注,避免了标注33个人体关键点所造成的计算资源的浪费。
2、模型训练阶段;
在一些实施例中,训练得到的模型为全卷积神经网络(FCN,Fully Convolutional Networks)模型,包括三部分,分别为特征图模型(例如DarknetX)、二维模型及三维模型,实现二维模型及三维模型的联合训练。在进行模型训练时,根据业务需求选取相应的setting,即选取相应的人体关键点配置,经DarknetX输出的特征图,输入至二维模型,通过PAF以及heatmap使用L2的loss训练,通过NMS以及PAF的Grouping操作获得人体二维关键点的位置及方向信息,并确定人体关键点的归属人;在联合训练三维模型时,需要对DarknetX输出的特征图进行剪裁,得到单个人体的目标人体特征图,然后将目标人体特征图与二维模型输出的热力图进行Concat拼接作为三维模型的输入,这主要是利用二维 人体关键点,减少三维模型所需的计算量,只需回归目标单人的情况,并且共享和复用了DarknetX输出的特征图。
3、模型应用阶段;
对于模型输出的人体三维信息可用于人体的三维姿态识别及三维蒙皮模型驱动,例如根据得到的用户的三维姿态参数(shape、pose)驱动一个三维人物模型同步用户的动作。
对于模型输出的人体二维信息可用于人体的二维姿态识别,在实际应用中,可用于静态动作识别和时序动作识别,例如,终端屏幕显示动画人物的动作,终端采集用户模仿该动画人物的动作,终端进行二维姿态识别,根据动作契合程度进行评分等。
接下来对本申请实施例训练得到的姿态识别模型的应用进行说明,在一些实施例中,姿态识别模型可用于图像识别,以终端中设置有图像识别客户端为例,图12为本申请实施例提供的采用姿态识别模型进行图像识别的流程示意图,参见图12,终端将包含人体的待识别图像,输入姿态识别模型包括的特征图模型,输出对应待识别图像的特征图;将特征图输入姿态识别模型包括的二维模型,输出用于表征二维人体姿态的二维关键点参数,二维关键点参数用于识别得到人体的二维姿态;将从特征图中剪裁出的目标人体特征图及二维关键点参数,输入姿态识别模型包括的三维模型,输出用于表征三维人体姿态的三维姿态参数,三维姿态参数用于识别得到人体的三维姿态。
在一些实施例中,终端输出特定人物姿态的图像后,采集得到待识别图像,终端基于二维模型输出的二维关键点参数,识别得到待识别图像中人体的二维姿态,将识别得到的二维姿态与特定人物姿态进行相似度匹配,得到匹配结果,输出用于表征匹配结果的提示信息。
示例性地,以终端中设置有舞蹈游戏客户端为例,图13为本申请实施例提供的姿态识别模型的应用场景示意图,参见图13,终端通过舞蹈游戏客户端展示动画人物的动作即特定人物姿态,用户根据终端屏幕上的动作提示做出相应的动作,终端采集用户的动作图像即待识别图像,并将待识别图像输入至姿态识别模型,进行二维人体姿态识别,将识别结果与动画人物的姿态进行相似度匹配,并根据得到的相似度输出相应的提示信息,如输出相应的评分、“great”、“good”、“miss”等提示。
在一些实施例中,终端基于三维模型输出的三维姿态参数,构建对应目标人体的三维人体模型;控制三维人体模型执行目标动作,目标动作与目标人体所执行的动作相匹配。
示例性地,以终端中设置有人体三维模型客户端,图14为本申请实施例提供的姿态识别模型的应用场景示意图,参见图14,终端进行用户图像采集得到待识别图像,将待识别图像输入至姿态识别模型,进行三维人体姿态识别,根据输出的三维姿态参数进行三维蒙皮模型构建,并控制三维蒙皮模型同步用户的动作。
接下来对本申请实施例提供的装置采用软件单元实施进行说明。图15为本申请实施例提供的姿态识别模型的训练装置的组成结构示意图,参见图15,本申请实施例的姿态识别模型的训练装置包括:
第一处理单元151,用于通过姿态识别模型包括的特征图模型,对标注有人体关键点的样本图像进行处理,获得对应所述样本图像的特征图;
第二处理单元152,用于通过所述姿态识别模型包括的二维模型,对所述特征图进行处理,获得用于表征二维人体姿态的二维关键点参数;
第三处理单元153,用于通过所述姿态识别模型包括的三维模型,对从所述特征图中剪裁出的目标人体特征图及所述二维关键点参数进行处理,获得用于表征三维人体姿态的三维姿态参数;
构建单元154,用于结合所述二维关键点参数及所述三维姿态参数,构建目标损失函数;
更新单元155,用于基于所述目标损失函数,更新所述姿态识别模型的模型参数。
在一些实施例中,所述装置还包括:
标注单元,用于根据配置场景的类型,从关键点集中确定所述人体关键点;
基于所述人体关键点,参照所述关键点集对所述样本图像进行标注。
在一些实施例中,所述关键点集包括:
用于定位人体部位的基准关键点、以及与所述基准关键点协同表征所属部位的不同三维姿态的扩展关键点。
在一些实施例中,所述目标损失函数包括对应所述三维模型的第一损失函数;
所述构建单元,还用于基于所述三维姿态参数,确定相应的二维关键点信息;
结合所述二维关键点参数以及所述二维关键点信息,构造所述第一损失函数。
在一些实施例中,所述目标损失函数还包括对应所述二维模型的损失函数及对应所述三维模型的第二损失函数;所述二维关键点参数包括:人体关键点的部分亲和字段参数及人体关键点的热力图,所述三维姿态参数包括:人体的形状参数及形态参数;
所述构建单元,还用于结合所述二维模型输出的部分亲和字段参数与相应人体关键点在样本图像中的部分亲和字段参数的差异、所述二维模型输出的热力图与相应人体关键点在样本图像中的热力图的差异,构建对应所述二维模型的损失函数;
结合所述三维模型输出的形状参数与相应人体在样本图像中的形状参数的差异、所述三维模型输出的形态参数与相应人体在样本图像中的形态参数的差异,构建对应所述三维模型的第二损失函数。
在一些实施例中,所述装置还包括:
剪裁单元,用于基于所述二维关键点参数,确定所述特征图中的目标人体;
根据所述目标人体对所述特征图进行剪裁,得到所述目标人体特征图。
在一些实施例中,所述更新单元,还用于基于所述二维关键点参数及所述三维姿态参数,确定所述目标损失函数的值;
当所述目标损失函数的值超出预设阈值时,基于所述目标损失函数确定所述姿态识别模型的误差信号;
将所述误差信号在所述姿态识别模型中反向传播,并在传播的过程中更新各个层的模型参数。
图15为本申请实施例提供的基于姿态识别模型的图像识别装置的组成结构示意图,参见图16,本申请实施例的基于姿态识别模型的图像识别装置160包括:
第一获取单元161,用于通过所述姿态识别模型包括的特征图模型,对包含人体的待识别图像进行处理,获得对应所述待识别图像的特征图;
第二获取单元162,用于通过所述姿态识别模型包括的二维模型,对所述特征图进行处理,获得用于表征二维人体姿态的二维关键点参数,所述二维关键点参数用于识别得到所述人体的二维姿态;
第三获取单元163,用于通过所述姿态识别模型包括的三维模型,对从所述特征图中剪裁出的目标人体特征图及所述二维关键点参数进行处理,获得用于表征三维人体姿态的三维姿态参数,所述三维姿态参数用于识别得到所述人体的三维姿态。
在一些实施例中,所述装置还包括:
匹配单元,用于基于所述二维关键点参数,识别得到所述待识别图像中人体的二维姿态;所述待识别图像为基于输出的特定人物姿态的图像采集得到的;
将识别得到的所述二维姿态与所述特定人物姿态进行相似度匹配,得到匹配结果;
提示单元,用于输出用于表征所述匹配结果的提示信息。
在一些实施例中,所述装置还包括:
人体模型单元,用于基于所述三维姿态参数,构建对应所述目标人体的三维人体模型;
控制单元,用于控制所述三维人体模型执行目标动作,所述目标动作与所述目标人体所执行的动作相匹配。
这里需要指出的是:以上涉及装置的描述,与上述方法描述是类似的,同方法的有益效果描述,不做赘述,对于本申请实施例所述装置中未披露的技术细节,请参照本申请方法实施例的描述。
本申请实施例还提供一种存储有可执行指令的存储介质,其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的姿态识别模型的训练方法。
本申请实施例还提供一种存储有可执行指令的存储介质,其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的基于姿态识别模型的图像识别方法。
在一些实施例中,存储介质可以是FRAM、ROM、PROM、EPROM、EEPROM、闪存、磁表面存储器、光盘、或CD-ROM等存储器;也可以是包括上述存储器之一或任意组合的各种设备。
在一些实施例中,可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。
作为示例,可执行指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(HTML,Hyper Text Markup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。
作为示例,可执行指令可被部署为在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行。
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。

Claims (15)

  1. 一种姿态识别模型的训练方法,所述方法应用于图像处理设备,所述方法包括:
    通过姿态识别模型包括的特征图模型,对标注有人体关键点的样本图像进行处理,获得对应所述样本图像的特征图;
    通过所述姿态识别模型包括的二维模型,对所述特征图进行处理,获得用于表征二维人体姿态的二维关键点参数;
    通过所述姿态识别模型包括的三维模型,对从所述特征图中剪裁出的目标人体特征图及所述二维关键点参数进行处理,获得用于表征三维人体姿态的三维姿态参数;
    结合所述二维关键点参数及所述三维姿态参数,构建目标损失函数;
    基于所述目标损失函数,更新所述姿态识别模型的模型参数。
  2. 如权利要求1所述的方法,通过姿态识别模型包括的特征图模型,对标注有人体关键点的样本图像进行处理,获得对应所述样本图像的特征图之前,所述方法还包括:
    根据配置场景的类型,从关键点集中确定所述人体关键点;
    基于所述人体关键点,参照所述关键点集对所述样本图像进行标注。
  3. 如权利要求2所述的方法,所述关键点集包括:
    用于定位人体部位的基准关键点、以及与所述基准关键点协同表征所属部位的多种三维姿态的扩展关键点。
  4. 如权利要求1所述的方法,所述目标损失函数包括对应所述三维模型的第一损失函数;所述结合所述二维关键点参数及所述三维姿态参数,构建目标损失函数,包括:
    基于所述三维姿态参数,确定相应的二维关键点信息;
    结合所述二维关键点参数以及所述二维关键点信息,构造所述第一损失函数。
  5. 如权利要求4所述的方法,所述目标损失函数还包括对应所述二维模型的损失函数及对应所述三维模型的第二损失函数;所述二维关键点参数包括:所述人体关键点的部分亲和字段参数及所述人体关键点的热力图,所述三维姿态参数包括:人体的形状参数及形态参数;
    所述结合所述二维关键点参数及所述三维姿态参数,构建目标损失函数,包括:
    结合所述二维模型输出的部分亲和字段参数与所述人体关键点在样本图像中的部分亲和字段参数的差异、所述二维模型输出的热力图与相应人体关键点在样本图像中的热力图的差异,构建对应所述二维模型的损失函数;
    结合所述三维模型输出的形状参数与相应人体在样本图像中的形状参数的差异、所述三维模型输出的形态参数与相应人体在样本图像中的形态参数的差异,构建对应所述第二损失函数。
  6. 如权利要求1所述的方法,通过所述姿态识别模型包括的三维模型,对从所述特征图中剪裁出的目标人体特征图及所述二维关键点参数进行处理,获得用于表征三维人体姿态的三维姿态参数之前,所述方法还包括:
    基于所述二维关键点参数,确定所述特征图中的目标人体;
    根据所述目标人体对所述特征图进行剪裁,得到所述目标人体特征图。
  7. 如权利要求1所述的方法,所述基于所述目标损失函数,更新所述姿态识别模型的模型参数,包括:
    基于所述二维关键点参数及所述三维姿态参数,确定所述目标损失函数的值;
    当所述目标损失函数的值超出预设阈值时,基于所述目标损失函数确定所述姿态识别模型的误差信号;
    将所述误差信号在所述姿态识别模型中反向传播,并在传播的过程中更新各个层的模型参数。
  8. 一种基于姿态识别模型的图像识别方法,所述方法应用于图像处理设备,所述方法包括:
    通过所述姿态识别模型包括的特征图模型,对包含人体的待识别图像进行处理,获得对应所述待识别图像的特征图;
    通过所述姿态识别模型包括的二维模型,对所述特征图进行处理,获得用于表征二维人体姿态的二维关键点参数,所述二维关键点参数用于识别得到所述人体的二维姿态;
    通过所述姿态识别模型包括的三维模型,对从所述特征图中剪裁出的目标人体特征图及所述二维关键点参数进行处理,获得用于表征三维人体姿态的三维姿态参数,所述三维姿态参数用于识别得到所述人体的三维姿态。
  9. 如权利要求8所述的方法,所述方法还包括:
    基于所述二维关键点参数,识别得到所述待识别图像中人体的二维姿态;所述待识别图像为基于输出的特定人物姿态的图像采集得到的;
    将所述二维姿态与所述特定人物姿态进行相似度匹配,得到匹配结果;
    输出用于表征所述匹配结果的提示信息。
  10. 如权利要求8所述的方法,所述方法还包括:
    基于所述三维姿态参数,构建对应所述目标人体的三维人体模型;
    控制所述三维人体模型执行目标动作,所述目标动作与所述目标人体所执行的动作相匹配。
  11. 一种姿态识别模型的训练装置,所述装置包括:
    第一处理单元,用于通过姿态识别模型包括的特征图模型,对标注有人体关键点的样本图像进行处理,获得对应所述样本图像的特征图;
    第二处理单元,用于通过所述姿态识别模型包括的二维模型,对所述特征图进行处理,获得用于表征二维人体姿态的二维关键点参数;
    第三处理单元,用于通过所述姿态识别模型包括的三维模型,对从所述特征图中剪裁出的目标人体特征图及所述二维关键点参数进行处理,获得用于表征三维人体姿态的三维姿态参数;
    构建单元,用于结合所述二维关键点参数及所述三维姿态参数,构建目标损失函数;
    更新单元,用于基于所述目标损失函数,更新所述姿态识别模型的模型参数。
  12. 一种基于姿态识别模型的图像识别装置,所述装置包括:
    第一获取单元,用于通过所述姿态识别模型包括的特征图模型,对包含人体的待识别图像进行处理,获得对应所述待识别图像的特征图;
    第二获取单元,用于通过所述姿态识别模型包括的二维模型,对所述特征图进行处理,获得用于表征二维人体姿态的二维关键点参数,所述二维关键点参数用于识别得到所述人体的二维姿态;
    第三获取单元,用于通过所述姿态识别模型包括的三维模型,对从所述特征图中剪裁出的目标人体特征图及所述二维关键点参数进行处理,获得用于表征三维人体姿态的三维姿态参数,所述三维姿态参数用于识别得到所述人体的三维姿态。
  13. 如权利要求12所述的装置,所述装置还包括:
    匹配单元,用于基于所述二维关键点参数,识别得到所述待识别图像中人体的二维姿态;所述待识别图像为基于输出的特定人物姿态的图像采集得到的;;
    将识别得到的所述二维姿态与所述特定人物姿态进行相似度匹配,得到匹配结果;
    提示单元,用于输出用于表征所述匹配结果的提示信息。
  14. 一种图像处理设备,所述设备包括:
    存储器,用于存储可执行指令;
    处理器,用于执行所述存储器中存储的可执行指令时,实现权利要求1至10中任一项所述的方法。
  15. 一种存储介质,所述存储介质存储有可执行指令,当其被处理器执行时,实现权利要求1至10中任一项所述的方法。
PCT/CN2020/082039 2019-04-12 2020-03-30 姿态识别模型的训练方法、图像识别方法及装置 WO2020207281A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/330,261 US11907848B2 (en) 2019-04-12 2021-05-25 Method and apparatus for training pose recognition model, and method and apparatus for image recognition

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910294734.8 2019-04-12
CN201910294734.8A CN110020633B (zh) 2019-04-12 2019-04-12 姿态识别模型的训练方法、图像识别方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/330,261 Continuation US11907848B2 (en) 2019-04-12 2021-05-25 Method and apparatus for training pose recognition model, and method and apparatus for image recognition

Publications (1)

Publication Number Publication Date
WO2020207281A1 true WO2020207281A1 (zh) 2020-10-15

Family

ID=67191240

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/082039 WO2020207281A1 (zh) 2019-04-12 2020-03-30 姿态识别模型的训练方法、图像识别方法及装置

Country Status (3)

Country Link
US (1) US11907848B2 (zh)
CN (1) CN110020633B (zh)
WO (1) WO2020207281A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270669A (zh) * 2020-11-09 2021-01-26 北京百度网讯科技有限公司 人体3d关键点检测方法、模型训练方法及相关装置
CN112488005A (zh) * 2020-12-04 2021-03-12 临沂市新商网络技术有限公司 基于人体骨骼识别和多角度转换的在岗监测方法及系统
CN112509123A (zh) * 2020-12-09 2021-03-16 北京达佳互联信息技术有限公司 三维重建方法、装置、电子设备及存储介质
CN112528858A (zh) * 2020-12-10 2021-03-19 北京百度网讯科技有限公司 人体姿态估计模型的训练方法、装置、设备、介质及产品
CN112801138A (zh) * 2021-01-05 2021-05-14 北京交通大学 基于人体拓扑结构对齐的多人姿态估计方法
CN113100755A (zh) * 2021-03-26 2021-07-13 河北工业大学 一种基于视觉追踪控制的肢体康复训练与评估系统
CN113627083A (zh) * 2021-08-05 2021-11-09 广州帕克西软件开发有限公司 一种基于虚拟试穿实现div衣服的方法
CN113724393A (zh) * 2021-08-12 2021-11-30 北京达佳互联信息技术有限公司 三维重建方法、装置、设备及存储介质

Families Citing this family (53)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111435432B (zh) * 2019-01-15 2023-05-26 北京市商汤科技开发有限公司 网络优化方法及装置、图像处理方法及装置、存储介质
CN110020633B (zh) * 2019-04-12 2022-11-04 腾讯科技(深圳)有限公司 姿态识别模型的训练方法、图像识别方法及装置
CN110102050B (zh) * 2019-04-30 2022-02-18 腾讯科技(深圳)有限公司 虚拟对象显示方法、装置、电子设备及存储介质
CN110570455B (zh) * 2019-07-22 2021-12-07 浙江工业大学 一种面向房间vr的全身三维姿态跟踪方法
CN112307801A (zh) * 2019-07-24 2021-02-02 鲁班嫡系机器人(深圳)有限公司 一种姿态识别方法、装置及系统
CN112287730A (zh) * 2019-07-24 2021-01-29 鲁班嫡系机器人(深圳)有限公司 姿态识别方法、装置、系统、存储介质及设备
CN110634160B (zh) * 2019-08-12 2022-11-18 西北大学 二维图形中目标三维关键点提取模型构建及姿态识别方法
CN110544301A (zh) * 2019-09-06 2019-12-06 广东工业大学 一种三维人体动作重建系统、方法和动作训练系统
CN112700510B (zh) * 2019-10-23 2024-03-15 北京地平线机器人技术研发有限公司 一种热力图构建方法及装置
CN111160088A (zh) * 2019-11-22 2020-05-15 深圳壹账通智能科技有限公司 Vr体感数据检测方法、装置、计算机设备及存储介质
CN110827383B (zh) * 2019-11-25 2020-11-10 腾讯科技(深圳)有限公司 三维模型的姿态模拟方法、装置、存储介质和电子设备
CN110991319B (zh) * 2019-11-29 2021-10-19 广州市百果园信息技术有限公司 手部关键点检测方法、手势识别方法及相关装置
CN111103981B (zh) * 2019-12-20 2024-06-11 北京奇艺世纪科技有限公司 控制指令生成方法及装置
CN111178280A (zh) * 2019-12-31 2020-05-19 北京儒博科技有限公司 一种人体坐姿识别方法、装置、设备及存储介质
CN111402228B (zh) * 2020-03-13 2021-05-07 腾讯科技(深圳)有限公司 图像检测方法、装置和计算机可读存储介质
CN111414839B (zh) * 2020-03-16 2023-05-23 清华大学 基于姿态的情感识别方法及装置
CN111462169B (zh) * 2020-03-27 2022-07-15 杭州视在科技有限公司 一种基于背景建模的老鼠轨迹追踪方法
CN113449570A (zh) * 2020-03-27 2021-09-28 虹软科技股份有限公司 图像处理方法和装置
CN113456058A (zh) * 2020-03-30 2021-10-01 Oppo广东移动通信有限公司 头部姿态的检测方法、装置、电子设备和可读存储介质
CN111488824B (zh) * 2020-04-09 2023-08-08 北京百度网讯科技有限公司 运动提示方法、装置、电子设备和存储介质
CN111539377A (zh) * 2020-05-11 2020-08-14 浙江大学 基于视频的人体运动障碍检测方法、装置及设备
CN111611903B (zh) * 2020-05-15 2021-10-26 北京百度网讯科技有限公司 动作识别模型的训练方法、使用方法、装置、设备和介质
CN111679737B (zh) * 2020-05-27 2022-06-21 维沃移动通信有限公司 手部分割方法和电子设备
CN111723687A (zh) * 2020-06-02 2020-09-29 北京的卢深视科技有限公司 基于神经网路的人体动作识别方法和装置
CN113822097B (zh) * 2020-06-18 2024-01-26 北京达佳互联信息技术有限公司 单视角人体姿态识别方法、装置、电子设备和存储介质
CN111783609A (zh) * 2020-06-28 2020-10-16 北京百度网讯科技有限公司 行人再识别的方法、装置、设备和计算机可读存储介质
CN111898642B (zh) * 2020-06-30 2021-08-13 北京市商汤科技开发有限公司 关键点检测方法、装置、电子设备及存储介质
CN111964606B (zh) * 2020-08-18 2021-12-07 广州小鹏汽车科技有限公司 一种三维信息的处理方法和装置
CN111985556A (zh) * 2020-08-19 2020-11-24 南京地平线机器人技术有限公司 关键点识别模型的生成方法和关键点识别方法
CN111967406A (zh) * 2020-08-20 2020-11-20 高新兴科技集团股份有限公司 人体关键点检测模型生成方法、系统、设备和存储介质
CN112163480B (zh) * 2020-09-16 2022-09-13 北京邮电大学 一种行为识别方法及装置
CN112307940A (zh) * 2020-10-28 2021-02-02 有半岛(北京)信息科技有限公司 模型训练方法、人体姿态检测方法、装置、设备及介质
CN112287865B (zh) * 2020-11-10 2024-03-26 上海依图网络科技有限公司 一种人体姿态识别的方法及装置
CN112464791B (zh) * 2020-11-25 2023-10-27 平安科技(深圳)有限公司 基于二维相机的姿态识别方法、装置、设备和存储介质
CN112465695B (zh) * 2020-12-01 2024-01-02 北京达佳互联信息技术有限公司 图像处理方法、装置、电子设备及存储介质
CN112464895B (zh) * 2020-12-14 2023-09-01 深圳市优必选科技股份有限公司 姿态识别模型训练方法、装置、姿态识别方法和终端设备
CN112580488B (zh) * 2020-12-15 2023-12-22 深圳大学 一种基于自启发的人体姿态估计模型训练方法及装置
CN112560962B (zh) * 2020-12-17 2024-03-22 咪咕文化科技有限公司 骨骼动画的姿态匹配方法、装置、电子设备及存储介质
CN113065458B (zh) * 2021-03-29 2024-05-28 芯算一体(深圳)科技有限公司 基于手势识别的投票方法与系统、电子设备
EP4315282A1 (en) * 2021-03-30 2024-02-07 Fisch, Martin Systems and methods for computer recognition of 3d gesture movements
CN113158920B (zh) * 2021-04-26 2023-12-22 平安科技(深圳)有限公司 特定动作识别模型的训练方法、装置以及计算机设备
CN113298922B (zh) * 2021-06-11 2023-08-29 深圳市优必选科技股份有限公司 人体姿态估计方法、装置及终端设备
CN113850865A (zh) * 2021-09-26 2021-12-28 北京欧比邻科技有限公司 一种基于双目视觉的人体姿态定位方法、系统和存储介质
CN113947635A (zh) * 2021-10-15 2022-01-18 北京百度网讯科技有限公司 图像定位方法、装置、电子设备以及存储介质
CN114220162A (zh) * 2021-11-17 2022-03-22 深圳职业技术学院 一种猪只姿态识别方法及装置
CN114675657B (zh) * 2022-05-25 2022-09-23 天津卡雷尔机器人技术有限公司 一种基于红外摄像头模糊控制算法回巢充电的方法
CN114881893B (zh) * 2022-07-05 2022-10-21 腾讯科技(深圳)有限公司 图像处理方法、装置、设备及计算机可读存储介质
CN115830640B (zh) * 2022-12-26 2024-03-05 北京百度网讯科技有限公司 一种人体姿态识别和模型训练方法、装置、设备和介质
CN115984972B (zh) * 2023-03-20 2023-08-11 乐歌人体工学科技股份有限公司 基于运动视频驱动的人体姿态识别方法
CN116129016B (zh) * 2023-04-17 2023-07-14 广州趣丸网络科技有限公司 一种姿态运动的数字同步方法、装置、设备及存储介质
CN116310012B (zh) * 2023-05-25 2023-07-25 成都索贝数码科技股份有限公司 一种基于视频的三维数字人姿态驱动方法、设备及系统
CN117102856B (zh) * 2023-10-23 2024-02-13 浙江大学 一种大型舱体双平台五自由度位姿识别与调整方法
CN117854156B (zh) * 2024-03-07 2024-05-07 腾讯科技(深圳)有限公司 一种特征提取模型的训练方法和相关装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107622250A (zh) * 2017-09-27 2018-01-23 深圳市得色科技有限公司 基于机器学习的3d图像识别方法及其系统
US20180357518A1 (en) * 2017-06-13 2018-12-13 Konica Minolta, Inc. Image Recognition Device and Image Recognition Method
CN110020633A (zh) * 2019-04-12 2019-07-16 腾讯科技(深圳)有限公司 姿态识别模型的训练方法、图像识别方法及装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8577154B2 (en) * 2008-06-16 2013-11-05 University Of Southern California Automated single viewpoint human action recognition by matching linked sequences of key poses
BRPI0917864A2 (pt) * 2008-08-15 2015-11-24 Univ Brown aparelho e método para estimativa da forma corporal
US8861800B2 (en) * 2010-07-19 2014-10-14 Carnegie Mellon University Rapid 3D face reconstruction from a 2D image and methods using such rapid 3D face reconstruction
US9646384B2 (en) * 2013-09-11 2017-05-09 Google Technology Holdings LLC 3D feature descriptors with camera pose information
WO2018226621A1 (en) * 2017-06-05 2018-12-13 Umajin Inc. Methods and systems for an application system
JP6833630B2 (ja) * 2017-06-22 2021-02-24 株式会社東芝 物体検出装置、物体検出方法およびプログラム
US10733755B2 (en) * 2017-07-18 2020-08-04 Qualcomm Incorporated Learning geometric differentials for matching 3D models to objects in a 2D image

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180357518A1 (en) * 2017-06-13 2018-12-13 Konica Minolta, Inc. Image Recognition Device and Image Recognition Method
CN107622250A (zh) * 2017-09-27 2018-01-23 深圳市得色科技有限公司 基于机器学习的3d图像识别方法及其系统
CN110020633A (zh) * 2019-04-12 2019-07-16 腾讯科技(深圳)有限公司 姿态识别模型的训练方法、图像识别方法及装置

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CAO, ZHE ET AL.: "OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields", HTTPS://ARXIV.ORG/ABS/1812.08008V1, 18 December 2018 (2018-12-18), XP080994611, DOI: 20200612145752Y *
E., SIMO-SERRA ET AL.: "A Joint Model for 2D and 3D Pose Estimation from a Single Image", 2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 28 June 2013 (2013-06-28), XP032493257, ISSN: 1063-6919, DOI: 20200612150842Y *
KUDO, YASUNORI ET AL.: "Unsupervised Adversarial Learning of 3D Human Pose from 2D Joint Locations", HTTPS://ARXIV.ORG/ABS/1803.08244V1, 22 March 2018 (2018-03-22), XP080861842, DOI: 20200612150252Y *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112270669A (zh) * 2020-11-09 2021-01-26 北京百度网讯科技有限公司 人体3d关键点检测方法、模型训练方法及相关装置
CN112270669B (zh) * 2020-11-09 2024-03-01 北京百度网讯科技有限公司 人体3d关键点检测方法、模型训练方法及相关装置
CN112488005A (zh) * 2020-12-04 2021-03-12 临沂市新商网络技术有限公司 基于人体骨骼识别和多角度转换的在岗监测方法及系统
CN112488005B (zh) * 2020-12-04 2022-10-14 临沂市新商网络技术有限公司 基于人体骨骼识别和多角度转换的在岗监测方法及系统
CN112509123A (zh) * 2020-12-09 2021-03-16 北京达佳互联信息技术有限公司 三维重建方法、装置、电子设备及存储介质
CN112528858A (zh) * 2020-12-10 2021-03-19 北京百度网讯科技有限公司 人体姿态估计模型的训练方法、装置、设备、介质及产品
CN112801138A (zh) * 2021-01-05 2021-05-14 北京交通大学 基于人体拓扑结构对齐的多人姿态估计方法
CN112801138B (zh) * 2021-01-05 2024-04-09 北京交通大学 基于人体拓扑结构对齐的多人姿态估计方法
CN113100755A (zh) * 2021-03-26 2021-07-13 河北工业大学 一种基于视觉追踪控制的肢体康复训练与评估系统
CN113627083A (zh) * 2021-08-05 2021-11-09 广州帕克西软件开发有限公司 一种基于虚拟试穿实现div衣服的方法
CN113724393A (zh) * 2021-08-12 2021-11-30 北京达佳互联信息技术有限公司 三维重建方法、装置、设备及存储介质
CN113724393B (zh) * 2021-08-12 2024-03-19 北京达佳互联信息技术有限公司 三维重建方法、装置、设备及存储介质

Also Published As

Publication number Publication date
US20210279456A1 (en) 2021-09-09
US11907848B2 (en) 2024-02-20
CN110020633A (zh) 2019-07-16
CN110020633B (zh) 2022-11-04

Similar Documents

Publication Publication Date Title
WO2020207281A1 (zh) 姿态识别模型的训练方法、图像识别方法及装置
CN110781765B (zh) 一种人体姿态识别方法、装置、设备及存储介质
CN111680562A (zh) 一种基于骨骼关键点的人体姿态识别方法、装置、存储介质及终端
Martins et al. Accessible options for deaf people in e-learning platforms: technology solutions for sign language translation
CN113496507A (zh) 一种人体三维模型重建方法
CN105051755A (zh) 用于姿势识别的部位和状态检测
CN111222486B (zh) 手部姿态识别模型的训练方法、装置、设备及存储介质
CN111095170B (zh) 虚拟现实场景及其交互方法、终端设备
US20240005211A1 (en) Data processing method and apparatus
Haggag et al. Semantic body parts segmentation for quadrupedal animals
CN113435236A (zh) 居家老人姿态检测方法、系统、存储介质、设备及应用
CN115578393A (zh) 关键点检测方法、训练方法、装置、设备、介质及产品
CN109858402B (zh) 一种图像检测方法、装置、终端以及存储介质
CN116092120B (zh) 基于图像的动作确定方法、装置、电子设备及存储介质
CN116977506A (zh) 模型动作重定向的方法、装置、电子设备及存储介质
CN115994944A (zh) 三维关键点预测方法、训练方法及相关设备
De Paolis et al. Augmented Reality, Virtual Reality, and Computer Graphics: 6th International Conference, AVR 2019, Santa Maria al Bagno, Italy, June 24–27, 2019, Proceedings, Part II
Li et al. A preliminary exploration to make stereotactic surgery robots aware of the semantic 2D/3D working scene
CN113822137A (zh) 一种数据标注方法、装置、设备及计算机可读存储介质
CN113592986A (zh) 基于神经网络的动作生成方法、装置及计算设备
CN112309181A (zh) 一种舞蹈教学辅助方法及装置
CN117058405B (zh) 一种基于图像的情绪识别方法、系统、存储介质及终端
Cao et al. A novel augmented reality guidance system for future informatization experimental teaching
Li et al. Near-convex decomposition of 2D shape using visibility range
CN117557699B (zh) 动画数据生成方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20787404

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20787404

Country of ref document: EP

Kind code of ref document: A1