WO2024055748A1 - 一种头部姿态估计方法、装置、设备以及存储介质 - Google Patents

一种头部姿态估计方法、装置、设备以及存储介质 Download PDF

Info

Publication number
WO2024055748A1
WO2024055748A1 PCT/CN2023/108312 CN2023108312W WO2024055748A1 WO 2024055748 A1 WO2024055748 A1 WO 2024055748A1 CN 2023108312 W CN2023108312 W CN 2023108312W WO 2024055748 A1 WO2024055748 A1 WO 2024055748A1
Authority
WO
WIPO (PCT)
Prior art keywords
key point
image
point coordinate
dimensional key
coordinate set
Prior art date
Application number
PCT/CN2023/108312
Other languages
English (en)
French (fr)
Inventor
卫华威
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2024055748A1 publication Critical patent/WO2024055748A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • the present application relates to the field of image processing, and in particular to head pose estimation technology.
  • head pose estimation refers to the ability to infer the orientation of a person's head relative to the camera view. In visual dynamic capture scenarios, head pose estimation is a very important part. Accurate head posture allows the avatar to perfectly replicate the human head movements, making the avatar animation more vivid, dexterous, and realistic.
  • the more mainstream head posture estimation methods generally require the use of traditional motion sensors or three-dimensional (3-dimension, 3D) image acquisition equipment to obtain the three-dimensional coordinate information of the head.
  • the current mainstream image acquisition equipment collects two-dimensional (2D) image information, it is necessary to realize the 3D conversion of the 2D coordinate information in the world coordinate system based on the coordinate information of the key points of the face, so as to obtain the head
  • the 3D coordinate information of the head posture is obtained, and then based on the changes in the coordinate information, the estimation of the head posture and the judgment of the head movement are realized.
  • the above method is based on the method of solving the motion of 3D to 2D point pairs (also known as Perspective-n-Point, PnP).
  • This method first estimates the 2D key points of the human face; then based on the 2D key points, the fixed 3D head model is Calibrate the corresponding 3D point. Through PnP solution, the transformation posture of the 3D point corresponding to the 2D key point can be obtained.
  • PnP Perspective-n-Point
  • Embodiments of the present application provide a head posture estimation method, device, equipment and storage medium, which can ensure the stability and reliability of head posture estimation.
  • the present application provides a head posture estimation method, which is executed by a computer device, including: obtaining an image to be recognized, and the image to be recognized includes a target face image; through a first network model, based on the image to be recognized The image is subjected to key point recognition processing to obtain a two-dimensional key point coordinate set of the target face image in the image to be recognized, and a three-dimensional key point coordinate set of the target face image, wherein the first network model includes a first branch network and a second branch network, wherein the first branch network is used to identify the two-dimensional key point coordinate set, and the second branch network is used to identify the three-dimensional key point coordinate set; according to the two-dimensional key point coordinate set and the three-dimensional key point coordinate set A set of key point coordinates to determine the head posture corresponding to the target face image in the image to be recognized.
  • a head posture estimation device including:
  • An acquisition module used to acquire an image to be recognized, which includes a target face image
  • a processing module configured to perform key point recognition processing based on the image to be recognized through the first network model, and obtain a set of two-dimensional key point coordinates of the target face image in the image to be recognized, and three-dimensional key points of the target face image.
  • the first network model includes a first branch network and a second branch network, wherein the first branch network is used to identify the two-dimensional key point coordinate set, and the second branch network is used to identify the three-dimensional key point coordinate set.
  • the output module is used to determine the head posture corresponding to the target face image in the image to be recognized based on the two-dimensional key point coordinate set and the three-dimensional key point coordinate set.
  • a computer device including: a memory, a processor and a bus system;
  • the memory is used to store programs
  • the processor is used to execute the program in the memory, and the processor is used to execute the above methods according to the instructions in the program code;
  • the bus system is used to connect the memory and the processor so that the memory and the processor can communicate.
  • the computer-readable storage medium stores instructions, which when run on a computer, cause the computer to perform the methods of the above aspects.
  • Another aspect of the present application provides a computer program product or computer program, the computer program product or computer program including computer instructions stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the methods provided by the above aspects.
  • the embodiments of the present application have the following advantages: through two branch networks, the 2D key point coordinates and the 3D key point coordinates of the target face image in the image to be recognized are output, and then the 2D key point coordinates are output according to the 2D key point coordinates. And the 3D key point coordinates are calculated to obtain the head posture of the target face image.
  • the 3D head model can change with changes in expressions.
  • the coordinate correspondence between the 2D key points and the 3D key points is made more accurate, thus making the When there are large expressions, the head posture calculation is guaranteed to be stable and reliable.
  • Figure 1 is an architectural schematic diagram of the application system in the embodiment of the present application
  • Figure 2 is an architectural schematic diagram of the first network model in the embodiment of the present application.
  • Figure 3 is another architectural schematic diagram of the first network model in the embodiment of the present application.
  • Figure 4 is a schematic flow chart of the head posture estimation method in the embodiment of the present application.
  • Figure 5 is a schematic diagram of the target face image in the image to be recognized in the embodiment of the present application.
  • Figure 6 is a schematic flowchart of the image to be processed to obtain the image to be recognized through the image processing model in the embodiment of the present application;
  • Figure 7a is a schematic diagram of an embodiment of head posture estimation in the embodiment of the present application.
  • Figure 7b is a schematic diagram of a virtual image generated after head posture estimation of the image to be processed in the embodiment of the present application.
  • Figure 8 is a schematic diagram of an embodiment of the head posture estimation device in the embodiment of the present application.
  • Figure 9 is a schematic diagram of another embodiment of the head posture estimation device in the embodiment of the present application.
  • Figure 10 is a schematic diagram of another embodiment of the head posture estimation device in the embodiment of the present application.
  • Figure 11 is a schematic diagram of another embodiment of the head posture estimation device in the embodiment of the present application.
  • Embodiments of the present application provide a head posture estimation method, device, equipment and storage medium to ensure the stability and reliability of head posture estimation.
  • an image to be recognized which includes a target face image; input the image to be recognized into the first network model to obtain a set of two-dimensional key point coordinates of the target face image in the image to be recognized, and the target face
  • the three-dimensional key point coordinate set of the image wherein the first network model includes a first branch network and a second branch network, wherein the first branch network is used to identify the two-dimensional key point coordinate set, and the second branch network It is used to identify and obtain the three-dimensional key point coordinate set; and determine the head posture corresponding to the target face image in the image to be recognized based on the two-dimensional key point coordinate set and the three-dimensional key point coordinate set.
  • the coordinates of the 3D key points of the face image in the image to be recognized can be obtained in real time, thus ensuring that the 3D head model can change with the change of the character's expression.
  • the 2D key points can be matched with the 3D key points.
  • the coordinate correspondence between points is more accurate, and it can ensure that the head posture calculation is stable and reliable when the characters in the image to be recognized make large expressions.
  • Key points of facial features used to represent the position of facial features on the human face.
  • the position of facial features can be represented by key points.
  • the facial features key points involved in the embodiment of this application include five points corresponding to the left pupil, right pupil, nose tip, left mouth corner and right mouth corner of the human face.
  • Euler angles refers to a set of three independent angular parameters proposed by Euler to determine the position of a fixed-point rotating rigid body.
  • the embodiment of the present application establishes a rectangular coordinate system based on the human face.
  • the embodiment of the present application uses the human face
  • the face posture angle is the Euler angle as an example.
  • the Euler angle is in a three-dimensional rectangular coordinate system.
  • the three-dimensional rectangular coordinate system takes the center or center of gravity of the person's head as the origin, and points from one ear of the human face to the other.
  • the direction of the side ears is the X-axis direction, the direction from the top of the person's head to the neck is the Y-axis, and the direction from the person's face to the back of the head is the Z-axis.
  • Euler angles include the following three angles:
  • Pitch the angle of rotation around the X-axis
  • Yaw angle (yaw) the angle of rotation around the Y axis
  • Roll angle (roll) The angle of rotation around the Z axis.
  • Visual motion capture Traditional motion capture uses inertial sensors or markers attached to the human body for motion capture. Visual motion capture does not require people to wear any equipment. It can capture people's facial and body movements using single or multiple cameras.
  • DoF is the number of directions in which an object can move in 3d space, with a total of 6 degrees of freedom. That is, the human head posture includes rotation and translation. The rotation is represented by three Euler angles, and the translation is also represented by displacement in three directions. Together, it is a 6-degree-of-freedom posture parameter.
  • PnP Perspective-n-Point
  • PnP is a method for solving the motion of 3D to 2D point pairs. The purpose is to solve the pose of the camera coordinate system relative to the world coordinate system. It describes how to estimate the pose of the camera (that is, solving the rotation matrix and translation vector from the world coordinate system to the camera coordinate system) when the coordinates of several 3D points (relative to the world coordinate system) and the 2D coordinates of these points are known.
  • Convolutional layer refers to the layered structure composed of several convolution units in the convolutional neural network layer.
  • Convolutional Neural Network (CNN) is a feedforward neural network.
  • Convolutional neural network The network includes at least two neural network layers. Each neural network layer contains several neurons. Each neuron is arranged in layers. Neurons in the same layer are not connected to each other. Information between layers is transmitted only in one direction. conduct.
  • the pooling layer also known as the sampling layer, refers to a layered structure that can extract features from the input value twice.
  • the pooling layer can ensure the main features of the upper layer's value and can also reduce the parameters of the next layer. and calculation amount.
  • the pooling layer is composed of multiple feature surfaces.
  • One feature surface of the convolution layer corresponds to a feature surface in the pooling layer. The number of feature surfaces will not be changed. By reducing the resolution of the feature surfaces, spatially diverse features are obtained. Characteristics of transgenderism.
  • FC Fully Connected layer
  • Forward propagation refers to the feedforward processing process of the model.
  • Backpropagation is the opposite of forward propagation. It refers to updating the weight parameters of each layer of the model based on the results output by the model. For example, if the model includes an input layer, a hidden layer, and an output layer, then forward propagation refers to processing in the order of input layer-hidden layer-output layer, and backward propagation refers to processing in the order of output layer-hidden layer-input layer. Update the weight parameters of each layer in turn.
  • the head posture estimation method, device, equipment and storage medium provided by the embodiments of the present application can ensure the stability and reliability of head posture estimation.
  • Exemplary applications of the electronic devices provided by the embodiments of the present application are described below.
  • the electronic devices provided by the embodiments of the present application can be implemented as various types of user terminals or as servers.
  • the electronic device can ensure the stability and reliability of the head posture estimation, that is, improve the stability and reliability of the head posture estimation of the electronic device itself, and is suitable for head posture estimation.
  • Multiple application scenarios For example, augmented reality (AR) games, virtual reality (VR) games, aiding in gaze estimation, modeling attention, fitting 3D models to videos, and performing facial alignment.
  • AR augmented reality
  • VR virtual reality
  • aiding in gaze estimation modeling attention
  • fitting 3D models to videos and performing facial alignment.
  • Figure 1 is an optional architectural schematic diagram in an application scenario of the head posture estimation solution provided by the embodiment of the present application.
  • the terminal device 100 exemplarily shows The terminal device 1001 and the terminal device 1002
  • the server 300 is connected to the database 400.
  • the network 200 can be a wide area network or a local area network, or a combination of the two.
  • a client used to implement the head pose estimation solution is deployed on the terminal device 100.
  • the client can be run on the terminal device 100 in the form of a browser, or can be run on the terminal device 100 through an independent application (application, APP).
  • application, APP independent application
  • the format runs on the terminal device 100, and the specific presentation format of the client is not limited here.
  • the server 300 involved in this application may be an independent physical server or may be composed of multiple physical servers. Server clusters or distributed systems can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, and content delivery networks (Content Delivery Network, CDN), as well as cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
  • the terminal device 100 may be a smartphone, a tablet computer, a notebook computer, a handheld computer, a personal computer, a smart TV, a smart watch, a vehicle-mounted device, a wearable device, etc., but is not limited thereto.
  • the terminal device 100 and the server 300 can be connected directly or indirectly through the network 200 through wired or wireless communication methods, which is not limited in this application.
  • the number of servers 300 and terminal devices 100 is also not limited.
  • the solution provided by this application can be completed independently by the terminal device 100, or can be completed independently by the server 300, or can be completed by the terminal device 100 and the server 300 in cooperation. This application does not specifically limit this.
  • the database 400 can be simply regarded as an electronic file cabinet - a place where electronic files are stored. Users can add, query, update, delete and other operations on the data in the files.
  • a database is a collection of data that is stored in a certain manner, can be shared with multiple users, has as little redundancy as possible, and is independent of applications.
  • Database Management System (DBMS) is a computer software system designed for managing databases. It generally has basic functions such as storage, interception, security, and backup.
  • Database management systems can be classified according to the database models they support, such as relational, Extensible Markup Language (XML); or according to the types of computers they support, such as server clusters, mobile phones; or Classify according to the query language used, such as Structured Query Language (SQL), XQuery; or classify according to the focus of performance impact, such as maximum scale, maximum running speed; or other classification methods. Regardless of the classification scheme used, some DBMSs are able to span categories, for example, supporting multiple query languages simultaneously.
  • the database 400 can be used to store the training sample set and the image to be recognized.
  • the storage location of the training sample set is not limited to the database.
  • it can also be stored in the terminal device 100, the blockchain or the distributed distribution of the server 300.
  • File system is medium.
  • the server 300 can execute the head pose estimation method provided by the embodiment of the present application and the training method of the first network model in the head pose estimation.
  • the first network model includes a first branch. network and a second branch network, wherein the first branch network is used to identify and obtain the two-dimensional key point coordinates and the uncertainty, and the second branch network is used to identify and obtain the three-dimensional key point coordinates.
  • the specific process may be as follows: obtain the first training sample set corresponding to the real 2D key point coordinates and the real 3D key point coordinates from the terminal device 100 and/or the database 400, and pass The first initial network model to be trained performs detection processing on the first training sample set to obtain the predicted 2D key point coordinates and predicted 3D key point coordinates of each face in the first training sample set.
  • the predicted 2D key point coordinates of each face in the first training sample set can also be obtained. Predict the uncertainty corresponding to the coordinates of the 2D key point.
  • the server may use Gaussian Negative Log Likelihood Loss (GaussianNLLLoss, GNLL) to calculate the first loss value.
  • GaussianNLLLoss GNLL
  • the specific calculation process may use Formula 1:
  • the N is used to indicate the number of the 2D key points
  • the y is used to indicate the real 2D key point coordinates
  • the f(x) is used to represent the predicted 2D key point coordinates output by the first branch network
  • the ⁇ is used to represent The uncertainty in the predicted 2D keypoint coordinates.
  • the server may use a regression loss function to calculate the second loss value.
  • the server uses L2LOSS, and its specific calculation process may use Formula 2:
  • the N is used to indicate the number of the 3D key points
  • the y is used to indicate the real 3D key point coordinates
  • the f(x) is used to indicate the predicted 3D key point coordinates output by the second branch network.
  • the initial model architecture of the first network model may include a feature extraction network, a fully connected layer, a pooling layer, and the first branch network and the second branch network.
  • the feature extraction network can be a CNN network such as Residual Neural Network (ResNet), Le Net or AlexNet, or a high-resolution network with feature pyramid (High-Resolution netV2P, HRNetV2P), or based on mobile
  • the hierarchical visual self-attention model of the window (Swin Transformer) and the first branch network and the second branch network can be fully connected layers.
  • the feature extraction network is ResNet50 as an example to illustrate the first network model below.
  • the first network model includes the ResNet50, where the ResNet50 includes 49 convolutional layers and a fully connected layer, and the fully connected layer is then connected to the pooling layer; the output of the pooling layer They are respectively connected to two fully connected layers, one of which is the first branch network, and the other fully connected layer is the second branch network.
  • the input of the network is 224 ⁇ 224 ⁇ 3.
  • the output is 7 ⁇ 7 ⁇ 2048.
  • the pooling layer will convert it into a feature vector.
  • the classifier will calculate and output the feature vector. Class probabilities.
  • the ResNet50 network structure can be divided into seven parts.
  • the first part does not contain the residual block and mainly performs convolution, regularization, activation function, and maximum pooling calculations on the input.
  • the second, third, fourth and fifth part of the structure all contain residual blocks. Their existence will not change the size of the residual block, but is only used to change the dimension of the residual block.
  • the input of the Resnet50 network is 256 ⁇ 256 ⁇ 3.
  • the output is a feature map of N ⁇ 2048 ⁇ 8 ⁇ 8, where N is the number of samples selected for one training (also called batchsize) .
  • N is the number of samples selected for one training (also called batchsize) .
  • the N ⁇ 2048 ⁇ 8 ⁇ 8 feature map is passed through the pooling layer to obtain N ⁇ 2048 features.
  • the N ⁇ 2048 features will pass through two FC layers to output 2D key point coordinates and uncertainty respectively. degrees and 3D keypoint coordinates.
  • the weight dimension of each FC layer is 2048 ⁇ 3660 (1220 points), where 3660 is regarded as 1220 ⁇ 3.
  • 3 represents the x,y coordinates and the uncertainty ⁇ .
  • 3 represents the x, y, and z coordinate values.
  • the server 300 can execute the head pose estimation method provided by the embodiment of the present application and the training method of the first network model in the head pose estimation.
  • the first network model includes a first A branch network, a second branch network and a computing network, wherein the first branch network is used to identify and obtain the two-dimensional key point coordinates, the second branch network is used to identify and obtain the three-dimensional key point coordinates, and the computing network is used to identify and obtain the three-dimensional key point coordinates according to The 2D key point coordinates and the 3D key point coordinates estimate the head pose.
  • the specific process can be As follows: obtain the second training sample set corresponding to the real 2D key point coordinates, the real 3D key point coordinates and the real head posture from the terminal device 100 and/or the database 400, and use the second initial network model to be trained to
  • the second training sample set is subjected to detection processing to obtain the predicted 2D key point coordinates and predicted 3D key point coordinates of each face in the second training sample set.
  • the uncertainty corresponding to the predicted 2D key point coordinates can also be obtained.
  • a loss function including pre-designed loss factors (such as interval value and distance)
  • the first branch network and the second branch network can be jointly trained, so that the two branches can influence each other and enhance the network learning ability.
  • the server when training the first network model, can calculate the first loss value using Gaussian Negative Log Likelihood Loss (GaussianNLLLoss, GNLL).
  • GaussianNLLLoss Gaussian Negative Log Likelihood Loss
  • GNLL Gaussian Negative Log Likelihood Loss
  • the N is used to indicate the number of the 2D key points
  • the y is used to indicate the real 2D key point coordinates
  • the f(x) is used to represent the predicted 2D key point coordinates output by the first branch network
  • the ⁇ is used to represent The uncertainty in the predicted 2D keypoint coordinates.
  • the server may use a regression loss function to calculate the second loss value.
  • the server uses L2LOSS, and its specific calculation process may use Formula 2:
  • the N is used to indicate the number of the 3D key points
  • the y is used to indicate the real 3D key point coordinates
  • the f(x) is used to indicate the predicted 3D key point coordinates output by the second branch network.
  • the initial model architecture of the first network model may include a feature extraction network, a fully connected layer, a pooling layer, the first branch network, the second branch network and the computing network.
  • the feature extraction network can be a CNN network such as Residual Neural Network (ResNet), Le Net or AlexNet, or a high-resolution network with feature pyramid (High-Resolution netV2P, HRNetV2P), or based on mobile
  • the hierarchical visual self-attention model of the window (Swin Transformer) and the first branch network and the second branch network can be fully connected layers.
  • the feature extraction network is ResNet50 as an example to illustrate the first network model below.
  • the first network model includes the ResNet50, where the ResNet50 includes 49 convolutional layers and a fully connected layer, and the fully connected layer is then connected to the pooling layer; the output of the pooling layer They are respectively connected to two fully connected layers, one of which is the first branch network, and the other fully connected layer is the second branch network.
  • the input of the network is 224 ⁇ 224 ⁇ 3.
  • the output is 7 ⁇ 7 ⁇ 2048.
  • the pooling layer will convert it into a feature vector.
  • the classifier will calculate and output the feature vector. Class probabilities.
  • the ResNet50 network structure can be divided into five parts.
  • the first part does not contain the residual block and mainly performs convolution, regularization, activation function, and maximum pooling calculations on the input.
  • the second, third, fourth and fifth part of the structure all contain residual blocks. Their existence will not change the size of the residual block, but is only used to change the dimension of the residual block.
  • the residual block has three layers of convolution.
  • the input of the Resnet50 network is 256 ⁇ 256 ⁇ 3.
  • the output is an N ⁇ 2048 ⁇ 8 ⁇ 8 feature map, where N is the number of samples selected for one training (also called batchsize) .
  • N is the number of samples selected for one training (also called batchsize) .
  • the N ⁇ 2048 ⁇ 8 ⁇ 8 feature map is passed through the pooling layer to obtain N ⁇ 2048 features.
  • the N ⁇ 2048 features will pass through two FC layers to output 2D key point coordinates and uncertainty respectively. degrees and 3D keypoint coordinates.
  • the weight dimension of each FC layer is 2048 ⁇ 3660 (1220 points), where 3660 is regarded as 1220 ⁇ 3.
  • 3 represents the x,y coordinates and the uncertainty ⁇ .
  • 3 represents the x, y, and z coordinate values.
  • the predicted head posture is calculated based on the predicted 2D key point coordinates and uncertainty output by the first branch network and the predicted 3D key point coordinates output by the second branch network.
  • the data of the first training sample set and the second training sample set can adopt the following technical solution: that is, collecting face images through the depth camera of the terminal device, and then using the augmented reality technology based on the device system ( ARKit) captures facial 3D point cloud data and corresponding head posture in real time, and develops data collection software based on ARKit, and then collects facial data at a speed of 60 frames (FPS).
  • ARKit the augmented reality technology based on the device system
  • FPS data collection software based on ARKit
  • external cameras can also be used when acquiring the first training sample set and the second training sample set.
  • the external camera can be a depth camera or other cameras, as long as facial data collection can be achieved, and the specific method is not limited here. In this way, using external devices for facial data can reduce the requirements for hardware devices, thereby reducing costs.
  • the server 300 can save the first network model locally, thereby providing the terminal device 100 with a remote head posture estimation function.
  • the server 300 can receive the image to be recognized sent by the terminal device 100, and detect and process the image to be recognized through the first network model to obtain the head posture corresponding to the target face image in the image to be recognized and the corresponding confidence probability; finally, The head gesture is sent to the terminal device 100, so that the terminal device 100 displays the head gesture in the graphical interface 110 (graphical interface 1101 and graphical interface 1102 are shown as examples).
  • the server 300 may also send (deploy) the trained first network model to the terminal device 100, thereby realizing head pose estimation locally on the terminal device 100.
  • the terminal device 100 can obtain the image to be recognized in real time or obtain the image to be recognized from other devices, and detect and process the image to be recognized through the first network model to obtain the head posture and image corresponding to the target face image in the image to be recognized. The corresponding confidence probability; finally, the terminal device 100 displays the head gesture in the graphical interface 110 (graphical interface 1101 and graphical interface 1102 are shown as examples).
  • an execution process of the head pose estimation method in this application can be as follows:
  • Step 1 Generate an image to be recognized for the target face.
  • the image to be recognized includes a target face image, where the target face image refers to an area in the image to be recognized that only includes face images, and does not include other background images.
  • FIG 5 (a) in Figure 5 is an image including a background image
  • FIG 5 is an image to be recognized including a target face image.
  • various cameras can first collect the image to be processed including other background images, and then preprocess the image to be processed to obtain the image to be recognized.
  • the specific process can be shown in Figure 6: the image a to be processed is collected through the camera; then the sparse correlation in the image a to be processed is obtained through face detection.
  • the sparse key points can be facial features points and facial contour points; then the target face image is extracted from the image to be processed according to the facial contour points; and then the target face image is extracted through the facial facial features points
  • the target face image is horizontally aligned with the eye key points, and the image is scaled to the target size, thereby obtaining the image to be recognized.
  • Step 2 Detect the image to be recognized through the first network model to obtain the 2D key point coordinates, the uncertainty of the 2D key point coordinates, and the 3D key point coordinates.
  • Step 3 Screen the 2D key point coordinates and the 3D key point coordinates according to the uncertainty, and obtain the target 2D key point coordinates and the target 3D key point coordinates whose uncertainty is less than the prediction threshold.
  • Step 4 Use the PnP algorithm to estimate the head posture corresponding to the target face image using the target 2D key point coordinates and the target 3D key point coordinates.
  • One embodiment of the head posture estimation method in this application embodiment includes:
  • the terminal device can collect the image to be processed through its own camera, and then input the image to be processed into the image processing model for processing to obtain the image to be recognized.
  • the terminal device can also obtain the image to be recognized stored in the memory.
  • the terminal device can also obtain the image to be recognized through an instant messaging application, where the instant messaging application refers to software that enables online chatting and communication through instant messaging technology.
  • the terminal device can also obtain the image to be recognized from the Internet, for example, obtain the video image from the video network on the Internet, extract the face image from the video image, or directly download the face image from the Internet, etc. wait.
  • the specific process can be as follows: obtain an image to be processed, which includes a target face image collected by a camera, and then determine the target face image in the image to be processed through an image preprocessing network.
  • Sparse key points include facial features points and facial contour points of the target face image; obtain the target face image from the image to be processed based on the facial contour points; convert the target face image based on the facial features points
  • the face image is aligned horizontally and scaled to the target size to obtain the image to be recognized.
  • the target face image means that the image to be recognized only includes face images and no longer includes other background images.
  • (a) in Figure 5 is an image including a background image
  • (b) in Figure 5 is an image to be recognized including a target face image.
  • various cameras can first collect the image to be processed including the background image, and then preprocess the image to be processed to obtain the image to be recognized.
  • the image a to be processed is collected through the camera; then the face detection is performed on the image a to be processed through the image preprocessing network, and the sparse key points in the image a to be processed are obtained, where,
  • the sparse key points can be facial features points and facial contour points; then extract the target face image from the image a to be processed based on the facial contour points; and then use the eye key points in the facial facial features points , horizontally align the target face image, and scale the horizontally aligned target face image to a preset target size, thereby obtaining the image to be recognized.
  • the first network model perform key point recognition processing based on the image to be recognized, and obtain a set of two-dimensional key point coordinates of the target face image in the image to be recognized, and a set of three-dimensional key point coordinates of the target face image
  • the first network model includes a first branch network and a second branch network, wherein the first branch network is used to identify the two-dimensional key point coordinate set, and the second branch network is used to identify the three-dimensional key point coordinate set.
  • the uncertainty corresponding to each two-dimensional key point in the above-mentioned two-dimensional key point coordinate set can also be obtained.
  • the uncertainty is determined by the above-mentioned first network.
  • the first branch network in the model is identified.
  • the terminal device inputs the image to be recognized into the first network model, and then the feature extraction network of the first network model performs corresponding feature extraction on the image to be recognized to obtain the final feature representation of the image to be recognized; and then the final feature extraction network
  • the feature representations are respectively input into the first branch network and the second branch network of the first network model; wherein, the first branch network will output the two-dimensional key point coordinates (ie, 2D key points) of the target face image in the image to be recognized. coordinates) set and uncertainty; the second branch network will output a set of three-dimensional key point coordinates (ie, 3D key point coordinates) of the target face image in the image to be recognized.
  • the head posture corresponding to the target face image in the image to be recognized According to the two-dimensional key point coordinate set and the three-dimensional key point coordinate set, determine the head posture corresponding to the target face image in the image to be recognized.
  • the terminal device can calculate the uncertainty according to the two-dimensional key point coordinate set, the two-dimensional key point coordinate The uncertainty corresponding to each two-dimensional key point coordinate in the set and the set of three-dimensional key point coordinates determine the head posture corresponding to the target face image in the image to be recognized.
  • the terminal device can filter the two-dimensional key point coordinates in the two-dimensional key point coordinate set and the three-dimensional key point coordinates in the three-dimensional key point coordinate set according to the uncertainty to obtain the intermediate three-dimensional key point coordinates. point coordinates and the coordinates of the intermediate two-dimensional key point, and then use the PnP algorithm to obtain the head posture corresponding to the target face image based on the coordinates of the intermediate three-dimensional key point and the coordinates of the intermediate two-dimensional key point.
  • the terminal device sorts the corresponding uncertainties of each two-dimensional key point coordinates, eliminates the 20% two-dimensional key point coordinates with the largest uncertainty, and uses the retained intermediate two-dimensional key points.
  • Coordinates determine the corresponding intermediate three-dimensional key point coordinates to retain, and then use PnP solution to solve the head posture based on the retained intermediate two-dimensional key point coordinates and intermediate three-dimensional key point coordinates.
  • the terminal device can use opencv's built-in solvePnP algorithm. The principle of this method is to iteratively solve the posture, so that the coordinates of the intermediate 3D key points after projection of the posture are as close as possible to the coordinates of the intermediate 2D key points.
  • the technical solutions provided by the embodiments of the present application can be applied to virtual image construction, helping gaze estimation, modeling attention, adapting 3D models to videos, and performing facial alignment.
  • the construction of a game character is taken as an example.
  • the game device collects the user's facial data through the camera and performs head posture estimation to generate an image of the virtual object corresponding to the user, and then uses the virtual object to interact with other virtual objects in the game. interaction to achieve game interaction.
  • the game device collects facial data through a camera and generates a virtual object based on the facial data, as shown in Figure 7b.
  • the head movements of the game character are generated by collecting facial movements.
  • the collected image to be processed can be as shown in (a) in Figure 7b, that is, the head movement displayed in the target face image is a head tilt; then through the head posture estimation provided by the embodiment of the present application, the figure is obtained
  • the head movements of the corresponding game characters simultaneously display head tilts.
  • the head movements of the corresponding virtual objects are generated by collecting the user's facial movements to improve the game interaction experience. At the same time, real-time synchronization of the head movements of the virtual objects is achieved, which improves data processing efficiency.
  • this head pose estimation method can also be applied to live broadcasts or video recordings. That is, when the user does not want to appear in the live video with his or her own appearance, the user's facial data can be collected through the camera, and then based on the facial data Generate the corresponding virtual image, and then use the virtual image for live broadcast or video recording. In this way, the actions of the avatar can be synchronized with the user's actions, effectively realizing the interaction between the user and the user watching the video, while protecting the user's privacy.
  • FIG. 8 is a schematic diagram of a head posture estimation device in an embodiment of the present application.
  • the head posture estimation device 20 includes:
  • the acquisition module 201 is used to acquire an image to be recognized, where the image to be recognized includes a target face image;
  • the processing module 202 is configured to perform key point recognition processing based on the image to be recognized through the first network model, and obtain a set of two-dimensional key point coordinates of the target face image in the image to be recognized, and a three-dimensional key point coordinate set of the target face image in the image to be recognized.
  • the output module 203 is used to determine the head posture corresponding to the target face image in the image to be recognized based on the two-dimensional key point coordinate set and the three-dimensional key point coordinate set.
  • An embodiment of the present application provides a head posture estimation device.
  • Using the above device through two branch networks respectively Output the 2D key point coordinates and 3D key point coordinates of the target face image in the image to be recognized, and then calculate the head posture of the target face image based on the 2D key point coordinates and the 3D key point coordinates. Since the coordinates of the 3D key points can be obtained in real time, the 3D head model can change with changes in expressions. Correspondingly, the coordinate correspondence between the 2D key points and the 3D key points is made more accurate, thereby making it possible to create larger When using facial expressions, the head posture estimation is stable and reliable.
  • the processing module 202 is specifically configured to perform key point identification processing based on the image to be recognized through the first network model to obtain a two-dimensional key point coordinate set and the corresponding uncertainty of each two-dimensional key point coordinate in the two-dimensional key point coordinate set. and a three-dimensional key point coordinate set; wherein, the first branch network in the first network model is also used to identify the corresponding uncertainty of each two-dimensional key point coordinate in the two-dimensional key point coordinate set;
  • the output module 203 is specifically used to determine the head posture corresponding to the target face image in the image to be recognized based on the two-dimensional key point coordinate set, uncertainty and the three-dimensional key point coordinate set.
  • the output module 203 is specifically used to eliminate the two-dimensional key point coordinates whose corresponding uncertainty is greater than the preset threshold from the two-dimensional key point coordinate set to obtain an intermediate two-dimensional key point coordinate set;
  • the intermediate two-dimensional key point coordinate set obtain the intermediate three-dimensional key point coordinate set from the three-dimensional key point coordinate set;
  • the head posture corresponding to the target face image in the image to be recognized is determined.
  • An embodiment of the present application provides a head posture estimation device. Using the above device, the 2D key point coordinates and the 3D key point coordinates are screened according to the uncertainty corresponding to the 2D key point coordinates, so that points with greater uncertainty are deleted when estimating the head posture, so that the head posture Probably more robust.
  • the output module 203 is specifically used to use PnP solution, according to The intermediate two-dimensional key point coordinate set and the intermediate three-dimensional key point coordinate set determine the head posture corresponding to the target face image in the image to be recognized.
  • An embodiment of the present application provides a head posture estimation device. Using the above device and using the PnP solution method to perform posture estimation makes the head posture estimation more feasible.
  • the acquisition module 201 is also used to acquire a first training sample set, which includes training samples labeled with a real two-dimensional key point coordinate set and a real three-dimensional key point coordinate set of the human face image;
  • the head posture estimation device also includes a training module 204 for performing feature extraction processing on the face images corresponding to the training samples in the first training sample set through the feature extraction network layer in the first initial network model to be trained. , obtain the feature representation of the training sample;
  • the predicted two-dimensional key point coordinate set corresponding to the training sample is determined; and through the first initial network model
  • the initial second branch network determines the set of predicted three-dimensional key point coordinates corresponding to the training sample based on the characteristic representation of the training sample;
  • the first loss value is calculated based on the predicted two-dimensional key point coordinate set and the real two-dimensional key point coordinate set in the training sample.
  • the first loss value is calculated based on the predicted three-dimensional key point coordinate set and the real three-dimensional key point coordinate set in the training sample.
  • the first network model is obtained according to the first branch network and the second branch network.
  • An embodiment of the present application provides a head posture estimation device.
  • the first branch network and the second branch network are trained during the training process, and the two branch networks respectively output the 2D key point coordinates and the 3D key point coordinates of the target face image in the image to be recognized, and then The head posture of the target face image is calculated based on the 2D key point coordinates and the 3D key point coordinates.
  • the 3D head model can change with changes in expressions.
  • the coordinate correspondence between the 2D key points and the 3D key points is made more accurate, thereby making the When making large-scale expressions, the head posture estimation is guaranteed to be stable and reliable.
  • the first branch network and the second branch network are trained independently, which can increase the generalization of the model.
  • the training module 204 is specifically used to:
  • the predicted two-dimensional key point coordinate set corresponding to the training sample and the predicted key point coordinates in the predicted two-dimensional key point coordinate set are determined. Corresponding prediction uncertainty;
  • the first loss value is calculated according to the predicted two-dimensional key point coordinate set, the prediction uncertainty and the true two-dimensional key point coordinate set.
  • the acquisition module 201 is specifically used to collect training images through a depth camera.
  • each training image in the training image set includes three-dimensional point cloud data of face images and real head poses;
  • the first training sample set is determined based on the training image set.
  • An embodiment of the present application provides a head posture estimation device. Using the above device and using a depth camera to collect training images, the 3D point cloud data and the real head posture can be obtained more conveniently, thereby simplifying the acquisition process of the training sample set.
  • the acquisition module 201 is specifically configured to use the image processing network , based on the training images in the training image set, determine the sparse key points of the face image in the training image.
  • the sparse key points include the facial features points and face contour points of the face image in the training image;
  • the face image is horizontally aligned and scaled to a target size according to the facial features points to obtain training samples in the first training sample set.
  • An embodiment of the present application provides a head posture estimation device. Using the above device, the face image is cut out and aligned through sparse key points, which can reduce the interference caused by background information in the image collected by the camera. Scaling the image to a uniform size will facilitate image feature extraction and reduce training difficulty.
  • the sparse key points are at least five facial features points and four facial features. Contour points.
  • An embodiment of the present application provides a head posture estimation device. Using the above device, the image preprocessing process can be reduced while ensuring accurate cutout and alignment.
  • the feature extraction network includes a residual neural network ResNet and a pooling layer.
  • the first branch network is a fully connected layer
  • the second branch network is a fully connected layer.
  • An embodiment of the present application provides a head posture estimation device. Using the above device can increase the feasibility of the solution.
  • the training module 204 is specifically used to use Gaussian negative logarithm similarity. Random loss, calculate the first loss value based on the predicted two-dimensional key point coordinate set, the prediction uncertainty and the real two-dimensional key point coordinate set; use the regression loss function to calculate the first loss value based on the predicted three-dimensional key point coordinate set and the real two-dimensional key point coordinate set A collection of three-dimensional key point coordinates to calculate the second loss value.
  • An embodiment of the present application provides a head posture estimation device. Using the above device can increase the feasibility of the solution.
  • the acquisition module 201 is specifically used to acquire a second training sample set.
  • the second training sample set includes training samples labeled with a real two-dimensional key point coordinate set of the human face image, a real three-dimensional key point coordinate set, and a real head posture;
  • the head posture estimation device also includes a training module 204, which is used to perform feature extraction processing on the face images corresponding to the training samples in the second training sample set through the feature extraction network layer in the second initial network model to be trained. , obtain the feature representation of the training sample;
  • the predicted two-dimensional key point coordinate set corresponding to the training sample is determined; and through the second initial network model The initial second branch network determines the set of predicted three-dimensional key point coordinates corresponding to the training sample based on the characteristic representation of the training sample;
  • the first loss value is calculated based on the predicted two-dimensional key point coordinate set and the real two-dimensional key point coordinate set in the training sample.
  • the first loss value is calculated based on the predicted three-dimensional key point coordinate set and the real three-dimensional key point coordinate set in the training sample.
  • two loss values and calculate a third loss value based on the predicted head posture and the real head posture in the training sample;
  • the second initial network model is adjusted according to the first loss value, the second loss value and the third loss value to obtain the first network model.
  • An embodiment of the present application provides a head posture estimation device.
  • the first branch network and the second branch network are trained during the training process, and the two branch networks respectively output the 2D key point coordinates and the 3D key point coordinates of the target face image in the image to be recognized, and then The head posture of the target face image is calculated based on the 2D key point coordinates and the 3D key point coordinates.
  • the 3D head model can change with changes in expressions.
  • the coordinate correspondence between the 2D key points and the 3D key points is made more accurate, thereby making the When making large-scale expressions, the head posture calculation is stable and reliable.
  • the first branch network and the second branch network are jointly trained, thereby increasing the learnability of the model.
  • the training module 204 is specifically used to:
  • the predicted two-dimensional key point coordinate set corresponding to the training sample and the predicted key point coordinates in the predicted two-dimensional key point coordinate set are determined. Corresponding prediction uncertainty;
  • the first loss value is calculated according to the predicted two-dimensional key point coordinate set, the prediction uncertainty and the true two-dimensional key point coordinate set.
  • the feature extraction network includes a residual neural network ResNet and a pooling layer.
  • the first branch network is a fully connected layer
  • the second branch network is a fully connected layer
  • the computing network is a differentiable PnP solution network.
  • An embodiment of the present application provides a head posture estimation device. Using the above device can increase the feasibility of the solution.
  • the acquisition module 201 is specifically used to acquire the image to be processed, the The image to be processed includes the target face image collected by the camera;
  • the sparse key points of the target face image in the image to be processed are determined.
  • the sparse key points include facial features points and facial contour points of the target face image;
  • the target face image is horizontally aligned and scaled to the target size according to the facial features points to obtain the image to be recognized.
  • An embodiment of the present application provides a head posture estimation device. Using the above device, the face image is cut out and aligned through sparse key points, which can reduce the interference caused by background information in the image collected by the camera. Scaling the image to a uniform size will help extract features from the image.
  • FIG. 10 is a schematic structural diagram of a server provided by an embodiment of this application.
  • the server 300 can have relatively large differences due to different configurations or performances. It can Including one or more central processing units (CPUs) 322 (e.g., one or more processors) and memory 332, one or more storage media 330 (e.g., one or more The above mass storage devices). Among them, the memory 332 and the storage medium 330 may be short-term storage or persistent storage.
  • the program stored in the storage medium 330 may include one or more modules (not shown in the figure). (marked), each module can include a series of instruction operations on the server.
  • the central processor 322 may be configured to communicate with the storage medium 330 and execute a series of instruction operations in the storage medium 330 on the server 300 .
  • Server 300 may also include one or more power supplies 326, one or more wired or wireless network interfaces 350, one or more input and output interfaces 358, and/or, one or more operating systems 341, such as Windows Server TM , Mac OS X TM , Unix TM , Linux TM , FreeBSD TM and so on.
  • the steps performed by the server in the above embodiment may be based on the server structure shown in FIG. 10 .
  • the head posture estimation device provided by this application can be used in terminal equipment. Please refer to Figure 11. For convenience of explanation, only the parts related to the embodiments of this application are shown. If the specific technical details are not disclosed, please refer to the method of the embodiments of this application. part.
  • the terminal device is a smartphone as an example for explanation:
  • FIG. 11 shows a block diagram of a partial structure of a smartphone related to the terminal device provided by the embodiment of the present application.
  • the smart phone includes: radio frequency (radio frequency, RF) circuit 410, memory 420, input unit 430, display unit 440, sensor 450, audio circuit 460, wireless fidelity (wireless fidelity, WiFi) module 470, processor 480, and power supply 490 and other components.
  • RF radio frequency
  • the structure of the smart phone shown in FIG. 11 does not limit the smart phone, and may include more or fewer components than shown in the figure, or combine certain components, or arrange different components.
  • the RF circuit 410 can be used for receiving and sending signals during information transmission or communication. In particular, after receiving the downlink information of the base station, it is sent to the processor 480 for processing; in addition, the designed uplink data is sent to the base station.
  • the RF circuit 410 includes but is not limited to an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, etc.
  • the RF circuit 410 can also communicate with the network and other devices through wireless communication.
  • the above-mentioned wireless communication can use any communication standard or protocol, including but not limited to the global system of mobile communication (GSM), general packet radio service (GPRS), code division multiple access (CDMA), wideband code division multiple access (WCDMA), long term evolution (LTE), email, short messaging service (SMS), etc.
  • GSM global system of mobile communication
  • GPRS general packet radio service
  • CDMA code division multiple access
  • WCDMA wideband code division multiple access
  • LTE long term evolution
  • email short messaging service
  • SMS short messaging service
  • the memory 420 can be used to store software programs and modules.
  • the processor 480 executes various functional applications and data processing of the smart phone by running the software programs and modules stored in the memory 420 .
  • the memory 420 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may store a program based on Data created by the use of smartphones (such as audio data, phone books, etc.), etc.
  • memory 420 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
  • the input unit 430 may be used to receive input numeric or character information, and generate key signal input related to user settings and function control of the smartphone.
  • the input unit 430 may include a touch panel 431 and other input devices 432.
  • the touch panel 431 also known as a touch screen, can collect the user's touch operations on or near the touch panel 431 (for example, the user uses a finger, stylus, or any suitable object or accessory on or near the touch panel 431 operation), and drive the corresponding connection device according to the preset program.
  • the touch panel 431 may include a touch detection It consists of two parts: testing device and touch controller.
  • the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact point coordinates, and then sends it to the touch controller. to the processor 480, and can receive commands sent by the processor 480 and execute them.
  • the touch panel 431 can be implemented using various types such as resistive, capacitive, infrared, and surface acoustic wave.
  • the input unit 430 may also include other input devices 432.
  • other input devices 432 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), trackball, mouse, joystick, etc.
  • the display unit 440 may be used to display information input by the user or information provided to the user and various menus of the smartphone.
  • the display unit 440 may include a display panel 441.
  • the display panel 441 may be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), etc.
  • the touch panel 431 can cover the display panel 441. When the touch panel 431 detects a touch operation on or near it, it is sent to the processor 480 to determine the type of the touch event. The processor 480 then determines the type of the touch event. The type provides corresponding visual output on display panel 441.
  • the touch panel 431 and the display panel 441 are used as two independent components to implement the input and input functions of the smartphone, in some embodiments, the touch panel 431 and the display panel 441 can be integrated. And realize the input and output functions of the smart phone.
  • the smartphone may also include at least one sensor 450, such as a light sensor, a motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor.
  • the ambient light sensor may adjust the brightness of the display panel 441 according to the brightness of the ambient light.
  • the proximity sensor may close the display panel 441 when the smartphone is moved to the ear. /or backlight.
  • the accelerometer sensor can detect the magnitude of acceleration in all directions (generally three axes). It can detect the magnitude and direction of gravity when stationary.
  • smartphone posture applications such as horizontal and vertical screen switching, Related games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, knock), etc.
  • sensors that can be configured on smartphones such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc., here No longer.
  • the audio circuit 460, speaker 461, and microphone 462 can provide an audio interface between the user and the smartphone.
  • the audio circuit 460 can transmit the electrical signal converted from the received audio data to the speaker 461, and the speaker 461 converts it into a sound signal for output; on the other hand, the microphone 462 converts the collected sound signal into an electrical signal, and the audio circuit 460 After receiving, it is converted into audio data, and then processed by the audio data output processor 480, and then sent to, for example, another smart phone through the RF circuit 410, or the audio data is output to the memory 420 for further processing.
  • WiFi is a short-distance wireless transmission technology. Smartphones can help users send and receive emails, browse web pages, and access streaming media through the WiFi module 470. It provides users with wireless broadband Internet access.
  • the WiFi module 470 is shown in FIG. 11 , it can be understood that it is not a necessary component of the smart phone and can be omitted as needed without changing the essence of the invention.
  • the processor 480 is the control center of the smartphone, using various interfaces and lines to connect various parts of the entire smartphone, by running or executing software programs and/or modules stored in the memory 420, and calling data stored in the memory 420 , perform various functions of the smartphone and process data, thereby performing overall monitoring of the smartphone.
  • the processor 480 may include one or more processing units; optionally, the processor 480 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface and application programs. Wait, modem
  • the debug processor mainly handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 480 .
  • the smartphone also includes a power supply 490 (such as a battery) that supplies power to various components.
  • a power supply 490 (such as a battery) that supplies power to various components.
  • the power supply can be logically connected to the processor 480 through a power management system, thereby managing functions such as charging, discharging, and power consumption management through the power management system. .
  • the smart phone may also include a camera, a Bluetooth module, etc., which will not be described in detail here.
  • the steps performed by the terminal device in the above embodiment may be based on the terminal device structure shown in FIG. 11 .
  • An embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program that, when run on a computer, causes the computer to perform the methods described in the foregoing embodiments.
  • An embodiment of the present application also provides a computer program product including a program, which when run on a computer causes the computer to execute the methods described in the foregoing embodiments.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or can be integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above integrated units can be implemented in the form of hardware or software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program code. .

Abstract

本申请实施例提供了一种头部姿态估计方法、装置、设备以及存储介质,用于保证头部姿态估计的稳定可靠。包括:获取待识别图像,待识别图像中包括目标人脸图像;通过第一网络模型,基于待识别图像进行关键点识别处理,得到待识别图像中目标人脸图像的二维关键点坐标集合以及目标人脸图像的三维关键点坐标集合,第一网络模型包括第一分支网络和第二分支网络,第一分支网络用于识别二维关键点坐标,第二分支网络用于识别三维关键点坐标;根据二维关键点坐标集合和三维关键点坐标集合确定待识别图像中目标人脸图像对应的头部姿态。本申请提供的技术方案可应用于人工智能、计算机视觉领域。

Description

一种头部姿态估计方法、装置、设备以及存储介质
本申请要求于2022年09月15日提交中国专利局、申请号为2022111304414、申请名称为“一种头部姿态估计方法、装置、设备以及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像处理领域,尤其涉及头部姿态估计技术。
背景技术
在计算机视觉的背景下,头部姿势估计是指推断人的头部相对于相机视图的方向的能力。在视觉动态捕捉场景中,头部姿态估计是非常重要的一环。精准的头部姿态可以让虚拟形象完美复刻人的头部动作,让虚拟人动画更加生动灵巧,写实逼真。
当前,比较主流的头部姿态估计方法普遍需要利用传统的运动传感器,或者通过三维(3-dimension,3D)图像获取设备,获取头部的三维坐标信息。但是由于目前主流的图像采集设备采集的都是二维(2-dimension,2D)图像信息,因此需要基于人脸关键点坐标信息,实现2D坐标信息在世界坐标系的3D转换,从而获得人头部姿态的3D坐标信息,进而,再根据坐标信息的变化,来实现对头部姿态的估计以及头部动作的判断。
上述方法是基于求解3D到2D点对运动的方法(也称为Perspective-n-Point,PnP),该方法首先会估计人脸的2D关键点;然后根据该2D关键点在固定的3D人头模型中标定对应的3D点。通过PnP解算,可以得到3D点对应于2D关键点的变换姿态。上述方法虽然大体精度可以,解释性强,但是当人做出大幅度表情时,抖动就会很明显,所估计的头部姿态不够稳定可靠。
发明内容
本申请实施例提供了一种头部姿态估计方法、装置、设备以及存储介质,能够保证头部姿态估计的稳定可靠。
有鉴于此,本申请一方面提供一种头部姿态估计方法,由计算机设备执行,包括:获取待识别图像,该待识别图像中包括目标人脸图像;通过第一网络模型,基于该待识别图像进行关键点识别处理,得到该待识别图像中目标人脸图像的二维关键点坐标集合、以及该目标人脸图像的三维关键点坐标集合,其中,该第一网络模型包括第一分支网络和第二分支网络,其中,该第一分支网络用于识别该二维关键点坐标集合,该第二分支网络用于识别该三维关键点坐标集合;根据该二维关键点坐标集合和该三维关键点坐标集合,确定该待识别图像中目标人脸图像对应的头部姿态。
本申请另一方面提供一种头部姿态估计装置,包括:
获取模块,用于获取待识别图像,该待识别图像中包括目标人脸图像;
处理模块,用于通过第一网络模型,基于该待识别图像进行关键点识别处理,得到该待识别图像中目标人脸图像的二维关键点坐标集合、以及该目标人脸图像的三维关键点坐 标集合,其中,该第一网络模型包括第一分支网络和第二分支网络,其中,该第一分支网络用于识别该二维关键点坐标集合,该第二分支网络用于识别该三维关键点坐标集合;
输出模块,用于根据该二维关键点坐标集合和该三维关键点坐标集合,确定该待识别图像中目标人脸图像对应的头部姿态。
本申请另一方面提供一种计算机设备,包括:存储器、处理器以及总线系统;
其中,存储器用于存储程序;
处理器用于执行存储器中的程序,处理器用于根据程序代码中的指令执行上述各方面的方法;
总线系统用于连接存储器以及处理器,以使存储器以及处理器进行通信。
本申请的另一方面提供了一种计算机可读存储介质,计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述各方面的方法。
本申请的另一个方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述各方面所提供的方法。
从以上技术方案可以看出,本申请实施例具有以下优点:通过两个分支网络分别输出待识别图像中关于目标人脸图像的2D关键点坐标以及3D关键点坐标,然后根据该2D关键点坐标以及该3D关键点坐标计算得到该目标人脸图像的头部姿态。其中,由于该3D关键点坐标可以实时获取,因此3D头部模型可以随着表情的变化而变化,相应地,使得2D关键点与该3D关键点之间的坐标对应关系更精确,从而在做大幅度表情时,保证头部姿态的解算稳定可靠。
附图说明
图1为本申请实施例中应用系统的一个架构示意图;
图2为本申请实施例中第一网络模型的一个架构示意图;
图3为本申请实施例中第一网络模型的另一个架构示意图;
图4为本申请实施例中头部姿态估计方法的一个流程示意图;
图5为本申请实施例中待识别图像中目标人脸图像的一个示意图;
图6为本申请实施例中待处理图像通过图像处理模型处理得到待识别图像的一个流程示意图;
图7a为本申请实施例中头部姿态估计的一个实施例示意图;
图7b为本申请实施例中待处理图像通过头部姿态估计之后生成的虚拟形象的一个示意图;
图8为本申请实施例中头部姿态估计装置的一个实施例示意图;
图9为本申请实施例中头部姿态估计装置的另一个实施例示意图;
图10为本申请实施例中头部姿态估计装置的另一个实施例示意图;
图11为本申请实施例中头部姿态估计装置的另一个实施例示意图。
具体实施方式
本申请实施例提供了一种头部姿态估计方法、装置、设备以及存储介质,用于保证头部姿态估计的稳定可靠。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“对应于”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
为了解决相关技术中基于PnP算法实现的头部姿态估计不够稳定可靠的问题,本申请提供如下技术方案:
获取待识别图像,该待识别图像中包括目标人脸图像;将该待识别图像输入第一网络模型,得到该待识别图像中目标人脸图像的二维关键点坐标集合、以及该目标人脸图像的三维关键点坐标集合,其中,该第一网络模型包括第一分支网络和第二分支网络,其中,该第一分支网络用于识别得到该二维关键点坐标集合,该第二分支网络用于识别得到该三维关键点坐标集合;根据该二维关键点坐标集合和该三维关键点坐标集合,确定该待识别图像中目标人脸图像对应的头部姿态。在本申请实施例中,可以实时获取待识别图像中人脸图像的3D关键点坐标,因此保证3D头部模型可以随着人物表情的变化而变化,相应地,可以使得2D关键点与3D关键点之间的坐标对应关系更精确,在待识别图像中的人物做大幅度表情时,也能够保证头部姿态的解算稳定可靠。
为了方便理解,下面对本申请中的部分专业名词进行说明:
人脸五官关键点:用于表示五官在人脸上的位置,五官的位置可以由关键点表示。本申请实施例涉及的五官关键点包括人脸的左眼瞳孔、右眼瞳孔、鼻尖、左嘴角和右嘴角五个位置对应的点。
欧拉角(Eulerian angles):是指由欧拉提出的,用来确定定点转动刚体位置的3个一组独立角参量,本申请实施例基于人脸建立直角坐标系,本申请实施例以人脸姿态角度是欧拉角为例进行说明,欧拉角在三维直角坐标系中,该三维直角坐标系是以人的头部的中心或重心为原点,由人脸的一侧耳朵指向另一侧耳朵的方向为X轴方向,由人的头部顶端指向脖子的方向为Y轴,由人的脸部指向后脑的方向为Z轴,欧拉角包含下述三个角度:
俯仰角(pitch):围绕X轴旋转的角度;
偏航角(yaw):围绕Y轴旋转的角度;
翻滚角(roll):围绕Z轴旋转的角度。
视觉动捕:传统动捕是利用惯性传感器或者在人身上贴marker点的方式进行动作捕捉。视觉动捕不需要人穿戴任何设备,利用单个或者多个摄像头,就可以捕捉人的面部、身体动作。
6DoF(6自由度):DoF是对象在3d空间中可以移动的方向数,总共有6个自由度。即人头姿态包括旋转和平移,其中旋转用3个欧拉角来表示,平移也用3个方向的位移来表示,合起来就是6自由度姿态参数。
PnP(Perspective-n-Point):PnP是求解3D到2D点对运动的方法,目的是求解相机坐标系相对世界坐标系的位姿。它描述了已知若干个3D点的坐标(相对世界坐标系)以及这些点的2D坐标时,如何估计相机的位姿(即求解世界坐标系到相机坐标系的旋转矩阵和平移向量)。
卷积层(Convolutional layer,Conv)是指卷积神经网络层中由若干卷积单元组成的层状结构,卷积神经网络(Convolutional Neural Network,CNN)是一种前馈神经网络,卷积神经网络中包括至少两个神经网络层,其中,每一个神经网络层包含若干个神经元,各个神经元分层排列,同一层的神经元之间没有互相连接,层间信息的传送只沿一个方向进行。
池化层(Pooling layer)又称为取样层,是指能够从输入值中二次提取特征的层状结构,池化层可保证上一层数值的主要特征,还可减少下一层的参数和计算量。池化层由多个特征面组成,卷积层的一个特征面与池化层中的一个特征面对应,不会改变特征面的个数,通过降低特征面的分辨率来获得具有空间不变性的特征。
全连接层(Fully Connected layer,FC)是指该层状结构中的每一个结点均与上一层的所有结点相连,可用于对上一层的神经网络层提取的特征进行综合处理,在神经网络模型中起到“分类器”的作用。
反向传播:前向传播是指模型的前馈处理过程,反向传播与前向传播相反,是指根据模型输出的结果对模型各个层的权重参数进行更新。例如,模型包括输入层、隐藏层和输出层,则前向传播是指按照输入层-隐藏层-输出层的顺序进行处理,反向传播是指按照输出层-隐藏层-输入层的顺序,依次更新各个层的权重参数。
本申请实施例提供的一种头部姿态估计方法、装置、设备及存储介质,能够保证头部姿态估计的稳定可靠。下面说明本申请实施例提供的电子设备的示例性应用,本申请实施例提供的电子设备可以实施为各种类型的用户终端,也可以实施为服务器。
电子设备通过运行本申请实施例提供的头部姿态估计的方案,可以能够保证头部姿态估计的稳定可靠,即提高电子设备自身对头部姿态估计的稳定可靠性,适用于头部姿态估计的多个应用场景。例如,增强现实(Augmented Reality,AR)游戏、虚拟现实(Virtual Reality,VR)游戏、帮助凝视估计、建模注意力、使3D模型适合视频以及执行面部对准。
参见图1,图1是本申请实施例提供的头部姿态估计方案的一个应用场景下的一个可选的架构示意图,为实现支撑一个头部姿态估计应用,终端设备100(示例性示出了终端设备1001和终端设备1002)通过网络200连接服务器300,服务器300连接数据库400,网络200可以是广域网或者局域网,又或者是二者的组合。
其中,用于实现头部姿态估计方案的客户端部署于终端设备100上,其中,客户端可以通过浏览器的形式运行于终端设备100上,也可以通过独立的应用程序(application,APP)的形式运行于终端设备100上,对于客户端的具体展现形式此处不做限定。
本申请涉及的服务器300可以是独立的物理服务器,也可以是多个物理服务器构成的 服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端设备100可以是智能手机、平板电脑、笔记本电脑、掌上电脑、个人电脑、智能电视、智能手表、车载设备、可穿戴设备等,但并不局限于此。终端设备100以及服务器300可以通过有线或无线通信方式通过网络200进行直接或间接地连接,本申请在此不做限制。服务器300和终端设备100的数量也不做限制。本申请提供的方案可以由终端设备100独立完成,也可以由服务器300独立完成,还可以由终端设备100与服务器300配合完成,对此,本申请并不做具体限定。
其中,数据库400简而言之可视为电子化的文件柜——存储电子文件的处所,用户可以对文件中的数据进行新增、查询、更新、删除等操作。数据库是以一定方式储存、能与多个用户共享、具有尽可能小的冗余度、与应用程序彼此独立的数据集合。数据库管理系统(Database Management System,DBMS)是为管理数据库而设计的电脑软件系统,一般具有存储、截取、安全保障、备份等基础功能。数据库管理系统可以依据它所支持的数据库模型来作分类,例如关系式、可扩展标记语言(Extensible Markup Language,XML);或依据所支持的计算机类型来作分类,例如服务器群集、移动电话;或依据所用查询语言来作分类,例如结构化查询语言(Structured Query Language,SQL)、XQuery;或依据性能冲量重点来作分类,例如最大规模、最高运行速度;亦或其他的分类方式。不论使用哪种分类方式,一些DBMS能够跨类别,例如,同时支持多种查询语言。在本申请中,数据库400可以用于存储训练样本集以及待识别图像,当然,训练样本集的存储位置并不限于数据库,例如还可以存储于终端设备100、区块链或者服务器300的分布式文件系统中等。
在一些实施例中,服务器300可以执行本申请实施例提供的头部姿态估计方法、以及头部姿态估计中第一网络模型的训练方法,本实施例中,该第一网络模型包括第一分支网络和第二分支网络,其中,该第一分支网络用于识别得到二维关键点坐标和该不确定度,该第二分支网络用于识别得到三维关键点坐标。在执行第一网络模型的训练方法时,其具体流程可以如下:从终端设备100和/或数据库400中获取对应标注有真实2D关键点坐标和真实3D关键点坐标的第一训练样本集,通过待训练的第一初始网络模型对该第一训练样本集进行检测处理,得到该第一训练样本集中各个人脸的预测2D关键点坐标、预测3D关键点坐标,可选的,还可以得到该预测2D关键点坐标对应的不确定度,根据包括预先设计损失因素(如间隔值和距离两个因素)的损失函数,确定预测2D关键点坐标和该预测2D关键点坐标对应的不确定度对应的第一损失值、以及确定该预测3D关键点坐标对应的第二损失值,进而根据该第一损失值反向传播调整该第一分支网络的参数;同时根据该第二损失值反向传播调整该第二分支网络的参数,从而实现对该第一初始网络模型的训练,得到该第一网络模型。本实施例中,该第一分支网络与该第二分支网络分别独立训练,相互之间并不影响参数的调整。这样可以提升该第一分支网络与该第二分支网络的泛化能力。本实施例中,在训练该第一网络模型时,该服务器计算该第一损失值可以采用高斯负对数似然损失(GaussianNLLLoss,GNLL),一种示例性方案中,具体计算过程可以采用 公式1:
其中,该N用于指示该2D关键点的数量,该y用于指示真实2D关键点坐标,该f(x)用于表示第一分支网络输出的预测2D关键点坐标,该δ用于表示该预测2D关键点坐标的不确定度。
该服务器计算该第二损失值可以采用回归损失函数,一种示例性方案中,该服务器采用L2LOSS,则其具体计算过程可以采用公式2:
其中,该N用于指示该3D关键点的数量,该y用于指示真实3D关键点坐标,该f(x)用于表示第二分支网络输出的预测3D关键点坐标。
在该服务器300训练该第一网络模型时,该第一网络模型的初始模型架构可以包括特征提取网络、全连接层、池化层以及该第一分支网络和该第二分支网络。其中,该特征提取网络可以是残差神经网络(Residual Neural Network,ResNet)、Le Net或者AlexNet等CNN网络,或者为具有特征金字塔的高分辨率网络(High-Resolution netV2P,HRNetV2P)、或者基于移动窗口的层次化视觉自注意力模型(Swin Transformer),而该第一分支网络和该第二分支网络可以为全连接层。一个示例性方案中,下面以特征提取网络为ResNet50为例说明该第一网络模型。如图2所示,第一网络模型包括该ResNet50,其中,该ResNet50包括49个卷积层和一个全连接层,而该全连接层之后再与该池化层相连;该池化层的输出分别与两个全连接层相连,其中一个全连接层为第一分支网络,另一个全连接层为第二分支网络。网络的输入为224×224×3,经过上述结构的卷积计算,输出为7×7×2048,池化层会将其转化成一个特征向量,最后分类器会对这个特征向量进行计算并输出类别概率。该ResNet50网络结构可以分成七个部分,第一部分不包含残差块,主要对输入进行卷积、正则化、激活函数、最大池化的计算。第二、三、四、五部分结构都包含了残差块,其存在不会改变残差块的尺寸,只用于改变残差块的维度的结构。在Resnet50网络结构中,残差块都有三层卷积,那网络总共有1+3×(3+4+6+3)=49个卷积层,加上最后的全连接层总共是50层,这也是Resnet50名称的由来。Resnet50网络的输入为256×256×3,经过前五部分的卷积计算,输出为N×2048×8×8的特征图,其中,N为一次训练所选取的样本数(也称为batchsize)。然后底都将N×2048×8×8的特征图经过该池化层得到就得到N×2048的特征,该N×2048的特征会经过两个FC层,分别输出2D关键点坐标以及不确定度和3D关键点坐标。每个FC层的权重维度都是2048×3660(1220个点),其中,将3660看作是1220×3。对于2D分支,3代表x,y坐标以及不确定度δ。对于3D分支,3代表x,y,z坐标数值。
在另一些实施例中,服务器300可以执行本申请实施例提供的头部姿态估计方法、以及头部姿态估计中第一网络模型的训练方法,本实施例中,该第一网络模型包括第一分支网络、第二分支网络和计算网络,其中,该第一分支网络用于识别得到该二维关键点坐标,该第二分支网络用于识别得到该三维关键点坐标,该计算网络用于根据该2D关键点坐标和该3D关键点坐标估计该头部姿态。在执行第一网络模型的训练方法时,其具体流程可 以如下:从终端设备100和/或数据库400中获取对应标注有真实2D关键点坐标、真实3D关键点坐标和真实头部姿态的第二训练样本集,通过待训练的第二初始网络模型对该第二训练样本集进行检测处理,得到该第二训练样本集中各个人脸的预测2D关键点坐标、预测3D关键点坐标,可选的,还可以得到该预测2D关键点坐标对应的不确定度,根据包括预先设计损失因素(如间隔值和距离两个因素)的损失函数,确定预测2D关键点坐标和该预测2D关键点坐标对应的不确定度对应的第一损失值、以及确定该预测3D关键点坐标对应的第二损失值;然后再根据预测2D关键点坐标和该预测3D关键点坐标计算预测头部姿态,并根据该预测头部姿态和真实头部姿态计算第三损失值;进而根据该第一损失值、该第二损失值和该第三损失值,反向传播调整该第一分支网络和该第二分支网络的参数,从而实现对该第二初始网络模型的训练得到该第一网络模型。本实施例中,该第一分支网络与该第二分支网络可以进行联合训练,使得两个分支可以互相影响,增强网络学习能力。本实施例中,在训练该第一网络模型时,该服务器计算该第一损失值可以采用高斯负对数似然损失(GaussianNLLLoss,GNLL),一种示例性方案中,具体计算过程可以采用公式1:
其中,该N用于指示该2D关键点的数量,该y用于指示真实2D关键点坐标,该f(x)用于表示第一分支网络输出的预测2D关键点坐标,该δ用于表示该预测2D关键点坐标的不确定度。
该服务器计算该第二损失值可以采用回归损失函数,一种示例性方案中,该服务器采用L2LOSS,则其具体计算过程可以采用公式2:
其中,该N用于指示该3D关键点的数量,该y用于指示真实3D关键点坐标,该f(x)用于表示第二分支网络输出的预测3D关键点坐标。
在该服务器300训练该第一网络模型时,该第一网络模型的初始模型架构可以包括特征提取网络、全连接层、池化层、该第一分支网络、该第二分支网络和该计算网络。其中,该特征提取网络可以是残差神经网络(Residual Neural Network,ResNet)、Le Net或者AlexNet等CNN网络,或者为具有特征金字塔的高分辨率网络(High-Resolution netV2P,HRNetV2P)、或者基于移动窗口的层次化视觉自注意力模型(Swin Transformer),而该第一分支网络和该第二分支网络可以为全连接层。一个示例性方案中,下面以特征提取网络为ResNet50为例说明该第一网络模型。如图3所示,第一网络模型包括该ResNet50,其中,该ResNet50包括49个卷积层和一个全连接层,而该全连接层之后再与该池化层相连;该池化层的输出分别与两个全连接层相连,其中一个全连接层为该第一分支网络,另一个全连接层为该第二分支网络。网络的输入为224×224×3,经过上述结构的卷积计算,输出为7×7×2048,池化层会将其转化成一个特征向量,最后分类器会对这个特征向量进行计算并输出类别概率。该ResNet50网络结构可以分成五个部分,第一部分不包含残差块,主要对输入进行卷积、正则化、激活函数、最大池化的计算。第二、三、四、五部分结构都包含了残差块,其存在不会改变残差块的尺寸,只用于改变残差块的维度的结构。在 Resnet50网络结构中,残差块都有三层卷积,那网络总共有1+3×(3+4+6+3)=49个卷积层,加上最后的全连接层总共是50层,这也是Resnet50名称的由来。Resnet50网络的输入为256×256×3,经过前五部分的卷积计算,输出为N×2048×8×8的特征图,其中,N为一次训练所选取的样本数(也称为batchsize)。然后底都将N×2048×8×8的特征图经过该池化层得到就得到N×2048的特征,该N×2048的特征会经过两个FC层,分别输出2D关键点坐标以及不确定度和3D关键点坐标。每个FC层的权重维度都是2048×3660(1220个点),其中,将3660看作是1220×3。对于2D分支,3代表x,y坐标以及不确定度δ。对于3D分支,3代表x,y,z坐标数值。然后根据该第一分支网络输出的预测2D关键点坐标和不确定度以及该第二分支网络输出的预测3D关键点坐标计算得到预测头部姿态。
本实施例中,该第一训练样本集和该第二训练样本集的数据可以采用如下技术方案:即通过终端设备自带的深度摄像头采集人脸图像,然后利用基于设备系统的增强现实技术(ARKit)实时捕捉面部3D点云数据以及对应的头部姿态,并基于ARKit开发数据采集软件,再以60帧(FPS)的速度采集面部数据。这样基于已有技术进行面部数据采集,可以降低训练样本集的采集难度。
可以理解的是,在获取该第一训练样本集和该第二训练样本集时也可以采用其他外接摄像头。其中,该外接摄像头可以是深度摄像头也可以是其他摄像头,即只要可实现面部数据采集即可,具体方式此处不做限定。这样使用外接设备进行面部数据,可以降低对硬件设备的要求,从而降低成本。
在对第一网络模型训练完毕后,服务器300可以将第一网络模型保存至本地,从而为终端设备100提供远程的头部姿态估计功能。例如,服务器300可以接收终端设备100发送的待识别图像,并通过第一网络模型对待识别图像进行检测处理,得到待识别图像中的目标人脸图像对应的头部姿态及对应的置信概率;最后将头部姿态发送至终端设备100,以使终端设备100在图形界面110(示例性示出了图形界面1101和图形界面1102)中显示头部姿态。
服务器300也可以将训练完毕的第一网络模型发送(部署)至终端设备100,从而在终端设备100本地实现头部姿态估计。例如,终端设备100可以实时获取待识别图像或从其他设备中获取待识别图像,并通过第一网络模型对待识别图像进行检测处理,得到待识别图像中的目标人脸图像对应的头部姿态及对应的置信概率;最后,终端设备100在图形界面110(示例性示出了图形界面1101和图形界面1102)中显示头部姿态。
基于上述系统,具体请参阅图4所示,本申请中头部姿态估计方法的一个执行流程可以如下:
步骤1、对该目标人脸生成待识别图像。本实施例中,该待识别图像中包括目标人脸图像,其中,目标人脸图像是指该待识别图像中仅包括人脸图像的区域,其中不包括其他背景图像。如图5所示,该图5中(a)中为包括背景图像的图像,而图5中(b)为包括目标人脸图像的待识别图像。本实施例中,各种摄像头可以先采集包括了其他背景图像的待处理图像,然后对该待处理图像进行预处理得到该待识别图像。具体流程可以如图6所示:通过摄像头采集到待处理图像a;然后通过人脸检测得到该待处理图像a中的稀疏关 键点,其中,该稀疏关键点可以为人脸五官点和人脸轮廓点;然后根据该人脸轮廓点从该待处理图像中抠出该目标人脸图像;然后再通过该人脸五官点中的眼部关键点对该目标人脸图像进行水平对齐,并对该图像缩放至目标尺寸,从而得到该待识别图像。
步骤2、将该待识别图像通过该第一网络模型进行检测得到2D关键点坐标以及该2D关键点坐标的不确定度,以及3D关键点坐标。
步骤3、根据该不确定度对该2D关键点坐标和该3D关键点坐标进行筛选,得到该不确定度小于预测阈值的目标2D关键点坐标和目标3D关键点坐标。
步骤4、将目标2D关键点坐标和目标3D关键点坐标利用PnP算法估计该目标人脸图像对应的头部姿态。
可以理解的是,在本申请的具体实施方式中,涉及到待检测图像和训练样本集等相关的数据,当本申请以上实施例运用到具体产品或技术中时,需要获得用户许可或者同意,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。
结合上述介绍,下面以终端设备为执行主体为例,对本申请中头部姿态估计方法进行介绍,请参阅图7a,本申请实施例中头部姿态估计方法的一个实施例包括:
701、获取待识别图像,该待识别图像中包括目标人脸图像。
终端设备可以通过自身的摄像头采集待处理图像,然后将该待处理图像输入图像处理模型进行处理,得到该待识别图像。或者,该终端设备也可以获取到在内存中保存的待识别图像。或者,终端设备也可以通过即时通讯应用获取到待识别图像,其中,即时通讯应用是指通过即时通讯技术来实现在线聊天、交流的软件。或者,终端设备也可以从互联网中获取待识别图像,比如,从互联网中视频网络中获取到视频图像,从视频图像中提取到人脸图像,又比如,从互联网中直接下载到人脸图像等等。
一个示例性方案中,其具体流程可以如下:获取待处理图像,该待处理图像包括摄像头采集的目标人脸图像,然后,通过图像预处理网络,确定该待处理图像中该目标人脸图像的稀疏关键点,该稀疏关键点包括该目标人脸图像的五官点以及人脸轮廓点;根据该人脸轮廓点从该待处理图像中获取该目标人脸图像;根据该五官点将该目标人脸图像水平对齐并缩放至目标尺寸得到该待识别图像。
其中,目标人脸图像是指该待识别图像中仅包括人脸图像,不再包括其他背景图像。如图5所示,该图5中(a)中为包括背景图像的图像,而图5中(b)为包括目标人脸图像的待识别图像。本实施例中,各种摄像头可以先采集包括背景图像的待处理图像,然后对该待处理图像进行预处理得到该待识别图像。
具体流程可以如图6所示:通过摄像头采集到待处理图像a;然后通过图像预处理网络对该待处理图像a进行人脸检测,得到该待处理图像a中的稀疏关键点,其中,该稀疏关键点可以为人脸五官点和人脸轮廓点;然后根据该人脸轮廓点从该待处理图像a中抠出该目标人脸图像;然后再通过该人脸五官点中的眼部关键点,对该目标人脸图像进行水平对齐,并将水平对齐后的目标人脸图像缩放至预设的目标尺寸,从而得到该待识别图像。
702、通过第一网络模型,基于该待识别图像进行关键点识别处理,得到该待识别图像中目标人脸图像的二维关键点坐标集合、以及该目标人脸图像的三维关键点坐标集合, 其中,该第一网络模型包括第一分支网络和第二分支网络,其中,该第一分支网络用于识别该二维关键点坐标集合,该第二分支网络用于识别该三维关键点坐标集合。
此外,通过第一网络模型,基于该待识别图像进行关键点处理,还可以得到上述二维关键点坐标集合中各个二维关键点各自对应的不确定度,该不确定度由上述第一网络模型中的第一分支网络识别得到。
该终端设备将该待识别图像输入该第一网络模型,然后该第一网络模型的特征提取网络对该待识别图像进行相应的特征提取,得到该待识别图像的最终特征表示;然后将该最终特征表示分别输入该第一网络模型的第一分支网络和第二分支网络;其中,该第一分支网络将输出该待识别图像中的目标人脸图像的二维关键点坐标(即2D关键点坐标)集合和不确定度;该第二分支网络将输出该待识别图像中的目标人脸图像的三维关键点坐标(即3D关键点坐标)集合。
可以理解的是,本申请中第一网络模型的训练过程可以参考图2至图3所示,具体此处不再赘述。
703、根据该二维关键点坐标集合和该三维关键点坐标集合,确定该待识别图像中目标人脸图像对应的头部姿态。
应理解,当第一网络模型还输出该二维关键点坐标集合中各个二维关键点坐标各自对应的不确定度时,终端设备可以根据该二维关键点坐标集合、该二维关键点坐标集合中各个二维关键点坐标各自对应的不确定度和该三维关键点坐标集合,确定该待识别图像中目标人脸图像对应的头部姿态。
本实施例中,该终端设备可以根据该不确定度,对该二维关键点坐标集合中的二维关键点坐标和该三维关键点坐标集合中的三维关键点坐标进行筛选,得到中间三维关键点坐标和中间二维关键点坐标,然后再利用PnP算法,根据该中间三维关键点坐标和该中间二维关键点坐标,求解得到该目标人脸图像对应的头部姿态。一个示例性方案中,该终端设备对各个二维关键点坐标各自对应的不确定度进行排序,剔除不确定度最大的20%的二维关键点坐标,以用所保留的中间二维关键点坐标,确定其对应的中间三维关键点坐标保留,进而,采用PnP解算根据保留的中间二维关键点坐标和中间三维关键点坐标求解头部姿态。一个示例性方案中,该终端设备可以使用opencv内置的solvePnP算法,该方法原理是迭代式求解姿态,使得中间3D关键点坐标经过该姿态投影后,和中间2D关键点坐标尽可能接近。
如此,将不确定度较高的二维关键点坐标剔除,仅基于不确定度较低的二维关键点坐标及其对应的三维关键点坐标,进行头部姿态估计,可以有效地避免不确定度较高的关键点影响最终的姿态估计结果,即避免因不确定度较高的关键点而引起的负面效果,从而使得头部姿态解算更可靠鲁棒。
可以理解的是,本申请实施例提供的技术方案可以应用于虚拟形象构建、帮助凝视估计、建模注意力、使3D模型适合视频以及执行面部对准。一个示例性应用场景中,以游戏人物构建为例进行说明。游戏设备通过摄像头采集用户的面部数据并进行头部姿态估计,从而生成该用户对应的虚拟对象的形象,然后通过该虚拟对象与游戏中的其他虚拟对象进 行交互,从而实现游戏互动。
其中,该游戏设备通过摄像头采集的面部数据,并根据该面部数据生成虚拟对象的过程可以如图7b所示,如图7b所示,通过采集面部动作生成游戏人物的头部动作。具体来说,采集的待处理图像可以如图7b中的(a)所示,即目标人脸图像显示的头部动作是歪头;然后通过本申请实施例提供的头部姿态估计得到如图7b中的(b)所示,即相应的游戏人物的头部动作同步显示歪头。在图7b中,通过采集用户的面部动作生成相应的虚拟对象的头部动作,提高游戏交互体验,同时对于虚拟对象的头部动作实现实时同步,提高了数据处理效率。
而在实际应用中,该头部姿态估计方法也可以应用于直播或者视频录制,即在用户不想以自身的相貌出现在直播视频时,可以通过摄像头采集该用户的面部数据,然后根据该面部数据生成相应的虚拟形象,然后利用该虚拟形象进行直播或者视频录制。这样,可以在实现该虚拟形象的动作与用户的动作同步,有效实现用户与观看视频的用户的互动,同时保护用户隐私。
下面以一个具体实例对本申请提供的技术方案的有益效果进行说明:
获取一个40人约50万张图片的数据集,用来评估本申请实施例提供的方法和其他方法的技术指标。参与评估的方法有三种:1.直接估计6D0F参数;2.采用PnP方案,但是没有估计不确定性;3.本申请提供的技术方案。其结果如表1所示:
表1
如表1所示,在6DoF的6个维度都进行比较,分别是pitch,yaw,roll,tx,ty和tz。从结果上看,本申请提供的技术方案在各个维度上的指标都是显著优于其他方法的。
下面对本申请中的头部姿态估计装置进行详细描述,请参阅图8,图8为本申请实施例中头部姿态估计装置的一个实施例示意图,头部姿态估计装置20包括:
获取模块201,用于获取待识别图像,该待识别图像中包括目标人脸图像;
处理模块202,用于通过第一网络模型,基于该待识别图像进行关键点识别处理,得到该待识别图像中目标人脸图像的二维关键点坐标集合、以及该目标人脸图像的三维关键点坐标集合,其中,该第一网络模型包括第一分支网络和第二分支网络,其中,该第一分支网络用于识别该二维关键点坐标集合,该第二分支网络用于识别该三维关键点坐标集合;
输出模块203,用于根据该二维关键点坐标集合和该三维关键点坐标集合,确定该待识别图像中目标人脸图像对应的头部姿态。
本申请实施例提供了一种头部姿态估计装置。采用上述装置,通过两个分支网络分别 输出待识别图像中关于目标人脸图像的2D关键点坐标以及3D关键点坐标,然后根据该2D关键点坐标以及该3D关键点坐标计算得到该目标人脸图像的头部姿态。由于该3D关键点坐标可以实时获取,因此3D头部模型可以随着表情的变化而变化,相应地,使得2D关键点与该3D关键点之间的坐标对应关系更精确,从而使得在做大幅度表情时,保证头部姿态估计的稳定可靠。
可选地,在上述图8所对应的实施例的基础上,本申请实施例提供的头部姿态估计装置20的另一实施例中,
处理模块202具体用于通过第一网络模型,基于待识别图像进行关键点识别处理,得到二维关键点坐标集合、二维关键点坐标集合中各个二维关键点坐标各自对应的不确定度、以及三维关键点坐标集合;其中,第一网络模型中的第一分支网络还用于识别二维关键点坐标集合中各个二维关键点坐标各自对应的不确定度;
输出模块203具体用于根据二维关键点坐标集合、不确定度和三维关键点坐标集合,确定待识别图像中目标人脸图像对应的头部姿态。
可选地,在上述图8所对应的实施例的基础上,本申请实施例提供的头部姿态估计装置20的另一实施例中,
该输出模块203,具体用于将所对应的不确定度大于预设阈值的二维关键点坐标从该二维关键点坐标集合中剔除,得到中间二维关键点坐标集合;
根据该中间二维关键点坐标集合,从该三维关键点坐标集合中获取中间三维关键点坐标集合;
根据该中间二维关键点坐标集合和该中间三维关键点坐标集合,确定该待识别图像中目标人脸图像对应的头部姿态。
本申请实施例提供了一种头部姿态估计装置。采用上述装置,根据2D关键点坐标对应的不确定度,对该2D关键点坐标和该3D关键点坐标进行筛选,使得进行头部姿态估计时删除不确定性较大的点,使得头部姿态估计更加鲁棒性。
可选地,在上述图8所对应的实施例的基础上,本申请实施例提供的头部姿态估计装置20的另一实施例中,该输出模块203,具体用于利用PnP解算,根据该中间二维关键点坐标集合和该中间三维关键点坐标集合,确定该待识别图像中目标人脸图像对应的头部姿态。
本申请实施例提供了一种头部姿态估计装置。采用上述装置,利用PnP解算方式进行姿态估计,使得该头部姿态估计更加具有可实行性。
可选地,在上述图8所对应的实施例的基础上,本申请实施例提供的头部姿态估计装置20的另一实施例中,如图9所示:
该获取模块201,还用于获取第一训练样本集,该第一训练样本集中包括标注有人脸图像的真实二维关键点坐标集合和真实三维关键点坐标集合的训练样本;
该头部姿态估计装置还包括训练模块204,用于通过待训练的第一初始网络模型中的特征提取网络层,对该第一训练样本集合中的训练样本对应的人脸图像进行特征提取处理,得到该训练样本的特征表示;
通过所述第一初始网络模型中的初始第一分支网络,根据所述训练样本的特征表示,确定该训练样本对应的预测二维关键点坐标集合;并通过所述第一初始网络模型中的初始第二分支网络,根据所述训练样本的特征表示,确定该训练样本对应的预测三维关键点坐标集合;
根据该预测二维关键点坐标集合和该训练样本中的真实二维关键点坐标集合计算第一损失值,根据该预测三维关键点坐标集合和该训练样本中的真实三维关键点坐标集合计算第二损失值;
根据该第一损失值调整该初始第一分支网络得到该第一分支网络,并根据该第二损失值调整该初始第二分支网络得到该第二分支网络;
根据该第一分支网络和该第二分支网络得到该第一网络模型。
本申请实施例提供了一种头部姿态估计装置。采用上述装置,在训练过程中训练得到该第一分支网络和该第二分支网络,通过两个分支网络分别输出待识别图像中关于目标人脸图像的2D关键点坐标以及3D关键点坐标,然后根据该2D关键点坐标以及该3D关键点坐标计算得到该目标人脸图像的头部姿态。其中,由于该3D关键点坐标可以实时获取,因此3D头部模型可以随着表情的变化而变化,相应地,使得2D关键点与该3D关键点之间的坐标对应关系更精确,从而使得在做大幅度表情时,保证头部姿态估计的稳定可靠。同时,该第一分支网络和该第二分支网络分别进行独立训练,可以增加模型的泛化性。
可选地,在上述图9所对应的实施例的基础上,本申请实施例提供的头部姿态估计装置20的另一实施例中,该训练模块204具体用于:
通过所述初始第一分支网络,根据所述训练样本的特征表示,确定所述训练样本对应的预测二维关键点坐标集合、以及所述预测二维关键点坐标集合中各个预测关键点坐标各自对应的预测不确定度;
根据所述预测二维关键点坐标集合、所述预测不确定度和所述真实二维关键点坐标集合,计算所述第一损失值。
可选地,在上述图9所对应的实施例的基础上,本申请实施例提供的头部姿态估计装置20的另一实施例中,该获取模块201,具体用于通过深度摄像头采集训练图像集合,该训练图像集合中的各个训练图像包括人脸图像的三维点云数据以及真实头部姿态;
将该三维点云数据进行姿态投影,得到该训练图像中人脸图像的二维关键点数据;
通过图像处理网络,根据该训练图像集合,确定该第一训练样本集。
本申请实施例提供了一种头部姿态估计装置。采用上述装置,采用深度摄像头采集训练图像,可以更方便获取到该3D点云数据以及该真实头部姿态,从而简化训练样本集的获取流程。
可选地,在上述图9所对应的实施例的基础上,本申请实施例提供的头部姿态估计装置20的另一实施例中,该获取模块201,具体用于通过所述图像处理网络,根据该训练图像集合中的训练图像,确定该训练图像中人脸图像的稀疏关键点,该稀疏关键点包括该训练图像中人脸图像的五官点以及人脸轮廓点;
根据该人脸轮廓点从该训练图像中获取该人脸图像;
根据该五官点将该人脸图像水平对齐并缩放至目标尺寸,得到该第一训练样本集中的训练样本。
本申请实施例提供了一种头部姿态估计装置。采用上述装置,通过稀疏关键点将该人脸图像进行抠图对齐,这样可以降低摄像头采集的图像中背景信息带来的干扰。而将图像缩放至统一尺寸,将有利于对图像的特征提取,降低训练难度。
可选地,在上述图9所对应的实施例的基础上,本申请实施例提供的头部姿态估计装置20的另一实施例中,该稀疏关键点至少五个人脸五官点以及四个人脸轮廓点。
本申请实施例提供了一种头部姿态估计装置。采用上述装置,这样可以在保证精确抠图以及对齐的情况下,降低图像预处理的过程。
可选地,在上述图9所对应的实施例的基础上,本申请实施例提供的头部姿态估计装置20的另一实施例中,该特征提取网络包括残差神经网络ResNet和池化层,该第一分支网络为全连接层,该第二分支网络为全连接层。
本申请实施例提供了一种头部姿态估计装置。采用上述装置,这样可以增加该方案的可实现性。
可选地,在上述图9所对应的实施例的基础上,本申请实施例提供的头部姿态估计装置20的另一实施例中,该训练模块204,具体用于利用高斯负对数似然损失,根据该预测二维关键点坐标集合、该预测不确定度和该真实二维关键点坐标集合,计算第一损失值;利用回归损失函数,根据该预测三维关键点坐标集合和该真实三维关键点坐标集合,计算第二损失值。
本申请实施例提供了一种头部姿态估计装置。采用上述装置,这样可以增加该方案的可实现性。
可选地,在上述图8所对应的实施例的基础上,本申请实施例提供的头部姿态估计装置20的另一实施例中,如图9所示,
该获取模块201,具体用于获取第二训练样本集,该第二训练样本集中包括标注有人脸图像的真实二维关键点坐标集合、真实三维关键点坐标集合和真实头部姿态的训练样本;
该头部姿态估计装置还包括训练模块204,用于通过待训练的第二初始网络模型中的特征提取网络层,对该第二训练样本集合中的训练样本对应的人脸图像进行特征提取处理,得到该训练样本的特征表示;
通过所述第二初始网络模型中的初始第一分支网络,根据所述训练样本的特征表示,确定该训练样本对应的预测二维关键点坐标集合;并通过所述第二初始网络模型中的初始第二分支网络,根据所述训练样本的特征表示,确定该训练样本对应的预测三维关键点坐标集合;
通过所述第二初始网络模型中的计算网络,根据该预测二维关键点坐标集合和该预测三维关键点坐标集合,计算预测头部姿态;
根据该预测二维关键点坐标集合和该训练样本中的真实二维关键点坐标集合计算第一损失值,根据该预测三维关键点坐标集合和该训练样本中的真实三维关键点坐标集合计算第二损失值,并根据该预测头部姿态和该训练样本中的真实头部姿态计算第三损失值;
根据该第一损失值、该第二损失值和该第三损失值调整该第二初始网络模型,得到该第一网络模型。
本申请实施例提供了一种头部姿态估计装置。采用上述装置,在训练过程中训练得到该第一分支网络和该第二分支网络,通过两个分支网络分别输出待识别图像中关于目标人脸图像的2D关键点坐标以及3D关键点坐标,然后根据该2D关键点坐标以及该3D关键点坐标计算得到该目标人脸图像的头部姿态。其中,由于该3D关键点坐标可以实时获取,因此3D头部模型可以随着表情的变化而变化,相应地,使得2D关键点与该3D关键点之间的坐标对应关系更精确,从而使得在做大幅度表情时,使得头部姿态的解算稳定可靠。同时,该第一分支网络和该第二分支网络进行联合训练,从而增加模型的学习性。
可选地,在上述图9所对应的实施例的基础上,本申请实施例提供的头部姿态估计装置20的另一实施例中,训练模块204具体用于:
通过所述初始第一分支网络,根据所述训练样本的特征表示,确定所述训练样本对应的预测二维关键点坐标集合、以及所述预测二维关键点坐标集合中各个预测关键点坐标各自对应的预测不确定度;
根据所述预测二维关键点坐标集合、所述预测不确定度和所述真实二维关键点坐标集合,计算所述第一损失值。
可选地,在上述图9所对应的实施例的基础上,本申请实施例提供的头部姿态估计装置20的另一实施例中,该特征提取网络包括残差神经网络ResNet和池化层,该第一分支网络为全连接层,该第二分支网络为全连接层;该计算网络为可微分的PnP解算网络。
本申请实施例提供了一种头部姿态估计装置。采用上述装置,这样可以增加该方案的可实现性。
可选地,在上述图8所对应的实施例的基础上,本申请实施例提供的头部姿态估计装置20的另一实施例中,该获取模块201,具体用于获取待处理图像,该待处理图像包括摄像头采集的目标人脸图像;
通过图像预处理网络,确定该待处理图像中目标人脸图像的稀疏关键点,该稀疏关键点包括该目标人脸图像的五官点以及人脸轮廓点;
根据该人脸轮廓点从该待处理图像中获取该目标人脸图像;
根据该五官点将该目标人脸图像水平对齐并缩放至目标尺寸,得到该待识别图像。
本申请实施例提供了一种头部姿态估计装置。采用上述装置,通过稀疏关键点将该人脸图像进行抠图对齐,这样可以降低摄像头采集的图像中背景信息带来的干扰。而将图像缩放至统一尺寸,将有利用对图像的特征提取。
本申请提供的头部姿态估计装置可用于服务器,请参阅图10,图10是本申请实施例提供的一种服务器结构示意图,该服务器300可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)322(例如,一个或一个以上处理器)和存储器332,一个或一个以上存储应用程序342或数据344的存储介质330(例如一个或一个以上海量存储设备)。其中,存储器332和存储介质330可以是短暂存储或持久存储。存储在存储介质330的程序可以包括一个或一个以上模块(图示没 标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器322可以设置为与存储介质330通信,在服务器300上执行存储介质330中的一系列指令操作。
服务器300还可以包括一个或一个以上电源326,一个或一个以上有线或无线网络接口350,一个或一个以上输入输出接口358,和/或,一个或一个以上操作系统341,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
上述实施例中由服务器所执行的步骤可以基于该图10所示的服务器结构。
本申请提供的头部姿态估计装置可用于终端设备,请参阅图11,为了便于说明,仅示出了与本申请实施例相关的部分,具体技术细节未揭示的,请参照本申请实施例方法部分。在本申请实施例中,以终端设备为智能手机为例进行说明:
图11示出的是与本申请实施例提供的终端设备相关的智能手机的部分结构的框图。参考图11,智能手机包括:射频(radio frequency,RF)电路410、存储器420、输入单元430、显示单元440、传感器450、音频电路460、无线保真(wireless fidelity,WiFi)模块470、处理器480、以及电源490等部件。本领域技术人员可以理解,图11中示出的智能手机结构并不构成对智能手机的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合图11对智能手机的各个构成部件进行具体的介绍:
RF电路410可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,给处理器480处理;另外,将设计上行的数据发送给基站。通常,RF电路410包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(low noise amplifier,LNA)、双工器等。此外,RF电路410还可以通过无线通信与网络和其他设备通信。上述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统(global system of mobile communication,GSM)、通用分组无线服务(general packet radio service,GPRS)、码分多址(code division multiple access,CDMA)、宽带码分多址(wideband code division multiple access,WCDMA)、长期演进(long term evolution,LTE)、电子邮件、短消息服务(short messaging service,SMS)等。
存储器420可用于存储软件程序以及模块,处理器480通过运行存储在存储器420的软件程序以及模块,从而执行智能手机的各种功能应用以及数据处理。存储器420可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据智能手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器420可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
输入单元430可用于接收输入的数字或字符信息,以及产生与智能手机的用户设置以及功能控制有关的键信号输入。具体地,输入单元430可包括触控面板431以及其他输入设备432。触控面板431,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板431上或在触控面板431附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触控面板431可包括触摸检 测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器480,并能接收处理器480发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板431。除了触控面板431,输入单元430还可以包括其他输入设备432。具体地,其他输入设备432可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元440可用于显示由用户输入的信息或提供给用户的信息以及智能手机的各种菜单。显示单元440可包括显示面板441,可选的,可以采用液晶显示器(liquid crystal display,LCD)、有机发光二极管(organic light-emitting diode,OLED)等形式来配置显示面板441。进一步的,触控面板431可覆盖显示面板441,当触控面板431检测到在其上或附近的触摸操作后,传送给处理器480以确定触摸事件的类型,随后处理器480根据触摸事件的类型在显示面板441上提供相应的视觉输出。虽然在图11中,触控面板431与显示面板441是作为两个独立的部件来实现智能手机的输入和输入功能,但是在某些实施例中,可以将触控面板431与显示面板441集成而实现智能手机的输入和输出功能。
智能手机还可包括至少一种传感器450,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板441的亮度,接近传感器可在智能手机移动到耳边时,关闭显示面板441和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别智能手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于智能手机还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。
音频电路460、扬声器461,传声器462可提供用户与智能手机之间的音频接口。音频电路460可将接收到的音频数据转换后的电信号,传输到扬声器461,由扬声器461转换为声音信号输出;另一方面,传声器462将收集的声音信号转换为电信号,由音频电路460接收后转换为音频数据,再将音频数据输出处理器480处理后,经RF电路410以发送给比如另一智能手机,或者将音频数据输出至存储器420以便进一步处理。
WiFi属于短距离无线传输技术,智能手机通过WiFi模块470可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图11示出了WiFi模块470,但是可以理解的是,其并不属于智能手机的必须构成,完全可以根据需要在不改变发明的本质的范围内而省略。
处理器480是智能手机的控制中心,利用各种接口和线路连接整个智能手机的各个部分,通过运行或执行存储在存储器420内的软件程序和/或模块,以及调用存储在存储器420内的数据,执行智能手机的各种功能和处理数据,从而对智能手机进行整体监测。可选的,处理器480可包括一个或多个处理单元;可选的,处理器480可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解 调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器480中。
智能手机还包括给各个部件供电的电源490(比如电池),可选的,电源可以通过电源管理系统与处理器480逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。
尽管未示出,智能手机还可以包括摄像头、蓝牙模块等,在此不再赘述。
上述实施例中由终端设备所执行的步骤可以基于该图11所示的终端设备结构。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行如前述各个实施例描述的方法。
本申请实施例中还提供一种包括程序的计算机程序产品,当其在计算机上运行时,使得计算机执行前述各个实施例描述的方法。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (19)

  1. 一种头部姿态估计方法,由计算机设备执行,包括:
    获取待识别图像,所述待识别图像中包括目标人脸图像;
    通过第一网络模型,基于所述待识别图像进行关键点识别处理,得到所述待识别图像中所述目标人脸图像的二维关键点坐标集合、以及所述目标人脸图像的三维关键点坐标集合,其中,所述第一网络模型包括第一分支网络和第二分支网络,其中,所述第一分支网络用于识别所述二维关键点坐标集合,所述第二分支网络用于识别所述三维关键点坐标集合;
    根据所述二维关键点坐标集合和所述三维关键点坐标集合,确定所述待识别图像中所述目标人脸图像对应的头部姿态。
  2. 根据权利要求1所述的方法,所述通过第一网络模型,基于所述待识别图像进行关键点识别处理,得到所述待识别图像中所述目标人脸图像的二维关键点坐标集合、以及所述目标人脸图像的三维关键点坐标集合,包括:
    通过所述第一网络模型,基于所述待识别图像进行关键点识别处理,得到所述二维关键点坐标集合、所述二维关键点坐标集合中各个二维关键点坐标各自对应的不确定度、以及所述三维关键点坐标集合;其中,所述第一网络模型中的所述第一分支网络还用于识别所述二维关键点坐标集合中各个二维关键点坐标各自对应的不确定度;
    所述根据所述二维关键点坐标集合和所述三维关键点坐标集合,确定所述待识别图像中所述目标人脸图像对应的头部姿态,包括:
    根据所述二维关键点坐标集合、所述不确定度和所述三维关键点坐标集合,确定所述待识别图像中所述目标人脸图像对应的头部姿态。
  3. 根据权利要求2所述的方法,所述根据所述二维关键点坐标集合、所述不确定度和所述三维关键点坐标集合,确定所述待识别图像中目标人脸图像对应的头部姿态,包括:
    将所对应的不确定度大于预设阈值的二维关键点坐标从所述二维关键点坐标集合中剔除,得到中间二维关键点坐标集合;
    根据所述中间二维关键点坐标集合,从所述三维关键点坐标集合中获取中间三维关键点坐标集合;
    根据所述中间二维关键点坐标集合和所述中间三维关键点坐标集合,确定所述待识别图像中目标人脸图像对应的头部姿态。
  4. 根据权利要求3所述的方法,所述根据所述中间二维关键点坐标集合和所述中间三维关键点坐标集合,确定所述待识别图像中目标人脸图像对应的头部姿态,包括:
    利用PnP解算,根据所述中间二维关键点坐标集合和所述中间三维关键点坐标集合,确定所述待识别图像中目标人脸图像对应的头部姿态。
  5. 根据权利要求1至4中任一项所述的方法,所述方法还包括:
    获取第一训练样本集,所述第一训练样本集中包括标注有人脸图像的真实二维关键点坐标集合和真实三维关键点坐标集合的训练样本;
    通过待训练的第一初始网络模型中的特征提取网络层,对所述第一训练样本集中的训 练样本对应的人脸图像进行特征提取处理,得到所述训练样本的特征表示;
    通过所述第一初始网络模型中的初始第一分支网络,根据所述训练样本的特征表示,确定所述训练样本对应的预测二维关键点坐标集合;通过所述第一初始网络模型中的初始第二分支网络,根据所述训练样本的特征表示,确定所述训练样本对应的预测三维关键点坐标集合;
    根据所述预测二维关键点坐标集合和所述训练样本中的真实二维关键点坐标集合计算第一损失值,根据所述预测三维关键点坐标集合和所述训练样本中的真实三维关键点坐标集合计算第二损失值;
    根据所述第一损失值调整所述初始第一分支网络得到所述第一分支网络,并根据所述第二损失值调整所述初始第二分支网络得到所述第二分支网络;
    根据所述第一分支网络和所述第二分支网络得到所述第一网络模型。
  6. 根据权利要求5所述的方法,所述通过所述第一初始网络模型中的初始第一分支网络,根据所述训练样本的特征表示,确定所述训练样本对应的预测二维关键点坐标集合,包括:
    通过所述初始第一分支网络,根据所述训练样本的特征表示,确定所述训练样本对应的预测二维关键点坐标集合、以及所述预测二维关键点坐标集合中各个预测关键点坐标各自对应的预测不确定度;
    所述根据所述预测二维关键点坐标集合和所述训练样本中的真实二维关键点坐标集合计算第一损失值,包括:
    根据所述预测二维关键点坐标集合、所述预测不确定度和所述真实二维关键点坐标集合,计算所述第一损失值。
  7. 根据权利要求5或6所述的方法,所述获取第一训练样本集,包括:
    通过深度摄像头采集训练图像集合,所述训练图像集合中的各个训练图像包括人脸图像的三维点云数据以及真实头部姿态;
    将所述三维点云数据进行姿态投影,得到所述训练图像中人脸图像的二维关键点数据;
    通过图像处理网络,根据所述训练图像集合,确定所述第一训练样本集。
  8. 根据权利要求7所述的方法,所述通过图像处理网络,根据所述训练图像集合,确定所述第一训练样本集,包括:
    通过所述图像处理网络,根据所述训练图像集合中的所述训练图像,确定所述训练图像中人脸图像的稀疏关键点,所述稀疏关键点包括所述人脸图像的五官点以及人脸轮廓点;
    根据所述人脸轮廓点从所述训练图像中获取所述人脸图像;
    根据所述五官点将所述人脸图像水平对齐并缩放至目标尺寸,得到所述第一训练样本集中的训练样本。
  9. 根据权利要求8所述的方法,所述稀疏关键点至少五个人脸五官点以及四个人脸轮廓点。
  10. 根据权利要求5至9任一项所述的方法,所述特征提取网络包括残差神经网络ResNet和池化层,所述第一分支网络为全连接层,所述第二分支网络为全连接层。
  11. 根据权利要求6至9任一项所述的方法,所述根据所述预测二维关键点坐标集合、所述预测不确定度和所述真实二维关键点坐标集合,计算所述第一损失值,包括:
    利用高斯负对数似然损失,根据所述预测二维关键点坐标集合、所述预测不确定度和所述真实二维关键点坐标集合,计算所述第一损失值;
    根据所述预测三维关键点坐标集合和所述训练样本中的真实三维关键点坐标集合计算第二损失值,包括:
    利用回归损失函数,根据所述预测三维关键点坐标集合和所述真实三维关键点坐标集合,计算所述第二损失值。
  12. 根据权利要求1至4中任一项所述的方法,所述方法还包括:
    获取第二训练样本集,所述第二训练样本集中包括标注有人脸图像的真实二维关键点坐标集合、真实三维关键点坐标集合和真实头部姿态的训练样本;
    通过待训练的第二初始网络模型中的特征提取网络层,对所述第二训练样本集合中的训练样本对应的人脸图像进行特征提取处理,得到所述训练样本的特征表示;
    通过所述第二初始网络模型中的初始第一分支网络,根据所述训练样本的特征表示,确定所述训练样本对应的预测二维关键点坐标集合;通过所述第二初始网络模型中的初始第二分支网络,根据所述训练样本的特征表示,确定所述训练样本对应的预测三维关键点坐标集合;
    通过所述第二初始网络模型中的计算网络,根据所述预测二维关键点坐标集合和所述预测三维关键点坐标集合,计算预测头部姿态;
    根据所述预测二维关键点坐标集合和所述训练样本中的真实二维关键点坐标集合计算第一损失值,根据所述预测三维关键点坐标集合和所述训练样本中的真实三维关键点坐标集合计算第二损失值,并根据所述预测头部姿态和所述训练样本中的真实头部姿态计算第三损失值;
    根据所述第一损失值、所述第二损失值和所述第三损失值调整所述第二初始网络模型,得到所述第一网络模型。
  13. 根据权利要求12所述的方法,所述通过所述第二初始网络模型中的初始第一分支网络,根据所述训练样本的特征表示,确定所述训练样本对应的预测二维关键点坐标集合,包括:
    通过所述初始第一分支网络,根据所述训练样本的特征表示,确定所述训练样本对应的预测二维关键点坐标集合、以及所述预测二维关键点坐标集合中各个预测关键点坐标各自对应的预测不确定度;
    所述根据所述预测二维关键点坐标集合和所述训练样本中的真实二维关键点坐标集合计算第一损失值,包括:
    根据所述预测二维关键点坐标集合、所述预测不确定度和所述真实二维关键点坐标集合,计算所述第一损失值。
  14. 根据权利要求12或13所述的方法,所述特征提取网络包括残差神经网络ResNet和池化层,所述第一分支网络为全连接层,所述第二分支网络为全连接层;所述计算网络 为可微分的PnP解算网络。
  15. 根据权利要求1至14中任一项所述的方法,所述获取待识别图像包括:
    获取待处理图像,所述待处理图像包括摄像头采集的目标人脸图像;
    通过图像预处理网络,确定所述待处理图像中所述目标人脸图像的稀疏关键点,所述稀疏关键点包括所述目标人脸图像的五官点以及人脸轮廓点;
    根据所述人脸轮廓点从所述待处理图像中获取所述目标人脸图像;
    根据所述五官点将所述目标人脸图像水平对齐并缩放至目标尺寸,得到所述待识别图像。
  16. 一种头部姿态估计装置,包括:
    获取模块,用于获取待识别图像,所述待识别图像中包括目标人脸图像;
    处理模块,用于通过第一网络模型,基于所述待识别图像进行关键点识别处理,得到所述待识别图像中所述目标人脸图像的二维关键点坐标集合、以及所述目标人脸图像的三维关键点坐标集合,其中,所述第一网络模型包括第一分支网络和第二分支网络,其中,所述第一分支网络用于识别所述二维关键点坐标集合,所述第二分支网络用于识别所述三维关键点坐标集合;
    输出模块,用于根据所述二维关键点坐标集合和所述三维关键点坐标集合,确定所述待识别图像中所述目标人脸图像对应的头部姿态。
  17. 一种计算机设备,包括:存储器、处理器以及总线系统;
    其中,所述存储器用于存储程序;
    所述处理器用于执行所述存储器中的程序,所述处理器用于根据程序代码中的指令执行权利要求1至15中任一项所述的方法;
    所述总线系统用于连接所述存储器以及所述处理器,以使所述存储器以及所述处理器进行通信。
  18. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如权利要求1至15中任一项所述的方法。
  19. 一种计算机程序产品,包括计算机程序或者指令,所述计算机程序或者所述指令被处理器执行时,实现权利要求1至15中任一项所述的方法。
PCT/CN2023/108312 2022-09-15 2023-07-20 一种头部姿态估计方法、装置、设备以及存储介质 WO2024055748A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211130441.4A CN117011929A (zh) 2022-09-15 2022-09-15 一种头部姿态估计方法、装置、设备以及存储介质
CN202211130441.4 2022-09-15

Publications (1)

Publication Number Publication Date
WO2024055748A1 true WO2024055748A1 (zh) 2024-03-21

Family

ID=88566103

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/108312 WO2024055748A1 (zh) 2022-09-15 2023-07-20 一种头部姿态估计方法、装置、设备以及存储介质

Country Status (2)

Country Link
CN (1) CN117011929A (zh)
WO (1) WO2024055748A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117315211B (zh) * 2023-11-29 2024-02-23 苏州元脑智能科技有限公司 数字人合成及其模型训练方法、装置、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447462A (zh) * 2015-11-20 2016-03-30 小米科技有限责任公司 人脸姿态估计方法及装置
CN112241731A (zh) * 2020-12-03 2021-01-19 北京沃东天骏信息技术有限公司 一种姿态确定方法、装置、设备及存储介质
US20210150264A1 (en) * 2017-07-05 2021-05-20 Siemens Aktiengesellschaft Semi-supervised iterative keypoint and viewpoint invariant feature learning for visual recognition
CN114299152A (zh) * 2022-01-21 2022-04-08 奥比中光科技集团股份有限公司 一种获取姿态数据的方法及神经网络构建方法
CN114333034A (zh) * 2022-01-04 2022-04-12 广州虎牙科技有限公司 人脸姿态估计方法、装置、电子设备及可读存储介质
CN114360031A (zh) * 2022-03-15 2022-04-15 南京甄视智能科技有限公司 头部姿态估计方法、计算机设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447462A (zh) * 2015-11-20 2016-03-30 小米科技有限责任公司 人脸姿态估计方法及装置
US20210150264A1 (en) * 2017-07-05 2021-05-20 Siemens Aktiengesellschaft Semi-supervised iterative keypoint and viewpoint invariant feature learning for visual recognition
CN112241731A (zh) * 2020-12-03 2021-01-19 北京沃东天骏信息技术有限公司 一种姿态确定方法、装置、设备及存储介质
CN114333034A (zh) * 2022-01-04 2022-04-12 广州虎牙科技有限公司 人脸姿态估计方法、装置、电子设备及可读存储介质
CN114299152A (zh) * 2022-01-21 2022-04-08 奥比中光科技集团股份有限公司 一种获取姿态数据的方法及神经网络构建方法
CN114360031A (zh) * 2022-03-15 2022-04-15 南京甄视智能科技有限公司 头部姿态估计方法、计算机设备及存储介质

Also Published As

Publication number Publication date
CN117011929A (zh) 2023-11-07

Similar Documents

Publication Publication Date Title
WO2021244217A1 (zh) 一种表情迁移模型的训练方法、表情迁移的方法及装置
US20220051061A1 (en) Artificial intelligence-based action recognition method and related apparatus
US20210019627A1 (en) Target tracking method and apparatus, medium, and device
WO2020216054A1 (zh) 视线追踪模型训练的方法、视线追踪的方法及装置
US20220076000A1 (en) Image Processing Method And Apparatus
WO2019114696A1 (zh) 一种增强现实的处理方法、对象识别的方法及相关设备
US11383166B2 (en) Interaction method of application scene, mobile terminal, and storage medium
US11715224B2 (en) Three-dimensional object reconstruction method and apparatus
US11366528B2 (en) Gesture movement recognition method, apparatus, and device
CN108985220B (zh) 一种人脸图像处理方法、装置及存储介质
CN109005336B (zh) 一种图像拍摄方法及终端设备
CN106127829B (zh) 一种增强现实的处理方法、装置及终端
CN111209423B (zh) 一种基于电子相册的图像管理方法、装置以及存储介质
CN108683850B (zh) 一种拍摄提示方法及移动终端
CN109272473B (zh) 一种图像处理方法及移动终端
WO2022088819A1 (zh) 视频处理方法、视频处理装置和存储介质
CN113426117B (zh) 虚拟相机拍摄参数获取方法、装置、电子设备和存储介质
WO2024055748A1 (zh) 一种头部姿态估计方法、装置、设备以及存储介质
CN112818733B (zh) 信息处理方法、装置、存储介质及终端
CN113409468A (zh) 一种图像处理方法、装置、电子设备及存储介质
CN116958715A (zh) 一种手部关键点的检测方法、装置以及存储介质
WO2023137923A1 (zh) 基于姿态指导的行人重识别方法、装置、设备及存储介质
CN108108017B (zh) 一种搜索信息处理方法及移动终端
CN110969085B (zh) 脸部特征点定位方法及电子设备
CN112954480B (zh) 数据传输进度的显示方法和数据传输进度的显示装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23864483

Country of ref document: EP

Kind code of ref document: A1