WO2023027712A1 - Procédés et systèmes permettant de reconstruire simultanément une pose et des modèles humains 3d paramétriques dans des dispositifs mobiles - Google Patents

Procédés et systèmes permettant de reconstruire simultanément une pose et des modèles humains 3d paramétriques dans des dispositifs mobiles Download PDF

Info

Publication number
WO2023027712A1
WO2023027712A1 PCT/US2021/047793 US2021047793W WO2023027712A1 WO 2023027712 A1 WO2023027712 A1 WO 2023027712A1 US 2021047793 W US2021047793 W US 2021047793W WO 2023027712 A1 WO2023027712 A1 WO 2023027712A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
neural network
human
person
model
Prior art date
Application number
PCT/US2021/047793
Other languages
English (en)
Inventor
Zhong Li
Shuxue Quan
Yi Xu
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Priority to PCT/US2021/047793 priority Critical patent/WO2023027712A1/fr
Priority to CN202180101541.1A priority patent/CN117916773A/zh
Publication of WO2023027712A1 publication Critical patent/WO2023027712A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings

Definitions

  • This application relates generally to image data processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for rendering an avatar in real time based on information of a person captured in an image.
  • Human pose estimation provides information of human motion for use in movies, games, and health applications.
  • Current practice normally requires an industrial grade imaging equipment that is expensive to manufacture, requires professional training to operate, and is oftentimes used with physical markers attached to surface of tracking objects. Physical markers are inconvenient to use, cause data pollution, and even interfere with an object’s movement in some situations.
  • researchers use multiple optical or depth cameras with multiple viewing angles to provide image input and develop some markerless algorithms to capture human motion. These optical cameras are not suitable for outdoor environments, and particularly in sunlight, a resolution and a collection distance of optical or depth cameras are limited.
  • the markerless algorithms are executed offline on a personal computer having strong computing power. How to enable handheld devices to capture human motion in real time becomes a problem. It would be beneficial to have a more convenient human pose estimation mechanism at a mobile device than the current practice.
  • a convenient human pose estimation mechanism for identifying joints of human bodies in images and determine associated human motion in real time, particularly in images taken by conventional cameras (e.g., a camera of a mobile phone or augmented glasses).
  • Various embodiments of this application are directed to an end-to-end pipeline that simultaneously recovers parametric human pose, shapes, geometry, and color from one or more images in a real-time fashion on a mobile platform.
  • a human region is detected and cropped from the image.
  • the human region is provided as an input to a pose estimation network to infer a three-dimensional (3D) human model (e.g., a skinned multi-person linear (SMPL) model) including a first set of human model parameters describing at least a pose and a shape of a human body of the person and a second set of human model parameters concerning a plurality of vertices of the human body (e.g., vertex offset, vertex color).
  • 3D human model e.g., a skinned multi-person linear (SMPL) model
  • SMPL skinned multi-person linear
  • Such human model parameters are applied to generate a parametric colored human mesh of the 3D human model associated with a person in input image.
  • the pose estimation network is optionally trained using public dataset, such as MP II, COCO, and Human3.6M, to generate the first set of human model parameters.
  • the pose estimation network is trained to generate vertex offset and color in a semisupervised manner, e.g., using Thu depth dataset for supervised training and using differentiable render for unsupervised training.
  • a pose estimation network can be reliably executed on a consumer level mobile device in real time.
  • a method is implemented at a computer system for driving an avatar.
  • the method includes obtaining an image captured by a camera.
  • the image includes a person.
  • the method further includes extracting a plurality of features from the image using a convolutional neural network and generating a 3D human model of the person from the plurality of features using a regression neural network.
  • the 3D human model includes a first set of human model parameters describing at least a pose and a shape of a human body of the person and a second set of human model parameters concerning a plurality of vertices of the human body.
  • the method further includes rendering an avatar based on the 3D human model of the person.
  • the first set of human model parameters describe the pose of the human body via positional information of a plurality of key points of the human body in the image.
  • the plurality of key points includes a root point
  • the positional information of each key point includes a 3D rotational position of the respective key point measured with respect to the root point.
  • the 3D human model is meshed to the plurality of vertices
  • the second set of human model parameters includes at least a 3D vertex offset and a vertex color value of each of the plurality of vertices.
  • the 3D vertex offset indicates a positional deviation between a location of the respective vertex of the 3D human model and a location of a corresponding spot of the human body.
  • some implementations include a computer system that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.
  • some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.
  • Figure 1 A is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments
  • Figure IB is a pair of AR glasses that can be communicatively coupled in a data processing environment, in accordance with some embodiments.
  • Figure 2 is a block diagram illustrating a data processing system, in accordance with some embodiments.
  • Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.
  • Figure 4A is an example neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments
  • Figure 4B is an example node in the neural network, in accordance with some embodiments.
  • Figure 5 is a block diagram of a data processing model that is applied to render an avatar based on image data, in accordance with some embodiments.
  • Figure 6A is a flow diagram of a data inference process in which an avatar is rendered based on image data, in accordance with some embodiments.
  • Figure 6B is a flow diagram of a training process in which a pose estimation model is trained to render an avatar based on image data, in accordance with some embodiments.
  • Figure 7 is a flowchart of a method for rendering and driving an avatar based on an image captured by a camera, in accordance with some embodiments.
  • Three-dimensional (3D) reconstruction of the human body is widely used in movie special effects, games and health applications.
  • movies it can be used to create realistic virtual characters, in games to create game characters, and health applications can use 3D human modeling to evaluate human health parameters and behaviors.
  • Various embodiments of this application are directed to parametric model generation and photorealistic human reconstruction.
  • a human parametric model has pose parameters and beta parameters. The pose parameter controls human pose and associated motion, and the beta parameter controls human shapes to be reconstructed base on its appearance in real input images.
  • This application proposes a portable, client-side, real-time 3D human shape and parameter reconstruction solution.
  • Image data are applied to reconstruct the human body and posture simultaneously and in an end-to-end manner, while demanding a reasonable level of computational power that can be provided by a mobile device.
  • an end-to-end method is applied for real-time reconstruction of geometry and shape of a human body.
  • a differentiable Tenderer is used with neural networks that are trained in a semi-supervised manner to return camera parameters and shape, geometric, and color information of a human body concurrently.
  • Figure 1 A is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments.
  • the one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera 104E).
  • HMD head-mounted display
  • AR augmented reality
  • Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102.
  • the one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104.
  • the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
  • the one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104.
  • the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application.
  • the game console receives a user instruction and sends it to a game server 102 with user data.
  • the game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console.
  • the client devices 104 include a networked surveillance camera and a mobile phone 104C.
  • the networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time.
  • the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera in the real time and remotely.
  • the one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100.
  • the one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof.
  • the one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol.
  • a connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof.
  • a network interface 110 e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node
  • the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Intemet Protocol (TCP/IP) suite of protocols to communicate with one another.
  • TCP/IP Transmission Control Protocol/Intemet Protocol
  • At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.
  • deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data.
  • the content data may broadly include inertial sensor data captured by inertial sensor(s) of a client device 104.
  • data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. Subsequently to model training, the mobile phone 104C or HMD 104D obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the data processing models locally.
  • both model training and data processing are implemented locally at each individual client device 104 (e.g., the mobile phone 104C and HMD 104D).
  • the client device 104 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models.
  • both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A and HMD 104D).
  • the server 102A obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the client device 104 obtains the content data, sends the content data to the server 102 A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized or predicted device poses) from the server 102A, presents the results on a user interface (e.g., associated with the application), rending virtual objects in a field of view based on the poses, or implements some other functions based on the results.
  • the client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102 A.
  • data processing is implemented locally at a client device 104 (e.g., the client device 104B and HMD 104D), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104.
  • the server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the trained data processing models are optionally stored in the server 102B or storage 106.
  • the client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.
  • FIG. IB illustrates a pair of AR glasses 104D (also called an HMD) that can be communicatively coupled to a data processing environment 100, in accordance with some embodiments.
  • the AR glasses 104D can be includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display.
  • the camera and microphone are configured to capture video and audio data from a scene of the AR glasses 104D, while the one or more inertial sensors are configured to capture inertial sensor data.
  • the camera captures hand gestures of a user wearing the AR glasses 104D.
  • the microphone records ambient sound, including user’s voice commands.
  • both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses.
  • the video, static image, audio, or inertial sensor data captured by the AR glasses 104D is processed by the AR glasses 104D, server(s) 102, or both to recognize the device poses.
  • deep learning techniques are applied by the server(s) 102 and AR glasses 104D jointly to recognize and predict the device poses.
  • the device poses are used to control the AR glasses 104D itself or interact with an application (e.g., a gaming application) executed by the AR glasses 104D.
  • the display of the AR glasses 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items (e.g., an avatar) on the user interface.
  • deep learning techniques are applied in the data processing environment 100 to process video data, static image data, or inertial sensor data captured by the AR glasses 104D.
  • 2D or 3D device poses are recognized and predicted based on such video, static image, and/or inertial sensor data using a first data processing model.
  • Visual content is optionally generated using a second data processing model.
  • Training of the first and second data processing models is optionally implemented by the server 102 or AR glasses 104D.
  • Inference of the device poses and visual content is implemented by each of the server 102 and AR glasses 104D independently or by both of the server 102 and AR glasses 104D jointly.
  • FIG 2 is a block diagram illustrating a data processing system 200, in accordance with some embodiments.
  • the data processing system 200 includes a server 102, a client device 104 (e.g., AR glasses 104D in Figure IB), a storage 106, or a combination thereof.
  • the data processing system 200 typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset).
  • CPUs processing units
  • the data processing system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls.
  • the client device 104 of the data processing system 200 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard.
  • the client device 104 includes one or more cameras, scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices.
  • the data processing system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
  • the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104.
  • a location detection device such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104.
  • Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:
  • Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
  • User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);
  • information e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.
  • output devices 212 e.g., displays, speakers, etc.
  • Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
  • Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;
  • One or more user applications 224 for execution by the data processing system 200 e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices;
  • Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;
  • content data e.g., video, image, audio, or textual data
  • Data processing module 228 (e.g., a data processing module 500 in Figure 5) for processing content data using data processing models 240 (e.g., a pose estimation model 550 in Figure 5), thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224; • One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.
  • the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the data processing system 200.
  • the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the data processing system 200.
  • more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively.
  • Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
  • the above identified modules or programs i.e., sets of instructions
  • memory 206 optionally, stores a subset of the modules and data structures identified above.
  • memory 206 optionally, stores additional modules and data structures not described above.
  • FIG. 3 is another example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments.
  • the data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240.
  • both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104.
  • the training data source 304 is optionally a server 102 or storage 106.
  • both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300.
  • the training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106.
  • the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.
  • the model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312.
  • the data processing model 240 is trained according to a type of the content data to be processed.
  • the training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data.
  • an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size.
  • ROI region of interest
  • an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform.
  • the model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item.
  • the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item.
  • the model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold).
  • the modified data processing model 240 is provided to the data processing module 228 to process the content data.
  • the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.
  • the data processing module 228 includes a data pre-processing modules 314, a model -based processing module 316, and a data post-processing module 318.
  • the data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model -based processing module 316. Examples of the content data include one or more of video, image, audio, textual, and other types of data.
  • each image is pre-processed to extract an ROI or cropped to a predefined image size
  • an audio clip is pre-processed to convert to a frequency domain using a Fourier transform.
  • the content data includes two or more types, e.g., video data and textual data.
  • the model -based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre-processed content data.
  • the model -based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240.
  • the processed content data is further processed by the data postprocessing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.
  • Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments
  • Figure 4B is an example node 420 in the neural network (NN) 400, in accordance with some embodiments.
  • the data processing model 240 is established based on the neural network 400.
  • a corresponding model-based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format.
  • the neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs.
  • the node output is provided via one or more links 412 to one or more other nodes 420
  • a weight w associated with each link 412 is applied to the node output.
  • the one or more node inputs are combined based on corresponding weights wi, W2, W3, and W4 according to the propagation function.
  • the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.
  • the collection of nodes 420 is organized into one or more layers in the neural network 400.
  • the one or more layers includes a single layer acting as both an input layer and an output layer.
  • the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406.
  • a deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer.
  • a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer.
  • one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers.
  • max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.
  • a convolutional neural network is applied in a data processing model 240 to process content data (particularly, video and image data).
  • the CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406.
  • the one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product.
  • Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network.
  • Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN.
  • the pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map.
  • a recurrent neural network is applied in the data processing model 240 to process content data (particularly, textual and audio data).
  • Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior.
  • each node 420 of the RNN has a time-varying real-valued activation.
  • the RNN examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • LSTM long short-term memory
  • BAM bidirectional associative memory
  • an echo state network an independently RNN (IndRNN)
  • a recursive neural network a recursive neural network
  • a neural history compressor examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • the RNN can be used for hand
  • the training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402.
  • the training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied.
  • forward propagation the set of weights for different layers are applied to the input data and intermediate results from the previous layers.
  • backward propagation a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error.
  • the activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types.
  • a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied.
  • the network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data.
  • the result of the training includes the network bias parameter b for each layer.
  • FIG. 5 is a block diagram of a data processing module 500 that renders an avatar 504 based on image data, in accordance with some embodiments.
  • the image data includes one or more images 502 captured by a camera (e.g., included in a mobile phone 104C or AR glasses 104D).
  • the data processing module 500 obtains an image 502, renders the avatar 504 based on the image 502, and causes the avatar 504 to be displayed on a screen of the mobile phone or AR glasses 104D.
  • a client device 104 includes the data processing module 500 and is configured to render and drive the avatar 504 based on the image 502 captured by the client device 104 itself.
  • a first client device 104 includes the data processing module 500 and is configured to render and drive the avatar 504 based on the image 502 captured by a second distinct client device 104.
  • the data processing module 500 includes a subset or all of a human detection module 506, a CNN encoder 508, a regression neural network 510, a 3D human pose estimation module 512, a global position localization module 514, and an avatar rendering module 516. These modules 506-516 cause the data processing module 500 to extract a plurality of features 518 from the image 502, generate a 3D human model 520 of a person in the image, and renders the avatar 504.
  • the 3D human model 520 includes a first set of human model parameters 522 describing at least a pose and a shape of a human body of the person and a second set of human model parameters 524 concerning a plurality of vertices of the human body.
  • the 3D human model 520 further includes information of a camera pose 526 (e.g., a camera position or orientation).
  • the human detection module 506 obtains the image 502 (e.g., an RGB image), detects a human body of a person from the image 502, and generates a human area 528 to enclose the human body.
  • the human area 528 has a rectangular bounding box that closely encloses the human body.
  • a human detection model is trained and applied to detect the human body and generate the human area 528.
  • the human detection model optionally includes an inverted residual block.
  • the human detection model includes a compact and lightweight fully convolutional neural network (CNN) encoder and uses an anchor-based one-shot detection framework (e.g., a single-stage real-time object detection model, YoloV2) which is configured to generate a regression result associated with the human region 528.
  • the COCO dataset is applied to train the neural network applied in the human detection module 506.
  • the bounding box of the human area 528 has a predefined aspect ratio that applies to any bounding box associated with human bodies detected within the image 502. Given the predefined aspect ratio, a width or a length of the bounding box is expanded to enclose a distinct human body entirely without distorting an image aspect ratio of the image 502.
  • the bounding box 528’ includes 224 x 224 pixels. In some embodiments not shown, the image 502 is cropped and/or scaled to 224 x 224 pixels, and the bounding box 528’ is less than 224 x 224 pixels and enclosed within the cropped image 502.
  • the CNN encoder 508 is coupled to the human detection module 506, and configured to extract a plurality of features from the image 502 (specifically, from the human area 528 of the image 502).
  • the regression neural network 510 is coupled to the CNN encoder 508, and configured to generate a first set of human model parameters 522 including pose parameters 522A and shape parameters 522B and a second set of human model parameters 524 concerning a plurality of vertices of the human body.
  • the pose parameters 522A and shape parameters 522B describe a pose and a shape of the human body of the person, respectively.
  • Examples of the second set of human model parameters are 3D vertex offset 524A and vertex color 524B of each vertex of the human body.
  • the regression neural network 510 also detects a camera pose 526 (e.g., a camera position and a camera orientation) with respect to a scene where the camera capturing the image 502 is disposed.
  • the regression neural network 510 includes an output neural network layer, and the first set of human model parameters 522 and the second set of human model parameters 524 are outputted from the output neural network layer.
  • a camera intrinsic matrix transforms 3D camera coordinates to 2D homogeneous image coordinates, and is known and fixed for each input image 502.
  • a pose estimation model 550 includes a neural network of the human detection module 506, the CNN encoder 508, and the regression neural network 510, and is configured to predict the human model parameters 522 and 524 and camera pose 526.
  • a feature extraction layer i.e., the CNN encoder 508 is implemented by an inverted bottleneck module with skip connection, and the regression neural network 510 includes at least two fully connected layers applied to produce camera extrinsic parameter (i.e., camera pose 526), SMPL poses parameters 522A for 24 joints of the human body, and shape parameters 522B (e.g., controlled by a 10-element vector).
  • the regression neural network 510 also provides a vertex displacement 524A and per-vertex color information 524B in the SMPL template.
  • the 3D human pose estimation module 512 is coupled to the regression neural network 510 and forms a 3D human model 520 (e.g., a skinned multi-person linear (SMPL) model) based on the first and second sets of human model parameters 522 and 524 and camera pose 526.
  • the first and second sets of human model parameters 522 and 524 describe the human body of the SMPL model.
  • the avatar rendering module 516 is coupled to the 3D human pose estimation module 512, and configured to render the avatar 504.
  • the pose estimation model 550 of the modules 506-510 is trained in a self-supervised manner.
  • a human model is reconstructed with mesh based on the pose parameters 522A, shape parameters 522B, 3D vertex offset 524A, and vertex color 524B.
  • the avatar rendering module 516 includes a differentiable render module 530 configured to receive an output image including the avatar 504 and the image 502 including the human area 528 and compare the image 502 and output image.
  • a corresponding loss function is defined as a difference between the human area 528 of the image 502 and the rendered avatar 504 to train the pose estimation model 550 to obtain at least the vertex offset 524A and vertex color 524B accurately.
  • the pose estimation model 550 of the modules 506-510 is trained in a supervised manner, particularly to obtain accurate pose parameters 522A and shape parameters 522B of the human body or camera pose 526.
  • Training data provided for supervised training include a plurality of test images and 2D labels for ground truths of the pose parameters 522A and shape parameters 522B of human bodies or camera pose 526 associated with the test images.
  • the pose estimation model 550 of the modules 506-610 projects joints of human bodies in the test images.
  • the pose estimation model 550 is adjusted to optimize (e.g., minimize, suppress below a joint threshold) an /.2 loss between the projected joints and the ground truths.
  • the training data includes 3D pose information of the human bodies in the test images.
  • the pose estimation model 550 of the modules 506-610 regresses 3D positions of the joints of the human bodies in the test images to obtain the 3D human model 520.
  • the pose estimation model 550 is adjusted to optimize (e.g., minimize, suppress below a 3D offset threshold) a 3D loss between the 3D positions of the joints and the ground truths.
  • the training data has ground truth geometry. A 3D surface loss is determined between a geometry of a human body associated with vertex offset 524A and the ground truth geometry, thereby guiding determination of geometry displacement 524A and vertex color 524B.
  • the pose estimation model 550 is adjusted to optimize (e.g., minimize, suppress below a surface loss threshold) the 3D surface loss.
  • a differentiable render module 530 in the avatar rendering module 516 is used for determining the rendered avatar 504 and comparing it with the human area 528 of the image 502.
  • the pose estimation model 550 in the data processing module 500 is trained on one of public datasets (e.g., MPII, COCO, Human3.6m) to obtain accurate pose parameters 522A and shape parameters 522B of human bodies.
  • a public dataset, ThuHuman contains 2000 human scans, and is applied to regress geometry and color information 524 A and 524B.
  • a synthetic human dataset is built from a high-fidelity human model on a public data source.
  • the pose estimation model 550 in the data processing module 500 takes a computing cost of about 0.45G floating point operations per second (FLOPS) to render the avatar 504 based on the image 502.
  • FLOPS floating point operations per second
  • the global position localization module 514 is coupled to the 3D human pose estimation module 512, and receives the 3D human model 520 including the 3D joint positions of joints of the human body captured in the image 502. Such 3D joint positions are converted to human motion in a 3D space.
  • the global position localization module 512 enables an AR real-time human motion capture system that solves a global position T of a human object (i.e., the avatar 504) for estimating the avatar’s motion relative to the real world.
  • a camera intrinsic projection matrix is P
  • the 3D joint positions determined from the image 502 is X.
  • a human global position movement is Ax in real time, so that the 2D joint position A 2d in the image 502 is represented as:
  • Equation (1) is derived into a linear system, and solved using singular value decomposition (SVD).
  • SVD singular value decomposition
  • RISC advanced reduced instruction set computing
  • ARM advanced reduced instruction set computing
  • the avatar rendering module 516 is configured to render the 3D avatar model (i.e., the avatar 504) on a display of a client device 104.
  • the client device 104 has a camera configured to capture images of a field of view of a scene, and the avatar 504 is overlaid on top of the field of view on the display.
  • the same camera is applied to capture the human body applied to extract the 3D human model 520 for driving and rendering the avatar 504, and the avatar 504 is displayed in real time on top of the human body in the field of view of the camera.
  • the avatar 504 substantially overlaps the human body captured by the camera.
  • a first camera is applied to capture the human body applied to extract the 3D human model 520 for driving and rendering the avatar 504, and the avatar 504 is displayed in real time in a field of view of a distinct second camera.
  • a latency between rendering the avatar 504 and capturing the image 502 from which the avatar 504 is rendered is substantially small (e.g., less than a threshold latency (e.g., 5 milliseconds)), such that the avatar 504 is regarded as being rendered substantially in real time.
  • the data processing module 500 is implemented in real time on a mobile device (e.g., a mobile device 104C), and corresponds to the pose estimation model 550 that further includes at least the CNN of the human detection module 506, CNN encoder 508, and regression neural network 510.
  • Post-processing and linear calculation can be optimized in the data processing module 500.
  • the CNN of the human detection module 506, CNN encoder 508, and regression neural network 510 are quantized.
  • Each of the CNN of the human detection module 506 and regression neural network 510 includes a plurality of layers, and each layer has a respective number of filters. Each filter is associated with a plurality of weights.
  • a float32 format is maintained for the plurality of weights of each filter while the respective network is trained. After the respective network is generated, the plurality weights of each filter are quantized to an ml8. uinl8. int!6 or uint!6 format.
  • a server trains the CNN of the human detection module 506, CNN encoder 508, and regression neural network 510 in the float32 format, and quantizes them to the in 18, uinl8. inti 6 or uintl6 format.
  • the quantized CNN and regression neural network 510 are provided to the mobile device for use in inference of the avatar 504.
  • the CNN of the human detection module 506, CNN encoder 508, and regression neural network 510 are executed by a neural network inference engine of a digital signal processing (DSP) unit or a graphics processing unit (GPU), e.g., a Qualcomm Snapdragon Neural Processing Engine (SNPE).
  • DSP digital signal processing
  • GPU graphics processing unit
  • SNPE Qualcomm Snapdragon Neural Processing Engine
  • computing power consumption is roughly 0.8G FLOPS, which can be conveniently executed on at many chips in the market.
  • FIG. 6A is a flow diagram of a data inference process 600 in which an avatar 504 is rendered based on image data, in accordance with some embodiments.
  • the image data is captured by a camera of an electronic device 104, and includes one or more images 502.
  • each image 502 includes a person.
  • the respective image 502 is cropped to keep an image area 528 containing the person.
  • the image area 528’ has a predefined aspect ratio.
  • a length or width of the image area 528’ matches that of the person in the image 502, and a corresponding width or length of the image area 528’ is greater than that of the person in the image 502, such that the person is fully contained within the image area 528.
  • the image area 528 or 528 of the image 502 is processed using a CNN encoder 508 to extract a plurality of features.
  • a regression neural network 510 generates (604) parameters 522 of a pose and a shape of a human body of the person, and vertex offset 524A and vertex color 524B of a plurality of vertices of the human body simultaneously.
  • a 3D human model 520 is formed (606) from the pose parameters 522A and shape parameters 522B of the human body of the person, and vertex offset 524A and vertex color 524B of the plurality of vertices of the human body.
  • the pose parameters 522A of the human body of the person belongs to a first set of human model parameters, and include positional information of a plurality of key points of the human body in each image 502.
  • the plurality of key points includes a root point, and the positional information of each key point includes a 3D rotational position of the respective key point measured with respect to the root point.
  • Each key point corresponds to a joint.
  • the 3D human model 520 includes 24 joints represented by 24 key points.
  • the root point corresponds to a hip joint
  • a position of each key point is measured with respect to the hip joint in a spherical coordinate centered at the hip joint.
  • the 3D human model 520 has a surface that is meshed to the plurality of vertices, and each vertex is described by the respective vertex offset 524A and vertex color 524B.
  • the respective vertex offset 524A indicates a positional deviation between a location of the respective vertex of the 3D human model in each image 502 and a nominal or static location of a corresponding spot of the human body.
  • the plurality of vertices have a predefined number of vertices, and the predefined number is fixed and not adaptively adjusted by the regression neural network 510 during training.
  • the regression neural network 510 generates (608) a camera pose 526 (e.g., a camera position and orientation) of the camera capturing the one or more images 502.
  • the electronic device 104 includes AR glasses 104D.
  • the AR glasses 104 scan a scene where the AR glasses 104D are located, and create and update a 3D map of the scene (i.e., a virtual 3D camera space).
  • Each image 502 is compared with the 3D map to identify the camera pose 526 with respect to the 3D map.
  • the 3D map includes a plurality of feature points, and each image 502 includes a subset of feature points.
  • the subset of feature points are compared to the plurality of feature points to determine where the camera of the AR glasses 104D is located and oriented within the scene.
  • the plurality of feature points are optionally updated according to the subset of feature points of each image 502 as well.
  • the avatar 504 is generated based on the 3D human model 520 and rendered (610) in a user application executed at an electronic device 104-1.
  • the operations 602-608 are implemented on an electronic device 104-2.
  • the camera capturing the image 502 is integrated on an electronic device 104-3.
  • all three electronic devices 104-1, 104- 2, and 104-3 are distinct from each other.
  • all three electronic devices 104-1, 104- 2, and 104-3 are the same device.
  • the electronic devices 104-1 and 104-2 are distinct, and the electronic devices 104-2 and 104-3 are identical.
  • the avatar 504 is generated based on the 3D human model 520 and provided to a distinct electronic device to be rendered in a user application executed on the distinct electronic device that does not implement operations 602-608.
  • the electronic devices 104-1 and 104-2 are the same device, and the electronic devices 104-2 and 104-3 are identical or distinct.
  • the avatar 504 is generated based on the 3D human model 520 and rendered in a user application executed locally on the electronic device that implements operations 602-608.
  • any of the electronic devices 104-1, 104-2, and 104-3 is optionally a portable device, e.g., a mobile phone, a tablet computer, and a laptop computer.
  • the user application is one of an image processing application configured to render one or more augmented reality (AR) effects on the avatar 504, a gaming application configured to place the avatar 504 in a game scene, and a health application configured to evaluate human health conditions and behaviors based on the 3D human model of the person.
  • the electronic device 104-2 configured to implement the operations 602-608 includes one of a GPU and a DSP. The one of the GPU and DSP has a precision setting.
  • Each of the convolutional neural network 508 and the regression neural network 510 includes one or more layers, and each layer has a plurality of weights associated with each filter in the respective layer.
  • the plurality weights of each layer are quantized according to the precision setting after the convolutional neural network 508 and the regression neural network 510 are trained.
  • the one or more images 502 include a first image 502A and the 3D human model 520 of the person includes a first 3D human model.
  • the electronic device configured to implement the operations 602-608 obtains a second image 502B including the person and generates a second 3D human model of the person from the second image 502B using the CNN encoder 508 and regression neural network 510.
  • the second image 502B is captured by the camera subsequently to the first image 502A.
  • the electronic device re-renders the avatar 504 based on the second 3D human model of the person, thereby making the avatar track motion of the person.
  • FIG. 6B is a flow diagram of a training process 650 in which a pose estimation model 550 is trained to render an avatar based on image data, in accordance with some embodiments.
  • the data processing module 500 corresponds to a comprehensive pose estimation model 550 including at least the CNN of the human detection module 506, CNN encoder 508, and regression neural network 510.
  • the comprehensive pose estimation model 550 is trained in an end-to-end manner. Alternatively, each of the CNN of the human detection module 506, CNN encoder 508, and regression neural network 510 is trained separately. Model training is optionally implemented at a server 102 or a client device 104, while the data processing module 500 is executed at the client device 104 to render the avatar 504 in a new image 612.
  • the convolutional neural network 508 and the regression neural network 510 are trained end-to-end in a supervised manner for the first set of human model parameters 522, e.g., using one or more public database sets (such as MPII, COCO, Human3.6M, etc.).
  • Training data include both test images 502 and ground truth 614 of the first set of human model parameters 522 of the test images 502.
  • the first set of human model parameters 522 describing the pose and shape of a human body of the person are compared with the ground truth 614.
  • Weights associated with filters of neural networks in the pose estimation model 550 are adjusted to optimize a loss function that combines the first set of human model parameters 522 and the ground truth 614.
  • the convolutional neural network 508 and the regression neural network 510 are also trained end-to-end in a supervised manner for the camera pose 526.
  • training data includes test images 502 captured with the camera facing towards a Z-axis.
  • a camera pose loss is determined between a camera pose 526 determined based on a test image 502 and a ground truth camera pose.
  • the weights of the filters of the neural networks in the pose estimation model 550 are adjusted to optimize the camera pose loss (e.g., minimize, suppress below a camera pose loss threshold).
  • the filters of neural networks in the pose estimation model 550 are at least partially trained end-to-end in an un-supervised manner.
  • Each test image 502 including a respective person.
  • a plurality of test features are extracted from each test image 502 (specifically, respective human regions 528) using the convolutional neural network.
  • a respective 3D test model 520 of the respective person is generated from the plurality of test features using the regression neural network.
  • the respective 3D test model 520 includes a first set of test parameters 522 describing at least a pose and a shape of a respective human body of the respective person and a second set of test parameters 524 concerning a plurality of vertices of the respective human body.
  • a new image 612 is rendered based on the respective 3D test model of the respective person.
  • the second set of test parameters concerning the plurality of vertices include at least a color value of each vertex of the respective human body, and the new image is rendered based on a camera pose including a position and an orientation of the camera.
  • a loss function 616 is determined, e.g., based on asx Ll or L2 norm, indicating an overall difference between the rendered new image 612 and the respective test image 502. The filters of the neural networks in the pose estimation model 550 are adjusted to minimize the loss function 616.
  • the pose estimation network is trained to generate the second set of test parameters 524 (e.g., vertex offset and color) in a semi-supervised manner, e.g., using a Thu depth dataset for supervised training and using differentiable render for unsupervised training.
  • the second set of test parameters 524 e.g., vertex offset and color
  • both supervised and unsupervised training are implemented for each test image 502, such that the weights of the filters of the neural networks in the pose estimation model 550 are trained to obtain both the first and second sets of human model parameters 522 and 524.
  • FIG. 7 is a flowchart of a method 700 for rendering and driving an avatar 504 based on an image 502 captured by a camera, in accordance with some embodiments.
  • the method 700 is described as being implemented by a computer system (e.g., a client device 104, a server 102, or a combination thereof).
  • An example of the client device 104 is a mobile phone 104C or AR glasses 104D.
  • Method 700 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system.
  • Each of the operations shown in Figure 7 may correspond to instructions stored in a computer memory or non- transitory computer readable storage medium (e.g., memory 206 of the computer system 200 in Figure 2).
  • the computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices.
  • the instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 700 may be combined and/or the order of some operations may be changed.
  • the computer system obtains (702) an image captured by a camera, and extracts extracting (704) a plurality of features from the image using a convolutional neural network.
  • the image includes a person.
  • the person is identified in the image, and the image is cropped to keep an image area containing the person.
  • the plurality of features are extracted from the image area.
  • the computer system extends at least one of a width and a length of the image area to reach a predefined aspect ratio, and resizes the image area to a predefined image resolution while keeping the predefined aspect ratio for the image area.
  • the plurality of features are extracted from the resized image area.
  • the computer system generates (706) a 3D human model of the person from the plurality of features using a regression neural network.
  • the 3D human model includes (708) a first set of human model parameters describing at least a pose and a shape of a human body of the person and a second set of human model parameters concerning a plurality of vertices of the human body.
  • the regression neural network includes an output neural network layer, and the first set of human model parameters and the second set of human model parameters of the 3D human model are outputted from the output neural network layer.
  • the computer system then renders (710) an avatar based on the 3D human model of the person. In some embodiments, this avatar is rendered substantially in real time with capturing the image. That said, a latency between avatar rendering and image capturing is less than a threshold time duration (e.g., 10 milliseconds).
  • the first set of human model parameters describe (712) the pose of the human body via positional information of a plurality of key points of the human body in the image.
  • the plurality of key points include a root point, and the positional information of each key point includes a 3D rotational position of the respective key point measured with respect to the root point.
  • each key point is measured under a spherical coordinate centered at the root point.
  • each key point corresponds to a joint, and the plurality of key points has 24 key points corresponding to 24 joints of the human body.
  • the first set of human model parameters further include a plurality of shape characteristics describing the shape of the human body.
  • the first set of human model parameters further includes positional information of the camera.
  • the 3D human model of the person is generated from the plurality of features by determining the positional information of the camera in a virtual 3D space associated with a scene in which the image is captured by the camera. A camera position is regressed directly from the regression neural network. It is assumed that the camera is oriented along a Z-axis.
  • the 3D human model is meshed (714) to the plurality of vertices
  • the second set of human model parameters includes at least a 3D vertex offset and a vertex color value of each of the plurality of vertices, the 3D vertex offset indicating a positional deviation between a location of the respective vertex of the 3D human model and a location of a corresponding spot of the human body.
  • the plurality of vertices have a predefined number of vertices, and the predefined number is fixed and not adaptively adjusted by the regression neural network during training.
  • the avatar is rendered in a user application executed at an electronic device configured to implement the method.
  • the camera that captures the image is optionally integrated on this electronic device or a distinct, remote electronic device.
  • the convolutional neural network and the regression neural network are trained remotely in a server, and provided to the electronic device.
  • the convolutional neural network and the regression neural network are both trained and applied in the electronic device.
  • the electronic device is a portable device, e.g., a mobile phone, a tablet computer, a laptop computer.
  • the electronic device provides the 3D human model to a distinct electronic device, and the avatar is rendered on the distinct electronic device based on the 3D human model provided by the electronic device.
  • the user application is one of an image processing application configured to render one or more augmented reality (AR) effects on the avatar, a gaming application configured to place the avatar in a game scene, and a health application configured to evaluate human health conditions and behaviors based on the 3D human model of the person.
  • AR augmented reality
  • the electronic device includes one of a GPU and a DSP, the one of the GPU and DSP having a precision setting and configured to implement the method.
  • Each of the convolutional neural network and the regression neural network includes one or more layers, and each layer has a plurality of weights associated with each filter in the respective layer.
  • the plurality weights of each layer are quantized according to the precision setting after the convolutional neural network and the regression neural network are trained. For example, the plurality of weights maintain a float32 format while training. All of the plurality of weights of the convolutional neural network and the regression neural network are quantized to an int8, uinl8. int!6 or uint!6 format.
  • the convolutional neural network and regression neural network are trained end-to-end in a supervised manner for the first set of human model parameters.
  • public database sets e.g., MPII, COCO, Human3.6M
  • the convolutional neural network and the regression neural network are trained by a server and provided to an electronic device to infer the 3D human model and render the avatar.
  • the convolutional neural network and regression neural network are at least partially trained end-to-end in an un-supervised manner.
  • the computer system obtains one or more test images. Each test image includes a respective person.
  • the computer system extracts a plurality of test features from the respective test image using the convolutional neural network and generates a respective 3D test model of the respective person from the plurality of test features using the regression neural network.
  • the respective 3D test model includes a first set of test parameters describing at least a pose and a shape of a respective human body of the respective person and a second set of test parameters concerning a plurality of vertices (e.g., a color value of each vertex) of the respective human body.
  • the computer system renders a new image based on the respective 3D test model of the respective person.
  • the computer system establishes a loss function indicating an overall difference between the rendered new image and the respective test image, e.g., based on an LI or L2 norm.
  • the computer system adjusts the convolutional neural network and the regression neural network to minimize the loss function.
  • the second set of test parameters concerning the plurality of vertices include at least a color value of each vertex of the respective human body, and the new image is rendered based on a camera pose including a position and an orientation of the camera.
  • the computer system supervises the neural networks using 2D projection of a mesh model (e.g., the 3D human model of the person), and renders a difference between an inferred image and the input test image, thereby allowing the 3D human model to be optimized in an unsupervised manner.
  • a mesh model e.g., the 3D human model of the person
  • the image includes a first image and the 3D human model of the person includes a first 3D human model.
  • the computer system obtains a second image including the person. The second image is captured by the camera subsequently to the first image.
  • the computer system generates a second 3D human model of the person from the second image using the convolutional neural network and regression neural network, and rerenders the avatar based on the second 3D human model of the person, thereby making the avatar track motion of the person.
  • the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context.
  • the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
  • stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

La présente demande a pour objectif d'entraîner un avatar sur la base de données d'image d'une personne. Un système informatique obtient une image capturée par une caméra. L'image comprend une personne. Le système informatique extrait une pluralité de caractéristiques de l'image à l'aide d'un réseau neuronal convolutif et génère un modèle humain 3D de la personne à partir de la pluralité de caractéristiques à l'aide d'un réseau neuronal de régression. Le modèle humain 3D comprend un premier ensemble de paramètres de modèle humain décrivant au moins une pose et une forme du corps humain de la personne et un second ensemble de paramètres de modèle humain concernant une pluralité de sommets du corps humain. Un avatar est rendu sur la base du modèle humain 3D de la personne.
PCT/US2021/047793 2021-08-26 2021-08-26 Procédés et systèmes permettant de reconstruire simultanément une pose et des modèles humains 3d paramétriques dans des dispositifs mobiles WO2023027712A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/US2021/047793 WO2023027712A1 (fr) 2021-08-26 2021-08-26 Procédés et systèmes permettant de reconstruire simultanément une pose et des modèles humains 3d paramétriques dans des dispositifs mobiles
CN202180101541.1A CN117916773A (zh) 2021-08-26 2021-08-26 用于在移动设备中同时重建姿态和参数化3d人体模型的方法和系统

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2021/047793 WO2023027712A1 (fr) 2021-08-26 2021-08-26 Procédés et systèmes permettant de reconstruire simultanément une pose et des modèles humains 3d paramétriques dans des dispositifs mobiles

Publications (1)

Publication Number Publication Date
WO2023027712A1 true WO2023027712A1 (fr) 2023-03-02

Family

ID=85323111

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/047793 WO2023027712A1 (fr) 2021-08-26 2021-08-26 Procédés et systèmes permettant de reconstruire simultanément une pose et des modèles humains 3d paramétriques dans des dispositifs mobiles

Country Status (2)

Country Link
CN (1) CN117916773A (fr)
WO (1) WO2023027712A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116758124A (zh) * 2023-06-16 2023-09-15 北京代码空间科技有限公司 一种3d模型修正方法及终端设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150042663A1 (en) * 2013-08-09 2015-02-12 David Mandel System and method for creating avatars or animated sequences using human body features extracted from a still image
US20150262333A1 (en) * 2010-07-13 2015-09-17 Google Inc. Method and system for automatically cropping images
US20160042548A1 (en) * 2014-03-19 2016-02-11 Intel Corporation Facial expression and/or interaction driven avatar apparatus and method
US20190035149A1 (en) * 2015-08-14 2019-01-31 Metail Limited Methods of generating personalized 3d head models or 3d body models

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150262333A1 (en) * 2010-07-13 2015-09-17 Google Inc. Method and system for automatically cropping images
US20150042663A1 (en) * 2013-08-09 2015-02-12 David Mandel System and method for creating avatars or animated sequences using human body features extracted from a still image
US20160042548A1 (en) * 2014-03-19 2016-02-11 Intel Corporation Facial expression and/or interaction driven avatar apparatus and method
US20190035149A1 (en) * 2015-08-14 2019-01-31 Metail Limited Methods of generating personalized 3d head models or 3d body models

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116758124A (zh) * 2023-06-16 2023-09-15 北京代码空间科技有限公司 一种3d模型修正方法及终端设备

Also Published As

Publication number Publication date
CN117916773A (zh) 2024-04-19

Similar Documents

Publication Publication Date Title
US11132845B2 (en) Real-world object recognition for computing device
US20230013451A1 (en) Information pushing method in vehicle driving scene and related apparatus
CN113835522A (zh) 手语视频生成、翻译、客服方法、设备和可读介质
WO2021077140A2 (fr) Systèmes et procédés de transfert de connaissance préalable pour la retouche d'image
WO2023101679A1 (fr) Récupération inter-modale d'image de texte sur la base d'une expansion de mots virtuels
CN115244495A (zh) 针对虚拟环境运动的实时式样
WO2021092600A2 (fr) Réseau pose-over-parts pour estimation de pose multi-personnes
WO2023102223A1 (fr) Apprentissage multitâche en couplage croisé pour cartographie de profondeur et segmentation sémantique
CN116391209A (zh) 现实的音频驱动的3d化身生成
CN108509830B (zh) 一种视频数据处理方法及设备
WO2022148248A1 (fr) Procédé d'entraînement de modèle de traitement d'image, procédé et appareil de traitement d'image, dispositif électronique et produit programme informatique
WO2023027712A1 (fr) Procédés et systèmes permettant de reconstruire simultanément une pose et des modèles humains 3d paramétriques dans des dispositifs mobiles
WO2024055748A1 (fr) Procédé et appareil d'estimation de posture de tête, ainsi que dispositif et support de stockage
WO2023086398A1 (fr) Réseaux de rendu 3d basés sur des champs de radiance neurale de réfraction
US11188787B1 (en) End-to-end room layout estimation
WO2023277877A1 (fr) Détection et reconstruction de plan sémantique 3d
WO2023133285A1 (fr) Anticrénelage de bordures d'objet comportant un mélange alpha de multiples surfaces 3d segmentées
WO2023091131A1 (fr) Procédés et systèmes pour récupérer des images sur la base de caractéristiques de plan sémantique
WO2023277888A1 (fr) Suivi de la main selon multiples perspectives
US20240153184A1 (en) Real-time hand-held markerless human motion recording and avatar rendering in a mobile platform
US11734888B2 (en) Real-time 3D facial animation from binocular video
WO2023069086A1 (fr) Système et procédé de ré-éclairage de portrait dynamique
WO2023063944A1 (fr) Reconnaissance de gestes de la main en deux étapes
US20240087344A1 (en) Real-time scene text area detection
CN117813626A (zh) 从多视图立体(mvs)图像重建深度信息

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21955232

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180101541.1

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE