WO2023069085A1 - Systèmes et procédés de synthèse d'images de main - Google Patents

Systèmes et procédés de synthèse d'images de main Download PDF

Info

Publication number
WO2023069085A1
WO2023069085A1 PCT/US2021/055775 US2021055775W WO2023069085A1 WO 2023069085 A1 WO2023069085 A1 WO 2023069085A1 US 2021055775 W US2021055775 W US 2021055775W WO 2023069085 A1 WO2023069085 A1 WO 2023069085A1
Authority
WO
WIPO (PCT)
Prior art keywords
hand
rigged
model
data
image
Prior art date
Application number
PCT/US2021/055775
Other languages
English (en)
Inventor
Celong LIU
Yuanzhou HA
Lingyu Wang
Yang Zhou
Yi Xu
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Priority to PCT/US2021/055775 priority Critical patent/WO2023069085A1/fr
Publication of WO2023069085A1 publication Critical patent/WO2023069085A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings

Definitions

  • This application relates generally to image processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for generating realistic hand images.
  • 3D hand model generation processes apply simple 3D hand models in initial synthetic data generation stages. These 3D hand models are not generated from a photo-realistic Tenderer and neglect many hand details. Deep learning techniques are applied to convert low-quality synthetic hand data to photo-realistic hand data. However, the deep learning techniques often lead to large biases and generate hand data with erroneous backgrounds. During image rendering, many deep learning techniques do not consider shadow and lighting (random background) and comprise photorealism of feeding dataset. It would be beneficial to have an efficient mechanism to generate realistic 3D hand images.
  • Synthetic realistic hand images are useful in many applications such as training artificial intelligence or neural network models. Although the realistic hand images are useful in creating training datasets, gathering of millions of realistic hand images is costly and time consuming.
  • This application is directed to generating synthetic realistic hand images efficiently.
  • the synthetic realistic hand images include a 3D hand model that mimics natural hands and appears realistic in these images. Rigging and rendering of the realistic hand images are carefully designed, such that the realistic hand images are close to real photos of the natural hands before any artificial intelligence (AI)/neural network enhancement.
  • AI artificial intelligence
  • a synthetic realistic hand dataset is created for a plurality of hand gestures, and a gesture distribution of the realistic hand dataset is consistent with real-world human use cases.
  • the 3D hand model corresponding to each hand gesture of the realistic hand dataset is fit to a depth image of a real hand gesture captured by a depth sensor system.
  • neural network trained with this synthetic realistic hand dataset is not biased, and the synthetic images have consistent lighting with their background.
  • High dynamic range (HDR) environment maps or random background images are used with estimated lighting, and background lighting is considered and used during image rendering.
  • a method is implemented at an electronic device for rendering realistic hand images with background.
  • the method includes creating a 3D rigged hand model corresponding to a unique combination of hand characteristics, posing the 3D rigged hand model to a hand gesture based on one or more kinesiology parameters, adjusting the hand gesture of the 3D rigged hand model with reference to a plurality of natural gestures, and rendering the 3D rigged hand model having the adjusted hand gesture on a background image to obtain a hand image.
  • adjusting the hand gesture of the 3D rigged hand model with reference to the plurality of natural gestures includes marking key points on hands of a sampling pool of persons; for each of the plurality of natural gestures, capturing and measuring depth data from the marked key points of a respective subset of the hands that are posed with the respective natural gesture; and transferring a subset of the captured depth data of the hands with the plurality of natural gestures to the 3D rigged hand model.
  • rendering the 3D rigged hand model having the adjusted hand gesture on the background image further includes collecting one or more indoor environment maps with a lighting condition, extracting the lighting condition from the environment maps; rendering the 3D rigged hand model having the adjusted hand gesture with a respective camera viewpoint and using a path tracer to obtain the hand image, and applying the lighting condition to the hand image according to the camera viewpoint.
  • some implementations include an electronic device that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.
  • some implementations include a non-transitory computer- readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.
  • Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.
  • Figure 2 is a block diagram illustrating a data processing system, in accordance with some embodiments.
  • Figure 3 is an example data processing environment for training and applying a neural network based (NN-based) data processing model for processing visual and/or audio data, in accordance with some embodiments.
  • NN-based neural network based
  • Figure 4A is an example neural network (NN) applied to process content data in an NN-based data processing model, in accordance with some embodiments.
  • NN neural network
  • Figure 4B is an example node in the neural network (NN), in accordance with some embodiments.
  • Figure 5 is a flow chart of the photo-realistic hand rendering process, in accordance with some embodiments.
  • Figure 6 are exemplary pictures of hand gestures that are captured by a marker based depth capturing system and converted to rigged hand models, in accordance with some embodiments, in accordance with some embodiments.
  • FIG. 7 is a block diagram of a cycle-generative adversarial network (GAN), in accordance with some embodiments.
  • GAN cycle-generative adversarial network
  • FIG. 8 is a diagram of the task management system with a pool of graphics processing units (GPUs), in accordance with some embodiments.
  • GPUs graphics processing units
  • Figure 9 is a flow chart of a method for rendering realistic hand images with a background, in accordance with some embodiments.
  • the methods and systems disclosed herein use the depth hand data from key point tagged real hand images, and the gesture distribution of synthetic data is the same with real-world human use cases.
  • the methods and systems disclosed herein also use a deep generative neural network to learn hand details from real photos and enhance the rendered images.
  • Systems and methods for generating photorealistic hand images are disclosed herein.
  • the generated photorealistic images are used for training deep neural networks for hand gesture recognition.
  • the systems and methods disclosed herein can generate realistic hand images.
  • the images are rendered by physically correct path tracer that produces photorealistic images of human hands.
  • the 3D hands are posed to a common gesture that a human can easily achieve so that the hand gesture is physically realistic.
  • the systems and methods disclosed herein use a Deep Generative network to learn hand details from a large-scale real hand database. Then the learned knowledge is applied to improve the details to render the hand images.
  • the systems and methods disclosed herein can handle the large-scale image generation tasks stably, a GPU pool across several servers and an automatic system are utilized to manage the data generation and to handle the heavy computational cost of path tracing and neural network inferencing.
  • the automatic system has a complete set of infrastructures that could monitor the progress, failure task tolerance and restoration, high concurrent task management, and bad result recognition.
  • the systems and methods disclosed herein can provide high quality training data for hand detection and hand tracking system.
  • the task of hand detection and tracking is to predict the joints of a hand from an input image.
  • the training data is not easy to obtain. If markers are attached to the joints on the hand, the captured image will be polluted, because the markers are not common in real scenarios. Manually labeling the data is possible, but the cost is high, and the process will be very long and error prone. Moreover, self-occlusion of the hand is also ambiguous for manual labeling.
  • synthetic data full control is obtained of what label is needed for a particular use, such as joints, segmentation masks, depth maps, visibility map, etc. The accuracy of labeling can also be guaranteed.
  • the generated hand images are indistinguishable from the real hand photos. This can accelerate the development of hand detection and tracking algorithms to a new level.
  • Photorealistic rendering refers to a physically correct rendering method which turns a three-dimensional (3D) scene to a photo according to specific camera position and lighting condition.
  • Realistic hand gesture generation refers to a task of posing a 3D hand to a common gesture that a human can easily achieve.
  • a deep generative network includes one or more neural network models that have many hidden layers trained from a huge number of samples to handle complex, high-dimensional probability distributions.
  • the models include some different approaches, e.g., normalizing flows (NF), variational autoencoders (VAE), and generative adversarial networks (GAN).
  • Surface registration includes mapping and adding constraints between a template surface and a target surface during a surface deformation for corresponding points on a template and a target surface.
  • Spherical harmonics (SHs) are a low-dimensional lighting representation (36 values for 5th- degree SH for each color channel) and are predicted with a compact decoder architecture.
  • GAN Generative Adversarial Network
  • FIG. 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments.
  • the one or more client devices 104 may be, for example, desktop computers 104 A, tablet computers 104B, mobile phones 104C, or intelligent, multi-sensing, network-connected home devices (e.g., a camera).
  • Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface.
  • the collected data or user inputs are processed locally (e.g., for training and/or for prediction) at the client device 104 and/or remotely by the server(s) 102.
  • the one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, process the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104.
  • the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
  • storage 106 may store video content for training a machine learning model (e.g., deep learning network) and/or video content obtained by a user to which a trained machine learning model is applied to determine one or more actions associated with the video content.
  • a machine learning model e.g., deep learning network
  • the one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104.
  • the client devices 104 include a game console that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console.
  • the client devices 104 include a networked surveillance camera and a mobile phone 104C.
  • the networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera in the real time and remotely.
  • the one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communication links between these devices and computers connected together within the data processing environment 100.
  • the one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof.
  • the one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol.
  • a connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof.
  • a network interface 110 e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node
  • the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Intemet Protocol (TCP/IP) suite of protocols to communicate with one another.
  • TCP/IP Transmission Control Protocol/Intemet Protocol
  • At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational, and other computer systems that route data and messages.
  • Deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data.
  • content data e.g., video data, visual data, audio data
  • data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data.
  • both model training and data processing are implemented locally at each individual client device 104 (e.g., the client device 104C).
  • the client device 104C obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models.
  • the client device 104C obtains the content data (e.g., captures image or video data via an internal camera) and processes the content data using the training data processing models locally.
  • both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A).
  • the server 102 A obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the client device 104A obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results from the server 102A, and presents the results on a user interface (e.g., associated with the application).
  • the client device 104A itself implements no or little data processing on the content data prior to sending them to the server 102 A.
  • data processing is implemented locally at a client device 104 (e.g., the client device 104B), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104B.
  • the server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models.
  • the trained data processing models are optionally stored in the server 102B or storage 106.
  • the client device 104B imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface locally.
  • FIG. 2 is a block diagram illustrating a data processing system 200, in accordance with some embodiments.
  • the data processing system 200 includes a server 102, a client device 104, a storage 106, or a combination thereof.
  • the data processing system 200 typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset).
  • the data processing system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voicecommand input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls.
  • the client device 104 of the data processing system 200 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard.
  • the client device 104 includes one or more cameras, scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices.
  • the data processing system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
  • the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104.
  • GPS global positioning satellite
  • Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium.
  • memory 206 stores the following programs, modules, and data structures, or a subset or superset thereof: • Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;
  • Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
  • User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);
  • information e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.
  • output devices 212 e.g., displays, speakers, etc.
  • Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
  • Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;
  • One or more user applications 224 for execution by the data processing system 200 e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices;
  • Model training module 226 for receiving training data (.g., training data 238) and establishing a data processing model (e.g., data processing module 228) for processing content data (e.g., video data, visual data, audio data) to be collected or obtained by a client device 104;
  • a data processing model e.g., data processing module 2248 for processing content data (e.g., video data, visual data, audio data) to be collected or obtained by a client device 104;
  • Data processing module 228 for processing content data using data processing models 240, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224; • One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing
  • the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the data processing system 200.
  • the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the data processing system 200.
  • more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively.
  • Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
  • the above identified modules or programs i.e., sets of instructions
  • memory 206 optionally, stores a subset of the modules and data structures identified above.
  • memory 206 optionally, stores additional modules and data structures not described above.
  • FIG. 3 is another example data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video data, visual data, audio data), in accordance with some embodiments.
  • the data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240.
  • both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104.
  • the training data source 304 is optionally a server 102 or storage 106.
  • both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300.
  • the training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106.
  • the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.
  • the model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312.
  • the data processing model 240 is trained according to a type of the content data to be processed.
  • the training data 306 is consistent with the type of the content data, and a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data.
  • a video pre-processing module 308 is configured to process video training data 306 to a predefined image format, e.g., group frames (e.g., video frames, visual frames) of the video content into video segments.
  • the data pre-processing module 308 may also extract a region of interest (ROI) in each frame or separate a frame into foreground and background components, and crop each frame to a predefined image size.
  • the model training engine 310 receives pre-processed training data provided by the data preprocessing module(s) 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item.
  • the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item.
  • the model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold).
  • the modified data processing model 240 is provided to the data processing module 228 to process the content data.
  • the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled.
  • the model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.
  • the data processing module 228 includes a data pre-processing modules 314, a model -based processing module 316, and a data post-processing module 318.
  • the data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model -based processing module 316.
  • Examples of the content data include one or more of: video data, visual data (e.g., image data), audio data, textual data, and other types of data. For example, each video is pre-processed to group frames in the video into video segments.
  • the model-based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre- processed content data.
  • the model-based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 240.
  • the processed content data is further processed by the data post-processing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.
  • Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments
  • Figure 4B is an example node 420 in the neural network 400, in accordance with some embodiments.
  • the data processing model 240 is established based on the neural network 400.
  • a corresponding model -based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format.
  • the neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs.
  • the node output is provided via one or more links 412 to one or more other nodes 420
  • a weight w associated with each link 412 is applied to the node output.
  • the one or more node inputs are combined based on corresponding weights vi’/, W2, W3, and W4 according to the propagation function.
  • the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.
  • the collection of nodes 420 is organized into one or more layers in the neural network 400.
  • the one or more layers includes a single layer acting as both an input layer and an output layer.
  • the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input layer 402 and the output layer 406.
  • a deep neural network has more than one hidden layer 404 between the input layer 402 and the output layer 406.
  • each layer may be only connected with its immediately preceding and/or immediately following layer.
  • a layer 402 or 404B is a fully connected neural network layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer.
  • one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers.
  • max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.
  • a convolutional neural network is applied in a data processing model 240 to process content data (particularly, video and image data).
  • the CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that moves data forward from the input layer 402 through the hidden layers to the output layer 406.
  • the one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product.
  • Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network.
  • Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN.
  • the pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map.
  • a recurrent neural network is applied in the data processing model 240 to process content data (particularly, visual data and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior.
  • each node 420 of the RNN has a time-varying real-valued activation.
  • the RNN examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor.
  • LSTM long short-term memory
  • Elman Elman
  • Jordan Jordan network
  • Hopfield network a bidirectional associative memory
  • BAM network bidirectional associative memory
  • an echo state network an independently RNN (IndRNN)
  • a recursive neural network a recursive neural network
  • a neural history compressor examples include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN),
  • a generative neural network is applied in the data processing model 240 to process content data (particularly, visual data and audio data).
  • a generative neural is trained by providing it with a large amount of data (e.g., millions of images, sentences, or sounds, etc.) and then the neural network is trained to generate data like the input data.
  • the generative neural networks have significantly smaller amount of parameters than the amount of input training data, so the generative neural networks are forced to find and efficiently internalize the essence of the data for generating data.
  • two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly.
  • two or more types of neural networks e.g., both CNN and RNN
  • the training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402.
  • the training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied.
  • forward propagation the set of weights for different layers are applied to the input data and intermediate results from the previous layers.
  • backward propagation a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error.
  • the activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types.
  • a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied.
  • the network bias b provides a perturbation that helps neural network 400 avoid over fitting the training data.
  • the result of the training includes the network bias parameter b for each layer.
  • FIG. 5 is a flow chart of the photorealistic hand rendering process 500, in accordance with some embodiments.
  • a 3D hand model creation operation 510 is implemented.
  • a number of high-resolution scanned 3D hand models with different characteristics such as ethnicity, skin color and/or age are obtained.
  • the hand models are scanned from real persons with different characteristics such as skin color, skin textures, and nail color.
  • the scanned hand models have a high resolution (e.g., a resolution that exceeds a threshold resolution), such that pores and textures on a hand can be seen on the hand models.
  • the scanned 3D hand models are in the form of polygonal mesh and color textures.
  • the scanned 3D models are in the form of points with associated colors.
  • the number of scanned hand models is from 10 to 20. In some embodiments, the number of scanned hand models is from 20 to 50. In some embodiments, the number of scanned hand models is from 50 to 100. In some embodiments, the number of scanned hand models is 100 and more.
  • a rigging system is built for each hand model according to human hand kinesiology. For example, the rigging system is built by adding one or more key points manually or with a software on the hand models and associating each key point with muscle movement. The rigging system defines a set of gesture parameters, and the hand models are posed to one or more gestures by the set of gesture parameters. The hand model is deformed and posed based on the rigging system, and muscles of the hand models are shaped accordingly.
  • a gesture generation process 520 is implemented after 3D hand model creation 510.
  • the hand model adopts random gestures, and the gestures of the hand models are randomly sampled. All possible gestures are randomly sampled with the same possibility. Under some circumstances, this gesture possibility distribution is not realistic and can fool the neural network and harm its performance.
  • a real human uses some gestures more frequently than other gestures.
  • a marker based capturing system such as TrakStar is used to provide a reasonable gesture sampling distribution.
  • the marker based capturing system utilizes a plurality of sensors, such as magnetic or light trackers, and simultaneously tracks motions of each sensor at a high frequency, for example, from 200 to 500 times per second.
  • the marker based capturing system dynamically monitors motion of people and objects for interactive real-time visualization.
  • a TOF camera measures depth data of gestures from real persons with the key point markers.
  • ToF is a method for measuring distance (depth) between an object and a sensor, based on a time difference between transmission of a signal (such as light) from the sensor to the object and reflection of the signal from the object to the sensor.
  • IR infrared
  • ToF sensors are used for measuring the depth data of the hand, and these depth data are collected from more than one person, e.g., from 10 to 100 persons.
  • a fitting algorithm is developed to fit the rigged hand model to the depth data gathered from real persons. This fitting algorithm is configured to optimize orientation of each joint for the rigged hand model by minimizing a surface distance between the depth data and the rigged hand model.
  • the fitting algorithm is a non-linear optimization algorithm.
  • the non-linear optimization algorithm is solved in an analogous manner with a non-rigid surface registration algorithm.
  • the non-rigid surface registration algorithm loops over a plurality of stiffness weights that gradually decreases, and incrementally finds optimal deformation for a given stiffness using an optimal iterative closest point step.
  • the non-rigid surface registration algorithm deforms a template for some correspondences estimated from a nearest-point search and calculates the active stiffness.
  • the non-rigid surface registration algorithm continues to deform the template towards a target with new correspondences from searching of the displaced template vertices.
  • the non-rigid surface registration algorithm utilizes a locally affine regularization to assign an affine transformation to each vertex and minimizes the distance of the transformation of neighboring vertices.
  • the optimal deformation for fixed correspondences and fixed stiffness is determined efficiently and exactly by using the regularization of the non-rigid surface registration algorithm.
  • Infrared (IR) images captured by a ToF camera are shown in column (a) on the left, and overlay images that consist of the IR images and background images are shown in column (c) on the right.
  • images are rendered from the rigged hand model, IR images, and background images according to the methods and systems disclosed herein.
  • a process of rendering with photorealistic path tracer 530 is implemented after the gesture generation 520 process.
  • a path tracer is used to render the hand with a random camera viewpoint.
  • a path tracer gathers information about light and color by casting many rays into the scene to shade a particular pixel. The path tracer can generate unbiased result and is not limited to the number of samples used.
  • a path tracer mimics the way that real light travels through and bounces off objects to render how real light behaves.
  • the first type of background is from a collection of indoor high dynamic range (HDR) environment maps, which provides the environment lightings that are used in the Tenderer.
  • HDR high dynamic range
  • a Tenderer is a process to generate images from two-dimensional (2D) or 3D models by a computer. During the rendering, backgrounds are automatically generated.
  • the second type of background is from a large-scale image database. An image is randomly selected, and the lighting is estimated from this image.
  • the lighting estimation algorithm is via a Differentiable Screen-Space Rendering (DSSR).
  • DSSR Differentiable Screen-Space Rendering
  • the lighting is parameterized using a 5th-order Spherical Harmonics (SH) and the microfacet BRDF model is used.
  • a microfacet BRDF model is a shading model which compares many options for each term and uses same input parameters. The microfacet BRDF model evaluates the importance of each term compared with more efficient alternatives. The transmission and self-luminous effects are not considered in DSSR.
  • the integral of the rendering equation is formulated as a low-cost linear combination of polynomials of the attributes of the three decoders for diffuse albedo (A), specular roughness (R), and surface normal (N) together with the SH coefficients analytically.
  • DSSR is differentiable.
  • the estimated lighting is used in the Tenderer to ensure the appearance of the rendered hand is consistent with the background.
  • an exposure fusion algorithm is an imaging technique to fuse multiple exposure images of the same scene into a high-quality image.
  • the exposure fusion algorithm can generate an image with a higher dynamic range than a camera is able to capture in a single exposure.
  • exposure fusion unlike most HDR imaging methods, exposure fusion generates the final low dynamic range (LDR) image by seamlessly fusing the best regions of the input image sequence without generating an intermediate HDR image.
  • LDR low dynamic range
  • the bit depth of the final blended image by exposure fusion algorithm is usually not different from the input images.
  • GAN generative adversarial network
  • a real hand photo dataset is collected from a number of real persons.
  • the number of real persons is 10 to 500.
  • the hand photos of the real persons are captured before a green screen so that the hands are easily segmented by thresholding based on color.
  • the neural networks illustrated in Figure 1, 3, 4A, and 4B are used to train the photo dataset.
  • a generative neural network is trained using the photo dataset.
  • a Cycle-GAN is used to train the dataset.
  • GAN generative adversarial networks
  • an adversarial loss is utilized that forces the generated images to be indistinguishable from the real images.
  • Figure 7 illustrates a Cycle-GAN, in accordance with some embodiments.
  • an adversarial loss is also utilized to learn the mapping such that the translated image cannot be distinguished from the target images.
  • Mapping functions are learned between two domains S (702) and T (704) given training samples. Two mappings X (706): S — T and Y (708): T — S are included.
  • the learned mapping functions are cycle-consistent in Cycle-GAN.
  • the image translation cycle should be able to bring the image back to the original image, to reach a forward cycle consistency.
  • Y and X should also satisfy backward cycle consistency.
  • a cycle consistency loss is utilized as a way of using transitivity to supervise the training in some instances. For example, after mapping X (706), the sample 710 in the domain S becomes the sample 712 in the domain T, and after mapping Y (708), the sample 712 in the domain T becomes sample 716 in the domain S. The difference between the sample 710 and 716 is represented as a cycle consistency loss 718.
  • the sample 714 in the domain T becomes the sample 720 in the domain S
  • the sample 720 in the domain S becomes sample 722 in the domain T.
  • the difference between the sample 714 and 722 is represented as a cycle consistency loss 724.
  • the trained neural network is applied to enhance the hand details of the rendered hand images.
  • a stable rendering task management system 550 is implemented for a stable rendering of the hand images.
  • the stable rendering task management system 550 are applied to one or more processes of rendering with photorealistic path tracer 530, and GAN based detail enhancement 540.
  • the process of rendering with photo-realistic path tracer 530 is computationally heavy. It can take 10 seconds to render a single image. During the rendering process, rendering failure occasionally occurs.
  • Figure 8 is a diagram of the task management system with the pool of graphics processing units (GPUs), in accordance with some embodiments.
  • Synthetization of an image is called a task, such as task 802-1, 802-2, 802-3. . .802-n, and dataset generation is regarded as a task queue handling.
  • this management system will find an available graphics processing unit (GPU) from a pool of GPUs, such as 804-1, 804-2, 804-3 . . .804-n, and assign the available GPU to each incoming task.
  • GPU graphics processing unit
  • rendering failures will be handled, and the task will be reset 808 if failure occurs.
  • all resources will be released, and the GPU will be put back to the pool.
  • tools are developed that can monitor the progress of the task queue.
  • a number of GPUs are used, for example, 32 Nvidia V100 video cards and the final data generation speed is 30 pictures per second. It takes about 18.5 hours to generate a dataset with 2 million hand images.
  • Figure 9 is a flowchart of a method 900 for rendering realistic hand images with background, in accordance with some embodiments.
  • the method 900 is implemented by a computer system (e.g., a client device 104, a server 102, or a combination thereof).
  • An example of the client device 104 is a mobile phone.
  • Method 900 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system.
  • Each of the operations shown in Figure 9 may corresponds to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 of the system 200 in Figure 2).
  • the computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices.
  • the instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 900 may be combined and/or the order of some operations may be changed.
  • the computer system creates (910) a 3D rigged hand model corresponding to a unique combination of hand characteristics.
  • the hand characteristics include: ethnicity, skin color and age.
  • the computer system poses (920) the 3D rigged hand model to a hand gesture based on one or more kinesiology parameters.
  • the computer system additionally adjusts (930) the hand gesture of the 3D rigged hand model with reference to a plurality of natural gestures.
  • the computer system further renders (940) the 3D rigged hand model having the adjusted hand gesture on a background image to obtain a hand image.
  • the computer system scans a hand having the unique combination of hand characteristics to obtain the 3D rigged hand model. In some embodiments, in the step (910) of creating the 3D rigged hand models, the computer system further obtains a plurality of rigged hand models based on a plurality of unique combinations of hand characteristics; and selects the 3D rigged hand model from the plurality of rigged hand models.
  • the computer system marks key points on hands of a sampling pool of persons; for each of the plurality of natural gestures, the computer system captures and measures the depth data from the marked key points of a respective subset of the hands that are posed with the respective natural gesture; and the computer system transfers a subset of the captured depth data of the hands with the plurality of natural gestures to the 3D rigged hand model.
  • the computer system in the step of transferring the captured depth data, forms a fitting between the depth data and the 3D rigged hand model; and computer system optimizes orientations of each joint of the 3D rigged hand model based on the formed fitting.
  • the computer system minimizes a surface distance between the depth data and the 3D rigged hand for each of the marked key points.
  • the computer system collects one or more indoor environment maps with a lighting condition; the computer system extracts the lighting condition from the environment maps; the computer system renders the 3D rigged hand model having the adjusted hand gesture with a respective camera viewpoint and using a path tracer to obtain the hand image; and the computer system applies the lighting condition to the hand image according to the camera viewpoint.
  • the computer system further selects the background image; the computer system estimates a respective lighting condition from the background image; and the computer system applies the estimated respective lighting condition in rendering the 3D rigged hand model on the background image with a respective camera viewpoint using a path tracer.
  • the computer system in the step (940) of rendering the 3D rigged hand model having the adjusted hand gesture on the background image to obtain the hand image, the computer system further adjusts a tone of the hand image according to a fusion of a hand gesture tone and a respective background image tone.
  • the computer system further enhances detailed hand features of the rendered 3D rigged hand model of the hand image to render a final hand image.
  • the computer system collects a plurality of real hand photos, trains a neural network to learn the detailed hand features of the real hand photos; and applies the trained neural network to the rendered hand images to enhance the detailed hand features.
  • the final hand image is rendered using a pool of GPU to form a task management system.
  • a task management system has a throughput of at least 30 hand pictures per second.
  • the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context.
  • the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
  • stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

Cette demande concerne un procédé mis en œuvre au niveau d'un dispositif électronique pour rendre des images de main réalistes. Un système informatique crée un modèle de main gréé tridimensionnel (3D) ayant des caractéristiques de main uniques. Le modèle de main gréé 3D est posé sur un geste de la main sur la base d'un ou de plusieurs paramètres de kinésiologie. Le geste de la main du modèle de main gréé 3D est ajusté en référence à un ou plusieurs gestes naturels, et restitué sur une image d'arrière-plan pour obtenir une image de main. Dans certains modes de réalisation, pour chaque geste naturel, des données de profondeur sont capturées et mesurées à partir d'un ou plusieurs points clés marqués d'un sous-ensemble respectif de mains réelles qui sont posés avec le geste naturel respectif. Un sous-ensemble des données de profondeur des mains réelles avec le ou les gestes naturels est transféré vers le modèle de main gréé 3D.
PCT/US2021/055775 2021-10-20 2021-10-20 Systèmes et procédés de synthèse d'images de main WO2023069085A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2021/055775 WO2023069085A1 (fr) 2021-10-20 2021-10-20 Systèmes et procédés de synthèse d'images de main

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2021/055775 WO2023069085A1 (fr) 2021-10-20 2021-10-20 Systèmes et procédés de synthèse d'images de main

Publications (1)

Publication Number Publication Date
WO2023069085A1 true WO2023069085A1 (fr) 2023-04-27

Family

ID=86058481

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/055775 WO2023069085A1 (fr) 2021-10-20 2021-10-20 Systèmes et procédés de synthèse d'images de main

Country Status (1)

Country Link
WO (1) WO2023069085A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116310659A (zh) * 2023-05-17 2023-06-23 中数元宇数字科技(上海)有限公司 训练数据集的生成方法及设备

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180315329A1 (en) * 2017-04-19 2018-11-01 Vidoni, Inc. Augmented reality learning system and method using motion captured virtual hands
US20210142568A1 (en) * 2019-11-08 2021-05-13 Fuji Xerox Co., Ltd. Web-based remote assistance system with context & content-aware 3d hand gesture visualization

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180315329A1 (en) * 2017-04-19 2018-11-01 Vidoni, Inc. Augmented reality learning system and method using motion captured virtual hands
US20210142568A1 (en) * 2019-11-08 2021-05-13 Fuji Xerox Co., Ltd. Web-based remote assistance system with context & content-aware 3d hand gesture visualization

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116310659A (zh) * 2023-05-17 2023-06-23 中数元宇数字科技(上海)有限公司 训练数据集的生成方法及设备
CN116310659B (zh) * 2023-05-17 2023-08-08 中数元宇数字科技(上海)有限公司 训练数据集的生成方法及设备

Similar Documents

Publication Publication Date Title
US11455495B2 (en) System and method for visual recognition using synthetic training data
CN113609896A (zh) 基于对偶相关注意力的对象级遥感变化检测方法及系统
WO2021092600A2 (fr) Réseau pose-over-parts pour estimation de pose multi-personnes
WO2023102223A1 (fr) Apprentissage multitâche en couplage croisé pour cartographie de profondeur et segmentation sémantique
WO2023101679A1 (fr) Récupération inter-modale d'image de texte sur la base d'une expansion de mots virtuels
CN112242002B (zh) 基于深度学习的物体识别和全景漫游方法
US20230082715A1 (en) Method for training image processing model, image processing method, apparatus, electronic device, and computer program product
CN113297988A (zh) 一种基于域迁移和深度补全的物体姿态估计方法
WO2023069085A1 (fr) Systèmes et procédés de synthèse d'images de main
CN116391209A (zh) 现实的音频驱动的3d化身生成
CN112927127A (zh) 一种运行在边缘设备上的视频隐私数据模糊化方法
WO2023069086A1 (fr) Système et procédé de ré-éclairage de portrait dynamique
WO2023086398A1 (fr) Réseaux de rendu 3d basés sur des champs de radiance neurale de réfraction
WO2023133285A1 (fr) Anticrénelage de bordures d'objet comportant un mélange alpha de multiples surfaces 3d segmentées
CN111199248A (zh) 一种基于深度学习目标检测算法的服装属性检测方法
CN113436251B (zh) 一种基于改进的yolo6d算法的位姿估计系统及方法
WO2023027712A1 (fr) Procédés et systèmes permettant de reconstruire simultanément une pose et des modèles humains 3d paramétriques dans des dispositifs mobiles
WO2023091131A1 (fr) Procédés et systèmes pour récupérer des images sur la base de caractéristiques de plan sémantique
WO2023277877A1 (fr) Détection et reconstruction de plan sémantique 3d
WO2023277888A1 (fr) Suivi de la main selon multiples perspectives
CN113052059A (zh) 一种基于时空特征融合的实时动作识别方法
WO2023023160A1 (fr) Reconstruction d'informations de profondeur à partir d'images stéréo multi-vues (mvs)
US20240087344A1 (en) Real-time scene text area detection
US20240153184A1 (en) Real-time hand-held markerless human motion recording and avatar rendering in a mobile platform
CN117813626A (zh) 从多视图立体(mvs)图像重建深度信息

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21961578

Country of ref document: EP

Kind code of ref document: A1