WO2022096105A1 - Reconstruction de langue 3d à partir d'images uniques - Google Patents
Reconstruction de langue 3d à partir d'images uniques Download PDFInfo
- Publication number
- WO2022096105A1 WO2022096105A1 PCT/EP2020/081148 EP2020081148W WO2022096105A1 WO 2022096105 A1 WO2022096105 A1 WO 2022096105A1 EP 2020081148 W EP2020081148 W EP 2020081148W WO 2022096105 A1 WO2022096105 A1 WO 2022096105A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- tongue
- synthetic
- head
- latent
- meshes
- Prior art date
Links
- 238000000034 method Methods 0.000 claims abstract description 65
- 210000002105 tongue Anatomy 0.000 claims description 121
- 208000005232 Glossitis Diseases 0.000 claims description 48
- 230000014509 gene expression Effects 0.000 claims description 41
- 238000012549 training Methods 0.000 claims description 24
- 210000000214 mouth Anatomy 0.000 claims description 20
- 239000011159 matrix material Substances 0.000 claims description 12
- 230000001131 transforming effect Effects 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 5
- 210000003128 head Anatomy 0.000 claims description 5
- 238000005286 illumination Methods 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 230000035515 penetration Effects 0.000 claims description 3
- 230000002250 progressing effect Effects 0.000 claims description 3
- 238000000513 principal component analysis Methods 0.000 description 15
- 238000012545 processing Methods 0.000 description 6
- 239000003795 chemical substances by application Substances 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 239000013598 vector Substances 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 238000000149 argon plasma sintering Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000007654 immersion Methods 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000003387 muscular Effects 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T19/00—Manipulating 3D models or images for computer graphics
- G06T19/20—Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2210/00—Indexing scheme for image generation or computer graphics
- G06T2210/56—Particle system, point based geometry or rendering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2219/00—Indexing scheme for manipulating 3D models or images for computer graphics
- G06T2219/20—Indexing scheme for editing of 3D models
- G06T2219/2004—Aligning objects, relative positioning of parts
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2219/00—Indexing scheme for manipulating 3D models or images for computer graphics
- G06T2219/20—Indexing scheme for editing of 3D models
- G06T2219/2021—Shape modification
Definitions
- the present disclosure relates to 3D reconstruction of human tongues for applications in 3D face reconstruction, animation and/or verification.
- the present disclosure provides, to this end, a method, a computer program and a device.
- a first aspect of the present disclosure provides a method for 3D reconstruction of a tongue, comprising: determining one of a plurality of synthetic tongue-and-head 3D meshes based on an image of the tongue; and modifying the determined synthetic tongue-and-head 3D mesh based on a synthetic tongue 3D point-cloud being indicative of the image of the tongue.
- a combined tongue-and-head 3D model may be obtained which has a 3D tongue pose reconstructed from the single image of the tongue.
- the determining of the synthetic tongue-and-head 3D mesh comprises encoding the image of the tongue into one of a plurality of latent tongue 3D features representing a corresponding one of a plurality of raw tongue 3D point-clouds; transforming the obtained latent tongue 3D feature into a corresponding one of a plurality of latent tongue-and-head expression shape parameters of a corresponding one of the plurality of synthetic tongue-and-head 3D meshes; and converting the obtained latent tongue-and-head expression shape parameters into the corresponding one of the plurality of synthetic tongue-and-head 3D meshes.
- the encoding of the image of the tongue into one of a plurality of latent tongue 3D features comprises using an embedding network trained to encode a plurality of images of tongues into the corresponding plurality of the latent tongue 3D features.
- the plurality of the latent tongue 3D features may be established by autoencoding the plurality of raw tongue 3D point-clouds.
- the plurality of images of the tongues comprises a plurality of rendered images of respective raw tongue 3D point-clouds of the plurality of raw tongue 3D point-clouds.
- the plurality of images of the tongues are rendered using different scaling with respect to a center, using different rotation with respect to a rotary axis, and/or using different illumination, of the respective raw tongue 3D pointcloud.
- the transforming the obtained latent tongue 3D feature comprises using a first regression matrix transforming the plurality of latent tongue 3D features into the plurality of corresponding latent tongue-and-head expression shape parameters of the plurality of corresponding synthetic tongue-and-head 3D meshes.
- the plurality of latent tongue-and-head expression shape parameters are established by applying a second regression matrix to a plurality of first latent tongue-and-mouth expression shape parameters of a plurality of corresponding first tongue-and-mouth landmark vertex sets defined in the plurality of raw tongue 3D point-clouds.
- the plurality of first latent tongue-and-mouth expression shape parameters of the plurality of corresponding first tongue-and-mouth landmark vertex sets may be established by using a first Principal Component Analysis (PCA) on the plurality of first tongue-and-mouth landmark vertex sets.
- PCA Principal Component Analysis
- the first PCA may be established by analyzing principal components of a plurality of second tongue-and-mouth landmark vertex sets defined in the plurality of corresponding synthetic tongue-and-head 3D meshes and corresponding to the plurality of first tongue-and-mouth landmark vertex sets in terms of the underlying landmarks.
- the second regression matrix may be established by regressing a plurality of second latent tongue-and-mouth expression shape parameters of the plurality of second tongue-and-mouth landmark vertex sets to the plurality of corresponding latent tongue-and-head expression shape parameters.
- the plurality of latent tongue-and-head expression shape parameters may be established by using a second PCA on the plurality of synthetic tongue-and-head 3D meshes.
- the second PCA may be established by analyzing principal components of the plurality of synthetic tongue-and-head 3D meshes.
- the above-mentioned computationally inexpensive inference phase is prepared by associating the latent tongue 3D features of raw tongue 3D point-clouds with appropriate synthetic tongue-and-head 3D meshes in a PCA representation.
- the converting the obtained latent tongue-and-head expression shape parameters comprises using the second PCA to convert the obtained latent tongue-and-head expression shape parameters into the corresponding one of the plurality of synthetic tongue-and-head 3D meshes.
- the first and second tongue-and-mouth landmark vertex sets are arranged circumferentially around a tongue and mouth of the respective raw tongue 3D point-cloud or respective synthetic tongue-and-head 3D mesh.
- the modifying the determined synthetic tongue-and- head 3D mesh comprises generating a plurality of synthetic points of the synthetic tongue 3D point-cloud based on the one of the plurality of latent tongue 3D features; and adapting, based on an optimization procedure, the one of the plurality of synthetic tongue-and-head 3D meshes by performing an iterative gradient descent according to first order derivatives of an error metric between the one of the plurality of synthetic tongue-and-head 3D meshes and the determined synthetic tongue 3D point-cloud to back-propagate the error to the one of the plurality of latent tongue-and-head expression shape parameters.
- a shape of the pre-shaped synthetic tongue-and-head 3D mesh may further be optimized in accordance with the 3D tongue pose depicted in the single input image.
- the generating a plurality of synthetic points of the synthetic tongue 3D point-cloud comprises using a generative network of a Generative Adversarial Network (GAN) trained to establish individual synthetic points of the synthetic tongue 3D point-cloud based on the one of the plurality of latent tongue 3D features and respective Gaussian noise samples to be indistinguishable, by a discriminative network of the GAN, from points of one of the plurality of raw tongue 3D point-clouds corresponding to the one of the plurality of latent tongue 3D features.
- GAN Generative Adversarial Network
- raw tongue 3D point-clouds may be generated in accordance with latent tongue 3D features of a raw tongue 3D point-cloud. Accordingly, raw tongue 3D point-clouds do not need to have the same number of points, which reduces dataset preprocessing, and as many points as needed may be generated on demand, so that there is no constraint in terms of resolution.
- the points of the one of the plurality of raw tongue 3D point-clouds undergo a diversification by an isotropic multi-variate normal distribution N having a variance that declines with progressing training epoch e of the GAN.
- the training of the GAN is improved and stabilized, by softening the binary behavior of the discriminative network especially in an early training phase.
- the error metric comprises a Chamfer distance loss to modify a 3D position of points of the one of the plurality of synthetic tongue-and-head 3D meshes; a normal loss to modify a 3D orientation of the one of the plurality of synthetic tongue-and-head 3D meshes; a Laplacian regularization loss to constrain relative 3D positions of neighboring points of the one of the plurality of synthetic tongue-and-head 3D meshes; an edge length loss to constrain any possible outlier points of the one of the plurality of synthetic tongue-and-head 3D meshes; and a collision loss to inhibit penetration of a surface of an oral cavity of the one of the plurality of synthetic tongue-and-head 3D meshes by any possible colliding points.
- the collision loss comprises a sum of distances of each colliding point to a plurality of spheres having a radius and being centered at mouth landmark vertices of the tongue-and-mouth 3D landmark vertex set defined in the one of the synthetic tongue-and-head 3D meshes.
- a second aspect of the present disclosure provides a computer program, comprising executable instructions which, when executed by a processor, cause the processor to perform the method of the first aspect or any of its implementations.
- a third aspect of the present disclosure provides a device for 3D reconstruction of a tongue, comprising a processor being configured to perform the method of the first aspect or any of its implementations.
- FIG. 1 illustrates a flow chart of a method according to an embodiment of the invention
- FIG. 2 illustrates a tongue reconstruction framework during inference being used in connection with the method of Fig. 1;
- FIG. 3 illustrates an exemplary tongue-and-mouth landmark vertex set as it may be used in connection with the method of Fig. 1;
- FIG. 4 illustrates an exemplary GAN as it may be used in connection with the method of Fig. 1;
- FIG.4a exemplary illustrates Layer and Injection building blocks that may be used in connection with the GAN according to FIG.4.
- FIG. 5 illustrates a diversification of points of a raw tongue 3D point-clouds used in connection with the method of Fig. 1.
- FIG. 1 illustrates a flow chart of a method 1 according to an embodiment of the invention
- FIG. 2 illustrates a tongue reconstruction framework during inference being used in connection with the method 1 of Fig. 1.
- a tongue may refer to a muscular organ being anchored in an oral cavity of a human subject.
- the method 1 may achieve a 3D reconstruction of a tongue by comprising: determining 101 one 207 of a plurality of synthetic tongue-and-head 3D meshes based on an image 201 of the tongue; and modifying 108 the determined synthetic tongue-and-head 3D mesh 207 based on a synthetic tongue 3D point-cloud 210 being indicative of the image 201 of the tongue.
- the method 1 may define a processing pipeline as shown in FIG. 2 that can predict a 3D tongue mesh with fixed topology from a single image, which can be further optimized based on a generated point-cloud for more accurate results.
- the determining 101 of the synthetic tongue-and-head 3D mesh 207 may comprise encoding 102 the image 201 of the tongue into one of a plurality of latent tongue 3D features 203, y representing a corresponding one 401 of a plurality of raw tongue 3D point-clouds.
- the encoding 102 of the image 201 of the tongue into one of a plurality of latent tongue 3D features 203, y may comprise using 103 an embedding network 202.
- the determining 101 of the synthetic tongue-and-head 3D mesh 207 may further comprise transforming 104 the obtained latent tongue 3D feature 203, y into a corresponding one 205 of a plurality of latent tongue-and-head expression shape parameters pt of a corresponding one 207 of the plurality of synthetic tongue-and-head 3D meshes.
- the transforming 104 the obtained latent tongue 3D feature 203, y may comprise using 105 a first regression matrix 204, W t , y transforming the plurality of latent tongue 3D features 203, y into the plurality of corresponding latent tongue-and-head expression shape parameters 205, p t of the plurality of corresponding synthetic tongue-and-head 3D meshes.
- the determining 101 of the synthetic tongue-and-head 3D mesh 207 may further comprise converting 106 the obtained latent tongue-and-head expression shape parameters 205, p t into the corresponding one 207 of the plurality of synthetic tongue-and-head 3D meshes.
- the converting 106 of the obtained latent tongue-and-head expression shape parameters 205, p t of the plurality of synthetic tongue-and-head 3D meshes may comprise using 107 a second PCA 206, U t to convert the obtained latent tongue-and-head expression shape parameters 205, p t into the corresponding one 207 of the plurality of synthetic tongue-and-head 3D meshes.
- steps 101 - 107 of the method 1 are able to make use of knowledge acquired in a training phase that may be concluded before the inference phase illustrated in FIGs. 1 and 2.
- This preparatory work in the training phase is described in the following:
- a first dataset may be captured under controlled conditions and is comprised of raw 3D tongue scans in a 3D point-cloud form known as the ‘plurality of raw tongue 3D point-clouds’.
- a second dataset may be manually created and comprises only synthetic full head data with tongue expressions known as the ‘plurality of synthetic tongue-and-head 3D meshes’.
- Each of these meshes may be based on the mean template of the Universal Head Model (UHM), which has manually been diversified to render a wide range of tongue expressions.
- UHM Universal Head Model
- a set of landmark vertices 3 may be annotated around a circumference of the respective tongue and mouth areas.
- the same landmark protocol may be utilized to annotate the plurality of raw tongue 3D point-clouds as well as the plurality of synthetic tongue-and-head 3D meshes based on the same underlying landmarks.
- the total number of landmarks in each set of landmark vertices 3 is exemplarily 24 as can be seen in FIG. 3, and is divided into two groups 302, 301 which highlight the tongue and the mouth, respectively. This constitutes a first tongue-and- mouth landmark vertex set 3, l r per raw tongue 3D point-cloud, and a second tongue-and- mouth landmark vertex set 3, l t per synthetic tongue-and-head 3D mesh.
- the set of landmark vertices 3 serve to associate the two datasets, so that a raw tongue 3D pointcloud can be linked to a synthetic tongue-and-head 3D mesh having a (closest) corresponding tongue expression.
- a first PCA Uu may be established by analyzing principal components of a plurality of second tongue-and-mouth landmark vertex sets It of the plurality of corresponding synthetic tongue-and-head 3D meshes, and used to establish a plurality of second latent tongue-and-mouth expression shape parameters pi.
- a PCA may refer to a process of computing principal components of an n-dimensional point cloud, by fitting an n-dimensional ellipsoid to the point cloud, wherein each axis of the ellipsoid represents a principal component.
- An i th principal component may be a direction of a line that is orthogonal to the first i-1 vectors and minimizes the average squared distance from the points to that line.
- the principal components may be linearly uncorrelated and constitute an orthonormal basis that best fits the point cloud.
- a second PCA 206, U t may be established by analyzing principal components of the plurality of synthetic tongue-and-head 3D meshes as a whole, and used to establish a plurality of latent tongue-and-head expression shape parameters 205, p t .
- a second regression matrix W t ,i may be established by regressing the plurality of second latent tongue-and-mouth expression shape parameters / / to the plurality of latent tongue- and-head expression shape parameters 205, p t .
- the second regression matrix W t ,i associates the landmark vertex sets 3, l t of the synthetic tongue-and-head 3D meshes with the plurality of corresponding synthetic tongue-and-head 3D meshes.
- the second regression matrix W t ,i may also associate the plurality of first tongue-and-mouth landmark vertex sets 3. /,- of the plurality of raw tongue 3D pointclouds with the plurality of corresponding synthetic tongue-and-head 3D meshes.
- the above-mentioned plurality of the latent tongue 3D features 203, y may be established by auto-encoding the plurality of raw tongue 3D point-clouds, and the embedding network 202 may be trained to encode a plurality of images 201 of tongues into the corresponding plurality of the latent tongue 3D features 203, y.
- the embedding network 202 may be based on a ResNet-50 model pre-trained on the image database ImageNet and fine-tuned to work as an image encoder.
- a last layer of the embedding network 202 may be modified to output a vector y similar to the dimensions of the ground truth vector y. Then the goal of the embedding task it to minimize a L2 loss.
- the above-mentioned plurality of images 201 of the tongues may comprise a plurality of rendered images of respective raw tongue 3D point-clouds of the plurality of raw tongue 3D point-clouds, and may be rendered using different scaling with respect to a center, using different rotation with respect to a rotary axis, and/or using different illumination, of the respective raw tongue 3D point-cloud.
- the plurality of synthetic tongue-and-head 3D meshes may be rendered with a precomputed radiance transfer technique using spherical harmonics which efficiently represent global light scattering.
- spherical harmonics which efficiently represent global light scattering.
- 145 second-order spherical harmonics of more than 15 different indoor scenes may be coupled with random light positions and mesh orientations around all 3D axes, resulting in a rich plurality of images 201.
- the modifying 108 the determined synthetic tongue-and-head 3D mesh 207 may comprise generating 109 a plurality of synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 based on the one of the plurality of latent tongue 3D features 203, y, and adapting 111, based on an optimization procedure, the one 207 of the plurality of synthetic tongue-and-head 3D meshes by performing an iterative gradient descent according to first order derivatives of an error metric between the one of the plurality of synthetic tongue-and-head 3D meshes and the determined synthetic tongue 3D point-cloud 210 to back-propagate the error to the one of the plurality of latent tongue-and-head expression shape parameters 205, p t .
- the generating 109 a plurality of synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 may comprise using 110 a generative network 209, G of a generative adversarial network, GAN 4.
- a Generative Adversarial Network may refer to a machine learning framework in which two neural networks (generative and discriminative network agents, respectively) contest with each other in a zero-sum game where one agent's gam is another agent's loss. Given a training set of (real) samples, this framework leams to generate new (synthetic) samples with the same statistics as the training set.
- the discriminative network is trained by presenting samples from the training set until it achieves acceptable accuracy.
- the generative network is seeded with randomized input that is sampled from a predefined latent space, and leams to generate (synthetic) samples, i.e., to map from the latent space to a data distribution of interest.
- the discriminative network seeks to distinguish the synthetic samples produced by the generator from the real samples, i.e., the true data distribution. Backpropagation is applied in both networks so that the generative network generates better synthetic images, while the discriminative network becomes more skilled at flagging synthetic images.
- the generative network 209, G may randomly predict 10K synthetic points G(z, y) that describe a tongue surface in accordance with Gaussian noise samples 208, z being constituted as follows:
- the afore-mentioned step 110 of the method 1 makes use of knowledge acquired in the training phase.
- This preparatory work in the training phase may comprise that the GAN 4 is trained to establish individual synthetic points G(z, y) of the synthetic tongue 3D pointcloud 210 based on the one of the plurality of latent tongue 3D features 203, y and respective Gaussian noise samples 208, z to be indistinguishable, by a discriminative network 402 of the GAN 4, from points x t of one 401 of the plurality of raw tongue 3D point-clouds corresponding to the one of the plurality of latent tongue 3D features 203, y.
- the above-mentioned error metric may comprise a Chamfer distance loss LCD to modify a 3D position of points of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; a normal loss L nO m to modify a 3D orientation of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; a Laplacian regularization loss Li counter p to constrain relative 3D positions of neighboring points of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; an edge length loss L e d ge to constrain any possible outlier points of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; and a collision loss L coi to inhibit penetration of a surface of an oral cavity of the one 207 of the plurality of synthetic tongue-and-head 3D meshes by any possible colliding points, as follows:
- the collision loss L coi may comprise a sum of distances of each colliding point q' to a plurality of spheres k having a radius r and being centered at mouth landmark vertices 301 (see FIG. 3) having coordinates (xk, yk, Zk) of the tongue-and-mouth 3D landmark vertex set 3, l t defined in the one 207 of the synthetic tongue-and-head 3D meshes, as follows:
- FIG. 3 illustrates an exemplary tongue-and-mouth landmark vertex set as it may be used in connection with the method 1 of Fig. 1.
- the depicted tongue-and-mouth landmark vertex set 3, l t is arranged circumferentially around a tongue and mouth, respectively, of a synthetic tongue-and-head 3D mesh 207.
- the 24 landmarks around the oral cavity of the UHM template are divided into two groups 302, 301 which highlight the tongue and the mouth, respectively.
- the first and second tongue-and-mouth landmark vertex sets 3, l r may be arranged circumferentially around the tongue and the mouth of the respective raw tongue 3D point-cloud 401 or the respective synthetic tongue-and-head 3D mesh 207.
- FIG. 4 illustrates an exemplary GAN 4 as it may be used in connection with the method 1 of Fig. 1.
- the generating 109 of the plurality of synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 may comprise using 110 a generative network 209, G of a generative adversarial network, GAN 4, trained to establish individual synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 based on the one of the plurality of latent tongue 3D features 203, y and respective Gaussian noise samples 208, z to be indistinguishable, by a discriminative network 402 of the GAN 4, from points x t of one 401 of the plurality of raw tongue 3D point-clouds corresponding to the one of the plurality of latent tongue 3D features 203, y.
- a conditional GAN setting may be used in which the generative network 209, G is guided by labels throughout the training process in order to leam to produce samples that belong to specific categories which are dictated by the labels.
- the GAN 4 is preferably guided by meaningful labels which capture all the desired 3D surface information.
- These labels i.e., the plurality of latent tongue 3D features 203, y, may be learned by auto-encoding the plurality of raw tongue 3D point-clouds.
- a selforganizing map framework may be used for hierarchical feature extraction.
- the discriminative network 403, D receives as inputs the label y, a real point-cloud point x t (which belongs to the tongue represented by the label y) or the output G(z, y) of the generative network 209, G and tries to discriminate the fake (i.e., generated) from the real point.
- this may be described as: [log D ( x, . y ) ] - it [log D ( x, , y ⁇ ] , log D (x ⁇ . y)j .
- D tries to maximize LD
- G tries to minimize ⁇
- one point corresponding to the surface which the label y represents may be generated at a time.
- the raw tongue 3D pointclouds of the training set do not need to have a same number of points, so that the GAN 4 may be trained without any data preprocessing.
- as many points as needed may be generated on demand, so that there is no constraint in terms of resolution.
- the generative network 209, G may comprise L(ayer) and I(nj ection) building blocks 404, 405 being interconnected as shown in FIG. 4, for example. These L/I building blocks 404, 405 may respectively comprise a Multilayer Perceptron (MLP) 406 and Rectified Linear Unit (ReLU) 407 layers, as can be seen in FIG. 4a. Further building blocks describe a processing of the propagated signals. Processing block 408 (symbol c) stands for row-wise concatenation along the channel dimension, whereas processing block 409 (symbol o) stands for element-wise (i.e., Hadamard) product.
- the inputs of the generative network 209, G are: a label y corresponding to a particular raw tongue 3D point-cloud from a 3D point is to be sampled, and a Gaussian noise sample z.
- the discriminative network 403, D may be based on the building blocks already mentioned in connection with the generative network 209, G which may be interconnected as shown in FIG. 4, for example.
- the inputs of the discriminative network 403, D are: (y, x), where is a label corresponding to a raw tongue 3D point-cloud and x t is a point of this particular raw tongue 3D point-cloud, or (f: G(z, y) where G(z, y) is a point generated in accordance with this particular raw tongue 3D point-cloud.
- the switch symbol between the generative network 209, G and the discriminative network 403, D indicates that this feed is performed on a random basis.
- FIG. 5 illustrates a diversification of points x t of a raw tongue 3D point-cloud 401 used in connection with the method 1 of Fig. 1.
- the discriminative network 403, D shows a binary behavior in that it decides whether a point is either fake or real. This rigidity is not very helpful especially in the early steps of the training process, as the generative network 209, G struggles to learn the distribution of points of the plurality of raw tongue 3D point-clouds (i.e., all of the generated points are discarded as fake by the discriminator with high confidence). To remedy this, the strict nature of the discriminative network 403, D may be softened, especially in the initial training steps, by diversifying the points x t fed to it. To achieve that, instead of directly feeding a real point x t corresponding to a label y to the discriminative network 403, D, the following is provided:
- the points x t of the one 401 of the plurality of raw tongue 3D point-clouds may undergo a diversification 402 (see FIG. 4) by an isotropic multi-variate normal distribution TV having mean x t and (isotropic) variance a e that declines with progressing training epoch e of the GAN 4:
- the generative network 209 G can better learn the distribution of points of the plurality of raw tongue 3D point-clouds as it does not get severely punished by the discriminative network 403, D when it slightly misses out the actual surface. This yields better results and stabilizes the training.
- the training may be started with a relatively small value for the variance a e which is further reduced subsequently until it becomes zero towards the final training epochs e
- a computer program (not shown) may be provided comprising executable instructions which, when executed by a processor, cause the processor to perform the above-mentioned method 1, and that a device (not shown) for 3D reconstruction of a tongue may be provided comprising a processor configured to perform the above-mentioned method 1.
- the processor or processing circuitry of the device may comprise hardware and/or the processing circuitry may be controlled by software.
- the hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry.
- the digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field- programmable gate arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors.
- ASICs application-specific integrated circuits
- FPGAs field- programmable gate arrays
- DSPs digital signal processors
- multi-purpose processors multi-purpose processors.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Graphics (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Architecture (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Geometry (AREA)
- Processing Or Creating Images (AREA)
Abstract
L'invention concerne un procédé (1) de reconstruction 3D d'une langue. Le procédé (1) comprend les étapes suivantes : déterminer (101) un maillage (207) d'une pluralité de maillages 3D de langue et de tête synthétiques en fonction d'une image (201) de la langue ; et modifier (108) le maillage 3D de langue et de tête synthétique déterminé (207) en fonction d'un nuage de points 3D de langue synthétique (210) qui à son tour représente l'image (201) de la langue. Ceci permet de reconstruire un modèle 3D d'une tête conjointement avec la pose de langue 3D de la langue représentée.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/EP2020/081148 WO2022096105A1 (fr) | 2020-11-05 | 2020-11-05 | Reconstruction de langue 3d à partir d'images uniques |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/EP2020/081148 WO2022096105A1 (fr) | 2020-11-05 | 2020-11-05 | Reconstruction de langue 3d à partir d'images uniques |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022096105A1 true WO2022096105A1 (fr) | 2022-05-12 |
Family
ID=73172707
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2020/081148 WO2022096105A1 (fr) | 2020-11-05 | 2020-11-05 | Reconstruction de langue 3d à partir d'images uniques |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2022096105A1 (fr) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210248812A1 (en) * | 2021-03-05 | 2021-08-12 | University Of Electronic Science And Technology Of China | Method for reconstructing a 3d object based on dynamic graph network |
CN117649494A (zh) * | 2024-01-29 | 2024-03-05 | 南京信息工程大学 | 一种基于点云像素匹配的三维舌体的重建方法及系统 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017028961A1 (fr) * | 2015-08-14 | 2017-02-23 | Thomson Licensing | Reconstruction tridimensionnelle (3d) d'une oreille humaine à partir d'un nuage de points |
CN107256575A (zh) * | 2017-04-07 | 2017-10-17 | 天津市天中依脉科技开发有限公司 | 一种基于双目立体视觉的三维舌像重建方法 |
WO2020053551A1 (fr) * | 2018-09-12 | 2020-03-19 | Sony Interactive Entertainment Inc. | Procédé et système pour générer une reconstitution 3d d'un être humain |
EP3726467A1 (fr) * | 2019-04-18 | 2020-10-21 | Zebra Medical Vision Ltd. | Systèmes et procédés de reconstruction d'images anatomiques 3d à partir d'images anatomiques 2d |
-
2020
- 2020-11-05 WO PCT/EP2020/081148 patent/WO2022096105A1/fr active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017028961A1 (fr) * | 2015-08-14 | 2017-02-23 | Thomson Licensing | Reconstruction tridimensionnelle (3d) d'une oreille humaine à partir d'un nuage de points |
CN107256575A (zh) * | 2017-04-07 | 2017-10-17 | 天津市天中依脉科技开发有限公司 | 一种基于双目立体视觉的三维舌像重建方法 |
WO2020053551A1 (fr) * | 2018-09-12 | 2020-03-19 | Sony Interactive Entertainment Inc. | Procédé et système pour générer une reconstitution 3d d'un être humain |
EP3726467A1 (fr) * | 2019-04-18 | 2020-10-21 | Zebra Medical Vision Ltd. | Systèmes et procédés de reconstruction d'images anatomiques 3d à partir d'images anatomiques 2d |
Non-Patent Citations (3)
Title |
---|
HEWER ALEXANDER ET AL: "A multilinear tongue model derived from speech related MRI data of the human vocal tract", COMPUTER SPEECH AND LANGUAGE, ELSEVIER, LONDON, GB, vol. 51, 21 February 2018 (2018-02-21), pages 68 - 92, XP085398906, ISSN: 0885-2308, DOI: 10.1016/J.CSL.2018.02.001 * |
JUN YU ET AL: "A realistic and reliable 3D pronunciation visualization instruction system for computer-assisted language learning", 2016 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), IEEE, 15 December 2016 (2016-12-15), pages 786 - 789, XP033046448, DOI: 10.1109/BIBM.2016.7822623 * |
YU JUN: "A Real-Time Music VR System for 3D External and Internal Articulators", 2019 IEEE CONFERENCE ON VIRTUAL REALITY AND 3D USER INTERFACES (VR), IEEE, 23 March 2019 (2019-03-23), pages 1259 - 1260, XP033597801, DOI: 10.1109/VR.2019.8798288 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210248812A1 (en) * | 2021-03-05 | 2021-08-12 | University Of Electronic Science And Technology Of China | Method for reconstructing a 3d object based on dynamic graph network |
US11715258B2 (en) * | 2021-03-05 | 2023-08-01 | University Of Electronic Science And Technology Of China | Method for reconstructing a 3D object based on dynamic graph network |
CN117649494A (zh) * | 2024-01-29 | 2024-03-05 | 南京信息工程大学 | 一种基于点云像素匹配的三维舌体的重建方法及系统 |
CN117649494B (zh) * | 2024-01-29 | 2024-04-19 | 南京信息工程大学 | 一种基于点云像素匹配的三维舌体的重建方法及系统 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ichim et al. | Dynamic 3D avatar creation from hand-held video input | |
Tewari et al. | High-fidelity monocular face reconstruction based on an unsupervised model-based face autoencoder | |
US11727617B2 (en) | Single image-based real-time body animation | |
Karunratanakul et al. | A skeleton-driven neural occupancy representation for articulated hands | |
US11557391B2 (en) | Systems and methods for human pose and shape recovery | |
Bermano et al. | Facial performance enhancement using dynamic shape space analysis | |
Ye et al. | Audio-driven talking face video generation with dynamic convolution kernels | |
CN110599395A (zh) | 目标图像生成方法、装置、服务器及存储介质 | |
Yu et al. | A video, text, and speech-driven realistic 3-D virtual head for human–machine interface | |
Ma et al. | Otavatar: One-shot talking face avatar with controllable tri-plane rendering | |
CA2423212A1 (fr) | Dispositif et procede pour representation tridimensionnelle a partir d'une image bidimensionnelle | |
US11963741B2 (en) | Systems and methods for human pose and shape recovery | |
CN115004236A (zh) | 来自音频的照片级逼真说话面部 | |
WO2022096105A1 (fr) | Reconstruction de langue 3d à partir d'images uniques | |
WO2024114321A1 (fr) | Procédé et appareil de traitement de données d'image, dispositif informatique, support de stockage lisible par ordinateur et produit programme d'ordinateur | |
Sun et al. | Masked lip-sync prediction by audio-visual contextual exploitation in transformers | |
Claes | A robust statistical surface registration framework using implicit function representations-application in craniofacial reconstruction | |
Dundar et al. | Fine detailed texture learning for 3D meshes with generative models | |
CN117635897B (zh) | 三维对象的姿态补全方法、装置、设备、存储介质及产品 | |
Huang et al. | Object-occluded human shape and pose estimation with probabilistic latent consistency | |
Lifkooee et al. | Real-time avatar pose transfer and motion generation using locally encoded laplacian offsets | |
Ekmen et al. | From 2D to 3D real-time expression transfer for facial animation | |
Gan et al. | Fine-grained multi-view hand reconstruction using inverse rendering | |
Huang et al. | Detail-preserving controllable deformation from sparse examples | |
Park et al. | DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20803514 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20803514 Country of ref document: EP Kind code of ref document: A1 |