WO2022096105A1 - Reconstruction de langue 3d à partir d'images uniques - Google Patents

Reconstruction de langue 3d à partir d'images uniques Download PDF

Info

Publication number
WO2022096105A1
WO2022096105A1 PCT/EP2020/081148 EP2020081148W WO2022096105A1 WO 2022096105 A1 WO2022096105 A1 WO 2022096105A1 EP 2020081148 W EP2020081148 W EP 2020081148W WO 2022096105 A1 WO2022096105 A1 WO 2022096105A1
Authority
WO
WIPO (PCT)
Prior art keywords
tongue
synthetic
head
latent
meshes
Prior art date
Application number
PCT/EP2020/081148
Other languages
English (en)
Inventor
Stylianos PLOUMPIS
Stylianos MOSCHOGLOU
Vasilios TRIANTAFYLLOU
Stefanos ZAFEIRIOU
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2020/081148 priority Critical patent/WO2022096105A1/fr
Publication of WO2022096105A1 publication Critical patent/WO2022096105A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/20Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2210/00Indexing scheme for image generation or computer graphics
    • G06T2210/56Particle system, point based geometry or rendering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2219/00Indexing scheme for manipulating 3D models or images for computer graphics
    • G06T2219/20Indexing scheme for editing of 3D models
    • G06T2219/2004Aligning objects, relative positioning of parts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2219/00Indexing scheme for manipulating 3D models or images for computer graphics
    • G06T2219/20Indexing scheme for editing of 3D models
    • G06T2219/2021Shape modification

Definitions

  • the present disclosure relates to 3D reconstruction of human tongues for applications in 3D face reconstruction, animation and/or verification.
  • the present disclosure provides, to this end, a method, a computer program and a device.
  • a first aspect of the present disclosure provides a method for 3D reconstruction of a tongue, comprising: determining one of a plurality of synthetic tongue-and-head 3D meshes based on an image of the tongue; and modifying the determined synthetic tongue-and-head 3D mesh based on a synthetic tongue 3D point-cloud being indicative of the image of the tongue.
  • a combined tongue-and-head 3D model may be obtained which has a 3D tongue pose reconstructed from the single image of the tongue.
  • the determining of the synthetic tongue-and-head 3D mesh comprises encoding the image of the tongue into one of a plurality of latent tongue 3D features representing a corresponding one of a plurality of raw tongue 3D point-clouds; transforming the obtained latent tongue 3D feature into a corresponding one of a plurality of latent tongue-and-head expression shape parameters of a corresponding one of the plurality of synthetic tongue-and-head 3D meshes; and converting the obtained latent tongue-and-head expression shape parameters into the corresponding one of the plurality of synthetic tongue-and-head 3D meshes.
  • the encoding of the image of the tongue into one of a plurality of latent tongue 3D features comprises using an embedding network trained to encode a plurality of images of tongues into the corresponding plurality of the latent tongue 3D features.
  • the plurality of the latent tongue 3D features may be established by autoencoding the plurality of raw tongue 3D point-clouds.
  • the plurality of images of the tongues comprises a plurality of rendered images of respective raw tongue 3D point-clouds of the plurality of raw tongue 3D point-clouds.
  • the plurality of images of the tongues are rendered using different scaling with respect to a center, using different rotation with respect to a rotary axis, and/or using different illumination, of the respective raw tongue 3D pointcloud.
  • the transforming the obtained latent tongue 3D feature comprises using a first regression matrix transforming the plurality of latent tongue 3D features into the plurality of corresponding latent tongue-and-head expression shape parameters of the plurality of corresponding synthetic tongue-and-head 3D meshes.
  • the plurality of latent tongue-and-head expression shape parameters are established by applying a second regression matrix to a plurality of first latent tongue-and-mouth expression shape parameters of a plurality of corresponding first tongue-and-mouth landmark vertex sets defined in the plurality of raw tongue 3D point-clouds.
  • the plurality of first latent tongue-and-mouth expression shape parameters of the plurality of corresponding first tongue-and-mouth landmark vertex sets may be established by using a first Principal Component Analysis (PCA) on the plurality of first tongue-and-mouth landmark vertex sets.
  • PCA Principal Component Analysis
  • the first PCA may be established by analyzing principal components of a plurality of second tongue-and-mouth landmark vertex sets defined in the plurality of corresponding synthetic tongue-and-head 3D meshes and corresponding to the plurality of first tongue-and-mouth landmark vertex sets in terms of the underlying landmarks.
  • the second regression matrix may be established by regressing a plurality of second latent tongue-and-mouth expression shape parameters of the plurality of second tongue-and-mouth landmark vertex sets to the plurality of corresponding latent tongue-and-head expression shape parameters.
  • the plurality of latent tongue-and-head expression shape parameters may be established by using a second PCA on the plurality of synthetic tongue-and-head 3D meshes.
  • the second PCA may be established by analyzing principal components of the plurality of synthetic tongue-and-head 3D meshes.
  • the above-mentioned computationally inexpensive inference phase is prepared by associating the latent tongue 3D features of raw tongue 3D point-clouds with appropriate synthetic tongue-and-head 3D meshes in a PCA representation.
  • the converting the obtained latent tongue-and-head expression shape parameters comprises using the second PCA to convert the obtained latent tongue-and-head expression shape parameters into the corresponding one of the plurality of synthetic tongue-and-head 3D meshes.
  • the first and second tongue-and-mouth landmark vertex sets are arranged circumferentially around a tongue and mouth of the respective raw tongue 3D point-cloud or respective synthetic tongue-and-head 3D mesh.
  • the modifying the determined synthetic tongue-and- head 3D mesh comprises generating a plurality of synthetic points of the synthetic tongue 3D point-cloud based on the one of the plurality of latent tongue 3D features; and adapting, based on an optimization procedure, the one of the plurality of synthetic tongue-and-head 3D meshes by performing an iterative gradient descent according to first order derivatives of an error metric between the one of the plurality of synthetic tongue-and-head 3D meshes and the determined synthetic tongue 3D point-cloud to back-propagate the error to the one of the plurality of latent tongue-and-head expression shape parameters.
  • a shape of the pre-shaped synthetic tongue-and-head 3D mesh may further be optimized in accordance with the 3D tongue pose depicted in the single input image.
  • the generating a plurality of synthetic points of the synthetic tongue 3D point-cloud comprises using a generative network of a Generative Adversarial Network (GAN) trained to establish individual synthetic points of the synthetic tongue 3D point-cloud based on the one of the plurality of latent tongue 3D features and respective Gaussian noise samples to be indistinguishable, by a discriminative network of the GAN, from points of one of the plurality of raw tongue 3D point-clouds corresponding to the one of the plurality of latent tongue 3D features.
  • GAN Generative Adversarial Network
  • raw tongue 3D point-clouds may be generated in accordance with latent tongue 3D features of a raw tongue 3D point-cloud. Accordingly, raw tongue 3D point-clouds do not need to have the same number of points, which reduces dataset preprocessing, and as many points as needed may be generated on demand, so that there is no constraint in terms of resolution.
  • the points of the one of the plurality of raw tongue 3D point-clouds undergo a diversification by an isotropic multi-variate normal distribution N having a variance that declines with progressing training epoch e of the GAN.
  • the training of the GAN is improved and stabilized, by softening the binary behavior of the discriminative network especially in an early training phase.
  • the error metric comprises a Chamfer distance loss to modify a 3D position of points of the one of the plurality of synthetic tongue-and-head 3D meshes; a normal loss to modify a 3D orientation of the one of the plurality of synthetic tongue-and-head 3D meshes; a Laplacian regularization loss to constrain relative 3D positions of neighboring points of the one of the plurality of synthetic tongue-and-head 3D meshes; an edge length loss to constrain any possible outlier points of the one of the plurality of synthetic tongue-and-head 3D meshes; and a collision loss to inhibit penetration of a surface of an oral cavity of the one of the plurality of synthetic tongue-and-head 3D meshes by any possible colliding points.
  • the collision loss comprises a sum of distances of each colliding point to a plurality of spheres having a radius and being centered at mouth landmark vertices of the tongue-and-mouth 3D landmark vertex set defined in the one of the synthetic tongue-and-head 3D meshes.
  • a second aspect of the present disclosure provides a computer program, comprising executable instructions which, when executed by a processor, cause the processor to perform the method of the first aspect or any of its implementations.
  • a third aspect of the present disclosure provides a device for 3D reconstruction of a tongue, comprising a processor being configured to perform the method of the first aspect or any of its implementations.
  • FIG. 1 illustrates a flow chart of a method according to an embodiment of the invention
  • FIG. 2 illustrates a tongue reconstruction framework during inference being used in connection with the method of Fig. 1;
  • FIG. 3 illustrates an exemplary tongue-and-mouth landmark vertex set as it may be used in connection with the method of Fig. 1;
  • FIG. 4 illustrates an exemplary GAN as it may be used in connection with the method of Fig. 1;
  • FIG.4a exemplary illustrates Layer and Injection building blocks that may be used in connection with the GAN according to FIG.4.
  • FIG. 5 illustrates a diversification of points of a raw tongue 3D point-clouds used in connection with the method of Fig. 1.
  • FIG. 1 illustrates a flow chart of a method 1 according to an embodiment of the invention
  • FIG. 2 illustrates a tongue reconstruction framework during inference being used in connection with the method 1 of Fig. 1.
  • a tongue may refer to a muscular organ being anchored in an oral cavity of a human subject.
  • the method 1 may achieve a 3D reconstruction of a tongue by comprising: determining 101 one 207 of a plurality of synthetic tongue-and-head 3D meshes based on an image 201 of the tongue; and modifying 108 the determined synthetic tongue-and-head 3D mesh 207 based on a synthetic tongue 3D point-cloud 210 being indicative of the image 201 of the tongue.
  • the method 1 may define a processing pipeline as shown in FIG. 2 that can predict a 3D tongue mesh with fixed topology from a single image, which can be further optimized based on a generated point-cloud for more accurate results.
  • the determining 101 of the synthetic tongue-and-head 3D mesh 207 may comprise encoding 102 the image 201 of the tongue into one of a plurality of latent tongue 3D features 203, y representing a corresponding one 401 of a plurality of raw tongue 3D point-clouds.
  • the encoding 102 of the image 201 of the tongue into one of a plurality of latent tongue 3D features 203, y may comprise using 103 an embedding network 202.
  • the determining 101 of the synthetic tongue-and-head 3D mesh 207 may further comprise transforming 104 the obtained latent tongue 3D feature 203, y into a corresponding one 205 of a plurality of latent tongue-and-head expression shape parameters pt of a corresponding one 207 of the plurality of synthetic tongue-and-head 3D meshes.
  • the transforming 104 the obtained latent tongue 3D feature 203, y may comprise using 105 a first regression matrix 204, W t , y transforming the plurality of latent tongue 3D features 203, y into the plurality of corresponding latent tongue-and-head expression shape parameters 205, p t of the plurality of corresponding synthetic tongue-and-head 3D meshes.
  • the determining 101 of the synthetic tongue-and-head 3D mesh 207 may further comprise converting 106 the obtained latent tongue-and-head expression shape parameters 205, p t into the corresponding one 207 of the plurality of synthetic tongue-and-head 3D meshes.
  • the converting 106 of the obtained latent tongue-and-head expression shape parameters 205, p t of the plurality of synthetic tongue-and-head 3D meshes may comprise using 107 a second PCA 206, U t to convert the obtained latent tongue-and-head expression shape parameters 205, p t into the corresponding one 207 of the plurality of synthetic tongue-and-head 3D meshes.
  • steps 101 - 107 of the method 1 are able to make use of knowledge acquired in a training phase that may be concluded before the inference phase illustrated in FIGs. 1 and 2.
  • This preparatory work in the training phase is described in the following:
  • a first dataset may be captured under controlled conditions and is comprised of raw 3D tongue scans in a 3D point-cloud form known as the ‘plurality of raw tongue 3D point-clouds’.
  • a second dataset may be manually created and comprises only synthetic full head data with tongue expressions known as the ‘plurality of synthetic tongue-and-head 3D meshes’.
  • Each of these meshes may be based on the mean template of the Universal Head Model (UHM), which has manually been diversified to render a wide range of tongue expressions.
  • UHM Universal Head Model
  • a set of landmark vertices 3 may be annotated around a circumference of the respective tongue and mouth areas.
  • the same landmark protocol may be utilized to annotate the plurality of raw tongue 3D point-clouds as well as the plurality of synthetic tongue-and-head 3D meshes based on the same underlying landmarks.
  • the total number of landmarks in each set of landmark vertices 3 is exemplarily 24 as can be seen in FIG. 3, and is divided into two groups 302, 301 which highlight the tongue and the mouth, respectively. This constitutes a first tongue-and- mouth landmark vertex set 3, l r per raw tongue 3D point-cloud, and a second tongue-and- mouth landmark vertex set 3, l t per synthetic tongue-and-head 3D mesh.
  • the set of landmark vertices 3 serve to associate the two datasets, so that a raw tongue 3D pointcloud can be linked to a synthetic tongue-and-head 3D mesh having a (closest) corresponding tongue expression.
  • a first PCA Uu may be established by analyzing principal components of a plurality of second tongue-and-mouth landmark vertex sets It of the plurality of corresponding synthetic tongue-and-head 3D meshes, and used to establish a plurality of second latent tongue-and-mouth expression shape parameters pi.
  • a PCA may refer to a process of computing principal components of an n-dimensional point cloud, by fitting an n-dimensional ellipsoid to the point cloud, wherein each axis of the ellipsoid represents a principal component.
  • An i th principal component may be a direction of a line that is orthogonal to the first i-1 vectors and minimizes the average squared distance from the points to that line.
  • the principal components may be linearly uncorrelated and constitute an orthonormal basis that best fits the point cloud.
  • a second PCA 206, U t may be established by analyzing principal components of the plurality of synthetic tongue-and-head 3D meshes as a whole, and used to establish a plurality of latent tongue-and-head expression shape parameters 205, p t .
  • a second regression matrix W t ,i may be established by regressing the plurality of second latent tongue-and-mouth expression shape parameters / / to the plurality of latent tongue- and-head expression shape parameters 205, p t .
  • the second regression matrix W t ,i associates the landmark vertex sets 3, l t of the synthetic tongue-and-head 3D meshes with the plurality of corresponding synthetic tongue-and-head 3D meshes.
  • the second regression matrix W t ,i may also associate the plurality of first tongue-and-mouth landmark vertex sets 3. /,- of the plurality of raw tongue 3D pointclouds with the plurality of corresponding synthetic tongue-and-head 3D meshes.
  • the above-mentioned plurality of the latent tongue 3D features 203, y may be established by auto-encoding the plurality of raw tongue 3D point-clouds, and the embedding network 202 may be trained to encode a plurality of images 201 of tongues into the corresponding plurality of the latent tongue 3D features 203, y.
  • the embedding network 202 may be based on a ResNet-50 model pre-trained on the image database ImageNet and fine-tuned to work as an image encoder.
  • a last layer of the embedding network 202 may be modified to output a vector y similar to the dimensions of the ground truth vector y. Then the goal of the embedding task it to minimize a L2 loss.
  • the above-mentioned plurality of images 201 of the tongues may comprise a plurality of rendered images of respective raw tongue 3D point-clouds of the plurality of raw tongue 3D point-clouds, and may be rendered using different scaling with respect to a center, using different rotation with respect to a rotary axis, and/or using different illumination, of the respective raw tongue 3D point-cloud.
  • the plurality of synthetic tongue-and-head 3D meshes may be rendered with a precomputed radiance transfer technique using spherical harmonics which efficiently represent global light scattering.
  • spherical harmonics which efficiently represent global light scattering.
  • 145 second-order spherical harmonics of more than 15 different indoor scenes may be coupled with random light positions and mesh orientations around all 3D axes, resulting in a rich plurality of images 201.
  • the modifying 108 the determined synthetic tongue-and-head 3D mesh 207 may comprise generating 109 a plurality of synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 based on the one of the plurality of latent tongue 3D features 203, y, and adapting 111, based on an optimization procedure, the one 207 of the plurality of synthetic tongue-and-head 3D meshes by performing an iterative gradient descent according to first order derivatives of an error metric between the one of the plurality of synthetic tongue-and-head 3D meshes and the determined synthetic tongue 3D point-cloud 210 to back-propagate the error to the one of the plurality of latent tongue-and-head expression shape parameters 205, p t .
  • the generating 109 a plurality of synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 may comprise using 110 a generative network 209, G of a generative adversarial network, GAN 4.
  • a Generative Adversarial Network may refer to a machine learning framework in which two neural networks (generative and discriminative network agents, respectively) contest with each other in a zero-sum game where one agent's gam is another agent's loss. Given a training set of (real) samples, this framework leams to generate new (synthetic) samples with the same statistics as the training set.
  • the discriminative network is trained by presenting samples from the training set until it achieves acceptable accuracy.
  • the generative network is seeded with randomized input that is sampled from a predefined latent space, and leams to generate (synthetic) samples, i.e., to map from the latent space to a data distribution of interest.
  • the discriminative network seeks to distinguish the synthetic samples produced by the generator from the real samples, i.e., the true data distribution. Backpropagation is applied in both networks so that the generative network generates better synthetic images, while the discriminative network becomes more skilled at flagging synthetic images.
  • the generative network 209, G may randomly predict 10K synthetic points G(z, y) that describe a tongue surface in accordance with Gaussian noise samples 208, z being constituted as follows:
  • the afore-mentioned step 110 of the method 1 makes use of knowledge acquired in the training phase.
  • This preparatory work in the training phase may comprise that the GAN 4 is trained to establish individual synthetic points G(z, y) of the synthetic tongue 3D pointcloud 210 based on the one of the plurality of latent tongue 3D features 203, y and respective Gaussian noise samples 208, z to be indistinguishable, by a discriminative network 402 of the GAN 4, from points x t of one 401 of the plurality of raw tongue 3D point-clouds corresponding to the one of the plurality of latent tongue 3D features 203, y.
  • the above-mentioned error metric may comprise a Chamfer distance loss LCD to modify a 3D position of points of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; a normal loss L nO m to modify a 3D orientation of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; a Laplacian regularization loss Li counter p to constrain relative 3D positions of neighboring points of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; an edge length loss L e d ge to constrain any possible outlier points of the one 207 of the plurality of synthetic tongue-and-head 3D meshes; and a collision loss L coi to inhibit penetration of a surface of an oral cavity of the one 207 of the plurality of synthetic tongue-and-head 3D meshes by any possible colliding points, as follows:
  • the collision loss L coi may comprise a sum of distances of each colliding point q' to a plurality of spheres k having a radius r and being centered at mouth landmark vertices 301 (see FIG. 3) having coordinates (xk, yk, Zk) of the tongue-and-mouth 3D landmark vertex set 3, l t defined in the one 207 of the synthetic tongue-and-head 3D meshes, as follows:
  • FIG. 3 illustrates an exemplary tongue-and-mouth landmark vertex set as it may be used in connection with the method 1 of Fig. 1.
  • the depicted tongue-and-mouth landmark vertex set 3, l t is arranged circumferentially around a tongue and mouth, respectively, of a synthetic tongue-and-head 3D mesh 207.
  • the 24 landmarks around the oral cavity of the UHM template are divided into two groups 302, 301 which highlight the tongue and the mouth, respectively.
  • the first and second tongue-and-mouth landmark vertex sets 3, l r may be arranged circumferentially around the tongue and the mouth of the respective raw tongue 3D point-cloud 401 or the respective synthetic tongue-and-head 3D mesh 207.
  • FIG. 4 illustrates an exemplary GAN 4 as it may be used in connection with the method 1 of Fig. 1.
  • the generating 109 of the plurality of synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 may comprise using 110 a generative network 209, G of a generative adversarial network, GAN 4, trained to establish individual synthetic points G(z, y) of the synthetic tongue 3D point-cloud 210 based on the one of the plurality of latent tongue 3D features 203, y and respective Gaussian noise samples 208, z to be indistinguishable, by a discriminative network 402 of the GAN 4, from points x t of one 401 of the plurality of raw tongue 3D point-clouds corresponding to the one of the plurality of latent tongue 3D features 203, y.
  • a conditional GAN setting may be used in which the generative network 209, G is guided by labels throughout the training process in order to leam to produce samples that belong to specific categories which are dictated by the labels.
  • the GAN 4 is preferably guided by meaningful labels which capture all the desired 3D surface information.
  • These labels i.e., the plurality of latent tongue 3D features 203, y, may be learned by auto-encoding the plurality of raw tongue 3D point-clouds.
  • a selforganizing map framework may be used for hierarchical feature extraction.
  • the discriminative network 403, D receives as inputs the label y, a real point-cloud point x t (which belongs to the tongue represented by the label y) or the output G(z, y) of the generative network 209, G and tries to discriminate the fake (i.e., generated) from the real point.
  • this may be described as: [log D ( x, . y ) ] - it [log D ( x, , y ⁇ ] , log D (x ⁇ . y)j .
  • D tries to maximize LD
  • G tries to minimize ⁇
  • one point corresponding to the surface which the label y represents may be generated at a time.
  • the raw tongue 3D pointclouds of the training set do not need to have a same number of points, so that the GAN 4 may be trained without any data preprocessing.
  • as many points as needed may be generated on demand, so that there is no constraint in terms of resolution.
  • the generative network 209, G may comprise L(ayer) and I(nj ection) building blocks 404, 405 being interconnected as shown in FIG. 4, for example. These L/I building blocks 404, 405 may respectively comprise a Multilayer Perceptron (MLP) 406 and Rectified Linear Unit (ReLU) 407 layers, as can be seen in FIG. 4a. Further building blocks describe a processing of the propagated signals. Processing block 408 (symbol c) stands for row-wise concatenation along the channel dimension, whereas processing block 409 (symbol o) stands for element-wise (i.e., Hadamard) product.
  • the inputs of the generative network 209, G are: a label y corresponding to a particular raw tongue 3D point-cloud from a 3D point is to be sampled, and a Gaussian noise sample z.
  • the discriminative network 403, D may be based on the building blocks already mentioned in connection with the generative network 209, G which may be interconnected as shown in FIG. 4, for example.
  • the inputs of the discriminative network 403, D are: (y, x), where is a label corresponding to a raw tongue 3D point-cloud and x t is a point of this particular raw tongue 3D point-cloud, or (f: G(z, y) where G(z, y) is a point generated in accordance with this particular raw tongue 3D point-cloud.
  • the switch symbol between the generative network 209, G and the discriminative network 403, D indicates that this feed is performed on a random basis.
  • FIG. 5 illustrates a diversification of points x t of a raw tongue 3D point-cloud 401 used in connection with the method 1 of Fig. 1.
  • the discriminative network 403, D shows a binary behavior in that it decides whether a point is either fake or real. This rigidity is not very helpful especially in the early steps of the training process, as the generative network 209, G struggles to learn the distribution of points of the plurality of raw tongue 3D point-clouds (i.e., all of the generated points are discarded as fake by the discriminator with high confidence). To remedy this, the strict nature of the discriminative network 403, D may be softened, especially in the initial training steps, by diversifying the points x t fed to it. To achieve that, instead of directly feeding a real point x t corresponding to a label y to the discriminative network 403, D, the following is provided:
  • the points x t of the one 401 of the plurality of raw tongue 3D point-clouds may undergo a diversification 402 (see FIG. 4) by an isotropic multi-variate normal distribution TV having mean x t and (isotropic) variance a e that declines with progressing training epoch e of the GAN 4:
  • the generative network 209 G can better learn the distribution of points of the plurality of raw tongue 3D point-clouds as it does not get severely punished by the discriminative network 403, D when it slightly misses out the actual surface. This yields better results and stabilizes the training.
  • the training may be started with a relatively small value for the variance a e which is further reduced subsequently until it becomes zero towards the final training epochs e
  • a computer program (not shown) may be provided comprising executable instructions which, when executed by a processor, cause the processor to perform the above-mentioned method 1, and that a device (not shown) for 3D reconstruction of a tongue may be provided comprising a processor configured to perform the above-mentioned method 1.
  • the processor or processing circuitry of the device may comprise hardware and/or the processing circuitry may be controlled by software.
  • the hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry.
  • the digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field- programmable gate arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors.
  • ASICs application-specific integrated circuits
  • FPGAs field- programmable gate arrays
  • DSPs digital signal processors
  • multi-purpose processors multi-purpose processors.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Graphics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Architecture (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Geometry (AREA)
  • Processing Or Creating Images (AREA)

Abstract

L'invention concerne un procédé (1) de reconstruction 3D d'une langue. Le procédé (1) comprend les étapes suivantes : déterminer (101) un maillage (207) d'une pluralité de maillages 3D de langue et de tête synthétiques en fonction d'une image (201) de la langue ; et modifier (108) le maillage 3D de langue et de tête synthétique déterminé (207) en fonction d'un nuage de points 3D de langue synthétique (210) qui à son tour représente l'image (201) de la langue. Ceci permet de reconstruire un modèle 3D d'une tête conjointement avec la pose de langue 3D de la langue représentée.
PCT/EP2020/081148 2020-11-05 2020-11-05 Reconstruction de langue 3d à partir d'images uniques WO2022096105A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2020/081148 WO2022096105A1 (fr) 2020-11-05 2020-11-05 Reconstruction de langue 3d à partir d'images uniques

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2020/081148 WO2022096105A1 (fr) 2020-11-05 2020-11-05 Reconstruction de langue 3d à partir d'images uniques

Publications (1)

Publication Number Publication Date
WO2022096105A1 true WO2022096105A1 (fr) 2022-05-12

Family

ID=73172707

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2020/081148 WO2022096105A1 (fr) 2020-11-05 2020-11-05 Reconstruction de langue 3d à partir d'images uniques

Country Status (1)

Country Link
WO (1) WO2022096105A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210248812A1 (en) * 2021-03-05 2021-08-12 University Of Electronic Science And Technology Of China Method for reconstructing a 3d object based on dynamic graph network
CN117649494A (zh) * 2024-01-29 2024-03-05 南京信息工程大学 一种基于点云像素匹配的三维舌体的重建方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017028961A1 (fr) * 2015-08-14 2017-02-23 Thomson Licensing Reconstruction tridimensionnelle (3d) d'une oreille humaine à partir d'un nuage de points
CN107256575A (zh) * 2017-04-07 2017-10-17 天津市天中依脉科技开发有限公司 一种基于双目立体视觉的三维舌像重建方法
WO2020053551A1 (fr) * 2018-09-12 2020-03-19 Sony Interactive Entertainment Inc. Procédé et système pour générer une reconstitution 3d d'un être humain
EP3726467A1 (fr) * 2019-04-18 2020-10-21 Zebra Medical Vision Ltd. Systèmes et procédés de reconstruction d'images anatomiques 3d à partir d'images anatomiques 2d

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017028961A1 (fr) * 2015-08-14 2017-02-23 Thomson Licensing Reconstruction tridimensionnelle (3d) d'une oreille humaine à partir d'un nuage de points
CN107256575A (zh) * 2017-04-07 2017-10-17 天津市天中依脉科技开发有限公司 一种基于双目立体视觉的三维舌像重建方法
WO2020053551A1 (fr) * 2018-09-12 2020-03-19 Sony Interactive Entertainment Inc. Procédé et système pour générer une reconstitution 3d d'un être humain
EP3726467A1 (fr) * 2019-04-18 2020-10-21 Zebra Medical Vision Ltd. Systèmes et procédés de reconstruction d'images anatomiques 3d à partir d'images anatomiques 2d

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HEWER ALEXANDER ET AL: "A multilinear tongue model derived from speech related MRI data of the human vocal tract", COMPUTER SPEECH AND LANGUAGE, ELSEVIER, LONDON, GB, vol. 51, 21 February 2018 (2018-02-21), pages 68 - 92, XP085398906, ISSN: 0885-2308, DOI: 10.1016/J.CSL.2018.02.001 *
JUN YU ET AL: "A realistic and reliable 3D pronunciation visualization instruction system for computer-assisted language learning", 2016 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), IEEE, 15 December 2016 (2016-12-15), pages 786 - 789, XP033046448, DOI: 10.1109/BIBM.2016.7822623 *
YU JUN: "A Real-Time Music VR System for 3D External and Internal Articulators", 2019 IEEE CONFERENCE ON VIRTUAL REALITY AND 3D USER INTERFACES (VR), IEEE, 23 March 2019 (2019-03-23), pages 1259 - 1260, XP033597801, DOI: 10.1109/VR.2019.8798288 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210248812A1 (en) * 2021-03-05 2021-08-12 University Of Electronic Science And Technology Of China Method for reconstructing a 3d object based on dynamic graph network
US11715258B2 (en) * 2021-03-05 2023-08-01 University Of Electronic Science And Technology Of China Method for reconstructing a 3D object based on dynamic graph network
CN117649494A (zh) * 2024-01-29 2024-03-05 南京信息工程大学 一种基于点云像素匹配的三维舌体的重建方法及系统
CN117649494B (zh) * 2024-01-29 2024-04-19 南京信息工程大学 一种基于点云像素匹配的三维舌体的重建方法及系统

Similar Documents

Publication Publication Date Title
Ichim et al. Dynamic 3D avatar creation from hand-held video input
Tewari et al. High-fidelity monocular face reconstruction based on an unsupervised model-based face autoencoder
US11727617B2 (en) Single image-based real-time body animation
Karunratanakul et al. A skeleton-driven neural occupancy representation for articulated hands
US11557391B2 (en) Systems and methods for human pose and shape recovery
Bermano et al. Facial performance enhancement using dynamic shape space analysis
Ye et al. Audio-driven talking face video generation with dynamic convolution kernels
CN110599395A (zh) 目标图像生成方法、装置、服务器及存储介质
Yu et al. A video, text, and speech-driven realistic 3-D virtual head for human–machine interface
Ma et al. Otavatar: One-shot talking face avatar with controllable tri-plane rendering
CA2423212A1 (fr) Dispositif et procede pour representation tridimensionnelle a partir d'une image bidimensionnelle
US11963741B2 (en) Systems and methods for human pose and shape recovery
CN115004236A (zh) 来自音频的照片级逼真说话面部
WO2022096105A1 (fr) Reconstruction de langue 3d à partir d'images uniques
WO2024114321A1 (fr) Procédé et appareil de traitement de données d'image, dispositif informatique, support de stockage lisible par ordinateur et produit programme d'ordinateur
Sun et al. Masked lip-sync prediction by audio-visual contextual exploitation in transformers
Claes A robust statistical surface registration framework using implicit function representations-application in craniofacial reconstruction
Dundar et al. Fine detailed texture learning for 3D meshes with generative models
CN117635897B (zh) 三维对象的姿态补全方法、装置、设备、存储介质及产品
Huang et al. Object-occluded human shape and pose estimation with probabilistic latent consistency
Lifkooee et al. Real-time avatar pose transfer and motion generation using locally encoded laplacian offsets
Ekmen et al. From 2D to 3D real-time expression transfer for facial animation
Gan et al. Fine-grained multi-view hand reconstruction using inverse rendering
Huang et al. Detail-preserving controllable deformation from sparse examples
Park et al. DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20803514

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20803514

Country of ref document: EP

Kind code of ref document: A1