GB2597111A - Computer vision method and system - Google Patents

Computer vision method and system Download PDF

Info

Publication number
GB2597111A
GB2597111A GB2011293.4A GB202011293A GB2597111A GB 2597111 A GB2597111 A GB 2597111A GB 202011293 A GB202011293 A GB 202011293A GB 2597111 A GB2597111 A GB 2597111A
Authority
GB
United Kingdom
Prior art keywords
latent
space
ground truth
image
code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB2011293.4A
Other versions
GB202011293D0 (en
Inventor
Katranidis Vasileios
Chamaka Kularatna Shashitha
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Carv3d Ltd
Original Assignee
Carv3d Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Carv3d Ltd filed Critical Carv3d Ltd
Publication of GB202011293D0 publication Critical patent/GB202011293D0/en
Publication of GB2597111A publication Critical patent/GB2597111A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • G06T15/205Image-based rendering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Geometry (AREA)
  • Computer Graphics (AREA)
  • Image Processing (AREA)

Abstract

Generating an image from ground truth images, the image being described using two characteristics such as geometry and texture, comprises: first 201 and second 203 generative networks, each network comprising an output corresponding to a first and second characteristic space, respectively, and an input comprising first 61 and second 63 latent space, respectively; selecting a first latent code from said first latent space, wherein the first latent space comprises first ground-truth latent codes, each latent code mapping via said first generative network to a first characteristic of a ground-truth image in said first characteristic space; selecting a second latent code from allowable regions of said second latent space, the allowable regions being indicated by masks; and generating the image from the selected first and second latent codes in the first and second latent spaces, respectively. Variable Auto Encoders (VAE) and Generative Adversarial Networks (GAN) are examples of generative networks. Also claimed is: mapping the first latent code to select the second latent code; and a method of obtaining a representation of ground truth images in latent spaces based on backpropagation. Privacy of individuals providing training data is preserved by masking (excluding) or promoting (biasing towards) regions of the latent representations.

Description

Computer Vision Method and System
FIELD
Embodiments described herein are concerned with a computer vision method and system.
BACKGROUND:
In computer graphics and media applications, 3D content involves 3D assets, for example faces, human forms etc and their animation in space. Such 3D assets are essentially data instances of a geometry in a form of a meshed surface and various texture maps i.e. images with pixel-vertex correspondence on the respective mesh.
Conventionally, 3D assets can be created by technically skilled 3D artists by means of explicit control over the geometry and texture, by procedural models which use a set of parametric, stochastic and conditional rules which govern features of the generated geometry (and/or textures), or by means of digitizing real world objects via some embodiment of photogrammetry techniques.
In many cases where the "correctness" of the result is a purely subjective, artistic and arbitrary notion, the creation of 3D asset can be effectively be executed by human artistic labour or procedural models. However, in cases/embodiments where the result is required to fall within certain visual constraints i.e. realism, and the 3D asset itself is sophisticated and nuanced i.e. human face, the task of creation via manual labour or procedural models becomes difficult, slow and even impossible. That is because the space of possible creations in 3D space is mostly occupied by unrealistic results which do not satisfy the realism criterion. In these cases, it is where photogrammetry techniques pose a more effective and efficient way of obtaining realistic 3D assets.
Nevertheless, photogrammetry is a costly and demanding process which can only duplicate a real-world object/subject to the digital 3D domain.
BRIEF DESCRIPTION OF THE DRAWINGS:
Figure 1 is a schematic of a system in accordance with an embodiment; Figure 2 is a systematic showing 3D asset of a human head and how it is composed of geometry and texture; Figure 3 is a schematic of two latent spaces which are used in a method in accordance with an embodiment; Figure 4 is a flow diagram showing a simplified method in accordance with an embodiment; Figures 5(a) and 5(b) are diagrams showing the construction of a latent space in accordance with an embodiment; Figure 6 is a diagram showing a method of asset generation which is designed to preserve the privacy of the individuals who provided the training data; Figures 7(a), 7(b) and 7(c) are variations on the type of training data that can be used to express an asset; Figure 8 is a diagram showing a further method of asset generation in accordance with a further embodiment where it is desired to bias the selection of the latent codes to help provide a realistic asset; Figure 9 is a schematic of two latent spaces which are used in a method in accordance with an embodiment to select asset both with biasing and privacy considerations; Figure 10 is a diagram showing a further method of asset generation in accordance with a further embodiment having a single shared latent space; and Figure 11 is a diagram of a system in accordance with an embodiment
DETAILED DESCRIPTION:
In an embodiment, a computer implemented method of generating an image from a plurality of ground truth images, wherein the image is described using at least two characteristics, the method comprising: providing a first generative network and a second generative network, said first generative network comprising an output corresponding to a first characteristic space and an input comprising a first latent space, said second generative network comprising an output corresponding to a second characteristic space and an input comprising a second latent space; selecting a first latent code from said first latent space, wherein the first latent space comprises first ground truth latent codes, each first ground truth latent code mapping via said first generative network to a first characteristic of a ground truth image in said first characteristic space; selecting a second latent code from allowable regions of said second latent space, wherein the second latent space comprises second ground truth latent codes, each second ground truth latent codes mapping via said second generative network to a second characteristic of a ground truth image in said second characteristic space, wherein masks are applied to said second latent space indicating allowable regions; and generating the image from the selected first latent code in the first latent space and the selected second latent code in the second latent space.
The disclosed method provides an improvement to computer functionality by allowing computer performance of a function not previously performed by a computer. Specifically, the disclosed system provides for a computer to generate assets where the similarity or difference from the ground truth training data is controlled in an unsupervised manner.
To aid this explanation, it will be assumed that the images are human faces, the first characteristic is geometry and the second characteristic is texture. However, the first and second characteristics can be reversed. Also, there is no limitation to faces or human forms since there are many situations where the relationship of a generated synthetic image to the ground truth data needs to be controlled. The images may be 2D or 3D.
Embodiments generally relate to enabling the generation of computer-generated 3D assets via a rule-based combinatorial system which controls generative models which involve some form of latent representation of the training data, eg.VAEs (including normalizing flow architectures) and GANs. This functionality can prove valuable where the texture and the geometry of a 3D asset contribute meaningful and specific information on the visual characteristics and identity of the generated 3D asset In an embodiment, at least two generative models are trained with a training set of 3D assets, the first model is trained with some form of representation of the geometries of the training set and the second model is trained with the texture data (images) of the training set The correspondence between geometry and texture for each one of the training data is saved and used to create appropriate masks or biases on the latent space of both networks, according to user preferences and user-defined parameters.
A technique is described that is applied on pairs (or more) of generative networks (geometry and texture generators) and enables bespoke and procedural control on the generation of new 3D assets.
In one exemplary use, the above method can be used for preserving the privacy of people who have provided their images as training data for a generative network to produce further images or assets.
Control is implemented by masking (excluding) and/or promoting (biasing towards) certain regions of the latent representations of geometries and textures, given a set of rules which depends on the training examples and the preferences of the user. The generative models (at least one for texture generation and at least one for geometry generation) are trained on ground-truth data, each one of which is composed by geometry data and corresponding texture data. The information of the correspondence between texture and geometry of the ground-truth training data, along with user input, is employed to regulate the generation of new objects (assets).
The above method has many uses, for example it can be used to preserve the privacy of the ground truth data which may or may not be part of the training set In an example, the masks can be provided around second ground truth latent codes in the second latent space. For example, if the first latent space represents geometry and the second latent space represents texture, then then by placing masks on the second latent space, it is possible to avoid a face being generated with a texture too similar to that of the ground truth data. The size of the masks may be determined from the proximity of the first latent code to the first ground truth latent codes in the first latent space. For example, the diameter of the masks can be inversely proportional to the distance of the first latent code to the ground truth latent codes in the first latent space. The diameter of the masks may also be controlled by a user inputted parameter.
The ground truth images may comprise images that were used for training the network and images that were not used for training the network. For example, to protect the privacy of a famous person or persons, their images can be added to the ground truth data to ensure that the synthetic asset is not inadvertently too close to a famous person. However, the image of the famous person would not be used in the training data.
Although the above has discussed using the masks to avoid selecting latent codes which are similar to other codes. In further embodiments, the masks are used to bias the sampled latent codes towards certain latent codes.
For example, in an embodiment, the second latent code in the second latent space is selected by mapping said first latent code in said first latent space to said second latent space. The mapping may be achieved by a neural network.
The second latent code in the second latent space may also be selected by: mapping said first latent code in said first latent space to a corresponding latent code in the second latent space; defining a mask around the corresponding latent code; and selecting the latent code in the second latent space from within said mask.
The mask around the corresponding latent code can be thought of as a bias zone around the corresponding point Selecting other points within this bias zone ensure that the first and second characteristics are compatible.
In an embodiment, a computer implemented method of generating an image from a plurality of ground truth images is provided, wherein the image is described using at least two characteristics, the method comprising: providing a first generative network and a second generative network, said first generative network comprising an output corresponding to a first characteristic space and an input comprising a first latent space, said second generative network comprising an output corresponding to a second characteristic space and an input comprising a second latent space; selecting a first latent code from said first latent space, wherein the first latent space comprises first ground truth latent codes, each first ground truth latent code mapping via said first generative network to a first characteristic of a ground truth image in said first characteristic space; mapping said first latent code to select a second latent code in said second latent space; and generating the image from the selected first latent code in the first latent space and the selected second latent code in the second latent space.
Masks used to enforce privacy and to allow biasing can be used in the same latent space.
The first latent code in the first latent space may be selected by random sampling.
However, the sampling may be biased to one or more areas of the first latent space. For example sampling can be biased towards an area of the latent space where there is data from ground truth images of a certain race.
Assuming that a number of humans are scanned via photogrammetry techniques and their facial data (geometry and texture files) are used to train a plurality of networks (at least two, one for geometry and one for texture generation) to generate original and realistic digital humans, using the above embodiment, it can be guaranteed that the results of the generative processes will not appear similar to any of the humans in the training set.
Furthermore, in an embodiment, the control imposed to the generative network is procedural (in the sense that it is learnt in an unsupervised manner and is not explicitly defined and it does not require any human oversight or other manual control once achieved) and bespoke (in the sense that the user has control on the degree of acceptable similarity, considering any application-specific constraints). Avoiding the risk of self-identification or other compromise of the privacy of the scanned humans may be an enabling measure for the realisation of such a task. Doing so in a procedural manner can make run-time and autonomous generation of digital humans conceivable, where it is difficult to control manually each generated datapoint, but privacy of the training examples is important.
The notion of similarity in this instance depends on a large number of parameters which is difficult to define in an explicit manner. Furthermore, the higher the quality and resolution of the digital human, the more nuanced and abstract the notion of similarity becomes, associated to the higher number of degrees of freedom of larger data points.
The similarity between two human faces is a function of the similarity of both their geometries and their respective texture data.
In an embodiment where the geometry data is represented by a meshed surface of the kind which is typically used for animation applications (low number of vertices) and the texture files are multi-channel image files, the following assumptions can be made. The geometry data dictate the size, shape and relative ratios of low frequency facial features i.e. nose, jaw, eyes, ears etc. The texture maps data generally govern the way light interacts with skin (i.e. reflectance, albedo colour, specular roughness) as well the high-frequency features such as wrinkles on the skin and other such surface geometry. Additionally, texture data contain information on the location and shape of key-facial features that influence the identity of a person such as eyebrows, hairline and other facial hair or skin artefacts.
The task of preserving privacy in this case presents a suitable problem for unsupervised learning since it is very difficult to formalise and encode all the ways in which a human face can have a distinctive visual appearance.
In an embodiment, the mapping of the ground truth points to the first latent space is performed using a neural network.
VAEs may be trained to encode a plurality of input data to a latent representation of distributions in the latent space and, respectively decode such latent codes variationally to usable, generated data. GANs, achieve an indirect encoding of the training data by training a pair of generator and discriminator networks. The trained generator in a CAN can generate acceptable results from random codes. A VAE may be used to generate said first characteristic and a CAN to generate the second characteristic or two GANs can be used to generate both the first and second characteristics. Similarly, two VAEs can be used to generate both the first and second characteristics.
The training of a VAE or a CAN results in a latent representation of the training data in which data which are similar to each other are placed closer together in the latent space in a way that depends on the representation of such data in the training phase of the networks. In embodiments where the representation of the training data (geometries and textures) captures effectively the meaningful visual information, which is the intended subject of learning in this embodiment, the distance between two points in the high-dimensional latent space is a measure of their visual similarity. The criteria of this type of similarity are arbitrarily complex and are learnt in the training process in an unsupervised manner.
In an embodiment, the positions of the ground truth points in the first latent space are determined from: selecting a point to be a predicted ground truth latent code for a ground truth first characteristic in a first latent space; mapping said predicted latent code using said first generative neural network to a prediction in the first characteristic space; determining the error between the prediction and ground truth first characteristic in the first characteristic space; Backpropagating the error through said first generative neural network while keeping the weights of the first generative network fixed; applying said back propagated error to the predicted ground truth latent code and updating the predicted ground truth latent code; and repeating the process until the prediction and the ground truth in the first characteristic space converge.
The above method is not limited to the examples described above. Thus a further embodiment, provides a method of obtaining a representation of an image, the method comprising: providing a generative network, said generative network comprising an output corresponding to an image space and an input comprising a latent space, receiving image data of a ground truth image; selecting a point in said latent space to be a predicted ground truth latent code for said ground truth image; mapping said predicted latent code using said generative network to a prediction in the image space; determining the error between the prediction and ground truth image in the image space; backpropagating the error through said generative network while keeping the weights of the generative network fixed; applying said back propagated error to the predicted ground truth latent code and updating the predicted ground truth latent code; and repeating the process until the prediction and the ground truth in the first characteristic space converge to obtain a representation of said ground truth image in said latent space from the prediction.
The above teachings can also be applied to a method having a single latent space. Thus, in a further embodiment a computer implemented method of generating an image from a plurality of ground truth images is provided, comprising: providing a generative network, said generative network comprising an output corresponding to an image space and an input comprising a latent space, said latent space comprises ground truth latent codes, each ground truth latent code mapping via said generative network to a characteristic of a ground truth image in said image space; applying masks around said ground truth latent codes; selecting a latent code from the latent space which is outside of the masks around the ground truth latent codes; and generating an image from the selected latent code using said generative network.
A user interface may be provided to allow a user to select an indication of the allowable similarity of the generated image to the ground truth images, wherein applying said masks comprises using said user selected indication determine the size of said masks.
In a further embodiment, the positions of the ground truth points in the first latent space are determined from: selecting a point in an d-dimensional space to be a predicted latent code for a ground truth first characteristic in a first latent space; mapping said predicted latent code using a generative neural network to a prediction in the first characteristic space; determining the error between the prediction and ground truth first characteristic in the first characteristic space; Backpropagating the error through said neural network while keeping the weights of the generative network fixed; applying said back propagated error to the predicted latent code and updating the predicted latent code; and repeating the process until convergence.
Assuming a plurality of trained networks (at least two) on the human face ground-truth data (geometries in the format of meshed surface and texture in the format of multichannel images), the method disclosed herein achieves a semi-procedural control of the generative process in the following manner. The above embodiment, utilises the knowledge of the latent codes of that training data, the correspondence between geometry and texture for every datapoint in the training set, the latent code of generated example and a user-defined input which helps define the tolerance of acceptable similarity between a generated example of a face and the ground-truth data used for training or from ground truth data which lies outside the training set.
Deep learning techniques rely on learning statistical properties of the training data via a hierarchical and progressive layered decomposition of such data which enables formalising of complex and abstract notions (inherent to the data) that cannot be explicitly instructed to the computer. Thus, shared representation of the training data essentially permits different datapoints to participate in the training process and be processed in a way that enables them to be compared to each other and be decomposed so that their meaningful statistical properties can be leveraged for learning.
Various decomposition mechanisms can be employed to capture the meaningful information of the training set and discard non-meaningful information. For example, in structured, grid-type data such as images or video, convolution layers and multiple channels enable shift-invariance and capture the local connectivity and compositionally information of the pixels. In that way, meaningful visual features can be identified in any part of an image and taken into account in the learning process by downsampling in-between convolution layers.
In the case of geometry data, which in its raw form right after acquisition is generally a point cloud, there is no Euclidian structure to the data and the concepts of locality and connectivity are not straightforward.
One approach to convert such data in a structured format where, hierarchical convolutions can be applied seamlessly, is to construct a voxel-representation which captures the geometry in space, which is suitable for volumetric convolutions. However volumetric operations are significantly expensive in terms of memory and computation, thus limited to low resolutions of geometry representations.
Another approach is to capture multiple 2D projections of 3D objects an train with the resulting set of images. Yet this approach is not efficient since a lot on information is repeated and significant geometrical information is occluded.
A 3D object can also be described by a point cloud which is many times the case in applications such as photogrammetry and laser scanning. Spatial features from point clouds can be identified and learned via graph-based edge convolutions.
Describing a geometry as a surface manifold in 3D space is more attractive in terms of computational efficiency, in rendering applications but is an unstructured form of data, unfit for conventional convolution operations.
There are various approaches and significant ongoing research enabling efficient and effective training with unstructured (non-Eucledian) forms of data. The challenge lies in defining and using the notions of stationarity (translation/rotation invariance), locality (local connectivity) and compositionality on unstructured data. However, these approaches are not directly related to the field of invention herein. Such approaches include differential coordinates, poison based gradient field reconstruction, rotation invariant mesh difference (RIMD), spectral embeddings, diffusion maps, stochastic neighbourhood embeddings and other similar locality-preserving and mapping frameworks. Some additional representation approaches of 3D geometry data, specific to certain computer graphics applications will be discussed later in the specification.
The data are typically embedded in a Euclidean latent space of some dimension however, the present invention is not limited in application by the choice of dimensionality or topology of the latent space used by the networks in possible embodiments. Some data sets are better captured in non-euclidian topologies i.e. human motion data.
In a further embodiment, a system for generating an image from a plurality of ground truth images is provided, wherein the image is described using at least two characteristics, the system comprising a processor adapted to: provide a first generative network and a second generative network, said first generative network comprising an output corresponding to a first characteristic space and an input comprising a first latent space, said second generative network comprising an output corresponding to a second characteristic space and an input comprising a second latent space; select a first latent code from said first latent space, wherein the first latent space comprises first ground truth latent codes, each first ground truth latent code mapping via said first generative network to a first characteristic of a ground truth image in said first characteristic space; select a second latent code from allowable regions of said second latent space, wherein the second latent space comprises second ground truth latent codes, each second ground truth latent codes mapping via said second generative network to a second characteristic of a ground truth image in said second characteristic space, wherein masks are applied to said second latent space indicating allowable regions; and generate the image from the selected first latent code in the first latent space and the selected second latent code in the second latent space.
In a further embodiment, a system for generating an image from a plurality of ground truth images is provided, wherein the image is described using at least two characteristics, the system comprising a processor adapted to: provide a first generative network and a second generative network, said first generative network comprising an output corresponding to a first characteristic space and an input comprising a first latent space, said second generative network comprising an output corresponding to a second characteristic space and an input comprising a second latent space; select a first latent code from said first latent space, wherein the first latent space comprises first ground truth latent codes, each first ground truth latent code mapping via said first generative network to a first characteristic of a ground truth image in said first characteristic space; map said first latent code to select a second latent code in said second latent space; and generate and output the image from the selected first latent code in the first latent space and the selected second latent code in the second latent space.
In a further embodiment, a system for obtaining a representation of an image is provided, the system comprising a processor adapted to: provide a generative network, said generative network comprising an output corresponding to an image space and an input comprising a latent space, receive image data of a ground truth image; select a point in said latent space to be a predicted ground truth latent code for said ground truth image; map said predicted latent code using said generative network to a prediction in the image space; determine the error between the prediction and ground truth image in the image space; backpropagate the error through said generative network while keeping the weights of the generative network fixed; and apply said back propagated error to the predicted ground truth latent code and updating the predicted ground truth latent code; the processor being configured to repeat the process until the prediction and the ground truth in the first characteristic space converge to obtain a representation of said ground truth image in said latent space from the prediction.
In a further embodiment, a system for generating an image from a plurality of ground truth images is provided, wherein the image is described using at least two characteristics, the system comprising a processor adapted to: provide a first generative network and a second generative network, said first generative network comprising an output corresponding to a first characteristic space and an input comprising a first latent space, said second generative network comprising an output corresponding to a second characteristic space and an input comprising a second latent space; select a first latent code from said first latent space, wherein the first latent space comprises first ground truth latent codes, each first ground truth latent code mapping via said first generative network to a first characteristic of a ground truth image in said first characteristic space; map said first latent code to select a second latent code in said second latent space; and generate and output the image from the selected first latent code in the first latent space and the selected second latent code in the second latent space.
In a further embodiment, a system for generating an image from a plurality of ground truth images is provided, the system comprising a processor adapted to: provide a generative network, said generative network comprising an output corresponding to an image space and an input comprising a latent space, said latent space comprises ground truth latent codes, each ground truth latent code mapping via said generative network to a characteristic of a ground truth image in said image space; apply masks around said ground truth latent codes; select a latent code from the latent space which is outside of the masks around the ground truth latent codes; and generate and output an image from the selected latent code using said generative network.
Figure 1 is a schematic showing a very basic overview of the system. The system can be thought of as containing 3 basic components, a user interface 1, a candidate generation module 3 and generation of image module 5.
The user interface allows the user to request the generation of a synthetic image. Various examples of types of synthetic images will be discussed later in the specification. However, the system can be applied to the generation of any synthetic image.
The input of the users received via the user input module 7. The system may be configured in different ways. In some embodiments, the user may just be required to request generation of synthetic image. In other variations that will be discussed later, the user may be requested to provide some further information, for example the type of image required and possibly some further parameters.
The user interface 1 also comprises a selection control interface 9. The selection control interface 9 can be viewed as having two main sections: (i) a selection of candidate module 13; and (ii) a mask generation module 11.
The selection of candidate module 13 allows the generation of potential candidates for the synthetic image. The mask generation module 11 provides certain restrictions on the requirements of the generated image. The mask generation module will be controlled by both the selection control interface and also the user input 7. How the mask generation module and selection candidate module interface with each other will be described later in
the specification.
However, in an embodiment, the candidates are selected from an d-dimensional space. How the n-dimensional space is constructed will be described in detail later. The mask generation module provides masks on this d-dimensional space. Dependent on the user requirement, sometimes, it will be necessary to select a candidate within these masks and sometimes these masks will be indicating non-acceptable selections.
An image is then generated from an allowable candidate selected from the ddimensional latent space The generation is provided by generation module 5.
To put the system of figure 1 into context, a specific example will be described where the system is used for the generation of synthetic 3D assets. Figure 2 shows an example where the asset is a human head 51. The head can be described using two characteristics, in this case, geometry 53 which, in this example, is a mesh of the head and texture 55 which is to be applied to the geometry. Exactly how the head can be represented and variations on this will be described later with reference to figure 7. Further, how the latent spaces are constructed will be described with reference to figures 5(a) and 5(b).
In an embodiment, a latent space is formed for each characteristic. Figure 3 is used to demonstrate this where a first latent space 61 is formed for geometry and a second latent space 63 is formed for texture. How both the latent spaces are formed will be described in relation to figures 5(a) and 5(b).
In figure 3, the two latent spaces 61 and 63 are shown as two-dimensional. However, this is just for clarity. In practice, both latent spaces will be d-dimensional where d is an integer of at least two. Further, the dimensionality of the first and second latent spaces can match or be different to each other.
A point in the first latent space will map via a generator network to a representation of the geometry as described with reference to figure 2 and a point in the second latent space will map via a generator network to a texture as a described in relation to figure 2.
Figure 3, is a conceptual illustration of the masking of a latent space 4 conditional to a latent code in a latent space a. The data-points x/ and x2 are 3D assets which belong to the training set of the neural networks a and b (captured 3D faces) and are further decomposed to their respective 3D meshes (4. and 4) and texture maps (4 and 4). The latent codes for the training data x and xf are zf and zff respectively and they exist in latent space a. The latent codes for the training data 4 and 4 are 4 and 4 respectively and they exist in latent space b. The latent code z? represents the embedding of a new synthetic 3D mesh and the latent code r represents the embedding of a new set of synthetic texture maps, which when combined will yield the new synthetic 3D face i. In this embodiment, the latent spaces and are a result of training a Variational Autoencoder (VAE) neural network a and a Generative Adversarial Neural Network (GAN) neural network b respectively. For both latent spaces a and A it is true that latent codes which are in proximity with each other yield equivalently similar synthetic data. Although a VAE and a GAN are mentioned above, neural network a and neural network b could be of the same type.
Figure 3 shows intuitively that the closer the latent code of the synthetic 3D mesh za. is to the latent codes of the ground truth data 4 and Z21, the larger is the corresponding mask that is applied respectively to the latent codes zf and 4 in the texture latent space b. In this manner, if a generated 3D mesh is similar to any of the ground truth data, the corresponding generated textures will be dissimilar from the respective ground truth data in a proportional fashion. Thus, the generated synthetic 3D face will always avoid being similar to any of the captured 3D faces which were used for training.
In an embodiment, there will be a plurality of assets provided to the system that will be termed "ground truth assets". It is not necessary for all of the ground truth assets to be part of the training set of the generative network. Ground truth assets can be provided which are outside of the training set, so that the system can protect the privacy of ground truth assets other than those used in the training of the network. For each asset, there is a geometry and a corresponding texture. Each ground truth geometry will be mapped to latent space 61 (latent space a) as a point z;lin the first latent space 61 and each corresponding ground truth texture will be mapped to second latent space 63 (latent space b) a point zrin the second latent space 63, where i is an index denoting the ground truth assets.
When the user requests via the user interface to generate an asset, a point r in the first latent space 61 is selected. This point may be selected at random, or the user and/or system might provide some input or guidance on the selection of this point How this may be done will be described later. However, for the clarity of the specific example, it will be assumed that the point is selected at random.
In this specific example, two ground truth points are shown in the first latent space zff and zl. However, it would be appreciated, that there can be many ground truth points.
It will be assumed in this example, that the system will be required to generate assets that will not look too similar to any of the people who provided the original training data.
To prevent this from happening, in the first latent space 61, the distance between the selected point z, and the two ground truth points is determined. As will be appreciated, this is an or-dimensional system and there may be many different ways of determining the distance, some examples will be given later.
To avoid generating an asset that is too similar to the training data, restrictions put in place on how the corresponding texture data can be selected from the second latent space 63. In this example, it is seen that the selected point z, in the first latent space is quite close to the second ground truth point 4 in the first latent space that is further away from the first ground truth point zff in the first latent space.
The corresponding texture ground truth points 4 and z, in the second latent space are shown in the diagram. Also, around each of these points, is shown a mask. As the selected point za in the first latent space is close to the second ground truth point, in the second latent space, a large mask 65 is shown around the second ground truth point 4to avoid an asset being generated where both the geometry and texture is similar to the second ground truth point Also, a mask 67 is shown around the first ground truth point zit in the second latent space.
However, this mask is much smaller due to the fact that the selected point in the first latent space is much further away from the first ground truth point In this example, the second point zb is selected which is outside both of the masks 65 and 67. Once this point has been selected, the texture is generated from latent space b by using point zb and passing it through the generator network for the texture space. The geometry can be generated from point za by passing it through the generator for latent space a. Alternatively, the user might want to explore the generated geometries first (i.e. sample and generate different za) and then a geometry is determined that the user likes, the user can generate a texture, respecting the privacy of that geometry. In fact, many synthetic geometries can be generated and assessed by the user before the masks are applied and appropriated texture is generated.
Figure 4 is a very simple flow diagram showing the basic steps. In step S21, as explained with reference to figure 1, the user request generation of an asset In step S23, sampling of the first latent space 61 will then be performed. How this achieved will be described in detail later in the specification.
In step 525, a mask is applied to the second latent space 63. How the mask is formed will depend on what other requirements around generating the synthetic image. In some of the later embodiments, scenario is described where it is desirable not to allow the synthetic image to be too close to any of the ground truth images used to generate the first 61 and second 63 latent spaces. In this situation, if the point sampled in the first latent space close to one of the ground truth images, steps are taken to ensure that the candidate selected in the second latent space is not close to the latent code and second latent space relating to the same ground truth image. However, other variations are possible where it may be desirable to select a latent code and second latent space that is closer to or further from the ground truth candidates as required. In further embodiments, the system allows the user to browse synthetic textures or geometries first, before the user generates the complete asset In step S29, the image is then generated using a combination of the generated candidates in the first and second latent spaces.
How the first and second latent spaces can be generated when described with reference to figure 5. After training of the two generator networks and, the latent codes for the training data in both the latent space for generator network and generator network can be approximated. Normally, the mapping from the data space to latent space is achieved by an encoder network which is trained for that purpose.
Such examples of encoder networks are found in auto-encoders, Bidirectional GAN architectures and the same mapping can be achieved in normalising flow-based models. However, the direct mapping from data space xto latent space z in VAEs and GANs is not enabled inherently since given a ground truth data, the Encoder of a VAE can only produces two values: a mean and a standard deviation of a normal distribution, not a specific latent code. However, there is no way to use the encoder of a VAE to get direct access to a specific latent code.
In an embodiment, the latent codes can be produced by using an iterative optimisation process on sampling from the latent space through a trained generator. Figures 5(a) and 5(b) illustrates in detail the example of approximating the latent code for the ground truth data 3D mesh xt from the latent space a.
Specifically, given a trained generator neural network a, to approximate the latent code for each ground-truth data, a latent code.z/ is sampled randomly from the latent space. This is then propagated through the trained generator neural network.
In step 5303, the reconstruction loss L is calculated between the generated synthetic data 2 and the ground-truth data Xground truth which is to be approximated. In this instance the loss is square difference between generated synthetic data and the target ground-truth data L = (5e xground truth)2 Equation 1 The loss L is then checked in step 5305 to see if it is close to zero or below a threshold. If the loss is not close to zero or below a threshold, then the gradients of the loss L are computed with respect to the weights of the trained generator neural network via back propagation in step 5307. Once the back propagation reaches the latent space, the gradients of the previously sampled latent code are estimated and the latent code is updated into a next latent code following a gradient descent method and according to a learning rate in step 5309. The updated latent code is then propagated through the trained generator neural network to yield a new synthetic data in step 5311 which will have lower reconstruction loss with respect to the target ground truth data. Note that that the weights of the generator neural network are not updated during the gradient descent update of the latent code. The optimisation process is repeated until the reconstruction loss between the synthetic data and the target ground-truth data is minimised at step, at S313. At this point the latent code of the target ground-truth data has been sufficiently approximated and the process terminates in step 5313. The process is shown in figure 5(b), here the progression of the successive approximations in latent space are shown as.z/ which is then updated into z2 etc, the corresponding losses as LI, 1.2 etc and Ycl, %2 as the synthetic data corresponding to the approximations in latent space.
The process illustrated in figures 5(a) and 5(b) is repeated for all the captured 3D meshes [4, 4] so that the respective latent codes [4]., 4] are approximated and stored in a setZa. Each entry of Z" is a vector of dimensionality al] (dimensionality of latent space a). The same is done for all the ground truth texture maps [4, so that the respective latent codes [4, ......41 are approximated and stored in Zb. It is possible for latent space a and latent space b to be of different dimensionality i.e. da db. Both sets Z" and Zb have the same size 17, equal to the number of training data.
Equation 2 Once za and Zb are known, the dynamic masking of latent space a or b is conditional to a first sampled latent code in latent space b or a respectively. Figure 3 is a conceptual illustration of the masking of a latent space b conditional to a latent code in a latent space a so the privacy of the data-points x, and x2 is preserved according to this embodiment The data-points x1 and x2 are 3D faces which belong to the training set of the generator neural networks a and b and are further decomposed to their respective 3D geometries (i.e meshes) 4' and 4 and texture maps (4 and 4). The latent codes for the training data xi and 4 are z and zff respectively and they exist in latent space a. The latent codes for the training data 4 and xff are 4 and 4 respectively and they exist in latent space b. The latent code za represents a new embedding which yields a synthetic 3D mesh when propagated through generator neural network a and the latent code zb represents the embedding of a new set of synthetic texture maps, which when combined together, will yield the new synthetic 3D face For both latent spaces a and A it is true that latent codes which are in proximity with each other yield respectively similar synthetic data.
Figure 6 shows a more detailed overview of the system and how the user interface interacts with the selection of points in the latent spaces.
In response to a request from a user to generate an asset which is handled by the user interface 1 explained in relation to figure 1 the user samples a latent point r or "latent code" from the first latent space 61.
The user interface may allow the used some control over the sampling, for example, in its simplest form, sampling can be random from a uniform distribution. In another embodiment, the user may interact with the user interface to focus the sampling of the latent space drawing from a non-uniform distribution over a particular region of the latent space which features higher density of embeddings with desired semantic label) i.e. in the present embodiment such labels could be: race, EMI, age etc. In step S103, the sampled latent code r from S101 is input to the generator neural network to generate ica = Gcr(za) which can be viewed by the user via the user interface.
In step 5105, the sampled latent code Z3 from 5101 is used to generate the masks in the second latent space 63. To do this, a metric of distance between r and each of the elements of Z° (set of the latent codes of the ground-truth data) is calculated in 5107 and stored in the set D(2a,2a). This will be used as a measure of similarity between the synthetic data 2a = Ga(za) and the ground truth data [xr, ...... 4]. In this embodiment (EQ.3) the Manhattan distance (Li norm) is used. It has been demonstrated that at high-dimensional spaces, Lk norms with lower values for k are more effective at providing contrast between different points than Lk norms of higher values. However, other distance metrics, for example, the Euclidean distance metric (L2 norm) could also be used.
Furthermore, it will be appreciated that any type of distance metric could be used.
D(z". Z") Equation 3 As soon as the proximity between the sampled latent code and the latent codes of the ground-truth data D(z°,Za) is calculated, its elements are normalised with respect to the maximum distance that is found (Equation 4). Now the values in DI(za,Za) are in (0,1] and express the relative distance (or dissimilarity) between the sampled latent code za and the latent codes of the ground-truth data ZIT.
Equation 4 Next, the masks in latent space bare created in step S109, these can be thought of as a set of d1'-dimensional spheres centred on the latent codes [4, , where db is the dimensionality of latent space b. The radii of these db-dirnensional spheres are different from each other and depend on 01(zu,Z"). Given a sampled za, the radii of the masks in latent space 17 is stored in the set R(ek') and can be calculated as described in EQ.S.
Note that both sets Diza,Z") and R(Zb lz") have the same number of elements, equal to the number of ground-truth data.
R(2-11 -Dr( Equation 5 Where rand /Care non-negative, user-defined parameters entered via used interface b in step S111 to control the masks in latent space b in step S113. Given a r, any latent code in latent space b which falls within the space of the masked regions is considered to be violating the privacy of the ground truth data. Equation 5 shows that the elements of R(Zb le) are inversely proportional to the corresponding elements of Di (za, Z"). The inverse proportionality illustrates that the closer r is to any of the ground-truth embedding [ti", the larger the respective mask of the corresponding embedding [zf, will be in latent space b. In this embodiment the proximity of the first latent code to the latent codes of the ground-truth data in the first latent space and the size of the applied masks on the respective latent codes of the ground-truth data in the second latent space are inversely proportional. It will be appreciated any formulations which achieve the inverse proportionality between,01(z°, za) and R(Zblzu) could be used.
In this embodiment, the masks in latent space b depend on: (i) the approximated latent codes of the training data in latent space a and latent space b, (ii) the latent code r:and (iii) user input rand g Since the sampling of r does not affect the masks in latent space b, the user interface can be informed by R(Zilz") so that it only allows valid sampling of zb. A valid latent code zb according to this embodiment, satisfies the inequality in Equation 6 for all the respective elements in D (zb Zb) and R(Zblz°).
D( Z6) R('Z' I Equation 6 The set D(z5,Zb) contains the proximity information between the latent code r and [zf, and is calculated in accordance to Equations 2 and 3 for a sampled zh.
Ultimately the condition imposed by Equation 6 ensures that, given a set of ground-truth data, any r and zb which satisfy Equation 6 will yield ia = Ga(z") and ib = G b(Z5) which when combined will yield synthetic data 2 such that it is meaningfully dissimilar to all the ground-truth data and does not violate their privacy.
If the sampled zb does not fall within ally of the defined masks, it is considered to be of acceptable dissimilarity with the ground-truth data, given 2' = G,z(za). If the sampled falls within any of the masks, then the user interface notifies the user and is resampled.
In step 5103, the sampled latent code z, from 5101 is input to a generator neural network a, 201 to generate 2" = Ga(za) where 2" is the mapping of point z, from the first latent space 61 to the geometry space. ic" will be the geometry of the generated asset In step S115, the sampled latent code zb is input to a generator neural network Gb 203 to generate = Gb(Zb) where if' is the mapping of point zb from the second latent space 63 to the texture space. 2.13 will be the texture of the generated asset The geometry of the generated asset 2" and the texture of the generated asset 2bare then combined in step 5117 to form the generated asset 2.
In the above description, the masks are spheres. However, the masks do not need to be spherical.
The above description has concentrated on a specific embodiment where the asset is a head. Embodiments described herein relate to the generation of image data which are composed of discrete (image data) components and which can be re-arranged. One such example is 3D assets in computer graphics and VFX applications. 3D assets are composed by a geometry file and corresponding texture file(s).
In real-time rendering applications (i.e. games), there is strong incentive to minimise the number of vertices in the 3D mesh so that the size of the 3D asset in the memory is small.
The most efficient way to represent a 3D geometry in that regard (minimum number of vertices) is in the form of a triangulated surface mesh. The same 3D mesh applies for traditional multi-view stereo reconstruction algorithms which calculate a point cloud from images of a 3D object However this is not a structured form of data (non-Eucledian) and the concepts of locality and connectivity are not straightforward and thus 2D-convolutions are not applicable. Although progress is being made into modelling and representing nonEucledian data, the performance and achievements of modelling 2D data (i.e. images) is still far ahead.
In applications where the 3D asset needs to be animated via non-rigid deformation (i.e. expressions on a character's face), the triangulated 3D mesh may be replaced by a quad-mesh so that there are no small angles between adjacent cells (i.e. edge flow). The vertices of the quad-meshes used for animation pipelines are not distributed evenly on the object, rather they are concentrated around regions where more detail is required. Another feature of artists-made topologies (quad-meshes specific to certain asset classes i.e. human face) is that they are made so that they represent different assets of the same class via non-rigid deformation. That means that the same 3D quad-mesh for a particular face can be deformed to fit any other face since all plausible faces follow the same general structure.
The quad-representation has enabled the modelling of 3D assets by means of modelling the deformation of the quad-mesh itself. For human faces in particular, this type of modelling approach is known as 3D Morphable Model (3DMM). The consistent nature of the 3D quad-meshes for a given class of assets also enables applying any texture which is mapped to that quad-mesh (i.e. apply the texture of face B to the 3D quad-mesh of face A). Recently, this correspondence between texture space and 3D-mesh space has been used to represent the 3D-geometry as an image file. This is achieve by projecting all the vertices of the 3D quad-mesh on the texture space so that the space coordinates become equivalent to RGB values in image space. In that, interpolation is used to create values for all vacant pixels. This enables to treat 3D meshes like images and train models which capitalise on the representational power of 2D convolutions.
Figure 7 illustrates the representation of a 3D geometry of a human head in the ways discussed above. In figure 7(a), a non-Euclidean graph mesh is shown, in figure 7(b) a consistent quad mesh is shown. In figure 7(c), a representation as an image file is shown.
In the above embodiment, the first characteristic is geometry and the second characteristic is texture. However, these could be reversed.
In an embodiment where the generation of a texture (for example of a human face) by the texture-generating network/s precedes the generation of geometry by the geometry-generating network, the masking of the latent space of geometries will be calculated and applied with respect to the generated texture. Again, the information of some metric of distance between the generated texture and the latent codes of the ground-truth textures in the latent space of the texture generating-network, is used with a user-defined input and the knowledge of correspondence between textures and geometries in the ground truth data, to calculate and apply masks of appropriate dimensionally, shape and size, to the latent space of the geometry generating network.
Whether the generation of geometry or texture occurs first and guides the combinatorial control method may be conditional to other, external control that is applied to the generative process via some form of interface or procedure, outside the scope of this disclosure. For example, in embodiments where the training data is labelled with descriptive labels and a framework/interface is provided to guide the generation (i.e. "Asian male, 30 years old, BMI of 25"), some labels may be associated more with texture data and some more with geometry data and the priority in generating geometry versus texture will reflect the priority of the given labels (e.g. BMI affects geometry whereas age affects texture, gender and race affect both).
The above example has concentrated largely on a situation where the assets are heads and with system is being used to prevent the generation of a synthetic asset that is too close to the original training data to avoid privacy issues. However, it will be appreciated that the system could be used in many different ways. For example, in the case of asset generation, if it is known that certain textures do not correspond well to certain geometry types, then the second latent space could have masks which also require the candidate in the second latent space to be close to certain types of ground truth assets while maintaining separation from others. Also, although the above system can be used to protect the privacy of those who provided the training data, it is also possible for the system to be used in many different ways. For example the system can be used to generate realistic combinations of images, so called "biasing generation".
In another embodiment, two generator neural networks are trained with captured appearance data of real human faces with the purpose of generating new synthetic and realistic 3D faces. The data representation and generator networks are the same as in the privacy preservation embodiment described above. For example, one neural network (VAE) is trained to generate synthetic 3D meshes and the other neural network (GAN) is trained to generate the texture maps (all as one image). Any generated set of textures is compatible with any generated 3D mesh because they share the same dense correspondence between texture space and 3D mesh space (i.e. UV maps).
In this case, the user cares not about the privacy of the initial training data, rather on the realism of the generated synthetic faces. In this example, the realism of a synthetic face must satisfy the plausible combination of texture and geometry (morphology) data. There are notable visual features in both geometries and textures of faces which are typically coupled with each other, associated with racial origin, age and other properties of the human face. For example faces of Asian origin typically feature circular eye-sockets whilst faces of African descent have rectangular eye-sockets and more prominent mouth region when viewed in profile. Analogous morphological distinctions apply to faces of European origin. In the real world, morphological features of faces are coupled with certain features of the appearance of the individual, for example skin complexion and colour. An individual with the morphological facial features of African descent is expected to have a darker skin colour as well and, in that regard, an Asian skin complexion on an African face morphology may look unnatural. A similar unnatural outcome would result from applying invalid combinations of "old" skin with wrinkles on "young" face morphologies. The in skin appearance and higher-frequency features such as wrinkles is encoded in the texture data which is mapped on the 3D mesh.
Since the 3D geometry data (facial morphology) and texture data are typically represented differently, the generation of synthetic data is usually assigned to distinct generator networks which are not coupled with each other and are optimised differently for the generation of each different data type. Thus, the realistic combination of synthetic face geometries and textures requires some kind of supervision post-generation.
It is also possible to train a geometry generator and a texture generator with only a subset of the training data being common to both (corresponding textures and geometries) and still preserve privacy or produce realistic results.
This task can be fully automated by providing appropriate bias and constraints to either one of the generator networks (geometry generator or texture generator), conditional to the synthetic data of the other one.
After training of the two generator networks a and b, the latent codes for the training data in both the latent space for generator network a and generator network b can be approximated, following the same optimisation method described in the privacy-preservation embodiment (figures 4 to 6) In detail, the set of latent codes [4., are approximated which correspond to the geometry training data (captured 3D meshes) [x.......4] 4] and respectively [zif, __et], corresponding to ground truth texture maps [4, ....., xbd. The correspondence between the ground truth geometry data and texture data is known, therefore the correspondence between their respective latent codes [zf, , 4] and [4, , 4i is also known.
The next step involves leveraging the correspondence between the ground-truth latent codes in latent space a and fr, and approximating mapping functions from latent space a to latent space b(F"b) and the other way around (Fb-m). The mapping functions F"b and Fb" are fully connected neural networks which are trained with and the respective [4, After training, for an arbitrary latent code for generating a synthetic face geometry 2', F"-}b can approximate a latent code zb = pczt)(_u.' ) so that the generated texture 50 = Gb(Zb) respects the semantic correspondence of the ground truth texture data to the ground truth geometry data. In that way, 25 = Gb(Z5) is constrained to be the most plausible synthetic texture, given za, learnt in an unsupervised manner from the correspondence between textures and geometries from the ground-truth data. The same applies in the opposite direction, where for an arbitrary latent code for generating a -> synthetic face texture zb, F1'" can approximate a latent code z" = Fba( z_b) so that the generated geometry 5e" = Gu.(z") respects the semantic correspondence of the ground truth texture data to the ground truth geometry data.
Up to now, the method for obtaining the most plausible combination between arbitrary synthetic face geometries and synthetic face textures has been disclosed. In some cases, the user might want to browse a number of plausible synthetic textures given a synthetic face geometry, or respectively, he might want to browse a number of plausible synthetic face geometries, given a synthetic face texture. In these cases, the browsing of any number of plausible combinations of textures and geometries can be achieved by dynamically creating adjustable bias-zones in the latent spaces of Go. or G b.
Following from the mapping of an arbitrary latent code z" to zb = F"' (z"), instead of actually using to generate the synthetic datapoint 15 = Gb (Z5), a bias-zone is created around z" where the sampling of alternative latent codes to z" occurs from a Gaussian distribution with a standard deviation a defined by the user, via a user interface. Then, given an arbitrary latent code za for a synthetic geometry, the user can browse any number of plausible synthetic textures for that particular synthetic geometry by sampling from the bias-zone around ?.
A conceptual illustration of the system can be seen in Figure 8. Prior to the user interacting with the system of figure 8, as described in relation to figures 5(a) and 5(b), latent codes are approximated for the training data of the first data type and the second data type.
Neural networks are then trained to map the corresponding latent codes of the training data from the first latent space a 61 (from the first network) to the second latent space 1.7 63 (of the second network).
In the system of figure 8, a user interface 411 is provided. The user interface 411 will allow the user to enter the size of the bias zone at 413. In an embodiment, the size of the bias zone is the standard deviation a of a Gaussian distribution. It should be noted that the user might enter an exact figure or might be asked one or more questions to determine the size of the bias zone. For example -"Select similarity from -1 to 3 where 1 represents the highest similarity".
The user interface 411 will also sample the first latent space 61 to sample latent code z". Once latent code z° is sampled, it is then mapped via mapping neural network 415 to produce latent code zb. A bias zone 405 is then created around the latent code zb in second latent space 63 in step 417. The bias zone 405 is created using the user input a from 413.
The size of the bias zone 405 is then provided to the interface 411. The interface 411 then samples codes with the bias zone 405. The latent code z" from the first latent space a 61 is then fed through first generator neural network 421 to produce synthetic geometry = Gjz"). A sampled latent code zb from the second latent space b 63 is then fed through second generator neural network 423 to produce synthetic texture 213 = The synthetic geometry and the synthetic texture are then combined to form synthetic asset 2. As noted above, many different latent codes from the second latent space can be sampled. Synthetic texture data can be generated for each of these and synthetic assets with the same geometry but with different textures can be provided to the user interface 411 to allow the user to select their preferred asset It will be appreciated that changes in the particular order of generation (geometry first or texture first) or changes in the number of discrete generator neural networks which participate in the generation are possible.
It is noted that the same method can be used to perform image translation tasks where a variational output is desired (i.e. different versions of plausible image translation from a first domain to a second domain).
In this example, as for the previous example, two generator networks are trained on training data. In this example, as for the privacy example, the first network is trained on data of a first type and the second network is trained on data of a second data type. In this example, the first data type represent geometries and the second data type represents textures.
In a further example, the above two examples -(i) privacy preservation and (ii) biased generation are combined. Such a combination provides privacy preservation of the training data and biased generation towards realistic synthetic data.
Here, as above, two generator neural networks are trained with captured appearance data of real human faces, the user aims to generate both realistic synthetic 3D faces and preserve the privacy of the training (ground-truth) data.
In this case, the two previously described methods can be applied jointly. Given a sampled first latent code in a first latent space (either for geometry or textures generation), both methods rely on the knowledge of the latent codes of the ground truth data in all latent spaces and the one-to-one correspondence between them. In that way, the creation of dynamic masks in the latent space for privacy preservation and the creation of a bias-zone to enforce realistic combination of synthetic data are independent to each other and can be applied autonomously.
Figure 9 shows masks and bias zones in the second latent space to achieve the goals of creation of a bias-zone to enforce realistic combination of synthetic data and privacy preservation. To avoid unnecessary repetition, like reference numerals have been used to denote like features. Here, sampling in latent space b would be allowed within bias zone 405, but the part where bias zone 405 overlapped with privacy zone 67 would not be available for sampling.
In the above example, the user choses to apply multiple levels of rule-based making in the generation of geometries (or textures equivalently). In such cases, the union of all resulting masks in the latent space is considered for the generation process. For example, the privacy of the original training set may be preserved and a further restriction with regards to similarity to a subset of the training set may be applied (i.e. trying to obtain the geometry of an Asian face but exclude any skin colour and complexion of Asian origin).
The features that define the appearance of a human face (or any asset for that regard) are numerous and not discrete in the sense that an attempt to label the training data with all the relevant features would be impossible. For example how many freckles does a face need to have to be labelled as a 'freckled face'? In such a case, whether it is for the geometry or the texture-generating network, the first layer of masks -pertinent to privacy of the training set-is applied to the latent space. If there is any overlap of masks in it, the union is considered. Subsequently, the second restriction will apply similar masks to the latent space around the latent codes which correspond to training data of Asian faces and their complexion. The size of the masks in the second layer will depend on the user input which defines the similarity tolerance. If there is any overlap between the two layers of masks, their union is considered.
In the above examples, the size of the mask in the second latent space is calculated dependent on the distance between the candidate and the first latent space and the ground truth assets projections in the first latent space. However, in other embodiments, this mask distance could be fixed or a combination of both.
The above examples have contained two latent spaces. However, it is also possible to apply masking with a single latent space. In this example, one generator neural network is trained with 2D images of real human faces with the purpose of generating new synthetic 2D images of faces. In this embodiment, a generative adversarial network (GAN) is trained to generate the synthetic image data. Once the generator neural network is trained, the Privacy of the original ground-truth data can be preserved simply by (i) approximating the latent codes for the ground-truth data using the optimisation process described in figures S (a) and 5(b); and (ii) applying a static mask around the approximated latent codes of the ground-truth data, defined by a user input parameter y. Any sampled latent code which is not within the applied masks is yields a which is meaningfully different from all the ground-truth data and respects their privacy. A conceptual illustration of the method can be seen in figure 10.
In figure 10, the user interface 501 accepts user input y. User input y is an indication of how different the synthetic data is from the ground truth training data. The user may input an actual figure or they may otherwise indicate how similar the synthetic data may be to the ground truth data. In a further example, the system may have a pre-set value for y and the user does not change this value.
Once a level for y is known, this is used to create masks around the ground truth points in the latent space SOS. As the parameter that governs the size of the masks is y, the size of the mask around each ground-truth point is the same. The actual size can be determined by trial and error.
To generate a synthetic asset, the latent space 505 is sampled outside the masked areas. The sampling may be random or may be weighted if a specific characteristic is required. For example, if it is required to generate a synthetic asset with an Asian appearance, then the sampling may be weighted towards the ground truth assets known to have an Asian appearance.
Once a latent code has been sampled, it is fed into generative network 507 to produce asset 509.
While it will be appreciated that the above embodiments are applicable to any computing system, an example computing system is illustrated in Figure 11, which provides means capable of putting an embodiment, as described herein, into effect As illustrated, the computing system 600 comprises a processor 601 coupled to a mass storage unit 603 and accessing a working memory 605. As illustrated, a survival analysis model 613 is represented as software products stored in working memory 605. However, it will be appreciated that elements of the synthetic image generation model 613, may, for convenience, be stored in the mass storage unit 603. Depending on the use, the synthetic image generation model 613 may be used within a gaming environment to generate characters.
Usual procedures for the loading of software into memory and the storage of data in the mass storage unit 603 apply. The processor 601 also accesses, via bus 609, an input/output interface 611 that is configured to receive data from and output data to an external system (e.g. an external network or a user input or output device). The input/output interface 611 may be a single component or may be divided into a separate input interface and a separate output interface.
Thus, execution of the synthetic image generation model 613 by the processor 601 will cause embodiments as described herein to be implemented.
The synthetic image generation model 613 can be embedded in original equipment, or can be provided, as a whole or in part, after manufacture. For instance, the synthetic image generation model 613 can be introduced, as a whole, as a computer program product, which may be in the form of a download, or to be introduced via a computer program storage medium, such as an optical disk. Alternatively, modifications to existing synthetic image generation software can be made by an update, or plug-in, to provide features of the above described embodiment Implementations of the subject matter and the operations described in this specification can be realized in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Implementations of the subject matter described in this specification can be realized using one or more computer programs, i.e., one or more modules of computer program instructions, encoded on computer storage medium for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. A computer storage medium can be, or be included in, a computer-readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or a combination of one or more of them. Moreover, while a computer storage medium is not a propagated signal, a computer storage medium can be a source or destination of computer program instructions encoded in an artificially-generated propagated signal. The computer storage medium can also be, or be included in, one or more separate physical components or media (e.g., multiple CDs, disks, or other storage devices).
The above embodiments have described methods and systems that enable unsupervised control on performing image data generation with one or multiple neural network generators. The control is achieved by dynamically (or statically) regulating the sampling of the latent space of the generator(s). Some of the features of the above are: (i) the approximation of the latent codes which yield the training data, (ii) the correspondence between latent codes in from different latent spaces -if more than one generators is used- (iii) The conditional regulation of a second generator neural network, on the sampling of a latent code for a first generative neural network.
(iv) a user-defined input to adjust the system.
These components are used to create appropriate masks or biases on the latent space(s) of the generator network(s).
The above can be used in: (i) Privacy of synthetic 3D humans (involves dynamic & user-adjustable masking of a latent space, conditional to the sampling of the other latent space. Where one latent space is used to generate an image representa6on of the geometry and the other is used to generate an image representation of the textures) (ii) Realism of synthetic 3D humans (involves dynamic & user-adjustable biasing the sampling of a latent space, conditional to the sampling of the other latent space. Where one latent space is used to generate an image representa6on of the geometry and the other is used to generate an image representation of the textures) (iii) Variational Image translation tasks (can be thought of as enforcing realism as a bias-zone) (iv) Combination of (i) and (ii) (v) Privacy of synthetic 213 faces (involves the static & user-adjustable masking a latent space. Where one latent space is used to generate synthetic image data) While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed the novel methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and systems described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms of modifications as would fall within the scope and spirit of the inventions.

Claims (27)

  1. CLAIMS: 1. A computer implemented method of generating an image from a plurality of ground truth images, wherein the image is described using at least two characteristics, the method comprising: providing a first generative network and a second generative network, said first generative network comprising an output corresponding to a first characteristic space and an input comprising a first latent space, said second generative network comprising an output corresponding to a second characteristic space and an input comprising a second latent space; selecting a first latent code from said first latent space, wherein the first latent space comprises first ground truth latent codes, each first ground truth latent code mapping via said first generative network to a first characteristic of a ground truth image in said first characteristic space; selecting a second latent code from allowable regions of said second latent space, wherein the second latent space comprises second ground truth latent codes, each second ground truth latent codes mapping via said second generative network to a second characteristic of a ground truth image in said second characteristic space, wherein masks are applied to said second latent space indicating allowable regions; and generating the image from the selected first latent code in the first latent space and the selected second latent code in the second latent space.
  2. 2. A computer implemented method according to claim 1, wherein the masks indicating the allowable regions are provided around second ground truth latent codes in the second latent space.
  3. 3. A computer implemented method according to claim 2, wherein the size of the masks are determined from the proximity of the first latent code to the first ground truth latent codes in the first latent space.
  4. 4. A computer implemented method according to claim 3, wherein the diameter of the masks is inversely proportional to the distance of the first latent code to the ground truth latent codes in the first latent space.
  5. 5. A computer implemented method according to any preceding claim, wherein the diameter of the masks is dependent on a user input indicating the allowed similarity of generated synthetic image to the ground truth images.
  6. 6. A computer implemented method according to any preceding claim, wherein the masks indicating allowable regions are provided around latent codes corresponding to a mapping of the first latent code to the second latent space
  7. 7. A computer implemented method according to any preceding claim, wherein the ground truth images comprises images that were used for training the network and images that were not used for training the network.
  8. 8. A computer implemented method according to any preceding claim, wherein the image is a 3D image
  9. 9. A computer implemented method according to any preceding claim, wherein the first characteristic is geometry and the second characteristic is texture.
  10. 10. A computer implemented method according to any preceding claim, wherein the image is a representation of a human.
  11. 11. A computer implemented method according to any preceding claims wherein said generative networks are selected from variational autoencoders and generative adversarial network.
  12. 12. A computer implemented method according to any preceding claim, wherein the first latent code in the first latent space is selected by random sampling.
  13. 13. A computer implemented method according to any preceding claim, wherein the second latent code in the second latent space is selected by mapping said first latent code in said first latent space to said second latent space.
  14. 14. A computer implemented method of generating an image from a plurality of ground truth images, wherein the image is described using at least two characteristics, the method comprising: providing a first generative network and a second generative network, said first generative network comprising an output corresponding to a first characteristic space and an input comprising a first latent space, said second generative network comprising an output corresponding to a second characteristic space and an input comprising a second latent space; selecting a first latent code from said first latent space, wherein the first latent space comprises first ground truth latent codes, each first ground truth latent code mapping via said first generative network to a first characteristic of a ground truth image in said first characteristic space; mapping said first latent code to select a second latent code in said second latent space; and generating the image from the selected first latent code in the first latent space and the selected second latent code in the second latent space.
  15. 15. A computer implemented method according to any of claims 1 to 11 or 13, wherein the second latent code in the second latent space is selected by: mapping said first latent code in said first latent space to a corresponding latent code in the second latent space; defining a mask around the corresponding latent code; and selecting the latent code in the second latent space from within said mask.
  16. 16. A computer implemented method according to claim 13, wherein the mapping of the first latent code in the first latent space to the corresponding latent code in the second latent space is performed using a neural network.
  17. 17. A method according any preceding claim, wherein the positions of the ground truth points in the first latent space are determined from: selecting a point to be a predicted ground truth latent code for a ground truth first characteristic in a first latent space; mapping said predicted latent code using said first generative neural network to a prediction in the first characteristic space; determining the error between the prediction and ground truth first characteristic in the first characteristic space; backpropagating the error through said first generative neural network while keeping the weights of the first generative network fixed; applying said back propagated error to the predicted ground truth latent code and updating the predicted ground truth latent code; and repeating the process until the prediction and the ground truth in the first characteristic space converge.
  18. 18. A method of obtaining a representation of an image, the method comprising: providing a generative network, said generative network comprising an output corresponding to an image space and an input comprising a latent space, receiving image data of a ground truth image; selecting a point in said latent space to be a predicted ground truth latent code for said ground truth image; mapping said predicted latent code using said generative network to a prediction in the image space; determining the error between the prediction and ground truth image in the image space; backpropagating the error through said generative network while keeping the weights of the generative network fixed; applying said back propagated error to the predicted ground truth latent code and updating the predicted ground truth latent code; and repeating the process until the prediction and the ground truth in the first characteristic space converge to obtain a representation of said ground truth image in said latent space from the prediction.
  19. 19. A computer implemented method of generating an image from a plurality of ground truth images, wherein the image is described using at least two characteristics, the method comprising: providing a first generative network and a second generative network, said first generative network comprising an output corresponding to a first characteristic space and an input comprising a first latent space, said second generative network comprising an output corresponding to a second characteristic space and an input comprising a second latent space; selecting a first latent code from said first latent space, wherein the first latent space comprises first ground truth latent codes, each first ground truth latent code mapping via said first generative network to a first characteristic of a ground truth image in said first characteristic space; mapping said first latent code to select a second latent code in said second latent space; and generating the image from the selected first latent code in the first latent space and the selected second latent code in the second latent space.
  20. 20. A computer implemented method of generating an image from a plurality of ground truth images, comprising: providing a generative network, said generative network comprising an output corresponding to an image space and an input comprising a latent space, said latent space comprises ground truth latent codes, each ground truth latent code mapping via said generative network to a characteristic of a ground truth image in said image space; applying masks around said ground truth latent codes; selecting a latent code from the latent space which is outside of the masks around the ground truth latent codes; and generating an image from the selected latent code using said generative network.
  21. 21. A computer implemented method according to claim 20, further comprising: providing a user interface to allow a user to select an indication of the allowable similarity of the generated image to the ground truth images, wherein applying said masks comprises using said user selected indication determine the size of said masks.
  22. 22. A system for generating an image from a plurality of ground truth images, wherein the image is described using at least two characteristics, the system comprising a processor adapted to: provide a first generative network and a second generative network, said first generative network comprising an output corresponding to a first characteristic space and an input comprising a first latent space, said second generative network comprising an output corresponding to a second characteristic space and an input comprising a second latent space; select a first latent code from said first latent space, wherein the first latent space comprises first ground truth latent codes, each first ground truth latent code mapping via said first generative network to a first characteristic of a ground truth image in said first characteristic space; select a second latent code from allowable regions of said second latent space, wherein the second latent space comprises second ground truth latent codes, each second ground truth latent codes mapping via said second generative network to a second characteristic of a ground truth image in said second characteristic space, wherein masks are applied to said second latent space indicating allowable regions; and generate the image from the selected first latent code in the first latent space and the selected second latent code in the second latent space.
  23. 23. A system for generating an image from a plurality of ground truth images, wherein the image is described using at least two characteristics, the system comprising a processor adapted to: provide a first generative network and a second generative network, said first generative network comprising an output corresponding to a first characteristic space and an input comprising a first latent space, said second generative network comprising an output corresponding to a second characteristic space and an input comprising a second latent space; select a first latent code from said first latent space, wherein the first latent space comprises first ground truth latent codes, each first ground truth latent code mapping via said first generative network to a first characteristic of a ground truth image in said first characteristic space; map said first latent code to select a second latent code in said second latent space; and generate and output the image from the selected first latent code in the first latent space and the selected second latent code in the second latent space.
  24. 24. A system for obtaining a representation of an image, the system comprising a processor adapted to: provide a generative network, said generative network comprising an output corresponding to an image space and an input comprising a latent space, receive image data of a ground truth image; select a point in said latent space to be a predicted ground truth latent code for said ground truth image; map said predicted latent code using said generative network to a prediction in the image space; determine the error between the prediction and ground truth image in the image space; backpropagate the error through said generative network while keeping the weights of the generative network fixed; and apply said back propagated error to the predicted ground truth latent code and updating the predicted ground truth latent code; the processor being configured to repeat the process until the prediction and the ground truth in the first characteristic space converge to obtain a representation of said ground truth image in said latent space from the prediction.
  25. 25. A system for generating an image from a plurality of ground truth images, wherein the image is described using at least two characteristics, the system comprising a processor adapted to: provide a first generative network and a second generative network, said first generative network comprising an output corresponding to a first characteristic space and an input comprising a first latent space, said second generative network comprising an output corresponding to a second characteristic space and an input comprising a second latent space; select a first latent code from said first latent space, wherein the first latent space comprises first ground truth latent codes, each first ground truth latent code mapping via said first generative network to a first characteristic of a ground truth image in said first characteristic space; map said first latent code to select a second latent code in said second latent space; and generate and output the image from the selected first latent code in the first latent space and the selected second latent code in the second latent space.
  26. 26. A system for generating an image from a plurality of ground truth images, the system comprising a processor adapted to: provide a generative network, said generative network comprising an output corresponding to an image space and an input comprising a latent space, said latent space comprises ground truth latent codes, each ground truth latent code mapping via said generative network to a characteristic of a ground truth image in said image space; apply masks around said ground truth latent codes; select a latent code from the latent space which is outside of the masks around the ground truth latent codes; and generate and output an image from the selected latent code using said generative network.
  27. 27. A computer-readable medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of any of claims 1 to 21.
GB2011293.4A 2020-07-14 2020-07-21 Computer vision method and system Withdrawn GB2597111A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GR20200100408 2020-07-14

Publications (2)

Publication Number Publication Date
GB202011293D0 GB202011293D0 (en) 2020-09-02
GB2597111A true GB2597111A (en) 2022-01-19

Family

ID=72338906

Family Applications (1)

Application Number Title Priority Date Filing Date
GB2011293.4A Withdrawn GB2597111A (en) 2020-07-14 2020-07-21 Computer vision method and system

Country Status (1)

Country Link
GB (1) GB2597111A (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113961736A (en) * 2021-09-14 2022-01-21 华南理工大学 Method and device for generating image by text, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190318040A1 (en) * 2018-04-16 2019-10-17 International Business Machines Corporation Generating cross-domain data using variational mapping between embedding spaces
US20190354804A1 (en) * 2018-05-15 2019-11-21 Toyota Research Institute, Inc. Systems and methods for conditional image translation
US20200143079A1 (en) * 2018-11-07 2020-05-07 Nec Laboratories America, Inc. Privacy-preserving visual recognition via adversarial learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190318040A1 (en) * 2018-04-16 2019-10-17 International Business Machines Corporation Generating cross-domain data using variational mapping between embedding spaces
US20190354804A1 (en) * 2018-05-15 2019-11-21 Toyota Research Institute, Inc. Systems and methods for conditional image translation
US20200143079A1 (en) * 2018-11-07 2020-05-07 Nec Laboratories America, Inc. Privacy-preserving visual recognition via adversarial learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
arXiv.org, Chen A et al, "A Free Viewpoint Portrait Generator with Dynamic Styling", 7 Jul 2020, available from https://arxiv.org/abs/2007.03780 *
arXiv.org, Chen WT et al, "Texture Deformation Based Generative Adversarial Networks for Face Editing", 24 December 2018, available from https://arxiv.org/abs/1812.09832 *

Also Published As

Publication number Publication date
GB202011293D0 (en) 2020-09-02

Similar Documents

Publication Publication Date Title
CN111489412B (en) Semantic image synthesis for generating substantially realistic images using neural networks
US11093733B2 (en) Image processing method and apparatus, storage medium and electronic device
CN111325851B (en) Image processing method and device, electronic equipment and computer readable storage medium
CN110785767B (en) Compact linguistics-free facial expression embedding and novel triple training scheme
WO2021027759A1 (en) Facial image processing
US10552730B2 (en) Procedural modeling using autoencoder neural networks
US11615516B2 (en) Image-to-image translation using unpaired data for supervised learning
US20230044644A1 (en) Large-scale generation of photorealistic 3d models
CN108780519A (en) Structure learning in convolutional neural networks
US11250245B2 (en) Data-driven, photorealistic social face-trait encoding, prediction, and manipulation using deep neural networks
CN110555896B (en) Image generation method and device and storage medium
Tieleman Optimizing neural networks that generate images
CN115205949A (en) Image generation method and related device
JP7493532B2 (en) Changing the appearance of the hair
Lv et al. Approximate intrinsic voxel structure for point cloud simplification
CN115797606B (en) 3D virtual digital human interaction action generation method and system based on deep learning
WO2024109374A1 (en) Training method and apparatus for face swapping model, and device, storage medium and program product
US20230100427A1 (en) Face image processing method, face image processing model training method, apparatus, device, storage medium, and program product
CN114330736A (en) Latent variable generative model with noise contrast prior
CN111950389A (en) Depth binary feature facial expression recognition method based on lightweight network
CN118202388A (en) Attention-based depth point cloud compression method
CN110415261B (en) Expression animation conversion method and system for regional training
Salvi et al. Attention-based 3D object reconstruction from a single image
CN117218300B (en) Three-dimensional model construction method, three-dimensional model construction training method and device
GB2597111A (en) Computer vision method and system

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)