CN115516460A

CN115516460A - Variational autocoder for mixed data types

Info

Publication number: CN115516460A
Application number: CN202180033226.XA
Authority: CN
Inventors: 张�成; 马超; R·E·特纳; J·M·赫尔南德斯·洛巴托; S·契亚切克
Original assignee: Microsoft Technology Licensing LLC
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2020-05-07
Filing date: 2021-04-09
Publication date: 2022-12-23
Also published as: WO2021225741A1; EP4147173A1

Abstract

In a first stage, each of a plurality of first variational self-encoders, VAEs, is trained, each first VAE comprising: a respective first encoder arranged to encode a respective subset of one or more features of the feature space into a respective first eigenrepresentation; and a respective first decoder arranged to decode from the respective intrinsic representation back to a decoded version of a respective subset of the feature space, wherein different subsets comprise features of different data types. Training a second VAE in a second phase subsequent to the first phase, the second VAE comprising: a second encoder arranged to encode the plurality of inputs into a second eigenrepresentation; and a second decoder arranged to decode the second intrinsic representation into a decoded version of the first intrinsic representation, wherein each of the plurality of inputs comprises a different respective one of the subsets of features in combination with the respective first intrinsic representation, respectively.

Description

Variational autocoder for mixed data types

Background

Neural networks are used in the fields of machine learning and Artificial Intelligence (AI). Neural networks are made up of a number of nodes that are connected to each other by links, sometimes referred to as edges. The input edges of one or more nodes form the inputs of the overall network, the output edges of one or more other nodes form the outputs of the overall network, and the output edges of the individual nodes in the network form the input edges of the other nodes. Each node represents a function of its input edge, weighted by a respective weight, and the result is output on its output edge. The weights may be adjusted in steps based on a set of empirical data (training data) to trend toward a state where the network will expect for a given input output.

Typically, the nodes are arranged in layers having at least an input layer and an output layer. A "deep" neural network includes one or more intermediate or "hidden" layers between an input layer and an output layer. The neural network may take input data and propagate the input data through the network layer to generate output data. Some nodes in the network perform operations on the data, the results of which are passed to other nodes, and so on.

FIG. 1A presents a simplified representation of an example neural network 101 in a diagrammatic manner. The example neural network includes multiple layers of nodes 104: an input layer 102i, one or more hidden layers 102h, and an output layer 102o. In practice, there may be many nodes in each layer, but for simplicity only a few of them are illustrated. Each node 104 is configured to generate an output by performing a function on a value input to that node. The inputs of one or more nodes form the inputs of the neural network, the outputs of some nodes form the inputs of other nodes, and the outputs of one or more nodes form the outputs of the network.

At some or all of the nodes of the network, the inputs to the nodes are weighted by respective weights. The weights may define connectivity between a node of a given layer and a node of a next layer of the neural network. The weights may take the form of a single scalar value or may be modeled as a probability distribution. When the weights are defined by distributions, as in a bayesian model, the neural network can be fully probabilistic and capture the concept of uncertainty. The values of the connections 106 between nodes can also be modeled as distributions. As shown in fig. 1B. The distribution may be represented as a set of samples or a set of parameterised distributed parameters (e.g. mean μ and standard deviation σ or variance σ 2).

The network learns by operating on the data input of the input layer and adjusting the weights applied by some or all of the nodes according to the input data. There are different learning methods, but in general there is a forward propagation through the network from left to right in fig. 1A, calculating the overall error, and a backward propagation through the network from right to left in fig. 1A. In the next cycle, each node considers the back-propagated error and generates a revised set of weights. In this way, the network can be trained to perform its required operations.

The input to the network is typically a vector, each element of which represents a different corresponding feature. For example, in the case of image recognition, the elements of the feature vector may represent different pixel values, or in medical applications, different features may represent different symptoms or patient questionnaire responses. The output of the network may be a scalar or vector. The output may represent a classification, for example, indicating whether an object, such as a elephant, is recognized in the image, or a patient diagnosis in a medical example.

Fig. 1C shows a simple arrangement in which a neural network is arranged to predict the classification from the input feature vectors. In the training phase, empirical data comprising a large number of input data points X, each data point comprising a set of example values of a feature vector, is provided to the neural network and labeled with a corresponding value of class Y. The classification Y may be a single scalar value (e.g., representing an elephant or not), or a vector (e.g., a hotspot vector whose elements represent different possible classification results, such as an elephant, a river horse, a rhinoceros, etc.). The possible classification values may be binary or soft values representing a percentage probability. On many example data points, the learning algorithm adjusts the weights to reduce the overall error between the label classification and the network prediction classification. Once trained with the appropriate number of data points, the unlabeled feature vectors can be input to a neural network, and the network can predict classification values based on the input feature values and the adjusted weights.

Training in this manner is sometimes referred to as a supervised approach. Other methods, such as a brute force method, may also be employed, where each data point of the network is initially unmarked. The learning algorithm first guesses the corresponding output for each point, then tells it if it is correct, and then adjusts the weights step by step based on each feedback. Another example is an unsupervised approach, where the input data points are not labeled at all, and the learning algorithm leaves the empirical data to infer its structure. The term "training" herein is not necessarily limited to supervised, intensified or unsupervised approaches.

A machine learning model (also referred to as a "knowledge model") may also be formed from a plurality of constituent neural networks. An example is a self-encoder, as shown in fig. 4A-D. In the self-encoder, an encoder network is arranged to encode the observed input vector Xo as an eigenvector Z, and a decoder network is arranged to decode the eigenvector back into the real eigenspace of the input vector. Actual input vector Xo and decoder predicted input vector

The difference between the versions is used to adjust the encoder and decoder weights to minimize a measure of the overall difference, e.g., based on an Evidence Lower Bound (ELBO) function. Eigenvectors Z mayIn a compressed form that is considered as information in the input feature space. In a variational self-encoder (VAE), each element of the eigenvector Z is modeled as a probability or statistical distribution, such as a gaussian distribution. In this case, for each element of Z, the encoder learns one or more parameters of the distribution, such as the measured center point and the distribution range. For example, the center point may be the mean and the price difference may be the variance or standard deviation. Then, the element values input to the decoder are randomly sampled from the learned distribution.

The encoder is sometimes referred to as an inference network because it infers the eigenvectors Z from the input observations Xo. The decoder is sometimes referred to as a generation network because it generates a version of the input feature space from the eigenvector Z

After training, the self-encoder can be used to interpolate missing values from the subsequently observed feature vector Xo. Alternatively, a third network may be trained to predict class Y based on the eigenvectors and then used to predict the class of subsequent unlabeled observations after training.

Disclosure of Invention

It can be determined here that the performance of a conventional VAE is particularly poor when the feature space of the input vector contains mixed-type data. For example, in a medical environment, one or more features in the input feature space may be a classification value (e.g., yes/no answer to a questionnaire, or gender), while other one or more features may be a continuous value (e.g., height or weight). The opposite case is for example image recognition, where all input features may represent pixel values.

In VAEs, the performance of any interpolation or prediction performed based on eigenvectors depends on the dimensions of the eigenspace. In other words, the more elements contained in an eigenvector (the larger the dimension), the better the performance (performance can be measured in terms of prediction accuracy as compared to known truth in some test data). However, it is recognized herein that: when modeling mixed-type data, the limiting factor of a traditional VAE is not the magnitude of the eigenvectors, but the mixed nature of the data types. It is recognized herein that: in this case, increasing the eigenscale does not significantly improve performance. On the other hand, the computational complexity (in terms of training and prediction or interpolation) will continue to expand with the dimensions of the eigenspace (number of elements in the eigenvector Z), even though increasing dimensions no longer improves performance. Therefore, the conventional VAE cannot effectively utilize the generated computational complexity in an application program processing mixed type data.

It is desirable to provide a machine learning model that can process mixed types of data with reduced computational complexity for a given performance or with increased performance for a given computational complexity.

According to one aspect of the present disclosure, a method is provided that includes a first stage and a second stage. In a first stage, the method comprises training each of a plurality of individual first variational self-encoders VAEs, each first VAE comprising a separate respective first encoder arranged to encode a respective subset of one or more features of the feature space into a separate respective first intrinsic representation having one or more dimensions, and a separate respective first decoder arranged to decode from the respective intrinsic representation back to a decoded version of the respective subset of the feature space, wherein different subsets comprise features of different data types. In a second phase, after the first phase, the method comprises training a second VAE comprising a second encoder arranged to encode the plurality of inputs into a second intrinsic representation having a plurality of dimensions and a second decoder arranged to decode the second intrinsic representation into a decoded version of the first intrinsic representation, wherein each respective input of the plurality of inputs comprises a combination of a different respective one of the subsets of features with the respective first intrinsic representation.

Since the first decoders are trained independently of each other, they can be trained without affecting each other. The second encoder and second decoder may then be trained at a subsequent stage, encoded into the second eigenspace and decoded back to the respective first eigenvalues, to learn the dependencies between the different data types. This two-phase approach, including the separation phase between different data types, provides improved performance when processing mixed data.

In a conventional ("native") VAE, the dimensions of the eigenspace are only the dimensions of a single eigenvector Z between the encoder and decoder. In the method of the present disclosure, the dimension is the sum of the dimension of the second eigenrepresentation (the number of elements in the second eigenvector) plus the dimension of each element in the first eigenrepresentation (in embodiments, one for each element). For example, the dimension may be represented as dim (H) + D, where dim (H) is the number of elements in the second eigenvector H and D is the number of features or feature subsets. However, one problem with the original VAE is that it does not use the eigenspace very efficiently with mixed-type data. Therefore, increasing the size of the eigenspace will be useless. In contrast, since the disclosed method has a two-stage structure, if H and Z have the same dimensions, it will actually have a larger intrinsic size. However, by separating the different feature types in the first learning stage, the increase in eigensize in the disclosed model is significantly improved over the original VAE. Therefore, eigenspaces and training procedures are designed to more efficiently utilize eigenspaces.

This summary introduces a number of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Nor is the claimed subject matter limited to implementations that solve any or all disadvantages noted herein.

Drawings

To assist in understanding embodiments of the invention and to show how these embodiments may be carried into effect, reference is made, by way of example only, to the accompanying drawings, in which:

figure 1A is a schematic diagram of a neural network,

figure 1B is a schematic diagram of a bayesian neural network node,

FIG. 1C is a schematic diagram of a neural network for predicting a classification based on input feature vectors,

figure 2 is a schematic diagram of a computing device for implementing a neural network,

fig. 3 schematically illustrates a data set comprising a plurality of data points, each data point comprising one or more characteristic values,

figure 4A is a schematic diagram of a variational self-encoder (VAE),

figure 4B is another schematic diagram of a VAE,

figure 4C is a high level schematic diagram of a VAE,

figure 4D is a high level schematic diagram of a VAE,

figure 5A schematically illustrates a first stage of machine learning model training according to an embodiment of the present disclosure,

figure 5B is a schematic diagram of a second stage of machine learning model training according to an embodiment of the present disclosure,

figure 5C is a high level schematic of the knowledge model of figures 5A and 5B,

figure 5D shows a variation of the decoder in the models of figures 5A and 5B,

figure 5E illustrates how the model is used to predict the classification,

figure 6 shows a part of the inference network for inputting missing values,

figure 7A shows a three-dimensional data pair plot of true values,

figure 7B shows a pair-wise plot of three-dimensional data generated using a model,

figure 7C shows a pair-wise plot of three-dimensional data generated using another model,

figure 7D shows a pair-wise plot of three-dimensional data generated using another model,

figure 7E shows a pair-wise plot of three-dimensional data generated using another model,

figure 7F shows a pair-wise plot of three-dimensional data generated using another model,

FIG. 8 shows some information curves for sequential active information acquisition in some example scenarios (a) - (e), and (f) corresponding Area Under Information Curve (AUIC) comparison, and

fig. 9 is a flow chart of an overall method in accordance with the techniques of this disclosure.

Detailed Description

Due to the heterogeneity of natural data sets, depth-generative models often do not perform well in practical applications. Heterogeneity arises from having different types of features (e.g., classification features, continuation features, etc.), each with its own edge attributes, which may vary greatly. "edge" refers to the distribution of different possible values of a feature and the number of samples, ignoring correlation with other features. In other words, the distribution shapes of different types of features may vary greatly. The data types may include, for example: classification (feature value takes one of a number of non-numeric categories), order (integer value), and/or continuity (continuous value). The VAE will try to optimize all likelihood functions simultaneously. In practice, some likelihood functions may have larger values, so the VAE will focus on a particular likelihood function and ignore others. In this case, the contribution of each likelihood to the training objective may be very different, resulting in a challenging optimization problem, where some data dimensions may not model well, while others may model poorly. Fig. 7 (d) shows an example where the native VAE conforms to some classification variables, but underperforms on continuous data.

The use of VAEs to model hybrid real-world data is poorly studied in the literature, especially when combined with downstream decision-making tasks. To overcome the limitations of VAEs in this scenario, the present disclosure provides a new approach, which may be referred to as a variational auto-encoder (VAEM) of heterogeneous hybrid type data. Later, we will study its decision performance examples in practical applications. VAEM uses a hierarchy of intrinsic variables that is fitted in two stages. In the first stage, a specific type of VAE is learned for each dimension. These initial one-dimensional VAEs capture the edge distribution characteristics and provide an intrinsic representation that is consistent across dimensions. In the second phase, another VAE is used to capture the dependencies between the one-dimensional eigen representations of the first phase.

Thus, an improved model is provided for heterogeneous hybrid-type data that alleviates the limitations of conventional VAEs. In an embodiment, VAEM employs a depth generative model for heterogeneous hybrid type data.

The present disclosure will investigate the quality of data generation by VAEM compared to VAE and other baselines on five different datasets (see, e.g., fig. 8). The results show that VAEM can model mixed-type data more successfully than other baselines.

In an embodiment, the VAEM may be extended to handle lost data, perform conditional data generation, and use algorithms that enable it for efficient sequential proactive information acquisition. It will be shown herein that VAEM has powerful performance in terms of conditional data generation and sequential active information acquisition in case of poor VAE performance.

The two-stage VAEM model will be discussed in more detail later with reference to fig. 5A. However, a general overview of neural networks and their use in VAEs will first be discussed with reference to fig. 2 to 4D.

Fig. 2 illustrates an example computing device 200 for implementing an Artificial Intelligence (AI) algorithm including a Machine Learning (ML) model according to embodiments described herein. Computing device 200 may include one or more user terminals, such as a desktop computer, a laptop computer, a tablet computer, a smartphone, a wearable smart device such as a smart watch, or an on-board computer of a vehicle such as an automobile. Additionally or alternatively, computing device 200 may also include a server. A server is herein a logical entity that may comprise one or more physical server units located in one or more geographical locations. Distributed or "cloud" computing techniques are known per se in the art, where required. One or more user terminals and/or one or more server units of the server may be connected to each other via a packet switched network, which may comprise a wide area Internet, e.g. the Internet, a mobile cellular network, e.g. 3GPP, a wired Local Area Network (LAN), e.g. ethernet, or a wireless local area network, e.g. a Wi-Fi, thread or 6LoWPAN network.

The computing device 200 includes a controller 202, an interface 204, and an Artificial Intelligence (AI) algorithm 206. The controller 202 is operatively coupled to each of the interface 204 and the AI algorithm 206.

Each of the controller 202, interface 204, and AI algorithm 206 may be implemented in the form of software code embodied on computer readable memory and run on a processing device including one or more processors, such as CPUs, work accelerator co-processors, such as GPUs, and/or other application specific processors, implemented on one or more computer terminals or units at one or more geographic locations. The memory storing the code may include one or more storage devices utilizing one or more storage media (e.g., electronic or magnetic media) which may also be implemented on one or more computer terminals or units at one or more geographic locations. In an embodiment, one, some, or all of controller 202, interface 204, and AI algorithm 206 may be implemented on a server. Alternatively, respective instances of one, some, or all of these components may be implemented partially or even fully on each, some, or all of one or more user terminals. In a further example, the functionality of the above-described components may be split between any combination of user terminals and servers. Note again that distributed computing techniques are known per se in the art, where required. Nor does it exclude that one or more of the elements may be implemented in dedicated hardware.

The controller 202 includes control functions for coordinating the functions of the interface 204 and the AI algorithm 206. The interface 204 refers to a function for receiving and/or outputting data. The interface 204 may include a User Interface (UI) for receiving and/or outputting data to and/or from one or more users, respectively; or it may include an interface to one or more other external devices that may provide an interface to one or more users. Alternatively, the interface may be arranged to collect data from an automation function or a device implemented on the same device and/or one or more external devices and/or output data, for example from a sensor device such as an industrial sensor device or an internet of things device. In the case of interfacing with an external device, the interface 204 may include a wired or wireless interface for communicating with the external device through a wired or wireless connection, respectively. The interface 204 may include one or more component types of interfaces, such as a voice interface and/or a graphical user interface.

Thus, the interface 204 is arranged to collect observations (i.e. observations) of various features of the input feature space. For example, it may be arranged to collect input from one or more users via the UI front-end, e.g. a microphone, a touch screen, etc.; or automatically collect data from an unmanned device, such as a sensor device. The logic of the interface may be implemented on a server and arranged to collect data from one or more external user devices (e.g. user devices or sensor devices). Alternatively, some or all of the logic of interface 204 may also be implemented on the user device or sensor device itself.

The controller 202 is configured to control the AI algorithm 206 to perform operations according to embodiments described in the present disclosure. It is to be appreciated that any of the operations of the present disclosure may be performed by the AI algorithm 206, under the control of the controller 202, collecting empirical data from a user and/or an automated process via the interface 204 and passing it to the AI algorithm 206, receiving predictions from the AI algorithm, and outputting predictions to a user and/or an automated process via the interface 204.

The Machine Learning (ML) algorithm 206 includes a machine learning model 208, including one or more component neural networks 101. Such machine learning models 208 may also be referred to as knowledge models. The machine learning algorithm 206 further comprises a learning function 209 arranged to adjust the weights w of the nodes 104 of the neural network 101 of the machine learning model 208 according to a learning process (e.g. training based on a set of training data).

Fig. 1A illustrates the principle of a neural network. The neural network 101 includes a graph of interconnected nodes 104 and edges 106 connecting between the nodes, all of which are implemented in software. Each node 104 has one or more input edges and one or more output edges, with each node of at least some of the nodes 104 having multiple input edges and each node of at least some of the nodes 104 having multiple output edges. The input edges of one or more nodes 104 form the overall input 108i of the graph (typically an input vector, i.e., having multiple input edges). The output edges of one or more nodes 104 constitute the overall output 108o of the graph (which may be an output vector in the case of multiple output edges). Furthermore, the output edges of at least some of the nodes 104 form the input edges of at least some of the other nodes 104.

Each node 104 represents a function of the input values received on its input edge 106i, the output of which is output onto the output edge 106o of the respective node 104, such that the value output at the output edge 106o of the node 104 depends on the respective input value according to the respective function. The function of each node 104 is also parameterized by one or more respective parameters w, sometimes also referred to as weights (not necessarily weights in the sense of multiplicative weights, but this is certainly a possibility). Thus, the relationship between the values of the inputs 106i and the outputs 106o of each node 104 depends on the respective function of the node and its respective weight.

Each weight may be only one scalar value. Alternatively, as shown in fig. 1B, at some or all of the nodes 104 in the network 101, the respective weights may be modeled as a probability distribution, such as a gaussian distribution. In this case, the neural network 101 is sometimes referred to as a bayesian neural network. Alternatively, the value inputs/outputs on each of some or all of the edges 106 may also be modeled as respective probability distributions. For any given weight or edge, the distribution may be modeled from a set of distribution samples or a set of parameters that parameterize the respective distributions, e.g., a pair of parameters specifying its center point and width (e.g., from its mean μ and standard deviation σ or variance σ) ² ). The edge or weight values may be random samples in the distribution. Learning or weighting may include adjusting one or more parameters of each distribution.

As shown in fig. 1A, the nodes 104 of the neural network 101 may be arranged in multiple layers, each layer including one or more nodes 104. In so-called "deep" neural networks, the neural network 101 includes an input layer 102i, the input layer 102i containing one or more input nodes 104i, one or more hidden layers 102h (also referred to as inner layers), each hidden layer including one or more hidden nodes 104h (or internal nodes), and an output layer 102o including one or more output nodes 104o. For simplicity, only two hidden layers 102h are shown in fig. 1A, but more hidden layers may be present.

The different weights of the various nodes 104 in the neural network 101 may be adjusted in steps based on a set of empirical data (training data) in order to trend toward a state where the output 108o of the network will produce a desired value for a given input 108 i. For example, the neural network 101 may first be trained for an actual application before use in that application. Training involves inputting empirical data in the form of training data to the inputs 108i of the graph and then adjusting the weights w of the nodes 104 in accordance with the feedback from the outputs 108o of the graph. The training data comprises a plurality of different input data points, each data point containing a value or vector of values corresponding to an input edge 108i of the graph 101.

For example, consider a simple example as shown in FIG. 1C, where the machine learning model includes a single neural network 101 arranged to have the feature vector X as its input 108i and the class Y as its output 108o. The input feature vector X comprises a plurality of elements X _d Each element represents a different feature d =0, 1, 2 …, etc. For example, in the example of image recognition, each element of the feature vector X may represent a respective pixel value. For example, one element represents the red channel of a pixel (0,0); another element represents the green channel of the pixel (0,0); another element represents the blue channel of the pixel (0,0); another element represents the red channel of the pixel (0,1), and so on. As another example, when a neural network is used to make a medical diagnosis, each element of the feature vector may represent a value of a different symptom of the subject, a physical feature of the subject, or other condition of the subject (e.g., body temperature, blood pressure, etc.).

Fig. 3 shows an example data set containing a plurality of data points i =0, 1, 2 …. Each data point i contains a respective set of eigenvector values (where x _id Is the d-th eigenvalue in the ith data point). Input feature vector X _i Represents the input observations of a given data point, where in general any given observation i may or may not include the complete set of values for all elements of the feature vector X. Class Y _i Representing the corresponding classification of the observation i. In the training data, class Y _i Is specified with the observations of each data point and the feature vector elements (input data points in the training data are referred to as being classified by class Y) _i "marker"). In a subsequent prediction phase, the classification Y is predicted by the neural network 101 for further input observations X.

The classification Y may be a scalar or a vector. For example, in a simple example of an elephant recognizer, Y may be a single binary value representing an elephant or a non-elephant, or a soft value representing the probability or confidence that an image contains an elephant image. Or similarly, if the neural network 101 is used to test a particular medical condition, Y may be a single binary value indicating whether the subject has the condition, or a soft value indicating the probability or confidence that the subject has the condition. As another example, Y may comprise a "1-hot" vector, where each element represents a different animal or condition. For example, Y = [1,0,0, … ] represents elephant, Y = [0,1,0, ] represents hippopotamus, and Y = [0,0, 1, … ] represents rhinoceros, etc. Alternatively, if soft values are used, Y = [0.81,0.12,0.05, … ] then an 81% confidence, 12% confidence, 5% confidence, etc. is indicated that the image contains an elephant image.

In the training phase, Y for each data point i is known _i The true value of (d). For each training data point i, the AI algorithm 206 measures the resulting output value at the graph output edge or edges 108o and uses this feedback to gradually adjust the different weights w of the various nodes 104 so that over many observed data points, the weights tend to make the output 108i (Y) of the graph 101 as close as possible to the values of the actual observed values in the empirical data of the training input (for some overall error metric). I.e., for each input training data, the predetermined training output is compared to the actual observed output of graph 108o. This comparison provides feedback that, over a number of training data, is used to gradually adjust the weights of the various nodes 104 in the graph to a state where the actual output 108o of the graph will closely match the expected or expected output of a given input 108 i. Examples of such feedback techniques include random back propagation.

After training, neural network 101 may be used to infer the value of output 108o (Y) for a given value of input vector 108i (X), and vice versa.

Explicit training based on label training data is sometimes referred to as a supervised approach. Other methods of machine learning are also possible. Another example is a strengthening method, for example. In this situationIn this case, the neural network 101 starts a classification Y for each data point i _i The prediction is made with little or no initial accuracy. After predicting each data point i (or at least some of the data points), the AI algorithm 206 receives feedback (e.g., feedback from a human) as to whether the prediction is correct and uses the feedback to adjust the weights for better performance next time. Another example is referred to as an unsupervised approach. In this case, the artificial intelligence algorithm does not receive tags or feedback, but rather infers its structure in the empirical input data.

Fig. 1C is a simple example of using a neural network 101. In some cases, the machine learning model 208 may include two or more structures that make up the neural network 101.

Fig. 4A schematically illustrates one such example, a Variational Autocoder (VAE). In this case, the machine learning model 208 includes an encoder 208q that includes an inference network and a decoder 208p that includes a generation network. Each of the inference network and the generation network comprises one or more component neural networks 101, as shown in FIG. 1A. In the present sense, an inference network refers to a neural network that encodes an input into an intrinsic representation of the input, and a generation network refers to a neural network that is at least partially decoded from the intrinsic representation.

The encoder 208q is arranged to receive the observed feature vector X _o As input and encoded as eigenvectors Z (representation in eigenspace). The decoder 208p is arranged to receive the eigenvector Z and decode back the original feature space of the eigenvector. The version of the feature vector output by the decoder 208p may be labeled herein

Eigenvectors Z are input observations X _o A compressed (i.e., encoded) representation of the information contained therein. Neither element of the eigenvector Z necessarily represents any real-world quantity directly, but the vector Z as a whole represents the information in the input data in compressed form. It can be considered conceptually that it means from the input data X _o Abstract features extracted in (1), e.g. elephant recognitionThe examples are "wrinkles" and "trunks" (although any one element of the eigenvector Z does not necessarily map onto any one such factor, but the eigenvector Z as a whole encodes such abstract information). The decoder 208p is used to decode the eigenvectors Z back to values in the real feature space, i.e., back to an uncompressed form representing the actual observed property (e.g., pixel values)

Decoded feature vectors

And the input vector X _o Having the same number of elements, represent the same respective feature.

The weight w of the inference network (encoder) 208q is labeled φ and the weight w of the generation network (decoder) 208p is labeled θ. Each node 104 applies its own weight as shown in fig. 4.

For each data point in the training data (each data point in the empirical data in the learning process), the learning function 209 adjusts the weights φ and θ so that the VAE 208 learns to encode the feature vector X into the eigenspace Z and then decode back. This may be done, for example, by minimizing

And p _θ (X _i |Z _i ) In which

Is a representation of the encoder 208q at a given X by phi _i Z of the output in the case of an input value of _i A function parameterized by a probability distribution vector of the elements, and p _θ (X _i |Z _i ) Is a function parameterized by θ, representing the encoder 208q given Z _i X of the output _i A probability distribution vector of elements. The symbol "|" indicates "given". The model is trained to reconstruct X _i Thus in X _i Upper is kept distributed. At the "input side", xo _i Is known, on the "output side", the model is evaluatedUnder the output distribution

The likelihood of (a) being too high. In general, p (z | x) is called a posteriori and q (z | x) is called an approximate posteriori. p (z) and q (z) are referred to as priors.

This may be done, for example, by minimizing

And p _θ (X _i |Z _i ) With a Kullback-Leibler (KL) divergence in between. The minimization can be performed using an optimization function, such as an ELBO (lower bound of evidence) function, which is minimized using a gradient descent based cost function. The ELBO function may be referred to in this disclosure by way of example, but this is not limiting and other metrics and functions for tuning the encoder and decoder networks of the VAE are also known in the art.

The requirement to learn the code to Z and return again amounts to imposing constraints on the overall neural network 208 of the VAE formed by the constituent neural networks of the encoder 208q and decoder 208p. This is the general principle of an auto-encoder. The purpose of forcing the autoencoder to learn to encode and then decode the data in compressed form is that this may achieve one or more advantages in learning compared to a general purpose neural network; for example, learning to ignore noise in the input data, for better generalization, or because when away from the solution, the compressed form would provide better gradient information on how to converge quickly to the solution. In a variational autocoder, the eigenvector Z is subject to an additional constraint that it follows a predetermined form of probability distribution, such as a multi-dimensional gaussian distribution or gamma distribution.

Figure 4B shows a more abstract representation of the VAE as shown in figure 4A.

Fig. 4C shows a higher level representation of the VAE as shown in fig. 4A and 4B. In fig. 4C, the solid line represents the generation network of the decoder 208q, and the dotted line represents the inference network of the encoder 208p. In this form of the figure, the vectors shown in the circles represent distribution vectors. Therefore, here, the feature vector X (= X) ₁ …x _d ) Each element of (a) is modeled as a distribution, as shown in FIG. 1CAs shown. Similarly, each element of the eigenvector Z is modeled as a distribution. On the other hand, a vector without a circle represents one fixed point. Thus, in the illustrated example, the weights of the generating network are modeled as simple values, rather than distributions (although this is also possible). The rounded rectangle labeled N represents a "plate," which means that the vector within the plate iterates over N learning steps (one per data point). In other words, i =0, …, N-1. The vector outside the plate is global, i.e. it does not scale with the number of data points i (nor with the number of features d in the feature vector). The rounded rectangle denoted D represents that the feature vector X contains a plurality of elements X ₁ …x _d 。

The VAE 208 may be used for practical purposes in a variety of ways. One use is to generate a new unobserved instance of the feature vector X by inputting random or unobserved values of the eigenvector Z into the decoder 208p once the VAE is trained. For example, if the feature space of X represents pixels of an image and VAE has been trained to encode and decode a human face, then by inputting a random value Z to decoder 208p, a new human face may be generated during training that does not belong to any sample object. This may be used, for example, to generate fictional characters for movies or video games.

Another use is to interpolate missing values. In this case, once the VAE is trained, vector X is input _o May be input to the encoder 208q but for the missing value. I.e. the feature vector X _o One or more (but not all) of the elements of (a) have no observed value. The values of these elements (representing unobserved features) may be set to zero, 50%, or some other predetermined value representing "unobserved". The feature vectors may then be read from the decoder 208p

The corresponding element in the version is decoded to interpolate the missing value. The VAE may also be trained using some data points that lack certain characteristic values.

Another possible use of VAEs is predictive classification, similar to the idea described in fig. 1A. In such a case, e.g.As shown in fig. 4D, a decoder 208pY is also provided to decode the eigenvector Z into class Y, which may be a single element or a vector containing multiple elements (e.g., a 1-hot vector). During training, each input data point (X) _o Each observation of) is labeled with an observation of class Y, and thus another decoder 208pY is trained to decode eigenvector Z into class Y. After training, this can then be used to input the unlabeled feature vector X _o And let the decoder 208pY be the observed feature vector X _o A prediction of class Y is generated.

An improved method of forming the machine learning model 208' according to an embodiment of the present disclosure is now described with reference to fig. 5A-5E. In particular, the methods of the present disclosure are particularly well suited to processing mixed types of data. For example, a Machine Learning (ML) model 208' may be used in place of the standard VAE in the apparatus 200 of fig. 2 for prediction or interpolation.

The model is trained in two stages. In the first stage, a single VAE is trained for each individual feature or feature type without affecting each other. In the second phase, the VAE is further trained to learn correlations between features.

Both the native VAE and the disclosed form of VAE use multiple likelihood functions. However, a problem with the native VAE is that it attempts to optimize all likelihood functions simultaneously. In practice, some likelihood functions may have larger values, so the VAE will focus on a particular likelihood function and ignore others. In contrast, the method of the present disclosure can optimize all likelihood functions individually, thereby alleviating this problem.

As shown in fig. 5A, in a first stage, a single VAE is trained for each of a plurality of different subsets Xo1, xo2, xo3 of the observed feature vector Xo (i.e., different subsets of the vector feature space). Three subsets are shown here as an example, but it is noted that other numbers may be used. Each subset containing a different one or more features in the feature space. I.e. each subset is one or more different elements of the feature vector Xo. Different subsets of the features contain features of different data types. For example, a type may contain two or more of: sorted, sequential or sequential. Classification refers to data whose value takes one of a discrete number of classes. An example of this could be a gender question, or an answer to a question, with some qualitative answer. In some cases, classification data can be divided into two types: binary classification data and non-binary classification data. An example of binary data is an answer to a yes/no question or a smoker/non-smoker. One example of non-binary data may be gender, such as male, female, or others; or a residential town or country, etc. An example of sequential data is age measured in whole years, or an answer to a question, given a scale of 1 to 10, or one or five stars, etc. An example of continuous data is weight or height. It is worth noting that these different types of data have very different statistical properties.

The number of subsets may be labeled herein as D =1 … D, where D is the index of the subset and D is the total number of subsets. In an embodiment, each subset X _od But a single feature. For example a feature X _o1 Can be gender, another characteristic X _o2 May be age, and another characteristic X _o3 May be weight (e.g., in the example of predicting or estimating a medical condition of the user). Alternatively, features of the same type may be combined into a single subset of VAE training. E.g. a subset X _o1 May be composed of categorical variables, another subset X _o2 May be composed of sequential variables, with X in another subset _o3 May consist of continuous variables.

Each individual VAE includes a respective first encoder 208q _d (D =1 … D) arranged for respective features X _od Coded as respective eigenrepresentations (i.e. eigenspaces) Z _d . Each individual VAE also includes a respective first decoder 208p _d (D =1 … D) arranged for representing respective eigens as Z _d Decoding back the respective dimensions of the feature space of the respective feature subsets, i.e. generating the respective observed feature subsets X _od Decoded version of

Thus X _o1 Coded as Z ₁ Then decoded into

And X _o2 Coded as Z ₂ Then decoded into

X _o3 Coded as Z ₃ Then decoded into

(if there are more than three feature subsets, and so on).

In an embodiment, each eigen represents Z _d Are all one-dimensional, i.e. consist of only a single intrinsic variable (element). Note, however, that this does not mean that the intrinsic variable Z is _d But a simple fixed scalar value. In contrast, since the autocoder is a variational autocoder, for each of the eigenvariables Z _d The encoder learns a statistical or probability distribution and the values input to the decoder are random samples of the distribution. This means that for each individual element of the eigenspace, the encoder learns one or more parameters of the respective distribution, such as the center point of the distribution and a measure of the spread. For example, each of the intrinsic variables Zd (single dimension) can be passed through the respective average value μ in the encoder _d And standard deviation σ _d Or variance

And (6) modeling. The possibility of multidimensional Zd (in this case each dimension is modeled by one or more parameters distributed individually) is not excluded, although this increases computational complexity, and in general the idea of eigenrepresentation is to compress the information in the input feature space into low dimensions.

In the first stage, a learning function 209 (e.g., an ELBO function) trains (i.e., adjusts the weights of) each VAE to minimize the respective observed feature subset X _od And the feature subset

A measure of the difference between the respective decoded versions.

As shown in fig. 5B, the second stage employs a second VAE, including a second decoder 208pH and a second encoder 208qH, to form a second VAE. The second stage of the method includes training a second VAE.

At the input of the second encoder 208qH, each feature subset X _od With their respective eigenvectors Z _d Combination (Z learned using first VAE in first stage _d A value). In an embodiment, the combining comprises sub-combining each feature subset X _od With their respective eigenvectors Z _d Are connected in series. In principle, however, any function that combines the two information may be used, such as multiplication or interleaving, etc. Whatever function is used, each such combination constitutes an input to the second encoder 208 qH. The second encoder 208qH is arranged to encode these inputs into a second eigenrepresentation in the form of an eigenvector H, having a plurality of dimensions (each dimension-i.e. each element of the vector-is modeled as a respective distribution and is thus represented by one or more parameters of the respective distribution, such as a respective mean and variance or standard deviation). H is also referred to hereinafter as H (vector form) and is not to be confused with the function H (·).

The second decoder 208pH is arranged to decode the second intrinsic representations H back to the respective first intrinsic representations

The version of (1). In a second learning phase, a second VAE is trained (i.e., its weights adjusted) by the learning function 209 to minimize the first intrinsic representation Z and its decoded version

A measure of the difference between (where Z is a vector consisting of a single first eigen representation Z1, Z2, Z3, … …; and

is the corresponding decoded version). In FIGS. 5A-5B, the first encoder 208The weight of qd is denoted by phi, the weight of the first decoder 208pd by theta, the weight of the second encoder 208qh by lambda and the weight of the second decoder 208pH by psi.

A more abstract, higher-level representation of model 208' is shown in FIG. 5C.

Based on this two-phase approach, the model therefore first learns to separate dependencies between different data types, and then learns the effects of the dependencies between the data types.

The computational complexity of the auto-encoder increases as the dimension of the eigenspace increases. For example, consider a conventional VAE 208 as shown in FIGS. 4A-D, which is encoded as an eigenvector Z having 20 elements (each element modeled as a distribution). This dimension is referred to as 20. Now consider an implementation of the presently disclosed model 208', where H has 20 elements, each Zd is an element, and D =3. In this case, the total dimension of the eigenspace is 20+ 1=23. Or more generally dim (H) + D, if dim (H) is the dimension (element) of H, D is the number of features or feature subsets (i.e. the number of first encoders/decoders). However, one problem with the native VAE is that it does not use eigenspace very efficiently with mixed-type data. Therefore, increasing the size of the eigenspace will be useless. In contrast, by separating different feature types in the first learning stage, then in the method of the present disclosure, the increase in eigensize in the disclosed model improves performance compared to the original VAE. Therefore, eigenspaces and training procedures are designed to more efficiently utilize eigenspaces.

Once trained, as shown in FIG. 5D, the model 208' may be used to predict or interpolate in a manner similar to that shown in FIGS. 4A-4D. The values of the second eigenvector H are input to a second decoder that decodes the first eigenvector Z and then each element Z of the first eigenvector Z ₁ 、Z ₂ 、Z ₃ … … is decoded by its respective first decoder 208p1, 208p2, 208p3 … …. For example, random or unobserved values of the eigenvector H may be input to the second decoder 208pH to generate training data with unobserved valuesFeature vector of

A new example of (a). This may be used, for example, to generate an imaginary face for movement or gaming, or to generate detailed information about a functional patient for training or learning purposes.

Fig. 5E shows another example, in which a third decoder network 208pY is trained in a second training phase to decode a second eigenvector H into class Y. In the second training phase, each data point (input feature vector X) _o Each instance of) is labeled with the observed value of the corresponding classification Y. Based on this, the learning function 209 trains the third decoder 208pY (i.e., adjusts its weights) to minimize the difference metric between the label classification and the prediction classification. After training, unlabeled input feature vector X may be used _o Input into a second encoder 208qH of the 208' model to generate a corresponding prediction of class Y.

In another example, the model 208' may be used to interpolate the input feature vector X _o The missing value in (1). After training, the feature vector X observed subsequently can be combined _o The instance is input to a second encoder 208qH, where the feature vector X _o Examples have features (i.e., elements) that are part (but not all) of the feature vector that is missing (i.e., not observed). The missing element may be set to zero, 50%, or other predetermined value representing "not observed". The decoded version of the feature vector may then be used

The values of the corresponding features (i.e., the same elements) in the feature space are read and used as the interpolation of the missing observations. In an embodiment, some data points having one or more missing values may also be used to train the model 208'.

One problem with this basic interpolation method is that a predetermined value representing "no observation" may still be interpreted by the encoder as a sample value. For example, if 0 is used, the encoder cannot distinguish the difference between "no observation" and an actual zero observation (e.g., black pixels or sensor readings of zero, etc.). A similar problem may occur if a predetermined value of 50% probability is used.

Fig. 6 shows an example structure of a second encoder 208qH that may be used to improve missing feature interpolation. In FIG. 6, each function H (-) represents a separate component neural network, each function having its respective input (note H (-) as a function, not to be confused with the eigenvector H discussed later, which may also be referred to above as H) as a combination (e.g., multiplication) of the corresponding embedded e with a corresponding value v, where v is X _d Or Z _d . It is desirable to input as many X and Z values as possible during training and/or interpolation. However, for any given data point during training and/or interpolation, some features d may lack X _d And/or Z _d Value of (if X is absent) _d Then Z _d Will also be lost). For any value that is not present, only the corresponding input need be omitted (it need not be replaced by an input of a predetermined value (e.g., 0 or 50%). This is possible because the permutation invariant operator g (-) is used, as will be discussed later.

Each value v is combined with its own embedding, e.g. by multiplication or concatenation etc. In an embodiment, multiplication is used here, but it can be any operator that combines the information of the two values. The embedding e is the coordinates of each input-it tells the encoder which element is input at that input. This may be, for example, the coordinates of the pixel or an index of the feature d.

Each individual neural network h (-) outputs a vector. These vectors are combined, e.g. summed, by a permutation invariant operator g. A permutation invariant operator is an operator of the output value, in this case a vector, which depends on the values input to the operator, but is independent of the order of the inputs. Furthermore, the size of the output vector is fixed, independent of the number of inputs by the operator. This means that g (-) can provide a vector c in a given format regardless of which inputs are present, which are absent, and the order in which they are provided. This enables the encoder 208qH to handle lost input.

The encoder 208qH comprises a further generic neural network f (-) that is generic for all inputs v. The output c of the permutation invariant operator g (-) is provided to the input of the further neural network f. The neural network encodes the output c of g (-) as a second eigenvector H (also labeled H in the following work as a vector rather than a function). In an embodiment, another neural network f (-) is used, not just c directly, because the size of the observed features is not fixed. Thus, first, a generic function f is preferably applied to all observed features.

In an optional additional application of the disclosed model, the reward function R _I May be used to determine the next observation to be made after the first and second training phases of model 208. The reward function is a function of the observations obtained so far, representing the amount of new information added by observing a given missing input. By determining which function is currently missing to maximize the reward function (or equivalently minimize the cost function), this determines which unobserved inputs will be the inputs that have the greatest amount of information to collect next. It represents the fact that some inputs have greater interdependencies than others, so that the input that is least correlated with others, the input that has been observed will provide the most up-to-date information. The reward function evaluates against a number of different candidate unobserved features, the feature that maximizes the reward (or minimizes the cost) will be the feature that provides the most up-to-date information through the next observation. In some cases, the model 208' may go through another cycle of the first and second training phases, now adding new observations. Alternatively, the new observations may be used to improve the quality of the prediction, or simply by a human analyst (e.g., a physician) incorporating the results of the trained model 208' (e.g., classification Y or input missing feature X) _d ) The preparation is used.

FIG. 9 is a flow chart summarizing methods consistent with the present disclosure. In step S1, a first training phase is performed, in which each individual VAE is trained separately on a respective one of the feature subsets to learn to encode the feature subset in a manner that separates the other feature subsets, i.e. to model the edge characteristics of each feature subset. In step S2, a second training phase is performed in which a second generic VAE is trained to learn or model the dependencies between the feature subsets. In step S3, the trained model 208' may be used for practical purposes, such as prediction or interpolation. In an optional additional step (not shown between S2 and S3 or after S3), the reward function may be used to determine the missing features to be observed next. In some cases, the method may include observing the missing feature and looping back to S1, retraining model 208', including new observations of the previously missing feature.

Note that although the examples herein are described as using labeled training data, the disclosed techniques are not limited to supervised approaches. More generally, the term "training" as used herein may refer to any supervised, reinforced or unsupervised learning. The disclosed method is a specific method of obtaining a model that can model a data set with mixed type variables. Once the model is trained, it can be used in a variety of ways, such as reinforcement learning and prediction.

Some example implementation details of the various concepts discussed above will now be further discussed by way of example.

In order to correctly process mixed-type data with heterogeneous edges, the method fits the data in two stages. As shown in fig. 5C, in the first stage, for each variable, we fit a low-dimensional VAE independent of each edge data distribution. We call edge VAE. Then, in the second stage, in order to capture the dependency relationship between the variables, a new multi-dimensional VAE is constructed on the basis of the intrinsic representation provided by the encoder in the first stage, which is called a dependency network. We denote the dimension of the observed value by D, N is the number x of data points, hence x _nd Is the dth dimension of the nth point. We provide the following detailed information.

The first stage is as follows: individual edge VAEs are trained for each single variable. In the first stage, we trained D individual VAEs independently

Each trained to fit in a single dimension x in the dataset _d ：

Wherein p (z) _d ) Is a standard gaussian prior and is,

is a gaussian encoder for the d-th edge VAE. Specifying likelihood terms

We use gaussian likelihood for continuous data and a classification likelihood of a 1-hot vector for classified data.

Note that equation 1, equation 2, contains D independent targets. Each VAEp _d (x _nd ；θ _d ) Are all independently trained and are responsible for modeling only a single dimension x from the dataset _d Of the individual statistical attributes. Therefore, we assume z _d Is a scalar without loss of generality, although its dimensions are not limited. Each edge VAE can be trained independently until convergence, thereby avoiding the optimization problem of the original VAE. Then fix the parameter θ of the edge VAE _d ^★ 。

And a second stage: and training a dependency relationship network and combining the edge VAEs. In the second stage, we train a new multidimensional system called a dependency network

Built on top of the intrinsic representation z provided by the edge VAE encoder in the first stage. Specifically, we train p in the following way _ψ (z)：

x _data ～p _data (x) Formula 2

Where h is the eigenspace of the dependent network. The above process effectively separates the inter-variable, heterogeneous attributes and inter-variable correlations (modeled by previous networks) of mixed-type data (modeled by edge VAEs). We refer to our VAE model as heterogeneous mixed type data (VAEM).

After training the edge VAEs and the dependency network, our final generative model is as follows:

to handle complex statistical correlations, we use VampPrior, which uses Gaussian mixtures (MoGs) as a prior distribution of high-level intrinsic variables, i.e.

Wherein k < N and u _k Is a subset of points.

In general machine learning applications, normalization is considered as a basic preprocessing step. For example, the data is typically first normalized to a mean and a standard deviation of zero. However, for mixed types of data, a standard normalization method cannot be applied. Using our VAEM, each edge VAE is trained independently to model the heterogeneous properties of each data dimension, mixing types of data x _d Conversion to a continuous representation z _d . Due to the regularization effect of a priori p (z), z _d The set forms a posterior distribution of the aggregation that approximates a standard normal distribution. In this way, we overcome the heterogeneous mixture type problem, relying on VAE to focus on learning between variablesThe relationship (2) of (c).

We further extend the method of uncertainty decision making. In particular, as an example, we focus on sequential active information acquisition applications. In this application environment, we propose an extension to use our model in the presence of missing data and Lindley information estimation.

Suppose for data instance x, we are interested in predicting object x _Φ ∈x _U Currently observed

Wherein x is _O Representing the set of currently observed variables, x _U Representing a set of unobserved variables. One important issue is Sequential Active Information Acquisition (SAIA): how we determine which variable

Is the next best observation, so we can best increase about x _Φ Is knowledge (e.g., predictive power) of?

The problem is solved:

1) We should have a good generative model to handle missing data and to efficiently generate conditional samples logp (x) from logs _U |x _O )，

2) The ability to evaluate the reward function (Lindley information in this example) to implement a generative model-based decision. We now introduce our extensions to VAEM to meet both requirements.

The amortization reasoning method of VAEM cannot process part of the observed data, because the dimensionality of the observed variable xo may vary from data instance to data instance, we apply PointNet coding structure to build a VAE-dependent part-reasoning network to infer h in a phased manner based on part of the observed values. Specifically, in the first stage, we estimate each edge VAE using only the observation samples for that dimension.

Wherein

iffx _nd ∈x _n,O And x _n,O Is the observed variable for the nth data instance. In the second phase, a VAE is required that can handle partial observations. Similar to partial VAEs, the dependent VAEs in the presence of missing data are defined as follows:

this is trained by maximizing the partial ELBO:

where h is the network-dependent eigenspace, q _λ (h|z _O ,x _O ) Is a set function, so-called partial inference network, whose structure is shown in fig. 6. Basically, for each data instance x _O The inputs of the partial inference net are first modified to s using element multiplication _O ：＝{νe _v |ν∈z _O ∪x _O ) And e is _v Is a feature embedding. Then s is _O Input feature mapping (neural network)

Where M and K are the feature embedding and feature map dimensions, respectively. Finally, we apply a permutation-invariant polymerization operation g (), such as summation. Thus, q _λ (h|z _O ,x _O ) For x _O The arrangement of the elements is constant, and x _O And may be of any length.

Once the edge VAEs and partially dependent networks are trained, we can push downPhysical process from log p _θ (x _U x _O ) Generating a condition sample: first, an intrinsic representation z of the observed variable is derived _d . With this representation, we use a partial inference network to infer h, which is the intrinsic code of the second stage VAE. By h, we can generate zs, which is an intrinsic code of the unobserved dimensions, and then x _s 。

h～q _λ (h|z _O ，z _O ) Formula 11

SAIA may be defined as a bayesian design of experiments problem.

Selected by the following information reward function:

we use a pre-trained partial VAEM model to estimate the required distribution p (x) _i |x _O )，p(x _Ψ |X _i ，x _O ) And p (x) _Φ |x _O ). Due to the fact that

We must take an approximation. R _I (x _i ，x _O ) The effective eigenspace estimation method of (a) can be approximated as:

note that for compactness, we omit part of the inference net input x _O And x _i The symbol of (2). Hypothesis partial inference network q _λ (h|z _O ) Is a gaussian distribution (or other common distribution such as normalized flow).

In active information acquisition, objects of interest

Is often the goal we are trying to predict. To improve the predictive performance of VAEM, we propose to use the following factorization:

in the formula p _λ (x _Φ |x _O ,x _UΦ ) Is based on the observed variable x _O Interpolating variable x _U And global eigen representation h (the last one is optional) gives x _Φ A discriminator for probability prediction. The discriminator in equation 16 is target x _Φ Additional predictive capability is provided.

To evaluate the performance and effectiveness of our proposed VAEM model, we first evaluated it on a mixed-type heterogeneous data generation task. Then, we compared the performance of conditional hybrid data generation (interpolation). Finally, to more fully evaluate the condition generation quality of the model, we applied VAEM to the sequential active information acquisition task. In these tasks, the underlying generative model is required to generate a sample of unobserved variables for each instance, and then decide which variables to take next. All experiments used the same set of data sets, including two UCI reference data sets (Boston housing and Energy), two real world data sets Avocado and Bank, and a medical data set (MIMIC-III). We compared our proposed VAEM with some baseline methods.

For our proposed method VAEM, we use the partial version presented above, and the discriminator structure specified by equation 16, unless otherwise specified.

Table 1. Data generation quality was derived by testing the NLL of each variable with the standard error as the error bar.

During the entire experiment, we considered some baselines. Unless otherwise specified, all VAE baselines also use similar partial inference methods and discriminator structures. Furthermore, all baselines are equipped with MoG priorities. Our primary baselines include:

-heterogeneous incomplete VAE (HI-VAE). We take the multi-header structure of HI-VAE and match the hidden unit dimensions to make it the same as our VAEM. HI-VAE is an important baseline because it is excited in a similar fashion as our VAEM, but the training mode is end-to-end, rather than two-stage. We denote VAE-HI.

-VAE: original VAE equipped with VampPrior. The number of eigen dimensions is the same as in the second stage of VAEM. We denote by VAE.

VAE with extended intrinsic dimensions: note that the overall intrinsic dimension of VAEM is D + L, where D and L are the dimensions of the data instance and h, respectively. Fairly speaking, in this baseline we extended the intrinsic dimension of the baseline of the original VAE to D + L. We denote this baseline by VAE-extended.

VAE with automatic balancing possibility. The baseline attempts to automatically balance the ratio of log-likelihood values for different variable types in the ELBO by adaptively multiplying a constant before the likelihood term. We denote this baseline by VAE-balanced.

We use the same mixed type dataset for all tasks. They include:

two standard UCI reference datasets: boston housing (13 consecutive, 1 category) and energy efficiency (6 consecutive, 3 categories);

two relatively large real-world datasets: bank marketing data set (45211 instances, 11 consecutive, 8 classifications, 2 discrete); and Avocado sales forecast data set (18249 instances, 9 consecutive, 5 classifications)

-a real medical data set: the intensive care Medical Information Marketing (MIMICIII) database, the largest public medical data set containing 21139 patient records (we focused mainly on mortality prediction tasks based on 17 modalities (13 consecutive, 4 categories)). Due to dataset imbalance (mortality of 0 for more than 80% of the data), we balanced the dataset by downsampling to better demonstrate model behavior. All time series variables were averaged to give a static signature.

Mixed type data generation task

In this task, we evaluate the quality of the generative model from the perspective of mixed-type data generation. During the training process, all variables range between 0 and 1. For all data sets, we first trained the models and quantitatively compared their performance on the test set using a training test cut of 90% -10%. All experiments were repeated 5 times on different random seeds.

Visualization of paired graphs: for a deep generative model, the quality of data generation reflects the degree to which the model models the data. Therefore, we first visualize the data generation quality of each model on a representative dataset (Bank marketing). Bank datasets contain three different data types, each with distinct edges, which present a learning challenge. We fit the model to the Bank data set, then for three variables x selected from the data ₀ 、x ₁ And x ₂ (the first two are classified and the third is continuous) a pairwise graph is generated (fig. 7). In each sub-graph of fig. 7, a diagonal graph shows an edge histogram for each variable. The upper triangular portion shows the sample scatter plot for each variable pair. The lower triangular section shows a heat map identifying regions of high probability density for each variable pair. Fig. 7 (a) shows the true value of each variable. FIG. 7 (b) shows calculation using VAEMThe value of (c). Fig. 7 (c) shows the values calculated for the original VAEs. Fig. 7 (d) shows the values calculated using VAE-extension. Fig. 7 (e) shows the values calculated using VAE-balance. Fig. 7 (f) shows the values calculated using VAE-HI.

The original VAE can generate a second categorical variable. Note, however, that the third variable of the data set (fig. 7 (a)) corresponds to the "duration" characteristic of the data set, being a very important variable, with heavy endings. The original VAE (fig. 7 (c)) cannot model this heavy tail behavior of the variables. On the other hand, while VAE-balance and VAE-HI (fig. 7 (e) (f)) may capture a portion of this heavy-tailed behavior, it does not model the second classification variable well. Our VAEM model (fig. 7 (b)) is able to generate accurate edge and joint distributions for classification and heavy-tailed continuous distributions.

Quantitative evaluation of all datasets: to quantitatively assess data generation quality, we calculated the edge Negative Log Likelihood (NLL) of the model on the test set. Note that all NLL numbers are divided by the number of variables in the data set. As shown in table 1, VAEM is able to consistently generate real samples and on average is significantly better than other baselines.

Mixed type conditional data generation tasks

An important aspect of generating models is the ability to perform conditional data generation. That is, given a data instance, given x _O To infer the unobserved variable x _U Posterior distribution of (2). For all baselines evaluated in this task, we trained their partial version (i.e., generate + partial inference net). To train the partial model, we randomly drawn 90% of the data set as the training set and removed the random portion of each round of observations during the training process (uniform sampling between 0% and 99%). Then, we deleted 50% of the test sets and used the generative model to extrapolate the unobserved data. Since all inferences are probabilistic, we report negative-test NLL of unobserved data, rather than interpolated RMSE, which is commonly used in the literature.

Table 2 summarizes the results, where all NLL values are divided by the number of observed variables. We repeated the experiment 5 times and reported standard errors. Note that the automatic balancing strategy VAE-balancing almost always degrades performance. In contrast, table 2 shows that our proposed method is very robust, yielding better performance than all baselines on four fifths of the data set, competitive on energy data set.

TABLE 2

In our final experiment, we applied VAEM to a Sequential Active Information Acquisition (SAIA) task based on the above formula. We take this task as an example to show how VAEM is used for decision under uncertainty. In SAIA, each step requires the underlying generative model to generate an unobserved variable x for each data instance x _U And then decides which variables to take next. SAIA is a perfect task to evaluate hybrid data generation models, as it integrates data generation, condition generation, target prediction, and decision making into one task. Deep generative models and efficient reasoning capable of handling partial observations are important components of the SAIA task.

We first pre-train the model and baseline according to the above settings. Then, in SAIA, we actively select variables for each test case, from null observations

And starting. The reward function for VAEM is estimated as described above. We add an extra baseline, denoted VAE-no-disc, which is a VAE without discriminator structure. This is a baseline showing the importance of the above extensions in the prediction task. Other settings are the same as the VAE baseline. All experiments were repeated ten times.

FIG. 8 shows x for each variable selection step on all five datasets _Φ RMSE evaluation of (1), wherein x _Φ Is the target variable for each data set. The y-axis shows the prediction error and the x-axis shows the number of features acquired for prediction. We refer to the curve in fig. 8 as the information curve, where the area under the information curve (AUIC) can be used to evaluate the SAIA performance of a given model. The smaller the area, the better. As can be seen in fig. 8, VAEM performance is consistently better than the other baseline. Note that on Bank marketing and Avocado sales data sets involving a large number of heterogeneous variables, other baselines are hardly able to rapidly degrade RMSE assessments, while VAEM performs far better than them. These experiments show that VAEM can efficiently acquire information for mixed-type datasets.

It should be understood that the above embodiments have been described by way of example only.

More generally, according to one aspect disclosed herein, there is provided a method comprising: in a first stage, training a plurality of separate first variational autoencoders, VAEs, each comprising a separate respective first encoder arranged to encode a respective subset of one or more features of the feature space into a separate respective first intrinsic representation having one or more dimensions, and a separate respective first decoder arranged to decode from the respective intrinsic representation back to a decoded version of the respective subset of the feature space, wherein different subsets comprise features of different data types; and in a second stage, subsequent to the first stage, training a second VAE comprising a second encoder arranged to encode the plurality of inputs as a second intrinsic representation having a plurality of dimensions, and a second decoder arranged to decode the second intrinsic representation as a decoded version of the first intrinsic representation, wherein each of the plurality of inputs comprises a different respective one of the subsets of features in combination with the respective first intrinsic representation.

Since the autocoder is a variational autocoder, each dimension of its eigenrepresentation is modeled as a probability distribution. In embodiments, the decoded version of the features of the decoder output may also be modeled as a distribution, or may be a simple scalar. The weights of the nodes in the neural network can also be modeled as distributions or scalars.

Each encoder and decoder may include one or more neural networks. Training of each VAE may include comparing features of the decoder output to features of the encoder input, and adjusting parameters of neural network nodes in the VAE to reduce differences between the two.

In an embodiment, each of the subsets is a single feature.

Alternatively, in an embodiment, each of the subsets may be more than one feature. In this case, the respective features within each subset may be of the same type, but of a respectively different data type with respect to the other subset.

In an embodiment, each of the first eigenrepresentations is a single respective one-dimensional eigenvariable.

Note again, however, that since the autoencoder is a variational autoencoder, each of the eigenvariables is still modeled as a distribution.

In embodiments, the different data types may include two or more of: classification, order and succession.

In an embodiment, the different data types may include: binary classification, and classification with more than two classes.

In an embodiment, a feature may include one or more sensor readings from one or more sensors that sense a material or a machine.

In embodiments, the features may include one or more sensor readings regarding the user's health and/or questionnaire responses from the user.

In an embodiment, the third decoder may be trained to generate a classification from the second intrinsic representation.

In an embodiment, the second encoder may comprise a respective separate second encoder arranged to encode each of the plurality of feature subsets and/or the first intrinsic representations; the permutation invariant operator is arranged to combine the encoded outputs of the separate second encoders into a fixed size output; and a further encoder arranged to encode the fixed size output into a second intrinsic representation.

In an embodiment, the combination may be in series.

Aspects disclosed herein also provide a method of performing prediction or interpolation using the second VAE after training according to the method mentioned above in any aspect or embodiment.

In an embodiment, the method may use the second VAE to predict or infer a condition of the material or machine.

In an embodiment, the method may predict or infer a health condition of the user using the second VAE.

In an embodiment, the method may use a third decoder and a second encoder after training to predict a subsequently observed classification of the feature vectors of the feature space.

In an embodiment, the method may use a second VAE after training to interpolate values of one or more missing features in subsequently observed feature vectors of the feature space by:

-providing the observations of the feature vectors as feature values of the respective inputs to a second encoder,

-setting each non-observed feature in said input to a predetermined value representing no observation, an

-a read value of a feature of the feature space, as output by the first decoder, corresponding to an unobserved feature.

In an embodiment, the method may use the second encoder after training to interpolate one or more unobserved features by:

-providing the observed values of the feature vector as feature values of the respective inputs to a second encoder, omitting inputs corresponding to one or more unobserved features,

-using a permutation invariant operator to convert the remaining observed features into a fixed size output of the same size as during training,

-providing the generated first intrinsic representation to respective first decoders trained in a first training phase, and

-the read values of the features of said feature space, as output by the first decoder, correspond to the unobserved features.

Another aspect provides a computer program embodied on a computer readable memory and configured to run on one or more processing units to perform the method of any aspect or embodiment defined above.

Another aspect provides a computer system, comprising: a memory including one or more storage units, and a processing device including one or more processing units; wherein the memory stores code arranged to run on the processing apparatus, the code being configured to perform the method of any aspect or embodiment defined above when run on the processing apparatus.

In embodiments, the computer system may be implemented as a server comprising one or more server units for one or more geographical locations, the server being arranged to perform one or both of:

-collecting observations of said features from a plurality of devices over a network and using these observations to perform said training; and/or

-providing a predictive or interpolation service to the user over the network based on the trained second VAE.

In an embodiment, the network for one or both services may be a wide area Internet network, such as the Internet. In the case of collecting observations, the collecting may comprise collecting some or all of the observations from a plurality of different users by different respective user devices. As another example, collecting may include collecting some or all of the observations from a plurality of different sensor devices (e.g., internet of things devices or industrial measurement devices).

Another aspect provides the use of a variational encoder trained in a first stage by training each of a plurality of separate first variational self-encoders VAE, each comprising a separate respective first encoder arranged to encode a respective subset of one or more features of a feature space into a separate respective first intrinsic representation having one or more dimensions, and a separate respective first decoder arranged to decode from the respective intrinsic representation back to a decoded version of the respective subset of the feature space, wherein different subsets comprise features of different data types; and in a second stage, subsequent to the first stage, training a second VAE, the second VAE comprising a second encoder arranged to encode the plurality of inputs as a second intrinsic representation having a plurality of dimensions, and a second decoder arranged to decode the second intrinsic representation into a decoded version of the first intrinsic representation, wherein each of the plurality of inputs comprises a different respective one of the subsets of features in combination with the respective first intrinsic representation, respectively, said second variational encoder being used.

In an example application, a trained model may be used to predict a user's state, such as a disease or other health condition. For example, once trained, the model may receive answers to questions posed to the user regarding their health status to provide data to the model. A user interface may be provided to output questions to the user and receive responses from the user, such as through a voice or other interface means. In some examples, the user interface may comprise a chat robot. In other examples, the user interface may include a Graphical User Interface (GUI), such as a click-type user interface or a touch screen user interface. The trained algorithm may be configured to generate an overall score based on the user response, which provides his or her health data, thereby predicting the user's condition based on that data. In some embodiments, the model may be used to predict the occurrence of a particular condition of the user, for example, a health condition such as asthma, depression, or heart disease.

The user's condition may be monitored by asking questions that are repeated instances of the same question (asking the same question, i.e., the same question content), and/or different questions (asking different questions, i.e., different question content). These problems may be related to the condition of the user in order to monitor the condition. For example, the disease may be a health condition, such as asthma, depression, a health condition, and the like. The purpose of the monitoring may be to predict a future condition of the user, for example to predict the occurrence of a health problem for the user, or to provide information to the user, a health practitioner or a clinical trial, etc.

User data may also be provided from sensor devices, such as wearable or portable sensor devices worn or carried on the user. For example, such a device may take the form of an inhaler or spirometer with an embedded communication interface for connecting to and providing data to a controller. Data from the sensors may be input into the model and form part of the patient data for prediction using the model.

Context metadata may also be provided for training and using algorithms. Such metadata may contain the location of the user. The user's location may be monitored by a portable or wearable device placed on the user (plus any one or more of a variety of known location techniques, such as triangulation, trilateration, multiple iterations, or fingerprinting relative to the network to known nodes, such as WLAN access points, cellular base stations, satellites, or anchor nodes of a private positioning network, such as an indoor positioning network).

Other contextual information, such as sleep quality, may be inferred from the personal device data, such as by using a wearable sleep monitor. In further alternative or additional examples, sensor data from cameras, positioning systems, motion sensors, and/or heart rate monitors, etc. may be used as metadata.

The model may be trained to identify specific disease or health outcomes. For example, a particular health condition, such as a certain type of cancer or diabetes, may use a patient's existing feature set to train the model. Once the model is trained, it can be used to diagnose a particular disease when a new patient provides patient data. The model may make other health-related predictions, for example, mortality may be predicted once trained from a set of suitable patient training data with known mortality outcomes.

Another example of using the model is determining geological conditions, such as drilling to determine the likelihood of encountering oil or gas. Different sensors may be used on the tool at a particular geographic location. The sensors may include, for example, radar, lidar and position sensors. Other sensors such as thermometers or vibration sensors may also be used. The data from the sensors may belong to different data categories, thus constituting a mixed data. Once the model has been effectively trained from such hybrid data, the model can be applied in an unknown environment by taking sensor readings from equivalent sensors in the unknown environment, and used to generate predictions of geological conditions.

Another possible application is to determine the status of an autonomous vehicle. In this case, data may be generated from sensors such as radar sensors, lidar sensors, and position sensors on the automobile and used as a feature set to train the model for the particular conditions the automobile may be in. Once the model is trained, a corresponding hybrid data set may be provided to the model to predict a particular vehicle condition.

Another possible application of the training model is machine diagnostics and management in an industrial environment. For example, readings from different machine sensors, including but not limited to temperature sensors, vibration sensors, accelerometers, hydraulic sensors, may be used to train the model for certain fault conditions of the machine. Once the model is trained, the model may be used to predict the cause of possible machine failure once data from the machine is provided from the corresponding sensors.

Another application is to predict the heat and cold loads of different buildings. To facilitate training, the model may be provided with building attributes including surface area, wall area, roof area, height, orientation, and the like. These attributes may be mixed data types. For example, the direction may be a classified data type and the region may be a continuous data type. After training, the model can be used to predict the heat or cold load of a particular building once the corresponding data is provided for a new building.

Other variations and uses of the disclosed technology may become apparent to those skilled in the art once the present disclosure is given. The scope of the present disclosure is not limited by the described embodiments, but only by the appended claims.

Claims

1. A method comprising

In a first stage, training each of a plurality of individual first variational self-encoders, VAEs, each first VAE comprising a separate respective first encoder arranged to encode a respective subset of one or more features of a feature space into a separate respective first intrinsic representation having one or more dimensions, and a separate respective first decoder arranged to decode from the respective intrinsic representation back to a decoded version of the respective subset of the feature space, wherein different subsets comprise features of different data types; and

in a second stage, subsequent to the first stage, training a second VAE comprising a second encoder arranged to encode a plurality of inputs into a second intrinsic representation having a plurality of dimensions and a second decoder arranged to decode the second intrinsic representation into a decoded version of the first intrinsic representation, wherein each respective input of the plurality of inputs comprises a combination of a different respective one of the subsets of features with the respective first intrinsic representation.

2. The method of claim 1, wherein each of the subsets is a single feature.

3. The method of claim 1, wherein each of the subsets is a plurality of features, and wherein the respective features within each subset are of the same type but of a respective different data type relative to the other subsets.

4. A method according to claim 1, 2 or 3, wherein each of the first eigenrepresentations is a single respective one-dimensional eigenvariable.

5. A method as claimed in any preceding claim, wherein the different data types comprise two or more of: classification, order and succession.

6. The method of any preceding claim, wherein the different data types comprise: binary classification, and classification with more than two classes.

7. A method according to any preceding claim, wherein the characteristic comprises one or more sensor readings from one or more sensors sensing a material or machine.

8. The method of any one of claims 1 to 6, wherein the features comprise one or more sensor readings regarding the user's health and/or questionnaire responses from the user.

9. A method as claimed in any preceding claim, comprising training a third decoder to generate a classification from the second intrinsic representation.

10. The method of any preceding claim, wherein the second encoder comprises: a respective separate second encoder arranged to encode each of a plurality of said feature subsets and/or first intrinsic representations; a permutation invariant operator arranged to combine the encoded outputs of the separate second encoders into a fixed size output; and a further encoder arranged to encode the fixed size output into the second intrinsic representation.

11. A method of performing prediction or interpolation using the second VAE trained according to any preceding claim.

12. The method of claim 11, using the second VAE of claim 7 to predict or infer a condition of a material or machine.

13. The method of claim 11, using the second VAE of claim 8 to predict or infer a health condition of a user.

14. A computer program embodied on a computer readable memory and configured so as when run on one or more processing units to perform the method according to any preceding claim.

15. A computer system, comprising:

a memory including one or more memory cells, an

A processing device comprising one or more processing units;

wherein the memory stores code arranged to run on the processing apparatus, the code being configured to perform the method of any preceding claim when run on the processing apparatus.