JP2024059100A

JP2024059100A - Apparatus and method for determining an analysis of an image constructed by an encoder - Patents.com

Info

Publication number: JP2024059100A
Application number: JP2023178260A
Authority: JP
Inventors: リーユメン; コレヴァアンナ; チャンダン
Original assignee: Robert Bosch GmbH
Current assignee: Robert Bosch GmbH
Priority date: 2022-10-17
Filing date: 2023-10-16
Publication date: 2024-04-30
Also published as: CN117911806A; US20240135699A1; KR20240053554A; EP4357977A1

Abstract

【課題】画像の潜在表現を決定するように構成されているエンコーダをトレーニングするためのコンピュータ実装された方法及びシステムを提供する。【解決手段】方法は、トレーニング画像（ｘｉ）を、提供された画像に対する潜在表現とノイズ画像とを決定するエンコーダ（７０）に提供することによって潜在表現（ｗ）とノイズ画像（ε）とを決定するステップと、マスキングユニット（７４）がノイズ画像の部分をマスキングアウトすることにより、マスキングされたノイズ画像（εｍ）を決定するステップと、潜在表現とマスキングされたノイズ画像とを敵対的生成ネットワークの生成器（８０）に提供することによって予測画像を決定するステップと、エンコーダ（７０）のパラメータを、予測画像とトレーニング画像との間の差を特徴付ける損失値に基づいて適合させることによってエンコーダ（７０）をトレーニングするステップと、を含む。【選択図】図１A computer-implemented method and system for training an encoder configured to determine a latent representation of an image includes the steps of determining a latent representation (w) and a noise image (ε) by providing training images (xi) to an encoder (70) that determines a latent representation and a noise image for the provided images, determining a masked noise image (εm) by a masking unit (74) masking out portions of the noise image, determining a predicted image by providing the latent representation and the masked noise image to a generator (80) of a generative adversarial network, and training the encoder (70) by adapting parameters of the encoder (70) based on a loss value that characterizes a difference between the predicted image and the training image.

Description

本発明は、エンコーダをトレーニングするためのコンピュータ実装された方法と、画像の拡張を決定するための方法と、機械学習システムをトレーニングするための方法と、制御信号を決定するための方法と、トレーニングシステムと、制御システムと、コンピュータプログラムと、コンピュータ可読記憶媒体とに関する。 The present invention relates to a computer-implemented method for training an encoder, a method for determining an augmentation of an image, a method for training a machine learning system, a method for determining a control signal, a training system, a control system, a computer program, and a computer-readable storage medium.

従来技術
Richardsonら著の「“Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation”，2021年，https://arxiv.org/pdf/2008.00951.pdf」は、画像から画像への翻訳のジェネリックなフレームワークを開示している。 Prior Art
“Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation” by Richardson et al., 2021, https://arxiv.org/pdf/2008.00951.pdf, presents a generic framework for image-to-image translation.

Karrasら著の「“A style-based generator architecture for generative adversarial networks”，2019年，https://arxiv.org/pdf/1812.04948.pdf」は、高レベルの属性と、生成された画像における確率的変化とを自動的に学習して教師なしで分離するニューラルネットワークアーキテクチャであるＳｔｙｌｅＧＡＮを開示している。 “A style-based generator architecture for generative adversarial networks” by Karras et al., 2019, https://arxiv.org/pdf/1812.04948.pdf, discloses StyleGAN, a neural network architecture that automatically learns and separates high-level attributes and probabilistic changes in generated images in an unsupervised manner.

Karrasら著の「“Analyzing and Improving the Image Quality of StyleGAN”，2020年，https://arxiv.org/pdf/1912.04958.pdf」は、ＳｔｙｌｅＧＡＮニューラルネットワークの改良版であるＳｔｙｌｅＧＡＮ２を開示している。 “Analyzing and Improving the Image Quality of StyleGAN” by Karras et al., 2020, https://arxiv.org/pdf/1912.04958.pdf, discloses StyleGAN2, an improved version of the StyleGAN neural network.

Zhangら著の「“The Unreasonable Effectiveness of Deep Features as a Perceptual Metric”，2018年，https://arxiv.org/pdf/1801.03924.pdf」は、ＬＰＩＰＳ（Learned Perceptual Image Patch Similarity）指標を開示している。 Zhang et al., "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric", 2018, https://arxiv.org/pdf/1801.03924.pdf, discloses the Learned Perceptual Image Patch Similarity (LPIPS) metric.

背景技術
画像の潜在因子を自動的に分析することは、複数の技術分野において実務者らが直面するタスクである。潜在表現から画像を決定することは、例えば、敵対的生成ネットワーク（generative adversarial network：ＧＡＮ）として知られるニューラルネットワークによって容易に達成可能であるが、その一方で、その反対の方向、すなわち、所与の画像に関する潜在表現を発見することは、困難な課題のままである。特に、機械学習システムを考慮する場合には、このような潜在因子を発見することは、解決されることが望ましいだろう課題である。なぜなら、これを解決することにより、画像にエンコードされた意味論的側面に関して機械学習システムをトレーニングするために既存のデータセットを拡張することが可能となるからである。例えば、画像の潜在因子とは、画像内に現在描写されている天候状況であるものとしてよい。この潜在因子の値を適合させて、適合させられた潜在表現をＧＡＮに供給することにより、所与の画像に関する種々異なる天候状況を特徴付ける拡張を作成することができる。次いで、これらの拡張を、機械学習システムをトレーニングするために使用することができる。トレーニングのために使用される画像の潜在因子、例えば意味論的因子に関してより多様なデータを用いて機械学習システムがトレーニングされることとなるので、分類及び／又は回帰分析に関する機械学習システムの性能を向上させることができる。 2. Background Art Automatically analyzing latent factors of an image is a task faced by practitioners in several technical fields. While determining an image from a latent representation can be easily achieved, for example, by neural networks known as generative adversarial networks (GANs), the opposite direction, i.e., finding a latent representation for a given image, remains a difficult problem. Particularly when considering machine learning systems, finding such latent factors is a problem that would be desirable to solve, since it would allow extending existing datasets to train machine learning systems on semantic aspects encoded in the image. For example, a latent factor of an image may be the weather conditions currently depicted in the image. By adapting the values of this latent factor and feeding the adapted latent representation to the GAN, an extension can be created that characterizes different weather conditions for a given image. These extensions can then be used to train the machine learning system. The performance of the machine learning system for classification and/or regression analysis can be improved, since the machine learning system will be trained with more diverse data on the latent factors, e.g., semantic factors, of the images used for training.

画像に関する潜在因子を決定するプロセスは、ＧＡＮに基づいて達成可能である。このような方法は、当分野においては、「ＧＡＮ反転」とも称される。ＧＡＮ反転に関する従前の研究では、ＦＦＨＱのような単純な顔データセットに関して有望な結果が示されている。ＧＡＮ生成器を使用して、Richardsonらは、所与の画像から特徴を抽出して、これらの特徴を中間潜在変数にマッピングするようにエンコーダをトレーニングすることを提案しており、この場合、潜在変数は、画像を操作するために、例えば、髪の色及びその他の顔の細部を変化させるために使用可能である。しかしながら、例えば、運転シーンのデータセットのような比較的高度な構造的複雑性を有するデータセットのことになると、シーン内の全てのオブジェクトを再構築すること、すなわち、画像内の全ての細部を復元することは、公知の方法では十分に可能ではない。例えば、顔データセットの場合には、人間の顔は、おおよそ中央にある単一のオブジェクトであるが、例えば、運転シーンを描写するデータセットでは、画像内に自動車のような複数のオブジェクトが存在するので、画像レイアウトが格段に多様になる。 The process of determining latent factors for an image can be accomplished based on GANs. Such methods are also referred to in the art as "GAN inversion". Previous work on GAN inversion has shown promising results for simple face datasets such as FFHQ. Using a GAN generator, Richardson et al. propose to train an encoder to extract features from a given image and map these features to intermediate latent variables, where the latent variables can be used to manipulate the image, e.g., to change hair color and other facial details. However, when it comes to datasets with a relatively high degree of structural complexity, such as driving scene datasets, known methods are not fully capable of reconstructing all objects in the scene, i.e., recovering all details in the image. For example, in the case of face datasets, the human face is a single object that is roughly in the center, whereas in datasets depicting driving scenes, for example, there are multiple objects such as cars in the image, resulting in much more diverse image layouts.

Richardsonら著、「“Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation”，2021年，https://arxiv.org/pdf/2008.00951.pdf」Richardson et al., “Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation”, 2021, https://arxiv.org/pdf/2008.00951.pdf Karrasら著、「“A style-based generator architecture for generative adversarial networks”，2019年，https://arxiv.org/pdf/1812.04948.pdf」Karras et al., "A style-based generator architecture for generative adversarial networks", 2019, https://arxiv.org/pdf/1812.04948.pdf Karrasら著、「“Analyzing and Improving the Image Quality of StyleGAN”，2020年，https://arxiv.org/pdf/1912.04958.pdf」Karras et al., “Analyzing and Improving the Image Quality of StyleGAN”, 2020, https://arxiv.org/pdf/1912.04958.pdf Zhangら著、「“The Unreasonable Effectiveness of Deep Features as a Perceptual Metric”，2018年，https://arxiv.org/pdf/1801.03924.pdf」Zhang et al., "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric", 2018, https://arxiv.org/pdf/1801.03924.pdf

有利には、独立請求項１の特徴を有する方法は、画像の潜在因子を正確に分析することができるエンコーダをトレーニングすることを可能にする。このことは、エンコーダが、高度な構造的複雑性を有する画像を拡張するために、かつ、それによって正確な拡張を決定するために適しているという追加的な利点を有する。 Advantageously, the method having the features of independent claim 1 makes it possible to train an encoder that is able to accurately analyze the latent factors of an image. This has the additional advantage that the encoder is suitable for enhancing images with a high structural complexity and thereby for determining accurate enhancements.

発明の開示
第１の態様においては、本発明は、画像の潜在表現を決定するように構成されているエンコーダをトレーニングするためのコンピュータ実装された方法であって、エンコーダをトレーニングすることは、
・トレーニング画像をエンコーダに提供することによって潜在表現とノイズ画像とを決定するステップであって、エンコーダは、提供された画像に対する潜在表現とノイズ画像とを決定するように構成されている、ステップと、
・ノイズ画像の部分をマスキングアウトすることにより、マスキングされたノイズ画像を決定するステップと、
・潜在表現とマスキングされたノイズ画像とを敵対的生成ネットワークの生成器に提供することによって予測画像を決定するステップと、
・エンコーダのパラメータを損失値に基づいて適合させることによってエンコーダをトレーニングするステップであって、損失値は、予測画像とトレーニング画像との間の差を特徴付ける、ステップと
を含む、方法に関する。 DISCLOSURE OF THEINVENTION In a first aspect, the present invention provides a computer-implemented method for training an encoder configured to determine a latent representation of an image, the method comprising:
- determining a latent representation and a noisy image by providing training images to an encoder, the encoder being configured to determine the latent representation and the noisy image for the provided images;
- determining a masked noise image by masking out parts of the noise image;
Determining a predicted image by providing the latent representation and the masked noise image to a generator of a generative adversarial network;
- Training the encoder by adapting its parameters based on a loss value, the loss value characterizing the difference between a predicted image and a training image.

エンコーダは、入力として画像を受信するように、かつ、画像のピクセル値に基づいて潜在表現を予測するように構成された機械学習システムであると理解することが可能である。好ましくは、エンコーダは、ニューラルネットワークであり、又は、ニューラルネットワークを含む。本方法においては、エンコーダには、潜在表現を予測するためにトレーニング画像が提供される。潜在表現を決定することは、画像分析の特定の形態の１つとして理解することが可能である。エンコーダは、画像を特徴付ける特定の潜在因子に関して画像を分析するようにトレーニングされ、この場合、潜在因子は、潜在表現に含まれる。 The encoder may be understood as a machine learning system configured to receive an image as input and to predict a latent representation based on pixel values of the image. Preferably, the encoder is or includes a neural network. In the method, the encoder is provided with training images for predicting a latent representation. Determining the latent representation may be understood as a particular form of image analysis. The encoder is trained to analyze images with respect to particular latent factors that characterize the images, where the latent factors are included in the latent representation.

潜在表現に含まれる潜在因子を、当分野においては、「スタイル」と称することもできる。換言すれば、潜在表現は、画像の少なくとも１つの潜在的なスタイルを特徴付けるものとしても理解することが可能である。潜在因子は、画像の外観として一般的に理解することが可能である。例えば、１つの潜在因子は、画像内に描写されている状況の明るさであるものとしてよい。その場合、この潜在因子の具体的な値は、例えば昼間のシーンを描写する画像を特徴付けることができる。この潜在因子の他の値は、画像によって描写される夜間のシーンを特徴付けることができる。 The latent factors contained in the latent representation may also be referred to in the art as "style". In other words, the latent representation may also be understood as characterizing at least one latent style of an image. The latent factors may be generally understood as the appearance of an image. For example, one latent factor may be the brightness of the situation depicted in the image. A specific value of this latent factor may then characterize, for example, an image depicting a daytime scene. Another value of this latent factor may characterize a nighttime scene depicted by the image.

エンコーダは、トレーニング中の第２のコンポーネント、すなわち、ノイズ画像を予測するようにさらに構成されている。このノイズ画像は、好ましくはトレーニング画像と同一のアスペクト比を有する画像として理解することが可能である。ノイズ画像という名称は、ＳｔｙｌｅＧＡＮにおける類似のエンティティの類似のネーミングに関して選択されている。換言すれば、ノイズ画像は、画像内のノイズの予測として理解されるべきではない。ノイズ画像は、生成器によって提供されるような画像の反転の１つの部分を特徴付けるエンティティである（他の部分は、潜在表現である）。換言すれば、生成器によってトレーニング画像が生成された場合に、エンコーダは、そのトレーニング画像を生成するための生成器への入力として使用されたノイズを決定することを学習する。ノイズ画像は、例えばトレーニング画像内のピクセルがどの程度ノイズを受けているかのパーセンテージ値を特徴付ける０から１までの間の値を含み得る。ノイズ画像は、トレーニング画像と同一のサイズであるものとしてよく、この場合、ノイズ画像内のピクセルと、トレーニング画像内のピクセルとの間に一対一の対応関係が存在する。しかしながら、エンコーダが、トレーニング画像と比較してスケールダウンされたサイズを有するノイズ画像を予測することも可能である。 The encoder is further configured to predict the second component during training, namely the noise image. This noise image can be understood as an image that preferably has the same aspect ratio as the training image. The name noise image has been chosen with respect to the similar naming of similar entities in StyleGAN. In other words, the noise image should not be understood as a prediction of noise in the image. The noise image is an entity that characterizes one part of the inverse of the image as provided by the generator (the other part is the latent representation). In other words, given a training image generated by the generator, the encoder learns to determine the noise that was used as input to the generator to generate that training image. The noise image may include, for example, a value between 0 and 1 that characterizes a percentage value of how much noise a pixel in the training image is subject to. The noise image may be of the same size as the training image, in which case there is a one-to-one correspondence between pixels in the noise image and pixels in the training image. However, it is also possible for the encoder to predict a noise image that has a scaled-down size compared to the training image.

エンコーダは、種々異なる種類のセンサからの画像を処理するように構成可能であると理解することが可能である。この意味において、画像は、カメラ、ＬＩＤＡＲセンサ、レーダセンサ、超音波センサ又はサーマルカメラから得られたセンサ測定値として理解することが可能である。 It can be understood that the encoder can be configured to process images from different types of sensors. In this sense, an image can be understood as a sensor measurement obtained from a camera, a LIDAR sensor, a radar sensor, an ultrasonic sensor or a thermal camera.

本方法においては、ノイズ画像の部分がマスキングアウトされる。このことは、ノイズ画像内のピクセルを他の値によって置き換えることとして理解することが可能である。例えば、ノイズ画像内のピクセル値を、好ましくはガウス分布からランダムに引き出されたピクセル値によって置き換えることによってマスキングアウトすることができる。どのピクセルをマスキングアウトすべきかを選択するために、画像内のピクセルを、マスキングアウトされるべきもの又はマスキングアウトされるべきでないものにランダムに割り当てることができる。代替的に、マスキングアウトされるべき画像の領域、例えば矩形の領域を、例えばランダムに決定することも可能である。このような矩形の領域は、パッチとも称されることがある。 In the method, parts of the noise image are masked out. This can be understood as replacing pixels in the noise image by other values. For example, pixel values in the noise image can be masked out by replacing them by pixel values drawn randomly, preferably from a Gaussian distribution. To select which pixels to mask out, pixels in the image can be randomly assigned to those that should be masked out or those that should not be masked out. Alternatively, it is also possible to determine, for example randomly, areas of the image to be masked out, e.g. rectangular areas. Such rectangular areas may also be referred to as patches.

潜在表現とマスキングされたノイズ画像とは、生成器に提供され、生成器は、潜在表現とノイズ画像とに基づいて画像を決定するように構成されている。生成器は、ニューラルネットワークであるものとしてよく、特に、ニューラルネットワークの種々異なる層において潜在表現とノイズ画像とを受信するように構成されたニューラルネットワークであるものとしてよい。好ましくは、エンコーダは、潜在表現を必要とする生成器の全ての入力に潜在表現を提供することができる。代替的に、複数の異なる潜在表現を予測して、これらの複数の異なる潜在表現を、潜在表現を必要とする入力に提供するように、エンコーダを構成することも可能である。ノイズ画像に関して、エンコーダは、好ましくは単一のノイズ画像を予測することができる。その場合、生成器には、ノイズ画像を必要とする全ての入力においてノイズ画像を提供することができる。代替的に、ノイズ画像を必要とする生成器の単一の入力のみにノイズ画像を提供し、ノイズ画像を必要とする他の全ての入力には、単一のランダムに引き出されたノイズ画像、又は、複数の異なるように引き出されたノイズ画像のコピーを提供するようにしてもよい。 The latent representation and the masked noise image are provided to a generator, which is configured to determine an image based on the latent representation and the noise image. The generator may be a neural network, in particular a neural network configured to receive the latent representation and the noise image at different layers of the neural network. Preferably, the encoder can provide the latent representation to all inputs of the generator that require a latent representation. Alternatively, the encoder can be configured to predict multiple different latent representations and provide these multiple different latent representations to inputs that require a latent representation. With respect to the noise image, the encoder can preferably predict a single noise image. In that case, the generator can be provided with the noise image at all inputs that require it. Alternatively, the noise image can be provided only to a single input of the generator that requires a noise image, and all other inputs that require a noise image can be provided with a single randomly drawn noise image or copies of multiple differently drawn noise images.

好ましくは、生成器は、ＳｔｙｌｅＧＡＮ又はＳｔｙｌｅＧＡＮ２アーキテクチャに従って構成された生成器である。このような生成器は、「ＳｔｙｌｅＧＡＮ又はＳｔｙｌｅＧＡＮ２の生成器」とも称される。ＳｔｙｌｅＧＡＮ又はＳｔｙｌｅＧＡＮ２の生成器を使用する実施形態においては、潜在表現は、好ましくはＳｔｙｌｅＧＡＮ又はＳｔｙｌｅＧＡＮ２の生成器に直接的に提供され、すなわち、マッピングネットワークの使用が省略される。ＳｔｙｌｅＧＡＮ及びＳｔｙｌｅＧＡＮ２は、ＳｔｙｌｅＧＡＮのそれぞれ異なる部分に対してそれぞれ異なる潜在表現及び／又はノイズ画像を受信するようにも構成可能であるので、エンコーダは、ＳｔｙｌｅＧＡＮ又はＳｔｙｌｅＧＡＮ２への入力として機能する複数の潜在表現及び／又はノイズ画像を決定するようにも構成可能である。ＳｔｙｌｅＧＡＮ又はＳｔｙｌｅＧＡＮ２は、生成器を取得するために使用される好ましい敵対的生成ネットワークであるが、少なくとも潜在表現とノイズ画像とに基づいて画像を決定する限り、その他の機械学習システムも同様に可能である。 Preferably, the generator is a generator configured according to the StyleGAN or StyleGAN2 architecture. Such a generator is also referred to as a "StyleGAN or StyleGAN2 generator." In embodiments using a StyleGAN or StyleGAN2 generator, the latent representation is preferably provided directly to the StyleGAN or StyleGAN2 generator, i.e., the use of a mapping network is omitted. Since StyleGAN and StyleGAN2 can also be configured to receive different latent representations and/or noise images for different parts of the StyleGAN, the encoder can also be configured to determine multiple latent representations and/or noise images that serve as inputs to the StyleGAN or StyleGAN2. StyleGAN or StyleGAN2 are the preferred generative adversarial networks used to obtain the generator, but other machine learning systems are possible as well, as long as they determine the image based on at least the latent representation and a noise image.

エンコーダのトレーニング中、生成器のパラメータは、好ましくは、固定されており、すなわち、適合させられない。しかしながら、一般的には、本方法の一部として生成器のパラメータを更新することも可能である。 During training of the encoder, the generator parameters are preferably fixed, i.e., not adapted. However, it is generally possible to update the generator parameters as part of the method.

生成器は、潜在表現とマスキングされたノイズ画像とに基づいて予測画像を決定する。次いで、エンコーダは、トレーニング画像と予測画像との間の差に基づいて、エンコーダのパラメータを適合させることによってトレーニングされる。このことは、損失値を決定し、この損失値に基づいてパラメータを適合させることによって達成される。好ましくは、このことは、損失に関するパラメータの勾配を逆伝播アルゴリズムによって決定し、負の勾配に従ってパラメータを適合させることによって達成される。代替的に、他の最適化法、例えば進化的最適化法も同様に使用することができる。 The generator determines a predicted image based on the latent representation and the masked noise image. The encoder is then trained by adapting the parameters of the encoder based on the difference between the training image and the predicted image. This is achieved by determining a loss value and adapting the parameters based on this loss value. Preferably, this is achieved by determining the gradient of the parameters with respect to the loss by a backpropagation algorithm and adapting the parameters according to the negative gradient. Alternatively, other optimization methods, e.g. evolutionary optimization methods, can be used as well.

本方法は、生成器を反転させるための方法、すなわち、ＧＡＮ反転のための方法としても理解することが可能である。これにより、生成器のトレーニング中に決定される潜在空間からの潜在因子を、エンコーダによって復元することができる。本発明者らは、ノイズ画像の部分をマスキングすることが、提供された画像に関する潜在因子を正確に決定する際におけるエンコーダの性能の改善をもたらすこと、すなわち、エンコーダが、画像をより良好に分析可能となることを発見した。このことは、自然のシーンからの画像、又は、種々異なる可能性のある複数のオブジェクトが含まれる画像のような高度な構造的複雑性を描写する画像に対して特に当てはまる。 The method can also be understood as a method for inverting a generator, i.e. for GAN inversion, which allows the encoder to recover latent factors from a latent space determined during training of the generator. The inventors have discovered that masking parts of a noisy image leads to an improvement in the performance of the encoder in accurately determining the latent factors for a provided image, i.e. the encoder is better able to analyze the image. This is especially true for images depicting a high structural complexity, such as images from natural scenes or images containing multiple, potentially different, objects.

損失値は、特に損失関数に基づいて決定可能であり、損失関数の第１の項は、予測画像とトレーニング画像との間の差を特徴付ける。 The loss value can be determined based in particular on a loss function, the first term of which characterizes the difference between the predicted image and the training image.

好ましくは、第１の項は、差のマスキングをさらに特徴付け、マスキングは、差から、マスキングアウトされた部分内に収まるピクセルを除去する。 Preferably, the first term further characterizes a masking of the difference, which removes from the difference pixels that fall within the masked out portion.

この差は、例えば、トレーニング画像及び予測画像からの対応するピクセルの平均Ｌ_２ノルムであるものとしてよい。本発明者らは、この差における、マスキングアウトされる対象となったピクセルを考慮しないことが、エンコーダの性能にとって有益であることを発見した。 This difference may be, for example, the average _L2 norm of corresponding pixels from the training and predicted images. The inventors have discovered that it is beneficial for the performance of the encoder to not consider pixels that are subject to being masked out in this difference.

好ましくは、損失関数は、エンコーダによって予測されるノイズ画像のノルムを特徴付ける第２の項を含む。 Preferably, the loss function includes a second term that characterizes the norm of the noisy image predicted by the encoder.

このことは、大きい分散を有するノイズ画像を予測することをエンコーダが学習することを、第２の項が抑制し、それによりノイズ画像によって提供される情報の量が制限されるので、有利である。それにより、画像の潜在因子がより忠実に潜在表現にエンコードされ、ノイズ画像に漏れ出ることがなくなる。 This is advantageous because the second term constrains the encoder from learning to predict noisy images with high variance, thereby limiting the amount of information provided by the noisy images. This allows the latent factors of the image to be more faithfully encoded into the latent representation and not leak into the noisy images.

好ましくは、損失関数は、弁別器の出力信号の負の対数尤度を特徴付ける第３の項を含み、出力信号は、予測画像

を弁別器に提供することによって弁別器によって決定される。 Preferably, the loss function includes a third term characterizing the negative log-likelihood of the output signal of the discriminator, the output signal being a predicted image

is determined by the discriminator by providing to the discriminator

本発明者らは、弁別器を使用することによって、提供された画像に関する正確な潜在表現を決定するためのエンコーダの精度がさらに向上することを発見した。弁別器のパラメータも、エンコーダのトレーニング中、好ましくは固定されている。デコーダを使用することにより、エンコーダは、それぞれの潜在因子に対する正確な値を特徴付ける潜在表現を決定するようにさらに促される。生成器及び弁別器は、固定されているものとしてよいので、エンコーダは、好ましくはトレーニング中に適合させることができる唯一のエンティティであり、すなわち、予測画像を変化させることができる唯一のエンティティである。本発明者らが発見したように、この項は、最適化中、現実のように見える画像を生成器によって生成させるための有利なインセンティブを提供する。換言すれば、エンコーダは、生成器によって現実のように見える画像にマッピングされる潜在表現を決定するように動機付けられる。 The inventors have discovered that the use of a discriminator further improves the accuracy of the encoder to determine an accurate latent representation for a provided image. The parameters of the discriminator are also preferably fixed during the training of the encoder. The use of the decoder further encourages the encoder to determine a latent representation that characterizes the exact values for each latent factor. Since the generator and the discriminator may be fixed, the encoder is preferably the only entity that can adapt during training, i.e., the only entity that can change the predicted image. As the inventors have discovered, this term provides a favorable incentive to have the generator generate images that look real during optimization. In other words, the encoder is motivated to determine latent representations that are mapped to images that look real by the generator.

好ましくは、トレーニング画像は、ランダムにサンプリングされた潜在表現又はユーザ定義された潜在表現を生成器に提供することによって決定可能であり、損失関数は、ランダムにサンプリングされた潜在表現又はユーザ定義された潜在表現と、エンコーダから決定された潜在表現との間の差を特徴付ける第４の項を含む。 Preferably, the training images can be determined by providing randomly sampled or user-defined latent representations to the generator, and the loss function includes a fourth term that characterizes the difference between the randomly sampled or user-defined latent representations and the latent representation determined from the encoder.

このことは、潜在表現と予測画像との間で前後にマッピングする際における周期的な一貫性を提供するものとして理解することが可能である。したがって、開始点は、ランダムに選択された潜在表現、又は、ユーザの裁量で選択された潜在表現であるものとしてよく、この潜在表現は、次いで、トレーニング画像を決定するために生成器に提供される。トレーニング画像は、次いで、エンコーダによって予測されるような潜在表現を決定するためにエンコーダに提供される。この潜在表現は、以前に選択された潜在表現に近いものであるべきであり、すなわち、潜在表現と画像との間での前後のマッピングは、同様の結果をもたらすべきである。第４の項は、有利にはそのような周期的な一貫性を保証するようにエンコーダを動機付ける。したがって、本発明者らは、第４の項が、有利にはエンコーダの精度をさらに向上させることを発見した。 This can be understood as providing cyclic consistency in mapping back and forth between latent representations and predicted images. Thus, the starting point may be a randomly selected latent representation, or a latent representation selected at the user's discretion, which is then provided to the generator to determine a training image. The training image is then provided to the encoder to determine a latent representation as predicted by the encoder. This latent representation should be close to the previously selected latent representation, i.e., mapping back and forth between the latent representation and the image should yield similar results. The fourth term advantageously motivates the encoder to ensure such cyclic consistency. Thus, the inventors have found that the fourth term advantageously further improves the accuracy of the encoder.

第４の項が含まれる実施形態においては、生成器から予測画像を生成するために必要とされるノイズ画像は、ランダムにサンプリングされるものとしてもよいし、又は、所定のノイズ画像であるものとしてもよい。 In embodiments that include the fourth term, the noise image required to generate the predicted image from the generator may be randomly sampled or may be a predetermined noise image.

好ましくは、損失関数は、トレーニング画像を特徴抽出器に提供することによって決定された第１の特徴表現と、予測画像を特徴抽出器に提供することによって決定された第２の特徴表現との差を特徴付ける第５の項を含み、差は、好ましくはマスキングアウトされた部分内にあるピクセルを特徴付ける特徴を特徴付けない。 Preferably, the loss function includes a fifth term characterizing the difference between a first feature representation determined by providing training images to the feature extractor and a second feature representation determined by providing a predicted image to the feature extractor, the difference preferably not characterizing features characterizing pixels that are within the masked out portion.

このことは、ＬＰＩＰＳ指標を特徴付ける損失関数に項を追加することとして理解することが可能である。特徴抽出器は、供給されたトレーニング画像及び予測画像からそれぞれ機械学習の意味において特徴を決定するように構成された機械学習システムとして理解することが可能である。例えば、特徴抽出器は、ＶＧＧｎｅｔの畳み込み部分のようなニューラルネットワークであるものとしてよい。本発明者らは、第５の項を追加することによってエンコーダの精度がさらに向上することを発見した。 This can be understood as adding a term to the loss function that characterizes the LPIPS metric. The feature extractor can be understood as a machine learning system configured to determine features in a machine learning sense from the training images and the predicted images provided, respectively. For example, the feature extractor may be a neural network, such as the convolutional part of VGGnet. The inventors have found that adding a fifth term further improves the accuracy of the encoder.

損失関数において第１の項から第５の項を任意に組み合わせることが可能である。換言すれば、これらの項のうちのいくつかを用いて、又は、これらの項のうちのいくつかを除外してトレーニングすることが可能である。 It is possible to combine any of the first through fifth terms in the loss function. In other words, it is possible to train with some of these terms or to exclude some of these terms.

他の態様においては、本発明は、画像の拡張を決定するためのコンピュータ実装された方法であって、
・上記のような方法によってエンコーダをトレーニングすることに基づいて、エンコーダを取得するステップと、
・画像をエンコーダに提供することによって、第１の潜在表現とノイズ画像とを決定するステップと、
・第１の潜在表現を変更することにより、第２の潜在表現を決定するステップと、
・エンコーダをトレーニングする際に使用された生成器への入力として第２の潜在表現とノイズ画像とを提供することにより、拡張を決定するステップと、
を含む方法に関する。 In another aspect, the present invention provides a computer-implemented method for determining an augmentation of an image, the method comprising:
- obtaining an encoder based on training the encoder by the method as described above;
- determining a first latent representation and a noise image by providing the image to an encoder;
- determining a second latent representation by modifying the first latent representation;
- determining the augmentation by providing the second latent representation and the noise image as input to a generator used in training the encoder;
The present invention relates to a method comprising the steps of:

トレーニングすることに基づいてエンコーダを取得することは、トレーニングのための方法を、拡張を決定するための方法の一部として実施することとして理解することが可能である。代替的に、トレーニングすることに基づいてエンコーダを取得することを、既にトレーニング済みのエンコーダを取得することとして理解することも可能であり、この場合、エンコーダは、上記で提示したようなトレーニングのための方法によってトレーニング済みである。 Obtaining an encoder based on training can be understood as implementing a method for training as part of a method for determining an extension. Alternatively, obtaining an encoder based on training can also be understood as obtaining an already trained encoder, in which case the encoder has been trained by the method for training as presented above.

拡張を決定するための方法においては、エンコーダを使用して画像から潜在表現とノイズ画像とを抽出し、潜在表現における潜在因子を変更し、エンコーダをトレーニングする際に使用された生成器に、変更された潜在表現とノイズ画像とを提供することにより拡張を決定することによって、拡張が決定される。 In a method for determining the augmentation, the augmentation is determined by extracting a latent representation and a noise image from the image using an encoder, modifying latent factors in the latent representation, and providing the modified latent representation and the noise image to a generator used in training the encoder to determine the augmentation.

有利には、本方法は、機械学習システムをトレーニングするために使用することができる画像を作成することを可能にする。潜在因子の変化に起因して、拡張は、少なくともその内容の一部を保持しながら画像の種々異なるスタイルを特徴付ける。このようにして、機械学習システムをトレーニングするために拡張を使用すると、この拡張によって種々異なるスタイルが特徴付けられるので、より多様な画像の集合が機械学習システムに提供されることとなる。本発明者らは、このことにより、機械学習システムの性能が改善されることを発見した。 Advantageously, the method makes it possible to create images that can be used to train a machine learning system. Due to the variation of the latent factors, the augmentation features different styles of the image while preserving at least some of its content. In this way, using the augmentation to train a machine learning system provides the machine learning system with a more diverse set of images, since the augmentation features different styles. The inventors have found that this improves the performance of the machine learning system.

したがって、他の態様においては、本発明は、機械学習システムをトレーニングするためのコンピュータ実装された方法であって、機械学習システムは、画像の分類及び／又は回帰分析を特徴付ける出力信号を決定するように構成されており、本方法は、
・請求項９に記載のトレーニング画像の拡張を決定するステップと、
・拡張に基づいて機械学習システムをトレーニングするステップと、
を含む、方法に関する。 Thus, in another aspect, the present invention provides a computer-implemented method for training a machine learning system, the machine learning system being configured to determine an output signal characterizing a classification and/or regression analysis of an image, the method comprising:
- determining an extension of the training images according to claim 9;
Training a machine learning system based on the augmentation;
The present invention relates to a method comprising the steps of:

本発明の実施形態を、以下の図面を参照しながらより詳細に説明する。 Embodiments of the present invention will be described in more detail with reference to the following drawings.

エンコーダをトレーニングするためのトレーニング方法の一部を概略的に示す図である。FIG. 2 illustrates a schematic diagram of a part of a training method for training an encoder; ノイズ画像をマスキングするための例を概略的に示す図である。FIG. 2 shows a schematic diagram of an example for masking a noise image; 画像を拡張するための拡張装置を示す図である。FIG. 1 shows a dilation device for dilating an image. 機械学習システムをトレーニングするためのトレーニングシステムを示す図である。FIG. 1 illustrates a training system for training a machine learning system. アクチュエータの環境におけるアクチュエータを制御するための機械学習システムを含む制御システムを示す図である。FIG. 1 illustrates a control system including a machine learning system for controlling an actuator in an environment of the actuator. 少なくとも半自律的なロボットを制御する制御システムを示す図である。FIG. 1 illustrates a control system for controlling at least a semi-autonomous robot. 製造機械を制御する制御システムを示す図である。FIG. 1 illustrates a control system for controlling a manufacturing machine. 自動化されたパーソナルアシスタントを制御する制御システムを示す図である。FIG. 1 illustrates a control system for controlling an automated personal assistant. アクセス制御システムを制御する制御システムを示す図である。FIG. 2 illustrates a control system that controls the access control system. 監視システムを制御する制御システムを示す図である。FIG. 2 is a diagram showing a control system that controls the monitoring system. イメージングシステムを制御する制御システムを示す図である。FIG. 2 illustrates a control system for controlling the imaging system. 医用分析システムを制御する制御システムを示す図である。FIG. 1 illustrates a control system for controlling a medical analysis system.

実施形態の説明
図１は、エンコーダ（７０）をトレーニングするための方法の実施形態の一部を示している。本方法の間、エンコーダ（７０）は、画像の（スタイルとしても知られる）潜在因子を特徴付ける潜在表現（ｗ）とノイズ画像（ε）とを決定するようにトレーニングされ、ノイズ画像（ε）は、画像内のノイズの領域を予測するものとして理解することが可能である。 1 shows a portion of an embodiment of a method for training an encoder (70). During the method, the encoder (70) is trained to determine a latent representation (w) that characterizes a latent factor (also known as style) of an image and a noise image (ε), which can be understood as predicting areas of noise in the image.

エンコーダ（７０）は、単一のトレーニング画像（ｘ_ｉ）に基づいてトレーニング可能である。しかしながら、好ましくは、本方法は、エンコーダ（７０）をトレーニングするために複数のトレーニング画像（ｘ_ｉ）を使用する。１つのトレーニング画像（ｘ_ｉ）又は複数のトレーニング画像（ｘ_ｉ）は、好ましくは高度な構造的複雑性を有するシーン、例えば自動車を運転している際に遭遇するシーン及び／又は街中のシーンのような自然環境のシーンを描写する。 The encoder (70) can be trained based on a single training image (x _i ). Preferably, however, the method uses multiple training images (x _i ) to train the encoder (70). The training image (x _i ) or multiple training images (x _i ) preferably depict scenes with a high degree of structural complexity, e.g. scenes encountered while driving a car and/or scenes of natural environments such as city scenes.

本実施形態においては、エンコーダ（７０）は、潜在表現（ｗ）及び／又はノイズ画像（ε）を予測するニューラルネットワークによって特徴付けられる。他の実施形態においては、潜在表現（ｗ）及び／又はノイズ画像（ε）を予測するために他の機械学習モデルを使用するものとしてよい。エンコーダ（７０）は、好ましくはエンコーダ（７０）に供給されたトレーニング画像（ｘ_ｉ）から特徴（ｆ）を抽出するための特徴抽出器（７１）を含む。特徴（ｆ）は、好ましくは潜在表現（ｗ）を決定するように構成されたスタイルユニット（７２）に転送可能であり、かつ、ノイズ画像（ε）を決定するように構成されたノイズユニット（７３）に転送可能である。スタイルユニット（７２）及び／又はノイズユニット（７３）は、好ましくはニューラルネットワークであるものとしてもよい。しかしながら、一般的に、スタイルユニット（７２）及び／又はノイズユニット（７３）のために他の機械学習モデルを使用することもできる。他の実施形態においては、エンコーダ（７０）は、潜在表現（ｗ）とノイズ画像（ε）とを予測するための単一のニューラルネットワークを含むこともあり得る。 In this embodiment, the encoder (70) is characterized by a neural network for predicting the latent representation (w) and/or the noise image (ε). In other embodiments, other machine learning models may be used for predicting the latent representation (w) and/or the noise image (ε). The encoder (70) preferably includes a feature extractor (71) for extracting features (f) from training images (x _i ) provided to the encoder (70). The features (f) may be transferred to a style unit (72) preferably configured to determine the latent representation (w) and to a noise unit (73) configured to determine the noise image (ε). The style unit (72) and/or the noise unit (73) may preferably be a neural network. However, in general, other machine learning models may be used for the style unit (72) and/or the noise unit (73). In other embodiments, the encoder (70) may include a single neural network for predicting the latent representation (w) and the noise image (ε).

本実施形態においては、潜在表現（ｗ）は、行列又はテンソルであるように構成されており、ノイズ画像（ε）は、行列であるように構成されている。エンコーダ（７０）は、潜在表現（ｗ）及びノイズ画像（ε）の幅寸法及び高さ寸法が、トレーニング画像（ｘ_ｉ）の幅寸法及び高さ寸法と同一の比率を有することとなるように構成されている。このことは、好ましくは特徴抽出器（７１）と、スタイルユニット（７２）と、ノイズユニット（７３）とにおいてアスペクト保存演算を使用することによって、例えば幅及び高さに沿って等しいストライドを用いた畳み込み演算を使用することによって達成可能である。 In this embodiment, the latent representation (w) is configured to be a matrix or tensor, and the noise image (ε) is configured to be a matrix. The encoder (70) is configured such that the width and height dimensions of the latent representation (w) and the noise image (ε) have the same ratio as the width and height dimensions of the training images (x _i ). This can be achieved by preferably using aspect-preserving operations in the feature extractor (71), style unit (72), and noise unit (73), for example by using convolution operations with equal strides along the width and height.

ノイズ画像（ε）は、ノイズ画像（ε）をマスキングするように構成されたマスキングユニット（７４）に提供される。本実施形態においては、ノイズ画像（ε）の要素をランダムに選択し、それぞれの選択された要素を、ガウス分布からランダムに引き出された値によって置き換え、それによりマスキングされたノイズ画像（ε_ｍ）を決定することによって、マスキングが実施される。さらなる実施形態においては、ランダムに引き出された値を、他の確率分布から引き出すこともできる。 The noise image (ε) is provided to a masking unit (74) configured to mask the noise image (ε). In the present embodiment, masking is performed by randomly selecting elements of the noise image (ε) and replacing each selected element by a value randomly drawn from a Gaussian distribution, thereby determining a masked noise image (ε _m ). In further embodiments, the randomly drawn values may also be drawn from other probability distributions.

次いで、潜在表現（ｗ）とマスキングされたノイズ画像（ε_ｍ）とが、敵対的生成ネットワークの生成器（８０）への入力として提供される。敵対的生成ネットワークは、好ましくはエンコーダ（７０）をトレーニングするための方法を実施する前にトレーニング済みである。しかしながら、エンコーダ（７０）をトレーニングする際における追加的なステップとして、敵対的生成ネットワークをトレーニングすることも可能である。敵対的生成ネットワークは、提供された潜在表現とノイズ画像とに基づいて画像を決定するように構成されている。好ましくは、敵対的生成ネットワークは、ＳｔｙｌｅＧＡＮ又はＳｔｙｌｅＧＡＮ２である。 The latent representation (w) and the masked noise image (ε _m ) are then provided as input to a generator (80) of a generative adversarial network. The generative adversarial network is preferably trained prior to performing the method for training the encoder (70). However, it is also possible to train the generative adversarial network as an additional step in training the encoder (70). The generative adversarial network is configured to determine an image based on the provided latent representation and the noise image. Preferably, the generative adversarial network is a StyleGAN or StyleGAN2.

潜在表現は、好ましくはＳｔｙｌｅＧＡＮ又はＳｔｙｌｅＧＡＮ２のマッピングネットワークを使用することなく生成器（８０）に提供される。このことは、エンコーダがＳｔｙｌｅＧＡＮ又はＳｔｙｌｅＧＡＮ２の中間潜在空間から潜在表現を決定することを学習するので、有利である。この中間潜在空間は、ＳｔｙｌｅＧＡＮ又はＳｔｙｌｅＧＡＮ２の元の潜在空間よりも良好な解きほぐし（disentanglement）を有するので、このことにより、画像に関する潜在因子を決定する際におけるエンコーダの性能が、有利にはさらに向上する。 The latent representation is preferably provided to the generator (80) without using a StyleGAN or StyleGAN2 mapping network. This is advantageous because the encoder learns to determine the latent representation from an intermediate latent space of StyleGAN or StyleGAN2. This advantageously further improves the performance of the encoder in determining latent factors for the image, since this intermediate latent space has better disentanglement than the original latent space of StyleGAN or StyleGAN2.

生成器（８０）は、潜在表現（ｗ）とノイズ画像（ε）とに基づいて予測画像

を決定する。次いで、トレーニング画像（ｘ_ｉ）と予測画像

との間の差を特徴付ける損失値を決定することができる。次いで、損失値を最小化するようにトレーニングを実施することができる。例えば、損失値は、損失関数に基づいて決定可能である。損失関数は、特に、差を特徴付ける第１の項、すなわち、

を特徴付けることができ、ここで、ｘ_ｉ及び

は、それぞれトレーニング画像及び予測画像であり、

は、アダマール積である。１－Ｍの項は、差におけるピクセルの好ましい重み付けを示し、すなわち、マスキングされたノイズ画像（ε_ｍ）においてマスキングアウトされたピクセルは、第１の項Ｌ_ｒｅｃを決定する際には考慮されない。１は、トレーニング画像（ｘ_ｉ）及び予測画像

と同一の形状の全て１の行列として理解されるべきであり、ノイズ画像（ε）がトレーニング画像（ｘ_ｉ）とは異なる形状を有する場合には、ノイズ画像（ε）をマスキングするために使用されるマスクが、トレーニング画像（ｘ_ｉ）のサイズにスケーリングされる。差

に対するＬ_２－ノルムを決定することは、特にｘ_ｉ及び

からの対応するピクセルのユークリッド距離の平均を求めることとして理解される。 A generator (80) generates a predicted image based on the latent representation (w) and the noise image (ε).

Then, the training image (x _i ) and the predicted image

A loss value can be determined that characterizes the difference between the mean and mean scores. Training can then be performed to minimize the loss value. For example, the loss value can be determined based on a loss function. The loss function includes, in particular, a first term that characterizes the difference, i.e.,

can be characterized, where x _i and

are the training image and the predicted image, respectively,

is the Hadamard product. The 1-M term indicates the preferred weighting of pixels in the difference, i.e., the pixels masked out in the masked noise image (ε _m ) are not considered in determining the first term L _rec . 1 is the weighting factor of the training image (x _i ) and the predicted image

should be understood as an all-ones matrix of the same shape as x i , and if the noise image (ε) has a different shape than the training images (x _i ), then the mask used to mask the noise image (ε) is scaled to the size of the training images (x _i ).

Determining the L ₂ -norm _for

This can be understood as taking the average of the Euclidean distances of corresponding pixels from

好ましくは、損失関数は、ノイズ画像（ε）のノルムを特徴付ける第２の項を含む。好ましくは、これは、ノイズ画像εにおける値の和であり、これにより、ノイズ画像を予測する際におけるスパース性を促進する。第２の項は、以下の式：
Ｌ_{ｎｏｉｓｅ＿ｒｅｇ}＝｜ε｜
によって表現可能である。 Preferably, the loss function includes a second term that characterizes the norm of the noise image (ε), which is preferably a sum of the values in the noise image ε, thereby promoting sparsity in predicting the noise image. The second term is expressed as follows:
L _{noise_reg} = |ε|
It can be expressed as:

好ましくは、損失関数は、弁別器の出力信号の負の対数尤度を特徴付ける第３の項を含み、この出力信号は、予測画像

を弁別器に提供することによって弁別器によって決定される。換言すれば、敵対的生成ネットワークをトレーニングする際に使用される弁別器は、エンコーダ（７０）をトレーニングする際における追加的なガイドとして使用可能である。デコーダを介して、エンコーダは、予測画像

がどの程度「現実的に」見えるかに関する追加的な情報を取得し、それにより、「現実的に」見える画像を予測するための潜在表現の有用性に関する情報を取得する。第３の項は、以下の式：

によって表現可能であり、ここで、Ｄは、弁別器であり、

は、期待値関数である。 Preferably, the loss function includes a third term characterizing the negative log-likelihood of the output signal of the discriminator, the output signal being the predicted image

In other words, the discriminator used in training the generative adversarial network can be used as an additional guide in training the encoder (70).

We obtain additional information about how “realistic” an image looks, and thereby obtain information about the usefulness of the latent representation for predicting “realistic” looking images. The third term is expressed as follows:

where D is the discriminator,

is the expectation function.

好ましくは、損失関数は、ランダムにサンプリングされた潜在表現又はユーザ定義された潜在表現と、エンコーダ（７０）から決定された潜在表現との間の差を特徴付ける第４の項を含み、ランダムにサンプリングされた潜在表現又はユーザ定義された潜在表現は、生成器（８０）に提供され、それによりトレーニング画像（ｘ_ｉ）が決定される。換言すれば、トレーニング画像（ｘ_ｉ）は、ランダムにサンプリングされた潜在表現又はユーザ定義された潜在表現に基づいて決定される。第４の項は、以下の式：

によって表現可能であり、ここで、ｗ_ｇｔは、ランダムにサンプリングされた潜在表現又はユーザ定義された潜在表現である。 Preferably, the loss function includes a fourth term characterizing a difference between the randomly sampled or user-defined latent representation and the latent representation determined from the encoder (70), and the randomly sampled or user-defined latent representation is provided to the generator (80) to determine the training images (x _i ). In other words, the training images (x _i ) are determined based on the randomly sampled or user-defined latent representation. The fourth term is expressed by the following formula:

where w _gt is a randomly sampled or user-defined latent representation.

好ましくは、損失関数は、トレーニング画像（ｘ_ｉ）を特徴抽出器に提供することによって決定される第１の特徴表現と、予測画像

を特徴抽出器に提供することによって決定される第２の特徴表現との差を特徴付ける第５の項を含み、この差は、好ましくはマスキングアウトされた部分内にあるピクセルを特徴付ける特徴を特徴付けない。このことは、エンコーダ（７０）をトレーニングする際における追加的なガイドとしてＬＰＩＰＳ指標を使用することとして理解することが可能である。第５の項は、以下の式：

によって表現可能であり、ここで、Ｖは、特徴抽出器であり、マスクＭは、第１の項Ｌ_ｒｅｃに関して行われるのと同様に、特徴の幅及び高さにスケーリングされている。 Preferably, the loss function is determined by providing a first feature representation (x _i ) to the feature extractor and a predicted image (x i )

to a feature extractor, which preferably does not characterize features that characterize pixels that are within the masked-out portion. This can be understood as using the LPIPS indices as an additional guide in training the encoder (70). The fifth term is expressed as the following formula:

where V is the feature extractor and the mask M has been scaled to the feature width and height as is done for the first term _Lrec .

損失関数Ｌを決定するために、これらの項の任意の組合せを使用することができる。好ましくは、それぞれ異なる項に１つの重みが割り当てられており、それぞれの重みは、その他の項に関するそれぞれの項の重要性を制御する。したがって、損失関数は、以下の式：

によって表現可能であり、ここで、α_１乃至α_５は、それぞれの項の重みである。これらの重みは、トレーニング方法のハイパーパラメータとして理解することが可能である。 Any combination of these terms can be used to determine the loss function L. Preferably, each different term is assigned a weight, which controls the importance of each term relative to the other terms. Thus, the loss function is expressed as the following formula:

where α ₁ to α ₅ are the weights of the respective terms. These weights can be understood as hyperparameters of the training method.

次いで、エンコーダ（７０）を勾配降下法によってトレーニングすることができる。このことは、特に、パラメータに関する損失の負の勾配に従ってパラメータを適合させることとして理解することが可能である。 The encoder (70) can then be trained by gradient descent, which can be understood as, among other things, adapting the parameters according to the negative gradient of the loss with respect to the parameters.

図２は、マスキングされたノイズ画像（ε_ｍ）を決定するために、どのようにしてノイズ画像（ε）をマスキングすることができるかを示している。ノイズ画像（ε）は、行列によって特徴付けられ、この行列の要素は、ノイズ値である。複数の要素が、マスキングアウトされるように選択される。これらの複数の要素は、ノイズ画像（ε）のマスキングアウトされた部分（ｐ）とも称される。これらの複数の要素を、二値行列Ｍによって特徴付けることができ、この二値行列Ｍは、マスキングアウトされた部分（ｐ）に対しては１の値を含み、全ての他の要素に対しては０の値を含む。次いで、マスキングアウトされた部分を、ランダムにサンプリングされた値によって、例えばガウス分布からサンプリングされた値によって置き換えることができる。 Fig. 2 shows how a noise image (ε) can be masked to determine a masked noise image (ε _m ). The noise image (ε) is characterized by a matrix, whose elements are noise values. A number of elements are selected to be masked out. These elements are also referred to as the masked out portion (p) of the noise image (ε). These elements can be characterized by a binary matrix M, which contains a value of 1 for the masked out portion (p) and a value of 0 for all other elements. The masked out portion can then be replaced by randomly sampled values, for example by values sampled from a Gaussian distribution.

図３は、提供された画像（ｂ_ｉ）を拡張するように構成された拡張ユニット（９０）の実施形態を示している。拡張ユニット（９０）は、上記で提示したトレーニング方法によってトレーニング済みであるエンコーダ（７０）を含む。エンコーダ（７０）は、提供された画像（ｂ_ｉ）を受信し、ノイズ画像（ε）と潜在表現（ｗ）とを決定する。潜在表現（ｗ）は、変化ユニット（９１）に提供される。変化ユニット（９１）は、潜在表現の１つ又は複数の潜在因子を変化させるように構成されている。好ましくは、変化ユニット（９１）は、１つ又は複数の因子をランダムに決定して変化させる。変化の量も、変化ユニット（９１）のハイパーパラメータとして理解される間隔で、ランダムに選択可能である。変化ユニット（９１）は、潜在表現（ｗ）の潜在因子を変化させることによって第２の潜在表現

を決定する。第２の潜在表現

とノイズ画像（ε）とは、エンコーダ（７０）をトレーニングする際に使用された生成器（８０）に提供される。次いで、生成器（８０）は、拡張

として提供される画像を決定する。 Fig. 3 shows an embodiment of an augmentation unit (90) configured to augment a provided image (b _i ). The augmentation unit (90) comprises an encoder (70) that has been trained by the training method presented above. The encoder (70) receives the provided image (b _i ) and determines a noise image (ε) and a latent representation (w). The latent representation (w) is provided to a transformation unit (91). The transformation unit (91) is configured to vary one or more latent factors of the latent representation. Preferably, the transformation unit (91) randomly determines and varies the one or more factors. The amount of variation can also be randomly selected, with an interval that is understood as a hyperparameter of the transformation unit (91). The transformation unit (91) generates a second latent representation (w) by varying the latent factors of the latent representation (w).

Determine the second latent representation

The noise image (ε) is provided to a generator (80) that was used in training the encoder (70). The generator (80) then

Determine the image that will be provided as.

図４は、機械学習システム（６０）をトレーニングデータセット（Ｔ）によってトレーニングするための拡張ユニット（９０）を使用する、トレーニングシステム（１４０）の実施形態を示している。トレーニングデータセット（Ｔ）は、機械学習システム（６０）をトレーニングするために使用される複数の画像（ｂ_ｉ）を含み、トレーニングデータセット（Ｔ）は、それぞれの画像（ｂ_ｉ）ごとに所望の出力信号（ｔ_ｉ）をさらに含み、この所望の出力信号（ｔ_ｉ）は、画像（ｂ_ｉ）に対応し、画像（ｂ_ｉ）の所望の分類及び／又は所望の回帰分析結果を特徴付ける。 4 shows an embodiment of a training system (140) using an augmentation unit (90) for training the machine learning system (60) with a training data set (T) comprising a plurality of images (b _i ) used to train the machine learning system (60), the training data set (T) further comprising a desired output signal (t _i ) for each image (b _i ), the desired output signal (t _i ) corresponding to the image (b _i ) and characterizing a desired classification and/or a desired regression analysis result for the image (b _i ).

トレーニングのために、トレーニングデータユニット（１５０）は、コンピュータ実装されるデータベース（Ｓｔ_２）にアクセスし、このデータベース（Ｓｔ_２）は、トレーニングデータセット（Ｔ）を提供する。トレーニングデータユニット（１５０）は、トレーニングデータセット（Ｔ）から少なくとも１つの画像（ｂ_ｉ）と、この画像（ｂ_ｉ）に対応する所望の出力信号（ｔ_ｉ）とを、好ましくはランダムに決定して、この画像（ｂ_ｉ）を機械学習システム（６０）に送信する。機械学習システム（６０）は、画像（ｂ_ｉ）に基づいて出力信号（ｙ_ｉ）を決定する。 For training, the training data unit (150) accesses a computer-implemented database (St ₂ ), which provides a training data set (T). The training data unit ( ₁₅₀ ) determines, preferably randomly, at least one image (b _i ) from the training data set (T) and a desired output signal (t _i ) corresponding to this image (b _i ), and transmits this image (b _i ) to the machine learning system (60). The machine learning system (60) determines an output signal (y _i ) based on the image (b _i ).

所望の出力信号（ｔ_ｉ）と決定された出力信号（ｙ_ｉ）とが、修正ユニット（１８０）に送信される。 The desired output signal (t _i ) and the determined output signal (y _i ) are sent to a modification unit (180).

次いで、修正ユニット（１８０）は、所望の出力信号（ｔ_ｉ）と決定された出力信号（ｙ_ｉ）とに基づいて、機械学習システム（６０）に対する新たなパラメータ（Φ’）を決定する。この目的で、修正ユニット（１８０）は、所望の出力信号（ｔ_ｉ）と決定された出力信号（ｙ_ｉ）とを、損失関数を使用して比較する。損失関数は、決定された出力信号（ｙ_ｉ）が所望の出力信号（ｔ_ｉ）からどの程度ずれているかを特徴付ける第１の損失値を決定する。所与の実施形態においては、損失関数として負の対数尤度関数が使用される。代替的な実施形態においては、その他の損失関数も考えられる。 The modification unit (180) then determines new parameters (Φ') for the machine learning system (60) based on the desired output signals (t _i ) and the determined output signals (y _i ). To this end, the modification unit (180) compares the desired output signals (t _i ) and the determined output signals (y _i ) using a loss function. The loss function determines a first loss value that characterizes how much the determined output signals (y _i ) deviate from the desired output signals (t _i ). In the given embodiment, a negative log-likelihood function is used as the loss function. In alternative embodiments, other loss functions are also possible.

さらに、決定された出力信号（ｙ_ｉ）と所望の出力信号（ｔ_ｉ）とが、例えばテンソル形式の複数のサブ信号をそれぞれ含むことも考えられ、この場合、所望の出力信号（ｔ_ｉ）のサブ信号は、決定された出力信号（ｙ_ｉ）のサブ信号に対応する。例えば、機械学習システム（６０）が、オブジェクト検出のために構成されており、第１のサブ信号が、画像（ｂ_ｉ）の一部に関してオブジェクトの発生確率を特徴付け、第２のサブ信号が、そのオブジェクトの正確な位置を特徴付けることが考えられる。決定された出力信号（ｙ_ｉ）と所望の出力信号（ｔ_ｉ）とが、複数の対応するサブ信号を含む場合には、好ましくはそれぞれの対応するサブ信号ごとに適当な損失関数によって第２の損失値が決定され、これらの決定された第２の損失値が適当に組み合わせられて、例えば重み付き和によって第１の損失値が形成される。 Furthermore, it is also conceivable that the determined output signal (y _i ) and the desired output signal (t _i ) each comprise a plurality of sub-signals, for example in the form of a tensor, where the sub-signals of the desired output signal (t _i ) correspond to the sub-signals of the determined output signal (y _i ). For example, it is conceivable that the machine learning system (60) is configured for object detection, where a first sub-signal characterizes the probability of occurrence of an object with respect to a portion of the image (b _i ), and a second sub-signal characterizes the exact location of the object. In case the determined output signal (y _i ) and the desired output signal (t _i ) comprise a plurality of corresponding sub-signals, a second loss value is preferably determined for each corresponding sub-signal by a suitable loss function, and these determined second loss values are suitably combined to form a first loss value, for example by a weighted sum.

修正ユニット（１８０）は、第１の損失値に基づいて新たなパラメータ（Φ’）を決定する。所与の実施形態においては、このことは、勾配降下法、好ましくは確率的勾配降下法、Ａｄａｍ又はＡｄａｍＷを使用して実施される。さらなる実施形態においては、トレーニングは、ニューラルネットワークをトレーニングするための進化的アルゴリズム又は二次法に基づくこともできる。 The correction unit (180) determines new parameters (Φ') based on the first loss value. In a given embodiment, this is performed using a gradient descent method, preferably a stochastic gradient descent method, Adam or AdamW. In further embodiments, the training can also be based on evolutionary algorithms or second-order methods for training neural networks.

他の好ましい実施形態においては、上記のトレーニングは、所定の反復ステップ回数にわたって反復的に繰り返され、又は、第１の損失値が所定の閾値を下回るまで反復的に繰り返される。代替的又は追加的に、テストデータセット又は検証データセットに対する第１の平均損失値が所定の閾値を下回ると、トレーニングを終了させることも考えられる。複数回の反復のうちの少なくとも１回の反復において、以前の反復において決定された新たなパラメータ（Φ’）が、機械学習システム（６０）のパラメータ（Φ）として使用される。 In another preferred embodiment, the training is repeated iteratively for a predetermined number of iteration steps or until the first loss value falls below a predetermined threshold. Alternatively or additionally, training may be terminated when the first average loss value for the test data set or the validation data set falls below a predetermined threshold. In at least one of the multiple iterations, the new parameters (Φ') determined in the previous iteration are used as parameters (Φ) of the machine learning system (60).

さらに、トレーニングシステム（１４０）は、少なくとも１つのプロセッサ（１４５）と、少なくとも１つの機械可読記憶媒体（１４６）とを含み得るものであり、少なくとも１つの機械可読記憶媒体（１４６）は、プロセッサ（１４５）によって実行された場合に本発明の態様のうちの１つによるトレーニング方法をトレーニングシステム（１４０）に実行させる命令を含む。 Further, the training system (140) may include at least one processor (145) and at least one machine-readable storage medium (146), the at least one machine-readable storage medium (146) including instructions that, when executed by the processor (145), cause the training system (140) to perform a training method according to one of the aspects of the present invention.

図５は、アクチュエータ（１０）の環境（２０）におけるアクチュエータ（１０）の実施形態を示している。アクチュエータ（１０）は、制御システム（４０）と相互作用し、制御システム（４０）は、アクチュエータ（１０）を制御するために機械学習システム（６０）を使用する。アクチュエータ（１０）とアクチュエータ（１０）の環境（２０）とを、合わせてアクチュエータシステムと称することとする。好ましくは等間隔の時点に、センサ（３０）がアクチュエータシステムの状態を感知する。センサ（３０）は、複数のセンサを含み得る。好ましくは、センサ（３０）は、環境（２０）の画像を撮影する光学センサである。感知された状況を符号化する、センサ（３０）の出力信号（Ｓ）（又はセンサ（３０）が複数のセンサを含む場合には、これらのセンサの各々ごとの出力信号（Ｓ））が、制御システム（４０）に送信される。 Figure 5 shows an embodiment of an actuator (10) in its environment (20). The actuator (10) interacts with a control system (40), which uses a machine learning system (60) to control the actuator (10). The actuator (10) and its environment (20) are collectively referred to as the actuator system. A sensor (30), preferably at equally spaced time points, senses the state of the actuator system. The sensor (30) may include multiple sensors. Preferably, the sensor (30) is an optical sensor that takes an image of the environment (20). An output signal (S) of the sensor (30) (or an output signal (S) for each of these sensors, if the sensor (30) includes multiple sensors), encoding the sensed situation, is sent to the control system (40).

それにより、制御システム（４０）は、センサ信号（Ｓ）のストリームを受信する。次いで、制御システム（４０）は、センサ信号（Ｓ）のストリームに依存して一連の制御信号（Ａ）を計算し、これらの制御信号（Ａ）は、次いで、アクチュエータ（１０）に送信される。 Thereby, the control system (40) receives the stream of sensor signals (S). The control system (40) then calculates a set of control signals (A) depending on the stream of sensor signals (S), which are then transmitted to the actuator (10).

制御システム（４０）は、センサ（３０）のセンサ信号（Ｓ）のストリームを、任意選択肢の受信ユニット（５０）において受信する。受信ユニット（５０）は、センサ信号（Ｓ）を画像（ｘ）に変換する。代替的に、受信ユニット（５０）が設けられていない場合には、それぞれのセンサ信号（Ｓ）を直接的に画像（ｘ）として取得するものとしてもよい。画像（ｘ）を、例えばセンサ信号（Ｓ）の抜粋として提供することができる。代替的に、センサ信号（Ｓ）を処理して画像（ｘ）を生成するものとしてもよい。換言すれば、画像（ｘ）は、センサ信号（Ｓ）に従って提供される。 The control system (40) receives a stream of sensor signals (S) of the sensor (30) at an optional receiving unit (50). The receiving unit (50) converts the sensor signals (S) into an image (x). Alternatively, if the receiving unit (50) is not provided, the respective sensor signal (S) may be directly acquired as an image (x). The image (x) may be provided, for example, as an excerpt of the sensor signal (S). Alternatively, the sensor signal (S) may be processed to generate the image (x). In other words, the image (x) is provided according to the sensor signal (S).

次いで、画像（ｘ）は、機械学習システム（６０）に伝送される。 The image (x) is then transmitted to a machine learning system (60).

機械学習システム（６０）は、パラメータ（Φ）によってパラメータ化されており、このパラメータ（Φ）は、パラメータ記憶装置（Ｓｔ_１）に格納されており、パラメータ記憶装置（Ｓｔ_１）によって提供される。 The machine learning system (60) is parameterized by parameters (Φ), which are stored in and provided by a parameter store (St ₁ ₎ .

機械学習システム（６０）は、画像（ｘ）から出力信号（ｙ）を決定する。出力信号（ｙ）は、画像（ｘ）に１つ又は複数のラベルを割り当てる情報を含む。出力信号（ｙ）は、任意選択肢の変換ユニット（８０）に送信され、変換ユニット（８０）は、出力信号（ｙ）を制御信号（Ａ）に変換する。次いで、制御信号（Ａ）は、アクチュエータ（１０）を相応に制御するためにアクチュエータ（１０）に送信される。代替的に、出力信号（ｙ）を直接的に制御信号（Ａ）として取得するものとしてもよい。 The machine learning system (60) determines an output signal (y) from an image (x). The output signal (y) includes information that assigns one or more labels to the image (x). The output signal (y) is sent to an optional transformation unit (80), which transforms the output signal (y) into a control signal (A). The control signal (A) is then sent to the actuator (10) to control the actuator (10) accordingly. Alternatively, the output signal (y) may be obtained directly as the control signal (A).

アクチュエータ（１０）は、制御信号（Ａ）を受信し、相応に制御され、制御信号（Ａ）に対応する行動を実施する。アクチュエータ（１０）は、制御信号（Ａ）をさらなる制御信号に変換する制御ロジックを含み得るものであり、その場合、このさらなる制御信号を使用してアクチュエータ（１０）が制御される。 Actuator (10) receives control signal (A) and is accordingly controlled to perform an action corresponding to control signal (A). Actuator (10) may include control logic that converts control signal (A) into a further control signal, which is then used to control actuator (10).

さらなる実施形態においては、制御システム（４０）は、センサ（３０）を含み得る。さらに他の実施形態においては、制御システム（４０）は、代替的又は追加的にアクチュエータ（１０）を含み得る。 In further embodiments, the control system (40) may include a sensor (30). In yet other embodiments, the control system (40) may alternatively or additionally include an actuator (10).

さらに他の実施形態においては、制御システム（４０）が、アクチュエータ（１０）に代えて又はこれに加えて、ディスプレイ（１０ａ）を制御することを想定することができる。 In yet other embodiments, it is contemplated that the control system (40) controls the display (10a) instead of or in addition to the actuator (10).

さらに、制御システム（４０）は、少なくとも１つのプロセッサ（４５）と、少なくとも１つの機械可読記憶媒体（４６）とを含み得るものであり、少なくとも１つの機械可読記憶媒体（４６）上には、実行された場合に本発明の態様による方法を制御システム（４０）に実行させる命令が格納されている。 Further, the control system (40) may include at least one processor (45) and at least one machine-readable storage medium (46) having instructions stored thereon that, when executed, cause the control system (40) to perform a method according to an aspect of the present invention.

図６は、少なくとも半自律的なロボット、例えば少なくとも半自律的な車両（１００）を制御するために制御システム（４０）が使用される実施形態を示している。 Figure 6 shows an embodiment in which the control system (40) is used to control an at least semi-autonomous robot, such as an at least semi-autonomous vehicle (100).

センサ（３０）は、１つ又は複数のビデオセンサ、及び／又は、１つ又は複数のレーダセンサ、及び／又は、１つ又は複数の超音波センサ、及び／又は、１つ又は複数のＬｉＤＡＲセンサを含み得る。これらのセンサの一部又は全部は、必須ではないが、好ましくは車両（１００）に搭載されている。 The sensors (30) may include one or more video sensors, one or more radar sensors, one or more ultrasonic sensors, and/or one or more LiDAR sensors. Some or all of these sensors are preferably, but not necessarily, on board the vehicle (100).

機械学習システム（６０）は、画像（ｘ）に基づいて、少なくとも半自律的なロボットの近傍にあるオブジェクトを検出するように構成可能である。出力信号（ｙ）は、少なくとも半自律的なロボットの近傍におけるどこにオブジェクトが位置しているかを特徴付ける情報を含み得る。次いで、例えば検出されたオブジェクトとの衝突を回避するために、この情報に従って制御信号（Ａ）を決定することができる。 The machine learning system (60) can be configured to detect an object in the vicinity of the at least semi-autonomous robot based on the image (x). The output signal (y) can include information characterizing where the object is located in the vicinity of the at least semi-autonomous robot. A control signal (A) can then be determined according to this information, for example to avoid a collision with the detected object.

好ましくは車両（１００）に搭載されているアクチュエータ（１０）は、車両（１００）のブレーキ、推進システム、エンジン、ドライブトレイン又はステアリングによって提供可能である。検出されたオブジェクトとの衝突を車両（１００）が回避するように、アクチュエータ（１０）が制御されるように、制御信号（Ａ）を決定することができる。検出されたオブジェクトを、機械学習システム（６０）が最も尤もらしいと見なした、それらのオブジェクトの正体、例えば歩行者や樹木に従って分類し、その分類に依存して、制御信号（Ａ）を決定することもできる。 The actuators (10), preferably on board the vehicle (100), can be provided by the brakes, propulsion system, engine, drivetrain or steering of the vehicle (100). A control signal (A) can be determined such that the actuators (10) are controlled such that the vehicle (100) avoids a collision with the detected object. The detected objects can also be classified according to their identity, e.g. pedestrian or tree, which the machine learning system (60) considers to be most plausible, and the control signal (A) can be determined depending on the classification.

代替的又は追加的に、制御信号（Ａ）は、例えば機械学習システム（６０）によって検出されたオブジェクトが表示されるように、ディスプレイ（１０ａ）を制御するためにも使用可能である。車両（１００）が、検出されたオブジェクトのうちの少なくとも１つと衝突しそうになった場合に、警告信号が生成されるように、制御信号（Ａ）がディスプレイ（１０ａ）を制御することができるようにすることも想像することができる。警告信号は、警告音及び／又は触覚信号、例えば車両のステアリングホイールの振動であるものとしてよい。 Alternatively or additionally, the control signal (A) can also be used to control the display (10a) so that objects detected, for example, by the machine learning system (60) are displayed. It is also conceivable that the control signal (A) can control the display (10a) so that a warning signal is generated if the vehicle (100) is about to collide with at least one of the detected objects. The warning signal can be an audible warning and/or a haptic signal, for example a vibration on the steering wheel of the vehicle.

さらなる実施形態においては、少なくとも半自律的なロボットは、例えば、飛行、水泳、潜水又は歩行によって移動することができる他の移動型ロボット（図示せず）によって提供可能である。移動型ロボットは、特に、少なくとも半自律的な芝刈り機、又は、少なくとも半自律的な掃除ロボットであるものとしてよい。上記の全ての実施形態において、移動型ロボットが前述の識別されたオブジェクトとの衝突を回避することができるように、移動型ロボットの推進ユニット及び／又はステアリング及び／又はブレーキが制御されるように、制御信号（Ａ）を決定することができる。 In further embodiments, the at least semi-autonomous robot may be provided by another mobile robot (not shown), which may for example move by flying, swimming, diving or walking. The mobile robot may in particular be an at least semi-autonomous lawnmower or an at least semi-autonomous cleaning robot. In all the above embodiments, the control signal (A) may be determined such that the propulsion unit and/or steering and/or braking of the mobile robot are controlled so that the mobile robot can avoid a collision with said identified object.

さらなる実施形態においては、少なくとも半自律的なロボットは、園芸用ロボット（図示せず）によって提供可能であり、園芸用ロボットは、センサ（３０）、好ましくは光学センサを使用して、環境（２０）における植物の状態を特定する。アクチュエータ（１０）は、液体を噴霧するためのノズル、及び／又は、切断装置、例えば、ブレードを制御することができる。植物の識別された種及び／又は識別された状態に依存して、アクチュエータ（１０）に、適当な液体の適当な量を植物に噴霧させるように、及び／又は、植物を切断させるように、制御信号（Ａ）を決定することができる。 In a further embodiment, the at least semi-autonomous robot can be provided by a gardening robot (not shown), which uses a sensor (30), preferably an optical sensor, to identify the state of the plants in the environment (20). The actuator (10) can control a nozzle for spraying a liquid and/or a cutting device, e.g. a blade. Depending on the identified species of the plant and/or the identified state, a control signal (A) can be determined to cause the actuator (10) to spray an appropriate amount of a suitable liquid on the plant and/or to cut the plant.

さらに他の実施形態においては、少なくとも半自律的なロボットは、例えば、洗濯機、ストーブ、オーブン、電子レンジ又は食器洗浄機のような家電装置（図示せず）によって提供可能である。センサ（３０）、例えば光学センサは、家電装置によって処理が施されるべきオブジェクトの状態を検出することができる。例えば、家電装置が洗濯機である場合には、センサ（３０）は、洗濯機内の洗濯物の状態を検出することができる。次いで、検出された洗濯物の素材に依存して、制御信号（Ａ）を決定することができる。 In yet another embodiment, the at least semi-autonomous robot can be provided by a household appliance (not shown), such as a washing machine, stove, oven, microwave or dishwasher. A sensor (30), such as an optical sensor, can detect the state of an object to be processed by the household appliance. For example, if the household appliance is a washing machine, the sensor (30) can detect the state of the laundry in the washing machine. A control signal (A) can then be determined depending on the detected material of the laundry.

図７は、例えば生産ラインの一部としての、製造システム（２００）の製造機械（１１）（例えば、パンチカッタ、カッタ、ガンドリル、又は、グリッパ）を制御するために制御システム（４０）が使用される実施形態を示している。製造機械は、製造された製品（１２）を移動させる搬送装置、例えば、コンベヤベルト又は組み立てラインを含み得る。制御システム（４０）は、アクチュエータ（１０）を制御し、アクチュエータ（１０）が、今度は製造機械（１１）を制御する。 Figure 7 shows an embodiment in which a control system (40) is used to control a manufacturing machine (11) (e.g., a punch cutter, cutter, gun drill, or gripper) of a manufacturing system (200), for example as part of a production line. The manufacturing machine may include a conveying device, for example a conveyor belt or an assembly line, that moves the manufactured product (12). The control system (40) controls an actuator (10), which in turn controls the manufacturing machine (11).

センサ（３０）は、例えば製造された製品（１２）の特性を捕捉する光学センサによって提供可能である。したがって、機械学習システム（６０）は、画像分類器として理解することが可能である。 The sensor (30) can be provided, for example, by an optical sensor that captures characteristics of the manufactured product (12). The machine learning system (60) can therefore be understood as an image classifier.

機械学習システム（６０）は、搬送装置に対する製造された製品（１２）の位置を特定することができる。次いで、製造された製品（１２）の後続の製造工程のために、製造された製品（１２）の特定された位置に依存してアクチュエータ（１０）を制御することができる。例えば、製造された製品をこの製造された製品自体の特定の箇所において切断するように、アクチュエータ（１０）を制御することができる。代替的に、製造された製品が破損しているかどうか、及び／又は、欠陥を示しているかどうかを、機械学習システム（６０）が分類することを想定することができる。その場合、その製造された製品を搬送装置から除去するように、アクチュエータ（１０）を制御することができる。 The machine learning system (60) can determine the position of the manufactured product (12) relative to the conveying device. The actuator (10) can then be controlled depending on the determined position of the manufactured product (12) for a subsequent manufacturing process of the manufactured product (12). For example, the actuator (10) can be controlled to cut the manufactured product at a specific point on the manufactured product itself. Alternatively, it can be envisaged that the machine learning system (60) classifies whether the manufactured product is damaged and/or exhibits a defect. In that case, the actuator (10) can be controlled to remove the manufactured product from the conveying device.

図８は、自動化されたパーソナルアシスタント（２５０）を制御するために制御システム（４０）が使用される実施形態を示している。センサ（３０）は、例えば、ユーザ（２４９）のジェスチャのビデオ画像を受信するための光学センサであるものとしてよい。代替的に、センサ（３０）は、例えば、ユーザ（２４９）の音声コマンドを受信するための音響センサであるものとしてもよい。 Figure 8 shows an embodiment in which the control system (40) is used to control an automated personal assistant (250). The sensor (30) may be, for example, an optical sensor for receiving video images of the gestures of the user (249). Alternatively, the sensor (30) may be, for example, an acoustic sensor for receiving voice commands of the user (249).

次いで、制御システム（４０）は、自動化されたパーソナルアシスタント（２５０）を制御するための制御信号（Ａ）を決定する。制御信号（Ａ）は、センサ（３０）のセンサ信号（Ｓ）に従って決定される。センサ信号（Ｓ）は、制御システム（４０）に送信される。例えば、機械学習システム（６０）は、例えばユーザ（２４９）によって実施されたジェスチャを識別するためのジェスチャ認識アルゴリズムを実行するように構成可能である。次いで、制御システム（４０）は、自動化されたパーソナルアシスタント（２５０）に送信するための制御信号（Ａ）を決定することができる。次いで、制御システム（４０）は、制御信号（Ａ）を自動化されたパーソナルアシスタント（２５０）に送信する。 The control system (40) then determines a control signal (A) for controlling the automated personal assistant (250). The control signal (A) is determined according to the sensor signal (S) of the sensor (30). The sensor signal (S) is transmitted to the control system (40). For example, the machine learning system (60) can be configured to execute a gesture recognition algorithm, for example, to identify a gesture performed by the user (249). The control system (40) can then determine a control signal (A) for transmission to the automated personal assistant (250). The control system (40) then transmits the control signal (A) to the automated personal assistant (250).

例えば、機械学習システム（６０）によって認識された識別されたユーザジェスチャに従って、制御信号（Ａ）を決定することができる。制御信号（Ａ）は、自動化されたパーソナルアシスタント（２５０）にデータベースから情報を検索させ、この検索された情報を、ユーザ（２４９）による受信のために適した形態で出力させるための情報を含み得る。 For example, the control signal (A) may be determined according to an identified user gesture recognized by the machine learning system (60). The control signal (A) may include information for causing the automated personal assistant (250) to retrieve information from a database and output the retrieved information in a form suitable for receipt by the user (249).

さらなる実施形態においては、自動化されたパーソナルアシスタント（２５０）に代えて、制御システム（４０）が、識別されたユーザジェスチャに従って制御される家電装置（図示せず）を制御することを想定することができる。家電装置は、洗濯機、ストーブ、オーブン、電子レンジ、又は、食器洗浄機であるものとしてよい。 In a further embodiment, instead of an automated personal assistant (250), it can be envisaged that the control system (40) controls a home appliance (not shown) which is controlled according to the identified user gestures. The home appliance may be a washing machine, a stove, an oven, a microwave oven or a dishwasher.

図９は、制御システム（４０）がアクセス制御システム（３００）を制御する実施形態を示している。アクセス制御システム（３００）は、アクセスを物理的に制御するように設計可能である。アクセス制御システム（３００）は、例えば、ドア（４０１）を含み得る。センサ（３０）は、アクセスが許可されるべきかどうかを判定するために関連するシーンを検出するように構成可能である。例えば、センサ（３０）は、画像又はビデオデータを提供するための、例えば、人物の顔を検出するための光学センサであるものとしてよい。したがって、機械学習システム（６０）は、画像分類器として理解することが可能である。 Figure 9 shows an embodiment in which the control system (40) controls an access control system (300). The access control system (300) can be designed to physically control access. The access control system (300) can include, for example, a door (401). The sensor (30) can be configured to detect a relevant scene to determine whether access should be allowed. For example, the sensor (30) can be an optical sensor to provide image or video data, for example to detect a person's face. Thus, the machine learning system (60) can be understood as an image classifier.

機械学習システム（６０）は、例えば、検出された人物の顔を、データベースに格納されている他の既知の人物の顔と照合し、それにより、その人物の識別情報を特定することによって、人物の識別情報を分類するように構成可能である。次いで、機械学習システム（６０）の分類に依存して、例えば特定された識別情報に従って、制御信号（Ａ）を決定することができる。アクチュエータ（１０）は、制御信号（Ａ）に依存してドアを開放又は閉鎖するロックであるものとしてよい。代替的に、アクセス制御システム（３００）は、非物理的かつ論理的なアクセス制御システムであるものとしてよい。この場合には、制御信号は、人物の識別情報に関する情報、及び／又は、その人物にアクセスが許可されるべきかどうかに関する情報を表示するように、ディスプレイ（１０ａ）を制御するために使用可能である。 The machine learning system (60) can be configured to classify the identity of the person, for example by matching the detected person's face with the faces of other known people stored in a database, thereby determining the identity of the person. A control signal (A) can then be determined depending on the classification of the machine learning system (60), for example according to the determined identity. The actuator (10) can be a lock that opens or closes a door depending on the control signal (A). Alternatively, the access control system (300) can be a non-physical and logical access control system. In this case, the control signal can be used to control a display (10a) to display information about the identity of the person and/or whether the person should be allowed access.

図１０は、制御システム（４０）が監視システム（４００）を制御する実施形態を示している。この実施形態は、図９に示されている実施形態と大部分で同一である。したがって、異なっている態様についてのみ詳細に説明する。センサ（３０）は、監視下にあるシーンを検出するように構成されている。制御システム（４０）は、必ずしもアクチュエータ（１０）を制御する必要はないが、代替的に、ディスプレイ（１０ａ）を制御することができる。例えば、機械学習システム（６０）は、シーンの分類を決定することができ、例えば、光学センサ（３０）によって検出されたシーンが正常であるかどうか、又は、シーンが異常を示しているかどうかを判定することができる。次いで、ディスプレイ（１０ａ）に送信された制御信号（Ａ）は、例えばディスプレイ（１０ａ）に、決定された分類に依存して表示する内容を調整させるように、例えば、機械学習システム（６０）によって異常であると判定されたオブジェクトを強調表示させるように構成可能である。 10 shows an embodiment in which the control system (40) controls the monitoring system (400). This embodiment is largely identical to the embodiment shown in FIG. 9. Therefore, only the different aspects will be described in detail. The sensor (30) is configured to detect the scene under surveillance. The control system (40) does not necessarily control the actuator (10), but can alternatively control the display (10a). For example, the machine learning system (60) can determine a classification of the scene, e.g., whether the scene detected by the optical sensor (30) is normal or whether the scene shows an anomaly. The control signal (A) sent to the display (10a) can then be configured to, for example, cause the display (10a) to adjust the content it displays depending on the determined classification, e.g., to highlight objects determined to be anomalous by the machine learning system (60).

図１１は、制御システム（４０）によって制御される医用イメージングシステム（５００）の実施形態を示している。イメージングシステムは、例えば、ＭＲＩ装置、Ｘ線イメージング装置、又は、超音波イメージング装置であるものとしてよい。センサ（３０）は、例えば患者の少なくとも１つの画像を撮影する、例えば患者の種々異なる種類の身体組織を表示する、イメージングセンサであるものとしてよい。 Figure 11 shows an embodiment of a medical imaging system (500) controlled by a control system (40). The imaging system may be, for example, an MRI device, an X-ray imaging device, or an ultrasound imaging device. The sensor (30) may be, for example, an imaging sensor that takes at least one image of a patient, for example displaying different types of body tissue of the patient.

次いで、機械学習システム（６０）は、感知された画像の少なくとも一部の分類を決定することができる。したがって、画像の少なくとも一部は、機械学習システム（６０）への入力画像（ｘ）として使用される。 The machine learning system (60) can then determine a classification of at least a portion of the sensed image. Thus, at least a portion of the image is used as an input image (x) to the machine learning system (60).

次いで、この分類に従って制御信号（Ａ）を選択することができ、それにより、ディスプレイ（１０ａ）を制御することができる。例えば、機械学習システム（６０）は、例えば画像内に表示された組織を悪性組織又は良性組織のいずれかに分類することによって、感知された画像内の種々異なる種類の組織を検出するように構成可能である。このことは、機械学習システム（６０）による入力画像（ｘ）のセマンティックセグメンテーションによって実施可能である。次いで、ディスプレイ（１０ａ）に、例えば入力画像（ｘ）を表示して、同一の組織種類の複数の異なる領域を同一の色で着色することによって複数の異なる組織を表示させるように、制御信号（Ａ）を決定することができる。 A control signal (A) can then be selected according to this classification, thereby controlling the display (10a). For example, the machine learning system (60) can be configured to detect different types of tissue in the sensed image, e.g., by classifying tissues displayed in the image as either malignant or benign tissue. This can be done by semantic segmentation of the input image (x) by the machine learning system (60). A control signal (A) can then be determined to cause the display (10a) to display the input image (x) and show different tissues, e.g., by coloring different regions of the same tissue type with the same color.

さらなる実施形態（図示せず）においては、イメージングシステム（５００）を、非医用目的で、例えばワークピースの材料特性を特定するために使用することができる。これらの実施形態においては、機械学習システム（６０）は、ワークピースの少なくとも一部の入力画像（ｘ）を受信し、入力画像（ｘ）のセマンティックセグメンテーションを実施し、それにより、ワークピースの材料特性を分類するように構成可能である。次いで、ディスプレイ（１０ａ）に、入力画像（ｘ）と、検出された材料特性に関する情報とを表示させるように、制御信号（Ａ）を決定することができる。 In further embodiments (not shown), the imaging system (500) can be used for non-medical purposes, for example to identify material properties of a workpiece. In these embodiments, the machine learning system (60) can be configured to receive an input image (x) of at least a portion of the workpiece and perform a semantic segmentation of the input image (x) to thereby classify material properties of the workpiece. A control signal (A) can then be determined to cause the display (10a) to display the input image (x) and information related to the detected material properties.

図１２は、制御システム（４０）によって制御される医用分析システム（６００）の実施形態を示している。医用分析システム（６００）にはマイクロアレイ（６０１）が供給され、マイクロアレイは、医用試料に曝露された複数のスポット（６０２、特徴としても知られる）を含む。医用試料は、例えば、ヒト試料であるものとしてもよいし、又は、例えばスワブから得られた動物試料であるものとしてもよい。 Figure 12 shows an embodiment of a medical analysis system (600) controlled by a control system (40). The medical analysis system (600) is provided with a microarray (601) that includes a number of spots (602, also known as features) that are exposed to a medical sample. The medical sample may be, for example, a human sample or an animal sample, for example obtained from a swab.

マイクロアレイ（６０１）は、ＤＮＡマイクロアレイ又はタンパク質マイクロアレイであるものとしてよい。 The microarray (601) may be a DNA microarray or a protein microarray.

センサ（３０）は、マイクロアレイ（６０１）を感知するように構成されている。センサ（３０）は、好ましくはビデオセンサのような光学センサである。 The sensor (30) is configured to sense the microarray (601). The sensor (30) is preferably an optical sensor, such as a video sensor.

機械学習システム（６０）は、センサ（３０）によって供給されたマイクロアレイの入力画像（ｘ）に基づいて試料の結果を分類するように構成されている。特に、機械学習システム（６０）は、マイクロアレイ（６０１）が試料中にウイルスの存在を示しているかどうかを判定するように構成可能である。 The machine learning system (60) is configured to classify the sample result based on the input image (x) of the microarray provided by the sensor (30). In particular, the machine learning system (60) can be configured to determine whether the microarray (601) indicates the presence of a virus in the sample.

次いで、ディスプレイ（１０ａ）が分類の結果を表示するように、制御信号（Ａ）を選択することができる。 The control signal (A) can then be selected so that the display (10a) displays the results of the classification.

「コンピュータ」という用語は、所定の計算規則を処理するための任意の装置を包含するものとして理解することが可能である。これらの計算規則は、ソフトウェアの形態、ハードウェアの形態、又は、ソフトウェアとハードウェアとの混合形態であるものとしてよい。 The term "computer" may be understood to encompass any device for processing certain computational rules. These computational rules may be in the form of software, hardware, or a mixture of software and hardware.

一般的に、複数形には添え字が付されているものと理解することが可能であり、すなわち、好ましくは複数形に含まれる複数の要素に連続した整数を割り当てることにより、複数形のそれぞれの要素に一意の添え字が割り当てられる。好ましくは、ある複数形にＮ個の要素が含まれ、かつ、Ｎがその複数形における要素の個数である場合、これらの要素には、１乃至Ｎの整数が割り当てられる。複数形に含まれるそれぞれの要素には、これらの要素の添え字を介してアクセス可能であることも理解することが可能である。 In general, plurals can be understood to be subscripted, i.e., each element of a plural is assigned a unique subscript, preferably by assigning consecutive integers to the elements in the plural. Preferably, if a plural has N elements, and N is the number of elements in the plural, then the elements are assigned integers from 1 to N. It can also be understood that each element in a plural is accessible via the subscript of the element.

Claims

1. A computer-implemented method for training an encoder (70) configured to determine a latent representation of an image (x _i ), comprising:
Training the encoder comprises:
- determining a latent representation (w) and a noise image (ε) by providing training images (x _i ) to said encoder (70), said encoder (70) being configured to determine the latent representation and the noise image for the provided images;
- determining a masked noise image (ε _m ) by masking out a portion (p) of said noise image (ε);
A predicted image is generated by providing the latent representation (w) and the masked noise image (ε _m ) to a generator (80) of a generative adversarial network.

determining
training the encoder (70) by adapting its parameters based on a loss value, the loss value being determined by the predicted image

characterizing the difference between the training images (x _i ) and
A method comprising:

masking out a portion (p) of the noise image (ε) comprises replacing values in the portion (p) by randomly drawn values.
The method of claim 1.

The loss value is determined based on a loss function;
The first term of the loss function is the predicted image

and the training images (x _i );
The method according to claim 1 or 2.

The first term further characterizes the masking of the difference,
The masking removes pixels from the difference that fall within the masked out portion (p).
The method according to claim 3.

the loss function includes a second term that characterizes the norm of the noisy image (ε) predicted by the encoder (70).
The method according to claim 3 or 4.

the loss function includes a third term that characterizes the negative log-likelihood of the discriminator output signal;
The output signal is the predicted image

to the discriminator,
6. The method according to any one of claims 3 to 5.

The training images (x _i ) are determined by providing randomly sampled or user-defined latent representations to the generator;
the loss function includes a fourth term that characterizes a difference between the randomly sampled or user-defined latent representation and the latent representation determined from the encoder (70).
7. The method according to any one of claims 3 to 6.

The loss function is a function of a first feature representation determined by providing the training images (x _i ) to a feature extractor and a function of the predicted image (x i )

a fifth term characterizing a difference between the first feature representation determined by providing
said difference preferably does not characterize features that characterize pixels that are within said masked out portion (p);
8. The method according to any one of claims 3 to 7.

Extending the image (b _i )

1. A computer-implemented method for determining
- obtaining an encoder (70) based on training said encoder (70) by a method according to any one of claims 1 to 8;
- determining a first latent representation (w) and a noise image (ε) by providing said images (b _i ) to said encoder (70);
- modifying the first latent representation (w) to generate a second latent representation

determining
the second latent representation as input to a generator (80) used in training the encoder (70);

and the noise image (ε),

determining
The method includes:

1. A computer-implemented method for training a machine learning system (60), comprising:
the machine learning system is configured to determine an output signal (y) characterizing a classification and/or regression analysis of an image (x);
The method comprises:
Extension of the training images (b _i ) according to claim 9

determining
・The above-mentioned extension

training the machine learning system (60) based on
A method comprising:

1. A computer-implemented method for determining a control signal (A) for an actuator (10), comprising:
The control signal (A) is determined based on an output signal (y) of a machine learning system (60) trained according to claim 10,
The output signal (y) is determined based on an image (x).
Method.

A training system (140) configured to implement the training method according to any one of claims 1 to 8.

A control system (40) configured to implement the method of claim 11.

A computer program configured to cause a computer to perform all steps of the method according to any one of claims 1 to 11 when executed by a processor (45, 145).

A machine-readable storage medium (46, 146) on which the computer program of claim 14 is stored.