CN115240185A

CN115240185A - Three-dimensional self-encoder, training method thereof, electronic device and storage medium

Info

Publication number: CN115240185A
Application number: CN202210864352.6A
Authority: CN
Inventors: 郝霖; 王国权; 叶德建
Original assignee: Zhejiang Qinghe Technology Co ltd; Shanghai Qinghe Technology Co ltd
Current assignee: Zhejiang Qinghe Technology Co ltd; Shanghai Qinghe Technology Co ltd
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2022-10-25
Also published as: WO2024016464A1

Abstract

The invention discloses a three-dimensional self-encoder and a training method thereof, a training method of a three-dimensional visual module, electronic equipment and a storage medium three-dimensional self-encoder, wherein the three-dimensional self-encoder comprises an encoder and a decoder; the encoder is used for extracting the spatial characteristic parameters of the input picture to be processed and outputting the spatial characteristic parameters to the decoder; the decoder is used for outputting a target picture comprising the depth information of the picture to be processed according to the spatial feature parameters, and the picture to be processed and the target picture are used for determining a loss function of the three-dimensional self-encoder. The method realizes the perception of the three-dimensional scene characteristics of the image in an automatic supervision mode; and a large amount of training sets can be adopted, a large amount of unmarked original data are used for training, labor and cost are saved, and the trained model can be conveniently transplanted to other scenes. The induction bias neural network design added with the three-dimensional spatial relationship can sense the interactive information with the environment contained in the action, and the accuracy of identification is improved.

Description

Three-dimensional self-encoder, training method thereof, electronic device and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a three-dimensional self-encoder and a training method thereof, a training method of a three-dimensional visual module, electronic equipment and a storage medium.

Background

The existing action recognition technology based on the deep learning neural network is usually trained based on a supervised learning method, wherein the supervised learning method is to directly train a network output result and a label calculation loss function, and all data have labels or truth values. In the method, the training set is relatively small, and the covered scene is relatively limited; the trained model can only be suitable for scenes related to a training data set, cannot be conveniently transplanted to other scenes, the labeled data set is required to be trained, a large amount of labor cost and calculation cost are required for correct labeling, and the extracted features also depend on the labels rather than the features of the data. The self-supervised learning is supervised training by self-generating labels, and the self-generating labels generally include rotation, cutting, painting, contrast learning and the like, and then fine-tuning training is performed according to downstream tasks.

Action recognition such as fall recognition, object carrying recognition and the like generally relates to interaction between a human body and surrounding scene objects, and through a sensed 3D space position structure, the action type can be more accurately judged without being influenced by the space position of a camera, for example, whether a person lies down on a chair or falls down on the ground is more accurately deduced through space position information. 3D GANs (generic adaptive networks) that can perceive 3D structures are different from 2D GANs in general, based on the following two basic components:

1. and introducing an inductive bias (elementary bias) capable of perceiving the 3D structure in the generation network.

2. And the neural network rendering engine is used for providing image results with consistent view angles.

Among other things, paradoxical generalizations of perceptual 3D structures can be modeled generally with explicit voxel grids (voxel grids) or can be modeled with implicit neural networks.

Vision transforms (ViT) is a transform module based Vision module that can perform machine Vision related tasks. Compared with a general convolutional neural network module (such as ResNet) for performing machine vision tasks, the convolutional neural network module can be saturated after the training data quantity of a data set is increased, for example, the convolutional neural network module is saturated after the quantity of the training set reaches 1 hundred million pictures, the recognition accuracy can not be improved any more by continuously adding training data, the recognition accuracy of ViT can be improved continuously when the training data quantity is more than 1 hundred million, and the convolutional neural network module is suitable for being used as a basic model for large-data-quantity pre-training. Video transforms (ViViViT) is a transform module based Vision module that can perform machine Vision related tasks. Like ViT uses image tiles for recognition and token partitioning, viViT uses video tiles for recognition tasks and token partitioning. Vision transformations are trained by using an MAE (Masked Auto Encoders) mode, and can be trained in a Self-Supervised Learning (Self-Supervised Learning) mode, so that the trained Vision transformations can achieve high accuracy. Similarly, 3DViViT is a three-dimensional visual module based on a transform module, i.e. can sense three-dimensional features of an image.

Disclosure of Invention

The present invention is directed to overcoming the above-mentioned drawbacks of the prior art, and provides a three-dimensional auto-encoder and a training method thereof, a training method of a three-dimensional visual module, an electronic device, and a storage medium.

The invention solves the technical problems through the following technical scheme:

the invention provides a three-dimensional self-encoder, which comprises an encoder and a decoder; the encoder is used for extracting the spatial characteristic parameters of the input picture to be processed and outputting the spatial characteristic parameters to the decoder; the decoder is used for outputting a target picture comprising the depth information of the picture to be processed according to the spatial feature parameters, and the picture to be processed and the target picture are used for determining a loss function of the three-dimensional self-encoder.

Preferably, the decoder comprises a three-dimensional feature extraction module and an upsampling module; the three-dimensional feature extraction module is used for acquiring the depth information of the picture to be processed according to the spatial feature parameters; the up-sampling module is used for generating the target picture according to the depth information and the spatial characteristic parameters processed by the three-dimensional characteristic extraction module.

Preferably, the three-dimensional feature extraction module comprises a mapping network, a two-dimensional generation confrontation network generator and a neural network renderer; the spatial feature parameters comprise camera parameters and a hidden scalar;

the mapping network is used for processing the hidden scalar to correspond to the specification of the picture to be processed and outputting the processed hidden scalar to the two-dimensional generation countermeasure network generator and the up-sampling module;

the two-dimensional generation confrontation network generator is used for generating a feature graph after feature deformation according to the processed spatial feature parameters and inputting the feature graph into the neural network renderer;

and the neural network renderer is used for processing the feature map after the feature deformation and the camera parameters so as to output the depth information of the picture to be processed.

Preferably, the neural network renderer comprises a three-plane decoder and a voxel rendering module;

the three-plane decoder is used for processing the feature map after feature deformation to output voxel information to the voxel rendering module;

and the voxel rendering module is used for outputting the depth information of the picture to be processed according to the voxel information and the camera parameters.

Preferably, the three-plane decoder comprises three full connection layers connected in sequence; the mapping network comprises three full connection layers which are connected in sequence.

Preferably, the encoder comprises a downsampled convolutional neural network; and the downsampling convolutional neural network is used for downsampling the picture to be processed.

Preferably, the encoder further comprises a fully connected layer; the downsampling convolutional neural network comprises a convolutional layer and a pooling layer; the convolution layer, the pooling layer and the full-connection layer are connected in sequence; the convolutional layer is implemented based on the attention mechanism + ReLU activation function.

Preferably, the three-dimensional feature extraction module is obtained based on a feature extraction module of a three-dimensional generation countermeasure network, and the feature extraction module is configured to extract three-dimensional scene information of the to-be-processed picture.

The invention also provides a training method of the three-dimensional self-encoder, which is used for training the three-dimensional self-encoder; the training method comprises the following steps:

locking parameters of a decoder of the three-dimensional self-encoder, and training the parameters of the encoder of the three-dimensional self-encoder by adopting a first training data set until convergence; wherein the first training data set is used for training to obtain the three-dimensional generation countermeasure network

And opening the parameters of the encoder, and continuing to train the parameters of the encoder and the decoder until convergence.

The invention also provides a training method of the three-dimensional visual module, which comprises the following steps:

processing a video frame by using the three-dimensional self-encoder to obtain a three-dimensional space characteristic corresponding to the video frame;

dividing the video frame associated with the three-dimensional spatial feature into a plurality of video tiles;

respectively distributing a position vector and a time vector for the video small blocks;

randomly screening and filtering video frames with a first preset proportion according to the time vectors;

randomly selecting video small blocks with a second preset proportion from the rest video frames as training data, inputting the training data into an encoder of the three-dimensional visual module, and inputting the output video small blocks and tokens corresponding to the screened and filtered video frames with the first preset proportion into a decoder of the three-dimensional visual module; wherein the video tiles are arranged in the order of the time vector and the position vector;

training the three-dimensional vision module based on the output data of the decoder of the three-dimensional vision module and the video frame until convergence.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the training method of the three-dimensional vision module.

The invention also provides a computer-readable storage medium on which a computer program is stored which, when being executed by a processor, carries out the method of training a three-dimensional visual module as described above.

The positive progress effects of the invention are as follows: by providing the three-dimensional self-encoder and the training method thereof, the training method of the three-dimensional visual module, the electronic equipment and the storage medium, the perception of the three-dimensional scene characteristics of the image can be realized in a self-supervision mode; and a large number of training sets can be adopted to cover more scenes, a marked data set is not needed for training, a large number of unmarked original data can be used for training, labor and cost are saved, and the trained model can be conveniently transplanted to other scenes. The neural network design of induction bias added with the three-dimensional spatial relationship can sense the interaction information with the environment contained in the action, and the accuracy of identification is improved.

Drawings

Fig. 1 is a schematic structural diagram of a three-dimensional self-encoder according to embodiment 1 of the present invention.

Fig. 2 is a schematic structural diagram of a decoder of a three-dimensional self-encoder according to embodiment 1 of the present invention.

Fig. 3 is a schematic structural diagram of an encoder of the three-dimensional self-encoder according to embodiment 1 of the present invention.

Fig. 4 is a flowchart of a training method of a three-dimensional self-encoder according to embodiment 2 of the present invention.

Fig. 5 is a flowchart of a method for training a three-dimensional vision module according to embodiment 3 of the present invention.

Fig. 6 is a schematic processing flow diagram for sensing three-dimensional scene characteristics of a video frame in embodiment 3 of the present invention.

Fig. 7 is a schematic structural diagram of an encoder of the three-dimensional vision module in embodiment 3 of the present invention.

Fig. 8 is a schematic structural diagram of a decoder of the three-dimensional vision module in embodiment 3 of the present invention.

Fig. 9 is a block diagram of an electronic device according to embodiment 4 of the present invention.

Detailed Description

The invention is further illustrated by the following examples, which are not intended to limit the invention thereto.

Example 1

Referring to fig. 1-3, the present embodiment specifically provides a three-dimensional self-encoder, which includes an encoder and a decoder; the encoder is used for extracting the spatial characteristic parameters of the input picture to be processed and outputting the spatial characteristic parameters to the decoder; the decoder is used for outputting a target picture comprising depth information of the picture to be processed according to the spatial characteristic parameters, and the picture to be processed and the target picture are used for determining a loss function of the three-dimensional self-encoder.

The present embodiment provides a three-dimensional self-encoder constructed based on an encoder-decoder (encoder-decoder) mode, as shown in fig. 1, which includes one encoder and one decoder. The three-dimensional feature extraction module is obtained based on a feature extraction module of the three-dimensional generation confrontation network, and the feature extraction module is used for extracting three-dimensional scene information of the picture to be processed. The three-dimensional generation countermeasure network is a GANs capable of perceiving 3D structures, the implementation of the GANs is based on the fact that paradoxical induction of the perceivable 3D structures is introduced into the generation network, a neural network rendering engine can provide image results with consistent view angles, and feature maps corresponding to the 3D structures can be output from random features. Paradoxical generalizations of perceptual 3D structures can be modeled generally with explicit voxel grids or can be modeled with implicit neural networks.

As a preferred embodiment, referring to fig. 2, the decoder includes a three-dimensional feature extraction module and an upsampling module; the three-dimensional feature extraction module is used for acquiring the depth information of the picture to be processed according to the spatial feature parameters; the up-sampling module is used for generating a target picture according to the depth information and the spatial characteristic parameters processed by the three-dimensional characteristic extraction module. Up-sampling, i.e., enlarging an image to make the image into a higher resolution, is usually implemented by deconvolution, pooling methods, bilinear interpolation, etc. In this embodiment, the image is amplified to the same specification as the image to be processed through the processing of the up-sampling module, so that the calculation of the loss function can be further performed through comparison.

Preferably, the three-dimensional feature extraction module comprises a mapping network, a two-dimensional generation confrontation network generator and a neural network renderer; the spatial feature parameters comprise camera parameters and a hidden scalar; the mapping network is used for processing the hidden scalar to correspond to the specification of the picture to be processed and outputting the processed hidden scalar to the two-dimensional generation countermeasure network generator and the up-sampling module; the two-dimensional generation confrontation network generator is used for generating a feature graph after feature deformation according to the processed spatial feature parameters and inputting the feature graph into the neural network renderer; and the neural network renderer is used for processing the feature map and the camera parameters after feature deformation so as to output the depth information of the picture to be processed.

The neural network renderer comprises a three-plane decoder and a voxel rendering module; the three-plane decoder is used for processing the feature map after feature deformation to output voxel information to the voxel rendering module; and the voxel rendering module is used for outputting the depth information of the picture to be processed according to the voxel information and the camera parameters. Optionally, the tri-plane decoder comprises three fully connected layers connected in sequence; the mapping network may likewise comprise three fully connected layers connected in sequence.

As a preferred embodiment, referring to fig. 3, the encoder includes a downsampling convolutional neural network; and the downsampling convolutional neural network is used for downsampling the picture to be processed. The main purpose of down-sampling, i.e., reducing an image, is to generate a thumbnail of the corresponding image so that the image fits the size of the display area. Usually implemented by pooling or convolutional layers in convolutional neural networks, with the difference that the image reduction caused by the convolution process is to extract features, and the pooling down-sampling is to reduce the dimensionality of the features. Preferably, the encoder further comprises a fully connected layer; the down-sampling convolution neural network comprises a convolution layer and a pooling layer; the convolution layer, the pooling layer and the full-connection layer are connected in sequence; the convolutional layer is implemented based on the attention mechanism + the ReLU activation function.

The three-dimensional self-encoder of the embodiment fully utilizes the scene feature perception advantage of the 3DGANs to process the image, so that the video frame containing the scene feature is obtained and can be used for realizing a subsequent over-self-supervision mode; the method has the advantages that massive training sets can be adopted to cover more scenes, labeled data sets are not needed for training, a large amount of unlabeled original data can be used for training, labor and cost are saved, and the trained model can be conveniently transplanted to other scenes. The neural network design of induction bias added with the three-dimensional spatial relationship can sense the interaction information with the environment contained in the action, and the accuracy of identification is improved.

Example 2

Referring to fig. 4, this embodiment specifically provides a training method for a three-dimensional self-encoder, which is used for training the three-dimensional self-encoder in embodiment 1; the training method comprises the following steps:

s0, training a three-dimensional generation countermeasure network (3 DGANs) by adopting a first training data set (for example, 7 pieces of RGB pictures), and extracting a generator of the trained 3DGANs as a decoder of a three-dimensional self-encoder.

S1, locking parameters of a decoder of a three-dimensional self-encoder, and training the parameters of the encoder of the three-dimensional self-encoder by adopting a first training data set until convergence; wherein the first training data set is used for training to obtain a three-dimensional generated confrontation network.

And S2, opening the parameters of the encoder, and continuing to train the parameters of the encoder and the decoder until convergence.

Wherein the first training set of data may employ more than 7 thousand primary color (RGB) pictures, which are also used to train the three-dimensional self-encoder.

The training method of the three-dimensional self-encoder of the embodiment designs a training method aiming at the composition principle of the three-dimensional self-encoder, namely, firstly trains the encoder, and then integrally trains the three-dimensional self-encoder, so that the three-dimensional self-encoder meeting the requirements can be obtained, the three-dimensional self-encoder is used for processing images by using the scene feature perception advantage of 3DGANs, and the obtained video frame containing the scene features can be used for realizing a subsequent over-self-supervision mode; the method has the advantages that massive training sets can be adopted to cover more scenes, labeled data sets are not needed for training, a large amount of unlabeled original data can be used for training, labor and cost are saved, and the trained model can be conveniently transplanted to other scenes. The neural network design of induction bias added with the three-dimensional spatial relationship can sense the interaction information with the environment contained in the action, and the accuracy of identification is improved.

Example 3

Referring to fig. 5, this embodiment specifically provides a method for training a three-dimensional visual module, including the steps of:

s101, processing a video frame by using the three-dimensional self-encoder to obtain three-dimensional space characteristics corresponding to the video frame;

s102, dividing the video frame associated with the three-dimensional space characteristics into a plurality of video small blocks;

s103, respectively distributing position vectors and time vectors for the video small blocks;

s104, randomly screening and filtering video frames with a first preset proportion according to the time vectors;

s105, randomly selecting video small blocks with a second preset proportion from the rest video frames as training data, inputting the training data into an encoder of the three-dimensional visual module, and inputting the output video small blocks and tokens corresponding to the video frames with the first preset proportion which are filtered out in a screening mode into a decoder of the three-dimensional visual module; the video small blocks are arranged according to a time vector and a position vector;

and S106, training the three-dimensional visual module based on the output data of the decoder of the three-dimensional visual module and the video frame until convergence.

In this embodiment, the training data set may be performed using 400 ten thousand short video files of 10-second order. Based on the above steps, those skilled in the art can appreciate that the accuracy of motion recognition can be improved by performing three-dimensional scene perception on the video frame image by using the processing flow shown in fig. 6. Specifically, a three-dimensional confrontation generation network is trained, a feature extraction module of the three-dimensional confrontation generation network is taken as a decoder of a three-dimensional self-encoder, and the three-dimensional self-encoder is trained to process video frames and is used for pre-training a three-dimensional visual module 3 DViViT.

The encoder and decoder of the three-dimensional video module of this embodiment are shown in fig. 7 and fig. 8, respectively, wherein the three-dimensional video block (3 DTubelet) is converted into tokens, the Token Embedding (Token Embedding) combines the position features and performs spatial encoding conversion, respectively, and combines the time features and performs temporal encoding conversion, and then outputs an implicit expression video block (tube high encoding conversion).

It is to be appreciated that after the three-dimensional vision module processes the pre-training, fine-tuning (fine-tuning) training may be performed using other small data sets. The fine-tuning is to freeze a part of convolution layers (usually, most convolution layers close to the input) of the pre-training model and train the remaining convolution layers (usually, part of convolution layers close to the output) and the full-link layers even without freezing any network layer, so that parameters of a plurality of layers before the last layer are finely adjusted by using a known network structure and known network parameters, the generalization capability of the deep neural network is effectively utilized, and the complicated model design and the time-consuming training are also avoided.

The training method of the three-dimensional vision module can realize perception of three-dimensional scene characteristics of the image in a self-supervision mode; and a large number of training sets can be adopted to cover more scenes, a marked data set is not needed for training, a large number of unmarked original data can be used for training, labor and cost are saved, and the trained model can be conveniently transplanted to other scenes. The induction bias neural network design added with the three-dimensional spatial relationship can sense the interactive information with the environment contained in the action, and the accuracy of identification is improved.

Example 4

Referring to fig. 9, the present embodiment provides an electronic device 30, which includes a processor 31, a memory 32, and a computer program stored on the memory 32 and executable on the processor 31, wherein the processor 31 implements a training method of a three-dimensional self-encoder in embodiment 2 and/or a training method of a three-dimensional visual module in embodiment 3 when executing the program. The electronic device 30 shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.

The electronic device 30 may be embodied in the form of a general purpose computing device, which may be, for example, a server device. The components of the electronic device 30 may include, but are not limited to: the at least one processor 31, the at least one memory 32, and a bus 33 connecting the various system components (including the memory 32 and the processor 31).

The bus 33 includes a data bus, an address bus, and a control bus.

The memory 32 may include volatile memory, such as Random Access Memory (RAM) 321 and/or cache memory 322, and may further include Read Only Memory (ROM) 323.

Memory 32 may also include a program/utility 325 having a set (at least one) of program modules 324, such program modules 324 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which or some combination thereof may comprise an implementation of a network environment.

The processor 31 executes various functional applications and data processing, such as a training method of the three-dimensional self-encoder in embodiment 2 and/or a training method of the three-dimensional visual module in embodiment 3 of the present invention, by running a computer program stored in the memory 32.

The electronic device 30 may also communicate with one or more external devices 34 (e.g., keyboard, pointing device, etc.). Such communication may be through input/output (I/O) interfaces 35. Also, model-generating device 30 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via network adapter 36. Network adapter 36 communicates with the other modules of model-generating device 30 via bus 33. Other hardware and/or software modules may be used in conjunction with the model-generating device 30, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems, to name a few.

It should be noted that although in the above detailed description several units/modules or sub-units/modules of the electronic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Example 5

The present embodiment provides a computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the training method of the three-dimensional self-encoder in embodiment 2 and/or the training method of the three-dimensional visual module in embodiment 3.

More specific examples that may be employed by the readable storage medium include, but are not limited to: a portable disk, hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.

In a possible implementation, the invention can also be implemented in the form of a program product comprising program code for causing a terminal device to perform a training method implementing the three-dimensional self-encoder in example 2 and/or the three-dimensional visual module in example 3, when the program product is run on the terminal device.

Where program code for carrying out the invention is written in any combination of one or more programming languages, the program code may execute entirely on the user's device, partly on the user's device, as a stand-alone software package, partly on the user's device, partly on a remote device or entirely on the remote device.

While specific embodiments of the invention have been described above, it will be understood by those skilled in the art that this is by way of example only, and that the scope of the invention is defined by the appended claims. Various changes or modifications to these embodiments may be made by those skilled in the art without departing from the principle and spirit of this invention, and these changes and modifications are within the scope of this invention.

Claims

1. A three-dimensional self-encoder, comprising an encoder and a decoder; the encoder is used for extracting the spatial characteristic parameters of the input picture to be processed and outputting the spatial characteristic parameters to the decoder; the decoder is used for outputting a target picture comprising the depth information of the picture to be processed according to the spatial feature parameters, and the picture to be processed and the target picture are used for determining a loss function of the three-dimensional self-encoder.

2. The three-dimensional self-encoder according to claim 1, wherein the decoder comprises a three-dimensional feature extraction module and an upsampling module; the three-dimensional feature extraction module is used for acquiring the depth information of the picture to be processed according to the spatial feature parameters; the up-sampling module is used for generating the target picture according to the depth information and the spatial feature parameters processed by the three-dimensional feature extraction module.

3. The three-dimensional self-encoder according to claim 2, wherein the three-dimensional feature extraction module comprises a mapping network, a two-dimensional generation countermeasure network generator, a neural network renderer; the spatial feature parameters comprise camera parameters and a hidden scalar;

4. The three-dimensional self-encoder according to claim 3, wherein the neural network renderer comprises a three-plane decoder and a voxel rendering module;

5. The three dimensional auto-encoder according to claim 4, wherein the tri-plane decoder comprises three fully connected layers connected in sequence; the mapping network comprises three full connection layers which are connected in sequence.

6. The three dimensional self-encoder according to claim 2, wherein the encoder comprises a downsampled convolutional neural network; and the downsampling convolutional neural network is used for downsampling the picture to be processed.

7. The three dimensional self-encoder according to claim 6, wherein the encoder further comprises a fully connected layer; the downsampling convolutional neural network comprises a convolutional layer and a pooling layer; the convolution layer, the pooling layer and the full-connection layer are connected in sequence; the convolutional layer is implemented based on the attention mechanism + ReLU activation function.

8. The three-dimensional self-encoder according to any one of claims 2-7, wherein the three-dimensional feature extraction module is obtained based on a feature extraction module of a three-dimensional generation countermeasure network, the feature extraction module being configured to extract three-dimensional scene information of the picture to be processed.

9. A training method of a three-dimensional self-encoder, for training the three-dimensional self-encoder according to claim 8; the training method comprises the following steps:

10. A method of training a three-dimensional visual module, comprising the steps of:

processing a video frame by using the three-dimensional self-encoder of claim 8 to obtain a three-dimensional spatial feature corresponding to the video frame;

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of training the three-dimensional self-encoder of claim 9 and/or the method of training the three-dimensional vision module of claim 10 when executing the computer program.

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of training a three-dimensional self-encoder according to claim 9 and/or the method of training a three-dimensional visual module according to claim 10.