WO2021004738A1

WO2021004738A1 - Device and method for training a neural network

Info

Publication number: WO2021004738A1
Application number: PCT/EP2020/066728
Authority: WO
Inventors: Konrad Groh; Matthias Woehrle
Original assignee: Robert Bosch Gmbh
Priority date: 2019-07-09
Filing date: 2020-06-17
Publication date: 2021-01-14
Also published as: DE102019210091A1; CN114041144A

Abstract

A device and a method for training a neural network are disclosed, the method for training a neural network comprising: training a first neural sub-network with first digital training data (302), which describe a first context, the first neural sub-network being designed as an autoencoder network and having a first encoder portion (306) and a first decoder portion, and the first encoder portion (306) providing a mapping of the first digital training data (302) to a first latent space (308); training a first mapping (404) of first digital data (402), which are semantically related to the first digital training data (302), to the first latent space (308) using the first digital training data (302) mapped to the first latent space (308) by means of the trained first neural sub-network; training a second neural sub-network with second digital training data (322), which describe a second context different from the first context, the second neural sub-network being designed as an autoencoder network and having a second encoder portion and a second decoder portion (330), and the second encoder portion providing a mapping of the second digital training data (322) to a second latent space; training a second mapping (424) of second digital data (422), which are semantically related to the second digital training data (322), to the second latent space (328), using the second digital training data (322) mapped to the second latent space (328) by means of the trained second neural sub-network; training a third mapping (502) of digital latent data from the first latent space (308) to the second latent space (328) using third digital training data and third digital data, the third digital training data comprising digital training data describing the first context and digital training data describing the second context, the third digital data comprising digital data semantically related to the digital training data describing the first context and digital data semantically related to the digital training data describing the second context.

Description

description

Device and method for training a neural network

Various embodiments generally relate to an apparatus and a

Method for training a neural network.

Various neural networks are used, for example, to generate output data based on input data and a function learned by training the neural network. The desired output data can depend on the context in which the input data was generated. It may therefore be necessary for data that describe a first context to be transformed into data that describe a second context. The context is for example with the

Intention recognition of road users is important.

In Suwajanakorn et al., Synthesizing Obama: Learning Lip Sync from Audio, ACM

Transactions on Graphics, Vol. 36, No. 4, 2017 describes a method for transforming video data by means of a neural network.

The method and the device with the features of independent claims 1 (first example) and 9 (thirty-second example) enable a neural

Network to train to transform digital data from a first context into a second context.

The context of the digital data, for example the first digital training data and the second digital training data, can differ in terms of the context in which the digital data were generated, ie in which context or under which boundary conditions the digital data were generated, and / or the intrinsic context of the digital data, for example which environment is described by the digital data. In different embodiments, the context, ie the first context and the second context, can differ territorially. For example, the context of a district, a region, a country, etc. can be different. The context can differ in terms of language and / or can differ in terms of facial expressions and gestures associated with a language and / or a region / country. The context can be culturally different, ie the context can differ in terms of territory, language, facial expressions, gestures, etc. According to one example, the digital data is digital image data and the context can change territorially, in that the digital image data were generated in different countries, and / or can differ intrinsically in that the digital image data differ in gestures and facial expressions to a text spoken by a person (ie the digital image data also differ with regard to the language dependent movements of the face).

The language can, for example, also be used in the case of the digital data which are linked to the digital data which describe a first context or a second context

semantic relationship, be different. The digital data that are semantically related to the digital data that describe the first context or the second context can be, for example, text data that have a plurality of text strings, and the semantic relationship can describe that each digital datum of the digital Data which describe the first context or the second context, exactly one text string of the plurality of text strings is assigned. In other words, digital text data can be assigned to digital data that describe a context and describe the digital data. This means that the digital text data can clearly describe the content of the digital data. The digital text data can have additional information relating to the digital data. For example, the digital data can be digital image data representing a scene and the digital text data can describe the scene.

The first map, the second map and the third map can comprise a neural subnetwork. Each of these neural sub-networks can be any neural network, for example an autoencoder network or a convolutional neural network. Each neural sub-network, i.e. including the first neural sub-network and the second neural sub-network, can have any number of layers and can be trained using any method, such as backpropagation. Each encoder section of an autoencoder network can have any number of encoder layers, with each encoder layer having a convolutional layer with any properties (for example any filter size), an activation function (for example a ReLU Activation function), a pooling layer with any

Properties (for example a max pooling layer with any increment) and a normalization layer can have. Each decoder section has one

Autoencoder network can have any number of decoder layers, each decoder layer having a transposed convolutional layer with any number of decoder layers

Properties, a folding layer with any properties, a Activation function (for example a ReLU activation function) and a

May have normalization layer.

At least a part of the first neural sub-network can be implemented by one or more processors. At least part of the first mapping can be implemented by one or more processors. At least a part of the second neural sub-network can be implemented by one or more processors. At least part of the second mapping can be implemented by one or more processors. At least part of the third mapping can be implemented by one or more processors. The features described in this paragraph in combination with the first example form a second example.

The first digital training data and the second digital training data can include digital image data. The feature described in this paragraph in combination with the first example or the second example forms a third example.

The first neural sub-network can be trained in that the first decoder section passes through the first encoder section into the first latent space

reproduced first digital training data and compares the reconstructed first digital training data with the first digital training data. The features described in this paragraph in combination with one or more of the first example to the third example form a fourth example.

Comparing the reconstructed first digital training data with the first digital training data can include determining a first loss value. The first loss value can be determined based on a loss function. The features described in this paragraph in combination with the fourth example form a fifth example.

The training of the first neural subnetwork can include adapting the first encoder section and the first decoder section, wherein the adaptation of the first encoder section and the first decoder section can include minimizing the first loss value. This means that the first encoder section of the trained first neural subnetwork can output a code that describes digital data describing a first context in a first latent space, and that the first decoder section of the trained first neural subnetwork a code, the digital data describing a first context in a first Describes latent space, can process it and can output digital data based on the code. The features described in this paragraph in combination with the fifth example form a sixth example.

The second neural sub-network can be trained by the second decoder section reconstructing the second digital training data mapped into the second latent space by the second encoder section and comparing the reconstructed second digital training data with the second digital training data. The features described in this paragraph in combination with one or more of the first example to the sixth example form a seventh example.

Comparing the reconstructed second digital training data with the second digital training data can include determining a second loss value. The second loss value can be determined based on a loss function. The features described in this paragraph in combination with the seventh example form an eighth example.

The training of the second neural sub-network can include the adaptation of the second encoder section and the second decoder section, wherein the adaptation of the second encoder section and the second decoder section can include minimizing the second loss value. This means that the second encoder section of the trained second neural sub-network can output a code that describes digital data describing a second context in a second latent space, and that the second decoder section of the trained second neural sub-network a code that describes digital data describing a second context in a second latent space, can process it and can output digital data based on the code. The features described in this paragraph in

Combination with the eighth example form a ninth example.

The first map can have a third neural sub-network and the second map can have a fourth neural sub-network. The features described in this paragraph in combination with one or more of the first example to the ninth example form a tenth example.

The first digital data and the second digital data can have a plurality of text strings, the text strings being the associated first digital

Describes training data or the assigned second digital training data. The means that each first digital training date of the first digital training data can be assigned to exactly one text string of the plurality of text strings of the first digital data and that each second digital training date of the second digital training data can be assigned to exactly one text string of the plurality of text strings of the second digital data. For example, the first digital training data or the second digital training data can have digital image data that represent a scene, and the first digital data or the second digital data can have a plurality of text strings that describe the scene shown in each case. The features described in this paragraph in combination with one or more of the first example to the tenth example form an eleventh example.

Training the first mapping may include comparing the code output by the first encoder section based on the first digital training data with a code output by the first mapping based on the first digital data. The feature described in this paragraph in combination with one or more of the first example to the eleventh example forms a twelfth example.

The comparison of the code output by the first encoder section based on the first digital training data with a code output by the first mapping based on the first digital data can result in the determination of a first

Have image loss value. The first mapping loss value can be determined based on a loss function. The features described in this paragraph in combination with the twelfth example form a thirteenth example.

Training the first mapping can include adapting the first mapping, wherein adapting the first mapping can include minimizing the first mapping loss value. This means that the trained first mapping can output a code that describes digital text data that describe a first context in a first latent space, the code that describes the digital text data, a code that describes the digital data in the first describes latent space, can be assigned. The features described in this paragraph in combination with the thirteenth example form a fourteenth example.

At least part of the first digital data can be provided by an additional first neural network, the additional first neural network being able to process at least part of the first digital training data. The ones in this paragraph Features described in combination with one or more of the first example to the fourteenth example form a fifteenth example.

The training of the second mapping can include comparing the code output by the second encoder section based on the second digital training data with a code output by the second mapping based on the second digital data. The features described in this paragraph in

Combination with one or more of the first example to the fifteenth example forms a sixteenth example.

The comparison of the code output by the second encoder section based on the second digital training data with a code output by the second mapping based on the second digital data can include determining a second mapping loss value. The second mapping loss value can be determined based on a loss function. The features described in this paragraph in combination with the sixteenth example form a seventeenth example.

The training of the second mapping can comprise the adapting of the second mapping, wherein the adapting of the second mapping can comprise the minimizing of the second mapping loss value. This means that the trained second mapping can output a code that describes digital text data describing a second context in a second latent space, the code describing the digital text data being a code describing the digital data in the second describes latent space, can be assigned. The features described in this paragraph in combination with the seventeenth example form an eighteenth example.

At least part of the second digital data can be provided by an additional second neural network, wherein the additional second neural network can process at least part of the second digital training data. The features described in this paragraph in combination with one or more of the first example through the eighteenth example form a nineteenth example.

The third map can have a fifth neural sub-network. The feature described in this paragraph in combination with one or more of the first example through the nineteenth example forms a twentieth example. The digital training data of the third digital training data, which describe the first context, can contain at least a subset of the first digital

Training data (for example the entire first digital training data) and the digital data of the third digital data, which are semantically related to this digital training data, can be those of the subset of the first digital

Have training data assigned subset of the first digital data. The features described in this paragraph in combination with one or more of the first example through the twentieth example form a twenty-first example.

The digital training data of the third digital training data, which describe the second context, can contain at least a subset of the second digital

Training data (for example, the entire second digital training data) and the digital data of the third digital data, which with this digital

Training data are semantically related can have the subset of the second digital data assigned to the subset of the second digital training data.

The features described in this paragraph in combination with one or more of the first example through the twenty-first example form a twenty-second example.

The third mapping can be a code that describes digital training data that describe the first context in the first latent space, a code that describes digital data that is semantically related to the digital training data that describes the first context in which first latent space describes, and a code that describes digital data, which are semantically related to the digital training data that describe the second context, in the second latent space, and can output a code that describes digital training data in the second describes latent space, spend. The features described in this paragraph in combination with one or more of the first example through the twenty-second example form a twenty-third example.

The second decoder section can process the code that describes digital training data in the second latent space and can output reconstructed digital training data that describe the second context. The features described in this paragraph in combination with one or more of the first example through the twenty-third example form a twenty-fourth example. The training of the third mapping can involve comparing the reconstructed digital training data, which describe the second context, with the digital one

Have training data that describe the second context. The feature described in this paragraph in combination with one or more of the first example through the twenty-fourth example form a twenty-fifth example.

Comparing the reconstructed digital training data, which describe the second context, with the digital training data, which describes the second context, can include determining a third mapping loss value. The third

The mapping loss value can be determined based on a loss function. The features described in this paragraph in combination with the twenty-fifth example form a twenty-sixth example.

The training of the third mapping can comprise adapting the third mapping, wherein adapting the third mapping can comprise minimizing the third mapping loss value. This has the effect that the trained third figure is a code, the digital training data, which the first context

describe, in the first latent space describes a code that describes digital data, which are semantically related to the digital training data that describe the first context, in the first latent space, and a code that describes digital data that is associated with the digital training data that describe the second context are semantically related, describe and process in the second latent space and can output digital training data that describe the second context. The features described in this paragraph in combination with the twenty-sixth example form a twenty-seventh example.

A first transformation network can have the first encoder section of the trained first neural subnetwork, the second decoder section of the trained second neural subnetwork, the trained first mapping, the trained second mapping and the trained third mapping. The first transformation network can process digital data describing a first context, digital text data describing the first context and associated with the digital data describing the first context, and digital text data describing a second context, and can process digital data that describe the second context. That is to say that the first transformation network can transform digital data that describe a first context into digital data that describe a second context. This has the advantage that, if digital data describing an initial context, are extensively available and if digital data that describe a second context are not extensively available, based on the digital data that describe the first context, digital data that describe the second context can be generated. In other words, digital data can be generated for a second context, so that extensive digital data are available both for the first context and for the second context. The ones described in this paragraph

Features in combination with one or more of the first example through the twenty-seventh example form a twenty-eighth example.

A second transformation network can have the second encoder section of the trained second neural subnetwork, the first decoder section of the trained first neural subnetwork, the trained first mapping, the trained second mapping and the inverse mapping of the trained third mapping. The second transformation network can process digital data describing a second context, digital text data describing the second context and associated with the digital data describing the second context, and digital text data describing a first context, and can process digital data that describe the first context. That is to say that the second transformation network can transform digital data that describe a second context into digital data that describe a first context. This has the advantage that digital data can be adapted to the respective context, so that the digital data can be further processed based on the context in which the digital data were generated or the intrinsic context of the digital data. The features described in this paragraph in combination with one or more of the first example through the twenty-eighth example form a twenty-ninth example.

A computer program can have program instructions which, when they are executed by one or more processors, are set up to carry out the method according to one or more of the first example to the twenty-ninth example. The feature described in this paragraph constitutes a thirtieth example.

The computer program can be stored in a machine-readable storage medium. The feature described in this paragraph in combination with the thirtieth example forms a thirty-first example.

At least a part of the first neural sub-network can be implemented by one or more processors. At least part of the first figure can be replaced by a or multiple processors can be implemented. At least a part of the second neural sub-network can be implemented by one or more processors. At least part of the second mapping can be implemented by one or more processors. At least part of the third mapping can be implemented by one or more processors. The features described in this paragraph in combination with the thirty-second example form a thirty-third example.

A system may include an apparatus of the thirty-second example or the thirty-third example. The system can have a sensor, for example an imaging sensor, which is set up to provide digital data that describe the first context or the second context. The features described in this paragraph constitute a thirty-fourth example.

The system may further comprise an additional neural network which is set up to digital text data, the text strings, which describe the digital data, which describe the first context or the second context, based on the digital data, which the first context or the second Describe, create context. The feature described in this paragraph in combination with the thirty-fourth example forms a thirty-fifth example.

The imaging sensor can be a camera sensor or a video sensor. The

Imaging sensor can be a remote location sensor, such as a radar sensor, a LIDAR sensor or an ultrasonic sensor, which is processed by the

Sensor signals using imaging processes to provide image data. The features described in this paragraph in combination with the thirty-fourth example or the thirty-fifth example form a thirty-sixth example.

A vehicle can have a driver assistance system. The driver assistance system may be the system according to one or more of the thirty-fourth examples to the

having thirty-sixth example. The features described in this paragraph constitute a thirty-seventh example.

A vehicle can have at least one imaging sensor or a remote location sensor which is set up to provide digital image data. The vehicle can also have a driver assistance system. The driver assistance system can have the first neural transformation network according to the twenty-eighth example and / or the second neural transformation network according to the twenty-ninth example. The driver assistance system can furthermore be set up to classify and / or segment the digital data output by the first neural transformation network or the second neural transformation network. The

The driver assistance system can be set up to control the vehicle based on the classified and / or segmented digital data. That means that

The driver assistance system can be set up to process the classified and / or segmented digital data and to be able to output at least one control command based on the classified and / or segmented digital data. This has the advantage that the driver assistance system can influence the driving behavior based on the context of the digital data. For example, the driver assistance system can recognize the intention of a road user based on the context and

influence the driving behavior accordingly (e.g. changing the driving behavior, e.g. maintaining the driving behavior). The ones in this paragraph

features described form a thirty-eighth example.

Exemplary embodiments of the invention are shown in the drawing and explained in more detail in the description below.

Show it

FIG. 1 shows a device according to various embodiments; FIG. 2 shows an imaging device in accordance with various embodiments; Figure 3A shows a processing system for training a first neural

Subnetwork according to various embodiments;

Figure 3B shows a processing system for training a second neural

Subnetwork according to various embodiments;

FIG. 4A shows a processing system for training a first mapping in accordance with various embodiments;

FIG. 4B shows a processing system for training a second mapping in accordance with various embodiments; FIG. 5 shows a processing system for training a third mapping in accordance with various embodiments;

FIG. 6 shows a method for training a neural network in accordance with

different embodiments;

FIG. 7A shows a first processing system for transforming digital data between different contexts according to different ones

Embodiments;

Figure 7B shows a second processing system for transforming digital data between different contexts according to different ones

Embodiments; and

FIG. 8 shows a vehicle according to various embodiments;

In one embodiment, a “circuit” can be any type of logic

implementing entity, which can be hardware, software, firmware or a combination thereof. Therefore, in one embodiment, a

"Circuit" means a hardwired logic circuit or a programmable one

Logic circuit, such as a programmable processor, for example a microprocessor (e.g. a CISC (processor with large instruction set) or a RISC (processor with reduced instruction set)). A “circuit” can also be software that is implemented or executed by a processor, for example any type of computer program, for example a computer program that is a virtual one

Machine code such as Java is used. Any other type of implementation of the respective functions, which are described in more detail below, may be understood as a “circuit” in accordance with an alternative embodiment.

Various exemplary embodiments clearly illustrate a method for training a neural network so that the trained neural network can transform digital data, such as digital image data, from a first context into a second context. In other words, digital data can have context-specific properties, such as, for example, country-specific properties, and the trained neural network can transfer the digital data to a different context. FIG. 1 illustrates a device 100 according to various embodiments. The device 100 may have one or more sensors 102. The sensor 102 can be configured to provide digital data 104. The sensor 102 can be a

Imaging sensor, such as a camera sensor or a video sensor, or a remote location sensor, such as a radar sensor, a LIDAR sensor or an ultrasonic sensor. According to various embodiments, the sensor 102 has a different type of sensor. According to various embodiments, the digital data 104 comprise digital image data (in the context of this description, recorded radar, LID AR and ultrasonic sensor signals that have been processed by means of imaging methods are also understood as digital image data). The sensors of a plurality of sensors may have the same type or different types of sensors.

The device 100 may further include a storage device 106. The

Storage device 106 may include memory. The memory can be used, for example, in the processing performed by a processor. A memory used in the embodiments may be a volatile memory such as a DRAM (dynamic random access memory), or a non-volatile memory such as a PROM (programmable read-only memory), an EPROM (erasable PROM), an EEPROM (electrically erasable PROM), or a Flash memories such as a floating gate memory device, a charge trapping memory device, an MRAM (magnetoresistive random access memory), or a PCRAM (phase change random access memory). The storage device 106 may be configured to store the digital data 104. The device 100 can furthermore have at least one processor 108 (for example exactly one processor, for example two processors, for example more than two processors). As described above, the at least one processor 108 can be any type of circuit, i.e., any type of logic-implementing entity. In various embodiments, the at least one processor 108 is set up to process the digital data 104.

The exemplary embodiments are described below using digital image data 204 as digital data 104. It should be pointed out, however, that other (digital) data can also be used which are dependent on the context, such as any type of digital sensor data.

FIG. 2 illustrates an imaging device 200 in which the sensor is implemented as an imaging sensor 202 in accordance with various embodiments. The Imaging sensor 202 can be a camera sensor or a video sensor. The

Imaging sensor 202 may be configured to provide digital image data 204.

In the context of this description, radar, LID AR and ultrasound sensors, which are set up to provide digital image data 204, are also understood as imaging sensor 202. The digital image data 204 may include a plurality of digital images 206. The plurality of digital images 206 may represent a scene in a respective context. According to various embodiments, the

Imaging device 200 has a plurality of imaging sensors.

FIG. 3A illustrates a processing system 300A for training a first neural subnetwork in accordance with various embodiments. The processing system 300A may include the storage device 106 for storing the digital image data 204, such as first digital training data 302. The first digital training data 302 can describe a first context. The processing system 300A may further include the at least one processor 108. The processor 108 implements at least part of a first neural sub-network 304. The first neural sub-network 304 is set up to process the first digital training data 302. The first neural sub-network 304 can be an auto-encoder network. The first neural sub-network 304 can have a first encoder section 306. The first encoder section 306 can have at least one encoder and can be set up to display the features of the first digital training data 302 in a first latent space 308, ie in a lower dimension than the dimension of the first digital training data 302. In other words, the first encoder section 306 can output a code which has a lower dimension than the dimension of the first digital training data 302. The first neural sub-network 304 can also have a first decoder section 310. The first decoder section 310 can have at least one decoder and can be set up to process the code present in the first latent space 308 and to output first digital output data 312. The dimension of the first digital output data 312 can correspond to the dimension of the first digital training data 302. In other words, the first decoder section 310 can increase the dimension of the code present in the first latent space 308 to the dimension of the first digital training data 302. The first decoder section 310 can reconstruct the first digital training data 302 from the code output by the first encoder section 306. The processor 108 can be set up to determine a first loss value 314 by comparing the first digital output data 312 with the first digital training data 302. The first neural sub-network 304 can be trained by the first encoder section 306 and the first decoder section 310 can be adapted. The first neural sub-network 304 can be adapted in such a way that the first loss value 314 is minimized. That is, the trained first neural sub-network 304 can output a code that describes digital image data that describe a first context in a first latent space 308 and can output a code that describes digital image data in the first latent space 308 to digital image data reconstruct.

FIG. 3B illustrates a processing system 300B for training a second neural subnetwork in accordance with various embodiments. The processing system 300B may include the storage device 106 for storing the digital image data 204, such as second digital training data 322. The second digital training data 322 can describe a second context that is different from the first context.

In various embodiments, the first context and the second context can differ territorially and / or intrinsically, as described above.

The processing system 300B may further include the at least one processor 108. The processor 108 implements at least part of a second neural sub-network 324. The second neural sub-network 324 is set up to process the second digital training data 322. The second neural sub-network 324 can be an auto-encoder network. The architecture of the second neural sub-network 324 can essentially correspond to the architecture of the first neural sub-network 304. The second neural sub-network 324 can have a second encoder section 326 and a second decoder section 330, the second encoder section 326 being able to generate a code in a second latent space 328 based on the second digital training data 322 and where the second decoder section 330 can reconstruct the code present in the second latent space 328. That is, the second decoder section 330 can generate second digital output data 332, wherein the dimension of the second digital output data 332 can correspond to the dimension of the second digital training data. The processor 108 can be set up to determine a second loss value 334 by comparing the second digital output data 332 with the second digital training data 322 and to determine the second loss value 334 by adapting the second encoder section 326 and the second decoder section 330 to minimize. That is, the trained second neural sub-network 324 can output a code that describes digital image data describing a second context in a second latent space 328 and can output a code, that describes digital image data in the second latent space 328 to reconstruct digital image data.

FIG. 4A illustrates a processing system 400A for training a first mapping in accordance with various embodiments. The processing system 400A may include the storage device 106 for storing the first digital training data 302. The storage device 106 may also store first digital data 402. The first digital data 402 can be semantically related to the first digital

Training data 302 which describe a first context are available. According to various embodiments, the first digital data 402 have a plurality of text strings, the text strings describing the scene represented in the first digital training data 302. For example, the first digital training data 302 can have a first digital image 302-1 and a second digital image 302-2, which describe a scene in a first context, and the first digital data 402 can have a first text string 402-1, which corresponds to the first digital image 302-1 is assigned, and a second text string 402-2, which is assigned to the second digital image 302-2.

For example, the first digital image 302-2 of the first digital training data 302 depicts a street, parked cars and a pedestrian standing on the street, and the first text string 402-1 describes the scene as “pedestrian standing on the street”. The second digital image 302-2 following the first digital image 302-2 represents the scene according to the example, the pedestrian stopping on the street and the second text string 402-2 describes the scene as “pedestrian stopping”.

The processing system 400A may further include the at least one processor 108. The processor 108 implements at least a part of the trained first neural subnetwork 304. The first encoder section 306 of the trained first neural subnetwork 304 can output a code that describes the first digital training data 302 in the first latent space 308. The processor 108 further implements at least a portion of a first mapping 404. The first mapping 404 can map the first digital data 402 into the first latent space 308. In other words, the first mapping 404 can process the first digital data 402 and can output a code that describes the first digital data 402 in the first latent space 308. That is, the first encoder section 306 can output a first digital training data code 406 that describes the first digital training data 302 in the first latent space 308 and the first mapping 404 can output a first digital data code 408 that describes the first digital data 402 in the first latent space 308 describes. In other words, the first mapping 404 maps the first digital training data code 406 the first digital data code 408 in the first latent space 308. The processor 108 can be configured to determine a first mapping loss value 410 by comparing the first digital training code 406 with the first digital data code 408. The first mapping loss value 410 can be determined based on a loss function. The loss function can be any type of loss function, for example any type of loss function that is based on a regression model.

The first mapping 404 can be trained by adapting the first mapping 404, wherein the first mapping can be adapted such that the first mapping loss value 410 is minimized. That is, the trained first mapping 404 can output a code that describes digital text data in the first latent space 308, the digital text data comprising a text string and describing digital image data representing a first context, and the code representing the digital text data, a code which describes the digital image data in the first latent space 308 is assigned. The first mapping 404 can be a third neural sub-network.

The processing system 400A may further comprise at least one additional first neural network that is set up to assign at least a part (for example all of the first digital data) of the first digital data 402 using the first digital training data 302 that describe the first context produce.

FIG. 4B illustrates a processing system 400B for training a second mapping in accordance with various embodiments. The processing system 400B may include the storage device 106 for storing the second digital training data 322. The storage device 106 may also store second digital data 422. The second digital data 422 can have a semantic relationship to the second digital training data 322, which describe a second context. According to various embodiments, the second digital data 422 have a plurality of text strings, the text strings describing the scene represented in the second digital training data 322. For example, the second digital training data 322 can have a first digital image 322-1 and a second digital image 322-2, which describe a scene in a second context, and the second digital data 422 can have a first text string 422-1, which corresponds to the first digital image 322-1 is assigned, and a second text string 422-2, which is assigned to the second digital image 302-2. For example, the first digital image 322-2 represents the second digital

Training data 322 represents essentially the same scene as the first digital image 302-2 of the first digital training data 302, that is, a street, parked cars and a pedestrian standing on the street, and the first text string 422-1 of the second digital data 422 describes the scene accordingly as “pedestrian standing on the street”.

The second digital image 322-2 following the first digital image 322-2 of the second digital training data 322 represents, according to an example, a scene in the second context, the pedestrian crossing the street and the assigned second text string 422-2 describes the scene as "pedestrian crosses the street".

The processing system 400B may further include the at least one processor 108. The processor 108 implements at least a part of the trained second neural sub-network 324. The second encoder section 326 of the trained second neural sub-network 324 can output a code that describes the second digital training data 322 in the second latent space 328. The processor 108 further implements at least a portion of a second mapping 424. The second mapping 424 can map the second digital data 422 into the second latent space 328. In other words, the second mapping 424 can process the second digital data 422 and can output a code that describes the second digital data 422 in the second latent space 328. That is, the second encoder section 326 can output a second digital training data code 426 that describes the second digital training data 322 in the second latent space 328 and the second mapping 424 can output a second digital data code 428 that describes the second digital data 422 in the second latent space 328 describes. In other words, the second mapping 424 assigns the second digital data code 428 in the second latent space 328 to the second digital training data code 426. The processor 108 can be configured to determine a second mapping loss value 430 by comparing the second digital training code 426 with the second digital data code 428. The second mapping loss value 430 can be determined based on a loss function. The Training The second mapping 424 can be trained by adapting the second mapping 424, the second mapping 424 being adapted such that the second mapping loss value 430 is minimized. That is, the trained second mapping 424 can output a code that describes digital text data in the second latent space 328, the digital text data comprising a text string and describing digital image data representing a second context, and the code representing the digital text data, associated with a code that describes the digital image data in the second latent space 328. The second mapping 424 can be a fourth neural subnetwork.

The processing system 400B may further include at least one additional second neural network that is configured to process at least a portion (e.g. entire second digital data) of the second digital data 422 using the second digital training data 322 that describe the second context.

FIG. 5 illustrates a processing system 500 for training a third mapping in accordance with various embodiments. The processing system 500 may include the

Storage device 106 for storing digital image data 204 and digital text data describing the digital image data 204. The

Storage device 106 may store third digital training data and third digital data. The third digital training data can have digital training data that describe a first context and digital training data that describe a second context. The third digital data can have digital data that are semantically related to the digital training data that describe the first context, and digital data that are semantically related to the digital training data that describe the second context. According to various embodiments, the digital training data that describe a first context have at least a subset (for example the entire first digital training data) of the first digital training data 302 and the digital data that can be semantically related to this digital training data have the first First digital data 402 assigned to digital training data 302. According to various embodiments, the digital training data that describe a second context have at least a subset (for example the entire second digital training data) of the second digital training data 322, and the digital data that can have a semantic relationship with this digital training data are the second have second digital data 422 assigned to digital training data 322.

The processing system 500 may further include the at least one processor 108. The processor 108 implements at least a portion of the trained first neural network 304, at least a portion of the trained second neural network 324, at least a portion of the trained first map 404, and at least a portion of the trained second map 424. The processor 108 further implements at least one Part of a third map 502. The third map 502 can be a fifth neural sub-network. The trained second mapping 424 can output a code that describes the second digital data 422 in the second latent space 328. The first encoder section 306 of the trained first neural

Subnetwork 304 can output a code which describes the first digital training data 302 in the first latent space 308 and which can the trained first mapping 404 output a code that describes the first digital data 402 in the first latent space 308. The third mapping 502 can convert digital latent data of a first latent space 308, that is to say a code that describes digital training data or digital text data in the first latent space 308, into digital latent data of a second latent space 328, that is to say in a code that describes the Data in the second latent space 328 describes transform. In other words, the third mapping 502 can map the code that describes the first digital training data 302 in the first latent space 308 and the code that describes the first digital data 402 in the first latent space 308 into the second latent space 328.

The third mapping 502 can include the code describing the first digital training data 302 in the first latent space 308, the code describing the first digital data 402 in the first latent space 308, and the code describing the second digital data 422 in the second latent space 328, and can output a code that describes digital training data in the second latent space 328. The second decoder section 330 of the second neural sub-network 324 can process the code that describes the digital training data describing the second context in the second latent space 328 and can output third digital output data 504, the dimension of the third digital Output data 504 can correspond to the dimension of the second digital training data 322. The processor 108 can be set up to determine a third mapping loss value 506 by comparing the third digital output data 504 with the second digital training data 322. The third mapping 502 can be trained by adapting the third mapping 502, the third mapping 502 being adapted in such a way that the third mapping loss value 506 is minimized. That is, the trained third mapping 502 can output a code that describes training data in the second latent space 328, the second decoder section 330 being able to output training data after processing the code, which can correspond to the second digital training data 322.

FIG. 6 illustrates a method 600 for training a neural network according to various embodiments. The method 600 may include training a first neural sub-network 304 (in 602). The first neural sub-network 304 can have a first encoder section 306 and a first decoder section 310 and can be trained based on first digital training data 302 that describe a first context. The first encoder section 306 can map the first digital training data 302 into a first latent space 308 provide. The method 600 may include training a first mapping 404 (at 604). The first mapping 404 can map first digital data 402, which are semantically related to the first digital training data 302, into the first latent space 308 and can use the first digital data mapped into the first latent space 308 by means of the trained first neural sub-network 304 Training data 302 are trained. The method 600 may include training a second neural sub-network 324 (in 606). The second neural sub-network 324 can have a second encoder section 326 and a second decoder section 330 and can be trained based on second digital training data 322 which describe a second context. The second encoder section 326 can provide an image of the second digital training data 322 in a second latent space 328. The method 600 may include training a second mapping 424 (at 608). The second mapping 424 can map second digital data 422, which are semantically related to the second digital training data 322, in the second latent space 328 and can use the second digital data mapped in the second latent space 328 by means of the trained second neural sub-network 324 Training data 322 are trained. The method 600 may further include training a third mapping 502 (in 610). The third mapping 502 may map digital latent data from the first latent space 308 into the second latent space 328. The third mapping 502 can be trained based on third digital training data and third digital data, wherein the third digital training data can include digital training data that describe the first context and digital training data that describe the second context, and the third digital data can include digital Data, which with the digital

Training data which describe the first context are in a semantic relationship and digital data which have a semantic relationship with the digital training data which describe the second context. The digital training data that describe the first context can have at least a subset (for example the entire first digital training data) of the first digital training data 302, and the digital data that can have a semantic relationship with this digital training data can be the first digital training data 302 have associated first digital data 402. The digital training data that describe the second context can have at least a subset (for example the entire second digital training data) of the second digital training data 322, and the digital data that can be semantically related to this digital training data can be the second digital training data 322 have associated second digital data 422. FIG. 7A illustrates a processing system 700A for transforming digital data between different contexts according to various embodiments. The processing system 700A may include the storage device 106 for storing digital data 702, such as digital image data 204. The digital data 702 can have first context data 704, the first context data 704 having digital image data 204 which describe the first context. The digital data 702 can further include first context text data 706, the first context text data 706 having a plurality of text strings which are assigned to the first context data 704 and which describe the first context. The digital data 702 can furthermore have second context text data 708, the second context text data 708 having a plurality of text strings which describe a second context.

The processing system 700 A may further include the at least one processor 108. The processor 108 implements at least a portion of a first neural transformation network 710A. The first neural transformation network 710A may include at least a portion of the trained first neural network 304 and at least a portion of the trained second neural network 324. The first neural transformation network 710A can have the first encoder section 306 of the trained first neural subnetwork 304, the second decoder section 330 of the trained second neural subnetwork 324, the trained first mapping 404 and the trained second mapping 424. The first neural

Transformation network 710A may further include the third map 502. The first encoder section 306 can process the first context data 704 and can output a code that describes the first context data 704 in the first latent space 308. The first mapping 404 can process the first context text data 706 and can output a code that describes the first context text data 706 in the first latent space 308. The second mapping 424 can process the second context text data 708 and can output a code that describes the second context text data 708 in the second latent space 328. The third mapping 502 may include the code describing the first context data 704 in the first latent space 308, the code describing the first context text data 706 in the first latent space 308, and the code describing the second contextual data. Text data 708 in the second latent space 328 describes, processes and can output a code that describes digital data in the second latent space 328. The second decoder section 330 can write the code that describes digital data in the second latent space 328, process and can output second context data 712 describing the second context. That is, the first neural transformation network 710A can process digital data describing the first context and digital text data describing the first context and the second context, and can output digital data describing the second context.

The processor 108 can further be set up to process the second context data 712 and can output classified and / or segmented second context data 714A. The processor 108 can implement a first neural classification network, the first classification network being set up to classify and / or segment digital data.

FIG. 7B illustrates a second processing system 700B for transforming digital data between different contexts according to various embodiments. The processing system 700B may include the storage device 106 for storing digital data 702, such as digital image data 204. The digital data 702 can have second context data 712, the second context data 712 having digital image data 204 which describe the second context. The digital data 702 can furthermore have first context text data 706, the first context text data 706 having a plurality of text strings which describe a first context. The digital data 702 may further include second context text data 708, the second context text data 708 being a

A plurality of text strings which are assigned to the second context data 712, which describe a second context, have.

The processing system 700B may further include the at least one processor 108. The processor 108 implements at least part of a second neural transformation network 71 OB. The second neural

Transformation network 71 OB can have at least part of the trained first neural network 304, at least part of the trained second neural network 324, the trained first mapping 404 and the trained second mapping 424. The second neural transformation network 706B can have the second encoder section 326 of the trained second neural subnetwork 324 and the first decoder section 310 of the trained first neural subnetwork 304. The second encoder section 326 can process the second context data 712 and can output a code that describes the second context data 712 in the second latent space 328. The first mapping 404 can be the first context Process text data 706 and can output a code that describes the first context text data 706 in the first latent space 308. The second mapping 424 can process the second context text data 708 and can output a code that describes the second context text data 708 in the second latent space 328. The second neural transformation network 71 OB can also have at least part of an inverse third mapping 716, wherein the inverse third mapping 716 can correspond to the inverse mapping of the trained third mapping 502. That is, the inverse third map 716 describes a code that describes digital data describing a second context in the second latent space 328, a code that describes digital text data in the second latent space 328, and a code that describes digital Describes text data in the first latent space 328, can process and output a code describing digital data in the first latent space 308. In other words, the inverse third mapping 716 can map a code that describes digital data describing a second context in the second latent space into the first latent space 308 using digital text data describing the first context and the second context . The second decoder section 310 can process the second context data 712 that describe the second context and can output a code that describes the second context data 712 in the second latent space 328. The inverse third map 716 may include the code describing the second context data 712 in the second latent space 328, the code describing the first context text data 706 in the first latent space 308, and a code describing the second context Describes and processes text data 708 in the second latent space 328 and can output a code that describes digital data in the first latent space 308. The first decoder section 310 can process the code that describes digital data in the first latent space 308 and can output first context data 704 that describe the first context. That is, the second neural transformation network 71 OB can process digital data describing the second context and digital text data describing a first context and a second context, and can output digital data describing the first context.

The processor 108 can furthermore be configured to process the first context data 704 and can output classified and / or segmented first context data 714B. The processor 108 can implement a second neural classification network, the second classification network being set up to classify and / or segment digital data. The second classification network can correspond to the first classification network. FIG. 8 illustrates a vehicle 800 in accordance with various embodiments. The vehicle 800 may be an internal combustion engine vehicle, an electric vehicle, a hybrid vehicle, or a combination thereof. Further, the vehicle 800 can be a car, a truck, a ship, a drone, an airplane, and the like.

The vehicle 800 may include at least one sensor (e.g., an imaging sensor) 802 (e.g., the sensor 102). The vehicle 800 may be on

Have driver assistance system 804. The driver assistance system 804 can the

Have storage device 106. The driver assistance system 804 can include the processor 108. The processor 108 can implement the first neural transformation network 710A and / or the second neural transformation network 71OB. The first neural transformation network 710A can be set up to process digital data that describe a first context and to output digital data that describe a second context. The second neural transformation network 71 OB can be set up to process digital data that describe a second context and to output digital data that describe a first context. According to various embodiments, the first were neural

Transformation network 710A and / or the second neural transformation network 71 OB trained according to the method 600 for training a neural network, so that the first neural transformation network 71 OB or the second neural transformation network 71 OB digital data which have a first context or a second context describe can be transformed into digital data which describe a second context or a first context.

The processor 108 may also be set up to be used by the first neural

Transformation network 710A and / or the second neural transformation network 71 OB to classify and / or segment digital data output.

The processor 108 can implement a neural classification network that is set up to classify and / or segment the digital data output by the first neural transformation network 710A and / or the second neural transformation network 71 OB.

According to various embodiments, the classified and / or

segmented digital data 714A, 714B on the intention of road users as a feature. This has in combination with the transformation of the digital image data for example, the advantage that the intention of a road user can be determined depending on the context of the situation, for example the territorial context with regard to a district, a region, a country, etc. The driver assistance system 804 can be set up to control the vehicle 800 based on the classified and / or segmented digital data 714A, 714B. In other words, the driver assistance system 804 can be set up to process the classified and / or segmented digital data 714A, 714B and to output at least one control command to one or more actuators of the vehicle 800 based on the classified and / or segmented digital data 714A, 714B to be able to.

That is to say, the driver assistance system 804 can influence the current driving behavior based on the context of the digital image data 204 and thus the context of the classified and / or segmented digital data 714A, 714B, for example the current driving behavior can be maintained or changed. For example, the driver assistance system 804 can determine that a pedestrian is trying to cross a street in the context of a situation and can, for example, change the driving behavior in such a way that the driving behavior is interfered with for safety reasons, for example by emergency braking.

Claims

Method for training a neural network, carried out by one or more processors, comprising the method:

Training a first neural sub-network with the first digital

Training data that describe an initial context,

• where the first neural sub-network is the autoencoder network

is set up and has a first encoder section and a first decoder section, and

• where the first encoder section is a mapping of the first digital

Providing training data in a first latent space;

Training a first mapping of first digital data, which are semantically related to the first digital training data, into the first latent space using the first digital training data mapped into the first latent space by means of the trained first neural sub-network; Training a second neural sub-network with a second digital one

Training data that describe a second context that is different from the first context,

• where the second neural sub-network is an autoencoder network

is set up and has a second encoder section and a second decoder section, and

• where the second encoder section provides a mapping of the second digital training data in a second latent space,

Training a second mapping of second digital data, which are semantically related to the second digital training data, into the second latent space using the second digital training data mapped into the second latent space by means of the trained second neural sub-network;

Training a third mapping of digital latent data from the first latent space into the second latent space using third digital training data and third digital data,

• the third having digital training data:

- digital training data that describe the first context, and

- digital training data that describe the second context,

• where the third digital data comprises:

- digital data which are semantically related to the digital training data which describe the first context, and - digital data which are semantically related to the digital training data which describe the second context.

2. The method according to claim 1,

wherein the first digital training data, the second digital training data, and the third digital training data comprise digital image data.

3. The method according to claim 1 or 2,

wherein the first digital training data, the second digital training data, and the third digital training data comprise digital sensor data.

4. The method according to any one of claims 1 to 3, further comprising:

Generating the first digital data using an additional first neural network and the first digital training data which describe the first context; and or

Generating the second digital data using an additional second neural network and the second digital training data which describe the second context; and or

Generating the third digital data using an additional third neural network and digital training data which describe the first context and the second context.

5. The method according to any one of claims 1 to 4, further comprising:

Transforming digital data describing the first context into digital data describing the second context using a first neural transformation network that is formed by the first encoder section of the trained first neural sub-network, the trained first mapping, the trained second figure, trained third

Mapping for mapping from the second latent space into the first latent space, and the second decoder section of the trained second neural

Subnetwork.

6. The method of claim 5, further comprising:

Carrying out a classification and / or segmentation of digital data which describe the first context.

7. The method according to any one of claims 1 to 4, further comprising: Transforming digital data describing the second context into digital data describing the first context using a second neural transformation network that is formed by the second encoder section of the trained second neural subnetwork, the trained first mapping, the trained second mapping, the inverse mapping of the trained third mapping from the second latent space into the first latent space, and the first decoder section of the trained first neural sub-network.

8. The method of claim 7, further comprising:

Carrying out a classification and / or segmentation of digital data which describe the second context.

9. Device which is set up to carry out the method according to one of claims 1 to 8.

10. System, comprising:

an apparatus according to claim 9; and

• a sensor that is set up, the device the digital data

to provide.

11. Vehicle, comprising:

at least one sensor which is set up to provide digital data; and

a driver assistance system that has a neural network trained according to one of claims 1 to 4, wherein the neural network is set up to classify and / or segment the digital data according to claim 6 or claim 8 and wherein the driver assistance system is set up to the Control vehicle based on the classified and / or segmented digital data.