CN117095266A

CN117095266A - Generation domain adaptation in neural networks

Info

Publication number: CN117095266A
Application number: CN202210502660.4A
Authority: CN
Inventors: 普拉韦恩·纳拉亚南; 尼基塔·斋普里亚; A·马利克; 普纳杰·查克拉瓦蒂; G·库马尔
Original assignee: Ford Global Technologies LLC
Current assignee: Ford Global Technologies LLC
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2023-11-21

Abstract

The invention provides "generation domain adaptation in neural networks". A system includes a computer including a processor and a memory. The memory stores instructions executable by the processor to cause the processor to: generating a low-level representation of the input source domain data; generating an embedding of input source domain data; generating a high-level feature representation of the features of the input source domain data; generating output target domain data including semantics corresponding to the input source domain data in the target domain by processing a high-level feature representation of features of the input source domain data using a domain low-level decoder neural network layer that generates data from the target; and modifying the loss function such that potential attributes corresponding to the embedding are selected from the same probability distribution.

Description

Generation domain adaptation in neural networks

Technical Field

The present disclosure relates to neural networks in vehicles.

Background

Neural networks are machine-learning models that employ one or more layers of nonlinear units to predict the output of a received input. In addition to the output layer, some neural networks include one or more hidden layers. The output of each hidden layer serves as an input to the next layer in the network (i.e., the next hidden layer or output layer). Each layer of the network generates an output from the received input according to the current value of the respective set of weights.

Disclosure of Invention

Neural networks are used for many tasks of operating autonomous vehicles. For example, the neural network may input image data acquired by vehicle sensors to determine objects in the environment surrounding the vehicle and use data about the objects to determine a vehicle path on which to operate the vehicle. The neural network may also be trained to determine commands spoken by the vehicle occupants and operate the vehicle based on the determined commands. For example, the spoken command may include spoken phrases such as "forward", "stop", and "left turn". For example, neural networks may also be trained to process video data to determine the real world location of a vehicle based on visual ranging. Visual ranging is a technique for determining a vehicle location based on processing a sequence of video images to determine a vehicle location based on a change in location of a determined feature in the sequence of video images. A feature is an arrangement of pixel values that can be determined in two or more video images. The neural network may be trained to accomplish these tasks by collecting a large amount of training data including input data and examples of corresponding ground truth. The input data may be an image of the surroundings of the vehicle, including objects such as other vehicles and pedestrians. In other examples, the training data may include commands spoken by a plurality of different people having different sound characteristics. Ground truth is data corresponding to the correct output that the neural network expects to acquire from a source independent of the neural network. In an example of image data, a human observer may view the training image and determine the identity and location of objects in the image data. In an example of a verbal command, a human listener may listen to the verbal command and determine the correct vehicle command corresponding to the verbal command.

The problem faced by training data is that a large number (typically greater than 1000) of training examples may be required to train the neural network. Because each training example requires a corresponding ground truth, compiling the training data set can be very expensive and require many human man-hours to complete. Additional neural networks may be trained to generate simulated training data including ground truth from a fewer number of real world examples, thereby reducing the time and expense required to generate a training data set for the neural network. The training data set generated in this way is only useful when the simulated training data exactly corresponds to the real world data used to generate the simulated training data. The techniques discussed herein improve the process of generating a training data set using a neural network by improving techniques for generating accurate simulated training data based on a limited amount of input real world training data, thereby reducing the time and expense required to generate a training data set for neural network training. The techniques described herein may improve neural network generation of training data sets by improving the determination of a loss function. The impairment function is used in training the neural network by comparing the generated result with the ground truth to determine a difference between the generated result and the corresponding ground truth.

A system includes a computer including a processor and a memory. The memory stores instructions executable by the processor to cause the processor to generate a low-level representation of the input source domain data by processing the source domain data using a source domain low-level encoder neural network layer corresponding to the data from the input source domain to generate a low-level representation of the input source domain data; generating an embedding of the input source domain data by processing the low-level representation using a high-level encoder neural network layer shared between the data from the source domain and the data from the target domain; generating a high-level feature representation of the features of the input source domain data by processing the embedding of the input source domain image using a high-level decoder neural network layer shared between the data from the source domain and the data from the target domain to generate a high-level feature representation of the features of the input source domain data; generating output target domain data including semantics corresponding to the input source domain data in the target domain by processing a high-level feature representation of features of the input source domain data using a domain low-level decoder neural network layer that generates data from the target; and modifying the loss function such that potential attributes corresponding to the embedding are selected from the same probability distribution.

In other features, the processor is further programmed to: the loss function is modified by calculating a maximum mean difference between a first potential attribute corresponding to the source domain and a second potential attribute corresponding to the target domain.

In other features, the processor is further programmed to: the loss function is modified based on a prediction from the discriminator, wherein the prediction indicates a domain corresponding to the potential attribute.

In other features, the discriminator includes one or more convolution layers, one or more batch normalization layers, and one or more correction linear element layers.

In other features, the last layer of the discriminator comprises a softmax layer.

In other features, the discriminator generates a multidimensional vector representing the prediction.

In other features, the multi-dimensional vector includes four-dimensional vectors corresponding to four domains.

In other features, the multi-dimensional vector includes two-dimensional vectors corresponding to two domains.

In other features, the loss function of the discriminator includes: wherein L is _D Is defined as a loss function->Is defined as a label of the corresponding domain, log D is defined as an estimate of the probability that the potential attribute corresponds to the particular domain, and Z _AA 、Z _AB 、Z _BA 、Z _BB Is defined as the predicted domain output.

In other features, the processor is further programmed to: generating a low-level representation of the input target domain data by processing the input target domain data using a target domain low-level encoder neural network layer that is specific to the data from the target domain; generating an embedding of the input target domain data by processing the low-level representation using a high-level encoder neural network layer shared between the data from the source domain and the data from the target domain; generating a high-level feature representation of a feature of the input target domain data by processing embedding of the input target domain image using a high-level decoder neural network layer shared between data from the source domain and data from the target domain; and generating output source domain data from the source domain including semantics corresponding to the input target domain data by processing the high-level feature representation of the features of the target source domain image using a source domain low-level decoder neural network layer specific to the data from the source domain.

A method comprising: generating a low-level representation of the input source domain data by processing the source domain data using a source domain low-level encoder neural network layer corresponding to the data from the input source domain to generate a low-level representation of the input source domain data; generating an embedding of the input source domain data by processing the low-level representation using a high-level encoder neural network layer shared between the data from the source domain and the data from the target domain; generating a high-level feature representation of the features of the input source domain data by processing the embedding of the input source domain image using a high-level decoder neural network layer shared between the data from the source domain and the data from the target domain to generate a high-level feature representation of the features of the input source domain data; generating output target domain data including semantics corresponding to the input source domain data in the target domain by processing a high-level feature representation of features of the input source domain data using a domain low-level decoder neural network layer that generates data from the target; and generating output source domain data from the source domain including semantics corresponding to the input target domain data by processing the high-level feature representation of the features of the target source domain image using a source domain low-level decoder neural network layer specific to the data from the source domain.

In other features, the method includes: the loss function is modified by calculating a maximum mean difference between a first potential attribute corresponding to the source domain and a second potential attribute corresponding to the target domain.

In other features, the method includes: the loss function is modified based on a prediction from the discriminator, wherein the prediction indicates a domain corresponding to the potential attribute.

In other features, the method includes: generating a low-level representation of the input target domain data by processing the input target domain data using a target domain low-level encoder neural network layer that is specific to the data from the target domain; generating an embedding of the input target domain data by processing the low-level representation using a high-level encoder neural network layer shared between the data from the source domain and the data from the target domain; generating a high-level feature representation of a feature of the input target domain data by processing embedding of the input target domain image using a high-level decoder neural network layer shared between data from the source domain and data from the target domain; and generating output source domain data from the source domain including semantics corresponding to the input target domain data by processing the high-level feature representation of the features of the target source domain image using a source domain low-level decoder neural network layer specific to the data from the source domain.

The present disclosure describes a domain-adaptive network that can receive data (such as images) from a source domain and convert the data into data from a target domain that has similar semantics to the source domain data (e.g., maintain semantic content within the images). Semantics in this context refer to data to be maintained between images, such as objects within an image. Typically, the source domain (e.g., daytime image or virtual environment image) is different from the target domain (e.g., nighttime image or real world image). For example, the distribution of pixel values in the image from the source domain is different from the distribution of pixel values in the image from the target domain. Thus, if one image is from a source domain and another image is from a target domain, images having the same semantics may appear different. For example, the source domain may be an image of a virtual environment simulating a real world environment, and the target domain may be an image of a real world environment.

The source domain image may be an image of a virtual environment simulating a real world environment to be interacted with by an autonomous or semi-autonomous vehicle, and the target domain image may be an image of a real world environment captured by the vehicle. During training, one or more weights of the domain adaptation network are updated using the loss function. As described in more detail herein, the loss function may be modified such that embedded potential attributes are selected from the same probability distribution to create more realistic data in the target domain.

By transforming the source domain image into the target domain image, the target domain image may be used to develop a control strategy for the vehicle, or the target domain image may be used when training a neural network for selecting actions to be performed by the vehicle. Thus, the performance of the vehicle in a real world environment may be improved by exposing the neural network and/or control strategy to additional conditions created within the virtual environment.

Drawings

FIG. 1 is a diagrammatic illustration of an exemplary vehicle system.

Fig. 2 is an illustration of an exemplary server.

Fig. 3 is an illustration of an exemplary domain transfer network.

Fig. 4 is another illustration of an exemplary domain transfer network.

Fig. 5 is another illustration of an exemplary domain delivery network including an authenticator.

Fig. 6A-6D are illustrations of an exemplary discriminator.

Fig. 7 is a diagram illustrating exemplary layers of a discriminator.

Fig. 8 is an illustration of an exemplary deep neural network.

Fig. 9 is a flowchart illustrating an exemplary process for generating target domain data from source domain data.

Fig. 10 is a flowchart illustrating an exemplary process for generating source domain data from destination domain data.

Detailed Description

FIG. 1 is a block diagram of an exemplary domain delivery network 300 for determining and transmitting path recommendations for one or more vehicles. The domain transfer network 300 includes a vehicle 105, which is a land vehicle such as an automobile, truck, or the like. The vehicle 105 includes a computer 110, vehicle sensors 115, actuators 120 for actuating various vehicle components 125, and a vehicle communication module 130. The communication module 130 allows the computer 110 to communicate with a server 145 via a communication network 135. The domain transfer network 300 also includes a roadside device 150 and an authentication device 155 that can communicate with the server 145 and the vehicle 105 via the communication network 135.

The computer 110 includes a processor and a memory. The memory includes one or more forms of computer-readable media and stores instructions executable by the computer 110 to perform various operations, including operations as disclosed herein.

The computer 110 may operate the vehicle 105 in an autonomous mode, a semi-autonomous mode, or a non-autonomous (manual) mode. For purposes of this disclosure, autonomous mode is defined as a mode in which each of the propulsion, braking, and steering of the vehicle 105 is controlled by the computer 110; in semi-autonomous mode, the computer 110 controls one or both of propulsion, braking, and steering of the vehicle 105; in the non-autonomous mode, a human operator controls each of the vehicle 105 propulsion, braking, and steering.

The computer 110 may include one or more of programming to operate the vehicle 105 to brake, propel (e.g., control acceleration of the vehicle by controlling one or more of an internal combustion engine, an electric motor, a hybrid engine, etc.), turn, climate control, interior lights and/or exterior lights, etc., and to determine whether and when the computer 110 (and not a human operator) controls such operations. In addition, the computer 110 may be programmed to determine whether and when a human operator controls such operations.

The computer 110 may include or be communicatively coupled to more than one processor, such as included in an Electronic Controller Unit (ECU) or the like (e.g., powertrain controller, brake controller, steering controller, etc.) included in the vehicle 105 for monitoring and/or controlling various vehicle components 125, such as via a communication module 130 of the vehicle 105 as further described below. In addition, the computer 110 may communicate with a navigation system using a Global Positioning System (GPS) via a communication module 130 of the vehicle 105. As one example, the computer 110 may request and receive location data for the vehicle 105. The location data may be in a known form, such as geographic coordinates (latitude and longitude coordinates).

The computer 110 is typically arranged to communicate by means of the vehicle 105 communication module 130 and also by means of a wired and/or wireless network inside the vehicle 105 (e.g. a bus in the vehicle 105, etc., such as a Controller Area Network (CAN), etc.) and/or other wired and/or wireless mechanisms.

Via the vehicle 105 communication network, the computer 110 may transmit and/or receive messages to and/or from various devices in the vehicle 105, such as vehicle sensors 115, actuators 120, vehicle components 125, human-machine interfaces (HMI), and the like. Alternatively or additionally, where the computer 110 actually includes a plurality of devices, the vehicle 105 communication network may be used for communication between the devices represented in this disclosure as the computer 110. Further, as mentioned below, various controllers and/or vehicle sensors 115 may provide data to the computer 110.

The vehicle sensors 115 may include a variety of devices such as are known for providing data to the computer 110. For example, the vehicle sensors 115 may include light detection and ranging (lidar) sensors 115 or the like disposed on top of the vehicle 105, behind a front windshield of the vehicle 105, around the vehicle 105, etc., that provide relative position, size, and shape of objects around the vehicle 105, and/or conditions of the surroundings. As another example, one or more radar sensors 115 secured to the bumper of the vehicle 105 may provide data to provide speed of an object (possibly including the second vehicle 106) or the like relative to the position of the vehicle 105 and ranging. The vehicle sensors 115 may also include camera sensors 115 (e.g., front view, side view, rear view, etc.) that provide data from a field of view inside and/or outside the vehicle 105.

The vehicle 105 actuators 120 are implemented via circuits, chips, motors, or other electronic and/or mechanical components that may actuate various vehicle subsystems according to appropriate control signals as is known. The actuators 120 may be used to control components 125, including braking, acceleration, and steering of the vehicle 105.

In the context of the present disclosure, the vehicle component 125 is one or more hardware components adapted to perform mechanical or electromechanical functions or operations, such as moving the vehicle 105, decelerating or stopping the vehicle 105, steering the vehicle 105, and the like. Non-limiting examples of components 125 include propulsion components (which include, for example, an internal combustion engine and/or an electric motor, etc.), transmission components, steering components (which may include, for example, one or more of a steering wheel, a steering rack, etc.), braking components (as described below), parking assist components, adaptive cruise control components, adaptive steering components, movable seats, etc.

Further, the computer 110 may be configured to communicate with devices external to the vehicle 105 via a vehicle-to-vehicle communication module or interface 130, for example, by vehicle-to-vehicle (V2V) or vehicle-to-infrastructure (V2X) wireless communication with another vehicle, a remote server 145 (typically via a communication network 135). Module 130 may include one or more mechanisms by which computer 110 may communicate, including any desired combination of wireless (e.g., cellular, wireless, satellite, microwave, and radio frequency) communication mechanisms, and any desired network topology (or topologies when multiple communication mechanisms are utilized). Exemplary communications provided via module 130 include a cellular network providing data communication services,IEEE 802.11, dedicated Short Range Communications (DSRC), and/or Wide Area Networks (WANs), including the internet.

The communication network 135 may be one or more of a variety of wired or wireless communication mechanisms, including any desired combination of wired (e.g., cable and fiber) and/or wireless (e.g., cellular, wireless, satellite, microwave, and radio frequency) communication mechanisms and any desired network topology (or topologies where multiple communication mechanisms are utilized). Exemplary communication networks include wireless communication networks (e.g., using bluetooth, bluetooth Low Energy (BLE), IEEE 802.11, vehicle-to-vehicle (V2V), such as Dedicated Short Range Communication (DSRC), etc.), local Area Networks (LANs), and/or Wide Area Networks (WANs), including the internet, that provide data communication services.

The computer 110 may receive and analyze data from the sensors 115 substantially continuously, periodically, and/or when instructed by the server 145, etc. Further, object classification or identification techniques may be used in, for example, computer 110 to identify the type of object (e.g., vehicle, person, rock, pothole, bicycle, motorcycle, etc.) and the physical characteristics of the object based on data from lidar sensor 115, camera sensor 115, etc.

Fig. 2 is a block diagram of an exemplary server 145. The server 145 includes a computer 235 and a communication module 240. The computer 235 includes a processor and a memory. The memory includes one or more forms of computer-readable media and stores instructions executable by the computer 235 for performing various operations, including operations as disclosed herein. The communication module 240 allows the computer 235 to communicate with other devices, such as the vehicle 105.

Fig. 3 illustrates an exemplary domain-passing network 300 that may be implemented as one or more computer programs executable by computer 110 and/or server 145. Domain transfer network 300 is a system that transforms input source domain data 302 into output target domain data 342 and transforms input target domain data 304 into output source domain data 362 at least during training. For example, domain transfer network 300 may receive a sequence of data in a source domain, such as daytime data, and output a sequence of data in a target domain, such as nighttime data.

The domain delivery network 300 processes the source domain data 302 using one or more source domain low-level encoder neural network layers 310 that are specific to the data from the source domain to generate a low-level representation 312 of the input source domain data. For example, the source domain low-level encoder neural network layer 310 is used when encoding data from a source domain, but not when encoding data from a target domain. The low-level representation 312 is the output of the last of the low-level encoder layers.

The domain transfer network 300 then processes the low-level representation 312 using one or more high-level encoder neural network layers 320 shared between data from the source domain and data from the target domain to generate an embedding 322 of the input source domain data 302 and the input target domain data 304, respectively. That is, the advanced encoder neural network layer 320 is used to generate an embedding 322 based on the source domain data and the target domain data. The embeddings 322 may be vectors of probability distributions, where each probability distribution represents a potential attribute or a plurality of potential attributes of the input data. In this context, a vector means an ordered set of values, and potential attributes are features within the input data. For example, the potential attribute of the input image of the person may be a characteristic representing eyes or nose. In another example, the potential attribute of the input image of the vehicle may be a characteristic representing a tire, bumper, or body part.

The domain delivery network 300 processes the embedding 322 of the input source domain data using one or more advanced decoder neural network layers 330 shared between the data from the source domain and the data from the target domain to generate an advanced feature representation 332 of the features of the input source domain data 302. The advanced potential representation is the output of the last of the advanced decoder layers 330.

The domain delivery network 300 then processes the high-level feature representation 332 of the features of the input source domain data using one or more target domain low-level decoder neural network layers 340 that are specific to the generation of data from the source domain to generate output target domain data 342 from the target domain but having similar semantics to the input source domain data 302. Similar semantics means that the target domain data 342 has a pixel value area corresponding to the same object as the pixel value area in the input source domain data 302. For example, the output target domain data 342 may have a distribution of pixel values that matches pixel values of data from the target domain but with similar semantics as the input source domain data 302, meaning that the output target domain data 342 includes objects that are recognizable by the user as the same objects included in the input source domain data 302. For example, the target domain data 342 may include trailers having a-frame trailer tongue corresponding to the a-frame trailer tongue presented in the input source domain data 302.

During training, domain delivery network 300 may also generate output source domain data 362 from input target domain data 304, i.e., transform the target domain data into source domain data having similar semantics as the original target domain data.

In the exemplary embodiment, domain transfer network 300 processes target source domain data 304 using one or more target domain low-level encoder neural network layers 350 that are specific to data from the target domain to generate a low-level representation 352 of the input source domain data to transform the input target domain data 304. The target domain low-level encoder neural network layer 350 is used only when encoding data from the target domain, and not when encoding data from the source domain.

The domain transfer network 300 then processes the low-level representation 352 using one or more high-level encoder neural network layers 320 shared between data from the source domain and data from the target domain to generate an embedding 324 of the input target domain data 304.

The domain delivery network 300 processes the embedding 324 of the input target domain data using one or more advanced decoder neural network layers 330 shared between the data from the source domain and the data from the target domain to generate an advanced feature representation 334 of the features of the target domain data 304. Similar to the embeddings 322, the embeddings 324 may be vectors of probability distributions, where each probability distribution represents a potential attribute of the input data.

The domain delivery network 300 then processes the high-level feature representation 334 of the features of the input target domain data using one or more source domain low-level decoder neural network layers 360 that are specific to generating data from the source domain to generate output source domain data 362 that is from the source domain but has similar semantics to the input target domain data 304. That is, the output source domain data 362 has distribution data values that match the distribution data values of data from the source domain but with similar semantics to the input target domain data 304. For example, a source domain image in which one or more objects are depicted may be generated to a target domain such that the generated target domain image appears to be from the target domain but maintains semantics of the corresponding source domain image, e.g., depicts one or more objects. During training, the target domain low-level decoder neural network layer 340 trains with the target domain low-level encoder neural network layer 350 and the source domain low-level decoder neural network layer 310.

FIG. 4 shows the reception of input data X _A 、X _B And can generate output data X' _AA 、X' _BA 、X' _AB 、X' _BB An exemplary illustration of a domain-passing network 300. As shown, the target domain low-level encoder nerveNetwork layer 350 may receive input X _A And the source domain low-level encoder neural network layer 310 may receive the input X _B . Since the target domain low-level encoder neural network layer 350 and the source domain low-level encoder neural network layer 310 are connected to the source domain low-level decoder neural network layer 360 and the target domain low-level decoder neural network layer 340 via the shared high-level encoder neural network layer 320 and the shared high-level decoder neural network layer 330, the domain transfer network 300 can perform inter-domain and intra-domain transformations. Domain transfer network 300 is shown with shared high-level encoder neural network layer 320 and shared high-level decoder network layer 330 having shared weights. The domain transfer network 300 may also operate with an advanced encoder neural network layer 320 and an advanced decoder network layer 330 that do not share weights.

Input data X _A May be data in the source domain, wherein the subscript "A" represents a first domain, e.g., a daytime image, and the data X is entered _B May be data in the target domain, where the subscript "B" represents a second domain of data, e.g., a night image. Data X' _AA 、X' _BA 、X' _AB 、X' _BB Representing output data generated by the domain transfer network 300, wherein the subscripts each represent an inter-domain transform and an intra-domain transform of the data X'. For example, if the input data is an image depicting one or more objects in a first domain, domain delivery network 300 may generate a representation depicting a first domain (X' _AA ) (e.g., daytime), a second domain (X' _BB ) (e.g., night), third domain (X' _AB ) (e.g., morning) and a fourth domain (X' _BB ) An image X 'of an image of one or more objects in (e.g., dusk)' _AA 、X′ _BA 、X′ _AB 、X′ _BB . Element Z _AA 、Z _AB 、Z _BA 、Z _BB Representing one or more potential attributes generated by the advanced encoder neural network layer 320.

During training, domain transfer network 300 may modify the conventional loss function of domain transfer network 300 such that potential attributes are selected from the same probability distribution. The conventional loss function may determine the loss value by calculating the probability that the result and the ground truth data correspond to the same probability distribution. The techniques described herein add constraints that select potential attributes from the same probability distribution. By selecting potential attributes from the same probability distribution, features within the input source domain data 304 may be represented in the output target domain data 342. For example, as discussed above, the features may represent portions of an object depicted within the image.

May be generated by embedding 322, 324 (e.g., potential attribute Z _AA 、Z _AB 、Z _BA 、Z _BB ) Conventional Maximum Mean Difference (MMD) minimization between to modify the conventional loss function. The maximum mean difference may be defined as a numerical difference between the embeddings 322, 324, such as the difference between the mean of the probability distribution from the embeddings 322 and the mean of the probability distribution from the embeddings 324. During training of domain transfer network 300, the maximum mean difference may be used to modify the conventional loss function to update one or more weights within domain transfer network 300. Equations 1 through 6 show the modification of the conventional Loss Function (LF) using the maximum mean difference:

L＝LF+MMD(Z _AA ，Z _BB ) Equation 1

L＝LF+MMD(Z _AA ，Z _AB ) Equation 2

L＝LF+MMD(Z _AA ，Z _BA ) Equation 3

L＝LF+MMD(Z _AB ，Z _BA ) Equation 4

L＝LF+MMD(Z _AB ，Z _BB ) Equation 5

L＝LF+MMD(Z _BA ，Z _BB ) Equation 6

Where L is defined as the loss function of the domain transfer network 300, LF is defined as the conventional loss function, and MMD is defined as the maximum mean difference between potential attributes. During training, one or more weights within domain transfer network 300 may be updated using loss function L.

As shown in fig. 5, in some embodiments, domain transfer network 300 may include an authenticator 502. The output of discriminator 502 may be used to update the conventional loss function during training so that the selection from the same probability distribution during data generationPotential attributes. Discriminator 502 may receive potential attribute Z _AA 、Z _AB 、Z _BA 、Z _BB Evaluating potential attribute Z _AA 、Z _AB 、Z _BA 、Z _BB And generate an indication of potential attribute Z _AA 、Z _AB 、Z _BA 、Z _BB Whether it corresponds to a prediction of a particular domain. In various embodiments, the discriminator 502 may include one or more Convolition-BatchNorm-ReLU layers, as discussed below.

As shown in fig. 6A-6D, evaluator 502 can receive potential attributes and output a multidimensional vector 602 that classifies the corresponding potential attributes into a particular domain. In the example shown in fig. 6A-6D, multi-dimensional vector 602 includes a four-dimensional vector that includes a prediction tag that indicates which domain the corresponding potential attribute came from. For example, discriminator 502 generates a "1" within the multidimensional vector 602 corresponding to the prediction domain and a "0" within the multidimensional vectors corresponding to all other domains. In some implementations, the multi-dimensional vector 602 may include a two-dimensional vector 602. In these embodiments, the discriminator 502 generates a prediction between the first domain and the second domain.

Equation 7 shows the loss function of discriminator 502 that generates a four-dimensional vector:

wherein L is _D Defined as the loss function of the discriminator 502,a label defined as the corresponding domain, log D is defined as an estimate of the probability that the potential attribute corresponds to the particular domain by the discriminator 302, and Z _AA 、Z _AB 、Z _BA 、Z _BB Is defined as the predicted domain output from discriminator 502, e.g., "0" or "1". Equation 8 shows the loss function of the discriminator generating the two-dimensional vector:

wherein L is _D Defined as the loss function of the discriminator 502,a label defined as the corresponding domain, log D is defined as an estimate of the probability that the potential attribute corresponds to the particular domain by the discriminator 302, and Z _AA 、Z _BB Is defined as the predicted domain output from discriminator 502, e.g., "0" or "1". Loss function L _D May be used to update the weights of the discriminator 502 and/or the conventional loss function of the domain transfer network 300.

Fig. 7 is a block diagram showing an example of the discriminator 502. Suppose that the discriminator architecture convolves with the standard component being used internally (e.g., conv-BatchNorm-ReLU). Conv-BatchNorm-ReLU is a technique for training a neural network that deals with potential variables between convolutional layers of the neural network by performing batch normalization (BatchNorm) to avoid problems caused by wide variations in input data values and performing modified Linear activation (ReLU) to accelerate convergence during training. The latent variables may be batched together by saving results from a series of training runs and normalizing them to zero the mean and one standard deviation of the batched latent variables without changing the output results. The normalized latent variable may then be ReLU by outputting positive values directly to the subsequent stage and converting negative values to zero.

The discriminator 502 may include a number of different types of layers based on connectivity and weight sharing. As shown in fig. 7, the discriminator 502 may include one or more convolutional layers (CONV) 702, one or more batch normalization layers (batch norm) 704, and one or more modified linear unit layers (ReLU) 706. The convolution layer 702 may include one or more convolution filters applied to potential properties. The filtered potential attributes may be provided to the batch normalization layer 704, and the batch normalization layer 704 normalizes the filtered potential attributes. The normalized filtered potential attributes may be provided to the modified linear unit layer 706, and the modified linear unit layer 706 includes an activation function, e.g., a piecewise linear function, that generates an output based on the normalized filtered potential attributes. The output of the modified linear cell layer 706 may be provided as input to the softmax layer 708 to generate a prediction domain. The Softmax layer 708 includes the last layer of the discriminator 502 and is an activation function that generates the probability of the domain to which the normalized filtered potential attribute belongs. Although only a single convolution layer 702, batch normalization layer 704, correction unit layer 706, and softmax layer 708 are shown, the discriminator 502 may include additional layers, depending on the implementation of the discriminator 502.

Deep neural networks trained using the techniques described herein may improve training of deep neural networks by allowing single sample or few sample training. As discussed above, generating a training dataset for a deep neural network may require acquiring thousands of exemplary input images along with corresponding ground truths. The training data set should include multiple examples of all different types of objects trained in examples of all environmental conditions expected during operation of the deep neural network. For example, deep neural networks may be trained to identify and locate vehicle trailers. During training, all types and configurations of trailers that will be encountered while operating the vehicle should be included in the training dataset. Furthermore, each type and configuration of trailer in the training dataset should be included in each different environmental condition that would be encountered when operating the deep neural network. Environmental conditions include weather and lighting conditions such as rain, snow, fog, bright sunlight, night, and the like.

Acquiring a training dataset comprising multiple images of objects of all types and configurations to be identified and located under all types of environmental conditions can be expensive and time consuming. The techniques discussed herein may improve the training of the deep neural network by allowing a single input image to be used to train the deep neural network and modifying the input image to simulate different types of objects in different configurations under different environmental conditions. For example, the vehicle may acquire a new type or configuration of object during operation. A single image of the object may be transferred back to the server computer and used to retrain the deep neural network by simulating multiple orientations and positions of the new object based on previously acquired training images including ground truth. The techniques described herein allow for training deep neural networks using a limited training data set, thereby saving time and expense.

The techniques described herein may be applied to deep neural networks that process image data, video data, and human voice data. The images may be processed using convolutional neural networks. The convolutional neural network includes a convolutional layer that encodes an image to form latent variables, which can be decoded by the convolutional layer that reconstructs the latent variables to form output image data. A recurrent neural network may be used to process video data and human speech data. The recurrent neural network includes a memory that stores results from a plurality of previously decoded layers and previously encoded layers to combine with the current encoded layer and the decoded layer. In an exemplary recurrent neural network that processes video data, the encoding layer and decoding layer may include convolutional layers. In an exemplary recurrent neural network that processes human speech, the encoding and decoding layers may be fully connected layers. An exemplary convolutional neural network or recurrent neural network may be configured as a domain transfer network including a loss function and discriminator as described above with respect to fig. 3-7 to allow the convolutional neural network or recurrent neural network to process image data, video data, or human voice data from a source domain to a target domain, thereby improving the ability to train the neural network by increasing the amount of training data without the need to additionally acquire real world data and corresponding ground truth, thereby saving time and expense.

Fig. 8 illustrates an exemplary Deep Neural Network (DNN) 800 that may perform the functions described above and herein. For example, domain transfer network 300 may be implemented as one or more DNNs 800. For example, DNN 800 may be a software program that may be loaded into memory and executed by a processor included in computer 110 or server 145. In an exemplary embodiment, DNN 800 may include, but is not limited to, convolutional Neural Networks (CNNs), R-CNNs (regions with CNN features), fast R-CNNs, faster R-CNNs, and Recurrent Neural Networks (RNNs). DNN 800 includes a plurality of nodes 805, and nodes 805 are arranged such that DNN 800 includes an input layer, one or more hidden layers, and an output layer. Each layer of DNN 800 may include a plurality of nodes 805. Although fig. 8 shows three (3) hidden layers, it should be understood that DNN 800 may include additional or fewer hidden layers. The input and output layers may also include more than one (1) node 805.

Nodes 805 are sometimes referred to as artificial neurons 805 because they are designed to emulate biological (e.g., human) neurons. A set of inputs (represented by arrows) for each neuron 805 are each multiplied by a corresponding weight. The weighted inputs may then be summed in an input function to provide a net input with adjustments made through the bias. The net input may then be provided to an activation function, which in turn provides an output for the connected neuron 805. The activation function may be a variety of suitable functions that are typically selected based on empirical analysis. As indicated by the arrows in fig. 8, the output of the neuron 805 may then be provided to be included in a set of inputs to one or more neurons 805 in a next layer.

DNN 800 may be trained to accept data as input, for example, from the vehicle 105CAN bus, sensors, or other network, and generate a distribution of possible outputs based on the input. DNN 800 may be trained with ground truth data (i.e., data regarding real world conditions or states). For example, DNN 800 may be trained with ground truth data or updated with additional data by the processor of server 145. DNN 800 may be transmitted to vehicle 105 via network 135. For example, the weights may be initialized by using a gaussian distribution, and the bias of each node 805 may be set to zero. Training DNN 800 may include updating weights and biases via suitable techniques, such as back propagation plus optimization. Ground truth data may include, but is not limited to, data specifying an object within the data or data specifying a physical parameter (e.g., an angle, speed, distance, or angle of an object relative to another object).

Fig. 9 is a flow chart of an exemplary process 900 for generating target domain data from source domain data. The blocks of process 900 may be performed by a processor of computer 110 and/or server 145. Process 900 may begin at block 905 with receiving data, such as an image, from a source domain. At block 910, source domain data is processed using one or more source domain low-level encoder neural network layers that are specific to the data from the source domain to generate a low-level representation of the input source domain data.

At block 915, the low-level representation is processed using one or more high-level encoder neural network layers shared between data from the source domain and data from the target domain to generate an embedding, e.g., potential attributes, of the input source domain data. At block 920, embedding of the input source domain data is processed using one or more advanced decoder neural network layers shared between the data from the source domain and the data from the target domain to generate an advanced feature representation of the features of the input source domain data.

At block 925, the high-level feature representation of the features of the input source domain data is processed using one or more target domain low-level decoder neural network layers that are specific to generating data from the target domain to generate output target domain data from the target domain but having similar semantics to the input source domain data.

At block 930, one or more weights of the domain transfer network are updated based on the loss function. In an exemplary embodiment, the maximum mean difference between the various embeddings is calculated. In this embodiment, the maximum mean difference is used to modify the conventional loss function. In another exemplary embodiment, the domain transfer network includes a discriminator, and the discriminator generates a prediction of which domain the embedding belongs to. The predictions may be compared to ground truth data and the loss function of the discriminator may be updated based on the comparison. Additionally or alternatively, the weights of the domain transfer network may be updated based on the comparison. The domain delivery network may update its weights according to update rules (e.g., ADAM update rules, random gradient descent (SGD) update rules). Process 900 then ends.

Fig. 10 is a flow chart of an exemplary process 1000 for generating source domain data from destination domain data. The blocks of process 1000 may be performed by a processor of computer 110 and/or server 145. Process 1000 may begin at block 1005, where data from a target domain is received. At block 1010, the target domain data is processed using one or more target domain low-level encoder neural network layers that are specific to the data from the target domain to generate a low-level representation of the input target domain data. At block 1015, the low-level representation is processed using one or more high-level encoder neural network layers shared between data from the source domain and data from the target domain to generate an embedding of the input target domain data.

At block 1020, embedding of the input target domain data is processed using one or more advanced decoder neural network layers shared between the data from the source domain and the data from the target domain to generate an advanced feature representation of the features of the input target domain data. At block 1025, the high-level feature representation of the features of the input target domain data is processed using one or more source domain low-level decoder neural network layers that are specific to generating data from the source domain to generate output source domain data from the source domain but having similar semantics to the input source domain data.

At block 1030, one or more weights of the domain transfer network are updated based on the loss function. In an exemplary embodiment, the maximum mean difference between the various embeddings is calculated. In this embodiment, the maximum mean difference is used to modify the conventional loss function. In another exemplary embodiment, the domain transfer network includes a discriminator, and the discriminator generates a prediction of which domain the embedding belongs to. The predictions may be compared to ground truth data and the loss function of the discriminator may be updated based on the comparison. Additionally or alternatively, the weights of the domain transfer network may be updated based on the comparison. The domain delivery network may update its weights according to update rules (e.g., ADAM update rules, random gradient descent (SGD) update rules).

In general, the described computing systems and/or devices may employ any of a variety of computer operating systems, including, but in no way limited to, the following versions and/or categories: ford (force)An application program; the AppLink/intelligent device is connected with the middleware; microsoft->An operating system; microsoft->An operating system; unix operating systems (e.g., promulgated by Oracle Corporation on the rosewood coast, california >An operating system); an AIX UNIX operating system issued by International Business machines corporation of Armonk, N.Y.; a Linux operating system; mac OSX and iOS operating systems published by apple Inc. of Copico, calif.; blackberry operating systems issued by blackberry limited of smooth iron, canada; android operating systems developed by google corporation and open cell phone alliance; or +.>CAR infotainment platform. Examples of computing devices include, but are not limited to, an in-vehicle computer, a computer workstation, a server, a desktop, a notebook, a laptop or handheld computer, or some other computing system and/or device.

Computers and computing devices typically include computer-executable instructions that may be capable of being executed by one or more computing devices, such as those listed above. Computer-executable instructions may be compiled or interpreted from a computer program created using a variety of programming languages and/or techniques, including, but not limited to, java, alone or in combination ^TM C, C ++, matlab, simulink, stateflow, visual Basic, java Script, perl, HTML, etc. Some of these applications may be compiled and executed on virtual machines such as Java virtual machines, dalvik virtual machines, and the like. In general, a processor (e.g., a microprocessor) receives instructions, e.g., from a memory, a computer-readable medium, etc., and executes the instructions to perform one or more processes, including one of the processes described herein One or more. Such instructions and other data may be stored and transmitted using a variety of computer-readable media. Files in a computing device are typically a collection of data stored on a computer readable medium such as a storage medium, random access memory, or the like.

The memory may include computer-readable media (also referred to as processor-readable media) including any non-transitory (e.g., tangible) media that participates in providing data (e.g., instructions) that may be read by a computer (e.g., by a processor of a computer). Such a medium may take many forms, including but not limited to, non-volatile media and volatile media. Non-volatile media may include, for example, optical or magnetic disks, and other persistent memory. Volatile media may include, for example, dynamic Random Access Memory (DRAM), which typically constitutes a main memory. Such instructions may be transmitted by one or more transmission media, including coaxial cables, copper wire and fiber optics, including the wires that comprise a system bus coupled to the processor of the ECU. Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, a flash EEPROM, any other memory chip or cartridge, or any other medium from which a computer can read.

Databases, data stores, or other data stores described herein may include various mechanisms for storing, accessing, and retrieving various data, including hierarchical databases, file sets in file systems, application databases in proprietary formats, relational database management systems (RDBMSs), and the like. Each such data storage device is typically included within a computing device employing a computer operating system (such as one of those mentioned above) and is accessed via a network in any one or more of a variety of ways. The file system may be accessed from a computer operating system and may include files stored in various formats. In addition to languages used to create, store, edit, and execute stored programs, such as the PL/SQL language described above, RDBMS also typically employ Structured Query Language (SQL).

In some examples, system elements may be implemented as computer-readable instructions (e.g., software) on one or more computing devices (e.g., servers, personal computers, etc.), stored on a computer-readable medium (e.g., disk, memory, etc.) associated therewith. The computer program product may include such instructions stored on a computer-readable medium for performing the functions described herein.

With respect to the media, processes, systems, methods, heuristics, etc. described herein, it should be understood that, while the steps of such processes, etc. have been described as occurring in a certain ordered sequence, such processes may be practiced by executing the steps in an order different than that described herein. It should also be understood that certain steps may be performed concurrently, other steps may be added, or certain steps described herein may be omitted. In other words, the description of the processes herein is provided for the purpose of illustrating certain embodiments and should not be construed as limiting the claims in any way.

Accordingly, it is to be understood that the above description is intended to be illustrative, and not restrictive. Many embodiments and applications other than the examples provided will be apparent to those of skill in the art upon reading the above description. The scope of the invention should be determined, not with reference to the above description, but should instead be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. It is contemplated and anticipated that future developments will occur in the arts discussed herein, and that the disclosed systems and methods will be incorporated into such future embodiments. In summary, it is to be understood that the invention is capable of modification and variation and is limited only by the following claims.

Unless explicitly indicated to the contrary herein, all terms used in the claims are intended to be given their ordinary and customary meaning as understood by those skilled in the art. In particular, the use of singular articles such as "a," "an," "the," and the like are to be construed to recite one or more of the indicated elements unless a claim recites an explicit limitation to the contrary.

According to the present invention, there is provided a system having a computer including a processor and a memory, the memory storing instructions executable by the processor to cause the processor to: generating a low-level representation of the input source domain data by processing the source domain data using a source domain low-level encoder neural network layer corresponding to the data from the input source domain to generate a low-level representation of the input source domain data; generating an embedding of the input source domain data by processing the low-level representation using a high-level encoder neural network layer shared between the data from the source domain and the data from the target domain; generating a high-level feature representation of the features of the input source domain data by processing the embedding of the input source domain image using a high-level decoder neural network layer shared between the data from the source domain and the data from the target domain to generate a high-level feature representation of the features of the input source domain data; generating output target domain data including semantics corresponding to the input source domain data in the target domain by processing a high-level feature representation of features of the input source domain data using a domain low-level decoder neural network layer that generates data from the target; and modifying the loss function such that potential attributes corresponding to the embedding are selected from the same probability distribution.

According to one embodiment, the low-level encoder neural network layer, the high-level encoder neural network layer, the low-level decoder neural network layer, and the high-level decoder neural network layer are included in a convolutional neural network.

According to one embodiment, the low-level encoder neural network layer, the high-level encoder neural network layer, the low-level decoder neural network layer, and the high-level decoder neural network layer are included in a recurrent neural network.

According to one embodiment, the input source and target fields include image data, video data, and human voice data.

According to one embodiment, the processor is further programmed to modify the penalty function by calculating a maximum mean difference between a first potential attribute corresponding to the source domain and a second potential attribute corresponding to the target domain.

According to one embodiment, the processor is further programmed to modify the loss function based on a prediction from the discriminator, wherein the prediction indicates a domain corresponding to the potential attribute.

According to one embodiment, the discriminator includes one or more convolutional layers, one or more batch normalization layers, and one or more modified linear cell layers.

According to one embodiment, the last layer of the discriminator comprises a softmax layer.

According to one embodiment, the discriminator generates a multidimensional vector representing the prediction.

According to one embodiment, the multi-dimensional vector comprises four-dimensional vectors corresponding to four domains.

According to one embodiment, the multi-dimensional vector comprises two-dimensional vectors corresponding to two domains.

According to one embodiment, the loss function of the discriminator comprises: wherein L is _D Is defined as a loss function->Is defined as a label of the corresponding domain, log D is defined as an estimate of the probability that the potential attribute corresponds to the particular domain, and Z _AA 、Z _AB 、Z _BA 、Z _BB Is defined as the predicted domain output.

According to one embodiment, the processor is further programmed to: generating a low-level representation of the input target domain data by processing the input target domain data using a target domain low-level encoder neural network layer that is specific to the data from the target domain; generating an embedding of the input target domain data by processing the low-level representation using a high-level encoder neural network layer shared between the data from the source domain and the data from the target domain; generating a high-level feature representation of a feature of the input target domain data by processing embedding of the input target domain image using a high-level decoder neural network layer shared between data from the source domain and data from the target domain; and generating output source domain data from the source domain including semantics corresponding to the input target domain data by processing the high-level feature representation of the features of the target source domain image using a source domain low-level decoder neural network layer specific to the data from the source domain.

According to the invention, a method comprises: generating a low-level representation of the input source domain data by processing the source domain data using a source domain low-level encoder neural network layer corresponding to the data from the input source domain to generate a low-level representation of the input source domain data; generating an embedding of the input source domain data by processing the low-level representation using a high-level encoder neural network layer shared between the data from the source domain and the data from the target domain; generating a high-level feature representation of the features of the input source domain data by processing the embedding of the input source domain image using a high-level decoder neural network layer shared between the data from the source domain and the data from the target domain to generate a high-level feature representation of the features of the input source domain data; generating output target domain data including semantics corresponding to the input source domain data in the target domain by processing a high-level feature representation of features of the input source domain data using a domain low-level decoder neural network layer that generates data from the target; and modifying the loss function such that potential attributes corresponding to the embedding are selected from the same probability distribution.

In one aspect of the invention, the low-level encoder neural network layer, the high-level encoder neural network layer, the low-level decoder neural network layer, and the high-level decoder neural network layer are included in a convolutional neural network.

In one aspect of the invention, the low-level encoder neural network layer, the high-level encoder neural network layer, the low-level decoder neural network layer, and the high-level decoder neural network layer are included in a recurrent neural network.

In one aspect of the invention, the input source and target fields include image data, video data, and human voice data.

In one aspect of the invention, the method includes modifying the loss function by calculating a maximum mean difference between a first potential attribute corresponding to the source domain and a second potential attribute corresponding to the target domain.

In one aspect of the invention, the method includes modifying the loss function based on a prediction from the discriminator, wherein the prediction indicates a domain corresponding to the potential attribute.

In one aspect of the invention, the discriminator includes one or more convolutional layers, one or more batch normalization layers, and one or more modified linear cell layers.

Claims

1. A method, comprising:

generating a low-level representation of input source domain data by processing the source domain data using a source domain low-level encoder neural network layer corresponding to data from the input source domain to generate a low-level representation of the input source domain data;

Generating an embedding of the input source domain data by processing the low-level representation using a high-level encoder neural network layer shared between data from the source domain and data from the target domain;

generating a high-level feature representation of a feature of the input source domain data by processing embedding of an input source domain image using a high-level decoder neural network layer shared between data from the source domain and data from the target domain to generate a high-level feature representation of the feature of the input source domain data;

generating output target domain data including semantics corresponding to the input source domain data in the target domain by processing the high-level feature representation of the features of the input source domain data using a domain low-level decoder neural network layer that generates data from the target; and

the loss function is modified such that potential attributes corresponding to the embedding are selected from the same probability distribution.

2. The method of claim 1, wherein the low-level encoder neural network layer, the high-level encoder neural network layer, the low-level decoder neural network layer, and the high-level decoder neural network layer are included in a convolutional neural network.

3. The method of claim 1, wherein the low-level encoder neural network layer, the high-level encoder neural network layer, the low-level decoder neural network layer, and the high-level decoder neural network layer are included in a recurrent neural network.

4. The method of claim 1, wherein the input source field and the target field comprise image data, video data, and human voice data.

5. The method of claim 1, wherein the processor is further programmed to modify the penalty function by calculating a maximum mean difference between a first potential attribute corresponding to a source domain and a second potential attribute corresponding to a target domain.

6. The method of claim 1, wherein the processor is further programmed to modify the penalty function based on a prediction from a discriminator, wherein the prediction indicates a domain corresponding to a potential attribute.

7. The method of claim 6, wherein the discriminator comprises one or more convolutional layers, one or more batch normalization layers, and one or more modified linear unit layers.

8. The method of claim 7, wherein the last layer of the discriminator comprises a softmax layer.

9. The method of claim 6, wherein the discriminator generates a multidimensional vector representing the prediction.

10. The method of claim 9, wherein the multi-dimensional vector comprises a four-dimensional vector corresponding to four domains.

11. The method of claim 9, wherein the multi-dimensional vector comprises a two-dimensional vector corresponding to two domains.

12. The method of claim 6, wherein the loss function of the discriminator comprises:wherein L is _D Is defined as the loss function, +.>A label defined as the corresponding domain, log D is defined as an estimate of the probability that the potential attribute corresponds to the particular domain, and Z _AA 、Z _AB 、Z _BA 、Z _BB Is defined as the predicted domain output.

13. The method of claim 1, wherein the processor is further programmed to generate a low-level representation of the input target domain data by processing the input target domain data using a target domain low-level encoder neural network layer that is specific to the data from the target domain;

generating an embedding of the input target domain data by processing the low-level representation using a high-level encoder neural network layer shared between data from the source domain and data from the target domain;

Generating a high-level feature representation of features of the input target domain data by processing the embedding of the input target domain image using the high-level decoder neural network layer shared between data from the source domain and data from the target domain; and

output source domain data including semantics corresponding to the input target domain data is generated from the source domain by processing the high-level feature representation of the features of the target source domain image using a source domain low-level decoder neural network layer that is specific to the data from the source domain.

14. The method of claim 13, wherein the low-level encoder neural network layer, the high-level encoder neural network layer, the low-level decoder neural network layer, and the high-level decoder neural network layer are included in a convolutional neural network.

15. A system comprising a computer programmed to perform the method of any one of claims 1 to 14.