US20240202533A1 - Generating artificial video with changed domain - Google Patents

Generating artificial video with changed domain Download PDF

Info

Publication number
US20240202533A1
US20240202533A1 US18/065,672 US202218065672A US2024202533A1 US 20240202533 A1 US20240202533 A1 US 20240202533A1 US 202218065672 A US202218065672 A US 202218065672A US 2024202533 A1 US2024202533 A1 US 2024202533A1
Authority
US
United States
Prior art keywords
encoder
computer
generator
video
domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US18/065,672
Inventor
Akhil Perincherry
Arpita Chand
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ford Global Technologies LLC
Original Assignee
Ford Global Technologies LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ford Global Technologies LLC filed Critical Ford Global Technologies LLC
Priority to US18/065,672 priority Critical patent/US20240202533A1/en
Assigned to FORD GLOBAL TECHNOLOGIES, LLC reassignment FORD GLOBAL TECHNOLOGIES, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: Chand, Arpita, PERINCHERRY, AKHIL
Priority to DE102023133614.2A priority patent/DE102023133614A1/en
Priority to CN202311626240.8A priority patent/CN118200462A/en
Publication of US20240202533A1 publication Critical patent/US20240202533A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/776Validation; Performance evaluation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/778Active pattern-learning, e.g. online learning of image or video features
    • G06V10/7784Active pattern-learning, e.g. online learning of image or video features based on feedback from supervisors
    • G06V10/7788Active pattern-learning, e.g. online learning of image or video features based on feedback from supervisors the supervisor being a human, e.g. interactive learning with a human teacher
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads

Definitions

  • Video can be acquired by image sensors such as cameras and processed using a computer to determine data regarding objects in an environment around a system. Operation of a sensing system can include acquiring accurate and timely data regarding objects in the system's environment.
  • a computer can acquire video from one or more image sensors that can be processed to determine locations of objects.
  • Object location data extracted from video can be used by a computer to operate systems including vehicles, robots, security, and object-tracking systems.
  • Machine-learning algorithms can be used on board vehicles to operate advanced driver assistance systems (ADAS) or perform autonomous operation based on detecting objects in video, e.g., taken by cameras on board vehicles as the vehicles are driving.
  • ADAS advanced driver assistance systems
  • FIG. 1 is a block diagram of an example system for collecting and analyzing video data.
  • FIG. 2 is a diagram of an example machine-learning architecture for generating an artificial video.
  • FIG. 5 is a diagram of another example training architecture for training the machine-learning architecture to generate the artificial video.
  • This disclosure presents techniques for generating an artificial output video based on an input video.
  • the output video has the same semantic content as the input video.
  • the semantic content in both the input video and the output video may show the point of view of a vehicle traveling along a particular stretch of road, with the same other vehicles, surrounding environment, etc.
  • the output video has a changed domain from the input video.
  • the domain may be changed from daytime to nighttime, clear to rainy, low-resolution to high-resolution, red-clear-clear-clear (RCCC) to red-green-blue (RGB), etc. Because the output video is artificial, i.e., synthetic, video in the new domain does not need to be manually collected.
  • Having a dataset of videos in different domains can be useful for training a machine-learning algorithm for tasks such as object recognition that can depend on interpreting domain data.
  • the techniques herein can provide diversity in training data and targeting of particular scenarios, both of which may contribute to a trained machine-learning algorithm with an enhanced ability to generalize.
  • An issue that can arise when generating the output video is maintaining a temporal consistency with the input video, e.g., ensuring that events occur or objects move at the same time in the output video as in the input video.
  • audio data is used to supervise the generation of the output video in order to maintain the temporal consistency.
  • the audio can be generated contemporaneously with the input video.
  • video refers to sequential visual data, not to audio data.
  • the audio data provides a way to track events in a manner that generally excludes domain data in order to conform the output video to the same temporal sequence as the input video.
  • a computer is programmed to receive an input video of a scene in a first domain and audio data associated with the input video, execute an encoder to map the input video and the audio data to a latent vector in a lower-dimensional latent space, and execute a generator to generate an output video of the scene in a second domain from the latent vector.
  • the encoder and the generator are trained to maintain temporal consistency between the input video and the output video by using the audio data.
  • the use of the latent vector helps to strip away the domain data while preserving the semantic content of the input vidce.
  • a computer includes a processor and a memory, and the memory stores instructions executable by the processor to receive an input video of a scene and audio data associated with the input video, the input video being in a first domain, execute an encoder to map the input video and the audio data to a latent vector in a lower-dimensional latent space, and execute a generator to generate an output video of the scene from the latent vector, the output video being in a second domain.
  • the encoder and the generator are trained to maintain temporal consistency between the input video and the output video by using the audio data.
  • the encoder and the generator may be supervised by a discriminator during training.
  • the discriminator may supervise the training of the encoder and the generator by testing a consistency of the output video with the audio data.
  • the discriminator may use a correlation between the output video and the audio data to test the consistency of the output video with the audio data.
  • the discriminator may receive the correlation from a correlation module, and the correlation module may be pretrained with contrastive learning.
  • the discriminator may supervise the training of the encoder and the generator by testing a consistency of the output video with the second domain.
  • the instructions may further include instructions to determine an adversarial loss based on an output of the discriminator and to update the encoder and the generator based on the adversarial loss.
  • the first domain and the second domain may be mutually exclusive environmental conditions of the scene.
  • the environmental conditions may be one of a lighting condition or a weather condition.
  • the first domain and the second domain may be mutually exclusive visual rendering characteristics of the input video and output video.
  • the visual rendering characteristics are one of a resolution, a color representation scheme, or simulatedness.
  • the instructions may further include instructions to extract visual features from the input video, and executing the encoder may be based on the visual features.
  • the instructions may further include instructions to extract audio features from the audio data and to fuse the visual features and the audio features, and executing the encoder may be based on the fusion of the visual features and the audio features.
  • the encoder may be trained to include semantic content of the input video in the latent vector and to exclude domain data of the input video from the latent vector.
  • the encoder may be a first encoder
  • the generator may be a first generator
  • the latent vector may be a first latent vector
  • training the first encoder and the first generator may include executing a second encoder to map the output video and the audio data to a second latent vector in the lower-dimensional latent space, and executing a second generator to generate a test video of the scene from the second latent vector in the first domain.
  • training the first encoder and the first generator may include updating the first encoder and the first generator based on a difference between the test video and the input video.
  • the instructions may further include instructions to train a machine-learning model on training data, and the training data includes the output video.
  • the machine-learning model is an object-recognition model.
  • a computer 100 includes a processor 105 and a memory 110 , and the memory 110 stores instructions executable by the processor 105 to receive an input video 205 of a scene and audio data 210 associated with the input video 205 , the input video 205 being in a first domain; execute an encoder 215 to map the input video 205 and the audio data 210 to a latent vector 220 in a lower-dimensional latent space; and execute a generator 225 to generate an output video 230 of the scene from the latent vector 220 , the output video 230 being in a second domain.
  • the encoder 215 and the generator 225 are trained to maintain temporal consistency between the input video 205 and the output video 230 by using the audio data 210 .
  • the computer 100 is a microprocessor-based computing device, e.g., a generic computing device including the processor 105 and the memory 110 , an electronic controller or the like, a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), a combination of the foregoing, etc.
  • a hardware description language such as VHDL (Very High Speed Integrated Circuit Hardware Description Language) is used in electronic design automation to describe digital and mixed-signal systems such as FPGA and ASIC.
  • an ASIC is manufactured based on VHDL programming provided pre-manufacturing
  • logical components inside an FPGA may be configured based on VHDL programming, e.g., stored in a memory 110 electrically connected to the FPGA circuit.
  • the computer 100 can thus include the processor 105 , the memory 110 , etc.
  • the memory 110 of the computer 100 can include media for storing instructions executable by the processor 105 as well as for electronically storing data and/or databases, and/or the computer 100 can include structures such as the foregoing by which programming is provided.
  • the computer 100 can be multiple computers coupled together.
  • the computer 100 is connected to a network 115 .
  • the network 115 represents one or more mechanisms by which the computer 100 may communicate with remote servers.
  • the network 115 may be one or more of various wired or wireless communication mechanisms, including any desired combination of wired (e.g., cable and fiber) and/or wireless (e.g., cellular, wireless, satellite, microwave, and radio frequency) communication mechanisms and any desired network topology (or topologies when multiple communication mechanisms are utilized).
  • Exemplary communication networks include wireless communication networks (e.g., using Bluetooth, IEEE 802.11, etc.), local area networks (LAN) and/or wide area networks (WAN), including the Internet, providing data communication services.
  • the computer 100 can communicate with, e.g., a plurality of vehicles 120 including cameras 125 and microphones 130 and/or a plurality of standalone cameras 125 and microphones 130 .
  • the vehicles 120 may be any passenger or commercial automobiles such as cars, trucks, sport utility vehicles, crossovers, vans, minivans, taxis, buses, etc.
  • the cameras 125 can detect electromagnetic radiation in some range of wavelengths.
  • the cameras 125 may detect visible light, infrared radiation, ultraviolet light, or some range of wavelengths including visible, infrared, and/or ultraviolet light.
  • each camera 125 can be a charge-coupled device (CCD), complementary metal oxide semiconductor (CMOS), or any other suitable type.
  • the cameras 125 may be a time-of-flight (TOF) cameras, which include a modulated light source for illuminating the environment and detect both reflected light from the modulated light source and ambient light to sense reflectivity amplitudes and distances to the scene.
  • TOF time-of-flight
  • the cameras 125 generate video data, which may be used as described below as the input video 205 .
  • the video data are sequences of image frames depicting the scenes contained in the fields of view of the respective cameras 125 .
  • Each image frame is a two-dimensional matrix of pixels.
  • Each pixel has a brightness or color represented as one or more numerical values, e.g., a scalar unitless value of photometric light intensity between 0 (black) and 1 (white), or values for each of red, green, and blue, e.g., each on an 8-bit scale (0 to 255) or a 12- or 16-bit scale.
  • the pixels may be a mix of representations, e.g., a repeating pattern of scalar values of intensity for three pixels and a fourth pixel with three numerical color values, or some other pattern.
  • Position in the image frame i.e., position in the field of view of the sensor at the time that the image frame was recorded, can be specified in pixel dimensions or coordinates, e.g., an ordered pair of pixel distances, such as a number of pixels from a top edge and a number of pixels from a left edge of the field of view.
  • the image data includes semantic content and domain data.
  • semantic content is defined as the physical features of the environment and the objects populating the environment within the scene captured in a video.
  • the semantic content of a video captured by a camera 125 mounted to a vehicle 120 may include the road, sidewalks, other terrain, buildings and infrastructure, other vehicles, pedestrians, bicyclists, etc.
  • a “domain” is a characteristic or condition affecting the scene as a whole or the image frame as a whole. Multiple domains may apply to a given video.
  • domains may include environmental conditions, e.g., lighting conditions such as daytime versus nighttime, overcast versus sunny, sun angle, etc., or weather conditions such as clear, foggy, rainy, snowy, etc.
  • domains may include visual rendering characteristics such as resolution, color representation scheme, simulatedness, etc. The resolution is the sharpness of the image frame, e.g., as represented by the pixel density.
  • the color representation scheme is how colors are represented pixel-by-pixel in the image frame, e.g., red-green-blue (RGB), red-clear-clear-clear (RCCC), etc.
  • the simulatedness is how realistic versus computer-generated surfaces appear in the image frame, e.g., no textures, computer-generated texture, photorealistic, etc.
  • the input video 205 is in a first domain
  • the output video 230 is in a second domain.
  • the first domain and the second domain may be mutually exclusive domains of a same type.
  • the first and second domains are mutually exclusive in that the first and second domains are different and nonoverlapping.
  • a type of domain is a group of domains of which at most one domain applies to a given video or image frame.
  • a type can be cloud cover, and the possible domains of that type can include overcast, mostly cloudy, partly cloudy, and sunny.
  • a type can be color representation scheme, and the possible domains of that type can be RGB, RCCC, and black-and-white.
  • the microphones 130 are transducers that convert sound to electrical signals.
  • Each microphone 130 can be any suitable type, e.g., a dynamic microphone, which includes a coil of wire suspended in a magnetic field; a condenser microphone, which uses a vibrating diaphragm as a capacitor plate; a contact microphone, which uses a piezoelectric crystal; etc.
  • the microphones 130 generate the audio data 210 .
  • the audio data 210 is recorded sound data in any suitable format, e.g., a standard audio file format such as .wav.
  • the audio data 210 includes sound as a function at least of time.
  • the audio data 210 may be represented as spectrogram, which shows amplitude as a function of time and frequency.
  • the audio data 210 is associated with the input video 205 , e.g., is recorded in the scene depicted in the input video 205 contemporaneously with recording the input video 205 .
  • audio data 210 may be time-synchronized to the input video 205 .
  • a machine-learning architecture 200 is shown that is trained to generate the output video 230 in the second domain based on the input video 205 in the first domain and based on the audio data 210 associated with the input video 205 .
  • the machine-learning architecture 200 may be trained for a specific first domain and second domain. (The training of the machine-learning architecture 200 is described below with respect to FIGS. 4 - 6 .)
  • the machine-learning architecture 200 may include a visual feature extractor 235 , an audio feature extractor 240 , a fusion block 245 , the encoder 215 , and the generator 225 .
  • the visual feature extractor 235 extracts visual features, e.g., in the form of a visual feature map 250 , from the input video 205 .
  • the audio feature extractor 240 extracts audio features, e.g., in the form of an audio feature map 255 , from the audio data 210 .
  • the fusion block 245 fuses the visual feature map 250 and the audio feature map 255 and outputs a resulting fused feature map 260 to the encoder 215 .
  • the encoder 215 uses the fused feature map 260 to generate the latent vector 220 .
  • the generator 225 uses the latent vector 220 to generate the output video 230 .
  • the computer 100 may be programmed to extract visual features from the video data.
  • feature is used in its machine-learning sense as a piece of information about the content of a type of data (e.g., image, video, audio), specifically about whether a certain region of the data (e.g., image frame, spectrogram) has certain properties.
  • Types of visual features may include edges, corners, blobs, etc.
  • the visual features may take the form of the visual feature map 250 .
  • the visual feature map 250 includes a plurality of visual features.
  • the visual feature map 250 provides locations in an image frame, e.g., in pixel coordinates, of the visual features.
  • the visual feature map 250 has a reduced dimensionality compared to the image frame or video.
  • Extracting the visual features may include executing the visual feature extractor 235 .
  • the visual feature extractor 235 may include one or more suitable techniques for feature extraction, e.g., low-level techniques such as edge detection, corner detection, blob detection, ridge detection, scale-invariant feature transform (SIFT), etc.; shape-based techniques such as thresholding, blob extraction, template matching, Hough transform, generalized Hough transform, etc.; flexible methods such as deformable parameterized shapes, active contours, etc.; etc.
  • the visual feature extractor 235 may include machine-learning operations.
  • the visual feature extractor 235 may include residual network (ResNet) layers followed by a convolutional neural network.
  • the computer 100 may be programmed to extract audio features from the audio data 210 .
  • the audio features may take the form of the audio feature map 255 .
  • the audio feature map 255 includes a plurality of audio features.
  • the audio feature map 255 provides temporal positions or time-frequency positions in the audio data 210 of the audio features.
  • the audio feature map 255 has a reduced dimensionality compared to the audio data 210 .
  • Extracting the audio features may include executing the audio feature extractor 240 .
  • the audio feature extractor 240 may include one or more suitable techniques for feature extraction, e.g., low-, medium-, and/or high-level; instantaneous, segment-level, and/or global; time domain, frequency domain, and/or time-frequency representation; etc.
  • the audio data 210 may be represented as a spectrogram, and the audio feature extractor 240 may treat the spectrogram as an image, with two-dimensional pixel coordinates representing time and frequency and a value of the pixel representing the amplitude at the respective time and frequency.
  • the audio feature extractor 240 may thus perform similar analysis as described above for the visual feature extractor 235 .
  • the audio feature extractor 240 may include machine-learning operations.
  • the audio feature extractor 240 may include ResNet layers followed by a convolutional neural network.
  • the computer 100 may be programmed to fuse the visual features and the audio features, e.g., the visual feature map 250 and the audio feature map 255 , resulting in the fused feature map 260 .
  • Data fusion is combining data from disparate sources together so that the resulting data has less uncertainty than if the data from each source were used individually, e.g., in creating a unified model of the surrounding environment.
  • the data fusion may include concatenating the visual feature map 250 and the audio feature map 255 .
  • the data fusion can be performed with one or more algorithms, e.g., Kalman filter, central limit theorem, Bayesian networks, Dempster-Shafer, convolutional neural networks, etc.
  • the computer 100 may be programmed to map the input video 205 and the audio data 210 to the latent vector 220 in the lower-dimensional latent space.
  • the latent space has a lower dimensionality than the input video 205 has and/or than the fused feature map 260 has.
  • the latent space includes dimensions for the semantic content and excludes domain data for the type of domain that the first domain and second domain are (e.g., without precipitation data when converting from clear to rainy, or without resolution data when converting from low resolution to high resolution).
  • the latent vector 220 provides values for the semantic content in the latent space.
  • the computer 100 may be programmed to execute the encoder 215 to map the input video 205 and the audio data 210 to the latent vector 220 .
  • Executing the encoder 215 may be based on the visual features and the audio features, e.g., on the fusion of the visual features and the audio features, e.g., on the fused feature map 260 .
  • the encoder 215 may be or include an artificial neural network such as a residual neural network (ResNet).
  • the encoder 215 may include a sequence of layers and may reduce a dimensionality of the data from one layer to the next layer.
  • the output of the final layer may be the latent vector 220 .
  • the computer 100 may be programmed to execute the generator 225 to generate the output video 230 of the scene from the latent vector 220 .
  • the latent vector 220 may be an input to the generator 225 .
  • the generator 225 may be or include an artificial neural network such as a ResNet.
  • the generator 225 may include a series of layers and may increase a dimensionality of the data from one layer to the next.
  • the output of the final layer may be the output video 230 .
  • the encoder 215 and the generator 225 are trained to maintain the same semantic content in the output video 230 as in the input video 205 while changing the domain from the first domain to the second domain.
  • the encoder 215 and the generator 225 may be trained for a specific first domain and second domain of the same type.
  • the encoder 215 and the generator 225 are further trained to maintain temporal consistency between the input video 205 and the output video 230 , e.g., an event involving the semantic content occurs at the same time in both the input video 205 and the output video 230 , by using the audio data 210 , e.g., as included in the fused feature map 260 .
  • FIG. 3 is a process flow diagram illustrating an example process 300 for generating the output video 230 .
  • the memory 110 of the computer 100 stores executable instructions for performing the steps of the process 300 and/or programming can be implemented in structures such as mentioned above.
  • the computer 100 receives the input video 205 and the associated audio data 210 , extracts the visual and audio features from the input video 205 and audio data 210 , fuses the visual and audio features, maps the fused data to the latent vector 220 , generates the output video 230 from the latent vector 220 , and trains a machine-learning model with the output video 230 as training data.
  • the process 300 begins in a block 305 , in which the computer 100 receives the input video 205 of the scene in the first domain and the audio data 210 associated with the input video 205 , as described above.
  • the computer 100 extracts the visual features from the input video 205 , as described above.
  • the computer 100 extracts the audio features from the audio data 210 , as described above.
  • the computer 100 fuses the visual features and the audio features, as described above.
  • the computer 100 executes the encoder 215 to map the input video 205 and the audio data 210 to the latent vector 220 in the lower-dimensional latent space, as described above.
  • the computer 100 executes the generator 225 to generate the output video 230 of the scene in the second domain from the latent vector 220 , as described above.
  • the computer 100 trains a machine-learning model on training data that includes the output video 230 .
  • the machine-learning model may be an object-recognition model, e.g., using a convolutional neural network.
  • the training data includes the output video 230 .
  • the training data may further include the input video 205 and/or similarly generated output videos 230 and associated input videos 205 .
  • the machine-learning model may be installed on a vehicle computer of a vehicle 120 , and the vehicle computer may be programmed to actuate the vehicle 120 based on classifications outputted by the machine-learning model, e.g., by actuating one or more of a propulsion system, a brake system, and/or a steering system of the vehicle 120 .
  • the vehicle computer may use the classification of an object as a particular type of road user to determine an amount of lateral distance to maintain when passing the road user, e.g., a greater distance for a bicycle than for a motor vehicle. Because of the different domains represented in the training data used for training the machine-learning model, i.e., both the first domain and the second domain, the machine-learning model may classify objects accurately in a greater variety of conditions, thereby enhancing the autonomous operation of the vehicle 120 . After the block 335 , the process 300 ends.
  • a training architecture 400 is shown for training the encoder 215 and the generator 225 to generate the output video 230 in the second domain based on the input video 205 in the first domain and based on the audio data 210 associated with the input video 205 .
  • the encoder 215 and the generator 225 are trained to maintain temporal consistency between the input video 205 and the output video 230 by using the audio data 210 .
  • the encoder 215 is trained to include the semantic content of the input video 205 in the latent vector 220 and to exclude domain data from the first domain of the input video 205 from the latent vector 220 .
  • the generator 225 is trained to construct the output video 230 including the semantic content from the latent vector 220 with the second domain applied.
  • the encoder 215 and the generator 225 are trained for a specific first domain and a specific second domain of the same type as the first domain, e.g., daytime and nighttime, respectively; i.e., the encoder 215 and the generator 225 are trained to convert from one specific domain to another specific domain of the same type. This is beneficial so that the encoder 215 may be trained to exclude domain data from the type of domain of the first domain but not necessarily exclude domain data from other types of domains.
  • the training data for training the encoder 215 and the generator 225 may include a plurality of input videos 205 in the first domain and respective audio data 210 associated with the input videos 205 .
  • the training data may further include videos or images in the second domain that are not paired with the input videos 205 in the first domain, i.e., are of different scenes than captured in the input videos 205 . Because the training architecture 400 uses the audio data 210 for supervision, the training architecture 400 does not need a video in the second domain of the same scene as an input video 205 in the first domain.
  • the training of the encoder 215 and the generator 225 includes iterative execution of the encoder 215 and the generator 225 together, i.e., joint training of the encoder 215 and the generator 225 .
  • a loss function may be calculated by a discriminator 405 (in the example of FIG. 4 ) or a discriminator 405 and a comparator 505 (in the example of FIG. 5 ).
  • the loss function is based on a score of the temporal consistency of the output video 230 with the audio data 210 and based on a score of the output video 230 being in the second domain.
  • the loss function is used to update the encoder 215 and the generator 225 .
  • the training is completed in response to a termination condition being satisfied, e.g., the loss function being below a threshold value and/or the training elapsing for a preset number of iterations.
  • the threshold value or the preset number of iterations may be chosen to indicate or produce accurate operation by the encoder 215 and the generator 225 .
  • the encoder 215 and the generator 225 may be used as described above with respect to FIGS. 2 and 3 .
  • the training architecture 400 may include a correlation module 410 .
  • the correlation module 410 may include a correlation visual feature extractor 415 , a correlation audio feature extractor 420 , and a contrastive-learning block 425 .
  • the correlation visual feature extractor 415 and the correlation audio feature extractor 420 may operate similarly to the visual feature extractor 235 and the audio feature extractor 240 , respectively, as described above.
  • the correlation visual feature extractor 415 may receive the output videos 230 as inputs each iteration.
  • the correlation audio feature extractor 420 may receive the audio data 210 (unchanged) as an input each iteration.
  • the contrastive-learning block 425 receives the outputs from the correlation visual feature extractor 415 and the correlation audio feature extractor 420 as inputs.
  • the contrastive-learning block 425 outputs a correlation score between the output video 230 of the current iteration and the audio data 210 .
  • the correlation score indicates a temporal closeness between events in the output video 230 and the audio data 210 .
  • the correlation module 410 may be pretrained, i.e., trained separately from the rest of the training architecture 400 before training the encoder 215 and the generator 225 .
  • the correlation module 410 e.g., the contrastive-learning block 425 , may be trained using contrastive learning using pairs of samples. Each pair includes a positive sample and a negative sample.
  • the positive sample may be one of the input videos 205 along with the audio data 210 synched to the input video 205
  • the negative sample may be the same input video 205 along with the audio data 210 time-shifted with respect to the input video 205 .
  • the correlation module 410 may thus be trained using easy-to-gather training data that does not require manual annotation.
  • the encoder 215 and the generator 225 may be supervised by the discriminator 405 during training.
  • the generator 225 and the discriminator 405 may together constitute a generative adversarial network.
  • the discriminator 405 may be a machine-learning model, e.g., a convolutional neural network.
  • the discriminator 405 may be trained to classify inputs as (a) temporally consistent or (b) temporally inconsistent.
  • the discriminator 405 may also be trained to classify inputs as (a) actual videos in the second domain or (b) output videos 230 in the second domain from the generator 225 .
  • the discriminator 405 may be jointly trained with the generator 225 (and the encoder 215 ).
  • the inputs to the discriminator 405 are the output videos 230 generated by the generator 225 each iteration.
  • the discriminator 405 may supervise the training of the encoder 215 and the generator 225 by testing a consistency of the output video 230 with the audio data 210 .
  • the discriminator 405 may receive the correlation score from the correlation module 410 each iteration.
  • the classification outputted by the discriminator 405 may be based on the correlation score as well as on the output videos 230 and the audio data 210 as direct inputs.
  • the computer 100 determines an adversarial loss based on the accuracy of the classification by the discriminator 405 .
  • the computer 100 updates the encoder 215 and the generator 225 based on the adversarial loss, i.e., according to a generative-adversarial-network scheme, e.g., by adjusting weights within the encoder 215 and/or the generator 225 .
  • the computer 100 also updates the discriminator 405 based on the adversarial loss, i.e., according to a generative-adversarial-network scheme, e.g., by adjusting weights within the discriminator 405 .
  • the discriminator 405 may supervise the training of the encoder 215 and the generator 225 by testing a consistency of the output video 230 with the second domain. For example, the discriminator 405 may receive the output videos 230 generated by the generator 225 and actual videos in the second domain, i.e., training data as described above. The discriminator 405 attempts to classify the videos as actual or as generated by the generator 225 .
  • the computer 100 determines an adversarial loss based on the accuracy of the classification.
  • the computer 100 updates the encoder 215 and the generator 225 based on the adversarial loss, i.e., according to a generative-adversarial-network scheme, e.g., by adjusting weights within the encoder 215 and/or the generator 225 .
  • the computer 100 also updates the discriminator 405 based on the adversarial loss, i.e., according to a generative-adversarial-network scheme, e.g., by adjusting weights within the discriminator
  • the training architecture 400 may include a second encoder 510 , a second generator 515 , and the comparator 505 , in addition to the discriminator 405 .
  • the discriminator 405 may supervise the training of the first encoder 215 and the first generator 225 by testing a consistency of the output video 230 with the second domain, as described above in the example of FIG. 4 . In this example, the discriminator 405 is not used for testing the consistency of the output video 230 with the audio data 210 , i.e., the temporal consistency.
  • the training architecture 400 may further include the correlation module 410 described above, and the discriminator 405 may be used for both consistencies as described above. In that case, the discriminator 405 is trained jointly with the second encoder 510 and second generator 515 described below, as well as with the rest of the training architecture 400 .
  • the second encoder 510 and the second generator 515 may have the same structure as the first encoder 215 and the first generator 225 , respectively, as described above with respect to FIG. 2 .
  • the training architecture 400 may further include a second visual feature extractor 530 , a second audio feature extractor 535 , and a second fusion block 540 , which operate in the same manner as described above for the visual feature extractor 235 , the audio feature extractor 240 , and the fusion block 245 , respectively.
  • the second encoder 510 may receive the output video 230 in the second domain from the generator 225 , as well as the audio data 210 , after processing by the second visual feature extractor 530 , second audio feature extractor 535 , and second fusion block 540 .
  • the second encoder 510 may execute to map the output video 230 and the audio data 210 to a second latent vector 520 in the lower-dimensional latent space.
  • the second generator 515 may execute to generate a test video 525 in the first domain from the second latent vector 520 .
  • the second encoder 510 and the second generator 515 are trained to convert video from the second domain to the first domain, the reverse of the first encoder 215 and the first generator 225 .
  • the audio data 210 inputted into the second encoder 510 is the same as the audio data 210 inputted into the first encoder 215 .
  • the comparator 505 determines a difference between the test video 525 and the input video 205 , e.g., a level of cycle consistency.
  • the input video 205 and the test video 525 are both in the first domain.
  • the comparator 505 may be a pretrained module.
  • the computer 100 may calculate a loss function based on the difference, e.g., a cycle consistency loss as used in cycle-consistent adversarial networks such as cycleGAN.
  • the computer 100 updates the first encoder 215 , the first generator 225 , the second encoder 510 , and the second generator 515 based on the loss function, e.g., by adjusting weights within the first encoder 215 , the first generator 225 , the second encoder 510 , and the second generator 515 .
  • the second generator 515 and the second encoder 510 may thus be jointly trained with the first encoder 215 and the first generator 225 .
  • FIG. 6 is a process flow diagram illustrating an example process 600 for training the machine-learning architecture 200 .
  • the memory 110 of the computer 100 stores executable instructions for performing the steps of the process 600 and/or programming can be implemented in structures such as mentioned above.
  • the computer 100 provides pretrained components of the training architecture 400 and generates the initial values of the trainable components of the training architecture 400 . including the encoder 215 and generator 225 .
  • the computer 100 then iteratively executes the training architecture 400 .
  • the computer 100 executes the machine-learning architecture 200 to generate the output videos 230 , tests the output videos 230 , and updates the training architecture 400 based on testing the output videos 230 .
  • the computer 100 Upon reaching a termination condition, the computer 100 outputs the encoder 215 and the generator 225 from the final iteration.
  • the process 600 begins in a block 605 , in which the computer 100 provides pretrained components of the training architecture 400 , e.g., the correlation module 410 in the example of FIG. 4 or the comparator 505 in the example of FIG. 5 .
  • the computer 100 may load components already installed on the computer 100 .
  • the computer 100 generates the initial encoder 215 and generator 225 .
  • the computer 100 may assign weights at random to the structures described above for the encoder 215 and the generator 225 .
  • the computer 100 may also generate initial values for the other components of the training architecture 400 that will be trained, e.g., the discriminator 405 in the example of FIG. 4 or the second encoder 510 , second generator 515 , and discriminator 405 in the example of FIG. 5 .
  • the computer 100 may assign weights at random to the structures described above for these components.
  • the computer 100 executes the machine-learning architecture 200 , including the encoder 215 and the generator 225 , as described above.
  • the inputs are the training data of input videos 205 in the first domain and the respective associated audio data 210 .
  • the computer 100 tests the output videos 230 generated in the block 615 .
  • the computer 100 may execute the correlation module 410 to output correlation scores to the discriminator 405 and execute the discriminator 405 to classify the output videos 230 and other training videos.
  • the computer 100 may execute the discriminator 405 to classify the output videos 230 and other training videos, execute the second encoder 510 and second generator 515 to output the test videos 525 , and execute the comparator 505 to determine differences between the input videos 205 and the respective test videos 525 .
  • the computer 100 calculates the loss functions based on the results from the block 620 , as described above.
  • the computer 100 determines whether the termination condition is satisfied, as described above. If not, the process 600 proceeds to a block 635 to update the trainable components for a next iteration of training. If so, the process 600 proceeds to a block 640 .
  • the computer 100 updates the trainable components of the training architecture 400 based on the losses calculated in the block 625 , as described above.
  • the computer 100 updates the encoder 215 and the generator 225 .
  • the computer 100 may also update the discriminator 405 in the example of FIG. 4 or the discriminator 405 , the second encoder 510 , and the second generator 515 in the example of FIG. 5 .
  • the process 600 returns to the block 615 for the next iteration of training.
  • the computer 100 outputs the encoder 215 and the generator 225 in their current state, i.e., with the values of the weights as updated in the iterations of training. After the block 640 , the process 600 ends.
  • the computing systems and/or devices described may employ any of a number of computer operating systems, including, but by no means limited to, versions and/or varieties of the Ford Sync® application, AppLink/Smart Device Link middleware, the Microsoft Automotive® operating system, the Microsoft Windows® operating system, the Unix operating system (e.g., the Solaris® operating system distributed by Oracle Corporation of Redwood Shores, California), the AIX UNIX operating system distributed by International Business Machines of Armonk, New York, the Linux operating system, the Mac OSX and iOS operating systems distributed by Apple Inc. of Cupertino, California, the BlackBerry OS distributed by Blackberry, Ltd. of Waterloo, Canada, and the Android operating system developed by Google, Inc.
  • the Microsoft Automotive® operating system e.g., the Microsoft Windows® operating system distributed by Oracle Corporation of Redwood Shores, California
  • the Unix operating system e.g., the Solaris® operating system distributed by Oracle Corporation of Redwood Shores, California
  • the AIX UNIX operating system distributed by International Business Machines of Armonk,
  • computing devices include, without limitation, an on-board vehicle computer, a computer workstation, a server, a desktop, notebook, laptop, or handheld computer, or some other computing system and/or device.
  • Computing devices generally include computer-executable instructions, where the instructions may be executable by one or more computing devices such as those listed above.
  • Computer executable instructions may be compiled or interpreted from computer programs created using a variety of programming languages and/or technologies, including, without limitation, and either alone or in combination, JavaTM, C, C++, Matlab, Simulink, Stateflow, Visual Basic, Java Script, Python, Perl, HTML, etc. Some of these applications may be compiled and executed on a virtual machine, such as the Java Virtual Machine, the Dalvik virtual machine, or the like.
  • a processor receives instructions, e.g., from a memory, a computer readable medium, etc., and executes these instructions, thereby performing one or more processes, including one or more of the processes described herein.
  • Such instructions and other data may be stored and transmitted using a variety of computer readable media.
  • a file in a computing device is generally a collection of data stored on a computer readable medium, such as a storage medium, a random access memory, etc.
  • a computer-readable medium includes any non-transitory (e.g., tangible) medium that participates in providing data (e.g., instructions) that may be read by a computer (e.g., by a processor of a computer). Such a medium may take many forms, including, but not limited to, non-volatile media and volatile media. Instructions may be transmitted by one or more transmission media, including fiber optics, wires, wireless communication, including the internals that comprise a system bus coupled to a processor of a computer. Common forms of computer-readable media include, for example, RAM, a PROM, an EPROM, a FLASH-EEPROM, any other memory chip or cartridge, or any other medium from which a computer can read.
  • Databases, data repositories or other data stores described herein may include various kinds of mechanisms for storing, accessing, and retrieving various kinds of data, including a hierarchical database, a set of files in a file system, an application database in a proprietary format, a relational database management system (RDBMS), a nonrelational database (NoSQL), a graph database (GDB), etc.
  • Each such data store is generally included within a computing device employing a computer operating system such as one of those mentioned above, and are accessed via a network in any one or more of a variety of manners.
  • a file system may be accessible from a computer operating system, and may include files stored in various formats.
  • An RDBMS generally employs the Structured Query Language (SQL) in addition to a language for creating, storing, editing, and executing stored procedures, such as the PL/SQL language mentioned above.
  • SQL Structured Query Language
  • system elements may be implemented as computer-readable instructions (e.g., software) on one or more computing devices (e.g., servers, personal computers, etc.), stored on computer readable media associated therewith (e.g., disks, memories, etc.).
  • a computer program product may comprise such instructions stored on computer readable media for carrying out the functions described herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)
  • Probability & Statistics with Applications (AREA)

Abstract

A computer includes a processor and a memory, and the memory stores instructions executable by the processor to receive an input video of a scene and audio data associated with the input video, the input video being in a first domain; execute an encoder to map the input video and the audio data to a latent vector in a lower-dimensional latent space; and execute a generator to generate an output video of the scene from the latent vector, the output video being in a second domain. The encoder and the generator are trained to maintain temporal consistency between the input video and the output video by using the audio data.

Description

    BACKGROUND
  • Video can be acquired by image sensors such as cameras and processed using a computer to determine data regarding objects in an environment around a system. Operation of a sensing system can include acquiring accurate and timely data regarding objects in the system's environment. A computer can acquire video from one or more image sensors that can be processed to determine locations of objects. Object location data extracted from video can be used by a computer to operate systems including vehicles, robots, security, and object-tracking systems. Machine-learning algorithms can be used on board vehicles to operate advanced driver assistance systems (ADAS) or perform autonomous operation based on detecting objects in video, e.g., taken by cameras on board vehicles as the vehicles are driving.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block diagram of an example system for collecting and analyzing video data.
  • FIG. 2 is a diagram of an example machine-learning architecture for generating an artificial video.
  • FIG. 3 is a process flow diagram of an example process for generating the artificial video.
  • FIG. 4 is a diagram of an example training architecture for training the machine-learning architecture to generate the artificial video.
  • FIG. 5 is a diagram of another example training architecture for training the machine-learning architecture to generate the artificial video.
  • FIG. 6 is a process flow diagram of an example process for training the machine-learning architecture to generate the artificial video.
  • DETAILED DESCRIPTION
  • This disclosure presents techniques for generating an artificial output video based on an input video. The output video has the same semantic content as the input video. For example, the semantic content in both the input video and the output video may show the point of view of a vehicle traveling along a particular stretch of road, with the same other vehicles, surrounding environment, etc. However, the output video has a changed domain from the input video. The domain may be changed from daytime to nighttime, clear to rainy, low-resolution to high-resolution, red-clear-clear-clear (RCCC) to red-green-blue (RGB), etc. Because the output video is artificial, i.e., synthetic, video in the new domain does not need to be manually collected. Having a dataset of videos in different domains can be useful for training a machine-learning algorithm for tasks such as object recognition that can depend on interpreting domain data. To that end, the techniques herein can provide diversity in training data and targeting of particular scenarios, both of which may contribute to a trained machine-learning algorithm with an enhanced ability to generalize.
  • An issue that can arise when generating the output video is maintaining a temporal consistency with the input video, e.g., ensuring that events occur or objects move at the same time in the output video as in the input video. As described herein, audio data is used to supervise the generation of the output video in order to maintain the temporal consistency. The audio can be generated contemporaneously with the input video. (As used herein, the term “video” refers to sequential visual data, not to audio data.) The audio data provides a way to track events in a manner that generally excludes domain data in order to conform the output video to the same temporal sequence as the input video. In particular, a computer is programmed to receive an input video of a scene in a first domain and audio data associated with the input video, execute an encoder to map the input video and the audio data to a latent vector in a lower-dimensional latent space, and execute a generator to generate an output video of the scene in a second domain from the latent vector. The encoder and the generator are trained to maintain temporal consistency between the input video and the output video by using the audio data. The use of the latent vector helps to strip away the domain data while preserving the semantic content of the input vidce.
  • A computer includes a processor and a memory, and the memory stores instructions executable by the processor to receive an input video of a scene and audio data associated with the input video, the input video being in a first domain, execute an encoder to map the input video and the audio data to a latent vector in a lower-dimensional latent space, and execute a generator to generate an output video of the scene from the latent vector, the output video being in a second domain. The encoder and the generator are trained to maintain temporal consistency between the input video and the output video by using the audio data.
  • In an example, the encoder and the generator may be supervised by a discriminator during training. In a further example, the discriminator may supervise the training of the encoder and the generator by testing a consistency of the output video with the audio data. In a yet further example, while training the encoder and the generator, the discriminator may use a correlation between the output video and the audio data to test the consistency of the output video with the audio data. In a still yet further example, while training the encoder and the generator, the discriminator may receive the correlation from a correlation module, and the correlation module may be pretrained with contrastive learning.
  • In another further example, the discriminator may supervise the training of the encoder and the generator by testing a consistency of the output video with the second domain. In a yet further example, the instructions may further include instructions to determine an adversarial loss based on an output of the discriminator and to update the encoder and the generator based on the adversarial loss.
  • In an example, the first domain and the second domain may be mutually exclusive environmental conditions of the scene. In a further example, the environmental conditions may be one of a lighting condition or a weather condition.
  • In an example, the first domain and the second domain may be mutually exclusive visual rendering characteristics of the input video and output video. In a further example, the visual rendering characteristics are one of a resolution, a color representation scheme, or simulatedness.
  • In an example, the instructions may further include instructions to extract visual features from the input video, and executing the encoder may be based on the visual features. In a further example, the instructions may further include instructions to extract audio features from the audio data and to fuse the visual features and the audio features, and executing the encoder may be based on the fusion of the visual features and the audio features.
  • In an example, the instructions may further include instructions to extract audio features from the audio data, and executing the encoder is based on the audio features.
  • In an example, the encoder may be trained to include semantic content of the input video in the latent vector and to exclude domain data of the input video from the latent vector.
  • In an example, the encoder may be a first encoder, the generator may be a first generator, the latent vector may be a first latent vector, and training the first encoder and the first generator may include executing a second encoder to map the output video and the audio data to a second latent vector in the lower-dimensional latent space, and executing a second generator to generate a test video of the scene from the second latent vector in the first domain. In a further example, training the first encoder and the first generator may include updating the first encoder and the first generator based on a difference between the test video and the input video.
  • In an example, the instructions may further include instructions to train a machine-learning model on training data, and the training data includes the output video. In a further example, the machine-learning model is an object-recognition model.
  • A method includes receiving an input video of a scene and audio data associated with the input video, the input video being in a first domain, executing an encoder to map the input video and the audio data to a latent vector in a lower-dimensional latent space, and executing a generator to generate an output video of the scene from the latent vector, the output video being in a second domain. The encoder and the generator are trained to maintain temporal consistency between the input video and the output video by using the audio data.
  • With reference to the Figures, wherein like numerals indicate like parts throughout the several views, a computer 100 includes a processor 105 and a memory 110, and the memory 110 stores instructions executable by the processor 105 to receive an input video 205 of a scene and audio data 210 associated with the input video 205, the input video 205 being in a first domain; execute an encoder 215 to map the input video 205 and the audio data 210 to a latent vector 220 in a lower-dimensional latent space; and execute a generator 225 to generate an output video 230 of the scene from the latent vector 220, the output video 230 being in a second domain. The encoder 215 and the generator 225 are trained to maintain temporal consistency between the input video 205 and the output video 230 by using the audio data 210.
  • With reference to FIG. 1 , the computer 100 is a microprocessor-based computing device, e.g., a generic computing device including the processor 105 and the memory 110, an electronic controller or the like, a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), a combination of the foregoing, etc. Typically, a hardware description language such as VHDL (Very High Speed Integrated Circuit Hardware Description Language) is used in electronic design automation to describe digital and mixed-signal systems such as FPGA and ASIC. For example, an ASIC is manufactured based on VHDL programming provided pre-manufacturing, whereas logical components inside an FPGA may be configured based on VHDL programming, e.g., stored in a memory 110 electrically connected to the FPGA circuit. The computer 100 can thus include the processor 105, the memory 110, etc. The memory 110 of the computer 100 can include media for storing instructions executable by the processor 105 as well as for electronically storing data and/or databases, and/or the computer 100 can include structures such as the foregoing by which programming is provided. The computer 100 can be multiple computers coupled together.
  • The computer 100 is connected to a network 115. The network 115 represents one or more mechanisms by which the computer 100 may communicate with remote servers.
  • Accordingly, the network 115 may be one or more of various wired or wireless communication mechanisms, including any desired combination of wired (e.g., cable and fiber) and/or wireless (e.g., cellular, wireless, satellite, microwave, and radio frequency) communication mechanisms and any desired network topology (or topologies when multiple communication mechanisms are utilized). Exemplary communication networks include wireless communication networks (e.g., using Bluetooth, IEEE 802.11, etc.), local area networks (LAN) and/or wide area networks (WAN), including the Internet, providing data communication services. The computer 100 can communicate with, e.g., a plurality of vehicles 120 including cameras 125 and microphones 130 and/or a plurality of standalone cameras 125 and microphones 130.
  • The vehicles 120 may be any passenger or commercial automobiles such as cars, trucks, sport utility vehicles, crossovers, vans, minivans, taxis, buses, etc.
  • The cameras 125 can detect electromagnetic radiation in some range of wavelengths. For example, the cameras 125 may detect visible light, infrared radiation, ultraviolet light, or some range of wavelengths including visible, infrared, and/or ultraviolet light. For example, each camera 125 can be a charge-coupled device (CCD), complementary metal oxide semiconductor (CMOS), or any other suitable type. For another example, the cameras 125 may be a time-of-flight (TOF) cameras, which include a modulated light source for illuminating the environment and detect both reflected light from the modulated light source and ambient light to sense reflectivity amplitudes and distances to the scene.
  • The cameras 125 generate video data, which may be used as described below as the input video 205. The video data are sequences of image frames depicting the scenes contained in the fields of view of the respective cameras 125. Each image frame is a two-dimensional matrix of pixels. Each pixel has a brightness or color represented as one or more numerical values, e.g., a scalar unitless value of photometric light intensity between 0 (black) and 1 (white), or values for each of red, green, and blue, e.g., each on an 8-bit scale (0 to 255) or a 12- or 16-bit scale. The pixels may be a mix of representations, e.g., a repeating pattern of scalar values of intensity for three pixels and a fourth pixel with three numerical color values, or some other pattern. Position in the image frame, i.e., position in the field of view of the sensor at the time that the image frame was recorded, can be specified in pixel dimensions or coordinates, e.g., an ordered pair of pixel distances, such as a number of pixels from a top edge and a number of pixels from a left edge of the field of view.
  • The image data includes semantic content and domain data. For the purposes of this disclosure, “semantic content” is defined as the physical features of the environment and the objects populating the environment within the scene captured in a video. For example, the semantic content of a video captured by a camera 125 mounted to a vehicle 120 may include the road, sidewalks, other terrain, buildings and infrastructure, other vehicles, pedestrians, bicyclists, etc.
  • For the purposes of this disclosure, a “domain” is a characteristic or condition affecting the scene as a whole or the image frame as a whole. Multiple domains may apply to a given video. For example, domains may include environmental conditions, e.g., lighting conditions such as daytime versus nighttime, overcast versus sunny, sun angle, etc., or weather conditions such as clear, foggy, rainy, snowy, etc. For another example, domains may include visual rendering characteristics such as resolution, color representation scheme, simulatedness, etc. The resolution is the sharpness of the image frame, e.g., as represented by the pixel density. The color representation scheme is how colors are represented pixel-by-pixel in the image frame, e.g., red-green-blue (RGB), red-clear-clear-clear (RCCC), etc. The simulatedness is how realistic versus computer-generated surfaces appear in the image frame, e.g., no textures, computer-generated texture, photorealistic, etc.
  • The input video 205 is in a first domain, and the output video 230 is in a second domain. The first domain and the second domain may be mutually exclusive domains of a same type. The first and second domains are mutually exclusive in that the first and second domains are different and nonoverlapping. A type of domain is a group of domains of which at most one domain applies to a given video or image frame. For example, a type can be cloud cover, and the possible domains of that type can include overcast, mostly cloudy, partly cloudy, and sunny. For another example, a type can be color representation scheme, and the possible domains of that type can be RGB, RCCC, and black-and-white.
  • The microphones 130 are transducers that convert sound to electrical signals. Each microphone 130 can be any suitable type, e.g., a dynamic microphone, which includes a coil of wire suspended in a magnetic field; a condenser microphone, which uses a vibrating diaphragm as a capacitor plate; a contact microphone, which uses a piezoelectric crystal; etc.
  • The microphones 130 generate the audio data 210. The audio data 210 is recorded sound data in any suitable format, e.g., a standard audio file format such as .wav. The audio data 210 includes sound as a function at least of time. For example, the audio data 210 may be represented as spectrogram, which shows amplitude as a function of time and frequency. The audio data 210 is associated with the input video 205, e.g., is recorded in the scene depicted in the input video 205 contemporaneously with recording the input video 205. For example, audio data 210 may be time-synchronized to the input video 205.
  • With reference to FIG. 2 , a machine-learning architecture 200 is shown that is trained to generate the output video 230 in the second domain based on the input video 205 in the first domain and based on the audio data 210 associated with the input video 205. The machine-learning architecture 200 may be trained for a specific first domain and second domain. (The training of the machine-learning architecture 200 is described below with respect to FIGS. 4-6 .) The machine-learning architecture 200 may include a visual feature extractor 235, an audio feature extractor 240, a fusion block 245, the encoder 215, and the generator 225. The visual feature extractor 235 extracts visual features, e.g., in the form of a visual feature map 250, from the input video 205. The audio feature extractor 240 extracts audio features, e.g., in the form of an audio feature map 255, from the audio data 210. The fusion block 245 fuses the visual feature map 250 and the audio feature map 255 and outputs a resulting fused feature map 260 to the encoder 215. The encoder 215 uses the fused feature map 260 to generate the latent vector 220. The generator 225 uses the latent vector 220 to generate the output video 230.
  • The computer 100 may be programmed to extract visual features from the video data. For the purposes of this disclosure, the term “feature” is used in its machine-learning sense as a piece of information about the content of a type of data (e.g., image, video, audio), specifically about whether a certain region of the data (e.g., image frame, spectrogram) has certain properties. Types of visual features may include edges, corners, blobs, etc. The visual features may take the form of the visual feature map 250. The visual feature map 250 includes a plurality of visual features. The visual feature map 250 provides locations in an image frame, e.g., in pixel coordinates, of the visual features. The visual feature map 250 has a reduced dimensionality compared to the image frame or video.
  • Extracting the visual features may include executing the visual feature extractor 235. The visual feature extractor 235 may include one or more suitable techniques for feature extraction, e.g., low-level techniques such as edge detection, corner detection, blob detection, ridge detection, scale-invariant feature transform (SIFT), etc.; shape-based techniques such as thresholding, blob extraction, template matching, Hough transform, generalized Hough transform, etc.; flexible methods such as deformable parameterized shapes, active contours, etc.; etc. The visual feature extractor 235 may include machine-learning operations. For example, the visual feature extractor 235 may include residual network (ResNet) layers followed by a convolutional neural network.
  • The computer 100 may be programmed to extract audio features from the audio data 210. The audio features may take the form of the audio feature map 255. The audio feature map 255 includes a plurality of audio features. The audio feature map 255 provides temporal positions or time-frequency positions in the audio data 210 of the audio features. The audio feature map 255 has a reduced dimensionality compared to the audio data 210.
  • Extracting the audio features may include executing the audio feature extractor 240. The audio feature extractor 240 may include one or more suitable techniques for feature extraction, e.g., low-, medium-, and/or high-level; instantaneous, segment-level, and/or global; time domain, frequency domain, and/or time-frequency representation; etc. The audio data 210 may be represented as a spectrogram, and the audio feature extractor 240 may treat the spectrogram as an image, with two-dimensional pixel coordinates representing time and frequency and a value of the pixel representing the amplitude at the respective time and frequency. The audio feature extractor 240 may thus perform similar analysis as described above for the visual feature extractor 235. The audio feature extractor 240 may include machine-learning operations. For example, the audio feature extractor 240 may include ResNet layers followed by a convolutional neural network.
  • The computer 100 may be programmed to fuse the visual features and the audio features, e.g., the visual feature map 250 and the audio feature map 255, resulting in the fused feature map 260. Data fusion is combining data from disparate sources together so that the resulting data has less uncertainty than if the data from each source were used individually, e.g., in creating a unified model of the surrounding environment. The data fusion may include concatenating the visual feature map 250 and the audio feature map 255. Alternatively or additionally, the data fusion can be performed with one or more algorithms, e.g., Kalman filter, central limit theorem, Bayesian networks, Dempster-Shafer, convolutional neural networks, etc.
  • The computer 100 may be programmed to map the input video 205 and the audio data 210 to the latent vector 220 in the lower-dimensional latent space. The latent space has a lower dimensionality than the input video 205 has and/or than the fused feature map 260 has. The latent space includes dimensions for the semantic content and excludes domain data for the type of domain that the first domain and second domain are (e.g., without precipitation data when converting from clear to rainy, or without resolution data when converting from low resolution to high resolution). The latent vector 220 provides values for the semantic content in the latent space.
  • The computer 100 may be programmed to execute the encoder 215 to map the input video 205 and the audio data 210 to the latent vector 220. Executing the encoder 215 may be based on the visual features and the audio features, e.g., on the fusion of the visual features and the audio features, e.g., on the fused feature map 260. For example, the encoder 215 may be or include an artificial neural network such as a residual neural network (ResNet). The encoder 215 may include a sequence of layers and may reduce a dimensionality of the data from one layer to the next layer. The output of the final layer may be the latent vector 220.
  • The computer 100 may be programmed to execute the generator 225 to generate the output video 230 of the scene from the latent vector 220. The latent vector 220 may be an input to the generator 225. For example, the generator 225 may be or include an artificial neural network such as a ResNet. The generator 225 may include a series of layers and may increase a dimensionality of the data from one layer to the next. The output of the final layer may be the output video 230.
  • The encoder 215 and the generator 225 are trained to maintain the same semantic content in the output video 230 as in the input video 205 while changing the domain from the first domain to the second domain. The encoder 215 and the generator 225 may be trained for a specific first domain and second domain of the same type. The encoder 215 and the generator 225 are further trained to maintain temporal consistency between the input video 205 and the output video 230, e.g., an event involving the semantic content occurs at the same time in both the input video 205 and the output video 230, by using the audio data 210, e.g., as included in the fused feature map 260.
  • FIG. 3 is a process flow diagram illustrating an example process 300 for generating the output video 230. The memory 110 of the computer 100 stores executable instructions for performing the steps of the process 300 and/or programming can be implemented in structures such as mentioned above. As a general overview of the process 300, the computer 100 receives the input video 205 and the associated audio data 210, extracts the visual and audio features from the input video 205 and audio data 210, fuses the visual and audio features, maps the fused data to the latent vector 220, generates the output video 230 from the latent vector 220, and trains a machine-learning model with the output video 230 as training data.
  • The process 300 begins in a block 305, in which the computer 100 receives the input video 205 of the scene in the first domain and the audio data 210 associated with the input video 205, as described above.
  • Next, in a block 310, the computer 100 extracts the visual features from the input video 205, as described above.
  • Next, in a block 315, the computer 100 extracts the audio features from the audio data 210, as described above.
  • Next, in a block 320, the computer 100 fuses the visual features and the audio features, as described above.
  • Next, in a block 325, the computer 100 executes the encoder 215 to map the input video 205 and the audio data 210 to the latent vector 220 in the lower-dimensional latent space, as described above.
  • Next, in a block 330, the computer 100 executes the generator 225 to generate the output video 230 of the scene in the second domain from the latent vector 220, as described above.
  • Next, in a block 335, the computer 100 trains a machine-learning model on training data that includes the output video 230. The machine-learning model may be an object-recognition model, e.g., using a convolutional neural network. The training data includes the output video 230. The training data may further include the input video 205 and/or similarly generated output videos 230 and associated input videos 205. Once trained, the machine-learning model may be installed on a vehicle computer of a vehicle 120, and the vehicle computer may be programmed to actuate the vehicle 120 based on classifications outputted by the machine-learning model, e.g., by actuating one or more of a propulsion system, a brake system, and/or a steering system of the vehicle 120. For example, the vehicle computer may use the classification of an object as a particular type of road user to determine an amount of lateral distance to maintain when passing the road user, e.g., a greater distance for a bicycle than for a motor vehicle. Because of the different domains represented in the training data used for training the machine-learning model, i.e., both the first domain and the second domain, the machine-learning model may classify objects accurately in a greater variety of conditions, thereby enhancing the autonomous operation of the vehicle 120. After the block 335, the process 300 ends.
  • With reference to FIGS. 4-5 , a training architecture 400 is shown for training the encoder 215 and the generator 225 to generate the output video 230 in the second domain based on the input video 205 in the first domain and based on the audio data 210 associated with the input video 205. The encoder 215 and the generator 225 are trained to maintain temporal consistency between the input video 205 and the output video 230 by using the audio data 210. The encoder 215 is trained to include the semantic content of the input video 205 in the latent vector 220 and to exclude domain data from the first domain of the input video 205 from the latent vector 220. The generator 225 is trained to construct the output video 230 including the semantic content from the latent vector 220 with the second domain applied. The encoder 215 and the generator 225 are trained for a specific first domain and a specific second domain of the same type as the first domain, e.g., daytime and nighttime, respectively; i.e., the encoder 215 and the generator 225 are trained to convert from one specific domain to another specific domain of the same type. This is beneficial so that the encoder 215 may be trained to exclude domain data from the type of domain of the first domain but not necessarily exclude domain data from other types of domains.
  • The training data for training the encoder 215 and the generator 225 may include a plurality of input videos 205 in the first domain and respective audio data 210 associated with the input videos 205. The training data may further include videos or images in the second domain that are not paired with the input videos 205 in the first domain, i.e., are of different scenes than captured in the input videos 205. Because the training architecture 400 uses the audio data 210 for supervision, the training architecture 400 does not need a video in the second domain of the same scene as an input video 205 in the first domain.
  • The training of the encoder 215 and the generator 225 includes iterative execution of the encoder 215 and the generator 225 together, i.e., joint training of the encoder 215 and the generator 225. For each iteration, a loss function may be calculated by a discriminator 405 (in the example of FIG. 4 ) or a discriminator 405 and a comparator 505 (in the example of FIG. 5 ). The loss function is based on a score of the temporal consistency of the output video 230 with the audio data 210 and based on a score of the output video 230 being in the second domain. The loss function is used to update the encoder 215 and the generator 225. e.g., adjusting weights within the encoder 215 and/or the generator 225. The training is completed in response to a termination condition being satisfied, e.g., the loss function being below a threshold value and/or the training elapsing for a preset number of iterations. The threshold value or the preset number of iterations may be chosen to indicate or produce accurate operation by the encoder 215 and the generator 225. Once the termination condition is satisfied, the encoder 215 and the generator 225 may be used as described above with respect to FIGS. 2 and 3 .
  • With reference to FIG. 4 , the training architecture 400 may include a correlation module 410. The correlation module 410 may include a correlation visual feature extractor 415, a correlation audio feature extractor 420, and a contrastive-learning block 425. The correlation visual feature extractor 415 and the correlation audio feature extractor 420 may operate similarly to the visual feature extractor 235 and the audio feature extractor 240, respectively, as described above. The correlation visual feature extractor 415 may receive the output videos 230 as inputs each iteration. The correlation audio feature extractor 420 may receive the audio data 210 (unchanged) as an input each iteration. The contrastive-learning block 425 receives the outputs from the correlation visual feature extractor 415 and the correlation audio feature extractor 420 as inputs. The contrastive-learning block 425 outputs a correlation score between the output video 230 of the current iteration and the audio data 210. The correlation score indicates a temporal closeness between events in the output video 230 and the audio data 210.
  • The correlation module 410, e.g., the contrastive-learning block 425, may be pretrained, i.e., trained separately from the rest of the training architecture 400 before training the encoder 215 and the generator 225. The correlation module 410, e.g., the contrastive-learning block 425, may be trained using contrastive learning using pairs of samples. Each pair includes a positive sample and a negative sample. The positive sample may be one of the input videos 205 along with the audio data 210 synched to the input video 205, and the negative sample may be the same input video 205 along with the audio data 210 time-shifted with respect to the input video 205. The correlation module 410 may thus be trained using easy-to-gather training data that does not require manual annotation.
  • The encoder 215 and the generator 225 may be supervised by the discriminator 405 during training. For example, the generator 225 and the discriminator 405 may together constitute a generative adversarial network. The discriminator 405 may be a machine-learning model, e.g., a convolutional neural network. The discriminator 405 may be trained to classify inputs as (a) temporally consistent or (b) temporally inconsistent. The discriminator 405 may also be trained to classify inputs as (a) actual videos in the second domain or (b) output videos 230 in the second domain from the generator 225. The discriminator 405 may be jointly trained with the generator 225 (and the encoder 215). The inputs to the discriminator 405 are the output videos 230 generated by the generator 225 each iteration.
  • The discriminator 405 may supervise the training of the encoder 215 and the generator 225 by testing a consistency of the output video 230 with the audio data 210. For example, the discriminator 405 may receive the correlation score from the correlation module 410 each iteration. The classification outputted by the discriminator 405 may be based on the correlation score as well as on the output videos 230 and the audio data 210 as direct inputs. The computer 100 determines an adversarial loss based on the accuracy of the classification by the discriminator 405. The computer 100 updates the encoder 215 and the generator 225 based on the adversarial loss, i.e., according to a generative-adversarial-network scheme, e.g., by adjusting weights within the encoder 215 and/or the generator 225. The computer 100 also updates the discriminator 405 based on the adversarial loss, i.e., according to a generative-adversarial-network scheme, e.g., by adjusting weights within the discriminator 405.
  • The discriminator 405 may supervise the training of the encoder 215 and the generator 225 by testing a consistency of the output video 230 with the second domain. For example, the discriminator 405 may receive the output videos 230 generated by the generator 225 and actual videos in the second domain, i.e., training data as described above. The discriminator 405 attempts to classify the videos as actual or as generated by the generator 225. The computer 100 determines an adversarial loss based on the accuracy of the classification. The computer 100 updates the encoder 215 and the generator 225 based on the adversarial loss, i.e., according to a generative-adversarial-network scheme, e.g., by adjusting weights within the encoder 215 and/or the generator 225. The computer 100 also updates the discriminator 405 based on the adversarial loss, i.e., according to a generative-adversarial-network scheme, e.g., by adjusting weights within the discriminator 405.
  • With reference to FIG. 5 , the training architecture 400 may include a second encoder 510, a second generator 515, and the comparator 505, in addition to the discriminator 405. (For clarity, the encoder 215 and the generator 225 will be referred to as the first encoder 215 and the first generator 225, respectively, in the discussion of FIG. 5 .) The discriminator 405 may supervise the training of the first encoder 215 and the first generator 225 by testing a consistency of the output video 230 with the second domain, as described above in the example of FIG. 4 . In this example, the discriminator 405 is not used for testing the consistency of the output video 230 with the audio data 210, i.e., the temporal consistency. Alternatively, the training architecture 400 may further include the correlation module 410 described above, and the discriminator 405 may be used for both consistencies as described above. In that case, the discriminator 405 is trained jointly with the second encoder 510 and second generator 515 described below, as well as with the rest of the training architecture 400.
  • The second encoder 510 and the second generator 515 may have the same structure as the first encoder 215 and the first generator 225, respectively, as described above with respect to FIG. 2 . The training architecture 400 may further include a second visual feature extractor 530, a second audio feature extractor 535, and a second fusion block 540, which operate in the same manner as described above for the visual feature extractor 235, the audio feature extractor 240, and the fusion block 245, respectively. The second encoder 510 may receive the output video 230 in the second domain from the generator 225, as well as the audio data 210, after processing by the second visual feature extractor 530, second audio feature extractor 535, and second fusion block 540. The second encoder 510 may execute to map the output video 230 and the audio data 210 to a second latent vector 520 in the lower-dimensional latent space. The second generator 515 may execute to generate a test video 525 in the first domain from the second latent vector 520. In other words, the second encoder 510 and the second generator 515 are trained to convert video from the second domain to the first domain, the reverse of the first encoder 215 and the first generator 225. The audio data 210 inputted into the second encoder 510 is the same as the audio data 210 inputted into the first encoder 215.
  • The comparator 505 determines a difference between the test video 525 and the input video 205, e.g., a level of cycle consistency. The input video 205 and the test video 525 are both in the first domain. The comparator 505 may be a pretrained module. The computer 100 may calculate a loss function based on the difference, e.g., a cycle consistency loss as used in cycle-consistent adversarial networks such as cycleGAN. The computer 100 updates the first encoder 215, the first generator 225, the second encoder 510, and the second generator 515 based on the loss function, e.g., by adjusting weights within the first encoder 215, the first generator 225, the second encoder 510, and the second generator 515. The second generator 515 and the second encoder 510 may thus be jointly trained with the first encoder 215 and the first generator 225.
  • FIG. 6 is a process flow diagram illustrating an example process 600 for training the machine-learning architecture 200. The memory 110 of the computer 100 stores executable instructions for performing the steps of the process 600 and/or programming can be implemented in structures such as mentioned above. As a general overview of the process 600, the computer 100 provides pretrained components of the training architecture 400 and generates the initial values of the trainable components of the training architecture 400. including the encoder 215 and generator 225. The computer 100 then iteratively executes the training architecture 400. During each iteration, the computer 100 executes the machine-learning architecture 200 to generate the output videos 230, tests the output videos 230, and updates the training architecture 400 based on testing the output videos 230. Upon reaching a termination condition, the computer 100 outputs the encoder 215 and the generator 225 from the final iteration.
  • The process 600 begins in a block 605, in which the computer 100 provides pretrained components of the training architecture 400, e.g., the correlation module 410 in the example of FIG. 4 or the comparator 505 in the example of FIG. 5 . For example, the computer 100 may load components already installed on the computer 100.
  • Next, in a block 610, the computer 100 generates the initial encoder 215 and generator 225. For example, the computer 100 may assign weights at random to the structures described above for the encoder 215 and the generator 225. The computer 100 may also generate initial values for the other components of the training architecture 400 that will be trained, e.g., the discriminator 405 in the example of FIG. 4 or the second encoder 510, second generator 515, and discriminator 405 in the example of FIG. 5 . The computer 100 may assign weights at random to the structures described above for these components.
  • Next, in a block 615, the computer 100 executes the machine-learning architecture 200, including the encoder 215 and the generator 225, as described above. The inputs are the training data of input videos 205 in the first domain and the respective associated audio data 210.
  • Next, in a block 620, the computer 100 tests the output videos 230 generated in the block 615. For example, as described with respect to FIG. 4 , the computer 100 may execute the correlation module 410 to output correlation scores to the discriminator 405 and execute the discriminator 405 to classify the output videos 230 and other training videos. For another example, as described with respect to FIG. 5 , the computer 100 may execute the discriminator 405 to classify the output videos 230 and other training videos, execute the second encoder 510 and second generator 515 to output the test videos 525, and execute the comparator 505 to determine differences between the input videos 205 and the respective test videos 525.
  • Next, in a block 625, the computer 100 calculates the loss functions based on the results from the block 620, as described above.
  • Next, in a decision block 630, the computer 100 determines whether the termination condition is satisfied, as described above. If not, the process 600 proceeds to a block 635 to update the trainable components for a next iteration of training. If so, the process 600 proceeds to a block 640.
  • In the block 635, the computer 100 updates the trainable components of the training architecture 400 based on the losses calculated in the block 625, as described above. The computer 100 updates the encoder 215 and the generator 225. The computer 100 may also update the discriminator 405 in the example of FIG. 4 or the discriminator 405, the second encoder 510, and the second generator 515 in the example of FIG. 5 . After the block 635, the process 600 returns to the block 615 for the next iteration of training.
  • In the block 640, the computer 100 outputs the encoder 215 and the generator 225 in their current state, i.e., with the values of the weights as updated in the iterations of training. After the block 640, the process 600 ends.
  • In general, the computing systems and/or devices described may employ any of a number of computer operating systems, including, but by no means limited to, versions and/or varieties of the Ford Sync® application, AppLink/Smart Device Link middleware, the Microsoft Automotive® operating system, the Microsoft Windows® operating system, the Unix operating system (e.g., the Solaris® operating system distributed by Oracle Corporation of Redwood Shores, California), the AIX UNIX operating system distributed by International Business Machines of Armonk, New York, the Linux operating system, the Mac OSX and iOS operating systems distributed by Apple Inc. of Cupertino, California, the BlackBerry OS distributed by Blackberry, Ltd. of Waterloo, Canada, and the Android operating system developed by Google, Inc. and the Open Handset Alliance, or the QNX® CAR Platform for Infotainment offered by QNX Software Systems. Examples of computing devices include, without limitation, an on-board vehicle computer, a computer workstation, a server, a desktop, notebook, laptop, or handheld computer, or some other computing system and/or device.
  • Computing devices generally include computer-executable instructions, where the instructions may be executable by one or more computing devices such as those listed above. Computer executable instructions may be compiled or interpreted from computer programs created using a variety of programming languages and/or technologies, including, without limitation, and either alone or in combination, Java™, C, C++, Matlab, Simulink, Stateflow, Visual Basic, Java Script, Python, Perl, HTML, etc. Some of these applications may be compiled and executed on a virtual machine, such as the Java Virtual Machine, the Dalvik virtual machine, or the like. In general, a processor (e.g., a microprocessor) receives instructions, e.g., from a memory, a computer readable medium, etc., and executes these instructions, thereby performing one or more processes, including one or more of the processes described herein. Such instructions and other data may be stored and transmitted using a variety of computer readable media. A file in a computing device is generally a collection of data stored on a computer readable medium, such as a storage medium, a random access memory, etc.
  • A computer-readable medium (also referred to as a processor-readable medium) includes any non-transitory (e.g., tangible) medium that participates in providing data (e.g., instructions) that may be read by a computer (e.g., by a processor of a computer). Such a medium may take many forms, including, but not limited to, non-volatile media and volatile media. Instructions may be transmitted by one or more transmission media, including fiber optics, wires, wireless communication, including the internals that comprise a system bus coupled to a processor of a computer. Common forms of computer-readable media include, for example, RAM, a PROM, an EPROM, a FLASH-EEPROM, any other memory chip or cartridge, or any other medium from which a computer can read.
  • Databases, data repositories or other data stores described herein may include various kinds of mechanisms for storing, accessing, and retrieving various kinds of data, including a hierarchical database, a set of files in a file system, an application database in a proprietary format, a relational database management system (RDBMS), a nonrelational database (NoSQL), a graph database (GDB), etc. Each such data store is generally included within a computing device employing a computer operating system such as one of those mentioned above, and are accessed via a network in any one or more of a variety of manners. A file system may be accessible from a computer operating system, and may include files stored in various formats. An RDBMS generally employs the Structured Query Language (SQL) in addition to a language for creating, storing, editing, and executing stored procedures, such as the PL/SQL language mentioned above.
  • In some examples, system elements may be implemented as computer-readable instructions (e.g., software) on one or more computing devices (e.g., servers, personal computers, etc.), stored on computer readable media associated therewith (e.g., disks, memories, etc.). A computer program product may comprise such instructions stored on computer readable media for carrying out the functions described herein.
  • In the drawings, the same reference numbers indicate the same elements. Further, some or all of these elements could be changed. With regard to the media, processes, systems, methods, heuristics, etc. described herein, it should be understood that, although the steps of such processes, etc. have been described as occurring according to a certain ordered sequence, such processes could be practiced with the described steps performed in an order other than the order described herein. It further should be understood that certain steps could be performed simultaneously, that other steps could be added, or that certain steps described herein could be omitted.
  • All terms used in the claims are intended to be given their plain and ordinary meanings as understood by those skilled in the art unless an explicit indication to the contrary in made herein. In particular, use of the singular articles such as “a,” “the,” “said,” etc. should be read to recite one or more of the indicated elements unless a claim recites an explicit limitation to the contrary. The adjectives “first” and “second” are used throughout this document as identifiers and are not intended to signify importance, order, or quantity. Use of “in response to” and “upon determining” indicates a causal relationship, not merely a temporal relationship.
  • The disclosure has been described in an illustrative manner, and it is to be understood that the terminology which has been used is intended to be in the nature of words of description rather than of limitation. Many modifications and variations of the present disclosure are possible in light of the above teachings, and the disclosure may be practiced otherwise than as specifically described.

Claims (20)

1. A computer comprising a processor and a memory, the memory storing instructions executable by the processor to:
receive an input video of a scene and audio data associated with the input video, the input video being in a first domain;
execute an encoder to map the input video and the audio data to a latent vector in a lower-dimensional latent space; and
execute a generator to generate an output video of the scene from the latent vector, the output video being in a second domain;
wherein the encoder and the generator are trained to maintain temporal consistency between the input video and the output video by using the audio data.
2. The computer of claim 1, wherein the encoder and the generator are supervised by a discriminator during training.
3. The computer of claim 2, wherein the discriminator supervises the training of the encoder and the generator by testing a consistency of the output video with the audio data.
4. The computer of claim 3, wherein, while training the encoder and the generator, the discriminator uses a correlation between the output video and the audio data to test the consistency of the output video with the audio data.
5. The computer of claim 4, wherein, while training the encoder and the generator, the discriminator receives the correlation from a correlation module, the correlation module being pretrained with contrastive learning.
6. The computer of claim 2, wherein the discriminator supervises the training of the encoder and the generator by testing a consistency of the output video with the second domain.
7. The computer of claim 6, wherein the instructions further include instructions to determine an adversarial loss based on an output of the discriminator and to update the encoder and the generator based on the adversarial loss.
8. The computer of claim 1, wherein the first domain and the second domain are mutually exclusive environmental conditions of the scene.
9. The computer of claim 8, wherein the environmental conditions are one of a lighting condition or a weather condition.
10. The computer of claim 1, wherein the first domain and the second domain are mutually exclusive visual rendering characteristics of the input video and output video.
11. The computer of claim 10, wherein the visual rendering characteristics are one of a resolution, a color representation scheme, or simulatedness.
12. The computer of claim 1, wherein
the instructions further include instructions to extract visual features from the input video; and
executing the encoder is based on the visual features.
13. The computer of claim 12, wherein
the instructions further include instructions to extract audio features from the audio data and to fuse the visual features and the audio features; and
executing the encoder is based on the fusion of the visual features and the audio features.
14. The computer of claim 1, wherein
the instructions further include instructions to extract audio features from the audio data; and
executing the encoder is based on the audio features.
15. The computer of claim 1, wherein the encoder is trained to include semantic content of the input video in the latent vector and to exclude domain data of the input video from the latent vector.
16. The computer of claim 1, wherein
the encoder is a first encoder;
the generator is a first generator;
the latent vector is a first latent vector; and
training the first encoder and the first generator includes executing a second encoder to map the output video and the audio data to a second latent vector in the lower-dimensional latent space, and executing a second generator to generate a test video of the scene from the second latent vector in the first domain.
17. The computer of claim 16, wherein training the first encoder and the first generator includes updating the first encoder and the first generator based on a difference between the test video and the input video.
18. The computer of claim 1, wherein
the instructions further include instructions to train a machine-learning model on training data; and
the training data includes the output video.
19. The computer of claim 18, wherein the machine-learning model is an object-recognition model.
20. A method comprising:
receiving an input video of a scene and audio data associated with the input video, the input video being in a first domain;
executing an encoder to map the input video and the audio data to a latent vector in a lower-dimensional latent space; and
executing a generator to generate an output video of the scene from the latent vector, the output video being in a second domain;
wherein the encoder and the generator are trained to maintain temporal consistency between the input video and the output video by using the audio data.
US18/065,672 2022-12-14 2022-12-14 Generating artificial video with changed domain Pending US20240202533A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US18/065,672 US20240202533A1 (en) 2022-12-14 2022-12-14 Generating artificial video with changed domain
DE102023133614.2A DE102023133614A1 (en) 2022-12-14 2023-11-30 GENERATE ARTIFICIAL VIDEO WITH CHANGED DOMAIN
CN202311626240.8A CN118200462A (en) 2022-12-14 2023-11-30 Generating artificial video with change fields

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US18/065,672 US20240202533A1 (en) 2022-12-14 2022-12-14 Generating artificial video with changed domain

Publications (1)

Publication Number Publication Date
US20240202533A1 true US20240202533A1 (en) 2024-06-20

Family

ID=91279017

Family Applications (1)

Application Number Title Priority Date Filing Date
US18/065,672 Pending US20240202533A1 (en) 2022-12-14 2022-12-14 Generating artificial video with changed domain

Country Status (3)

Country Link
US (1) US20240202533A1 (en)
CN (1) CN118200462A (en)
DE (1) DE102023133614A1 (en)

Also Published As

Publication number Publication date
DE102023133614A1 (en) 2024-06-20
CN118200462A (en) 2024-06-14

Similar Documents

Publication Publication Date Title
US10691952B2 (en) Adapting to appearance variations when tracking a target object in video sequence
Wang et al. Data-driven based tiny-YOLOv3 method for front vehicle detection inducing SPP-net
Wang et al. Fusing bird’s eye view lidar point cloud and front view camera image for 3d object detection
Premebida et al. Pedestrian detection combining RGB and dense LIDAR data
Cai et al. Saliency-based pedestrian detection in far infrared images
Souani et al. Efficient algorithm for automatic road sign recognition and its hardware implementation
CN112581612B (en) Vehicle-mounted grid map generation method and system based on fusion of laser radar and all-round-looking camera
Vaquero et al. Deconvolutional networks for point-cloud vehicle detection and tracking in driving scenarios
Hwang et al. Cramnet: Camera-radar fusion with ray-constrained cross-attention for robust 3d object detection
WO2021201774A1 (en) Method and system for determining a trajectory of a target object
Zhang et al. Vehicle detection method for intelligent vehicle at night time based on video and laser information
Murugan et al. Automatic moving vehicle detection and classification based on artificial neural fuzzy inference system
CN110909656A (en) Pedestrian detection method and system with integration of radar and camera
Cho et al. Modified perceptual cycle generative adversarial network-based image enhancement for improving accuracy of low light image segmentation
Moseva et al. Development of a System for Fixing Road Markings in Real Time
US20240202533A1 (en) Generating artificial video with changed domain
CN115346184A (en) Lane information detection method, terminal and computer storage medium
CN113283411B (en) Unmanned aerial vehicle target detection method, device, equipment and medium
Jeong et al. Homogeneity patch search method for voting-based efficient vehicle color classification using front-of-vehicle image
Misman et al. Camera-based vehicle recognition methods and techniques: Systematic literature review
Baek et al. Practical Object Detection Using Thermal Infrared Image Sensors
US20230244815A1 (en) Anonymizing personally identifiable information in sensor data
US20240046625A1 (en) De-biasing datasets for machine learning
Oztel Vision-based road segmentation for intelligent vehicles using deep convolutional neural networks
US20240087332A1 (en) Object detection with images

Legal Events

Date Code Title Description
AS Assignment

Owner name: FORD GLOBAL TECHNOLOGIES, LLC, MICHIGAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:PERINCHERRY, AKHIL;CHAND, ARPITA;REEL/FRAME:062082/0477

Effective date: 20220912

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION