WO2021063476A1 - Procédé d'entraînement d'un réseau publicitaire génératif, module de génération d'image modifiée et système de détection de caractéristiques dans une image - Google Patents

Procédé d'entraînement d'un réseau publicitaire génératif, module de génération d'image modifiée et système de détection de caractéristiques dans une image Download PDF

Info

Publication number
WO2021063476A1
WO2021063476A1 PCT/EP2019/076434 EP2019076434W WO2021063476A1 WO 2021063476 A1 WO2021063476 A1 WO 2021063476A1 EP 2019076434 W EP2019076434 W EP 2019076434W WO 2021063476 A1 WO2021063476 A1 WO 2021063476A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
generator
training
discriminator
ssd
Prior art date
Application number
PCT/EP2019/076434
Other languages
English (en)
Inventor
Luca MINCIULLO
Sven Meier
Norimasa Kobori
Fabian MANHARDT
Original Assignee
Toyota Motor Europe
Technical University Of Munich
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toyota Motor Europe, Technical University Of Munich filed Critical Toyota Motor Europe
Priority to PCT/EP2019/076434 priority Critical patent/WO2021063476A1/fr
Priority to DE112019007762.7T priority patent/DE112019007762T5/de
Publication of WO2021063476A1 publication Critical patent/WO2021063476A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/60Extraction of image or video features relating to illumination properties, e.g. using a reflectance or lighting model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present disclosure concerns the detection of features in images, more specifically images of outer scenes, in which identification of features can become difficult because of the variety of the lighting conditions to which the scene is exposed.
  • Cameras have become a very common sensor for acquiring information. Indeed, the high resolution images they provide can be processed to extract information which can be used in many subsequent applications. In this process accordingly, the images outputted by the cameras are not used directly; they are processed by a processor executing a program which processes the image(s) so as to extract useful information therefrom.
  • the processor and its image-processing program constitute an image-based detector, suitable for outputting useful information on the basis of one or more images (or 'input images') provided to the processor.
  • the images which can be processed by such an image-based detector can be obtained by a camera (CCD, CMOS cameras, operating in the visible, infrared, etc., range, or more generally any sensor capable of outputting an image of its environment.
  • the images can also be calculated by a computer.
  • Image-based detectors of the type defined above can be of many different types, depending on the type of information which has to be detected in the images.
  • the image-based detector can be for instance an object- detector, which will output information relative to the presence of object(s) in the images, or of specific types of objects in the images; in some implementations, the output information will include the bounding boxes of the objects detected in the images.
  • deep neural networks can be used.
  • SSD-GAN Single Shot Detector - Generative Adversarial Network
  • FIG. 2 the publication from reference [2]. While the above-presented process (image acquisition with a camera, and subsequent processing by an image-based detector) has already been successfully implemented in a number of use cases, it has appeared however that one of the biggest challenges when dealing with real world images, is the large variation of appearance of physical objects, people and environments, leading in some cases to less-than-expected performances.
  • a first solution to overcome this problem has been to perform a pre ⁇ processing of the images, in order to normalize the illumination in the images, before inputting them into the image-based detector.
  • This pre-processing included traditionally changing the light histogram of the images.
  • this function, or the traditional contrast normalization techniques appeared insufficient to correctly compensate the differences in lighting conditions in real world images, resulting in poor performances.
  • a modified image generation module is proposed.
  • This module is configured to generate a modified image of a scene, based on an initial image of the scene.
  • This module comprises a generator which is the generator of a generative adversarial network (GAN), which generative adversarial network is trained using a training method which will be described below.
  • GAN generative adversarial network
  • this modified image generation module is configured to input an initial image acquired or calculated under any lighting conditions, and to output a modified image which represents the same scene, but as it appears under 'optimal lighting conditions'.
  • 'optimal lighting conditions' are lighting conditions which are supposed to be the optimal lighting conditions for carrying out the image-based process of the image-based detector.
  • the modified image generator is capable of performing a pre-processing of an image so as to output a modified image which is very close to an image of the same scene, but under the optimal lighting conditions. Therefore, advantageously when modified images obtained thanks to this pre ⁇ processing (rather than the raw initial images) are inputted to the image-based detector, it is possible to obtain a high level of performance of the image-based detector, despite the diversity of lighting conditions of the initial images.
  • the modified image generation module can be coupled to an image- based detection module in order to detect features in an image.
  • the system for detecting features in an image obtained in this manner includes the above- defined modified image generation module, and an image-based detection module comprising an image-based detector configured to process the modified image of the scene outputted by the modified image generation module.
  • the generator is coupled to a discriminator so as to form a generative adversarial network.
  • the generator is an image encoder-decoder configured to input an input image of a scene, and to output a modified image of the scene suitable for being processed by an image-based detector.
  • the training method of the generative adversarial network comprises: B30) training the generative adversarial network by alternatively training the discriminator and the generator; the generator being trained to produce, for each input image among an any lighting training set of input images of one or more scenes acquired or calculated under any lighting conditions, a modified image representing the same scene as said each input image but under predetermined first lighting conditions; the generator being trained using a generator loss function which comprises a detector loss term; the detector loss term being calculated based on an input of the image-based detector.
  • "alternatively training the discriminator and the generator” includes of course repeating the steps of training the discriminator one or more times, and then training the generator one or more times.
  • the detector loss term is preferably calculated so as to represent how well the image-based detector performs its function on the basis of a modified image.
  • the detector loss term is calculated by comparing an output of the image-based detector based on a Ground Truth Image corresponding to said input image and acquired or obtained under the first lighting conditions, and an output of the image-based detector based on the modified image obtained on the basis of said each input image.
  • the discriminator is trained, for each input image it receives (which can be for instance a modified image outputted by the generator), to determine whether the input image is a real image or a fake image, wherein a real image is an image which was initially acquired or calculated under the optimal conditions (that is, under the first lighting conditions), and conversely, a fake image is a modified image outputted by the generator.
  • the generator learns to produce modified images which cannot be distinguished - at least by the discriminator - from the real images. Consequently, when the training has been completed, the generator produces images which appear to represent the scene under the optimal (or first) lighting conditions. Since the optimal conditions are lighting conditions under which the processing performed by the image-based detector works best, the pre-processing performed by the generator thus makes it possible for the image-based detector to process the modified images with a high performance level.
  • the generator loss function comprises the detector loss term, which term is calculated in particular on the basis of an output of the image-based detector when processing the modified image outputted by the modified image generator.
  • this detector loss term in the generator loss function, it has been possible to increase the performance of the generator, the image based detector thus exhibiting particularly good performances when processing modified images obtained by the generator trained using this generator loss function.
  • the generator loss function usually also includes other terms.
  • the generator loss function can also include an LI norm as a reconstruction error.
  • Both the generator and the discriminator are essentially constituted by neural networks.
  • the generator can preferably be an encoder-decoder. It can in particular have a U-net architecture (A U-net architecture is described for instance by publication [6]).
  • An image encoder-decoder is a neural network essentially comprising an encoder and a decoder, and configured to learn a compressed representation of an image.
  • An encoder-decoder comprises two parts:
  • an encoder which learns a representation of the image, using fewer neurons than the input image
  • a U-net is an image encoder-decoder modified by adding skip connections which directly connect encoder layers to decoder layers: normally, a layer at a position i is connected to a layer at position n-i, where n is the total number of layers.
  • the discriminator includes one or more neural networks used to calculate the loss function of the generator.
  • the discriminator comprises two sub-neural networks, that is, a global discriminator and a local discriminator.
  • the global discriminator is a neural network which outputs an estimate for the inputted image to be a fake or a true image, based on the whole inputted image.
  • the local discriminator comprises a pre-processing unit with a convolutional layer, and a main processing unit.
  • the pre-processing unit is configured to output patches, each patch being a representation of a sub-patch of the inputted image.
  • the main processing unit includes a neural network, and is configured to output an estimate for the inputted image to be a fake or a true image, based on said patches.
  • the patches can preferably be based on non-overlapping regions of the inputted image.
  • the global discriminator can be a neural network trained using a structured loss function which penalizes the output of the global discriminator only at a large scale; and the local discriminator is trained using a structured loss function which penalizes the output of the local discriminator only at a small scale smaller than the large scale.
  • a patch outputted by the pre-processing unit typically represents a local texture of the inputted image.
  • the global discriminator is configured to output an estimate based on the global resemblance of the whole image to a 'true' image, that is, an image obtained for normal lighting conditions.
  • the local discriminator is used, which forces the generator to take into account low frequency details (as they appear in the different regions) of the input image.
  • the local discriminator is therefore used to judge if an image is 'real' or 'fake' (that is, if an image is imaged under normal lighting conditions, or not) by looking locally at the texture of the image, rather than by taking into account large scale indices.
  • the global objective function of the GAN network may include an L2-loss term, calculated for the global discriminator.
  • the global objective function of the GAN network may include an Ll-loss term, calculated for the Local Discriminator, which forces the GAN network to take into account the local scale (This is described by above-referenced publication [2]; refer in particular to section 3.1).
  • the image-based detector is an object detector, in particular a neural network configured to detect objects in an image.
  • the object detector can in particular detect the bounding boxes of objects shown in the image.
  • the function of the image-based detector more broadly can be any computer vision task : activity recognition, semantic segmentation, SLAM,..
  • the generator loss function further takes into account a perceptual loss, wherein the perceptual loss is an image-based function which increases when a loss of sharpness in the image increases.
  • the image-based function is based on information outputted respectively by two different layers of a convolutional neural network.
  • This convolutional neural network can be for instance a VGG16 neural network or a similar neural network.
  • the perceptual loss can be expressed by the sum of two terms, as follows:
  • the input image and the output image are fed into the VGG16 or similar neural network.
  • the outputs of a low-rank convolutional layer and of a high-rank convolutional layer are then collected; they are called respectively the "low level features" and the "high level features".
  • the perceptual loss can then be expressed as:
  • the method further comprises a step of:
  • the GAN training set comprising: as input, images of at least one scene, wherein the at least one scene is imaged under a plurality of lighting conditions; as ground truth, images of the at least one scene under the first lighting conditions.
  • the image-based neural network comprises a generator coupled to an image-based detector; the image-based detector is a neural network configured to process a modified image of the scene outputted by the generator; the generator is an image encoder-decoder configured to input an input image of a scene and to output a modified image of the scene suitable for being processed by the image-based detector; the generator is part of a generative adversarial network comprising the generator and a discriminator.
  • the method comprises the step of:
  • the method may further comprise the following optional steps for training the image-based detector:
  • the method may further comprise the following optional steps for optimizing the image-based detector: CIO) creating a third training set comprising as an input, images outputted by the generator when processing the first training set; and as ground truth, the desired outputs of the image-based detector for the first images. C20) training the image-based detector on the third training set.
  • the desired outputs of the image-based detector can be for instance the actual bounding boxes of the various objects represented in the image.
  • the training method further comprises repeating at least one time the step B) of training the generative adversarial neural network, and the steps CIO) of creating the third training set and C20) of training the image-based detector.
  • the proposed training method is determined by computer program instructions.
  • the present disclosure is to propose a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the steps of a method.
  • the computer program is preferably stored on a non-transitory computer-readable storage media.
  • the computer program may use any programming language, and be in the form of source code, object code, or code intermediate between source code and object code, such as in a partially compiled form, or in any other desirable form.
  • the computer may be any data processing means, for instance a personal computer, an electronic control unit configured to be mounted in a car, etc.
  • the present disclosure also includes a non-transitory computer readable medium having the computer program stored thereon.
  • the computer-readable medium may be an entity or device capable of storing the program.
  • the computer-readable medium may comprise storage means, such as a read only memory (ROM), e.g. a compact disk (CD) ROM, or a microelectronic circuit ROM, or indeed magnetic recording means, e.g. a floppy disk or a hard disk.
  • ROM read only memory
  • CD compact disk
  • microelectronic circuit ROM indeed magnetic recording means, e.g. a floppy disk or a hard disk.
  • the computer-readable medium may be an integrated circuit in which the program is incorporated, the circuit being adapted to execute or to be used in the execution of the control method in question.
  • Another object of the present disclosure is to propose a data-processing apparatus configured for carrying out the steps of the above-defined training method.
  • Fig. 1 shows a schematic representation of an embodiment of an image- based detector
  • Fig. 2 shows a schematic representation of an input image to be processed by the image-based detector of Fig.l;
  • Fig. 3 shows a schematic representation of a generative adversarial network according to an embodiment of the present disclosure, during training of the discriminator thereof;
  • Fig. 4 shows a schematic representation of the generative adversarial network of Fig.3, during training of the generator thereof;
  • Fig.5 shows a bloc diagram of a training method according to an embodiment of the present disclosure.
  • Fig.6 shows a schematic representation of a vehicle, comprising a data processing apparatus according to an embodiment of the present disclosure.
  • the neural network architectures according to the present disclosure are based essentially on a generative adversarial network GAN, complemented by auxiliary loss-function calculation modules, including an image-based detector SSD.
  • the image-based detector is an algorithm capable of determining the bounding boxes of the objects shown in an image.
  • the methods and systems of the present disclosure can be implemented using an image-based detector configured to perform another function than object detection.
  • the image-based detector is a single-shot detector SSD, configured to detect objects in an image (hereinafter: the 'SSD network'). It can be for instance the single-shot multibox detector proposed by publication [3] (refer in particular to section 2.1 and Fig.2).
  • the SSD network is configured to be fed with input images (II), and to output the predicted bounding boxes (PBB) of the objects detected in the input image.
  • Fig.l shows an image IIA, which is the first image of a set of input images II, being inputted in the SSD network.
  • This image IIA represents two apples (Fig.2).
  • the SSD network processes image IIA and outputs the predicted value PBB1 and PBB2 of the bounding boxes ((ulmin,vlmin); (ulmax,vlmax)), (u2min,v2min); (u2max,v2max))) of the representations of the two apples in image IIA.
  • the SSD network outputs a set of bounding boxes PBB.
  • the generative adversarial network GAN comprises a first neural network G which constitutes the generator, and two neural networks, namely a global discriminator GD and a local discriminator LD, which constitute the discriminator (Figs.3, 4).
  • the generator G is an encoder-decoder configured to input an input image II of a scene, and to output a modified image MI of the scene.
  • the modified image MI has the same size (height and width) as the input image II.
  • the input image II can of course have been acquired or calculated with the scene being under any lighting conditions.
  • the generator G is trained so that the modified image MI is as close as possible of an image of the scene as it would appear in the first lighting conditions. That is, the generator G is configured to correct the lighting conditions of the scene so as to output a modified image in which the lighting conditions are modified so as to correspond to optimal lighting conditions.
  • the generator G can have for instance a encoder-decoder architecture.
  • it has a U-net architecture, that is, an encoder- decoder architecture complemented by skip connections between the encoder and the decoder layers.
  • U-net architecture that is, an encoder- decoder architecture complemented by skip connections between the encoder and the decoder layers.
  • the discriminator D comprises, in addition to the global discriminator GD, an additional discriminator called the 'local discriminator', referenced LD.
  • Both of these two networks are configured to input a first image, which is a 'raw' image of the scene, obtained under any lighting conditions, and a second image, which is an image of the same scene, but now in the optimal lighting conditions.
  • This second image can be of two types: It can be either a ground truth image, that is, an image of the scene under the optimal lighting conditions, or a modified image MI, outputted by the generator G (which image is supposed to be in the optimal lighting conditions).
  • each of these networks outputs an estimate, respectively E GD and E LD (a real number within a [0-1] range) which expresses to what extent the second image is likely to be the ground truth image, rather than a modified image.
  • E GD and E LD a real number within a [0-1] range
  • the discriminator networks GD and LD can be configured to output binary values instead of values in a range (in the [0-1] range, in the present embodiment).
  • Each of the networks G, GD and LD is formed of modules of the form convolution-BatchNorm-ReLu.
  • the local discriminator LD comprises two units, a pre-processing unit LD1, which comprises a convolutional layer LD11, and a main processing unit LD2.
  • the pre-processing unit LD1 is configured to output patches. Each of these patches is a representation of a sub-patch of the inputted image. In this embodiment, each input image is subdivided into eight sub-patches pll-p41 and pl2-p42, called collectively the sub-patches p. Accordingly, for each subpatch p of the input image II, the pre-processing unit LD1 outputs a patch. This patch is only based on the sub-patch p considered and represents its texture.
  • the main processing unit LD2 on the basis of the eight patches outputted by the pre-processing unit LD1, outputs its estimate E LD - CONFIGURATION AND TRAINING PROCEDURE
  • the present disclosure may be implemented in particular to improve the performance of an existing image-based detector.
  • the image-based detector is a detector (namely, the SSD network) which can be trained.
  • a first step consists in an initial training of the image-based detector.
  • the first step consists in training the SSD network.
  • the SSD network is trained by carrying out the following steps A10, A20:
  • A10 - Creation of a first training set SSD-TS1 A first training set SSD-TS1 is generated for the initial training of the SSD network. This first SSD-TS1 training set serves to train the SSD network to determine the location of the bounding boxes of objects present in images.
  • the first training set SSD-TS1 comprises first input images III of several scenes. All these input images are acquired or calculated under the optimal lighting conditions.
  • these optimal lighting conditions are in the present case the lighting conditions which are considered as the most preferable to perform bounding box detection in images.
  • the ground truth is the correct or desired bounding boxes for the objects shown by images III.
  • the SSD network is trained using the first training set SSD-TS1.
  • GAN-TS In order to train these networks, a GAN training set GAN-TS is then created.
  • This training set comprises all the data used to train the generator and the discriminator.
  • This training set comprises a set of input images 112, and a set of 'Ground Truth Images' TGI.
  • the images of the second set of input images 112 are images acquired or calculated under a variety of (any) lighting conditions.
  • the training set GAN-TS comprises a corresponding Ground Truth Image TGI, which is the image of the same scene, but under the optimal lighting conditions.
  • the discriminators GD and LD are trained by performing the following actions (Figs.3, 5). a) Pass the input images 112 of the GAN training set GAN-TS through the generator G and discriminator GD.LD
  • Each input image 112 of the GAN training set GAN-TS (as a 'first image ) is passed through the generator G, which yields as many modified images MI (as a 'second image 7 ).
  • the global and local discriminators GD,LD each yield their estimate (estimates E GD , E LD ) of whether the corresponding second image is an original image with optimal lighting, or a modified image.
  • a pair of images comprising both the input image 112 and the corresponding modified image MI outputted by the generator G (as the first and second images) is transmitted to the global discriminator GD and to the local discriminator LD.
  • pairs of images comprising an input image 112 and the corresponding ground truth image TGI (as first and second images) are also passed to the global and local discriminators GD and LD.
  • each of the global discriminator GD and the local discriminator LD outputs its estimate E GD or E LD that the second image it has received be a 'true 7 image, that is, an Ground Truth Image obtained under the optimal lighting conditions, rather than a 'fake 7 image, that is, an image produced by generator G to resemble to an original image imaged under the optimal lighting conditions.
  • a 'true 7 image that is, an Ground Truth Image obtained under the optimal lighting conditions
  • a 'fake 7 image that is, an image produced by generator G to resemble to an original image imaged under the optimal lighting conditions.
  • the values of the global discriminator loss and the local discriminator loss are then calculated, respectively for the global discriminator GD and to the local discriminator LD. These values are based on the outputs E GD and E LD of the global and local discriminator, which are compared to the true values TV expected from the the global and local discriminator (These true values TV are annotations prepared beforehand in the training data-set GAN-TS).
  • the two global discriminator and local discriminator loss functions are sigmoid cross entropy loss functions.
  • the Discriminator loss is calculated as a weighted sum of the global discriminator loss and the local discriminator loss.
  • the weights of the global discriminator GD and of the local discriminator LD are then updated by backpropagation.
  • the generator G is trained by performing the following actions (Figs.4,5). a) Pass the data again through the architecture
  • the modified images MI are then calculated using generator G, based on the input images 112 of the GAN training set GAN-TS.
  • the modified images MI are passed through the image-based detector SSD, which yields calculated or predicted values PBB of the bounding boxes of the objects identified in the modified image MI.
  • the pairs of images comprising an input image 112 and a corresponding modified images MI are passed through the updated global discriminator GD and the local discriminator LD, which yield updated values of their estimates E G D and E L D- b ⁇ l Compute the Generator loss G Loss
  • the Generator loss GJoss is then calculated as follows:
  • G_Loss LI + Fool_global_discriminator + 0.5 FoolJocal_discriminator + Perceptual_Loss + SSDJoss
  • LI is a LI norm function.
  • the LI loss term is calculated by comparing the Ground Truth Image GTI and the modified image MI outputted by generator G;
  • Perceptual_Loss is a term to take into account a proximity between a modified image MI and a Ground Truth Image GTI.
  • the proximity between these two images is preferably evaluated in a feature space. That is, rather than evaluating directly the proximity between these two images, a transformation is applied to each of these images, using a neural network. The proximity is then evaluated between the outputs of these respective transformations, rather than between the two images themselves.
  • the neural network used to apply the transformation is a neural network suitable for carrying out image recognition tasks. For instance, a neural network such as a VGG16 can be used in this purpose.
  • SSDJoss is a term which takes into account the capacity of generator G to output images for which the SSD network can efficiently (that is, with a small loss) determine the bounding boxes of the objects shown in the image.
  • the LI term is calculated using the LI norm, based on the compared inputs of the generator, i.e. based on the Ground Truth Images GTI and the corresponding modified image MI outputted by the generator.
  • the two fool discriminator terms are Fool_global_discriminator and FoolJocaLdiscriminator, calculated respectively for the global discriminator GD and the local discriminator LD.
  • the local discriminator LD and the global discriminator GD output their estimates E G D and E L D that these images be ground truth images.
  • a sigmoid cross entropy loss is calculated between the estimate (E G D or ELD respectively) outputted by the considered discriminator and the 'real image labels', which are the real values assigned on the basis of the actual Ground Truth Images.
  • the value determined in this manner is the fool discriminator term (respectively the Fool_global_discriminator and FoolJocaLdiscriminator terms).
  • the Fool_global_discriminator and FoolJocaLdiscriminator terms constrain the generator G to generate modified images MI that tend to be identified by the discriminators GD and LD as Ground Truth Images (with high values of EGD and ELD). Indeed, by comparing the estimates (EGD or ELD respectively) outputted by the discriminators with real image labels, the GJoss loss function takes into account to what extent the discriminators estimate that the outputs of the generator G are Ground Truth Images. If the discriminators GD and LD output estimates indicating that the modified images MI are fake images, then the loss will be high (signaling that the generator G is not doing well), otherwise the loss will be low.
  • the perceptual loss is a term measuring the performance of the generator to reconstruct images that look similar to the input.
  • this neural network has a VGG16 architecture (this neural network is called, for sake of simplicity, VGG16).
  • VGG16 can be for instance a pre-trained network, in particular trained on the Imagenet dataset as proposed in reference publication [4].
  • An example of VGG16 architecture is known for instance from publication [5].
  • the perceptual loss is calculated as:
  • Perceptual_Loss 100 * (
  • the SSD loss takes up the values outputted by the loss function of the image-based detector.
  • the image-based detector is the single-shot detector SSD proposed in publication [3]. Accordingly, the SSD loss is the loss function used to train this network. The loss term is therefore a weighted sum of a localization loss and a confidence loss.
  • the localization loss is a smooth LI loss term between a predicted location and a correct location (ground truth location) of a bounding box.
  • a detailed presentation of the SSD loss function can be found in publication [3] (refer to function L presented in the section 'Training Objective').
  • the SSDJoss term is calculated on the basis of the compared values of:
  • the various terms (LI, FoolJocal_discriminator, FoolJocaLdiscriminator, Perceptual_Loss, SSDJoss) are weighted to take into account their respective effects.
  • the weights can be modified depending of the effect which is to be privileged. In the present embodiments, the weights found optimal are 1, 1, 0.5, 1 and 1, whereby the FoolJocaLdiscriminator is weighted half as much as the Fool_global_discriminator. c) Back-propaaate through Generator G
  • the weights of the generator G are then updated by backpropagation.
  • the Generator and the Discriminator are repeatedly trained, alternatively, by performing the above-described operations.
  • the training is stopped when it is considered that the training algorithm has converged, or when stable values for the weights of these networks have been determined.
  • the SSD network is trained again on the basis of modified images outputted by the updated generator G.
  • the image-based detector SSD is thus trained by carrying out the following steps CIO, C20:
  • a second training set SSD-TS2 is generated for the training of the SSD network.
  • the purpose of this second SSD-TS2 training set is to train the SSD network to determine the location of the bounding boxes of the objects present in an image.
  • the first training set SSD-TS1 comprised images of scenes under normal lighting conditions
  • the input images included in the second training set are modified images outputted by generator G. It has been observed indeed that training the image-based detector (in the present embodiment, the SSD network) on the modified images as outputted by generator G, rather than on original input images, increases the performance of the SSD network.
  • the second training set SSD-TS2 comprises modified images which are outputted by generator G based on the initial input images III of the first training set SSD-TS1. That is, instead of training the SSD network based on original images imaged under the first lighting conditions (that is, in normal lighting conditions), the SSD network is trained on fake images, produced by generator G to resemble to these original images.
  • the ground truth is the correct or desired bounding boxes for the objects shown by the input images.
  • FIG. 6 shows a car 1000 (an example of a vehicle) equipped with an automated driving system 500.
  • the automated driving system 500 comprises an image-based detection system 100 as an exemplary computerized system or data-processing apparatus on which the present disclosure may be implemented in whole or in part.
  • the image-based detection system 100 (or, in short, the system 100) comprises several sensor units, including in particular a forward-facing camera 110.
  • Camera 110 is mounted slightly above the windshield of the car on a non- shown mount.
  • the image-based detection system 100 includes a computer system 150 which comprises a storage 151, one or more processor(s) 152, a memory 153, an operating system 154 and a communication infrastructure 155.
  • the communication infrastructure 155 is a data bus to which all the above-mentioned sensor units are connected, and therefore through which the signals outputted by these sensor units are transmitted to the other components of system 100.
  • the storage 151, the processor(s) 152, the memory 153, and the operating system 154 are communicatively coupled over the communication infrastructure 155.
  • the computer system 150 may interact with a user, or environment, via input/output device(s) 156, as well as over one or more networks 157.
  • the operating system 154 may interact with other components to control one or more applications 158. All components of the image-based detection system 2100 are shared or possibly shared with other units of the automated driving system 500 or of car 1000.
  • a computer program to perform object detection according to the present disclosure is stored in memory 153.
  • This program, and the memory 153 are examples respectively of a computer program and a computer- readable recording medium pursuant to the present disclosure.
  • the memory 153 of the computer system 150 indeed constitutes a recording medium according to the invention, readable by the one or more processor(s) 152 and on which said program is recorded.
  • the systems and methods described herein can be implemented in software or hardware or any combination thereof.
  • the systems and methods described herein can be implemented using one or more computing devices which may or may not be physically or logically separate from each other.
  • the systems and methods described herein may be implemented using a combination of any of hardware, firmware and/or software.
  • the present systems and methods described herein (or any part(s) or function(s) thereof) may be implemented using hardware, software, firmware, or a combination thereof and may be implemented in one or more computer systems or other processing systems.
  • the present embodiments are embodied in machine-executable instructions.
  • the instructions can be used to cause a processing device, for example a general-purpose or special-purpose processor, which is programmed with the instructions, to perform the steps of the present disclosure.
  • the steps of the present disclosure can be performed by specific hardware components that contain hardwired logic for performing the steps, or by any combination of programmed computer components and custom hardware components.
  • the present disclosure can be provided as a computer program product, as outlined above.
  • the embodiments can include a machine-readable medium having instructions stored on it.
  • the instructions can be used to program any processor or processors (or other electronic devices) to perform a process or method according to the present exemplary embodiments.
  • the present disclosure can also be downloaded and stored on a computer program product.
  • the program can be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem or network connection) and ultimately such signals may be stored on the computer systems for subsequent execution.
  • a remote computer e.g., a server
  • a requesting computer e.g., a client
  • a communication link e.g., a modem or network connection
  • the methods can be implemented in a computer program product accessible from a computer-usable or computer-readable storage medium that provides program code for use by or in connection with a computer or any instruction execution system.
  • a computer-usable or computer-readable storage medium can be any apparatus that can contain or store the program for use by or in connection with the computer or instruction execution system, apparatus, or device.
  • a data processing system suitable for storing and/or executing the corresponding program code can include at least one processor coupled directly or indirectly to computerized data storage devices such as memory elements.
  • the systems and methods described herein can be implemented in a computer system that includes a back-end component, such as a data server, or that includes a middleware component, such as an application server or an Internet server, or that includes a front-end component, such as a client computer having a graphical user interface or an Internet browser, or any combination of them.
  • a back-end component such as a data server
  • a middleware component such as an application server or an Internet server
  • a front-end component such as a client computer having a graphical user interface or an Internet browser, or any combination of them.
  • the components of the system can be connected by any form or medium of digital data communication such as a communication network.
  • computer program medium and “computer readable medium” may be used to generally refer to media such as but not limited to removable storage drive, a hard disk installed in hard disk drive.
  • These computer program products may provide software to computer system.
  • the systems and methods described herein may be directed to such computer program products.
  • All the neural networks mentioned herein are artificial neural networks formed from one or more processors (e.g., microprocessors, integrated circuits, field programmable gate arrays, or the like). These neural networks are divided into two or more layers, which comprise an input layer that may for instance receive images, an output layer that may for instance output an image or loss function (e.g., error, as described below), and one or more intermediate layers.
  • the layers of the neural networks G, GD and LD represent different groups or sets of artificial neurons, which can represent different functions performed by the processors on the images to calculate modified images and/or determine errors in the calculation of the modified images.
  • any range set forth in the description, including the claims should be understood as including its end value(s) unless otherwise stated. Specific values for described elements should be understood to be within accepted manufacturing or industry tolerances known to one of skill in the art, and any use of the terms “substantially” and/or “approximately” and/or “generally” should be understood to mean falling within such accepted tolerances.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

L'invention concerne un procédé mis en œuvre par ordinateur d'entraînement d'un réseau social génératif GAN comprenant un générateur (G) et un discriminateur (GD, LD). Le générateur est un codeur-décodeur d'image qui entre une image d'entrée (111, 112) d'une scène, et délivre en sortie une image modifiée (MI) de la scène appropriée pour être traitée par un détecteur à base d'image (SSD). Le réseau GAN est entraîné par entraînement en alternance du discriminateur (GD, LD) et du générateur (G). Le générateur (G) est entraîné pour produire, pour chaque image d'entrée parmi des images d'entrée (112) acquises dans n'importe quelles conditions d'éclairage, une image modifiée (MI) représentant la même scène que ladite image d'entrée mais dans des premières conditions d'éclairage prédéterminées. Cet entraînement est effectué à l'aide d'une fonction de perte de générateur (G_loss), qui comprend un terme de perte de détecteur (SSD_loss) calculé sur la base d'une sortie (PBB) du détecteur à base d'image (SSD). L'invention concerne également un module de génération d'image modifiée et un système de détection de caractéristiques dans une image.
PCT/EP2019/076434 2019-09-30 2019-09-30 Procédé d'entraînement d'un réseau publicitaire génératif, module de génération d'image modifiée et système de détection de caractéristiques dans une image WO2021063476A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/EP2019/076434 WO2021063476A1 (fr) 2019-09-30 2019-09-30 Procédé d'entraînement d'un réseau publicitaire génératif, module de génération d'image modifiée et système de détection de caractéristiques dans une image
DE112019007762.7T DE112019007762T5 (de) 2019-09-30 2019-09-30 Verfahren zum Trainieren eines generativen kontradiktorischen Netzes, modifiziertes Bildgenerierungsmodul und System zum Detektieren von Merkmalen in einem Bild

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2019/076434 WO2021063476A1 (fr) 2019-09-30 2019-09-30 Procédé d'entraînement d'un réseau publicitaire génératif, module de génération d'image modifiée et système de détection de caractéristiques dans une image

Publications (1)

Publication Number Publication Date
WO2021063476A1 true WO2021063476A1 (fr) 2021-04-08

Family

ID=68136379

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2019/076434 WO2021063476A1 (fr) 2019-09-30 2019-09-30 Procédé d'entraînement d'un réseau publicitaire génératif, module de génération d'image modifiée et système de détection de caractéristiques dans une image

Country Status (2)

Country Link
DE (1) DE112019007762T5 (fr)
WO (1) WO2021063476A1 (fr)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113487512A (zh) * 2021-07-20 2021-10-08 陕西师范大学 一种基于边缘信息指导的数字图像修复方法及装置
CN113706646A (zh) * 2021-06-30 2021-11-26 酷栈(宁波)创意科技有限公司 用于生成山水画的数据处理方法
CN113888443A (zh) * 2021-10-21 2022-01-04 福州大学 一种基于自适应层实例归一化gan的演唱会拍摄方法
CN115830723A (zh) * 2023-02-23 2023-03-21 苏州浪潮智能科技有限公司 一种训练集图像的相关方法和相关装置
WO2023071285A1 (fr) * 2021-11-01 2023-05-04 Huawei Technologies Co., Ltd. Recherche d'architecture neuronale antagoniste générative
WO2023102709A1 (fr) * 2021-12-07 2023-06-15 深圳先进技术研究院 Procédé et système de synthèse d'image à paramètre dynamique sur la base d'une image pet statique
WO2023207531A1 (fr) * 2022-04-29 2023-11-02 华为技术有限公司 Procédé de traitement d'image et dispositif associé

Non-Patent Citations (12)

* Cited by examiner, † Cited by third party
Title
JUSTIN JOHNSONALEXANDRE ALAHILI FEI-FEI: "European conference on computer vision", 2016, SPRINGER, article "Perceptual losses for real-time style transfer and super-resolution"
KAREN SIMONYANANDREW ZISSERMAN: "Very deep convolutional networks for large-scale image recognition", ARXIV PREPRINT ARXIV:1409.1556, 2014
KIM GUISIK ET AL: "Low-Lightgan: Low-Light Enhancement Via Advanced Generative Adversarial Network With Task-Driven Training", 2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), IEEE, 22 September 2019 (2019-09-22), pages 2811 - 2815, XP033647287, DOI: 10.1109/ICIP.2019.8803328 *
LIU, WEI ET AL.: "14th European Conference on Computer Vision, ECCV 2016", 2016, SPRINGER VERLAG, article "SSD: Single shot multibox detector"
OLAF RONNEBERGERPHILIPP FISCHERTHOMAS BROX: "International Conference on Medical image computing and computer-assisted intervention", 2015, SPRINGER, article "U-net: Convolutional networks for biomedical image segmentation"
PHILLIP ISOLA ET AL.: "2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR", 2017, IEEE, article "Image-to-image Translation with Conditional Adversarial Networks"
PHILLIP ISOLA ET AL: "Image-to-Image Translation with Conditional Adversarial Networks", ARXIV - 1611.07004V2, 21 July 2017 (2017-07-21), pages 5967 - 5976, XP055620831, ISBN: 978-1-5386-0457-1, DOI: 10.1109/CVPR.2017.632 *
QIAN YICHEN ET AL: "Unsupervised Face Normalization With Extreme Pose and Expression in the Wild", 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 15 June 2019 (2019-06-15), pages 9843 - 9850, XP033687519, DOI: 10.1109/CVPR.2019.01008 *
RAD MAHDIPETER M.ROTHVINCENT LEPETIT: "ALCN: Adaptive Local Contrast Normalization for Robust Object Detection and 3D Pose Estimation", BRITISH MACHINE VISION CONFERENCE, 2017
SHIN YONG-GOO ET AL: "Adversarial Context Aggregation Network for Low-Light Image Enhancement", 2018 DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA), IEEE, 10 December 2018 (2018-12-10), pages 1 - 5, XP033503508, DOI: 10.1109/DICTA.2018.8615848 *
WEI MA ET AL: "Face Image Illumination Processing Based on Generative Adversarial Nets", 2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), August 2018 (2018-08-01), pages 2558 - 2563, XP055706580, ISBN: 978-1-5386-3788-3, DOI: 10.1109/ICPR.2018.8545434 *
YANG ZHANG ET AL: "IL-GAN: Illumination-invariant representation learning for single sample face recognition", JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION., vol. 59, February 2019 (2019-02-01), US, pages 501 - 513, XP055706567, ISSN: 1047-3203, DOI: 10.1016/j.jvcir.2019.02.007 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113706646A (zh) * 2021-06-30 2021-11-26 酷栈(宁波)创意科技有限公司 用于生成山水画的数据处理方法
CN113487512A (zh) * 2021-07-20 2021-10-08 陕西师范大学 一种基于边缘信息指导的数字图像修复方法及装置
CN113888443A (zh) * 2021-10-21 2022-01-04 福州大学 一种基于自适应层实例归一化gan的演唱会拍摄方法
WO2023071285A1 (fr) * 2021-11-01 2023-05-04 Huawei Technologies Co., Ltd. Recherche d'architecture neuronale antagoniste générative
WO2023102709A1 (fr) * 2021-12-07 2023-06-15 深圳先进技术研究院 Procédé et système de synthèse d'image à paramètre dynamique sur la base d'une image pet statique
WO2023207531A1 (fr) * 2022-04-29 2023-11-02 华为技术有限公司 Procédé de traitement d'image et dispositif associé
CN115830723A (zh) * 2023-02-23 2023-03-21 苏州浪潮智能科技有限公司 一种训练集图像的相关方法和相关装置

Also Published As

Publication number Publication date
DE112019007762T5 (de) 2022-06-15

Similar Documents

Publication Publication Date Title
Kwon et al. Predicting future frames using retrospective cycle gan
WO2021063476A1 (fr) Procédé d'entraînement d'un réseau publicitaire génératif, module de génération d'image modifiée et système de détection de caractéristiques dans une image
CN109948796B (zh) 自编码器学习方法、装置、计算机设备及存储介质
US11074438B2 (en) Disentangling human dynamics for pedestrian locomotion forecasting with noisy supervision
CN113657560B (zh) 基于节点分类的弱监督图像语义分割方法及系统
WO2019099537A1 (fr) Localisation spatiotemporelle d'action et d'acteur
CN110532883B (zh) 应用离线跟踪算法对在线跟踪算法进行改进
CN113065645B (zh) 孪生注意力网络、图像处理方法和装置
KR102132407B1 (ko) 점진적 딥러닝 학습을 이용한 적응적 영상 인식 기반 감성 추정 방법 및 장치
WO2021069945A1 (fr) Procédé de reconnaissance d'activités à l'aide de poids d'attention séparés dans l'espace et dans le temps
CN110598587B (zh) 结合弱监督的表情识别网络训练方法、系统、介质及终端
GB2547760A (en) Method of image processing
CN114937083A (zh) 一种应用于动态环境的激光slam系统及方法
Hammam et al. DeepPet: A pet animal tracking system in internet of things using deep neural networks
Ukwuoma et al. Image inpainting and classification agent training based on reinforcement learning and generative models with attention mechanism
Wang et al. Intrusion detection for high-speed railways based on unsupervised anomaly detection models
Cheng et al. Language-guided 3d object detection in point cloud for autonomous driving
Mobahi et al. An improved deep learning solution for object detection in self-driving cars
Teng et al. Unimodal face classification with multimodal training
JP2023126130A (ja) オブジェクト検出のためのコンピュータにより実施される方法、データ処理機器及びコンピュータプログラム
CN114708353A (zh) 图像重建方法、装置、电子设备与存储介质
US11138468B2 (en) Neural network based solution
Khan et al. Automatic multi-gait recognition using pedestrian’s spatiotemporal features
Langerman et al. Domain Adaptation of Networks for Camera Pose Estimation: Learning Camera Pose Estimation Without Pose Labels
Sophonpattanakit GAN-based water droplet removal

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19782546

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19782546

Country of ref document: EP

Kind code of ref document: A1