EP3895415A1

EP3895415A1 - Transfer of additional information between camera systems

Info

Publication number: EP3895415A1
Application number: EP19797243.3A
Authority: EP
Inventors: Dirk Raproeger; Paul Robert Herzog; Lidia Rosario Torres Lopez; Paul-Sebastian Lauer; Uwe Brosch
Original assignee: Robert Bosch GmbH
Current assignee: Robert Bosch GmbH
Priority date: 2018-12-13
Filing date: 2019-10-29
Publication date: 2021-10-20
Also published as: DE102018221625A1; US20210329219A1; CN113196746A; WO2020119996A1

Abstract

A method (100) for enriching a target picture (31), which a target camera system (3) has taken of a scenery (1), with additional information (4, 41, 42) with which at least one source picture (21), which a source camera system (2) has taken of the same scenery (1) from a different perspective, is already enriched, having the steps of: • assigning (110) 3D locations (5) in the three-dimensional space to source pixels (21a) of the source picture (21), which 3D locations correspond to the positions of the source pixels (21a) in the source picture (21); • assigning (120) additional information (4, 41, 42) assigned to source pixels (21a) to the respective associated 3D locations (5); • assigning (130) those target pixels (31a) of the target picture (31) whose positions in the target picture (31) correspond to the 3D locations (5) to the 3D locations (5); assigning (140) additional information (4, 41, 42) assigned to 3D locations (5) to the associated target pixels (31a). A method (200) for training an AI module (50), wherein at least some learning additional information (54) is assigned (215) to the pixels (53a) of a learning picture (53) as target pixels (31a) using the method (100). Associated computer program.

Description

description

Title:

Transfer of additional information between camera systems

The present invention relates to a method for processing images that have been recorded with different camera systems. The method can be used in particular for driver assistance systems and systems for at least partially automated driving.

State of the art

For driver assistance systems and for systems at least partially

automated driving are pictures taken with camera systems from

Vehicle environment, the most important source of information. Often there is additional information about the images, such as semantic segmentation, which was obtained with an artificial neural network. The additional information is to the used in each case

Camera system bound.

US Pat. No. 8,958,630 B1 discloses a method for producing a classifier for the semantic classification of image pixels that belong to different object types. The database of the learning data is enlarged in an unsupervised learning process.

US 9,414,048 B2 and US 8,330,801 B2 disclose methods with which two-dimensional images and video sequences can be converted into three-dimensional images.

Disclosure of the invention

Within the scope of the invention, a method for enriching a target image, which a target camera system has recorded from a scene, with additional information was developed. The additional information is a source image, which a source camera system has recorded from the same scenery from a different perspective, or source pixels of this source image. In other words, the source image is already with this

Additional information enriched.

The additional information can be of any type. For example, it can contain physical measurement data that were acquired in connection with the acquisition of the source image. For example, the source camera system can be a camera system that includes a source camera that is sensitive to visible light and a thermal imaging camera that is oriented to the same observation area. This source camera system can then record a source image with visible light, and each pixel of the source image is then assigned additional information as an intensity value from the thermal image recorded at the same time.

The source pixels of the source image are assigned 3D locations in three-dimensional space, which correspond to the positions of the source pixels in the source image. A three-dimensional representation of the scenery is thus determined, which, when imaged with the source camera system, leads to the input source image. This representation does not have to be continuous and / or complete in the three-dimensional space like a conventional three-dimensional scenery, especially since a particular three-dimensional scenery cannot be inferred from a single two-dimensional picture in particular.

Rather, there are several three-dimensional sceneries that, when imaged with the source camera system, generate the same two-dimensional source image. The three-dimensional representation obtained from a single source image can thus be, for example, a point cloud in three-dimensional space in which there are as many points as the source image has source pixels and in which the three-dimensional space is otherwise assumed to be empty. When these points are plotted in a three-dimensional representation, the three-dimensional volume is thus sparsely populated.

Additional information that is assigned to source pixels is assigned to the respectively associated 3D locations. In the aforementioned example with the additional thermal imaging camera, each point in the three-dimensional point cloud that corresponds to the source image is assigned the intensity value of the thermal image associated with the corresponding pixel in the source image. The 3D locations are now assigned those target pixels of the target image whose positions in the target image correspond to the 3D locations. It is determined which target pixels in the target image the 3D locations are mapped to when the three-dimensional scenery is recorded with the target camera system. This assignment results from the interaction of the arrangement of the target camera system in space with the imaging properties of the target camera system.

The additional information that is assigned to the 3D locations is now assigned to the associated target pixels.

In this way, the additional information that was originally developed in connection with the source image can be transferred to the target image. It is therefore possible to provide the target image with this additional information without having to physically record the additional information.

The basic idea behind the method is that the additional information, such as the infrared intensity from the thermal image in the above example, is not primarily physically linked to the source pixel of the source image, but to the associated 3D location in three-dimensional space. In this example, there is matter at this 3D location that emits infrared radiation. This 3D location is only mapped to different positions in the source image and in the target image, since the source camera and the target camera select the 3D location

look at different perspectives. The method takes advantage of this connection by reconstructing 3D locations in a three-dimensional “world coordinate system” for source pixels of the source image and then assigning these 3D locations to target pixels of the target image.

In a particularly advantageous embodiment, a semantic

Classification of image pixels selected as additional information. Such a semantic classification can, for example, assign information to each pixel of the type of the object to which the pixel belongs. The object can be, for example, a vehicle, a roadway, a roadway marking, a roadway boundary, a structural obstacle or a traffic sign. The semantic classification is often carried out with neural networks or other KL modules. These KL modules are trained by you are given a variety of learning images, for which the correct semantic classification is known as "ground truth". It is checked to what extent the classification issued by the KL module corresponds to the "ground truth", and lessons are learned from the deviations by the

Processing of the KL module is optimized accordingly.

Ground truth is usually obtained by semantically classifying a large number of images of people. In other words, people mark in the pictures which pixels belong to objects of which classes. This process, called “labeling”, is time-consuming and expensive. So far, the additional information entered by people in this way has always been just that

Bound camera system with which the learning images were taken. If you switched to a different type of camera system, such as from a normal perspective camera to a fish-eye camera, or just changed the perspective of the existing camera system, the labeling process had to start all over again. Since the semantic classification already available for the source images recorded with the source camera system can now be transferred to the target images recorded with the target camera system, the work previously invested in connection with the source images can be used further.

This is particularly important in connection with applications in

Vehicles. Driver assistance systems and systems for at least partially automated driving are using more and more cameras and more and more different camera perspectives.

For example, it is common to have a front camera in the middle behind the

Install windshield. For this camera perspective there is a large amount of “ground truth” in the form of pictures semantically classified by people and is still being made. In addition, however, more and more systems are being created that contain additional cameras in addition to the front camera system, for example in the front area in the radiator area, in the side mirror or in the tailgate. The neural network, which was trained with images from the front camera and the associated “ground truth”, now provides a semantic classification of what the other cameras see from their other perspectives. This semantic classification can be used as "ground truth" for training a neural network with recordings of these others Cameras are used. The “ground truth” acquired in connection with the front camera as the source camera can therefore be used as target cameras for training the other cameras. So, for training several cameras, "ground truth" only has to be acquired once, ie the effort for acquiring "ground truth" does not multiply with the number of cameras and perspectives.

The source pixels can be assigned to 3D locations in any way. For example, the associated 3D location for at least one source pixel can be determined from a time program, according to which at least one source camera of the source camera system moves in space. For example, a “structure from motion” algorithm can be used to convert the time program of the movement of a single source camera into an assignment of the source pixels to 3D locations.

In a particularly advantageous embodiment, a source camera system with at least two source cameras is selected. On the one hand, the 3D locations associated with source pixels can then be determined by stereoscopic evaluation of source images that were recorded by both 3D cameras. The at least two source cameras can in particular be contained in a stereo camera system that has one for each pixel

Provides depth information. This depth information can be used to directly assign the source pixels of the source image to 3D locations.

On the other hand, source pixels from source images that were recorded by both source cameras can also be combined in order to assign additional information to more target pixels of the target image. Since the perspectives of the source camera system and the target camera system are different, both camera systems do not depict exactly the same section of the three-dimensional scene. Thus, if the additional information is transferred from all source pixels of a single source image to target pixels of the target image, not all target pixels of the target image will be covered by this. There will therefore be target pixels to which no additional information has yet been assigned. If several source cameras are used, preferably two or three source cameras, then gaps in the target image can be filled. However, this is not absolutely necessary for training a neural network or other CI module on the basis of the target image. In particular, with one such training target pixels of the target image, for which there is no additional information, from the evaluation by that during training

used quality measure (such as an error function) are excluded.

In a further embodiment of the system, any 3D sensor can deliver a point cloud that is compatible with a suitable one to obtain the 3D structure observed by both the source and the target camera system

The calibration procedure locates both the source pixels and the target pixels in 3D space, thus ensuring that the training information can be transferred from the source system to the target system.

Additional 3D sensors that only determine the connecting 3D structure of the observed scene for the training could be an additional one

imaging time-of-flight (TOF) sensor or a lidar sensor.

In a further advantageous embodiment, a source image and a target image are selected which have been recorded simultaneously. In this way it is ensured that, especially in the case of dynamic scenery with moving objects, the source image and the target image, apart from the different camera perspective, show the same state of the scenery. If, on the other hand, there is a temporal offset between the source image and the target image, an object that was still present in one image may already be out of the detection range until the other image is captured

disappeared.

In a particularly advantageous embodiment, a source camera system and a target camera system are selected, which are mounted on the same vehicle in a fixed relative orientation to one another. Especially with

As a rule, the scenarios observed are dynamic in applications in and on vehicles. Are the two camera systems in fixed relative

Orientation to each other is particularly simultaneous

Image acquisition possible. The fixed connection of the two camera systems ensures that the difference in perspective between the two camera systems remains constant while driving.

As previously discussed, the transfer of additional information from a source image to a target image is useful regardless of what the additional information is actually exists. However, an important application is the continued use of “ground truth”, which was generated for the processing of images from one camera system with a KL module, for the processing of images from another camera system.

Therefore, the invention also relates to a method for training a Kl module, the image taken by a camera system and / or pixels of such an image, by processing in an internal module

Processing chain assigns additional information. This additional information can in particular be a classification of image pixels. The internal

Processing chain of the KL module can in particular include an artificial neural network (KNN).

The behavior of the internal processing chain is determined by parameters. These parameters are optimized when training the Kl module. For a KNN, for example, the parameters can be weights with which the

Inputs received by a neuron are weighted among each other.

During training, learning images are entered into the Kl module. The additional information output by the KL module is compared with additional learning information associated with the respective learning image. The result of the comparison is used to adjust the parameters. For example, an error function (Loess function) can depend on the deviation determined in the comparison, and the parameters can be optimized with the aim of minimizing this error function. Any multivariate optimization method can be used for this, such as a gradient descent method.

The additional learning information is at least partially with the previous one

described method assigned to the pixels of the learning image as target pixels. This means that additional learning information created for another camera system and / or for a camera system observing from a different perspective is used further. The generation of "ground truth" for the specific camera system that is to be used in connection with the trained KL module can therefore be at least partially automated. Since the manual generation of "ground truth" was very labor-intensive, the development costs for combinations of KL modules and new ones Camera systems significantly reduced. Furthermore, the susceptibility to errors is also reduced, since "ground truth" that has been checked can be used many times.

The methods can in particular be carried out on a computer and / or on a control device and can be embodied in software to that extent. This software is an independent product with customer benefits. The invention therefore also relates to a computer program with machine-readable instructions which, when executed on a computer and / or a control device, cause the computer and / or the control device to carry out one of the methods described.

Further measures improving the invention are shown below together with the description of the preferred exemplary embodiments of the invention with reference to figures.

Embodiments

It shows:

Figure 1 embodiment of the method 100;

Figure 2 Exemplary source image 21;

FIG. 3 exemplary translation of the source image 21 into a point cloud in three-dimensional space;

FIG. 4 Exemplary target image 31 with additional information 4, 41, 42 transferred from the source image 21;

FIG. 5 shows an exemplary arrangement of a source camera system 2 and a target camera system 3 on a vehicle 6;

FIG. 6 embodiment of the method 200. According to FIG. 1, in step 110 of the method 100 source pixels 21a of a source image 21 are assigned to 3D values 5 in three-dimensional space. According to block 111, the associated 3D location 5 for at least one source pixel 21a can be determined from a time program, according to which at least one source camera of the source camera system 2 moves in space. Alternatively or also in combination with this, according to block 112, the associated 3D location 5 for at least one source pixel 21a can be determined by stereoscopic evaluation of source images 21, which were recorded by two source cameras.

The latter option requires that a source camera system with at least two source cameras was selected in step 105. Furthermore, according to the optional step 106, a source image 21a and a target image 31a can be selected which have been recorded simultaneously. According to the optional step 107, a source camera system 2 and a target camera system 3 can also be selected, which are mounted on the same vehicle 6 in a fixed relative orientation 61 to one another.

In step 120, the additional information 4, 41, 42, which is assigned to the source pixels 21a of the source image 21, is assigned to the respectively associated 3D locations 5. In step 130, those target pixels 31a of the target image 31 are assigned to the 3D locations whose positions in the target image 31 correspond to the 3D locations 5. In step 140, the additional information 4, 41, 42, which is assigned to 3D locations 5, is assigned to the associated target pixels 31a.

This process is explained in more detail in FIGS. 2 to 4.

FIG. 2 shows a two-dimensional source image 21 with coordinate directions x and y, which a source camera system 2 has recorded from a scenery 1. The source image 21 was segmented semantically. In the example shown in FIG. 2, the became part of the source image 21

Additional information 4, 41 acquired that this subarea belongs to a vehicle 11 present in scenery 1. For other sub-areas of the source image 21, the additional information 4, 42 was acquired that this

Sub-areas belong to existing road markings 12 in the scenery 1. A single pixel 21a of the source image 21 is marked as an example in FIG. In FIG. 3, the source pixels 21a are translated into 3D locations 5 in three-dimensional space, this being denoted by the reference symbol 5 for the target pixel 21a from FIG. If the additional information 4, 41 was stored for a source pixel 21a that the source pixel 21a belongs to a vehicle 11, then this additional information 4, 41 was also assigned to the corresponding 3D location 5. If the additional information 4, 42 was stored for a source pixel 21a that the source pixel 21a belongs to a road marking 12, then this additional information 4, 42 was also assigned to the corresponding 3D location 5. This is represented by different symbols with which the respective 3D locations 5 are represented in the point cloud shown in FIG. 3.

In Figure 3 there are only as many 3D locations 5 as there are source pixels 21a in the source image 21. Therefore, the three-dimensional space in Figure 3 is not completely filled, but rather only sparsely populated by the point cloud. In particular, only the rear area of the vehicle 11 is shown, since only this area is visible in FIG. 2.

FIG. 3 also shows that the source image 21 shown in FIG. 2 was taken from perspective A. As a purely illustrative example with no claim to real applicability, the target image 31 is taken from the perspective B drawn in FIG. 3.

This exemplary target image 31 is shown in FIG. 4. It is shown here by way of example that the source pixel 21a was ultimately assigned to the target pixel 31a on the detour via the associated 3D location 5. All target pixels 31a, for which there is an associated source pixel 21a with a stored one in FIG

Additional information 4, 41, 42 is, accordingly, associated with this additional information 4, 41, 42 on the detour via the associated 3D location 5. The work so far invested in the semantic segmentation of the source image 21 was therefore completely recycled.

As indicated in FIG. 4, more of the vehicle 11 is visible in the perspective B shown here than in the perspective A of the source image. The

Additional information 4, 41 that source pixels 21a belong to vehicle 11 was only recorded with respect to the rear area of vehicle 11 visible in FIG. 2. Thus, the front area of the vehicle 11 shown in dashed lines in FIG. 4 is not provided with this additional information 4, 41. This extreme The constructed example shows that it is advantageous to combine source images 21 from several source cameras in order to provide as many target pixels 31a of the target image 31 with additional information 4, 41, 42.

FIG. 5 shows an exemplary arrangement of a source camera system 2 and a target camera system 3, both of which are mounted on the same vehicle 6 in a fixed relative orientation 61 to one another. This fixed relative

Orientation 61 is specified in the example shown in FIG. 5 by a rigid test vehicle.

The source camera system 2 observes the scenery 1 from a first

Perspective A '. The target camera system 3 observes the same scenery 1 from a second perspective B '. The described method 100 enables additional information 4, 41, 42, which was acquired in connection with the source camera system 2, to be used in the context of the target camera system 3.

FIG. 6 shows an exemplary embodiment of the method 200 for training a Kl module 50. The Kl module 50 comprises an internal processing chain 51, the behavior of which is determined by parameters 52.

In step 210 of the method 200, learning images 53 with pixels 53a are input into the KL module 50. The KL module 50 supplies these learning images

Additional information 4, 41, 42, such as a semantic

Segmentation. Learning data 54 as to which additional information 4, 41, 42 is expected in each case for a given learning image 53 is transferred according to step 215 by means of method 100 into the perspective from which the learning image 53 was recorded.

In step 220, the additional information 4, 41, 42 actually supplied by the KL module 50 is compared with the additional learning information 54. The result 220a of this comparison 220 is used in step 230 in order to optimize the parameters 52 of the internal processing chain 51 of the KL module 50.

Claims

Expectations

1. The method (100) for enriching a target image (31), which a target camera system (3) of a scenery (1) has taken with

Additional information (4, 41, 42) with which at least one source image (21), which a source camera system (2) of the same scenery (1) has taken from a different perspective, has already been enriched, with the steps:

• Source pixels (21a) of the source image (21) become 3D locations (5) in the

assigned three-dimensional space (110), which correspond to the positions of the source pixels (21a) in the source image (21);

• Additional information (4, 41, 42) which is assigned to source pixels (21a) is assigned (120) to the respectively associated 3D locations (5);

• The 3D locations (5) are assigned (130) those target pixels (31a) of the target image (31) whose positions in the target image (31) correspond to the 3D locations (5);

• Additional information (4, 41, 42) that is assigned to 3D locations (5) is assigned to the associated target pixels (31a) (140).

2. The method (100) according to claim 1, wherein for at least one source pixel (21a) the associated 3D location (5) is determined from a time program (111), according to which at least one source camera of the source camera system ( 2) moved in space.

3. The method (100) according to any one of claims 1 to 2, wherein a source camera system (2) with at least two source cameras is selected (105).

4. The method (100) according to claim 3, wherein for at least one source pixel (21a) the associated 3D location (5) is determined (112) by stereoscopic evaluation of source images (21) by both source cameras

were recorded.

5. The method (100) according to any one of claims 3 to 4, wherein source pixels from source images (21), which were recorded by both source cameras, are merged to more target pixels (31a) of the target image (31) Assign additional information (4, 41, 42). 6. The method (100) according to any one of claims 1 to 5, wherein a source image (21a) and a target image (31a) are selected (106) simultaneously

have been included.

7. The method (100) according to any one of claims 1 to 6, wherein a source camera system (2) and a target camera system (3) are selected (107) in a fixed relative orientation (61) to one another on the same vehicle (6) are mounted.

8. Method (200) for training a KL module (50), which contains an image (31) recorded by a camera system (3) and / or pixels (31a) of such an image (31), by processing in an internal one

Processing chain (51) assigns additional information (4, 41, 42), the behavior of the internal processing chain (51) being determined by parameters (52), wherein

Learning images (53) are entered into the KL module (50) (210),

The additional information (4, 41, 42) output by the KL module (50) is compared (220) with the additional learning information (54) associated with the respective learning image (53),

• the result (220a) of the comparison (220) for adjusting the parameters

(52) is used (230) and

• the additional learning information (54) at least partially using the method (100) according to one of claims 1 to 5, the pixels (53a) of the learning image

(53) is assigned (215) as target pixels (31a).

9. The method (100, 200) according to one of claims 1 to 8, wherein a semantic classification of image pixels (21a, 31a) is selected as additional information (4, 41, 42).

10. Computer program containing machine-readable instructions which, when executed on a computer and / or a control device, cause the computer and / or the control device to initiate a method (100, 200) according to one of claims 1 to 9 to execute.