WO2022042903A1

WO2022042903A1 - Method for identifying three-dimensional objects, computer program, machine-readable storage medium, control unit, vehicle and video monitoring system

Info

Publication number: WO2022042903A1
Application number: PCT/EP2021/068017
Authority: WO
Inventors: Emil Schreiber; Fabian GIGENGACK
Original assignee: Robert Bosch Gmbh
Priority date: 2020-08-27
Filing date: 2021-06-30
Publication date: 2022-03-03
Also published as: DE102020210816A1

Abstract

The invention relates to a method for identifying three-dimensional objects (180) in a field of view (191) of a camera (111, 112, 120), comprising the following steps: capturing (210) at least one camera image (300) by means of the at least one camera (111, 112, 120); semantic segmenting (220) of the camera image (300) by a first learned machine identification method; allocating (221) a piece of segment information to the pixels of the camera image (300) according to the semantic segmenting; determining (222) at least one image detail as a segment (410, 420, 430, 440), wherein adjacent pixels of the camera image (300) are grouped into a segment (410, 420, 430, 440) according to the allocated semantic piece of segment information; determining (250) distance data (501 to 507) between surroundings objects (180) in the camera field of view (191) and the camera (111, 112, 120) and allocating (252) a piece of distance information to the pixels of at least one part of the camera image (300) according to the determined distance data (501 to 507), and/or determining (230) an optical flow to at least one part of the pixels of the captured camera image according to the camera image (300) and at least one additional camera image captured previously and/or afterward; and determining (270) at least one three-dimensional object hypothesis (510, 511, 512, 520, 530) as a segment section of a determined segment (410, 420, 430, 440) according to the a piece of distance information allocated to the pixels of the segment and/or according to the determined optical flow.

Description

description

title

Method for detecting three-dimensional objects, computer program, machine-readable storage medium, control unit, vehicle and video surveillance system

The present invention relates to a method for detecting three-dimensional objects in a field of view of a camera. The invention also relates to a computer program that is set up to carry out this method, and a machine-readable storage medium on which the computer program is stored. The invention also relates to a control unit that is set up to carry out the method according to the invention, and a vehicle with this control unit, and also a video surveillance system with this control unit.

State of the art

The acquisition of camera images on a vehicle by means of a mono or stereo camera is known, with the vehicle camera capturing, for example, the area to the rear or to the front in the direction of travel of an area surrounding the vehicle. Based on at least one recorded camera image, a learned machine recognition method can be used, for example, to carry out a semantic segmentation and/or an object recognition. The recognized segments and/or objects in the surroundings of the vehicle are used in driver assistance methods or partially or fully autonomous guidance of a vehicle and/or for display in a virtual three-dimensional environment model, for example a driving maneuver is carried out depending on a recognized object. In addition, for driver assistance methods or partially or fully autonomous guidance of a vehicle, distance data is recorded detected objects, for example to follow another vehicle or to avoid an object or to be able to carry out a driving maneuver.

Machine recognition methods are, for example, neural networks, in particular those with a large number of layers, each of which includes so-called neurons. A neuron of each layer is typically associated with neurons of a previous layer and neurons of a subsequent layer. For example, the links between the neurons each have associated weights. Machine recognition methods are advantageously trained with a large number of data, in particular this data includes a large number of images which each have an assigned label or an expected output of the recognition method for at least a partial area of a respective image. Alternatively or additionally, a machine recognition method can be trained with data of only one known output. During training, at least the weights of the links are typically adjusted. Each of the layers of the neural network advantageously represents an abstraction level of the image. Through the training, a machine recognition method can learn, for example, in particular by adjusting the weights of the connections between the neurons, to distinguish a vehicle in an image from a person or a tree or to recognize the vehicle, with the machine recognition method typically providing a probability for the Presence of the object determined. The result is a computationally efficient, trained machine recognition method. Both the structure of the machine recognition process or the number of layers and neurons per layer and the training or the training data of the machine recognition process have a major influence on the recognition quality. However, due to the large number of layers and neurons, a resulting principle of the machine recognition method in the application often remains unclear for a user. In other words, a machine recognition method can be described as a non-analytical method. In other words, a user often does not know exactly why a neural network, for example, recognizes a vehicle as a vehicle and a person as a person. Accordingly, machine recognition methods, for example for object detection, can deliver unreliable results since no physical models or no abstract model knowledge is generally implemented. Semantic segmentation is known as a learned machine recognition method. A method for semantic segmentation delivers as a result a classification of a pixel of a camera image into semantic categories (eg person, car, street, . . . ), with all pixels of the image being classified in particular but not necessarily. This corresponds in particular to a rough classification of the image content mapped by the pixels. As a simple example of semantic segmentation, an image can be divided into two classes or sub-areas, for example a sub-area that depicts a person and another sub-area that depicts the background of the person depicted.

The detection of static and/or dynamic objects based on three-dimensional point clouds is also known, for example through the publication by Y.Zhou and O. Tuzel "VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection" (CVPR 2018) .

C. Godard et al. disclose in their paper entitled “Unsupervised Monocular Depth Estimation with Left-Right Consistency” (arXiv:1609.03677v3) a depth estimation using a learned machine recognition method.

D. Eigen et al. in their paper titled "Depth Map Prediction from a Single Imageusing a Multi-Scale Deep Network" disclose depth estimation using a different trained machine recognition method, with one trained machine recognition method performing a rough estimate of depth for larger areas of an image and another trained machine recognition method locally performs a more accurate estimate of the depth of the image's pixels (http://papers.nips.cc/paper/5539-depth-map-prediction-from-a-single-image-using-a-multi-scale- deep-network.pdf.).

Furthermore, stereo cameras allow, in a known manner, the determination of distance data between surrounding objects in the field of view of a camera and the camera using two cameras arranged at a fixed distance using a triangulation method. Such a stereo camera, comprising two cameras, can be arranged on a vehicle, for example. Alternatively or additionally, distance data between objects and a vehicle or the camera can be recorded by means of an ultrasonic sensor, by means of a radar sensor or by means of a lidar sensor. A sensor placed in addition to a camera however, increases the costs of the overall system, for example the vehicle, and in the case of a sensor data fusion requires a complex and possibly regular calibration between the sensor and the camera as well as a more powerful computing unit for the sensor data fusion.

The object of the present invention is to improve the detection of static and/or dynamic objects compared to the prior art.

Disclosure of Invention

The above object is achieved according to the invention according to independent claims 1 and 11 to 15.

The present invention relates to a method for detecting three-dimensional objects in a field of view of a camera. The camera is in particular a vehicle camera, which preferably captures at least part of the surroundings of the vehicle. For example, a vehicle camera is arranged on the vehicle at an elevated position behind the windshield. In a first step, at least one camera image is captured using at least the camera. Provision can advantageously be made for multiple camera images to be captured approximately simultaneously by means of one camera each, with the cameras having different fields of view or perspectives or capturing a different partial area of the environment. In other words, provision can be made for multiple camera images to be captured by multiple vehicle cameras, which are each advantageously part of a surround view system of the vehicle. The camera or vehicle camera is advantageously set up to capture an area of an environment of the vehicle that is at the front in the direction of travel. The camera or vehicle camera is preferably a mono camera, in particular for cost reasons, it being possible for the mono camera to have wide-angle optics. Alternatively, the camera or vehicle camera is advantageously part of a stereo camera, in particular in order to achieve increased reliability or accuracy of the method. In a second method step, a semantic segmentation of the at least one camera image is carried out using a first learned machine recognition method. At least one image region that depicts a static and/or moving object class, for example, is advantageously recognized in at least one partial area of the camera image other vehicles detected in the camera image. Subsequently, in a further method step, segment information is assigned to the pixels of the camera image as a function of the semantic segmentation, with the respective pixels in particular imaging the recognized object. For example, the pixels of the camera image that depict a vehicle are assigned a value as segment information that represents vehicles. At least one image section of the camera image is then determined as a segment, which has adjacent pixels with the same assigned segment information. In other words, adjacent pixels of the camera image are grouped into a segment depending on the respectively assigned segment information. All pixels of a camera image that have a connection to the respective pixel through pixels with the same segment information are advantageously understood as neighboring pixels. A single segment can consequently in particular include more than one vehicle or more than one person. Distance data is then determined between surrounding objects in the camera's field of view and the camera, in particular between objects in the vicinity of the vehicle and the vehicle. The distance data are preferably determined as a function of the captured camera image. The distance data are particularly preferably determined as a function of the captured camera image by a second, learned machine recognition method, see publication by C. Godard et al. or D. Eigen et al. Alternatively or additionally, the distance data can be determined by a stereo vision method and/or a structure-from-motion method. Alternatively or additionally, the distance data can be determined by an ultrasonic sensor, a radar sensor and/or a lidar sensor. In a further step of the method, distance information is assigned to the pixels of at least part of the camera image as a function of the determined distance data. The associated distance information of the respective pixel advantageously represents a distance of an object in the surroundings, which is imaged by the pixel, from the vehicle. Alternatively or in addition to determining distance data between surrounding objects in the camera's field of view and the camera, an optical flow or a movement relative to at least some of the pixels of the captured camera image is determined. In particular, an optical flow of the pixels of a determined segment is determined. The optical flow to at least some of the pixels of the camera image is determined as a function of the captured camera image and at least one other previously and/or subsequently captured camera image. In a further step, at least one three-dimensional object hypothesis is determined in a determined segment, with pixels of the determined segment are advantageously subgrouped or grouped into a segment excerpt depending on the respectively assigned distance information. The grouping of the pixels in a segment to form a three-dimensional object hypothesis takes place in particular when a difference between the associated distance information of these pixels or at least a predetermined number of these pixels is less than or equal to a distance tolerance value. In other words, a segment section of the segment is advantageously determined as a three-dimensional object hypothesis as a function of the assigned distance information of the pixels of the segment, this segment section advantageously having at least a defined number of pixels whose assigned distance information each have a difference that is less than or equal to a distance tolerance value. Alternatively or additionally, the at least one three-dimensional object hypothesis is determined as a segment section of a determined segment depending on the determined optical flow of the pixels of the segment. In other words, when determining the three-dimensional object hypothesis, the pixels of a determined segment are advantageously additionally or alternatively combined or grouped into a segment excerpt depending on the determined optical flow. In particular, those pixels of the segment whose optical flow vectors are approximately the same and/or whose change in flow vectors are approximately the same and/or whose optical flow vectors point in approximately the same direction are additionally or alternatively combined to form the segment section. Advantageously, the pixels grouped into a three-dimensional object hypothesis are adjacent. The method has the advantage that the object hypotheses are reliably determined because the machine recognition method or methods that have been learned are linked to a physical model. In other words, errors in object recognition and/or in a determined object extension are avoided, especially when two objects cover each other, since mutually covering objects have a different distance from the camera and/or a different direction of movement and/or a different speed. The physical model used states that in an image section of the camera image or in a segment that depicts the same semantic content, there can be no significantly different distances to the camera or vehicle and/or no significantly different speeds or directions of movement if this would represent only one object. In other words can in advantageously different segment excerpts are identified in a segment, which represent different three-dimensional object hypotheses. This is advantageous, for example, when a vehicle in the captured camera image is covered by another vehicle or a person in the captured camera image is covered by another person. In addition, the first learned machine recognition method can advantageously be trained more robustly, since it can generate a more abstract output compared to classic object recognition methods, for example static and moving objects or vehicle classes do not initially have to be differentiated. The method is preferably carried out using only one camera or vehicle camera or using a mono camera or using a stereo camera and additional active sensors which emit electromechanical radiation or pressure or ultrasound are dispensed with. As a result, the method can be carried out in a cost-effective and very computationally efficient manner.

In an advantageous development of the invention, the object hypothesis is only determined if the distance information assigned to the pixels of the segment excerpt is less than or equal to a distance threshold value for at least a predetermined number of pixels. This makes the method more computationally efficient and more reliable.

In one embodiment of the method, the three-dimensional object hypothesis is only determined if the number of pixels in the segment section is greater than or equal to a minimum value. This avoids unrealistically small extensions of object hypotheses or unimportant object hypotheses.

In a further development of the method, when determining the three-dimensional object hypothesis, the distance tolerance value is adjusted as a function of the assigned segment information, the assigned distance information and/or a detected speed of the vehicle. As a result, the distance tolerance value can advantageously be adapted to an expected extent of an object class and/or to an expected orientation of an object class, for example vehicles or people who are hiding, and/or to an accuracy of the determined distance data that changes as the vehicle speed changes. Advantageously, for example, the distance tolerance value for separating object hypotheses in a segment for an assigned Segment information representing people smaller than for associated segment information representing vehicles.

In another embodiment, before the three-dimensional object hypothesis is determined, at least one object in the ascertained segment is recognized by a further learned machine recognition method. For example, a person's head or a license plate is recognized. The object hypothesis in a segment is then determined as a function of the detected object, for example the number of object hypotheses is determined as a function of the number of vehicles or people depicted in the segment. In other words, object hypotheses, for example, are determined as a function of the number of vehicles or people depicted in the segment. Optionally, in a further step before the three-dimensional object hypothesis is determined, object information is assigned to the respective pixels of the determined segment, which depict the detected object, depending on the detected object. In this optional refinement, the at least one three-dimensional object hypothesis is then determined as a segment section of a determined segment, additionally as a function of the object information assigned to at least some pixels in the segment section. In other words, the determination of the at least one three-dimensional object hypothesis as a segment section of a determined segment is optionally carried out additionally as a function of the object information assigned to the pixels. In this embodiment, for example, a distance tolerance value can be adjusted as a function of the number of objects detected if the number of object hypotheses determined does not correspond to the number of objects detected. Alternatively or additionally, the number of object hypotheses determined is carried out as a function of the determined number of objects recognized. In this embodiment, an object hypothesis for an object located in the foreground in the camera's field of view is thus advantageously determined if, for example, a necessary condition, such as a number plate of a vehicle or a person's head, is detected. Alternatively or additionally, the number of determined three-dimensional object hypotheses is advantageously checked and, if necessary, a parameter of the method is adjusted if the number of determined object hypotheses does not correlate to the number of objects recognized.

In a further embodiment it can be provided at least one

Texture information and/or color information of the pixels in the determined segment to determine. The determined texture information and/or the determined color information is then assigned to the respective pixels of the determined segment which map the determined texture information and/or the determined color information. The at least one three-dimensional object hypothesis is then determined as a segment section of a determined segment, additionally depending on the assigned texture information and/or the assigned color information. In other words, the pixels of a determined segment are also combined or grouped depending on the determined texture information and/or the determined color information to form a three-dimensional object hypothesis or a segment section, with in particular those pixels of the segment being combined whose determined or assigned texture information and/or their determined or assigned color information are approximately the same. This results in the advantage that different vehicles partially covering one another and/or driving next to each other or people partially covering one another and/or walking next to each other can be more easily determined as different three-dimensional object hypotheses.

In another development, the distance data between the surroundings of the vehicle and the vehicle, determined by means of a vehicle camera, are corrected by means of an ultrasonic sensor, a lidar sensor and/or a radar sensor. This results in the advantage that the determined distance data is recorded or determined precisely. This enables three-dimensional object hypotheses to be determined more precisely, so that advantageously, for example in a crowd of people at a traffic light, a number of people standing next to or behind one another and/or partially covering one another can be more easily determined as three-dimensional object hypotheses.

Furthermore, after the determination of the three-dimensional object hypothesis, a validation of the three-dimensional object hypothesis can preferably be carried out, the method being carried out repeatedly based on another camera image previously or later captured by the camera or vehicle camera. As a result, the determination of the object hypotheses is advantageously checked for temporal consistency. In other words, in this embodiment it is checked whether a person or a vehicle has been detected before and after in the camera image and has already been determined as an object hypothesis, since the person or the vehicle cannot suddenly disappear or appear. In addition, the three-dimensional object hypothesis can optionally be validated, with the method being carried out on the basis of another camera image captured earlier or later or at the same time using a different camera from a different perspective. As a result, the determination of the object hypotheses is advantageously checked for perspective consistency. Advantageously, the other camera and the camera or the vehicle camera in this embodiment are part of a stereo camera, so that the distance data can also be precisely recorded or determined. In this development, the method is particularly accurate and reliable.

In an optional refinement of the method, the at least one specific three-dimensional object hypothesis is then displayed in a virtual three-dimensional environment model. The environment model is advantageously displayed or represented from a bird's-eye view. Provision can be made for the three-dimensional object hypothesis to be displayed or represented by means of a synthetic model loaded as a function of the three-dimensional object hypothesis, with the synthetic model representing the object hypothesis. In this case, in particular, the three-dimensional object hypothesis for the vehicle is displayed as a function of the distance information assigned to the pixels, which represents the respective determined object hypothesis. Furthermore, it can advantageously be provided that the three-dimensional object hypothesis is additionally displayed as a function of an orientation of the specific object hypothesis determined based on another learned machine recognition method.

The invention also relates to a computer program which is set up to carry out a method according to the invention for recognizing three-dimensional objects in a field of view of a camera.

The invention also relates to a machine-readable storage medium on which the computer program product according to the invention is stored.

Furthermore, the invention relates to a control unit. The control unit according to the invention is set up to be connected to at least one camera, the camera being in particular a vehicle camera. The control unit is Also set up to carry out a method according to the invention for detecting three-dimensional objects in a field of view of a camera.

Furthermore, the invention relates to a vehicle with a control device according to the invention.

In addition, the invention relates to a video surveillance system with a control unit according to the invention.

Further advantages result from the following description of exemplary embodiments with reference to the figures.

Figure 1: vehicle

Figure 2: Procedure

Figure 3: Captured camera image

Figure 4: Rough division of the captured camera image into segments

FIG. 5: determined distance data for the captured camera image

exemplary embodiments

A vehicle 100 is shown schematically in FIG. Vehicle 100 has a camera 111 or vehicle camera, which is advantageously designed as a mono camera for reasons of cost. The camera 111 captures a partial area 191 of the surroundings 190 which is in the field of view 191 of the camera. Camera 111 is set up to capture at least one camera image of partial area 191 in field of view 191 of surroundings 190 of vehicle 100 or a sequence of camera images of surroundings 190 . In particular, camera 111 captures a field of view 191 or a partial area of surroundings 190 in the direction of travel of vehicle 100 or front surroundings 190 of vehicle 100 . It can be provided that alternatively by means of the camera 111 or by means of another Camera 120 also captures a rear portion of surroundings 190 of vehicle 100, with each camera 111 being able to be designed as a wide-angle camera. Furthermore, provision can be made for several wide-angle cameras 120 of a surround view camera system to be arranged as cameras 111 on the vehicle as an alternative or in addition. Vehicle 100 optionally includes a stereo vision system 110 which includes camera 111 or vehicle camera and a further camera 112 . Camera 111 and the additional camera 112 can be used to capture a sequence of camera images or camera images and, using a triangulation method based on simultaneously captured camera images from camera 111 and the additional camera 112, distances or distance data between camera 111 or the vehicle and surroundings 190 or Objects 180 in the area 190 of the vehicle 100 are determined. Surrounding objects 180 are, for example, other vehicles or third-party vehicles that are driving ahead or behind vehicle 100, for example on a common lane 182, or other vehicles that are approaching vehicle 100, for example on another lane 182, or people who are, for example, on move on a sidewalk 181 next to the roadway. Alternatively or additionally, vehicle 100 can have at least one radar sensor 130, a lidar sensor (not shown) and/or an ultrasonic sensor 140 as an optional sensor in addition to camera 111 for detecting or determining distance data. The vehicle 100 also has a display device 150 which is set up to display information which is based on the detected sensor data of the various sensors 111 , 112 , 120 , 130 , 140 to a user or driver of the vehicle 100 . The vehicle 100 can optionally be set up by means of a control unit to support a guidance of the vehicle 100 . Vehicle 100 can also optionally be set up by means of a control unit to carry out some driving situations semi-autonomously or fully autonomously, for example a parking maneuver or driving on a freeway.

FIG. 2 shows a sequence of the method for detecting three-dimensional objects 180 in a field of view 191 of a camera 111 as a block diagram. The method begins with acquisition 210 of at least one camera image using camera 111, 112 and/or 120, with camera 111, 112 and/or 120 being arranged in particular on a vehicle 100 or with camera 111, 112 and/or 120 in particular the vehicle camera according to FIG. Alternatively, the camera 111, 112 and/or 120 can be part of a surveillance system, with the surveillance system being stationary in particular. Subsequently, in step 220, a semantic segmentation of the camera image is carried out using a first learned machine recognition method. For example, in step 220, the first learned machine recognition method or the semantic segmentation subareas of the camera image that depict semantic categories, such as at least one person, a vehicle or a car and/or a road and/or a vehicle for driving or monitoring unimportant background of the environment, detected. In particular, all pixels of the camera image are classified by the semantic segmentation 220, with the semantic segmentation 220 representing a rough classification of the camera image into the respective categories shown, for example the categories include a background of the camera image or moving objects. In other words, all objects in the vicinity of the camera 111 that are in the field of view 191 of the camera are preferably classified by the semantic segmentation 220 and, in particular, a partial area of the camera image is also recognized or classified as the background. Subsequently, segment information is assigned 221 to those pixels of the respective partial area of the camera image for which a category was recognized. The segment information assigned in step 221 to a respective pixel of the camera image or a representation of the camera image represents the recognized semantic category which is mapped by the pixel. Thereafter, in step 222, adjacent pixels of the camera image are grouped into a segment 410, 420 depending on the respectively associated semantic segment information. In other words, in step 222 at least one image section is determined as a segment 410, 420 depending on the semantic segment information assigned to the pixels, with a segment 410, 420 preferably only having pixels that are adjacent to one another. The neighborhood of pixels can be determined in a number of ways according to the prior art. For example, pixels can be considered to be adjacent to one another if only pixels with the same assigned semantic segment information are arranged between two pixels or if a direct connection through pixels with the same assigned semantic segment information is possible between two pixels. A segment 410, 420 can have one or more objects that at least partially cover one another, for example a number of people or a number of vehicles. In an alternative or additional step 230 to step 250, optical flow vectors or an optical Flow to at least part of the pixels of the captured camera image is determined, see below. In step 230, in particular optical flow vectors for the pixels of each determined segment 410, 420 are determined as a function of the camera image and at least one other previously and/or subsequently captured camera image, in particular if the segment 410, 420 or the partial area of the camera image contains at least one moving and/or or non-moving environmental object. Furthermore, a determination 240 of at least one item of texture information and/or one item of color information of the pixels in the determined segment 410, 420 can optionally be carried out. In this optional refinement, an assignment 241 of the ascertained texture information and/or the ascertained color information to the respective pixels of the ascertained segment 410, 420 is then carried out. The respective pixels to which the determined texture information and/or the determined color information is assigned map the determined texture information and/or the determined color information. In a further step 250 of the method that is an alternative or additional to step 230, distance data 501 to 507 between surrounding objects 180 in the camera field of view 191 or the partial area of the surroundings and camera 111, 112 and/or 120 captured by camera 111, 112 and/or 120 determined. The distance data 501 to 507 are determined in step 250, preferably using a trained second machine recognition method based on the camera image 300 of a mono camera as camera 111 or using a stereo camera 110. Alternatively, the distance data between surrounding objects 180 in the camera field of view 191 and the camera 111, 112 and/or 120 can be determined at least by means of an ultrasonic sensor, a lidar sensor and/or a radar sensor. It can be provided in an optional step 251 that in step 250 camera-based distance data is corrected and/or validated by distance data determined by means of an ultrasonic sensor, lidar sensor and/or radar sensor. Then, in step 252, pixels of at least part of the camera image are assigned a respective distance information item depending on the distance data determined in step 250 or in step 251. In an optional step 260, at least one object or detail object in the determined segment 410, 420 is recognized by a further trained machine recognition method, with the detected detail object in the segment 410, 420 having a lower degree of abstraction than the assigned segment information or the recognized semantic category of the Segments 410, 420. For example, in optional step 260, a number plate is determined for the determined segment vehicles or moving object. This is done in an optional step 261, not shown in Figure 2 recognized detailed object assigned to the respective pixels of the determined or associated or superordinate segment 410, 420. In the next step 270, at least one three-dimensional object hypothesis is determined as a segment section of a determined segment 410, 420 depending on the distance information assigned to the pixels of the segment. In particular, an object hypothesis is determined in step 270 if the distance information assigned to neighboring pixels is approximately the same or the assigned distance information of the pixels has a difference that is less than or equal to a distance tolerance value. In other words, the object hypothesis is advantageously determined in step 270 if the distance information assigned to the pixels of a segment section of segment 410, 420, in particular for at least a predefined number of pixels, is approximately the same or the assigned distance information of the pixels in at least one segment section is in each case to one another have a difference less than or equal to a distance tolerance value. The determination 270 of the object hypothesis is set up to separate two different objects that are imaged in the same segment 410, 420 and that in particular cover one another, since they are at a different distance from the camera, which is represented by the distance information. It can optionally be provided in step 270 that the object hypothesis is only determined if the distance information assigned to the pixels of a segment section of segment 410, 420 is less than or equal to a distance threshold value for at least a predetermined number of pixels. In other words, three-dimensional object hypotheses are advantageously determined in step 270 only within a closer environment to the camera or to the vehicle, with this closer environment being defined by the distance threshold value. Furthermore, in an optional development, the three-dimensional object hypothesis is determined 270 only if the number of pixels of the segment section is greater than or equal to a minimum value. It can also be provided in step 270 that when determining 260 the three-dimensional object hypothesis, a distance tolerance value is adjusted depending on the segment information assigned to the pixels of the segment, depending on the distance information assigned to the pixels of the segment and/or depending on a detected speed of the vehicle will. Advantageously, the at least one three-dimensional object hypothesis is determined 270 as a segment section of a determined segment additionally or alternatively in Dependency of the determined optical flow. Furthermore, it can be provided that the determination 270 of the at least one three-dimensional object hypothesis as a segment section of a determined segment is additionally carried out as a function of the detected object or the detected detailed object. For example, a vehicle driving ahead is advantageously recognized when a number plate is recognized in the segment detail. Provision can furthermore be made in step 270 for the three-dimensional object hypothesis to be determined as a function of a determined number of objects in the segment, for example by adjusting the distance tolerance value. Furthermore, the determination 270 of the three-dimensional object hypothesis as a segment excerpt of a determined segment can also take place depending on the assigned texture information and/or the assigned color information, so that a green vehicle can be separated or differentiated more easily from a red vehicle. In a further optional method step 280, the method is first repeatedly carried out on the basis of another camera image previously or later captured by the vehicle camera. Then, in optional step 280, the consistency of the object hypothesis is checked with object hypotheses determined earlier or later, or the specific object hypothesis is validated or discarded as a function of the object hypothesis determined at a different point in time. Furthermore, in another optional step 281, the method can be carried out repeatedly based on a camera image captured from a different perspective. Then, in optional step 281, the consistency of the determined three-dimensional object hypothesis is checked with an object hypothesis determined from a different perspective or the determined object hypothesis is validated or rejected depending on the object hypothesis determined from a different perspective. Finally, in an optional method step 290, the at least one specific three-dimensional object hypothesis can be displayed in a virtual three-dimensional environment model.

A captured camera image 300 is shown schematically in FIG. The camera image 300 depicts the partial area of the surroundings captured in the field of view 191 of the camera 111 , 112 and/or 120 . For example, a roadway or lane 182 with a vehicle driving ahead 320 as a moving object 180 and pedestrians 310 partially covering one another as further moving objects 180 are shown on one Sidewalk 181 and a vehicle 330 parked on a sidewalk 181 as a stationary moving object 180, the parked vehicle 330 being partially covered by the vehicle 320 driving ahead.

FIG. 4 shows a categorized representation or rough classification 400, determined according to steps 220, 221 and 222, of the captured camera image 300 from FIG. In the camera image 300, moving objects 180 are initially recognized as a semantic category, for example people and vehicles, by a first learned machine recognition method. Furthermore, the semantic segmentation 220 recognizes a background in the camera image 300 that is not relevant to the driving of the vehicle. Provision can be made for recognizing further semantic categories, for example the roadway 182. The respective pixels which depict the vehicles and people are assigned the moving object category 180 as segment information in step 221. Subsequently, in step 222, segments 410 and 420, for example, and advantageously at least one segment 430 for roadway 182 and at least one segment 440 for background, are formed or determined by grouping adjacent pixels with the same assigned segment information, in particular moving object 180. The semantic segmentation 220 of the captured camera image consequently results in the steps 221 and 222 in the rough division 400 of the camera image 300 shown in Figure 4 into segments 410, 420 and 430 and 440, with this rough division 400 in particular separating adjacent pixels with different assigned segment information from one another . A segment 410, 420 of the camera image can represent or include a number of people and/or vehicles.

In FIG. 5, distance data from the captured camera image 300 from FIG. 3 determined by means of the second learned machine recognition method are shown schematically. The areas 501 to 507, which in part but not necessarily run in the form of a ring, each represent a different distance between the surroundings with the surrounding objects 180, 310, 320, 330 and the camera 111, 112 and/or 120 or the vehicle 100. It can be seen that that based on the detected or determined distance data 501 to 507, at least a large number of pixels of the camera image 300 can be assigned determined distance information. The distance data 501 to 507 are advantageously very computationally efficient due to the second learned machine The detection method is estimated or determined or detected or, not shown in FIG. 5, preferably determined technically by a stereo camera method, since distance data determined by means of a stereo camera method have a high quality or reliability. Alternatively, the distance data can be recorded or determined by an ultrasonic, radar or lidar sensor, with distance data advantageously resulting in high quality or reliability. In step 270, the person 510 in the foreground, for example, can be easily determined or distinguished from the people 511, 512 located behind as a separate three-dimensional object hypothesis in the segment 410 based on the determined distance data. Similarly, with the method according to the invention, vehicles driving in front of each other that are concealing one another can be determined well from one another as separate three-dimensional object hypotheses (not shown). Vehicles 320 and 330 depicted in camera image 300 cannot be clearly distinguished from each other based on distance data, despite the different distances at their respective rears, since vehicles 320 and 330 have different and sometimes the same distances to the camera due to their respective spatial depth. However, the optical flow vectors for vehicles 320 and 330 have very different magnitudes because vehicle 320 is driving and vehicle 330 is parked or stationary. Vehicles 320 and 330 can therefore advantageously be determined very reliably as different three-dimensional object hypotheses in the same segment 420 if the three-dimensional object hypothesis is determined as a function of the optical flow or the optical flow vectors of the respective pixels of a segment.

Claims

Expectations

1. Method for detecting three-dimensional objects (180) in a field of view (191) of a camera (111, 112, 120), wherein the camera (111, 112, 120) in particular at least a part of an environment (190) of a vehicle (100) recorded, comprising the following steps

• Acquisition (210) of at least one camera image (300) by means of the at least one camera (111, 112, 120), the camera (111, 112, 120) being arranged in particular on the vehicle (100),

• semantic segmentation (220) of the camera image (300) by a first learned machine recognition method,

• Allocation (221) of segment information to the pixels of the camera image (300) depending on the semantic segmentation,

• Determination (222) of at least one image section as a segment (410, 420, 430, 440), with adjacent pixels of the camera image (300) being grouped into a segment (410, 420, 430, 440) depending on the respectively assigned semantic segment information ,

• Determination (250) of distance data (501 to 507) between surrounding objects (180) in the camera field of view (191) and the camera (111, 112, 120) and assignment (252) of distance information to the pixels of at least part of the camera image (300) depending on the determined distance data (501 to 507), and/or

• determining (230) an optical flow to at least some of the pixels of the captured camera image, in particular the determined segment, depending on the camera image (300) and at least one other camera image captured beforehand and/or afterwards, and

• Determining (270) at least one three-dimensional object hypothesis (510, 511, 512, 520, 530) as a segment section of a determined segment (410, 420, 430, 440) depending on the distance information assigned to the pixels of the segment and/or depending on the determined optical flow of the pixels of the segment.

2. The method of claim 1, wherein the determination (270) of the three-dimensional object hypothesis (510, 511, 512, 520, 530) only takes place if the distance information associated with the pixels of the segment section for at least one predetermined number of pixels is less than or equal to a distance threshold. Method according to one of the preceding claims, in which the three-dimensional object hypothesis (510, 511, 512, 520, 530) is only determined (270) if the number of pixels in the segment section is greater than or equal to a minimum value. Method according to one of the preceding claims, wherein when determining (270) the three-dimensional object hypothesis (510, 511, 512, 520, 530) a distance tolerance value depending on the assigned segment information, the assigned distance information and/or a detected speed of the vehicle (100) is adjusted. Method according to one of the preceding claims, wherein the following step is performed before the determination (270) of the three-dimensional object hypothesis (510, 511, 512, 520, 530).

• Recognition (240) of at least one detail object in the determined segment by a further learned machine recognition method, wherein

• the determination (270) of the at least one three-dimensional object hypothesis as a segment section of a determined segment is additionally carried out as a function of the detected detailed object. Method according to one of the preceding claims, wherein the following step is carried out before the determination of the three-dimensional object hypothesis (510, 511, 512, 520, 530).

• determination (260) of at least one piece of texture information and/or one piece of color information of the pixels in the determined segment, and

• Allocation (261) of the ascertained texture information and/or the ascertained color information to the respective pixels of the ascertained segment, which map the ascertained texture information and/or the ascertained color information, wherein

• the determination (270) of the at least one three-dimensional object hypothesis (510, 511, 512, 520, 530) as a segment section of a determined segment also takes place as a function of the assigned texture information and/or the assigned color information. Method according to one of the preceding claims, wherein the determined distance data between surrounding objects (180) in the camera field of view (191) and the camera (111, 112, 120) is determined at least by means of an ultrasonic sensor (140), a lidar sensor and/or a radar sensor (130). or be corrected. Method according to one of the preceding claims, wherein the following step is performed after the determination (270) of the three-dimensional object hypothesis (510, 511, 512, 520, 530).

• Validation (280) of the determined three-dimensional object hypothesis (510, 511, 512, 520, 530), the method being carried out repeatedly based on another previously and/or later camera image captured by the camera (111, 112, 120) and one before and/or comparing the three-dimensional object hypothesis determined later with the determined three-dimensional object hypothesis (510, 511, 512, 520, 530). Method according to one of the preceding claims, wherein the following step is performed after the determination (270) of the three-dimensional object hypothesis (510, 511, 512, 520, 530).

• Validation (281) of the three-dimensional object hypothesis (510, 511, 512, 520, 530), the method being carried out repeatedly based on another previously or later or at the same time by means of another camera (112) from a different perspective and a camera image three-dimensional object hypothesis determined from another perspective is compared with the determined three-dimensional object hypothesis (510, 511, 512, 520, 530). Method according to one of the preceding claims, wherein the following step is carried out

• Display (290) of the at least one specific three-dimensional object hypothesis (510, 511, 512, 520, 530) in a virtual three-dimensional environment model. Computer program which is set up to carry out a method for recognizing three-dimensional objects according to one of Claims 1 to 10. Machine-readable storage medium on which the computer program product according to claim 11 is stored. - 22 - Control unit, the control unit being set up to be connected to at least one camera (111, 112, 120) and to carry out a method for recognizing three-dimensional objects according to one of Claims 1 to 10. Vehicle (100) with a control unit according to Claim 13. Video surveillance system with a control unit according to Claim 13.