WO2021148392A1

WO2021148392A1 - Method and device for object identification on the basis of sensor data

Info

Publication number: WO2021148392A1
Application number: PCT/EP2021/051045
Authority: WO
Inventors: Raimund BOHL; William Harris Beluch
Original assignee: Robert Bosch Gmbh
Priority date: 2020-01-24
Filing date: 2021-01-19
Publication date: 2021-07-29
Also published as: DE102020200847A1

Abstract

The invention relates to a method for detecting a specific object, in particular a drone, in a surrounding area, comprising the following steps: - capturing image data of at least one optical capturing source; - capturing sound data from at least one acoustic capturing source; - generating image data feature maps for the image data on the basis of an image data feature extractor; - generating sound data feature maps for the sound data on the basis of a sound data feature extractor; - processing the image data feature maps and the sound data feature maps in order to provide a statement of presence which states whether the specific object is or is not present in the surrounding area.

Description

description

title

Method and device for object identification based on sensor data

Technical area

The invention relates to a method for identifying an object based on sensor data from various types of sensors. In particular, the present invention relates to determining the presence or absence of an object in a surrounding area based on visual and acoustic information.

Technical background

Methods for identifying the presence of an object are known. Optical detection sources, such as cameras that record a surrounding area, are often used for this purpose. With the aid of object recognition, the presence of an object can then be recognized by evaluating the recorded image data.

There are various established deep learning methods for object recognition and classification based on image data. These usually use convolutional neural networks as feature extractors in connection with a classification, such as SSD, YOLO, Faster RCNN and the like.

The reliability of such object recognition systems is limited and depends to a considerable extent on the quality of the image data to be processed. Disclosure of the invention

According to the invention, a method for detecting a certain object, in particular a (for example at least partially autonomous) robot such as a drone, in a surrounding area according to claim 1 and a corresponding device and a detection system according to the independent claims are achieved.

Further refinements are given in the dependent claims.

According to a first aspect, a computer-implemented method for detecting a (specific) object, in particular a (for example at least partially autonomous) robot such as a drone, in a surrounding area is provided, with the following steps:

Acquiring image data of at least one optical acquisition source; Acquiring sound data from at least one microphone;

Generating one or more image data feature maps for the image data based on an optical feature extractor;

Generating one or more sound data feature maps for the sound data based on an acoustic feature extractor;

Processing the one or more image data feature maps and the one or more sound data feature maps to provide an indication of presence indicating whether or not the particular object is present in the surrounding area.

In addition to the aforementioned method for visual object recognition and identification, methods for acoustic event identification and classic signal processing methods based on acoustic sensor signals are also known.

The above method enables the detection of the presence of a certain object, in particular an object that emits sound at least during operation, in a surrounding area, in particular in a surrounding area, ie in a radius around a predetermined position, for example to detect the presence of a drone in the vicinity of a certain one Building. For this purpose, image data from one or more optical detection sources, such as B. surveillance cameras, and sound data are used as acoustic sensor signals from one or more acoustic detection sources.

However, for example, the visual detection of the presence of the specific object can be made more difficult by disturbed or noisy image data, such as a blurred or noisy image due to sensor noise, insufficient lighting at night or twilight shots or in unsuitable weather conditions or the like. This leads to false detections or non-detections of the existence of the particular object.

Likewise, when using sound data of the ambient noise and in the presence of strong background noise, adverse weather conditions or low signal strengths of the sound signals, the detection of the presence of the particular object can be unreliable.

The combination of the information from both detection sources can significantly increase the detection reliability.

With the aid of a sensor data fusion, the trustworthiness of the detection of the presence or absence of the specific object or the identification of the specific object can be improved, so that erroneous false identifications (false positives) or non-identifications can be avoided more reliably. For this purpose, the additional information is used that results from the use of different detection sources, namely one or more optical detection sources (cameras) and one or more acoustic detection sources (microphones).

By using a feature extractor (image data feature extractor), in particular based on a convolutional neural network (CNN) for each optical detection source (camera), object detection can be carried out on the basis of image data that includes both the presence of an object and, optionally, the type of the recognized object. Each of the feature extractors can have a number of different layers of provide learned feature maps, each of which has its own dimensionality depending on the depth of the layer. One or more of these feature maps can be used in a feature extractor designed as an image classifier for visual object recognition in order to detect a specific object or a specific type of object. As a result, the presence of the specific object can be indicated by a bounding frame, for which an object classification can furthermore be carried out.

The sound data can also be processed with a feature extractor (sound data feature extractor), in particular based on a convolutional neural network (CNN) for each acoustic detection source (microphone), for example based on a CNN, so that feature maps are obtained. These are used by a sound classifier to identify the particular object in the environment.

The classification results are each provided with uncertainty information by the respective classifier, which indicates how reliable the classification of the corresponding feature data from the feature extractors was.

Furthermore, the image data feature maps and the sound data feature maps can be processed with a data-based classification model which is trained to recognize the presence of the specific object in the surrounding area.

It can thereby be provided that classification results provided with uncertainty measures can be determined from the feature extractions as to whether a specific object is present or not.

In particular, the processing of the image data feature maps and the sound data feature maps can each be carried out with a weighting that is dependent on the degree of uncertainty.

Furthermore, the image data feature maps can each have an output from different layers of a convolutional neural network of the Image data feature extractor and / or the sound data feature cards each correspond to an output from different layers of a convolutional neural network of the sound data feature extractor.

According to one embodiment, the image data feature maps and / or the sound data feature maps can each be provided with an uncertainty measure.

According to a further aspect, a device for detecting a specific object, in particular a drone, in a surrounding area is provided, comprising: at least one optical acquisition source for acquiring image data; at least one acoustic acquisition source for acquiring sound data; an image data feature extractor for generating one or more image data feature maps for the image data; a sound data feature extractor for generating one or more sound data feature maps for the sound data; a coordinator which is configured to process the one or more image data feature cards and the one or more sound data feature cards in order to provide an indication of the presence, which indicates whether the particular object is present in the surrounding area or not.

Brief description of the drawings

Embodiments are explained in more detail below with reference to the accompanying drawings. Show it:

Figure 1 is a schematic representation of a

Object recognition system using the example of recognition of the presence or absence of a drone in the vicinity of a building; and

Figure 2 is a block diagram to illustrate the structure of a

Object recognition device. Description of embodiments

FIG. 1 shows a schematic illustration for an object recognition system 1 using the example of recognition of the presence or absence of a drone 3 as a specific object in the vicinity of a building 2.

The object recognition system 1 comprises an object recognition device 4, which is connected to a camera system 5 comprising one or more cameras 51 (or other optical acquisition sources) for acquiring image signals and a microphone system 6 with one or more microphones 61 for acquiring sound signals. A detection of the presence of a drone 3 as the specific object can be signaled by the object detection device 4 in a suitable manner, for example by outputting an alarm or the like with the aid of a signaling device 7.

The object recognition device 4 has a schematic structure as shown in FIG.

In an image data feature extractor 11, the image signals B from the cameras 51 are analyzed as image data using a convolutional neural network (CNN), such as a VGG-16 network, in order to create image data feature maps MB from the relevant camera image. The image data feature extractor 11 is trained so that one or more specific objects 3, such as. B. a drone can be detected. The image data feature extractor 11 has a CNN with a plurality of layers in order to provide image data feature maps MB in each case. The image data feature cards MB each correspond to the output of a layer. The image data feature maps MB each have their own dimensionality depending on the depth of the respective layer.

Some of these image data feature maps MB can be used for object recognition in a detection module 12, so that a bounding frame is generated for each of the objects 3 determined in the vicinity of the object recognition device. For each of the bounding frames, an image data Classifier 13, an object classification can be made which indicates the type of object 3 marked by the bounding frame. The classification can also recognize the type of the object 3, ie the type of drone for the present exemplary embodiment.

With the aid of a sound signal feature extractor 14, the sound signals received by one or more microphones 51 of the microphone system 5 can be analyzed. The sound signals provided by the microphones 51 of the microphone system 5 can be unprocessed sound data or preprocessed sound data, such as e.g. B. as a Log-MEL spectrum.

The sound signal feature extractor 14 is trained so that the presence of one or more specific objects 3, such as. B. a drone can be detected. The sound data feature extractor 14 has a multi-layered CNN to provide sound data feature maps MK, respectively. The sound data feature cards MK each have their own dimensionality depending on the depth of the respective layer.

The sound data feature cards MK enable the recorded sound signal to be classified, which can indicate the presence or absence of a specific object, such as a drone, in the environment.

Uncertainty measures which indicate the prediction reliability can be provided for the classifications of the feature extractors 11, 14. The uncertainty measures are provided for each of the feature maps so that the relevant feature map can be taken into account according to its reliability. The degree of uncertainty of each feature map can thus be assumed in the form of a weighting of a consideration of the relevant feature map.

A coordinator 15, which carries out the common classification of the feature cards MB, MK, can be provided for the data fusion. For this purpose, the feature maps are taken into account jointly by the two feature extractors 11, 14 as input data of the coordinator 15 and weighted with their corresponding forecast uncertainties. The coordinator 15 includes a trainable data-based classification model that is trained to recognize the presence of the specific object on the basis of the feature maps.

The combined feature cards are fed to a classification module of the coordinator 15, which is parameterized by the classification parameters that undertake the final classification based on the presence of the specific object 3. The classification module can have a CNN or a DNN and on the input side it can be adapted to the dimensionality of the feature cards MB, MK. On the one hand, the classification module can be trained to provide an indication of whether a specific object, i.e. a drone 3, is in the vicinity of the object recognition system 1, depending on the feature maps MB, MK and the associated uncertainty measures.

If a bounding frame or a type of the object is to be output, this must be taken into account in the training data with which the coordinator is trained.

The uncertainty measures for certain feature maps with regard to the classification parameters can be determined, for example, from a multiple classification of the image data or the sound data, so that the number of matching classification results indicates a measure for the prediction uncertainty. This is known, for example, from Lakshminarayanan, B. et al., “Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles”, NIPS, 2017.

Furthermore, using a so-called Monte Carlo dropout method, as known from Gal, Y. et al., “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”, ICML, 2016, the camera images and / or the Microphone data are each superimposed with different types of noise signals (in particular from a stochastic noise process) and the corresponding classification can be made. The number of matching predictions about the presence or absence of the object can then indicate a measure of the prediction uncertainty for the feature maps. The individual classification predictions with the corresponding uncertainty measure can then be used to determine the Uncertainty of the overall prediction about the presence or absence of an object can be combined.

Claims

Expectations

1. A method for detecting an object, in particular a robot, in a surrounding area, with the following steps:

Acquiring image data of at least one optical acquisition source; Acquiring sound data from at least one acoustic acquisition source;

Generating one or more image data feature maps for the image data based on an image data feature extractor; Generating one or more sound data feature maps for the sound data based on a sound data feature extractor; Processing the one or more image data feature maps and the one or more sound data feature maps to provide an indication of presence indicating whether or not the particular object is present in the surrounding area.

2. The method according to claim 1, wherein the image data feature cards each correspond to an output from different layers of a convolutional neural network of the image data feature extractor and / or wherein the sound data feature cards each correspond to an output from different layers of a convolutional neural network of the sound data feature extractor.

3. The method of claim 1 or 2, wherein the image data feature maps and the sound data feature maps are processed with a data-based classification model that is trained to recognize the presence of the specific object in the surrounding area.

4. The method according to claim 3, wherein the image data feature maps and / or the sound data feature maps are each provided with an uncertainty measure.

5. The method according to claim 4, wherein the processing of the image data feature maps and the sound data feature maps is each carried out with a weighting that is dependent on the degree of uncertainty.

6. A device for detecting a specific object, in particular a drone, in a surrounding area, comprising: at least one optical acquisition source for acquiring image data; at least one acoustic acquisition source for acquiring sound data; an image data feature extractor for generating one or more image data feature maps for the image data; a sound data feature extractor for generating one or more sound data feature maps for the sound data; a coordinator which is configured to process the one or more image data feature cards and the one or more sound data feature cards in order to provide an indication of the presence, which indicates whether the particular object is present in the surrounding area or not.

7. A computer program with program code means which is set up to carry out a method according to any one of claims 1 to 5 when the computer program is run on a computing unit.

8. Machine-readable storage medium with a computer program according to claim 7 stored thereon.