WO2018224447A1 - Method and apparatus for analysing image data - Google Patents
Method and apparatus for analysing image data Download PDFInfo
- Publication number
- WO2018224447A1 WO2018224447A1 PCT/EP2018/064644 EP2018064644W WO2018224447A1 WO 2018224447 A1 WO2018224447 A1 WO 2018224447A1 EP 2018064644 W EP2018064644 W EP 2018064644W WO 2018224447 A1 WO2018224447 A1 WO 2018224447A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- processing results
- processing
- image
- neural nets
- concepts
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/768—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using context analysis, e.g. recognition aided by known co-occurring patterns
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
Definitions
- the invention relates to a method and an apparatus for ana ⁇ lysing image data using an artificial deep neural network, also denoted as a neural net .
- Image analysis is a common problem in various applications. It is, for example, often necessary or useful to detect posi ⁇ tions and/or numbers of objects in a scene, or to segment a scene or an image semantically .
- Deep neural networks have been successfully used for specific image analysis tasks, such as image classification, object detection, or semantic segmentation. Even though deep neural nets or models are suc ⁇ cessful in these tasks, they at the same time have the disad ⁇ vantage of being designed only for one specific task and re ⁇ quiring large amounts of annotated training data. Each time a new deep model is to be created or trained for a specific task, appropriate annotated or labelled training data is re ⁇ quired.
- Another example is an application that requires doing the same task for which the existing deep model has been trained but doing it on different data, that is, a different target dataset.
- Conventional approaches using deep learning techniques for image analysis involve explicitly specifying during training what is to be analysed. For example, if different people in a scene are to be detected, the training would involve marking each person in the scene separately for the neural net to be able to differentiate between the different persons. Each time a different analysis is required, a new neural net may need to be trained. While this brings with it the above- mentioned challenges of training multiple neural nets, it can also be problematic to have to use and switch between differ ⁇ ent neural nets.
- the respective deep model is directly trained for this specific task with explicit annotation of all con ⁇ cepts, or transfer learning is used.
- Transfer learning in- volves using a model that is trained for a task or a source distribution for a different task or distribution to avoid the need for additional specifically labelled training data.
- Other approaches involve using unsupervised learning to pre— train neural nets and using semi-supervised learning when at least some of the training data are correctly labelled.
- Still another approach lies in using a multi—task objective function for one model.
- This objective is achieved by a method having the features of patent claim 1 and an apparatus having the features of patent claim 15.
- Advantageous embodiments with expedient develop ⁇ ments of the invention are indicated in the dependent patent claims as well as in the following description and the draw- ings .
- a method according to the present invention is concerned with analysing image data.
- the image data to be analysed may also be referred to as input or target data.
- the method comprises providing one or more pre—trained artificial deep neural nets. These may be pre—trained for image analysis although this is not always necessary.
- Each of the one or more pre— trained neural nets are then adapted for at least one spe ⁇ cific image analysis task. This means that if only one pre— trained neural net is provided, it is adapted for one spe ⁇ cific task. If, however, multiple pre—trained neural nets are provided, some or all of the respective specific tasks for which the multiple pre—trained neural nets are adapted may be different from one another.
- Adapting a neural net may also be referred to as tuning the neural net.
- the image data is processed by means of a respective forward pass through the one or more adapted neural nets that have by then learned multiple logically related concepts.
- This processing of the image data generates multiple processing results correspond ⁇ ing to the multiple logically related concepts which the adapted neural nets have learned.
- At least two of these mul ⁇ tiple processing results corresponding to different ones of the multiple logically related concepts are then selected.
- the at least two selected processing results are then com ⁇ bined in dependence of the logical relation between the con- cepts to which the selected processing results correspond. This combining generates an image analysis result.
- the present invention can in principle be used to analyse ar ⁇ bitrary image data. Therefore, the term image is to be inter- preted broadly and can refer to different kinds of image data or images.
- An input that is, an image to be analysed might for example be an image captured by a surveillance camera or a camera that is part of an assistance system of a car.
- an image or image data that has been pre—processed.
- the image data can for example be or comprise a crop of a larger image that has been subdivided.
- the image data may be or comprise a whole image or a crop of an image output by another algorithm that is not capable of separating closely spaced objects or that identifies larger areas compared to the sizes of actual objects present in the respective image data.
- the neural net is pre—trained for counting objects, in particular for counting pedestrians in images, and the method is used for detecting objects.
- This example can be generalised and details and terms of the example can be replaced by the broader terms as used in the claims.
- the logi- cally related concepts may refer to different numbers or counts of objects such as the pedestrians. It is, however, to be understood that the present invention is by no means lim ⁇ ited to this example, since the present invention can be used with differently trained neural nets and different concepts.
- a deep neural net is a neural net comprising multiple layers .
- the pre—trained neural net could be trained from scratch, starting with randomly initiated weights and/or other parameters. It could also be pre—trained by transfer learning starting from a baseline neural net trained for general image classification or analysis. While the pre—trained neural net is not yet trained or adapted for a specific use case or ap ⁇ plication as may be required in a respective productive envi ⁇ ronment, it has received at least some degree of training, preferably in terms of data or image analysis and/or regard ⁇ ing different concepts and/or logical relations. In the present description a trained, pre—trained, or adapted neural net may also be referred to as a model.
- Adapting or tuning the pre—trained neural net for one spe- cific image analysis task allows or forces the adapted neural net to focus on the specific task. This can mean that the neural net is tuned to specifically react to or recognise features, properties, or concepts corresponding to the spe ⁇ cific task.
- To adapt the pre—trained neural net training data can be provided and, for example, processed using a back- propagation method. This may comprise executing multiple it ⁇ erations of gradient descent to update weights and/or other parameters of the neural net in dependence of the specific image analysis task and/or at least one of the multiple logi- cally related concepts.
- the pre—trained neural net is used as a starting point for the tuning, this can advanta ⁇ geously be achieved with significantly less training data than would be necessary for creating a completely new neural net trained for the specific image analysis task.
- the present invention can effectively, efficiently, and flexibly be used for various applications - and even applica ⁇ tions with a changing target dataset - with comparatively low effort, cost, and time investment. It can be especially effi ⁇ cient to create a new instance, that is, a copy of the pre— trained neural net and adapting this new instance or copy.
- the adapted neural nets may be adapted online and/or offline using synthetic and/or natural images.
- offline training or tuning refers to using training data that differs from a target image or target dataset to be analysed. This means that the training data is another physical dataset separate from the target dataset.
- the training data and the target dataset may, however, com ⁇ prise similar kinds of images or image data.
- Online training or tuning on the other hand refers to using the same image or target data later to be analysed for training and/or tuning purposes.
- Online and/or offline training or tuning can be used for every image or for a group of images. If a group of images is used, then all images of the group can be analysed after the tuning without further adaptation steps to reduce the overall time required for processing all images of the group. Online training can yield particularly accurate re ⁇ sults since the adaptation does not need to rely on training data that differs from the target data, that is, the actual image data to be analysed. Offline adaptation can be advanta ⁇ geous since the training data and therefore the adaptation can be controlled and supervised, which lowers a chance for the neural net to acquire an unintended bias.
- Using synthetic images as training data can be advantageous since those can be easily mass-produced, managed, and specifically created or tailored to fit a respective use case and to avoid any unin ⁇ tentional biases.
- Using natural or real images can, on the other hand, have the advantage of better preparing the neural net for its intended application. It is possible to use online and/or offline tuning as well as synthetic and/or natural images as training data in various combinations .
- Processing the image data by means of a forward pass through the at least one adapted neural net means that the image data is provided as an input to a first or input layer of the adapted neural net, which then works on this input and pro ⁇ vides a corresponding output or processing result.
- the forward pass through the adapted neural net therefore comprises a data flow from the input layer to an output layer, that is, from a lowest to a highest layer in a hierarchical layer structure of the adapted neural net.
- the processing result can be an output of the respective adapted neural net at its output layer.
- the processing result can, however, also be or comprise at least one representation or activation from other parts or layers of the respective neural net.
- the selected processing results are taken from dif- ferent activations of the layers, in particular of filters or channels, of the one or more adapted neural nets.
- the proc ⁇ essing results can, in other words, be or comprise different representations of the image data or outputs of intermediate layers or parts thereof.
- Combining the selected processing results can therefor comprise processing in a representation space, that is, processing data in or from an abstract space of representations of image data.
- the different layers of a deep neural net in particular higher layers such as the layers in, for example, the highest third of the layer struc ⁇ ture, can learn different concepts.
- the different layers can therefor offer a useful and flexible source of data for the processing results . It can be very efficient to use outputs or activations of different layers of one individual adapted neural net, since this can limit the number of adapted neural nets required for the image analysis. At least one of the selected processing results or at least a part thereof may, however, be taken from different adapted neural nets.
- This approach can advan ⁇ tageously offer a high degree of flexibility and customis- ability and can therefore improve the overall image analysis result.
- the selected proc ⁇ essing results to comprise an output or a classification from an output layer of a first adapted neural net, an activation or representation of a specific filter or channel of a second or third highest layer of the first adapted neural net, and another activation of a specific channel of an intermediate or output layer of a second adapted neural net.
- the present invention is therefore applicable and usable in a variety of different use cases.
- a concept in terms of the present invention may refer to a more abstract and/or meaningful part or content of an image as compared to a simple contextless pixel-level feature or property of the image. If, for example, an image depicts multiple persons, one concept may be x all female persons' , while another concept may be x all persons wearing glasses' . In order to analyse image data, the learning of concepts can be leveraged. Processing image data by means of a deep neural net leads to different representations at different layers of the deep neural net. These representations learned at the different layers can be or correspond to the different learned concepts.
- the goal of representation learning is to learn representations that are disentangled causes of the image.
- One application of such disentangled representations is the generation of images by varying codes in a latent space, where different dimensions of the latent space represent dif ⁇ ferent disentangled causes of the input image.
- One dimension might, for example, represent whether the person in the input image is a male or a female, while another dimension might represent whether or not the person in the image is wearing glasses.
- the code along these two dimensions of the latent space various combinations of male or female with or without glasses can be obtained.
- the use of the dis ⁇ entangled representation in this case is synthesis or genera ⁇ tion.
- the method for image analysis in accordance with the present invention uses representation learning for learning and using different representations of or for the image data or parts thereof to be analysed, wherein these representa ⁇ tions are disentangled. This means that through the different representations there exists a separation of the correspond- ing causes from a composition of all causes.
- a cause in this sense can be understood to be a feature, property, or characteristic of the image or image data that is responsible for, that is, is causing the corresponding representation.
- a cause can therefore be thought of as a root or an origin of a certain state, behaviour, and/or output of the deep neural net.
- the different disentangled representa ⁇ tions or concepts are, however, not necessarily independent of each other. Instead, they have a logical relation to each other. This can for example, mean that they may be described using set theory, and/or partly overlap, such that there may be partly redundancy between them.
- the data analysis can be done using simple operations that may not have been as effective in the space of the original input image data, since the different repre- sentations or activations can highlight or be focused on dif ⁇ ferent features of the image data, that is, causes of the re ⁇ spective representations or activations.
- An activation in this sense is an output of a certain filter or channel of the deep neural net .
- the provided one or more pre—trained deep neural nets it is possible for the provided one or more pre—trained deep neural nets to have learned the logically related concepts. It is, however, also possible that only the one or more adapted neural nets have learned the logically related con- cepts. Learning the logically related concepts can therefore be part of or a result of the adaptation or tuning.
- Selecting the at least two of the multiple processing results can be done automatically.
- a predetermined criterion may be provided such as a threshold value for a confidence rating. For example, only processing results having or corresponding to a confidence rating or confidence value higher than the predetermined threshold value may be selected. It can also be possible to provide a predetermined table or a family of characteristics which can then be used to automatically select processing results, for example in dependence of a respective use case, application, or a prede ⁇ termined goal of the image analysis.
- Combining the selected processing results means that data from multiple sources are combined, analysed, or processed together. This can comprise adding or subtracting the processing results or parts thereof to one another or from one another, respectively.
- the image analysis result can for example comprise a classi ⁇ fication of the image data or parts thereof, an indication of a detected object, a segmentation, etc.
- the image analysis result may also be a complemented or modified version of the analysed image data, which can for example comprise bounding boxes around detected object.
- the selected processing results are selected such that the corre ⁇ sponding different logically related concepts have a nonempty intersection.
- the selected processing results do, in other words, have an overlap or a redundancy. They can have a partly overlap or redundancy or a first concept may include or contain a second concept, which therefore then is a subset of the first concept.
- intersecting or overlapping concepts additional or more detailed image analy ⁇ sis results can be obtained. This approach also avoids the need - and therefore the time and effort required - for spe ⁇ cifically training a neural net for the overlap, since the data corresponding to the overlap can be obtained by combin ⁇ ing the different processing results.
- a first concept may be ⁇ animal' and a second concept may be x having two legs' .
- a first proc ⁇ essing result corresponding to the first concept may be a de ⁇ tection or indication of all animals shown in a picture
- a second processing result corresponding to the second concept may be a detection or indication of all two-legged objects or beings.
- the nonempty intersection, that is, the overlap would therefore consist of all two-legged animals, but would for example neither include four-legged animals nor humans .
- the selected processing results are combined by using logical op ⁇ erators corresponding to the logical relation between the concepts to which the selected processing results correspond.
- the logical relation between the concepts is, in other words, mapped to the logical operators used for combining, analys ⁇ ing, and/or further processing the selected processing results.
- This can advantageously enable a consistent and auto ⁇ mated yet flexible image analysis. This is especially advan- tageous and useful for automation since the logical operators to be used do not have to be manually specified in advance.
- Logical operators in this sense may for example refer to Boo ⁇ lean operators such as conjunction, disjunction, negation, and comparisons, as well as operators of set theory such as union or intersection.
- the processing results may be selected based on or in dependence of a control signal, and/or a pre- determined configuration or setting of a corresponding apparatus used to carry out the method.
- the higher layers in this sense may for example be fully con ⁇ nected layers and/or layers of the highest third of a layer structure of a respective deep neural net.
- the different con ⁇ cepts can be learned by the same model or different models on or from one and the same image or multiple images or multiple image parts of the same or different images.
- At least one of the selected processing results is processed us ⁇ ing an image processing algorithm before or after combining the at least two selected processing results.
- the image processing techniques may be used on the processing results and/or a combination of the processing results, that is, a result of combining the selected processing results.
- the image processing techniques that is, the image process ⁇ ing algorithm may be used on only a part of the processing results and/or the combination.
- the image processing algorithm may in particular be a low-level image processing algorithm.
- a low-level image processing algorithm as well as low- level level image processing in terms of the present inven ⁇ tion refers to algorithms and methods that are not part of the machine learning domain and/or that are not concerned with the interpretation or classification of a scene or an image as a whole. Rather, these low-level techniques may for example include pixel-based operations, finding corresponding points, edge detection, and the like.
- Using the image proc- essing algorithm as well as combining the processing results to generate the overall image analysis result can advanta ⁇ geously yield an image analysis result that is easier to un ⁇ derstand, interpret, or process for a respective user and/or other applications or systems, which use the image analysis result as an input .
- the at least one selected processing result using the image processing algorithm comprises pseudo— colouring, and/or highlighting regions based on intensity, and/or thresholding, and/or contour detection, and/or generating a bounding box, in particular a bounding box surrounding a detected contour or object.
- image processing tasks can be automatically carried out using low-level image processing techniques and thus do not require the use of a neural net.
- Pseudo—colouring can be used to generate or cre ⁇ ate colour variations based on sine wave generation for dif- ferent colour channels.
- Thresholding can be used to discard parts of the processing results or their combination having one or more values, such as intensity, below a predetermined threshold value.
- Generating a bounding box can advantageously provide a reference area or region assigned to or associated with a detected contour, a detected object, a specific fea ⁇ ture, or the like.
- the bounding box can therefore be used to indicate a specific part of the processing result, the combi ⁇ nation, and/or the original input image data to a user and/or another program or system. Since the bounding box can be a geometric primitive form, for example a rectangle, this can be done with less processing effort than using a complex de ⁇ tailed outline or shape or segmentation and can therefore re ⁇ sult in fast and efficient processing, especially in time- sensitive applications.
- the bounding box may be added to the at least one processing result, the combination of the se ⁇ lected processing results, and/or to the input image data.
- At least a part of the image data is processed by means of the respective forward pass and a respective subsequent backward pass through the one or more adapted neural nets to generate the multiple processing results.
- the multiple processing re ⁇ sults therefore comprise an output or a result of the at least one backward pass.
- the backward pass thus constitutes an additional processing step to generate at least one of the multiple processing results.
- a backward pass through a neural net refers to processing data using the neural net in an op ⁇ posite direction as compared to the direction of data proc ⁇ essing or a data flows used in the forward pass.
- the backward pass therefore comprises a data flow from a higher layer to a lower layer of the respective neural net.
- the output of the backward pass can represent or resemble an image wherein only features or causes of the input image data corresponding to one of the logically related concepts and/or corresponding to the respective image analysis task for which the respec ⁇ tive neural net has been adapted are visible, highlighted, or emphasised. This is the case, since other features not con ⁇ tributing to an intermediary processing result used as input for the backward pass are not reconstructed during the back ⁇ ward pass.
- Using the additional processing step of the back ⁇ ward pass can be advantageous since the processing result ob- tained from the backward pass may be less abstract than a processing result taken directly from the forward pass and may therefore be easier to further process using conventional image processing techniques. Employing the backward pass may therefore lead to an improved overall image analysis result.
- the backward pass can provide a re ⁇ spective inverse representation, which may reveal or high- light features, properties, or aspects of the image data more clearly and/or a reveal additional features, properties, or aspects.
- Multiple inverse representations could be obtained from different filters, channels, or layers of the same adapted neural net or from multiple adapted neural nets, wherein the respective filters, channels, layers, and/or adapted neural nets correspond to different logically related concepts .
- a space of these inverse representations may be referred to as inverse representation space or the inverse domain.
- processing data from the inverse representation space may be referred to inverse domain processing.
- the inverse representations can be images with each image con ⁇ taining only features corresponding to one concept, inverse domain processing advantageously allows for accurate process ⁇ ing of image data on a level of concepts, that is, correctly and accurately with regards to actual content or meaning of the image data as it could be understood by a human user. It can for example be possible to remove parts or features from an image, or add parts or features to an image selectively, wherein each part or feature corresponds to a specific con ⁇ cept, such as a specific type or number of objects.
- the pre- sent invention may therefore advantageously allow content- based analysis and editing of image data without necessarily relying on techniques such as segmentation and with improved accuracy and flexibility, since features can be handled based on their belonging to a concept without necessarily being re- lated on a pixel level.
- the features may, for example, belong to the same concept or a single manifestation thereof, but may at the same time not belong to a continuous or connected area, which would make only using segmentation techniques unreliable .
- the selected processing results may comprise representations or activations, that is, data from a representation space and/or inverse representations, that is, data from the in ⁇ verse domain.
- the method for images analysis according to the present invention therefore proposes using processing in the representation space and/or in the inverse representation space as a means for obtaining improved image analysis re ⁇ sults, in particular as compared to conventional methods op ⁇ erating only in the space of the original input image data.
- a transpose of weights of the one or more adapted neural nets is used for processing the respective image data by means of the backward pass.
- an intermediary processing result after the respective forward pass is proc ⁇ essed again using the transpose of the weights or weight ma ⁇ trix used for the forward pass. If the neural net has been correctly trained, taking the inverse of this intermediary processing result, that is, processing it by means of the backward pass reveals the corresponding concepts present in the input image data.
- Using the transpose of the weights is an advantageous method for obtaining the inverse representa- tions since the transpose can be obtained or calculated with minimal investment of time and processing effort.
- one or more deep convolutional neural nets, and/or deep feedfor ⁇ ward neural nets, and/or deep recurrent neural nets are used as the one or more deep neural nets.
- the deep neural nets can therefore have features or characteristics of one of these types of neural nets or features or characteristics of a com- bination of some or all of these types of neural nets. This allows for adapting the properties and behaviour of the neu ⁇ ral net to different applications.
- CNN convolutional neural net
- This problem is avoided with the present invention by using a pre—trained neural net as a starting point and ob ⁇ taining the one or more adapted neural nets by tuning the pre—trained neural net for one specific image analysis task.
- the pre-trained CNN can be trained using synthetic image data.
- the adaptation or tuning of this pre-trained neural net requires significantly less training data which can also be annotated with significantly less effort.
- a backward pass through the CNN may also be referred to as a deconvolution pass or simply a deconvolution .
- a deconvo- lution pass may therefore involve starting from one of the learned filters at a layer of the CNN and doing the reverse data processing steps of successive unpooling, rectification, and filtering to reconstruct the activity in the layer be ⁇ neath, that gave rise to the chosen activation of the learned filter where the deconvolution pass started. Accordingly the output of the deconvolution pass may be referred to as decon ⁇ volved output .
- Feedforward neural nets have the advantage of being very ro ⁇ bust, meaning that their performance degrades gracefully in the presence of increasing amounts of noise.
- Using a recur ⁇ rent neural net can be especially advantageous for analysing data with a temporal structure, such as for example a time series of multiple images or a video feed.
- the one or more pre—trained neural nets are pre-trained for counting objects in images.
- the neural net is, in other words, pre-trained to classify images according to a number of objects depicted therein.
- Each class of a classifier or output layer of the pre-trained neural net may therefore rep ⁇ resent or correspond to a different count of objects.
- the neural net might be capable of classifying images with anywhere from 0 to 15 or 1 to 16 objects according to the re ⁇ spective number of objects depicted in each image.
- counting objects does not necessarily include detecting or specifying individual locations or out ⁇ lines of each object. Training the neural net for counting objects therefore requires significantly less detailed anno ⁇ tations in the training data then, for example, training the neural net to detect individual objects.
- the neural net can be pre-trained for counting at least one specific type of object, such as for example pedestrians or cars. Since the different concepts may refer to different kinds of objects, one adapted neural net or one layer thereof may be adapted to count a first number of objects, while a second layer or a second adapted neural net a be adapted for counting a different second number of objects. Since the lar ⁇ ger number or count of objects includes the smaller number or count, these concepts are logically related since the lower number or count is a subset of the larger count or number. One processing result can therefore correspond to the first number of objects, whereas a second processing result may correspond to the second number of objects.
- Counting a specific number of objects means focusing on this number of objects or features corresponding to this number of objects and disregarding features belonging or corresponding to other objects that might also be depicted in the same im- age. If there are at least the specified number of objects present in the image, the neural net adapted to count this specific number of objects will provide a classification of the object as containing this specific number of objects, meaning that it has counted exactly that many of the objects depicted in the image. It can be especially advantageous to have at least one layer or adapted neural net adapted to count exactly one object. This allows for using a neural net trained for counting objects to detect objects, in particular one individual object.
- Using a neural net pre-trained for counting objects therefore allows for a detailed image analy ⁇ sis with significantly less training effort than would be re ⁇ quired for creating a neural net capable of detecting indi- vidual objects or different numbers of objects even in images with a larger number of objects.
- the described method of removing the detected object can also be used in an iterative process to sequentially count and de ⁇ tect all objects shown in the respective image.
- This approach can yield improved accuracy especially for images with closely spaced and/or partly occluded objects.
- mul ⁇ tiple differently adapted neural nets are created by adapting the one pre—trained neural net for different concepts, in particular by adapting the one pre—trained neural net to count different numbers of objects.
- the differently adapted neural nets are then used to generate the multiple processing results. This means that only one pre-trained neural net is provided. From this one pre-trained neural net.
- the multiple adapted neural nets can be created or generated as copies or new instances of the one pre—trained neural net which are then adapted using different or differently labelled or anno ⁇ tated training data.
- This approach can be very efficient, re ⁇ quiring a minimal amount of time, effort, and storage space for the one pre-trained neural net used as a starting point for the adapted neural nets.
- At least one object is detected in the at least two processing results and/or the combination of the at least two processing results by means of an image processing algorithm to generate the image analysis result.
- This image processing algorithm can be the above-mentioned low-level image processing algo ⁇ rithm. Object detection is a common problem in various applications.
- detecting ob ⁇ jects is an important task. Detecting the at least one object in the processing results or their combination can yield especially accurate and reliable results since the processing results or the combination can be focused on the object to be detected such that its features are relatively emphasised compared to the rest of the image data.
- the image analysis result can be a final result obtained after combining the se ⁇ lected processing results and processing them and/or their combination by means of the image processing algorithm.
- the at least one object is detected using at least one predetermined optimisation criterion.
- at least one predetermined constraint may be used. This can comprise at least one predetermined constraint for a boundary smooth ⁇ ness, and/or for an area of the processing results and/or the combination of the processing results.
- a constraint in this sense may comprise one or more predetermined threshold val- ues, such that for example a boundary or contour is interpreted to indicate an object if the corresponding boundary smoothness and/or area surrounded by the boundary or contour is greater or smaller than the predetermined threshold value and/or lies between two predetermined threshold values.
- Dif- ferent threshold value and/or constraints may be provided and used for detecting different objects and/or different kinds of objects.
- Using at least one such predetermined optimisa ⁇ tion criterion enables fast and reliable object detection, in particular since features or causes corresponding to the ob- ject to be detected are highlighted or relatively emphasised by processing the image data using the adapted neural net prior to using the image processing algorithm.
- the at least one object is detected by treating pixels of the processing result and/or the combination of the processing results as a Markov random field and using a predeter- mined constraint on a gradient of intensities.
- a Markov random field can be used to detect and segment objects in an image simul ⁇ taneously and can therefore yield fast and accurate object detection and segmentation, as might be required in time- sensitive and/or safety related applications.
- a Markov random field model or approach can also advantageously be used to detect moving objects with improved accuracy.
- one of the objects can be selected according to or in dependence of a predetermined object selection criterion. Multiple objects might for example be detected due to the tuning of the pre-trained neural net, and/or you to parame ⁇ ters of the low-level image processing or thresholding.
- Pro- viding a predetermined object selection criterion for select ⁇ ing one object advantageously allows for reliably marking ex ⁇ actly one object.
- the criterion advantageously allows for im ⁇ proved flexibility and customisation of the image analysis. Selecting one object can advantageously enhance the capabil- ity of the present invention to be used for object detection since it strengthens the focus on detecting only a single ob ⁇ ject at a time.
- the object selection criterion may for example be a size of a respective area corresponding to each of the multiple detected objects.
- the area can for example be enclosed by a bounding box or a detected contour or boundary.
- the object corresponding to the bounding box having the largest area can be automatically selected. This can be advantageous based on the assumption that the majority of specific regions - for example high-intensity regions - of the image or processing result belong to a single object.
- the apparatus comprises one or more artificial adapted deep neural nets, and a separate image processing unit.
- the appa ⁇ ratus is configured to process the image data by means of a respective forward pass through the one or more adapted neu- ral nets to generate multiple processing results.
- the one or more adapted neural nets are adapted for a specific image analysis task from at least one pre-trained neural net and have learned multiple logically related concepts.
- the multi ⁇ ple generated processing results correspond to the multiple logically related concepts.
- the apparatus is further config ⁇ ured to select at least two of the multiple processing re ⁇ sults corresponding to different ones of the multiple logi ⁇ cally related concepts.
- the apparatus is further configured to provide the at least two selected processing results as input to the image processing unit.
- the image processing unit is configured to combine the at least two selected processing results in dependence of the logical relation between the concepts to which the selected processing results correspond to generate an image analysis result.
- the apparatus may also comprise the at least one pre-trained neural net and/or be configured to adapt the one or more adapted neural nets from the at least one pre—trained neural net .
- the apparatus may comprise a processing unit (CPU) , a memory device, and an I/O-system.
- the apparatus according to the present invention may be configured to carry out or conduct at least one embodiment of a method according to the present invention.
- the apparatus may comprise a memory device or data store containing program code representing or encoding the steps of this method.
- the memory device or data storage containing this program code may on its own also be one aspect of the present invention.
- the respective embodiments of the method according to the present invention as well as their respective advantages may be applied to the apparatus, the memory device or data store, and/or the program code contained therein according to the present invention as applicable and vice versa.
- the apparatus may further be configured to iteratively use the image analysis result as an input for the one or more adapted neural nets to further analyse the image analysis re ⁇ sult of a respective previous iteration. At least some steps or parts of the method may therefore be executed multiple times iteratively until a predetermined exit condition is met. If, for example, an objective of the image analysis is to detect objects, the exit condition might be met if no more objects can be detected. In this case, the iterative process may automatically stop. Further advantages, features, and details of the present in ⁇ vention derive from the following description of preferred embodiments of the present invention as well as from the drawings pertaining to the present invention.
- FIG 1 schematically depicts a flow diagram illustrating a method for analysing image data using deep neural nets
- FIG 2 depicts a schematic illustrating a structure of a deep convolutional neural net which can be used to analyse images
- FIG 3 schematically depicts a first logical relation be ⁇ tween two concepts
- FIG 4 schematically depicts a second logical relation be ⁇ tween two concepts
- FIG 5 schematically depicts a third logical relation be ⁇ tween two concepts
- FIG 6 depicts a first schematic illustrating a first
- FIG 7 schematically depicts multiple processing steps of an image analysis using the method shown in FIG 6
- FIG 8 schematically depicts multiple processing steps of an image analysis using a slight variation of the method shown in FIG 6;
- FIG 9 depicts a second schematic illustrating a second method of combining and processing two processing result .
- FIG 10 schematically depicts multiple processing steps of an image analysis using the method shown in FIG 9;
- FIG 11 depicts a second schematic illustrating a third method of combining and processing two processing result
- FIG 12 schematically depicts multiple processing steps of an image analysis using the method shown in FIG 11,
- FIG 1 schematically depicts a flow diagram 1 illustrating a method for analysing image data using deep neural nets. Be- low, the steps and parts of the flow diagram 1 are described with reference to FIGs 1 to 5.
- An input 2 comprising at least one image to be analysed is provided or fed to adapted neural nets 3.
- adapted neu ⁇ ral nets 3 are each adapted from at least one pre—trained neural net for a specific image analysis task or a specific class.
- Each of the adapted neural nets 3 is adapted for a different specific image analysis task and/or has learned one or more concepts.
- These concepts which the adapted neural nets 3 have learned are logically related to one another. In particular, these concepts are partly redundant, meaning that each of the concepts at least partly overlaps with at least one other concept. While there are only two adapted neural nets 3 are shown, more adapted neural nets 3 may be used.
- FIG 2 schematically depicts a layer structure 15 of a deep convolutional neural net (CNN) , such as the adapted neural nets 3.
- CNN deep convolutional neural net
- the input to is processed by means of a respective forward pass through the adapted neural nets 3.
- the input 2 is received at an input data layer 16.
- the input data layer 16 is followed by five convolutional layers 17 which in turn are followed by three fully connected layers 18.
- the differ ⁇ ent shapes and sizes of the layers 16, 17, 18 schematically indicate different corresponding dimensions, that is, numbers of neurons and filters.
- the input data layer 16 may have a size of 227 by 227 neurons with a kernel size of 11 by 11.
- the first convolutional layer 20 may have a size of 55 by 55 neurons with a thickness of 96 which indicates the number of filters in the direction of a data flow as indicated by arrow 19.
- the kernel size of the first convolutional layer 20 may for example be 5 by 5.
- the second convolutional layer 21 may have a size of 27 by 27 neurons with 256 filters.
- the kernel size for the second con ⁇ volutional layer 21, the third convolutional layer 22 and the fourth convolutional layer 23 may all be the same at 3 by 3.
- the third convolutional layer 22 and the fourth convolutional layer 23 may have the same dimensions at for example 13 by 13 neurons with 384 filters each.
- the fifth convolutional layer 24 may have the same size at 13 by 13 neurons but only 256 filters.
- the first fully connected layer 25 and the second fully connected layer 26 may have 1024 filters each.
- a CNN having the layer structure 15 may for example be trained to count 0 to 15 or 1 to 16 pedestrians depicted in respective images.
- 3 rectified linear units (Re- LUs) may be used as activation functions, while pooling and local response normalisation layers can be present after the convolutional layers 18. Dropout can be used to reduce over- fitting.
- an output that is, an intermedi ⁇ ate processing result, of the forward pass of the input 2 through the first adapted neural net 4 is processed by means of a backward pass 6 through the first adapted neural net 4.
- an output that is, a second intermediate processing result, of the forward pass of the input 2 through the second adapted neural net 5 in a further processed by means of a backward pass 7 through the second adapted neural net 5.
- the first adapted neural net 4 may, for example, be adapted to count exactly one pedestrian
- the second adapted neural net 5 may, for example, be adapted to count exactly two pedestrians.
- training data used for tuning the first adapted neural net 4 can therefore be automatically labelled with a count of one, while training data used for tuning the second adapted neural net 5 can be automatically labelled with a count of two.
- the respective forward passes through the adapted neural nets 3 result in different representations or activations 9 at the different layers and channels of the adapted neural nets 3.
- the backward pass 6 and the backward pass 7 result in respec- tive inverse representations, that is, respective images in which features causing the representations or activations 9 are emphasised.
- a selection stage 8 selects at least two of the repre- sentations or activations 9 and the processing results gener ⁇ ated by the backward pass 6 and the backward pass 7, wherein the at least two selected processing results and/or interme ⁇ diate processing results correspond to different logically related concepts.
- the selection or selection process can be controlled by a control signal 10 or a corresponding configu ⁇ ration provided to the selection stage 8.
- the control signal 10 can be based on a respective analysis task to be achieved for the input 2, meaning that the selection can be dependent on a respective use case or application of the described method for image analysis.
- FIG 3 schematically depicts a first logical relation between two concepts.
- a first example 28 of a first concept is in its entirety a sub ⁇ set of a first example of a second concept 29.
- the first ex ⁇ ample 28 of the first concept is therefore equal to the over ⁇ lap between the two concepts. If, for example, the first con ⁇ cept is or describes exactly one pedestrian and the second concept is or describes exactly two pedestrians, then an in ⁇ tersection between the first example 29 and the first example 28 is the one pedestrian described by the first example 28.
- FIG 4 schematically depicts a second logical relation between two concepts.
- a second example 30 of a first concept partly overlaps with a second example 31 of a second concept. Even though there is only a partly overlap, both concepts have or share a nonempty intersection 32. If, for example, the first concept represents females and the second concept represents persons with glasses, then the intersection 32 represents females with glasses.
- FIG 5 dramatically depicts a third logical relation between two concepts.
- a third example 33 of a first concept is completely separate or disjunct from a third example 34 of a second concept.
- the selection of the adapted neural nets 3, the intermediate processing results, that is, the representations or activations 9, and/or the processing results of the backward pass 6 and/or the backward pass 7 are then provided to a post-processing stage 11.
- the selection may be provided in the form of multiple data streams, wherein each data stream corresponds to one selected processing re ⁇ sult, output, representation or activation 9.
- the postprocessing stage 11 may comprise different parts or process ⁇ ing steps such as a combination step 12 and a low-level image processing step 13.
- the combination step 12 combines the mul ⁇ tiple data streams, meaning that information or data from different sources, such as different filters or channels of the adapted neural nets 3, is analysed and processed to- gether.
- the combination step 12 also can take the logical re ⁇ lations between the multiple data streams or the concepts to which the multiple data streams correspond into account .
- an output of the combination step 12 into a usable output such as a regional segmentation and/or an image complemented with one or more bounding boxes
- This low-level image processing presently takes place in the low-level image processing step 13.
- the low-level image processing step 13 can be carried out by an image processing unit using an image processing algorithm.
- the post-processing stage 11 provides as an output a final image analysis result 14.
- the post-processing stage 11 can have different internal structures corresponding to different sequences and/or combi ⁇ nations of the combination step 12 and the low-level image processing step 13 or parts thereof.
- three examples for different structures of the post-processing stage 11 will be described referring to FIGs 6 to 12.
- the examples will be described using two data streams provided by the selection stage 8.
- the examples and the underlying method can, however, easily be extended to or adapted for more than two data streams .
- FIG 6 schematically illustrates a first structure 35 of the post-processing stage 11.
- a first data stream 36 from the representation space or the inverse representation space is processed using low-level image processing steps 38, which can be thought of as an instance or an example of the low- level image processing step 13.
- the low-level image process ⁇ ing steps 38 yields a first partial image analysis result 39.
- the first partial image analysis result 39 or a part thereof is provided as input to a combination stage 40.
- a second data stream 37 from the representation space or the inverse representation space is also provided to the combination stage 40 as a second input.
- the combination stage 40 combines the first partial image analysis result 39 and the second data stream 37.
- the resulting combination that is, an output of the combination stage 40 then undergoes further low-level im ⁇ age processing steps 41, resulting in a second partial image analysis result 42.
- the partial image analysis results 39, 42 can together form the final image analysis result 14 (see FIG 1) .
- the combination stage 40 can operate on data from the representation space and/or the inverse representation space as well as on data provided by a low-level image processing algorithm. The same is true for other variations of combina- tion stages.
- FIG 7 schematically depicts a first overview 43 of multiple processing steps of an image analysis, wherein the first structure 35 is used for post-processing.
- an input im- age 44 which is an example of the input 2 (see FIG 1), is to be analysed.
- the input image 44 presently is a crop of a lar ⁇ ger image and partly depicts a first pedestrian 45 and a sec ⁇ ond pedestrian 46.
- the two pedestrians 45, 46 are closely spaced such that the first pedestrian 45 partly occludes the second pedestrian 46.
- a forward pass of the in ⁇ put image 44 through the first adapted neural net 4 followed by the subsequent backward pass 6 provides a first processing result 47, which is a deconvolved output of the backward pass 6.
- first processing result 47 features corresponding to the first pedestrian 45 are emphasised such that a first in ⁇ verse representation 48 of the first pedestrian 45 is visi ⁇ ble.
- the post processing result 47 then undergoes multiple low-level image processing steps 38, which may comprise pseudo—colouring, and/or highlighting of regions based on intensity as well as thresholding with a predetermined thresh ⁇ old value. After the thresholding step this yields an inter ⁇ mediate post-processing result 49 wherein a contour 50 corre- sponding to the first inverse representation 48 - and there ⁇ fore to the first pedestrian 45 - is visible.
- the contour 50 can be de ⁇ tected and a corresponding first bounding box 52 can be gen- erated. Complementing the input image 44 with the first bounding box 52 surrounding and therefore clearly marking the first pedestrian 45 results in a first partial image analysis result 51, which is an example of the first partial image analysis result 39.
- the input image 44 is also processed by means of a forward pass through the second adapted neural net 5 and the corresponding backward pass 7 yielding a second processing result 53, which is a deconvolved output of the backward pass 7.
- the second processing result 53 features corresponding to both of the pedestrians 45, 46 are highlighted or empha ⁇ sised such that a second inverse representation 54 of the first pedestrian 55 and a first inverse representation 55 of the second pedestrian 46 are visible.
- the second processing result 53 is combined with the first partial image analysis result 51 or a part thereof yielding a combination result 56.
- an area 58 equal to the region surrounded by the first bounding box 52 is replaced with uniform pixel values such as intensity and/or colour.
- the area 58 therefore covers or ob ⁇ scures all features corresponding to the first pedestrian 45.
- a remainder 57 of the first inverse representation 55 that is, the parts of the inverse representation 55 outside of the area 58, remains visible in the combination result 56.
- the combination result 56 then undergoes the low-level image processing steps 41, which can be essentially equal to the above-mentioned low-level image processing steps 38.
- the contour 60 can be detected through low-level image processing techniques re- suiting in the detection of the second pedestrian 46.
- a correspondingly generated second bounding box 62 surrounding the second pedestrian 46 can be added to the input image 44 to generate a second partial image analysis result 61.
- the sec- ond partial image analysis result 61 is an example of the second partial image analysis result 42.
- FIG 8 schematically depicts a second overview 63 of multiple processing steps of an image analysis using a slight varia ⁇ tion of the structure 35 for post-processing.
- the same input image 44 is processed to generate the same first partial image analysis result 51 and the same second processing result 53.
- the second proc ⁇ essing result 53 is then combined with the intermediate post ⁇ processing result 49 to generate a different combination re- suit 64.
- the contour 50 or a corresponding segmented output of a low-level image processing step is used as a ref ⁇ erence instead of the first bounding box 52. Accordingly, a different replacement area 66 equal to the parts of the in ⁇ termediate post-processing result 59 surrounded by the con- tour 50 is filled with pixel of uniform intensity and colour.
- This has the advantage that the contour 50 or the correspond ⁇ ing segmentation can more closely match an actual area or outline of the first pedestrian 45 as compared to the first bounding box 52 resulting in fewer parts of the inverse rep- resentation 55 of the second pedestrian 46 being replaced.
- a detection result of the second pedestrian 46 may be improved. After filling in the replacement area 66 a re ⁇ mainder 65 of the first inverse representation 55 of the sec ⁇ ond pedestrian 46 remains visible in the combination result 64.
- the combination result 64 may also undergo the low-level image processing steps 41. After a corresponding thresholding step a contour 68 corre- sponding to the remainder 65 and therefore to the second pe ⁇ destrian 46 becomes visible and can be detected in an inter ⁇ mediate post-processing result 67. Because of the different sources combined to generate the combination result 64, the contour 68 differs from the contour 60. After detecting the contour 68 a corresponding second bounding box 70 can be generated. Another second partial image analysis result 69 can then be generated by adding the second bounding box 70 to the input image 44.
- the second bounding box 70 can more closely match the visible parts of the second pedestrian 46 as compared to the second bounding box 62.
- the intensity and/or colour values used for the area 58 and the replacement area 66 can be predetermined or they can, for example, be derived from values of one or more neighbouring pixels in the respective combination result 56, 64.
- the first structure 35 can advantageously provide the first partial image analysis result 39, 51 as soon as possible af ⁇ ter selecting the first data stream 36, in particular, possibly even before selecting the second data stream 37 and/or before executing the combination stage 40.
- FIG 9 schematically depicts an alternative second structure 71 of the post-processing stage 11.
- the first data stream 36 is again processed using the low-level image processing steps 38 to generate the first partial image analysis result 39, 51.
- the second struc- ture 71 Using the second struc- ture 71, however, the first data stream 36 and the second data stream 37 are processed using a combination stage 72 in parallel to, that is, at the same time as the low-level image processing steps 38 are executed.
- An output of the combina ⁇ tion stage 72 then undergoes low-level image processing steps 73 to generate a second partial image analysis result 74.
- the second structure 71 has the advantage of providing the final image analysis result 14, which may comprising the first and second partial image analysis results 39, 74, as fast or as soon as possible without any delays. Since the raw first data stream 36 instead of the first partial image analysis result 39 is used as an input for the combination stage 72 this can, however, make the combination stage 72 more complex. Combin- ing the two data streams 36, 37 can, in other words, be more difficult than combining the first partial image analysis re ⁇ sult 39 and the second data stream 37.
- FIG 10 schematically depicts a third overview 75 illustrating multiple processing steps of an image analysis using the sec ⁇ ond structure 71.
- the input image 44 is again processed the same way to gen ⁇ erate the first partial image analysis result 51 and the sec ⁇ ond processing result 53.
- the first and second processing results 47, 53 that is, the decon ⁇ volved outputs of the backward passes 6, 7, are normalised and then combined in the combination stage 72.
- a combination result 76 a logical difference between the first and second processing results 47, 53 after normalisation is taken. This can, for example, mean that the first processing result 47 is subtracted from the second processing result 53.
- the combination result 76 only a remainder 77 corresponding to the first inverse representation 55 which is not ex ⁇ pressed or visible in the first processing result 47 remains.
- an intermediate post-processing result 78 is obtained.
- a contour 79 is visible and can be detected.
- the contour 79 corresponds to the second pedestrian 46 but differs from the contour 68.
- FIG 11 schematically depicts an alternative third structure 82 of the post-processing stage 11.
- the low-level image processing steps 38 are applied to the first data stream 36 to generate the first partial image analysis result 39.
- the data streams 36, 37 are simultaneously com- bined using a combination stage 83. While the first and sec ⁇ ond data streams 36, 37 are being combined, processed, or analysed together in the combination stage 83, however, the low-level image processing steps 38 may already be completed.
- the first partial image analysis result 39 or a part thereof is used as an additional input for the combina ⁇ tion stage 83 in addition to the data streams 36, 37.
- the combination stage 83 can start processing or combining the data streams 36, 37 as soon as possible but can also take advantage of results obtained while processing data streams 36, 37 are still being processed in the combination stage 83.
- the third structure 82 can therefore be regarded as a combination of the first structure 35 and the second struc ⁇ ture 71 implementing the respective advantages of both of these structures 35, 71.
- FIG 12 schematically depicts a fourth overview 86 of multiple processing steps of an image analysis using the third struc ⁇ ture 82 for the post-processing stage 11.
- the input image 44 is processed to generate the first partial image analysis result 51 as well as the second processing result 53.
- the first partial image analysis result 51 or a part thereof is also used to generate a combination result 87.
- the first and second processing results 47, 53 can be combined as described with reference to FIG 10 to reliably cover or obscure any leftover features of the second inverse representation 54, that is, features corresponding to the first pedestrian 45.
- An area equal to the region surrounded by the first bounding box 52 in the first partial image analysis result 51 can be blocked out, that is, be replaced or filled with uniform pixel values. It is also possible to mask an area of the first processing result 47 equal to the region of the first bounding box 52 before taking the difference between the thusly modified first processing result 47 and the second processing result 53 to generate the combination result 87.
- a remainder 88 corresponding to the second pedestrian 46 remains visible in the combination result 87.
- low-level image processing steps 84 can then be applied to the combination result 87, which after thresholding yields an intermediate post-processing result 89 containing a contour 90 corresponding to the second pedestrian 46.
- the contour 90 differs from the contours 60, 68, and 79 as a result of the difference be ⁇ tween the structures 35, 71, and 82.
- a corresponding second bounding box 92 can be generated. Adding the second bounding box 92 to the input image 44 then results in a second partial image analysis result 91.
- the first adapted neural net 4 used to gen ⁇ erate the first processing result 47 can be adapted to count exactly one object, that is, to count exactly the one first pedestrian 45. This corresponds to the first concept of 'ex ⁇ actly one object' , that is, of 'one pedestrian' .
- the second adapted neural net 5 used to generate the second processing result 53 can be adapted to count exactly two objects, that is, to count exactly the two pedestrians 45, 46. This corre ⁇ sponds to a second concept of 'exactly two objects' , that is, of 'exactly two pedestrians' .
- the described examples there- fore illustrate how processing in representation space and inverse representation space of a deep neural network as well as using multiple learned representations of concepts with at least partly overlap or redundancy among the concepts, that is, a relation among the learned concepts, can be used to analyse image data.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a method (1) and an apparatus for 5 analysing image data (2, 44). Therein, one or more provided pre—trained neural nets are adapted for at least one specific image analysis task. The image data (2, 44) is processed by means of a respective forward pass through the one or more adapted neural nets (3, 4, 5) that have learned multiple 10 logically related concepts. This results in the generation of multiple processing results (47, 53) corresponding to the concept. At least two of the multiple processing results (47, 53) corresponding to different ones of the multiple logically related concepts are then selected. The at least two selected 15 processing results (47, 53) are then combined in dependence of the logical relation between the concepts to which the se- lected processing results (47, 53) correspond. This generates an overall image analysis result (14).
Description
Description
Method and apparatus for analysing image data The invention relates to a method and an apparatus for ana¬ lysing image data using an artificial deep neural network, also denoted as a neural net .
Image analysis is a common problem in various applications. It is, for example, often necessary or useful to detect posi¬ tions and/or numbers of objects in a scene, or to segment a scene or an image semantically . Deep neural networks have been successfully used for specific image analysis tasks, such as image classification, object detection, or semantic segmentation. Even though deep neural nets or models are suc¬ cessful in these tasks, they at the same time have the disad¬ vantage of being designed only for one specific task and re¬ quiring large amounts of annotated training data. Each time a new deep model is to be created or trained for a specific task, appropriate annotated or labelled training data is re¬ quired. Additionally, a significant effort is required for training the model by way of hyper-parameter searches and a significant time investment is required for the large number of backpropagation iterations necessary for training or gen- erating the model. This can make conventional approaches us¬ ing a deep neural net costly and time-consuming, especially when a respective problem or application requires doing an - even slightly - different task than what the existing deep neural net was trained for. Examples for this are having trained a neural net for object detection with the new task or goal being semantic segmentation, or having a neural net trained for detecting one person with the new goal being to detect two persons or a second person, that is, doing multi¬ ples of the same task on the same data. Another example is an application that requires doing the same task for which the existing deep model has been trained but doing it on different data, that is, a different target dataset.
Conventional approaches using deep learning techniques for image analysis involve explicitly specifying during training what is to be analysed. For example, if different people in a scene are to be detected, the training would involve marking each person in the scene separately for the neural net to be able to differentiate between the different persons. Each time a different analysis is required, a new neural net may need to be trained. While this brings with it the above- mentioned challenges of training multiple neural nets, it can also be problematic to have to use and switch between differ¬ ent neural nets. Currently, in order to achieve a certain specific task, the respective deep model is directly trained for this specific task with explicit annotation of all con¬ cepts, or transfer learning is used. Transfer learning in- volves using a model that is trained for a task or a source distribution for a different task or distribution to avoid the need for additional specifically labelled training data. Other approaches involve using unsupervised learning to pre— train neural nets and using semi-supervised learning when at least some of the training data are correctly labelled. Still another approach lies in using a multi—task objective function for one model.
It is an objective of the present invention to provide a means for image analysis using a deep neural net that avoids the need for explicit annotation of all concepts to be ana¬ lysed as well as the need for training a completely new neu¬ ral net for each individual analysis task. This objective is achieved by a method having the features of patent claim 1 and an apparatus having the features of patent claim 15. Advantageous embodiments with expedient develop¬ ments of the invention are indicated in the dependent patent claims as well as in the following description and the draw- ings .
A method according to the present invention is concerned with analysing image data. The image data to be analysed may also
be referred to as input or target data. The method comprises providing one or more pre—trained artificial deep neural nets. These may be pre—trained for image analysis although this is not always necessary. Each of the one or more pre— trained neural nets are then adapted for at least one spe¬ cific image analysis task. This means that if only one pre— trained neural net is provided, it is adapted for one spe¬ cific task. If, however, multiple pre—trained neural nets are provided, some or all of the respective specific tasks for which the multiple pre—trained neural nets are adapted may be different from one another. The specific tasks for which the multiple pre-trained neural nets are adapted may, however, also be the same. Adapting a neural net may also be referred to as tuning the neural net. In a next step the image data is processed by means of a respective forward pass through the one or more adapted neural nets that have by then learned multiple logically related concepts. This processing of the image data generates multiple processing results correspond¬ ing to the multiple logically related concepts which the adapted neural nets have learned. At least two of these mul¬ tiple processing results corresponding to different ones of the multiple logically related concepts are then selected. The at least two selected processing results are then com¬ bined in dependence of the logical relation between the con- cepts to which the selected processing results correspond. This combining generates an image analysis result.
The present invention can in principle be used to analyse ar¬ bitrary image data. Therefore, the term image is to be inter- preted broadly and can refer to different kinds of image data or images. An input, that is, an image to be analysed might for example be an image captured by a surveillance camera or a camera that is part of an assistance system of a car. In¬ stead of directly using an image captured by a camera it is, for example, also possible to use an image or image data that has been pre—processed. The image data can for example be or comprise a crop of a larger image that has been subdivided. It is also possible to use as the image data an output - for
example the contents of a bounding box - resulting from a re¬ gional proposal algorithm or object detection algorithm, or a result or an output from a change detection algorithm. Such a pre—processing algorithm the outputs of which are used as the image data to be analysed by the present invention, could for example be a low-complexity algorithm operating in the com¬ pressed domain. In particular, the image data may be or comprise a whole image or a crop of an image output by another algorithm that is not capable of separating closely spaced objects or that identifies larger areas compared to the sizes of actual objects present in the respective image data.
Throughout the following description an example is referred to on occasion for illustrative purposes. In the example, the neural net is pre—trained for counting objects, in particular for counting pedestrians in images, and the method is used for detecting objects. This example can be generalised and details and terms of the example can be replaced by the broader terms as used in the claims. In this sense, the logi- cally related concepts may refer to different numbers or counts of objects such as the pedestrians. It is, however, to be understood that the present invention is by no means lim¬ ited to this example, since the present invention can be used with differently trained neural nets and different concepts.
A deep neural net is a neural net comprising multiple layers . The pre—trained neural net could be trained from scratch, starting with randomly initiated weights and/or other parameters. It could also be pre—trained by transfer learning starting from a baseline neural net trained for general image classification or analysis. While the pre—trained neural net is not yet trained or adapted for a specific use case or ap¬ plication as may be required in a respective productive envi¬ ronment, it has received at least some degree of training, preferably in terms of data or image analysis and/or regard¬ ing different concepts and/or logical relations.
In the present description a trained, pre—trained, or adapted neural net may also be referred to as a model.
Adapting or tuning the pre—trained neural net for one spe- cific image analysis task allows or forces the adapted neural net to focus on the specific task. This can mean that the neural net is tuned to specifically react to or recognise features, properties, or concepts corresponding to the spe¬ cific task. To adapt the pre—trained neural net training data can be provided and, for example, processed using a back- propagation method. This may comprise executing multiple it¬ erations of gradient descent to update weights and/or other parameters of the neural net in dependence of the specific image analysis task and/or at least one of the multiple logi- cally related concepts. Since the pre—trained neural net is used as a starting point for the tuning, this can advanta¬ geously be achieved with significantly less training data than would be necessary for creating a completely new neural net trained for the specific image analysis task. This means that the present invention can effectively, efficiently, and flexibly be used for various applications - and even applica¬ tions with a changing target dataset - with comparatively low effort, cost, and time investment. It can be especially effi¬ cient to create a new instance, that is, a copy of the pre— trained neural net and adapting this new instance or copy.
This way, the originally provided pre—trained neural net re¬ mains available as a starting point for additional adapta¬ tions or uses of the presently described method in general. The adapted neural nets may be adapted online and/or offline using synthetic and/or natural images. In terms of the pre¬ sent invention offline training or tuning refers to using training data that differs from a target image or target dataset to be analysed. This means that the training data is another physical dataset separate from the target dataset. The training data and the target dataset may, however, com¬ prise similar kinds of images or image data. Online training or tuning on the other hand refers to using the same image or
target data later to be analysed for training and/or tuning purposes. Online and/or offline training or tuning can be used for every image or for a group of images. If a group of images is used, then all images of the group can be analysed after the tuning without further adaptation steps to reduce the overall time required for processing all images of the group. Online training can yield particularly accurate re¬ sults since the adaptation does not need to rely on training data that differs from the target data, that is, the actual image data to be analysed. Offline adaptation can be advanta¬ geous since the training data and therefore the adaptation can be controlled and supervised, which lowers a chance for the neural net to acquire an unintended bias. Using synthetic images as training data can be advantageous since those can be easily mass-produced, managed, and specifically created or tailored to fit a respective use case and to avoid any unin¬ tentional biases. Using natural or real images can, on the other hand, have the advantage of better preparing the neural net for its intended application. It is possible to use online and/or offline tuning as well as synthetic and/or natural images as training data in various combinations .
Processing the image data by means of a forward pass through the at least one adapted neural net means that the image data is provided as an input to a first or input layer of the adapted neural net, which then works on this input and pro¬ vides a corresponding output or processing result. The forward pass through the adapted neural net therefore comprises a data flow from the input layer to an output layer, that is, from a lowest to a highest layer in a hierarchical layer structure of the adapted neural net.
The processing result can be an output of the respective adapted neural net at its output layer. The processing result can, however, also be or comprise at least one representation or activation from other parts or layers of the respective neural net. In an advantageous development of the present in¬ vention the selected processing results are taken from dif-
ferent activations of the layers, in particular of filters or channels, of the one or more adapted neural nets. The proc¬ essing results can, in other words, be or comprise different representations of the image data or outputs of intermediate layers or parts thereof. Combining the selected processing results can therefor comprise processing in a representation space, that is, processing data in or from an abstract space of representations of image data. Since the different layers of a deep neural net, in particular higher layers such as the layers in, for example, the highest third of the layer struc¬ ture, can learn different concepts. The different layers can therefor offer a useful and flexible source of data for the processing results . It can be very efficient to use outputs or activations of different layers of one individual adapted neural net, since this can limit the number of adapted neural nets required for the image analysis. At least one of the selected processing results or at least a part thereof may, however, be taken from different adapted neural nets. This approach can advan¬ tageously offer a high degree of flexibility and customis- ability and can therefore improve the overall image analysis result. It is, for example, possible for the selected proc¬ essing results to comprise an output or a classification from an output layer of a first adapted neural net, an activation or representation of a specific filter or channel of a second or third highest layer of the first adapted neural net, and another activation of a specific channel of an intermediate or output layer of a second adapted neural net. The present invention is therefore applicable and usable in a variety of different use cases.
Higher layers in the hierarchical layer structure of a deep neural net correspond to or learn higher levels of abstrac- tion. In these higher layers concepts are learned or repre¬ sented. A concept in terms of the present invention may refer to a more abstract and/or meaningful part or content of an image as compared to a simple contextless pixel-level feature
or property of the image. If, for example, an image depicts multiple persons, one concept may be xall female persons' , while another concept may be xall persons wearing glasses' . In order to analyse image data, the learning of concepts can be leveraged. Processing image data by means of a deep neural net leads to different representations at different layers of the deep neural net. These representations learned at the different layers can be or correspond to the different learned concepts.
Generally, the goal of representation learning is to learn representations that are disentangled causes of the image. One application of such disentangled representations is the generation of images by varying codes in a latent space, where different dimensions of the latent space represent dif¬ ferent disentangled causes of the input image. One dimension might, for example, represent whether the person in the input image is a male or a female, while another dimension might represent whether or not the person in the image is wearing glasses. By varying the code along these two dimensions of the latent space, various combinations of male or female with or without glasses can be obtained. Thus, the use of the dis¬ entangled representation in this case is synthesis or genera¬ tion. The method for image analysis in accordance with the present invention uses representation learning for learning and using different representations of or for the image data or parts thereof to be analysed, wherein these representa¬ tions are disentangled. This means that through the different representations there exists a separation of the correspond- ing causes from a composition of all causes.
A cause in this sense can be understood to be a feature, property, or characteristic of the image or image data that is responsible for, that is, is causing the corresponding representation. A cause can therefore be thought of as a root or an origin of a certain state, behaviour, and/or output of the deep neural net. The different disentangled representa¬ tions or concepts are, however, not necessarily independent
of each other. Instead, they have a logical relation to each other. This can for example, mean that they may be described using set theory, and/or partly overlap, such that there may be partly redundancy between them. These logical relations between the learned concepts can advantageously be used for the purpose of data analysis . By using or exploiting these logical relations the data analysis can be done using simple operations that may not have been as effective in the space of the original input image data, since the different repre- sentations or activations can highlight or be focused on dif¬ ferent features of the image data, that is, causes of the re¬ spective representations or activations. An activation in this sense is an output of a certain filter or channel of the deep neural net .
It is possible for the provided one or more pre—trained deep neural nets to have learned the logically related concepts. It is, however, also possible that only the one or more adapted neural nets have learned the logically related con- cepts. Learning the logically related concepts can therefore be part of or a result of the adaptation or tuning.
Selecting the at least two of the multiple processing results can be done automatically. For this purpose a predetermined criterion may be provided such as a threshold value for a confidence rating. For example, only processing results having or corresponding to a confidence rating or confidence value higher than the predetermined threshold value may be selected. It can also be possible to provide a predetermined table or a family of characteristics which can then be used to automatically select processing results, for example in dependence of a respective use case, application, or a prede¬ termined goal of the image analysis. Combining the selected processing results means that data from multiple sources are combined, analysed, or processed together. This can comprise adding or subtracting the processing results or parts thereof to one another or from one
another, respectively. Other mathematical and/or logical op¬ erations may also be used in combining the selected process¬ ing results. The logical relation between the corresponding concepts ensures that by combining the selected processing results new data or information about the processed image data can be obtained. This method can for example make it possible to obtain new information about the image data for which none of the neural nets have been explicitly trained or adapted. The present invention can therefore be used flexibly and can provide correct data analysis even in difficult situations, such as for example detecting partly occluded or occluding objects. Similarly, the present invention can advantageously be used to analyse image data for concepts that have not explicitly been annotated in the training data used to train, pre—trained, or adapt the neural net or nets.
The image analysis result can for example comprise a classi¬ fication of the image data or parts thereof, an indication of a detected object, a segmentation, etc. The image analysis result may also be a complemented or modified version of the analysed image data, which can for example comprise bounding boxes around detected object.
In an advantageous development of the present invention the selected processing results are selected such that the corre¬ sponding different logically related concepts have a nonempty intersection. The selected processing results do, in other words, have an overlap or a redundancy. They can have a partly overlap or redundancy or a first concept may include or contain a second concept, which therefore then is a subset of the first concept. Through the use of intersecting or overlapping concepts additional or more detailed image analy¬ sis results can be obtained. This approach also avoids the need - and therefore the time and effort required - for spe¬ cifically training a neural net for the overlap, since the data corresponding to the overlap can be obtained by combin¬ ing the different processing results. As an example of a partly overlap a first concept may be λanimal' and a second
concept may be xhaving two legs' . In this case, a first proc¬ essing result corresponding to the first concept may be a de¬ tection or indication of all animals shown in a picture, while a second processing result corresponding to the second concept may be a detection or indication of all two-legged objects or beings. The nonempty intersection, that is, the overlap would therefore consist of all two-legged animals, but would for example neither include four-legged animals nor humans .
In an advantageous development of the present invention the selected processing results are combined by using logical op¬ erators corresponding to the logical relation between the concepts to which the selected processing results correspond. The logical relation between the concepts is, in other words, mapped to the logical operators used for combining, analys¬ ing, and/or further processing the selected processing results. This can advantageously enable a consistent and auto¬ mated yet flexible image analysis. This is especially advan- tageous and useful for automation since the logical operators to be used do not have to be manually specified in advance. Logical operators in this sense may for example refer to Boo¬ lean operators such as conjunction, disjunction, negation, and comparisons, as well as operators of set theory such as union or intersection.
In a practical realisation of the method according to the present invention the processing results may be selected based on or in dependence of a control signal, and/or a pre- determined configuration or setting of a corresponding apparatus used to carry out the method.
Since different layers, in particular the higher layers, of a deep model learn concepts, a combination of such concepts or the corresponding processing results can lead to meaningful insights, in particular, if the concepts can be isolated. The higher layers in this sense may for example be fully con¬ nected layers and/or layers of the highest third of a layer
structure of a respective deep neural net. The different con¬ cepts can be learned by the same model or different models on or from one and the same image or multiple images or multiple image parts of the same or different images.
In an advantageous development of the present invention at least one of the selected processing results is processed us¬ ing an image processing algorithm before or after combining the at least two selected processing results. This means that the image processing techniques may be used on the processing results and/or a combination of the processing results, that is, a result of combining the selected processing results. The image processing techniques, that is, the image process¬ ing algorithm may be used on only a part of the processing results and/or the combination. The image processing algorithm may in particular be a low-level image processing algorithm. A low-level image processing algorithm as well as low- level level image processing in terms of the present inven¬ tion refers to algorithms and methods that are not part of the machine learning domain and/or that are not concerned with the interpretation or classification of a scene or an image as a whole. Rather, these low-level techniques may for example include pixel-based operations, finding corresponding points, edge detection, and the like. Using the image proc- essing algorithm as well as combining the processing results to generate the overall image analysis result can advanta¬ geously yield an image analysis result that is easier to un¬ derstand, interpret, or process for a respective user and/or other applications or systems, which use the image analysis result as an input .
In a further advantageous development of the present inven¬ tion processing the at least one selected processing result using the image processing algorithm comprises pseudo— colouring, and/or highlighting regions based on intensity, and/or thresholding, and/or contour detection, and/or generating a bounding box, in particular a bounding box surrounding a detected contour or object. These image processing
tasks can be automatically carried out using low-level image processing techniques and thus do not require the use of a neural net. Pseudo—colouring can be used to generate or cre¬ ate colour variations based on sine wave generation for dif- ferent colour channels. Thresholding can be used to discard parts of the processing results or their combination having one or more values, such as intensity, below a predetermined threshold value. Generating a bounding box can advantageously provide a reference area or region assigned to or associated with a detected contour, a detected object, a specific fea¬ ture, or the like. The bounding box can therefore be used to indicate a specific part of the processing result, the combi¬ nation, and/or the original input image data to a user and/or another program or system. Since the bounding box can be a geometric primitive form, for example a rectangle, this can be done with less processing effort than using a complex de¬ tailed outline or shape or segmentation and can therefore re¬ sult in fast and efficient processing, especially in time- sensitive applications. The bounding box may be added to the at least one processing result, the combination of the se¬ lected processing results, and/or to the input image data.
In an advantageous development of the present invention at least a part of the image data is processed by means of the respective forward pass and a respective subsequent backward pass through the one or more adapted neural nets to generate the multiple processing results. The multiple processing re¬ sults therefore comprise an output or a result of the at least one backward pass. The backward pass thus constitutes an additional processing step to generate at least one of the multiple processing results. A backward pass through a neural net refers to processing data using the neural net in an op¬ posite direction as compared to the direction of data proc¬ essing or a data flows used in the forward pass. The backward pass therefore comprises a data flow from a higher layer to a lower layer of the respective neural net. After the backward pass the respective processing result, that is, the output of the backward pass can represent or resemble an image wherein
only features or causes of the input image data corresponding to one of the logically related concepts and/or corresponding to the respective image analysis task for which the respec¬ tive neural net has been adapted are visible, highlighted, or emphasised. This is the case, since other features not con¬ tributing to an intermediary processing result used as input for the backward pass are not reconstructed during the back¬ ward pass. Using the additional processing step of the back¬ ward pass can be advantageous since the processing result ob- tained from the backward pass may be less abstract than a processing result taken directly from the forward pass and may therefore be easier to further process using conventional image processing techniques. Employing the backward pass may therefore lead to an improved overall image analysis result.
While after a forward pass the layers, in particular the higher layers, of a deep neural net provide a respective rep¬ resentation of the input, the backward pass can provide a re¬ spective inverse representation, which may reveal or high- light features, properties, or aspects of the image data more clearly and/or a reveal additional features, properties, or aspects. Multiple inverse representations could be obtained from different filters, channels, or layers of the same adapted neural net or from multiple adapted neural nets, wherein the respective filters, channels, layers, and/or adapted neural nets correspond to different logically related concepts .
A space of these inverse representations may be referred to as inverse representation space or the inverse domain. Ac¬ cordingly, processing data from the inverse representation space may be referred to inverse domain processing. Since the inverse representations can be images with each image con¬ taining only features corresponding to one concept, inverse domain processing advantageously allows for accurate process¬ ing of image data on a level of concepts, that is, correctly and accurately with regards to actual content or meaning of the image data as it could be understood by a human user. It
can for example be possible to remove parts or features from an image, or add parts or features to an image selectively, wherein each part or feature corresponds to a specific con¬ cept, such as a specific type or number of objects. The pre- sent invention may therefore advantageously allow content- based analysis and editing of image data without necessarily relying on techniques such as segmentation and with improved accuracy and flexibility, since features can be handled based on their belonging to a concept without necessarily being re- lated on a pixel level. The features may, for example, belong to the same concept or a single manifestation thereof, but may at the same time not belong to a continuous or connected area, which would make only using segmentation techniques unreliable .
The selected processing results may comprise representations or activations, that is, data from a representation space and/or inverse representations, that is, data from the in¬ verse domain. The method for images analysis according to the present invention therefore proposes using processing in the representation space and/or in the inverse representation space as a means for obtaining improved image analysis re¬ sults, in particular as compared to conventional methods op¬ erating only in the space of the original input image data.
In a further advantageous development of the present inven¬ tion a transpose of weights of the one or more adapted neural nets is used for processing the respective image data by means of the backward pass. In other words, an intermediary processing result after the respective forward pass is proc¬ essed again using the transpose of the weights or weight ma¬ trix used for the forward pass. If the neural net has been correctly trained, taking the inverse of this intermediary processing result, that is, processing it by means of the backward pass reveals the corresponding concepts present in the input image data. Using the transpose of the weights is an advantageous method for obtaining the inverse representa-
tions since the transpose can be obtained or calculated with minimal investment of time and processing effort.
In an advantageous development of the present invention one or more deep convolutional neural nets, and/or deep feedfor¬ ward neural nets, and/or deep recurrent neural nets are used as the one or more deep neural nets. The deep neural nets can therefore have features or characteristics of one of these types of neural nets or features or characteristics of a com- bination of some or all of these types of neural nets. This allows for adapting the properties and behaviour of the neu¬ ral net to different applications.
Using a convolutional neural net (CNN) is especially advanta- geous for image processing and analysis since a high accuracy can be achieved. This is partly because of the implicit as¬ sumption of locality. This means that by using a CNN it is possible to take advantage of the fact that in typical images and with typical objects depicted therein pixels located in the same region of the image are more likely to be related, that is, to belong to the same object, then pixels that are farther away or apart. A disadvantage of conventional ap¬ proaches using a CNN for image analysis is that large amounts of extensively annotated training data are required to train the CNN. This problem is avoided with the present invention by using a pre—trained neural net as a starting point and ob¬ taining the one or more adapted neural nets by tuning the pre—trained neural net for one specific image analysis task. The pre-trained CNN can be trained using synthetic image data. The adaptation or tuning of this pre-trained neural net requires significantly less training data which can also be annotated with significantly less effort.
The present invention thus enables the analysis of new image data even for concepts not explicitly annotated in the train¬ ing data and therefore without the need of extensive accurate labelling or annotations. Especially in the case of using a CNN, a backward pass through the CNN may also be referred to
as a deconvolution pass or simply a deconvolution . A deconvo- lution pass may therefore involve starting from one of the learned filters at a layer of the CNN and doing the reverse data processing steps of successive unpooling, rectification, and filtering to reconstruct the activity in the layer be¬ neath, that gave rise to the chosen activation of the learned filter where the deconvolution pass started. Accordingly the output of the deconvolution pass may be referred to as decon¬ volved output .
Feedforward neural nets have the advantage of being very ro¬ bust, meaning that their performance degrades gracefully in the presence of increasing amounts of noise. Using a recur¬ rent neural net can be especially advantageous for analysing data with a temporal structure, such as for example a time series of multiple images or a video feed.
In an advantageous development of the present invention the one or more pre—trained neural nets are pre-trained for counting objects in images. The neural net is, in other words, pre-trained to classify images according to a number of objects depicted therein. Each class of a classifier or output layer of the pre-trained neural net may therefore rep¬ resent or correspond to a different count of objects. If there are, for example, 16 different classes or categories, the neural net might be capable of classifying images with anywhere from 0 to 15 or 1 to 16 objects according to the re¬ spective number of objects depicted in each image. It is to be noted that counting objects does not necessarily include detecting or specifying individual locations or out¬ lines of each object. Training the neural net for counting objects therefore requires significantly less detailed anno¬ tations in the training data then, for example, training the neural net to detect individual objects.
The neural net can be pre-trained for counting at least one specific type of object, such as for example pedestrians or
cars. Since the different concepts may refer to different kinds of objects, one adapted neural net or one layer thereof may be adapted to count a first number of objects, while a second layer or a second adapted neural net a be adapted for counting a different second number of objects. Since the lar¬ ger number or count of objects includes the smaller number or count, these concepts are logically related since the lower number or count is a subset of the larger count or number. One processing result can therefore correspond to the first number of objects, whereas a second processing result may correspond to the second number of objects. This can, for ex¬ ample, mean that in the first processing result a first num¬ ber of objects or corresponding features are visible or em¬ phasised, whereas in the second processing result a second number of objects or features thereof are visible or empha¬ sised. Combining these two processing results therefore can be used to, for example, remove a number of objects equal to the first or the second number of objects from the image data .
Counting a specific number of objects means focusing on this number of objects or features corresponding to this number of objects and disregarding features belonging or corresponding to other objects that might also be depicted in the same im- age. If there are at least the specified number of objects present in the image, the neural net adapted to count this specific number of objects will provide a classification of the object as containing this specific number of objects, meaning that it has counted exactly that many of the objects depicted in the image. It can be especially advantageous to have at least one layer or adapted neural net adapted to count exactly one object. This allows for using a neural net trained for counting objects to detect objects, in particular one individual object. Using a neural net pre-trained for counting objects therefore allows for a detailed image analy¬ sis with significantly less training effort than would be re¬ quired for creating a neural net capable of detecting indi-
vidual objects or different numbers of objects even in images with a larger number of objects.
The described method of removing the detected object can also be used in an iterative process to sequentially count and de¬ tect all objects shown in the respective image. This approach can yield improved accuracy especially for images with closely spaced and/or partly occluded objects. In an advantageous development of the present invention mul¬ tiple differently adapted neural nets are created by adapting the one pre—trained neural net for different concepts, in particular by adapting the one pre—trained neural net to count different numbers of objects. The differently adapted neural nets are then used to generate the multiple processing results. This means that only one pre-trained neural net is provided. From this one pre-trained neural net. The multiple adapted neural nets can be created or generated as copies or new instances of the one pre—trained neural net which are then adapted using different or differently labelled or anno¬ tated training data. This approach can be very efficient, re¬ quiring a minimal amount of time, effort, and storage space for the one pre-trained neural net used as a starting point for the adapted neural nets.
It is, however, also possible to provide only the one pre- trained neural net and create from that only one adapted neu¬ ral net which has learned different concepts and representa¬ tions at different filters or layers. It is also possible to provide multiple pre-trained neural nets and from these cre¬ ate multiple adapted neural nets. In general, the number of provided pre—trained neural nets can be different from the number of adapted neural nets. In an advantageous development of the present invention at least one object is detected in the at least two processing results and/or the combination of the at least two processing results by means of an image processing algorithm to generate
the image analysis result. This image processing algorithm can be the above-mentioned low-level image processing algo¬ rithm. Object detection is a common problem in various applications. In applications like surveillance or driver assis- tance systems or autonomous driving or any application where information about the environment is required, detecting ob¬ jects is an important task. Detecting the at least one object in the processing results or their combination can yield especially accurate and reliable results since the processing results or the combination can be focused on the object to be detected such that its features are relatively emphasised compared to the rest of the image data. The image analysis result can be a final result obtained after combining the se¬ lected processing results and processing them and/or their combination by means of the image processing algorithm.
In a further advantageous development of the present inven¬ tion the at least one object is detected using at least one predetermined optimisation criterion. In particular, at least one predetermined constraint may be used. This can comprise at least one predetermined constraint for a boundary smooth¬ ness, and/or for an area of the processing results and/or the combination of the processing results. A constraint in this sense may comprise one or more predetermined threshold val- ues, such that for example a boundary or contour is interpreted to indicate an object if the corresponding boundary smoothness and/or area surrounded by the boundary or contour is greater or smaller than the predetermined threshold value and/or lies between two predetermined threshold values. Dif- ferent threshold value and/or constraints may be provided and used for detecting different objects and/or different kinds of objects. Using at least one such predetermined optimisa¬ tion criterion enables fast and reliable object detection, in particular since features or causes corresponding to the ob- ject to be detected are highlighted or relatively emphasised by processing the image data using the adapted neural net prior to using the image processing algorithm.
In a further advantageous development of the present inven¬ tion the at least one object is detected by treating pixels of the processing result and/or the combination of the processing results as a Markov random field and using a predeter- mined constraint on a gradient of intensities. This approach can be especially advantageous since a Markov random field can be used to detect and segment objects in an image simul¬ taneously and can therefore yield fast and accurate object detection and segmentation, as might be required in time- sensitive and/or safety related applications. A Markov random field model or approach can also advantageously be used to detect moving objects with improved accuracy.
If multiple objects are detected by the image processing al- gorithm one of the objects can be selected according to or in dependence of a predetermined object selection criterion. Multiple objects might for example be detected due to the tuning of the pre-trained neural net, and/or you to parame¬ ters of the low-level image processing or thresholding. Pro- viding a predetermined object selection criterion for select¬ ing one object advantageously allows for reliably marking ex¬ actly one object. The criterion advantageously allows for im¬ proved flexibility and customisation of the image analysis. Selecting one object can advantageously enhance the capabil- ity of the present invention to be used for object detection since it strengthens the focus on detecting only a single ob¬ ject at a time. The object selection criterion may for example be a size of a respective area corresponding to each of the multiple detected objects. The area can for example be enclosed by a bounding box or a detected contour or boundary. For example, the object corresponding to the bounding box having the largest area can be automatically selected. This can be advantageous based on the assumption that the majority of specific regions - for example high-intensity regions - of the image or processing result belong to a single object.
Another aspect of the present invention besides the method described herein is an apparatus for analysing image data.
The apparatus comprises one or more artificial adapted deep neural nets, and a separate image processing unit. The appa¬ ratus is configured to process the image data by means of a respective forward pass through the one or more adapted neu- ral nets to generate multiple processing results. The one or more adapted neural nets are adapted for a specific image analysis task from at least one pre-trained neural net and have learned multiple logically related concepts. The multi¬ ple generated processing results correspond to the multiple logically related concepts. The apparatus is further config¬ ured to select at least two of the multiple processing re¬ sults corresponding to different ones of the multiple logi¬ cally related concepts. The apparatus is further configured to provide the at least two selected processing results as input to the image processing unit. The image processing unit is configured to combine the at least two selected processing results in dependence of the logical relation between the concepts to which the selected processing results correspond to generate an image analysis result.
The apparatus may also comprise the at least one pre-trained neural net and/or be configured to adapt the one or more adapted neural nets from the at least one pre—trained neural net .
The apparatus may comprise a processing unit (CPU) , a memory device, and an I/O-system. In particular, the apparatus according to the present invention may be configured to carry out or conduct at least one embodiment of a method according to the present invention. For this purpose the apparatus may comprise a memory device or data store containing program code representing or encoding the steps of this method. The memory device or data storage containing this program code may on its own also be one aspect of the present invention. The respective embodiments of the method according to the present invention as well as their respective advantages may be applied to the apparatus, the memory device or data store,
and/or the program code contained therein according to the present invention as applicable and vice versa.
The apparatus may further be configured to iteratively use the image analysis result as an input for the one or more adapted neural nets to further analyse the image analysis re¬ sult of a respective previous iteration. At least some steps or parts of the method may therefore be executed multiple times iteratively until a predetermined exit condition is met. If, for example, an objective of the image analysis is to detect objects, the exit condition might be met if no more objects can be detected. In this case, the iterative process may automatically stop. Further advantages, features, and details of the present in¬ vention derive from the following description of preferred embodiments of the present invention as well as from the drawings pertaining to the present invention. The features and feature combinations previously mentioned in the descrip- tion as well as the features and feature combinations men¬ tioned in the following description of the figures and/or shown in the figures alone can be employed not only in the respectively indicated combination but also in other combina¬ tions or taken alone without leaving the scope of the inven- tion.
In the drawings
FIG 1 schematically depicts a flow diagram illustrating a method for analysing image data using deep neural nets ;
FIG 2 depicts a schematic illustrating a structure of a deep convolutional neural net which can be used to analyse images;
FIG 3 schematically depicts a first logical relation be¬ tween two concepts;
FIG 4 schematically depicts a second logical relation be¬ tween two concepts; FIG 5 schematically depicts a third logical relation be¬ tween two concepts;
FIG 6 depicts a first schematic illustrating a first
method of combining and processing two processing result;
FIG 7 schematically depicts multiple processing steps of an image analysis using the method shown in FIG 6; FIG 8 schematically depicts multiple processing steps of an image analysis using a slight variation of the method shown in FIG 6;
FIG 9 depicts a second schematic illustrating a second method of combining and processing two processing result ;
FIG 10 schematically depicts multiple processing steps of an image analysis using the method shown in FIG 9;
FIG 11 depicts a second schematic illustrating a third method of combining and processing two processing result; and FIG 12 schematically depicts multiple processing steps of an image analysis using the method shown in FIG 11,
In the FIGs elements that provide the same function are marked with identical reference signs.
FIG 1 schematically depicts a flow diagram 1 illustrating a method for analysing image data using deep neural nets. Be-
low, the steps and parts of the flow diagram 1 are described with reference to FIGs 1 to 5.
An input 2 comprising at least one image to be analysed is provided or fed to adapted neural nets 3. These adapted neu¬ ral nets 3 are each adapted from at least one pre—trained neural net for a specific image analysis task or a specific class. Each of the adapted neural nets 3 is adapted for a different specific image analysis task and/or has learned one or more concepts. These concepts which the adapted neural nets 3 have learned are logically related to one another. In particular, these concepts are partly redundant, meaning that each of the concepts at least partly overlaps with at least one other concept. While there are only two adapted neural nets 3 are shown, more adapted neural nets 3 may be used.
In the present example the adapted neural nets 3 are deep convolutional neural nets. FIG 2 schematically depicts a layer structure 15 of a deep convolutional neural net (CNN) , such as the adapted neural nets 3. The input to is processed by means of a respective forward pass through the adapted neural nets 3. To execute such a forward pass through the layer structure 15 the input 2 is received at an input data layer 16. The input data layer 16 is followed by five convolutional layers 17 which in turn are followed by three fully connected layers 18. The differ¬ ent shapes and sizes of the layers 16, 17, 18 schematically indicate different corresponding dimensions, that is, numbers of neurons and filters. The smaller squares in the input data layer 16 and the first four convolutional layers 20 to 23 in¬ dicate a respective kernel size. In the present example the input data layer 16 may have a size of 227 by 227 neurons with a kernel size of 11 by 11. The first convolutional layer 20 may have a size of 55 by 55 neurons with a thickness of 96 which indicates the number of filters in the direction of a data flow as indicated by arrow 19. The kernel size of the first convolutional layer 20 may for example be 5 by 5. The
second convolutional layer 21 may have a size of 27 by 27 neurons with 256 filters. The kernel size for the second con¬ volutional layer 21, the third convolutional layer 22 and the fourth convolutional layer 23 may all be the same at 3 by 3. The third convolutional layer 22 and the fourth convolutional layer 23 may have the same dimensions at for example 13 by 13 neurons with 384 filters each. The fifth convolutional layer 24 may have the same size at 13 by 13 neurons but only 256 filters. The first fully connected layer 25 and the second fully connected layer 26 may have 1024 filters each. A CNN having the layer structure 15 may for example be trained to count 0 to 15 or 1 to 16 pedestrians depicted in respective images. Correspondingly, the third fully connected layer 27, that is, the output layer which acts as a classifier com- prises 16 classes for the different counts of pedestrians. As part of the adapted neural nets 3 rectified linear units (Re- LUs) may be used as activation functions, while pooling and local response normalisation layers can be present after the convolutional layers 18. Dropout can be used to reduce over- fitting.
Referring now back to FIG 1 an output, that is, an intermedi¬ ate processing result, of the forward pass of the input 2 through the first adapted neural net 4, is processed by means of a backward pass 6 through the first adapted neural net 4. Correspondingly, an output, that is, a second intermediate processing result, of the forward pass of the input 2 through the second adapted neural net 5 in a further processed by means of a backward pass 7 through the second adapted neural net 5. The first adapted neural net 4 may, for example, be adapted to count exactly one pedestrian, whereas the second adapted neural net 5 may, for example, be adapted to count exactly two pedestrians. Correspondingly, training data used for tuning the first adapted neural net 4 can therefore be automatically labelled with a count of one, while training data used for tuning the second adapted neural net 5 can be automatically labelled with a count of two.
The respective forward passes through the adapted neural nets 3 result in different representations or activations 9 at the different layers and channels of the adapted neural nets 3. The backward pass 6 and the backward pass 7 result in respec- tive inverse representations, that is, respective images in which features causing the representations or activations 9 are emphasised.
A selection stage 8 then selects at least two of the repre- sentations or activations 9 and the processing results gener¬ ated by the backward pass 6 and the backward pass 7, wherein the at least two selected processing results and/or interme¬ diate processing results correspond to different logically related concepts. The selection or selection process can be controlled by a control signal 10 or a corresponding configu¬ ration provided to the selection stage 8. The control signal 10 can be based on a respective analysis task to be achieved for the input 2, meaning that the selection can be dependent on a respective use case or application of the described method for image analysis.
The selection can also be based on the logical relation or relations between the different concepts. FIG 3 schematically depicts a first logical relation between two concepts. Here a first example 28 of a first concept is in its entirety a sub¬ set of a first example of a second concept 29. The first ex¬ ample 28 of the first concept is therefore equal to the over¬ lap between the two concepts. If, for example, the first con¬ cept is or describes exactly one pedestrian and the second concept is or describes exactly two pedestrians, then an in¬ tersection between the first example 29 and the first example 28 is the one pedestrian described by the first example 28. A difference between the first example 29 and the first example 28, that is, the part of the first example 29 that is not contained within the first example 28 corresponds to or re¬ veals the second pedestrian.
FIG 4 schematically depicts a second logical relation between two concepts. Here a second example 30 of a first concept partly overlaps with a second example 31 of a second concept. Even though there is only a partly overlap, both concepts have or share a nonempty intersection 32. If, for example, the first concept represents females and the second concept represents persons with glasses, then the intersection 32 represents females with glasses. FIG 5 dramatically depicts a third logical relation between two concepts. Here, a third example 33 of a first concept is completely separate or disjunct from a third example 34 of a second concept. In this third logical relation there is no overlap between the two concepts. Expressed in terms of set theory an intersection between the first example 33 and the third example 34 therefore is an empty set. If, for example, the first concept represents dogs and the second concept represents cats, then there is no overlap between the two concepts or classes.
It is also possible to have more than two concepts being used for the image analysis, wherein these multiple concepts can have multiple logical relations among them. Again referring back to FIG 1, the selection of the adapted neural nets 3, the intermediate processing results, that is, the representations or activations 9, and/or the processing results of the backward pass 6 and/or the backward pass 7 are then provided to a post-processing stage 11. The selection may be provided in the form of multiple data streams, wherein each data stream corresponds to one selected processing re¬ sult, output, representation or activation 9. The postprocessing stage 11 may comprise different parts or process¬ ing steps such as a combination step 12 and a low-level image processing step 13. The combination step 12 combines the mul¬ tiple data streams, meaning that information or data from different sources, such as different filters or channels of the adapted neural nets 3, is analysed and processed to-
gether. The combination step 12 also can take the logical re¬ lations between the multiple data streams or the concepts to which the multiple data streams correspond into account . To convert the data streams and/or the combination of the data streams, that is, an output of the combination step 12 into a usable output such as a regional segmentation and/or an image complemented with one or more bounding boxes low-level image processing may be required. This low-level image processing presently takes place in the low-level image processing step 13. The low-level image processing step 13 can be carried out by an image processing unit using an image processing algorithm. The post-processing stage 11 provides as an output a final image analysis result 14. The post-processing stage 11 can have different internal structures corresponding to different sequences and/or combi¬ nations of the combination step 12 and the low-level image processing step 13 or parts thereof. Below, three examples for different structures of the post-processing stage 11 will be described referring to FIGs 6 to 12. The examples will be described using two data streams provided by the selection stage 8. The examples and the underlying method can, however, easily be extended to or adapted for more than two data streams .
FIG 6 schematically illustrates a first structure 35 of the post-processing stage 11. Here, a first data stream 36 from the representation space or the inverse representation space is processed using low-level image processing steps 38, which can be thought of as an instance or an example of the low- level image processing step 13. The low-level image process¬ ing steps 38 yields a first partial image analysis result 39. The first partial image analysis result 39 or a part thereof is provided as input to a combination stage 40. A second data stream 37 from the representation space or the inverse representation space is also provided to the combination stage 40 as a second input. The combination stage 40 combines the first partial image analysis result 39 and the second data
stream 37. The resulting combination, that is, an output of the combination stage 40 then undergoes further low-level im¬ age processing steps 41, resulting in a second partial image analysis result 42. The partial image analysis results 39, 42 can together form the final image analysis result 14 (see FIG 1) . The combination stage 40 can operate on data from the representation space and/or the inverse representation space as well as on data provided by a low-level image processing algorithm. The same is true for other variations of combina- tion stages.
FIG 7 schematically depicts a first overview 43 of multiple processing steps of an image analysis, wherein the first structure 35 is used for post-processing. Here, an input im- age 44, which is an example of the input 2 (see FIG 1), is to be analysed. The input image 44 presently is a crop of a lar¬ ger image and partly depicts a first pedestrian 45 and a sec¬ ond pedestrian 46. In this example the two pedestrians 45, 46 are closely spaced such that the first pedestrian 45 partly occludes the second pedestrian 46. A forward pass of the in¬ put image 44 through the first adapted neural net 4 followed by the subsequent backward pass 6 provides a first processing result 47, which is a deconvolved output of the backward pass 6.
In the first processing result 47 features corresponding to the first pedestrian 45 are emphasised such that a first in¬ verse representation 48 of the first pedestrian 45 is visi¬ ble. The post processing result 47 then undergoes multiple low-level image processing steps 38, which may comprise pseudo—colouring, and/or highlighting of regions based on intensity as well as thresholding with a predetermined thresh¬ old value. After the thresholding step this yields an inter¬ mediate post-processing result 49 wherein a contour 50 corre- sponding to the first inverse representation 48 - and there¬ fore to the first pedestrian 45 - is visible. As part of the low-level image processing steps 38 the contour 50 can be de¬ tected and a corresponding first bounding box 52 can be gen-
erated. Complementing the input image 44 with the first bounding box 52 surrounding and therefore clearly marking the first pedestrian 45 results in a first partial image analysis result 51, which is an example of the first partial image analysis result 39.
Meanwhile, the input image 44 is also processed by means of a forward pass through the second adapted neural net 5 and the corresponding backward pass 7 yielding a second processing result 53, which is a deconvolved output of the backward pass 7. In the second processing result 53 features corresponding to both of the pedestrians 45, 46 are highlighted or empha¬ sised such that a second inverse representation 54 of the first pedestrian 55 and a first inverse representation 55 of the second pedestrian 46 are visible. Using the combination stage 40 the second processing result 53 is combined with the first partial image analysis result 51 or a part thereof yielding a combination result 56. In the combination result 56 an area 58 equal to the region surrounded by the first bounding box 52 is replaced with uniform pixel values such as intensity and/or colour. The area 58 therefore covers or ob¬ scures all features corresponding to the first pedestrian 45. A remainder 57 of the first inverse representation 55, that is, the parts of the inverse representation 55 outside of the area 58, remains visible in the combination result 56.
The combination result 56 then undergoes the low-level image processing steps 41, which can be essentially equal to the above-mentioned low-level image processing steps 38. After a corresponding thresholding step this yields another intermediate post-processing result 59 containing a contour 60 corresponding to the remainder 57 and therefore to the second pedestrian 46. Similar to the contour 50, the contour 60 can be detected through low-level image processing techniques re- suiting in the detection of the second pedestrian 46. A correspondingly generated second bounding box 62 surrounding the second pedestrian 46 can be added to the input image 44 to generate a second partial image analysis result 61. The sec-
ond partial image analysis result 61 is an example of the second partial image analysis result 42.
FIG 8 schematically depicts a second overview 63 of multiple processing steps of an image analysis using a slight varia¬ tion of the structure 35 for post-processing. For the sake of clarity only differences to the process as described with re¬ spect to FIG 7 will be described in further detail. In the example shown in FIG 8 the same input image 44 is processed to generate the same first partial image analysis result 51 and the same second processing result 53. Differing from the sequence in the first overview 43, however, the second proc¬ essing result 53 is then combined with the intermediate post¬ processing result 49 to generate a different combination re- suit 64. Here, the contour 50 or a corresponding segmented output of a low-level image processing step is used as a ref¬ erence instead of the first bounding box 52. Accordingly, a different replacement area 66 equal to the parts of the in¬ termediate post-processing result 59 surrounded by the con- tour 50 is filled with pixel of uniform intensity and colour. This has the advantage that the contour 50 or the correspond¬ ing segmentation can more closely match an actual area or outline of the first pedestrian 45 as compared to the first bounding box 52 resulting in fewer parts of the inverse rep- resentation 55 of the second pedestrian 46 being replaced.
Therefore a detection result of the second pedestrian 46 may be improved. After filling in the replacement area 66 a re¬ mainder 65 of the first inverse representation 55 of the sec¬ ond pedestrian 46 remains visible in the combination result 64.
Similar to the combination result 56 the combination result 64 may also undergo the low-level image processing steps 41. After a corresponding thresholding step a contour 68 corre- sponding to the remainder 65 and therefore to the second pe¬ destrian 46 becomes visible and can be detected in an inter¬ mediate post-processing result 67. Because of the different sources combined to generate the combination result 64, the
contour 68 differs from the contour 60. After detecting the contour 68 a corresponding second bounding box 70 can be generated. Another second partial image analysis result 69 can then be generated by adding the second bounding box 70 to the input image 44. Corresponding to the difference between the contours 68 and 60 the second bounding box 70 can more closely match the visible parts of the second pedestrian 46 as compared to the second bounding box 62. The intensity and/or colour values used for the area 58 and the replacement area 66 can be predetermined or they can, for example, be derived from values of one or more neighbouring pixels in the respective combination result 56, 64. The first structure 35 can advantageously provide the first partial image analysis result 39, 51 as soon as possible af¬ ter selecting the first data stream 36, in particular, possibly even before selecting the second data stream 37 and/or before executing the combination stage 40. This can be advan- tageous for time-sensitive and/or safety related applications since a delay can be minimised and the first partial image analysis result 39, 51 may already indicate that some action might need to be taken. FIG 9 schematically depicts an alternative second structure 71 of the post-processing stage 11. As with the first struc¬ ture 35 the first data stream 36 is again processed using the low-level image processing steps 38 to generate the first partial image analysis result 39, 51. Using the second struc- ture 71, however, the first data stream 36 and the second data stream 37 are processed using a combination stage 72 in parallel to, that is, at the same time as the low-level image processing steps 38 are executed. An output of the combina¬ tion stage 72 then undergoes low-level image processing steps 73 to generate a second partial image analysis result 74. The second structure 71 has the advantage of providing the final image analysis result 14, which may comprising the first and second partial image analysis results 39, 74, as fast or as
soon as possible without any delays. Since the raw first data stream 36 instead of the first partial image analysis result 39 is used as an input for the combination stage 72 this can, however, make the combination stage 72 more complex. Combin- ing the two data streams 36, 37 can, in other words, be more difficult than combining the first partial image analysis re¬ sult 39 and the second data stream 37.
FIG 10 schematically depicts a third overview 75 illustrating multiple processing steps of an image analysis using the sec¬ ond structure 71. As described in conjunction with FIGs 7 and 8 the input image 44 is again processed the same way to gen¬ erate the first partial image analysis result 51 and the sec¬ ond processing result 53. In this case, however, the first and second processing results 47, 53, that is, the decon¬ volved outputs of the backward passes 6, 7, are normalised and then combined in the combination stage 72. To generate a combination result 76 a logical difference between the first and second processing results 47, 53 after normalisation is taken. This can, for example, mean that the first processing result 47 is subtracted from the second processing result 53. In the combination result 76 only a remainder 77 corresponding to the first inverse representation 55 which is not ex¬ pressed or visible in the first processing result 47 remains.
After a thresholding step executed as part of the low-level image processing steps 73 applied to the combination result 76 an intermediate post-processing result 78 is obtained. In this intermediate post-processing result 78 a contour 79 is visible and can be detected. The contour 79 corresponds to the second pedestrian 46 but differs from the contour 68.
FIG 11 schematically depicts an alternative third structure 82 of the post-processing stage 11. As with the first and second structures 35, 71 the low-level image processing steps 38 are applied to the first data stream 36 to generate the first partial image analysis result 39. As with the second structure 71 the data streams 36, 37 are simultaneously com-
bined using a combination stage 83. While the first and sec¬ ond data streams 36, 37 are being combined, processed, or analysed together in the combination stage 83, however, the low-level image processing steps 38 may already be completed. In this case, the first partial image analysis result 39 or a part thereof is used as an additional input for the combina¬ tion stage 83 in addition to the data streams 36, 37. This means that the combination stage 83 can start processing or combining the data streams 36, 37 as soon as possible but can also take advantage of results obtained while processing data streams 36, 37 are still being processed in the combination stage 83.
As with the structures 35, 71 low-level image processing steps 84 are then applied to an output of the combination stage 83 to generate a second partial image analysis result 85. The third structure 82 can therefore be regarded as a combination of the first structure 35 and the second struc¬ ture 71 implementing the respective advantages of both of these structures 35, 71.
FIG 12 schematically depicts a fourth overview 86 of multiple processing steps of an image analysis using the third struc¬ ture 82 for the post-processing stage 11. Again, the input image 44 is processed to generate the first partial image analysis result 51 as well as the second processing result 53. In this case, however, in addition to the first process¬ ing result 47 and the second processing result 53 the first partial image analysis result 51 or a part thereof is also used to generate a combination result 87. Here, the first and second processing results 47, 53 can be combined as described with reference to FIG 10 to reliably cover or obscure any leftover features of the second inverse representation 54, that is, features corresponding to the first pedestrian 45. An area equal to the region surrounded by the first bounding box 52 in the first partial image analysis result 51 can be blocked out, that is, be replaced or filled with uniform pixel values. It is also possible to mask an area of the
first processing result 47 equal to the region of the first bounding box 52 before taking the difference between the thusly modified first processing result 47 and the second processing result 53 to generate the combination result 87.
In any case, a remainder 88 corresponding to the second pedestrian 46 remains visible in the combination result 87. As described with respect to FIGs 7, 8, and 10 low-level image processing steps 84 can then be applied to the combination result 87, which after thresholding yields an intermediate post-processing result 89 containing a contour 90 corresponding to the second pedestrian 46. The contour 90 differs from the contours 60, 68, and 79 as a result of the difference be¬ tween the structures 35, 71, and 82. After detecting the con- tour 90 a corresponding second bounding box 92 can be generated. Adding the second bounding box 92 to the input image 44 then results in a second partial image analysis result 91.
In the examples described above, in particular with respect to FIGs 6 to 13, the first adapted neural net 4 used to gen¬ erate the first processing result 47 can be adapted to count exactly one object, that is, to count exactly the one first pedestrian 45. This corresponds to the first concept of 'ex¬ actly one object' , that is, of 'one pedestrian' . The second adapted neural net 5 used to generate the second processing result 53 can be adapted to count exactly two objects, that is, to count exactly the two pedestrians 45, 46. This corre¬ sponds to a second concept of 'exactly two objects' , that is, of 'exactly two pedestrians' . The described examples there- fore illustrate how processing in representation space and inverse representation space of a deep neural network as well as using multiple learned representations of concepts with at least partly overlap or redundancy among the concepts, that is, a relation among the learned concepts, can be used to analyse image data.
Claims
1. Method (1) for analysing image data (2, 44), comprising the following steps:
- providing one or more pre-trained artificial deep neural nets,
- adapting each of the one or more pre-trained neural nets for at least one specific image analysis task,
- processing the image data (2, 44) by means of a forward pass through the one or more adapted neural nets (3, 4, 5) that have learned multiple logically related concepts, to generate multiple processing results (47, 53) corresponding to the concepts,
- selecting at least two of the multiple processing results (47, 53) corresponding to different ones of the multiple logically related concepts,
- combining the at least two selected processing results (47, 53) in dependence of the logical relation between the con¬ cepts to which the selected processing results (47, 53) correspond to generate an image analysis result (14) .
2. Method (1) in accordance with claim 1, characterised in that the selected processing results (47, 53) are taken from different activations (9) of the layers (17, 18), in particu- lar of filters, of the one or more adapted neural nets (3, 4, 5), such that combining the selected processing results (47, 53) comprises processing in a representation space.
3. Method (1) in accordance with any of the preceding claims, characterised in that the selected processing results (47,
53) are selected such that the corresponding different logi¬ cally related concepts have a nonempty intersection (32) .
4. Method (1) in accordance with any of the preceding claims, characterised in that the selected processing results (47,
53) are combined by using logical operators corresponding to the logical relation between the concepts to which the se¬ lected processing results (47, 53) correspond.
5. Method (1) in accordance with any of the preceding claims, characterised in that at least one of the selected processing results (47; 53) is processed (13, 38, 41, 73, 84) using an image processing algorithm before or after combining the at least two selected processing results (47, 53) .
6. Method (1) in accordance with claim 5, characterised in that processing the at least one selected processing result (47, 53) using the image processing algorithm comprises pseudo-colouring, and/or highlighting regions based on intensity, and/or thresholding, and/or contour detection, and/or generating a bounding box (52, 62, 70, 80, 91), in particular a bounding box (52, 62, 70, 80, 91) surrounding a detected contour (50, 60, 68, 79, 90) or object (45, 46) .
7. Method (1) in accordance with any of the preceding claims, characterised in that at least a part of the image data (2, 44) is processed by means of the forward pass and a subse- quent backward pass (6, 7) through the one or more adapted neural nets (3, 4, 5) to generate the multiple processing re¬ sults (47, 53) .
8. Method (1) in accordance with claim 7, characterised in that a transpose of weights of the one or more adapted neural nets (3, 4, 5) is used for processing the respective image data (2, 44) by means of the backward pass (6, 7) .
9. Method (1) in accordance with any of the preceding claims, characterised in that one or more deep convolutional neural nets (15), and/or deep feedforward neural nets, and/or deep recurrent neural nets are used as the one or more deep neural nets (3, 4, 5) .
10. Method (1) in accordance with any of the preceding claims, characterised in that the one or more pre-trained neural nets are pre-trained for counting objects (45, 46) in images (2, 44 ) .
11. Method (1) in accordance with any of the preceding claims, characterised in that
- multiple differently adapted neural nets (3, 4, 5) are cre- ated by adapting the one pre-trained neural net for different concepts, in particular by adapting the one pre-trained neu¬ ral net to count different numbers of objects (45, 46), and
- the differently adapted neural nets (3, 4, 5) are used to generate the multiple processing results (47, 53) .
12. Method (1) in accordance with any of the preceding claims, characterised in that at least one object (45, 46) is detected in the at selected least two processing results (47, 53) and/or the combination (56, 64, 76, 87) of the selected at least two processing results (47, 53) by means of an image processing algorithm to generate the image analysis result (14) .
13. Method (1) in accordance with claim 12, characterized in that the at least one object (45, 46) is detecting using at least one predetermined optimisation criterion, in particular using at least one predetermined constraint for a boundary smoothness, and/or for an area of the processing results (47, 53) and/or the combination (56, 64, 76, 87) of the processing results (47, 53) .
14. Method (1) in accordance with any of the claims 12 and 13, characterised in that the at least one object (45, 46) is detected by treating pixels of the processing results (47, 53) and/or the combination (56, 64, 76, 87) of the processing results (47, 53) as a Markov random field and using a prede¬ termined constraint on a gradient of intensities.
15. Apparatus for analysing image data, comprising one or more artificial adapted deep neural nets (3, 4, 5), and a separate image processing unit (11), wherein the apparatus is configured to
- process the image data (2, 44) by means of a respective forward pass through the one or more adapted neural nets (3, 4, 5) to generate multiple processing results (47, 53), wherein
- the one or more adapted neural nets are (3, 4, 5) adapted for a specific image analysis task from at least one pre- trained neural net, and have learned multiple logically related concepts, and
- the multiple processing results (47, 53) correspond to the multiple logically related concepts,
- select at least two of the multiple processing results (47, 53) corresponding to different ones of the multiple logi¬ cally related concepts,
- provide the at least two selected processing results (47, 53) as input to the image processing unit (11),
wherein the image processing unit (11) is configured to com¬ bine the at least two selected processing results (47, 53) in dependence of the logical relation between the concepts to which the selected processing results (47, 53) correspond to generate an image analysis result (14) .
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
IN201741019627 | 2017-06-05 | ||
IN201741019627 | 2017-06-05 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018224447A1 true WO2018224447A1 (en) | 2018-12-13 |
Family
ID=62684749
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2018/064644 WO2018224447A1 (en) | 2017-06-05 | 2018-06-04 | Method and apparatus for analysing image data |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2018224447A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110874595A (en) * | 2019-10-22 | 2020-03-10 | 杭州效准智能科技有限公司 | Multi-dish dinner plate intelligent segmentation method based on deep learning |
US11176422B2 (en) | 2019-08-08 | 2021-11-16 | Robert Bosch Gmbh | Apparatus and system for unsupervised disentangled representation learning with a residual variational autoencoder |
-
2018
- 2018-06-04 WO PCT/EP2018/064644 patent/WO2018224447A1/en active Application Filing
Non-Patent Citations (6)
Title |
---|
"Medical image computing and computer-assisted intervention - MICCAI 2015 : 18th international conference, Munich, Germany, October 5-9, 2015; proceedings", vol. 9004, 1 January 2015, SPRINGER INTERNATIONAL PUBLISHING, Cham, ISBN: 978-3-540-70543-7, ISSN: 0302-9743, article MARCEL SIMON ET AL: "Part Detector Discovery in Deep Convolutional Neural Networks", pages: 162 - 177, XP055480906, 032548, DOI: 10.1007/978-3-319-16808-1_12 * |
IONUT SORODOC ET AL: "Pay Attention to Those Sets! Learning Quantification from Images", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 10 April 2017 (2017-04-10), XP080761939 * |
LI-JIA LI ET AL: "Objects as Attributes for Scene Classification", 10 September 2010, TRENDS AND TOPICS IN COMPUTER VISION, SPRINGER BERLIN HEIDELBERG, BERLIN, HEIDELBERG, PAGE(S) 57 - 69, ISBN: 978-3-642-35748-0, XP047006618 * |
MARIAN GEORGE ET AL: "Semantic Clustering for Robust Fine-Grained Scene Recognition", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 26 July 2016 (2016-07-26), XP080714921 * |
SEGUI SANTI ET AL: "Learning to count with deep object features", 2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), IEEE, 7 June 2015 (2015-06-07), pages 90 - 96, XP032795536, DOI: 10.1109/CVPRW.2015.7301276 * |
ZHANG JIANMING ET AL: "Salient Object Subitizing", 2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 7 June 2015 (2015-06-07), pages 4045 - 4054, XP032793858, DOI: 10.1109/CVPR.2015.7299031 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11176422B2 (en) | 2019-08-08 | 2021-11-16 | Robert Bosch Gmbh | Apparatus and system for unsupervised disentangled representation learning with a residual variational autoencoder |
CN110874595A (en) * | 2019-10-22 | 2020-03-10 | 杭州效准智能科技有限公司 | Multi-dish dinner plate intelligent segmentation method based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chen et al. | Pix2seq: A language modeling framework for object detection | |
US11055580B2 (en) | Method and apparatus for analyzing an image | |
Wallace et al. | AllenNLP interpret: A framework for explaining predictions of NLP models | |
CN111126592B (en) | Method and device for outputting prediction result and generating neural network and storage medium | |
US11636639B2 (en) | Mobile application for object recognition, style transfer and image synthesis, and related systems, methods, and apparatuses | |
WO2018224437A1 (en) | Method and apparatus for analysing an image | |
Bouwmans | Traditional and recent approaches in background modeling for foreground detection: An overview | |
US20080310717A1 (en) | Apparatus and Method for Image Labeling | |
US11176417B2 (en) | Method and system for producing digital image features | |
Garcia-Gasulla et al. | An out-of-the-box full-network embedding for convolutional neural networks | |
Asgari et al. | Masktune: Mitigating spurious correlations by forcing to explore | |
US20220230072A1 (en) | Generating a data structure for specifying visual data sets | |
WO2018224447A1 (en) | Method and apparatus for analysing image data | |
US20220038620A1 (en) | Aesthetic Learning Methods and Apparatus for Automating Image Capture Device Controls | |
KR20200130759A (en) | Zero Shot Recognition Apparatus for Automatically Generating Undefined Attribute Information in Data Set and Method Thereof | |
EP4239523A1 (en) | Method to add inductive bias into deep neural networks to make them more shape-aware | |
Ramasso et al. | Belief Scheduler based on model failure detection in the TBM framework. Application to human activity recognition | |
Marconato et al. | BEARS Make Neuro-Symbolic Models Aware of their Reasoning Shortcuts | |
Liu et al. | AutoDC: Automated data-centric processing | |
Chyung et al. | Extracting interpretable concept-based decision trees from CNNs | |
Pap | Artificial intelligence: theory and applications | |
Mo et al. | Large-scale automatic species identification | |
Spehr | On Hierarchical Models for Visual Recognition and Learning of Objects, Scenes, and Activities | |
Siddique et al. | Unsupervised spatio-temporal latent feature clustering for multiple-object tracking and segmentation | |
Chen et al. | Using mask R-CNN for underwater fish instance segmentation as novel objects: A proof of concept |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18732681 Country of ref document: EP Kind code of ref document: A1 |
|
DPE1 | Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101) | ||
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18732681 Country of ref document: EP Kind code of ref document: A1 |