WO2023117534A1

WO2023117534A1 - Image compression by means of artificial neural networks

Info

Publication number: WO2023117534A1
Application number: PCT/EP2022/085363
Authority: WO
Inventors: Chup Chung Wong; Senthil YOGAMANI
Original assignee: Connaught Electronics Ltd.
Priority date: 2021-12-20
Filing date: 2022-12-12
Publication date: 2023-06-29
Also published as: DE102021133878A1

Abstract

According to a computer-implemented method for image compression, a compressed image (13) is generated by applying a compression module (14a) of an artificial neural network (14), which is trained for image compression, to input data, which comprises an input image (12) or depends on the input image (12). A further artificial neural network (15), which is trained for carrying out at least one computer vision task and comprises a first hidden layer (19a, 19b, 19c), is applied to the input image (12). The input data comprises an output of the first hidden layer (19a, 19b, 19c) or depends on the output of the first hidden layer (19a, 19b, 19c).

Description

Image compression by means of artificial neural networks

The present invention is directed to a computer-implemented method for image compression, wherein a compressed image is generated by applying a compression module of an artificial neural network, which is trained for image compression, to input data, which comprises an input image or depends on the input image. The invention is further directed to a corresponding computer-implemented training method, to a method for guiding a vehicle at least in part automatically, to a data processing system, to an electronic vehicle guidance system for a vehicle and to a computer program product.

Image compression techniques, such as JPEG, H.264/AVC and H.265/HEVC, may be employed to reduce the image size for storage and streaming. However these are lossy compression methods, which may introduce artifacts and reduce image details, which may lead to performance problems when the decompressed images are used for computer vision tasks. Furthermore, the artifacts and reduced image details are visually observable by human observers. Both problems arise, for example, in the context of automotive applications, in particular when computer vision tasks are involved for autonomous or semi-autonomous driving or driver assistance functions, but also when displaying images to a driver or user of the vehicle, for example for semi-autonomous driving or driver assistance functions. Similar problems arise, however, in other fields of application, where the decompressed images are used for computer vision tasks, as well.

It is an objective of the present invention to provide a possibility for image compression, wherein the performance or reliability of a computer vision task, which is carried out based on the compressed image, is improved, in particular, failures of the computer vision task resulting from artifacts due to the image compression, are avoided.

This objective is achieved by the respective subject matter of the independent claims. Further implementations and preferred embodiments are subject matter of the dependent claims.

The invention is based on the idea to provide a combination of two artificial neural networks, wherein one of the artificial neural networks is trained for image compression and the other artificial neural network is trained for computer vision. In order to perform the image compression, the artificial neural network for image compression uses an output of a hidden layer of the artificial neural network for computer vision. In particular, the image compression by the artificial neural network for image compression is guided by the output of the hidden layer of the artificial neural network for computer vision.

According to an aspect of the invention, a computer-implemented method for image compression is provided. Therein, a compressed image is generated by applying a compression module of an artificial neural network, which is trained for image compression, to input data, wherein the input data comprises an input image or depends on the input image. A further artificial neural network, which is trained for carrying out at least one computer vision task and which comprises a first hidden layer, is applied to the input image, in particular to carry out the at least one computer vision task. The input data to which the compression module is applied comprises an output of the first hidden layer or depends on the output of the first hidden layer.

Unless stated otherwise, all steps of the computer-implemented method may be performed by at least one computing unit. In particular, a data processing apparatus comprising at least one processor configured or adapted to perform a computer- implemented method according to the invention may perform the steps of the computer- implemented method. For this purpose, the data processing apparatus, which may correspond to the at least one computing unit, may in particular store a computer program comprising instructions which, when executed by the data processing device, in particular the at least one processor, cause the data processing device to execute the computer- implemented method.

In case the input data does not comprise the input image but depends on the input image, the input data comprises a dataset, in particular a two-dimensional dataset, which is generated depending on the input image, for example by the further artificial neural network. In particular, the dataset depending on the input image may correspond to an intermediate result of the further artificial neural network, which is produced when carrying out the at least one computer vision task. For example, the further artificial neural network may comprise an encoder module for encoding features of the input image and one or more decoder modules for carrying out the at least one computer vision task. The dataset depending on the input image, which is part of the input data for the artificial neural network, may then for example comprise an output of the encoder module or an output of a layer, in particular an intermediate layer, of the encoder module. The artificial neural network may also be denoted as compression network, even though it may comprise further modules apart from the compression module, for example a decompression module. The further neural network may be denoted as computer vision network. The compression network and the computer vision network may be considered as respective parts of a superordinate artificial neural network. For example, the superordinate artificial neural network may be trained in an end-to-end fashion such that the computer vision network and the compression network are trained commonly.

In general, an artificial neural network may be considered to comprise an input layer, an output layer and optionally one or more hidden layers, also denoted as intra-layers. In case the number of hidden layers is one or more, the artificial neural network is sometimes denoted as deep neural network, DNN. In particular, in a sequence of layers of an artificial neural network, the hidden layers lay between the input layer and the output layer of that network. The computer vision network is, in particular, implemented as a convolutional neural network, CNN. The compression network may also be implemented as a CNN. Since the computer vision network comprises the first hidden layer, it is a DNN. The compression network may also be implemented as a DNN.

Apart from the compression module, the compression network may also comprise a decompression module. The decompression module may be applied to the compressed image to generate a reconstructed or decompressed image. However, the step of applying the decompression module to the compressed image is not necessarily a part of the computer-implemented method for image compression. Rather, the compressed image may be stored, in particular for later usage. For example, the later usage may comprise applying the decompression module to the compressed image.

The decompressed image may be used for different purposes. On the one hand, the decompressed image may be used to carry out the at least one computer vision task, for example by applying the computer vision network to the decompressed image. The purpose of this may for example be error analysis of the computer vision network or an analysis of the performance of the computer vision network, for example in order to increase the performance of the computer vision network by adapting its architecture.

However, the decompressed image may, in addition or alternatively, also be provided to a human user or human observer. For example, in the context of tele-operated driving, a human observer may analyze the decompressed image to reconstruct the reasons for an accident et cetera. By means of the invention, an optimal tradeoff between small digital image size of the compressed image and a level of reserved details specifically optimized for computer vision tasks, but also for different applications such as long term storage, streaming and visualization purposes, is achieved by using the trained compression network, which explicitly takes into account the output of the first hidden layer of the computer vision network for the compression. In particular, the compression is guided by the output of the first hidden layers and, optionally, by the output of further hidden layers of the computer vision network to enhance or reduce the compression of different regions in the input image depending on their relevance for the at least one computer vision task. Since the image compression is guided according to the interests of the computer vision network, level of details of different regions in the input image can be retained specifically for the at least one computer vision task. Therefore, the reconstruction of the input image via decompression of the compressed image can be used for both visualization and computer vision applications with an improved performance or, in other words, with an improved reliability or reduced failure rate.

In several implementations, the compression module comprises a cascade of compression blocks with a progressively reduced resolution. Each of the compression blocks consists of one or more convolution layers, for example three convolution layers, at an input followed by a weighted downsampling layer, which reduces the resolution by a factor determined by a predefined compression ratio and a total number of the compression blocks.

For example, integer downsampling, for example by a factor of two, is efficient but fractional scaling may be supported as well in the weighted downsampling layer. The input feature volume and the output feature volume may be divided into identical number of tiles, wherein the tile sizes are different in input and output as resolutions are different.

Each output tile feature map may be a linear combination of the corresponding input tile and its eight connected neighbors. The neighborhood links establish redundancy and thus achieve compression. A sigmoidal non-linearity may be provided at the end of the weighted downsampling layer followed by one or more convolution layers, for example two convolution layers.

This may be progressively repeated in each of the compression blocks. For example, if the compression ratio needs a reduction in resolution by a factor of eight and there are three compression blocks, downsampling is may be equally distributed in the three blocks, wherein the downsampling factor in each block is two.

The weights in the compression module may be trained by auxiliary decoding maximizing the information content needed for a particular computer vision task. Thus the compression is tuned for the targeted computer vision application and it is not a generic compression scheme.

The decompression module may be designed as a symmetric inverse of the compression module, wherein weighted upsampling layers are used instead of the downsampling layers. For example, the compression and decompression module may be designed according to an encoder-decoder architecture.

According to several implementations of the computer-implemented method, a set of encoded features is generated by applying an encoder module of the further artificial neural network to the input image. A decoder module of the further artificial neural network is applied to the set of encoded features, wherein the decoder module is trained to carry out the at least one computer vision task and the decoder module comprises the first hidden layer.

In such implementations, the output of the hidden layer carries specific information regarding the actual computer vision task to be carried out. Therefore, the compression module is guided in a way that is specifically adapted for this computer vision task and not only for computer vision in general. The set of encoded features may also be denoted as the output of the encoder module.

The decoder module may comprise one or more decoder sub-modules, wherein each decoder sub-module is trained for carrying out a respective computer vision task of the at least one computer vision task. In other words, the decoder module may be a single decoder for carrying out exactly one computer vision task or it may be a multi-task decoder with two or more decoder sub-modules, which use the same set of encoder features for carrying out different computer vision tasks. The output of the decoder module or, in other words, the respective outputs of the decoder sub-modules, may be considered as the respective outputs of the computer vision network.

The content of the output depends on the corresponding computer vision task. For example, if the computer vision task is an object detection task, the output may comprise a description of a region of interest, ROI, also denoted as bounding box, for an object in the input image, wherein the ROI specifies the location and/or orientation of the object in the input image or the corresponding scenario. The output may also comprise an object class for the object and/or a confidence level for the object class and/or for one or more further object classes. In case of a semantic segmentation task, the output may comprise a respective pixel-level class for each pixel of the input image, wherein each pixel level class defines a type of object to which the respective pixel belongs with a certain probability or confidence. However, several other computer vision tasks are known and yield respective outputs. In particular, the architecture of the computer vision network may be designed according to a known architecture for one or more computer vision tasks.

According to several implementations, the input data comprises the set of encoded features and, in particular, does not comprise the input image. Using the encoded representation of the input image, the performance of the compression and/or decompression may be enhanced.

According to several implementations, a first decoder sub-module of the decoder module, wherein the first decoder sub-module is trained for carrying out a first computer vision task of the at least one computer vision task, is applied to the set of encoded features, wherein the first decoder sub-module comprises the first hidden layer. A second decoder submodule of the decoder module, wherein the second decoder sub-module is trained for carrying out a second computer vision task of the at least one computer vision task, is applied to the set of encoded features, wherein the second decoder sub-module comprises a second hidden layer of the further artificial neural network. Therein, the second computer vision task is, in particular, different from the first computer vision task. The input data to which the compression module is applied comprises an output of the second hidden layer or depends on the output of the second hidden layer.

Consequently, the compression is guided according to the interests of the first and second computer vision task at the same time. Therefore, the compressed image and, in particular, the decompressed image, is particularly suitable for carrying out different computer vision tasks, which increases the flexibility of the method or enables the application of the method without requiring multiple different compressed images for different computer vision tasks.

According to several implementations, the further artificial neural network, in particular the first decoder sub-module, comprises a further first hidden layer and the input data comprises an output of the further first hidden layer or depends on the output of the further first hidden layer.

According to several implementations, the first decoder sub-module comprises the further first hidden layer of the further artificial neural network and the input data comprises an output of the further first hidden layer or depends on the output of the further first hidden layer.

Therefore, the compression is guided by different levels of the decoder process, which allows for an improved adaptation of the compression to the first computer vision task.

According to several implementations, the further artificial neural network, in particular the second decoder sub-module, comprises a further second hidden layer of the further artificial neural network and the input data comprises an output of the further second hidden layer or depends on the output of the further second hidden layer.

According to several implementations, the output of the first hidden layer is downsampled and normalized, in particular by a downsampling and normalization module of the superordinate neural network, for example of the compression network or the computer vision network. The input data comprises the downsampled and normalized output of the first hidden layer.

In this way, the dimensions of the output of the first hidden layer and/or the further hidden layers, are adapted to an input format, which may be directly processed by the compression module.

According to several implementations, the output of the second hidden layer is downsampled and normalized, in particular by the downsampling and normalization module. The input data comprises the downsampled and normalized output of the second hidden layer.

According to a further aspect of the invention, a computer-implemented training method for training an artificial neural network for image compression and a further artificial neural network for carrying out at least one computer vision task is provided. A compressed training image is generated by applying a compression module of the artificial neural network to training input data, wherein the training input data comprises a training image or depends on the training image and the training input data comprises an output of a first hidden layer of the further artificial neural network or depends on the output of the first hidden layer. A decompressed training image is generated by applying a decompression module of the artificial neural network to the compressed training image and at least one loss function is evaluated depending on the decompressed training image. The further artificial neural network is applied to the training image and at least one further loss function is evaluated depending on an output of the further artificial neural network. Network parameters of the artificial neural network and further network parameters of the further artificial neural network are adapted depending on a result of the evaluation of the at least one loss function and the at least one further loss function.

Therein, adapting the network parameters may be understood to correspond to a training step for training the artificial neural network and the further artificial neural network in a training epoch. The adaptation of the parameters is for example carried out to minimize the at least one loss function and the at least one further loss functions or to minimize a common loss function, which depends on the at least one loss function and the at least one further loss function, over the course of several training epochs. In particular, a plurality of training images may be used in the described way for the different training epochs. Evaluating a loss function may be understood as computing a value of the loss function.

The network parameters and the further network parameters comprise, in particular, weighting parameters and/or bias parameters of the artificial neural network and the further artificial neural network, respectively, for example of an encoder module of the artificial neural network, a decoder module of the artificial neural network, the compression module and/or the decompression module.

According to several implementations of the computer-implemented training method, a common loss function comprising a combination, in particular a sum, of the at least one loss function and the at least one further loss function is evaluated and the network parameters and the further network parameters are adapted depending on a result of the evaluation of the common loss function, in particular in an end-to-end fashion.

The combination of the at least one loss function and the at least one further loss function may, in particular, comprise a sum of all loss functions of the at least one loss function and all further loss functions of the at least one further loss functions. In this way, an end-to-end training is achieved, which results in a particularly close relation between the trained artificial neural network and the trained further artificial neural network and consequently in a particularly good suitability of the compressed image for the at least one computer vision task.

According to several implementations of the computer-implemented training method, the at least one loss function comprises a first loss function, wherein the at least one further loss function also comprises the first loss function.

In other words, the first loss function is a loss function for training or assessing the corresponding computer vision task carried out by the further artificial neural network. In this way, the performance of the compression module may be directly adapted to the computer vision task.

According to several implementations, the at least one loss function comprises a second loss function, which depends on a total reconstruction error of the decompressed training image with respect to the training input image.

The second loss function is, in particular, a loss function for evaluating the quality of the decompressed training image or the compression for human vision purposes.

In particular, the total reconstruction error E may be given by

wherein a^ are elements of an MxN matrix A comprising the pixel values of the training input image and by are elements of an MxN matrix B comprising the pixel values of the decompressed training input image.

The second loss function may be given directly by the total reconstruction error or by a quantity computed depending on the total reconstruction error, such as a mean square error, MSE, wherein MSE = E/(M*N), a signal to noise ratio, SNR, or a peak signal to noise ratio, PSNR, wherein

Therein, wherein V denotes the maximum possible value of the pixel values, for example V=255 for an Sbit image.

According to several implementations of the computer-implemented method for image compression, the artificial neural network and the further artificial network are trained or have been trained by using a computer-implemented training method according to the invention.

Further implementations of the computer-implemented training method follow directly from the various implementations of the computer-implemented method for image compression and vice versa.

According to a further aspect of the invention, a computer-implemented method for image compression and decompression is provided. The method comprises an implementation of a computer-implemented method for image compression and an additional method step, wherein the decompression module of the artificial neural network is applied to the compressed image to generate a decompressed image.

For example, the compressed image may be stored to a memory device and the decompression may be carried out at a later point after the storage. For example, the decompressed image may be stored to the memory device or a further memory device.

According to a further aspect of the invention, a computer-implemented method for image compression and computer vision is provided. The method comprises an implementation of a computer-implemented method for image compression according to the invention, wherein the at least one computer vision task is carried out by applying the further artificial neural network to the input image.

According to a further aspect of the invention, a method for guiding a vehicle at least in part automatically is provided. Therein, an input image, which represents and environment of the vehicle, in particular an exterior environment of the vehicle, is generated by an environmental sensor system, for example by a camera system, of the vehicle. A compressed image is generated by carrying out a computer-implemented method for image compression according to the invention, in particular by at least one processing circuit of the vehicle. The vehicle is guided at least in part automatically depending on an output, in particular a final output, of the further artificial neural network applied to the input image. The compressed image is stored, in particular to a memory device of the vehicle or to an external memory device, for example of a server computer.

The vehicle is for example guided at least in part automatically by an electronic vehicle guidance system of the vehicle, which may comprise the environmental sensor system and the at least one processing circuit.

A control unit of the electronic vehicle guidance system may generate at least one control signal for guiding the vehicle at least in part automatically depending on the output of the further artificial neural network and provide the at least one control signal to one or more actuators of the motor vehicle or the electronic vehicle guidance system, respectively. The one or more actuators may affect or carry out a longitudinal and/or lateral control of the vehicle at least in part automatically depending on the at least one control signal.

Alternatively or in addition, a warning and/or information signal and/or another output for warning and/or informing a user or driver of the vehicle, may be generated by the at least one processing circuit depending on the output of the further artificial neural network for guiding the vehicle in part automatically.

According to several implementations of the method for guiding a vehicle, the at least one computer vision task comprises an object detection task and the output of the further artificial neural network applied to the input image comprises an object class of an object depicted by the input image and/or a bounding box for the object.

According to several implementations the at least one computer vision task comprises a semantic segmentation task and the output of the further artificial neural network applied to the input image comprises a pixel-level class for each of a plurality of pixels of the input image, for example for all pixels of the input image.

According to a further aspect of the invention, a data processing system, which comprises at least one processing circuit, is provided. The at least one processing circuit is adapted to carry out a computer-implemented method for image compression according to the invention and/or a computer-implemented training method according to the invention. According to a further aspect of the invention, an electronic vehicle guidance system for a vehicle is provided. The electronic vehicle guidance system comprises at least one memory device, which stores a compression module of an artificial neural network, which is trained for image compression. The electronic vehicle guidance system comprises at least one processing circuit, which is configured to generate a compressed image by applying the compression module to input data, which comprises an input image or depends on the input image, and to store the compressed image to the at least one memory device. The at least one memory device stores a further artificial neural network, which is trained for carrying out at least one computer vision task and which comprises a first hidden layer. The at least one processing circuit is configured to apply the further artificial neural network to the input image. The electronic vehicle guidance system comprises at least one control unit, which is configured to guide the vehicle at least in part automatically depending on an output of the further artificial neural network applied to the input image. The input data comprises an output of the first hidden layer or depends on the output of the first hidden layer.

The at least one processing circuit may receive the input image directly from an environmental sensor system of the vehicle or of the electronic vehicle guidance system. Alternatively, the input image may be stored on the at least one memory device and the at least one processing circuit receives the input image from the at least one memory device.

According to several implementations of the electronic vehicle guidance system, the electronic vehicle guidance system comprises an environmental sensor system, in particular a camera system, which is configured to generate image data representing an environment of the vehicle or, in other words, an environment of the environmental sensor system, and the at least one processing circuit is configured to generate the input image depending on the image data.

The image data may for example correspond to raw image data of an imager chip of the environmental sensor system or the pre-processed raw image data. The at least one processing circuit may for example comprise an image signal processor, ISP, which is configured to generate the input image depending on the image data. The ISP may be a part of the camera system in some implementations.

Further implementations of the electronic vehicle guidance system according to the invention follow directly from the various implementations of the computer-implemented method for image compression and the various implementations of the method for guiding a vehicle at least in part automatically and vice versa. In particular, an electronic vehicle guidance system according to the invention is configured to carry out a method for guiding a vehicle at least in part automatically or carries out such a method.

According to a further aspect of the invention, a first computer program comprising first instructions is provided. When the first instructions or the first computer program, respectively, are carried out by a data processing system, in particular by a data processing system according to the invention, the first instructions cause the data processing system to carry out a computer-implemented method for image compression according to the invention.

According to a further aspect of the invention, a second computer program comprising second instructions is provided. When the second instructions or the second computer program, respectively, are carried out by an electronic vehicle guidance system according to the invention, in particular by the at least one processing circuit and/or the control unit, the second instructions cause the electronic vehicle guidance system to carry out a method for guiding a vehicle at least in part automatically according to the invention.

According to a further aspect of the invention, a computer readable storage medium is provided, which stores a first and/or a second computer program according to the invention.

The first and the second computer program as well as the computer readable storage medium may be considered as respective computer program products comprising the first and/or the second instructions.

An electronic vehicle guidance system may be understood as an electronic system, configured to guide a vehicle in a fully automated or a fully autonomous manner and, in particular, without a manual intervention or control by a driver or user of the vehicle being necessary. The vehicle carries out all required functions, such as steering maneuvers, deceleration maneuvers and/or acceleration maneuvers as well as monitoring and recording the road traffic and corresponding reactions automatically. In particular, the electronic vehicle guidance system may implement a fully automatic or fully autonomous driving mode according to level 5 of the SAE J3016 classification. An electronic vehicle guidance system may also be implemented as an advanced driver assistance system, ADAS, assisting a driver for partially automatic or partially autonomous driving. In particular, the electronic vehicle guidance system may implement a partly automatic or partly autonomous driving mode according to levels 1 to 4 of the SAE J3016 classification. Here and in the following, SAE J3016 refers to the respective standard dated June 2018.

Guiding the vehicle at least in part automatically may therefore comprise guiding the vehicle according to a fully automatic or fully autonomous driving mode according to level 5 of the SAE J3016 classification. Guiding the vehicle at least in part automatically may also comprise guiding the vehicle according to a partly automatic or partly autonomous driving mode according to levels 1 to 4 of the SAE J3016 classification.

If it is mentioned in the present disclosure that a component of the data processing system according to the invention or the electronic vehicle guidance system according to the invention, in particular the at least one processing circuit the electronic vehicle guidance system, is adapted, configured or designed to, et cetera, to perform or realize a certain function, to achieve a certain effect or to serve a certain purpose, this can be understood such that the component, beyond being usable or suitable for this function, effect or purpose in principle or theoretically, is concretely and actually capable of executing or realizing the function, achieving the effect or serving the purpose by a corresponding adaptation, programming, physical design and so on.

A computing unit may in particular be understood as a data processing device, which comprises processing circuitry. The computing unit can therefore in particular process data to perform computing operations. This may also include operations to perform indexed accesses to a data structure, for example a look-up table, LUT. The at least one processing circuitry and the control unit of the electronic vehicle guidance system can therefore also be considered as one or more computing units.

In particular, the computing unit may include one or more computers, one or more microcontrollers, and/or one or more integrated circuits, for example, one or more application-specific integrated circuits, ASIC, one or more field-programmable gate arrays, FPGA, and/or one or more systems on a chip, SoC. The computing unit may also include one or more processors, for example one or more microprocessors, one or more central processing units, CPU, one or more graphics processing units, GPU, and/or one or more signal processors, in particular one or more digital signal processors, DSP. The computing unit may also include a physical or a virtual cluster of computers or other of said units.

In various embodiments, the computing unit includes one or more hardware and/or software interfaces and/or one or more memory devices. A memory device may be implemented as a volatile data memory, for example a dynamic random access memory, DRAM, or a static random access memory, SRAM, or as a nonvolatile data memory, for example a read-only memory, ROM, a programmable read-only memory, PROM, an erasable read-only memory, EPROM, an electrically erasable readonly memory, EEPROM, a flash memory or flash EEPROM, a ferroelectric random access memory, FRAM, a magnetoresistive random access memory, MRAM, or a phase-change random access memory, PCRAM.

An artificial neural network can be understood as a software code or a compilation of several software code components, wherein the software code may comprise several software modules for different functions, for example one or more encoder modules and one or more decoder modules. An artificial neural network can be understood as a nonlinear model or algorithm that maps an input to an output, wherein the input is given by an input feature vector or an input sequence. A software module may be understood as a portion of software code functionally connected and combined to a unit. A software module may comprise or implement several processing steps and/or data structures.

Computer vision algorithms, which may also be denoted as machine vision algorithms or algorithms for automatic visual perception, may be considered as computer algorithms for performing a visual perception task automatically. A visual perception task, also denoted as computer vision task, may for example be understood as a task for extracting visual information from image data. In particular, the visual perception task may in principle be performed by a human, which is able to visually perceive an image corresponding to the image data. In the present context, however, visual perception tasks are performed automatically without requiring the support of a human.

For example, a computer vision algorithm may be understood as an image processing algorithm or an algorithm for image analysis, which is trained using machine learning and may for example be based on an artificial neural network, in particular a convolutional neural network.

For example, the computer vision algorithm may include an object detection algorithm, an obstacle detection algorithm, an object tracking algorithm, a classification algorithm, and/or a segmentation algorithm.

The output of a computer vision algorithm depends on the specific underlying visual perception task. For example, an output of an object detection algorithm may include one or more bounding boxes defining a spatial location and, optionally, orientation of one or more respective objects in the environment and/or corresponding object classes for the one or more objects. A semantic segmentation algorithm applied to a camera image may include a pixel level class for each pixel of the camera image. The pixel level classes may, for example, define a type of object the respective pixel or point belongs to.

Further features of the invention are apparent from the claims, the figures and the figure description. The features and combinations of features mentioned above in the description as well as the features and combinations of features mentioned below in the description of figures and/or shown in the figures may be comprised by the invention not only in the respective combination stated, but also in other combinations. In particular, embodiments and combinations of features, which do not have all the features of an originally formulated claim, are also comprised by the invention. Moreover, embodiments and combinations of features which go beyond or deviate from the combinations of features set forth in the recitations of the claims are comprised by the invention.

In the following, the invention will be explained in detail with reference to specific exemplary implementations and respective schematic drawings. In the drawings, identical or functionally identical elements may be denoted by the same reference signs. The description of identical or functionally identical elements is not necessarily repeated with respect to different figures.

In the figures,

Fig. 1 shows a schematic representation of a vehicle with an exemplary implementation of an electronic vehicle guidance system according to the invention;

Fig. 2 shows a schematic illustration of an exemplary implementation of a computer-implemented method for image compression according to the invention;

Fig. 3 shows a schematic illustration of a decompression according to an exemplary implementation of a computer-implemented method for compression and decompression according to the invention; Fig. 4 shows a schematic illustration of a decompression according to a further exemplary implementation of a computer-implemented method for compression and decompression according to the invention;

Fig. 5 shows a schematic illustration of an exemplary implementation of a computer-implemented training method according to the invention; and

Fig. 6 shows a schematic illustration of a further part of a further exemplary implementation of a computer-implemented method for image compression according to the invention.

Fig. 1 shows schematically a motor vehicle 1 , which comprises an exemplary implementation of an electronic vehicle guidance system 2 according to the invention.

The electronic vehicle guidance system 2 comprises a camera 4 mounted on the vehicle 1 to capture a field of view in the exterior environment of the vehicle 1 . The electronic vehicle guidance system 2 comprises at least one memory device 8, 9, for example a volatile memory 8 and a non-volatile memory 9. The volatile memory 8 may for example be implemented as a random-access memory, RAM, for example a DDR-RAM.

The electronic vehicle guidance system 2 further comprises at least one processing circuit

5, 6, 7, for example an image signal processor, ISP, 5, a computer vision and compression engine, CVCE, 6 and a central processing unit, CPU, 7. The processing circuits 5, 6, 7 may for example be part of a system on a chip, SoC, 3. It is noted, however, that the described separation of the processing circuits 5, 6, 7 is not necessarily given or may be realized differently. In particular, the functions of the processing circuits 5,

6, 7 may be carried out by different processing circuits or a different number of processing circuits or by a single processing circuit.

The electronic vehicle guidance system 2, for example the SoC 3, may further comprise a storage interface 10, which is connected to the memory devices 8, 9 and a communication interface 11 , which may also be denoted as network interface. The SoC 3, in particular the processing circuits 5, 6, 7, the memory devices 8, 9 and the interfaces 10, 11 , may be part of an electronic control unit, ECU, of the motor vehicle 1 . However, said components may also be distributed amongst two or more ECUs or other components of the motor vehicle 1 .

The camera 4 may generate a set of image data representing an environment of the motor vehicle 1 and provide the set of image data to the ISP 5, which may process the image signals, in particular Bayer patterns originating from an imager chip of the camera 4. In alternative designs, the ISP 5 may be located inside the camera 4.

The input image may then be written to the volatile memory 8 by the ISP 5 or video streams comprising a plurality of such images may be written accordingly to the volatile memory 8. The input image may then be read by the CVCE 6. The CVCE 6 stores an artificial neural network 14, which is trained for image compression and a further artificial neural network 15, which is trained for carrying out at least one computer vision task (see for example Fig. 2 and Fig. 5). In this sense, while the CVCE 6 is a processing circuit it may also be considered to comprise a memory device, which stores the neural networks 14, 15.

In particular, the at least one processing circuit 5, 6, 7 may carry out a computer- implemented method for image compression according to the invention. An exemplary implementation of such a method is illustrated in Fig. 2, which shows schematically a compression module 14a of the neural network 14, which may also be denoted as compression network, as well as the further artificial neural network 15, which may also be denoted as computer vision network.

The input image 12 is received by the CVCE 6, which is configured to apply the computer vision network 15 to the input image 12 in order to carry out the at least one computer vision task. Each of the at least one computer vision task yields a corresponding output 20a, 20b, 20c.

Furthermore, the CVCE 6 applies the compression module 14a to input data, which comprises the input image 12 or depends on the input image 12, to generate a compressed image 13. The CVCE 6 may then store the outputs 20a, 20b, 20c of the computer vision tasks and/or the compressed image 13 to the volatile memory 8. The storage interface 10 may for example read the compressed image 13 from the volatile memory 8 and store it to the non-volatile memory 9.

Alternatively or in addition, the communication interface 11 may read the compressed image 13 from the volatile memory 8 and provide it to an external device, for example a further ECU of the motor vehicle 1 or to an external server computer.

The computer vision network 15 comprises at least one hidden layer 19a, 19b, 19c. When the computer vision network 15 is applied to the input image 12, a plurality of layers including an input layer and an output layer and the one or more hidden layers 19a, 19b, 19c, which are arranged between the input layer and the output layer, are successively applied to the input image 12. A corresponding output of at least one of the hidden layers 19a, 19b, 19c is provided to the compression module 14a as an additional part of the input data to generate the compressed image 13. In particular, the compression network 14 may comprise a downsampling and normalization stage 21 , which receives the output of the at least one hidden layer 19a, 19b, 19c and downsamples and normalizes it before providing it to the compression module 14a.

The electronic vehicle guidance system 2 further comprises a control unit (not shown), which receives the output 20a, 20b, 20c of the computer vision tasks and generates control signals for one or more actuators (not shown) of the motor vehicle 1 , which may affect a lateral and/or longitudinal control of the vehicle 1 at least in part automatically depending on the control signals.

As shown in Fig. 3, in some applications, the storage interface 10 may read the compressed image 13 from the non-volatile memory 9 and provide it to a decompression module 14b of the compression network 14. The CVCE 6 applies the decompression module 14b to the compressed image 13 and generates a decompressed image based thereupon. Then, the decompressed image may for example be provided to the computer vision network 15 or to another computer vision network (not shown), which may carry out the at least one computer vision task depending on the decompressed image.

Fig. 4 shows a similar application, wherein the storage interface 10 reads the compressed image 13 from the non-volatile memory 9 and writes it to a further volatile memory 22. The communication interface 11 reads the compressed image 13 from the further volatile memory 22 and provides it to a corresponding further communication interface 11 ' of a further ECU. The further communication interface 11 ' stores the received compressed image 13 to a further volatile memory 22', which provides it to a further decompression module 14', which is applied by a further CVCE (not shown) of the further ECU to generate the decompressed image. The decompressed image may then be provided to a further computer vision network 17', which performs the at least one computer vision task accordingly.

In some implementations, the computer vision network 15 is implemented as an autoencoder CNN network, as shown for instance in Fig. 2. The computer vision network 15 comprises an encoder module 16, which receives the input image 12 and outputs a set of encoded features 18. The computer vision network 15 also comprises a decoder module 17, which receives the set of encoded features 18 to carry out the at least one computer vision task and generates the outputs 20a, 20b, 20c depending on the set of encoded features 18.

Several layers of the encoder module 16 may for example comprise a convolution operation, a maximum pooling operation and/or a non-linear operation, such as a ReLU operation. The output size of a convolution layer is typically smaller than its input. The convolutions therefore compress the input image 12. The level of compression can be different for different network designs of the encoder module 16. At the same time, information and details contained by the input image 12 are maintained for the needs of the targeted computer vision task.

In some implementations, the decoder module 17 is a multi-task decoder comprising two or more decoder sub-modules 17a, 17b, 17c, each of them being trained for another computer vision task based on the same set of encoded features 18. Each of the decoder sub-modules 17a, 17b, 17c therefore generates a respective output 20a, 20b, 20c of the computer vision network 15, for example for object detection, semantic segmentation, pedestrian detection, vehicle detection and speed limit recognition, depth estimation, motion estimation, et cetera.

The decoder module 17, for example each of the decoder sub-modules 17a, 17b, 17, comprises hidden layers 19a, 19b, 19c, which are also denotes as intra-layers 19a, 19b, 19c. At least some of the intra-layers 19a, 19b, 19c are extracted as part of the input to the compression module 14a, which may be based on a CNN as well.

Since the multiple decoder sub-modules 17a, 17b, 17 are, for example, trained to detect objects and features that are used for guiding the motor vehicle 1 at least in part automatically, the intra-layers 19a, 19b, 19c are used to guide the compression module 14a on which features are important and hence in which regions in the input image 12 are more details shall be retained. This helps to achieve an optimal tradeoff between the image compression factor and image quality in view of the use for computer vision applications.

Consequently, also the decompressed image is optimized for computer vision applications. It can therefore be used by the computer vision network 15 and/or by other CNN networks for the same or other computer vision tasks.

Furthermore, an improved level of security is obtained, since the decompression cannot be carried out by any standard decompression engine.

One exemplary application for the compressed image is an offline scenario analysis. In case a failure in operation of the at least partially automatic guidance of the vehicle 1 based on the outputs 20a, 20b, 20c, as well as for corner cases, which have not been observed during development, the stored compressed images can be decompressed and examined in a backend via a simulation that involves system behavior decisions based on an computer vision analysis.

In some implementations of the invention, image compression may be considered as an additional task in an existing multi-task perception model like OmniDet in a computationally efficient way.

In contrast to known compression algorithms, the invention may for example leverage semantic and/or geometric features from the computer vision network 15. Features from one or more, for example all, hidden decoder layers are used to directly guide the compression module 14a by normalizing the resolution of each feature map to match the compressed feature map. This feature guidance enables the explicit usage of information form the computer vision tasks to guide the compression in an application specific way. For example, pedestrians and vehicles may be important objects and the respective detection features can guide the compression to give more importance to the respective image areas and therefore compress these areas less than other, for example road and sky areas, which may be less important and can be compressed more. Motion features can enable tracking of regions and objects over time to achieve an explicit feature reuse and a higher level of compression.

The compression module 14a may be implemented as an encoder, which progressively reduces the received encoded feature set based on the required compression ratio needs. The decompression module 14b may be a decoder, which progressively reconstructs the compressed feature map to produce the decompressed image.

A computer-implemented training method according to the invention is illustrated in Fig. 5. As a compression algorithm is a necessarily a tradeoff between the compression factor and the reconstruction accuracy, specific losses 22a for image reconstruction based on computer vision tasks, such as object detection, may be used for training, as depicted in Fig. 5. Furthermore, losses 22b quantifying the quality of the decompressed image for human vision may be used.

For example, a progressive information reduction loss in the compression module 14a may be used, where the next layer feature has an auxiliary reconstruction branch 23 with a respective error calculation block 24 to ensure that important information is preserved. The focus may for example be to preserve application dependent information rather than generic image content.

Each of the computer vision tasks may have a corresponding loss 26a, 26b, 26c defined in a known manner. A common loss function may be a sum of all the losses 26a, 26b, 26c, 22a, 22b and may be evaluated to achieve a joint end-to-end training of the computer vision network 15 and the compression network 15.

Optionally, the compression may be extended by leveraging a temporal redundancy in the video stream as illustrated in Fig. 6. Several, for example three in Fig. 6, encoder streams T1 , T2, T3 may be passed to a feature fusion module 25, which may for example operate in a rolling buffer fashion to avoid storage of a large set of previous encoder frames. The feature encoder may have a predictor module, which can predict the encoder module for the next frame and then take a differential to obtain a further degree of compression. This is particularly beneficial in static or slow moving scenes. In case of static scenes with no moving objects, the differential encoder has no information. The predictor module is trained based on a reconstruction loss of the predicted and reconstructed next frame using the observed next frame as a ground truth. The loss is designed to maximize the compression factor across the video stream and capture only novel events, which are changing mimicking an event care in software.

In addition to classical lossy image or video encoders, such as JPEG, HEIC, H.264/AVC, H.265/HEVC, that aim to maximize the image or video compression ratio with minimized visual artifacts, a CNN based image compression technique as described with respect to several implementations of the invention can also minimize the loss of image details for computer vision applications. In order to achieve the three goals of best compression ratio, least visual artefacts and least loss of details for computer vision, intermediate layers of the computer vision network are used to provide information to guide the compression module.

In vehicle ECUs, computer vision and image or video compression could make use of two respective different functional blocks, which are processed separately. For instance, computer vision may aim at the highest possible detection performance with least false positives, while the compression may aim at the highest possible compression with least impact to its visual quality. However, if the compressed images or streams are used for computer vision, they may less details and introduce unwanted artifacts, resulting in reduced performance of the computer vision. In addition, the processing paths for computer vision are separated from that for image or video compression. This would result in an increased demand for processing power, number of processing blocks as well as memory size and bandwidth. Some implementation of the invention, on the other hand, enable the use of a single CNN processor, also denoted as CVCE engine, for both computer vision and compression, eliminating the need for a separate image compression engine. Hence, hardware costs and/or required computational resources are reduced.

Claims

24

Claims Computer-implemented method for image compression, wherein a compressed image (13) is generated by applying a compression module (14a) of an artificial neural network (14), which is trained for image compression, to input data, which comprises an input image (12) or depends on the input image (12), characterized in that a further artificial neural network (15), which is trained for carrying out at least one computer vision task and comprises a first hidden layer (19a, 19b, 19c), is applied to the input image (12); and the input data comprises an output of the first hidden layer (19a, 19b, 19c) or depends on the output of the first hidden layer (19a, 19b, 19c). Computer-implemented method according to claim 1 , characterized in that a set of encoded features (18) is generated by applying an encoder module (16) of the further artificial neural network (15) to the input image (12); a decoder module (17), which is trained for carrying out the at least one computer vision task, of the further artificial neural network (15) is applied to the set of encoded features (18), wherein the decoder module (17) comprises the first hidden layer (19a, 19b, 19c). Computer-implemented method according to claim 2, characterized in that a first decoder sub-module (17a, 17b, 17c) of the decoder module (17), which is trained for carrying out a first computer vision task of the at least one computer vision task, is applied to the set of encoded features (18), wherein the first decoder sub-module (17a, 17b, 17c) comprises the first hidden layer (19a, 19b, 19c); a second decoder sub-module (17a, 17b, 17c) of the decoder module (17), which is trained for carrying out a second computer vision task of the at least one computer vision task, is applied to the set of encoded features (18), wherein the second decoder sub-module (17a, 17b, 17c) comprises a second hidden layer (19a, 19b, 19c) of the further artificial neural network (15); and the input data comprises an output of the second hidden layer (19a, 19b, 19c) or depends on the output of the second hidden layer (19a, 19b, 19c). Computer-implemented method according to claim 3, characterized in that the first decoder sub-module (17a, 17b, 17c) comprises a further first hidden layer (19a, 19b, 19c) of the further artificial neural network (15) and the input data comprises an output of the further first hidden layer (19a, 19b, 19c) or depends on the output of the further first hidden layer (19a, 19b, 19c); and/or the second decoder sub-module (17a, 17b, 17c) comprises a further second hidden layer (19a, 19b, 19c) of the further artificial neural network (15) and the input data comprises an output of the further second hidden layer (19a, 19b, 19c) or depends on the output of the further second hidden layer (19a, 19b, 19c). Computer-implemented method according to one of the preceding claims, wherein the output of the first hidden layer (19a, 19b, 19c) is downsampled and normalized; and the input data comprises the downsampled and normalized output of the first hidden layer (19a, 19b, 19c). Computer-implemented training method for training an artificial neural network (14) for image compression and a further artificial neural network (15) for carrying out at least one computer vision task, which comprises a first hidden layer (19a, 19b, 19c), characterized in that a compressed training image is generated by applying a compression module (14a) of the artificial neural network (14) to training input data, wherein the training input data comprises a training image (12') or depends on the training image (12') and the training input data comprises an output of the first hidden layer (19a, 19b, 19c) or depends on the output of the first hidden layer (19a, 19b, 19c); a decompressed training image is generated by applying a decompression module (14b) of the artificial neural network (14) to the compressed training image and at least one loss function (22a, 22b) is evaluated depending on the decompressed training image; the further artificial neural network (15) is applied to the training image (12') and at least one further loss function (26a, 26b, 26c) is evaluated depending on an output of the further artificial neural network (15); network parameters of the artificial neural network (14) and further network parameters of the further artificial neural network (15) are adapted depending on a result of the evaluation of the at least one loss function (22a, 22b) and the at least one further loss function (26a, 26b, 26c).

7. Computer-implemented training method according to claim 6, characterized in that a common loss function comprising a combination of the at least one loss function (22a, 22b) and the at least one further loss function (26a, 26b, 26c) is evaluated and the network parameters and the further network parameters are adapted depending on a result of the evaluation of the common loss function.

8. Computer-implemented training method according to one of claims 6 or 7, characterized in that the at least one loss function (22a, 22b) comprises a first loss function, wherein the at least one further loss function (26a, 26b, 26c) comprises the first loss function as well; and/or the at least one loss function (22a, 22b) comprises a second loss function, which depends on a total reconstruction error of the decompressed training image with respect to the training image (12').

9. Computer-implemented method according to one of claims 1 to 5, characterized in that the artificial neural network (14) and the further artificial neural network (15) are trained by using a computer-implemented training method according to one of claims 6 to 8.

10. Method for guiding a vehicle (1 ) at least in part automatically, wherein an input image (12), which represents an environment of the vehicle (1 ), is generated by an environmental sensor system (4) of the vehicle (1 ); a compressed image (13) is generated by carrying out a computer-implemented method according to one of claims 1 to 5 or 9; 27 the vehicle (1) is guided at least in part automatically depending on an output (20a, 20b, 20c) of the further artificial neural network (15) applied to the input image (12); and the compressed image (13) is stored. Method according to claim 10, characterized in that the at least one computer vision task comprises an object detection task and the output (20a, 20b, 20c) of the further artificial neural network (15) applied to the input image (12) comprises an object class and/or a bounding box for an object depicted by the input image (12); and/or the at least one computer vision task comprises a semantic segmentation task and the output (20a, 20b, 20c) of the further artificial neural network (15) applied to the input image (12) comprises a pixel-level class for each of a plurality of pixels of the input image (12). Data processing system comprising at least one processing circuit (5, 6, 7), which is adapted to carry out a computer-implemented method according to one of claims 1 to 5 or 9. Electronic vehicle guidance system (2) for a vehicle (1 ), wherein the electronic vehicle guidance system (2) comprises at least one memory device (8, 9) storing a compression module (14a) of an artificial neural network (14), which is trained for image compression; and the electronic vehicle guidance system (2) comprises at least one processing circuit (5, 6, 7), which is configured to generate a compressed image (13) by applying the compression module (14a) to input data, which comprises an input image (12) or depends on the input image (12), and to store the compressed image (13) to the at least one memory device (8, 9); the at least one memory device (8, 9) stores a further artificial neural network (15), which is trained for carrying out at least one computer vision task and comprises a first hidden layer (19a, 19b, 19c), the at least one processing circuit (5, 6, 7) is configured to apply the further artificial neural network (15) to the input image (12); 28 the electronic vehicle guidance system (2) comprises at least one control unit, which is configured to guide the vehicle (1) at least in part automatically depending on an output (20a, 20b, 20c) of the further artificial neural network (15) applied to the input image (12); and the input data comprises an output of the first hidden layer (19a, 19b, 19c) or depends on the output of the first hidden layer (19a, 19b, 19c).

14. Electronic vehicle guidance system (2) according to claim 13, characterized in that the electronic vehicle guidance system (2) comprises an environmental sensor system (4), which is configured to generate image data representing an environment of the vehicle (1) and the at least one processing circuit (5, 6, 7) is configured to generate the input image (12) depending on the image data.

15. Computer program product comprising instructions, which, when carried out by a data processing system, cause the data processing system to carry out a computer-implemented method according to one of claims 1 to 5 or 9; or when carried out by an electronic vehicle guidance system (2) according to one of claims 13 or 14, cause the electronic vehicle guidance system (2) to carry out a method according to one of claims 10 or 11 .