WO2023082162A1 - 图像处理方法和装置 - Google Patents

图像处理方法和装置 Download PDF

Info

Publication number
WO2023082162A1
WO2023082162A1 PCT/CN2021/130201 CN2021130201W WO2023082162A1 WO 2023082162 A1 WO2023082162 A1 WO 2023082162A1 CN 2021130201 W CN2021130201 W CN 2021130201W WO 2023082162 A1 WO2023082162 A1 WO 2023082162A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
feature map
reconstructed image
reconstructed
standard image
Prior art date
Application number
PCT/CN2021/130201
Other languages
English (en)
French (fr)
Inventor
林永兵
张培科
马莎
万蕾
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2021/130201 priority Critical patent/WO2023082162A1/zh
Publication of WO2023082162A1 publication Critical patent/WO2023082162A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals

Definitions

  • the embodiments of the present application relate to the field of image processing, and more specifically, to an image processing method and device.
  • the camera has the characteristics of high resolution, non-contact, convenient use, and low cost, and has a wide range of applications in the field of environmental perception.
  • more and more cameras are installed on vehicles to achieve artificial intelligence (AI) vehicles without blind spot coverage and machine vision.
  • AI artificial intelligence
  • the video output by the camera requires more and more transmission bandwidth.
  • Figure 1 shows a schematic diagram of an existing solution Block diagram, as shown in Figure 1, the Bayer raw image or video output by the camera often has high precision and requires high transmission bandwidth.
  • the RGB image obtained by the encoder compression and ISP processing is transmitted to the MDC for image reconstruction.
  • the image reconstruction is performed by the decoder.
  • the quality of the reconstructed image is critical to subsequent tasks such as target detection and semantic segmentation. Significance, so it is particularly important that the image reconstructed by the decoder has a high quality.
  • the quality of the reconstructed image cannot be guaranteed.
  • the decoder is optimized to ensure that the reconstructed image output by the decoder has a higher quality in practical applications.
  • An embodiment of the present application provides an image processing method, which is used to obtain the degree of distortion of the reconstructed image and the original image to evaluate the quality of the reconstructed image, so as to guide the optimization of an encoder and/or a decoder.
  • an image processing method comprising: acquiring a reconstructed image and a standard image of the reconstructed image, the reconstructed image is an image reconstructed according to the first image, and the first image is an image obtained by compressing the standard image ; Input the reconstructed image and the standard image into the single-layer convolutional neural network to obtain the feature map of the reconstructed image and the feature map of the standard image.
  • the parameters of the single-layer convolutional neural network come from the first convolutional layer of the pre-trained model Parameters, the pre-trained model is a pre-trained convolutional neural network; obtain the distortion degree of the feature map of the reconstructed image relative to the feature map of the standard image.
  • the pre-training model since the pre-training model is originally trained for machine tasks, using a single-layer convolutional neural network from the pre-training model to extract image features can better adapt to machine vision tasks; the initial convolutional layer often extracts The underlying common features of the image.
  • the single-layer convolutional neural network in the embodiment of the present application is the first convolutional layer of the pre-trained model.
  • the image features extracted by using the single-layer convolutional neural network are also the underlying common features.
  • the codec obtained by feature optimization can adapt to a variety of task scenarios and improve the multi-task generalization ability; in addition, compared with the existing technology that uses a complete neural network to process images, the embodiment of this application uses a single-layer convolution
  • the neural network only needs to perform single-layer convolution calculations, which reduces the computational complexity and reduces the computing power requirements for hardware.
  • the single-layer convolutional neural network includes a plurality of convolution kernels, the first feature map of the reconstructed image and the first feature map of the standard image have a first weight, and the reconstructed The first feature map of the image and the first feature map of the standard image are obtained by the first convolution kernel, the second feature map of the reconstructed image and the second feature map of the standard image have a second weight, the first feature map of the reconstructed image and The second feature map of the standard image is obtained by the second convolution kernel, and the first convolution kernel and the second convolution kernel belong to multiple convolution kernels.
  • weights are assigned to different feature maps to achieve different effects. For example, for feature maps related to detail features, the weight can be increased appropriately to increase the importance of detail effects. It has beneficial effects on the subsequent execution of machine vision tasks.
  • the weight is determined by a normalization parameter, and the normalization parameter is a normalization parameter of the pre-trained model.
  • the weighting coefficients used for weighting can be determined manually, or can be determined according to normalization parameters.
  • obtaining the degree of distortion of the feature map of the reconstructed image relative to the feature map of the standard image includes: calculating the characteristic of the feature map of the reconstructed image relative to the standard image according to the following formula Distortion of the graph:
  • wfMSE is the distortion degree of the feature map of the reconstructed image relative to the feature map of the standard image
  • x is the standard image
  • y is the reconstructed image
  • f() is the convolution operation
  • i is the i-th convolution kernel
  • w is the weight coefficient
  • C is the number of feature map channels of the reconstructed image or the feature map channel number of the standard image
  • H is the height of the standard image or the reconstructed image
  • W is the width of the standard image or the reconstructed image.
  • the above method is based on the existing calculation method MSE, which performs a single-layer convolution operation on the image, which is simple to calculate and has a small amount of calculation; the convolution kernel parameters and weight parameters come from the machine task-oriented pre-training model, based on which the codec is optimized. , so that the output reconstructed image is more suitable for machine vision tasks; different weights are assigned to different feature maps, and the weighting coefficients can be adjusted freely. For detail-related feature maps, the weight can be increased to enhance image detail texture features.
  • the pre-training model is trained for machine vision tasks, when the weight coefficients come from the pre-training model instead of artificial design, it can ensure that the encoder and/or decoder optimized accordingly can have better performance when facing the same machine vision task. Good results.
  • the method further includes: calculating the degree of distortion of the feature map of the reconstructed image relative to the feature map of the standard image according to the following formula:
  • wfSSIM is the distortion degree of the feature map of the reconstructed image relative to the feature map of the standard image
  • is the mean
  • is the covariance
  • C 1 is a constant
  • C 2 is a constant.
  • the degree of distortion of the reconstructed image may also be obtained based on other indicators such as a calculation method of structural similarity between images. Based on the structural similarity calculation, because statistics such as mean and variance are used, it is easier to avoid the influence of noise (such as ringing noise), thereby obtaining more stable results, and can effectively improve accuracy in machine tasks such as semantic segmentation.
  • noise such as ringing noise
  • the method further includes: calculating the degree of distortion of the feature map of the reconstructed image relative to the feature map of the standard image according to the following formula:
  • wfMSSSIM is the distortion degree of the feature map of the reconstructed image relative to the feature map of the standard image, and C 3 is a constant.
  • the calculation method based on structural similarity SSIM can also have other variants, such as the calculation method based on multi-scale structural similarity MSSSIM.
  • the image details are further enhanced while retaining the anti-noise ability, and the reconstructed image output by the optimized codec is of higher quality, which is of great significance for subsequent machine tasks such as target detection.
  • the single-layer convolutional neural network includes a plurality of convolution kernels
  • the first convolution kernel is used to obtain the first feature of the reconstructed image and the first feature of the standard image
  • the coefficient of the first convolution kernel has the first weight
  • the second convolution kernel is used to obtain the second feature of the reconstructed image and the second feature of the standard image
  • the coefficient of the second convolution kernel has the second weight
  • the first convolution The product kernel and the second convolution kernel belong to a plurality of convolution kernels.
  • This method uses the weighted convolution kernel to extract the features of the reconstructed image and the standard image separately, so that the feature maps do not need to be weighted separately, which can save the amount of calculation.
  • the coefficients of the convolution kernel are much less than the pixels of the image (image pixels are related to the specific resolution), and the amount of calculation for weighting the coefficients of the convolution kernel is also much less than the amount of calculation for weighting the feature map. Especially when the image resolution is high.
  • obtaining the degree of distortion of the feature map of the reconstructed image relative to the feature map of the standard image includes: calculating the degree of distortion of the feature map of the reconstructed image according to the following formula:
  • wfMSE is the distortion degree of the feature map of the reconstructed image
  • x is the standard image
  • y is the reconstructed image
  • f() is the convolution operation
  • i is the first i convolution kernels
  • w is the weight coefficient
  • C is the number of feature map channels of the reconstructed image or the number of feature map channels of the standard image
  • H is the height of the standard image or the reconstructed image
  • W is the width of the standard image or the reconstructed image.
  • This method is a fast implementation of the previous MSE method.
  • the previous method is to perform convolution operations on the standard image and the reconstructed image separately and then calculate the residual. In this way, two convolution operations are required for one convolution kernel. And this method is to first calculate the residual and then convolve the residual, so that only one convolution operation is needed for a convolution kernel, which can save the amount of calculation.
  • the method further includes: evaluating the quality of the reconstructed image according to the degree of distortion of the feature map of the reconstructed image relative to the feature map of the standard image, so as to obtain an evaluation result; according to the evaluation result Optimizing the encoder and/or decoder for outputting the reconstructed image.
  • parameters of the encoder and/or decoder can be updated according to the degree of distortion, so that the degree of distortion of the output reconstructed image is as small as possible, so as to achieve the purpose of optimizing the encoder and/or decoder.
  • the feature map of the reconstructed image and the feature map of the standard image are full-resolution images.
  • the single-layer convolutional neural network used in the embodiment of the present application does not include a pooling layer, and the convolutional layer does not perform down-sampling operations on reconstructed images and standard images. If no downsampling operation is performed, the image will not be compressed, so that a full-resolution image can be output, and the codec is optimized based on the evaluation results of the full-resolution image, which can ensure the output of the optimized codec.
  • the quality of the reconstructed image is more friendly to human vision.
  • an image processing device which includes: an acquisition unit configured to acquire a reconstructed image and a standard image of the reconstructed image, the reconstructed image is an image reconstructed according to the first image, and the first image is the The image obtained by compressing the standard image; the processing unit is used to input the reconstructed image and the standard image into the single-layer convolutional neural network to obtain the feature map of the reconstructed image and the feature map of the standard image.
  • the parameters of the single-layer convolutional neural network come from Based on the parameters of the first convolution layer of the pre-training model, the pre-training model is a pre-trained convolutional neural network; the processing unit is also used to obtain the distortion of the feature map of the reconstructed image relative to the feature map of the standard image.
  • the single-layer convolutional neural network includes a plurality of convolution kernels, the first feature map of the reconstructed image and the first feature map of the standard image have a first weight, and the reconstructed The first feature map of the image and the first feature map of the standard image are obtained by the first convolution kernel, the second feature map of the reconstructed image and the second feature map of the standard image have a second weight, the first feature map of the reconstructed image and The second feature map of the standard image is obtained by the second convolution kernel, and the first convolution kernel and the second convolution kernel belong to multiple convolution kernels.
  • the weight is determined by a normalization parameter, and the normalization parameter is a normalization parameter of the pre-trained model.
  • the processing unit is specifically configured to: calculate the degree of distortion of the feature map of the reconstructed image relative to the feature map of the standard image according to the following formula:
  • wfMSE is the distortion degree of the feature map of the reconstructed image relative to the feature map of the standard image
  • x is the standard image
  • y is the reconstructed image
  • f() is the convolution operation
  • i is the i-th convolution kernel
  • w is the weight coefficient
  • C is the number of feature map channels of the reconstructed image or the feature map channel number of the standard image
  • H is the height of the standard image or the reconstructed image
  • W is the width of the standard image or the reconstructed image.
  • the processing unit is further configured to: calculate the degree of distortion of the feature map of the reconstructed image relative to the feature map of the standard image according to the following formula:
  • wfSSIM is the distortion degree of the feature map of the reconstructed image relative to the feature map of the standard image
  • is the mean
  • is the covariance
  • C 1 is a constant
  • C 2 is a constant.
  • the processing unit is further configured to: calculate the degree of distortion of the feature map of the reconstructed image relative to the feature map of the standard image according to the following formula:
  • wfMSSSIM is the distortion degree of the feature map of the reconstructed image relative to the feature map of the standard image, and C 3 is a constant.
  • the single-layer convolutional neural network includes a plurality of convolution kernels
  • the first convolution kernel is used to obtain the first feature of the reconstructed image and the first feature of the standard image
  • the coefficient of the first convolution kernel has the first weight
  • the second convolution kernel is used to obtain the second feature of the reconstructed image and the second feature of the standard image
  • the coefficient of the second convolution kernel has the second weight
  • the first convolution The product kernel and the second convolution kernel belong to a plurality of convolution kernels.
  • the processing unit is specifically configured to: calculate the distortion degree of the feature map of the reconstructed image according to the following formula:
  • wfMSE is the distortion degree of the feature map of the reconstructed image
  • x is the standard image
  • y is the reconstructed image
  • f() is the convolution operation
  • i is the first i convolution kernels
  • w is the weight coefficient
  • C is the number of feature map channels of the reconstructed image or the number of feature map channels of the standard image
  • H is the height of the standard image or the reconstructed image
  • W is the width of the standard image or the reconstructed image.
  • the processing unit is further configured to: evaluate the quality of the reconstructed image according to the degree of distortion of the feature map of the reconstructed image relative to the feature map of the standard image, so as to obtain an evaluation result;
  • the evaluation results optimize the encoder and/or decoder for outputting the reconstructed image.
  • the feature map of the reconstructed image and the feature map of the standard image are full-resolution images.
  • a computer-readable medium stores program code for execution by a device, and the program code includes a method for executing any implementation manner of the above-mentioned first aspect.
  • a chip in a fourth aspect, includes a processor and a data interface, and the processor reads instructions stored in the memory through the data interface, and executes the method of any one of the above-mentioned first aspects.
  • FIG. 1 is a schematic block diagram of compressing and transmitting images output by a camera according to an embodiment of the present application
  • Fig. 2 is a schematic block diagram of images acquired by a camera in an embodiment of the present application for various tasks;
  • Fig. 3 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present application.
  • Fig. 4 is a schematic flowchart of an image processing method according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a convolution operation on a standard image and a reconstructed image according to an embodiment of the present application
  • Fig. 6 is a schematic diagram of the process of obtaining the degree of distortion of the reconstructed image according to the embodiment of the present application.
  • FIG. 7 is a schematic diagram of an application scenario of the image processing method in the optimization of the encoder and/or decoder according to the embodiment of the present application;
  • FIG. 8 is a schematic block diagram of an image processing device according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of an image processing apparatus 900 according to an embodiment of the present application.
  • FIG. 2 shows a schematic block diagram of the images captured by the camera in the embodiment of the present application for various tasks.
  • the images around the vehicle captured by the camera sensor installed on the vehicle are compressed by the encoder or
  • the RGB image is obtained after ISP processing, and the accuracy of the RGB image is much lower than that of the original image output by the camera, thereby reducing the bandwidth requirement for network transmission.
  • the RGB image is processed by an encoder and decoder (CODEC) (including compression and reconstruction of the original image) to obtain a reconstructed image, and the reconstructed image is used for target detection, semantic segmentation, and detection of traffic lights and lane lines.
  • CDEC encoder and decoder
  • the quality of the reconstructed image is of great significance to the performance of tasks such as target detection, semantic segmentation, and detection of traffic lights and lane lines. For example, if the quality of the reconstructed image is poor, it is difficult to accurately identify the target in the target detection task. It may bring great danger to automatic driving, so it is necessary to ensure that the reconstructed image output by CODEC has a high quality. Furthermore, a method is needed to determine the degree of distortion between the reconstructed image output by the CODEC and the original image, so as to evaluate the quality of the reconstructed image. According to the evaluation result, the CODEC can be optimized so that the optimized CODEC can output a Higher quality reconstructed images.
  • the above-mentioned tasks such as target detection, semantic segmentation, detection of traffic lights, and lane lines can be called machine vision tasks, that is, the reconstructed image is directly processed by the machine system, so the reconstructed image mainly meets the requirements that can be quickly processed by the machine system. Identify and detect needs.
  • the existing technology proposes a variety of machine vision-oriented evaluation indicators for evaluating the quality of reconstructed images, including image classification evaluation index Top-1Acc, target detection evaluation index mAP, and semantic segmentation evaluation index mIoU And the lane line detection evaluation index Acc, etc.
  • the image classification evaluation index Top-1Acc optimizes CODEC.
  • the reconstructed image output by the optimized CODEC has a good effect on image classification tasks, but it is still not effective for tasks such as object detection, semantic segmentation, and lane line detection. evaluation of task generalization.
  • the prior art also includes evaluation indicators for human vision, including peak signal-to-noise ratio (peak signal noise ratio, PSNR), multiscale structural similarity index (multiscale structural similarity index, MSSSIM) , Learning perceptual image patch similarity (LPIPS) and other evaluation indicators
  • the reconstructed image output by the CODEC optimized according to the evaluation index oriented to human vision can be more in line with the subjective feeling of the human eye, such as directly displaying the reconstructed image
  • the display screen inside the vehicle is for the driver to watch, so the reconstructed image needs to have higher definition and be easy for human eyes to watch.
  • the complexity of the human eye is difficult to satisfy with an evaluation index, and the existing evaluation indexes for human vision have their own shortcomings.
  • LPIPS when using the LPIPS evaluation index to evaluate the quality of the reconstructed image, it needs Computing all the convolutional layers of the network has high computational complexity; in addition, since the network involves pooling and downsampling, image information is lost during the evaluation process, and only low-resolution feature maps are used in the evaluation process, making it difficult to make accurate Evaluation results, and inaccurate evaluation results are also difficult to guarantee the optimization results of CODEC, so it is difficult to guarantee the quality of reconstructed images.
  • the images acquired by the camera are mainly used for the machine system to perform related tasks, for example, in the autonomous driving scene, the images acquired by the camera sensor are mainly used for target detection, semantic segmentation, and detection of traffic lights and lane lines, etc.
  • the perception of the environment, so the method of evaluating the quality of the reconstructed image should first face the machine vision; at the same time, in order to meet the application scenarios of various machine vision tasks, the method of evaluating the quality of the reconstructed image should be decoupled from the specific task; in addition, in In some cases, it is necessary to take human vision into account, for example, the image captured by the camera sensor also needs to be displayed on the display screen in the car for the driver to watch, etc.
  • the embodiment of the present application proposes an image processing method, which is used to obtain the distortion degree of the reconstructed image and the original image, so as to evaluate the quality of the reconstructed image, so as to guide the optimization of CODEC.
  • the image processing method of the embodiment of the present application is oriented to machine vision tasks, and can meet the requirements of various tasks while taking human vision into consideration.
  • CNN convolutional neural network
  • Fig. 3 shows the schematic structural diagram of the convolutional neural network of the embodiment of the present application
  • CNN300 can comprise input layer 310, convolutional layer/pooling layer 320 (wherein pooling layer is optional), and fully connected layer (fully connected layer) 330.
  • pooling layer is optional
  • fully connected layer fully connected layer
  • the convolutional layer/pooling layer 320 may include layers 321-326 as examples, for example: in one implementation, the 321st layer is a convolutional layer, the 322nd layer is a pooling layer, and the 323rd layer is a volume Layer, 324 is a pooling layer, 325 is a convolutional layer, and 326 is a pooling layer; in another implementation, 321, 322 are convolutional layers, 323 are pooling layers, 324, 325 are convolutional layers Layer, 326 is a pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 321 can include many convolution operators, which are also called convolution kernels, and their role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator In essence, it can be a weight matrix. This weight matrix is usually pre-defined. In the process of convolution operation on the image, the weight matrix is usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image. pixels...it depends on the value of the stride), so as to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix will be extended to The entire depth of the input image. Therefore, convolution with a single weight matrix will produce a convolutional output with a single depth dimension, but in most cases instead of using a single weight matrix, multiple weight matrices of the same size (row ⁇ column) are applied, That is, multiple matrices of the same shape.
  • the output of each weight matrix is stacked to form the depth dimension of the convolution image, where the dimension can be understood as determined by the "multiple" mentioned above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to filter unwanted noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), and the convolutional feature maps extracted by the multiple weight matrices of the same size are also of the same size, and then the extracted multiple convolutional feature maps of the same size are combined to form The output of the convolution operation.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 300 can make correct predictions .
  • the initial convolutional layer (for example, 321) often extracts more general features, which can also be referred to as bottom general features or low-level features;
  • the features extracted by the later convolutional layers (such as 326) become more and more complex, such as high-level semantic features.
  • pooling layer can be a layer of convolutional layers followed by a layer
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the sole purpose of pooling layers is to reduce the spatial size of the image.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling an input image to obtain an image of a smaller size.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling.
  • the maximum pooling operator can take the pixel with the largest value within a specific range as the result of maximum pooling. Also, just like the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
  • the size of the image output after being processed by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional layer itself can also compress images.
  • the convolution kernel performs a convolution operation on an image, if its step size is greater than 1, the image can be compressed. This compression of the image is called downsampling.
  • the convolutional neural network 300 After being processed by the convolutional layer/pooling layer 320, the convolutional neural network 300 is not enough to output the required output information. Because as mentioned earlier, the convolutional layer/pooling layer 320 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 300 needs to use the fully connected layer 330 to generate one or a group of outputs with the required number of classes. Therefore, the fully connected layer 330 may include multiple hidden layers (331, 332 to 33n as shown in FIG. 3 ) and an output layer 340, and the parameters contained in the multi-layer hidden layers may be based on the specific task type The related training data is pre-trained. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, and so on.
  • the output layer 340 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error.
  • the convolutional neural network 300 shown in FIG. 3 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models.
  • the convolutional neural network shown in Figure 3 is used to process the reconstructed image and the original image. Due to the pooling operation and downsampling operation in the image processing process, the processing The resulting image information is lost, the resolution of the obtained feature map is low, and the distortion degree of the obtained reconstructed image and the original image is not accurate, so the evaluation result of the reconstructed image quality is also not accurate.
  • the CODEC is optimized according to the inaccurate evaluation results, so that the optimization results are not good, and the reconstructed image output by the CODEC may have flaws.
  • using all the convolutional layers shown in Figure 3 to process the image has a high computational complexity.
  • Fig. 4 shows a schematic flowchart of the image processing method of the embodiment of the present application.
  • the degree of distortion of the reconstructed image and the original image can be obtained, so as to evaluate the quality of the reconstructed image, thereby guiding optimal coding device, decoder and ISP image processing algorithm, etc.
  • Specific application scenarios include scenarios where assisted/autonomous driving vehicles process images captured by cameras or safe city systems/video surveillance systems process images captured by cameras.
  • the method in FIG. 4 includes step 401 to step 403, which will be introduced respectively below.
  • the image processing method shown in Figure 4 can be applied to the training process of the codec.
  • the standard image and its corresponding compressed image can be obtained from On any training set such as ImageNet, Kitti, coco, Cityscapes, etc., then the compressed image is input into the codec, and the codec outputs a reconstructed image, which corresponds to the aforementioned compressed image and the standard image.
  • the distortion degree of the reconstructed image and the standard image determines the optimization direction of the codec. Generally speaking, it is hoped that the reconstructed image output by the codec can be as close as possible to the standard image, so it is necessary to obtain the distortion degree of the reconstructed image and the standard image.
  • the parameters of the single-layer convolutional neural network come from the first convolutional layer of the pre-trained model Parameters, the pre-trained model is a pre-trained convolutional neural network.
  • a convolutional neural network is used to extract features from the reconstructed image and the standard image.
  • the parameters of the single-layer convolutional neural network come from the parameters of the first convolutional layer of the pre-trained model.
  • the pre-trained model is a pre-trained convolutional neural network model, such as Resnet, alexnet, vggnet, regnet and other classification models trained on the large-scale training set ImageNet.
  • a single-layer convolutional neural network includes multiple convolution kernels. Different convolution kernels are used to extract different features. Each convolution kernel has a clear physical meaning. For example, the first convolution kernel is used to extract rapidly changing textures. Details, the second convolution kernel is used to extract image edge features and color information.
  • the single-layer convolutional neural network with parameters from the pre-training model is used to extract image features, which can better adapt to machine vision tasks; according to the above for Figure 3 It can be seen from the description that the initial convolutional layer often extracts the underlying general features of the image.
  • the single-layer convolutional neural network in the embodiment of the present application is the first convolutional layer of the pre-trained model.
  • the image extracted by the single-layer convolutional neural network is The features are also the general features of the bottom layer.
  • the codec optimized for the general features of the bottom layer can adapt to various task scenarios and improve the generalization ability of multi-tasks;
  • the use of a single-layer convolutional neural network only needs to perform single-layer convolution calculations, which reduces the computational complexity and reduces the computing power requirements for hardware.
  • the single-layer convolutional neural network used in the embodiment of the present application does not include a pooling layer, and the convolutional layer does not perform downsampling operations on reconstructed images and standard images. If no downsampling operation is performed, the image will not be compressed, so that a full-resolution image can be output, and the codec is optimized based on the evaluation results of the full-resolution image, which can ensure the output of the optimized codec.
  • the quality of the reconstructed image is more friendly to human vision.
  • the first convolution kernel is used to extract the first feature to obtain the first feature map of the reconstructed image and the first feature map of the standard image
  • the second convolution kernel is used to extract the second feature , to obtain the second feature map of the reconstructed image and the second feature map of the standard image.
  • the first feature map and the second feature map are weighted respectively, so that the first feature map has a first weight, and the second feature map has a second weight, and different weights are assigned to different feature maps to achieve different effects.
  • the weight of feature maps related to detail features can be appropriately increased to increase the importance of detail effects, which has beneficial effects on both human visual experience and subsequent execution of machine vision tasks.
  • the coefficients of the first convolution kernel and the coefficients of the second convolution kernel can be directly weighted respectively, so that the coefficients of the first convolution kernel have the third weight, and the coefficients of the second convolution kernel The coefficient of has a fourth weight.
  • the weighted convolution kernel performs feature extraction on the reconstructed image and the standard image separately, so that the feature maps do not need to be weighted separately, which can save the amount of calculation.
  • the coefficients of the convolution kernel are much less than the pixels of the image (image pixels are related to the specific resolution), and the amount of calculation for weighting the coefficients of the convolution kernel is also much less than the amount of calculation for weighting the feature map. Especially when the image resolution is high.
  • both the convolution kernel coefficient and the feature map may be weighted.
  • the weighting coefficients used for weighting can be determined manually, or can be determined according to normalization parameters, wherein the normalization parameters come from the normalization parameters of the pre-trained model.
  • the algorithm for obtaining the degree of distortion of the feature map of the reconstructed image relative to the feature map of the standard image may be based on an existing evaluation index algorithm, such as an evaluation index algorithm such as MSE or MSSSIM.
  • an evaluation index algorithm such as MSE or MSSSIM.
  • four calculation methods for obtaining the distortion degree of the feature map of the reconstructed image relative to the feature map of the standard image are given below, but it should be understood that the following four methods are only for obtaining the feature map of the reconstructed image relative to the standard image.
  • the example of the calculation method of the distortion degree of the feature map does not constitute a limitation to the present application. In addition to the following four methods, other possible calculation methods may also be used in the embodiment of the present application.
  • Method 1 The distortion degree of the feature map of the reconstructed image relative to the feature map of the standard image can be calculated according to the following formula:
  • wfMSE is the distortion degree of the feature map of the reconstructed image relative to the feature map of the standard image
  • x is the standard image
  • y is the reconstructed image
  • f() is the convolution operation
  • i is the i-th convolution kernel
  • w is the weight coefficient
  • C is the number of feature map channels of the reconstructed image or the number of feature map channels of the standard image
  • H is the height of the standard image or the reconstructed image
  • W is the width of the standard image or the reconstructed image.
  • the convolution kernel parameters and weight coefficients are all from the pre-training model.
  • the convolution kernel parameters include 64 7*7 convolution kernels.
  • the weight coefficients can be artificially set or normalized parameters from the pre-training model. Normalization parameters include scaling coefficient ⁇ i and normalization coefficient ⁇ i .
  • the convolution operation on the standard image and the reconstructed image is shown in Figure 5, using 64 7*7 convolution kernels to perform feature extraction on the standard image x to obtain the feature map f(x) of the standard image, the i-th
  • the feature map obtained by extracting features from the convolution kernel is f i (x); use 64 7*7 convolution kernels to perform feature extraction on the reconstructed image y to obtain the feature map f(y) of the reconstructed image, the i-th convolution kernel
  • the above method 1 is based on the existing calculation method MSE, and performs a single-layer convolution operation on the image, which is simple in calculation and has a small amount of calculation; the convolution kernel parameters and weight parameters come from the machine task-oriented pre-training model, based on this, the codec Optimization makes the output reconstructed image more suitable for machine vision tasks; different weights are assigned to different feature maps, and the weighting coefficients can be adjusted freely. For detail-related feature maps, the weight can be increased to enhance image detail texture features.
  • Method 2 For the above method 1, the embodiment of this application proposes a fast implementation method based on method 1. Specifically, the distortion degree of the feature map of the reconstructed image relative to the feature map of the standard image can be calculated according to the following formula:
  • wfMSE is the distortion degree of the feature map of the reconstructed image relative to the feature map of the standard image
  • x is the standard image
  • y is the reconstructed image
  • f() is the convolution operation
  • i is the i-th convolution kernel
  • w is the weight coefficient
  • C is the number of feature map channels of the reconstructed image or the number of feature map channels of the standard image
  • H is the height of the standard image or the reconstructed image
  • W is the standard The width of the image or reconstructed image.
  • the standard image and the reconstructed image are convolved separately and then the residual is calculated, so for a convolution kernel, two convolution operations are required, and the second method is to first calculate the residual and then perform convolution on the residual.
  • Product so that only one convolution operation is needed for a convolution kernel, which can save the amount of calculation.
  • the first method is to weight the feature map
  • the process of obtaining the distortion degree of the reconstructed image according to the second method is shown in Figure 6.
  • the final convolution kernel g i () performs a convolution operation on the residual to obtain the convolution result
  • calculate the convolution result The variance of is wfMSE.
  • Method 3 In method 1 and method 2, the MSE calculation method is used to obtain the degree of distortion of the reconstructed image, which is a calculation method based on the degree of pixel distortion between images.
  • the embodiment of the present application can also be based on other indicators such as structural similarity between images (structure Similarity index, SSSIM) calculation method to obtain the distortion of the reconstructed image.
  • wfSSIM is the distortion degree of the reconstructed image relative to the standard image
  • is the mean value
  • is the covariance
  • C 1 is a constant
  • C 2 is a constant.
  • the third method is to calculate the distortion degree of the reconstructed image relative to the standard image based on the structural similarity index, perform SSIM calculation on the convolved feature map, and weight and average the calculation results to obtain wfSSIM.
  • method 3 is based on structural similarity calculations. Because statistics such as mean and variance are used, it is easier to avoid the influence of noise (such as ringing noise), thereby obtaining more stable results. It can effectively improve the accuracy in machine tasks such as semantic segmentation.
  • Method 4 The calculation method based on structural similarity SSIM in method 3 may also have other variants, such as a calculation method based on multi-scale structural similarity MSSSIM.
  • wfMSSSIM is the distortion degree of the feature map of the reconstructed image relative to the feature map of the standard image
  • C3 is a constant.
  • Method 4 is based on the multi-scale measurement of the structural similarity between images. Compared with Method 3, the image details are further enhanced while retaining the anti-noise ability. The reconstructed image output by the optimized codec has higher quality. Machine tasks such as object detection are of great significance.
  • the degree of distortion of the feature map of the reconstructed image relative to the feature map of the standard image can be obtained.
  • the method of the embodiment of the present application further includes evaluating the quality of the reconstructed image according to the degree of distortion to obtain an evaluation result;
  • the encoder and/or decoder are then optimized according to the evaluation results, wherein the encoder and/or decoder are used to output the reconstructed image.
  • the distortion of the reconstructed image means the difference between the reconstructed image and the standard image.
  • the degree of distortion so the parameters of the encoder and/or decoder can be updated according to the degree of distortion, so that the distortion degree of the output reconstructed image is as small as possible, so as to achieve the purpose of optimizing the encoder and/or decoder.
  • FIG. 7 shows a schematic diagram of an application scenario of an image processing method according to an embodiment of the present application in optimization of an encoder and/or a decoder.
  • using the image processing method of the embodiment of the present application can optimize the encoder and decoder separately, or when the encoder and decoder are designed as one, the codec can be optimized optimization.
  • the single-layer convolutional neural network used in the image processing method of the embodiment of the present application comes from an existing pre-trained model rather than artificial design, it can be compatible with existing AI models; in addition, the single-layer convolutional neural network is a pre-trained model
  • the first convolutional layer of the image extracts the underlying common features of the image, so that the optimized encoder and decoder can adapt to a variety of machine vision tasks at the same time, such as the target detection task, semantic segmentation task and Traffic lights, lane detection tasks, etc.
  • the encoder and decoder can be optimized using the image processing method of the embodiment of the present application, then the encoder is fixed, and the existing evaluation indicators for machine vision tasks are used for the decoder Combined with specific machine vision tasks for joint optimization, the encoder obtained in this way can adapt to a variety of machine vision tasks and improve task generalization capabilities, and the combination of decoders and specific machine vision tasks can make the output reconstruction image more in line with the specific Application scenarios of machine vision tasks.
  • the encoder and decoder can be optimized using the image processing method of the embodiment of the present application, and then the encoder is fixed, and then the existing evaluation indicators for machine vision tasks are used Combine specific machine vision tasks to optimize the backbone network of the decoder without optimizing the head network.
  • the backbone network is a network used for feature extraction, which is a part of the decoder network.
  • the head network is based on the features extracted from the backbone network. Further making predictions is also part of the network of the decoder, so that self-supervised learning can be used in the training process of the decoder without label data, because label data is required in the process of making predictions by the head network.
  • the image processing method of the embodiment of the present application can also be used to guide the optimization of the ISP processing method.
  • ISP processing includes a series of steps such as demosaicing, color transformation, white balance, denoising, tone mapping, and gamma correction.
  • the image quality evaluation index of machine vision generally needs to be adjusted in an end-to-end manner by combining ISP and specific tasks, that is, the output accuracy of specific tasks is used to guide ISP parameter adjustment, and specific tasks need to use label data, so that this The application method of parameter adjustment is limited.
  • the image processing method in the embodiment of the present application is decoupled from specific tasks, it can directly guide the ISP parameter tuning without performing specific tasks, which simplifies the ISP parameter tuning process.
  • the following table 1 shows that the image processing method proposed according to the embodiment of the present application and the method of the prior art optimize the encoder and/or decoder, and the reconstructed image output by the optimized encoder and/or decoder is displayed in the machine vision Ranking of effects in tasks and human vision.
  • wfMSE-w0 means no weighting
  • wfMSE-w1 means that the weighting coefficient is
  • FIG. 8 is a schematic block diagram of an image processing device according to an embodiment of the present application.
  • the image processing device may be a terminal or a chip inside the terminal. As shown in FIG. 8 , it includes an acquisition unit 801 and a processing unit 802. The following Give a brief introduction.
  • An acquiring unit 801 configured to acquire a reconstructed image and a standard image of the reconstructed image.
  • the processing unit 802 is used to input the reconstructed image and the standard image into the single-layer convolutional neural network to obtain the feature map of the reconstructed image and the feature map of the standard image.
  • the single-layer convolutional neural network is the first layer of convolution of the pre-trained model layer.
  • the processing unit 802 is further configured to acquire the degree of distortion of the feature map of the reconstructed image relative to the feature map of the standard image.
  • the single-layer convolutional neural network includes a plurality of convolution kernels, the first feature map of the reconstructed image and the first feature map of the standard image have a first weight, and the first feature map is composed of the first convolution kernel Acquisition, the second feature map of the reconstructed image and the second feature map of the standard image have a second weight, the second feature map is obtained by the second convolution kernel, the first convolution kernel and the second convolution kernel belong to multiple convolution nuclear.
  • the weight is determined by a normalization parameter, which is a normalization parameter of the pre-trained model.
  • the processing unit 802 is specifically configured to: calculate the degree of distortion of the feature map of the reconstructed image relative to the feature map of the standard image according to the following formula:
  • wfMSE is the distortion degree of the feature map of the reconstructed image relative to the feature map of the standard image
  • x is the standard image
  • y is the reconstructed image
  • f() is the convolution operation
  • i is the i-th convolution kernel
  • w is the weight coefficient
  • C is the number of feature map channels of the reconstructed image or the feature map channel number of the standard image
  • H is the height of the standard image or the reconstructed image
  • W is the width of the standard image or the reconstructed image.
  • the processing unit 802 is further configured to: calculate the distortion degree of the feature map of the reconstructed image relative to the feature map of the standard image according to the following formula:
  • wfSSIM is the distortion degree of the feature map of the reconstructed image relative to the feature map of the standard image
  • is the mean
  • is the covariance
  • C 1 is a constant
  • C 2 is a constant.
  • the processing unit 802 is further configured to: calculate the distortion degree of the feature map of the reconstructed image relative to the feature map of the standard image according to the following formula:
  • wfMSSSIM is the distortion degree of the feature map of the reconstructed image relative to the feature map of the standard image, and C 3 is a constant.
  • the single-layer convolutional neural network includes a plurality of convolution kernels
  • the first convolution kernel is used to obtain the first feature of the reconstructed image and the first feature of the standard image
  • the coefficient of the first convolution kernel has The first weight
  • the second convolution kernel is used to obtain the second feature of the reconstructed image and the second feature of the standard image
  • the coefficient of the second convolution kernel has the second weight
  • the first convolution kernel and the second convolution kernel belong to Multiple convolution kernels.
  • the processing unit 802 is specifically configured to: calculate the distortion degree of the feature map of the reconstructed image according to the following formula:
  • wfMSE is the distortion degree of the feature map of the reconstructed image
  • x is the standard image
  • y is the reconstructed image
  • f() is the convolution operation
  • i is the first i convolution kernels
  • w is the weight coefficient
  • C is the number of feature map channels of the reconstructed image or the number of feature map channels of the standard image
  • H is the height of the standard image or the reconstructed image
  • W is the width of the standard image or the reconstructed image.
  • the processing unit 802 is further configured to: evaluate the quality of the reconstructed image according to the distortion degree of the feature map of the reconstructed image relative to the feature map of the standard image, so as to obtain the evaluation result; optimize the encoder and/or Decoders, encoders and/or decoders are used to output reconstructed images.
  • the single-layer convolutional neural network does not perform pooling and downsampling operations on the reconstructed image and the standard image.
  • the image processing device shown in FIG. 8 can be used to implement the above image processing method 400, wherein the acquisition unit 801 is used to implement step 401, and the processing unit 802 is used to implement steps 402 and 403.
  • the image shown in FIG. 8 The processing device can also be used to implement the image processing method described in FIG. 5 to FIG. 7 , and the specific steps can refer to the above description of FIG. 5 to FIG. 7 .
  • the embodiments of the present application will not repeat them here.
  • the image processing apparatus 800 in the embodiment of the present application may be implemented by software, for example, a computer program or instruction having the above functions, and the corresponding computer program or instruction may be stored in the internal memory of the terminal, and the processor may Read the corresponding computer programs or instructions inside the memory to realize the above functions.
  • the image processing apparatus 800 in the embodiment of the present application may also be implemented by hardware.
  • the processing unit 802 is a processor (such as a processor in an NPU, GPU, or system chip), and the acquisition unit 801 is a data interface.
  • the image processing apparatus 800 in the embodiment of the present application may also be implemented by a combination of a processor and a software unit.
  • the obtaining unit 801 may be an interface circuit of a processor, or an encoder and/or a decoder, and the like. For example, the encoder and/or decoder sends the output reconstructed image to the processor interface circuit.
  • FIG. 9 is a schematic structural diagram of an image processing apparatus 900 according to an embodiment of the present application.
  • the device 900 shown in FIG. 9 includes a memory 901 , a processor 902 , a communication interface 903 and a bus 904 .
  • the memory 901 , the processor 902 , and the communication interface 903 are connected to each other through a bus 904 .
  • the acquiring unit 801 in FIG. 8 may be equivalent to the communication interface 903 in the apparatus 900
  • the processing unit 1002 and the processing unit 802 may be equivalent to the processor 902 in the apparatus 900 .
  • Each unit and units in the device 900 will be described in detail below.
  • the memory 901 may be a read only memory (read only memory, ROM), a static storage device, a dynamic storage device or a random access memory (random access memory, RAM).
  • the memory 901 may store a program, and when the program stored in the memory 901 is executed by the processor 902, the processor 902 is configured to execute each step of the image processing method of the embodiment of the present application.
  • the processor 902 may be configured to execute step 402 and step 403 in the method shown in FIG. 4 .
  • the processor 902 may also execute the processes shown in FIG. 5 to FIG. 7 .
  • the processor 902 executes step 402 and step 403, the processor 902 can obtain the reconstructed image output by the encoder and/or decoder and its corresponding standard image through the communication interface 903, and compare the acquired reconstructed image and its corresponding standard image The image is processed.
  • the processor 902 may be a general-purpose central processing unit (central processing unit, CPU), a microprocessor, an application specific integrated circuit (application specific integrated circuit, ASIC), a graphics processing unit (graphics processing unit, GPU) or one or more
  • the integrated circuit is used to execute related programs to implement the image processing method of the embodiment of the present application.
  • the processor 902 may also be an integrated circuit chip, which has a signal processing capability. In the implementation process, each step of the method of the present application may be completed by an integrated logic circuit of hardware in the processor 902 or instructions in the form of software.
  • the above-mentioned processor 902 can also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application-specific integrated circuit (ASIC), a ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, Discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application-specific integrated circuit
  • FPGA field programmable gate array
  • Various methods, steps, and logic block diagrams disclosed in the embodiments of the present application may be implemented or executed.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the methods disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software units in the decoding processor.
  • the software unit can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in the field.
  • the storage medium is located in the memory 901, and the processor 902 reads the information in the memory 901, and combines its hardware to complete the functions required by the units included in the device, or execute the image processing method of the method embodiment of the present application.
  • the communication interface 903 implements communication between the apparatus 900 and other devices or communication networks by using a transceiver device such as but not limited to a transceiver.
  • a transceiver device such as but not limited to a transceiver.
  • the reconstructed image and its corresponding standard image can be acquired through the communication interface 903 .
  • the bus 904 may include pathways for transferring information between various components of the device 900 (eg, memory 901 , processor 902 , communication interface 903 ).
  • the embodiment of the present application also provides a computer-readable medium.
  • the computer-readable medium stores program codes.
  • the computer program codes run on the computer, the computer executes the methods described above in FIGS. 4 to 7 .
  • the embodiment of the present application also provides a chip, including: at least one processor and a memory, at least one processor is coupled with the memory, and is used to read and execute instructions in the memory, so as to execute the above described in Figure 4 to Figure 7 method.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例提供了一种图像处理的方法,该方法包括:获取重建图像和重建图像的标准图像,重建图像为根据第一图像重建得到的图像,第一图像为标准图像经过压缩得到的图像;将重建图像和标准图像输入单层卷积神经网络,以得到重建图像的特征图和标准图像的特征图,单层卷积神经网络为预训练模型的第一层卷积层,预训练模型为预先训练好的卷积神经网络;获取重建图像的特征图相对于标准图像的特征图的失真度。本申请实施例的图像处理的方法,用于获取重建图像和原始图像的失真度,以评价重建图像的质量,从而指导优化编码器和/或解码器。

Description

图像处理方法和装置 技术领域
本申请实施例涉及图像处理领域,并且更具体地,涉及一种图像处理方法和装置。
背景技术
摄像头具有分辨率高、非接触、使用方便、成本低廉等特点,在环境感知领域具有广泛应用。例如在自动驾驶领域,越来越多的摄像头安装到车辆上,以实现无盲点覆盖和机器视觉的人工智能(artificial intelligence,AI)车辆。随着摄像头分辨率、帧率和采样深度的不断提高,摄像头输出的视频对传输带宽需求越来越大,为了缓解传输网络的压力,图1示出了一种现有的解决方案的示意性框图,如图1所示,摄像头输出的贝叶尔原始(bayer raw)图像或视频往往具有较高的精度,对传输带宽的要求较高,例如对于帧率为30fps、采样深度为16bitdepth、分辨率为4K的超高清(ultra high definition,UHD)视频,其带宽需求高达4Gbps(4K*2K*30*16),因此为了缓解传输网络的压力,在将贝叶尔原始图像或视频传输给移动数据计算平台(mobile data center,MDC)之前,需要对图像或视频进行压缩处理,同时需要对图像或视频进行图像信号处理(image signal processing,ISP),以此来降低带宽需求,以便无需升级现有的网络带宽即可开展高清视频的业务。为了满足现有的带宽需求,需要对摄像头输出的贝叶尔原始图像或视频进行较高压缩率的压缩,这往往会采用有损图像或视频的压缩技术,因此不可避免地会导致图像或视频质量损伤,同时ISP处理也会导致图像或视频的信息受损。经过编码器压缩处理和ISP处理得到的RGB图像被传输到MDC中进行图像重建,图像重建由解码器进行,重建的图像的质量好坏对于后续的任务例如目标检测和语义分割等任务中具有关键意义,因此解码器重建得到的图像具有较高质量则尤为重要。然而,由于经过压缩处理和ISP处理得到的RGB图像有损伤,重建图像质量无法保障,因此亟需一种方法来确定重建图像与原图像的失真度,从而评价重建图像的质量,并根据评价结果对解码器进行优化,以保证解码器在实际应用中输出的重建图像具有较高质量。
发明内容
本申请实施例提供一种图像处理的方法,用于获取重建图像和原始图像的失真度以评价重建图像的质量,从而指导优化编码器和/或解码器。
第一方面,提供了一种图像处理的方法,该方法包括:获取重建图像和重建图像的标准图像,重建图像为根据第一图像重建得到的图像,第一图像为标准图像经过压缩得到的图像;将重建图像和标准图像输入单层卷积神经网络,以得到重建图像的特征图和标准图像的特征图,单层卷积神经网络的参数来自于预训练模型的第一层卷积层的参数,预训练模型为预先训练好的卷积神经网络;获取重建图像的特征图相对于标准图像的特征图的失真度。
本申请实施例中,由于预训练模型本是针对机器任务训练,因此使用来自预训练模型的单层卷积神经网络提取图像特征,可以更好地适配机器视觉任务;初始卷积层往往提取图像的底层通用特征,本申请实施例的单层卷积神经网络为预训练模型的第一层卷积层,使用该单层卷积神经网络提取的图像特征也为底层通用特征,针对底层通用特征优化得到的编码解码器可以适配多种任务场景,提高多任务泛化能力;另外相较于现有技术中使用完整的神经网络对图像进行处理,本申请实施例中使用单层卷积神经网络只需进行单层卷积计算,降低了计算复杂度,同时降低了对于硬件的算力要求。
结合第一方面,在第一方面的某些实现方式中,单层卷积神经网络包括多个卷积核,重建图像的第一特征图和标准图像的第一特征图具有第一权重,重建图像的第一特征图和标准图像的第一特征图由第一卷积核获取,重建图像的第二特征图和标准图像的第二特征图具有第二权重,重建图像的第一特征图和标准图像的第二特征图由第二卷积核获取,第一卷积核和第二卷积核属于多个卷积核。
本申请实施例为不同的特征图分配不同的权重从而达到不同的效果,例如,对于有关细节特征的特征图可以适当增大权重,以增加细节效果的重要程度,如此不论对于人眼视觉感受还是对于机器视觉任务的后续执行都具有有益效果。
结合第一方面,在第一方面的某些实现方式中,权重由归一化参数确定,归一化参数为预训练模型的归一化参数。
用于加权的加权系数可以由人为确定,也可以根据归一化参数确定。
结合第一方面,在第一方面的某些实现方式中,获取重建图像的特征图相对于标准图像的特征图的失真度,包括:根据如下公式计算重建图像的特征图相对于标准图像的特征图的失真度:
Figure PCTCN2021130201-appb-000001
wfMSE为重建图像的特征图相对于标准图像的特征图的失真度,x为标准图像,y为重建图像,f()为卷积操作,i为第i个卷积核,w为权重系数,C为重建图像的特征图通道数或标准图像的特征图通道数,H为标准图像或重建图像的高度,W为标准图像或重建图像的宽度。
上述方法基于现有的计算方法MSE,对图像进行单层卷积操作,计算简单,计算量小;卷积核参数和权重参数来自于面向机器任务的预训练模型,基于此对编码解码器优化,使得输出的重建图像更加适配机器视觉任务;对不同特征图分配不同权重,加权系数可自由调整,对于细节相关的特征图可以增大权重,以增强图像细节纹理特征。
结合第一方面,在第一方面的某些实现方式中,w=1,或
Figure PCTCN2021130201-appb-000002
Figure PCTCN2021130201-appb-000003
γ i为预训练模型的缩放系数,σ i为预训练模型的归一化系数。
由于预训练模型是针对机器视觉任务训练的,因此当权重系数来自于预训练模型而非人工设计时,可以确保据此优化的编码器和/或解码器在面向同样的机器视觉任务时具有更好的效果。
结合第一方面,在第一方面的某些实现方式中,方法还包括:根据如下公式计算重建图像的特征图相对于标准图像的特征图的失真度:
Figure PCTCN2021130201-appb-000004
Figure PCTCN2021130201-appb-000005
wfSSIM为重建图像的特征图相对于标准图像的特征图的失真度,μ为均值,σ为协方差,C 1为常数,C 2为常数。
本申请实施例还可以基于其他指标例如图像间结构相似性的计算方法来获取重建图像的失真度。基于结构形似性计算,因为使用了均值和方差等统计量,更容易避免噪声(例如振铃噪声)的影响,从而获得更加稳定的效果,在语义分割等机器任务中可以有效提高精度。
结合第一方面,在第一方面的某些实现方式中,方法还包括:根据如下公式计算重建图像的特征图相对于标准图像的特征图的失真度:
Figure PCTCN2021130201-appb-000006
Figure PCTCN2021130201-appb-000007
Figure PCTCN2021130201-appb-000008
Figure PCTCN2021130201-appb-000009
Figure PCTCN2021130201-appb-000010
wfMSSSIM为重建图像的特征图相对于标准图像的特征图的失真度,C 3为常数。
基于结构相似性SSIM的计算方法还可以具有其他的变形,例如基于多尺度结构相似性MSSSIM的计算方法。相比于SSIM的计算方法在保留抗噪能力的同时进一步增强图像细节,由此优化得到的编码解码器输出的重建图像具有更高质量,对于后续的目标检测等机器任务具有重要意义。
结合第一方面,在第一方面的某些实现方式中,单层卷积神经网络包括多个卷积核,第一卷积核用于获取重建图像的第一特征和标准图像的第一特征,第一卷积核的系数具有第一权重,第二卷积核用于获取重建图像的第二特征和标准图像的第二特征,第二卷积核的系数具有第二权重,第一卷积核和第二卷积核属于多个卷积核。
这种方法使用加权后的卷积核分别对重建图像和标准图像进行特征提取,使得特征图无需再分别加权,可以节省计算量。此外,一般来说,卷积核的系数远少于图像的像素(图像像素与具体分辨率有关),对卷积核系数进行加权的计算量也远少于对特征图进行加权的计算量,特别是在图像分辨率较高的时候。
结合第一方面,在第一方面的某些实现方式中,获取重建图像的特征图相对于标准图像的特征图的失真度,包括:根据如下公式计算重建图像的特征图的失真度:
Figure PCTCN2021130201-appb-000011
wfMSE为重建图像的特征图的失真度,x为标准图像,y为重建图像,g i()=w i×f i()=fw i(),f()为卷积操作,i为第i个卷积核,w为权重系数,C为重建图像的特征图通道数或标准图像的特征图通道数,H为标准图像或重建图像的高度,W为标准图像或重建图像的宽度。
该方法为上一种MSE方法的快速实现方式,上一种方法是对标准图像和重建图像分别卷积操作后再求残差,如此对于一个卷积核来说需要做两次卷积操作,而该方法是先求残差再对残差进行卷积,如此对于一个卷积核来说只需做一次卷积操作,可以节省计算量。
结合第一方面,在第一方面的某些实现方式中,方法还包括:根据重建图像的特征图相对于标准图像的特征图的失真度评价重建图像的质量,以获取评价结果;根据评价结果优化编码器和/或解码器,编码器和/或解码器用于输出重建图像。
在编码器和/或解码器的训练过程中,自然希望编码器和/或解码器输出的重建图像尽可能接近标准图像,而重建图像的失真度即表示重建图像与标准图像之间的差异,所以可以根据失真度来更新编码器和/或解码器的参数,使得输出的重建图像的失真度尽量小,以达到优化编码器和/或解码器的目的。
结合第一方面,在第一方面的某些实现方式中,重建图像的特征图和标准图像的特征图为全分辨率图像。
本申请实施例使用的单层卷积神经网络不包括池化层,卷积层也不执行对重建图像和标准图像的下采样操作。不进行下采样的操作,则不会对图像进行压缩,如此可以输出全分辨率的图像,基于对全分辨率图像的评价结果对编码解码器进行优化,可以保证优化后的编码解码器输出的重建图像的质量,对人眼视觉更加友好。
第二方面,提供了一种图像处理的装置,该装置包括:获取单元,用于获取重建图像和重建图像的标准图像,重建图像为根据第一图像重建得到的图像,第一图像为所述标准图像经过压缩得到的图像;处理单元,用于将重建图像和标准图像输入单层卷积神经网络,以得到重建图像的特征图和标准图像的特征图,单层卷积神经网络的参数来自于预训练模型的第一层卷积层的参数,预训练模型为预先训练好的卷积神经网络;处理单元还用于获取重建图像的特征图相对于标准图像的特征图的失真度。
结合第二方面,在第二方面的某些实现方式中,单层卷积神经网络包括多个卷积核,重建图像的第一特征图和标准图像的第一特征图具有第一权重,重建图像的第一特征图和标准图像的第一特征图由第一卷积核获取,重建图像的第二特征图和标准图像的第二特征图具有第二权重,重建图像的第一特征图和标准图像的第二特征图由第二卷积核获取,第一卷积核和第二卷积核属于多个卷积核。
结合第二方面,在第二方面的某些实现方式中,权重由归一化参数确定,归一化参数为预训练模型的归一化参数。
结合第二方面,在第二方面的某些实现方式中,处理单元具体用于:根据如下公式计算重建图像的特征图相对于标准图像的特征图的失真度:
Figure PCTCN2021130201-appb-000012
wfMSE为重建图像的特征图相对于标准图像的特征图的失真度,x为标准图像,y为重建图像,f()为卷积操作,i为第i个卷积核,w为权重系数,C为重建图像的特征图通 道数或标准图像的特征图通道数,H为标准图像或重建图像的高度,W为标准图像或重建图像的宽度。
结合第二方面,在第二方面的某些实现方式中,w=1,或
Figure PCTCN2021130201-appb-000013
Figure PCTCN2021130201-appb-000014
γ i为预训练模型的缩放系数,σ i为预训练模型的归一化系数。
结合第二方面,在第二方面的某些实现方式中,处理单元还用于:根据如下公式计算重建图像的特征图相对于标准图像的特征图的失真度:
Figure PCTCN2021130201-appb-000015
Figure PCTCN2021130201-appb-000016
wfSSIM为重建图像的特征图相对于标准图像的特征图的失真度,μ为均值,σ为协方差,C 1为常数,C 2为常数。
结合第二方面,在第二方面的某些实现方式中,处理单元还用于:根据如下公式计算重建图像的特征图相对于标准图像的特征图的失真度:
Figure PCTCN2021130201-appb-000017
Figure PCTCN2021130201-appb-000018
Figure PCTCN2021130201-appb-000019
Figure PCTCN2021130201-appb-000020
Figure PCTCN2021130201-appb-000021
wfMSSSIM为重建图像的特征图相对于标准图像的特征图的失真度,C 3为常数。
结合第二方面,在第二方面的某些实现方式中,单层卷积神经网络包括多个卷积核,第一卷积核用于获取重建图像的第一特征和标准图像的第一特征,第一卷积核的系数具有第一权重,第二卷积核用于获取重建图像的第二特征和标准图像的第二特征,第二卷积核的系数具有第二权重,第一卷积核和第二卷积核属于多个卷积核。
结合第二方面,在第二方面的某些实现方式中,处理单元具体用于:根据如下公式计算重建图像的特征图的失真度:
Figure PCTCN2021130201-appb-000022
wfMSE为重建图像的特征图的失真度,x为标准图像,y为重建图像,g i()=w i×f i()=fw i(),f()为卷积操作,i为第i个卷积核,w为权重系数,C为重建图像的特征图通道数或标准图像的特征图通道数,H为标准图像或重建图像的高度,W为标准图像或重建图像的宽度。
结合第二方面,在第二方面的某些实现方式中,处理单元还用于:根据重建图像的特征图相对于标准图像的特征图的失真度评价重建图像的质量,以获取评价结果;根据评价 结果优化编码器和/或解码器,编码器和/或解码器用于输出重建图像。
结合第二方面,在第二方面的某些实现方式中,重建图像的特征图和标准图像的特征图为全分辨率图像。
第三方面,提供了一种计算机可读介质,其特征在于,计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行上述第一方面的任一种实现方式的方法。
第四方面,提供了一种芯片,芯片包括处理器与数据接口,处理器通过数据接口读取存储器上存储的指令,执行上述第一方面的任一种实现方式的方法。
附图说明
图1是本申请实施例的对摄像头输出图像进行压缩传输的示意性框图;
图2是本申请实施例的摄像头获取的图像用于多种任务的示意性框图;
图3是本申请实施例的卷积神经网络的示意性结构图;
图4是本申请实施例的图像处理方法的示意性流程图;
图5是本申请实施例的对标准图像和重建图像的卷积操作的示意图;
图6是本申请实施例的获取重建图像的失真度的过程的示意图;
图7是本申请实施例的图像处理方法在编码器和/或解码器的优化中的应用场景示意图;
图8是本申请实施例的图像处理的装置的示意性框图;
图9是本申请实施例的图像处理装置900的结构示意图。
具体实施方式
下面将结合附图,对本申请实施例中的技术方案进行描述。
图2示出了本申请实施例的摄像头获取的图像用于多种任务的示意性框图,以自动驾驶场景为例,安装在车辆上的摄像头传感器获取的车辆周围的图像经过编码器压缩处理或ISP处理后得到RGB图像,RGB图像的精度远低于摄像头输出的原始图像的精度,由此可以降低网络传输的带宽需求。RGB图像经过编码解码器(encoder and decoder,CODEC)的处理(包括对原始图像的压缩和重建)得到重建图像,重建图像用于目标检测、语义分割和红绿灯、车道线的检测等任务。显然,重建图像的质量对于目标检测、语义分割和红绿灯、车道线的检测等任务的执行效果具有重要意义,例如,如果重建图像的质量较差,则在目标检测任务中难以对目标识别准确,可能给自动驾驶带来巨大危险,因此需要保证CODEC输出的重建图像具有较高质量。进而需要一种方法,用于确定CODEC输出的重建图像与原始图像之间的失真度,从而评价重建图像的质量,根据评价结果由此可以对CODEC进行优化,以使优化后的CODEC可以输出具有较高质量的重建图像。
从某种程度上来说,上述目标检测、语义分割和红绿灯、车道线的检测等任务可以称为机器视觉任务,即重建图像是直接由机器系统进行处理,因此重建图像主要满足可以被机器系统快速识别和检测的需求。针对上述机器视觉任务的需求,现有技术提出了多种面向机器视觉的评价指标,用于评价重建图像的质量,包括图像分类评价指标Top-1Acc、目标检测评价指标mAP、语义分割评价指标mIoU和车道线检测评价指标Acc等。使用上述评价指标针对相应的单个机器视觉任务对CODEC进行优化可以取得较好的效果,但 如此优化后的CODEC会与相应的单个机器视觉任务耦合,而无法适配多种任务场景,例如,使用图像分类评价指标Top-1Acc对CODEC进行优化,优化后的CODEC输出的重建图像对于图像分类任务具有较好的效果,但对于目标检测、语义分割和车道线检测等任务依然效果较差,无法做到任务泛化的评价。除了上述面向机器视觉的评价指标,现有技术中还包括面向人眼视觉的评价指标,包括峰值信噪比(peak signal noise ratio,PSNR)、多尺度结构相似度(multiscale structural similarity index,MSSSIM)、学习感知图块相似度(learned perceptual image patch similarity,LPIPS)等评价指标,根据面向人眼视觉的评价指标优化的CODEC所输出的重建图像可以更加符合人眼主观感受,例如将重建图像直接显示在车辆内部的显示屏上以供驾驶员观看,因此重建图像需要具有更高的清晰度以及便于人眼观看等特点。然而实际上,人眼的复杂性难以用一个评价指标来满足,而现有的面向人眼视觉的评价指标均有其各自的不足,以LPIPS为例,使用LPIPS评价指标评价重建图像质量时需要计算网络的全部卷积层,计算复杂度高;此外由于网络涉及池化和下采样,评价过程中导致图像信息丢失,并且评价过程中仅使用了低分辨率的特征图,难以做出准确的评价结果,而不准确的评价结果也难以保证CODEC的优化结果,因此难以保证重建图像的质量。
由于摄像头获取的图像主要用于机器系统执行相关的任务,例如在自动驾驶场景中,摄像头传感器获取的图像主要用于目标检测、语义分割和红绿灯、车道线的检测等任务,以实现对车辆周围的环境的感知,因此评价重建图像的质量的方法首先应当面对机器视觉;同时,为了满足多种机器视觉任务的应用场景,评价重建图像的质量的方法应当与具体任务解耦合;此外,在某些情况下需要兼顾人眼视觉,例如摄像头传感器获取的图像也需要显示在车内显示屏上供驾驶员观看等。
因此,本申请实施例提出一种图像处理方法,用于获取重建图像和原始图像的失真度,以评价重建图像的质量,从而指导优化CODEC。本申请实施例的图像处理方法,面向机器视觉任务,可以满足多种任务的需求,同时兼顾人眼视觉。
为了更好地理解本申请实施例的方案,下面先对本申请实施例可能涉及到的卷积神经网络(convolutional neuron network,CNN)做简单介绍。
图3示出了本申请实施例的卷积神经网络的示意性结构图,CNN300可以包括输入层310,卷积层/池化层320(其中池化层为可选的),以及全连接层(fully connected layer)330。下面对这些层的相关内容做详细介绍。
卷积层/池化层320:
卷积层:
如图3所示卷积层/池化层320可以包括如示例321-326层,举例来说:在一种实现中,321层为卷积层,322层为池化层,323层为卷积层,324层为池化层,325为卷积层,326为池化层;在另一种实现方式中,321、322为卷积层,323为池化层,324、325为卷积层,326为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
下面将以卷积层321为例,介绍一层卷积层的内部工作原理。
卷积层321可以包括很多个卷积算子,卷积算子也称为卷积核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重 矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的卷积特征图的尺寸也相同,再将提取到的多个尺寸相同的卷积特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络300进行正确的预测。
当卷积神经网络300有多个卷积层的时候,初始的卷积层(例如321)往往提取较多的一般特征,该一般特征也可以称之为底层通用特征或低级别的特征;随着卷积神经网络300深度的加深,越往后的卷积层(例如326)提取到的特征越来越复杂,比如高级别的语义之类的特征。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图3中320所示例的321-326各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
应当说明的是,除了上述池化层可以实现对图像进行压缩,卷积层本身也可以实现对图像进行压缩。例如,当卷积核对图像进行卷积操作时,如果其步长大于1,即可实现对图像的压缩,这种对图像的压缩称为下采样。
全连接层330:
在经过卷积层/池化层320的处理后,卷积神经网络300还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层320只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络300需要利用全连接层330来生成一个或者一组所需要的类的数量的输出。因此,在全连接层 330中可以包括多层隐含层(如图3所示的331、332至33n)以及输出层340,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。
在全连接层330中的多层隐含层之后,也就是整个卷积神经网络300的最后层为输出层340,该输出层340具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络300的前向传播(如图3由310至340方向的传播为前向传播)完成,反向传播(如图3由340至310方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络300的损失,及卷积神经网络300通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图3所示的卷积神经网络300仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在。
现有技术在获取重建图像和原始图像失真度时,使用图3所示的卷积神经网络对重建图像和原始图像进行处理,由于对图像处理过程中存在池化操作和下采样操作,使得处理后的图像信息丢失,得到的特征图分辨率较低,获取的重建图像和原始图像的失真度并不准确,由此对于重建图像质量的评价结果也并不准确。根据不准确的评价结果对CODEC进行优化,使得优化结果不佳,CODEC输出的重建图像可能具有瑕疵。此外,使用图3所示的全部卷积层对图像进行处理,计算复杂度较高。
图4示出了本申请实施例的图像处理方法的示意性流程图,使用图4所示的图像处理方法可以获取重建图像和原始图像的失真度,以评价重建图像的质量,从而指导优化编码器、解码器和ISP图像处理算法等。具体的应用场景包括辅助/自动驾驶车辆对摄像头获取的图像进行处理或平安城市系统/视频监控系统对摄像头获取的图像进行处理等场景。图4的方法包括步骤401至步骤403,以下分别进行介绍。
S401,获取重建图像和重建图像的标准图像,重建图像为根据第一图像重建得到的图像,第一图像为标准图像经过压缩得到的图像。
图4所示的图像处理方法可以应用于编码解码器的训练过程,在一种可能的训练方式中,需要获取标准图像和标准图像对应的压缩图像,该标准图像和其对应的压缩图像可以来自于ImageNet、Kitti、coco、Cityscapes等任何训练集,然后将压缩图像输入编码解码器,编码解码器输出重建图像,该重建图像与前述压缩图像和标准图像相对应。重建图像与标准图像的失真度决定了编码解码器的优化方向,一般来说,希望编码解码器输出的重建图像可以尽量接近标准图像,因此需要获取重建图像与标准图像的失真度。
S402,将重建图像和标准图像输入单层卷积神经网络,以得到重建图像的特征图和标准图像的特征图,单层卷积神经网络的参数来自于预训练模型的第一层卷积层的参数,预训练模型为预先训练好的卷积神经网络。
获取重建图像与标准图像的失真度首先需要获取重建图像与标准图像的特征图,本申请实施例使用卷积神经网络对重建图像与标准图像进行特征提取。与现有技术使用完整的神经网络不同的是,本申请实施例使用单层神经网络进行特征提取,该单层卷积神经网络的参数来自于预训练模型的第一层卷积层的参数,包括卷积核的参数和归一化参数,预训练模型即为预先训练好的卷积神经网络模型,例如在大规模训练集ImageNet上训练的Resnet、alexnet、vggnet、regnet等分类模型。单层卷积神经网络包括多个卷积核,不同卷 积核用于提取不同的特征,每个卷积核都具有明确的物理意义,例如第一个卷积核用于提取快速变化的纹理细节,第二个卷积核用于提取图像边缘特征及颜色信息等。本申请实施例中,由于预训练模型本是针对机器任务训练,因此使用参数来自预训练模型的单层卷积神经网络提取图像特征,可以更好地适配机器视觉任务;根据上述对于图3的描述可知,初始卷积层往往提取图像的底层通用特征,本申请实施例的单层卷积神经网络为预训练模型的第一层卷积层,使用该单层卷积神经网络提取的图像特征也为底层通用特征,针对底层通用特征优化得到的编码解码器可以适配多种任务场景,提高多任务泛化能力;另外相较于现有技术中使用完整的神经网络对图像进行处理,本申请实施例中使用单层卷积神经网络只需进行单层卷积计算,降低了计算复杂度,同时降低了对于硬件的算力要求。
此外,本申请实施例使用的单层卷积神经网络不包括池化层,卷积层也不执行对重建图像和标准图像的下采样操作。不进行下采样的操作,则不会对图像进行压缩,如此可以输出全分辨率的图像,基于对全分辨率图像的评价结果对编码解码器进行优化,可以保证优化后的编码解码器输出的重建图像的质量,对人眼视觉更加友好。
在一种可能的实现方式中,第一卷积核用于提取第一特征,以获取重建图像的第一特征图和标准图像的第一特征图,第二卷积核用于提取第二特征,以获取重建图像的第二特征图和标准图像的第二特征图。本申请实施例分别第一特征图和第二特征图加权,使得第一特征图具有第一权重,第二特征图具有第二权重,为不同的特征图分配不同的权重从而达到不同的效果,例如,对于有关细节特征的特征图可以适当增大权重,以增加细节效果的重要程度,如此不论对于人眼视觉感受还是对于机器视觉任务的后续执行都具有有益效果。
在另一种可能的实现方式中,可以直接分别对第一卷积核的系数和第二卷积核的系数进行加权,使得第一卷积核的系数具有第三权重,第二卷积核的系数具有第四权重。加权后的卷积核分别对重建图像和标准图像进行特征提取,使得特征图无需再分别加权,可以节省计算量。此外,一般来说,卷积核的系数远少于图像的像素(图像像素与具体分辨率有关),对卷积核系数进行加权的计算量也远少于对特征图进行加权的计算量,特别是在图像分辨率较高的时候。
在另一种可能的实现方式中,也可以结合具体的应用场景,既对卷积核系数进行加权,又对特征图进行加权。
用于加权的加权系数可以由人为确定,也可以根据归一化参数确定,其中归一化参数来自于预训练模型的归一化参数。
S403,获取重建图像的特征图相对于标准图像的特征图的失真度。
获取重建图像的特征图相对于标准图像的特征图的失真度的算法可以基于现有的评价指标的算法,例如MSE或MSSSIM等评价指标的算法。结合上述描述,以下给出四种获取重建图像的特征图相对于标准图像的特征图的失真度的计算方法,但应理解,以下四种方法只是对获取重建图像的特征图相对于标准图像的特征图的失真度的计算方法的举例,并不构成对本申请的限定,除了以下四种方法,本申请实施例还可以使用其他可能的计算方法。
方法一:可以根据如下公式计算重建图像的特征图相对于标准图像的特征图的失真度:
Figure PCTCN2021130201-appb-000023
其中wfMSE为重建图像的特征图相对于标准图像的特征图的失真度,x为标准图像,y为重建图像,f()为卷积操作,i为第i个卷积核,w为权重系数,C为重建图像的特征图通道数或标准图像的特征图通道数,H为标准图像或重建图像的高度,W为标准图像或重建图像的宽度。
其中,卷积核参数和权重系数均来自预训练模型,卷积核参数包括64个7*7卷积核,权重系数具体可以是人为设定或来自于预训练模型的归一化参数,归一化参数包括缩放系数γ i和归一化系数σ i
具体的,对标准图像和重建图像的卷积操作如图5所示,使用64个7*7卷积核对标准图像x进行特征提取,以得到标准图像的特征图f(x),第i个卷积核提取特征得到的特征图为f i(x);使用64个7*7卷积核对重建图像y进行特征提取,以得到重建图像的特征图f(y),第i个卷积核提取特征得到的特征图为f i(y)。stride=1表示不对图像进行下采样,由此可以保持图像具有较高的分辨率,有助于恢复重建图像的更多细节。
当权重系数来自于预训练模型的归一化参数时,其取值具体可以是w=1,或
Figure PCTCN2021130201-appb-000024
Figure PCTCN2021130201-appb-000025
w的不同取值对应不同的增强效果,对与细节有关的特征图增大权重,可以实现细节增强效果。
上述方法一基于现有的计算方法MSE,对图像进行单层卷积操作,计算简单,计算量小;卷积核参数和权重参数来自于面向机器任务的预训练模型,基于此对编码解码器优化,使得输出的重建图像更加适配机器视觉任务;对不同特征图分配不同权重,加权系数可自由调整,对于细节相关的特征图可以增大权重,以增强图像细节纹理特征。
方法二:针对上述方法一,本申请实施例提出一种基于方法一的快速实现方式,具体可以根据如下公式计算重建图像的特征图相对于标准图像的特征图的失真度:
Figure PCTCN2021130201-appb-000026
其中wfMSE为重建图像的特征图相对于标准图像的特征图的失真度,x为标准图像,y为重建图像,g i()=w i×f i()=fw i(),f()为卷积操作,i为第i个卷积核,w为权重系数,C为重建图像的特征图通道数或标准图像的特征图通道数,H为标准图像或重建图像的高度,W为标准图像或重建图像的宽度。
方法一中是对标准图像和重建图像分别卷积操作后再求残差,如此对于一个卷积核来说需要做两次卷积操作,而方法二是先求残差再对残差进行卷积,如此对于一个卷积核来说只需做一次卷积操作,可以节省计算量。
方法一中是对特征图进行加权,而方法二是对卷积核系数进行加权,得到新的卷积核 g i()=w i×f i()=fw i(),由于卷积核只有7*7个系数,与图像分辨率无关,而图像有W×H个像素(一般来说远大于7*7),且图像分辨率越高,像素越高,因此对卷积核系数进行加权相较于对特征图进行加权可以节省计算量,特别是对于高分辨率图像。
根据方法二获取重建图像的失真度的过程如图6所示,首先编码解码器输出标准图像x对应的重建图像y,然后计算标准图像和重建图像之间的残差z=x-y,再使用加权后的卷积核g i()对残差进行卷积操作,得到卷积结果
Figure PCTCN2021130201-appb-000027
最后计算卷积结果
Figure PCTCN2021130201-appb-000028
的方差,即为wfMSE。
方法三:方法一和方法二中使用MSE计算方法来获取重建图像的失真度,为基于图像之间像素失真度的计算方法,本申请实施例还可以基于其他指标例如图像间结构相似性(tructure similarity index,SSSIM)的计算方法来获取重建图像的失真度。
Figure PCTCN2021130201-appb-000029
Figure PCTCN2021130201-appb-000030
其中wfSSIM为所述重建图像相对于标准图像的失真度,μ为均值,σ为协方差,C 1为常数,C 2为常数。
方法三为基于结构相似性指标计算重建图像相对于标准图像的失真度,对卷积后的特征图进行SSIM计算,并对计算结果进行加权求平均即可得到wfSSIM。相对于方法一和方法二基于像素计算,方法三基于结构形似性计算,因为使用了均值和方差等统计量,更容易避免噪声(例如振铃噪声)的影响,从而获得更加稳定的效果,在语义分割等机器任务中可以有效提高精度。
方法四:方法三中的基于结构相似性SSIM的计算方法还可以具有其他的变形,例如基于多尺度结构相似性MSSSIM的计算方法。
Figure PCTCN2021130201-appb-000031
Figure PCTCN2021130201-appb-000032
Figure PCTCN2021130201-appb-000033
Figure PCTCN2021130201-appb-000034
Figure PCTCN2021130201-appb-000035
其中wfMSSSIM为重建图像的特征图相对于标准图像的特征图的失真度,C 3为常数。α M、β j、γ j的取值为:
β 1=γ 1=0.0448
β 2=γ 2=0.2856
β 3=γ 3=0.3001
β 4=γ 4=0.2363
α 5=β 5=γ 5=0.1333
方法四中基于多尺度度量图像间的结构形似性,相比于方法三在保留抗噪能力的同时进一步增强图像细节,由此优化得到的编码解码器输出的重建图像具有更高质量,对于后续的目标检测等机器任务具有重要意义。
根据上述描述可以获取重建图像的特征图相对于标准图像的特征图的失真度,在获取到失真度之后,本申请实施例的方法还包括,根据失真度评价重建图像的质量,得到评价结果;然后根据评价结果对编码器和/或解码器进行优化,其中编码器和/或解码器是用于输出重建图像的。例如,在编码器和/或解码器的训练过程中,自然希望编码器和/或解码器输出的重建图像尽可能接近标准图像,而重建图像的失真度即表示重建图像与标准图像之间的失真度,所以可以根据失真度来更新编码器和/或解码器的参数,使得输出的重建图像的失真度尽量小,以达到优化编码器和/或解码器的目的。
图7示出了本申请实施例的图像处理方法在编码器和/或解码器的优化中的应用场景示意图。如图7中的(a)图所示,使用本申请实施例的图像处理方法可以对编码器和解码器分别进行优化,或者当编码器和解码器为一体设计时,可以对编码解码器进行优化。由于本申请实施例的图像处理方法使用的单层卷积神经网络来自现有的预训练模型,而非人工设计,因此可以兼容现有的AI模型;此外单层卷积神经网络为预训练模型的第一层卷积层,提取图像的底层通用特征,使得优化得到的编码器和解码器可以同时适配多种机器视觉任务,如图7(a)中的目标检测任务、语义分割任务和红绿灯、车道线检测任务等。如图7中的(b)图所示,使用本申请实施例的图像处理方法可以对编码器和解码器进行优化,之后固定编码器,对解码器使用现有的面向机器视觉任务的评价指标联合具体的机器视觉任务联合优化,如此优化得到的编码器可以适配多种机器视觉任务,提高任务泛化能力,而解码器与具体的机器视觉任务结合,可以使得输出的重建图像更加符合具体的机器视觉任务的应用场景。此外,如图7中的(c)图所示,使用本申请实施例的图像处理方法可以对编码器和解码器进行优化,然后固定编码器,再使用现有的面向机器视觉任务的评价指标联合具体的机器视觉任务对解码器的主干网络进行优化,而不对head网络进行优化,其中主干网络是用来做特征提取的网络,为解码器的一部分网络,head网络是根据主干网络提取的特征进一步作出预测,也为解码器的一部分网络,如此在解码器的训练过程中可以使用自监督学习,而无需标签数据,因为在head网络作出预测的过程中需要标签数据。
此外,还可以使用本申请实施例的图像处理方法对ISP处理方法的优化进行指导,ISP处理包括去马赛克、颜色变换、白平衡、去噪、色调映射、gamma校正等一系列步骤,由于缺乏面向机器视觉的图像质量评价指标,一般需要进行ISP与具体任务联合的端到端的方式进行调参,即以具体任务的输出精度来指导ISP调参,而具体任务需要用到标签数据,使得这种调参应用方式受限制。而本申请实施例的图像处理方法由于与具体任务解耦,可以直接指导ISP调参而无需执行具体任务,简化了ISP调参过程。
下面表1示出了根据本申请实施例提出的图像处理方法和现有技术的方法对编码器和/或解码器进行优化,优化后的编码器和/或解码器输出的重建图像在机器视觉任务和人眼视觉中的效果排名。
表1
  目标检测排名 语义分割排名 人眼视觉排名 综合排名
wfMSSSIM 2 2 2 1
wfMSE-w1 1 4 1 2
wfMSE-w0 3 3 3 3
MSSSIM 4 1 4 4
MSE 5 6 5 5
DISTS 6 5 6 6
其中wfMSE-w0表示不进行加权,wfMSE-w1表示加权系数为
Figure PCTCN2021130201-appb-000036
由表1中可以看出,根据本申请实施例的图像处理方法优化的编码器和/或解码器相比于现有技术,无论在机器视觉任务的处理还是面向人眼视觉中都具有更好的效果。
上文结合附图对本申请实施例的图像处理方法进行了详细的介绍,下面结合附图对本申请实施例的图像处理的装置进行描述。应理解,下文中介绍的图像处理的装置能够执行本申请实施例的图像处理方法的各个步骤,下面在介绍本申请实施例的图像处理的装置适当省略重复的描述。
图8为本申请实施例的图像处理的装置的示意性框图,该图像处理的装置可以是终端,也可以是终端内部的芯片,如图8所示,包括获取单元801,处理单元802,以下进行简要介绍。
获取单元801,用于获取重建图像和重建图像的标准图像。
处理单元802,用于将重建图像和标准图像输入单层卷积神经网络,以得到重建图像的特征图和标准图像的特征图,单层卷积神经网络为预训练模型的第一层卷积层。
处理单元802还用于获取重建图像的特征图相对于标准图像的特征图的失真度。
在某些实现方式中,单层卷积神经网络包括多个卷积核,重建图像的第一特征图和标准图像的第一特征图具有第一权重,第一特征图由第一卷积核获取,重建图像的第二特征图和标准图像的第二特征图具有第二权重,第二特征图由第二卷积核获取,第一卷积核和第二卷积核属于多个卷积核。
在某些实现方式中,权重由归一化参数确定,归一化参数为预训练模型的归一化参数。
在某些实现方式中,处理单元802具体用于:根据如下公式计算重建图像的特征图相对于标准图像的特征图的失真度:
Figure PCTCN2021130201-appb-000037
wfMSE为重建图像的特征图相对于标准图像的特征图的失真度,x为标准图像,y为重建图像,f()为卷积操作,i为第i个卷积核,w为权重系数,C为重建图像的特征图通道数或标准图像的特征图通道数,H为标准图像或重建图像的高度,W为标准图像或重建图像的宽度。
在某些实现方式中,w=1,或
Figure PCTCN2021130201-appb-000038
Figure PCTCN2021130201-appb-000039
γ i为预训练模型的缩放系数,σ i为预训练模型的归一化系数。
在某些实现方式中,处理单元802还用于:根据如下公式计算重建图像的特征图相对于标准图像的特征图的失真度:
Figure PCTCN2021130201-appb-000040
Figure PCTCN2021130201-appb-000041
wfSSIM为重建图像的特征图相对于标准图像的特征图的失真度,μ为均值,σ为协方差,C 1为常数,C 2为常数。
在某些实现方式中,处理单元802还用于:根据如下公式计算重建图像的特征图相对于标准图像的特征图的失真度:
Figure PCTCN2021130201-appb-000042
Figure PCTCN2021130201-appb-000043
Figure PCTCN2021130201-appb-000044
Figure PCTCN2021130201-appb-000045
Figure PCTCN2021130201-appb-000046
wfMSSSIM为重建图像的特征图相对于标准图像的特征图的失真度,C 3为常数。
在某些实现方式中,单层卷积神经网络包括多个卷积核,第一卷积核用于获取重建图像的第一特征和标准图像的第一特征,第一卷积核的系数具有第一权重,第二卷积核用于获取重建图像的第二特征和标准图像的第二特征,第二卷积核的系数具有第二权重,第一卷积核和第二卷积核属于多个卷积核。
在某些实现方式中,处理单元802具体用于:根据如下公式计算重建图像的特征图的失真度:
Figure PCTCN2021130201-appb-000047
wfMSE为重建图像的特征图的失真度,x为标准图像,y为重建图像,g i()=w i×f i()=fw i(),f()为卷积操作,i为第i个卷积核,w为权重系数,C为重建图像的特征图通道数或标准图像的特征图通道数,H为标准图像或重建图像的高度,W为标准图像或重建图像的宽度。
在某些实现方式中,处理单元802还用于:根据重建图像的特征图相对于标准图像的特征图的失真度评价重建图像的质量,以获取评价结果;根据评价结果优化编码器和/或解码器,编码器和/或解码器用于输出重建图像。
在某些实现方式中,单层卷积神经网络对重建图像和标准图像不执行池化操作和下采样操作。
应理解,图8所示的图像处理的装置可以用于实现上述图像处理方法400,其中获取 单元801用于实现步骤401,处理单元802用于实现步骤402和步骤403,图8所示的图像处理装置还可以用于实现图5至图7所述的图像处理方法,具体步骤可以参照上述对于图5至图7的描述,为了简洁,本申请实施例在此不再赘述。
应理解的是,本申请实施例中的图像处理装置800可以由软件实现,例如,具有上述功能的计算机程序或指令来实现,相应计算机程序或指令可以存储在终端内部的存储器中,通过处理器读取该存储器内部的相应计算机程序或指令来实现上述功能。或者,本申请实施例中的图像处理装置800还可以由硬件来实现。其中处理单元802为处理器(如NPU、GPU、系统芯片中的处理器),获取单元801为数据接口。或者,本申请实施例中的图像处理装置800还可以由处理器和软件单元的结合实现。具体地,获取单元801可以为处理器的接口电路,或者,编码器和/或解码器等。例如,编码器和/或解码器将输出的重建图像发送给处理器接口电路。
图9是本申请实施例的图像处理装置900的结构示意图。图9所示的装置900包括存储器901、处理器902、通信接口903以及总线904。其中,存储器901、处理器902、通信接口903通过总线904实现彼此之间的通信连接。
应理解,图8中的获取单元801可以相当于装置900中的通信接口903,处理单元1002和处理单元802可以相当于装置900中的处理器902。下面对装置900中的各个单元和单元进行详细的介绍。
存储器901可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器901可以存储程序,当存储器901中存储的程序被处理器902执行时,处理器902用于执行本申请实施例的图像处理方法的各个步骤。
具体地,处理器902可用于执行图4所示的方法中的步骤402、步骤403。另外,处理器902还可以执行图5至图7所示的过程。
当处理器902执行步骤402、步骤403,处理器902可以通过通信接口903从获取编码器和/或解码器输出的重建图像和其对应的标准图像,并对获取的重建图像和其对应的标准图像进行处理。
处理器902可以采用通用的中央处理器(central processing unit,CPU),微处理器,应用专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的图像处理方法。
处理器902还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的方法的各个步骤可以通过处理器902中的硬件的集成逻辑电路或者软件形式的指令完成。
上述处理器902还可以是通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件单元组合执行完成。软件单 元可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器901,处理器902读取存储器901中的信息,结合其硬件完成装置中包括的单元所需执行的功能,或者执行本申请方法实施例的图像处理方法。
通信接口903使用例如但不限于收发器一类的收发装置,来实现装置900与其他设备或通信网络之间的通信。例如,可以通过通信接口903获取重建图像和其对应的标准图像。
总线904可包括在装置900各个部件(例如,存储器901、处理器902、通信接口903)之间传送信息的通路。
本申请实施例还提供了一种计算机可读介质,计算机可读介质存储有程序代码,当计算机程序代码在计算机上运行时,使得计算机执行上述图4至图7所述的方法。
本申请实施例还提供了一种芯片,包括:至少一个处理器和存储器,至少一个处理器与存储器耦合,用于读取并执行存储器中的指令,以执行上述图4至图7所述的方法。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟 悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (24)

  1. 一种图像处理的方法,其特征在于,包括:
    获取重建图像和所述重建图像的标准图像,所述重建图像为根据第一图像重建得到的图像,所述第一图像为所述标准图像经过压缩得到的图像;
    将所述重建图像和所述标准图像输入单层卷积神经网络,以得到所述重建图像的特征图和所述标准图像的特征图,所述单层卷积神经网络的参数来自于预训练模型的第一层卷积层的参数,所述预训练模型为预先训练好的卷积神经网络;
    获取所述重建图像的特征图相对于所述标准图像的特征图的失真度。
  2. 如权利要求1所述的方法,其特征在于,所述单层卷积神经网络包括多个卷积核,所述重建图像的第一特征图和所述标准图像的第一特征图具有第一权重,所述重建图像的第一特征图和所述标准图像的第一特征图由第一卷积核获取,所述重建图像的第二特征图和所述标准图像的第二特征图具有第二权重,所述重建图像的第二特征图和所述标准图像的第二特征图由第二卷积核获取,所述第一卷积核和所述第二卷积核属于所述多个卷积核。
  3. 如权利要求2所述的方法,其特征在于,所述权重由归一化参数确定,所述归一化参数为所述预训练模型的归一化参数。
  4. 如权利要求1至3中任一项所述的方法,其特征在于,所述获取所述重建图像的特征图相对于所述标准图像的特征图的失真度,包括:
    根据如下公式计算所述重建图像的特征图相对于所述标准图像的特征图的失真度:
    Figure PCTCN2021130201-appb-100001
    所述wfMSE为所述重建图像的特征图相对于所述标准图像的特征图的失真度,所述x为所述标准图像,所述y为所述重建图像,所述f()为卷积操作,所述i为第i个卷积核,所述w为权重系数,所述C为所述重建图像的特征图通道数或所述标准图像的特征图通道数,所述H为所述标准图像或所述重建图像的高度,所述W为所述标准图像或所述重建图像的宽度。
  5. 如权利要求4所述的方法,其特征在于,所述w=1,或所述
    Figure PCTCN2021130201-appb-100002
    或所述
    Figure PCTCN2021130201-appb-100003
    所述γ i为所述预训练模型的缩放系数,所述σ i为所述预训练模型的归一化系数。
  6. 如权利要求1至5中任一项所述的方法,其特征在于,所述方法还包括:
    根据如下公式计算所述重建图像的特征图相对于所述标准图像的特征图的失真度:
    Figure PCTCN2021130201-appb-100004
    Figure PCTCN2021130201-appb-100005
    所述wfSSIM为所述重建图像的特征图相对于所述标准图像的特征图的失真度,所述μ为均值,所述σ为协方差,所述C 1为常数,所述C 2为常数。
  7. 如权利要求1至6中任一项所述的方法,其特征在于,所述方法还包括:
    根据如下公式计算所述重建图像的特征图相对于所述标准图像的特征图的失真度:
    Figure PCTCN2021130201-appb-100006
    Figure PCTCN2021130201-appb-100007
    Figure PCTCN2021130201-appb-100008
    Figure PCTCN2021130201-appb-100009
    Figure PCTCN2021130201-appb-100010
    所述wfMSSSIM为所述重建图像的特征图相对于所述标准图像的特征图的失真度,所述C 3为常数。
  8. 如权利要求1至7中任一项所述的方法,其特征在于,所述单层卷积神经网络包括多个卷积核,第一卷积核用于获取所述重建图像的第一特征和所述标准图像的第一特征,所述第一卷积核的系数具有第一权重,第二卷积核用于获取所述重建图像的第二特征和所述标准图像的第二特征,所述第二卷积核的系数具有第二权重,所述第一卷积核和所述第二卷积核属于所述多个卷积核。
  9. 如权利要求8所述的方法,其特征在于,所述获取所述重建图像的特征图相对于所述标准图像的特征图的失真度,包括:
    根据如下公式计算所述重建图像的特征图的失真度:
    Figure PCTCN2021130201-appb-100011
    所述wfMSE为所述重建图像的特征图的失真度,所述x为所述标准图像,所述y为所述重建图像,所述g i()=w i×f i()=fw i(),所述f()为卷积操作,所述i为第i个卷积核,所述w为权重系数,所述C为所述重建图像的特征图通道数或所述标准图像的特征图通道数,所述H为所述标准图像或所述重建图像的高度,所述W为所述标准图像或所述重建图像的宽度。
  10. 如权利要求1至9中任一项所述的方法,其特征在于,所述方法还包括:
    根据所述重建图像的特征图相对于所述标准图像的特征图的失真度评价所述重建图像的质量,以获取评价结果;
    根据所述评价结果优化编码器和/或解码器,所述编码器和/或解码器用于输出所述重建图像。
  11. 如权利要求1至10中任一项所述的方法,其特征在于,所述重建图像的特征图和所述标准图像的特征图为全分辨率图像。
  12. 一种图像处理的装置,其特征在于,包括:
    获取单元,用于获取重建图像和所述重建图像的标准图像,所述重建图像为根据第一图像重建得到的图像,所述第一图像为所述标准图像经过压缩得到的图像;
    处理单元,用于将所述重建图像和所述标准图像输入单层卷积神经网络,以得到所述重建图像的特征图和所述标准图像的特征图,所述单层卷积神经网络的参数来自于预训练 模型的第一层卷积层的参数,所述预训练模型为预先训练好的卷积神经网络;
    所述处理单元还用于获取所述重建图像的特征图相对于所述标准图像的特征图的失真度。
  13. 如权利要求12所述的装置,其特征在于,所述单层卷积神经网络包括多个卷积核,所述重建图像的第一特征图和所述标准图像的第一特征图具有第一权重,所述重建图像的第一特征图和所述标准图像的第一特征图由第一卷积核获取,所述重建图像的第二特征图和所述标准图像的第二特征图具有第二权重,所述重建图像的第一特征图和所述标准图像的第二特征图由第二卷积核获取,所述第一卷积核和所述第二卷积核属于所述多个卷积核。
  14. 如权利要求13所述的装置,其特征在于,所述权重由归一化参数确定,所述归一化参数为所述预训练模型的归一化参数。
  15. 如权利要求12至14中任一项所述的装置,其特征在于,所述处理单元具体用于:
    根据如下公式计算所述重建图像的特征图相对于所述标准图像的特征图的失真度:
    Figure PCTCN2021130201-appb-100012
    所述wfMSE为所述重建图像的特征图相对于所述标准图像的特征图的失真度,所述x为所述标准图像,所述y为所述重建图像,所述f()为卷积操作,所述i为第i个卷积核,所述w为权重系数,所述C为所述重建图像的特征图通道数或所述标准图像的特征图通道数,所述H为所述标准图像或所述重建图像的高度,所述W为所述标准图像或所述重建图像的宽度。
  16. 如权利要求15所述的装置,其特征在于,所述w=1,或所述
    Figure PCTCN2021130201-appb-100013
    或所述
    Figure PCTCN2021130201-appb-100014
    所述γ i为所述预训练模型的缩放系数,所述σ i为所述预训练模型的归一化系数。
  17. 如权利要求12至16中任一项所述的装置,其特征在于,所述处理单元还用于:
    根据如下公式计算所述重建图像的特征图相对于所述标准图像的特征图的失真度:
    Figure PCTCN2021130201-appb-100015
    Figure PCTCN2021130201-appb-100016
    所述wfSSIM为所述重建图像的特征图相对于所述标准图像的特征图的失真度,所述μ为均值,所述σ为协方差,所述C 1为常数,所述C 2为常数。
  18. 如权利要求12至17中任一项所述的装置,其特征在于,所述处理单元还用于:
    根据如下公式计算所述重建图像的特征图相对于所述标准图像的特征图的失真度:
    Figure PCTCN2021130201-appb-100017
    Figure PCTCN2021130201-appb-100018
    Figure PCTCN2021130201-appb-100019
    Figure PCTCN2021130201-appb-100020
    Figure PCTCN2021130201-appb-100021
    所述wfMSSSIM为所述重建图像的特征图相对于所述标准图像的特征图的失真度,所述C 3为常数。
  19. 如权利要求12至18中任一项所述的装置,其特征在于,所述单层卷积神经网络包括多个卷积核,第一卷积核用于获取所述重建图像的第一特征和所述标准图像的第一特征,所述第一卷积核的系数具有第一权重,第二卷积核用于获取所述重建图像的第二特征和所述标准图像的第二特征,所述第二卷积核的系数具有第二权重,所述第一卷积核和所述第二卷积核属于所述多个卷积核。
  20. 如权利要求19所述的装置,其特征在于,所述处理单元具体用于:
    根据如下公式计算所述重建图像的特征图的失真度:
    Figure PCTCN2021130201-appb-100022
    所述wfMSE为所述重建图像的特征图的失真度,所述x为所述标准图像,所述y为所述重建图像,所述g i()=w i×f i()=fw i(),所述f()为卷积操作,所述i为第i个卷积核,所述w为权重系数,所述C为所述重建图像的特征图通道数或所述标准图像的特征图通道数,所述H为所述标准图像或所述重建图像的高度,所述W为所述标准图像或所述重建图像的宽度。
  21. 如权利要求12至20中任一项所述的装置,其特征在于,所述处理单元还用于:
    根据所述重建图像的特征图相对于所述标准图像的特征图的失真度评价所述重建图像的质量,以获取评价结果;
    根据所述评价结果优化编码器和/或解码器,所述编码器和/或解码器用于输出所述重建图像。
  22. 如权利要求12至21中任一项所述的装置,其特征在于,所述重建图像的特征图和所述标准图像的特征图为全分辨率图像。
  23. 一种计算机可读介质,其特征在于,所述计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行如权利要求1至11所述的方法。
  24. 一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行如权利要求1至11所述的方法。
PCT/CN2021/130201 2021-11-12 2021-11-12 图像处理方法和装置 WO2023082162A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/130201 WO2023082162A1 (zh) 2021-11-12 2021-11-12 图像处理方法和装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/130201 WO2023082162A1 (zh) 2021-11-12 2021-11-12 图像处理方法和装置

Publications (1)

Publication Number Publication Date
WO2023082162A1 true WO2023082162A1 (zh) 2023-05-19

Family

ID=86334824

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/130201 WO2023082162A1 (zh) 2021-11-12 2021-11-12 图像处理方法和装置

Country Status (1)

Country Link
WO (1) WO2023082162A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180158224A1 (en) * 2015-07-31 2018-06-07 Eberhard Karls Universitaet Tuebingen Method and device for image synthesis
CN111046893A (zh) * 2018-10-12 2020-04-21 富士通株式会社 图像相似性确定方法和装置、图像处理方法和装置
CN111754403A (zh) * 2020-06-15 2020-10-09 南京邮电大学 一种基于残差学习的图像超分辨率重构方法
CN112418332A (zh) * 2020-11-26 2021-02-26 北京市商汤科技开发有限公司 一种图像处理的方法及装置、图像生成的方法及装置
CN112525851A (zh) * 2020-12-10 2021-03-19 深圳先进技术研究院 一种太赫兹单像素成像方法及其系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180158224A1 (en) * 2015-07-31 2018-06-07 Eberhard Karls Universitaet Tuebingen Method and device for image synthesis
CN111046893A (zh) * 2018-10-12 2020-04-21 富士通株式会社 图像相似性确定方法和装置、图像处理方法和装置
CN111754403A (zh) * 2020-06-15 2020-10-09 南京邮电大学 一种基于残差学习的图像超分辨率重构方法
CN112418332A (zh) * 2020-11-26 2021-02-26 北京市商汤科技开发有限公司 一种图像处理的方法及装置、图像生成的方法及装置
CN112525851A (zh) * 2020-12-10 2021-03-19 深圳先进技术研究院 一种太赫兹单像素成像方法及其系统

Similar Documents

Publication Publication Date Title
WO2021164731A1 (zh) 图像增强方法以及图像增强装置
US20220188999A1 (en) Image enhancement method and apparatus
WO2020192483A1 (zh) 图像显示方法和设备
WO2021164234A1 (zh) 图像处理方法以及图像处理装置
US20230214976A1 (en) Image fusion method and apparatus and training method and apparatus for image fusion model
US20210398252A1 (en) Image denoising method and apparatus
CN110717868B (zh) 视频高动态范围反色调映射模型构建、映射方法及装置
CN111079764B (zh) 一种基于深度学习的低照度车牌图像识别方法及装置
EP4163832A1 (en) Neural network training method and apparatus, and image processing method and apparatus
CN111951164B (zh) 一种图像超分辨率重建网络结构及图像重建效果分析方法
WO2022021938A1 (zh) 图像处理方法与装置、神经网络训练的方法与装置
WO2024002211A1 (zh) 一种图像处理方法及相关装置
WO2023082453A1 (zh) 一种图像处理方法及装置
US20220398698A1 (en) Image processing model generation method, processing method, storage medium, and terminal
CN111145102A (zh) 一种基于卷积神经网络的合成孔径雷达图像去噪方法
CN112308866A (zh) 图像处理方法、装置、电子设备及存储介质
WO2022116104A1 (zh) 图像处理方法、装置、设备及存储介质
CN114842216A (zh) 一种基于小波变换的室内rgb-d图像语义分割方法
Wu et al. FW-GAN: Underwater image enhancement using generative adversarial network with multi-scale fusion
CN110503002B (zh) 一种人脸检测方法和存储介质
WO2019228450A1 (zh) 一种图像处理方法、装置及设备、可读介质
CN113743300A (zh) 基于语义分割的高分遥感图像云检测方法和装置
WO2021042774A1 (zh) 图像恢复方法、图像恢复网络训练方法、装置和存储介质
WO2023082162A1 (zh) 图像处理方法和装置
CN117078574A (zh) 一种图像去雨方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21963614

Country of ref document: EP

Kind code of ref document: A1