WO2021196050A1

WO2021196050A1 - Neural network-based image processing method and apparatus

Info

Publication number: WO2021196050A1
Application number: PCT/CN2020/082634
Authority: WO
Inventors: 李蒙; 郑成林; 胡慧; 陈海
Original assignee: 华为技术有限公司
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2021-10-07
Also published as: CN115335852A

Abstract

The present application relates to the technical field of image processing, and disclosed are a neural network-based image processing method and apparatus, for use in ensuring image quality and reducing image processing delay. The method comprises: inputting multiple frames of images to be processed into a first neural network for calculation to obtain a first image; and respectively inputting multiple image groups into multiple second neural networks for calculation to respectively obtain multiple frames of second images, each image group comprising the first image and one of the multiple frames of images to be processed.

Description

Image processing method and device based on neural network

Technical field

The embodiments of the present application relate to the field of image processing technology, and in particular, to a neural network-based image processing method and device.

Background technique

With the development of science and technology, mobile terminals with camera and video recording functions such as mobile phones and tablet computers have been widely used by people. In the process of photographing or video recording, the mobile terminal performs image signal processing (ISP) on the image signal.

The main function of ISP is to perform post-processing on the image signal output by the front-end image sensor. Depending on the ISP, the images obtained under different optical conditions can better restore the details of the scene. The ISP processing flow is shown in Figure 1. The natural scene 101 obtains the Bayer image through the lens 102, and then obtains the analog electrical signal 105 through the sensor 103 and the photoelectric conversion 104, and further passes noise reduction and digital-to-analog conversion (A/ D) 106 obtains a digital electrical signal (ie raw image) 107, and then enters the digital signal processing chip 100. The steps in the digital signal processing chip 100 are the core steps of ISP processing. The digital signal processing chip 100 generally includes black level compensation (BLC) 108, lens shading correction 109, and dead pixel correction ( bad pixel correction, BPC) 110, demosaic (demosaic) 111, Bayer domain noise reduction (denoise) 112, auto white balance (AWB) 113, Ygamma 114, auto exposure (AE) 115, auto focus (auto focus, AF) (not shown in Figure 1), color correction (CC) 116, gamma correction 117, color gamut conversion 118, color denoising/detail enhancement 119, color enhancement (color Enhance (CE) 120, formater (formater) 121, input/output (input/output, I/O) control 122 and other modules.

At present, the application of deep learning is becoming more and more extensive. ISP based on deep learning has achieved certain results in the application of many tasks. The ISP based on deep learning will process the image data through a neural network and then output it. However, the processing complexity of the neural network is generally very high. In non-real-time processing scenarios, the expected purpose can be achieved, but in scenarios that require real-time processing , Generally there are problems such as energy consumption and running time.

Therefore, ISP based on neural network needs to be further optimized.

Summary of the invention

This application provides a neural network-based image processing method and device, in order to optimize the neural network-based image signal processing performance.

In a first aspect, a neural network-based image processing method is provided, which uses a first neural network and a second neural network to process multiple frames of to-be-processed images, and output a second image. The steps of the method are as follows: input multiple frames of to-be-processed images into the first neural network for calculation to obtain the first image; input multiple image groups into multiple second neural networks for calculation respectively to obtain multiple frames of the first neural network. Two images, where each image group includes the first image and one frame of images among the multiple frames of images to be processed.

In the neural network-based image processing method provided in this application, the first image obtained after multiple frames of images to be processed is subjected to a first neural network operation, that is, the image characteristics common to the multiple frames of images to be processed are obtained. The first image and a frame of image to be processed are subjected to a second neural network operation to obtain a second image, so as to obtain multiple frames of second images respectively. Since the first neural network and the second neural network are used to process multiple frames of images to be processed, the first image is applied to the processing of the second neural network, which reduces the computational complexity of the second neural network and can Ensure the quality of image processing.

In a possible implementation manner, inputting multiple image groups into multiple second neural networks to perform calculations includes: inputting one frame of the multiple frames of images to be processed into the second neural network to perform calculations, so as to obtain the third neural network. Image; merge the first image and the third image to obtain the second image.

Optionally, the first neural network includes an image output layer and multiple feature map output layers, the image output layer outputs the first image, the feature map output layer outputs multiple intermediate feature maps, and the multiple intermediate feature maps are used to participate in the second The operation of the neural network to obtain the third image.

Optionally, the complexity of the second neural network is lower than the complexity of the first neural network.

Optionally, the multiple frames of images to be processed include multiple frames of temporally adjacent images.

Optionally, the processing capability of the first neural network on the static region of the image is greater than that of the second neural network.

In one possible design, the first neural network is used to process the static area of the image to be processed.

In one possible design, the second neural network is used to process the motion area of the image to be processed.

In a possible design, the first neural network and the second neural network form an image processing system, and the image processing system is used to reduce noise and eliminate mosaic effects on the image to be processed.

In the second aspect, a neural network-based image processing device is provided. The device can be a mobile terminal, a device in a mobile terminal (such as a chip, or a chip system, or a circuit), or a device that can be matched with the mobile terminal. . In one design, the device may include modules that perform one-to-one correspondence of the methods/operations/steps/actions described in the first aspect. The modules may be hardware circuits, software, or hardware circuits combined with software. . The device processes multiple frames of to-be-processed images to obtain a second image. In one design, the device may include an arithmetic module. Exemplarily: an arithmetic module for inputting multiple frames of images to be processed into a first neural network for operation to obtain a first image; the arithmetic module is also used for inputting multiple image groups into multiple second neural networks, respectively Perform operations to obtain multiple frames of second images respectively, where each image group includes the first image and one frame of the multiple frames of images to be processed.

In a possible implementation, the arithmetic module is used to: input one frame of the images to be processed into the second neural network for calculation to obtain the third image; merge the first image and the third image to obtain The second image.

In a possible design, the first neural network and the second neural network constitute an image processing system, and the image processing system is used to reduce noise and eliminate mosaic effects on the image to be processed.

For the beneficial effects of the second aspect, reference may be made to the corresponding effects of the first aspect, which will not be repeated here.

In a third aspect, an embodiment of the present application provides an image processing device based on a neural network. The device includes a processor, and the processor is used to call a set of programs, instructions, or data to execute the first aspect or any one of the first aspects. Possible design methods described. The device may also include a memory for storing programs, instructions or data called by the processor. The memory is coupled with the processor, and when the processor executes the instructions or data stored in the memory, it can implement the method described in the first aspect or any possible design.

In a fourth aspect, an embodiment of the present application provides a chip system, which includes a processor and may also include a memory, for implementing the method described in the first aspect or any one of the possible designs of the first aspect. . The chip system can be composed of chips, and can also include chips and other discrete devices.

In a fifth aspect, the embodiments of the present application also provide a computer-readable storage medium. The computer-readable storage medium stores computer-readable instructions. When the computer-readable instructions run on a computer, The method described in one aspect or any one of the possible designs of the first aspect is executed.

In a sixth aspect, the embodiments of the present application also provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the method described in the first aspect or any possible design of the first aspect .

Description of the drawings

Figure 1 is a schematic diagram of an ISP processing flow in the prior art;

FIG. 2 is a schematic structural diagram of a system architecture provided by an embodiment of this application;

FIG. 3 is a schematic diagram of the principle of a neural network provided by an embodiment of the application;

4 is a flowchart of a neural network-based image processing method provided by an embodiment of the application;

FIG. 5 is a schematic diagram of an implementation manner of image processing provided by an embodiment of the application;

FIG. 6 is a schematic diagram of an implementation manner of image processing provided by an embodiment of the application;

FIG. 7 is a schematic diagram of an implementation manner of image processing provided by an embodiment of the application;

FIG. 8 is a schematic diagram of an RGrGbB image processing process provided by an embodiment of the application;

FIG. 9a is one of the schematic structural diagrams of the first neural network provided by an embodiment of this application;

FIG. 9b is the second schematic diagram of the structure of the first neural network provided by an embodiment of the application;

FIG. 10a is one of the schematic structural diagrams of the second neural network provided by an embodiment of this application;

FIG. 10b is one of the schematic structural diagrams of the second neural network provided by an embodiment of this application;

FIG. 11a is a schematic structural diagram of a first neural network and a second neural network provided by an embodiment of this application;

FIG. 11b is a schematic diagram of the structure of the first neural network and the second neural network provided by an embodiment of the application;

FIG. 12 is a schematic structural diagram of an image processing device based on a neural network provided by an embodiment of the application;

FIG. 13 is a schematic structural diagram of an image processing device based on a neural network provided by an embodiment of the application.

Detailed ways

The terms "first", "second", and "third" in the specification and claims of this application and the above-mentioned drawings are used to distinguish different objects, rather than to limit a specific order.

In the embodiments of the present application, words such as "exemplary" or "for example" are used as examples, illustrations, or illustrations. Any embodiment or design solution described as "exemplary" or "for example" in the embodiments of the present application should not be construed as being more preferable or advantageous than other embodiments or design solutions. To be precise, words such as "exemplary" or "for example" are used to present related concepts in a specific manner.

The image processing method and device based on neural network (NN) provided by the embodiments of this application can be applied to electronic equipment. The electronic equipment may be a mobile terminal (mobile terminal), a mobile station (MS), Mobile devices such as user equipment (UE) can also be fixed devices, such as fixed phones, desktop computers, etc., or video monitors. The electronic device is an image acquisition and processing device with image signal acquisition and processing functions, and has an ISP processing function. The electronic device can also optionally have a wireless connection function to provide users with a handheld device with voice and/or data connectivity, or other processing devices connected to a wireless modem. For example, the electronic device can be a mobile phone (or (Called "cellular" phones), computers with mobile terminals, etc., can also be portable, pocket-sized, handheld, computer-built or vehicle-mounted mobile devices, of course, can also be wearable devices (such as smart watches, smart bracelets) Etc.), tablet computers, personal computers (PC), personal digital assistants (PDAs), point of sales (POS), etc. In the embodiments of the present application, the following takes the electronic device as a mobile terminal as an example for description.

FIG. 2 is a schematic diagram of an optional hardware structure of the mobile terminal 200 according to an embodiment of the application.

As shown in FIG. 2, the mobile terminal 200 mainly includes a chipset and peripheral devices. Among them, the power management unit (PMU), voice codec, short-distance module, and radio frequency (PMU) in the solid-line box in FIG. , RF), arithmetic processor, random-access memory (RAM), input/output (input/output, I/O), display interface, image signal processor (ISP), sensor interface ( Sensor hub), baseband communication module and other components make up a chip or chipset. Components such as USB interface, memory, display screen, battery/mains power, earphone/speaker, antenna, sensor, etc. can be understood as peripheral devices. The arithmetic processor, RAM, I/O, display interface, ISP, sensor interface, baseband and other components in the chipset can form a system-on-a-chip (SOC), which is the main part of the chipset. The components in the SOC can all be integrated into a complete chip, or part of the components in the SOC can be integrated, and the other parts are not integrated. For example, the baseband communication module in the SOC can not be integrated with other parts and become an independent part. The components in the SOC can be connected to each other through a bus or other connecting lines. The PMU, voice codec, RF, etc. outside the SOC usually include analog circuit parts, so they are often outside the SOC and are not integrated with each other.

In Figure 2, the PMU is used to connect to the mains or battery to supply power to the SOC, and the mains can be used to charge the battery. The voice codec is used as the sound codec unit to connect with earphones or speakers to realize the conversion between natural analog voice signals and digital voice signals that can be processed by the SOC. The short-range module can include wireless fidelity (WiFi) and Bluetooth, and can also optionally include infrared, near field communication (NFC), radio (FM) or global positioning system (GPS) ) Module etc. The RF is connected with the baseband communication module in the SOC to realize the conversion between the air interface RF signal and the baseband signal, that is, mixing. For mobile phones, receiving is down-conversion, and sending is up-conversion. Both the short-range module and the RF can have one or more antennas for signal transmission or reception. Baseband is used for baseband communication, including one or more of a variety of communication modes, used for processing wireless communication protocols, including physical layer (layer 1), medium access control (MAC) ( Layer 2), radio resource control (RRC) (layer 3) and other protocol layers can support various cellular communication standards, such as long term evolution (LTE) communication, or 5G new air interface ( new radio, NR) communication, etc. The sensor interface is an interface between the SOC and an external sensor, and is used to collect and process data from at least one external sensor. The external sensor may be, for example, an accelerometer, a gyroscope, a control sensor, an image sensor, and so on. The arithmetic processor can be a general-purpose processor, such as a central processing unit (CPU), or one or more integrated circuits, such as one or more application specific integrated circuits (ASICs), or , One or more digital signal processors (digital signal processors, DSP), or microprocessors, or, one or more field programmable gate arrays (FPGA), etc. The arithmetic processor can include one or more cores, and can selectively schedule other units. RAM can store some intermediate data during calculation or processing, such as intermediate calculation data of CPU and baseband. ISP is used to process the data collected by the image sensor. I/O is used for the SOC to interact with various external interfaces, such as the universal serial bus (USB) interface for data transmission. The memory can be a chip or a group of chips. The display screen can be a touch screen, which is connected to the bus through a display interface. The display interface can be used for data processing before image display, such as aliasing of multiple layers to be displayed, buffering of display data, or control and adjustment of screen brightness.

The mobile terminal 200 involved in the embodiment of the present application includes an image sensor, which can collect external signals such as light from the outside, and process and convert the external signals into sensor signals, that is, electrical signals. The sensor signal can be a static image signal or a dynamic video image signal. Wherein, the image sensor may be a camera, for example.

The mobile terminal 200 involved in the embodiments of the present application further includes an image signal processor. The image sensor collects sensor signals and transmits them to the image signal processor. The image signal processor obtains the sensor signal and can perform image signal processing on the sensor signal. , In order to obtain the image signal of the sharpness, color, brightness and other aspects that are in line with the characteristics of the human eye.

It can be understood that the image signal processor involved in the embodiment of the present application may be one or a group of chips, that is, it may be integrated or independent. For example, the image signal processor included in the mobile terminal 200 may be an integrated ISP chip integrated in the arithmetic processor.

The mobile terminal 200 involved in the embodiments of the present application has the function of taking photos or recording videos.

The neural network-based image processing method provided in the embodiments of the present application mainly focuses on how to perform image signal processing based on the neural network.

In order to better understand the solutions of the embodiments of the present application, firstly, the conceptual terms involved in the embodiments of the present application are explained.

(1) Neural network

In the embodiment of the present application, a neural network is used to process the multi-frame images to be processed. Neural network is a network structure that imitates the behavioral characteristics of animal neural network for information processing, also referred to as neural network for short.

Among them, the neural network can be composed of neural units, which can refer to _{an arithmetic unit that takes x s} and intercept 1 as inputs, and the output of the arithmetic unit can be as shown in formula (1):

Among them, s=1, 2,...n, n is a natural number greater than 1, W _s is the weight of x _s , and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function. A neural network is a network formed by connecting multiple above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected with the local receptive field of the previous layer to extract the characteristics of the local receptive field. The local receptive field can be a region composed of several neural units.

As shown in Figure 3, it is a schematic diagram of the principle of a neural network. The neural network 300 has N processing layers, where N≥3 and N takes a natural number. The first layer of the neural network is the input layer 301, which is responsible for receiving input signals. The last layer of the neural network is the output layer 303, which outputs the processing results of the neural network. The other layers except the first and last layers are the intermediate layer 304. These intermediate layers together form the hidden layer 302, each of the hidden layers The middle layer of the layer can receive input signals and output signals, and the hidden layer is responsible for the processing of input signals. Each layer represents a logic level of signal processing. Through multiple layers, data signals can be processed by multiple levels of logic.

In some feasible embodiments, the input signal of the neural network may be a signal in various forms such as a voice signal, a text signal, an image signal, and a temperature signal. In this embodiment, the processed image signals may be various sensor signals such as landscape signals captured by a camera (image sensor), image signals of a community environment captured by a display monitoring device, and facial signals of human faces acquired by an access control system. The input signals of the neural network include various other engineering signals that can be processed by computers, so I won't list them all here. If the neural network is used for deep learning of the image signal, the image quality can be improved.

(2) Deep neural network

Deep neural network (DNN), also known as multi-layer neural network, can be understood as a neural network with multiple hidden layers. The DNN is divided according to the positions of different layers. The neural network inside the DNN can be divided into three categories: input layer, hidden layer and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the number of layers in the middle are all hidden layers. The layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1th layer.

Although DNN looks very complicated, it is not complicated as far as the work of each layer is concerned. Simply put, it is the following linear relationship expression: y=α(Wx+b), where x is the input vector and y is The output vector, b is the offset vector, W is the weight matrix (also called coefficient), and α() is the activation function. Each layer simply performs such a simple operation on the input vector x to obtain the output vector y. Due to the large number of DNN layers, the number of coefficients W and offset vectors b is also relatively large. The definition of these parameters in DNN is as follows: Take coefficient W as an example: Suppose in a three-layer DNN, the linear coefficients from the fourth neuron in the second layer to the second neuron in the third layer are defined as

Among them, the superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third-level index 2 and the input second-level index 4.

In summary, the coefficient from the kth neuron in the L-1th layer to the jth neuron in the Lth layer is defined as

It should be noted that there is no W parameter in the input layer. In deep neural networks, more hidden layers make the network more capable of portraying complex situations in the real world. In theory, a model with more parameters is more complex and has a greater "capacity", which means that it can complete more complex learning tasks. Training the deep neural network is also the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).

(3) Convolutional neural network

Convolutional neural network (convolutional neuron network, CNN) is a deep neural network with a convolutional structure. The convolutional neural network contains a feature extractor composed of a convolutional layer and a sub-sampling layer. The feature extractor can be regarded as a filter. The convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network. In the convolutional layer of a convolutional neural network, a neuron can be connected to only part of the neighboring neurons. A convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units in the same feature plane share weights, and the shared weights here are the convolution kernels. Sharing weight can be understood as the way of extracting image information has nothing to do with location. The convolution kernel can be initialized in the form of a matrix of random size. In the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, and at the same time reduce the risk of overfitting.

The neural network in the embodiment of the present application may be a convolutional neural network, and of course, it may also be another type of neural network, such as a recurrent neural network (recurrent neural network, RNN).

It should be understood that the images in the embodiments of the present application may be static images (or referred to as static pictures) or dynamic images (or referred to as dynamic pictures). For example, the images in the present application may be videos or dynamic pictures, or the present application The images in can also be static pictures or photos. For ease of description, in the following embodiments of the present application, static images or dynamic images are collectively referred to as images.

The following describes the neural network-based image processing method provided by the embodiments of the present application. The method is executed by an image processing device based on a neural network. The neural network-based image processing device may be any device or device with image processing functions to execute, for example, the method is executed by the mobile terminal 200 shown in FIG. 2 or executed by a device related to the mobile terminal, or It is executed by part of the equipment included in the mobile terminal.

In the embodiment of the present application, multiple neural networks are used for image processing, for example, two neural networks are used to process the image to be processed, and the two neural networks are denoted as the first neural network and the second neural network. The first neural network and the second neural network conform to the above description of the neural network.

As shown in FIG. 4, the neural network-based image processing method provided by the embodiment of the present application includes the following steps.

S401: Input multiple frames of to-be-processed images into a first neural network for calculation to obtain a first image.

S402: Input multiple image groups into multiple second neural networks to perform operations to obtain multiple frames of second images, each image group includes the first image and one frame of the multiple frames of images to be processed.

As shown in FIG. 5, taking n frames of images to be processed as an example, n is an integer greater than or equal to 2. Input n frames of image to be processed into the first neural network to obtain the first image, input the first image and the first frame of image to be processed into the first second neural network, and input the first image and the second frame of image to be processed into the second neural network. A second neural network, and so on, input the first image and the n-th frame to be processed image into the n-th second neural network. It is understandable that each second neural network receives the first image and a frame of image to be processed, and each second neural network outputs a frame of the second image, that is, the first second neural network outputs the first frame of the second image, The second second neural network outputs the second image of the second frame, and so on, the nth second neural network outputs the second image of the nth frame.

Through the method shown in FIG. 4, the first image is obtained after multiple frames of images to be processed through the first neural network operation, that is, the image characteristics common to the multiple frames of images to be processed are obtained. The first image and a frame of image to be processed are subjected to a second neural network operation to obtain a second image, so as to obtain multiple frames of second images respectively. Since the first neural network and the second neural network are used to process multiple frames of images to be processed, the first image is applied to the processing of the second neural network, which reduces the computational complexity of the second neural network and can Ensure the quality of image processing.

For example, the first neural network is used to process static regions of multiple frames of images to be processed. The first image may be an image of a static area shared by multiple frames of images to be processed. Since the features of the static region occupy a high proportion of the network complexity, the features of the static region are processed first through the first neural network, and the processing results of the features of the static region are input into the second neural network as an intermediate result. The complexity of the second neural network The requirements will be reduced. Through the combined use of two neural networks, when processing multiple frames of images, it can achieve lower complexity than one neural network.

Optionally, the second neural network is used to process the motion regions of multiple frames of images to be processed.

Some optional designs of the neural network-based image processing method provided in the embodiments of the present application are described below.

In a possible implementation, as shown in FIG. 6, input n frames of to-be-processed images into the first neural network to obtain the first image, and input the first image and the first frame of to-be-processed images into the first second neural network , The first image and the second frame to be processed image are input into the second second neural network, and so on, the first image and the nth frame to be processed image are input into the nth second neural network. It is understandable that each second neural network receives the first image and a frame of image to be processed. Further, each second neural network outputs a frame of the third image, and merges the first image and the third image to obtain the second image. Each second neural network outputs a frame of the second image, that is, the first second neural network outputs the first frame of the second image, the second second neural network outputs the second frame of the second image, and so on, the nth The second neural network outputs the second image of the nth frame.

Optionally, combining the first image and the third image can also be considered as combining the first image and the third image, for example, performing a matrix addition operation on the first image and the third image to obtain the second image. For example, if the first image is an image of a static area processed by the first neural network, and the third image is an image of a moving area processed by the second neural network, then the first image and the third image are combined, namely The processed image of the static area and the processed image of the moving area are combined to obtain a complete second image.

In a possible implementation manner, the first neural network and/or the second neural network do not divide or recognize the image to be processed as static regions and/or moving regions, but process the image to be processed as a whole. For example, the characteristics of the first neural network will make it process the image area with static area characteristics in the image to be processed, and the characteristics of the second neural network will make it process the image to be processed and the intermediate image processed by the first neural network. The image area with the characteristics of the moving area is processed. For another example, the characteristics of the first neural network will cause it to perform higher-intensity processing on image areas with static area characteristics in the image to be processed, and perform lower-intensity processing on image areas with moving area characteristics, while the second neural network The characteristics of the network itself will make the image area to be processed and the intermediate image processed by the first neural network to be processed with higher intensity, and the image to be processed and the intermediate image processed by the first neural network are static. The image area with regional characteristics is processed with lower intensity. Correspondingly, the first image may be an image processed by the first neural network in a static area, and the third image may be an image processed by the second neural network in a moving area. It should be understood that, for the convenience of description, the foregoing implementation manners are also briefly described as that the first neural network processes images in a static area, and the second neural network processes images in a moving area.

In another possible implementation manner, as shown in FIG. 7, the difference from the implementation manner described in FIG. 6 is that the second neural network also receives the intermediate feature map output by the first neural network. The intermediate feature map is used to participate in the operation of the second neural network to obtain the third image.

For example, a frame of to-be-processed image and an intermediate feature map may be subjected to vector splicing or vector addition to obtain the to-be-processed image matrix, and the to-be-processed image matrix may be input to the second neural network for operation to obtain the third image. It can be understood that the vector stitching of a frame of image to be processed and the intermediate feature map can be regarded as the internal processing process of the second neural network. The input to the second neural network is an overall matrix, that is, the matrix of images to be processed.

Optionally, the second neural network also receives a multi-frame intermediate feature map output by the first neural network. A frame of image to be processed and the intermediate feature map of the first frame can be vector spliced or vector added to obtain an image matrix to be processed, and the image matrix to be processed can be input to the second neural network for operation to obtain the intermediate feature map of the second neural network. Perform vector splicing or matrix addition on the intermediate feature map of the second neural network and the intermediate feature map of the first neural network, and then process the remaining network layer of the second neural network to obtain the third image.

Optionally, the first neural network includes one image output layer and multiple feature map output layers. The image output layer outputs the first image. The feature map output layer outputs multiple intermediate feature maps.

In the embodiments of the present application, no specific distinction is made between the first image, the second image, and the third image. For example, no specific distinction is made between the color and texture of the first image, the second image, and the third image.

In the embodiment of the present application, the multiple frames of images to be processed include multiple frames of temporally adjacent images. Optionally, multiple frames of temporally adjacent images include multiple frames of temporally continuous images. After multiple frames of to-be-processed images are processed by multiple second neural networks, the processed images are also corresponding multiple frames. For example, the second images obtained through multiple second neural networks are multiple frames, and each frame of the image to be processed corresponds to one frame of the second image. Each frame of the second image corresponds to the first image and one frame of the third image.

In the embodiment of the present application, optionally, the format of the image to be processed may be a red-green-blue (RGB) format, a bright color separation (YUV) format, or a Bayer format. There is no limitation in this application.

For example, the number of multi-frame to-be-processed images is 4 frames, and the 4 frames of to-be-processed images are input to the first neural network for calculation to obtain the first image. Among them, the first image is obtained corresponding to 4 frames of images to be processed. The 4 image groups are respectively input to 4 second neural networks to perform operations to obtain 4 second images respectively, and each image group includes the first image and one of the 4 images to be processed. For example, the first frame of the image to be processed corresponds to the first frame of the second image; the second frame of image to be processed corresponds to the second frame of the second image; the third frame of image to be processed corresponds to the third frame of the second image; the fourth frame The image to be processed corresponds to the fourth frame of the second image.

Optionally, the first frame of image to be processed is input to the first second neural network for calculation to obtain the first frame of the third image, and the first image and the first frame of the third image are combined to obtain the first frame of second image. The second frame of image to be processed is input to the second second neural network for operation to obtain the second frame of the third image, and the first image and the second frame of the third image are combined to obtain the second frame of the second image. The third frame of image to be processed is input to the third second neural network for operation to obtain the third frame of the third image, and the first image and the third frame of the third image are combined to obtain the third frame of the second image. The fourth frame of image to be processed is input to the fourth second neural network for operation to obtain the fourth frame of the third image, and the first image and the fourth frame of the third image are combined to obtain the fourth frame of the second image.

In the embodiment of the present application, the first neural network and the second neural network can be combined into an image processing system, and the image repair system is used to process the image to be processed to improve the quality of the image or video. The processing process can include processing such as noise reduction and elimination of mosaic effects.

In general, the complexity of the first neural network is higher than the complexity of the second neural network. For example, the processing capability of the first neural network on the static area of the image is greater than that of the second neural network.

In some technologies, multiple frames of images are often synthesized into one frame of output through a neural network to improve image or video quality. However, such a neural network requires a high degree of complexity, and in a video scene, a high processing speed is required. For example, the implementation of video processing on a mobile terminal requires a video with a resolution of 8K to reach a processing speed of 30 frames/s, that is, a frame rate of 30. Under the high requirements for processing speed in video scenes, if a neural network is used to synthesize multiple frames of images into one frame for output, it needs to face the problems of computational complexity and computational resource consumption, and a large time delay is required. If you blindly reduce the complexity of the neural network and use a lower complexity network, it will affect the quality of the image or video.

In the embodiment of this application, the first neural network is used to deal with the problem of complex computing power between multiple frames of images, and the second neural network is used to deal with the problem of lower computing power in each frame of the multi-frame images, and output the multi-frame processed The integrated computing power of the first neural network and the second neural network is allocated to multiple frames of images, so that the processing complexity of each frame of image is reduced compared with the above-mentioned solution, and at the same time, the quality of the image or video can be guaranteed. For example, the first image is an image of a static area, the third image is an image of a moving area, the first neural network is used to process the static area of the multi-frame image to be processed, and the second neural network is used to process the motion of the multi-frame image to be processed In this way, through the joint processing of two neural networks, the image processing system provided by the present application has lower complexity in image processing and guarantees the quality of the image or video. Improve the application of deep learning technology in the field of image signal processing.

In the following description, the first neural network and the second neural network are convolutional neural networks as an example. Assume that the image to be processed is 4 frames, and the second image is 4 frames. The format of the image to be processed is a Bayer format image, in particular the image format is an RGrGbB format, and one frame of an RGrGbB format image includes 4 channels (R, Gr, Gb, B). After 4 frames of to-be-processed images pass through the image processing system, 4 frames of processed images are output. The image processing system includes a first neural network and a second neural network.

As shown in Fig. 8, 4 consecutive RGrGbB images to be processed are split into 4*4=16 channels. The 16 channels include (R1, Gr1, Gb1, B1, R2, Gr2, Gb2, B2, R3, Gr3, Gb3, B3, R4, Gr4, Gb4, B4). Input 4 consecutive RGrGbB images into the first neural network to obtain the first image (4 channels (R, Gr, Gb, B)).

Input the first frame of RGrGbB image (4 channels (R1, Gr1, Gb1, B1)) into the first second neural network, and obtain the first frame of the third image (4 channels (R1', Gr1', Gb1', B1') )), merge the first image (R, Gr, Gb, B) and the first frame of the third image (4 channels (R1', Gr1', Gb1', B1')) to obtain the first frame of the second image ( 4 channels (R1", Gr1", Gb1", B1")).

Input the second frame of RGrGbB image (4 channels (R2, Gr2, Gb2, B2)) into the second second neural network, and obtain the second frame of the third image (4 channels (R2', Gr2', Gb2', B2') )), merge the first image (R, Gr, Gb, B) and the second frame of the third image (4 channels (R2', Gr2', Gb2', B2')) to obtain the second frame of the second image ( 4 channels (R2", Gr2", Gb2", B2")).

Input the third frame RGrGbB image (4 channels (R3, Gr3, Gb3, B3)) into the third second neural network, and obtain the third frame of the third image (4 channels (R3', Gr3', Gb3', B3') )), merge the first image (R, Gr, Gb, B) and the third frame of the third image (4 channels (R3', Gr3', Gb3', B3')) to obtain the second image of the third frame ( 4 channels (R3”, Gr3”, Gb3”, B3”)).

Input the fourth frame of RGrGbB image (4 channels (R4, Gr4, Gb4, B4)) into the fourth second neural network, and obtain the fourth frame of the third image (4 channels (R4', Gr4', Gb4', B4') )), merge the first image (R, Gr, Gb, B) and the fourth frame of the third image (4 channels (R4', Gr4', Gb4', B4')) to obtain the fourth frame of the second image ( 4 channels (R4", Gr4", Gb4", B4")).

Exemplarily, the architecture of the first neural network is shown in Figures 9a and 9b. Since the drawings of the first neural network are too large, the first neural network is split into two parts, as shown in Figure 9a and Figure 9b, respectively. out. Figures 9a and 9b together form the architecture of the first neural network. After add in Figure 9a, connect to the first layer in Figure 9b.

In Figure 9a and Figure 9b, the convolutional layer is represented by a rectangular box. Conv2d+bias stride=23x3_16_32 in the rectangular box represents the convolutional layer. Among them, Conv2d represents a 2-dimensional convolution, bias represents the bias term, 1x1/3x3 represents the size of the convolution kernel, stride represents the step size, _32_16 represents the number of input and output feature maps, and 32 represents the number of feature maps input to the layer is 32 , 16 means that the number of feature maps of the output layer is 16.

Split represents the split layer, which means that the feature map is split in the channel (chanel) dimension. Split 2 means to split the image in the feature map dimension. For example, if 32 feature maps are input through the above operation, they will become two 16 feature maps.

concat stands for the jump chain layer, which means that the images will be merged in the dimension of the feature map, for example, two images of 16 feature maps are merged into one image of 32 feature maps.

add represents a matrix addition operation.

The first neural network shown in FIG. 9a and FIG. 9b is a typical convolutional neural network, which can well solve the static area of multiple frames of images to be processed. Assume 4 that the image to be processed is input to the first neural network, and the first neural network outputs the first image.

Optionally, if a typical convolutional neural network is not used, a multi-branch neural network can also be used. The convolutional layer of the first neural network may also adopt a group convolution (group convolution) operation. Among them, the group convolution is a special convolution layer. Assume that there are N output feature maps (feature maps) of the previous layer, that is, the number of channels channel=N, that is, there are N convolution kernels in the previous layer. Suppose also the number of groups M for group convolution. Then the operation of the group of convolutional layers is to first divide the N channels into M parts. Each group corresponds to N/M channels, and the convolution of each group is performed independently. After completion, the output feature maps are vector concatenated (concat) together as the output channel of this layer. The operation mode of group convolution can obtain the same or similar technical effect of the branch mode.

Exemplarily, the architecture of 4 second neural networks processing 4 images to be processed is shown in Fig. 10a and Fig. 10b. The output first image in Fig. 10a is input to the first layer in Fig. 10b. In Figure 10a and Figure 10b, the convolutional layer is represented by a rectangular box. For specific explanation, please refer to the explanation in Fig. 9a and Fig. 9b.

As shown in Figure 10a, the first frame of image to be processed is input to the first second neural network to obtain the first frame of the third image, and the first image and the first frame of the third image are combined to obtain the first frame of the second image .

The second frame of image to be processed is input to the second second neural network to obtain the second frame of the third image, and the first image and the second frame of the third image are combined to obtain the second frame of the second image.

As shown in Figure 10b, the third frame of the image to be processed is input to the third second neural network to obtain the third frame of the third image, and the first image and the third frame of the third image are combined to obtain the third frame of the second image .

The fourth frame of image to be processed is input to the fourth second neural network to obtain the fourth frame of the third image, and the first image and the fourth frame of the third image are combined to obtain the fourth frame of the second image.

Optionally, as shown in Figures 11a and 11b, the first neural network can also input an intermediate feature map, and the intermediate feature map is input to the second neural network, and the intermediate feature map is used to participate in the calculation of the second neural network to obtain The third image. For example, the intermediate feature map output by the second convolutional layer in the first neural network is used for vector stitching with the image output by the first convolutional layer of the second neural network to obtain a processed image. For another example, the intermediate feature map output by the fourth convolution layer in the first neural network is used for vector stitching with the intermediate feature map output by the third convolution layer of the second neural network to obtain a processed image.

In the embodiment of the present application, before the first neural network and the second neural network are used, the neural network model needs to be trained. In the process of training the neural network, the training data can include training images and ground truth images.

When training the model of the first neural network: first use the true value image of the first image to process the collected training image to obtain and output the image. The output image is compared with the true value image of the first image until the network converges, and the training of the first neural network model is completed. The so-called network convergence may refer to, for example, that the true value image difference between the output image and the first image is smaller than the set first threshold.

Fix the parameters of the first image obtained by the training of the first neural network, use the true value image of the third image to process the collected training image to obtain and output the image. The output image is compared with the true value image of the third image until the network converges, and the training of the second neural network model is completed. Here, network convergence may mean that the difference between the output image and the true value image of the third image is less than the set second threshold.

In the embodiment of the present application, an image processing system is formed by a first neural network and a second neural network, and the image processing system is used to process multiple frames of to-be-processed images, and output multiple frames of processed images. The complexity of the second neural network is lower than the complexity of the first neural network. The calculation amount of the image processing system for each frame of the image to be processed is reduced to a certain extent compared with the scheme of processing multiple frames of images into one frame through the basic network in some technologies. In turn, the image processing time delay can be reduced, and the quality of the image or video can be guaranteed. The computing power of the two neural networks for processing multiple frames of images to be processed will be illustrated below with examples. Assuming that the image to be processed is 4 frames, the processed image output by the first neural network and the second neural network is 4 frames. A frame is output after basic network processing. The first neural network is shown in Figures 9a and 9b, and the second neural network is shown in Figures 10a and 10b.

The calculation amount of the first neural network is about the same as that of the basic network, which is about 12000 MAC. For example, the calculation process of the network complexity of the basic network is as follows:

(23*32*1*1+32*16*3*3)/4#1336

+16*32*3*3/16#288

+(32*32*3*3)/16#576

+32*64*3*3/64#288

+(64*96*3*3+(48*48*3*3*2+96*96*1*1*1)*2+96*64*3*3+32*32*3*3* 2+64*64*1*1*1+64*64*3*3*1)/64#4240

+(64*32*2*2)/16+(concat)#512

+(64*32*3*3)/16#1152

+(32*16*2*2)/4+(concat)#512

+(32*16*3*3)/4#1152

+(16*16*3*3+16*16*3*3+16*4*3*3)/4#1296

= 11352

The calculation process of the network complexity of the second neural network is as follows:

(4*16*3*3)/4#144

+16*32*3*3/16#288

+(32*32*3*3)/16#576

+32*64*3*3/64#288

+(64*64*3*3)/64#576

+(64*32*2*2)/16+(concat)#512

+(64*32*3*3)/16#1152

+(32*16*2*2)/4+(concat)#512

+(32*4*3*3)/4#288

= 4336

It can be seen that the calculation amount of the second neural network is about 4000, which is assumed to be 4000.

Then when 4 frames of images to be processed are input and 4 frames of processed images are output at the same time, the calculation amount of the image processing system is (4000*4+12000)/4=7000; when 8 frames of images to be processed are input and 8 frames of processed images are output at the same time For the post image, the calculation amount of the image processing system is (4000*8+12000)/8=5500; when 16 frames of images to be processed are input and 16 frames of processed images are output at the same time, the calculation amount of the image processing system is (4000* 16+12000)/16=4700. Both are less than 12,000 computing power to process multiple frames into one frame through the basic network. It can be seen that the multi-frame input and multi-frame output schemes performed by the first neural network and the second neural network provided by the embodiments of the present application can reduce the amount of calculation, thereby reducing the delay of image processing, and can meet the requirements of video in video scenarios. Requirements for image processing delay. The network computing power requirement of a video with a resolution of 8 thousand (K) pixels and a frame rate of 30 frames per second is about 50000 MAC. When the embodiment of this application outputs 8 frames, the amount of calculation prescribed by the image processing system is basically sufficient Meet the network computing power requirements of 8K 30 videos.

It should be noted that the examples in each application scenario in this application only show some possible implementation manners, which are for a better understanding and description of the method of this application. Those skilled in the art can obtain some examples of evolution forms according to the indication method of the reference signal provided in the application.

In order to realize the functions in the methods provided in the above embodiments of the present application, the neural network-based image processing device may include a hardware structure and/or a software module, and implement the above in the form of a hardware structure, a software module, or a hardware structure plus a software module. Each function. Whether a certain function among the above-mentioned functions is executed by a hardware structure, a software module, or a hardware structure plus a software module depends on the specific application and design constraint conditions of the technical solution.

As shown in FIG. 12, based on the same technical concept, an embodiment of the present application also provides a neural network-based image processing apparatus 1200. The neural network-based image processing apparatus 1200 may be a mobile terminal or any device with image processing functions. . In one design, the neural network-based image processing device 1200 may include modules that perform one-to-one correspondence of the methods/operations/steps/actions in the foregoing method embodiments. The modules may be hardware circuits, software, or It is realized by hardware circuit combined with software. In one design, the neural network-based image processing device 1200 may include an arithmetic module 1201.

The arithmetic module 1201 is configured to input multiple frames of to-be-processed images into a first neural network for calculation to obtain a first image; and input multiple image groups into multiple second neural networks to perform calculations to obtain multiple frames of second images respectively , Wherein each of the image groups includes one frame of the first image and the multiple frames of images to be processed.

The division of modules in the embodiments of this application is illustrative, and it is only a logical function division. In actual implementation, there may be other division methods. In addition, the functional modules in the various embodiments of this application can be integrated into one process. In the device, it can also exist alone physically, or two or more modules can be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or software functional modules.

Based on the same technical concept, as shown in FIG. 13, an embodiment of the present application also provides an image processing device 1300 based on a neural network. The image processing device 1300 of the neural network includes a processor 1301. The processor 1301 is used to call a group of programs to enable the foregoing method embodiments to be executed. The image processing device 1300 of the neural network further includes a memory 1302, and the memory 1302 is configured to store program instructions and/or data executed by the processor 1301. The memory 1302 is coupled with the processor 1301. The coupling in the embodiments of the present application is an indirect coupling or communication connection between devices, units or modules, and may be in electrical, mechanical or other forms, and is used for information exchange between devices, units or modules. The processor 1301 may operate in cooperation with the memory 1302. The processor 1301 may execute program instructions stored in the memory 1302. The memory 1302 may be included in the processor 1301.

The neural network-based image processing device 1300 may be a chip system. In the embodiments of the present application, the chip system may be composed of chips, or may include chips and other discrete devices.

The processor 1301 is configured to input multiple frames of to-be-processed images into a first neural network for operation to obtain a first image; and input multiple image groups into multiple second neural networks for operation to obtain multiple frames of second images respectively , Wherein each of the image groups includes one frame of the first image and the multiple frames of images to be processed.

The processor 1301 may be a general-purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, and may implement or execute the The disclosed methods, steps and logic block diagrams. The general-purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in combination with the embodiments of the present application may be directly embodied as being executed and completed by a hardware processor, or executed and completed by a combination of hardware and software modules in the processor.

The memory 1302 may be a non-volatile memory, such as a hard disk drive (HDD) or a solid-state drive (SSD), etc., and may also be a volatile memory, such as random access memory (random access memory). -access memory, RAM). The memory is any other medium that can be used to carry or store desired program codes in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory in the embodiment of the present application may also be a circuit or any other device capable of realizing a storage function for storing program instructions and/or data.

Part or all of the various operations and functions described in the foregoing method embodiments of the present application may be completed by chips or integrated circuits.

An embodiment of the present application also provides a chip including a processor, which is used to support the neural network-based image processing device to implement the functions involved in the foregoing method embodiments. In a possible design, the chip is connected to a memory or the chip includes a memory, and the memory is used to store the necessary program instructions and data of the communication device.

The embodiment of the present application provides a computer-readable storage medium that stores a computer program, and the computer program includes instructions for executing the foregoing method embodiments.

The embodiments of the present application provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the foregoing method embodiments.

Those skilled in the art should understand that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

This application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are used to generate It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Although the preferred embodiments of the present application have been described, those skilled in the art can make additional changes and modifications to these embodiments once they learn the basic creative concept. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and all changes and modifications falling within the scope of the present application.

Obviously, those skilled in the art can make various changes and modifications to the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. In this way, if these modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalent technologies, the present application also intends to include these modifications and variations.

Claims

A neural network-based image processing method, which is characterized in that it includes:

Input multiple frames of to-be-processed images into the first neural network for calculation to obtain the first image;

Input multiple image groups into multiple second neural networks to perform operations respectively to obtain multiple frames of second images, wherein each of the image groups includes one of the first image and the multiple frames of images to be processed Frame image.
The method according to claim 1, wherein said inputting multiple image groups into multiple second neural networks for calculation respectively comprises:

Input one frame of the multiple frames of images to be processed into the second neural network for calculation to obtain a third image;

Combining the first image and the third image to obtain the second image.
The method according to claim 1, wherein the first neural network comprises an image output layer and a plurality of feature map output layers, the image output layer outputs the first image, and the feature map output layer A plurality of intermediate feature maps are output, and the plurality of intermediate feature maps are used to participate in the operation of the second neural network to obtain a third image.
The method according to any one of claims 1 to 3, wherein the multiple frames of to-be-processed images include multiple frames of temporally adjacent images.
The method according to any one of claims 1 to 4, wherein the first neural network is used to process the static area of the image to be processed.
The method according to any one of claims 1 to 5, wherein the second neural network is used to process the motion area of the image to be processed.
The method according to any one of claims 1 to 6, wherein the processing capability of the first neural network on the static area of the image is greater than that of the second neural network.
The method according to any one of claims 1-7, wherein the first neural network and the second neural network form an image processing system, and the image processing system is used for processing the image to be processed Perform noise reduction and remove mosaic effect processing.
An image processing device based on neural network, characterized in that it comprises:

An arithmetic module, configured to input multiple frames of to-be-processed images into the first neural network for calculation to obtain the first image;

The arithmetic module is further configured to input multiple image groups into multiple second neural networks for calculation respectively to obtain multiple frames of second images, wherein each of the image groups includes the first image and the One of the multiple frames to be processed.
The device according to claim 9, wherein the computing module is used for:

Input one frame of the multiple frames of images to be processed into the second neural network for calculation to obtain a third image;

Combining the first image and the third image to obtain the second image.
The device according to claim 9, wherein the first neural network comprises an image output layer and a plurality of feature map output layers, the image output layer outputs the first image, and the feature map output layer A plurality of intermediate feature maps are output, and the plurality of intermediate feature maps are used to participate in the operation of the second neural network to obtain a third image.
The apparatus according to any one of claims 9-11, wherein the multiple frames of to-be-processed images comprise multiple frames of temporally adjacent images.
The device according to any one of claims 9-12, wherein the first neural network is used to process the static area of the image to be processed.
The device according to any one of claims 9-13, wherein the second neural network is used to process the motion area of the image to be processed.
The device according to any one of claims 9-14, wherein the processing capability of the first neural network on the static area of the image is greater than that of the second neural network.
The device according to any one of claims 9-15, wherein the first neural network and the second neural network form an image processing system, and the image processing system is used for processing the image to be processed Perform noise reduction and remove mosaic effect processing.
A chip, characterized in that the chip is connected to a memory, and is used to read and execute a software program stored in the memory to implement the method according to any one of claims 1-8.
An image processing device based on a neural network, which is characterized by comprising a processor and a memory, and the processor is used to run a set of programs to enable the method according to any one of claims 1 to 8 to be executed.
A computer-readable storage medium, characterized in that computer-readable instructions are stored in the computer-readable storage medium, and when the computer-readable instructions run on a neural network-based image processing device, the computer-readable The image processing device of the neural network executes the method according to any one of claims 1-8.