WO2024002211A1 - 一种图像处理方法及相关装置 - Google Patents

一种图像处理方法及相关装置 Download PDF

Info

Publication number
WO2024002211A1
WO2024002211A1 PCT/CN2023/103616 CN2023103616W WO2024002211A1 WO 2024002211 A1 WO2024002211 A1 WO 2024002211A1 CN 2023103616 W CN2023103616 W CN 2023103616W WO 2024002211 A1 WO2024002211 A1 WO 2024002211A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
feature representation
image
representation
blurred image
Prior art date
Application number
PCT/CN2023/103616
Other languages
English (en)
French (fr)
Inventor
余磊
林明远
汪涛
李卫
李琤
刘健庄
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024002211A1 publication Critical patent/WO2024002211A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/73Deblurring; Sharpening
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/02Affine transformations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to an image processing method and related devices.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that can respond in a manner similar to human intelligence.
  • Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Motion blur generally occurs in scenes with obvious movement during the exposure time, especially in low-light environments with lightweight mobile devices, such as mobile phones and car cameras.
  • motion blur causes undesirable image degradation, making visual content less interpretable, motion blurred images also encode rich information about the relative motion between the camera and the observed scene. Therefore, recovering (reconstructing) clear frame sequences (photo-sequencing) from a single motion-blurred image helps understand the dynamics of the scene and has wide applications in image reconstruction, autonomous driving, and video surveillance.
  • a motion blurred image can be viewed as an average of HD frames over the exposure time. Since averaging destroys the temporal order of frames, it is very inappropriate to recover a clear sequence of frames from a single motion blurred image. That is to say, the sequence to be recovered is not unique, and there may be different high-definition frames with the same sequence composition. motion blurred image.
  • an event camera In order to solve the non-uniqueness of the sequence to be restored, an event camera is introduced.
  • the event camera can provide the inter-frame changes of the time series to guide the recovery of the sequence.
  • the event camera is a bio-inspired, event-driven, time-based neuromorphic vision sensor that perceives the world using a completely different principle than traditional cameras. It measures brightness changes by working asynchronously and triggers an event once the change exceeds a threshold.
  • Event cameras do away with concepts like exposure time and frames in traditional intensity cameras and are able to capture nearly continuous motion in frameless mode (microsecond time resolution), so you don't encounter problems like blur. Utilizing event cameras will be very helpful in recovering clear frames from blurry images.
  • Image deblurring is performed through the optical flow of event information.
  • the core idea is to calculate the optical flow through the event information, use this optical flow to perform affine transformation (warp) on the blurred image, and cooperate with various losses to achieve the image at any time within the exposure time. Deblur.
  • affine transformation warp
  • the optical flow is not precise, and there is a problem of pixel-level misalignment.
  • This application provides an image processing method that can achieve refined alignment of blurred image features and event information features through the alignment of multi-scale bidirectional scene flows to obtain accurate scene flow information, thereby solving the existing problem of deblurring based on event information.
  • improper consideration of pixel-level alignment is achieved, thereby improving the deblurring effect of blurry images.
  • embodiments of the present application provide an image processing method, including: obtaining a first feature representation of a blurred image and a second feature representation of event data collected by an event camera; the first feature representation and the second feature representation The sizes of the feature representations are consistent; according to the first feature representation of the blurred image and the second feature representation of the event data, through the scene flow prediction network, the first scene stream corresponding to the blurred image and the event data are obtained Corresponding second scene stream, the size of the first scene stream and the first feature representation are consistent, and each pixel feature in the first scene stream indicates the pixel corresponding to the pixel position in the first feature representation.
  • each pixel feature indication in the second scene stream is represented by Motion information from pixel features corresponding to pixel positions in the second feature representation to pixel features corresponding to pixel positions in the first feature representation; according to the first scene flow, perform an affine transformation (warp) on the first feature representation ) to obtain a third feature representation; perform affine transformation on the second feature representation according to the second scene flow to obtain a fourth feature representation; the third feature representation and the fourth feature representation are used to Blurred images are deblurred.
  • warp affine transformation
  • the "size" of the feature representation here can be understood as the width and height of the feature representation.
  • the pixel feature here can refer to a point at the spatial position (x, y), which may include multiple channels.
  • the blurred image and the event data are collected for the same scene in the same time period.
  • the scene flow prediction network may include a first encoding module, a second encoding module, a fusion module, a first decoding module and a second decoding module; the first feature of the blurred image is representation and the second feature representation of the event data, and obtaining the first scene stream corresponding to the blurred image and the second scene stream corresponding to the event data through the scene flow prediction network, which may include:
  • a first encoding result is obtained through the first encoding module;
  • a second encoding result is obtained through the second encoding module; according to the first encoding
  • the result and the second encoding result are passed through the fusion module to obtain the fusion result; according to the fusion result, the first decoding module and the second decoding module are respectively used to obtain the first corresponding to the blurred image.
  • the fusion module is configured to implement a first fusion of the first encoding result and the second encoding result based on an attention mechanism.
  • the first scene stream can represent the alignment relationship between the blur image features and the event information features
  • the second scene stream can represent the alignment relationship between the event feature information and the blur image
  • the feature representation of the blurred image and the feature representation of the event data are not the same modal information. If the feature representation of the blurred image and the feature representation of the event data are directly fused, the fusion result obtained will be inaccurate.
  • This application implements In the example, two different encoding modules are first used to encode the feature representation of the blurred image and the feature representation of the event data respectively, converting them into data similar to the same modality, and then fuse the coding results to obtain accurate Fusion results.
  • the scene flow in the embodiment of the present application is similar to the optical flow, and the information of each pixel position is a vector with direction.
  • This application can achieve pixel-level alignment between blur image features and event data features through scene flow prediction.
  • the fused image may lose part of the information compared to the event data (the event data collected by the event camera is included in the exposure time (multi-frame event data collected within a certain scene).
  • the event data collected by the event camera is included in the exposure time (multi-frame event data collected within a certain scene).
  • the event data is identified only when the brightness change of the point at the pixel position is greater than the threshold, some image data may be invalid. Therefore, if the information of the originally invalid (or occluded) area in the image data or time data is directly used, artifacts caused by reduced image quality will occur.
  • a second occluded area can be identified based on the fourth feature representation and the first feature representation (for example, a lightweight network (such as a continuous Convolution and residual) to achieve the determination of the occlusion area), wherein the image data of the second occlusion area in the blurred image is valid in the second occlusion area of the event data; convert the first feature Feature representations in the representation other than the second occlusion area are subjected to a second fusion with feature representations of the second occlusion area in the fourth feature representation to obtain a second fused feature representation.
  • some features of the blurred image are occluded, and the information in the event data can be used to replace these occluded features to obtain a more accurate feature representation.
  • the second occlusion area can be represented by a second mask, the size of the second mask and the fourth feature representation are consistent, and each pixel in the second mask is represented by Indicating whether the pixel feature at the corresponding position in the first feature representation is valid in the blurred image. For example, 0 and 1 can be used in the second mask to identify whether the pixel feature at the corresponding position is valid in the event data. For example, 0 means invalid and 1 means valid.
  • the first occluded area in the event data can be determined according to the third feature representation and the second feature representation (for example, a lightweight network (such as a continuous network) can be used) Convolution and residual) to achieve the determination of the occlusion area), wherein the image data of the first occlusion area in the event data is valid in the first occlusion area of the blurred image; the second The feature representation other than the first occlusion area in the feature representation and the feature representation of the first occlusion area in the third feature representation are subjected to a second fusion to obtain a first fused feature representation. That is to say, some features of the event information are occluded, and the model can be used to The information in the blurred image replaces these occluded features, thereby obtaining a more accurate feature representation.
  • the second feature representation for example, a lightweight network (such as a continuous network) can be used
  • Convolution and residual to achieve the determination of the occlusion area
  • the first occlusion area is represented by a first mask
  • the size of the first mask and the third feature representation are consistent
  • each pixel in the first mask is used for Indicate whether the pixel feature at the corresponding position in the third feature representation is valid in the event data. For example, 0 and 1 can be used in the first mask to identify whether the pixel feature at the corresponding position is valid in the event data. For example, 0 means invalid and 1 means valid.
  • the second fusion is an addition operation of corresponding pixel positions.
  • the occlusion area in the blurred image can be processed, thereby reducing the artifact problem caused by the occlusion area.
  • the method further includes: processing the feature representation of the blurred image and the feature representation of the event data through N series-connected feature nesting blocks to obtain a processing result for deblurring; Wherein, each of the feature nesting blocks is used to perform the image processing method of the first aspect, and the first feature nesting block is used to obtain features extracted from the blurred image and the event data through a feature extraction network.
  • the nth feature nesting block is used to obtain the feature representation output by the n-1th feature nesting block, and n is less than N.
  • the feature representation output by the Nth feature nesting block is used to fuse with the feature representation extracted from the blurred image through the feature extraction network to obtain residual information.
  • the difference information is used to fuse with the blurred image to achieve deblurring of the blurred image.
  • this application provides an image processing device, which includes:
  • An acquisition module configured to acquire a first feature representation of the blurred image and a second feature representation of the event data collected by the event camera; the size of the first feature representation and the second feature representation are consistent;
  • a scene flow prediction module configured to obtain the first scene flow corresponding to the blur image and the event through a scene flow prediction network based on the first feature representation of the blurred image and the second feature representation of the event data.
  • the second scene stream corresponding to the data, the size of the first scene stream and the first feature representation are consistent, and each pixel feature in the first scene stream is indicated by the corresponding pixel position in the first feature representation.
  • the motion information of the pixel feature to the pixel feature corresponding to the pixel position in the second feature representation, the size of the second scene stream and the second feature representation are consistent, and each pixel feature indication in the second scene stream Motion information from the pixel feature corresponding to the pixel position in the second feature representation to the pixel feature corresponding to the pixel position in the first feature representation;
  • An affine transformation module configured to perform affine transformation on the first feature representation according to the first scene flow to obtain a third feature representation
  • affine transformation is performed on the second feature representation to obtain a fourth feature representation; the third feature representation and the fourth feature representation are used to deblur the blurred image.
  • the blurred image and the event data are collected for the same scene in the same time period.
  • the scene flow prediction network includes a first encoding module, a second encoding module, a fusion module, a first decoding module and a second decoding module;
  • the scene flow prediction module is specifically used for:
  • a first encoding result is obtained through the first encoding module
  • a second encoding result is obtained through the second encoding module
  • a fusion result is obtained through the fusion module
  • the first scene stream corresponding to the blurred image and the second scene stream corresponding to the event data are obtained through the first decoding module and the second decoding module respectively.
  • the feature representation of the blurred image and the feature representation of the event data are not the same modal information. If the feature representation of the blurred image and the feature representation of the event data are directly fused, the fusion result obtained will be inaccurate.
  • This application implements In the example, two different encoding modules are first used to encode the feature representation of the blurred image and the feature representation of the event data, so that they are converted into similar modalities. data, and fuse the coding results to obtain accurate fusion results.
  • the device further includes:
  • An occlusion area identification module configured to identify a second occlusion area according to the fourth feature representation and the first feature representation, wherein the image data of the second occlusion area in the blurred image is in all parts of the event data. Valid in the second occlusion area;
  • a second fusion is performed on the feature representation of the first feature representation other than the second occlusion area and the feature representation of the second occlusion area in the fourth feature representation to obtain a second fused feature representation.
  • the second occlusion area is represented by a second mask, the size of the second mask and the fourth feature representation are consistent, and each pixel in the second mask is used for Indicates whether the pixel feature at the corresponding position in the first feature representation is valid in the blurred image.
  • the device further includes:
  • An occlusion area identification module configured to determine a first occlusion area according to the third feature representation and the second feature representation, wherein the image data of the first occlusion area in the event data is in all parts of the blurred image. Valid in the first occlusion area;
  • a second fusion is performed on the feature representation of the second feature representation other than the first occlusion area and the feature representation of the first occlusion area in the third feature representation to obtain a first fused feature representation.
  • the first occlusion area is represented by a first mask, the size of the first mask and the third feature representation are consistent, and each pixel in the first mask is used for Indicate whether the pixel feature at the corresponding position in the third feature representation is valid in the event data.
  • the occlusion area in the blurred image can be processed, thereby reducing the artifact problem caused by the occlusion area.
  • the second fusion is an addition operation of corresponding pixel positions.
  • the device further includes: a feature nesting module, configured to process the feature representation of the blurred image and the feature representation of the event data through N series-connected feature nesting blocks to obtain the feature representation for processing.
  • a feature nesting module configured to process the feature representation of the blurred image and the feature representation of the event data through N series-connected feature nesting blocks to obtain the feature representation for processing.
  • the processing result of deblurring processing wherein, each of the feature nesting blocks is used to perform the image processing method as in the first aspect, and the first feature nesting block is used to obtain the blurred image and
  • the n-th feature nesting block is used to obtain the feature representation output by the n-1th feature nesting block, and n is less than N.
  • the feature representation output by the Nth feature nesting block is used to fuse with the feature representation extracted from the blurred image through the feature extraction network to obtain residual information, so The residual information is used to fuse with the blurred image to achieve deblurring processing of the blurred image.
  • embodiments of the present application provide an image processing device, which may include a memory, a processor, and a bus system.
  • the memory is used to store programs
  • the processor is used to execute programs in the memory to perform the above-mentioned first aspect. Any optional method.
  • embodiments of the present application provide a computer-readable storage medium.
  • a computer program is stored in the computer-readable storage medium. When run on a computer, it causes the computer to execute the above-mentioned first aspect and any optional method. method.
  • embodiments of the present application provide a computer program product, which includes code, and when the code is executed, is used to implement the above first aspect and any optional method.
  • the present application provides a chip system, which includes a processor to support an execution device or a training device to implement the functions involved in the above aspects, for example, sending or processing data involved in the above methods; Or, information.
  • the chip system further includes a memory, which is used to store necessary program instructions and data for executing the device or training the device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • Figure 1 is a structural schematic diagram of the main framework of artificial intelligence
  • Figure 2 is a schematic diagram of an application scenario provided by the embodiment of the present application.
  • Figure 3 is a schematic diagram of an application scenario provided by the embodiment of the present application.
  • Figure 4 is a schematic diagram of a convolutional neural network provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of a convolutional neural network provided by an embodiment of the present application.
  • Figure 6 is a schematic structural diagram of a system provided by an embodiment of the present application.
  • Figure 7 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • Figure 8 is a schematic flow chart of an image processing method provided by an embodiment of the present application.
  • Figure 9 is a flowchart of an image processing method
  • Figure 10 is a flowchart of an image processing method
  • FIG. 11 is a flowchart of an image processing method
  • Figure 12 is a flowchart of an image processing method
  • Figure 13 is a schematic diagram of the effect of an image processing method provided by an embodiment of the present application.
  • Figure 14 is a schematic diagram of the effect of an image processing method provided by an embodiment of the present application.
  • Figure 15 is a schematic structural diagram of an image processing device provided by an embodiment of the present application.
  • Figure 16 is a schematic diagram of an execution device provided by an embodiment of the present application.
  • Figure 17 is a schematic diagram of a training device provided by an embodiment of the present application.
  • Figure 1 shows a structural schematic diagram of the artificial intelligence main framework.
  • the following is from the “intelligent information chain” (horizontal axis) and “IT value chain” ( The above artificial intelligence theme framework is elaborated on the two dimensions of vertical axis).
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has gone through the condensation process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence and information (providing and processing technology implementation) to the systematic industrial ecological process.
  • Infrastructure provides computing power support for artificial intelligence systems, enables communication with the external world, and supports it through basic platforms.
  • computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA, etc.);
  • the basic platform includes distributed computing framework and network and other related platform guarantees and support, which can include cloud storage and Computing, interconnection networks, etc.
  • sensors communicate with the outside world to obtain data, which are provided to smart chips in the distributed computing system provided by the basic platform for calculation.
  • Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other methods.
  • machine learning and deep learning can perform symbolic and formal intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to simulating human intelligent reasoning in computers or intelligent systems, using formal information based on reasoning control strategies. The process of machine thinking and problem solving based on information. Typical functions are search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of further data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, and image processing. identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of overall artificial intelligence solutions, productizing intelligent information decision-making and realizing practical applications. Its application fields mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, smart cities, etc.
  • the image processing method in the embodiment of the present application can be applied in assisted driving and autonomous driving smart cars, and can also be applied in fields that require image enhancement (such as image denoising) in the field of computer vision such as smart cities and smart terminals.
  • image enhancement such as image denoising
  • the following is a brief introduction to the video streaming transmission scenario and video monitoring scenario based on Figure 2 and Figure 3 respectively.
  • Video streaming scenario :
  • the server can transmit a downsampled, lower-resolution, low-quality video stream to the client over the network.
  • the client can then perform enhancements on the images in this low-quality video stream. For example, super-resolution, denoising and other operations are performed on images in videos, and finally high-quality images are presented to users.
  • the image processing method provided by the embodiments of the present application can be used to convert low-quality video surveillance videos into high-quality high-definition videos, thereby achieving effective recovery of a large number of details in the surveillance images, and providing more effective and efficient methods for subsequent target recognition tasks. Richer information.
  • the neural network can be composed of neural units.
  • the neural unit can refer to an operation unit that takes xs (ie, input data) and intercept 1 as input.
  • the output of the operation unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • Ws is the weight of xs
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of this activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting multiple above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • the convolutional neural network contains a feature extractor consisting of a convolutional layer and a subsampling layer, which can be regarded as a filter.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a neuron can be connected to only some of the neighboring layer neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units in the same feature plane share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as extracting features in a way that is independent of location.
  • the convolution kernel can be formalized as a matrix of random size. During the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • CNN is a very common neural network.
  • a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture.
  • the deep learning architecture refers to the algorithm of machine learning. Multiple levels of learning at different levels of abstraction.
  • CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network can respond to the image input into it.
  • a convolutional neural network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a fully connected layer 230.
  • the convolution layer/pooling layer 220 may include layers 221-226 as examples.
  • layer 221 is a convolution layer
  • layer 222 is a pooling layer
  • layer 223 is a convolution layer.
  • Product layer, 224 is a pooling layer
  • 225 is a convolution layer
  • 226 is a pooling layer;
  • 221 and 222 are convolution layers
  • 223 is a pooling layer
  • 224 and 225 are convolutions.
  • layer, 226 is the pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or can be used as the input of another convolutional layer to continue the convolution operation.
  • convolutional layer 221 As an example to introduce the internal working principle of a convolutional layer.
  • the convolution layer 221 can include many convolution operators.
  • the convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator is essentially It can be a weight matrix. This weight matrix is usually predefined. During the convolution operation on the image, the weight matrix is usually one pixel after one pixel (or two pixels after two pixels) along the horizontal direction on the input image. ...This depends on the value of the step size) to complete the process of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a convolved output with a single depth dimension, but in most cases, instead of using a single weight matrix, multiple weight matrices of the same size (rows ⁇ columns) are applied, That is, multiple matrices of the same type.
  • the output of each weight matrix is stacked to form the depth dimension of the convolution image.
  • the dimension here can be understood as being determined by the "multiple" mentioned above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to remove unnecessary noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), and the feature maps extracted by the multiple weight matrices with the same size are also the same size. The extracted multiple feature maps with the same size are then merged to form a convolution operation. output.
  • weight values in these weight matrices require a large amount of training in practical applications.
  • Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, thereby allowing the convolutional neural network 200 to make correct predictions. .
  • the initial convolutional layer for example, 221
  • the features extracted by subsequent convolutional layers for example, 226) become more and more complex, such as high-level semantic features.
  • each layer 221-226 as shown at 220 in Figure 4 there can be a layer of convolutional layer followed by a layer of
  • the pooling layer can also be a multi-layer convolution layer followed by one or more pooling layers.
  • the only purpose of the pooling layer is to reduce the spatial size of the image.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling.
  • the max pooling operator can select the pixel with the largest value in a specific range as the result of max pooling.
  • the operators in the pooling layer should also be related to the size of the image.
  • the size of the image output after processing by the pooling layer can be smaller than the size of the image input to the pooling layer.
  • Each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 200 After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 220 will only extract features and reduce the parameters brought by the input image. However, in order to generate the final output signal Information (required class information or other related information), the convolutional neural network 200 needs to use the fully connected layer 230 to generate an output or a set of required number of classes. Therefore, the fully connected layer 230 may include multiple hidden layers (231, 232 to 23n as shown in Figure 4), and the parameters contained in the multiple hidden layers may be based on the relevant training data of the specific task type. Obtained by pre-training, for example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc...
  • the output layer 240 has a loss function similar to categorical cross entropy and is specifically used to calculate the prediction error.
  • the convolutional neural network 200 shown in Figure 4 is only an example of a convolutional neural network.
  • the convolutional neural network can also exist in the form of other network models, for example, only Including part of the network structure shown in Figure 4, for example, the convolutional neural network used in the embodiment of the present application may only include an input layer 210, a convolution layer/pooling layer 220 and an output layer 240.
  • the convolutional neural network 100 shown in Figure 4 is only an example of a convolutional neural network.
  • the convolutional neural network can also exist in the form of other network models, for example, as The multiple convolutional layers/pooling layers shown in Figure 5 are parallel, and the extracted features are all input to the fully connected layer 230 for processing.
  • Deep Neural Network also known as multi-layer neural network
  • DNN Deep Neural Network
  • the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the layers in between are hidden layers.
  • the layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • the coefficient from the k-th neuron in layer L-1 to the j-th neuron in layer L is defined as It should be noted that the input layer has no W parameter.
  • more hidden layers make the network more capable of describing complex situations in the real world. Theoretically, a model with more parameters has higher complexity and greater "capacity", which means it can complete more complex learning tasks.
  • Training a deep neural network is the process of learning the weight matrix. The ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (a weight matrix formed by the vectors W of many layers).
  • Super Resolution is an image enhancement technology. Given a low-resolution image or a set of low-resolution images, it learns the prior knowledge of the image, the similarity of the image itself, and the complementary information of multiple frames of images. Recover the high-frequency detail information of the image and generate a higher-resolution target image. In the application of super-resolution, according to the number of input images, it can be divided into single-frame image super-resolution and video super-resolution. Super-resolution has important application value in fields such as high-definition television, surveillance equipment, satellite images, and medical imaging.
  • image noise reduction sometimes also called image denoising.
  • Image features mainly include color features, texture features, shape features and spatial relationship features of the image.
  • Color feature is a global feature that describes the surface properties of the scene corresponding to the image or image area. Generally, color features are based on pixel point features. At this time, all pixels belonging to the image or image area have their own contributions. Since color is insensitive to changes in the direction, size, etc. of the image or image area, color features cannot well capture the local characteristics of objects in the image.
  • Texture feature is also a global feature, which also describes the surface properties of the scene corresponding to the image or image area; however, since texture is only a feature of the surface of an object and cannot fully reflect the essential properties of the object, so only using texture features is Unable to obtain high-level image content. Unlike color features, texture features are not pixel-based features and require statistical calculations in an area containing multiple pixels.
  • contour features There are two types of representation methods for shape features, one is contour features, and the other is regional features.
  • the contour features of the image are mainly aimed at the outer boundary of the object, while the regional features of the image are related to the entire shape area.
  • the spatial relationship feature refers to the mutual spatial position or relative direction relationship between multiple targets segmented in the image. These relationships can also be divided into connection/adjacency relationships, overlapping/overlapping relationships, and inclusion/inclusion relationships.
  • spatial location information can be divided into two categories: relative spatial location information and absolute spatial location information. The former relationship emphasizes the relative situation between targets, such as the up, down, left, and right relationships, etc., while the latter relationship emphasizes the distance and orientation between targets.
  • image features listed above can be used as some examples of features in images. Images can also have other features, such as higher-level features: semantic features, which will not be expanded here.
  • Image/video enhancement refers to actions performed on images/videos that can improve imaging quality.
  • enhancement processing includes super-resolution, noise reduction, sharpening or demosaicing.
  • Peak signal-to-noise ratio is often used as a measure of signal reconstruction quality in fields such as image processing, and is often simply defined by the mean square error. Generally speaking, the higher the PSNR, the smaller the difference between the representation and the true value.
  • a term in the field of deep neural networks in the field of computer vision used to indicate the size of the sensory range of neurons at different locations within the neural network to the original image.
  • the value of the receptive field can be roughly used to judge the abstraction level of each layer.
  • optical flow is caused by the movement of the foreground object itself in the scene, the movement of the camera, or the joint movement of both.
  • Optical flow represents the instantaneous speed of pixels and is generally obtained from features of the same modality, such as adjacent image frames and images from different RGB cameras.
  • Optical flow represents the control position relationship between two different modal information, event information and image information (eg, grayscale information), represented by scene flow.
  • operations with flow represent the affine transformation of an image relative to the flow (such as optical flow, scene flow), such as rotation, movement, scaling, etc.
  • the error back propagation (BP) algorithm can be used to correct the size of the parameters in the initial model during the training process, so that the error loss of the model becomes smaller and smaller. Specifically, forward propagation of the input signal until the output will produce an error loss, and backward propagation of the error loss information is used to update the parameters in the initial model, so that the error loss converges.
  • the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain optimal model parameters, such as weight matrices.
  • FIG. 6 is a schematic diagram of the system architecture provided by an embodiment of the present application.
  • the system architecture 500 includes an execution device 510 , a training device 520 , a database 530 , a client device 540 , a data storage system 550 and a data collection system 560 .
  • the execution device 510 includes a computing module 511, an I/O interface 512, a preprocessing module 513 and a preprocessing module 514.
  • the target model/rule 501 may be included in the calculation module 511, and the preprocessing module 513 and the preprocessing module 514 are optional.
  • Data collection device 560 is used to collect training data.
  • the video samples can be low-quality images, and the supervision images are high-quality images corresponding to the image samples obtained in advance before model training.
  • the image sample may be, for example, a low-resolution image, and the supervision image may be a high-resolution image; or the image sample may be, for example, a video containing fog or noise, and the supervision image may be an image with the fog or noise removed.
  • the data collection device 560 stores the training data into the database 530, and the training device 520 trains to obtain the target model/rule 501 based on the training data maintained in the database 530.
  • the above target model/rule 501 (for example, the model including the scene flow prediction network in the embodiment of the present application) can be used to implement the image denoising task, that is, input the image to be processed into the target model/rule 501, and then the denoising can be obtained image after.
  • the training data maintained in the database 530 may not necessarily be collected by the data collection device 560, but may also be received from other devices.
  • the training device 520 does not necessarily perform training of the target model/rules 501 based entirely on the training data maintained by the database 530. It may also obtain training data from the cloud or other places for model training.
  • the above description should not be regarded as a limitation of this application. Limitations of Examples.
  • the target model/rules 501 trained according to the training device 520 can be applied to different systems or devices, such as to the execution device 510 shown in Figure 6.
  • the execution device 510 can be a terminal, such as a mobile phone terminal, a tablet computer, Laptops, augmented reality (AR)/virtual reality (VR) devices, vehicle-mounted terminals, etc., or servers or clouds, etc.
  • the execution device 510 is configured with an input/output (I/O) interface 512 for data interaction with external devices. The user can input data to the I/O interface 512 through the client device 540 .
  • I/O input/output
  • the preprocessing module 513 and the preprocessing module 514 are used to perform preprocessing according to the input data received by the I/O interface 512. It should be understood that there may be no preprocessing module 513 and 514 or only one preprocessing module. When the preprocessing module 513 and the preprocessing module 514 do not exist, the computing module 511 can be directly used to process the input data.
  • the execution device 510 When the execution device 510 preprocesses input data, or when the calculation module 511 of the execution device 510 performs calculations and other related processes, the execution device 510 can call data, codes, etc. in the data storage system 550 for corresponding processing. , the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 550.
  • the I/O interface 512 presents the processing results, such as the denoised image obtained after processing, to the client device 540, thereby providing it to the user.
  • the training device 520 can generate corresponding target models/rules 501 based on different training data for different goals or different tasks, and the corresponding target models/rules 501 can be used to implement image denoising tasks. , thereby providing users with the desired results.
  • the user can manually set the input data, and the "manually set input data" can be operated through the interface provided by the I/O interface 512 .
  • the client device 540 can automatically send input data to the I/O interface 512. If requiring the client device 540 to automatically send the input data requires the user's authorization, the user can set corresponding permissions in the client device 540. The user can view the results output by the execution device 510 on the client device 540, and the specific presentation form may be display, sound, action, etc.
  • the client device 540 can also be used as a data collection terminal to collect the input data of the input I/O interface 512 and the output results of the output I/O interface 512 as new sample data, and store them in the database 530.
  • the I/O interface 512 directly uses the input data input to the I/O interface 512 and the output result of the output I/O interface 512 as a new sample as shown in the figure.
  • the data is stored in database 530.
  • Figure 6 is only a schematic diagram of a system architecture provided by an embodiment of the present application.
  • the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 550 is an external memory relative to the execution device 510. In other cases, the data storage system 550 can also be placed in the execution device 510.
  • Figure 7 is a chip hardware structure diagram provided by an embodiment of the present application.
  • the chip includes a neural network processor 700.
  • the chip can be disposed in the execution device 510 as shown in Figure 6 to complete the calculation work of the calculation module 511.
  • the chip can also be installed in the training device 520 as shown in Figure 6 to complete the training work of the training device 520 and output the target model/rules 501.
  • the algorithms at each layer in the model shown in Figure 6 can be implemented in the chip shown in Figure 7.
  • the neural network processor (neural processing unit, NPU) 700 is mounted on the main central processing unit (host central processing unit, host CPU) as a co-processor, and the main CPU allocates tasks.
  • the core part of the NPU is the arithmetic circuit 703.
  • the controller 704 controls the arithmetic circuit 703 to extract data in the memory (weight memory 702 or input memory 701) and perform operations.
  • the computing circuit 703 internally includes multiple processing engines (PEs).
  • PEs processing engines
  • arithmetic circuit 703 is a two-dimensional systolic array.
  • the arithmetic circuit 703 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 703 is a general-purpose matrix processor.
  • the operation circuit 703 obtains the corresponding data of matrix B from the weight memory 702 and caches it on each PE in the operation circuit 703 .
  • the operation circuit 703 takes the matrix A data from the input memory 701 and performs matrix operation on the matrix B, and stores the partial result or the final result of the matrix in an accumulator (accumulator) 708 .
  • the vector calculation unit 707 can further process the output of the operation circuit 703, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc.
  • the vector calculation unit 707 can be used for network calculations of non-convolutional/non-FC layers in neural networks, such as pooling, batch normalization, local response normalization, etc. .
  • vector calculation unit 707 can store the processed output vectors to unified memory 706 .
  • the vector calculation unit 707 may apply a nonlinear function to the output of the operation circuit 703, such as a vector of accumulated values, to generate an activation value.
  • vector calculation unit 707 generates normalized values, merged values, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 703, such as for use in a subsequent layer in a neural network.
  • the unified memory 706 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 701 and/or unified memory 706 through the storage unit access controller (direct memory access controller, DMAC) 705, and stores the weight data in the external memory into the weight memory 702. And store the data in the unified memory 706 into the external memory.
  • DMAC direct memory access controller
  • the bus interface unit (bus interface unit, BIU) 710 is used to realize the interaction between the main CPU, the DMAC and the fetch memory 709 through the bus.
  • An instruction fetch buffer 709 connected to the controller 704 is used to store instructions used by the controller 704.
  • the controller 704 is used to call instructions cached in the fetch memory 709 to control the working process of the computing accelerator.
  • the unified memory 706, the input memory 701, the weight memory 702 and the fetch memory 709 are all on-chip memories, and the external memory is a memory external to the NPU.
  • the external memory can be double data rate synchronous dynamic random access. memory (double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
  • Motion blur generally occurs in scenes with obvious movement during the exposure time, especially in low-light environments with lightweight mobile devices, such as mobile phones and car cameras.
  • motion blur causes undesirable image degradation, making visual content less interpretable, motion blurred images also encode rich information about the relative motion between the camera and the observed scene. Therefore, recovering (reconstructing) clear frame sequences (photo-sequencing) from a single motion-blurred image helps understand the dynamics of the scene and has wide applications in image reconstruction, autonomous driving, and video surveillance.
  • a motion blurred image can be viewed as an average of HD frames over the exposure time. Since averaging destroys the temporal order of frames, it is very inappropriate to recover a clear sequence of frames from a single motion blurred image. That is to say, the sequence to be recovered is not unique, and there may be different high-definition frames with the same sequence composition. motion blurred image.
  • an event camera In order to solve the non-uniqueness of the sequence to be restored, an event camera is introduced.
  • the event camera can provide the inter-frame changes of the time series to guide the recovery of the sequence.
  • the event camera is a bio-inspired, event-driven, time-based neuromorphic vision sensor that perceives the world using a completely different principle than traditional cameras. It measures brightness changes by working asynchronously and triggers an event once the change exceeds a threshold.
  • Event cameras do away with concepts like exposure time and frames in traditional intensity cameras and are able to capture nearly continuous motion in frameless mode (microsecond time resolution), so you don't encounter problems like blur. Utilizing event cameras will be very helpful in recovering clear frames from blurry images.
  • Image deblurring is performed through the optical flow of event information.
  • the core idea is to calculate the optical flow through the event information, use this optical flow to perform affine transformation (warp) on the blurred image, and cooperate with various losses to achieve the image at any time within the exposure time. Deblur.
  • affine transformation warp
  • the optical flow is not precise, and there is a problem of pixel-level misalignment.
  • the image processing method can be a feedforward process of model training or an inference process.
  • FIG 8 is a schematic diagram of an image processing method provided by an embodiment of the present application.
  • an image processing method provided by an embodiment of the present application includes:
  • the execution subject of step 801 may be a terminal device, and the terminal device may be a portable mobile device, such as but not limited to a mobile or portable computing device (such as a smart phone), a personal computer, a server computer, a handheld device (such as tablet) or laptop device, multi-processor system, game console or controller, microprocessor-based system, set-top box, programmable consumer electronics, mobile phone, wearable or accessory form factor (e.g., watch, glasses, headsets, or earbuds), network PCs, minicomputers, mainframe computers, distributed computing environments including any of the above systems or devices, and the like.
  • a mobile or portable computing device such as a smart phone
  • a personal computer such as a server computer
  • a handheld device such as tablet
  • microprocessor-based system such as tablet
  • set-top box such as programmable consumer electronics
  • mobile phone wearable or accessory form factor
  • network PCs e.g., watch, glasses, headsets, or earbuds
  • minicomputers
  • the execution subject of step 801 can be a server on the cloud side.
  • the server can receive the blurred image sent from the terminal device and the event data collected by the event camera, and then the server can obtain the blurred image and the event data collected by the event camera. .
  • the blurred image and the event data are collected for the same scene in the same time period.
  • the blurred image may be an image collected by an RGB camera on the terminal device
  • the event data may be an image collected by the event camera on the terminal device of the same scene.
  • the blurred image may be an average of multiple frame images (images obtained within the exposure time), and the event data may include event points within the time period corresponding to the blurred image. That is to say, the blurred image can be averaged by averaging multiple existing consecutive frames of images to obtain a synthesized blurred image of one frame.
  • the time period corresponding to the above-mentioned blurred image can be determined by the time corresponding to the above-mentioned existing continuous multiple frames of high-definition images.
  • This time period can be the exposure time of the camera or camera during the actual shooting. That is to say, the blur caused by the subject's actions during the exposure time period produces a frame of blurred image.
  • This frame of blurred image corresponds to A sequence of image frames. For example, assuming that six consecutive frames of images between T0 and T1 are averaged to obtain blurred image B1, then the time period corresponding to blurred image B1 is T0-T1.
  • Event data can include multiple time points, and event points can also be called events.
  • event points can also be called events.
  • the most basic principle of an event camera is to output an event point when the cumulative brightness change of a certain pixel reaches the trigger condition (the change reaches a certain level). Therefore, an event point can be understood as an expression of an event: at what time (time stamp) and at which pixel point (pixel coordinates), the brightness increased or decreased (brightness change). change).
  • the blurred image can be a grayscale image whose size is H*W, where H represents the height of the image and W represents the width of the image.
  • H represents the height of the image
  • W represents the width of the image.
  • event data can be extracted into event features F E (such as the second feature representation in the embodiment of the present application) through a feature extraction network (such as multiple convolutional layers).
  • a feature extraction network such as multiple convolutional layers.
  • the first feature representation and the second feature representation obtained through the feature extraction network may be feature representations of the same size.
  • the "size" of the feature representation here can be understood as the width and height of the feature representation.
  • the feature representation of the blurred image and the feature representation of the event data are processed through N series-connected feature nesting blocks, wherein the first feature representation and the second feature representation can be input to the N series-connected feature nesting blocks.
  • the first characteristic representation of the blurred image and the second characteristic representation of the event data obtain the first scene stream corresponding to the blurred image and the second scene stream corresponding to the event data through the scene flow prediction network.
  • Scene stream, the size of the first scene stream and the first feature representation are consistent, and each pixel feature in the first scene stream indicates from the pixel feature of the corresponding pixel position in the first feature representation to the Motion information of pixel features corresponding to pixel positions in the second feature representation, the size of the second scene stream and the second feature representation are consistent, and each pixel feature indication in the second scene stream is represented by the second Motion information from a pixel feature corresponding to a pixel position in the feature representation to a pixel feature corresponding to a pixel position in the first feature representation.
  • the scene flow prediction network can be a network included in a feature nesting block introduced above, and the first feature representation of the blur image and the second feature representation of the event data can be input to the scene flow prediction network.
  • the first feature representation of the blurred image and the second feature representation of the event data can be input into the scene flow prediction network to obtain the first scene flow corresponding to the blurred image, A second scene stream corresponding to the event data.
  • the scene flow prediction network may include a first encoding module, a second encoding module, a fusion module, a first decoding module and a second decoding module; the first feature of the blurred image is representation and the second feature representation of the event data, obtaining the first scene stream corresponding to the blurred image and the second scene stream corresponding to the event data through the scene flow prediction network, which may include: according to the first According to the characteristic representation, the first encoding result is obtained through the first encoding module; according to the second characteristic representation, the second encoding result is obtained through the second encoding module; according to the first encoding result and the third encoding result The second encoding result is obtained through the fusion module, and the fusion result is obtained; according to the fusion result, the first scene stream corresponding to the blurred image and the first scene stream corresponding to the blurred image are obtained through the first decoding module and the second decoding module respectively.
  • the second scene stream corresponding to the event data may include: according to the first According to the characteristic representation
  • the fusion module is configured to implement a first fusion of the first encoding result and the second encoding result based on an attention mechanism.
  • the scene flow prediction network can also be called a multi-scale bidirectional scene flow network (for example, the multi-scale bidirectional scene flow prediction 2i10 shown in Figure 9).
  • the scene flow prediction network can be a "two-input-two-output" network. The specific structure can be seen in the diagram in Figure 10.
  • first characteristic representation for input (first characteristic representation) and (Second feature representation), first of all, through independent encoding encoder networks (such as the first encoding module and the second encoding module, the first encoding module is used to process the first feature representation, and the second encoding module is used to process the second feature representation) to extract features; and then use the fusion module to realize the fusion of the input of fuzzy image features and event information features (for example, based on the attention module to achieve fusion, the attention module can generate attention features, which are used to realize fuzzy images fusion of features and event data features); the fused features can be passed through an independent decoding decoder network (such as the first decoding module and The second decoding module, the first decoding module is used to generate the first scene flow, the second decoding module is used to generate the second scene flow) respectively generate the corresponding scene flow (scene flow) (first scene stream) and (Second scene flow).
  • independent encoding encoder networks such as the first encoding module and the second encoding module,
  • the first scene stream can represent the alignment relationship between the blur image features and the event information features
  • the second scene stream can represent the alignment relationship between the event feature information and the blur image
  • the feature representation of the blurred image and the feature representation of the event data are not the same modal information. If the feature representation of the blurred image and the feature representation of the event data are directly fused, the fusion result obtained will be inaccurate.
  • This application implements In the example, two different encoding modules are first used to encode the feature representation of the blurred image and the feature representation of the event data respectively, converting them into data similar to the same modality, and then fuse the coding results to obtain accurate the fusion result.
  • the size of the first scene stream and the first feature representation are consistent, and each pixel feature indication in the first scene stream is represented by the Motion information from the pixel feature corresponding to the pixel position in the first feature representation to the pixel feature corresponding to the pixel position in the second feature representation, the size of the second scene stream and the second feature representation are consistent, and the second Each pixel feature in the scene stream indicates motion information from the pixel feature at the corresponding pixel position in the second feature representation to the pixel feature at the corresponding pixel position in the first feature representation.
  • the pixel feature here can refer to a point at the spatial position (x, y), which may include multiple channels.
  • the motion information can be expressed as a two-dimensional instantaneous velocity field, in which the two-dimensional velocity vector is the projection of the three-dimensional velocity vector of the visible point in the scene on the imaging surface.
  • the scene flow in the embodiment of the present application is similar to the optical flow, and the information of each pixel position is a vector with direction.
  • the encoding module encoder (such as the first encoding module and the second encoding module introduced above) can be continuous convolution and downsampling; the fusion module can be the spatial attention structure or channel attention of the scene.
  • the decoding module decoder (such as the first decoding module and the second decoding module introduced above) can be continuous upsampling and convolution.
  • Pixel-level alignment between blur image features and event data features can be achieved through scene flow prediction.
  • affine transformation warp
  • multi-scale bidirectional scene flow alignment which specifically includes multi-scale bidirectional scene flow prediction 2i10, blur image feature warp operation 2i11, and event information feature warp operation 2i12.
  • a cross-warp operation can be performed with the corresponding features to obtain the features after warping (the third feature representation and the fourth feature representation).
  • warp(*) is a traditional pixel-to-pixel spatial warp operation and is a non-learnable operator.
  • pixel-level alignment between blurred image features and event features can be achieved through multi-scale bidirectional scene flow alignment.
  • This network structure combined with the attention structure and warp operation, can obtain fine information at different granularity levels from event information, which is conducive to extracting clear texture structures and facilitating the deblurring of blurred images.
  • the fused image may lose part of the information compared to the event data (the event data collected by the event camera is included in the exposure time (multi-frame event data collected within a certain scene).
  • the event data collected by the event camera is included in the exposure time (multi-frame event data collected within a certain scene).
  • the event data is identified only when the brightness change of the point at the pixel position is greater than the threshold, some image data may be invalid. Therefore, if the information of the originally invalid (or occluded) area in the image data or time data is directly used, artifacts caused by reduced image quality will occur.
  • a second occluded area can be identified based on the fourth feature representation and the first feature representation (for example, a lightweight network (such as a continuous Convolution and residual) to achieve the determination of the occlusion area), wherein the image data of the second occlusion area in the blurred image is valid in the second occlusion area of the event data; convert the first feature Feature representations other than the second occlusion area in the representation and features of the second occlusion area in the fourth feature representation Indicates performing a second fusion to obtain a second fused feature representation.
  • some features of the blurred image are occluded, and the information in the event data can be used to replace these occluded features to obtain a more accurate feature representation.
  • the second occlusion area can be represented by a second mask, the size of the second mask and the fourth feature representation are consistent, and each pixel in the second mask is represented by Indicating whether the pixel feature at the corresponding position in the first feature representation is valid in the blurred image. For example, 0 and 1 can be used in the second mask to identify whether the pixel feature at the corresponding position is valid in the event data. For example, 0 means invalid and 1 means valid.
  • the first occluded area in the event data can be determined according to the third feature representation and the second feature representation (for example, a lightweight network (such as a continuous network) can be used) Convolution and residual) to achieve the determination of the occlusion area), wherein the image data of the first occlusion area in the event data is valid in the first occlusion area of the blurred image; the second The feature representation other than the first occlusion area in the feature representation and the feature representation of the first occlusion area in the third feature representation are subjected to a second fusion to obtain a first fused feature representation.
  • some features of the event information are occluded, and the information in the blurred image can be used to replace these occluded features, thereby obtaining a more accurate feature representation.
  • the first occlusion area is represented by a first mask
  • the size of the first mask and the third feature representation are consistent
  • each pixel in the first mask is used for Indicate whether the pixel feature at the corresponding position in the third feature representation is valid in the event data. For example, 0 and 1 can be used in the first mask to identify whether the pixel feature at the corresponding position is valid in the event data. For example, 0 means invalid and 1 means valid.
  • the second fusion is an addition operation of corresponding pixel positions.
  • the occlusion area in the blurred image can be processed, thereby reducing the artifact problem caused by the occlusion area.
  • the feature representation of the blurred image and the feature representation of the event data can be processed through N feature nested blocks connected in series to obtain the processing result for deblurring; where, each The feature nesting block is used to perform the image processing method described above.
  • the first feature nesting block is used to obtain the feature representation extracted from the blur image and the event data through the feature extraction network.
  • the nth feature The nested block is used to obtain the feature representation output by the n-1th feature nested block, where n is less than N.
  • the feature representation output by the Nth feature nesting block is used to fuse with the feature representation extracted from the blurred image through the feature extraction network to obtain residual information.
  • the difference information is used to fuse with the blurred image to achieve deblurring of the blurred image.
  • each scene flow guidance dual feature nesting block 2i0 includes two symmetrical occlusion-aware feature fusions (event information feature occlusion-aware feature fusion 2i2, blur image feature occlusion-aware feature fusion 2i2 Feature fusion 2i3).
  • event information feature occlusion-aware feature fusion 2i2 blur image feature occlusion-aware feature fusion 2i2 Feature fusion 2i3
  • fuzzy feature occlusion sensing feature fusion 2i3 its internal structure is shown in Figure 11, including occlusion area mask generation 2i30, occlusion area feature generation 2i31, and feature fusion 2i32.
  • occlusion area mask generation 2i30 receiving features generated by blurred image feature warp operation (warp feature for short), event information feature (referred to as self-feature) as input, adaptively generate the occlusion area mask mask M B with the same resolution through a lightweight network (such as continuous convolution and residual) (optional, the mask can use One-hot encoding , its value can only be 0 and 1), representing the correlation between the only fuzzy image features and event information features of warp.
  • a lightweight network such as continuous convolution and residual
  • the occlusion area feature is generated 2i31: after obtaining the occlusion mask, the dot multiplication operation can be performed with the warp feature to obtain the features after occlusion processing (the 2i3 module generates The dual module 2i2 generates ), which represents the visible characteristics after occlusion effect processing.
  • feature fusion 2i32 receiving features after occlusion processing and original self-features As input, the final fused features are generated through concatenate (channel splicing), convolution and other operations.
  • the representation incorporates blurred image features after occlusion processing in event information features.
  • the feature representation output by the Nth feature nesting block is used to fuse with the feature representation extracted from the blurred image through the feature extraction network to obtain residual information.
  • the difference information is used to fuse with the blurred image to achieve deblurring of the blurred image.
  • the blur image feature F B and the event feature F E can be received as input through the scene flow-guided dual feature nesting 200. After N times of scene flow-guided dual feature nesting block processing, the nested blur is generated. Image features and event characteristics Receive fuzzy image features after nesting through global feature fusion 300 Event characteristics And the original fuzzy feature F B is used as input, and the fused mixed feature F mix is generated through operations such as convolution (or addition, or splicing). The mixing feature F mix and the original input blurred image B are received as input through the summation operation 400, and the sharp result O after final deblurring is generated through the addition operation.
  • Blurred image feature extraction 100 For a given input blurred image B (generally a grayscale image whose size is H*W, where H represents the height of the image and W represents the width of the image), through multiple convolution layers, extract to the blurred image feature F B .
  • H*W grayscale image whose size is H*W, where H represents the height of the image and W represents the width of the image
  • Event information feature extraction 101 for a given input event information E, event features F E are extracted through multiple convolutional layers. It should be noted that the spatial resolution of the event information at a certain moment is the same as that of the blurred image, which is H*W; but what is input here is all the event information within the exposure time of the blurred image, including N channels, then the event information input is H*W*N, where N represents the number of event information.
  • the scene flow-guided dual feature nesting 200 receives the fuzzy image feature F B and the event feature F E as input, and after N times of scene flow-guided dual feature nesting block processing, the nested fuzzy image feature is generated.
  • event characteristics For the i-th scene flow guided Dual Feature Embedding processing, it can be expressed as the following formula 2:
  • the input is the fuzzy image feature information after the previous processing.
  • Event information characteristics After multi-scale bidirectional scene flow alignment 2i1, event information feature occlusion perception feature fusion 2i2, fuzzy image information occlusion perception feature fusion 2i3, etc., fuzzy image feature information is generated Event information characteristics
  • multi-scale scene flow alignment 2i1 internally includes multi-scale bidirectional scene flow prediction 2i10, blur image feature warp operation 2i11, and event information feature warp operation 2i12.
  • multi-scale bidirectional scene flow alignment introduced above. flow alignment)2i1.
  • event information feature occlusion perception feature fusion 2i2 blur image features are received Event information characteristics after warp First, it is processed by the occlusion mask generation module 2i20 composed of lightweight convolutions to generate a one-hot encoded mask M E with the same resolution as the input feature (encoding as 0 indicates that the regional feature is occluded in the event information , tend to use blurred image features; on the contrary, coding as 1 indicates that the regional features are occluded in the blurred image, and tend to use event information features); then event information features after warp Perform a dot product with the occlusion mask M E to obtain the features after occlusion processing Finally blur image features and features after occlusion processing Perform channel fusion and convolution operations to generate blurred image features after fusion
  • event information features are received Blurry image after warp feature
  • modules such as occlusion area mask generation 2i30, occlusion area feature generation 2i31, and feature fusion 2i32, the fused event information features are generated.
  • Global feature fusion 300 receiving fuzzy image features after nesting Event characteristics And the original fuzzy feature F B is used as input, and the fused mixed feature F mix is generated through operations such as convolution (or addition, or splicing).
  • the summation operation 400 receives the mixed feature F mix and the original input blurred image B as input, and generates the final clear result O after deblurring through the addition operation.
  • this patented method has better PSNR/SSIM (bigger is better) compared with existing methods.
  • the single-frame PSNR is improved by 2.9dB compared to the existing SOTA PSNR, and multi-frame deblurring is improved by 2.7dB. See Table 1 for details.
  • the dual feature nesting, multi-scale bidirectional scene flow prediction, and occlusion-aware feature fusion described in the present invention all have positive effects.
  • the multi-scale bidirectional scene flow prediction effect is the most obvious, which can improve PSNR by more than 2 db.
  • Table 2 shows the results of ablation experiments. Among them, D in Table 2 represents dual feature nesting without the multi-scale bidirectional scene flow prediction described in the present invention, MSE represents the multi-scale scene flow prediction described in the present invention, and OAFF represents occlusion-aware feature fusion.
  • Figure 14 the scene flow and occlusion-aware mask are visualized.
  • (a) in Figure 14 is a blurred image. It can be seen that some scene flows with large changes in Figure 14 (b) are (c) in Figure 14 is well annotated using a mask, which represents It is precisely some occluded areas that are levied. Through additional processing of occlusion areas, the final result (d) in Figure 14 can handle occlusion well and generate clear results.
  • An embodiment of the present application provides an image processing method, including: obtaining a first feature representation of a blurred image and a second feature representation of event data collected by an event camera; the size of the first feature representation and the second feature representation Consistent; according to the first feature representation of the blurred image and the second feature representation of the event data, through the scene flow prediction network, the first scene stream corresponding to the blurred image and the second scene stream corresponding to the event data are obtained.
  • Scene stream the size of the first scene stream and the first feature representation are consistent, and each pixel feature in the first scene stream indicates from the pixel feature of the corresponding pixel position in the first feature representation to the Motion information of pixel features corresponding to pixel positions in the second feature representation, the size of the second scene stream and the second feature representation are consistent, and each pixel feature indication in the second scene stream is represented by the second
  • the motion information of the pixel feature corresponding to the pixel position in the feature representation to the pixel feature corresponding to the pixel position in the first feature representation according to the first scene flow, perform an affine transformation (warp) on the first feature representation to obtain the first feature representation.
  • Three feature representations perform affine transformation on the second feature representation according to the second scene flow to obtain a fourth feature representation; the third feature representation and the fourth feature representation are used to remove the blurred image Blurring.
  • inventions of the present application provide an image processing system.
  • the image processing system may include user equipment and data processing equipment.
  • user equipment includes smart terminals such as mobile phones, personal computers, or information processing centers.
  • the user equipment is the initiator of image processing.
  • the initiator of the image enhancement request the user usually initiates the request through the user equipment.
  • the above-mentioned data processing equipment may be a cloud server, a network server, an application server, a management server, and other equipment or servers with data processing functions.
  • the data processing device receives the image enhancement request from the smart terminal through the interactive interface, and then performs image processing in machine learning, deep learning, search, reasoning, decision-making and other methods through the memory that stores the data and the processor that processes the data.
  • the memory in the data processing device can be a general term, including local storage and a database that stores historical data.
  • the database can be on the data processing device or on other network servers.
  • the user equipment can receive instructions from the user. For example, the user equipment can obtain an image input/selected by the user, and then initiate a request to the data processing device, so that the data processing device executes an image enhancement processing application (such as image ultrasonic processing) on the image obtained by the user equipment. Resolution reconstruction, image denoising, image defogging, image deblurring, image contrast enhancement, etc.) to obtain the corresponding processing results for the image. For example, the user device can obtain an image input by the user, and then initiate an image denoising request to the data processing device, so that the data processing device performs image denoising on the image, thereby obtaining a denoised image.
  • an image enhancement processing application such as image ultrasonic processing
  • the data processing device can execute the image processing method according to the embodiment of the present application.
  • the user device can directly serve as a data processing device, and the user device can directly obtain input from the user and process it directly by the hardware of the user device itself.
  • the user equipment can receive instructions from the user. For example, the user equipment can obtain an image selected by the user in the user equipment, and then the user equipment itself executes an image processing application (such as image super-resolution reconstruction, image denoising) on the image. , image defogging, image deblurring and image contrast enhancement, etc.), thereby obtaining the corresponding processing results for the image. At this time, the user equipment itself can execute the image processing method in the embodiment of the present application.
  • an image processing application such as image super-resolution reconstruction, image denoising
  • Figure 15 is a schematic structural diagram of an image processing device provided by an embodiment of the present application. As shown in Figure 15, the implementation of the present application An image processing device 1500 provided by the example includes:
  • the acquisition module 1501 is configured to acquire the first feature representation of the blurred image and the second feature representation of the event data collected by the event camera; the size of the first feature representation and the second feature representation are consistent.
  • step 801 For a specific description of the acquisition module 1501, reference may be made to the description of step 801 in the above embodiment, which will not be described again here.
  • the scene flow prediction module 1502 is configured to obtain the first scene flow corresponding to the blur image and the first scene flow corresponding to the blur image through the scene flow prediction network according to the first characteristic representation of the blurred image and the second characteristic representation of the event data.
  • the second scene stream corresponding to the event data, the size of the first scene stream and the first feature representation are consistent, and each pixel feature in the first scene stream is indicated by the corresponding pixel position in the first feature representation pixel features to the motion information of the pixel features corresponding to the pixel position in the second feature representation, the size of the second scene stream and the second feature representation are consistent, and each pixel feature in the second scene stream indicates that the pair represented by the second feature Corresponding the pixel characteristics of the pixel position to the motion information of the pixel characteristics of the corresponding pixel position in the first feature representation;
  • step 802 For a specific description of the scene flow prediction module 1502, reference may be made to the description of step 802 in the above embodiment, which will not be described again here.
  • the affine transformation module 1503 is used to perform affine transformation on the first feature representation according to the first scene flow to obtain a third feature representation
  • affine transformation is performed on the second feature representation to obtain a fourth feature representation; the third feature representation and the fourth feature representation are used to deblur the blurred image.
  • affine transformation module 1503 For a specific description of the affine transformation module 1503, reference may be made to the description of step 803 and step 804 in the above embodiment, which will not be described again here.
  • the blurred image and the event data are collected for the same scene in the same time period.
  • the scene flow prediction network includes a first encoding module, a second encoding module, a fusion module, a first decoding module and a second decoding module;
  • the scene flow prediction module is specifically used for:
  • a first encoding result is obtained through the first encoding module
  • a second encoding result is obtained through the second encoding module
  • a fusion result is obtained through the fusion module
  • the first scene stream corresponding to the blurred image and the second scene stream corresponding to the event data are obtained through the first decoding module and the second decoding module respectively.
  • the feature representation of the blurred image and the feature representation of the event data are not the same modal information. If the feature representation of the blurred image and the feature representation of the event data are directly fused, the fusion result obtained will be inaccurate.
  • This application implements In the example, two different encoding modules are first used to encode the feature representation of the blurred image and the feature representation of the event data respectively, converting them into data similar to the same modality, and then fuse the coding results to obtain accurate Fusion results.
  • the device further includes:
  • An occlusion area identification module configured to identify a second occlusion area according to the fourth feature representation and the first feature representation, wherein the image data of the second occlusion area in the blurred image is in all parts of the event data. Valid in the second occlusion area;
  • a second fusion is performed on the feature representation of the first feature representation other than the second occlusion area and the feature representation of the second occlusion area in the fourth feature representation to obtain a second fused feature representation.
  • the second occlusion area is represented by a second mask, the size of the second mask and the fourth feature representation are consistent, and each pixel in the second mask is used for Indicates whether the pixel feature at the corresponding position in the first feature representation is valid in the blurred image.
  • the device further includes:
  • An occlusion area identification module configured to determine a first occlusion area according to the third feature representation and the second feature representation, wherein the image data of the first occlusion area in the event data is in all parts of the blurred image. Valid in the first occlusion area;
  • a second fusion is performed on the feature representation of the second feature representation other than the first occlusion area and the feature representation of the first occlusion area in the third feature representation to obtain a first fused feature representation.
  • the first occlusion area is represented by a first mask, the size of the first mask and the third feature representation are consistent, and each pixel in the first mask is used for Indicate whether the pixel feature at the corresponding position in the third feature representation is valid in the event data.
  • the occlusion area in the blurred image can be processed, thereby reducing the artifact problem caused by the occlusion area.
  • the second fusion is an addition operation of corresponding pixel positions.
  • the device further includes: a feature nesting module, configured to process the feature representation of the blurred image and the feature representation of the event data through N series-connected feature nesting blocks to obtain the feature representation for processing.
  • the processing result of deblurring processing wherein, each of the feature nesting blocks is used to perform the image processing method described above, and the first feature nesting block is used to obtain the blurred image and the blurred image through the feature extraction network.
  • the n-th feature nesting block is used to obtain the feature representation output by the n-1th feature nesting block, and n is less than N.
  • the feature representation output by the Nth feature nesting block is used to compare with the feature extraction network
  • the feature representations extracted from the blurred image are fused to obtain residual information, and the residual information is used to fuse with the blurred image to achieve deblurring processing of the blurred image.
  • FIG. 16 is a schematic structural diagram of an execution device provided by an embodiment of the present application.
  • the execution device 1600 can be embodied as a mobile phone, a tablet, a notebook computer, Smart wearable devices, servers, etc. are not limited here. Among them, the execution device 1600 implements the functions of the image processing method in the corresponding embodiment of FIG. 8 .
  • the execution device 1600 includes: a receiver 1601, a transmitter 1602, a processor 1603, and a memory 1604 (the number of processors 1603 in the execution device 1600 may be one or more), wherein the processor 1603 may include application processing processor 16031 and communication processor 16032.
  • the receiver 1601, the transmitter 1602, the processor 1603, and the memory 1604 may be connected by a bus or other means.
  • Memory 1604 may include read-only memory and random access memory and provides instructions and data to processor 1603 .
  • a portion of memory 1604 may also include non-volatile random access memory (NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 1604 stores processor and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, where the operating instructions may include various operating instructions for implementing various operations.
  • the processor 1603 controls the execution of operations of the device.
  • various components of the execution device are coupled together through a bus system.
  • the bus system may also include a power bus, a control bus, a status signal bus, etc.
  • various buses are called bus systems in the figure.
  • the methods disclosed in the above embodiments of the present application can be applied to the processor 1603 or implemented by the processor 1603.
  • the processor 1603 may be an integrated circuit chip with signal processing capabilities. During the implementation process, each step of the above method can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 1603 .
  • the above-mentioned processor 1603 can be a general-purpose processor, a digital signal processor (DSP), a microprocessor or a microcontroller, a vision processing unit (VPU), or a tensor processing unit.
  • TPU and other processors suitable for AI computing, may further include application specific integrated circuits (ASICs), field-programmable gate arrays (field-programmable gate arrays, FPGAs) or other programmable logic devices, Discrete gate or transistor logic devices, discrete hardware components.
  • ASICs application specific integrated circuits
  • FPGAs field-programmable gate arrays
  • Discrete gate or transistor logic devices discrete hardware components.
  • the processor 1603 can implement or execute each method, step and logical block diagram disclosed in the embodiment of this application.
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • the steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field.
  • the storage medium is located in the memory 1604.
  • the processor 1603 reads the information in the memory 1604 and completes steps 801 to 804 in the above embodiment in conjunction with its hardware.
  • the receiver 1601 may be configured to receive input numeric or character information and generate signal inputs related to performing relevant settings and functional controls of the device.
  • the transmitter 1602 can be used to output numeric or character information through the first interface; the transmitter 1602 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1602 can also include a display device such as a display screen .
  • FIG. 17 is a schematic structural diagram of the training device provided by the embodiment of the present application.
  • the training device 1700 is implemented by one or more servers.
  • the training device 1700 There may be relatively large differences due to different configurations or performance, which may include one or more central processing units (CPU) 1717 (for example, one or more processors) and memory 1732, one or more storage applications Storage medium 1730 for program 1742 or data 1744 (eg, one or more mass storage devices).
  • the memory 1732 and the storage medium 1730 may be short-term storage or persistent storage.
  • the program stored in the storage medium 1730 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the training device. Furthermore, the central processor 1717 may be configured to communicate with the storage medium 1730 and execute a series of instruction operations in the storage medium 1730 on the training device 1700 .
  • the training device 1700 may also include one or more power supplies 1726, one or more wired or wireless network interfaces 1750, one or more input and output interfaces 1758; or, one or more operating systems 1741, such as Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.
  • operating systems 1741 such as Windows ServerTM, Mac OS XTM , UnixTM, LinuxTM, FreeBSDTM and so on.
  • the training device may perform steps 801 to 804 in the above embodiment.
  • An embodiment of the present application also provides a computer program product that, when run on a computer, causes the computer to perform the steps performed by the foregoing execution device, or causes the computer to perform the steps performed by the foregoing training device.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium stores a program for performing signal processing.
  • the program When the program is run on a computer, it causes the computer to perform the steps performed by the aforementioned execution device. , or, causing the computer to perform the steps performed by the aforementioned training device.
  • the execution device, training device or terminal device provided by the embodiment of the present application may specifically be a chip.
  • the chip includes: a processing unit and a communication unit.
  • the processing unit may be, for example, a processor.
  • the communication unit may be, for example, an input/output interface. Pins or circuits, etc.
  • the processing unit can execute the computer execution instructions stored in the storage unit, so that the chip in the execution device executes the data processing method described in the above embodiment, or so that the chip in the training device executes the data processing method described in the above embodiment.
  • the storage unit is a storage unit within the chip, such as a register, cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • the processor mentioned in any of the above places can be a general central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of the above programs.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate.
  • the physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.
  • the present application can be implemented by software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions performed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. circuit etc. However, for this application, software program implementation is a better implementation in most cases. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology.
  • the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to cause a computer device (which can be a personal computer, training device, or network device, etc.) to execute the steps described in various embodiments of this application. method.
  • a computer device which can be a personal computer, training device, or network device, etc.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, the computer instructions may be transferred from a website, computer, training device, or data
  • the center transmits to another website site, computer, training equipment or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a training device or a data center integrated with one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (Solid State Disk, SSD)), etc.

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

一种图像处理方法,可以应用于人工智能领域,方法包括:获取模糊图像的第一特征表示以及事件相机采集的事件数据的第二特征表示;第一特征表示和第二特征表示的尺寸一致;根据模糊图像的第一特征表示以及事件数据的第二特征表示,通过场景流预测网络,得到模糊图像对应的第一场景流、和事件数据对应的第二场景流;根据第一场景流,对第一特征表示进行仿射变换,得到第三特征表示;根据第二场景流,对第二特征表示进行仿射变换,得到第四特征表示;第三特征表示和第四特征表示用于对模糊图像进行去模糊处理。本申请通过多尺度双向场景流的对齐,可以实现模糊图像特征与事件信息特征的精细化对齐,进而提升模糊图片的去模糊效果。

Description

一种图像处理方法及相关装置
本申请要求于2022年6月30日提交中国专利局、申请号为202210764024.9、发明名称为“一种图像处理方法及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种图像处理方法及相关装置。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
运动模糊一般发生在曝光时间内有明显运动的场景,特别是轻量级移动设备在低光环境下,例如手机和车载相机。虽然运动模糊导致不希望的图像退化,使得视觉内容变得不易解释,但运动模糊图像还编码关于照相机和观察场景之间相对运动的丰富信息。因此,从单个运动模糊图像中恢复(重建)清晰帧序列(photo-sequencing),有助于理解场景的动态,并在图像重建、自动驾驶和视频监控中具有广泛的应用。运动模糊的图像可以看作曝光时间内高清帧的平均值。由于平均会破坏帧的时间顺序,因此从单个运动模糊图像中恢复一组清晰的帧序列是非常不恰当的,也就是说待恢复的序列不具有唯一性,可能存在不同高清帧的序列组成相同的运动模糊图像。
为了解决待恢复序列的不唯一性,引入了事件相机,事件相机可以提供时间序列帧间变化来引导序列的恢复。事件相机是生物启发、事件驱动、基于时间的神经形态视觉传感器,它以与传统相机截然不同的原理感知世界,通过异步工作测量亮度变化,一旦变化超过阈值,就会触发事件。事件相机摒弃了传统强度相机中曝光时间和帧等概念,能够以无帧模式(微秒时间分辨率)捕捉几乎连续的运动,因此不会遇到像模糊这样的问题。利用事件摄像机将非常有助于从模糊的图像中恢复清晰的帧。
通过事件信息的光流进行图像去模糊,核心思想是通过事件信息计算光流,使用该光流对模糊图像进行仿射变换(warp),并配合多种损失,实现曝光时间内任意时刻的图像去模糊。然而,由于事件信息和模糊图像属于两种不同的模态,当前没有表征两种场景不一致的度量方法,光流并非精确,且存在对像素级不对齐的问题。
发明内容
本申请提供了一种图像处理方法,通过多尺度双向场景流的对齐,可以实现模糊图像特征与事件信息特征的精细化对齐,得到精确的场景流信息,从而解决已有基于事件信息进行去模糊方法中对像素级对齐考虑欠妥的问题,进而提升模糊图片的去模糊效果。
第一方面,本申请实施例提供了一种图像处理方法,包括:获取模糊图像的第一特征表示以及事件相机采集的事件数据的第二特征表示;所述第一特征表示和所述第二特征表示的尺寸一致;根据所述模糊图像的第一特征表示以及所述事件数据的第二特征表示,通过场景流预测网络,得到所述模糊图像对应的第一场景流、和所述事件数据对应的第二场景流,所述第一场景流和所述第一特征表示的尺寸一致,所述第一场景流中的每个像素特征指示由所述第一特征表示中对应像素位置的像素特征到所述第二特征表示中对应像素位置的像素特征的运动信息,所述第二场景流和所述第二特征表示的尺寸一致,所述第二场景流中的每个像素特征指示由所述第二特征表示中对应像素位置的像素特征到所述第一特征表示中对应像素位置的像素特征的运动信息;根据所述第一场景流,对第一特征表示进行仿射变换(warp),得到第三特征表示;根据所述第二场景流,对第二特征表示进行仿射变换,得到第四特征表示;所述第三特征表示和所述第四特征表示用于对所述模糊图像进行去模糊处理。
其中,这里的特征表示的“尺寸”可以理解为特征表示的宽和高。
其中,这里的像素特征可以指空间位置(x,y)的一个点,可能包含多个通道。
通过多尺度双向场景流的对齐,可以实现模糊图像特征与事件信息特征的精细化对齐,得到精确的场景流信息,从而解决已有基于事件信息进行去模糊方法中对像素级对齐考虑欠妥的问题,进而提升模糊图片的去模糊效果。
在一种可能的实现中,所述模糊图像和所述事件数据为在相同时间段针对于同一场景采集得到的。
在一种可能的实现中,所述场景流预测网络可以包括第一编码模块、第二编码模块、融合模块、第一解码模块以及第二解码模块;所述根据所述模糊图像的第一特征表示以及所述事件数据的第二特征表示,通过场景流预测网络,得到所述模糊图像对应的第一场景流、和所述事件数据对应的第二场景流,可以包括:
根据所述第一特征表示,通过所述第一编码模块,得到第一编码结果;根据所述第二特征表示,通过所述第二编码模块,得到第二编码结果;根据所述第一编码结果和所述第二编码结果,通过所述融合模块,得到融合结果;根据所述融合结果,分别通过所述第一解码模块和所述第二解码模块,得到所述模糊图像对应的第一场景流、和所述事件数据对应的第二场景流。
在一种可能的实现中,所述融合模块用于基于注意力机制实现所述第一编码结果和所述第二编码结果的第一融合。
本申请实施例中,第一场景流可以表征模糊图像特征与事件信息特征的对齐关系,第二场景流可以事件特征信息与模糊图像的对齐关系。
其中,模糊图像的特征表示和事件数据的特征表示不是同一个模态的信息,若对模糊图像的特征表示和事件数据的特征表示直接进行融合,得到的融合结果是不准确的,本申请实施例中,首先通过两个不同的编码模块分别对模糊图像的特征表示和事件数据的特征表示进行编码,使其转换为类似同一模态的数据,并对编码结果进行融合,进而可以得到准确的融合结果。
需要说明的是,本申请实施例中的场景流与光流类似,每个像素位置的信息是带方向的矢量。
本申请通过场景流预测可以实现模糊图像特征和事件数据特征之间的像素级对齐。
对于RGB相机采集的图像来说,由于其是曝光时间内采集的多帧图像通过融合得到的,融合后的图像相比于事件数据可能会丢失部分信息(事件相机采集的事件数据包括在曝光时间内采集的多帧事件数据),例如在采集某一场景的图像时,场景中的某一个对象在曝光时间内被遮挡了,在RGB采集的模糊图像中,该对象无效,而该部分对象的信息在事件数据中是有效的。类似的,由于事件数据是在像素位置的点的亮度变化大于阈值时才会标识出来,因此部分图像数据可能会无效。因此,若直接使用图像数据或者时间数据中原本无效(或者称之为被遮挡)区域的信息,则会出现图像质量下降导致的伪影。
在一种可能的实现中,针对于模糊图片中被遮挡区域,可以根据所述第四特征表示和所述第一特征表示,识别第二遮挡区域(例如可以使用轻量级网络(如连续的卷积和残差)来实现遮挡区域的确定),其中所述模糊图像中所述第二遮挡区域的图像数据在所述事件数据的所述第二遮挡区域中有效;将所述第一特征表示中除所述第二遮挡区域之外的特征表示和所述第四特征表示中所述第二遮挡区域的特征表示进行第二融合,以得到第二融合特征表示。也就是说,模糊图片的某些特征是遮挡的,可以使用事件数据中的信息替换这部分被遮挡的特征,进而得到更准确的特征表示。
在一种可能的实现中,所述第二遮挡区域可以通过第二掩膜mask表示,所述第二mask和所述第四特征表示的尺寸一致,所述第二mask中的每个像素用于指示所述第一特征表示中对应位置的像素特征是否在所述模糊图像中有效。例如,可以在第二mask中利用0和1来标识对应位置的像素特征是否在事件数据中有效,例如0表示无效,1表示有效。
在一种可能的实现中,针对于事件数据中被遮挡的区域,可以根据所述第三特征表示和所述第二特征表示,确定第一遮挡区域(例如可以使用轻量级网络(如连续的卷积和残差)来实现遮挡区域的确定),其中所述事件数据中所述第一遮挡区域的图像数据在所述模糊图像的所述第一遮挡区域中有效;将所述第二特征表示中除所述第一遮挡区域之外的特征表示和所述第三特征表示中所述第一遮挡区域的特征表示进行第二融合,以得到第一融合特征表示。也就是说,事件信息的某些特征是遮挡的,可以使用模 糊图片中的信息替换这部分被遮挡的特征,进而得到更准确的特征表示。
在一种可能的实现中,所述第一遮挡区域通过第一掩膜mask表示,所述第一mask和所述第三特征表示的尺寸一致,所述第一mask中的每个像素用于指示所述第三特征表示中对应位置的像素特征是否在所述事件数据中有效。例如,可以在第一mask中利用0和1来标识对应位置的像素特征是否在事件数据中有效,例如0表示无效,1表示有效。
在一种可能的实现中,所述第二融合为对应像素位置的相加运算。
通过上述方式,通过设置显式的遮挡感知特征融合,可以实现对模糊图片中遮挡区域进行处理,从而降低遮挡区域产生的伪影问题。
在一种可能的实现中,所述方法还包括:通过N个串联连接的特征嵌套块,处理模糊图像的特征表示以及事件数据的特征表示,以得到用于进行去模糊处理的处理结果;其中,每个所述特征嵌套块用于执行如第一方面的图像处理方法,第1个特征嵌套块用于获取到通过特征提取网络对所述模糊图像和所述事件数据提取的特征表示,第n个特征嵌套块用于获取到第n-1个特征嵌套块输出的特征表示,所述n小于N。
在一种可能的实现中,第N个所述特征嵌套块输出的特征表示用于和所述通过特征提取网络对所述模糊图像提取的特征表示进行融合,得到残差信息,所述残差信息用于和所述模糊图像进行融合以实现所述模糊图像的去模糊处理。
第二方面,本申请提供了一种图像处理装置,所述装置包括:
获取模块,用于获取模糊图像的第一特征表示以及事件相机采集的事件数据的第二特征表示;所述第一特征表示和所述第二特征表示的尺寸一致;
场景流预测模块,用于根据所述模糊图像的第一特征表示以及所述事件数据的第二特征表示,通过场景流预测网络,得到所述模糊图像对应的第一场景流、和所述事件数据对应的第二场景流,所述第一场景流和所述第一特征表示的尺寸一致,所述第一场景流中的每个像素特征指示由所述第一特征表示中对应像素位置的像素特征到所述第二特征表示中对应像素位置的像素特征的运动信息,所述第二场景流和所述第二特征表示的尺寸一致,所述第二场景流中的每个像素特征指示由所述第二特征表示中对应像素位置的像素特征到所述第一特征表示中对应像素位置的像素特征的运动信息;
仿射变换模块,用于根据所述第一场景流,对第一特征表示进行仿射变换,得到第三特征表示;
根据所述第二场景流,对第二特征表示进行仿射变换,得到第四特征表示;所述第三特征表示和所述第四特征表示用于对所述模糊图像进行去模糊处理。
通过多尺度双向场景流的对齐,可以实现模糊图像特征与事件信息特征的精细化对齐,得到精确的场景流信息,从而解决已有基于事件信息进行去模糊方法中对像素级对齐考虑欠妥的问题,进而提升模糊图片的去模糊效果。
在一种可能的实现中,所述模糊图像和所述事件数据为在相同时间段针对于同一场景采集得到的。
在一种可能的实现中,所述场景流预测网络包括第一编码模块、第二编码模块、融合模块、第一解码模块以及第二解码模块;
所述场景流预测模块,具体用于:
根据所述第一特征表示,通过所述第一编码模块,得到第一编码结果;
根据所述第二特征表示,通过所述第二编码模块,得到第二编码结果;
根据所述第一编码结果和所述第二编码结果,通过所述融合模块,得到融合结果;
根据所述融合结果,分别通过所述第一解码模块和所述第二解码模块,得到所述模糊图像对应的第一场景流、和所述事件数据对应的第二场景流。
其中,模糊图像的特征表示和事件数据的特征表示不是同一个模态的信息,若对模糊图像的特征表示和事件数据的特征表示直接进行融合,得到的融合结果是不准确的,本申请实施例中,首先通过两个不同的编码模块分别对模糊图像的特征表示和事件数据的特征表示进行编码,使其转换为类似同一模态 的数据,并对编码结果进行融合,进而可以得到准确的融合结果。
在一种可能的实现中,所述装置还包括:
遮挡区域识别模块,用于根据所述第四特征表示和所述第一特征表示,识别第二遮挡区域,其中所述模糊图像中所述第二遮挡区域的图像数据在所述事件数据的所述第二遮挡区域中有效;
将所述第一特征表示中除所述第二遮挡区域之外的特征表示和所述第四特征表示中所述第二遮挡区域的特征表示进行第二融合,以得到第二融合特征表示。
在一种可能的实现中,所述第二遮挡区域通过第二掩膜mask表示,所述第二mask和所述第四特征表示的尺寸一致,所述第二mask中的每个像素用于指示所述第一特征表示中对应位置的像素特征是否在所述模糊图像中有效。
在一种可能的实现中,所述装置还包括:
遮挡区域识别模块,用于根据所述第三特征表示和所述第二特征表示,确定第一遮挡区域,其中所述事件数据中所述第一遮挡区域的图像数据在所述模糊图像的所述第一遮挡区域中有效;
将所述第二特征表示中除所述第一遮挡区域之外的特征表示和所述第三特征表示中所述第一遮挡区域的特征表示进行第二融合,以得到第一融合特征表示。
在一种可能的实现中,所述第一遮挡区域通过第一掩膜mask表示,所述第一mask和所述第三特征表示的尺寸一致,所述第一mask中的每个像素用于指示所述第三特征表示中对应位置的像素特征是否在所述事件数据中有效。
通过上述方式,通过设置显式的遮挡感知特征融合,可以实现对模糊图片中遮挡区域进行处理,从而降低遮挡区域产生的伪影问题。
在一种可能的实现中,所述第二融合为对应像素位置的相加运算。
在一种可能的实现中,所述装置还包括:特征嵌套模块,用于通过N个串联连接的特征嵌套块,处理模糊图像的特征表示以及事件数据的特征表示,以得到用于进行去模糊处理的处理结果;其中,每个所述特征嵌套块用于执行如第一方面的图像处理方法,第1个特征嵌套块用于获取到通过特征提取网络对所述模糊图像和所述事件数据提取的特征表示,第n个特征嵌套块用于获取到第n-1个特征嵌套块输出的特征表示,所述n小于N。
在一种可能的实现中,所述第N个所述特征嵌套块输出的特征表示用于和所述通过特征提取网络对所述模糊图像提取的特征表示进行融合,得到残差信息,所述残差信息用于和所述模糊图像进行融合以实现所述模糊图像的去模糊处理。
第三方面,本申请实施例提供了一种图像处理装置,可以包括存储器、处理器以及总线系统,其中,存储器用于存储程序,处理器用于执行存储器中的程序,以执行如上述第一方面任一可选的方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面及任一可选的方法。
第五方面,本申请实施例提供了一种计算机程序产品,包括代码,当代码被执行时,用于实现上述第一方面及任一可选的方法。
第六方面,本申请提供了一种芯片系统,该芯片系统包括处理器,用于支持执行设备或训练设备实现上述方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据;或,信息。在一种可能的设计中,芯片系统还包括存储器,所述存储器,用于保存执行设备或训练设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。
附图说明
图1为人工智能主体框架的一种结构示意图;
图2为本申请实施例提供的一种应用场景的示意图;
图3为本申请实施例提供的一种应用场景的示意图;
图4为本申请实施例提供的卷积神经网络的示意图;
图5为本申请实施例提供的卷积神经网络的示意图;
图6为本申请实施例提供的一种系统的结构示意;
图7为本申请实施例提供的一种芯片的结构示意;
图8为本申请实施例提供的一种图像处理方法的流程示意图;
图9为一种图像处理方法的流程示意;
图10为一种图像处理方法的流程示意;
图11为一种图像处理方法的流程示意;
图12为一种图像处理方法的流程示意;
图13为本申请实施例提供的一种图像处理方法的效果示意;
图14为本申请实施例提供的一种图像处理方法的效果示意;
图15为本申请实施例提供的一种图像处理装置的结构示意图;
图16为本申请实施例提供的一种执行设备的示意图;
图17为本申请实施例提供的一种训练设备的示意图。
具体实施方式
下面结合本发明实施例中的附图对本发明实施例进行描述。本发明的实施方式部分使用的术语仅用于对本发明的具体实施例进行解释,而非旨在限定本发明。
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
首先对人工智能系统总体工作流程进行描述,请参见图1,图1示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信 息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能交通、智能医疗、自动驾驶、智慧城市等。
本申请实施例中的图像处理方法可以应用在辅助驾驶、自动驾驶的智能车中,也可应用在智慧城市、智能终端等计算机视觉领域中的需要进行图像增强(例如图像去噪)的领域。下面分别结合图2和图3对视频流传输场景和视频监控场景进行简单的介绍。
视频流传输场景:
例如,在使用智能终端(例如,手机、车、机器人、平板电脑、台式电脑、智能手表、虚拟现实VR、增强现实AR设备等等中)的客户端播放视频时,为了减少视频流的带宽需求,服务器可以通过网络向客户端传输经过下采样的、分辨率较低的低质量视频流。然后客户端可以对该低质量视频流中的图像进行增强。例如,对视频中的图像进行超分辨率、去降噪等操作,最后向用户呈现高质量的图像。
视频监控场景:
在安防领域中,受限于监控相机安装位置、有限的存储空间等不利条件,部分视频监控的图像质量较差,这样会影响人或识别算法识别目标的准确性。因此,可以利用本申请实施例提供的图像处理方法将低质量的视频监控视频转化为高质量的高清视频,从而实现对监控图像中大量细节的有效恢复,为后续的目标识别任务提供更有效、更丰富的信息。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以xs(即输入数据)和截距1为输入的运算单元,该运算单元的输出可以为:
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器,该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取特征的方式与位置无关。卷积核可以以随机大小的矩阵的形式化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
CNN是一种非常常见的神经网络,下面结合图4重点对CNN的结构进行详细的介绍。如前文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。
如图4所示,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220(其中池化层为可选的),以及全连接层(fully connected layer)230。
卷积层/池化层220:
卷积层:
如图4所示卷积层/池化层220可以包括如示例221-226层,举例来说:在一种实现中,221层为卷积层,222层为池化层,223层为卷积层,224层为池化层,225为卷积层,226为池化层;在另一种实现方式中,221、222为卷积层,223为池化层,224、225为卷积层,226为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
下面将以卷积层221为例,介绍一层卷积层的内部工作原理。
卷积层221可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的特征图的尺寸也相同,再将提取到的多个尺寸相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络200进行正确的预测。
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络200深度的加深,越往后的卷积层(例如226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图4中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
全连接层230:
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信 息(所需要的类信息或其他相关信息),卷积神经网络200需要利用全连接层230来生成一个或者一组所需要的类的数量的输出。因此,在全连接层230中可以包括多层隐含层(如图4所示的231、232至23n),该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等……
在全连接层230中的多层隐含层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(如图4由210至240方向的传播为前向传播)完成,反向传播(如图4由240至210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图4所示的卷积神经网络200仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,仅包括图4中所示的网络结构的一部分,比如,本申请实施例中所采用的卷积神经网络可以仅包括输入层210、卷积层/池化层220和输出层240。
需要说明的是,如图4所示的卷积神经网络100仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,如图5所示的多个卷积层/池化层并行,将分别提取的特征均输入给全连接层230进行处理。
(3)深度神经网络
深度神经网络(Deep Neural Network,DNN),也称多层神经网络,可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:其中,是输入向量,是输出向量,是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量经过如此简单的操作得到输出向量由于DNN层数多,则系数W和偏移向量的数量也就很多了。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。总结就是:第L-1层的第k个神经元到第L层的第j个神经元的系数定义为需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(4)超分辨率
超分辨率(Super Resolution,SR)是一种图像增强技术,给定一张或一组低分辨率的图像,通过学习图像的先验知识、图像本身的相似性、多帧图像信息互补等手段恢复图像的高频细节信息,生成较高分辨率的目标图像。超分辨率在应用中,按照输入图像的数量,可分为单帧图像超分辨率和视频超分辨率。超分辨率在高清电视、监控设备、卫星图像和医学影像等领域有重要的应用价值。
(5)降噪
图像在数字化和传输过程中常受到成像设备与外部环境的影响,导致图像包含噪声。减少图像中噪声的过程称为图像降噪,有时候也可称为图像去噪。
(6)图像特征
图像特征主要有图像的颜色特征、纹理特征、形状特征和空间关系特征等。
颜色特征是一种全局特征,描述了图像或图像区域所对应的景物的表面性质;一般颜色特征是基于像素点的特征,此时所有属于图像或图像区域的像素都有各自的贡献。由于颜色对图像或图像区域的方向、大小等变化不敏感,所以颜色特征不能很好地捕捉图像中对象的局部特征。
纹理特征也是一种全局特征,它也描述了图像或图像区域所对应景物的表面性质;但由于纹理只是一种物体表面的特性,并不能完全反映出物体的本质属性,所以仅仅利用纹理特征是无法获得高层次图像内容的。与颜色特征不同,纹理特征不是基于像素点的特征,它需要在包含多个像素点的区域中进行统计计算。
形状特征有两类表示方法,一类是轮廓特征,另一类是区域特征,图像的轮廓特征主要针对物体的外边界,而图像的区域特征则关系到整个形状区域。
空间关系特征,是指图像中分割出来的多个目标之间的相互的空间位置或相对方向关系,这些关系也可分为连接/邻接关系、交叠/重叠关系和包含/包容关系等。通常空间位置信息可以分为两类:相对空间位置信息和绝对空间位置信息。前一种关系强调的是目标之间的相对情况,如上下左右关系等,后一种关系强调的是目标之间的距离大小以及方位。
需要说明的,上述列举的图像特征可以作为图像中具有的特征的一些举例,图像还可以具有其他特征,如更高层级的特征:语义特征,此处不再展开。
(7)图像/视频增强
图像/视频增强指的是对图像/视频所做的能够提高成像质量的动作。例如,增强处理包括超分、降噪、锐化或去马赛克等。
(8)峰值信噪比(Peak signal-to-noise ratio,PSNR)
一个表示信号最大可能功率和影响它的表示精度的破坏性噪声功率的比值的工程术语。峰值信噪比经常用作图像处理等领域中信号重建质量的测量方法,通常简单地通过均方误差进行定义。一般而言,PSNR越高,表征与真值的差距越小。
(9)结构相似性(Structural SIMilarity,SSIM)
是一种衡量两幅图像相似度的指标,范围为0到1。当两张图像一模一样时,SSIM的值等于1。
(10)感受野(Receptive Field)
在计算机视觉领域的深度神经网络领域的一个术语,用来表示神经网络内部的不同位置的神经元对原图像的感受范围的大小。神经元感受野的值越大,表示其能接触到的原始图像范围就越大,也意味着该神经元可能蕴含更为全局、语义层次更高的特征;而值越小,则表示其包含的特征越趋向于局部和细节。感受野的值可以大致用来判断每一层的抽象层次。
(11)事件相机(event cameras)
是一种生物启发的视觉传感器,以完全不同于标准相机的方式工作。事件相机不是以恒定速率输出强度图像帧,而是仅输出局部像素级亮度变化的相关信息。
(12)光流(optical flow)
表示的是相邻两帧图像中每个像素的运动速度和运动方向,是空间运动物体在观察成像平面上的像素运动的瞬时速度,是利用图像序列中像素在时间域上的变化以及相邻帧之间的相关性来找到上一帧跟当前帧之间存在的对应关系,从而计算出相邻帧之间物体的运动信息的一种方法。一般而言,光流是由于场景中前景目标本身的移动、相机的运动,或者两者的共同运动所产生的。
(13)场景流
与光流类似,但是不是严格意义上的光流。光流表征像素的瞬时速度,一般由同模态的特征求取,例如相邻的图像帧、不同RGB相机的图像。在本申请实施例中,可以表征的是事件信息和图像信息(例如灰度信息)两个不同模态信息的控制位置关系,使用场景流表示。
(14)Warp操作
一般配合流(如光流,以及本申请中的场景流)的操作,表征的是一个图像相对于流(如光流、场景流)进行的仿射变换,如旋转、移动、缩放等。
(15)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(16)反向传播算法
可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始模型中参数的大小,使得模型的误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始模型中的参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的模型参数,例如权重矩阵。
下面结合图6对本申请实施例提供的系统架构进行详细的介绍。图6为本申请一实施例提供的系统架构示意图。如图6所示,系统架构500包括执行设备510、训练设备520、数据库530、客户设备540、数据存储系统550以及数据采集系统560。
执行设备510包括计算模块511、I/O接口512、预处理模块513和预处理模块514。计算模块511中可以包括目标模型/规则501,预处理模块513和预处理模块514是可选的。
数据采集设备560用于采集训练数据。其中,视频样本可以为低质量图像,监督图像为在模型训练前预先获取的图像样本对应的高质量图像。图像样本例如可以是低分辨率的图像,监督图像为高分辨率图像;或者,图像样本例如可以是包含雾气或噪声的视频,监督图像为去除了雾气或噪声的图像。在采集到训练数据之后,数据采集设备560将这些训练数据存入数据库530,训练设备520基于数据库530中维护的训练数据训练得到目标模型/规则501。
上述目标模型/规则501(例如本申请实施例中的包括场景流预测网络的模型)能够用于实现图像去噪任务,即,将待处理图像输入该目标模型/规则501,即可得到去噪后的图像。需要说明的是,在实际应用中,数据库530中维护的训练数据不一定都来自于数据采集设备560的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备520也不一定完全基于数据库530维护的训练数据进行目标模型/规则501的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备520训练得到的目标模型/规则501可以应用于不同的系统或设备中,如应用于图6所示的执行设备510,所述执行设备510可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备,车载终端等,还可以是服务器或者云端等。在图6中,执行设备510配置输入/输出(input/output,I/O)接口512,用于与外部设备进行数据交互,用户可以通过客户设备540向I/O接口512输入数据。
预处理模块513和预处理模块514用于根据I/O接口512接收到的输入数据进行预处理。应理解,可以没有预处理模块513和预处理模块514或者只有的一个预处理模块。当不存在预处理模块513和预处理模块514时,可以直接采用计算模块511对输入数据进行处理。
在执行设备510对输入数据进行预处理,或者在执行设备510的计算模块511执行计算等相关的处理过程中,执行设备510可以调用数据存储系统550中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统550中。
最后,I/O接口512将处理结果,如处理后得到的去噪图像呈现给客户设备540,从而提供给用户。
值得说明的是,训练设备520可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则501,该相应的目标模型/规则501即可以用于实现图像去噪任务,从而为用户提供所需的结果。
在图6所示情况下,用户可以手动给定输入数据,该“手动给定输入数据”可以通过I/O接口512提供的界面进行操作。另一种情况下,客户设备540可以自动地向I/O接口512发送输入数据,如果要求客户设备540自动发送输入数据需要获得用户的授权,则用户可以在客户设备540中设置相应权限。用户可以在客户设备540查看执行设备510输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备540也可以作为数据采集端,采集如图所示输入I/O接口512的输入数据及输出I/O接口512的输出结果作为新的样本数据,并存入数据库530。当然,也可以不经过客户设备540进行采集,而是由I/O接口512直接将如图所示输入I/O接口512的输入数据及输出I/O接口512的输出结果,作为新的样本数据存入数据库530。
值得注意的是,图6仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图6中,数据存储系统550相对执行设备510是外部存储器,在其它情况下,也可以将数据存储系统550置于执行设备510中。
下面介绍本申请实施例提供的一种芯片硬件结构。
图7为本申请一实施例提供的芯片硬件结构图,该芯片包括神经网络处理器700。该芯片可以被设置在如图6所示的执行设备510中,用以完成计算模块511的计算工作。该芯片也可以被设置在如图6所示的训练设备520中,用以完成训练设备520的训练工作并输出目标模型/规则501。如图6所示的模型中各层的算法均可在如图7所示的芯片中得以实现。
神经网络处理器(neural processing unit,NPU)700作为协处理器挂载到主中央处理单元(host central processing unit,host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路703,控制器704控制运算电路703提取存储器(权重存储器702或输入存储器701)中的数据并进行运算。
在一些实现中,运算电路703内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路703是二维脉动阵列。运算电路703还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路703是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路703从权重存储器702中取矩阵B相应的数据,并缓存在运算电路703中每一个PE上。运算电路703从输入存储器701中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)708中。
向量计算单元707可以对运算电路703的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元707可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现中,向量计算单元707能将经处理的输出的向量存储到统一存储器706。例如,向量计算单元707可以将非线性函数应用到运算电路703的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元707生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路703的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器706用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器(direct memory access controller,DMAC)705将外部存储器中的输入数据搬运到输入存储器701和/或统一存储器706、将外部存储器中的权重数据存入权重存储器702,以及将统一存储器706中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)710,用于通过总线实现主CPU、DMAC和取指存储器709之间进行交互。
与控制器704连接的取指存储器(instruction fetch buffer)709,用于存储控制器704使用的指令。
控制器704,用于调用取指存储器709中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器706、输入存储器701、权重存储器702以及取指存储器709均为片上(on-chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器 (double data rate synchronous dynamic random access memory,DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
运动模糊一般发生在曝光时间内有明显运动的场景,特别是轻量级移动设备在低光环境下,例如手机和车载相机。虽然运动模糊导致不希望的图像退化,使得视觉内容变得不易解释,但运动模糊图像还编码关于照相机和观察场景之间相对运动的丰富信息。因此,从单个运动模糊图像中恢复(重建)清晰帧序列(photo-sequencing),有助于理解场景的动态,并在图像重建、自动驾驶和视频监控中具有广泛的应用。运动模糊的图像可以看作曝光时间内高清帧的平均值。由于平均会破坏帧的时间顺序,因此从单个运动模糊图像中恢复一组清晰的帧序列是非常不恰当的,也就是说待恢复的序列不具有唯一性,可能存在不同高清帧的序列组成相同的运动模糊图像。
为了解决待恢复序列的不唯一性,引入了事件相机,事件相机可以提供时间序列帧间变化来引导序列的恢复。事件相机是生物启发、事件驱动、基于时间的神经形态视觉传感器,它以与传统相机截然不同的原理感知世界,通过异步工作测量亮度变化,一旦变化超过阈值,就会触发事件。事件相机摒弃了传统强度相机中曝光时间和帧等概念,能够以无帧模式(微秒时间分辨率)捕捉几乎连续的运动,因此不会遇到像模糊这样的问题。利用事件摄像机将非常有助于从模糊的图像中恢复清晰的帧。
通过事件信息的光流进行图像去模糊,核心思想是通过事件信息计算光流,使用该光流对模糊图像进行仿射变换(warp),并配合多种损失,实现曝光时间内任意时刻的图像去模糊。然而,由于事件信息和模糊图像属于两种不同的模态,当前没有表征两种场景不一致的度量方法,光流并非精确,且存在对像素级不对齐的问题。
为了解决上述问题,本申请提供了一种图像处理方法,该图像处理方法可以为模型训练的前馈过程,也可以为推理过程。
参照图8,图8为本申请实施例提供的一种图像处理方法的实施例示意,如图8示出的那样,本申请实施例提供的一种图像处理方法包括:
801、获取模糊图像的第一特征表示以及事件相机采集的事件数据的第二特征表示;所述第一特征表示和所述第二特征表示的尺寸一致。
本申请实施例中,步骤801的执行主体可以为终端设备,终端设备可以为便携式移动设备,例如但不限于移动或便携式计算设备(如智能手机)、个人计算机、服务器计算机、手持式设备(例如平板)或膝上型设备、多处理器系统、游戏控制台或控制器、基于微处理器的系统、机顶盒、可编程消费电子产品、移动电话、具有可穿戴或配件形状因子(例如,手表、眼镜、头戴式耳机或耳塞)的移动计算和/或通信设备、网络PC、小型计算机、大型计算机、包括上面的系统或设备中的任何一种的分布式计算环境等等。
本申请实施例中,步骤801的执行主体可以为云侧的服务器,服务器可以接收来自终端设备发送的模糊图像以及事件相机采集的事件数据,进而服务器可以获取到模糊图像和事件相机采集的事件数据。
在一种可能的实现中,所述模糊图像和所述事件数据为在相同时间段针对于同一场景采集得到的。例如,模糊图像可以为终端设备上的RGB相机采集的图像,事件数据为终端设备上的事件相机对同一场景采集的图像。
其中,模糊图像可以是多帧图像(曝光时间内得到的图像)的平均值,事件数据可以包括模糊图像对应的时间段内的事件点。也就是说,模糊图像可以是对已有的连续多帧图像进行平均化,从而得到一帧合成的模糊图像。
上述模糊图像对应的时间段就可以通过上述已有的连续多帧高清图像对应的时间来确定。该时间段可以是实际拍摄的时候,相机或摄像头的曝光时间,也就是说,曝光时间的时间段内由于被拍摄者的动作导致了模糊,产生了一帧模糊图像,这帧模糊图像对应于一段图像帧序列。举例说明,假设对T0-T1时刻之间的6帧连续图像取平均值得到了模糊图像B1,则模糊图像B1对应的时间段就是T0-T1。
事件数据可以包括多个时间点,事件点也可以称之为事件,事件相机的最基本的原理就是当某个像素的亮度变化累计达到触发条件(变化达到一定程度)后,输出一个事件点。所以一个事件点可以理解为是一次事件的表达:在什么时间(时间戳),哪个像素点(像素坐标),发生了亮度的增加或减小(亮度变 化)。
在一种可能的实现中,模糊图像可以为灰度图像,其尺寸为H*W,其中H表示图像的高度,W表示图像的宽度,通过特征提取网络(例如多个卷积层),可以提取到模糊图像的图像特征FB(例如本申请实施例中的第一特征表示)。
在一种可能的实现中,事件数据(或者称之为事件信息)可以通过特征提取网络(例如多个卷积层),提取到事件特征FE(例如本申请实施例中的第二特征表示)。需要说明的是,某个时刻的事件信息的空间分辨率可以同模糊图像一致,为H*W;但这里输入的是模糊图像曝光时间内的全部事件信息,包含M个通道,那么事件信息输入为H*W*M,其中M表征事件信息的个数。
通过特征提取网络得到的第一特征表示和第二特征表示可以为尺寸相同的特征表示。其中,这里的特征表示的“尺寸”可以理解为特征表示的宽和高。
应理解,通过N个串联连接的特征嵌套块,处理模糊图像的特征表示以及事件数据的特征表示,其中,第一特征表示和第二特征表示可以为输入到N个串联连接的特征嵌套块中的某一个特征嵌套块的特征表示,若第一特征表示和第二特征表示可以为输入到第1个特征嵌套块的特征,则第一特征表示和第二特征表示可以为通过特征提取网络对所述模糊图像和所述事件数据提取的特征表示,若第一特征表示和第二特征表示可以为输入到第n(n大于1)个特征嵌套块的特征,则第一特征表示和第二特征表示可以为第n-1个特征嵌套块输出的特征表示。
802、根据所述模糊图像的第一特征表示以及所述事件数据的第二特征表示,通过场景流预测网络,得到所述模糊图像对应的第一场景流、和所述事件数据对应的第二场景流,所述第一场景流和所述第一特征表示的尺寸一致,所述第一场景流中的每个像素特征指示由所述第一特征表示中对应像素位置的像素特征到所述第二特征表示中对应像素位置的像素特征的运动信息,所述第二场景流和所述第二特征表示的尺寸一致,所述第二场景流中的每个像素特征指示由所述第二特征表示中对应像素位置的像素特征到所述第一特征表示中对应像素位置的像素特征的运动信息。
在一种可能的实现中,场景流预测网络可以为上述介绍的一个特征嵌套块中包括的网络,所述模糊图像的第一特征表示以及所述事件数据的第二特征表示可以输入到场景流预测网络中。
在一种可能的实现中,可以将所述模糊图像的第一特征表示以及所述事件数据的第二特征表示输入到场景流预测网络中,以得到所述模糊图像对应的第一场景流、和所述事件数据对应的第二场景流。
接下来介绍本申请实施例中的场景流预测网络:
在一种可能的实现中,所述场景流预测网络可以包括第一编码模块、第二编码模块、融合模块、第一解码模块以及第二解码模块;所述根据所述模糊图像的第一特征表示以及所述事件数据的第二特征表示,通过场景流预测网络,得到所述模糊图像对应的第一场景流、和所述事件数据对应的第二场景流,可以包括:根据所述第一特征表示,通过所述第一编码模块,得到第一编码结果;根据所述第二特征表示,通过所述第二编码模块,得到第二编码结果;根据所述第一编码结果和所述第二编码结果,通过所述融合模块,得到融合结果;根据所述融合结果,分别通过所述第一解码模块和所述第二解码模块,得到所述模糊图像对应的第一场景流、和所述事件数据对应的第二场景流。
在一种可能的实现中,所述融合模块用于基于注意力机制实现所述第一编码结果和所述第二编码结果的第一融合。
其中,场景流预测网络也可以称之为多尺度双向场景流网络(例如图9中示出的多尺度双向场景流预测2i10)。其中,场景流预测网络可以为“两输入-两输出”的网络,具体结构可以见图10的示意。对于输入(第一特征表示)和(第二特征表示),首先可以分别通过独立的编码encoder网络(例如第一编码模块和第二编码模块,第一编码模块用于处理第一特征表示,第二编码模块用于处理第二特征表示)来提取特征;然后通过融合模块,来实现模糊图像特征和事件信息特征的输入的融合(例如基于attention模块实现融合,该attention模块可以产生注意力特征,该注意力特征用于实现模糊图像特征和事件数据特征的融合);融合后的特征可以通过独立的解码decoder网络(例如第一解码模块和 第二解码模块,第一解码模块用于生成第一场景流,第二解码模块用于生成第二场景流)分别生成对应的场景流(scene flow)(第一场景流)和(第二场景流)。
本申请实施例中,第一场景流可以表征模糊图像特征与事件信息特征的对齐关系,第二场景流可以事件特征信息与模糊图像的对齐关系。
其中,模糊图像的特征表示和事件数据的特征表示不是同一个模态的信息,若对模糊图像的特征表示和事件数据的特征表示直接进行融合,得到的融合结果是不准确的,本申请实施例中,首先通过两个不同的编码模块分别对模糊图像的特征表示和事件数据的特征表示进行编码,使其转换为类似于同一模态的数据,并对编码结果进行融合,进而可以得到准确的融合结果。
在一种可能的实现中,所述第一场景流和所述第一特征表示的尺寸(例如特征表示的宽和高)一致,所述第一场景流中的每个像素特征指示由所述第一特征表示中对应像素位置的像素特征到所述第二特征表示中对应像素位置的像素特征的运动信息,所述第二场景流和所述第二特征表示的尺寸一致,所述第二场景流中的每个像素特征指示由所述第二特征表示中对应像素位置的像素特征到所述第一特征表示中对应像素位置的像素特征的运动信息。
其中,这里的像素特征可以指空间位置(x,y)的一个点,可能包含多个通道。
其中,运动信息可以表示为二维瞬时速度场,其中的二维速度矢量是景物中可见点的三维速度矢量在成像表面的投影。
需要说明的是,本申请实施例中的场景流与光流类似,每个像素位置的信息是带方向的矢量。
需要说明的是,对于encoder、attention、decoder的具体构造,本专利不做约束。可选的,编码模块encoder(例如上述介绍的第一编码模块和第二编码模块)可以是连续的卷积和下采样;融合模块可以是场景的空间注意力spatial attention结构或者通道注意力channel attention结构;解码模块decoder(例如上述介绍的第一解码模块和第二解码模块)可以是连续的上采样和卷积。
通过场景流预测可以实现模糊图像特征和事件数据特征之间的像素级对齐。
803、根据所述第一场景流,对第一特征表示进行仿射变换(warp),得到第三特征表示。
804、根据所述第二场景流,对第二特征表示进行仿射变换,得到第四特征表示;所述第三特征表示和所述第四特征表示用于对所述模糊图像进行去模糊处理。
如图9所示,为多尺度双向场景流对齐内部实现图,具体包括多尺度双向场景流预测2i10、模糊图像特征warp操作2i11、事件信息特征warp操作2i12。
其中,当获取两个场景流(第一场景流和第二场景流)之后,即可与对应的特征进行交叉warp操作,得到warp之后的特征(第三特征表示和第四特征表示)。示例性的,可以如公式1,其中warp(*)是传统的pixel-to-pixel的空间warp操作,为一个非可学习的算子。

本申请实施例中,通过多尺度双向场景流对齐可以实现模糊图像特征和事件特征之间的像素级对齐。这种网络结构,配合attention结构以及warp操作,可以从事件信息中获取不同粒度水平的精细的信息,有利于抽取出清晰的纹理结构,方便进行模糊图像的去模糊。
对于RGB相机采集的图像来说,由于其是曝光时间内采集的多帧图像通过融合得到的,融合后的图像相比于事件数据可能会丢失部分信息(事件相机采集的事件数据包括在曝光时间内采集的多帧事件数据),例如在采集某一场景的图像时,场景中的某一个对象在曝光时间内被遮挡了,在RGB采集的模糊图像中,该对象无效,而该部分对象的信息在事件数据中是有效的。类似的,由于事件数据是在像素位置的点的亮度变化大于阈值时才会标识出来,因此部分图像数据可能会无效。因此,若直接使用图像数据或者时间数据中原本无效(或者称之为被遮挡)区域的信息,则会出现图像质量下降导致的伪影。
在一种可能的实现中,针对于模糊图片中被遮挡区域,可以根据所述第四特征表示和所述第一特征表示,识别第二遮挡区域(例如可以使用轻量级网络(如连续的卷积和残差)来实现遮挡区域的确定),其中所述模糊图像中所述第二遮挡区域的图像数据在所述事件数据的所述第二遮挡区域中有效;将所述第一特征表示中除所述第二遮挡区域之外的特征表示和所述第四特征表示中所述第二遮挡区域的特征 表示进行第二融合,以得到第二融合特征表示。也就是说,模糊图片的某些特征是遮挡的,可以使用事件数据中的信息替换这部分被遮挡的特征,进而得到更准确的特征表示。
在一种可能的实现中,所述第二遮挡区域可以通过第二掩膜mask表示,所述第二mask和所述第四特征表示的尺寸一致,所述第二mask中的每个像素用于指示所述第一特征表示中对应位置的像素特征是否在所述模糊图像中有效。例如,可以在第二mask中利用0和1来标识对应位置的像素特征是否在事件数据中有效,例如0表示无效,1表示有效。
在一种可能的实现中,针对于事件数据中被遮挡的区域,可以根据所述第三特征表示和所述第二特征表示,确定第一遮挡区域(例如可以使用轻量级网络(如连续的卷积和残差)来实现遮挡区域的确定),其中所述事件数据中所述第一遮挡区域的图像数据在所述模糊图像的所述第一遮挡区域中有效;将所述第二特征表示中除所述第一遮挡区域之外的特征表示和所述第三特征表示中所述第一遮挡区域的特征表示进行第二融合,以得到第一融合特征表示。也就是说,事件信息的某些特征是遮挡的,可以使用模糊图片中的信息替换这部分被遮挡的特征,进而得到更准确的特征表示。
在一种可能的实现中,所述第一遮挡区域通过第一掩膜mask表示,所述第一mask和所述第三特征表示的尺寸一致,所述第一mask中的每个像素用于指示所述第三特征表示中对应位置的像素特征是否在所述事件数据中有效。例如,可以在第一mask中利用0和1来标识对应位置的像素特征是否在事件数据中有效,例如0表示无效,1表示有效。
在一种可能的实现中,所述第二融合为对应像素位置的相加运算。
通过上述方式,通过设置显式的遮挡感知特征融合,可以实现对模糊图片中遮挡区域进行处理,从而降低遮挡区域产生的伪影问题。
在一种可能的实现中,可以通过N个串联连接的特征嵌套块,处理模糊图像的特征表示以及事件数据的特征表示,以得到用于进行去模糊处理的处理结果;其中,每个所述特征嵌套块用于执行上述描述中的图像处理方法,第1个特征嵌套块用于获取到通过特征提取网络对所述模糊图像和所述事件数据提取的特征表示,第n个特征嵌套块用于获取到第n-1个特征嵌套块输出的特征表示,所述n小于N。
在一种可能的实现中,第N个所述特征嵌套块输出的特征表示用于和所述通过特征提取网络对所述模糊图像提取的特征表示进行融合,得到残差信息,所述残差信息用于和所述模糊图像进行融合以实现所述模糊图像的去模糊处理。
在一种可能的实现中,参照图9,对于每个场景流引导双特征嵌套块2i0,均包含两个对称的遮挡感知特征融合(事件信息特征遮挡感知特征融合2i2,模糊图像特征遮挡感知特征融合2i3)。在这以模糊特征遮挡感知特征融合2i3为例,进行详细的说明。其内部结构如图11,包含遮挡区掩码生成2i30、遮挡区域特征生成2i31、特征融合2i32。
示例性的,遮挡区掩码生成2i30:接收由模糊图像特征warp操作之后生成的特征(简称warp特征)、事件信息特征(简称自特征)作为输入,通过轻量级网络(如连续的卷积和残差)自适应生成同分辨率的遮挡区掩码mask MB(可选的,掩码可以采用One-hot编码,其数值只能是0和1),表征warp只有的模糊图像特征与事件信息特征之间的关联性。当mask=0,表征该区域特征在warp之后特征(如模糊图像特征warp操作之后生成的特征)中是遮挡的,倾向于使用原始的自特征(如事件信息特征);反正当mask=1,表征该区域特征在原始的自特征中是遮挡的,倾向于使用warp特征。
其中,遮挡区域特征生成2i31:当获取遮挡mask之后,即可以与warp特征进行点乘操作,得到遮挡处理之后的特征(2i3模块生成的为对偶模块2i2生成的为),表征遮挡效果处理之后的可见特性。
其中,特征融合2i32:接收遮挡处理之后的特征和原始自特征作为输入,通过concatenate(通道拼接)、卷积等操作生成最终融合之后的特征表征在事件信息特征中通过遮挡处理之后融入了模糊图像特征。
在一种可能的实现中,第N个所述特征嵌套块输出的特征表示用于和所述通过特征提取网络对所述模糊图像提取的特征表示进行融合,得到残差信息,所述残差信息用于和所述模糊图像进行融合以实现所述模糊图像的去模糊处理。
参照图12,可以通过场景流引导的双特征嵌套200接收模糊图像特征FB和事件特征FE作为输入,经过N次场景流引导的双特征嵌套块处理之后,生成嵌套之后的模糊图像特征和事件特征通过全局特征融合300接收嵌套之后的模糊图像特征事件特征以及原始的模糊特征FB作为输入,通过卷积(或加法、或拼接)等操作,生成融合之后的混合特征Fmix。通过求和操作400接收混合特征Fmix和原始输入模糊图像B作为输入,通过加法操作生成最终去模糊之后的清晰结果O。
更具体的,参照图12,在训练Training阶段,可以针对给定的配对数据集(input=[B,E],output=O),使用相关的损失函数(本申请实施例可以使用MSE loss,perception loss等损失函数)进行训练,最终得到模糊图像特征提取100、事件信息特征提取101、场景流引导的双特征嵌套200、全局特征融合300等可训练的参数。
在推理阶段,可以接收给定的输入图像input=[B,E],经过模糊图像特征提取100、事件信息特征提取101、场景流引导的双特征嵌套200、全局特征融合300、求和操作400等作用之后,生成最终的增强结果O。
具体流程为:
模糊图像特征提取100,对于给定的输入模糊图像B(一般为灰度图像,其尺寸为H*W,其中H表示图像的高度,W表示图像的宽度),通过多个卷积层,提取到模糊图像特征FB
事件信息特征提取101,对于给定的输入事件信息E,通过多个卷积层,提取到事件特征FE。需要说明的是,某个时刻的事件信息的空间分辨率同模糊图像一致,为H*W;但这里输入的是模糊图像曝光时间内的全部事件信息,包含N个通道,那么事件信息输入为H*W*N,其中N表征事件信息的个数。
场景流引导的双特征嵌套200,,接收模糊图像特征FB和事件特征FE作为输入,经过N次场景流引导的双特征嵌套块处理之后,生成嵌套之后的模糊图像特征和事件特征对于第i个场景流引导的双特征嵌套(Scene flow guided Dual Feature Embedding)处理,可以表示为如下公式2:
其中
对于第i次场景流引导的双特征嵌套处理,输入为前一次处理之后的模糊图像特征信息事件信息特征经过多尺度双向场景流对齐2i1、事件信息特征遮挡感知特征融合2i2、模糊图像信息遮挡感知特征融合2i3等处理之后,生成模糊图像特征信息事件信息特征
其中,多尺度场景流对齐2i1内部包含多尺度双向场景流预测2i10、模糊图像特征warp操作2i11、事件信息特征warp操作2i12,详细操作参考上述介绍的多尺度双向场景流对齐(multi-scale dual scene flow alignment)2i1。
另外,在事件信息特征遮挡感知特征融合2i2中,接收模糊图像特征warp之后的事件信息特征首先经过由轻量级卷积构成的遮挡掩码生成模块2i20处理,生成同输入特征分辨率相同的one-hot编码的掩码ME(编码为0表征该区域特征在事件信息中是遮挡的,倾向于使用模糊图像特征;反之编码为1表征该区域特征在模糊图像中是遮挡的,倾向于使用事件信息特征);之后warp之后的事件信息特征与遮挡掩码ME进行点乘操作得到遮挡处理之后的特征最后模糊图像特征与遮挡处理之后的特征进行通道融合和卷积操作,生成融合之后的模糊图像特征
类似地,在模糊图像信息遮挡感知特征融合2i3中,接收事件信息特征warp之后的模糊图像 特征经过遮挡区掩码生成2i30、遮挡区域特征生成2i31、特征融合2i32等模块处理之后,生成融合之后的事件信息特征
不断重复上面的过程,经过N次场景流引导的双特征嵌套处理之后,最终生成模糊图像特征和事件信息特征
全局特征融合300,接收嵌套之后的模糊图像特征事件特征以及原始的模糊特征FB作为输入,通过卷积(或加法、或拼接)等操作,生成融合之后的混合特征Fmix
求和操作400,,接收混合特征Fmix和原始输入模糊图像B作为输入,通过加法操作生成最终去模糊之后的清晰结果O。
在开源数据集上测试,本专利方法与已有的方法相比,具有更好的PSNR/SSIM(越大越好)。其中单帧PSNR相对于已有的SOTA PSNR提升2.9dB,多帧去模糊提升了2.7dB,具体见表1。
表1
此外,参照图13,本专利方法相比与几个基准结果具有更清晰、锐化的结果,与GT更加接近,同时没有伪影。
另外,通过消融实验可以看到本发明所述双特征嵌套、多尺度双向场景流预测、遮挡感知特征融合均具有正面作用。其中多尺度双向场景流预测效果最明显,可以使得PSNR提升2个多db。
表2为消融实验结果。其中,表2中的D表示没有本发明所述的多尺度双向场景流预测的双特征嵌套,MSE表示本发明所述的多尺度场景流预测,OAFF表示遮挡感知特征融合。
表2
进一步地,参照图14,对场景流以及遮挡感知的掩码进行可视化,图14中的(a)为模糊图像,可以看到图14中的(b)中某些变化较大的场景流在图14中的(c)中使用掩码进行了很好的标注,它表 征的正是一些遮挡区域。通过遮挡区域的额外处理,最终生成的结果图14中的(d)可以很好地处理遮挡,生成清晰的结果。
本申请实施例提供了一种图像处理方法,包括:获取模糊图像的第一特征表示以及事件相机采集的事件数据的第二特征表示;所述第一特征表示和所述第二特征表示的尺寸一致;根据所述模糊图像的第一特征表示以及所述事件数据的第二特征表示,通过场景流预测网络,得到所述模糊图像对应的第一场景流、和所述事件数据对应的第二场景流,所述第一场景流和所述第一特征表示的尺寸一致,所述第一场景流中的每个像素特征指示由所述第一特征表示中对应像素位置的像素特征到所述第二特征表示中对应像素位置的像素特征的运动信息,所述第二场景流和所述第二特征表示的尺寸一致,所述第二场景流中的每个像素特征指示由所述第二特征表示中对应像素位置的像素特征到所述第一特征表示中对应像素位置的像素特征的运动信息;根据所述第一场景流,对第一特征表示进行仿射变换(warp),得到第三特征表示;根据所述第二场景流,对第二特征表示进行仿射变换,得到第四特征表示;所述第三特征表示和所述第四特征表示用于对所述模糊图像进行去模糊处理。通过多尺度双向场景流的对齐,可以实现模糊图像特征与事件信息特征的精细化对齐,得到精确的场景流信息,从而解决已有基于事件信息进行去模糊方法中对像素级对齐考虑欠妥的问题。
此外,本申请实施例提供了一种图像处理系统,图像处理系统可以包括用户设备以及数据处理设备。其中,用户设备包括手机、个人电脑或者信息处理中心等智能终端。用户设备为图像处理的发起端,作为图像增强请求的发起方,通常由用户通过用户设备发起请求。
上述数据处理设备可以是云服务器、网络服务器、应用服务器以及管理服务器等具有数据处理功能的设备或服务器。数据处理设备通过交互接口接收来自智能终端的图像增强请求,再通过存储数据的存储器以及数据处理的处理器环节进行机器学习,深度学习,搜索,推理,决策等方式的图像处理。数据处理设备中的存储器可以是一个统称,包括本地存储以及存储历史数据的数据库,数据库可以在数据处理设备上,也可以在其它网络服务器上。
用户设备可以接收用户的指令,例如用户设备可以获取用户输入/选择的一张图像,然后向数据处理设备发起请求,使得数据处理设备针对用户设备得到的该图像执行图像增强处理应用(例如图像超分辨率重构、图像去噪、图像去雾、图像去模糊以及图像对比度增强等),从而得到针对该图像的对应的处理结果。示例性的,用户设备可以获取用户输入的一张图像,然后向数据处理设备发起图像去噪请求,使得数据处理设备对该图像进行图像去噪,从而得到去噪后的图像。
数据处理设备可以执行本申请实施例的图像处理方法。
可选的,用户设备可以直接作为数据处理设备,该用户设备能够直接获取来自用户的输入并直接由用户设备本身的硬件进行处理。用户设备可以接收用户的指令,例如用户设备可以获取用户在用户设备中所选择的一张图像,然后再由用户设备自身针对该图像执行图像处理应用(例如图像超分辨率重构、图像去噪、图像去雾、图像去模糊以及图像对比度增强等),从而得到针对该图像的对应的处理结果。此时,用户设备自身就可以执行本申请实施例的图像处理方法。
接下来从装置的角度介绍本申请实施例提供的一种图像处理装置,参照图15,图15为本申请实施例提供的一种图像处理装置的结构示意,如图15所示,本申请实施例提供的一种图像处理装置1500包括:
获取模块1501,用于获取模糊图像的第一特征表示以及事件相机采集的事件数据的第二特征表示;所述第一特征表示和所述第二特征表示的尺寸一致。
其中,关于获取模块1501的具体描述可以参照上述实施例中步骤801的描述,这里不再赘述。
场景流预测模块1502,用于根据所述模糊图像的第一特征表示以及所述事件数据的第二特征表示,通过场景流预测网络,得到所述模糊图像对应的第一场景流、和所述事件数据对应的第二场景流,所述第一场景流和所述第一特征表示的尺寸一致,所述第一场景流中的每个像素特征指示由所述第一特征表示中对应像素位置的像素特征到所述第二特征表示中对应像素位置的像素特征的运动信息,所述第二场景流和所述第二特征表示的尺寸一致,所述第二场景流中的每个像素特征指示由所述第二特征表示中对 应像素位置的像素特征到所述第一特征表示中对应像素位置的像素特征的运动信息;
其中,关于场景流预测模块1502的具体描述可以参照上述实施例中步骤802的描述,这里不再赘述。
仿射变换模块1503,用于根据所述第一场景流,对第一特征表示进行仿射变换,得到第三特征表示;
根据所述第二场景流,对第二特征表示进行仿射变换,得到第四特征表示;所述第三特征表示和所述第四特征表示用于对所述模糊图像进行去模糊处理。
其中,关于仿射变换模块1503的具体描述可以参照上述实施例中步骤803和步骤804的描述,这里不再赘述。
在一种可能的实现中,所述模糊图像和所述事件数据为在相同时间段针对于同一场景采集得到的。
在一种可能的实现中,所述场景流预测网络包括第一编码模块、第二编码模块、融合模块、第一解码模块以及第二解码模块;
所述场景流预测模块,具体用于:
根据所述第一特征表示,通过所述第一编码模块,得到第一编码结果;
根据所述第二特征表示,通过所述第二编码模块,得到第二编码结果;
根据所述第一编码结果和所述第二编码结果,通过所述融合模块,得到融合结果;
根据所述融合结果,分别通过所述第一解码模块和所述第二解码模块,得到所述模糊图像对应的第一场景流、和所述事件数据对应的第二场景流。
其中,模糊图像的特征表示和事件数据的特征表示不是同一个模态的信息,若对模糊图像的特征表示和事件数据的特征表示直接进行融合,得到的融合结果是不准确的,本申请实施例中,首先通过两个不同的编码模块分别对模糊图像的特征表示和事件数据的特征表示进行编码,使其转换为类似同一模态的数据,并对编码结果进行融合,进而可以得到准确的融合结果。
在一种可能的实现中,所述装置还包括:
遮挡区域识别模块,用于根据所述第四特征表示和所述第一特征表示,识别第二遮挡区域,其中所述模糊图像中所述第二遮挡区域的图像数据在所述事件数据的所述第二遮挡区域中有效;
将所述第一特征表示中除所述第二遮挡区域之外的特征表示和所述第四特征表示中所述第二遮挡区域的特征表示进行第二融合,以得到第二融合特征表示。
在一种可能的实现中,所述第二遮挡区域通过第二掩膜mask表示,所述第二mask和所述第四特征表示的尺寸一致,所述第二mask中的每个像素用于指示所述第一特征表示中对应位置的像素特征是否在所述模糊图像中有效。
在一种可能的实现中,所述装置还包括:
遮挡区域识别模块,用于根据所述第三特征表示和所述第二特征表示,确定第一遮挡区域,其中所述事件数据中所述第一遮挡区域的图像数据在所述模糊图像的所述第一遮挡区域中有效;
将所述第二特征表示中除所述第一遮挡区域之外的特征表示和所述第三特征表示中所述第一遮挡区域的特征表示进行第二融合,以得到第一融合特征表示。
在一种可能的实现中,所述第一遮挡区域通过第一掩膜mask表示,所述第一mask和所述第三特征表示的尺寸一致,所述第一mask中的每个像素用于指示所述第三特征表示中对应位置的像素特征是否在所述事件数据中有效。
通过上述方式,通过设置显式的遮挡感知特征融合,可以实现对模糊图片中遮挡区域进行处理,从而降低遮挡区域产生的伪影问题。
在一种可能的实现中,所述第二融合为对应像素位置的相加运算。
在一种可能的实现中,所述装置还包括:特征嵌套模块,用于通过N个串联连接的特征嵌套块,处理模糊图像的特征表示以及事件数据的特征表示,以得到用于进行去模糊处理的处理结果;其中,每个所述特征嵌套块用于执行上述描述中的图像处理方法,第1个特征嵌套块用于获取到通过特征提取网络对所述模糊图像和所述事件数据提取的特征表示,第n个特征嵌套块用于获取到第n-1个特征嵌套块输出的特征表示,所述n小于N。
在一种可能的实现中,所述第N个所述特征嵌套块输出的特征表示用于和所述通过特征提取网络对 所述模糊图像提取的特征表示进行融合,得到残差信息,所述残差信息用于和所述模糊图像进行融合以实现所述模糊图像的去模糊处理。
接下来介绍本申请实施例提供的一种执行设备,请参阅图16,图16为本申请实施例提供的执行设备的一种结构示意图,执行设备1600具体可以表现为手机、平板、笔记本电脑、智能穿戴设备、服务器等,此处不做限定。其中,执行设备1600实现图8对应实施例中图像处理方法的功能。具体的,执行设备1600包括:接收器1601、发射器1602、处理器1603和存储器1604(其中执行设备1600中的处理器1603的数量可以一个或多个),其中,处理器1603可以包括应用处理器16031和通信处理器16032。在本申请的一些实施例中,接收器1601、发射器1602、处理器1603和存储器1604可通过总线或其它方式连接。
存储器1604可以包括只读存储器和随机存取存储器,并向处理器1603提供指令和数据。存储器1604的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器1604存储有处理器和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。
处理器1603控制执行设备的操作。具体的应用中,执行设备的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。
上述本申请实施例揭示的方法可以应用于处理器1603中,或者由处理器1603实现。处理器1603可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1603中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1603可以是通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器、以及视觉处理器(vision processing unit,VPU)、张量处理器(tensor processing unit,TPU)等适用于AI运算的处理器,还可进一步包括专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器1603可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1604,处理器1603读取存储器1604中的信息,结合其硬件完成上述实施例中步骤801至步骤804的步骤。
接收器1601可用于接收输入的数字或字符信息,以及产生与执行设备的相关设置以及功能控制有关的信号输入。发射器1602可用于通过第一接口输出数字或字符信息;发射器1602还可用于通过第一接口向磁盘组发送指令,以修改磁盘组中的数据;发射器1602还可以包括显示屏等显示设备。
本申请实施例还提供了一种训练设备,请参阅图17,图17是本申请实施例提供的训练设备一种结构示意图,具体的,训练设备1700由一个或多个服务器实现,训练设备1700可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1717(例如,一个或一个以上处理器)和存储器1732,一个或一个以上存储应用程序1742或数据1744的存储介质1730(例如一个或一个以上海量存储设备)。其中,存储器1732和存储介质1730可以是短暂存储或持久存储。存储在存储介质1730的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对训练设备中的一系列指令操作。更进一步地,中央处理器1717可以设置为与存储介质1730通信,在训练设备1700上执行存储介质1730中的一系列指令操作。
训练设备1700还可以包括一个或一个以上电源1726,一个或一个以上有线或无线网络接口1750,一个或一个以上输入输出接口1758;或,一个或一个以上操作系统1741,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
具体的,训练设备可以进行上述实施例中步骤801至步骤804的步骤。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于进行信号处理的程序,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。
本申请实施例提供的执行设备、训练设备或终端设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使执行设备内的芯片执行上述实施例描述的数据处理方法,或者,以使训练设备内的芯片执行上述实施例描述的数据处理方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (23)

  1. 一种图像处理方法,其特征在于,所述方法包括:
    获取模糊图像的第一特征表示以及事件相机采集的事件数据的第二特征表示;所述第一特征表示和所述第二特征表示的尺寸一致;
    根据所述模糊图像的第一特征表示以及所述事件数据的第二特征表示,通过场景流预测网络,得到所述模糊图像对应的第一场景流、和所述事件数据对应的第二场景流,所述第一场景流和所述第一特征表示的尺寸一致,所述第一场景流中的每个像素特征指示由所述第一特征表示中对应像素位置的像素特征到所述第二特征表示中对应像素位置的像素特征的运动信息,所述第二场景流和所述第二特征表示的尺寸一致,所述第二场景流中的每个像素特征指示由所述第二特征表示中对应像素位置的像素特征到所述第一特征表示中对应像素位置的像素特征的运动信息;
    根据所述第一场景流,对第一特征表示进行仿射变换(warp),得到第三特征表示;
    根据所述第二场景流,对第二特征表示进行仿射变换,得到第四特征表示;所述第三特征表示和所述第四特征表示用于对所述模糊图像进行去模糊处理。
  2. 根据权利要求1所述的方法,其特征在于,所述模糊图像和所述事件数据为在相同时间段针对于同一场景采集得到的。
  3. 根据权利要求1或2所述的方法,其特征在于,所述场景流预测网络包括第一编码模块、第二编码模块、融合模块、第一解码模块以及第二解码模块;
    所述根据所述模糊图像的第一特征表示以及所述事件数据的第二特征表示,通过场景流预测网络,得到所述模糊图像对应的第一场景流、和所述事件数据对应的第二场景流,包括:
    根据所述第一特征表示,通过所述第一编码模块,得到第一编码结果;
    根据所述第二特征表示,通过所述第二编码模块,得到第二编码结果;
    根据所述第一编码结果和所述第二编码结果,通过所述融合模块,得到融合结果;
    根据所述融合结果,分别通过所述第一解码模块和所述第二解码模块,得到所述模糊图像对应的第一场景流、和所述事件数据对应的第二场景流。
  4. 根据权利要求1至3任一所述的方法,其特征在于,所述方法还包括:
    根据所述第四特征表示和所述第一特征表示,识别第二遮挡区域,其中所述模糊图像中所述第二遮挡区域的图像数据在所述事件数据的所述第二遮挡区域中有效;
    将所述第一特征表示中除所述第二遮挡区域之外的特征表示和所述第四特征表示中所述第二遮挡区域的特征表示进行第二融合,以得到第二融合特征表示。
  5. 根据权利要求4所述的方法,其特征在于,所述第二遮挡区域通过第二掩膜mask表示,所述第二mask和所述第四特征表示的尺寸一致,所述第二mask中的每个像素用于指示所述第一特征表示中对应位置的像素特征是否在所述模糊图像中有效。
  6. 根据权利要求1至5任一所述的方法,其特征在于,所述方法还包括:
    根据所述第三特征表示和所述第二特征表示,确定第一遮挡区域,其中所述事件数据中所述第一遮挡区域的图像数据在所述模糊图像的所述第一遮挡区域中有效;
    将所述第二特征表示中除所述第一遮挡区域之外的特征表示和所述第三特征表示中所述第一遮挡区域的特征表示进行第二融合,以得到第一融合特征表示。
  7. 根据权利要求6所述的方法,其特征在于,所述第一遮挡区域通过第一掩膜mask表示,所述第一mask和所述第三特征表示的尺寸一致,所述第一mask中的每个像素用于指示所述第三特征表示中对应位 置的像素特征是否在所述事件数据中有效。
  8. 根据权利要求4至7任一所述的方法,其特征在于,所述第二融合为对应像素位置的相加运算。
  9. 根据权利要求1至8任一所述的方法,其特征在于,所述方法还包括:
    通过N个串联连接的特征嵌套块,处理模糊图像的特征表示以及事件数据的特征表示,以得到用于进行去模糊处理的处理结果;其中,每个所述特征嵌套块用于执行如权利要求1的图像处理方法,第1个特征嵌套块用于获取到通过特征提取网络对所述模糊图像和所述事件数据提取的特征表示,第n个特征嵌套块用于获取到第n-1个特征嵌套块输出的特征表示,所述n小于N大于1。
  10. 根据权利要求9所述的方法,其特征在于,第N个所述特征嵌套块输出的特征表示用于和所述通过特征提取网络对所述模糊图像提取的特征表示进行融合,得到残差信息,所述残差信息用于和所述模糊图像进行融合以实现所述模糊图像的去模糊处理。
  11. 一种图像处理装置,其特征在于,所述装置包括:
    获取模块,用于获取模糊图像的第一特征表示以及事件相机采集的事件数据的第二特征表示;所述第一特征表示和所述第二特征表示的尺寸一致;
    场景流预测模块,用于根据所述模糊图像的第一特征表示以及所述事件数据的第二特征表示,通过场景流预测网络,得到所述模糊图像对应的第一场景流、和所述事件数据对应的第二场景流,所述第一场景流和所述第一特征表示的尺寸一致,所述第一场景流中的每个像素特征指示由所述第一特征表示中对应像素位置的像素特征到所述第二特征表示中对应像素位置的像素特征的运动信息,所述第二场景流和所述第二特征表示的尺寸一致,所述第二场景流中的每个像素特征指示由所述第二特征表示中对应像素位置的像素特征到所述第一特征表示中对应像素位置的像素特征的运动信息;
    仿射变换模块,用于根据所述第一场景流,对第一特征表示进行仿射变换,得到第三特征表示;
    根据所述第二场景流,对第二特征表示进行仿射变换,得到第四特征表示;所述第三特征表示和所述第四特征表示用于对所述模糊图像进行去模糊处理。
  12. 根据权利要求11所述的装置,其特征在于,所述模糊图像和所述事件数据为在相同时间段针对于同一场景采集得到的。
  13. 根据权利要求11或12所述的装置,其特征在于,所述场景流预测网络包括第一编码模块、第二编码模块、融合模块、第一解码模块以及第二解码模块;
    所述场景流预测模块,具体用于:
    根据所述第一特征表示,通过所述第一编码模块,得到第一编码结果;
    根据所述第二特征表示,通过所述第二编码模块,得到第二编码结果;
    根据所述第一编码结果和所述第二编码结果,通过所述融合模块,得到融合结果;
    根据所述融合结果,分别通过所述第一解码模块和所述第二解码模块,得到所述模糊图像对应的第一场景流、和所述事件数据对应的第二场景流。
  14. 根据权利要求11至13任一所述的装置,其特征在于,所述装置还包括:
    遮挡区域识别模块,用于根据所述第四特征表示和所述第一特征表示,识别第二遮挡区域,其中所述模糊图像中所述第二遮挡区域的图像数据在所述事件数据的所述第二遮挡区域中有效;
    将所述第一特征表示中除所述第二遮挡区域之外的特征表示和所述第四特征表示中所述第二遮挡区域的特征表示进行第二融合,以得到第二融合特征表示。
  15. 根据权利要求14所述的装置,其特征在于,所述第二遮挡区域通过第二掩膜mask表示,所述第二mask和所述第四特征表示的尺寸一致,所述第二mask中的每个像素用于指示所述第一特征表示中对应位置的像素特征是否在所述模糊图像中有效。
  16. 根据权利要求11至15任一所述的装置,其特征在于,所述装置还包括:
    遮挡区域识别模块,用于根据所述第三特征表示和所述第二特征表示,确定第一遮挡区域,其中所述事件数据中所述第一遮挡区域的图像数据在所述模糊图像的所述第一遮挡区域中有效;
    将所述第二特征表示中除所述第一遮挡区域之外的特征表示和所述第三特征表示中所述第一遮挡区域的特征表示进行第二融合,以得到第一融合特征表示。
  17. 根据权利要求16所述的装置,其特征在于,所述第一遮挡区域通过第一掩膜mask表示,所述第一mask和所述第三特征表示的尺寸一致,所述第一mask中的每个像素用于指示所述第三特征表示中对应位置的像素特征是否在所述事件数据中有效。
  18. 根据权利要求14至17任一所述的装置,其特征在于,所述第二融合为对应像素位置的相加运算。
  19. 根据权利要求11至18任一所述的装置,其特征在于,所述装置还包括:特征嵌套模块,用于通过N个串联连接的特征嵌套块,处理模糊图像的特征表示以及事件数据的特征表示,以得到用于进行去模糊处理的处理结果;其中,每个所述特征嵌套块用于执行如权利要求1的图像处理方法,第1个特征嵌套块用于获取到通过特征提取网络对所述模糊图像和所述事件数据提取的特征表示,第n个特征嵌套块用于获取到第n-1个特征嵌套块输出的特征表示,所述n小于N。
  20. 根据权利要求19所述的装置,其特征在于,所述第N个所述特征嵌套块输出的特征表示用于和所述通过特征提取网络对所述模糊图像提取的特征表示进行融合,得到残差信息,所述残差信息用于和所述模糊图像进行融合以实现所述模糊图像的去模糊处理。
  21. 一种计算设备,其特征在于,所述计算设备包括存储器和处理器;所述存储器存储有代码,所述处理器被配置为获取所述代码,并执行如权利要求1至10任一所述的方法。
  22. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有一个或多个指令,所述指令在由一个或多个计算机执行时使得所述一个或多个计算机实施权利要求1至10任一所述的方法。
  23. 一种计算机程序产品,包括代码,其特征在于,在所述代码被执行时用于实现如权利要求1至10任一所述的方法。
PCT/CN2023/103616 2022-06-30 2023-06-29 一种图像处理方法及相关装置 WO2024002211A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210764024.9A CN115239581A (zh) 2022-06-30 2022-06-30 一种图像处理方法及相关装置
CN202210764024.9 2022-06-30

Publications (1)

Publication Number Publication Date
WO2024002211A1 true WO2024002211A1 (zh) 2024-01-04

Family

ID=83670800

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/103616 WO2024002211A1 (zh) 2022-06-30 2023-06-29 一种图像处理方法及相关装置

Country Status (2)

Country Link
CN (1) CN115239581A (zh)
WO (1) WO2024002211A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115239581A (zh) * 2022-06-30 2022-10-25 华为技术有限公司 一种图像处理方法及相关装置
CN116486120B (zh) * 2023-03-17 2024-01-19 广东工业大学 一种相移干涉图空间像素匹配方法
CN117726549B (zh) * 2024-02-07 2024-04-30 中国科学院长春光学精密机械与物理研究所 基于事件引导的图像去模糊方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200265590A1 (en) * 2019-02-19 2020-08-20 The Trustees Of The University Of Pennsylvania Methods, systems, and computer readable media for estimation of optical flow, depth, and egomotion using neural network trained using event-based learning
CN113076685A (zh) * 2021-03-04 2021-07-06 华为技术有限公司 图像重建模型的训练方法、图像重建方法及其装置
US20210321052A1 (en) * 2020-04-13 2021-10-14 Northwestern University System and method for high-resolution, high-speed, and noise-robust imaging
CN114463218A (zh) * 2022-02-10 2022-05-10 中国科学技术大学 一种基于事件数据驱动的视频去模糊方法
CN114494050A (zh) * 2022-01-14 2022-05-13 武汉大学 一种基于事件相机的自监督视频去模糊和图像插帧方法
CN115239581A (zh) * 2022-06-30 2022-10-25 华为技术有限公司 一种图像处理方法及相关装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200265590A1 (en) * 2019-02-19 2020-08-20 The Trustees Of The University Of Pennsylvania Methods, systems, and computer readable media for estimation of optical flow, depth, and egomotion using neural network trained using event-based learning
US20210321052A1 (en) * 2020-04-13 2021-10-14 Northwestern University System and method for high-resolution, high-speed, and noise-robust imaging
CN113076685A (zh) * 2021-03-04 2021-07-06 华为技术有限公司 图像重建模型的训练方法、图像重建方法及其装置
CN114494050A (zh) * 2022-01-14 2022-05-13 武汉大学 一种基于事件相机的自监督视频去模糊和图像插帧方法
CN114463218A (zh) * 2022-02-10 2022-05-10 中国科学技术大学 一种基于事件数据驱动的视频去模糊方法
CN115239581A (zh) * 2022-06-30 2022-10-25 华为技术有限公司 一种图像处理方法及相关装置

Also Published As

Publication number Publication date
CN115239581A (zh) 2022-10-25

Similar Documents

Publication Publication Date Title
Ming et al. Deep learning for monocular depth estimation: A review
CN110532871B (zh) 图像处理的方法和装置
WO2021164731A1 (zh) 图像增强方法以及图像增强装置
WO2021043273A1 (zh) 图像增强方法和装置
WO2021043168A1 (zh) 行人再识别网络的训练方法、行人再识别方法和装置
WO2020192483A1 (zh) 图像显示方法和设备
WO2021018163A1 (zh) 神经网络的搜索方法及装置
WO2024002211A1 (zh) 一种图像处理方法及相关装置
CN111402130B (zh) 数据处理方法和数据处理装置
WO2020177607A1 (zh) 图像去噪方法和装置
WO2021164234A1 (zh) 图像处理方法以及图像处理装置
WO2022116856A1 (zh) 一种模型结构、模型训练方法、图像增强方法及设备
US20220222776A1 (en) Multi-Stage Multi-Reference Bootstrapping for Video Super-Resolution
CN110222717B (zh) 图像处理方法和装置
WO2022134971A1 (zh) 一种降噪模型的训练方法及相关装置
WO2021063341A1 (zh) 图像增强方法以及装置
CN112446380A (zh) 图像处理方法和装置
CN113066017B (zh) 一种图像增强方法、模型训练方法及设备
WO2021018106A1 (zh) 行人检测方法、装置、计算机可读存储介质和芯片
WO2022001372A1 (zh) 训练神经网络的方法、图像处理方法及装置
CN113065645B (zh) 孪生注意力网络、图像处理方法和装置
CN113076685A (zh) 图像重建模型的训练方法、图像重建方法及其装置
CN113011562A (zh) 一种模型训练方法及装置
WO2022022288A1 (zh) 一种图像处理方法以及装置
CN113066018A (zh) 一种图像增强方法及相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23830380

Country of ref document: EP

Kind code of ref document: A1