WO2022022288A1 - 一种图像处理方法以及装置 - Google Patents

一种图像处理方法以及装置 Download PDF

Info

Publication number
WO2022022288A1
WO2022022288A1 PCT/CN2021/106380 CN2021106380W WO2022022288A1 WO 2022022288 A1 WO2022022288 A1 WO 2022022288A1 CN 2021106380 W CN2021106380 W CN 2021106380W WO 2022022288 A1 WO2022022288 A1 WO 2022022288A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
sub
structural
hidden state
state information
Prior art date
Application number
PCT/CN2021/106380
Other languages
English (en)
French (fr)
Inventor
李松江
磯部骏
贾旭
田奇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21850081.7A priority Critical patent/EP4181052A4/en
Publication of WO2022022288A1 publication Critical patent/WO2022022288A1/zh
Priority to US18/161,123 priority patent/US20230177646A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4046Scaling of whole images or parts thereof, e.g. expanding or contracting using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to an image processing method and apparatus.
  • Super-resolution refers to reconstructing a corresponding high-resolution image from an observed low-resolution image. By upsampling and enlarging the low-resolution image, the corresponding high-resolution image is generated by filling in the details by means of image prior knowledge, image self-similarity, and multi-frame image complementary information. Super-resolution technology has important application value in the fields of high-definition television, observation equipment, satellite imagery and medical imaging.
  • the intermediate frame and its 2N neighboring frames before and after it are input to form an input frame sequence of 2N+1 frames, and then the motion compensation is performed on the input frame sequence.
  • the adjacent frames are aligned to the intermediate frame, the information of multiple frames is fused, and finally the super-resolution output of the intermediate frame is realized.
  • it is necessary to temporarily store the next N neighboring frames which leads to the delay of N frames.
  • the present application provides an image processing method and device for performing super-resolution processing on an input image to efficiently and accurately obtain a higher-definition image.
  • a first aspect of the present application provides an image processing method, which includes: first, decomposing a first image to obtain a first structure sub-image and a first detail sub-image, where the first image is divided into the first image in the video data. Any frame of image outside one frame, and the first frequency is lower than the second frequency, the first frequency is the frequency of the information included in the first structure sub-image, and the second frequency is the frequency of the information included in the first detail sub-image , that is, the frequency of the information included in the first structural subgraph is higher than the frequency of the information included in the first detailed subgraph; then, the first hidden state information and the first structural subgraph are fused to obtain the second structural subgraph , and splicing the first hidden state information and the first detailed sub-image to obtain a second detailed sub-image, the first hidden state information includes the features extracted from the second image, and the second image includes the video data and the first image.
  • At least one adjacent frame of images then, feature extraction is performed based on the second structural sub-image and the second detailed sub-image to obtain structural features and detail features; then, according to the structural features and detail features, an output image is obtained, and the resolution of the output image is performed. higher than the resolution of the first image.
  • the structure branch and the detail branch are decomposed for processing, and the hidden state information is used to further enrich the structure and details, so that the final obtained
  • the output image is richer in structure and detail. There is no need to buffer multiple frames to process intermediate frames, and a high-resolution image of the current frame can be efficiently obtained.
  • using the hidden state information to fuse the first structural subgraph and the first detailed subgraph respectively to obtain the second structural subgraph and the second detailed subgraph may include: acquiring the first hidden state A similarity matrix between the information and the first image, the similarity matrix includes at least one similarity, and the at least one similarity is used to represent the similarity between the image area included in the first hidden state information and the image area in the first image ; filter the first hidden state information according to the similarity matrix to obtain the second hidden state information, and the similarity degree of each image area in the second hidden state information and the corresponding image area in the first image is higher than that of the first hidden state The degree of similarity between each image area in the information and the image area in the first image; use the second hidden state information to splicing the first structural subgraph to obtain the second structural subgraph, and use the second hidden state information to align the first details.
  • the sub-images are spliced to obtain the second detailed sub-image.
  • the first hidden state information when used, redundant information in it can be filtered, and the first structural sub-graph and the first detailed sub-graph are respectively fused using the obtained hidden state information, and the result can be obtained A second structural subgraph with richer details, and a second structural subgraph with richer structure.
  • performing feature extraction based on the second structural sub-image and the second detailed sub-image to obtain structural features and detailed features may include: performing the feature extraction on the second structural sub-image and the second detailed sub-image. Iterative fusion at least once to obtain the updated second structural sub-graph and the updated second detailed sub-graph; extract features from the updated second structural sub-graph to obtain structural features, and obtain the structural features from the updated second detailed sub-graph Extract features from , and get detailed features.
  • the information included in the second structural sub-graph and the second detailed sub-graph can be fused, so that the detailed information of the second detailed sub-graph is enriched by the structural information included in the second structural sub-graph, And the structure information included in the second structure sub-image is enriched by the detail information included in the second detail sub-image, so that the finally extracted features are more abundant, thereby making the final output image clearer and improving the user experience.
  • any iterative fusion process includes: fusing the second structure sub-image obtained in the previous iteration and the second detail sub-image obtained in the previous iteration to obtain the first fusion image of the current iteration ; fuse the first fused image with the second structure sub-image obtained in the previous iteration to obtain the second structural sub-image of the current iteration; fuse the first fused image and the second detailed sub-image obtained in the previous iteration, Get the second detail subgraph of the current iteration.
  • the second structure subgraph obtained in the previous iteration and the second detail subgraph obtained in the previous iteration can be fused, and the first The fusion image fuses the second structural sub-image and the second detailed sub-image respectively, thereby enriching the detailed information of the second detailed sub-image through the structural information included in the second structural sub-image, and through the detailed information included in the second detailed sub-image to enrich the structure information included in the second structure sub-graph, so that the finally extracted features are more abundant, and the finally obtained output image is clearer, and the user experience is improved.
  • obtaining an output image according to structural features and detail features may include: fusing structural features and detail features to obtain a second fused image; enlarging the second fused image to obtain an output image, and outputting The resolution of the image is higher than the resolution of the second fused image.
  • the second fusion image may be enlarged to obtain an output image, thereby obtaining an output image with a higher resolution.
  • the above method further includes: according to the structural features and details The feature updates the first hidden state information, and the first hidden state information is used to process the next frame image arranged in the first image in the video data.
  • the first hidden state information can be updated, so that the updated first hidden state can be used in the process of processing the next frame
  • the information is processed to improve the clear picture of the output image corresponding to the next frame and improve the user experience.
  • decomposing the first image may include: down-sampling the first image to obtain a down-sampled image; up-sampling the down-sampled image to obtain a first structural subgraph; The first structural sub-image is removed from the image to obtain the first detailed sub-image.
  • the first structure sub-map and the first detail sub-map can be obtained by down-sampling and up-sampling, and a specific method for obtaining the first structure sub-map and the first detail sub-map is provided. .
  • an image processing device comprising:
  • a decomposition unit configured to decompose the first image to obtain a first structural sub-image and a first detailed sub-image, where the first image is any frame of images in the video data except the first frame, and the first frequency is lower than the second frequency, where the first frequency is the frequency of the information included in the first structure sub-image, and the second frequency is the frequency of the information included in the first detail sub-image;
  • a fusion unit configured to fuse the first hidden state information and the first structural subgraph to obtain a second structural subgraph, and splicing the first hidden state information and the first detailed subgraph to obtain a second detailed subgraph,
  • the first hidden state information includes features extracted from the second image, and the second image includes at least one frame of image whose video data is adjacent to the first image;
  • a feature extraction unit configured to perform feature extraction based on the second structural sub-map and the second detailed sub-map to obtain structural features and detailed features
  • the output unit is configured to obtain an output image according to the structural feature and the detail feature, and the resolution of the output image is higher than that of the first image.
  • the fusion unit is specifically configured to: obtain a similarity matrix of the first hidden state information and the first image, the similarity matrix includes at least one similarity, and the at least one similarity is used to represent the first The degree of similarity between the image area included in the hidden state information and the image area in the first image; filter the first hidden state information according to the similarity matrix to obtain the second hidden state information, each of which is in the second hidden state information.
  • the degree of similarity between the image area and the corresponding image area in the first image is higher than the degree of similarity between each image area in the first hidden state information and the image area in the first image;
  • the graphs are spliced to obtain a second structure subgraph, and the second detail subgraph is obtained by splicing the first detail subgraph using the second hidden state information.
  • the feature extraction unit is configured to: perform at least one iterative fusion on the second structure subgraph and the second detail subgraph to obtain an updated second structure subgraph and an updated second detail subgraph Subgraph; extract features from the updated second structural subgraph to obtain structural features, and extract features from the updated second detail subgraph to obtain detailed features.
  • any iterative fusion process may include: fusing the second structure subgraph obtained in the previous iteration with the second detail subgraph obtained in the previous iteration to obtain the first fusion of the current iteration image; fuse the first fused image and the second structure sub-image obtained in the previous iteration to obtain the second structural sub-image of the current iteration; fuse the first fused image and the second detailed sub-image obtained in the previous iteration , get the second detail subgraph of the current iteration.
  • the output unit is specifically used for: fusing structural features and detail features to obtain a second fused image; enlarging the second fused image to obtain an output image, and the resolution of the output image is higher than that of the first fused image. 2 The resolution of the fused image.
  • the image processing apparatus may further include: an update unit, configured to update the first hidden state information according to the structural feature and the detail feature, and the first hidden state information is used to update the video data arranged in the first hidden state The next frame of the image is processed.
  • an update unit configured to update the first hidden state information according to the structural feature and the detail feature, and the first hidden state information is used to update the video data arranged in the first hidden state The next frame of the image is processed.
  • the decomposition unit is specifically configured to: downsample the first image to obtain a downsampled image; upsample the downsampled image to obtain a first structural sub-image; remove the first image from the first image The first structure sub-graph is obtained, and the first detail sub-graph is obtained.
  • an embodiment of the present application provides an image processing apparatus, and the image processing apparatus has the function of implementing the image processing method of the first aspect.
  • This function can be implemented by hardware or by executing corresponding software by hardware.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • an embodiment of the present application provides an image processing apparatus, including: a processor and a memory, wherein the processor and the memory are interconnected through a line, and the processor invokes program codes in the memory to execute any one of the first aspects above Processing-related functions in the image processing method shown.
  • the image processing device may be a chip.
  • the embodiments of the present application provide an image processing device, which may also be referred to as a digital processing chip or a chip.
  • the chip includes a processing unit and a communication interface.
  • the processing unit obtains program instructions through the communication interface, and the program instructions are
  • the processing unit executes, and the processing unit is configured to execute the processing-related functions in the first aspect or any optional implementation manner of the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium, including instructions, which, when executed on a computer, cause the computer to execute the method in the first aspect or any optional implementation manner of the first aspect.
  • an embodiment of the present application provides a computer program product including instructions, which, when run on a computer, enables the computer to execute the method in the first aspect or any optional implementation manner of the first aspect.
  • Fig. 1 is a schematic diagram of a main frame of artificial intelligence applied by the application
  • FIG. 2 is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of another convolutional neural network provided by an embodiment of the present application.
  • 4A is a schematic diagram of an application scenario of an image processing method provided by an embodiment of the present application.
  • 4B is a schematic diagram of an application scenario of an image processing method provided by an embodiment of the present application.
  • 5A is a schematic diagram of a system architecture provided by the present application.
  • 5B is a schematic diagram of an application scenario of an image processing method provided by an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of an image processing method provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an image processing architecture provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of an application scenario of an image processing method provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of another image processing architecture provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a method for filtering a hidden state provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of another hidden state filtering method provided by an embodiment of the present application.
  • FIG. 12 is a schematic flowchart of an image fusion provided by an embodiment of the present application.
  • FIG. 13 is a schematic diagram of another image processing architecture provided by an embodiment of the present application.
  • 15 is a schematic flowchart of a hidden state update provided by an embodiment of the present application.
  • FIG. 16 is a schematic diagram of another image processing architecture provided by an embodiment of the present application.
  • 17 is a schematic diagram of an image processing effect provided by an embodiment of the present application.
  • FIG. 18 is a schematic structural diagram of an image processing apparatus provided by an embodiment of the present application.
  • FIG. 19 is a schematic structural diagram of another image processing apparatus provided by an embodiment of the present application.
  • FIG. 20 is a schematic structural diagram of a chip according to an embodiment of the present application.
  • AI artificial intelligence
  • AI is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that responds in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theory.
  • Figure 1 shows a schematic diagram of an artificial intelligence main frame, which describes the overall workflow of an artificial intelligence system and is suitable for general artificial intelligence field requirements.
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.
  • the infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communicate with the outside through sensors; computing power is provided by intelligent chips, such as central processing unit (CPU), network processor (neural-network processing unit, NPU), graphics processor (English: graphics processing unit, GPU), Application specific integrated circuit (ASIC) or field programmable gate array (field programmable gate array, FPGA) and other hardware acceleration chips) are provided; the basic platform includes distributed computing framework and network related platform guarantee and support, It can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • intelligent chips such as central processing unit (CPU), network processor (neural-network processing unit, NPU), graphics processor (English: graphics processing unit, GPU), Application specific integrated circuit (ASIC) or field programmable gate array (field programmable gate array, FPGA) and other hardware acceleration chips
  • CPU central processing unit
  • the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, video, and text, as well as IoT data of traditional devices, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing (such as image recognition, object detection, etc.), speech recognition, etc.
  • algorithms or a general system such as translation, text analysis, computer vision processing (such as image recognition, object detection, etc.), speech recognition, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, and the productization of intelligent information decision-making and implementation of applications. Its application areas mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, autonomous driving, smart city, smart terminals, etc.
  • the embodiments of the present application involve a large number of related applications of neural networks.
  • the related terms and concepts of the neural networks that may be involved in the embodiments of the present application are first introduced below.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x s and an intercept 1 as input, and the output of the operation unit can be shown in formula (1-1):
  • W s is the weight of x s
  • b is the bias of the neural unit.
  • f is an activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting a plurality of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
  • a deep neural network also known as a multi-layer neural network, can be understood as a neural network with multiple intermediate layers.
  • the DNN is divided according to the position of different layers.
  • the neural network inside the DNN can be divided into three categories: input layer, intermediate layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all intermediate layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • DNN looks complicated, it is not complicated in terms of the work of each layer. In short, it is the following linear relationship expression: in, is the input vector, is the output vector, is the offset vector, w is the weight matrix (also called coefficients), and ⁇ () is the activation function.
  • Each layer is just an input vector After such a simple operation to get the output vector Due to the large number of DNN layers, the coefficient W and offset vector The number is also higher.
  • the DNN Take the coefficient w as an example: Suppose in a three-layer DNN, the linear coefficient from the 4th neuron in the second layer to the 2nd neuron in the third layer is defined as The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.
  • the coefficient from the kth neuron in the L-1 layer to the jth neuron in the Lth layer is defined as
  • the input layer does not have a W parameter.
  • more intermediate layers allow the network to better capture the complexities of the real world.
  • a model with more parameters is more complex and has a larger "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vectors W of many layers).
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of convolutional layers and subsampling layers, which can be viewed as a filter.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a convolutional layer of a convolutional neural network a neuron can only be connected to some of its neighbors.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle. Neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as the way to extract image information is independent of location.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights by learning during the training process of the convolutional neural network.
  • the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • Recurrent neural networks are used to process sequence data, also known as recurrent neural networks.
  • RNN Recurrent neural networks
  • the layers are fully connected, but each node in each layer is unconnected.
  • this ordinary neural network solves many problems, it is still powerless to many problems. For example, if you want to predict the next word of a sentence, you generally need to use the previous words, because the front and rear words in a sentence are not independent. The reason why RNN is called a recurrent neural network is that the current output of a sequence is also related to the previous output.
  • RNN can process sequence data of any length.
  • the training of RNN is the same as the training of traditional CNN or DNN.
  • Super Resolution is an image enhancement technology. Given an image or a group of low-resolution images, it learns the prior knowledge of the image, the similarity of the image itself, and the complementarity of multi-frame image information. The high-frequency detail information of the image is recovered to generate a higher-resolution target image.
  • super-resolution can be divided into single-frame image super-resolution and video super-resolution according to the number of input images. Super-resolution has important application value in high-definition television, observation equipment, satellite imagery and medical imaging.
  • Video super resolution is an enhancement technology for video processing, whose purpose is to convert low-resolution video into high-quality high-resolution video. According to the number of input frames, video super-resolution can be divided into multi-frame video super-resolution and loop video super-resolution.
  • CNN convolutional neural network
  • CNN is a deep neural network with a convolutional structure.
  • CNN is a deep learning architecture, which refers to multiple levels of learning at different levels of abstraction through machine learning algorithms.
  • a CNN is a feed-forward artificial neural network in which each neuron responds to overlapping regions in images fed into it.
  • a convolutional neural network consists of a feature extractor consisting of convolutional and subsampling layers.
  • the feature extractor can be viewed as a filter, and the convolution process can be viewed as convolution with an input image or a convolutional feature map using a trainable filter.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a convolutional layer of a convolutional neural network a neuron can only be connected to some of its neighbors.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle. Neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as the way to extract image information is independent of location. The underlying principle is that the statistics of one part of the image are the same as the other parts. This means that image information learned in one part can also be used in another part. So for all positions on the image, we can use the same learned image information. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the more convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights by learning during the training process of the convolutional neural network.
  • the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the convolutional neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller. Specifically, forwarding the input signal until the output will generate an error loss, and updating the parameters in the initial super-resolution model by back-propagating the error loss information, so that the error loss converges.
  • the back-propagation algorithm is a back-propagation motion dominated by the error loss, aiming to obtain the parameters of the optimal super-resolution model, such as the weight matrix.
  • a convolutional neural network (CNN) 100 may include an input layer 110 , a convolutional/pooling layer 120 , where the pooling layer is optional, and a neural network layer 130 .
  • the convolutional/pooling layer 120 may include layers 121-126 as examples.
  • layer 121 is a convolutional layer
  • layer 122 is a pooling layer
  • layer 123 is a convolutional layer
  • layer 124 is a convolutional layer.
  • Layers are pooling layers
  • 125 are convolutional layers
  • 126 are pooling layers; in another implementation, 121 and 122 are convolutional layers, 123 are pooling layers, 124 and 125 are convolutional layers, and 126 are pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 121 may include many convolution operators, which are also called kernels, and their role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can essentially be a weight matrix, which is usually pre-defined. During a convolution operation on an image, the weight matrix is usually performed one pixel by one pixel (or two pixels by two pixels...depending on the value of stride) in the horizontal direction on the input image. processing to complete the work of extracting specific features from an image.
  • the size of this weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix will extend to the entire depth of the input image. Therefore, convolution with a single weight matrix will produce a single depth dimension of the convolutional output, but in most cases a single weight matrix is not used, but multiple weight matrices of the same dimension are applied.
  • the output of each weight matrix is stacked to form the depth dimension of the convolutional image.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image. Blur, etc.
  • the multiple weight matrices have the same dimension, and the feature maps extracted from the multiple weight matrices with the same dimension have the same dimension, and then the multiple extracted feature maps with the same dimension are combined to form the output of the convolution operation.
  • the weight values in the weight matrix need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can extract information from the input image, thereby helping the convolutional neural network 100 to make correct predictions.
  • the initial convolutional layer eg 121
  • the features extracted by the later convolutional layers become more and more complex, such as features such as high-level semantics.
  • features with higher semantics are more suitable for the problem to be solved.
  • each layer 121-126 exemplified by 120 in Figure 2 can be a convolutional layer followed by a layer
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the average value of the pixel values in the image within a certain range.
  • the max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
  • the size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 100 After being processed by the convolutional layer/pooling layer 120, the convolutional neural network 100 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 100 needs to utilize the neural network layer 130 to generate one or a set of outputs of the required number of classes. Therefore, the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in FIG. 2) and an output layer 140. In this application, the convolutional neural network is obtained by deforming the selected starting point network at least once to obtain a serial network, and then obtaining it according to the trained serial network. This convolutional neural network can be used for image recognition, image classification, image super-resolution reconstruction, and more.
  • the output layer 140 After the multi-layer hidden layers in the neural network layer 130, that is, the last layer of the entire convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error,
  • the forward propagation of the entire convolutional neural network 100 (as shown in Fig. 2, the propagation from 110 to 140 is forward propagation) is completed, the back propagation (as shown in Fig. 2 from 140 to 110 as the back propagation) will start to update.
  • the weight values and biases of the aforementioned layers are used to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.
  • the convolutional neural network 100 shown in FIG. 2 is only used as an example of a convolutional neural network.
  • the convolutional neural network may also exist in the form of other network models, for example, such as
  • the multiple convolutional layers/pooling layers shown in FIG. 3 are in parallel, and the extracted features are input to the full neural network layer 130 for processing.
  • the image processing method provided in this application can be applied to live video, video call, album management, smart city, human-computer interaction, and other scenarios that need to involve video data.
  • the image processing method provided by the present application can be applied to a smart city scenario.
  • low-quality video data collected by various observation devices that is, low-resolution video data
  • the low-quality video data can be collected and stored in a memory.
  • super-resolution processing can be performed on the video data through the image processing method provided in the present application, so as to obtain video data with higher resolution and improve the user's viewing experience.
  • the image processing method provided in this application is also applied to various video shooting scenarios. For example, if a user can use a terminal to shoot a video, in order to reduce the storage amount occupied by the video, the video can be compressed or down-sampled to obtain video data with a smaller storage amount. When the user uses the terminal to play the video, the stored video data can be subjected to super-resolution processing through the image processing method provided in the present application, thereby obtaining video data with higher resolution and improving the user's viewing experience.
  • the image processing method provided by the present application can be applied to a live video scene.
  • the server can send a video stream to the client used by the user.
  • the transmitted video stream can be compressed.
  • the client After the client receives the data stream sent by the server, it can perform super-resolution processing on the data stream through the image processing method provided in the present application, thereby obtaining video data with higher resolution and improving the user's viewing experience.
  • the system architecture of the application of the image processing method provided in this application may be as shown in FIG. 5A .
  • the server cluster 410 is implemented by one or more servers, and optionally, cooperates with other computing devices, such as data storage, routers, load balancers and other devices.
  • the server cluster 410 may use the data in the data storage system 250, or invoke the program code in the data storage system 250 to implement the steps of the image processing method provided in this application.
  • Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, etc.
  • Each user's local device can interact with the server cluster 410 through any communication mechanism/standard communication network, which can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • the communication network may include a wireless network, a wired network, or a combination of a wireless network and a wired network, and the like.
  • the wireless network includes but is not limited to: the fifth generation mobile communication technology (5th-Generation, 5G) system, the long term evolution (long term evolution, LTE) system, the global system for mobile communication (global system for mobile communication, GSM) or code division Multiple access (code division multiple access, CDMA) network, wideband code division multiple access (wideband code division multiple access, WCDMA) network, wireless fidelity (wireless fidelity, WiFi), Bluetooth (bluetooth), Zigbee protocol (Zigbee), Any one or a combination of radio frequency identification technology (radio frequency identification, RFID), long range (Long Range, Lora) wireless communication, and near field communication (near field communication, NFC).
  • the wired network may include an optical fiber communication network or a network composed of coaxial cables, and the like.
  • any server in the server cluster 410 can obtain video data from the data storage system 250, or other devices, such as terminals, PCs, etc., if the video data is of low resolution video, the server can send the low-resolution video to the local device through the communication network. If the video data is a high-resolution video, in order to reduce the bandwidth occupied by the transmission of the video data, the server may downsample the video data to obtain a low-resolution video, and send the low-resolution video to the local device through the communication network . Therefore, after receiving the low-resolution video, the local device can perform super-resolution processing on the low-resolution video to obtain a high-resolution video, as shown in FIG. 5B .
  • the super-resolution method based on deep neural network can generate higher-quality super-resolution images with clearer and less artifacts, which further promotes the application of super-resolution technology.
  • a down-sampled, lower-resolution video stream can be transmitted over the network, and the client can convert it into a high-resolution image and play it through super-resolution technology after receiving it, which effectively reduces the Network bandwidth requirements; in video observation, the resolution of the observation image is usually low due to the limitations of the installation location and storage of the observation camera.
  • Super-resolution technology can transform it into a clearer version, providing more detailed information for subsequent tasks such as target face recognition and pedestrian re-identification.
  • Super-resolution technology has also been widely used in applications such as high-definition of old movies and medical images.
  • the present application provides an image processing method for video, which realizes lightweight computing based on a recursive network, so that the super-resolution processing of video can run in real time.
  • FIG. 6 a schematic flowchart of an image processing method provided by the present application is as follows.
  • video data may also be acquired, and the video data may be a video stream, or data of a complete video, or the like.
  • the video data may include multiple frames of images, and the first image is any one of the frames of images.
  • the second image mentioned below is one or more frames of images adjacent to the first image, and details are not described below.
  • the second image may be one or more frames of images arranged before the first image according to the playback sequence.
  • the second image may be one or more frames of images arranged after the first image.
  • the structural information is the low-frequency image components, and the detail information corresponds to the high-frequency image components. Therefore, in this step, the information included in the first image can be divided into high-frequency information and low-frequency information, the high-frequency information constitutes the first detail sub-image, and the low-frequency information constitutes the first structural sub-image.
  • the manners of decomposing the first image may include various methods.
  • the first image may be decomposed by means of downsampling combined with upsampling, or may be decomposed by means of low-pass filtering, etc., which may be adjusted according to actual application scenarios, which is not limited here.
  • the specific steps may include: downsampling the first image to obtain a downsampled image; and upsampling the downsampled image to obtain the first structure Sub-image; remove the first structural sub-image from the first image to obtain the first detail sub-image.
  • the features included in the first image can be acquired by down-sampling the first image, and then the dimensions of the first structure sub-image can be kept the same as the dimensions of the first image by up-sampling.
  • the first structure sub-image obtained by upsampling is subtracted from the first image, so as to obtain the first detail sub-image of the first image.
  • the specific steps may include: adding a low-pass filter, filtering out low-frequency parts in the first image, obtaining a first structural subgraph, and then adding a low-pass filter to the first image.
  • the first detail sub-image can be obtained by subtracting the first structural sub-image from the image.
  • high-frequency parts in the first image can also be filtered out by means of high-pass filtering to obtain the first detail sub-image, and then the first detailed sub-image is removed on the basis of the first image to obtain the first structural sub-image.
  • the first hidden state information includes features extracted from the second image.
  • the first hidden state information can also be understood as an image composed of the features of the second structural subgraph, and its dimension is the same as that of the first image.
  • the first hidden state information and the first structural subgraph can be fused to obtain a second structural subgraph referring to the features of the second image, and the first hidden state information and the first detailed subgraph can be fused to obtain the referenced subgraph.
  • the hidden state information can be understood as a feature map generated by the network, which contains features extracted from past frames, and is stored historical information.
  • the hidden state provides historical information, and the time-space level fusion with the features of the current input frame can obtain richer feature expression, thereby improving the super-resolution effect of the current frame.
  • the existence of hidden state information is conducive to outputting more stable results, effectively reducing the jitter of the video, and improving the look and feel of the picture.
  • the hidden state information stores historical information
  • new historical information may be added to the hidden state information after each frame is processed, which leads to a large amount of redundancy (such as outdated or useless) in the hidden state information. information.
  • redundancy such as outdated or useless
  • adaptive filtering may be performed on the first hidden state, so as to filter out redundant information in the first hidden state information.
  • the specific filtering process may include: first, acquiring a similarity matrix between the first hidden state information and the first image, where the similarity matrix consists of one or more similarities, and the one or more similarities are used to represent the first hidden state
  • the degree of similarity between the image area included in the information and the corresponding image area in the first image, and each image area may include one or more pixels.
  • filter the first hidden state information according to the similarity matrix to obtain the second hidden state information.
  • the similarity between each image area in the second hidden state information and the image area in the first image is higher than that of the first hidden state.
  • step 602 may include: using the second hidden state information to splicing the first structure subgraph and the first detail subgraph respectively, to obtain the second structure subgraph and the second detail subgraph.
  • the information in the first hidden state that is not similar to the first image can be filtered out through the similarity matrix, so as to obtain the second corresponding state information that is more similar to the first image and has a higher degree of correlation . Therefore, the structure and details of the second structure subgraph and the second detail subgraph obtained by fusion using the second hidden state information can be enriched, thereby making the subsequently obtained output image clearer and with higher resolution.
  • features may be extracted from the second structural sub-image and the second detailed sub-image, for example, features are extracted from the second structural sub-image to obtain structural features, and features are extracted from the second detailed sub-image to obtain detailed features.
  • Feature extraction may also be performed in combination with the second structural sub-image and the second detailed sub-image to obtain structural features and detailed features.
  • iterative fusion can be performed on the second structural sub-graph and the second detailed sub-graph at least once to obtain the updated second structural sub-graph and the updated second detailed sub-graph, and then the updated second structural sub-graph and the updated second detailed sub-graph can be obtained from the updated second structural sub-graph.
  • Extract features from to obtain structural features, and extract features from the updated second detail sub-map to obtain detailed features. Therefore, in the embodiment of the present application, the structural sub-graph and the detailed sub-graph can enrich the information included in each other by fusing the structural sub-graph and the detailed sub-graph, so that the final structural features and detailed features are more accurate. .
  • the second structural subgraph and the second detailed subgraph can be iteratively fused at least once to obtain the updated second structural subgraph and the updated second detailed subgraph
  • the second structural subgraph and the The process of any fusion of the two-detail sub-images may include: fusing the second structural sub-image obtained in the previous iteration and the second detailed sub-image obtained in the previous iteration to obtain the first fusion image of the current iteration; The first fusion image and the second structure sub-image obtained in the previous iteration are fused to obtain the second structure sub-image of the current iteration; the first fusion image and the second detailed sub-image obtained in the previous iteration are fused to obtain the current The second detail subgraph of the iteration.
  • the second structural sub-graph and the second detailed sub-graph interact at least once, so as to exchange the information included in each, so that the final updated second structural sub-graph is obtained.
  • the information included in the second detail sub-image is more abundant, so that the information included in the final output image is more abundant.
  • the structural features and the detailed features can be fused to obtain an output image with rich structure and details.
  • the second fused image may be enlarged to obtain an output image with a higher resolution.
  • the structure branch and the detail branch are decomposed for processing, and the hidden state information is used to further enrich the structure and details, so that the final obtained
  • the output image is richer in structure and detail. There is no need to buffer multiple frames to process intermediate frames, and a high-resolution image of the current frame can be efficiently obtained.
  • step 605 is an optional step.
  • the first hidden state information can be updated based on the structural features and the detailed features, so that when performing super-resolution processing on the next frame, the next frame can be enriched based on the updated hidden state information
  • the first hidden state information can be replaced with information after fusion of structural features and detail features, or the information after fusion of structural features and detail features can be fused with the original first hidden state information to obtain The updated first hidden state information.
  • the first hidden state information can be updated, so that when performing super-resolution processing on the next frame, the update, association can be used.
  • Hidden state information with higher degree is used to enrich the structure and details of the image, so that the final image is clearer.
  • FIG. 7 is a schematic flowchart of another image processing method provided by the present application.
  • the decomposition method for example, downsamples the input image 701 to obtain a downsampled image.
  • the down-sampled image is then up-sampled to obtain a first structure sub-map 702 .
  • the first detail sub-image 703 can be obtained.
  • the average or median of every four pixels in the input image can be taken and combined into one pixel to obtain a down-sampled image, and then four pixels of the down-sampled image can be interpolated to obtain an up-sampled image.
  • Image, the up-sampled image is the first structure sub-image, and its dimension is the same as the input image.
  • the pixel value here may include gray value, brightness value, value of each channel of RGB, etc., which may be adjusted according to actual application scenarios.
  • the first structure sub-picture 702 and the first detail sub-picture 703 are spliced respectively using the first hidden state information 704 to obtain the second structure sub-picture 705 and the second detail sub-picture 706 .
  • the first structure subgraph 702 includes 3 channels and the first hidden state information includes 3 channels
  • splicing the first structure subgraph and the first hidden state information can obtain a second structure including 6 channels subgraph.
  • the first structural subgraph 702 includes 3 channels and the first hidden state information includes 3 channels
  • the first hidden state information may be added to each channel in the first structural subgraph 702 Including the values of the 3 channels, the final second structure subgraph includes 3 channels, but the value of each channel becomes larger.
  • the manner of obtaining the second detail sub-image is similar to that of obtaining the first detail sub-image.
  • the feature extraction network 707 can then be used to extract features from the second structural sub-map 705 to obtain structural features 708 , and from the second detailed sub-map 706 to obtain detailed features 709 .
  • the feature extraction network may include one or more convolution kernels.
  • the feature extraction network may refer to the aforementioned convolutional neural network, which is not limited in this application.
  • a feature extraction network including fewer convolution kernels can be used for feature extraction.
  • a feature extraction network including more convolution kernels can also be used.
  • the feature extraction network of the accumulation kernel is used for feature extraction.
  • the structural features 708 and the detailed features 709 can be fused and enlarged to obtain a final output image 710 .
  • the first hidden state information can also be updated by using the structural feature 708 and the detail feature 709, so that when performing super-resolution processing on the next frame, the updated first hidden state information can be used.
  • a hidden state information is processed, so that the structure and details of the final output image are richer, and the user experience is improved.
  • the architecture provided in the aforementioned FIG. 7 can be applied to the scenario shown in FIG. 8 .
  • the user can play the image sent by the server through a mobile phone, a TV, or a PC.
  • the image frames included in the resolution include: I_t, I_t+1, I_t+2... can perform super-resolution processing on each frame of image, thereby obtaining a high-resolution image and improving the user's viewing experience.
  • FIG. 7 The architecture shown in FIG. 7 will be further described below. Referring to FIG. 9, wherein 701-706, 708-710 are similar to those shown in the aforementioned FIG. 7, and the differences are described below.
  • the difference between FIG. 9 and the aforementioned FIG. 7 may include: the first hidden state information is filtered, the second hidden state information obtained after filtering has a higher degree of correlation with the input image 701, and the second hidden state can be used subsequently
  • the information is spliced with the first structure sub-picture 702 and the first detail sub-picture 703 respectively, so that the information included in the second structure sub-picture 705 and the second detail sub-picture 706 obtained after splicing is more abundant, and the final output image is obtained. clearer.
  • FIG. 10 Exemplarily, reference may be made to FIG. 10 for the specific process of filtering the first hidden state information.
  • a similarity matrix 1001 can be generated by calculating the similarity with the first hidden state information 704 based on the features of the input image 701.
  • the input image can be divided into multiple image areas, each image area includes one or more pixel points, correspondingly, the first hidden state information is divided into multiple image areas according to the same division method, and each image area is divided into multiple image areas. Include one or more pixels.
  • the distribution rule of pixels in each image area in the input image can be matched with the distribution rule of pixels in each image area in the first hidden state, so as to calculate each image area in the input image and the similarity between the corresponding image regions in the first hidden state information, thereby obtaining a similarity matrix.
  • the second hidden state information includes an image region with a higher similarity (eg, not lower than a preset similarity) with the input image.
  • an application scenario is taken as an example to illustrate the filtering process of hidden state information.
  • the similarity calculation part first performs preliminary feature extraction on the input image based on a layer of convolutional layer, and generates a feature map of H ⁇ W ⁇ k 2 ; for each position (x, y) of the feature map, extract 1
  • the feature of ⁇ 1 ⁇ k 2 is expanded into a feature map of k ⁇ k.
  • a convolution kernel is constructed, and the 1 ⁇ 1 ⁇ C feature corresponding to the (x, y) position in the hidden state (H ⁇ W ⁇ C) matrix (ie, the first hidden state information) is rolled.
  • the filter part first uses the sigmoid function to normalize the similarity matrix to [0,1], and then multiplies the similarity matrix and the hidden state one-to-one to obtain the final filtered hidden state, that is, the second hidden state information.
  • the difference between FIG. 9 and the aforementioned FIG. 7 may further include: the feature extraction network may include N Structural Detail (SD) modules, where N is a positive integer, such as 901-90N shown in FIG. 9 .
  • SD Structural Detail
  • Each SD module is used to fuse the structural subgraph and the detail subgraph, so as to enrich the information included in the structural subgraph and the detail subgraph.
  • the SD_n may be any one of N SD modules.
  • the input of the SD_n module is the second structural sub-image 1201 and the second detailed sub-image 1202 output by the SD_n-1 module, and the second structural sub-image 1201 and the second detailed sub-image 1202 can be fused to obtain a second fused image.
  • the second fused image and the second structure sub-picture 1201 are fused, so that the updated second structure sub-picture 1203 can retain the information included in the second structure sub-picture before the update.
  • the second fusion image and the second detail sub-image 1202 are fused, so that the updated second detail sub-image 1204 can retain the information included in the second detail sub-image before the update, and on this basis, the Information included in the second structure subgraph.
  • the updated second structure sub-picture and the updated second detail sub-picture output by SD_n are input to the next SD module, that is, the SD_n+1 module.
  • the process of fusing structural features can refer to Figure 13.
  • After obtaining the structural features and detail features use 3*3 convolution to process the structural features and detail features respectively to obtain more stable structural features and detail features. Then, the structural features and detail features after convolution processing are spliced, and the spliced image is subjected to 3*3 convolution processing to obtain the second fusion image. Then, pixel shuffle is performed on the second fused image to obtain an enlarged output image.
  • the resolution of the input image can be 4*4*3
  • the resolution of the second fused image obtained by splicing is 4*4*12
  • pixel shuffle processing is performed on the second fused image to obtain 8*8*3 It can be seen that the resolution of the output image is higher than that of the input image, and a high-resolution image is obtained.
  • FIG. 14 for the steps of updating the first hidden state information.
  • the structural features and detailed features are obtained, the structural features and the detailed features are fused, and the fused image is subjected to 3*3 convolution and ReLU processing to obtain the updated first hidden state information.
  • the super-resolution processing flow in the aforementioned FIG. 9 may be represented as the super-resolution processing flow shown in FIG. 15 .
  • 3*3 convolution or 3*3 convolution and linear correction unit can be added, so that the features included in the image after fusion or splicing are more efficient.
  • FIG. 16 one frame of image is taken as an example to illustrate the image processing flow provided in the present application.
  • the input image is decomposed, and the filtered hidden state information is fused to obtain a first structural sub-image and a first detailed sub-image.
  • the second structural sub-graph and the second detailed sub-graph are input into the feature extraction network, and one or more SD modules interact with the structural sub-graph and the detailed sub-graph, so as to extract the structural features and detailed features, and analyze the The structural features and detail features are fused to obtain an output image.
  • the image processing method provided by the present application provides a video super-resolution processing method based on a structure-detail dual-branch recurrent neural network, and explicitly separates structure (low frequency) and detail (high frequency) information in the network and integrates Using two branches for processing, this explicit double-branch structure can effectively enrich the information included in the output image and improve the effect of video super-score.
  • a step of adaptive filtering of hidden states in recurrent neural networks is proposed. By calculating the similarity between the current input and the hidden state, and filtering the hidden state based on the similarity, outdated information is eliminated and the accumulation of errors is reduced. Improve the utilization efficiency of hidden state information.
  • the video super-resolution model is trained on the Vimeo-90K data set, that is, the network that performs the methods of the aforementioned Fig. 6-Fig. 16 of the present application, and the video super-resolution models commonly used in VID4, Vimeo-90K-T, SPMCS, UDM10, etc.
  • the test is carried out on the sub-dataset data set to demonstrate the processing effect of the image processing method proposed in this application on low-definition video.
  • the results of the current best-performing video super-resolution methods in the industry and academia in the same scene will be provided as a horizontal comparison.
  • the Vimeo-90K dataset is one of the commonly used datasets in video super-resolution tasks and contains about 90k video clips.
  • the dataset is collected from a social networking site, covering various scenes of daily life, as well as a large number of movie clips. Due to its huge sample size, diverse scenes, and large motion, it is a challenging video dataset and has been widely used in video processing tasks.
  • the Vimeo-90K dataset can be divided into training set and test set. For its test set, this application uses Vimeo-90K-T representation.
  • a network model is constructed on the PyTorch platform.
  • the original high-resolution ground truth (Ground Truth, GT) is used as the standard to calculate the peak signal-to-noise ratio (PSNR) of each frame and the structural similarity evaluation ( structural similarity index measurement, SSIM), and finally calculate the average PSNR and average SSIM for the entire test set.
  • GT ground truth
  • PSNR peak signal-to-noise ratio
  • SSIM structural similarity index measurement
  • Table 1 shows the test results of different methods on the Vid4 test set.
  • the Vid4 test set includes Calendar, City, Foliage, and Walk videos full of high-frequency details. It is one of the test sets commonly used in the field of video processing to test the processing capability of high-frequency details.
  • image processing methods are selected for comparison with the output results of the image processing methods provided in this application, such as Bicubic, SPMC (subpixel motion compensation), Liu (Robust Video Super-resolution With Learned Temporal Dynamics), TOFlow(task-oriented flow), DUF(Dynamic Up sampling Filters)-52L, RBPN(recurrent back-projection network), EDVR(Video Restoration with enhanced deformable convolutional networks)-L, PFNL(Progressive fusion video super resolution network via exploiting non-local spatio-temporal correlations), FRVSR (frame recurrent video super resolution), RLSP (efficient video super resolution through recurrent latent space propagation).
  • image processing methods provided in this application, such as Bicubic, SPMC (subpixel motion compensation), Liu (Robust Video Super-resolution With Learned Temporal Dynamics), TOFlow(task-oriented flow), DUF(Dynamic Up sampling Filters)-52L, RBPN(recurrent back-projection network), EDVR(Video Restoration
  • Figure 17 shows the leading effect of the method provided in this application in video super-resolution from the output high-resolution results, and can obtain higher-definition images.
  • the image processing apparatus may include:
  • the decomposition unit 1801 is configured to decompose the first image to obtain a first structural sub-image and a first detailed sub-image, where the first image is any frame of images in the video data except the first frame, and the first frequency is low In the second frequency, the first frequency is the frequency of the information included in the first structural sub-image, and the second frequency is the frequency of the information included in the first detail sub-image;
  • the fusion unit 1802 is used to fuse the first hidden state information and the first structural subgraph to obtain a second structural subgraph, and splicing the first hidden state information and the first detailed subgraph to obtain a second detailed subgraph , the first hidden state information includes features extracted from the second image, and the second image includes at least one frame of video data adjacent to the first image;
  • the feature extraction unit 1803 is used to perform feature extraction based on the second structural sub-map and the second detailed sub-map to obtain structural features and detailed features;
  • the output unit 1804 is configured to obtain an output image according to the structural feature and the detail feature, and the resolution of the output image is higher than that of the first image.
  • the fusion unit 1802 is specifically configured to: obtain a similarity matrix between the first hidden state information and the first image, where the similarity matrix includes at least one similarity, and the at least one similarity is used to represent the first The degree of similarity between the image area included in the hidden state information and the image area in the first image; filter the first hidden state information according to the similarity matrix to obtain the second hidden state information, each of which is in the second hidden state information.
  • the degree of similarity between each image area and the corresponding image area in the first image is higher than the degree of similarity between each image area in the first hidden state information and the image area in the first image;
  • the subgraphs are spliced to obtain a second structural subgraph, and the first detail subgraph is spliced using the second hidden state information to obtain a second detail subgraph.
  • the feature extraction unit 1803 is configured to: perform at least one iterative fusion on the second structure subgraph and the second detail subgraph to obtain an updated second structure subgraph and an updated second Detail sub-map; extract features from the updated second structure sub-map to obtain structural features, and extract features from the updated second detail sub-map to obtain detailed features.
  • any iterative fusion process may include: fusing the second structure subgraph obtained in the previous iteration with the second detail subgraph obtained in the previous iteration to obtain the first fusion of the current iteration image; fuse the first fused image and the second structure sub-image obtained in the previous iteration to obtain the second structural sub-image of the current iteration; fuse the first fused image and the second detailed sub-image obtained in the previous iteration , get the second detail subgraph of the current iteration.
  • the output unit 1804 is specifically configured to: fuse the structural features and the detail features to obtain a second fused image; amplify the second fused image to obtain an output image, the resolution of which is higher than that of the output image. The resolution of the second fused image.
  • the image processing apparatus may further include: an updating unit 1805, configured to update the first hidden state information according to the structural feature and the detail feature, and the first hidden state information is used to update the video data arranged in the first hidden state The next frame of an image is processed.
  • an updating unit 1805 configured to update the first hidden state information according to the structural feature and the detail feature, and the first hidden state information is used to update the video data arranged in the first hidden state The next frame of an image is processed.
  • the decomposition unit 1801 is specifically configured to: down-sample the first image to obtain a down-sampled image; up-sample the down-sampled image to obtain a first structural sub-image; Remove the first structure subgraph to obtain the first detail subgraph.
  • FIG. 19 is a schematic structural diagram of another image processing apparatus provided by the present application, as described below.
  • the image processing apparatus may include a processor 1901 and a memory 1902 .
  • the processor 1901 and the memory 1902 are interconnected by wires.
  • the memory 1902 stores program instructions and data.
  • the memory 1902 stores program instructions and data corresponding to the steps in the foregoing FIGS. 6 to 16 .
  • the processor 1901 is configured to perform the method steps performed by the image processing apparatus shown in any of the foregoing embodiments in FIG. 6 to FIG. 16 .
  • the image processing apparatus may further include a transceiver 1903 for receiving or sending data.
  • Embodiments of the present application also provide a computer-readable storage medium, where a program for generating a vehicle's running speed is stored in the computer-readable storage medium, and when the computer is running on a computer, the computer is made to execute the operations shown in FIGS. 6 to 16 above.
  • the illustrated embodiment describes the steps in the method.
  • the aforementioned image processing device shown in FIG. 19 is a chip.
  • the embodiments of the present application also provide an image processing device, which may also be referred to as a digital processing chip or a chip.
  • the chip includes a processing unit and a communication interface.
  • the processing unit acquires program instructions through the communication interface, and the program instructions are executed by the processing unit.
  • the processing unit is configured to perform the method steps performed by the image processing apparatus shown in any of the foregoing embodiments in FIG. 6 to FIG. 16 .
  • the embodiments of the present application also provide a digital processing chip.
  • the digital processing chip integrates circuits and one or more interfaces for realizing the above-mentioned processor 1901 or the functions of the processor 1901 .
  • the digital processing chip can perform the method steps of any one or more of the foregoing embodiments.
  • the digital processing chip does not integrate the memory, it can be connected with the external memory through the communication interface.
  • the digital processing chip implements the actions performed by the image processing apparatus in the above embodiments according to the program codes stored in the external memory.
  • Embodiments of the present application also provide a computer program product that, when driving on a computer, causes the computer to execute the steps performed by the image processing apparatus in the methods described in the embodiments shown in the foregoing FIGS. 6-16 .
  • the image processing apparatus may be a chip, and the chip includes: a processing unit and a communication unit.
  • the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin, or a circuit.
  • the processing unit can execute the computer-executed instructions stored in the storage unit, so that the chip in the server executes the image processing method described in the embodiments shown in FIG. 6 to FIG. 16 .
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • the aforementioned processing unit or processor may be a central processing unit (CPU), a network processor (neural-network processing unit, NPU), a graphics processing unit (graphics processing unit, GPU), a digital signal processing digital signal processor (DSP), application specific integrated circuit (ASIC) or field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or it may be any conventional processor or the like.
  • FIG. 20 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the chip can be represented as a neural network processor NPU 200, and the NPU 200 is mounted on the main CPU ( Host CPU), the task is allocated by the Host CPU.
  • the core part of the NPU is the arithmetic circuit 2003, which is controlled by the controller 2004 to extract the matrix data in the memory and perform multiplication operations.
  • the arithmetic circuit 2003 includes multiple processing units (process engines, PEs). In some implementations, the arithmetic circuit 2003 is a two-dimensional systolic array. The arithmetic circuit 2003 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 2003 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 2002 and buffers it on each PE in the arithmetic circuit.
  • the arithmetic circuit fetches the data of matrix A and matrix B from the input memory 2001 to perform matrix operation, and stores the partial result or final result of the matrix in an accumulator 2008 .
  • Unified memory 2006 is used to store input data and output data.
  • the weight data is directly accessed through the storage unit access controller (direct memory access controller, DMAC) 2005, and the DMAC is transferred to the weight memory 2002.
  • Input data is also transferred to unified memory 2006 via the DMAC.
  • DMAC direct memory access controller
  • a bus interface unit (BIU) 2010 is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (instruction fetch buffer, IFB) 2009.
  • the bus interface unit 2010 (bus interface unit, BIU) is used for the instruction fetch memory 2009 to obtain instructions from the external memory, and also for the storage unit access controller 2005 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 2006, the weight data to the weight memory 2002, or the input data to the input memory 2001.
  • the vector calculation unit 2007 includes a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on, if necessary. It is mainly used for non-convolutional/fully connected layer network computations in neural networks, such as batch normalization, pixel-level summation, and upsampling of feature planes.
  • the vector computation unit 2007 can store the processed output vectors to the unified memory 2006 .
  • the vector calculation unit 2007 may apply a linear function and/or a nonlinear function to the output of the operation circuit 2003, such as linear interpolation of the feature plane extracted by the convolutional layer, such as a vector of accumulated values, to generate activation values.
  • the vector computation unit 2007 generates normalized values, pixel-level summed values, or both.
  • the vector of processed outputs can be used as activation input to the arithmetic circuit 2003, eg, for use in subsequent layers in a neural network.
  • the instruction fetch memory (instruction fetch buffer) 2009 connected to the controller 2004 is used to store the instructions used by the controller 2004;
  • Unified memory 2006, input memory 2001, weight memory 2002 and instruction fetch memory 2009 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • each layer in the recurrent neural network can be performed by the operation circuit 2003 or the vector calculation unit 2007 .
  • the processor mentioned in any one of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the programs of the above-mentioned methods in FIG. 6-FIG. 16 .
  • the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • U disk U disk
  • mobile hard disk ROM
  • RAM random access memory
  • disk or CD etc.
  • a computer device which can be a personal computer, server, or network device, etc. to execute the methods described in the various embodiments of the present application.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • wire eg, coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg, infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a server, data center, etc., which includes one or more available media integrated.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state disks (SSDs)), and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

一种人工智能领域的图像处理方法以及装置,用于对输入图像进行超分辨率处理,高效准确地得到更高清的图像。该方法包括:对第一图像进行分解,得到第一结构子图和第一细节子图(601),第一图像为视频数据中的除第一帧外的任意一帧图像;对第一隐状态信息和第一结构子图进行融合,得到第二结构子图,以及对第一隐状态信息和第一细节子图进行拼接,得到第二细节子图,第一隐状态信息包括从第二图像中提取到的特征,第二图像包括视频数据与第一图像相邻的至少一帧图像;基于第二结构子图和第二细节子图进行特征提取,得到结构特征和细节特征(603);根据结构特征和细节特征,得到输出图像(604),输出图像的分辨率高于第一图像的分辨率。

Description

一种图像处理方法以及装置
本申请要求于2020年07月31日提交中国专利局、申请号为“202010762144.6”、申请名称为“一种图像处理方法以及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种图像处理方法以及装置。
背景技术
超分辨率(super resolution,SR)是指从观测到的低分辨率图像重建出相应的高分辨率图像。通过将低分辨率的图像上采样放大,借助图像先验知识、图像自相似性和多帧图像互补信息等手段填充细节,生成对应的高分辨率图像。超分辨率技术在高清电视、观测设备、卫星图像和医学影像等领域有重要的应用价值。
在现有方案中,在对中间帧进行分辨率提升时,输入中间帧及其前后近邻的2N个近邻帧,组成一个2N+1帧的输入帧序列,然后对输入帧序列进行运动补偿,将近邻帧对齐到中间帧,融合多帧信息,最后实现中间帧的超分辨率输出。然而,在此方案中,需要先将未来的N个近邻帧暂存起来,从而导致N帧的延迟,处理视频流等需要实时的应用时,会有明显的延迟,降低了用户体验,并且,需要同时对2N+1帧进行特征提取,所需的特征提取网络较复杂。
发明内容
本申请提供一种图像处理方法以及装置,用于对输入图像进行超分辨率处理,高效准确地得到更高清的图像。
有鉴于此,本申请第一方面提供一种图像处理方法,包括:首先,对第一图像进行分解,得到第一结构子图和第一细节子图,第一图像为视频数据中的除第一帧外的任意一帧图像,且第一频率低于第二频率,第一频率为第一结构子图所包括的信息的频率,第二频率为第一细节子图所包括的信息的频率,即第一结构子图所包括的信息的频率高于第一细节子图所包括的信息的频率;然后,对第一隐状态信息和第一结构子图进行融合,得到第二结构子图,以及对第一隐状态信息和第一细节子图进行拼接,得到第二细节子图,第一隐状态信息包括从第二图像中提取到的特征,第二图像包括视频数据与第一图像相邻的至少一帧图像;随后,基于第二结构子图和第二细节子图进行特征提取,得到结构特征和细节特征;随后,根据结构特征和细节特征,得到输出图像,输出图像的分辨率高于第一图像的分辨率。
因此,在本申请实施方式中,在进行视频数据的超分辨率处理的过程中,分解了结构分支和细节分支进行处理,并使用隐状态信息对结构和细节进行了进一步丰富,使最终得到的输出图像的结构和细节更丰富。无需缓存多帧对中间帧进行处理,可以高效地得到当前帧的高分辨率图像。
在一种可能的实施方式中,使用隐状态信息分别对第一结构子图和第一细节子图进行 融合,得到第二结构子图和第二细节子图,可以包括:获取第一隐状态信息和第一图像的相似度矩阵,相似度矩阵中包括至少一个相似度,至少一个相似度用于表示第一隐状态信息所包括的图像区域和第一图像中的图像区域之间的相似程度;根据相似度矩阵对第一隐状态信息进行过滤,得到第二隐状态信息,第二隐状态信息中每个图像区域与第一图像中对应的图像区域的相似程度,高于第一隐状态信息中每个图像区域与第一图像中的图像区域的相似程度;使用第二隐状态信息对第一结构子图进行拼接,得到第二结构子图,使用第二隐状态信息对第一细节子图进行拼接,得到第二细节子图。
因此,本申请实施方式中,在使用第一隐状态信息时,可以过滤其中的冗余信息,使用过来后的隐状态信息分别对第一结构子图和第一细节子图进行融合,可以得到细节更丰富的第二结构子图,以及结构更丰富的第二结构子图。
在一种可能的实施方式中,基于第二结构子图和第二细节子图中的进行特征提取,得到结构特征和细节特征,可以包括:对第二结构子图和第二细节子图进行至少一次迭代融合,得到更新后的第二结构子图和更新后的第二细节子图;从更新后的第二结构子图中提取特征,得到结构特征,从更新后的第二细节子图中提取特征,得到细节特征。
因此,本申请实施方式中,可以对第二结构子图和第二细节子图所包括的信息进行融合,从而通过第二结构子图所包括的结构信息丰富第二细节子图的细节信息,以及通过第二细节子图所包括的细节信息来丰富第二结构子图包括的结构信息,从而使最终提取到的特征更丰富,进而使最终得到的输出图像更清晰,提高用户体验。
在一种可能的实施方式中,任意一次迭代融合过程包括:对上一次迭代得到的第二结构子图和上一次迭代得到的第二细节子图进行融合,得到当前次迭代的第一融合图像;对第一融合图像和上一次迭代得到的第二结构子图进行融合,得到当前次迭代的第二结构子图;对第一融合图像和上一次迭代得到的第二细节子图进行融合,得到当前次迭代的第二细节子图。
因此,在本申请实施方式中,在每次迭代融合的过程中,都可以融合上一次迭代得到的第二结构子图和上一次迭代得到的第二细节子图,并使用融合得到的第一融合图像分别融合第二结构子图和第二细节子图,从而通过第二结构子图所包括的结构信息丰富第二细节子图的细节信息,以及通过第二细节子图所包括的细节信息来丰富第二结构子图包括的结构信息,从而使最终提取到的特征更丰富,进而使最终得到的输出图像更清晰,提高用户体验。
在一种可能的实施方式中,根据结构特征和细节特征,得到输出图像,可以包括:融合结构特征和细节特征,得到第二融合图像;对第二融合图像进行放大处理,得到输出图像,输出图像的分辨率高于第二融合图像的分辨率。
因此,本申请实施方式中,可以放大第二融合图像得到输出图像,从而得到分辨率更高的输出图像。
在一种可能的实施方式中,在提取第二结构子图中的特征,得到结构特征,以及提取第二细节子图中的特征,得到细节特征之后,上述方法还包括:根据结构特征和细节特征更新第一隐状态信息,第一隐状态信息用于对视频数据中排列在第一图像的下一帧图像进 行处理。
因此,在本申请实施方式中,在对当前帧进行超分辨率处理之后,可以更新第一隐状态信息,从而使在对下一帧进行处理的过程中,可以使用更新后的第一隐状态信息进行处理,提高下一帧对应的输出图像的清晰图,提高用户体验。
在一种可能的实施方式中,对第一图像进行分解,可以包括:对第一图像进行下采样,得到下采样图像;对下采样图像进行上采样,得到第一结构子图;从第一图像中去除第一结构子图,得到第一细节子图。
因此,本申请实施方式中,可以通过下采样以及上采样的方式来得到第一结构子图以及第一细节子图,提供了一种得到第一结构子图以及第一细节子图的具体方式。
第二方面,本申请提供一种图像处理装置,包括:
分解单元,用于对第一图像进行分解,得到第一结构子图和第一细节子图,第一图像为视频数据中的除第一帧外的任意一帧图像,且第一频率低于第二频率,第一频率为第一结构子图所包括的信息的频率,第二频率为第一细节子图所包括的信息的频率;
融合单元,用于对第一隐状态信息和第一结构子图进行融合,得到第二结构子图,以及对第一隐状态信息和第一细节子图进行拼接,得到第二细节子图,第一隐状态信息包括从第二图像中提取到的特征,第二图像包括视频数据与第一图像相邻的至少一帧图像;
特征提取单元,用于基于第二结构子图和第二细节子图进行特征提取,得到结构特征和细节特征;
输出单元,用于根据结构特征和细节特征,得到输出图像,输出图像的分辨率高于第一图像的分辨率。
在一种可能的实施方式中,融合单元,具体用于:获取第一隐状态信息和第一图像的相似度矩阵,相似度矩阵中包括至少一个相似度,至少一个相似度用于表示第一隐状态信息所包括的图像区域和第一图像中的图像区域之间的相似程度;根据相似度矩阵对第一隐状态信息进行过滤,得到第二隐状态信息,第二隐状态信息中每个图像区域与第一图像中对应的图像区域的相似程度,高于第一隐状态信息中每个图像区域与第一图像中的图像区域的相似程度;使用第二隐状态信息对第一结构子图进行拼接,得到第二结构子图,使用第二隐状态信息对第一细节子图进行拼接,得到第二细节子图。
在一种可能的实施方式中,特征提取单元,用于:对第二结构子图和第二细节子图进行至少一次迭代融合,得到更新后的第二结构子图和更新后的第二细节子图;从更新后的第二结构子图中提取特征,得到结构特征,从更新后的第二细节子图中提取特征,得到细节特征。
在一种可能的实施方式中,任意一次迭代融合过程可以包括:对上一次迭代得到的第二结构子图和上一次迭代得到的第二细节子图进行融合,得到当前次迭代的第一融合图像;对第一融合图像和上一次迭代得到的第二结构子图进行融合,得到当前次迭代的第二结构子图;对第一融合图像和上一次迭代得到的第二细节子图进行融合,得到当前次迭代的第二细节子图。
在一种可能的实施方式中,输出单元,具体用于:融合结构特征和细节特征,得到第 二融合图像;对第二融合图像进行放大处理,得到输出图像,输出图像的分辨率高于第二融合图像的分辨率。
在一种可能的实施方式中,该图像处理装置还可以包括:更新单元,用于根据结构特征和细节特征更新第一隐状态信息,第一隐状态信息用于对视频数据中排列在第一图像的下一帧图像进行处理。
在一种可能的实施方式中,分解单元,具体用于:对第一图像进行下采样,得到下采样图像;对下采样图像进行上采样,得到第一结构子图;从第一图像中去除第一结构子图,得到第一细节子图。
第三方面,本申请实施例提供一种图像处理装置,该图像处理装置具有实现上述第一方面图像处理方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
第四方面,本申请实施例提供一种图像处理装置,包括:处理器和存储器,其中,处理器和存储器通过线路互联,处理器调用存储器中的程序代码用于执行上述第一方面任一项所示的图像处理方法中与处理相关的功能。可选地,该图像处理装置可以是芯片。
第五方面,本申请实施例提供了一种图像处理装置,该图像处理装置也可以称为数字处理芯片或者芯片,芯片包括处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行如上述第一方面或第一方面任一可选实施方式中与处理相关的功能。
第六方面,本申请实施例提供了一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面任一可选实施方式中的方法。
第七方面,本申请实施例提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面或第一方面任一可选实施方式中的方法。
附图说明
图1为本申请应用的一种人工智能主体框架示意图;
图2为本申请实施例提供的一种卷积神经网络结构示意图;
图3为本申请实施例提供的另一种卷积神经网络结构示意图;
图4A为本申请实施例提供的一种图像处理方法的应用场景示意图;
图4B为本申请实施例提供的一种图像处理方法的应用场景示意图;
图5A本申请提供的一种系统架构示意图;
图5B为本申请实施例提供的一种图像处理方法的应用场景示意图;
图6为本申请实施例提供的一种图像处理方法的流程示意图;
图7为本申请实施例提供的一种图像处理架构示意图;
图8为本申请实施例提供的一种图像处理方法的应用场景示意图;
图9为本申请实施例提供的另一种图像处理架构示意图;
图10为本申请实施例提供的一种隐状态过滤的方式示意图;
图11为本申请实施例提供的另一种隐状态过滤的方式示意图;
图12为本申请实施例提供的一种图像融合的流程示意图;
图13为本申请实施例提供的另一种图像处理架构示意图;
图14为本申请实施例提供的一种图像方法处理的流程示意图;
图15为本申请实施例提供的一种隐状态更新的流程示意图;
图16为本申请实施例提供的另一种图像处理架构示意图;
图17为本申请实施例提供的一种图像处理效果示意图;
图18为本申请实施例提供的一种图像处理装置的结构示意图;
图19为本申请实施例提供的另一种图像处理装置的结构示意图;
图20为本申请实施例提供的一种芯片的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请提供的图像处理方法可以应用于人工智能(artificial intelligence,AI)场景中。AI是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
图1示出一种人工智能主体框架示意图,该主体框架描述了人工智能系统总体工作流程,适用于通用的人工智能领域需求。
下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。
“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。
“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施:
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片,如中央处理器(central processing unit,CPU)、网络处理器(neural-network processing unit,NPU)、图形处理器(英语:graphics processing unit,GPU)、专用集成电路(application specific integrated circuit,ASIC)或现场可编程逻辑门阵列(field programmable gate array, FPGA)等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、视频、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理(如图像识别、目标检测等),语音识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,智慧城市,智能终端等。
本申请实施例涉及了大量神经网络的相关应用,为了更好地理解本申请实施例的方案,下面先对本申请实施例可能涉及的神经网络的相关术语和概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距1为输入的运算单元,该运算单元的输出可以如公式(1-1)所示:
Figure PCTCN2021106380-appb-000001
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有多层中间层的神经网络。按照不同层的位置对DNN进行划分,DNN内部的神经网络可以分为三类:输入层,中间层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是中间层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。
虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2021106380-appb-000002
其中,
Figure PCTCN2021106380-appb-000003
是输入向量,
Figure PCTCN2021106380-appb-000004
是输出向量,
Figure PCTCN2021106380-appb-000005
是偏移向量,w是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2021106380-appb-000006
经过如此简单的操作得到输出向量
Figure PCTCN2021106380-appb-000007
由于DNN层数多,系数 W和偏移向量
Figure PCTCN2021106380-appb-000008
的数量也比较多。这些参数在DNN中的定义如下所述:以系数w为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2021106380-appb-000009
上标3代表系数 W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。
综上,第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2021106380-appb-000010
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的中间层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(3)卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器,该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(4)递归神经网络(recurrent neural networks,RNN)是用来处理序列数据的,也称为循环神经网络。在传统的神经网络模型中,是从输入层到中间层再到输出层,层与层之间是全连接的,而对于每一层层内之间的各个节点是无连接的。这种普通的神经网络虽然解决了很多难题,但是却仍然对很多问题无能无力。例如,你要预测句子的下一个单词是什么,一般需要用到前面的单词,因为一个句子中前后单词并不是独立的。RNN之所以称为循环神经网路,即一个序列当前的输出与前面的输出也有关。具体的表现形式为网络会对前面的信息进行记忆并应用于当前输出的计算中,即中间层本层之间的节点不再无连接 而是有连接的,并且中间层的输入不仅包括输入层的输出还包括上一时刻中间层的输出。理论上,RNN能够对任何长度的序列数据进行处理。对于RNN的训练和对传统的CNN或DNN的训练一样。
既然已经有了卷积神经网络,为什么还要循环神经网络?原因很简单,在卷积神经网络中,有一个前提假设是:元素之间是相互独立的,输入与输出也是独立的,比如猫和狗。但现实世界中,很多元素都是相互连接的,比如股票随时间的变化,再比如一个人说了:我喜欢旅游,其中最喜欢的地方是云南,以后有机会一定要去。这里填空,人类应该都知道是填“云南”。因为人类会根据上下文的内容进行推断,但如何让机器做到这一步?RNN就应运而生了。RNN旨在让机器像人一样拥有记忆的能力。因此,RNN的输出就需要依赖当前的输入信息和历史的记忆信息。
(5)超分辨率
超分辨率(Super Resolution,SR)是一种图像增强技术,给定一张或一组低分辨率的图像,通过学习图像的先验知识、图像本身的相似性、多帧图像信息互补等手段恢复图像的高频细节信息,生成较高分辨率的目标图像。超分辨率在应用中,按照输入图像的数量,可分为单帧图像超分辨率和视频超分辨率。超分辨率在高清电视、观测设备、卫星图像和医学影像等领域有重要的应用价值。
(6)视频超分辨率
视频超分辨率(video super resolution,VSR)是一种针对视频进行处理的增强技术,其目的是将低分辨率的视频转化成高质量的高分辨率视频。按照输入的帧数,视频超分辨率可以分为多帧视频超分辨率和循环视频超分辨率。
示例性地,下面以卷积神经网络(convolutional neural network,CNN)为例。
CNN是一种带有卷积结构的深度神经网络。CNN是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元对输入其中的图像中的重叠区域作出响应。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,我们都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之 间的连接,同时又降低了过拟合的风险。
卷积神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的超分辨率模型中参数的大小,使得超分辨率模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的超分辨率模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的超分辨率模型的参数,例如权重矩阵。
如图2所示,卷积神经网络(CNN)100可以包括输入层110,卷积层/池化层120,其中池化层为可选的,以及神经网络层130。
如图2所示卷积层/池化层120可以包括如示例121-126层,在一种实现中,121层为卷积层,122层为池化层,123层为卷积层,124层为池化层,125为卷积层,126为池化层;在另一种实现方式中,121、122为卷积层,123为池化层,124、125为卷积层,126为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
以卷积层121为例,卷积层121可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义。在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关。需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用维度相同的多个权重矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵维度相同,经过该多个维度相同的权重矩阵提取后的特征图维度也相同,再将提取到的多个维度相同的特征图合并形成卷积运算的输出。
通常,权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以从输入图像中提取信息,从而帮助卷积神经网络100进行正确的预测。
当卷积神经网络100有多个卷积层时,初始的卷积层(例如121)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络100深度的加深,越往后的卷积层(例如126)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,即如图2中120所示例的121-126各层,可以是一层卷积层后面跟一层池化层,也可以是 多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像大小相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
神经网络层130:
在经过卷积层/池化层120的处理后,卷积神经网络100还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层120只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或别的相关信息),卷积神经网络100需要利用神经网络层130来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层130中可以包括多层隐含层(如图2所示的131、132至13n)以及输出层140。在本申请中,该卷积神经网络为:对选取的起点网络进行至少一次变形得到串行网络,然后根据训练后的串行网络得到。该卷积神经网络可以用于图像识别,图像分类,图像超分辨率重建等等。
在神经网络层130中的多层隐含层之后,也就是整个卷积神经网络100的最后层为输出层140,该输出层140具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络100的前向传播(如图2由110至140的传播为前向传播)完成,反向传播(如图2由140至110的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络100的损失及卷积神经网络100通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图2所示的卷积神经网络100仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,如图3所示的多个卷积层/池化层并行,将分别提取的特征均输入给全神经网络层130进行处理。
本申请提供的图像处理方法可以应用于视频直播、视频通话、相册管理、智慧城市、人机交互以及其他需要涉及到视频数据等的场景。
例如,本申请提供的图像处理方法可以应用于智慧城市场景中,如图4A所示,可以采集各个观测设备采集到的低画质视频数据,即低分辨率的视频数据,并在存储器中存储该低画质视频数据。在播放该视频数据时,可以通过本申请提供的图像处理方法对该视频数据进行超分辨率处理,从而得到分辨率更高的视频数据,提高用户的观看体验。
又例如,本申请提供的图像处理方法还应用于各种视频拍摄场景。如用户可以使用终端拍摄一段视频,为降低该视频所占用的存储量,可以对该视频进行压缩或者下采样处理,得到占用储存量更小的视频数据。当用户使用终端对该视频进行播放时,可以通过本申请提供的图像处理方法,对存储的视频数据进行超分辨率处理,从而得到分辨率更高的视频数据,提高用户的观看体验。
还例如,本申请提供的图像处理方法可以应用于视频直播场景,如图4B所示,服务器 可以向用户使用的客户端发送视频流。为减少直播过程中传输的带宽,可以对传输的视频流进行压缩。当客户端接收到服务器发送的数据流之后,可以通过本申请提供的图像处理方法对对该数据流进行超分辨率处理,从而得到分辨率更高的视频数据,提高用户的观看体验。
示例性地,本申请提供的图像处理方法的应用的系统架构可以如图5A所示。在该系统架构400中,服务器集群410由一个或多个服务器实现,可选的,与其它计算设备配合,例如:数据存储、路由器、负载均衡器等设备。服务器集群410可以使用数据存储系统250中的数据,或者调用数据存储系统250中的程序代码实现本申请提供的图像处理方法的步骤。
用户可以操作各自的用户设备(例如本地设备401和本地设备402)与服务器集群410进行交互。每个本地设备可以表示任何计算设备,例如个人计算机、计算机工作站、智能手机、平板电脑、智能摄像头、智能汽车或其他类型蜂窝电话、媒体消费设备、可穿戴设备、机顶盒、游戏机等。
每个用户的本地设备可以通过任何通信机制/通信标准的通信网络与服务器集群410进行交互,通信网络可以是广域网、局域网、点对点连接等方式,或它们的任意组合。具体地,该通信网络可以包括无线网络、有线网络或者无线网络与有线网络的组合等。该无线网络包括但不限于:第五代移动通信技术(5th-Generation,5G)系统,长期演进(long term evolution,LTE)系统、全球移动通信系统(global system for mobile communication,GSM)或码分多址(code division multiple access,CDMA)网络、宽带码分多址(wideband code division multiple access,WCDMA)网络、无线保真(wireless fidelity,WiFi)、蓝牙(bluetooth)、紫蜂协议(Zigbee)、射频识别技术(radio frequency identification,RFID)、远程(Long Range,Lora)无线通信、近距离无线通信(near field communication,NFC)中的任意一种或多种的组合。该有线网络可以包括光纤通信网络或同轴电缆组成的网络等。
示例性地,在一种应用场景中,服务器集群410中的任意一个服务器,可以从数据存储系统250,或者其他设备,如终端、PC等中获取到视频数据,若该视频数据是低分辨率视频,则服务器可以将该低分辨率视频通过通信网络发送至本地设备。若该视频数据为高分辨率视频,为降低传输该视频数据占用的带宽,服务器可以对该视频数据进行下采样,得到低分辨率视频,并将该低分辨率视频通过通信网络发送至本地设备。因此,本地设备在接收到该低分辨率视频之后,如图5B所示,可以对该低分辨率视频进行超分辨率处理,得到高分辨率视频。
在超分辨率任务中,深度神经网络凭借其强大的学习能力,迅速超越了基于传统手工特征的方法,取得了巨大的成功。基于深度神经网络的超分辨率方法能够生成更加清晰、更少伪影的高质量超分辨率图片,进一步推动了超分辨率技术的落地应用。例如,在流视频应用中,可以通过网络传输经过降采样的、分辨率较低的视频流,客户端接收后通过超分辨率技术将其转化为高分辨率的画面并播放,这样有效降低了网络带宽的需求;在视频观测中,由于观测相机安装位置和存储的限制,观测画面的分辨率通常比较低。超分辨率 技术可以将其转化为更清晰的版本,为后续的目标人脸识别、行人再识别等任务提供更丰富的细节信息。超分辨率技术也在旧电影高清化、医学图像等应用中得到了广泛应用。
得益于图像处理器(graphic processing unit,GPU)算力的不断提升以及深度卷积网络的快速发展,超分网络的效果得到了大幅度的提高,这进一步推动了超分辨率技术的应用。在效果提升的同时,超分网络也变得更加复杂,计算量也随之增大。这极大限制了超分技术在一些较低计算力设备,如手机、摄像头、智能家居等的应用。随着移动设备摄像头像素的逐渐增加,超分网络的计算量随着图像分辨率的增大而快速增加。
因此,为实现高效、准确地超分辨率处理,本申请提供了一种针对视频的图像处理方法,基于递归网络实现了轻量化的计算,使得视频的超分辨率处理能够达到实时运行。
下面对本申请提供的图像处理方法的流程进行说明。
参阅图6,本申请提供的一种图像处理方法的流程示意图,如下所述。
601、对第一图像进行分解,得到第一结构子图和第一细节子图。
其中,在步骤601之前,还可以获取视频数据,该视频数据可以是视频流,或者完整的视频的数据等。该视频数据中可以包括多帧图像,第一图像是其中的任意一帧图像。
以下提及的第二图像,是与第一图像相邻的一帧或者多帧图像,以下不再赘述。例如,第二图像可以是按照播放时序,排列在第一图像之前的一帧或者多帧图像。或者,若按照与视频的播放时序相反的时序来对视频进行处理,则第二图像可以是排列在第一图像之后的一帧或者多帧图像。
通常,结构信息是低频的图像分量,细节信息对应高频的图像分量。因此,在本步骤中,可以将第一图像中所包括的信息分为高频信息和低频信息,高频信息即组成第一细节子图,低频信息组成第一结构子图。
具体地,对第一图像进行分解的方式可以包括多种。示例性地,可以通过下采样结合上采样的方式对第一图像进行分解,也可以通过低通滤波的方式进行分解等,具体可以根据实际应用场景进行调整,此处不作限定。
例如,若采用下采样结合上采样的方式对第一图像进行分解的方式,具体步骤可以包括:对第一图像进行下采样,得到下采样图像;对下采样图像进行上采样,得到第一结构子图;从第一图像中去除第一结构子图,得到第一细节子图。在本实施方式中,可以通过对第一图像进行下采样的方式来获取第一图像中所包括的特征,然后通过上采样的方式,使第一结构子图的维度与第一图像的维度保持一致,并将第一图像减去通过上采样后得到的第一结构子图,从而得到第一图像的第一细节子图。
又例如,若采用低通滤波的方式对第一图像进行分解的方式,具体步骤可以包括:增加低通滤波器,筛选出第一图像中低频部分,得到第一结构子图,然后在第一图像的基础上减去该第一结构子图,即可得到第一细节子图。当然,也可以通过高通滤波的方式筛选出第一图像中的高频部分,得到第一细节子图,然后在第一图像的基础上去除该第一细节子图,得到第一结构子图。
602、对第一隐状态信息和第一结构子图进行融合,得到第二结构子图,以及对第一隐状态信息和所述第一细节子图进行拼接,得到第二细节子图。
其中,第一隐状态信息包括了从第二图像中提取到的特征。该第一隐状态信息也可以理解为由第二结构子图的特征组成的图像,其维度和第一图像相同。
具体地,可以融合第一隐状态信息和第一结构子图,从而得到参考了第二图像的特征的第二结构子图,融合第一隐状态信息和第一细节子图,从而得到参考了第二图像的特征的第二细节子图。
为便于理解,隐状态信息可以理解为是网络生成的特征图,包含从过去的帧提取的特征,是存储的历史信息。在超分辨率的处理过程中,隐状态提供历史信息,与当前输入帧的特征进行时间-空间层面的融合,能够获得更丰富的特征表达,从而提升当前帧的超分效果。同时,隐状态信息的存在有利于输出更稳定的结果,有效减少视频的抖动,提升画面观感。
通常,由于隐状态信息存储的是历史信息,每处理完一帧之后就可能往隐状态信息中增加新的历史信息,这就导致隐状态信息中往往存在大量冗余(如过时或者无用)的信息。而随着递归处理的帧数的增加,这些冗余的信息会逐渐占据隐状态信息的大部分内容。因此,可选地,为了提高隐状态信息的有效利用率,可以对第一隐状态进行适应性过滤,从而滤除第一隐状态信息中冗余的信息。
具体的过滤过程可以包括,首先,获取第一隐状态信息和第一图像的相似度矩阵,相似度矩阵由一个或者多个相似度组成,该一个或者多个相似度用于表示第一隐状态信息所包括的图像区域和第一图像中对应的图像区域之间的相似程度,每个图像区域可以包括一个或者多个像素点。然后,根据相似度矩阵对第一隐状态信息进行过滤,得到第二隐状态信息,第二隐状态信息中每个图像区域与第一图像中的图像区域的相似度,高于第一隐状态信息中每个图像区域与第一图像中的图像区域的相似度。相应地,步骤602可以包括:使用第二隐状态信息分别对第一结构子图和第一细节子图进行拼接,得到第二结构子图和第二细节子图。
因此,在本申请实施方式中,可以通过相似度矩阵过滤掉第一隐状态中,与第一图像不相似的信息,从而得到与第一图像更相似、关联度更高的第二应状态信息。从而可以使使用第二隐状态信息进行融合得到的第二结构子图和第二细节子图的结构和细节更丰富,进而使后续得到的输出图像更清晰,分辨率更高。
603、基于第二结构子图和第二细节子图中的进行特征提取,得到结构特征和细节特征。
其中,在得到第二结构子图和第二细节子图之后,基于该第二结构子图和第二细节子图进行特征提取,从而得到结构特征和细节特征。
具体地,可以分别从第二结构子图和第二细节子图中提取特征,如从第二结构子图中提取特征,得到结构特征,从第二细节子图中提取特征,得到细节特征。
还可以结合第二结构子图和第二细节子图进行特征提取,得到结构特征和细节特征。例如,可以对第二结构子图和第二细节子图进行至少一次迭代融合,得到更新后的第二结构子图和更新后的第二细节子图,然后从更新后的第二结构子图中提取特征,得到结构特征,从更新后的第二细节子图中提取特征,得到细节特征。因此,在本申请实施方式中,可以通过融合结构子图和细节子图的方式,使结构子图和细节子图可以互相丰富各自包括 的信息,从而使最终得到的结构特征和细节特征更准确。
若可以对第二结构子图和第二细节子图进行至少一次迭代融合,得到更新后的第二结构子图和更新后的第二细节子图,进一步地,对第二结构子图和第二细节子图进行的任意一次融合的过程可以包括:对上一次迭代得到的第二结构子图和上一次迭代得到的第二细节子图进行融合,得到当前次迭代的第一融合图像;对第一融合图像和上一次迭代得到的第二结构子图进行融合,得到当前次迭代的第二结构子图;对第一融合图像和上一次迭代得到的第二细节子图进行融合,得到当前次迭代的第二细节子图。
可以理解为,在提取结构特征和细节特征之前,第二结构子图和第二细节子图进行了至少一次交互,从而交互各自所包括的信息,使最终得到的更新后的第二结构子图和第二细节子图所包括的信息更丰富,进而使最终得到的输出图像所包括的信息更丰富。
604、根据结构特征和细节特征,得到输出图像。
其中,在得到结构特征和细节特征之后,可以对该结构特征和细节特征进行融合,得到结构和细节都丰富的输出图像。
具体地,在融合了结构特征和细节特征,得到第二融合图像之后,可以对该第二融合图像进行放大处理,从而得到分辨率更高的输出图像。
因此,在本申请实施方式中,在进行视频数据的超分辨率处理的过程中,分解了结构分支和细节分支进行处理,并使用隐状态信息对结构和细节进行了进一步丰富,使最终得到的输出图像的结构和细节更丰富。无需缓存多帧对中间帧进行处理,可以高效地得到当前帧的高分辨率图像。
605、更新第一隐状态信息。
其中,步骤605为可选步骤。
在得到结构特征和细节特征之后,可以基于该结构特征和细节特征更新第一隐状态信息,从而在对下一帧进行超分辨率处理时,可以基于更新的隐状态信息,来丰富下一帧图像的结构和细节,从而得到清晰的高分辨率图像。
具体地,可以将该第一隐状态信息替换为结构特征和细节特征融合后的信息,也可以将结构特征和细节特征融合后的信息,再与原有的第一隐状态信息进行融合,得到更新后的第一隐状态信息。
因此,在本申请实施方式中,在得到每一帧输入图像的高分辨率图像之后,可以更新第一隐状态信息,以使在对下一帧进行超分辨率处理时,可以使用更新、关联度更高的隐状态信息来丰富图像的结构和细节,从而使最终得到的图像更清晰。
前述对本申请提供的图像处理方法的流程进行介绍,下面进一步地,基于前述的流程,对本申请提供的图像处理方法进行更详细的展开介绍。
示例性地,参阅图7,本申请提供的另一种图像处理方法的流程示意图。
首先,从视频中选择一帧作为输入图像701,即前述的第一图像,对该输入图像701进行分解,得到第一结构子图702和第一细节子图703。
分解方式例如,对输入图像701进行下采样,得到下采样图像。然后对该下采样图像进行上采样,得到第一结构子图702。从输入图像701中去除第一结构子图,即可得到第 一细节子图703。具体例如,可以将输入图像中的每四个像素点取平均值或者中位数,合并为一个像素点,得到下采样图像,然后将下采样图像进行插值处理的四个像素点,得到上采样图像,上采样图像即第一结构子图,其维度与输入图像相同。然后将输入图像中的每个像素点的值减去第一结构子图中每个像素点的值,即可得到第一细节子图。此处的像素值可以包括灰度值、亮度值、RGB每个通道的值等,具体可以根据实际应用场景进行调整。
然后使用第一隐状态信息704分别对第一结构子图702和第一细节子图703进行拼接,得到第二结构子图705和第二细节子图706。
例如,若第一结构子图702包括了3个通道,第一隐状态信息包括了3个通道,则拼接第一结构子图和第一隐状态信息可以得到包括了6个通道的第二结构子图。或者,又例如,若第一结构子图702包括了3个通道,第一隐状态信息包括了3个通道,可以在第一结构子图702中的每个通道中,增加第一隐状态信息包括的3个通道的值,最终得到的第二结构子图包括了3个通道,但每个通道的值变大。得到第二细节子图的方式与得到第一细节子图的方式类似。
随后,可以使用特征提取网络707从第二结构子图705中提取特征得到结构特征708,以及从第二细节子图706中提取特征得到细节特征709。该特征提取网络可以包括一个或者多个卷积核,例如,该特征提取网络可以参阅前述的卷积神经网络,本申请对此不作限定。通常,为实现轻量化的超分辨率处理网络,可以使用包括了较少卷积核的特征提取网络来进行特征提取,当然,为了使最终的输出网络更清晰,也可以使用包括了较多卷积核的特征提取网络来进行特征提取。
在得到结构特征708和细节特征709之后,可以对该结构特征708和细节特征709进行融合,并进行放大处理,得到最终的输出图像710。
此外,在得到结构特征708和细节特征709之后,还可以使用该结构特征708和细节特征709更新第一隐状态信息,以在对下一帧进行超分辨率处理时,可以使用更新后的第一隐状态信息进行处理,从而使最终得到的输出图像的结构和细节更丰富,提高用户体验。
例如,前述图7提供的架构可以应用于如图8所示的场景中,用户可以通过手机、电视或者PC等播放服务器发送的图像,按照播放顺序,分辨包括的图像帧包括:I_t-1、I_t、I_t+1、I_t+2…,可以对每一帧图像进行超分辨率处理,从而得到高分辨率的图像,提高用户的观看体验。
下面对前述图7所示的架构进行进一步展开描述。参阅图9,其中,701-706、708-710与前述图7中所示的类似,下面对不同之处进行说明。
其中,图9与前述图7的区别可以包括:对第一隐状态信息进行了过滤,过滤后得到的第二隐状态信息与输入图像701的关联度更高,后续可以使用该第二隐状态信息分别与第一结构子图702和第一细节子图703进行拼接,从而使拼接后得到的第二结构子图705和第二细节子图706所包括的信息更丰富,最终得到的输出图像更清晰。
示例性地,对第一隐状态信息进行过滤的具体过程可以参阅图10。
可以基于输入图像701的特征计算其与第一隐状态信息704的相似度,生成相似度矩 阵1001。例如,可以将输入图像划分为多个图像区域,每个图像区域包括一个或者多个像素点,相应地,将第一隐状态信息按照相同的划分方式划分为多个图像区域,每个图像区域包括一个或者多个像素点。例如,可以对输入图像中的每个图像区域中的像素点分布规律,与第一隐状态中的每个图像区域中的像素点的分布规律进行匹配,从而计算输入图像中的每个图像区域和第一隐状态信息中对应的图像区域之间的相似度,从而得到相似度矩阵。
在得到相似度矩阵1001之后,基于该相似度矩阵1001对第一隐状态信息进行过滤,滤除第一隐状态信息中与输入图像相似度较低(如低于预设相似度)的图像区域,从而第二隐状态信息902。第二隐状态信息中包括了与输入图像的相似度较高(如不低于预设相似度)的图像区域。
示例性地,如图11所示,以一个应用场景为例对隐状态信息的过滤过程进行示例性说明。其中,相似性计算部分首先基于一层卷积层对输入图进行初步的特征提取,生成H×W×k 2的特征图;对该特征图的每一个位置(x,y),提取出1×1×k 2的特征,展开成k×k的特征图。基于这个k×k的特征图,构建卷积核,对隐状态(H×W×C)矩阵(即第一隐状态信息)中(x,y)位置对应的1×1×C特征进行卷积,生成1×1×C的相似性结果输入到相似性矩阵对应的(x,y)位置。对所有的(x,y)位置均执行这样的卷积操作之后,得到一个维度与隐状态矩阵一致(H×W×C)的相似性矩阵。过滤器部分则首先利用sigmoid函数将相似性矩阵归一化到[0,1]之间,之后将相似性矩阵与隐状态进行一一对应的相乘,得到最后过滤的隐状态,即第二隐状态信息。
此外,图9与前述图7的区别还可以包括:特征提取网络中可以包括N个结构细节(SD)模块,N为正整数,如图9中所示的901-90N。每个SD模块用于融合结构子图和细节子图,从而丰富结构子图和细节子图所包括的信息。
示例性地,如图12所示,以其中一个SD_n为例,该SD_n可以是N个SD模块中的任意一个。其中,该SD_n模块的输入为SD_n-1模块输出的第二结构子图1201和第二细节子图1202,可以融合第二结构子图1201和第二细节子图1202,得到第二融合图像。
然后将第二融合图像和第二结构子图1201进行融合,从而使更新后的第二结构子图1203可以保留更新前的第二结构子图所包括的信息,在此基础上还融合了第二细节子图所包括的信息。以及,将第二融合图像和第二细节子图1202进行融合,从而使更新后的第二细节子图1204可以保留更新前的第二细节子图所包括的信息,在此基础上还融合了第二结构子图所包括的信息。
随后将SD_n输出的更新后的第二结构子图和更新后的第二细节子图输入至下一个SD模块,即SD_n+1模块。
此外,融合结构特征的流程可以参阅图13。在得到结构特征和细节特征之后,使用3*3卷积分别对结构特征和细节特征进行处理,得到更稳定的结构特征和细节特征。然后对卷积处理后的结构特征和细节特征进行拼接,并对拼接后的图像进行3*3卷积处理,即可得到第二融合图像。然后对第二融合图像进行像素重组(pixel shuffle)处理,从而得到放大后的输出图像。例如,输入图像的分辨率可以是4*4*3,拼接得到的第二融合图像的分 辨率为4*4*12,对该第二融合图像进行pixel shuffle处理,从而得到8*8*3的输出图像,由此可见,输出图像的分辨率是高于输入图像的,得到了高分辨率的图像。
另外,示例性地,更新第一隐状态信息的步骤可以参阅图14。在得到结构特征和细节特征之后,对结构特征和细节特征进行融合,并对融合后的图像进行3*3卷积以及ReLU处理,从而得到更新后的第一隐状态信息。
为便于理解,可以将前述图9中的超分辨率处理流程表示为如图15所示的超分辨率处理流程。
其中,在每次完成了拼接或者融合之后,可以增加3*3卷积或者3*3卷积与线性修正单元(rectified linear unit,ReLU),从而使融合或者拼接之后的图像所包括的特征更有效。
更进一步地,为便于理解,参阅图16,以其中一帧图像为例,为本申请提供的图像处理的流程进行示例性说明。
首先,采集到视频数据中的其中一帧图像作为输入图像之后,对该输入图像进行分解,并融合过滤后的隐状态信息,得到第一结构子图和第一细节子图。然后将该第二结构子图和第二细节子图输入至特征提取网络中,由一个或者多个SD模块对结构子图和细节子图进行交互,从而提取到结构特征和细节特征,并对该结构特征和细节特征进行融合,得到输出图像。
因此,本申请提供的图像处理方法,提供了基于结构-细节双分支递归神经网络的视频超分辨率处理方法,并在网络中显式地将结构(低频)和细节(高频)信息分离并采用两个分支进行处理,这种显式双分支的结构能够有效增丰富输出图像中所包括的信息的,提升视频超分的效果。并且,提出了递归神经网络中对隐状态进行适应性过滤的步骤,通过计算当前输入与隐状态之间的相似性,并基于相似性对隐状态进行过滤,剔除过时的信息,减少错误累积,提升隐状态信息的利用效率。
下面示例性地,对本申请提供的图像处理方法所实现的效果进行介绍。
示例性地,在Vimeo-90K数据集上训练视频超分辨率模型,即执行本申请前述图6-图16的方法的网络,在VID4、Vimeo-90K-T、SPMCS、UDM10等常用的视频超分数据集数据集上进行测试,展示本申请提出的图像处理方法对低清视频的处理效果。为了进一步验证本申请提供的方法的有效性,将同时提供当前业界和学界性能最好的视频超分辨方法在同一场景的结果作为横向比较。
Vimeo-90K数据集是视频超分辨率任务中常用的数据集之一,包含了大约90k个视频片段。该数据集为从某社交网站上采集而来,覆盖了日常生活的各种场景,同时还有大量的电影片段。由于其巨大的样本量、多样的场景、较大的运动,是一个具有挑战性的视频数据集,在视频处理任务中得到了广泛应用。Vimeo-90K数据集可被划分为训练集和测试集。对于其测试集,本申请使用Vimeo-90K-T表示。
基于本申请提供的方法,在PyTorch平台上构建了一个网络模型。为了评价输出结果的质量,以原始的高分辨率真值(Ground Truth,GT)作为标准,分别计算每一帧的峰值信噪比(peak signal-to-noise ratio,PSNR)和结构相似性评价(structural similarity  index measurement,SSIM),最后计算整个测试集的平均PSNR和平均SSIM。
表1展示了不同方法在Vid4测试集上的测试结果。Vid4测试集包括日历(Calendar)、城市(City)、植物(Foliage)和步行(Walk)等充满大量高频细节的视频,是视频处理领域常用的测试高频细节处理能力的测试集之一。
示例性地,选择了几种常用的图像处理方法与本申请提供的图像处理方法的输出结果进行对比,如Bicubic、SPMC(subpixel motion compensation)、Liu(Robust Video Super-resolution With Learned Temporal Dynamics)、TOFlow(task-oriented flow)、DUF(Dynamic Up sampling Filters)-52L、RBPN(recurrent back-projection network)、EDVR(Video Restoration with enhanced deformable convolutional networks)-L、PFNL(Progressive fusion video super resolution network via exploiting non-local spatio-temporal correlations)、FRVSR(frame recurrent video super resolution)、RLSP(efficient video super resolution through recurrent latent space propagation)。从表1可以看出,本申请提供的方法(表示为RSDN)凭借远小于其他方法的计算量(~0.13T Flops)实现了最高的PSNR和SSIM指标。而EDVR-L的计算量(通过Flops衡量)(0.93T)是本发明(0.13T)的7倍以上。这些结果体现了本发明能够更高效利用视频的时空信息,在更小的计算量下实现更好的视频超分效果。
Figure PCTCN2021106380-appb-000011
Figure PCTCN2021106380-appb-000012
表1
为了进一步验证本申请提供的图像处理方法在恢复高频细节上的优越性,在多个测试集上与当前主流方法进行了横向比较,包括SPMCS、UDM10和Vimeo-90K-T等数据集。横向比较的结果如表2所示。结果表明本审请提供的方法在多个测试集上显著地超过了现有的方法,取得了最好结果。这表明了本申请提供的方法在恢复高频细节上的优越性。
Figure PCTCN2021106380-appb-000013
Figure PCTCN2021106380-appb-000014
表2
从表2可以进一步看出,本方法在三个数据集上单帧处理的时间分别为18ms、24ms和12ms,均超过20帧/秒,达到了实时运行,体现了本申请提供的图像处理方法的高效率。
最后,选取了多个测试集的部分图像帧进行可视化,进一步从细节上比较不同方法的表现。如图17从输出的高分辨率结果上展示了本申请提供的方法在视频超分辨率上的领先效果,可以得到更高清的图像。
前述对本申请提供的图像处理方法进行了详细介绍,下面介绍本申请提供的装置。
参阅图18,本申请提供的一种图像处理装置的结构示意图。该图像处理装置可以包括:
分解单元1801,用于对第一图像进行分解,得到第一结构子图和第一细节子图,第一图像为视频数据中的除第一帧外的任意一帧图像,且第一频率低于第二频率,第一频率为第一结构子图所包括的信息的频率,第二频率为第一细节子图所包括的信息的频率;
融合单元1802,用于对第一隐状态信息和第一结构子图进行融合,得到第二结构子图,以及对第一隐状态信息和第一细节子图进行拼接,得到第二细节子图,第一隐状态信息包括从第二图像中提取到的特征,第二图像包括视频数据与第一图像相邻的至少一帧图像;
特征提取单元1803,用于基于第二结构子图和第二细节子图进行特征提取,得到结构特征和细节特征;
输出单元1804,用于根据结构特征和细节特征,得到输出图像,输出图像的分辨率高于第一图像。
在一种可能的实施方式中,融合单元1802,具体用于:获取第一隐状态信息和第一图像的相似度矩阵,相似度矩阵中包括至少一个相似度,至少一个相似度用于表示第一隐状态信息所包括的图像区域和第一图像中的图像区域之间的相似程度;根据相似度矩阵对第一隐状态信息进行过滤,得到第二隐状态信息,第二隐状态信息中每个图像区域与第一图像中对应的图像区域的相似程度,高于第一隐状态信息中每个图像区域与第一图像中的图像区域的相似程度;使用第二隐状态信息对第一结构子图进行拼接,得到第二结构子图,使用第二隐状态信息对第一细节子图进行拼接,得到第二细节子图。
在一种可能的实施方式中,特征提取单元1803,用于:对第二结构子图和第二细节子 图进行至少一次迭代融合,得到更新后的第二结构子图和更新后的第二细节子图;从更新后的第二结构子图中提取特征,得到结构特征,从更新后的第二细节子图中提取特征,得到细节特征。
在一种可能的实施方式中,任意一次迭代融合过程可以包括:对上一次迭代得到的第二结构子图和上一次迭代得到的第二细节子图进行融合,得到当前次迭代的第一融合图像;对第一融合图像和上一次迭代得到的第二结构子图进行融合,得到当前次迭代的第二结构子图;对第一融合图像和上一次迭代得到的第二细节子图进行融合,得到当前次迭代的第二细节子图。
在一种可能的实施方式中,输出单元1804,具体用于:融合结构特征和细节特征,得到第二融合图像;对第二融合图像进行放大处理,得到输出图像,输出图像的分辨率高于第二融合图像的分辨率。
在一种可能的实施方式中,该图像处理装置还可以包括:更新单元1805,用于根据结构特征和细节特征更新第一隐状态信息,第一隐状态信息用于对视频数据中排列在第一图像的下一帧图像进行处理。
在一种可能的实施方式中,分解单元1801,具体用于:对第一图像进行下采样,得到下采样图像;对下采样图像进行上采样,得到第一结构子图;从第一图像中去除第一结构子图,得到第一细节子图。
请参阅图19,本申请提供的另一种图像处理装置的结构示意图,如下所述。
该图像处理装置可以包括处理器1901和存储器1902。该处理器1901和存储器1902通过线路互联。其中,存储器1902中存储有程序指令和数据。
存储器1902中存储了前述图6-图16中的步骤对应的程序指令以及数据。
处理器1901用于执行前述图6-图16中任一实施例所示的图像处理装置执行的方法步骤。
可选地,该图像处理装置还可以包括收发器1903,用于接收或者发送数据。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于生成车辆行驶速度的程序,当其在计算机上行驶时,使得计算机执行如前述图6-图16所示实施例描述的方法中的步骤。
可选地,前述的图19中所示的图像处理装置为芯片。
本申请实施例还提供了一种图像处理装置,该图像处理装置也可以称为数字处理芯片或者芯片,芯片包括处理单元和通信接口,处理单元通过通信接口获取程序指令,程序指令被处理单元执行,处理单元用于执行前述图6-图16中任一实施例所示的图像处理装置执行的方法步骤。
本申请实施例还提供一种数字处理芯片。该数字处理芯片中集成了用于实现上述处理器1901,或者处理器1901的功能的电路和一个或者多个接口。当该数字处理芯片中集成了存储器时,该数字处理芯片可以完成前述实施例中的任一个或多个实施例的方法步骤。当该数字处理芯片中未集成存储器时,可以通过通信接口与外置的存储器连接。该数字处理芯片根据外置的存储器中存储的程序代码来实现上述实施例中图像处理装置执行的动作。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上行驶时,使得计算机执行如前述图6-图16所示实施例描述的方法中图像处理装置所执行的步骤。
本申请实施例提供的图像处理装置可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使服务器内的芯片执行上述图6-图16所示实施例描述的图像处理方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体地,前述的处理单元或者处理器可以是中央处理器(central processing unit,CPU)、网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)或现场可编程逻辑门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者也可以是任何常规的处理器等。
示例性地,请参阅图20,图20为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 200,NPU 200作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路2003,通过控制器2004控制运算电路2003提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路2003内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路2003是二维脉动阵列。运算电路2003还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路2003是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器2002中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器2001中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)2008中。
统一存储器2006用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(direct memory access controller,DMAC)2005,DMAC被搬运到权重存储器2002中。输入数据也通过DMAC被搬运到统一存储器2006中。
总线接口单元(bus interface unit,BIU)2010,用于AXI总线与DMAC和取指存储器(instruction fetch buffer,IFB)2009的交互。
总线接口单元2010(bus interface unit,BIU),用于取指存储器2009从外部存储器获取指令,还用于存储单元访问控制器2005从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器2006或将权重数据搬 运到权重存储器2002中或将输入数据数据搬运到输入存储器2001中。
向量计算单元2007包括多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如批归一化(batch normalization),像素级求和,对特征平面进行上采样等。
在一些实现中,向量计算单元2007能将经处理的输出的向量存储到统一存储器2006。例如,向量计算单元2007可以将线性函数和/或非线性函数应用到运算电路2003的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元2007生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路2003的激活输入,例如用于在神经网络中的后续层中的使用。
控制器2004连接的取指存储器(instruction fetch buffer)2009,用于存储控制器2004使用的指令;
统一存储器2006,输入存储器2001,权重存储器2002以及取指存储器2009均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,循环神经网络中各层的运算可以由运算电路2003或向量计算单元2007执行。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述图6-图16的方法的程序执行的集成电路。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
最后应说明的是:以上,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (17)

  1. 一种图像处理方法,其特征在于,包括:
    对第一图像进行分解,得到第一结构子图和第一细节子图,所述第一图像为视频数据中的除第一帧外的任意一帧图像,且第一频率低于第二频率,所述第一频率为所述第一结构子图所包括的信息的频率,所述第二频率为所述第一细节子图所包括的信息的频率;
    对第一隐状态信息和所述第一结构子图进行融合,得到第二结构子图,以及对第一隐状态信息和所述第一细节子图进行拼接,得到第二细节子图,所述第一隐状态信息包括从第二图像中提取到的特征,所述第二图像包括所述视频数据与所述第一图像相邻的至少一帧图像;
    基于所述第二结构子图和所述第二细节子图进行特征提取,得到结构特征和细节特征;
    根据所述结构特征和所述细节特征,得到输出图像,所述输出图像的分辨率高于所述第一图像的分辨率。
  2. 根据权利要求1所述的方法,其特征在于,所述对第一隐状态信息和所述第一结构子图进行融合,得到第二结构子图,以及对第一隐状态信息和所述第一细节子图进行拼接,得到第二细节子图,包括:
    获取所述第一隐状态信息和所述第一图像的相似度矩阵,所述相似度矩阵中包括至少一个相似度,所述至少一个相似度用于表示所述第一隐状态信息所包括的图像区域和所述第一图像中的图像区域之间的相似程度;
    根据所述相似度矩阵对所述第一隐状态信息进行过滤,得到第二隐状态信息,所述第二隐状态信息中每个图像区域与所述第一图像中对应的图像区域的相似程度,高于所述第一隐状态信息中每个图像区域与所述第一图像中的图像区域的相似程度;
    使用所述第二隐状态信息对所述第一结构子图进行拼接,得到所述第二结构子图,使用所述第二隐状态信息对所述第一细节子图进行拼接,得到所述第二细节子图。
  3. 根据权利要求1或2所述的方法,其特征在于,所述基于所述第二结构子图和所述第二细节子图中的进行特征提取,得到结构特征和细节特征,包括:
    对所述第二结构子图和所述第二细节子图进行至少一次迭代融合,得到更新后的所述第二结构子图和更新后的所述第二细节子图;
    从所述更新后的所述第二结构子图中提取特征,得到所述结构特征,从所述更新后的第二细节子图中提取特征,得到所述细节特征。
  4. 根据权利要求3所述的方法,其特征在于,所述任意一次迭代融合过程包括:
    对上一次迭代得到的第二结构子图和上一次迭代得到的第二细节子图进行融合,得到当前次迭代的第一融合图像;
    对所述第一融合图像和所述上一次迭代得到的第二结构子图进行融合,得到当前次迭代的所述第二结构子图;
    对所述第一融合图像和所述上一次迭代得到的第二细节子图进行融合,得到当前次迭代的第二细节子图。
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,所述根据所述结构特征和所 述细节特征,得到输出图像,包括:
    融合所述结构特征和所述细节特征,得到第二融合图像;
    对所述第二融合图像进行放大处理,得到所述输出图像,所述输出图像的分辨率高于所述第二融合图像。
  6. 根据权利要求1-5中任一项所述的方法,其特征在于,在提取所述第二结构子图中的特征,得到结构特征,以及提取所述第二细节子图中的特征,得到细节特征之后,所述方法还包括:
    根据所述结构特征和所述细节特征更新所述第一隐状态信息,所述第一隐状态信息用于对所述视频数据中排列在所述第一图像的下一帧图像进行处理。
  7. 根据权利要求1-6中任一项所述的方法,其特征在于,所述对第一图像进行分解,包括:
    对所述第一图像进行下采样,得到下采样图像;
    对所述下采样图像进行上采样,得到所述第一结构子图;
    从所述第一图像中去除所述第一结构子图,得到所述第一细节子图。
  8. 一种图像处理装置,其特征在于,包括:
    分解单元,用于对第一图像进行分解,得到第一结构子图和第一细节子图,所述第一图像为视频数据中的除第一帧外的任意一帧图像,且第一频率低于第二频率,所述第一频率为所述第一结构子图所包括的信息的频率,所述第二频率为所述第一细节子图所包括的信息的频率;
    融合单元,用于对第一隐状态信息和所述第一结构子图进行融合,得到第二结构子图,以及对第一隐状态信息和所述第一细节子图进行拼接,得到第二细节子图,所述第一隐状态信息包括从第二图像中提取到的特征,所述第二图像包括所述视频数据与所述第一图像相邻的至少一帧图像;
    特征提取单元,用于基于所述第二结构子图和所述第二细节子图进行特征提取,得到结构特征和细节特征;
    输出单元,用于根据所述结构特征和所述细节特征,得到输出图像,所述输出图像的分辨率高于所述第一图像的分辨率。
  9. 根据权利要求8所述的装置,其特征在于,所述融合单元,用于:
    获取所述第一隐状态信息和所述第一图像的相似度矩阵,所述相似度矩阵中包括至少一个相似度,所述至少一个相似度用于表示所述第一隐状态信息所包括的图像区域和所述第一图像中的图像区域之间的相似程度;
    根据所述相似度矩阵对所述第一隐状态信息进行过滤,得到第二隐状态信息,所述第二隐状态信息中每个图像区域与所述第一图像中对应的图像区域的相似程度,高于所述第一隐状态信息中每个图像区域与所述第一图像中的图像区域的相似程度;
    使用所述第二隐状态信息对所述第一结构子图进行拼接,得到所述第二结构子图,使用所述第二隐状态信息对所述第一细节子图进行拼接,得到所述第二细节子图。
  10. 根据权利要求8或9所述的装置,其特征在于,所述特征提取单元,用于:
    对所述第二结构子图和所述第二细节子图进行至少一次迭代融合,得到更新后的所述第二结构子图和更新后的所述第二细节子图;
    从所述更新后的所述第二结构子图中提取特征,得到所述结构特征,从所述更新后的第二细节子图中提取特征,得到所述细节特征。
  11. 根据权利要求10所述的装置,其特征在于,所述任意一次迭代融合过程包括:
    对上一次迭代得到的第二结构子图和上一次迭代得到的第二细节子图进行融合,得到当前次迭代的第一融合图像;
    对所述第一融合图像和所述上一次迭代得到的第二结构子图进行融合,得到当前次迭代的所述第二结构子图;
    对所述第一融合图像和所述上一次迭代得到的第二细节子图进行融合,得到当前次迭代的第二细节子图。
  12. 根据权利要求8-11中任一项所述的装置,其特征在于,所述输出单元,用于:
    融合所述结构特征和所述细节特征,得到第二融合图像;
    对所述第二融合图像进行放大处理,得到所述输出图像,所述输出图像的分辨率高于所述第二融合图像的分辨率。
  13. 根据权利要求8-12中任一项所述的装置,其特征在于,所述装置还包括:
    更新单元,用于根据所述结构特征和所述细节特征更新所述第一隐状态信息,所述第一隐状态信息用于对所述视频数据中排列在所述第一图像的下一帧图像进行处理。
  14. 根据权利要求8-13中任一项所述的装置,其特征在于,所述分解单元,用于:
    对所述第一图像进行下采样,得到下采样图像;
    对所述下采样图像进行上采样,得到所述第一结构子图;
    从所述第一图像中去除所述第一结构子图,得到所述第一细节子图。
  15. 一种图像处理装置,其特征在于,包括处理器,所述处理器和存储器耦合,所述存储器存储有程序,当所述存储器存储的程序指令被所述处理器执行时实现权利要求1至7中任一项所述的方法。
  16. 一种计算机可读存储介质,包括程序,当其被处理单元所执行时,执行如权利要求1至7中任一项所述的方法。
  17. 一种图像处理装置,其特征在于,包括处理单元和通信接口,所述处理单元通过所述通信接口获取程序指令,当所述程序指令被所述处理单元执行时实现权利要求1至7中任一项所述的方法。
PCT/CN2021/106380 2020-07-31 2021-07-15 一种图像处理方法以及装置 WO2022022288A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21850081.7A EP4181052A4 (en) 2020-07-31 2021-07-15 IMAGE PROCESSING METHOD AND DEVICE
US18/161,123 US20230177646A1 (en) 2020-07-31 2023-01-30 Image processing method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010762144.6 2020-07-31
CN202010762144.6A CN112070664B (zh) 2020-07-31 2020-07-31 一种图像处理方法以及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/161,123 Continuation US20230177646A1 (en) 2020-07-31 2023-01-30 Image processing method and apparatus

Publications (1)

Publication Number Publication Date
WO2022022288A1 true WO2022022288A1 (zh) 2022-02-03

Family

ID=73657157

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/106380 WO2022022288A1 (zh) 2020-07-31 2021-07-15 一种图像处理方法以及装置

Country Status (4)

Country Link
US (1) US20230177646A1 (zh)
EP (1) EP4181052A4 (zh)
CN (1) CN112070664B (zh)
WO (1) WO2022022288A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115471398A (zh) * 2022-08-31 2022-12-13 北京科技大学 图像超分辨率方法、系统、终端设备及存储介质
CN116228786A (zh) * 2023-05-10 2023-06-06 青岛市中心医院 前列腺mri图像增强分割方法、装置、电子设备与存储介质
EP4325261A1 (en) * 2022-08-16 2024-02-21 Fiberfox, Inc. Apparatus and method for verifying optical fiber work using artificial intelligence

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020263110A1 (en) * 2019-06-25 2020-12-30 Motorola Solutions, Inc System and method for saving bandwidth in performing facial recognition
CN112070664B (zh) * 2020-07-31 2023-11-03 华为技术有限公司 一种图像处理方法以及装置
CN116012230B (zh) * 2023-01-17 2023-09-29 深圳大学 一种时空视频超分辨率方法、装置、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103093449A (zh) * 2013-02-28 2013-05-08 重庆大学 一种多分辨率融合的射线图像增强方法
CN109064436A (zh) * 2018-07-10 2018-12-21 西安天盈光电科技有限公司 图像融合方法
WO2019228317A1 (zh) * 2018-05-28 2019-12-05 华为技术有限公司 人脸识别方法、装置及计算机可读介质
CN110942424A (zh) * 2019-11-07 2020-03-31 昆明理工大学 一种基于深度学习的复合网络单图像超分辨率重建方法
CN111402130A (zh) * 2020-02-21 2020-07-10 华为技术有限公司 数据处理方法和数据处理装置
CN112070664A (zh) * 2020-07-31 2020-12-11 华为技术有限公司 一种图像处理方法以及装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111091521B (zh) * 2019-12-05 2023-04-07 腾讯科技(深圳)有限公司 图像处理方法及装置、电子设备和计算机可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103093449A (zh) * 2013-02-28 2013-05-08 重庆大学 一种多分辨率融合的射线图像增强方法
WO2019228317A1 (zh) * 2018-05-28 2019-12-05 华为技术有限公司 人脸识别方法、装置及计算机可读介质
CN109064436A (zh) * 2018-07-10 2018-12-21 西安天盈光电科技有限公司 图像融合方法
CN110942424A (zh) * 2019-11-07 2020-03-31 昆明理工大学 一种基于深度学习的复合网络单图像超分辨率重建方法
CN111402130A (zh) * 2020-02-21 2020-07-10 华为技术有限公司 数据处理方法和数据处理装置
CN112070664A (zh) * 2020-07-31 2020-12-11 华为技术有限公司 一种图像处理方法以及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4181052A4

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4325261A1 (en) * 2022-08-16 2024-02-21 Fiberfox, Inc. Apparatus and method for verifying optical fiber work using artificial intelligence
CN115471398A (zh) * 2022-08-31 2022-12-13 北京科技大学 图像超分辨率方法、系统、终端设备及存储介质
CN115471398B (zh) * 2022-08-31 2023-08-15 北京科技大学 图像超分辨率方法、系统、终端设备及存储介质
CN116228786A (zh) * 2023-05-10 2023-06-06 青岛市中心医院 前列腺mri图像增强分割方法、装置、电子设备与存储介质
CN116228786B (zh) * 2023-05-10 2023-08-08 青岛市中心医院 前列腺mri图像增强分割方法、装置、电子设备与存储介质

Also Published As

Publication number Publication date
EP4181052A1 (en) 2023-05-17
US20230177646A1 (en) 2023-06-08
CN112070664B (zh) 2023-11-03
CN112070664A (zh) 2020-12-11
EP4181052A4 (en) 2024-01-10

Similar Documents

Publication Publication Date Title
WO2022022288A1 (zh) 一种图像处理方法以及装置
WO2020216227A9 (zh) 图像分类方法、数据处理方法和装置
WO2021018163A1 (zh) 神经网络的搜索方法及装置
WO2021164731A1 (zh) 图像增强方法以及图像增强装置
WO2020177607A1 (zh) 图像去噪方法和装置
WO2021043168A1 (zh) 行人再识别网络的训练方法、行人再识别方法和装置
WO2019192588A1 (zh) 图像超分方法及装置
WO2022042713A1 (zh) 一种用于计算设备的深度学习训练方法和装置
WO2022134971A1 (zh) 一种降噪模型的训练方法及相关装置
CN111402130B (zh) 数据处理方法和数据处理装置
CN113066017B (zh) 一种图像增强方法、模型训练方法及设备
WO2022001372A1 (zh) 训练神经网络的方法、图像处理方法及装置
CN113011562A (zh) 一种模型训练方法及装置
CN108875900A (zh) 视频图像处理方法和装置、神经网络训练方法、存储介质
CN111951195A (zh) 图像增强方法及装置
WO2024002211A1 (zh) 一种图像处理方法及相关装置
CN116547694A (zh) 用于对模糊图像去模糊的方法和系统
CN113284055A (zh) 一种图像处理的方法以及装置
CN115131256A (zh) 图像处理模型、图像处理模型的训练方法及装置
CN113066018A (zh) 一种图像增强方法及相关装置
CN114627034A (zh) 一种图像增强方法、图像增强模型的训练方法及相关设备
WO2022179606A1 (zh) 一种图像处理方法及相关装置
CN115082306A (zh) 一种基于蓝图可分离残差网络的图像超分辨率方法
WO2021042774A1 (zh) 图像恢复方法、图像恢复网络训练方法、装置和存储介质
WO2021057091A1 (zh) 视点图像处理方法及相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21850081

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021850081

Country of ref document: EP

Effective date: 20230209

NENP Non-entry into the national phase

Ref country code: DE