CN113011562B

CN113011562B - Model training method and device

Info

Publication number: CN113011562B
Application number: CN202110292062.4A
Authority: CN
Inventors: 肖帅; 宋风龙; 熊志伟; 肖泽宇
Original assignee: University of Science and Technology of China USTC; Huawei Technologies Co Ltd
Current assignee: University of Science and Technology of China USTC; Huawei Technologies Co Ltd
Priority date: 2021-03-18
Filing date: 2021-03-18
Publication date: 2024-07-26
Anticipated expiration: 2041-03-18
Also published as: CN113011562A

Abstract

The application discloses a model training method, which can be applied to the field of artificial intelligence, and comprises the following steps: acquiring a video sample, a first video processing network and a second video processing network, and processing the video sample through the first video processing network and the second video processing network to respectively obtain a first intermediate feature map output and a second intermediate feature map output; and respectively processing the first intermediate feature map output and the second intermediate feature map output through the cyclic neural network to respectively obtain first inter-frame information and second inter-frame information, determining target loss for performing knowledge distillation according to the first inter-frame information and the second inter-frame information, and performing knowledge distillation on the second video processing network based on the target loss. According to the application, the inter-frame information is added in the target loss, so that the video quality of the video obtained after the student model subjected to knowledge distillation performs the video processing task is improved.

Description

Model training method and device

Technical Field

The application relates to the field of artificial intelligence, in particular to a model training method and device.

Background

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The photographing imaging and video of terminal equipment such as a smart phone and the like are obviously improved, but the terminal equipment is limited by the hardware performance of an optical sensor of the terminal equipment, the quality of the photographed pictures and video is still not high enough, and the terminal equipment has the problems of high noise, lower resolving power, detail missing, color cast and the like. Meanwhile, the difficulty of the conventional image processing algorithm in solving the above challenges is very high due to the hardware area and power consumption constraints of the image signal processor. In order to improve the picture quality of an image or video, the video may be processed.

Deep learning is a key driving force in the field of artificial intelligence in recent years, and achieves remarkable effects on various tasks of computer vision. In the field of video processing, a video processing model based on deep learning achieves the best performance in the industry, and the effect is obviously superior to that of the traditional method.

The computing power of the mobile terminal is weak, the structure of the current model for video processing is very complex, the requirement on hardware computing resources is very high, the application of the neural network to scenes with high real-time requirements is severely limited, and the neural network is difficult to deploy to equipment with weak computing power such as the mobile terminal.

Disclosure of Invention

In a first aspect, the present application provides a model training method, the method comprising:

Acquiring a video sample, a first video processing network and a second video processing network, wherein the first video processing network is a teacher model, and the second video processing network is a student model to be trained;

In one possible implementation, the video sample may include a plurality of image frames, and the first video processing network and the second video processing network are used to implement a video enhancement task, where a video enhancement task may be understood as a task for enhancing the quality of video, for example, the video enhancement task may be a video denoising task, a video defogging task, a super resolution task, a high dynamic range task, or the like, which is not limited herein;

processing the video samples through the first video processing network and the second video processing network to respectively obtain a first intermediate feature map output of the first video processing network and a second intermediate feature map output of the second video processing network;

The first intermediate feature map output may be a feature map output of an intermediate network layer when the first video processing network processes the video sample, and the second intermediate feature map output may be a feature map output of an intermediate network layer when the second video processing network processes the video sample, where a position of the network layer outputting the first intermediate feature map output in the first video processing network is the same as a position of the network layer outputting the second intermediate feature map output in the second video processing network;

The intermediate network layer may be a network layer for outputting a feature map in the first video processing network and the second video processing network, so long as the output feature map can carry image features of an image frame, the embodiment of the application does not limit positions of the intermediate network layer in the first video processing network and the second video processing network and types of the network layers;

Processing the first intermediate feature map output and the second intermediate feature map output respectively to obtain first inter-frame information and second inter-frame information, wherein the first inter-frame information and the second inter-frame information are used for representing feature change relations among all image frames of the video sample;

In one implementation, the first intermediate feature map output and the second intermediate feature map output may be respectively processed by a recurrent neural network, where the first inter-frame information and the second inter-frame information may represent feature change relationships between respective image frames of the video sample, since the recurrent neural network memorizes previous information and applies the previous information to calculation of the current output when processing the sequence data. Specifically, the characteristic change relation may refer to continuous and change information between frames, where the continuous information is a relation between static areas between frames, and the change information is a relation between moving objects between frames;

determining a target loss according to the first inter-frame information and the second inter-frame information, and performing knowledge distillation on the second video processing network based on the target loss and the first video processing network to acquire a trained second video processing network, wherein the target loss is related to the difference between the first inter-frame information and the second inter-frame information.

By the method, on the premise that the model structure is not changed, the inter-frame information is added in the target loss for knowledge distillation, so that a teacher model can better identify the inter-frame information, the capability of video processing by using the inter-frame information is transferred to a student model, and the video quality of a video obtained after video processing by the student model after knowledge distillation is improved.

In one possible implementation, the processing the first intermediate feature map output and the second intermediate feature map output, respectively, includes:

And respectively processing the first intermediate feature map output and the second intermediate feature map output through a cyclic neural network.

In one possible implementation, the first inter-frame information and the second inter-frame information are hidden layer states (HIDDEN STATE) of the recurrent neural network.

In one possible implementation, the video sample includes a plurality of frame images, the first intermediate feature map output includes a first sub-intermediate feature map corresponding to each frame image in the plurality of frame images, the second intermediate feature map output includes a second sub-intermediate feature map corresponding to each frame image in the plurality of frame images, the first inter-frame information includes M hidden layer states obtained by the cyclic neural network processing a first sub-intermediate feature map corresponding to a later M frame image in the plurality of frame images, and the second inter-frame information is M hidden layer states obtained by the cyclic neural network processing a second sub-intermediate feature map corresponding to a later M frame image in the plurality of frame images.

In one implementation, the hidden layer state of the LSTM output may also be obtained, that is, the first intermediate feature image output of the first video processing network and the second intermediate feature image output of the second video processing network are respectively processed by the LSTM to obtain a first hidden layer state and a second hidden layer state, where the first hidden layer state may be all hidden layer states or part of hidden layer states of the LSTM when the first intermediate feature image output is processed;

In one possible implementation, the recurrent neural network is a long-short-term memory LSTM network, and the first and second inter-frame information are cell states (CELL STATE) of the LSTM output.

In one possible implementation, the video sample includes a plurality of frame images, the first intermediate feature map output includes a first sub-intermediate feature map corresponding to each frame image in the plurality of frame images, the second intermediate feature map output includes a second sub-intermediate feature map corresponding to each frame image in the plurality of frame images, the first inter-frame information is a cell state obtained by the LSTM network processing a first sub-intermediate feature map corresponding to a last frame image in the plurality of frame images, and the second inter-frame information is a cell state obtained by the LSTM network processing a second sub-intermediate feature map corresponding to the last frame image in the plurality of frame images.

The LSTM network may process the input image frames sequentially to obtain a cell state corresponding to each image frame, and the LSTM network may generally carry more inter-frame information in hidden layer output obtained by processing a later image frame, so in order to reduce the calculation amount, the cell state obtained by RNN processing an intermediate feature map corresponding to the later image frame in the multi-frame image may be selected.

In a possible implementation, the video sample includes a plurality of frames of images, the first intermediate feature map output includes a first sub-intermediate feature map corresponding to each of the plurality of frames of images, and the second intermediate feature map output includes a second sub-intermediate feature map corresponding to each of the plurality of frames of images, the method further includes:

Processing each first sub-intermediate feature map and each second sub-intermediate feature map to obtain first spatial information of each first sub-intermediate feature map and second spatial information of each second sub-intermediate feature map, wherein the first spatial information and the second spatial information are used for representing feature distribution of the feature map;

The determining a target loss according to the first inter-frame information and the second inter-frame information includes:

determining a target loss based on the first and second inter-frame information and the first and second spatial information, the target loss being related to a difference between the first and second inter-frame information and a difference between the first and second spatial information.

In one implementation, the target loss may be related to a difference between spatial information of the first intermediate feature map output and the second intermediate feature map output in addition to a difference between the first inter-frame information and the second inter-frame information; wherein the spatial information is used to represent a feature distribution of the feature map, which may comprise rich image content and represent image features of the corresponding image frames, such as frequency features, texture detail features, etc.

In one possible implementation, the first spatial information is a first spatial attention map, the second spatial information is a second spatial attention map, and the performing information statistics on each of the first sub-intermediate feature map and each of the second sub-intermediate feature maps includes:

and mapping the first intermediate feature map output and the second intermediate feature map output respectively based on a spatial attention mechanism to obtain the first spatial attention map and the second spatial attention map respectively.

In an alternative implementation, each first sub-intermediate feature map may be averaged per channel to obtain first spatial information, and each second sub-intermediate feature map may be averaged per channel to obtain second spatial information, which may also be referred to as a spatial attention map in the case where the information statistics are averaged per channel.

In one possible implementation, the processing the video samples through the first video processing network and the second video processing network includes:

Processing the video samples through the first video processing network and the second video processing network to respectively obtain a first intermediate feature map output of the first video processing network, a first enhanced video output by the first video processing network and a second intermediate feature map output by the second video processing network;

Acquiring a true value corresponding to the video sample (ground truth);

Determining a target loss based on the first and second inter-frame information and the first video processing result and the true value, the target loss being related to a difference between the first and second inter-frame information and a difference between the first video processing result and the true value.

In one implementation, the target loss may relate to, in addition to the difference between the first inter-frame information and the second inter-frame information, a difference between the first video processing result and a true value (ground truth) corresponding to the video sample, where the first video processing network and the second video processing network are used to implement the video enhancement task, and the true value (ground truth) corresponding to the video sample may be understood as a video sample with improved video quality, and in one implementation, the true value (ground truth) corresponding to the video sample may also be preset, or may be obtained after image enhancement of the video sample by the first video processing network, which is not limited herein; in one implementation, the target loss may be constructed based on a difference between the first and second inter-frame information, a difference between the first and second spatial information, and a difference between a first video processing result and a true value (ground truth) corresponding to a video sample.

In one possible implementation, the first video processing network and the second video processing network are used to implement video enhancement tasks.

In one possible implementation, the video enhancement task is a video denoising task, a video defogging task, a super resolution task, or a high dynamic range task.

In one possible implementation, before the processing the first intermediate feature map output and the second intermediate feature map output, respectively, the method further includes:

Respectively performing deblurring processing on the first intermediate feature map and the second intermediate feature map to obtain the first intermediate feature map after deblurring processing and the second intermediate feature map after deblurring processing;

the processing, by the recurrent neural network, the first intermediate feature map and the second intermediate feature map, respectively, includes:

and respectively processing the first intermediate feature map after the deblurring processing and the second intermediate feature map after the deblurring processing through a cyclic neural network.

In a second aspect, the present application provides a model training apparatus, the apparatus comprising:

The system comprises an acquisition module, a training module and a training module, wherein the acquisition module is used for acquiring a video sample, a first video processing network and a second video processing network, the first video processing network is a teacher model, and the second video processing network is a student model to be trained;

The video processing module is used for processing the video samples through the first video processing network and the second video processing network to respectively obtain a first intermediate feature map output of the first video processing network and a second intermediate feature map output of the second video processing network;

The feature map processing module is used for respectively processing the first intermediate feature map output and the second intermediate feature map output to respectively obtain first inter-frame information and second inter-frame information, wherein the first inter-frame information and the second inter-frame information are used for representing feature change relations among all image frames of the video sample;

And the knowledge distillation module is used for determining target loss according to the first inter-frame information and the second inter-frame information, and carrying out knowledge distillation on the second video processing network based on the target loss and the first video processing network so as to acquire a trained second video processing network, wherein the target loss is related to the difference between the first inter-frame information and the second inter-frame information.

In one possible implementation, the feature map processing module is configured to process the first intermediate feature map output and the second intermediate feature map output through a recurrent neural network, respectively.

In one possible implementation, the recurrent neural network is a long-short-term memory LSTM network, and the first inter-frame information and the second inter-frame information are cell states of the LSTM output.

In one possible implementation, the video sample includes a plurality of frames of images, the first intermediate feature map output includes a first sub-intermediate feature map corresponding to each of the plurality of frames of images, and the second intermediate feature map output includes a second sub-intermediate feature map corresponding to each of the plurality of frames of images, the apparatus further comprising:

The information statistics module is used for processing each first sub-intermediate feature map and each second sub-intermediate feature map to obtain first spatial information of each first sub-intermediate feature map and second spatial information of each second sub-intermediate feature map, wherein the first spatial information and the second spatial information are used for representing feature distribution of the feature map;

The knowledge distillation module is configured to determine a target loss based on the first and second inter-frame information and the first and second spatial information, the target loss being related to a difference between the first and second inter-frame information and a difference between the first and second spatial information.

In a possible implementation, the first spatial information is a first spatial attention map, the second spatial information is a second spatial attention map, and the information statistics module is configured to map the first intermediate feature map output and the second intermediate feature map output based on a spatial attention mechanism, respectively, to obtain the first spatial attention map and the second spatial attention map, respectively.

In one possible implementation, the video processing module is configured to process the video samples through the first video processing network and the second video processing network, to obtain a first intermediate feature map output of the first video processing network, a first video processing result output by the first video processing network, and a second intermediate feature map output by the second video processing network, respectively;

The knowledge distillation module is used for obtaining a true value (ground truth) corresponding to the video sample; determining a target loss based on the first and second inter-frame information and the first video processing result and the true value, the target loss being related to a difference between the first and second inter-frame information and a difference between the first video processing result and the true value.

In one possible implementation, the apparatus further includes: the deblurring module is used for respectively carrying out deblurring processing on the first intermediate feature map and the second intermediate feature map before the first intermediate feature map output and the second intermediate feature map output are respectively processed through the cyclic neural network so as to obtain the first intermediate feature map after deblurring processing and the second intermediate feature map after deblurring processing;

The feature map processing module is used for respectively processing the first intermediate feature map after the deblurring processing and the second intermediate feature map after the deblurring processing through a cyclic neural network.

In a third aspect, an embodiment of the present application provides a model training apparatus, which may include a memory, a processor, and a bus system, where the memory is configured to store a program, and the processor is configured to execute the program in the memory, so as to perform any one of the optional methods according to the first aspect.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored therein, which when run on a computer causes the computer to perform the method of any of the alternatives of the first aspect described above.

In a fifth aspect, embodiments of the present application provide a computer program comprising code for implementing any of the alternative methods of the first aspect described above when the code is executed.

In a sixth aspect, the present application provides a chip system comprising a processor for supporting an execution device or training device to perform the functions involved in the above aspects, e.g. to send or process data involved in the above method; or, information. In one possible design, the chip system further includes a memory for holding program instructions and data necessary for the execution device or the training device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

The embodiment of the application provides a model training method, which comprises the following steps: acquiring a video sample, a first video processing network and a second video processing network, wherein the first video processing network is a teacher model, and the second video processing network is a student model to be trained; processing the video samples through the first video processing network and the second video processing network to respectively obtain a first intermediate feature map output of the first video processing network and a second intermediate feature map output of the second video processing network; processing the first intermediate feature map output and the second intermediate feature map output through a cyclic neural network respectively to obtain first inter-frame information and second inter-frame information, wherein the first inter-frame information and the second inter-frame information are used for representing feature change relations among all image frames of the video sample; determining a target loss according to the first inter-frame information and the second inter-frame information, and performing knowledge distillation on the second video processing network based on the target loss and the first video processing network to acquire a trained second video processing network, wherein the target loss is related to the difference between the first inter-frame information and the second inter-frame information. By the method, on the premise that the model structure is not changed, the inter-frame information is added in the target loss for knowledge distillation, so that a teacher model can better identify the inter-frame information, the capability of video processing by using the inter-frame information is transferred to a student model, and the video quality of a video obtained after video processing by the student model after knowledge distillation is improved.

Drawings

FIG. 1 is a schematic diagram of a structure of an artificial intelligence main body frame;

Fig. 2 is a schematic diagram of an application scenario provided in an embodiment of the present application;

fig. 3 is a schematic diagram of an application scenario provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of a convolutional neural network provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a convolutional neural network provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a system according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a chip according to an embodiment of the present application;

FIG. 8 is a schematic illustration of a model training method provided by an embodiment of the present application;

FIG. 9 is a schematic illustration of a video enhancement network provided by an embodiment of the present application;

fig. 10 is a schematic diagram of a super-resolution network according to an embodiment of the present application;

FIG. 11 is a schematic diagram of an RNN according to an embodiment of the present application;

FIG. 12 is a schematic diagram of an RNN according to an embodiment of the present application;

FIG. 13 is a schematic diagram of an RNN according to an embodiment of the present application;

FIG. 14 is a schematic illustration of a model training method provided by an embodiment of the present application;

FIG. 15 is a schematic illustration of a model training method provided by an embodiment of the present application;

fig. 16 to 19 are schematic views illustrating the effect of a model training method according to an embodiment of the present application;

FIG. 20 is a schematic illustration of a model training apparatus provided in an embodiment of the present application;

FIG. 21 is a schematic structural diagram of an execution device according to an embodiment of the present application;

Fig. 22 is a schematic structural diagram of a training device according to an embodiment of the present application.

Detailed Description

Embodiments of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention. The terminology used in the description of the embodiments of the invention herein is for the purpose of describing particular embodiments of the invention only and is not intended to be limiting of the invention.

Embodiments of the present application are described below with reference to the accompanying drawings. As one of ordinary skill in the art can know, with the development of technology and the appearance of new scenes, the technical scheme provided by the embodiment of the application is also applicable to similar technical problems.

The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely illustrative of the manner in which embodiments of the application have been described in connection with the description of the objects having the same attributes. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, a schematic structural diagram of an artificial intelligence main body framework is shown in fig. 1, and the artificial intelligence main body framework is described below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where the "intelligent information chain" reflects a list of processes from the acquisition of data to the processing. For example, there may be general procedures of intelligent information awareness, intelligent information representation and formation, intelligent reasoning, intelligent decision making, intelligent execution and output. In this process, the data undergoes a "data-information-knowledge-wisdom" gel process. The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry from the underlying infrastructure of personal intelligence, information (provisioning and processing technology implementation), to the industrial ecological process of the system.

(1) Infrastructure of

The infrastructure provides computing capability support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the base platform. Communicating with the outside through the sensor; the computing power is provided by a smart chip, such as a hardware acceleration chip, e.g., a central processing unit (central processing unit, CPU), a Network Processor (NPU), a graphics processor (English: graphics processing unit, GPU), an application-specific integrated circuit (ASIC), or a field programmable gate array (field programmable GATE ARRAY, FPGA); the basic platform comprises a distributed computing framework, a network and other relevant platform guarantees and supports, and can comprise cloud storage, computing, interconnection and interworking networks and the like. For example, the sensor and external communication obtains data that is provided to a smart chip in a distributed computing system provided by the base platform for computation.

(2) Data

The data of the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence. The data relate to graphics, images, voice and text, and also relate to the internet of things data of the traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

Wherein machine learning and deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Reasoning refers to the process of simulating human intelligent reasoning modes in a computer or an intelligent system, and carrying out machine thinking and problem solving by using formal information according to a reasoning control strategy, and typical functions are searching and matching.

Decision making refers to the process of making decisions after intelligent information is inferred, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capability

After the data has been processed, some general-purpose capabilities can be formed based on the result of the data processing, such as algorithms or a general-purpose system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.

(5) Intelligent product and industry application

The intelligent product and industry application refers to products and applications of an artificial intelligent system in various fields, is encapsulation of an artificial intelligent overall solution, and realizes land application by making intelligent information decisions, and the application fields mainly comprise: intelligent terminal, intelligent traffic, intelligent medical treatment, autopilot, intelligent city, etc.

The model training method provided by the embodiment of the application can be particularly applied to data processing methods such as data training, machine learning, deep learning and the like, and intelligent information modeling, extraction, preprocessing, training and the like for symbolizing and formalizing training data are carried out to finally obtain a trained neural network model (such as a trained second video processing network in the embodiment of the application); the trained second video processing network can be used for model reasoning, and specifically, video can be input into the trained second video processing network to obtain a video processing result.

The trained second video processing network provided by the embodiment of the application can be applied to intelligent vehicles for assisting driving and automatic driving, and can also be applied to the fields of computer vision in smart cities, intelligent terminals and the like, which need to carry out video enhancement. The technical scheme of the application can be applied to video streaming scenes and video monitoring scenes. The following is a brief description of a video streaming scenario and a video surveillance scenario, respectively, in conjunction with fig. 2 and 3.

Video streaming scenarios:

For example, when a client using a smart terminal (e.g., in a cell phone, car, robot, tablet, desktop, smartwatch, virtual reality VR, augmented reality AR device, etc.) plays video, a server may transmit a downsampled, lower resolution, low quality video stream to the client over a network in order to reduce the bandwidth requirements of the video stream. The client may then enhance the images in the low quality video stream using the trained second video processing network. For example, super-resolution, noise reduction and other operations are performed on the images in the video, and finally, high-quality images are presented to the user.

Video monitoring scene:

In the security field, the method is limited by unfavorable conditions such as a mounting position of a monitoring camera, a limited storage space and the like, and the image quality of partial video monitoring is poor, so that the accuracy of identifying a target by a person or an identification algorithm can be influenced. Therefore, the trained second video processing network provided by the embodiment of the application can be utilized to convert the low-quality video monitoring video into the high-quality high-definition video, so that the effective recovery of a large amount of details in the monitoring image is realized, and more effective and richer information is provided for the subsequent target identification task.

Because the embodiments of the present application relate to a large number of applications of neural networks, for convenience of understanding, related terms and related concepts of the neural networks related to the embodiments of the present application will be described below.

(1) Neural network

The neural network may be composed of neural units, which may refer to an arithmetic unit with xs (i.e., input data) and intercept 1 as inputs, and the output of the arithmetic unit may be:

Where s=1, 2, … … n, n is a natural number greater than 1, ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit to an output signal. The output signal of the activation function may be used as an input to a next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by joining together a plurality of the above-described single neural units, i.e., the output of one neural unit may be the input of another neural unit. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.

(2) The convolutional neural network (convolutional neuron network, CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of a convolutional layer and a sub-sampling layer, which can be regarded as a filter. The convolution layer refers to a neuron layer in the convolution neural network, which performs convolution processing on an input signal. In the convolutional layer of the convolutional neural network, one neuron may be connected with only a part of adjacent layer neurons. A convolutional layer typically contains a number of feature planes, each of which may be composed of a number of neural elements arranged in a rectangular pattern. Neural elements of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights can be understood as the way features are extracted independent of location. The convolution kernel can be formed in a matrix with random size, and reasonable weight can be obtained through learning in the training process of the convolution neural network. In addition, the direct benefit of sharing weights is to reduce the connections between layers of the convolutional neural network, while reducing the risk of overfitting.

CNN is a very common neural network, and the structure of CNN is described in detail below with reference to fig. 4. As described in the foregoing description of the basic concepts, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (DEEP LEARNING) architecture, where the deep learning architecture refers to learning at multiple levels at different levels of abstraction through machine learning algorithms. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to an image input thereto.

As shown in fig. 4, convolutional Neural Network (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a fully-connected layer (fully connected layer) 230.

Convolution layer/pooling layer 220:

Convolution layer:

the convolution/pooling layer 220 as shown in fig. 4 may include layers as examples 221-226, for example: in one implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, layer 223 is a convolutional layer, layer 224 is a pooling layer, layer 225 is a convolutional layer, and layer 226 is a pooling layer; in another implementation, 221, 222 are convolutional layers, 223 are pooling layers, 224, 225 are convolutional layers, and 226 are pooling layers. I.e. the output of the convolution layer may be used as input to a subsequent pooling layer or as input to another convolution layer to continue the convolution operation.

The internal principle of operation of one convolution layer will be described below using the convolution layer 221 as an example.

The convolution layer 221 may include a plurality of convolution operators, also known as kernels, which function in image processing as a filter to extract specific information from the input image matrix, which may be a weight matrix in nature, which is typically predefined, and which is typically processed on the input image in a horizontal direction, pixel by pixel (or two pixels by two pixels … … depending on the value of the step size stride), to accomplish the task of extracting specific features from the image during the convolution operation on the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix produces a convolved output of a single depth dimension, but in most cases does not use a single weight matrix, but instead applies multiple weight matrices of the same size (row by column), i.e., multiple homography matrices. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by the "multiple" as described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix is used to extract image edge information, another weight matrix is used to extract a particular color of the image, yet another weight matrix is used to blur unwanted noise in the image, etc. The plurality of weight matrixes have the same size (row and column), the feature images extracted by the plurality of weight matrixes with the same size have the same size, and the extracted feature images with the same size are combined to form the output of convolution operation.

The weight values in the weight matrices are required to be obtained through a large amount of training in practical application, and each weight matrix formed by the weight values obtained through training can be used for extracting information from an input image, so that the convolutional neural network 200 can perform correct prediction.

When convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (e.g., 221) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 200 increases, features extracted by the later convolutional layers (e.g., 226) become more complex, such as features of high level semantics, which are more suitable for the problem to be solved.

Pooling layer:

Since it is often desirable to reduce the number of training parameters, the convolutional layers often require periodic introduction of pooling layers, one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers, as illustrated by layers 221-226 in FIG. 4, 220. The only purpose of the pooling layer during image processing is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image. The average pooling operator may calculate pixel values in the image over a particular range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.

Full connection layer 230:

After processing by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not yet sufficient to output the desired output information. Because, as previously described, the convolution/pooling layer 220 will only extract features and reduce the parameters imposed by the input image. However, in order to generate the final output information (the required class information or other relevant information), convolutional neural network 200 needs to utilize fully-connected layer 230 to generate the output of the required number of classes or groups. Thus, multiple hidden layers (231, 232 to 23n as shown in fig. 4) may be included in the fully-connected layer 230, and parameters included in the multiple hidden layers may be pre-trained according to relevant training data of a specific task type, e.g., the task type may include image recognition, image classification, image super-resolution reconstruction, etc. … …

After the hidden layers in the fully connected layer 230, i.e., the final layer of the overall convolutional neural network 200 is the output layer 240, the output layer 240 has a class-cross entropy-like loss function, specifically for calculating the prediction error, once the forward propagation of the overall convolutional neural network 200 (e.g., propagation from 210 to 240 directions in fig. 4 is forward propagation) is completed, the backward propagation (e.g., propagation from 240 to 210 directions in fig. 4 is backward propagation) will begin to update the weight values and deviations of the aforementioned layers to reduce the loss of the convolutional neural network 200 and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.

It should be noted that the convolutional neural network 200 shown in fig. 4 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, only includes a part of the network structure shown in fig. 4, for example, the convolutional neural network used in the embodiment of the present application may include only the input layer 210, the convolutional layer/pooling layer 220, and the output layer 240.

It should be noted that, the convolutional neural network 100 shown in fig. 4 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, a plurality of convolutional layers/pooling layers shown in fig. 5 are parallel, and the features extracted respectively are all input to the fully-connected layer 230 for processing.

(3) Deep neural network

Deep neural networks (Deep Neural Network, DNN), also known as multi-layer neural networks, can be understood as neural networks having many hidden layers, many of which are not particularly metrics. From DNNs, which are divided by the location of the different layers, the neural networks inside the DNNs can be divided into three categories: input layer, hidden layer, output layer. Typically the first layer is the input layer, the last layer is the output layer, and the intermediate layers are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Although DNN appears to be complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression: Wherein, Is the input vector which is to be used for the input,Is the output vector of the vector,Is the offset vector, W is the weight matrix (also called coefficient), and α () is the activation function. Each layer is only for input vectorsObtaining the output vector through such simple operationSince the DNN layer number is large, the coefficient W and the offset vectorAnd thus a large number. The definition of these parameters in DNN is as follows: taking the coefficient W as an example: it is assumed that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined asThe superscript 3 represents the number of layers in which the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4. The summary is: the coefficients of the kth neuron of the L-1 th layer to the jth neuron of the L-1 th layer are defined asIt should be noted that the input layer is devoid of W parameters. In deep neural networks, more hidden layers make the network more capable of characterizing complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the greater the "capacity", meaning that it can accomplish more complex learning tasks. The process of training the deep neural network, i.e. learning the weight matrix, has the final objective of obtaining a weight matrix (a weight matrix formed by a number of layers of vectors W) for all layers of the trained deep neural network.

(4) Loss function

In training the deep neural network, since the output of the deep neural network is expected to be as close to the value actually expected, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the actually expected target value according to the difference between the predicted value of the current network and the actually expected target value (of course, there is usually an initialization process before the first update, that is, the pre-configuration parameters of each layer in the deep neural network), for example, if the predicted value of the network is higher, the weight vector is adjusted to be predicted to be lower, and the adjustment is continued until the deep neural network can predict the actually expected target value or the value very close to the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and then the training of the deep neural network becomes a process of reducing the loss as much as possible.

(5) Back propagation algorithm

The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the parameter in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, the input signal is transmitted forward until the output is generated with error loss, and the parameters in the initial super-resolution model are updated by back-propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion that dominates the error loss, and aims to obtain parameters of the optimal super-resolution model, such as a weight matrix.

(6) A recurrent neural network (RNN, recurrent Neural Networks) is used to process the sequence data. In the traditional neural network model, from an input layer to an implicit layer to an output layer, the layers are fully connected, and no connection exists for each node between each layer. Although this common neural network solves many problems, it still has no weakness for many problems. For example, you want to predict what the next word of a sentence is, it is generally necessary to use the previous word, because the previous and next words in a sentence are not independent. RNN is called a recurrent neural network in the sense that a sequence's current output is related to the previous output. The specific expression is that the network memorizes the previous information and applies the previous information to the calculation of the current output, namely, the nodes between the hidden layers are not connected any more and are connected, and the input of the hidden layers comprises not only the output of the input layer but also the output of the hidden layer at the last moment. In theory, RNNs are able to process sequence data of any length. Training for RNNs is the same as training for traditional CNNs or DNNs. Error back propagation algorithms are also used, but with a few differences: that is, if the RNN is network extended, parameters therein, such as W, are shared; this is not the case with conventional neural networks such as those described above. And in using a gradient descent algorithm, the output of each step depends not only on the network of the current step, but also on the state of the previous steps of the network. This learning algorithm is referred to as a time-based back propagation algorithm Back Propagation Through Time (BPTT).

Why is the convolutional neural network already present, the neural network is also looped? The reason is simple, and in convolutional neural networks, one precondition assumption is that: the elements are independent of each other, and the input and output are independent of each other, such as cats and dogs. However, in the real world, many elements are interconnected, such as the stock changes over time, and further such as one says: i like travel, where the most favored place is Yunnan, and later have the opportunity to go. Here, the filling should be known to humans as filling "yunnan". Because humans will infer from the context, but how to have the machine do this? RNNs have thus been developed. RNNs aim to give robots the ability to memorize as a robot. Thus, the output of the RNN needs to rely on current input information and historical memory information.

(7) Pixel value

The pixel value of the image may be a Red Green Blue (RGB) color value and the pixel value may be a long integer representing the color. For example, the pixel value is 256×red+100×green+76blue, where Blue represents the Blue component, green represents the Green component, and Red represents the Red component. The smaller the value, the lower the luminance, the larger the value, and the higher the luminance in each color component. For a gray image, the pixel value may be a gray value.

(8) Super resolution

Super Resolution (SR) is an image enhancement technique, in which a low Resolution image or a group of images is given, and high frequency detail information of the images is recovered by learning prior knowledge of the images, similarity of the images, complementation of multi-frame image information, and the like, so as to generate a target image with higher Resolution. Super-resolution in application, according to the number of input images, super-resolution of single frame image and super-resolution of video can be divided. The super-resolution has important application value in the fields of high-definition televisions, monitoring equipment, satellite images, medical images and the like.

(9) Video super resolution

Video super-resolution (video super resolution, VSR) is an enhancement technique for video processing, the purpose of which is to convert low-resolution video into high-quality high-resolution video. The video super-resolution can be divided into multi-frame video super-resolution and cyclic video super-resolution according to the number of input frames.

The image processing method provided by the application can be applied to video live broadcasting, video call, album management, smart city, man-machine interaction and other scenes needing to relate to video data and the like.

(10) Noise reduction

Images are often affected by the imaging device and the external environment during digitizing and transmission, resulting in images containing noise. The process of reducing noise in an image is referred to as image denoising, and sometimes may also be referred to as image denoising.

(11) Image features

The image features mainly comprise color features, texture features, shape features, spatial relationship features and the like of the image.

The color feature is a global feature describing the surface properties of the scene to which the image or image area corresponds; the general color feature is a pixel-based feature, where all pixels belonging to an image or image area have their own contribution. Since color is insensitive to changes in direction, size, etc. of an image or image area, color features do not capture local features of objects in an image well.

Texture features are also global features that also describe the surface properties of the scene to which an image or image region corresponds; however, since texture is only a characteristic of the surface of an object, and cannot fully reflect the intrinsic properties of the object, high-level image contents cannot be obtained by using only texture features. Unlike color features, texture features are not pixel-based features, which require statistical calculations in areas containing multiple pixels.

The shape features have two types of representation methods, one is contour features, the other is region features, the contour features of the image are mainly aimed at the outer boundary of the object, and the region features of the image are related to the whole shape region.

The spatial relationship feature refers to a mutual spatial position or a relative direction relationship between a plurality of objects segmented in an image, and these relationships may be also classified into a connection/adjacency relationship, an overlapping/overlapping relationship, an inclusion/containment relationship, and the like. In general, spatial location information can be divided into two categories: relative spatial position information and absolute spatial position information. The former relationship emphasizes the relative situation between the targets, such as the up-down-left-right relationship, etc., and the latter relationship emphasizes the distance magnitude and orientation between the targets.

It should be noted that the above listed image features may be taken as some examples of features in the image, and the image may also have other features, such as higher-level features: semantic features, which are not expanded here.

(12) Image/video enhancement

Image/video enhancement refers to actions made on images/videos that can improve imaging quality. For example, enhancement processing includes superdivision, noise reduction, sharpening or demosaicing, and the like.

The system architecture provided by the embodiment of the present application is described in detail below with reference to fig. 6. Fig. 6 is a schematic diagram of a system architecture according to an embodiment of the application. As shown in fig. 6, the system architecture 500 includes an execution device 510, a training device 520, a database 530, a client device 540, a data storage system 550, and a data acquisition system 560.

The execution device 510 includes a computing module 511, an I/O interface 512, a preprocessing module 513, and a preprocessing module 514. The calculation module 511 may include a target model/rule 501 therein, with the preprocessing module 513 and preprocessing module 514 being optional.

The data acquisition device 560 is used to acquire training data. The training data in embodiments of the present application includes video samples and surveillance video (otherwise known as true values (ground truth)). The video samples can be low-quality videos, and the surveillance videos are high-quality videos corresponding to the video samples obtained in advance before model training. The video sample may be, for example, a low resolution video, and the surveillance image a high resolution video; or the video sample may be, for example, a video containing fog or noise, and the surveillance image is a video from which the fog or noise is removed. After the training data is collected, the data collection device 560 stores the training data in the database 530 and the training device 520 trains the target model/rule 501 based on the training data maintained in the database 530.

In this embodiment of the present application, the training device 520 performs knowledge distillation on the student model (e.g., the second video processing model in the embodiment of the present application) based on the training data maintained in the database 530 and the teacher model (e.g., the first video processing model in the embodiment of the present application), to obtain the target model/rule 501 (e.g., the trained second video processing model in the embodiment of the present application).

The target model/rule 501 can be used to implement a video enhancement task, that is, a video to be processed is input into the target model/rule 501, so that a processed enhanced video can be obtained. In practical applications, the training data maintained in the database 530 is not necessarily acquired by the data acquisition device 560, but may be received from other devices. It should be further noted that the training device 520 is not necessarily completely based on the training data maintained by the database 530 to perform training of the target model/rule 501, and it is also possible to obtain the training data from the cloud or other places to perform model training, which should not be taken as a limitation of the embodiments of the present application.

The target model/rule 501 obtained by training according to the training device 520 may be applied to different systems or devices, such as the execution device 510 shown in fig. 6, where the execution device 510 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a vehicle-mounted terminal, or may also be a server or cloud. In fig. 6, the execution device 510 configures an input/output (I/O) interface 512 for data interaction with an external device, and a user may input data to the I/O interface 512 through the client device 540, where the input data may include in an embodiment of the present application: the image to be processed is input by the client device.

The preprocessing module 513 and the preprocessing module 514 are configured to perform preprocessing according to input data (such as video to be processed) received by the I/O interface 512. It should be appreciated that there may be no pre-processing module 513 and pre-processing module 514 or only one pre-processing module. When the preprocessing module 513 and the preprocessing module 514 are not present, the calculation module 511 may be directly employed to process the input data.

In preprocessing input data by the execution device 510, or in performing processing related to computation or the like by the computation module 511 of the execution device 510, the execution device 510 may call data, codes or the like in the data storage system 550 for corresponding processing, or may store data, instructions or the like obtained by corresponding processing in the data storage system 550.

Finally, the I/O interface 512 presents the processing results, such as the processed enhanced video, to the client device 540 for presentation to the user.

It should be noted that the training device 520 may generate, based on different training data, a corresponding target model/rule 501 for different targets or different tasks, where the corresponding target model/rule 501 may be used to implement the video enhancement task, thereby providing the user with the desired result.

In the case shown in fig. 6, the user may manually give input data (which may be video to be processed), which may be operated through an interface provided by the I/O interface 512. In another case, the client device 540 may automatically send the input data to the I/O interface 512, and if the client device 540 is required to automatically send the input data requiring authorization from the user, the user may set the corresponding permissions in the client device 540. The user may view the results output by the execution device 510 at the client device 540, and the specific presentation may be in the form of a display, a sound, an action, or the like. The client device 540 may also be used as a data collection terminal to collect input data from the input I/O interface 512 and output data from the output I/O interface 512 as new sample data, and store the new sample data in the database 530. Of course, instead of being collected by the client device 540, the I/O interface 512 may directly store the input data of the I/O interface 512 and the output result of the I/O interface 512 as new sample data into the database 530.

It should be noted that fig. 6 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawing is not limited in any way, for example, in fig. 6, the data storage system 550 is an external memory with respect to the execution device 510, and in other cases, the data storage system 550 may be disposed in the execution device 510.

The following describes a chip hardware structure provided by the embodiment of the application.

Fig. 7 is a hardware architecture diagram of a chip including a neural network processor 700 according to an embodiment of the present application. The chip may be provided in an execution device 510 as shown in fig. 6 to perform the calculation of the calculation module 511. The chip may also be provided in a training device 520 as shown in fig. 6 for completing training work of the training device 520 and outputting the target model/rule 501. The algorithms of the various layers in the video processing network as shown in fig. 6 can be implemented in a chip as shown in fig. 7.

The neural network processor (neural processing unit, NPU) 700 is mounted as a co-processor to a main central processing unit (host central processing unit, host CPU) which distributes tasks. The NPU has a core part of an arithmetic circuit 703, and the controller 704 controls the arithmetic circuit 703 to extract data in a memory (weight memory 702 or input memory 701) and perform arithmetic.

In some implementations, the arithmetic circuitry 703 internally includes a plurality of processing units (PEs). In some implementations, the arithmetic circuit 703 is a two-dimensional systolic array. The arithmetic circuit 703 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 703 is a general purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit 703 takes the data corresponding to the matrix B from the weight memory 702 and buffers the data on each PE in the arithmetic circuit 703. The arithmetic circuit 703 takes the matrix a data from the input memory 701 and performs matrix operation with the matrix B, and the obtained partial result or the final result of the matrix is stored in an accumulator (accumulator) 708.

The vector calculation unit 707 may further process the output of the operation circuit 703, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, vector computation unit 707 may be used for network computation of non-convolutional/non-FC layers in a neural network, such as pooling (pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector computation unit 707 can store the vector of processed outputs to the unified memory 706. For example, the vector calculation unit 707 may apply a nonlinear function to an output of the operation circuit 703, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 707 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as an activation input to the arithmetic circuit 703, for example for use in subsequent layers in a neural network.

The unified memory 706 is used for storing input data and output data.

The weight data is transferred to the input memory 701 and/or the unified memory 706 directly by the memory cell access controller (direct memory access controller, DMAC) 705, the weight data in the external memory is stored in the weight memory 702, and the data in the unified memory 706 is stored in the external memory.

A bus interface unit (bus interface unit, BIU) 710 for implementing interactions between the main CPU, DMAC, and finger memory 709 over the bus.

An instruction fetch memory (instruction fetch buffer) 709 coupled to the controller 704 for storing instructions for use by the controller 704.

The controller 704 is configured to invoke an instruction cached in the instruction fetch memory 709, so as to control a working process of the operation accelerator.

Typically, the unified memory 706, input memory 701, weight memory 702, and finger memory 709 are on-chip (on-chip) memories, and the external memory is a memory external to the NPU, which may be a double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM), or other readable and writable memory.

Referring to fig. 8, fig. 8 is an embodiment schematic diagram of a model training method provided by the embodiment of the present application, and as shown in fig. 8, the model training method provided by the embodiment of the present application includes:

801. The method comprises the steps of obtaining a video sample, a first video processing network and a second video processing network, wherein the first video processing network is a teacher model, and the second video processing network is a student model to be trained.

In an embodiment of the present application, a video sample may include a plurality of image frames.

In the embodiment of the present application, the first video processing network and the second video processing network may be used to implement a video enhancement task, where the video enhancement task may be understood as a task for enhancing the quality of video, for example, the video enhancement task may be a video denoising task, a video defogging task, a super-resolution task, a high dynamic range task, or the like, which is not limited herein.

It should be appreciated that the first video processing network and the second video processing network are models for implementing the same video processing task, and the application is not limited to a particular type of video processing task. Taking the video enhancement task as an example of the super-resolution task, one illustration of the network structure of the first video processing network and the second video processing network is described next.

Referring to fig. 9, fig. 9 is a schematic diagram of a video processing network, as shown in fig. 9, an image to be processed may be a low resolution image (LR), an image feature may be obtained after a low resolution image frame is processed by a feature extraction module, and then a feature map may be processed by a plurality of basic units, where the basic units may be a network structure obtained by connecting basic modules through basic operations of a neural network, and the network structure may include basic operations or a combination of basic operations in a predetermined convolutional neural network, and these basic operations or combinations of basic operations may be collectively referred to as basic operations. For example, the basic operation may refer to a convolution operation, a pooling operation, a residual connection, and the like, and the basic operation may enable connection between the respective basic modules, thereby obtaining a network structure of the basic unit. The nonlinear transformation part is used for transforming the image characteristics of the image to be input, mapping the image characteristics into a high-dimensional characteristic space, and the mapped high-dimensional space is easier to reconstruct an ultra-division image under normal conditions; the reconstruction part is used for carrying out up-sampling and convolution processing on the image characteristics output by the nonlinear variation part to obtain a super-resolution image (LR, HR) corresponding to the image to be input.

Taking a superdivision task model as an example, as shown in fig. 10, a low-resolution video (including a plurality of low-resolution image frames) is input, and then feature extraction can be performed on the low-resolution image frames to obtain a low-scale feature map, where the low-scale feature map refers to a feature map including more low-frequency information, and alternatively refers to a feature map including less texture detail information, in the superdivision task model, the low-scale feature map may be a Pyramid (Pyramid) structure, where the feature is extracted by a convolution layer with a step length of 2 by using a plurality of residual blocks. The supertask model uses a pyramid structure to pre-deblur the input image frames.

And then, the low-scale characteristic map after denoising can be obtained through alignment and/or fusion operation on the low-scale characteristic map.

Specifically, deformable convolution (deformable conv) can be applied to the low-scale feature map to realize the alignment of images, and the problem that the optical flow of the images needs to be calculated/estimated explicitly or implicitly in the traditional alignment method is effectively avoided. The input low-scale feature map can be convolved by a convolution layer with the step length of 2 to obtain a pyramid of L layers, and for a reference frame t and any adjacent frame t+i, similar operation is carried out on each layer of the pyramid, namely, two feature maps are spliced and convolved to obtain a deformable convolved operation result (called offsets in an superdivision task model), the feature map at the moment t+i is input to deformable conv, and a new feature map at the moment t+i is output through deformable conv; in addition, offsets of the lower layer of the pyramid may be used as an input of the upper layer offset conv. for more accurate offsets estimation, and the deformable conv. output feature map may be up-sampled and then fused with the corresponding features of the upper layer. And (3) until the first layer of the pyramid is deformable conv, the feature map which is output and fused with the bottom layer is spliced with the feature map of the reference frame to be used as a new feature map of offsets of deformable conv, and the final feature map aligned with the t+i moment can be predicted.

In addition, due to some irresistible reasons such as hand shake and target motion, different image frames generate different blurring conditions, so that the contribution of different adjacent frames to the enhanced reference frame is different. Conventional approaches generally consider them equally well, but this is not the case. Therefore, the superdivision task model introduces a attention attention mechanism in the fusion process, and gives different weights to different feature graphs in two dimensions of a spatial domain and a temporal domain

Specifically, first, based on the already aligned feature map, the reference frame and the adjacent frame are further extracted by passing through different convolution layers again (sharing parameters for the adjacent frame), and the similarity between the adjacent frame and the reference frame is calculated, which is defined as the spatial attention map (temporal attention map) at that time. The feature map at each moment and the feature map of the reference frame perform such an operation, including the reference frame, so that each moment can obtain one temporal attenion map, and multiplication of the feature map aligned in the spatial dimension is equivalent to adjustment of the proportion occupied in the recovery/enhancement tasks of the feature map at different moments; then, all the feature graphs are convolved, namely feature fusion operation is carried out; further, a new feature map is obtained through the pyramid structure acquisition spatial attention map and up-sampling.

After obtaining the new low-scale feature image, a high-scale feature image can be obtained through reconstruction processing (for example, reconstruction can be performed through a plurality of residual blocks), and finally, a final high-resolution image frame is obtained through convolution operation. The high-scale feature map before convolution change can be a multi-channel feature map, and the result after convolution operation can represent a high-resolution image frame, for example, the result after convolution operation can be a three-channel image (such as an RGB image).

In the embodiment of the application, the first video processing network and the second video processing network are used for realizing the video enhancement task, the first video processing network is a teacher model, the second video processing network is a student model, and in the embodiment of the application, the first video processing network is used as the teacher model to carry out knowledge distillation on the second video processing network.

Here, the teacher model may be also referred to as a teacher model, a guidance model, or the like, and is not limited thereto.

In performing knowledge distillation, another simple network (second video processing network) may be trained by employing a pre-trained complex network (first video processing network) such that the simple network (second video processing network) may have the same or similar data processing capabilities as the complex network (first video processing network). Knowledge distillation is the migration of the "knowledge" that a trained complex network has into a network of simpler structure. Wherein the simple network may have a smaller amount of parameters than a complex network.

It should be noted that the same or similar data processing capability is understood that, when the same data to be processed is processed, the processing results obtained by the student model and the teacher model after knowledge distillation are the same or similar.

In knowledge distillation, a loss is required to be constructed based on the output of a teacher model and the output of a student model, wherein the model output for constructing the loss can be the output of an output layer of the model, can be the output of an intermediate feature map of an intermediate network layer, or can be the result obtained by processing the output of the output layer and/or the output of the intermediate feature map of the intermediate network layer, in the existing implementation, the model output for constructing the loss is spatial information for representing the feature distribution of the feature map obtained by counting the intermediate output of the intermediate network layer of each frame of image in a video, however, in a video enhancement scene, the spatial information of the image frame can only represent the feature distribution of the feature map of each frame of image frame and does not carry inter-frame information, wherein the inter-frame information can be continuous and change information between frames, a stationary region between frames can be continuous information, and an object with motion between frames can be change information.

The teacher model has large parameter quantity and data processing capability, and can well process the continuous and variable information among frames, namely, the teacher model can well identify the inter-frame information and enhance the video by utilizing the inter-frame information, the video quality of the enhanced video is very high, if the loss is only related to the spatial information of each image frame, the student model cannot learn the processing capability of the teacher model on the inter-frame information, and the video enhancement effect of the distilled student model is not high.

In the embodiment of the application, when constructing the target loss for knowledge distillation, the inter-frame information is also considered, and how to acquire the inter-frame information and how to construct the target loss based on the inter-frame information are described in detail below.

802. And processing the video samples through the first video processing network and the second video processing network to respectively obtain a first intermediate feature map output of the first video processing network and a second intermediate feature map output of the second video processing network.

In the embodiment of the application, in the knowledge distillation process, the teacher model and the student model need to process the video samples, that is, the feedforward process of the models is performed, after the video samples are processed through the first video processing network and the second video processing network, the enhanced video can be obtained, and in addition, the first intermediate feature map output of the first video processing network and the second intermediate feature map output of the second video processing network can also be obtained. The first intermediate feature map output of the first video processing network and the second intermediate feature map output of the second video processing network are described next:

in the embodiment of the application, the first intermediate feature map output may be a feature map output of an intermediate network layer when the first video processing network processes the video sample, the second intermediate feature map output may be a feature map output of an intermediate network layer when the second video processing network processes the video sample, and the position of the network layer outputting the first intermediate feature map output in the first video processing network is the same as the position of the network layer outputting the second intermediate feature map output in the second video processing network.

The intermediate network layer may be a network layer for outputting a feature map in the first video processing network and the second video processing network, so long as the output feature map may carry image features of an image frame.

Taking the video enhancement task as a super-resolution task as an example, the first intermediate feature map output and the second intermediate feature map output may be obtained by performing feature extraction on a video sample, or may be obtained by performing other processing on feature maps obtained by performing feature extraction, for example, a deblurred first intermediate feature map obtained by performing deblurring processing on the first intermediate feature map output and the second intermediate feature map obtained by performing deblurring processing on the second intermediate feature map output, or a high-scale feature map obtained by performing alignment and/or fusion operation on the first intermediate feature map output and the second intermediate feature map output may be obtained by performing reconstruction on the first intermediate feature map output and the second intermediate feature map output.

803. And respectively processing the first intermediate feature map output and the second intermediate feature map output to respectively obtain first inter-frame information and second inter-frame information, wherein the first inter-frame information and the second inter-frame information are used for representing feature change relations among all image frames of the video sample.

In one implementation, the first intermediate feature map output and the second intermediate feature map output may be respectively processed by a recurrent neural network to respectively obtain first inter-frame information and second inter-frame information, where the first inter-frame information and the second inter-frame information are used to represent feature change relationships between image frames of the video sample.

It will be appreciated that the first intermediate feature map output and the second intermediate feature map output may be processed by other networks or functional mappings that enable determination of inter-frame information between image frames in a video, and are not limited in this regard.

RNNs are called recurrent neural networks in the sense that a sequence's current output is also related to the previous output. The specific expression is that the network memorizes the previous information and applies the previous information to the calculation of the current output, namely, the nodes between the hidden layers are connected, and the input of the hidden layers not only comprises the output of the input layer but also comprises the output of the hidden layer at the last moment. In theory, RNNs are able to process sequence data of any length. Training for RNNs is the same as training for traditional CNNs or DNNs. Error back propagation algorithms are also used, but with a few differences: that is, if the RNN is network extended, parameters therein, such as W, are shared. And in using a gradient descent algorithm, the output of each step depends not only on the network of the current step, but also on the state of the previous steps of the network. This learning algorithm is referred to as a time-based back-propagation algorithm (back propagation through time, BPTT).

FIG. 11 is a schematic diagram of the structure of an RNN in which each circle can be considered as a unit and the same is done for each unit, and thus can be folded in the left half. RNN is a sequence-to-sequence model, where X _t in fig. 12 represents an input at time t, o _t represents an output at time t, S _t represents a memory at time t, U is an input layer, W is a weight, and V is an output layer. The output at the current time is determined by the memory and the output at the current time, where S _t＝f(U*X_t+W*S_t -1), the f () function is the activation function in the neural network.

In the embodiment of the application, the video sample comprises a plurality of frames of images, the first intermediate feature map output comprises a first sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, the second intermediate feature map output comprises a second sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, each first sub-intermediate feature map and each second sub-intermediate feature map are respectively processed through a cyclic neural network, so that first inter-frame information and second inter-frame information can be obtained. Specifically, the feature change relationship may refer to continuous and change information between frames, where the continuous information is a relationship between regions that are stationary between frames, and the change information is a relationship between objects that have motion between frames.

It should be understood that the inter-frame information in the embodiment of the present application may also be referred to as timing information (Temporal Context).

In one implementation, the first inter-frame information and the second inter-frame information may be hidden layer states (HIDDEN STATE) output by the recurrent neural network, where the hidden layer states may be hidden layer outputs in the RNN, and since the RNN may obtain multiple hidden layer states (each hidden layer state corresponds to one image frame) when processing intermediate feature map outputs of multiple image frames in the video, all hidden layer states or part of hidden layer states obtained by the RNN when processing intermediate feature map outputs may be obtained.

In general, RNN may carry more inter-frame information in hidden layer output obtained by processing a later image frame, so in order to reduce the amount of calculation, a hidden layer state obtained by RNN processing an intermediate feature map corresponding to a later image frame in a multi-frame image may be selected.

Specifically, the video sample includes a plurality of frame images, the first intermediate feature map output includes a first sub-intermediate feature map corresponding to each frame image in the plurality of frame images, the second intermediate feature map output includes a second sub-intermediate feature map corresponding to each frame image in the plurality of frame images, the first inter-frame information includes M hidden layer states obtained by the cyclic neural network processing a first sub-intermediate feature map corresponding to a later M frame image in the plurality of frame images, and the second inter-frame information is M hidden layer states obtained by the cyclic neural network processing a second sub-intermediate feature map corresponding to a later M frame image in the plurality of frame images. It should be understood that M may be flexibly selected here, and the present application is not limited thereto.

Taking a recurrent neural network as an example of a long short-term memory (LSTM), the first inter-frame information and the second inter-frame information may be cell states (CELL STATE) output by the LSTM.

Next, LSTM in an embodiment of the present application will be described.

Referring to fig. 13, fig. 13 is a schematic diagram of a structure of LSTM, in which each image frame in a video sample may be sequentially input into the LSTM unit shown in fig. 13, the LSTM process may obtain a cell state and a hidden state by each image frame, and transfer the cell state and the hidden state to a process of the LSTM process adjacent to the next image frame, as shown in fig. 13, ct-1 is a cell state obtained by outputting an intermediate feature image of a last frame image processed by the LSTM process, the cell state is transferred like a conveyor belt, the cell state runs on the entire chain of the LSTM, and there is some small linear operation that can act thereon, and the LSTM has the capability of deleting or adding information to the cell state, which is given by a structure called Gate (Gate). A Gate (Gate) is a way to optionally pass information. An exemplary embodiment consists of a Sigmoid neural network layer outputting a number between 0 and 1, which describes how much information each component can pass, 0 indicating no information, 1 indicating all passes, and LSTM can have three gates for protecting and controlling cell states.

The first step of LSTM is to determine what information to discard from the cell state. This decision is implemented by the Sigmoid layer called "forget gate". It looks at Ht-1 (the previous hidden layer state) and Ft (the current input) and outputs a number between 0 and 1 for each number in the cell state Ct-1 (the previous state), 1 representing complete retention and 0 representing complete deletion. The next step is to determine what information to store in the cell state. Specifically, it is possible to determine which values it needs to be updated by the Sigmoid layer of the input gate layer, and the tanh layer creates candidate vectorsThe candidate vectorWill be added to the cell state. The two vectors can then be combined to create an updated value, specifically, the last state value is multiplied by ft to express the portion expected to be forgotten, and the resulting value is then addedTo obtain the cell state Ct of the current image frame. Finally it is necessary to determine what to export as hidden layer state, which export will determine which parts of the cell state to export by a sigmoid layer based on the cell state. The cell state is then passed through tanh (normalizing the value to between-1 and 1) and multiplied by the output of the Sigmoid gate. The specific process can refer to the following formula:

In the embodiment of the application, the LSTM network can sequentially process the input image frames to obtain the cell state corresponding to each image frame, and the LSTM network can generally carry more interframe information in hidden layer output obtained by processing the later image frames, so that the cell state obtained by processing the intermediate feature map corresponding to the later image frames in the multi-frame image by the LSTM network can be selected for reducing the calculation amount.

In one implementation, the cell state obtained by the LSTM network in processing the intermediate feature image corresponding to the last image frame in the multi-frame image may be directly selected, specifically, for the first intermediate feature image output by the teacher network, the first intermediate feature image output may include a plurality of first sub-intermediate feature images, where each first sub-intermediate feature image may correspond to one image frame, and then the first sub-intermediate feature image corresponding to the last image frame in the video may be obtained, and it is determined that the cell state obtained by the LSTM network in processing the first sub-intermediate feature image corresponding to the last image frame is the first inter-frame information; similarly, for the second intermediate feature map output of the student network output, the second intermediate feature map output may include a plurality of second sub-intermediate feature maps, where each second sub-intermediate feature map may correspond to one image frame, then the second sub-intermediate feature map corresponding to the last image frame in the video may be obtained, and it is determined that the cell state obtained by processing the second sub-intermediate feature map corresponding to the last image frame by the LSTM network is second inter-frame information.

It should be understood that, in addition to the cell state, the first inter-frame information and the second inter-frame information may be determined according to hidden states obtained by processing the first intermediate feature map and the second intermediate feature map by the LSTM network, specifically, the first hidden state and the second hidden state may be obtained by respectively processing the first intermediate feature map output by the first video processing network and the second intermediate feature map output by the second video processing network by the LSTM, and the first hidden state and the second hidden state may be used as the first inter-frame information and the second inter-frame information respectively.

804. Determining a target loss according to the first inter-frame information and the second inter-frame information, and performing knowledge distillation on the second video processing network based on the target loss and the first video processing network to acquire a trained second video processing network, wherein the target loss is related to the difference between the first inter-frame information and the second inter-frame information.

In the embodiment of the application, after the first inter-frame information and the second inter-frame information are obtained, a target loss for performing knowledge distillation can be constructed based on the difference between the first inter-frame information and the second inter-frame information, and the second video processing network is subjected to knowledge distillation based on the target loss and the first video processing network so as to obtain a trained second video processing network.

Specifically, constraint is performed on first inter-frame information corresponding to a teacher model (a first video processing network) and second inter-frame information corresponding to a student model (a second video processing network), namely:

where Ld denotes a norm constraint, which may be, but is not limited to, using an L2-norm distance, LTD is a target loss, CT is first inter-frame information, and CS is second inter-frame information.

It is to be appreciated that the loss determined based on the difference between the first inter-frame information and the second inter-frame information may also be referred to as a time domain loss.

In one implementation, the target loss may be related to a difference between spatial information of the first intermediate feature map output and the second intermediate feature map output, in addition to a difference between the first inter-frame information and the second inter-frame information, as described in detail below:

In the embodiment of the present application, the video sample may include a plurality of frames of images, the first intermediate feature map output includes a first sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, the second intermediate feature map output includes a second sub-intermediate feature map corresponding to each frame of image in the plurality of frames of images, and each of the first sub-intermediate feature map and each of the second sub-intermediate feature map is processed to obtain first spatial information of each of the first sub-intermediate feature maps and second spatial information of each of the second sub-intermediate feature maps, where the first spatial information and the second spatial information are used to represent feature distribution of the feature maps; determining a target loss based on the first and second inter-frame information and the first and second spatial information, the target loss being related to a difference between the first and second inter-frame information and a difference between the first and second spatial information.

Wherein the spatial information is used to represent a feature distribution of the feature map, which may comprise rich image content and represent image features of the corresponding image frames, such as frequency features, texture detail features, etc.

In an alternative implementation, each first sub-intermediate feature map may be summed by channel to obtain first spatial information, and each second sub-intermediate feature map may be summed by channel to obtain second spatial information, which may also be referred to as a spatial attention map if the information statistics are summed by channel. Specifically, the first spatial information may be obtained by summing the squares of the channels of each first sub-intermediate feature map, and the second spatial information may be obtained by summing the squares of the channels of each second sub-intermediate feature map.

It should be understood that the above-mentioned square sum operation is only an illustration, and the first spatial information and the second spatial information may be calculated by other operations in practical application, which is not limited herein.

By the method, the first spatial information corresponding to the first video processing model and the second spatial information corresponding to the second video processing model can be obtained, the first spatial information and the second spatial information can be restrained, and then the target loss can comprise loss determined based on the difference between the first spatial information and the second spatial information. It should be appreciated that the loss determined based on the difference between the first spatial information and the second spatial information may also be referred to as spatial domain loss. For calculation of spatial and temporal losses, reference may be made to fig. 14.

In one implementation, the target loss may be related to a difference between a first video processing result obtained by the first video processing model processing and a true value (ground truth) corresponding to the video sample, in addition to a difference between the first inter-frame information and the second inter-frame information, which will be described in detail below:

In the embodiment of the present application, the video samples may be processed by the first video processing network and the second video processing network, so as to obtain a first intermediate feature map output of the first video processing network, a first video processing result output by the first video processing network, and a second intermediate feature map output by the second video processing network, respectively, and a true value (ground truth) corresponding to the video samples may be obtained; determining a target loss based on the first and second inter-frame information and the first video processing result and the true value, the target loss being related to a difference between the first and second inter-frame information and a difference between the first video processing result and the true value. Taking the first video processing network and the second video processing network as examples for implementing the video enhancement task, the true value (ground truth) corresponding to the video sample may be understood as a video sample with improved video quality, and in one implementation, the true value (ground truth) corresponding to the video sample may also be preset, or may be obtained after the video sample is subjected to image enhancement through the first video processing network, which is not limited herein.

By the method, the first video processing result output by the first video processing model can be obtained, then the first video processing result and the true value (ground truth) corresponding to the video sample can be constrained, and further the target loss can comprise a loss determined based on the difference between the first video processing result and the true value (ground truth) corresponding to the video sample. It should be appreciated that the loss determined based on the difference between the first video processing result and the true value (ground truth) corresponding to the video sample may also be referred to as a reconstruction loss.

Referring to fig. 15, in one implementation, a target loss may be constructed based on a difference between the first and second inter-frame information, a difference between the first and second spatial information, and a difference between a first video processing result and a true value (ground truth) corresponding to a video sample, and specifically, the following formula may be referred to:

wherein, the target loss is L, and λ1 and λ2 are super parameters. LSD is spatial loss, LTD is temporal loss, lrec is reconstruction loss.

The embodiment of the application provides a model training method, which comprises the following steps: acquiring a video sample, a first video processing network and a second video processing network, wherein the first video processing network is a teacher model, and the second video processing network is a student model to be trained; processing the video samples through the first video processing network and the second video processing network to respectively obtain a first intermediate feature map output of the first video processing network and a second intermediate feature map output of the second video processing network; processing the first intermediate feature map output and the second intermediate feature map output through a cyclic neural network respectively to obtain first inter-frame information and second inter-frame information, wherein the first inter-frame information and the second inter-frame information are used for representing feature change relations among all image frames of the video sample; determining a target loss according to the first inter-frame information and the second inter-frame information, and performing knowledge distillation on the second video processing network based on the target loss and the first video processing network to acquire a trained second video processing network, wherein the target loss is related to the difference between the first inter-frame information and the second inter-frame information. By the method, on the premise that the model structure is not changed, the inter-frame information is added in the target loss for carrying out knowledge distillation, so that a teacher model can better identify the inter-frame information, and the video enhancement capability by utilizing the inter-frame information is transferred to a student model, and the video quality of the enhanced video obtained after the student model subjected to the knowledge distillation carries out the video enhancement task is improved.

The advantageous effects of the embodiments of the present application are described below based on experimental results.

According to the flow provided by the embodiment of the application, the disclosed standard data set Vimeo90K data set is used as a training set, and Vimeo90K-Test and Vid4 testset are used as Test sets. Specifically, using EDVR as a teacher model, the flow provided by the embodiment of the present application tests VDSR, VESPCN, VSRNet, fastDVDnet models on the Vid4 and Vimeo90K-Test video datasets, respectively, with reference to table 1 and table 2, table 1, table 2 is a quantized evaluation result: peak signal-to-noise ratio (PSNR) and structural similarity (structural similarity, SSIM) index values, it can be seen that the embodiments of the present application have some improvement over the results without distillation. Quincuncial markers represent spatial distillation based on self-encoder and statistics, and embodiments of the present application have 0.17dB and 0.55dB improvement over the methods of Vid4 and Vimeo90K-Test, respectively.

Table 1 PSNR results of quantization index (∈ representing the method provided by the examples of the present application)

Table 2 SSIM quantization index results (∈ representing the method provided by the examples of the present application)

As shown in fig. 16 and 17, it can be seen that the model (the second video processing model after training) obtained by the model training method according to the embodiment of the present application has better detail and texture restoration capability, such as restoration effects of building windows and tablecloth lattices in the figure. As shown in fig. 18, fig. 18 is a comparison chart of frame-to-frame consistency, where STD is a processing result of a model obtained based on the model training method provided by the embodiment of the present application, and as shown in fig. 19, the distillation effect of the model training method provided by the embodiment of the present application on models with different calculation amounts/parameters is tested, and the calculation amounts/parameters are reduced by modifying the number of convolution channels of the student model, and compared with the result without distillation, the model training method provided by the embodiment of the present application has improvement on the student model with different calculation amounts/parameters.

Referring to fig. 20, fig. 20 is a schematic diagram of a model training apparatus 2000 according to an embodiment of the present application, and as shown in fig. 20, the model training apparatus 2000 according to the present application includes:

an acquiring module 2001, configured to acquire a video sample, a first video processing network, and a second video processing network, where the first video processing network is a teacher model, and the second video processing network is a student model to be trained;

for a specific description of the acquisition module 2001, reference may be made to the description of step 801, which is not repeated here.

The video processing module 2002 is configured to process the video samples through the first video processing network and the second video processing network, to obtain a first intermediate feature map output of the first video processing network and a second intermediate feature map output of the second video processing network, respectively;

For a specific description of the video processing module 2002, reference may be made to the description of step 802, which is not repeated here.

A feature map processing module 2003, configured to process the first intermediate feature map output and the second intermediate feature map output, respectively, to obtain first inter-frame information and second inter-frame information, where the first inter-frame information and the second inter-frame information are used to represent feature change relationships between image frames of the video sample;

for a specific description of the feature map processing module 2003, reference may be made to the description of step 803, which is not repeated here.

A knowledge distillation module 2004 for determining a target loss based on the first inter-frame information and the second inter-frame information, and performing knowledge distillation on the second video processing network based on the target loss and the first video processing network to obtain a trained second video processing network, wherein the target loss relates to a difference between the first inter-frame information and the second inter-frame information.

For a specific description of the knowledge distillation module 2004, reference may be made to the description of step 804, which is not repeated here.

The knowledge distillation module is configured to determine a target loss according to the first inter-frame information and the second inter-frame information, and the first video processing result and the true value, where the target loss is related to a difference between the first inter-frame information and the second inter-frame information, and a difference between the first video processing result and the true value.

Next, referring to fig. 21, fig. 21 is a schematic structural diagram of an execution device provided by an embodiment of the present application, where the execution device 2100 may specifically be a mobile phone, a tablet, a notebook computer, an intelligent wearable device, a server, etc., and is not limited herein. Wherein the execution device 2100 may run a trained second video processing network processed by the corresponding embodiment of fig. 8. Specifically, the execution device 2100 includes: the receiver 2101, the transmitter 2102, the processor 2103 and the memory 2104 (where the number of processors 2103 in the executing device 2100 may be one or more, as exemplified by one processor in fig. 21), wherein the processor 2103 may include an application processor 21031 and a communication processor 21032. In some embodiments of the application, the receiver 2101, the transmitter 2102, the processor 2103 and the memory 2104 may be connected by a bus or other means.

The memory 2104 may include read only memory and random access memory and provides instructions and data to the processor 2103. A portion of the memory 2104 may also include non-volatile random access memory (non-volatile random access memory, NVRAM). The memory 2104 stores a processor and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operating instructions may include various operating instructions for performing various operations.

The processor 2103 controls the operation of the execution device. In a specific application, the individual components of the execution device are coupled together by a bus system, which may include, in addition to a data bus, a power bus, a control bus, a status signal bus, etc. For clarity of illustration, however, the various buses are referred to in the figures as bus systems.

The methods disclosed in the embodiments of the present application described above may be applied to the processor 2103 or implemented by the processor 2103. The processor 2103 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software in the processor 2103. The processor 2103 may be a general purpose processor, a digital signal processor (DIGITAL SIGNAL processing, DSP), a microprocessor or microcontroller, a visual processor (vision processing unit, VPU), a tensor processor (tensor processing unit, TPU), or the like, which is suitable for AI operation, and may further include an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component. The processor 2103 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 2104, and the processor 2103 reads information in the memory 2104 and, in combination with its hardware, performs the steps of the method described above.

The receiver 2101 may be used to receive input digital or character information and to generate signal inputs related to performing device related settings and function control. The transmitter 2102 is operable to output numeric or character information via a first interface; the transmitter 2102 may be further configured to send instructions to the disk group via the first interface to modify data in the disk group; the transmitter 2102 may also include a display device such as a display screen.

The execution device may acquire the trained second video processing network trained by the model training method in the corresponding embodiment of fig. 8, and perform model reasoning.

Referring to fig. 22, fig. 22 is a schematic structural diagram of a training device according to an embodiment of the present application, specifically, the training device 2200 is implemented by one or more servers, where the training device 2200 may be relatively different according to configuration or performance, and may include one or more central processing units (central processing units, CPU) 2219 (e.g., one or more processors) and a memory 2232, and one or more storage mediums 2230 (e.g., one or more mass storage devices) storing application 2242 or data 2244. Wherein the memory 2232 and the storage medium 2230 can be transitory or persistent. The program stored on the storage medium 2230 may include one or more modules (not shown), each of which may include a series of instruction operations on the training device. Still further, central processor 2219 may be configured to communicate with storage medium 2230, and to execute a series of instruction operations in storage medium 2230 on training device 2200.

Training device 2200 may also include one or more power supplies 2226, one or more wired or wireless network interfaces 2250, one or more input/output interfaces 2258; or, one or more operating systems 2241, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.

Specifically, the training apparatus may perform the model training method in the corresponding embodiment of fig. 8.

The model training apparatus 2000 depicted in fig. 20 may be a module in the training device 2200, and the processor in the training device 2200 may perform the model training method performed by the model training apparatus 2000.

Embodiments of the present application also provide a computer program product which, when run on a computer, causes the computer to perform the steps as performed by the aforementioned performing device or causes the computer to perform the steps as performed by the aforementioned training device.

The embodiment of the present application also provides a computer-readable storage medium having stored therein a program for performing signal processing, which when run on a computer, causes the computer to perform the steps performed by the aforementioned performing device or causes the computer to perform the steps performed by the aforementioned training device.

The execution device, training device or terminal device provided in the embodiment of the present application may be a chip, where the chip includes: a processing unit, which may be, for example, a processor, and a communication unit, which may be, for example, an input/output interface, pins or circuitry, etc. The processing unit may execute the computer-executable instructions stored in the storage unit to cause the chip in the execution device to perform the data processing method described in the above embodiment, or to cause the chip in the training device to perform the data processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, or the like, and the storage unit may also be a storage unit in the wireless access device side located outside the chip, such as a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a random access memory (random access memory, RAM), or the like.

It should be further noted that the above-described apparatus embodiments are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course by means of special purpose hardware including application specific integrated circuits, special purpose CPUs, special purpose memories, special purpose components, etc. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. But a software program implementation is a preferred embodiment for many more of the cases of the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., comprising several instructions for causing a computer device (which may be a personal computer, a training device, a network device, etc.) to perform the method according to the embodiments of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via a wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device, a data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (Solid STATE DISK, SSD)), etc.

Claims

1. A method of model training, the method comprising:

Processing the first intermediate feature map output and the second intermediate feature map output respectively to obtain first inter-frame information and second inter-frame information, wherein the first inter-frame information and the second inter-frame information are used for representing feature change relations among all image frames of the video sample, the feature change relations comprise continuous information and change information among frames of all the image frames, the continuous information is a relation among areas with static inter-frame states, and the change information is a relation among objects with moving inter-frame states;

2. The method of claim 1, wherein the processing the first intermediate feature map output and the second intermediate feature map output, respectively, comprises:

3. The method of claim 2, wherein the first inter-frame information and the second inter-frame information are hidden layer states (HIDDEN STATE) of the recurrent neural network.

4. A method according to claim 2 or 3, wherein the recurrent neural network is a long-short-term memory LSTM network, and the first and second inter-frame information are cell states (CELL STATE) of the LSTM output.

5. The method of claim 4, wherein the video sample comprises a plurality of frames of images, the first intermediate feature map output comprises a first sub-intermediate feature map corresponding to each of the plurality of frames of images, the second intermediate feature map output comprises a second sub-intermediate feature map corresponding to each of the plurality of frames of images, the first inter-frame information is a cell state obtained by the LSTM network processing a first sub-intermediate feature map corresponding to a last of the plurality of frames of images, and the second inter-frame information is a cell state obtained by the LSTM network processing a second sub-intermediate feature map corresponding to the last of the plurality of frames of images.

6. The method of claim 4, wherein the video sample comprises a plurality of frame images, the first intermediate feature map output comprises a first sub-intermediate feature map corresponding to each frame of images in the plurality of frame images, the second intermediate feature map output comprises a second sub-intermediate feature map corresponding to each frame of images in the plurality of frame images, the first inter-frame information comprises M hidden layer states obtained by the cyclic neural network processing a first sub-intermediate feature map corresponding to a later M frame of images in the plurality of frame images, and the second inter-frame information is M hidden layer states obtained by the cyclic neural network processing a second sub-intermediate feature map corresponding to a later M frame of images in the plurality of frame images.

7. A method according to any one of claims 1 to 3, wherein the video sample comprises a plurality of frames of images, the first intermediate feature map output comprises a first sub-intermediate feature map corresponding to each of the plurality of frames of images, and the second intermediate feature map output comprises a second sub-intermediate feature map corresponding to each of the plurality of frames of images, the method further comprising:

8. The method of claim 7, wherein the first spatial information is a first spatial attention profile and the second spatial information is a second spatial attention profile, and wherein the processing each of the first sub-intermediate feature map and each of the second sub-intermediate feature maps comprises:

9. A method according to any one of claims 1 to 3, wherein said processing said video samples through said first video processing network and said second video processing network comprises:

Processing the video samples through the first video processing network and the second video processing network to respectively obtain a first intermediate feature map output of the first video processing network, a first video processing result output by the first video processing network and a second intermediate feature map output by the second video processing network;

Acquiring a true value corresponding to the video sample (ground truth);

10. A method according to any one of claims 1 to 3, wherein the first video processing network and the second video processing network are adapted to perform video enhancement tasks.

11. A method according to any one of claims 1 to 3, wherein prior to processing the first intermediate profile output and the second intermediate profile output respectively, the method further comprises:

Respectively performing deblurring processing on the first intermediate feature map output and the second intermediate feature map output to obtain a deblurred first intermediate feature map and a deblurred second intermediate feature map;

The processing the first intermediate feature map output and the second intermediate feature map output, respectively, includes:

and respectively processing the first intermediate feature map after the deblurring processing and the second intermediate feature map after the deblurring processing.

12. A model training apparatus, the apparatus comprising:

The feature map processing module is used for respectively processing the first intermediate feature map output and the second intermediate feature map output to respectively obtain first inter-frame information and second inter-frame information, wherein the first inter-frame information and the second inter-frame information are used for representing feature change relations among the image frames of the video sample, the feature change relations comprise continuous information and change information among the frames of the image frames, the continuous information is a relation among areas with static inter-frame states, and the change information is a relation among objects with moving inter-frame states;

13. The apparatus of claim 12, wherein the profile processing module is configured to process the first intermediate profile output and the second intermediate profile output, respectively, via a recurrent neural network.

14. The apparatus of claim 13, wherein the first inter-frame information and the second inter-frame information are hidden layer states (HIDDEN STATE) of the recurrent neural network.

15. The apparatus of claim 13 or 14, wherein the recurrent neural network is a long-short-term memory LSTM network, and the first and second inter-frame information are cell states of the LSTM output.

16. The apparatus of claim 15, wherein the video sample comprises a plurality of frames of images, the first intermediate feature map output comprises a first sub-intermediate feature map corresponding to each of the plurality of frames of images, the second intermediate feature map output comprises a second sub-intermediate feature map corresponding to each of the plurality of frames of images, the first inter-frame information is a cell state obtained by the LSTM network processing a first sub-intermediate feature map corresponding to a last of the plurality of frames of images, and the second inter-frame information is a cell state obtained by the LSTM network processing a second sub-intermediate feature map corresponding to a last of the plurality of frames of images.

17. The apparatus of claim 15, wherein the video sample comprises a plurality of frame images, the first intermediate feature map output comprises a first sub-intermediate feature map corresponding to each frame of the plurality of frame images, the second intermediate feature map output comprises a second sub-intermediate feature map corresponding to each frame of the plurality of frame images, the first inter-frame information comprises M hidden layer states obtained by the cyclic neural network processing a first sub-intermediate feature map corresponding to a later M frame of the plurality of frame images, and the second inter-frame information is M hidden layer states obtained by the cyclic neural network processing a second sub-intermediate feature map corresponding to a later M frame of the plurality of frame images.

18. The apparatus of any of claims 12 to 14, wherein the video sample comprises a plurality of frames of images, the first intermediate feature map output comprises a first sub-intermediate feature map corresponding to each of the plurality of frames of images, and the second intermediate feature map output comprises a second sub-intermediate feature map corresponding to each of the plurality of frames of images, the apparatus further comprising:

19. The apparatus of claim 18, wherein the first spatial information is a first spatial attention profile and the second spatial information is a second spatial attention profile, and wherein the information statistics module is configured to map the first intermediate feature map output and the second intermediate feature map output, respectively, based on a spatial attention mechanism, to obtain the first spatial attention profile and the second spatial attention profile, respectively.

20. The apparatus according to any one of claims 12 to 14, wherein the video processing module is configured to process the video samples through the first video processing network and the second video processing network to obtain a first intermediate feature map output of the first video processing network, a first video processing result output by the first video processing network, and a second intermediate feature map output by the second video processing network, respectively;

21. The apparatus according to any of claims 12 to 14, wherein the first video processing network and the second video processing network are configured to perform video enhancement tasks.

22. The apparatus according to any one of claims 12 to 14, further comprising: the deblurring module is used for respectively carrying out deblurring processing on the first intermediate feature map output and the second intermediate feature map output before the first intermediate feature map output and the second intermediate feature map output are respectively processed, so as to obtain the first intermediate feature map after deblurring processing and the second intermediate feature map after deblurring processing;

the feature map processing module is configured to process the deblurred first intermediate feature map and the deblurred second intermediate feature map respectively.

23. A model training apparatus, the apparatus comprising a memory and a processor; the memory stores code, the processor being configured to retrieve the code and to perform the method of any of claims 1 to 11.

24. A computer storage medium storing one or more instructions which, when executed by one or more computers, cause the one or more computers to implement the method of any one of claims 1 to 11.

25. A computer product comprising code which when executed is operable to implement a method as claimed in any one of claims 1 to 11.