CN111402130A

CN111402130A - Data processing method and data processing device

Info

Publication number: CN111402130A
Application number: CN202010110945.4A
Authority: CN
Inventors: 李松江; 磯部骏; 贾旭; 袁善欣; 格雷戈里·斯拉堡; 许春景; 田奇
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2020-07-10
Anticipated expiration: 2040-02-21
Also published as: CN111402130B

Abstract

The embodiment of the application discloses a data processing method, which is applied to the field of artificial intelligence, in particular to an image processing technology, and comprises the following steps: acquiring a sequence of frames, frames in the sequence of frames having a first resolution; determining at least two frame groups from the frame sequence, wherein the frame groups comprise a first target frame and at least two adjacent frames of the first target frame, the first target frame is any one frame in the frame sequence, and the adjacent frames are frames except the first target frame in the frame sequence; determining the characteristics of each frame group in at least two frame groups through a three-dimensional convolution neural network, wherein the size of a convolution kernel in the three-dimensional convolution neural network in a time dimension is positively correlated with the number of frames in the frame groups; fusing the characteristics of each frame group of the at least two frame groups to determine the detail characteristics of the first target frame; and acquiring a first target frame with a second resolution according to the detail features and the first target frame, wherein the second resolution is greater than the first resolution.

Description

Data processing method and data processing device

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a data processing method and a data processing apparatus.

Background

Super Resolution (SR), abbreviated as super resolution, refers to a technique for reconstructing a corresponding high-resolution image from a low-resolution image. And (3) performing up-sampling amplification on the low-resolution image, filling details by means of image priori knowledge and the like, and generating a corresponding high-resolution image. The super-resolution technology has important application value in the fields of high-definition televisions, monitoring equipment, satellite images, medical images and the like.

Video Super Resolution (VSR), referred to as video super resolution for short, is to generate a corresponding high-resolution video from a low-resolution video, and a core operation of video super resolution, which is different from image super resolution, is motion compensation, that is, to extract and fuse information of multiple frames of images near a target frame, and to obtain detailed information by using similarity between adjacent frames before and after the target frame, so as to generate the high-resolution video. Specifically, in the prior art, at least 7 frames of low resolution frame sequences including target frames are input into a three-dimensional (3D) convolutional neural network, extraction and fusion of detail information are implicitly performed through a convolution kernel of 3 × 3 size, since the size of the convolution kernel of 3 × 3 size in the time dimension is 3, that is, 3 frames can be simultaneously processed at one time, then information extraction is performed by sequentially sliding 1 frame through the convolution kernel, detail information is acquired, and the low resolution target frames are amplified through upsampling according to the detail information, and finally high resolution target frames are acquired.

In the prior art, when a 3D convolutional neural network is used for motion compensation, the size of a convolutional kernel is limited, 3 frames are processed in a sliding manner each time, for example, a frame sequence of 7 frames is input, and a 4 th frame is a target frame, when the convolutional kernel slides to the 1 st, 2 nd and 3 th frames, due to lack of guidance information of the target frame, feature extraction is relatively blind, and feature extraction efficiency is low.

Disclosure of Invention

The embodiment of the application provides a data processing method, which is used for frame sequence super-resolution and can improve the feature extraction efficiency and reduce the calculation amount.

A first aspect of an embodiment of the present application provides a data processing method, including: the data processing device acquires a frame sequence, wherein frames in the frame sequence have a first resolution; the data processing device determines at least two frame groups from the frame sequence, wherein the frame groups comprise a first target frame and at least two adjacent frames of the first target frame, the first target frame is any one frame in the frame sequence, and the adjacent frames are frames in the frame sequence except the first target frame; the data processing apparatus determines a feature of each of the at least two frame groups through a three-dimensional convolutional neural network, the feature of each frame group indicating detail information acquired from adjacent frames within the each frame group based on the first target frame, the size of a convolutional kernel in a time dimension in the three-dimensional convolutional neural network being positively correlated with the number of frames in the frame group; the data processing apparatus fusing features of each of the at least two frame groups to determine a detail feature of the first target frame, the detail feature indicating detail information obtained from adjacent frames within the at least two frame groups based on the first target frame; and the data processing device acquires a first target frame with a second resolution according to the detail features and the first target frame, wherein the second resolution is greater than the first resolution. Optionally, the larger the size of the convolution kernel in the three-dimensional convolution neural network in the time dimension, the larger the number of frames in the frame group, and optionally, the size of the convolution kernel in the time dimension in the three-dimensional convolution neural network is equal to the number of frames in the frame group.

In the data processing method provided by the embodiment of the application, in the process of hyper-division of a first target frame in a frame sequence, at least two frame groups are determined from the frame sequence, the frame groups comprise the first target frame, the frame groups are respectively input into a three-dimensional convolutional neural network to extract group characteristics of the frame groups, then the group characteristics are fused to determine detailed characteristics of the first target frame, and the first target frame with a first resolution is converted into the first target frame with a second resolution according to the detailed characteristics. According to the data processing method, the frame group determined from the frame sequence comprises the first target frame, the size of the convolution kernel in the three-dimensional convolution neural network in the time dimension is positively correlated with the number of the frames in the frame group, and the rooms of the frames in the frame group are set according to the size of the convolution kernel in the three-dimensional convolution neural network in the time dimension, so that when the features of the frame group are extracted through the three-dimensional convolution neural network, the convolution kernel sliding can be reduced, the calculation amount is reduced, the guidance of the target frame can be obtained, and the detail feature extraction efficiency is high.

In one possible implementation form of the first aspect, the frame group includes the first target frame and two of the neighboring frames.

According to the data processing method provided by the embodiment of the application, the number of the determined frame groups is 3, namely the size of the convolution kernel in the time dimension is 3, the feature extraction can be performed through the 3D convolution neural network with the convolution kernel size of 3x3, and the calculated amount is small.

In one possible implementation form of the first aspect, the two neighboring frames include a first neighboring frame and a second neighboring frame, and an interval between the first neighboring frame and the first target frame in the frame sequence is equal to an interval between the second neighboring frame and the first target frame in the frame sequence.

According to the data processing method provided by the embodiment of the application, the first adjacent frame and the second adjacent frame in the frame group are symmetrical about the target frame in the frame sequence, and in consideration of the continuity of motion, for the frame sequence continuously acquired through the same time interval, the two adjacent frames are symmetrical about the target frame in the time dimension, so that the features can be more effectively extracted.

In one possible implementation form of the first aspect, the at least two frame groups include three frame groups.

Since the larger the number of frame groups, the larger the calculation amount of feature extraction, and the smaller the number of frame groups, the less detailed information can be acquired. The data processing method provided by the embodiment of the application performs detail feature extraction by fusing the features of the three frame groups, and can achieve better balance between sufficient information quantity provision and calculation quantity reduction.

In a possible implementation manner of the first aspect, the method further includes: aligning frames within each of the at least two frame groups by the data processing apparatus, determining at least two frame groups that are aligned; the data processing apparatus determining the characteristics of each of the at least two frame groups through a three-dimensional convolutional neural network comprises: the data processing apparatus determines a characteristic of each of the at least two frame groups that are aligned through a three-dimensional convolutional neural network.

According to the data processing method provided by the embodiment of the application, before the features of the frame group are extracted through the three-dimensional convolutional neural network, the frame alignment processing can be performed on the first frame group, and therefore the features of the frame group can be extracted more effectively.

In one possible implementation manner of the first aspect, the data processing apparatus aligns frames in each of the at least two frame groups, and determining the aligned at least two frame groups includes: the data processing device determines a homography matrix between all continuous two frames in a queue formed by the frames in the at least two frame groups; the data processing apparatus determines the aligned at least two frame groups from the homography matrix.

According to the data processing method provided by the embodiment of the application, the frame group of the first target frame is aligned by the method of the homography matrix, so that the calculation amount can be reduced.

In a possible implementation manner of the first aspect, the method further includes: the data processing device determines a weight of a feature of each of the at least two frame groups; the data processing apparatus fusing the features of each of the at least two frame groups to determine the detail feature of the first target frame comprises: and the data processing device fuses the characteristics of each frame group in the at least two frame groups according to the weight value so as to determine the detail characteristics of the first target frame.

According to the data processing method provided by the embodiment of the application, when a plurality of features are fused, the attention mask can be calculated through a deep learning network attention mechanism, the weight of the features of each frame group is determined, and accordingly, each feature is fused, and the detailed features of the target frame are finally determined.

A second aspect of the embodiments of the present application provides a data processing apparatus, including: an obtaining unit configured to obtain a sequence of frames, wherein frames in the sequence of frames have a first resolution; a determining unit, configured to determine at least two frame groups from the frame sequence, where the frame groups include a first target frame and at least two adjacent frames of the first target frame, where the first target frame is any one of the frame sequence, and the adjacent frames are frames other than the first target frame in the frame sequence; the determining unit is further configured to determine, through a three-dimensional convolutional neural network, a feature of each of the at least two frame groups, the feature of each frame group indicating detail information acquired from adjacent frames within the each frame group based on the first target frame, wherein a size of a convolutional kernel in a time dimension in the three-dimensional convolutional neural network is positively correlated with the number of frames in the frame group; a processing unit for fusing features of each of the at least two frame groups to determine a detail feature of the first target frame, the detail feature indicating detail information obtained from adjacent frames within the at least two frame groups based on the first target frame; the acquiring unit is further configured to acquire a first target frame with a second resolution according to the detail feature and the first target frame, where the second resolution is greater than the first resolution.

In one possible implementation of the second aspect, the frame group includes the first target frame and two of the neighboring frames.

In one possible implementation of the second aspect, the two neighboring frames include a first neighboring frame and a second neighboring frame, and an interval between the first neighboring frame and the first target frame in the frame sequence is equal to an interval between the second neighboring frame and the first target frame in the frame sequence.

In one possible implementation form of the second aspect, the at least two frame groups include three frame groups.

In a possible implementation manner of the second aspect, the determining unit is further configured to align frames in each of the at least two frame groups, and determine at least two aligned frame groups; the determining unit is specifically configured to: determining, by a three-dimensional convolutional neural network, a characteristic of each of the aligned at least two frame groups.

In a possible implementation manner of the second aspect, the determining unit is specifically configured to: determining a homography matrix between all the two continuous frames in the queue formed by the frames in the at least two frame groups; and determining the at least two aligned frame groups according to the homography matrix.

In a possible implementation manner of the second aspect, the determining unit is further configured to: the determination unit is further configured to: determining a weight of the features of each of the at least two frame groups through a deep neural network; the processing unit is specifically configured to: and fusing the characteristics of each frame group in the at least two frame groups according to the weight to determine the detail characteristics of the first target frame.

In one possible implementation of the second aspect, the size of the convolution kernel in the time dimension in the three-dimensional convolutional neural network is equal to the number of frames in the frame group.

A third aspect of embodiments of the present application provides a computer program product containing instructions, which when run on a computer, causes the computer to perform the method according to the first aspect or any one of the possible implementation manners of the first aspect.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to perform the method according to the first aspect or any one of the possible implementation manners of the first aspect.

A fifth aspect of the embodiments of the present application provides a chip system, where the chip system includes a processor, and the processor is configured to read and execute a computer program stored in a memory to perform a function according to any possible implementation manner of any one of the above aspects. In one possible design, the system-on-chip further includes a memory electrically connected to the processor. Further optionally, the chip further comprises a communication interface, and the processor is connected to the communication interface. The communication interface is used for receiving data and/or information needing to be processed, the processor acquires the data and/or information from the communication interface, processes the data and/or information, and outputs a processing result through the communication interface. The communication interface may be an input output interface. The chip system may be constituted by a chip, or may include a chip and other discrete devices.

For technical effects brought by any one implementation manner of the second aspect, the third aspect, the fourth aspect, and the fifth aspect, reference may be made to technical effects brought by a corresponding implementation manner in the first aspect, and details are not described here.

The data processing method provided by the embodiment of the application has the advantages that:

in the data processing method provided by the embodiment of the application, in the process of hyper-division of a first target frame in a frame sequence, at least two frame groups are determined from the frame sequence, the frame groups comprise the first target frame, the frame groups are respectively input into a three-dimensional convolutional neural network to extract group characteristics of the frame groups, then the group characteristics are fused to determine detailed characteristics of the first target frame, and the first target frame with a first resolution is converted into the first target frame with a second resolution according to the detailed characteristics. According to the data processing method, the frame group determined from the frame sequence comprises the first target frame, and the size of the convolution kernel in the three-dimensional convolution neural network in the time dimension is matched with the number of the frames in the frame group, so that when the group characteristics of the frame group are extracted through the three-dimensional convolution neural network, the convolution kernel sliding can be reduced, the calculation amount is reduced, the guidance of the target frame can be obtained, and the detail characteristic extraction efficiency is high.

Drawings

FIG. 1 is a schematic diagram of an artificial intelligence agent framework provided by an embodiment of the present application;

fig. 2 is a schematic diagram of an application environment according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of another convolutional neural network structure provided in the embodiments of the present application;

FIG. 5 is a schematic diagram of an application scenario of the data processing method in the embodiment of the present application;

fig. 6 is a schematic view of another application scenario of the data processing method in the embodiment of the present application;

fig. 7 is a schematic diagram of an embodiment of a data processing method provided in an embodiment of the present application;

FIG. 8 is a diagram of an embodiment of a frame alignment method in an embodiment of the present application;

FIG. 9 is a schematic diagram of an embodiment of homography matrix computation in an embodiment of the present application;

FIG. 10 is a schematic diagram of an embodiment of a time-series grouping in an embodiment of the present application;

FIG. 11 is a schematic diagram of an embodiment of inter-group feature fusion in an embodiment of the present application;

FIG. 12 is a schematic diagram of another embodiment of inter-group feature fusion performed in an embodiment of the present application;

FIG. 13 is a schematic diagram of an embodiment of sample amplification in an embodiment of the present application;

fig. 14 is a schematic diagram of another embodiment of a data processing method provided in an embodiment of the present application;

fig. 15 is a schematic diagram of an embodiment of a data processing apparatus according to an embodiment of the present application;

fig. 16 is a diagram of a chip hardware structure according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides a data processing method, which is used for continuously shooting the super-resolution of an acquired frame sequence, so that the calculation amount can be reduced, and the detail feature extraction effect can be improved.

The following provides a brief description of the terms related to the embodiments of the present application.

Video: a video is composed of a series of static images, wherein the images are usually called frames (frames), the number of frames transmitted or displayed per second is called frame rate (FPS), and the larger the frame rate, the smoother the picture; the smaller the frame rate, the more jerky the picture. When the video frame rate is not lower than 24fps, due to the phenomenon of visual persistence, a single static picture cannot be distinguished by human eyes, and the static picture looks like a smooth continuous visual effect, so that a continuous frame sequence is a video. The video in the embodiment of the present application refers to a sequence of frames obtained by continuous shooting.

The resolution (resolution) refers to the amount of information stored in an image, and is the number of pixels in each inch of the image, and the unit of the resolution is PPI (pixels per inch), which is generally called PPI (pixels per inch). The expression mode is that the number of horizontal pixels is ×, the number of vertical pixels is adopted, the common resolution has specifications such as 1280 ﹡ 720PPI, 1920 ﹡ 1080PPI and the like, the higher the resolution is, the larger the image is, and the smaller the image is otherwise.

Video super-resolution: the method is called video super-resolution for short, and corresponding high-resolution video is reconstructed from the low-resolution video. The image with low resolution is up-sampled and amplified, and details are filled by means of image priori knowledge, image self-similarity, multi-frame image complementary information and the like, so that a corresponding image with high resolution is generated.

In a sequence of frames acquired by continuous shooting, adjacent frames are generally very similar. For the tasks of super-resolution of frame sequences, such as super-resolution of video streams, super-resolution of video monitoring, high definition of old movies, and the like, the ideal effect cannot be achieved by simply applying the image super-resolution method to process the video frame by frame. On one hand, due to the fact that the information of the front frame and the rear frame is not considered in the image super-resolution, time continuity is lacked, and the generated high-resolution video has artifacts such as flicker, jitter and the like, so that watching and subsequent application are greatly influenced; on the other hand, the image super-resolution method lacks complementary information of previous and next frames, and the super-resolution performance is limited by the lower information utilization rate. Therefore, a video super-resolution technology (VSR) capable of effectively using front and rear frame information has drawn much attention on the basis of image super-resolution.

And motion compensation: is a method of describing the difference between adjacent frames. In the frame sequence obtained by continuous shooting, the adjacent frames are usually very similar, that is, contain much redundancy, and simple motion compensation is to subtract the reference frame from the current frame, so as to obtain the difference between frames, that is, the detail information required to be obtained in the super-resolution technique.

One core operation of video super-resolution is the extraction and fusion of multi-frame spatio-temporal information, and thus motion compensation is required to deal with motion between video frames. According to the motion compensation method, the existing video super-resolution methods can be divided into two categories: explicit motion compensation and implicit motion compensation.

Explicit motion compensation methods directly warp and align images using optical flow and the like in the preprocessing stage, and such methods are computationally intensive and have significant artifacts.

Implicit motion compensation is implicitly performed in a neural network by means of operations such as 3D convolution or deformable convolution, and such methods are limited by the structures of the 3D convolution and the deformable convolution, require a huge amount of computation, and cause the operation speed to be slow. Therefore, the video super-separation method for fusing multi-frame timing information in a video sequence more effectively under the condition of limited calculation amount becomes a hot spot of current research in the industry and academia.

Three-dimensional convolution, 3D convolution for short, is a convolution applied to three-dimensional data, and its convolution kernel includes three dimensions, one depth dimension more than the length and width dimensions of the commonly used 2D convolution kernel, and the depth dimension may be multiple frames of a video or different slices of a stereo image. The 3D convolution can effectively extract the space-time information in a video sequence or a three-dimensional image, and is often applied to tasks such as motion recognition, medical image processing and video processing. The depth dimension in the three-dimensional convolution kernel in the embodiment of the application refers to a time dimension, and refers to a plurality of frames acquired at different time points in a video.

Frame sequence: refers to a plurality of frame images having a sequence.

Frame group: the frame is a plurality of frames which are simultaneously input into a three-dimensional convolutional neural network for information extraction in the embodiment of the application, and the frames comprise a target frame and a frame adjacent to the target frame.

Embodiments of the present application will now be described with reference to the accompanying drawings, and it is to be understood that the described embodiments are merely illustrative of some, but not all, embodiments of the present application. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.

The term "and/or" appearing in the present application may be an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. In addition, the character "/" in this application generally indicates that the former and latter related objects are in an "or" relationship. In the present application, "at least one" means one or more, "a plurality" means two or more. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

FIG. 1 shows a schematic diagram of an artificial intelligence body framework that describes the overall workflow of an artificial intelligence system, applicable to the general artificial intelligence field requirements.

The artificial intelligence topic framework described above is set forth below in terms of two dimensions, the "intelligent information chain" (horizontal axis) and the "IT value chain" (vertical axis).

The "smart information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process.

The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure:

the infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by intelligent chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA and the like); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.

(2) Data of

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capabilities

After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.

(5) Intelligent product and industrial application

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, safe city, intelligent terminal and the like.

Referring to fig. 2, a system architecture 200 is provided in an embodiment of the present application. The data acquisition device 260 is configured to acquire a sequence of frames continuously shot and store the acquired sequence of frames in the database 230, and the training device 220 generates the target model/rule 201 based on the sequence of frames data maintained in the database 230. How training device 220 derives target model/rule 201 based on frame sequential data will be described in more detail below, and target model/rule 201 can be used in video hyperscoring, image sequence hyperscoring, and other application scenarios.

The target model/rule 201 may be derived based on a deep neural network, which is described below.

The operation of each layer in the deep neural network can be expressed mathematically

To describe: from the work of each layer in the physical-level deep neural network, it can be understood that the transformation of the input space into the output space (i.e. the row space to the column space of the matrix) is accomplished by five operations on the input space (set of input vectors), which include: 1. ascending/descending dimensions; 2. zooming in/out; 3. rotating; 4. translating; 5. "bending". Wherein 1, 2, 3 are operated by

The operation of 4 is completed by + b, and the operation of 5 is realized by a (). The expression "space" is used herein because the object being classified is not a single thing, but a class of things, and space refers to the collection of all individuals of such things. Where W is a weight vector, each value in the vector representing a weight value for a neuron in the layer of neural network. The vector W determines the spatial transformation of the input space into the output space described above, i.e. the weight W of each layer controls how the space is transformed. The purpose of training the deep neural network is to finally obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the trained neural network. Therefore, the training process of the neural network is essentially a way of learning the control space transformation, and more specifically, the weight matrix.

Because it is desirable that the output of the deep neural network is as close as possible to the value actually desired to be predicted, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the value actually desired to be predicted, and then updating the weight vector according to the difference between the predicted value and the value actually desired (of course, there is usually an initialization process before the first update, that is, parameters are configured in advance for each layer in the deep neural network). Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

The target models/rules obtained by the training device 220 may be applied in different systems or devices. In FIG. 2, the execution device 210 is configured with an I/O interface 212 to interact with data from an external device, and a "user" may input data to the I/O interface 212 via a client device 240.

The execution device 210 may call data, code, etc. from the data storage system 250 and may store data, instructions, etc. in the data storage system 250.

The calculation module 211 processes the input data using the target model/rule 201, and the calculation module 211 may analyze the input image or image sequence to obtain the image features, taking the image super as an example.

The correlation function 213 may pre-process the image data in the calculation 211, for example, perform frame alignment or image grouping, etc.

The correlation function 214 may pre-process the image data in the calculation module 211, for example, perform frame alignment or image grouping, etc.

Finally, the I/O interface 212 returns the results of the processing to the client device 240 for presentation to the user.

Further, the training device 220 may generate corresponding target models/rules 201 based on different data for different targets to provide better results to the user.

In the case shown in FIG. 2, the user may manually specify data to be input into the execution device 210, for example, to operate in an interface provided by the I/O interface 212. Alternatively, the client device 240 may automatically enter data into the I/O interface 212 and obtain the results, and if the client device 240 automatically enters data to obtain authorization from the user, the user may set the corresponding permissions in the client device 240. The user can view the result output by the execution device 210 at the client device 240, and the specific presentation form can be display, sound, action, and the like. The client device 240 may also act as a data collection end to store the collected training data in the database 230.

It should be noted that fig. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 2, the data storage system 250 is an external memory with respect to the execution device 210, and in other cases, the data storage system 250 may also be disposed in the execution device 210.

A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, where the deep learning architecture refers to learning at multiple levels in different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network, for example, image processing, in which individual neurons respond to overlapping regions in an image input thereto.

As shown in fig. 3, Convolutional Neural Network (CNN)100 may include an input layer 110, a convolutional/pooling layer 120, where the pooling layer is optional, and a neural network layer 130.

Convolutional layer/pooling layer 120:

and (3) rolling layers:

as shown in FIG. 3, convolutional layer/pooling layer 120 may include, for example, 121-126 layers, in one implementation, 121 layers are convolutional layers, 122 layers are pooling layers, 123 layers are convolutional layers, 124 layers are pooling layers, 125 layers are convolutional layers, and 126 layers are pooling layers; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers, and 126 are pooling layers. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

Taking convolutional layer 121 as an example, convolutional layer 121 may include a plurality of convolution operators, also called kernels, whose role in image processing is to act as a filter for extracting specific information from an input image matrix, and a convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on an image, the weight matrix is usually processed on the input image pixel by pixel (or two pixels by two pixels … …, which depends on the value of step size stride) in the horizontal direction, so as to complete the task of extracting a specific feature from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depthdimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same dimension are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image. Different weight matrixes can be used for extracting different features in the image, for example, one weight matrix is used for extracting image edge information, another weight matrix is used for extracting specific colors of the image, another weight matrix is used for blurring unwanted noise points in the image … …, the dimensions of the multiple weight matrixes are the same, the dimensions of feature maps extracted by the multiple weight matrixes with the same dimensions are also the same, and the extracted multiple feature maps with the same dimensions are combined to form the output of convolution operation.

Convolution kernels also come in a variety of formats depending on the dimensionality of the data that needs to be processed. Commonly used convolution kernels include two-dimensional convolution kernels and three-dimensional convolution kernels. The two-dimensional convolution kernel is mainly applied to processing two-dimensional image data, and the three-dimensional convolution kernel can be applied to video processing, stereo image processing and the like due to the fact that the dimension of the depth/time direction is increased. Compared with a two-dimensional convolution kernel, the number of parameters and the calculation amount required by the three-dimensional convolution kernel with one added dimension are both greatly increased. In practical applications, a three-dimensional convolutional network needs to be carefully designed: when the convolution kernel is larger, the sliding times are less, but the calculation amount is greatly increased; when the convolution kernel is small, the sliding times are more, the feature extraction of the depth/time dimension is more blind and inefficient, and the performance of the network performance is limited.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can extract information from the input image, thereby helping the convolutional neural network 100 to make correct prediction.

When convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (e.g., 121) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 100 increases, the more convolutional layers (e.g., 126) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved. To facilitate description of the network structure, a plurality of convolutional layers may be referred to as a block.

A pooling layer:

since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce pooling layers after the convolutional layer, i.e. the layers 121-126 as illustrated by 120 in fig. 3, may be one convolutional layer followed by one pooling layer, or may be multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a particular range to produce an average. The max pooling operator may take the pixel with the largest value in a particular range as the result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

The neural network layer 130:

after processing by convolutional layer/pooling layer 120, convolutional neural network 100 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (class information or other relevant information as needed), the convolutional neural network 100 needs to generate one or a set of outputs of the number of classes as needed using the neural network layer 130. Accordingly, a plurality of hidden layers (such as 131, 132, to 13n shown in fig. 3) and an output layer 140 may be included in the neural network layer 130, and parameters included in the plurality of hidden layers may be pre-trained according to related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.

After the hidden layers in the neural network layer 130, i.e. the last layer of the whole convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the class cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from 110 to 140 in fig. 3 is the forward propagation) of the whole convolutional neural network 100 is completed, the backward propagation (i.e. the propagation from 140 to 110 in fig. 3 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.

It should be noted that the convolutional neural network 100 shown in fig. 3 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, as shown in fig. 4, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the overall neural network layer 130 for processing.

The application scenario of the video super-resolution method is wide, and is described below with reference to fig. 5 and 6 by way of example.

Application scenario 1: high-definition streaming video system

Please refer to fig. 5, which is a schematic view of an application scenario of the data processing method in the embodiment of the present application;

with the popularization of smart phones and tablet computers, streaming video has gradually become one of the mainstream video entertainment modes at present. The resolution of video resources provided by a streaming video platform is gradually improved, which puts higher requirements on network bandwidth and network stability. Based on the video super-resolution technology, a video picture with lower resolution can be directly transmitted to a user, the client can perform video super-resolution on a low-quality video by virtue of the computing power of the client, and finally the high-quality picture is presented to the user. This way the bandwidth requirement of the video stream can be reduced significantly without significantly degrading the high definition picture quality.

Application scenario 2: high-definition monitoring system

Please refer to fig. 6, which is a schematic view of another application scenario of the data processing method in the embodiment of the present application;

video monitoring is an important component of a safe city system, and more video monitoring is arranged at each corner of a city to safeguard the safety of the city. The method is limited by the adverse conditions such as camera quality, installation position, limited storage space and the like, and the picture quality of partial video monitoring is poor, so that the subsequent application is limited. As shown in fig. 2, the embodiment of the present application can convert a low-resolution video monitoring picture into a high-resolution high-definition picture. Through the information of the front frame and the back frame and the priori knowledge of the image, the effective recovery of a large number of details in the monitoring picture is realized, more effective and rich information is provided for the subsequent video analysis, and the robustness of a safe city system is improved.

The existing video super-resolution methods mainly include a video super-resolution method based on optical flow and a video super-resolution method based on implicit motion compensation, and are briefly introduced below.

The video super-resolution method based on the optical flow, namely an explicit motion compensation method, estimates dense optical flow from adjacent frames to intermediate frames from input multi-frame images frame by frame, distorts and aligns the adjacent frames to the intermediate frames based on the optical flow estimation result to form an aligned image sequence, performs feature extraction and fusion on the aligned image sequence, and outputs a super-resolution result of the intermediate frames.

Since the explicit motion compensation method directly warps and aligns the images by means of optical flow and the like in the preprocessing stage, the accuracy of optical flow calculation is limited, and large artifacts tend to exist.

In order to avoid the huge calculation amount of dense optical flow and the artifact problem thereof, the video hyper-segmentation method based on implicit motion compensation implicitly performs motion compensation in the process of extracting and fusing image information through a neural network module with motion compensation capability, such as a 3D convolutional neural network. Structurally, the 3D convolution kernel adds one dimension to the 2D convolution, and increasing the convolution kernel results in a dramatic increase in the number of parameters and computations, and thus it is difficult to deepen the depth of the 3D convolution kernel in the time dimension. In practical applications, a 3D convolution kernel of 3x3x3 size is often used to achieve balance between performance and computational complexity, and a 3D convolution kernel of 3x3x3 size extracts information of 3 frames at a time. In the video super-resolution task, when 7 frames of images are directly input to a three-dimensional convolutional neural network, 3 frames are processed each time, and 1 frame is sequentially slid through a convolutional kernel for processing, the total number of the processed frames is 5, a 4 th frame is taken as an example of a target frame, the distance between the farthest adjacent frame and the target frame is 3 frames, the distance between the farthest adjacent frame and the target frame is 1 to 3 frames in the first calculation, and the target frame is not included in the first calculation, so that the guidance of the target frame cannot be obtained in the feature extraction process, the feature fusion process is indirect and blind, and the feature extraction efficiency is low.

The data processing method provided by the embodiment of the present application may be applied to a video, and the video frame rate may be 24, 30, 60, 120, or 300, and the specific frame rate of the video is not limited here. Furthermore, the data processing method can also be applied to a sequence of images taken continuously, for example, a sequence of images taken continuously by a user at different time intervals. The data processing objects in the embodiments of the present application are collectively referred to as a frame sequence.

The data processing method provided by the embodiment of the application is used for efficiently fusing multi-frame information in the super-resolution of the frame sequence, so that the calculation amount can be reduced, and the processing speed can be improved. Please refer to fig. 7, which is a schematic diagram illustrating an embodiment of a data processing method according to an embodiment of the present application.

701. The data processing device carries out frame alignment processing on the frame sequence;

the data processing device carries out frame alignment processing on the input frame sequence, aligns the same picture content in different frames to obtain aligned video frames, can reduce the picture difference between frames, and reduces the difficulty of subsequent information extraction and fusion.

Optionally, the neighboring frames used for information extraction are aligned with the target frame, for example, for a first target frame, a plurality of neighboring frames of the first target frame for extracting detail information are aligned with the first target frame, for example, 6 neighboring frames of the first target frame are aligned with the first target frame. Similarly, the frame alignment process is performed once for each super-divided target frame.

Optionally, a plurality of frames input to the convolutional neural network at the same time for information extraction are aligned, for example, 3 frames including the target frame are aligned before being input to the three-dimensional convolutional neural network.

Optionally, the data processing method provided in the embodiment of the present application implements fast frame alignment by a homography matrix method. For ease of understanding, the principle of fast frame alignment, the characteristics of the homography matrix, will be briefly described below.

For video frames obtained by continuous shooting, the motion between frames, i.e. the change of picture content between the front and rear frames, can be composed of two parts, namely camera motion and object motion, and the camera motion can be roughly described by a homography matrix, so that the method uses frame alignment based on the homography matrix to realize rough motion compensation.

One plane is subjected to perspective transformation to obtain another plane, and the homography matrix can describe the perspective transformation of the planes. The homography matrix has the following properties:

1) homography matrix from A to C, can be calculated from homography matrix from A to B, homography matrix from B to C:

H_A→C＝H_A→B·H_B→C

2) b to A homography matrix is the inverse of A to B homography matrix

Please refer to fig. 8, which is a diagram illustrating an embodiment of a frame alignment method according to an embodiment of the present application.

The data processing device calculates a homography matrix between two consecutive frames, and optionally, the data processing device checks the homography matrix, and is limited by the complexity of motion, and some samples are not suitable for alignment, for example, a fixed monitoring camera does not run, or the ratio of motion of an object in motion between frames is too large, and alignment errors can be caused when such samples are aligned between frames. In order to avoid that the subsequent feature extraction and fusion are influenced by wrong alignment, optionally, before the frame alignment, applicability detection is performed, the data processing device judges whether the frame sequence is suitable for alignment through homography matrix check, if so, the alignment is performed, and if not, the originally input frame sequence is directly output.

In the embodiment of the present application, frame alignment is realized by a homography matrix method, which can reduce the amount of computation and accelerate the frame alignment speed, and the following description is provided by comparing with the existing optical flow method.

Referring to fig. 9, a schematic diagram of an embodiment of homography matrix calculation in the embodiment of the present application is shown. The data processing device only calculates a basic homography matrix between each frame and the previous frame, homography matrices between other frames are calculated by the basic homography matrix according to the properties of the homography matrices, the video of a section of M frames is considered to be processed, 2N +1 frames are taken as a group of network input, namely a target frame and 2N adjacent frames, the method based on optical flow needs to calculate between each adjacent frame and the target frame for 2NM times, and in the alignment method based on the homography matrices, the homography matrices are only required to be calculated once for the input of each 2N +1 frame, so the total calculation times are M times, the calculation amount is greatly reduced, and obvious speed improvement is realized. Meanwhile, the method reduces the deformation of the pixel level in the optical flow calculation and provides good alignment for the subsequent super-resolution processing.

It should be noted that step 701 is an optional step, and may or may not be executed, and is not limited herein.

Optionally, the frame groups of the first target frame may be respectively frame-aligned, or all the frame groups of the first target frame may be frame-aligned together, which is not limited herein.

702. The data processing device determines at least two frame groups of a first target frame;

when extracting detail information for a target frame in a frame sequence, it may be considered to acquire the detail information from neighboring frames of the target frame, where the neighboring frames of the target frame, hereinafter referred to as neighboring frames, are any frames in the frame sequence except the target frame. Alternatively, a frame which is relatively close to the target frame capturing time at the time of image capturing and contains partially overlapping image information with the target frame may generally provide detailed information that the target frame does not have.

The data processing apparatus groups a sequence of frames, and determines for each frame of the sequence of frames at least two groups of frames for which information extraction is to be performed. The frame sequence may be an aligned frame sequence output after aligning the frame sequences in step 701, or a frame sequence that is not aligned, and is not limited herein.

Taking the first target frame as an example, the data processing apparatus determines a frame group including at least two adjacent frames including the first target frame and the first target frame. The specific number of frame groups is not limited herein, and may be, for example, 3, 4, or 7. It is understood that, in the case that each frame group includes a certain number of frames, the more frame groups, the more detail information can be acquired, and the more calculation amount is, whereas the less frame groups, the less detail information can be acquired, and the less calculation amount is. In practical applications, the number of frame groups may be determined according to the over-distribution requirement.

The frame group for simultaneously inputting the 3D convolutional neural network in the embodiment of the application comprises the target frame, so that the target frame can provide guide information for feature extraction in the convolution process, the effectiveness of feature extraction is improved, the number of frames in the frame group is matched with the size of a depth dimension in a convolution kernel size, and the number of frames in the frame group is an odd number. Optionally, a convolution kernel with a size of 3 × 3 is selected, the number of frames in the frame group is 3 frames, that is, one target frame and two adjacent frames, optionally, if a convolution kernel with a size of 5 × 5 is used, the number of frames in the frame group is 5, if a convolution kernel with a size of 7 × 7 is used, the number of frames in the frame group is 7, and the specific size is not limited. Since the computational load will increase rapidly as the convolution kernel size increases, terminal data processing devices that are typically computationally limited use convolution kernels of 3x3 size for feature extraction.

The method for selecting adjacent frames in a frame group includes multiple methods, and optionally, all adjacent frames in a plurality of frame groups determined for the first target frame and a frame set formed by the first target frame are a group of continuous frame sequences, and no repeated adjacent frames exist in all adjacent frames in the plurality of frame groups.

Optionally, groups of frames are determined at intervals from the target frame, N groups of frames being denoted as { G _1, G _2, …, G _ N }, N ∈ [1: N }, and]. Each group of 3 frames G _ n ═ I according to the different temporal distances between corresponding adjacent frames and intermediate frames_(t-n)，I_t，I_(t+n)In which I_tIs a target frame, I_(t-n)Is a previous adjacent frame, in particular n frames, I, before the target frame_(t+n)N ∈ [1: N ] for post-neighboring frames, specifically N frames after the target frame]. For a frame sequence acquired according to the same time interval, in the frame group determined by the method, the time intervals between the front adjacent frame and the rear adjacent frame and the target frame are the same, namely the two adjacent frames are symmetrical about the target frame in the time dimension, and the two adjacent frames symmetrical about the target frame can more effectively extract features in consideration of the continuity of motion.

Taking a 7-frame input as an example, please refer to fig. 10, which illustrates a schematic diagram of an embodiment of a timing grouping in an embodiment of the present application.

The module groups the neighboring frames into 3 groups according to their temporal distance to the target frame. The 1 st, 4 th and 7 th frames are a first frame group, the 2 nd, 4 th and 6 th frames are a second frame group, and the 3 rd, 4 th and 5 th frames are a third frame group.

Because each frame group comprises the target frame as the guidance of the information fusion in the frame group, the information extraction efficiency can be effectively improved.

In addition, in the prior art, 3 frames are processed each time in a convolution kernel sliding mode, 6 groups of data need to be calculated, and the data processing method provided by the embodiment of the application only needs to perform 3 groups of calculation, so that the calculation amount can be reduced remarkably.

703. The data processing device acquires group characteristics of a frame group;

according to the frame group determined in step 702, the data processing apparatus extracts and fuses features of the frame group through a three-dimensional convolutional neural network, and for convenience of description, the features of the frame group are specifically referred to as group features. Optionally, feature extraction and fusion can be performed by combining a two-dimensional convolutional neural network and a three-dimensional convolutional neural network. The data processing apparatus acquires a group characteristic of each of the N frame groups

Alternatively, when determining a frame group according to the temporal distance from the neighboring frame to the target frame, the data processing apparatus may extract and fuse features using weight sharing for different frame groups of the target frame. Weight sharing refers to processing different batches of data with the same network fabric entity.

Taking a 7-frame input as an example, where the 4 th frame is the target frame, it is also called an intermediate frame in the following because it is in the middle of the frame sequence. The 1 st, 4 th and 7 th frames are a first frame group, the 2 nd, 4 th and 6 th frames are a second frame group, and the 3 rd, 4 th and 5 th frames are a third frame group. Considering that the interframe distances corresponding to 3 frame groups are respectively 1, 2 and 3, when extracting the features of each frame group, the same 3D network is used, and the expansion rates (durations) of convolution kernels in the network are respectively set to the interframe distances of the frame group, that is, the first frame group, the second frame group and the third frame group are respectively 1, 2 and 3. The fact that the translation rate corresponds to the receptive field of the convolution kernel, namely the space coverage range, and the use of a larger translation rate for a group with larger motion can better extract space motion information and realize more efficient intra-group fusion.

Similarly, the data processing apparatus extracts the group features for the plurality of frame groups of each target frame, which is not described in detail again.

704. The data processing device fuses the group characteristics of the first target frame and acquires the inter-group characteristics of the first target frame;

the data processing device applies group characteristics of a plurality of frame groups of the same target frame

Fusing together to obtain the inter-group characteristics F of the target frame_A. The inter-group features are detail features extracted from adjacent frames, and can be used for performing the super-classification of the target frame in the next step.

Fusing a plurality of group characteristics to determine the characteristics F of the target frame_AThere are various ways of doing this, and the details are not limited herein. Optionally, please refer to fig. 11, an embodiment of the present applicationOne embodiment of inter-median feature fusion is illustrated.

Group characteristics for all frame groups of the first target frame

Firstly, 2D convolution is used for carrying out feature extraction on the image to obtain 1-dimensional features

Then based on

Computing attention mask M_n. Attention mask M_nCan be understood as a characteristic

The calculation formula is as follows:

wherein M is_n(x，y)_iWhich represents the mask of attention to the eye,

a group feature map representing the ith group; (x, y)_jThe pixel coordinate of the position j in the characteristic diagram is represented, i represents the ith frame group, and N represents the total number of the frame groups.

Weighting the group characteristics according to the calculated attention mask, and acquiring the characteristics of the weighted target frame according to the following formula

Where ⊙ is a dot product (hadamard product) performed in terms of elements.

Optionally, a three-dimensional volume Block (3D Block) containing 3D convolution and a three-dimensional volume Block (2D Block) containing 2D convolution are reusedWeighted two-dimensional volume Block (2D Block) pairs of weighted group features

Further fusing to generate fused feature F_A. Please refer to fig. 12, which is a diagram illustrating another embodiment of inter-group feature fusion according to an embodiment of the present application.

Similarly, the data processing apparatus performs inter-group fusion on the frame group characteristics of each target frame in the frame sequence, and obtains the characteristics of each target frame, which is not described herein again.

705. The data processing device acquires a target frame with high resolution according to the interclass features.

Feature F of the target frame_AThe method can be understood as a multi-channel image, which is also called a feature map, and the data processing module enlarges the feature map to a target resolution and outputs a residual image, namely a detail feature. Optionally, the data processing means uses a cascaded 2D convolution and pixel rebinning algorithm (PixelShuffle) to achieve the feature map magnification, optionally 2 times each time, until the magnification reaches the target resolution.

The data processing device enlarges the low-resolution target frame by upsampling to obtain an enlarged blurred image, which can be generally considered as follows: the sharp image is the blurred image + the residual image. The fuzzy image can be obtained by directly interpolating the input low-resolution target frame, and then the clear high-resolution target frame can be obtained by combining the residual image.

Fig. 13 is a schematic diagram illustrating sample amplification of a target frame according to an embodiment of the present application.

The data processing device fuses the features F_AAnd amplifying by 2 times through a cascaded 2D convolution and pixel recombination algorithm (PixelShuffle) until a residual image is output after the residual image is amplified to the target resolution. And the data processing device samples and amplifies the original target frame by a Bicubic (Bicubic) method to obtain a fuzzy enlarged image, and adds the fuzzy enlarged image and the residual image according to the enlarged image to obtain a target frame, namely a high-resolution target frame corresponding to the intermediate frame I4.

Similarly, the data processing apparatus performs upsampling and amplifying according to the characteristics of each target frame in the frame sequence, and respectively obtains the high-resolution frame of each target frame, which is not described herein again.

Please refer to fig. 14, which is a schematic diagram illustrating another embodiment of a data processing method according to an embodiment of the present application;

in the data processing method provided by the embodiment of the application, after the data processing device obtains an input frame sequence, the frame sequence is subjected to fast frame alignment processing, then time sequence grouping is performed according to the time interval between adjacent frames and a target frame, feature intra-group fusion is performed on the groups 1 to N respectively according to a 3D convolutional neural network, then the weight of intra-group fusion features is obtained according to an attention mechanism, inter-group fusion is performed according to the weight and the N intra-group fusion features, the detail features of the target frame are obtained, and finally, a high-resolution frame is output through upsampling.

Referring to fig. 15, a schematic diagram of an embodiment of a data processing apparatus according to an embodiment of the present application is shown;

the data processing device provided by the embodiment of the application comprises:

an obtaining unit 1501, configured to obtain a sequence of frames, a frame of the sequence of frames having a first resolution;

a determining unit 1502 configured to determine at least two frame groups from the frame sequence, where the frame groups include a first target frame and at least two adjacent frames of the first target frame, where the first target frame is any one of the frame sequence, and the adjacent frames are frames other than the first target frame in the frame sequence;

the determining unit 1502 is further configured to determine, through a three-dimensional convolutional neural network, a feature of each of the at least two frame groups, where the feature of each frame group indicates detail information obtained from adjacent frames within the each frame group based on the first target frame, and a size of a convolutional kernel in a time dimension of the three-dimensional convolutional neural network is positively correlated with the number of frames in the frame group;

a processing unit 1503 for fusing features of each of the at least two frame groups to determine a detail feature of the first target frame, the detail feature indicating detail information obtained from adjacent frames within the at least two frame groups based on the first target frame;

the acquiring unit 1501 is further configured to acquire a first target frame with a second resolution according to the detail feature and the first target frame, where the second resolution is greater than the first resolution.

Optionally, the frame group includes the first target frame and two adjacent frames.

Optionally, the two adjacent frames include a first adjacent frame and a second adjacent frame, and an interval between the first adjacent frame and the first target frame in the frame sequence is equal to an interval between the second adjacent frame and the first target frame in the frame sequence.

Optionally, the at least two frame groups include three frame groups.

Optionally, the determining unit 1502 is further configured to align frames in each of the at least two frame groups, and determine at least two aligned frame groups; the determining unit is specifically configured to: determining, by a three-dimensional convolutional neural network, a characteristic of each of the aligned at least two frame groups.

Optionally, the determining unit 1502 is specifically configured to: determining a homography matrix between all the two continuous frames in the queue formed by the frames in the at least two frame groups; and determining the at least two aligned frame groups according to the homography matrix.

Optionally, the determining unit 1502 is further configured to: the determination unit is further configured to: determining a weight of the features of each of the at least two frame groups through a deep neural network; the processing unit is specifically configured to: and fusing the characteristics of each frame group in the at least two frame groups according to the weight to determine the detail characteristics of the first target frame.

Optionally, the size of the convolution kernel in the time dimension in the three-dimensional convolution neural network is equal to the number of frames in the frame group.

The convolutional neural network based algorithm shown in fig. 3 and 4 may be implemented in the NPU chip shown in fig. 16.

The neural network processor NPU 50 is mounted as a coprocessor on a main CPU (Host CPU), and tasks are allocated by the Host CPU. The core portion of the NPU is an arithmetic circuit 503, and the controller 504 controls the arithmetic circuit 503 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 503 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuitry 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 503 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to matrix B from the weight memory 502 and buffers each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 501 and performs matrix operation with the matrix B, and partial results or final results of the obtained matrix are stored in the accumulator 508 accumulator.

The unified memory 506 is used to store input data as well as output data. The weight data is directly transferred to the weight memory 502 by a memory access controller 505 (DMAC). The input data is also carried through the DMAC into the unified memory 506.

The BIU is a Bus Interface Unit 510, which is used for the interaction between the AXI Bus and the DMAC and the Instruction Fetch memory 509Instruction Fetch Buffer.

The bus interface unit 510(bus interface unit, BIU for short) is configured to obtain an instruction from the instruction fetch memory 509 and obtain the original data of the input matrix a or the weight matrix B from the external memory by the storage unit access controller 505.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 506 or to transfer weight data into the weight memory 502 or to transfer input data into the input memory 501.

The vector calculation unit 507 may include a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, etc. if necessary, the vector calculation unit 507 is mainly used for non-convolution/FC layer network calculations in a neural network, such as floating (Pooling), Batch Normalization, L octal response Normalization, etc.

In some implementations, the vector calculation unit can 507 store the processed output vector to the unified buffer 506. For example, the vector calculation unit 507 may apply a non-linear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 507 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 503, for example for use in subsequent layers in a neural network.

An instruction fetch buffer 509 connected to the controller 504 for storing instructions used by the controller 504;

the unified memory 506, the input memory 501, the weight memory 502, and the instruction fetch memory 509 are all On-Chip memories. The external memory is private to the NPU hardware architecture.

Among them, the operations of the layers in the convolutional neural networks shown in fig. 3 and 4 may be performed by the matrix calculation unit 212 or the vector calculation unit 507.

The above method embodiments of the present application may be applied to a processor, or the processor may implement the steps of the above method embodiments. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method embodiments may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The processor may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA), or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component. The various methods, steps, and logic blocks disclosed in this application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in this application may be directly implemented by a hardware decoding processor, or may be implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. Although only one processor is shown in the figure, the apparatus may comprise a plurality of processors or a processor may comprise a plurality of processing units. Specifically, the processor may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor.

The memory is used for storing computer instructions executed by the processor. The memory may be a memory circuit or a memory. The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. The memory may be independent of the processor, or may be a storage unit in the processor, which is not limited herein. Although only one memory is shown in the figure, the apparatus may comprise a plurality of memories or the memory may comprise a plurality of memory units.

The transceiver is used for enabling the processor to interact with the content of other elements or network elements. Specifically, the transceiver may be a communication interface of the apparatus, a transceiving circuit or a communication unit, and may also be a transceiver. The transceiver may also be a communication interface or transceiving circuitry of the processor. In one possible implementation, the transceiver may be a transceiver chip. The transceiver may also include a transmitting unit and/or a receiving unit. In one possible implementation, the transceiver may include at least one communication interface. In another possible implementation, the transceiver may also be a unit implemented in software. In embodiments of the application, the processor may interact with other elements or network elements via the transceiver. For example: the processor obtains or receives content from other network elements through the transceiver. If the processor and the transceiver are physically separate components, the processor may interact with other elements of the apparatus without going through the transceiver.

In one possible implementation, the processor, the memory, and the transceiver may be connected to each other by a bus. The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc.

In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

In the embodiments of the present application, various illustrations are made for the sake of an understanding of aspects. However, these examples are merely examples and are not meant to be the best mode of carrying out the present application.

The above-described embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof, and when implemented using software, may be implemented in whole or in part in the form of a computer program product.

The computer instructions may be stored in or transmitted from a computer-readable storage medium to another computer-readable storage medium, e.g., from one website site, computer, server, or data center, via wire (e.g., coaxial cable, fiber optics, digital subscriber line (DS L)) or wirelessly (e.g., infrared, wireless, microwave, etc.) to another website site, computer, server, or data center.

The technical solutions provided by the present application are introduced in detail, and the present application applies specific examples to explain the principles and embodiments of the present application, and the descriptions of the above examples are only used to help understand the method and the core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of data processing, the method comprising:

the data processing device acquires a frame sequence, wherein frames in the frame sequence have a first resolution;

the data processing device determines at least two frame groups from the frame sequence, wherein the frame groups comprise a first target frame and at least two adjacent frames of the first target frame, the first target frame is any one frame in the frame sequence, and the adjacent frames are frames in the frame sequence except the first target frame;

the data processing apparatus determines a feature of each of the at least two frame groups through a three-dimensional convolutional neural network, the feature of each frame group indicating detail information acquired from adjacent frames within the each frame group based on the first target frame, the size of a convolutional kernel in a time dimension in the three-dimensional convolutional neural network being positively correlated with the number of frames in the frame group;

the data processing apparatus fusing features of each of the at least two frame groups to determine a detail feature of the first target frame, the detail feature indicating detail information obtained from adjacent frames within the at least two frame groups based on the first target frame;

and the data processing device acquires a first target frame with a second resolution according to the detail features and the first target frame, wherein the second resolution is greater than the first resolution.

2. The method of claim 1, wherein the group of frames includes the first target frame and two of the neighboring frames.

3. The method of claim 2, wherein the two neighboring frames comprise a first neighboring frame and a second neighboring frame, and wherein an interval between the first neighboring frame and the first target frame in the frame sequence is equal to an interval between the second neighboring frame and the first target frame in the frame sequence.

4. The method according to any of claims 1 to 3, wherein the at least two frame groups comprise three frame groups.

5. The method of claim 1, further comprising:

aligning frames within each of the at least two frame groups by the data processing apparatus, determining at least two frame groups that are aligned;

the data processing apparatus determining the characteristics of each of the at least two frame groups through a three-dimensional convolutional neural network comprises:

the data processing apparatus determines a characteristic of each of the at least two frame groups that are aligned through a three-dimensional convolutional neural network.

6. The method of claim 5, wherein the data processing apparatus aligns frames within each of the at least two frame groups, and wherein determining the aligned at least two frame groups comprises:

the data processing device determines a homography matrix between all continuous two frames in a queue formed by the frames in the at least two frame groups;

the data processing apparatus determines the aligned at least two frame groups from the homography matrix.

7. The method according to any one of claims 1 to 6, further comprising:

the data processing device determines the weight of the characteristics of each frame group in the at least two frame groups through a deep neural network;

the data processing apparatus fusing the features of each of the at least two frame groups to determine the detail feature of the first target frame comprises:

and the data processing device fuses the characteristics of each frame group in the at least two frame groups according to the weight value so as to determine the detail characteristics of the first target frame.

8. The method of any one of claims 1 to 7, wherein the size of the convolution kernel in the time dimension in the three-dimensional convolutional neural network is equal to the number of frames in the set of frames.

9. A data processing apparatus, comprising:

an obtaining unit configured to obtain a sequence of frames, wherein frames in the sequence of frames have a first resolution;

a determining unit, configured to determine at least two frame groups from the frame sequence, where the frame groups include a first target frame and at least two adjacent frames of the first target frame, where the first target frame is any one of the frame sequence, and the adjacent frames are frames other than the first target frame in the frame sequence;

the determining unit is further configured to determine, through a three-dimensional convolutional neural network, a feature of each of the at least two frame groups, the feature of each frame group indicating detail information acquired from adjacent frames within the each frame group based on the first target frame, wherein a size of a convolutional kernel in a time dimension in the three-dimensional convolutional neural network is positively correlated with the number of frames in the frame group;

a processing unit for fusing features of each of the at least two frame groups to determine a detail feature of the first target frame, the detail feature indicating detail information obtained from adjacent frames within the at least two frame groups based on the first target frame;

the acquiring unit is further configured to acquire a first target frame with a second resolution according to the detail feature and the first target frame, where the second resolution is greater than the first resolution.

10. The apparatus of claim 9, wherein the group of frames comprises the first target frame and two of the neighboring frames.

11. The apparatus of claim 10, wherein the two neighboring frames comprise a first neighboring frame and a second neighboring frame, and wherein an interval between the first neighboring frame and the first target frame in the frame sequence is equal to an interval between the second neighboring frame and the first target frame in the frame sequence.

12. The apparatus according to any one of claims 9 to 11,

the at least two frame groups include three frame groups.

13. The apparatus of claim 9, wherein the determination unit,

the frame alignment device is further used for aligning frames in each frame group of the at least two frame groups and determining at least two frame groups which are aligned;

the determining unit is specifically configured to:

determining, by a three-dimensional convolutional neural network, a characteristic of each of the aligned at least two frame groups.

14. The apparatus according to claim 13, wherein the determining unit is specifically configured to:

determining a homography matrix between all the two continuous frames in the queue formed by the frames in the at least two frame groups;

and determining the at least two aligned frame groups according to the homography matrix.

15. The apparatus according to any one of claims 9 to 14,

the determination unit is further configured to:

determining a weight of the features of each of the at least two frame groups through a deep neural network;

the processing unit is specifically configured to:

and fusing the characteristics of each frame group in the at least two frame groups according to the weight to determine the detail characteristics of the first target frame.

16. The apparatus of any one of claims 9 to 15, wherein the size of the convolution kernel in the time dimension in the three-dimensional convolutional neural network is equal to the number of frames in the group of frames.

17. A data processing apparatus comprising a processor and a memory, said processor and said memory being interconnected, wherein said memory is adapted to store a computer program comprising program instructions, said processor being adapted to invoke said program instructions to perform the method of any one of claims 1 to 8.

18. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 8.

19. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of any of claims 1 to 8.