CN118053092A

CN118053092A - Video processing method and device, chip, storage medium and electronic equipment

Info

Publication number: CN118053092A
Application number: CN202311756121.4A
Authority: CN
Inventors: 董至恺; 周祥云; 吴方熠; 游森福; 夏海军; 汪子晨
Original assignee: Rockchip Electronics Co Ltd
Current assignee: Rockchip Electronics Co Ltd
Priority date: 2023-12-19
Filing date: 2023-12-19
Publication date: 2024-05-17

Abstract

The disclosure provides a video processing method and device, a chip, a storage medium and an electronic device. The video processing method comprises the following steps: obtaining a current frame image of a video; detecting the interest area of the current frame image to obtain at least one interest area; performing enhancement processing on each region of interest by using a first neural network, and correspondingly obtaining a region of interest branch image; performing enhancement processing on the current frame image by using a second neural network to obtain a full-frame branch image; and fusing all the obtained region-of-interest branch images with the full-frame branch image to obtain a current frame enhanced image. The method and the device realize real-time enhancement and superdivision of the low-quality video and can run at the mobile terminal in real time.

Description

Video processing method and device, chip, storage medium and electronic equipment

Technical Field

The disclosure belongs to the technical field of video processing, and in particular relates to a video processing method and device, a chip, a storage medium and electronic equipment.

Background

With the rapid development of consumer electronics industry in recent years, the size and resolution of the screen of the electronic product are larger and larger, however, in real life, there are a large number of videos shot in the DVD era and decades ago, and these videos have lower image quality and resolution, such as 360P or 480P, due to the limitation of the technology of the era, and are usually transcoded multiple times in the streaming process, resulting in further loss of image quality.

Disclosure of Invention

Embodiments of the present disclosure provide a video processing method and apparatus, a chip, a storage medium, and an electronic device for enhancing image quality and resolution of a low-resolution, low-quality video.

In a first aspect, embodiments of the present disclosure provide a video processing method. The video processing method comprises the following steps: obtaining a current frame image of a video; detecting the interest area of the current frame image to obtain at least one interest area; performing enhancement processing on the region of interest by using a first neural network to obtain a branch image of the region of interest; performing enhancement processing on the current frame image by using a second neural network to obtain a full-frame branch image; and fusing the region of interest branch image and the full-frame branch image to obtain a current frame enhanced image.

In an implementation manner of the first aspect, performing enhancement processing on the region of interest using a first neural network includes: enhancement processing is performed on the region of interest to add detail textures using a generative model that is trained separately for the target region of interest.

In an implementation manner of the first aspect, performing enhancement processing on the current frame image using a second neural network includes: and carrying out enhancement processing on the information of the current frame image by using a universal model, wherein the universal model is used for enhancing any image.

In an implementation manner of the first aspect, fusing the region of interest branch image with the full-frame branch image includes: smoothing the enhancement result of the detail texture of the region-of-interest branch image and the enhancement intensity of the region-of-interest branch image; and fusing the region of interest branch image after the smoothing processing with the full-frame branch image.

In an implementation manner of the first aspect, the method further includes: a plurality of frame enhanced images are output at a predetermined frame rate for playing an enhanced video corresponding to the video in real time based on the plurality of frame enhanced images.

In an implementation manner of the first aspect, performing region of interest detection on the current frame image, and obtaining at least one region of interest includes: extracting features of the current frame image by using a deep neural network to obtain image features; and inputting the image features into at least one type of interest region prediction model to obtain at least one type of interest region which is correspondingly output.

In an implementation manner of the first aspect, the first neural network includes generating an countermeasure network and a superdivision network, wherein the enhancing the region of interest with the first neural network includes: determining a corresponding affine transformation method according to the type of the region of interest, and carrying out affine transformation on the region of interest to obtain a transformation branch of the region of interest; extracting a first picture characteristic of the interest region transformation branch by using a generator in the generating countermeasure network, and performing enhancement processing on the first picture characteristic to obtain a second picture characteristic; obtaining a mask of the first picture feature by using mask branch calculation; based on the mask of the first picture feature, fusing the second picture feature with the interest region transformation branch to obtain an enhancement result; amplifying the enhancement result by using the superdivision network to obtain an enhancement amplification result; and carrying out corresponding affine inverse transformation on the enhanced amplification result to obtain the region-of-interest branch image.

In an implementation manner of the first aspect, the second neural network is a deep neural network, wherein performing enhancement processing on the current frame image with the second neural network, to obtain a full-frame branch image includes: and inputting the current frame image with the first resolution into the deep neural network to output the current frame image with the second resolution, wherein the second resolution is higher than the first resolution.

In an implementation manner of the first aspect, the method further includes: and scaling the current frame image to a preset resolution ratio and then detecting the region of interest.

In an implementation manner of the first aspect, fusing the region of interest branch image and the full-frame branch image to obtain a current frame enhanced image includes: performing time domain smoothing processing on the region of interest branch image of the current frame image and the region of interest branch image of the corresponding previous frame image to obtain a time domain smoothing result of the region of interest branch image of the current frame image; and fusing the time domain smoothing results of all the region-of-interest branch images of the current frame image with the full-frame branch image according to the fusion proportion to obtain the current frame enhanced image.

In an implementation manner of the first aspect, performing temporal smoothing processing on the region of interest branch image of the current frame image and the region of interest branch image of the corresponding previous frame image, and obtaining a temporal smoothing result of the region of interest branch image of the current frame image includes: calculating gradients of the region-of-interest branch image of the current frame image and the corresponding region-of-interest branch image of the previous frame image to obtain gradient gain coefficients; calculating to obtain a smoothing coefficient of the current frame image according to the pixel difference value of the region-of-interest branch image of the current frame image and the region-of-interest branch image of the corresponding previous frame image and the gradient gain coefficient; and according to the smoothing coefficient, fusing the region-of-interest branch image of the current frame image with the corresponding region-of-interest branch image of the previous frame image to obtain a time domain smoothing result of the region-of-interest branch image of the current frame image.

In an implementation manner of the first aspect, fusing a time domain smoothing result of all region of interest branch images of the current frame image with the full frame branch image according to a fusion ratio, and obtaining the current frame enhanced image includes: calculating the confidence coefficient of all the region-of-interest branch images of the current frame image; the confidence is the product of the size coefficient and the fuzzy coefficient of the region of interest; acquiring the region of interest branch image of the current frame image, and calculating the position distance between the region of interest branch image and the same type of region of interest branch image which is closest to the last frame image; obtaining a branch fusion coefficient corresponding to the region of interest branch image of the current frame image according to the confidence coefficient, mask, smoothing coefficient of the region of interest branch image of the current frame image and the confidence coefficient of the region of interest branch image of the same type in the last frame image, which is nearest to the current frame image; and fusing the full-frame branch image with the branch fusion coefficient based on the time domain smoothing result of each interest region branch image of the current frame image to obtain the current frame enhanced image.

In a second aspect, embodiments of the present disclosure provide a video processing apparatus. The video processing apparatus includes: the input module is configured to receive a video stream and obtain a current frame image of the video; the ROI detection module is electrically coupled with the input module and is configured to detect the region of interest of the current frame image to obtain at least one region of interest; the ROI enhancement and amplification module is electrically coupled with the ROI detection module and is configured to enhance the region of interest by using a first neural network to obtain a region of interest branch image; the full-frame recovery and amplification module is electrically coupled with the input module and is configured to perform enhancement processing on the current frame image by using a second neural network to obtain a full-frame branch image; the time domain smoothing and fusing module is electrically coupled with the ROI enhancement and amplification module and the full-frame restoration and amplification module respectively and is configured to fuse the region-of-interest branch image with the full-frame branch image to obtain a current frame enhanced image; and an output module, electrically coupled to the temporal smoothing and fusion module, configured to output an enhanced video stream composed of a series of current frame enhanced images.

In a third aspect, embodiments of the present disclosure provide a chip. The chip comprises: the central processing unit CPU is configured to output each interest region of an n-1 frame image when receiving the n frame image of the video stream, and perform interest region detection on the n frame image to obtain at least one interest region of the n frame image, wherein n is a natural number larger than zero; the neutral network processing unit NPU is configured to perform enhancement processing on each region of interest of the n-1 th frame image when the CPU receives the n-th frame image of the video stream to obtain each region of interest branch image of the n-1 th frame image, and perform enhancement processing on the n-1 th frame image to obtain a full-frame branch image of the n-1 th frame image; and the graphic processing unit GPU is configured to fuse all the region of interest branch images of the obtained n-1 frame image with the full-frame branch image of the n-1 frame image when the CPU receives the n-1 frame image of the video stream, so as to obtain an enhanced image of the n-1 frame image.

In an implementation manner of the third aspect, the CPU is further configured to output each region of interest of an nth frame image when receiving the (n+1) th frame image of the video stream, and perform region of interest detection on the (n+1) th frame image to obtain at least one region of interest of the (n+1) th frame image; the NPU is further configured to perform enhancement processing on each region of interest of the nth frame image to obtain each region of interest branch image of the nth frame image and perform enhancement processing on the nth frame image to obtain a full frame branch image of the nth frame image when the CPU receives the (n+1) th frame image of the video stream; and the GPU is further configured to fuse each region of interest branch image of the obtained nth frame image with a full frame branch image of the nth frame image when the CPU receives the (n+1) th frame image of the video stream, so as to obtain an enhanced image of the nth frame image.

In one implementation manner of the third aspect, the NPU includes a plurality of NPU units; the number of NPU units is configured to be the same as the number of regions of interest of the image.

In one implementation of the third aspect, the GPU is further configured to output a plurality of frame enhanced images at a predetermined frame rate for playing an enhanced video stream corresponding to the video stream in real time based on the plurality of frame enhanced images.

In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium having stored thereon a computer program that, when executed, implements the video processing method of any one of the first aspects of the present disclosure.

In a fifth aspect, embodiments of the present disclosure provide an electronic device. The electronic device includes: a memory configured to store a processor executable program; and a processor configured to invoke the program to perform the video processing method according to any of the first aspects of the present disclosure.

In a sixth aspect, embodiments of the present disclosure provide an electronic device. The electronic device comprises the chip of any one of the third aspects of the present disclosure and plays an enhanced video stream comprising enhanced images of a plurality of frame images in real time based on the received video stream.

According to an embodiment of the present disclosure, an entire frame of image is divided into ROI areas and non-ROI areas. For non-ROI areas, a deep neural network (such as a generation network) with lower complexity is adopted for enhancement and amplification. And aiming at the ROI area, adopting a generating network with higher complexity to enhance and amplify. The generating network can add detail textures in the low-quality video, so that a better visual effect is realized, and finally, the output of the two branches is subjected to time domain smoothing and fusion, so that a final enhancement effect is obtained.

Drawings

Fig. 1 is a schematic flow chart of a video processing method according to an embodiment of the disclosure.

Fig. 2 is a schematic flow chart of a video processing method according to an embodiment of the disclosure.

Fig. 3 is a schematic flow chart of step S120 of the video processing method according to the embodiment of the disclosure.

Fig. 4 is a schematic flow chart of step S130 of the video processing method according to the embodiment of the disclosure.

Fig. 5 is a schematic flow chart of step S140 of the video processing method according to the embodiment of the disclosure.

Fig. 6 is a schematic flow chart of step S150 of the video processing method according to the embodiment of the disclosure.

Fig. 7A is a schematic structural diagram of a video processing apparatus according to an embodiment of the disclosure.

Fig. 7B is a schematic diagram of another configuration of a video processing apparatus according to an embodiment of the disclosure.

Fig. 8A is a schematic diagram illustrating a ROI enhancement and enlargement module of a video processing apparatus according to an embodiment of the present disclosure.

Fig. 8B is a schematic diagram illustrating a structure of an ROI enhancing and amplifying unit of the video processing apparatus according to an embodiment of the present disclosure.

Fig. 9 is a schematic structural diagram of a chip according to an embodiment of the disclosure.

Fig. 10A is a flow timing diagram of a video processing method according to an embodiment of the disclosure at a single frame time on a chip.

Fig. 10B is a flow timing diagram of a video processing method according to an embodiment of the disclosure at a dual frame time on a chip.

Fig. 11 is a schematic diagram of an implementation structure of an electronic device according to an embodiment of the disclosure.

Detailed Description

Other advantages and effects of the embodiments of the present disclosure will be readily apparent to those skilled in the art from the present disclosure, as illustrated by the following detailed description of the embodiments of the present disclosure. The embodiments of the present disclosure may be further implemented or applied in different specific embodiments, and various modifications and changes may be made in the details of the present description based on different points of view and applications without departing from the spirit of the embodiments of the present disclosure. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

It should be noted that, the illustrations provided in the following embodiments merely illustrate the basic concepts of the embodiments of the disclosure by way of illustration, and only the components related to the embodiments of the disclosure are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.

Fig. 1 is a flowchart illustrating a video processing method according to an embodiment of the present disclosure. As shown in fig. 1, the video processing method provided by the embodiment of the present disclosure may include the following steps S110 to S150.

In step S110, a current frame image of a video is obtained.

In step S120, region of interest detection is performed on the current frame image, and at least one region of interest (Region of Interest, ROI) is obtained.

In step S130, enhancement processing is performed on each region of interest by using the first neural network, so as to obtain a branched image of the region of interest. In some embodiments, the region of interest is enhanced with a generative model that is trained separately for the target region of interest to add detail textures.

In step S140, the enhancement processing is performed on the current frame image by using a second neural network, so as to obtain a full-frame branch image. In some embodiments, the information of the current frame image is enhanced by using a common model, and the common model is used for enhancing any image.

In step S150, the obtained all region of interest branch image and the full-frame branch image are fused, so as to obtain the current frame enhanced image. In some embodiments, the enhancement result of the detail texture of the region of interest branch image and the enhancement intensity of the region of interest branch image are smoothed. And then fusing the region of interest branch image after the smoothing processing with the full-frame branch image.

In some embodiments, the video processing method may further include: a plurality of frame enhanced images are output at a predetermined frame rate for playing an enhanced video corresponding to the video in real time based on the plurality of frame enhanced images.

The existing image quality enhancement and super resolution technical scheme mostly enhances and super-divides from the angles of eliminating picture flaws and thinning edges, however, for low resolution and low image quality video, the input textures and details of the video can be lost, so that the improvement of the effect brought by the scheme is limited. In contrast, the present disclosure introduces a first neural network generation model that can generate lost textures and details for critical areas, achieving better visual effects in a "no-mesogenesis" manner.

The existing learning-based image quality enhancement and superdivision scheme generally adopts an end-to-end mode to enhance and superdivide video frames, namely, input is directly output through a network. However, the real-time running of the mobile terminal has particularly severe calculation force requirements on the network, so that the actual network scale is limited, and a good visual effect is difficult to achieve. In view of this problem, the present disclosure proposes that, in addition to the processing of the whole frame image, additional enhancement processing is performed on a key region of interest (ROI), where the enhancement complexity of the key region is far greater than the algorithm complexity of the whole frame image, which is equivalent to allocating computing resources of non-interest regions to the interest regions. Since the user is visually more focused on the ROI area, the visual effect improvement under the same computing resources can be achieved through the non-uniform distribution of computing power. The region of interest herein refers to the type of picture region that the user focuses on, including but not limited to faces, subtitles, etc., and the definition of the ROI may be different according to the actual service scenario.

Fig. 2 is another flow chart illustrating a video processing method according to an embodiment of the present disclosure. As shown in fig. 2, the video processing method provided by the present embodiment may further include step S115 in addition to steps S110 to S150 described above.

In step S115, the current frame image is scaled to a predetermined resolution and then the region of interest is detected.

Fig. 3 is a flowchart illustrating a video processing method according to an embodiment of the present disclosure in step S120. As shown in fig. 3, in step S120, region of interest detection is performed on the current frame image, and obtaining at least one region of interest includes the following steps S121 to S122.

In step S121, feature extraction is performed on the current frame image by using a deep neural network, so as to obtain an image feature.

In step S122, the image features are input into at least one type of region-of-interest prediction model, and at least one type of region-of-interest corresponding to the output is obtained.

Fig. 4 is a flowchart illustrating a video processing method according to an embodiment of the present disclosure in step S130. As shown in fig. 4, in step S130, enhancement processing is performed on each region of interest using the first neural network, and the corresponding acquisition of the region-of-interest branch image includes the following steps S131 to S136.

In step S131, a corresponding affine transformation method is determined according to the type of the region of interest, and affine transformation is performed on the region of interest to obtain a region of interest transformation branch.

In step S132, a first picture feature of the region of interest transformation branch is extracted by a generator in the generation countermeasure network, and enhancement processing is performed on the first picture feature, so as to obtain a second picture feature.

In step S133, a mask of the first picture feature is obtained by using a mask branch calculation.

In step S134, the second picture feature is fused with the region of interest transformation branch based on the mask of the first picture feature, and an enhancement result is obtained.

In step S135, the enhancement result is amplified by using the superdivision network, and an enhancement amplification result is obtained.

In step S136, performing a corresponding affine inverse transformation on the enhanced amplification result to obtain the region-of-interest branch image; the first neural network includes the generation countermeasure network and the superbranch network.

Fig. 5 is a flowchart illustrating a video processing method according to an embodiment of the present disclosure in step S140. As shown in fig. 5, in step S140, the enhancement processing is performed on the current frame image using a second neural network, and obtaining a full-frame branch image includes the following step S141.

In step S141, the second neural network is a deep neural network; the depth neural network inputs the current frame image with the first resolution and outputs the current frame image with the second resolution; the second resolution is higher than the first resolution.

Fig. 6 is a flowchart illustrating a video processing method according to an embodiment of the present disclosure in step S150. As shown in fig. 6, in step S150, the obtained all region of interest branch image is fused with the full-frame branch image, and the obtaining of the current frame enhanced image includes the following steps S151 to S152.

In step S151, the region of interest branch image of the current frame image and the corresponding region of interest branch image of the previous frame image are subjected to temporal smoothing processing, so as to obtain a temporal smoothing result of the region of interest branch image of the current frame image.

In step S152, the time domain smoothing results of all the region of interest branch images of the current frame image are fused with the full frame branch image according to the fusion ratio, so as to obtain the enhanced image of the current frame.

According to an embodiment of the present disclosure, in step S151, performing temporal smoothing processing on the region-of-interest branch image of the current frame image and the region-of-interest branch image of the corresponding previous frame image, and obtaining a temporal smoothing result of the region-of-interest branch image of the current frame image includes the following steps S1511 to S1513.

In step S1511, gradients of the region-of-interest branch image of the current frame image and the corresponding region-of-interest branch image of the previous frame image are calculated, and gradient gain coefficients are obtained.

In step S1512, a smoothing coefficient of the current frame image is calculated according to the pixel difference value of the region of interest branch image of the current frame image and the region of interest branch image of the corresponding previous frame image and the gradient gain coefficient.

In step S1513, according to the smoothing coefficient, the region-of-interest branch image of the current frame image and the corresponding region-of-interest branch image of the previous frame image are fused, so as to obtain a time domain smoothing result of the region-of-interest branch image of the current frame image.

Most of the existing image quality enhancement schemes are based on single-frame images, and the single-frame image is directly applied to each frame image of a video, so that flicker defects caused by time-domain discontinuity are easy to generate. The scintillation defect is specifically expressed in: the detection result is discontinuous due to false detection or omission of the ROI, so that the phenomenon of negligence of the enhancement effect in the region of the ROI is caused. The method adopts the generation model to carry out detail enhancement on the region in the ROI, and as no information is transmitted between frames, the added detail between different frames can be different, and the detail or texture flickering in the time domain can be generated when the video is continuously played. In order to solve the time domain flaws, a time domain stable fusion scheme is introduced, time domain smoothing is carried out on the ROI enhancement strength according to the historical detection result, time domain denoising is carried out on the ROI enhancement result, and finally weighted fusion is carried out. According to the time domain fusion scheme, an algorithm based on a single frame image can be expanded into a video, time domain flicker flaws cannot occur, and visual impression is further improved.

According to an embodiment of the present disclosure, in step S152, the time domain smoothing result of all the region of interest branch images of the current frame image is fused with the full frame branch image according to a fusion ratio, and obtaining the current frame enhanced image includes the following steps S1521 to S1524.

In step S1521, the confidence level of all the region-of-interest branch images of the current frame image is calculated; the confidence is the product of the size coefficient and the fuzzy coefficient of the region of interest.

In step S1522, the region of interest branch image of the current frame image is acquired, and the position distance between the two branch images is calculated.

In step S1523, a branch fusion coefficient corresponding to the region of interest branch image of the current frame image is obtained according to the confidence level, the mask, the smoothing coefficient of the region of interest branch image of the current frame image and the confidence level of the region of interest branch image of the same type in the previous frame image, which is closest to the current frame image.

In step S1524, the current frame enhanced image is obtained by fusing the current frame enhanced image with the full frame branched image according to the branch fusion coefficient based on the time domain smoothing result of each region of interest branched image of the current frame image.

The protection scope of the video processing method according to the embodiments of the present disclosure is not limited to the execution sequence of the steps listed in the embodiments of the present disclosure, and all the schemes implemented by adding or removing steps and replacing steps according to the prior art according to the principles of the present disclosure are included in the protection scope of the present disclosure.

The embodiment of the present disclosure also provides a video processing apparatus, which may implement the video processing method of the present disclosure, but the implementation apparatus of the video processing method of the present disclosure includes, but is not limited to, the structure of the video processing apparatus listed in the present embodiment, and all structural modifications and substitutions made according to the principles of the present disclosure in the prior art are included in the protection scope of the present disclosure.

Fig. 7A is a schematic diagram illustrating a configuration of a video processing apparatus according to an embodiment of the present disclosure. As shown in fig. 7A, an embodiment of the present disclosure provides a video processing apparatus 700. The video processing apparatus 700 includes: an input module 710, an ROI detection module 720, an ROI enhancement and amplification module 730, a full frame recovery and amplification module 740, a temporal smoothing and fusion module 750, and an output module 760.

The input module 710 is configured to receive a video stream and obtain a current frame image of the video.

The ROI detection module 720 is electrically coupled to the input module 710 and configured to perform region of interest detection on the current frame image to obtain at least one region of interest.

The ROI enhancement and amplification module 730 is electrically coupled to the ROI detection module 720 and configured to perform enhancement processing on each region of interest using a first neural network, and correspondingly obtain a branched image of the region of interest.

The full-frame restoration and amplification module 740 is electrically coupled to the input module 710 and configured to perform enhancement processing on the current frame image by using a second neural network to obtain a full-frame branched image.

The temporal smoothing and fusion module 750 is electrically coupled to the ROI enhancement and amplification module 730 and the full-frame restoration and amplification module 740, respectively, and is configured to fuse the obtained all-region-of-interest branch image with the full-frame branch image to obtain a current-frame enhanced image.

The output module 760 is electrically coupled to the temporal smoothing and fusion module 750 and is configured to output an enhanced video stream comprised of a series of current frame enhanced images.

The video processing apparatus according to the embodiments of the present disclosure divides an entire frame image into ROI areas and non-ROI areas. For non-ROI areas, a deep neural network (such as a generation network) with lower complexity is adopted for enhancement and amplification. And aiming at the ROI area, adopting a generating network with higher complexity to enhance and amplify. The generating network can add detail textures in the low-quality video, so that a better visual effect is realized, and finally, the output of the two branches is subjected to time domain smoothing and fusion, so that a final enhancement effect is obtained.

Fig. 7B is a schematic diagram illustrating another configuration of a video processing apparatus according to an embodiment of the present disclosure. As shown in fig. 7B, the video processing apparatus 700 provided in the embodiment of the present disclosure further includes a pre-scaling module 770 in addition to the input module 710, the ROI detection module 720, the ROI enhancement and amplification module 730, the full-frame recovery and amplification module 740, the temporal smoothing and fusion module 750, and the output module 760 described above. The pre-scaling module 770 is electrically coupled to the input module 710, the ROI detection module 720, and the ROI enhancement and magnification module 730, respectively, and is configured to scale the current frame image to a predetermined resolution prior to region of interest detection and full frame recovery and magnification.

In an embodiment of the disclosure, when the input is not of the specified resolution (e.g., 960x 540), the pre-scaling module 770 first scales the input to the specified resolution (e.g., 960x 540) using a simple scaling method, and the subsequent algorithm processes only the fixed magnification (e.g., 960x540 to 3840 x 2160, fixed 4-fold magnification), so that the mobile terminal device is convenient to receive the input of any resolution and operate.

Next, the respective modules of the video processing apparatus 700 according to the embodiments of the present disclosure will be described, in which 960×540 input is taken as an example, and the image is amplified to 3840×2160 by algorithm, that is, 4-fold amplified, and the ROI area is taken as an example of a face and a caption.

In an embodiment of the disclosure, the ROI detection module 720 is configured to perform feature extraction on the current frame image by using a first deep neural network to obtain image features; and inputting the image features into at least one type of interest region prediction model to obtain at least one type of interest region which is correspondingly output.

In an embodiment of the disclosure, the ROI detection module 720 utilizes a deep neural network to detect ROI areas of the input frame. The number of ROI areas is related to the actual selected computing platform, and typically 3 ROI areas are selected for processing on a particular chip. For the detection network, firstly, the feature extraction part can be shared by the ROIs of different types is input to reduce the computational complexity, then, the result prediction branches corresponding to the number of the ROIs are connected, the embodiment comprises a face detection branch and a subtitle detection branch, the object of the face detection branch is to detect the 5-point key points of the face (specifically including left and right eyes, a nose and left and right sides of a mouth), and the object of the subtitle detection branch is to rotate a bounding box (specifically including a box center coordinate, a width and height and a rotation angle).

In the disclosed embodiment, the ROI area (i.e., the region of interest) is individually trained for each target object, and is a dedicated network; and non-ROI areas are generic and can be used for enhancement of arbitrary pictures.

Fig. 8A is a schematic diagram illustrating a structure of an ROI enhancement and enlargement module of a video processing apparatus according to an embodiment of the present disclosure. As shown in fig. 8A, the ROI enhancement and magnification module 730 includes sets of processing units, each set of processing units including an ROI affine transformation unit, an ROI enhancement and magnification unit, and an ROI inverse transformation unit.

Fig. 8B is a schematic diagram illustrating a structure of an ROI enhancing and amplifying unit of the video processing apparatus according to the embodiment of the present disclosure. As shown in fig. 8B, the ROI enhancement and amplification unit includes a generator, a arbiter, a mask branch, and a superdivision network. The generator includes a feature encoder and a feature decoder.

In an embodiment of the disclosure, the ROI enhancement and amplification module 730 is configured to perform enhancement processing on each region of interest by using a first neural network, and the obtaining a branch image of the region of interest includes: determining a corresponding affine transformation method according to the type of the region of interest, and carrying out affine transformation on the region of interest to obtain a transformation branch of the region of interest; extracting a first picture characteristic of the interest region transformation branch by using a generator in a generating countermeasure network, and carrying out enhancement processing on the first picture characteristic to obtain a second picture characteristic; obtaining a mask of the first picture feature by using mask branch calculation; based on the mask of the first picture feature, fusing the second picture feature with the interest region transformation branch to obtain an enhancement result; amplifying the enhancement result by using a superdivision network to obtain an enhancement amplification result; and carrying out corresponding affine inverse transformation on the enhanced amplification result to obtain the region-of-interest branch image. The first neural network includes the generation countermeasure network and the superbranch network.

In an embodiment of the disclosure, the ROI enhancement and amplification module 730 transforms the content in the ROI detection result to a preset size based on the detection result of the ROI detection module 720, so as to be enhanced and amplified by the subsequent ROI. The present embodiment may set a transformation method according to different ROI types, including, but not limited to, the transformation method of the following examples:

For a face, presetting standard face space structure feature points, such as a two-dimensional space coordinate cluster set, and calculating an affine transformation matrix from the detected face key points to the standard face key points, wherein the calculation method comprises, but is not limited to, a least square method and the like.

Aiming at the subtitles, 4 vertexes of the subtitle bounding box are aligned with vertexes of a preset frame, a transformation matrix is calculated, and after the changed standard ROI is obtained, the standard ROI is sent into a deep neural network to be enhanced and amplified.

In an embodiment of the disclosure, the ROI enhancement and amplification module 730 enhances ROI region details and textures by generating a countermeasure network, wherein a generator in the countermeasure generation network is composed of a feature encoder (abbreviated as encoder) for extracting picture features of the ROI region and a feature decoder (abbreviated as decoder) for recovering and enhancing the picture features. The discriminator is used for judging whether the output of the generator is natural or not, and is only used in the training stage. The mask branches are used to segment out the real interesting part in the ROI area, such as the specific position of the face in the face ROI case, for the subsequent fusion module. The mask branches share the feature extraction module (i.e., feature encoder) with the generator, which may reduce the amount of computation. After the mask branch calculates the true interested part mask of the ROI area, the enhanced image (i.e. the enhanced interested area) is fused with the original image by using the interested part mask, the true interested part selects the enhanced result of the generator, and the rest uses the original input. And then 4 times of amplification is carried out on the fused result through a super-division network, so that the enhanced ROI area is obtained. Finally, the ROI area is transformed back to the original position of the image according to the inverse matrix of the previous affine transformation matrix (due to the size enlargement, the translation component of the inverse matrix needs to be multiplied by the corresponding enlargement factor).

In an embodiment of the disclosure, in terms of structural design of the ROI enhancement and amplification module 730, the generation counter-countermeasure network adopted by the ROI area is structurally composed of a generator and a discriminator, the generator receives a low-quality picture, and outputs a high-quality picture after processing, the structure of the ROI-enhancement and amplification module may be based on a U-shaped structure, the U-shaped structure is divided into an encoder and a decoder, the encoder is used for extracting characteristics of an input picture, and the decoder recovers a clear target picture as an output of the generator according to the characteristics. The discriminator is only used in training to assist in judging whether the output quality of the generator is good or bad, and is used for assisting the training of the generator. The non-ROI area adopts a common neural network, and is structurally different from the ROI area in that: firstly, the method can be used without a discriminator, secondly, the number of layers is small, and the method does not adopt a U-shaped structure but adopts common direct connection to reduce the computational complexity, for example EDSR/ECBSR and the like.

The video processing device in the embodiment of the disclosure is prone to perform "no-midwifery" detail recovery on the ROI area aiming at a specific kind of target, for example, a face, characters, a license plate and other areas, and since the kind of the area is known in advance by the neural network, the details can be filled according to the prior, for example, the known area is a face, and the details such as a beard, an eyebrow and the like can be added; whereas non-ROI areas are enhanced based on the entered information, such as edge narrowing, noise cancellation, etc., but do not produce details where the input does not exist.

In an embodiment of the disclosure, the full-frame recovery and amplification module 740 is configured to perform enhancement processing on the current frame image by using a second neural network to obtain a full-frame branch image. Specifically, the second neural network is a deep neural network; the depth neural network inputs the current frame image with the first resolution and outputs the current frame image with the second resolution; the second resolution is higher than the first resolution.

In an embodiment of the disclosure, the full-frame recovery and amplification module 740 is mainly used for enhancing and amplifying non-ROI areas, processing by using a deep neural network, inputting 960×540 low-quality images, and outputting 3840×2160 processed images. The deep neural network is mainly used for processing flaws of a background and refining edges, and the overall network complexity is low.

In an embodiment of the disclosure, the temporal smoothing and fusing module 750 is configured to fuse all the obtained region of interest branch images with the full-frame branch image to obtain a current frame enhanced image. Specifically, the temporal smoothing and fusing module 750 is configured to perform temporal smoothing processing on the region of interest branch image of the current frame image and the corresponding region of interest branch image of the previous frame image, so as to obtain a temporal smoothing result of the region of interest branch image of the current frame image; and fusing the time domain smoothing results of all the region-of-interest branch images of the current frame image with the full-frame branch image according to the fusion proportion to obtain the current frame enhanced image.

In an embodiment of the disclosure, the temporal smoothing and fusing module 750 is configured to perform temporal smoothing on the region of interest branch image of the current frame image and the corresponding region of interest branch image of the previous frame image, to obtain a temporal smoothing result of the region of interest branch image of the current frame image. Specifically, the temporal smoothing and fusing module 750 is configured to calculate gradients of the region of interest branch image of the current frame image and the corresponding region of interest branch image of the previous frame image, and obtain gradient gain coefficients; calculating to obtain a smoothing coefficient of the current frame image according to the pixel difference value of the region-of-interest branch image of the current frame image and the region-of-interest branch image of the corresponding previous frame image and the gradient gain coefficient; and according to the smoothing coefficient, fusing the region-of-interest branch image of the current frame image with the corresponding region-of-interest branch image of the previous frame image to obtain a time domain smoothing result of the region-of-interest branch image of the current frame image.

In one implementation of the present disclosure, first, a fusion coefficient of the ROI area enhancement amplification result of the current frame and the previous frame is calculated and fused, and the fusion coefficient and the fusion calculation method include the following steps 1) to 3).

1) According to the current frameAnd last frame/>Gradient gain coefficient g _grad is calculated.

Where Sobel (·) is the magnitude of the Sobel filter, max (·) is the pixel-by-pixel maximum,The 2 end point coordinates of the mapping function are respectively. /(I)Content after affine change alignment is carried out for the ROI detected in the nth frame according to the target type; the linear mapping function is y=f _map(x,x₀,x₁,y₀,y₁), which means that the input x is linearly interpolated according to the two endpoints, truncated out of range.

2) According to the current frameAnd last frame/>And gradient gain coefficient g _grad, and calculate a fusion coefficient k _n of the current nth frame.

Wherein abs (·) is an absolute function and mean _3x3 (·) is a 3x3 mean function.

3) Based on the fusion coefficient k _n, the ROI enhancement amplification result of the current frame and the previous frameFusion is performed. /(I)

Wherein, enhancement (·) is an enhancement operation, SR (·) is a super resolution amplification operation.

In an embodiment of the disclosure, the temporal smoothing and fusing module 750 is configured to fuse the temporal smoothing result of all the region of interest branch images of the current frame image with the full frame branch image according to a fusion ratio, so as to obtain the current frame enhanced image. Specifically, the temporal smoothing and fusion module 750 is configured to calculate the confidence of all region-of-interest branch images of the current frame image; the confidence is the product of the size coefficient and the fuzzy coefficient of the region of interest; acquiring the region of interest branch image of the current frame image, and calculating the position distance between the region of interest branch image and the same type of region of interest branch image which is closest to the last frame image; obtaining a branch fusion coefficient corresponding to the region of interest branch image of the current frame image according to the confidence coefficient, mask, smoothing coefficient of the region of interest branch image of the current frame image and the confidence coefficient of the region of interest branch image of the same type in the last frame image, which is nearest to the current frame image; and fusing the full-frame branch image with the branch fusion coefficient based on the time domain smoothing result of each interest region branch image of the current frame image to obtain the current frame enhanced image.

In one implementation of the present disclosure, a temporal smoothing and fusion module (abbreviated fusion module) updates the ROI region branches according to the temporal informationAnd fusing the fusion coefficient with the whole frame super branch SR (frame). The calculation rule of the fusion coefficient is divided into the following steps 1) to 3).

1) And calculating the confidence Cn of the ROI of the current nth frame. The confidence of the ROI area of different types is calculated in different ways. When the type is a face, the ROI confidence is the product of the size coefficient and the blur coefficient. The size coefficient controls the proper size of the human face, when the human face is oversized or undersized, the coefficient is close to 0, and when the size of the human face is proper, the coefficient is 1; the blurring coefficient measures the blurring degree of the human face, when the human face is too blurred, the coefficient is close to 0, and the blurring degree of the human face can be measured through the variance of the ROI area after passing through laplacian operators. When the type is the subtitle, the ROI confidence is the product of a size coefficient and a fuzzy coefficient, the size coefficient is calculated in a different mode from a human face, the height of the subtitle is mainly considered, when the subtitle is high, the coefficient is 1, and when the subtitle is too high or too low, the coefficient is 0; the fuzzy coefficient calculation mode is similar to that of a human face, and can be measured by a method such as variance or gradient maximum value after laplacian operators. When the type is a license plate, the ROI confidence coefficient is the product of a proportional coefficient and a fuzzy coefficient, the proportional coefficient is the aspect ratio of a license plate standard, and the coefficient is 1 when the proportion is proper; the blur coefficients are similar to the above and will not be described again.

2) The same type of ROI frame that is closest to the ROI frame in the previous frame is selected and the position distance between the two is calculated. The calculation modes of the distance between the two points include, but are not limited to IOU (Intersection over Union), average distance of corresponding corner points and the like, and the average distance is mapped into a coefficient w _dist between 0 and 1 according to a threshold value, and the farther the distance is, the smaller the coefficient is.

3) Calculating a fusion coefficient w _roi of the ROI enhancement branch according to the following formula, wherein w _seg is the output of the mask branch, C _n-1 is the confidence level of the nearest ROI frame of the previous frame, and gamma is a preset smoothing coefficient;

w_roi＝w_seg*(γ*C_n+(1-γ)*w_dist*C_n-1)

after the fusion coefficient is calculated, the ROI branch and the full frame branch can be fused according to the coefficient to obtain the final output SR (Enhance (frame)).

The ROI areas do not coincide with each other, so that only one ROI frame exists at the same position. Regardless of the type of the ROI frame, the respective fusion coefficients w _roi can be calculated according to the type of the ROI frame and fused with the full-frame SR result. Therefore, the calculation formula is the same regardless of the types of ROI areas, and the difference is only that the ways in which the fusion coefficients w _roi are calculated for the different types of ROIs are different.

The image enhancement and super-resolution algorithm is directly applied to each frame of image of the video, and flicker flaws can be caused due to discontinuous time domain information. To alleviate this phenomenon, the present disclosure proposes a temporal smoothing and fusion module. Time domain smoothing is embodied in two aspects, namely, smoothing of enhancement results between frames, and because enhancement is independently carried out between different frames by using a generated model, the enhanced details and textures of the enhancement are possibly unstable in the time domain and need to be processed; secondly, the smoothness of the enhancement intensity between frames, and the negligence of the enhancement intensity can occur because the ROI detection result of each frame may be unstable (the situations of ROI missed detection and false detection may occur or the situation that the detection result jumps from one target to another target when the number of targets is too large). The time domain smoothing and fusing module 750 performs time domain smoothing on the ROI enhancement result and the enhancement intensity, and fuses with the full-frame super-division branches, so as to finally obtain the enhancement effect of time domain stability.

Fig. 9 is a diagram illustrating an effect comparison between a current frame image and a current frame enhanced image according to an embodiment of the present disclosure. Fig. 9 illustrates the effect of the original video and the video processing method according to the present disclosure, in which the ROI area is a face and subtitle area, it can be seen that details and sharpness of the face and subtitle area are significantly improved. (note: original video is a video input by 960x540, and is truncated after bilinear amplification to 3840x 2160; rockchip 4K AI-PQ is a video input by 960x540, and is truncated after output by the real-time enhanced super-resolution algorithm proposed in the present disclosure).

Fig. 10 is a schematic diagram illustrating a structure of a chip according to an embodiment of the present disclosure. As shown in fig. 10, an embodiment of the present disclosure provides a chip 100. The chip 100 includes a CPU 101, an NPU 102, a GPU 103, and a cache 104. The CPU 101 is electrically coupled to the cache 104, the NPU 102 is electrically coupled to the CPU 101, GPU 103 and cache 104, respectively, and the GPU 103 is electrically coupled to the CPU 101.

Fig. 11A shows a flow timing diagram of a video processing method described in the present disclosure at a single frame time on a chip. As shown in fig. 11A, the workflow of the chip 100 at the nth frame time is described as follows.

The buffer 104 is configured to buffer a video stream.

The CPU 101 is configured to output each region of interest of an n-1 th frame image when receiving the n-th frame image of a video stream, and perform region of interest detection on the n-th frame image to obtain at least one region of interest of the n-th frame image; n is a natural number greater than zero.

The NPU 102 is electrically coupled to the CPU 101, the GPU 103, and the buffer 104, and is configured to perform enhancement processing on each region of interest of the n-1 th frame image when the CPU receives the n-th frame image of the video stream, so as to obtain branch images of each region of interest of the n-1 th frame image; and carrying out enhancement processing on the n-1 frame image to obtain a full-frame branch image of the n-1 frame image. And

The GPU 103 is electrically coupled to the CPU 101 and is configured to, when the CPU receives the nth frame image of the video stream, fuse all the region of interest branch images of the nth-1 frame image with all the frame branch images of the nth-1 frame image, and obtain an enhanced image of the nth-1 frame image.

Fig. 11B shows a flow timing diagram of a video processing method of the present disclosure at a dual frame time on a chip. As shown in fig. 11B, the workflow of the chip 100 at the n+1st frame time is described as follows.

The CPU 101 is configured to output each region of interest of an n+1th frame image when receiving the n+1th frame image of a video stream, and perform region of interest detection on the n+1th frame image to obtain at least one region of interest of the n+1th frame image.

The NPU 102 is configured to perform enhancement processing on each region of interest of the nth frame image when the CPU receives the n+1th frame image of the video stream, so as to obtain each region of interest branch image of the nth frame image; and carrying out enhancement processing on the nth frame image to obtain a full-frame branch image of the nth frame image.

The GPU 103 is configured to, when the CPU receives the n+1st frame image of the video stream, fuse all the region of interest branch images of the obtained n frame image with the full frame branch image of the n frame image, and obtain an enhanced image of the n frame image.

In accordance with an embodiment of the present disclosure, the NPU 102 includes a number of NPU units; the number of NPU units may be configured to be the same as the number of regions of interest of the image.

The method designs a pipeline parallel execution flow aiming at the chip, so that the chip in the embodiment of the disclosure can enable the video processing method in the embodiment of the disclosure to run in real time at the mobile terminal (the frame rate of 30 FPS).

In one embodiment of the present disclosure, the hardware architecture of a particular chip includes 1 CPU,1 GPU and 3 NPUs (i.e., 3 core NPUs). As shown in FIG. 11A, to fully utilize the hardware resources of the CPU/GPU/NPU on the chip in parallel, the pipeline adopts an asynchronous execution scheme, and one frame of delay is performed in the algorithm. Specifically, when the algorithm receives the data of the nth frame of the video stream, the data of the (n-1) th frame is output. The solid line box blocks in the timing diagram represent the times at which these blocks process the nth frame data; the dashed box blocks in the timing diagram represent the times at which these blocks process the n-1 frame data at the n-th frame.

In the embodiment of the disclosure, the overall execution flow may be divided into 2 threads, namely, a thread for processing the nth frame data and a thread for processing the n-1 th frame data. The two threads are executed in parallel, and different hardware resources are respectively mobilized to achieve the maximization of the hardware utilization rate.

The solid line frame thread takes the nth frame data in the video stream as input, firstly, a CPU is called to execute ROI detection, the actual operation of the ROI detection is the forward propagation of the running neural network, and the running of the ROI detection can be realized on the CPU by calling a neural network forward reasoning framework (including, but not limited to NCNN/MNN and the like); then, according to the detection result of the ROI, invoking the GPU to sequentially carry out affine change on the ROI one by one, transforming to a predefined gesture and size, and storing the aligned and transformed ROI into a cache; finally, the transformed aligned ROIs are sent to the NPU, the segmentation model is called to obtain the real processing area in the ROI area, and the segmentation result is stored in the buffer memory, and the fact that the NPU has 3 cores, so that the maximum of 3 ROIs can be processed is noted.

The broken line frame thread firstly takes out the buffer memory of the n-1 th frame, and calls all NPUs to execute the full-frame superdivision operation; secondly, the ROI after alignment transformation in the n-1 frame is taken out from the buffer memory, and NPU is respectively called to carry out enhancement and superdivision of at most 3 ROI areas; simultaneously using the ROI of the n-1 frame and the n-2 frame to calculate fusion coefficients of enhancement results between frames according to a scheme described in a time domain smoothing module on the GPU; finally, carrying out affine inverse transformation on the ROI by using the GPU respectively, smoothing and fusing the time, and finally outputting the enhancement result of the n-1 frame.

In one embodiment of the present disclosure, as shown in fig. 11B, the nth frame processing is shown with complete timing, see the portion of the solid line box in the figure. At time n, processing the ROI region detection, the ROI affine transformation and the segmentation of the nth frame data; at time n+1, processing and processing full frame super-division, ROI enhancement and super-division, affine inverse transformation and time domain fusion related calculation of the nth frame data. Meanwhile, the data of the n-1 frame is processed at the n time and the data of the n+1 frame is processed at the n time in parallel and asynchronously (the processes of the n-1 time and the n+1 time are respectively drawn in half, and the n frame time sequence passage is mainly displayed, so that the redundant description is not expanded).

Most of the existing image quality enhancement schemes cannot realize real-time processing of a mobile terminal, and play is required after offline processing of a server, so that network bandwidth and storage space requirements are remarkably improved, and real-time performance is lacking. The present disclosure proposes a sequential flow of pipelining parallelism of each module, which makes full use of cpu\gpu\npu hardware resources of a chip, and can play in real time at a frame rate of 30FPS on a mobile platform exemplified by a specific chip. The method and the device can carry out real-time enhancement and super-resolution amplification on the low-quality video, and finally display good visual effect on a 4K large screen.

The present disclosure provides a video processing method, which is a method for real-time enhancement and super resolution of a low-quality video of a mobile terminal based on a key region (ROI region) capable of running (30 FPS) in real time at the mobile terminal (for example, a specific chip is used), a method for distinguishing and processing the key region from a non-key region is used, and the key region is enhanced and amplified by generating an countermeasure model, so as to achieve the effect of low-quality enhancement.

The disclosure provides a video processing method, in which a time domain smoothing and fusion scheme performs time domain smoothing on ROI enhancement intensity according to a historical detection result, and performs time domain denoising on the ROI enhancement result, and finally performs weighted fusion. By the domain smoothing and fusion scheme, an algorithm based on a single frame image can be expanded into a video, time domain flicker flaws are not introduced, and visual impression is improved.

The chip can realize the parallel time sequence flow of each module pipeline of the video processing method, can fully utilize CPU/GPU/NPU hardware resources of the chip, and can be played in real time at the frame rate of 30FPS on a mobile platform taking a specific chip as an example.

The disclosed embodiments also provide a computer-readable storage medium having a computer program stored thereon. The computer program is executed to implement a video processing method according to any of the embodiments of the present disclosure.

Those of ordinary skill in the art will appreciate that all or part of the steps in a method implementing the above embodiments may be implemented by a program to instruct a processor, where the program may be stored in a computer readable storage medium, where the storage medium is a non-transitory (non-transitory) medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (MAGNETIC TAPE), a floppy disk (floppy disk), a compact disk (optical disk), and any combination thereof. The storage media may be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Drive (SSD)), or the like.

Fig. 12 is another structural schematic diagram illustrating an electronic device 1200 provided according to an embodiment of the present disclosure. As shown in fig. 12, an electronic device 1200 in an embodiment of the disclosure includes a memory 1210 and a processor 1220.

The memory 1210 is used for storing a computer program. In some possible implementations, the memory may include computer system readable media in the form of volatile memory, such as RAM and/or cache memory. The electronic device may further include other removable/non-removable, volatile/nonvolatile computer system storage media. The memory may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the various embodiments of the disclosure.

The processor 1220 is electrically coupled to the memory 1210, and is configured to execute computer programs stored in the memory module to perform the video processing method provided in any of the embodiments of the present disclosure by the electronic device 1200. In some possible implementations, the processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), and the like. In other implementations, the Processor may also be a digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

In some possible implementations, the electronic device 1200 provided by embodiments of the present disclosure may also include a display 1230. A display is electrically coupled to the memory and the processor for displaying an associated graphical user interface (GRAPHICAL USER INTERFACE, GUI) of the video processing method provided by any of the embodiments of the present disclosure.

In the embodiments of the present disclosure, the display may include a display screen (display panel). In some implementations, the display panel may be configured in the form of a Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD), an Organic Light-Emitting Diode (OLED), or the like. In addition, the display may also be a touch panel (touch screen ), which may include a display screen and a touch sensitive surface. When the touch-sensitive surface detects a touch operation thereon or thereabout, it is communicated to the processor to determine the type of touch event, and the processor then provides a corresponding visual output on the display device based on the type of touch event.

The embodiment of the disclosure also provides electronic equipment. The electronic equipment comprises the chip disclosed by the embodiment of the disclosure.

In an embodiment of the disclosure, the electronic device is a mobile terminal.

The electronic device provided in the embodiment of the present disclosure may implement the video processing method described in the present disclosure, but the implementation apparatus of the video processing method described in the present disclosure includes, but is not limited to, the structure of the electronic device listed in the embodiment of the present disclosure, and all structural modifications and substitutions made according to the principles of the present disclosure in the prior art are included in the protection scope of the present disclosure.

The chip provided in the embodiment of the present disclosure may implement the video processing method described in the present disclosure, but the implementation device of the video processing method described in the present disclosure includes, but is not limited to, the structure of the chip listed in the embodiment of the present disclosure, and all structural modifications and substitutions made according to the principles of the present disclosure in the prior art are included in the protection scope of the present disclosure.

In the several embodiments provided in the embodiments of the present disclosure, it should be understood that the disclosed system, apparatus, or method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules/units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple modules or units may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules or units, which may be in electrical, mechanical or other forms.

The modules/units illustrated as separate components may or may not be physically separate, and components shown as modules/units may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules/units may be selected according to actual needs to achieve the objectives of the embodiments of the present disclosure. For example, functional modules/units in various embodiments of the present disclosure may be integrated into one processing module, or each module/unit may exist alone physically, or two or more modules/units may be integrated into one module/unit.

Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present disclosure.

Embodiments of the present disclosure may also provide a computer program product comprising one or more computer instructions. When the computer instructions are loaded and executed on a computing device, the processes or functions described in accordance with the embodiments of the present disclosure are produced in whole or in part. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, or data center to another website, computer, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).

The computer program product is executed by a computer, which performs the method according to the preceding method embodiment. The computer program product may be a software installation package, which may be downloaded and executed on a computer in case the aforementioned method is required.

The descriptions of the processes or structures corresponding to the drawings have emphasis, and the descriptions of other processes or structures may be referred to for the parts of a certain process or structure that are not described in detail.

The above embodiments are merely illustrative of the principles of the embodiments of the present disclosure and their efficacy, and are not intended to limit the embodiments of the disclosure. Modifications and variations may be made to the above-described embodiments by those of ordinary skill in the art without departing from the spirit and scope of the disclosed embodiments. Accordingly, it is intended that all equivalent modifications and variations which a person having ordinary skill in the art would accomplish without departing from the spirit and technical spirit of the disclosed embodiments shall be covered by the claims of the disclosed embodiments.

Claims

1. A video processing method, comprising:

Obtaining a current frame image of a video;

detecting the interest area of the current frame image to obtain at least one interest area;

performing enhancement processing on the region of interest by using a first neural network to obtain a branch image of the region of interest;

Performing enhancement processing on the current frame image by using a second neural network to obtain a full-frame branch image; and

And fusing the region of interest branch image with the full-frame branch image to obtain a current frame enhanced image.

2. The video processing method according to claim 1, wherein the enhancement processing of the region of interest using the first neural network comprises:

enhancement processing is performed on the region of interest to add detail textures using a generative model that is trained separately for the target region of interest.

3. The video processing method according to claim 1, wherein performing enhancement processing on the current frame image using a second neural network includes:

And carrying out enhancement processing on the information of the current frame image by using a universal model, wherein the universal model is used for enhancing any image.

4. The video processing method according to claim 2, wherein fusing the region-of-interest branch image with the full-frame branch image comprises:

smoothing the enhancement result of the detail texture of the region-of-interest branch image and the enhancement intensity of the region-of-interest branch image; and

And fusing the region of interest branch image after the smoothing processing with the full-frame branch image.

5. The video processing method according to claim 1, further comprising:

A plurality of frame enhanced images are output at a predetermined frame rate for playing an enhanced video corresponding to the video in real time based on the plurality of frame enhanced images.

6. The video processing method according to claim 1, wherein performing region-of-interest detection on the current frame image to obtain at least one region-of-interest comprises:

extracting features of the current frame image by using a deep neural network to obtain image features; and

And inputting the image features into at least one type of interest region prediction model to obtain at least one type of interest region which is correspondingly output.

7. The video processing method of claim 1, wherein the first neural network includes generating a countermeasure network and a superdivision network, wherein the region of interest is enhanced with the first neural network, and obtaining the region of interest branch image includes:

Determining a corresponding affine transformation method according to the type of the region of interest, and carrying out affine transformation on the region of interest to obtain a transformation branch of the region of interest;

extracting a first picture characteristic of the interest region transformation branch by using a generator in the generating countermeasure network, and performing enhancement processing on the first picture characteristic to obtain a second picture characteristic;

obtaining a mask of the first picture feature by using mask branch calculation;

Based on the mask of the first picture feature, fusing the second picture feature with the interest region transformation branch to obtain an enhancement result;

Amplifying the enhancement result by using the superdivision network to obtain an enhancement amplification result; and

And carrying out corresponding affine inverse transformation on the enhancement amplification result to obtain the region-of-interest branch image.

8. The video processing method according to claim 1, wherein the second neural network is a deep neural network, wherein performing enhancement processing on the current frame image using the second neural network, obtaining a full-frame branch image includes:

and inputting the current frame image with the first resolution into the deep neural network to output the current frame image with the second resolution, wherein the second resolution is higher than the first resolution.

9. The video processing method according to claim 1, further comprising:

and scaling the current frame image to a preset resolution ratio and then detecting the region of interest.

10. The video processing method according to claim 1, wherein fusing the region-of-interest branch image and the full-frame branch image to obtain a current frame enhanced image includes:

Performing time domain smoothing processing on the region of interest branch image of the current frame image and the region of interest branch image of the corresponding previous frame image to obtain a time domain smoothing result of the region of interest branch image of the current frame image; and

And fusing the time domain smoothing results of all the region-of-interest branch images of the current frame image with the full-frame branch image according to the fusion proportion to obtain the current frame enhanced image.

11. The video processing method according to claim 10, wherein performing temporal smoothing processing on the region-of-interest branch image of the current frame image and the region-of-interest branch image of the corresponding previous frame image, obtaining a temporal smoothing result of the region-of-interest branch image of the current frame image includes:

calculating gradients of the region-of-interest branch image of the current frame image and the corresponding region-of-interest branch image of the previous frame image to obtain gradient gain coefficients;

calculating to obtain a smoothing coefficient of the current frame image according to the pixel difference value of the region-of-interest branch image of the current frame image and the region-of-interest branch image of the corresponding previous frame image and the gradient gain coefficient; and

And according to the smoothing coefficient, fusing the interest region branch image of the current frame image with the interest region branch image of the corresponding previous frame image to obtain a time domain smoothing result of the interest region branch image of the current frame image.

12. The video processing method according to claim 11, wherein fusing the temporal smoothing result of all the region-of-interest branch images of the current frame image with the full-frame branch image according to a fusion ratio, obtaining the current frame enhanced image includes:

calculating the confidence coefficient of all the region-of-interest branch images of the current frame image; the confidence is the product of the size coefficient and the fuzzy coefficient of the region of interest;

Acquiring the region of interest branch image of the current frame image, and calculating the position distance between the region of interest branch image and the same type of region of interest branch image which is closest to the last frame image;

Obtaining a branch fusion coefficient corresponding to the region of interest branch image of the current frame image according to the confidence coefficient, mask, smoothing coefficient of the region of interest branch image of the current frame image and the confidence coefficient of the region of interest branch image of the same type in the last frame image, which is nearest to the current frame image; and

And fusing the full-frame branch image with the time domain smoothing result of each interest region branch image of the current frame image according to the branch fusion coefficient to obtain the current frame enhanced image.

13. A video processing apparatus, comprising:

The input module is configured to receive a video stream and obtain a current frame image of the video;

the ROI detection module is electrically coupled with the input module and is configured to detect the region of interest of the current frame image to obtain at least one region of interest;

The ROI enhancement and amplification module is electrically coupled with the ROI detection module and is configured to enhance the region of interest by using a first neural network to obtain a region of interest branch image;

the full-frame recovery and amplification module is electrically coupled with the input module and is configured to perform enhancement processing on the current frame image by using a second neural network to obtain a full-frame branch image;

The time domain smoothing and fusing module is electrically coupled with the ROI enhancement and amplification module and the full-frame restoration and amplification module respectively and is configured to fuse the region-of-interest branch image with the full-frame branch image to obtain a current frame enhanced image; and

An output module, electrically coupled to the temporal smoothing and fusion module, is configured to output an enhanced video stream composed of a sequence of current frame enhanced images.

14. A chip, comprising:

The central processing unit CPU is configured to output each interest region of an n-1 frame image when receiving the n frame image of the video stream, and perform interest region detection on the n frame image to obtain at least one interest region of the n frame image, wherein n is a natural number larger than zero;

The neutral network processing unit NPU is configured to perform enhancement processing on each region of interest of the n-1 th frame image when the CPU receives the n-th frame image of the video stream to obtain each region of interest branch image of the n-1 th frame image, and perform enhancement processing on the n-1 th frame image to obtain a full-frame branch image of the n-1 th frame image; and

And the graphic processing unit GPU is configured to fuse all the region of interest branch images of the obtained n-1 frame image with the full-frame branch image of the n-1 frame image when the CPU receives the n-1 frame image of the video stream, so as to obtain an enhanced image of the n-1 frame image.

15. The chip of claim 14, wherein the chip comprises a plurality of chips,

The CPU is further configured to output each interest region of the n+1th frame image when receiving the n+1th frame image of the video stream, and perform interest region detection on the n+1th frame image to obtain at least one interest region of the n+1th frame image;

The NPU is further configured to perform enhancement processing on each region of interest of the nth frame image to obtain each region of interest branch image of the nth frame image and perform enhancement processing on the nth frame image to obtain a full frame branch image of the nth frame image when the CPU receives the (n+1) th frame image of the video stream; and

The GPU is further configured to fuse each region of interest branch image of the obtained nth frame image with a full frame branch image of the nth frame image when the CPU receives the (n+1) th frame image of the video stream, and obtain an enhanced image of the nth frame image.

16. The chip of claim 14, wherein the NPU comprises a plurality of NPU units; the number of NPU units is configured to be the same as the number of regions of interest of the image.

17. The chip of claim 14, wherein the GPU is further configured to output a plurality of frame enhanced images at a predetermined frame rate for playing an enhanced video stream corresponding to the video stream in real-time based on the plurality of frame enhanced images.

18. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed, implements the video processing method according to any one of claims 1 to 12.

19. An electronic device, comprising:

a memory configured to store a processor executable program; and

A processor configured to call the program to perform the video processing method according to any one of claims 1 to 12.

20. An electronic device comprising a chip according to any of claims 14 to 17 and playing an enhanced video stream comprising enhanced images of a plurality of frame images in real time based on the received video stream.