WO2023202447A1 - 画质提升模型的训练方法和视频会议系统画质的提升方法 - Google Patents

画质提升模型的训练方法和视频会议系统画质的提升方法 Download PDF

Info

Publication number
WO2023202447A1
WO2023202447A1 PCT/CN2023/087910 CN2023087910W WO2023202447A1 WO 2023202447 A1 WO2023202447 A1 WO 2023202447A1 CN 2023087910 W CN2023087910 W CN 2023087910W WO 2023202447 A1 WO2023202447 A1 WO 2023202447A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
image quality
quality improvement
noise
model
Prior art date
Application number
PCT/CN2023/087910
Other languages
English (en)
French (fr)
Inventor
徐茜
于维纳
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2023202447A1 publication Critical patent/WO2023202447A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20172Image enhancement details
    • G06T2207/20182Noise reduction or smoothing in the temporal domain; Spatio-temporal filtering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • Embodiments of the present application relate to the field of computer vision technology, and in particular, to a training method for an image quality improvement model and a method for improving the image quality of a video conferencing system.
  • the purpose of the present invention is to solve the above problems and provide a training method and method for image quality improvement model.
  • the method of improving the image quality of the video conferencing system solves the problems of slow image quality improvement and insignificant image quality improvement effect of the video conferencing system.
  • the image quality improvement model is used to improve the image quality of video images in a video conferencing system.
  • the method includes: obtaining at least one noise model; Among them, the noise model is trained based on the first noisy image and is used to simulate the noise characteristics of the video conferencing system; a noise-free image collection is obtained, and the noise-free images in the noise-free image collection are input into the noise model to obtain the second-band noise model.
  • the noisy image, the noise-free image and the second noisy image constitute a training data pair; use the training data to train the initial image quality improvement model, and after completing the training, perform convolution and folding on the trained image quality improvement model, and get The final image quality improvement model.
  • embodiments of the present application provide a method for improving the image quality of a video conferencing system, which includes: obtaining video images transmitted by the video conferencing system; and using an image quality improvement model to improve the image quality of the video images of the video conferencing system. , to obtain a video image with improved image quality; wherein, the image quality improvement model is obtained according to the above training method of the image quality improvement model.
  • embodiments of the present application provide a training device for an image quality improvement model, which is characterized in that the image quality improvement model is used to improve the image quality of video images of a video conferencing system, and includes: an acquisition module, Obtain at least one noise model; wherein, the noise model is trained based on the first noisy image and is used to simulate the noise characteristics of the video conferencing system; the noise adding module obtains a noise-free image collection and adds the noise-free images in the noise-free image collection Input into the noise model to obtain the second noisy image.
  • the noise-free image and the second noisy image constitute a training data pair; the training module uses the training data to train the initial image quality improvement model, and after completing the training,
  • the trained image quality improvement model is convolved and folded to obtain the final image quality improvement model.
  • embodiments of the present application provide a device for improving the image quality of a video conferencing system, which is characterized in that it includes: an acquisition module to acquire video images transmitted by the video conferencing system; an image quality improvement module to use image quality improvement The model improves the image quality of the video image of the video conferencing system to obtain an improved video image; wherein, the image quality improvement model is obtained according to the above image quality improvement model training method.
  • an embodiment of the present application also provides an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores information that can be used by the Instructions executed by at least one processor, the instructions being executed by the at least one processor, so that the at least one processor can perform the training method of the above-mentioned image quality improvement model, or can perform the above-mentioned image quality improvement of the video conferencing system. Promotion method.
  • embodiments of the present application also provide a computer-readable storage medium that stores a computer program.
  • the training method of the above-mentioned image quality improvement model can be implemented, or the above-mentioned training method can be implemented. Methods to improve the image quality of video conferencing systems.
  • Figure 1 is a flow chart of a training method for an image quality improvement model provided by an embodiment of the present application
  • Figure 2 is a flow chart of a method for improving the image quality of a video conferencing system provided by an embodiment of the present application
  • Figure 3 is a schematic structural diagram of a training device for an image quality improvement model provided by an embodiment of the present application
  • Figure 4 is a schematic structural diagram of a device for improving the image quality of a video conferencing system provided by an embodiment of the present application
  • Figure 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the image quality improvement model uses To improve the quality of video images of the video conferencing system, the method includes: obtaining at least one noise model; wherein, the noise model is obtained based on the first noisy image training and is used to simulate the noise characteristics of the video conferencing system; obtaining a noise-free image Collection, and input the noise-free images in the noise-free image collection into the noise model to obtain the second noisy image.
  • the noise-free image and the second noisy image constitute a training data pair; use the training data pair to pair the initial image quality improvement model Carry out training, and after completing the training, perform convolution and folding on the trained image quality improvement model to obtain the final image quality improvement model, which improves the video image quality while ensuring the real-time nature of the video conferencing system and solving the problem of painting.
  • the quality improvement speed is slow and the image quality improvement effect is not obvious.
  • step 101 at least one noise model is obtained; wherein the noise model is trained based on the first noisy image and is used to simulate the noise characteristics of the video conferencing system.
  • a first noisy image set is acquired from the receiving end of the video conferencing system; wherein each first noisy image in the first noisy image set is processed before encoding.
  • the resolutions are all lower than the first threshold, and the bit rate of each first noisy image when encoding is lower than the second threshold.
  • the resolution of the video image before encoding by the video conferencing system is downsampled to a fixed factor, such as two or four times, that is, the resolution of the video image is reduced by two or four times, and the encoder of the video conferencing system is The bit rate is reduced, and the video images encoded, transmitted and decoded at the low bit rate of the video conferencing system are collected at the receiving end of the video conferencing system, that is, the first noisy image collection is obtained.
  • the video conferencing system uses a reduced bit rate to encode the reduced resolution video image based on encoding methods such as H.264 or H.265 to obtain code stream data.
  • the code stream data is transmitted to the receiving end of the video conferencing system through the network. , and decode the code stream data.
  • the first noisy image is used to train a noise model.
  • the noise model is composed of multiple convolution residual blocks and a downsampling module.
  • the downsampling method and magnification in the noise model are the same as the downsampling method in the video conferencing system.
  • the magnification is the same, thus simulating video images looking at each other through the video conferencing system Downsampling of frequency images.
  • the video conferencing system when the video conferencing system encodes, transmits and decodes video images, it will cause the video image to have blur, mosaic, ringing and other noises, and the above noise can be regarded as filters using different weight parameters for the image, and
  • the filter can be simulated through the convolutional layer, and then the noise model composed of the convolutional layer and the downsampling module can be used to learn the various complex changes experienced by the video image in the videoconferencing system, thereby simulating the noise characteristics of the videoconferencing system.
  • a set of noise models is obtained based on each of the first noisy images; wherein each first noisy image in the first noisy image set corresponds to a noise model.
  • the first set of noisy images collected from the receiving end The first noisy image in has different noise characteristics.
  • a noise model is trained for each first noisy image in the collected first noisy image set to obtain a set of noise models.
  • the first noisy image is input into the initial noise model to obtain the third noisy image; where, when the first noisy image is input, whether to superimpose Singer filter noise is randomly selected; the first noisy image is The noisy image and the third noisy image are input into the conventional convolutional neural network model, and a first set of feature maps of the first noisy image and a second set of feature maps of the third noisy image of the same size are obtained. According to the first The first set of feature maps and the second set of feature maps train the initial noise model.
  • the kernel size range of the Singer filter is based on the actual situation of the video conferencing system. is configured, and the size of the randomly superimposed Singer filter noise is randomly selected within the kernel size range of the Singer filter.
  • the size of the original image before encoding at the sending end and the decoded image at the receiving end are not necessarily proportionally scaled, but the pixel distribution is generally consistent.
  • the noise model learns the video image in the video conferencing system It has experienced various complex changes. Therefore, the first noisy image input to the noise model and the third noisy image output by the noise model are not necessarily scaled equally, and the pixel distribution is generally consistent. Therefore, using convolution
  • the neural network model combines the first noisy image and The third noisy image is converted into a fixed-size feature map, the feature map of the first noisy image is used as the true value of the noise model, and the feature map of the third noisy image is used as the predicted value of the noise model. According to the above predicted value and The real value is used to construct the loss function of the noise model, and the noise model is iteratively trained based on the loss function.
  • a first noisy image is divided into multiple image blocks of the same size, and a first noisy image block is input into the noise model to obtain a third noisy image. block, and then select an image classification model (ie, a conventional convolutional neural network model) to extract the features of the first noisy image block and the third noisy image block, such as ResNet (Residual Network),
  • ResNet Residual Network
  • the efficient neural network (EfficientNet) proposed by Google removes the global average pooling layer and the fully connected layer that do not contain spatial features, and only retains the feature extraction part.
  • the first noisy image block and the third band are After feature extraction of noisy image blocks, feature size normalization methods such as ROI Pooling (Region of interest pooling) or Region of interest alignment (Region of interest Align, ROI Align) are added to extract the extracted features.
  • feature size normalization methods such as ROI Pooling (Region of interest pooling) or Region of interest alignment (Region of interest Align, ROI Align) are added to extract the extracted features.
  • the feature map of the first noisy image block and the feature map of the third noisy image block are converted into two sets of feature maps of fixed size. According to the distance between the two sets of feature maps, other first noisy image blocks are continued to be used. Iterative training is performed until the noise model training is completed.
  • step 102 a noise-free image set is obtained, and the noise-free images in the noise-free image set are input into the noise model to obtain a second noisy image.
  • the noise-free image and the second noisy image constitute a training data pair.
  • the noise-free image is input to a noise model randomly selected from the set of noise models to obtain a second noisy image.
  • noise-free images in public data sets are obtained, such as downloading and collecting public video image data sets such as BVI-DVC, LDV, DIV2K, Flickr2K, etc., to build a noise-free image collection.
  • Traverse each noise-free image in the noise-free image set randomly select a noise model from the set of trained noise models, input the noise-free image into the noise model, and obtain the second noisy image with video conference noise characteristics , the input noise-free image and the obtained second noisy image constitute a training data pair, Used as training data for training the image quality improvement model.
  • the video image data in the video conferencing system generally needs to be forwarded through a multi-point control unit, it is difficult to achieve a one-to-one correspondence between the video images at the sending end and the receiving end, and the video conferencing system generally responds according to the network conditions and the receiving end configuration information. Dynamically adjusting the resolution size of video images cannot ensure that the magnifications of high-resolution images and low-resolution images in the collected data are consistent. In addition, most of the high-resolution images collected by the sending-end video image acquisition equipment have been pre-processed by the sending-end equipment and are not unprocessed original images, in which noise information has been introduced. It is therefore difficult to obtain high-definition, noise-free images directly from video conferencing systems.
  • the embodiment of the present application proposes a method for generating data required for image quality improvement network learning. There is no need to obtain high-definition and noise-free images from the sending end of the video conferencing system, and it is directly used in the video conferencing system due to low bit rate.
  • the respective image change processes caused by low-resolution decoding, transmission and decoding are simulated to obtain a noise model with the noise characteristics of the video conferencing system.
  • the noise model is used to add noise to the high-resolution noise-free image in the public data set to obtain the second noisy image. Then it is combined with high-resolution noise-free images to form a training data pair for image quality improvement model learning.
  • step 103 the training data is used to train the initial image quality improvement model, and after completing the training, the trained image quality improvement model is convolved and folded to obtain the final image quality improvement model.
  • the training data pairs before using the training data pairs to train the initial image quality improvement model, are converted into binary data format or lmdb (Lightning Memory-Mapped Database) database format; from conversion Randomly extract small blocks of specified sizes from the second noisy image of the converted training data pair, and extract corresponding small blocks from the noise-free image of the converted data pair; use the training data pair to compare the initial image quality improvement model Perform training, including training the image quality improvement model using small patches extracted from the training data pairs.
  • lmdb Lightning Memory-Mapped Database
  • a set of training data pairs are randomly selected from the training set in binary format or lmdb database format, and small blocks of a specified size are randomly deducted from the second noisy image, for example, the size is 64x64, and the size of the small block is It can be flexibly configured according to the hardware resources used for training. If the video memory or memory space is large, it can be set to 96x96, 128x128, 192x192 and other sizes. Let the scaling ratio be s. After deducting 64sx64s small blocks from the second noisy image, deduct 64sx64s small blocks at the corresponding position of the noise-free image. Each round iteratively samples multiple sets of data for model training. Specific sampling The amount of data set is determined based on the actual training situation.
  • the training base loss of the image quality improvement model is L1loss.
  • the Canny edge detection map of the image quality improvement model output video image and the label video image are calculated respectively, and then the edges are calculated. L1loss between Figs. The weighted sum of the two losses is used as the final loss, with the edge loss weight being 0.5 and the base L1loss weight being 1.0.
  • the image quality improvement model consists of multiple convolutional residual blocks and an upsampling module. Based on the above training configuration, the image quality improvement model is trained for multiple rounds of iterations, so that the loss function of the image quality improvement module gradually decreases. , after convergence, the trained image quality improvement model with fixed weight parameters is obtained.
  • performing convolution and folding on the trained image quality improvement model includes: traversing all convolution residual blocks in the image quality improvement model, and for each convolution residual block, converting the convolution residual The block is folded into a single convolution; the weight of the single convolution is copied into the image quality improvement model after the convolution folding.
  • the trained image quality improvement model is convolved and folded based on the characteristics of the convolution layer, the convolution residual blocks in the image quality improvement model are traversed, and convolution is performed on each foldable convolution residual block. Folding, folding all foldable convolution residual blocks into a single convolution, and copying the folded convolution weights to a new image quality improvement model composed of multiple convolution and upsampling modules, which is the final use Image quality improvement model for image quality improvement.
  • a convolutional residual block consists of multiple sequentially connected convolutional layers and multiple residual connections.
  • the input channel of the first convolution layer is a small channel number, and the output is a large channel number.
  • the input channel of the last convolution layer is a large channel number, and the output is a small channel number.
  • the input and output channels of the middle convolution layer are both is the number of large channels.
  • the residual connection connects the input of the convolutional residual block to the output of the residual connection, and other residual connections connect the input of the intermediate convolutional layer to the output of the residual connection.
  • the configuration of the convolutional layer can be set to 8x256x1x1, 256x256x3x3, 256x8x1x1.
  • the first data represents the number of convolution input channels
  • the second The data represents the number of convolution output channels
  • the third data represents the convolution kernel size. All convolution residual blocks do not perform upsampling or downsampling operations, and the convolution step size is 1.
  • the first residual connection connects the input of the convolutional residual block to its output
  • the second residual connection connects the 3x3 convolutional input to its output. After fusing the first convolutional layer and the second convolutional layer, and then fusing it with the third convolutional layer, the convolutional residual block can be folded into a single 8x8x3x3 convolution.
  • the residual connection can be regarded as a convolution layer with a weight parameter of an identity matrix. According to the convolution additivity, it can be added to the weight and bias of the corresponding convolution layer respectively, thereby merging into a single convolution layer.
  • the convolution additivity it can be added to the weight and bias of the corresponding convolution layer respectively, thereby merging into a single convolution layer.
  • the folded convolution weight is w3w2w1x+w3w2b1+w3b2+b3.
  • the offset is w3w2b1+w3b2+b3.
  • the above convolution operation is: use the unit matrix as input, conduct continuous convolution with w1, w2, and w3 weight matrices, then flip the result, and rearrange the data order to make it conform to the convolution kernel parameter format, which is the w3w2w1 result, partial
  • the w3w2b1 obtained by the offset calculation is obtained by extending b1 into a kxk convolution kernel, and then performing matrix operations with w2 and w3 respectively.
  • the w3b2 obtained by the offset calculation is obtained by expanding b2 into a kxk convolution kernel, and then performing matrix operations with w2 and w3.
  • w3 is obtained by performing matrix operations. After fusing the concatenated convolutions into a single convolution, the outermost residual connection is fused with this convolution into a single convolution, thereby completing the convolution folding of the convolutional residual block.
  • the number of convolutional layers of the image quality improvement convolutional neural network is reduced, the number of feature channels is reduced, and the residual connections are removed, thereby reducing
  • the number of parameters and memory access times of the image quality improvement convolutional neural network will not change the accuracy of image quality improvement, while further improving the inference speed of the image quality improvement method and reducing resource consumption.
  • the image quality improvement model is subjected to int8 quantization processing, and the int8 The quantified model is used as the final image quality improvement model.
  • Transplanting the pre- and post-processing operations of model inference into the structure of the image quality improvement model, such as addition and subtraction of mean, dimension transformation and color space transformation, can effectively improve the speed of image quality improvement of video images.
  • the image format of the input and output images of the image quality improvement model is RGB (Red Green Blue) format
  • the sampling RGB format is used as the image format of the image quality improvement model.
  • a noise model for simulating the noise characteristics of the video conferencing system is trained through the first noisy image, and the noise model is used to add noise to the noise-free image to obtain a second band with the noise characteristics of the video conferencing system.
  • noisy images and train the image quality improvement model through the noise-free image and the second noisy image.
  • convolution and folding are performed on the trained image quality improvement model, which improves the image quality improvement effect and also ensures that the system real-time.
  • the image quality improvement model trained by the image quality improvement model training method provided by the embodiments of the present application can simultaneously realize the functions of super-resolution reconstruction and enhanced denoising, reduce the image quality improvement processing time, and improve the rendering of video images. quality and solve the problems existing in conventional technical means.
  • the training method of the image quality improvement model proposed in the embodiment of this application uses the convolutional neural network to train the video conferencing system based on the noise characteristics caused by low resolution and low bit rate and the general consistency of the pixel distribution of the original image and the terminal decoded image.
  • Various complex image change processes caused by low bit rate and low resolution encoding, transmission and decoding are simulated to obtain the noise model, and then the noise model is used to add noise to the noise-free image to obtain the second band that conforms to the noise characteristics of the video conferencing system.
  • noisy images are finally combined with the corresponding noise-free images to form training data to train the image quality improvement model.
  • the convolutional neural network features and weight parameters learned under the two tasks of super-resolution reconstruction and enhanced denoising exist.
  • the embodiment of the present application proposes an image quality improving convolutional neural network (image quality improving model) that simultaneously performs super-resolution reconstruction and enhanced denoising.
  • image quality improving model image quality improving model
  • the embodiments of this application construct super-resolution reconstruction and enhanced denoising mixed data for image quality improvement network learning, and only use a single single-branch convolutional neural network to achieve both super-resolution reconstruction and enhanced denoising.
  • the convolution residual block is convolved
  • Product folding reduces the number of convolutional layers of the convolutional neural network for image quality improvement, reduces the number of feature channels, and removes residual connections, thereby reducing the amount of parameters and the number of memory accesses for the convolutional neural network for image quality improvement. , without changing the accuracy of image quality improvement, while further improving the inference speed of the image quality improvement method and reducing resource consumption.
  • Embodiments of the present application also relate to a method for improving the image quality of a video conferencing system, which includes: obtaining video images transmitted by the video conferencing system; using an image quality improvement model to improve the image quality of the video images of the video conferencing system, and obtaining the improved image quality. video image; wherein, the image quality improvement model is obtained according to the training method of the above image quality improvement model.
  • step 201 video images transmitted by the video conferencing system are obtained.
  • the resolution of the video image is reduced and the code rate of the encoder of the video conferencing system is reduced.
  • the receiving end of the video conferencing system is collected and encoded by the video conferencing system at a low code rate. , transmitted and decoded video images.
  • step 202 an image quality improvement model is used to improve the image quality of the video image of the video conferencing system to obtain an improved video image; wherein the image quality improvement model is based on the training method of the above image quality improvement model. get.
  • the image quality improvement model is used to improve the resolution of the video image and enhance the video image to denoise, and obtain the video image with improved image quality and send it to the display device. display.
  • the type of the image quality improvement model is converted into the type required by the video conferencing system terminal.
  • the image quality improvement model is converted into the engine type required for video conferencing terminal deployment, such as MNN (Mobile Neural Network), TNN (Tencent Neural Network), TFLITE (TensorFlow Lite) , ONNX (open Neural Network Exchange, Open Neural Network Exchange), etc., and model quantification.
  • the image quality improvement model before using the image quality improvement model to improve the image quality of the video image of the video conferencing system, it also includes: splitting the video image into N video image blocks of the same size, and two adjacent ones Video image blocks overlap by M pixels; at the same time, N video image blocks of the same size are input into the image quality improvement model. After obtaining N high-definition video image blocks, the N high-definition video image blocks are image fused. , to obtain a video image with improved image quality; where M and N are both integers greater than 1.
  • the video image before inputting the video image into the image quality improvement model, the video image is split into four video image blocks of the same size and overlapping by 2 pixels.
  • the splitting method is top-down splitting, and then Start four threads to simultaneously enhance the image quality of the video image blocks to obtain four high-definition video image blocks, and then perform image fusion according to the magnification, and finally obtain the video image with improved image quality, which is sent to the display device for display, and the video image is Splitting and simultaneously improving the image quality not only improves the actual display effect, but also reduces the time-consuming inference.
  • the image format of the high-resolution video image is converted into the image format of the video conferencing system; the data format of the high-resolution video image is converted into the video conferencing system Data format; convert the data type of high-resolution video images into the data type of the video conferencing system.
  • the video image format of the video conferencing system is YUV, where Y represents brightness and UV represents chroma respectively.
  • YUV represents brightness and UV represents chroma respectively.
  • the meanings and importance of the three are inconsistent. Different channels of the input image of the convolutional neural network will be processed with equal weight. Therefore, when YUV is used as the input and output image format of the image quality improvement network, the image quality improvement effect is significantly lower than that of using RGB as the input and output image format. Therefore, the embodiment of the present application uses RGB format as the image format of the input and output images, NCHW as the data format, and the data type is float. Since the image format of the video conferencing system is YUV, the data type is unsigned character pointer (unsigned char ), the data format is NHWC.
  • the format of the video image needs to be converted to A format suitable for video conferencing systems, which provides color space conversion between YUV and RGB, data format conversion between NHWC and NCHW, and data types between unsigned char and float based on GPU (Graphic Process Unit) operators. Conversion method, thereby reducing CPU (Central Processing Unit, Central Process Unit) calculation operations and reducing CPU consumption of video conferencing terminal deployment.
  • CPU Central Processing Unit, Central Process Unit
  • the method for improving the image quality of a video conferencing system uses only one image quality improvement model to achieve both super-resolution reconstruction and enhanced denoising tasks. It not only enhances the image quality, but also reduces the processing time for image quality enhancement.
  • the type of noise model and the image format, data format and data type of the output image of the noise model are modified, making the video conference system image quality improvement method highly versatile, with obvious image quality improvement and fast speed. At the same time, it can also meet the deployment needs in actual scenarios.
  • the embodiment of the present application also relates to a training device for an image quality improvement model, as shown in Figure 3 , including: an acquisition module 301, a noise adding module 302, and a training module 303.
  • the acquisition module 301 is used to acquire at least one noise model; wherein the noise model is trained based on the first noisy image and is used to simulate the noise characteristics of the video conferencing system; the noise adding module 302 is used to acquire a noise-free image. Collect, and input the noise-free images in the noise-free image set into the noise model to obtain a second noisy image.
  • the noisy image and the second noisy image constitute a training data pair; the training module 303 is used to use training The data is used to train the initial image quality improvement model, and after completing the training, the trained image quality improvement model is convolved and folded to obtain the final image quality improvement model.
  • the acquisition module 301 uses the first noisy image to train a noise model.
  • the noise model is composed of multiple convolution residual blocks and a downsampling module.
  • the downsampling method and magnification in the noise model are the same as
  • the down-sampling method and magnification in the video conferencing system are the same, so that the down-sampling of the video image through the video conferencing system can be simulated.
  • the video conferencing system when the video conferencing system encodes, transmits and decodes video images, it will cause the video image to have blur, mosaic, ringing and other noises, and the above noise can be regarded as filters using different weight parameters for the image, and
  • the filter can be simulated through the convolutional layer, and then the noise model composed of the convolutional layer and the downsampling module can be used to learn the various complex changes experienced by the video image in the videoconferencing system, thereby simulating the noise characteristics of the videoconferencing system.
  • the noise adding module 302 obtains noise-free images from public data sets, such as downloading and collecting public video image data sets such as BVI-DVC, LDV, DIV2K, Flickr2K, etc., to construct a noise-free image collection. Traverse each noise-free image in the noise-free image set, randomly select a noise model from the set of trained noise models, input the noise-free image into the noise model, and obtain the second noisy image with video conference noise characteristics , the input noise-free image and the obtained second noisy image form a training data pair, which is used as training data for training the image quality improvement model.
  • public data sets such as downloading and collecting public video image data sets such as BVI-DVC, LDV, DIV2K, Flickr2K, etc.
  • the training module 303 randomly selects a set of training data pairs from a training set in binary format or lmdb database format, and randomly deducts small blocks of a specified size from the second noisy image, for example, the size is 64x64, small
  • the size of the block can be flexibly configured according to the hardware resources used for training. If the video memory or memory space is large, it can be set to 96x96, 128x128, 192x192 and other sizes. Let the scaling ratio be s. After deducting 64sx64s small blocks from the second noisy image, deduct 64sx64s small blocks at the corresponding position of the noise-free image. Each round iteratively samples multiple sets of data for model training. Specific sampling The amount of data set is determined based on the actual training situation.
  • the training module 303 performs convolution and folding on the trained image quality improvement model based on the convolution layer characteristics, traverses the convolution residual blocks in the image quality improvement model, and targets each foldable convolution residual block. Do convolution folding, fold all foldable convolution residual blocks into a single convolution, and copy the folded convolution weights to a new image quality improvement model composed of multiple convolution and upsampling modules, that is, It is the image quality improvement model that is ultimately used for image quality improvement.
  • the image quality improvement model training device proposed in the embodiment of the present application utilizes the characteristics of noise caused by low resolution and low bit rate and the general consistency of the pixel distribution of the original image and the terminal decoded image.
  • Use the convolutional neural network to simulate various complex image change processes caused by low bit rate and low resolution encoding, transmission and decoding in the video conferencing system to obtain the noise model, and then use the noise model to add noise to the noise-free image to obtain
  • the second noisy image that conforms to the noise characteristics of the video conferencing system is finally combined with the corresponding noise-free image to form training data to train the image quality improvement model, which greatly improves the image quality improvement effect, and after completing the training of the image quality improvement model , perform convolution folding on the convolutional residual block, which reduces the number of convolutional layers of the convolutional neural network for image quality improvement, reduces the number of feature channels, and removes the residual connections, thereby reducing the convolution of the image quality improvement convolutional network.
  • this embodiment is a device embodiment corresponding to the above embodiment of the training method for the image quality improvement model.
  • This embodiment can be implemented in cooperation with the above embodiment of the training method for the image quality improvement model.
  • the relevant technical details mentioned in the above embodiment of the training method for the image quality improvement model are still valid in this embodiment, and will not be described again in order to reduce duplication.
  • the relevant technical details mentioned in this embodiment can also be applied to the above training method embodiments of the image quality improvement model.
  • the embodiment of the present application also relates to a video conference system image quality improvement device, as shown in Figure 4 , including: an acquisition module 401 and an image quality improvement module 402.
  • the acquisition module 401 is used to acquire the video images transmitted by the video conferencing system;
  • the image quality improvement module 402 is used to use the image quality improvement model to improve the image quality of the video images of the video conferencing system, and obtain the improved image quality.
  • Video image wherein, the image quality improvement model is obtained according to the above image quality improvement model training method.
  • the acquisition module 401 reduces the resolution of the video image before the video conferencing system encodes the video image, and reduces the code rate of the encoder of the video conferencing system to collect the video at the receiving end of the video conferencing system. Video images encoded, transmitted and decoded at a low bit rate by the conference system.
  • the image quality improvement module 402 uses an image quality improvement model to improve the resolution of the video image and enhance and denoise the video image for each frame of the collected video image to obtain a video image with improved image quality. , and sent to the display device for display.
  • the image quality improvement module 402 splits the video image into four video image blocks of the same size and overlapping by 2 pixels.
  • the splitting method is from top to bottom. Split it down, and then start 4 threads to enhance the image quality of the video image blocks at the same time to obtain four high-definition video image blocks, and then perform image fusion according to the magnification, and finally obtain the video image with improved image quality, and send it to the display device for processing Display, split the video image and improve the image quality at the same time, which not only improves the actual display effect, but also reduces the time of inference.
  • the method for improving the image quality of a video conferencing system uses only one image quality improvement model to achieve both super-resolution reconstruction and enhanced denoising tasks. It not only enhances the image quality, but also reduces the processing time for image quality enhancement.
  • the type of noise model and the image format, data format and data type of the output image of the noise model are modified, making the video conference system image quality improvement method highly versatile, with obvious image quality improvement and fast speed. At the same time, it can also meet the deployment needs in actual scenarios.
  • this embodiment is a device embodiment corresponding to the above embodiment of the method for improving the image quality of the video conferencing system.
  • This embodiment can be implemented in cooperation with the above embodiment of the method for improving the image quality of the video conferencing system.
  • the relevant technical details mentioned in the above embodiment of the method for improving the image quality of the video conferencing system are still valid in this embodiment, and will not be described again in order to reduce duplication.
  • the relevant technical details mentioned in this embodiment can also be applied to the above embodiment of the method for improving the image quality of the video conferencing system.
  • each module involved in the above two embodiments of this application is a logical module.
  • a logical unit can be a physical unit, or a part of a physical unit, or it can Implemented as a combination of multiple physical units.
  • units that are not closely related to solving the technical problems raised in this application are not introduced in this embodiment, but this does not mean that other units do not exist in this embodiment.
  • An embodiment of the present application also provides an electronic device, as shown in Figure 5, including at least one processor 501; and a memory 502 communicatively connected to the at least one processor 501; wherein the memory 502 stores information that can Instructions executed by the at least one processor 501, the instructions are executed by the at least one processor 501, so that the at least one processor can perform the above-mentioned image quality Improve the training method of the model, or be able to perform the above-mentioned method of improving the image quality of the video conferencing system.
  • the bus can include any number of interconnected buses and bridges.
  • the bus connects one or more processors and various circuits of the memory together.
  • the bus may also connect various other circuits together such as peripherals, voltage regulators, and power management circuits, which are all well known in the art and therefore will not be described further herein.
  • the bus interface provides the interface between the bus and the transceiver.
  • a transceiver may be one element or may be multiple elements, such as multiple receivers and transmitters, providing a unit for communicating with various other devices over a transmission medium.
  • the data processed by the processor is transmitted over the wireless medium through the antenna. Further, the antenna also receives the data and transmits the data to the processor.
  • the processor is responsible for managing the bus and general processing, and can also provide a variety of functions, including timing, peripheral interfaces, voltage regulation, power management, and other control functions.
  • Memory can be used to store data used by the processor when performing operations.
  • Embodiments of the present application also provide a computer-readable storage medium storing a computer program.
  • the above method embodiments are implemented when the computer program is executed by the processor.
  • the program is stored in a storage medium and includes several instructions to make a device (which may be A microcontroller, a chip, etc.) or a processor (processor) executes all or part of the steps of the methods described in various embodiments of this application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Image Processing (AREA)

Abstract

本申请实施例涉及计算机视觉技术领域,公开了一种画质提升模型的训练方法和视频会议系统画质的提升方法。画质提升模型用于对视频会议系统的视频图像的画质进行提升,方法包括:获取至少一个噪声模型;其中,噪声模型基于第一带噪图像训练得到,用于模拟视频会议系统的噪声特性;获取无噪图像集合,并将无噪图像集合中的无噪图像输入到噪声模型中,得到第二带噪图像,无噪图像和第二带噪图像构成训练数据对;使用训练数据对对初始画质提升模型进行训练,并在完成训练后,对完成训练的画质提升模型进行卷积折叠,得到最终的画质提升模型。

Description

画质提升模型的训练方法和视频会议系统画质的提升方法
交叉引用
本申请要求在2022年04月20日提交中国专利局、申请号为202210420820.0、发明名称为“画质提升模型的训练方法和视频会议系统画质的提升方法”的中国专利申请的优先权,该申请的全部内容通过引用结合在本发明中。
技术领域
本申请实施例涉及计算机视觉技术领域,尤其涉及一种画质提升模型的训练方法和视频会议系统画质的提升方法。
背景技术
自新冠疫情爆发以来,因疫情防控要求,很多公司逐步开展远程办公模式,减少人员流动。全球范围内对在线办公、在线会议等远程沟通场景应用的需求激增。在视频会议场景中,码率越高会给网络带宽带来越大的压力,由于视频会议是双向互动的产品,延迟过高会增加语音重叠,甚至导致音画不同步,从而影响整个视频会议的体验。为了降低高延迟风险,视频会议一般采用低码率进行视频图像的编码传输,这导致接收设备解码后得到的视频图像画质较差。另外,越来越多高清的接收端显示设备也要求图像的分辨率和清晰度不能过低。因此,如何在视频会议中以低码率传输来保持高画质是需要解决的难点问题。
然而,相关技术手段中的方法存在通用性差、速度慢和画质提升效果不明显等问题,难以满足实际场景中的部署需求。
发明内容
本发明的目的在于解决上述问题,提供一种画质提升模型的训练方法和 视频会议系统画质的提升方法,解决了视频会议系统画质提升速度慢和画质提升效果不明显的问题。
为解决上述问题,本申请的实施例提供了一种画质提升模型的训练方法,画质提升模型用于对视频会议系统的视频图像的画质进行提升,方法包括:获取至少一个噪声模型;其中,噪声模型基于第一带噪图像训练得到,用于模拟视频会议系统的噪声特性;获取无噪图像集合,并将无噪图像集合中的无噪图像输入到噪声模型中,得到第二带噪图像,无噪图像和第二带噪图像构成训练数据对;使用训练数据对对初始画质提升模型进行训练,并在完成训练后,对完成训练的画质提升模型进行卷积折叠,得到最终的画质提升模型。
为解决上述问题,本申请的实施例提供了一种视频会议系统画质的提升方法,包括:获取视频会议系统传输的视频图像;采用画质提升模型对视频会议系统的视频图像进行画质提升,得到画质提升后的视频图像;其中,画质提升模型根据上述画质提升模型的训练方法得到。
为解决上述问题,本申请的实施例提供了一种画质提升模型的训练装置,其特征在于,画质提升模型用于对视频会议系统的视频图像的画质进行提升,包括:获取模块,获取至少一个噪声模型;其中,噪声模型基于第一带噪图像训练得到,用于模拟视频会议系统的噪声特性;加噪模块,获取无噪图像集合,并将无噪图像集合中的无噪图像输入到噪声模型中,得到第二带噪图像,无噪图像和第二带噪图像构成训练数据对;训练模块,使用训练数据对对初始画质提升模型进行训练,并在完成训练后,对完成训练的画质提升模型进行卷积折叠,得到最终的画质提升模型。
为解决上述问题,本申请的实施例提供了一种视频会议系统画质的提升装置,其特征在于,包括:获取模块,获取视频会议系统传输的视频图像;画质提升模块,采用画质提升模型对视频会议系统的视频图像进行画质提升,得到画质提升后的视频图像;其中,画质提升模型根据上述画质提升模型训练方法得到。
为解决上述问题,本申请的实施例还提供了一种电子设备,包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述画质提升模型的训练方法,或者,能够执行上述视频会议系统画质的提升方法。
为解决上述问题,本申请的实施例还提供了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现上述画质提升模型的训练方法,或者,能够实现上述视频会议系统画质的提升方法。
附图说明
一个或多个实施例通过与之对应的附图中的图片进行示例性说明,这些示例性说明并不构成对实施例的限定,附图中具有相同参考数字标号的元件表示为类似的元件,除非有特别申明,附图中的图不构成比例限制。
图1是本申请一实施例提供的画质提升模型的训练方法的流程图;
图2是本申请一实施例提供的视频会议系统画质的提升方法的流程图;
图3是本申请一实施例提供的画质提升模型的训练装置的结构示意图;
图4是本申请一实施例提供的视频会议系统画质的提升装置的结构示意图;
图5是本申请一实施例提供的电子设备结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请的各实施方式进行详细的阐述。然而,本领域的普通技术人员可以理解,在本申请各实施方式中,为了使读者更好地理解本申请而提出了许多技术细节。但是,即使没有这些技术细节和基于以下各实施方式的种种变化和修改,也可以实现本申请所要求保护的技术方案。
本申请的一实施例涉及一种画质提升模型的训练方法,画质提升模型用 于对视频会议系统的视频图像的画质进行提升,方法包括:获取至少一个噪声模型;其中,噪声模型基于第一带噪图像训练得到,用于模拟视频会议系统的噪声特性;获取无噪图像集合,并将无噪图像集合中的无噪图像输入到噪声模型中,得到第二带噪图像,无噪图像和第二带噪图像构成训练数据对;使用训练数据对对初始画质提升模型进行训练,并在完成训练后,对完成训练的画质提升模型进行卷积折叠,得到最终的画质提升模型,在改善视频图像画质的同时保证了视频会议系统的实时性,解决了画质提升速度慢,画质提升效果不明显的问题。
下面对本实施例中的画质提升模型的训练方法的实现细节进行具体的说明,以下内容仅为方便理解本方案的实现细节,并非实施本方案的必须。具体流程如图1所示,可包括如下步骤:
在步骤101中,获取至少一个噪声模型;其中,噪声模型基于第一带噪图像训练得到,用于模拟视频会议系统的噪声特性。
在本申请实施例中,在获取至少一个噪声模型之前,从视频会议系统的接收端获取第一带噪图像集合;其中,第一带噪图像集合中的各第一带噪图像在编码前的分辨率均低于第一阈值,各第一带噪图像在编码时的码率均低于第二阈值。
在一个例子中,将视频会议系统编码前的视频图像的分辨率下采样固定倍率,例如两倍或四倍,即将视频图像的分辨率降低两倍或四倍,并将视频会议系统的编码器的码率降低,在视频会议系统的接收端采集经视频会议系统低码率编码、传输和解码后的视频图像,即获取第一带噪图像集合。其中,视频会议系统基于H.264或H.265等编码方法使用降低的码率对降低了分辨率的视频图像进行编码,得到码流数据,码流数据经网络传输至视频会议系统的接收端,并对码流数据进行解码。
在一个例子中,使用第一带噪图像训练噪声模型,噪声模型由多个卷积残差块和下采样模块构成,噪声模型中的下采样方法和倍率与视频会议系统中的下采样方法和倍率相同,从而可以模拟视频图像经过视频会议系统对视 频图像的下采样。另外,视频会议系统对视频图像进行编码、传输和解码视频图时会导致视频图像具有模糊、马赛克、振铃等噪声,而上述噪声可以看作是对图像使用了不同权重参数的滤波器,并且滤波器可以通过卷积层模拟,那么可以利用卷积层和下采样模块组成的噪声模型学习视频会议系统中视频图像经历的各种复杂变化,从而模拟视频会议系统的噪声特性。
在本申请实施例中,基于所述各第一带噪图像,获取噪声模型的集合;其中,所述第一带噪图像集合中的每一个第一带噪图像,均对应一个噪声模型。
其中,由于视频会议系统中的码率、帧率和图像分辨率等数值会根据实际网络情况、发送端和接收端的配置等因素动态变化,因此,从接收端采集到的第一带噪图像集合中的第一带噪图像具有不同的噪声特性。为了覆盖视频会议场景中多样的图像变化,对采集到的第一带噪图像集合中的每一个第一带噪图像训练一个噪声模型,得到噪声模型的集合。
在本申请实施例中,将第一带噪图像输入到初始噪声模型中,得到第三带噪图像;其中,在输入第一带噪图像时,随机选择是否叠加辛格滤波噪声;将第一带噪图像和第三带噪图像输入到常规卷积神经网络模型中,得到尺寸一致的第一带噪图像的第一组特征图和第三带噪图像的第二组特征图,根据第一组特征图和第二组特征图对初始噪声模型进行训练。
在一个例子中,为了增强振铃噪声,在向初始噪声模型输入第一带噪图像时,随机选择是否叠加辛格滤波噪声,其中,辛格滤波器的核大小范围根据视频会议系统的实际情况进行配置,并且随机叠加的辛格滤波噪声的大小在辛格滤波器的核大小范围内随机选择。
另外,由于视频会议系统中发送端编码前的原始图像和接收端解码后的图像尺寸并不一定成等比例缩放,但像素分布是大体一致的,由于噪声模型学习的是视频会议系统中视频图像经历的各种复杂变化,因此,输入噪声模型的第一带噪图像和噪声模型输出的第三带噪图像也并不一定成等比例缩放,且像素分布是大体一致的,因此,使用卷积神经网络模型将第一带噪图像和 第三带噪图像转换成固定尺寸的特征图,将第一带噪图像的特征图作为噪声模型的真实值,将第三带噪图像的特征图作为噪声模型的预测值,根据上述预测值和真实值构建噪声模型的损失函数,根据损失函数对噪声模型进行迭代训练。
在一个例子中,在训练噪声模型时,将一张第一带噪图像分为多个尺寸一致的图像块,向噪声模型中输入一块第一带噪图像图像块,得到第三带噪图像图像块,然后再选择一种图像分类模型(即常规卷积神经网络模型)提取第一带噪图像图像块和第三带噪图像图像块的特征,例如ResNet(残差神经网络,Residual Network)、谷歌提出的高效神经网络(EfficientNet)等,去除掉其中的全局平均化池层和全连接层等不含空间特征的层,只保留特征提取部分,对第一带噪图像图像块和第三带噪图像图像块进行特征提取之后,增加ROI Pooling(感兴趣区域池化,Region of interest pooling)或感兴趣区域对齐(Region of interest Align,ROI Align)等特征尺寸归一化方法,将提取出的第一带噪图像图像块的特征图和第三带噪图像图像块的特征图转换为固定尺寸的两组特征图,根据两组特征图之间的距离,继续使用其他第一带噪图像块进行迭代训练,直至噪声模型训练完成。
在步骤102中,获取无噪图像集合,并将无噪图像集合中的无噪图像输入到噪声模型中,得到第二带噪图像,无噪图像和第二带噪图像构成训练数据对。
在本申请实施例中,针对获取的无噪图像集合中的每一张无噪图像,将无噪图像输入到从噪声模型的集合中随机选择的噪声模型,得到第二带噪图像。
在一个例子中,获取公开数据集中的无噪图像,例如下载收集BVI-DVC、LDV、DIV2K、Flickr2K等公开视频图像数据集,构建无噪图像集合。遍历无噪图像集合中的每一个无噪图像,随机从训练好的噪声模型的集合中选取一个噪声模型,将无噪图像输入到噪声模型中,得到带视频会议噪声特性的第二带噪图像,将输入的无噪图像和得到的第二带噪图像构成训练数据对, 用作训练画质提升模型的训练数据。
由于视频会议系统中的视频图像数据一般需要经过多点控制单元进行转发,发送端和接收端的视频图像难以实现帧与帧的一一对应,并且视频会议系统一般会根据网络情况和接收端配置信息动态调节视频图像的分辨率尺寸,无法保证采集到的数据中高分辨率图像与低分辨率图像的倍率都是一致的。此外,发送端视频图像采集设备采集到的高分辨率图像大多已经经过了发送端设备的预处理,并不是未经加工的原始图像,其中已引入噪声信息。因此很难直接从视频会议系统中获取高清无噪图像。另外,在降低码率和分辨率后,视频图像中会额外增加因下采样导致的模糊噪声和因低码率导致的马赛克和振铃等噪声。模糊噪声可以看作高斯滤波,马赛克噪声可以看作卷积操作,振铃噪声可以看作辛格滤波,这些滤波器均可用卷积网络进行模拟。因此,基于上述图像特性本申请实施例提出了一种画质提升网络学习所需数据的生成方法,无需从视频会议系统的发送端获取高清无噪图像,直接对视频会议系统中因低码率低分辨率解码、传输和解码导致的各自图像变化过程进行仿真,获取具有视频会议系统噪声特性的噪声模型,使用噪声模型对公开数据集中高分辨无噪图像进行加噪得到第二带噪图像,然后和高分辨无噪图像构成训练数据对,用于画质提升模型学习。
在步骤103中,使用训练数据对对初始画质提升模型进行训练,并在完成训练后,对完成训练的画质提升模型进行卷积折叠,得到最终的画质提升模型。
在本申请实施例中,在使用训练数据对对初始画质提升模型进行训练之前,将训练数据对转换为二进制数据格式或lmdb(闪电内存映射数据库,Lightning Memory-Mapped Database)数据库格式;从转换后的训练数据对的第二带噪图像中随机提取出指定尺寸的小块,并从转换后的数据对的无噪图像中提取出对应的小块;使用训练数据对对初始画质提升模型进行训练,包括:使用从训练数据对中提取出的小块对画质提升模型进行训练。
在一个例子中,将训练数据对的数据转换为二进制格式或者lmdb数据库 格式,用于在进行画质提升模型训练时,提升数据读取的访问速度。
在一个例子中,从二进制格式或者lmdb数据库格式的训练集中随机选择一组训练数据对,从第二带噪图像中随机扣取大小为指定尺寸的小块,例如尺寸为64x64,小块的尺寸可以根据训练使用的硬件资源进行灵活配置,如果显存或内存空间较大,可以设置为96x96、128x128、192x192等尺寸。设缩放比例为s,从第二带噪图像中扣取64sx64s的小块后,再在无噪图像的对应位置扣取64sx64s的小块,每轮迭代采样多组数据用于模型训练,具体采样的数据组量根据训练的实际情况决定。
在一个例子中,画质提升模型的训练基础损失为L1loss,为了增强视频会议终端显示图像的高频纹理,分别计算画质提升模型输出视频图像和标签视频图像的Canny边缘检测图,然后计算边缘图之间的L1loss。两个损失加权求和作为最终损失,边缘损失权重为0.5,基础L1loss权重为1.0。
在一个例子中,画质提升模型由多个卷积残差块和上采样模块组成,基于上述的训练配置对画质提升模型进行多轮迭代训练,使画质提升模块的损失函数逐渐减小,收敛后得到训练好的权重参数固定的画质提升模型。
在本申请实施例中,对完成训练的画质提升模型进行卷积折叠,包括:遍历画质提升模型中的所有卷积残差块,针对每一个卷积残差块,将卷积残差块折叠成单个卷积;将单个卷积的权重拷贝到经过卷积折叠后的画质提升模型中。
在一个例子中,基于卷积层特性对训练好的画质提升模型进行卷积折叠,遍历画质提升模型中的卷积残差块,针对每一个可折叠的卷积残差块做卷积折叠,将所有可折叠的卷积残差块折叠成单个卷积,并将折叠后的卷积权重拷贝到由多个卷积和上采样模块组成的新的画质提升模型,即为最终用于画质提升的画质提升模型。
在一个例子中,卷积残差块由多个顺序连接的卷积层和多个残差连接组成。第一个卷积层的输入通道为小通道数,输出为大通道数,最后一个卷积层的输入通道为大通道数,输出为小通道数,中间卷积层的输入输出通道均 为大通道数。残差连接将卷积残差块的输入连接到残差连接的输出,其他残差连接则是将中间卷积层的输入连接到残差连接的输出。以3个卷积层和2个残差连接组成的卷积残差块为例,卷积层的配置可以设置为8x256x1x1、256x256x3x3、256x8x1x1,第一个数据表示卷积输入通道数,第二个数据表示卷积输出通道数,第三个数据表示卷积核大小,所有卷积残差块均不做上采样或下采样操作,卷积步长均为1。第一个残差连接将卷积残差块的输入连接到其输出,第二个残差连接将3x3卷积输入连接到其输出。将第一卷积层和第二卷积层融合后,再和第三卷积层进行融合,可将卷积残差块折叠成单个8x8x3x3的卷积。
在一个例子中,残差连接可以看做权重参数为单位矩阵的卷积层,根据卷积相加性可以与对应卷积层的权重和偏置分别相加,从而融合成单个卷积层。以3个卷积为例,设每个卷积过程的计算公式都是y=wx+b,w表示卷积权重,b表示卷积偏置,w和x之间为卷积操作,则三个卷积的计算过程可以记为y3=w3(w2(w1x+b1)+b2)+b3,将其展开可以得到y3=w3w2w1x+w3w2b1+w3b2+b3,则折叠后的卷积权重为w3w2w1,偏置为w3w2b1+w3b2+b3。上述卷积操作为:使用单位矩阵作为输入,分别与w1、w2、w3权重矩阵进行连续卷积,再将结果翻转,并重排数据顺序使其符合卷积核参数格式,即为w3w2w1结果,偏置计算得出的w3w2b1,则是将b1扩展成kxk卷积核,再分别与w2、w3进行矩阵操作得到,偏置计算得出的w3b2,则是将b2扩展成kxk卷积核,再与w3进行矩阵操作得到。将串联卷积融合成单个卷积后,再将最外层的残差连接与该卷积融合成单个卷积,从而完成了卷积残差块的卷积折叠。
通过对卷积残差块进行卷积折叠,减少了画质提升卷积神经网络(画质提升模型)的卷积层个数,减少了特征通道个数并去除了残差连接,从而减小画质提升卷积神经网络的参数量和内存访问次数,在不改变画质提升精度的同时进一步提升画质提升方法的推理速度,减少资源消耗。
在本申请实施例中,对所述画质提升模型进行int8量化处理,将经int8 量化处理后的模型作为最终的画质提升模型。将模型推理的前后处理操作移植到画质提升模型结构的内部,如加减均值、维度变换和颜色空间变换等,可以有效的提高在对视频图像进行画质提升时的速度。
在本申请实施例中,画质提升模型的输入和输出图像的图像格式为RGB(红绿蓝,Red Green Blue)格式,采样RGB格式作为画质提升模型的图像格式,画质提升模型的推理效果更佳。
在本申请实施例中,通过第一带噪图像训练用于模拟视频会议系统的噪声特性的噪声模型,并使用噪声模型对无噪图像进行加噪得到带有视频会议系统噪声特性的第二带噪图像,并通过无噪图像和第二带噪图像对画质提升模型进行训练,最后通过对训练好的画质提升模型进行卷积折叠,在提高了画质提升效果的同时也保证了系统的实时性。通过本申请实施例提供的画质提升模型的训练方法训练出的画质提升模型,可以同时实现超分重构和增强去噪的功能,减少了画质提升处理时间,改善了视频图像的画质,解决了常规技术手段中存在的问题。
本申请实施例提出的画质提升模型的训练方法,根据低分辨率低码率导致的噪声特性以及原始图像和终端解码图像像素分布大体一致性的特性,利用卷积神经网络对视频会议系统中因低码率低分辨率编码、传输和解码导致的各种复杂图像变化过程进行仿真,获取噪声模型,再利用噪声模型对无噪图像进行加噪,获取符合视频会议系统噪声特性的第二带噪图像,最后和对应的无噪图像构成训练数据对对画质提升模型进行训练,另外,在基于超分重建和增强去噪这两个任务下学习得到的卷积神经网络特征和权重参数存在明显的相似性,因此,根据这一特性,本申请实施例提出了一种同时进行超分重建和增强去噪的画质提升卷积神经网络(画质提升模型)。不同于相关技术手段中的学习方法,本申请实施例通过构造超分重建和增强去噪混合数据用于画质提升网络学习,只用单个单分支卷积神经网络就同时实现了超分重建和增强去噪任务,增强画质的同时还减少了增强画质的处理时间,极大改善画质提升效果,并且在完成画质提升模型的训练后,对卷积残差块进行卷 积折叠,减少了画质提升卷积神经网络的卷积层个数、减少了特征通道个数并去除了残差连接,从而减小了画质提升卷积神经网络的参数量和内存访问次数,在不改变画质提升精度的同时进一步提升画质提升方法的推理速度,减少资源消耗。
本申请实施例还涉及一种视频会议系统画质的提升方法,包括:获取视频会议系统传输的视频图像;采用画质提升模型对视频会议系统的视频图像进行画质提升,得到画质提升后的视频图像;其中,画质提升模型根据上述画质提升模型的训练方法得到。
下面对本实施例中的画质提升模型的训练方法的实现细节进行具体的说明,以下内容仅为方便理解本方案的实现细节,并非实施本方案的必须。具体流程如图2所示,可包括如下步骤:
在步骤201中,获取视频会议系统传输的视频图像。
在一个例子中,在视频会议系统对视频图像进行编码前,降低视频图像的分辨率并将视频会议系统的编码器的码率降低在视频会议系统的接收端采集经视频会议系统低码率编码、传输和解码后的视频图像。
在步骤202中,采用画质提升模型对所述视频会议系统的视频图像进行画质提升,得到画质提升后的视频图像;其中,所述画质提升模型根据上述画质提升模型的训练方法得到。
在一个例子中,针对采集的视频图像中的每一帧图像,使用画质提升模型对视频图像进行分辨率提升和视频图像增强去噪,得到画质提升后的视频图像,并送入显示设备进行显示。
在本申请实施例中,在采用画质提升模型对视频会议系统的视频图像进行画质提升之前,将画质提升模型的类型转换成视频会议系统终端所需的类型。
在一个例子中,将画质提升模型转换为视频会议终端部署所需要的引擎类型,如MNN(移动神经网络,Mobile Neural Network)、TNN(腾讯神经网络,Tencent Neural Network)、TFLITE(TensorFlow Lite)、ONNX(开放 神经网络交换,Open Neural Network Exchange)等,并作模型量化。
在本申请实施例中,采用画质提升模型对所述视频会议系统的视频图像进行画质提升之前,还包括:将视频图像拆分成N个尺寸一致的视频图像块,且相邻两个视频图像块之间重叠M像素;在同一时刻,分别将N个尺寸一致的视频图像块输入到画质提升模型中,得到N个高清视频图像块后,将N个高清视频图像块进行图像融合,得到画质提升后的视频图像;其中,M和N均为大于1的整数。
在一个例子中,在将视频图像输入到画质提升模型中之前,将视频图像拆分成四个相同大小且重叠2像素的视频图像块,拆分的方式为自上而下拆分,然后启动4个线程同时对视频图像块进行画质增强,得到四个高清视频图像块,再根据放大倍率进行图像融合,最终得到画质提升后的视频图像,送入显示设备进行显示,将视频图像拆分并同时进行画质提升,在改善实际显示效果的同时,也降低了推理的耗时。
需要说明的是,本申请实施例不对具体的拆分数量和重叠像素的大小进行限制,具体取值根据终端的实际情况决定。
在本申请实施例中,在得到画质提升后的视频图像之后,将高分辨率视频图像的图像格式转换为视频会议系统的图像格式;将高分辨率视频图像的数据格式转换为视频会议系统的数据格式;将高分辨率视频图像的数据类型转换为视频会议系统的数据类型。
在一个例子中,视频会议系统的视频图像格式为YUV,Y表示亮度,UV分别表示色度,三者含义和重要性并不一致。而卷积神经网络输入图像的不同通道会等权重处理,因此采用YUV作为画质提升网络的输入输出图像格式时,画质提升效果明显低于采用RGB作为输入输出图像格式。因此,本申请实施例采用RGB格式作为输入输出图像的图像格式,NCHW作为数据格式,数据类型为float浮点型,由于视频会议系统的图像格式为YUV,数据类型为无符号字符指针(unsigned char),数据格式为NHWC,因此,在画质提升模型输出画质提升后的视频图像后,需要将视频图像的格式转换为 适用于视频会议系统的格式,提供了基于GPU(图形处理器,Graphic Process Unit)算子实现YUV和RGB之间颜色空间转换、NHWC和NCHW之间数据格式转换、unsigned char和float之间数据类型转换的方法,从而减少CPU(中央处理器,Central Process Unit)计算操作,降低视频会议终端部署CPU消耗。
本申请实施例提供的视频会议系统画质的提升方法,只用一个画质提升模型就同时实现了超分重建和增强去噪任务,增强画质的同时还减少了增强画质的处理时间。另外,基于会议系统的需求,对噪声模型的类型以及噪声模型的输出图像的图像格式、数据格式和数据类型进行修改,使得视频会议系统画质提升方法通用性强,画质提升明显且速度块的同时,还可以满足实际场景中的部署需求。
上面各种方法的步骤划分,只是为了描述清楚,实现时可以合并为一个步骤或者对某些步骤进行拆分,分解为多个步骤,只要包括相同的逻辑关系,都在本专利的保护范围内;对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计,但不改变其算法和流程的核心设计都在该专利的保护范围内。
本申请实施例还涉及一种画质提升模型的训练装置,如图3所示,包括:获取模块301、加噪模块302和训练模块303。
具体地说,获取模块301,用于获取至少一个噪声模型;其中,噪声模型基于第一带噪图像训练得到,用于模拟视频会议系统的噪声特性;加噪模块302,用于获取无噪图像集合,并将无噪图像集合中的无噪图像输入到噪声模型中,得到第二带噪图像,所噪图像和所述第二带噪图像构成训练数据对;训练模块303,用于使用训练数据对对初始画质提升模型进行训练,并在完成训练后,对完成训练的画质提升模型进行卷积折叠,得到最终的画质提升模型。
在一个例子中,获取模块301使用第一带噪图像训练噪声模型,噪声模型由多个卷积残差块和下采样模块构成,噪声模型中的下采样方法和倍率与 视频会议系统中的下采样方法和倍率相同,从而可以模拟视频图像经过视频会议系统对视频图像的下采样。另外,视频会议系统对视频图像进行编码、传输和解码视频图时会导致视频图像具有模糊、马赛克、振铃等噪声,而上述噪声可以看作是对图像使用了不同权重参数的滤波器,并且滤波器可以通过卷积层模拟,那么可以利用卷积层和下采样模块组成的噪声模型学习视频会议系统中视频图像经历的各种复杂变化,从而模拟视频会议系统的噪声特性。
在一个例子中,加噪模块302获取公开数据集中的无噪图像,例如下载收集BVI-DVC、LDV、DIV2K、Flickr2K等公开视频图像数据集,构建无噪图像集合。遍历无噪图像集合中的每一个无噪图像,随机从训练好的噪声模型的集合中选取一个噪声模型,将无噪图像输入到噪声模型中,得到带视频会议噪声特性的第二带噪图像,将输入的无噪图像和得到的第二带噪图像构成训练数据对,用作训练画质提升模型的训练数据。
在一个例子中,训练模块303从二进制格式或者lmdb数据库格式的训练集中随机选择一组训练数据对,从第二带噪图像中随机扣取大小为指定尺寸的小块,例如尺寸为64x64,小块的尺寸可以根据训练使用的硬件资源进行灵活配置,如果显存或内存空间较大,可以设置为96x96、128x128、192x192等尺寸。设缩放比例为s,从第二带噪图像中扣取64sx64s的小块后,再在无噪图像的对应位置扣取64sx64s的小块,每轮迭代采样多组数据用于模型训练,具体采样的数据组量根据训练的实际情况决定。
在一个例子中,训练模块303基于卷积层特性对训练好的画质提升模型进行卷积折叠,遍历画质提升模型中的卷积残差块,针对每一个可折叠的卷积残差块做卷积折叠,将所有可折叠的卷积残差块折叠成单个卷积,并将折叠后的卷积权重拷贝到由多个卷积和上采样模块组成的新的画质提升模型,即为最终用于画质提升的画质提升模型。
本申请实施例提出的画质提升模型的训练装置,根据低分辨率低码率导致的噪声特性以及原始图像和终端解码图像像素分布大体一致性的特性,利 用卷积神经网络对视频会议系统中因低码率低分辨率编码、传输和解码导致的各种复杂图像变化过程进行仿真,获取噪声模型,再利用噪声模型对无噪图像进行加噪,获取符合视频会议系统噪声特性的第二带噪图像,最后和对应的无噪图像构成训练数据对对画质提升模型进行训练,极大改善画质提升效果,并且在完成画质提升模型的训练后,对卷积残差块进行卷积折叠,减少了画质提升卷积神经网络的卷积层个数、减少了特征通道个数并去除了残差连接,从而减小了画质提升卷积神经网络的参数量和内存访问次数,在不改变画质提升精度的同时进一步提升画质提升方法的推理速度,减少资源消耗。
不难发现,本实施方式为上述画质提升模型的训练方法实施例相对应的装置实施例,本实施方式可与上述画质提升模型的训练方法实施例互相配合实施。上述画质提升模型的训练方法实施例提到的相关技术细节在本实施方式中依然有效,为了减少重复,这里不再赘述。相应地,本实施方式中提到的相关技术细节也可应用在上述画质提升模型的训练方法实施例中。
本申请实施例还涉及视频会议系统画质的提升装置,如图4所示,包括:获取模块401和画质提升模块402。
具体地说,获取模块401,用于获取视频会议系统传输的视频图像;画质提升模块402,用于采用画质提升模型对视频会议系统的视频图像进行画质提升,得到画质提升后的视频图像;其中,画质提升模型根据权上述画质提升模型训练方法得到。
在一个例子中,获取模块401,在视频会议系统对视频图像进行编码前,降低视频图像的分辨率,并对视频会议系统的编码器的码率进行降低在视频会议系统的接收端采集经视频会议系统低码率编码、传输和解码后的视频图像。
在一个例子中,画质提升模块402,针对采集的视频图像中的每一帧图像,使用画质提升模型对视频图像进行分辨率提升和视频图像增强去噪,得到画质提升后的视频图像,并送入显示设备进行显示。
在一个例子中,画质提升模块402在将视频图像输入到画质提升模型中之前,将视频图像拆分成四个相同大小且重叠2像素的视频图像块,拆分的方式为自上而下拆分,然后启动4个线程同时对视频图像块进行画质增强,得到四个高清视频图像块,再根据放大倍率进行图像融合,最终得到画质提升后的视频图像,送入显示设备进行显示,将视频图像拆分并同时进行画质提升,在改善实际显示效果的同时,也降低了推理的耗时。
本申请实施例提供的视频会议系统画质的提升方法,只用一个画质提升模型就同时实现了超分重建和增强去噪任务,增强画质的同时还减少了增强画质的处理时间。另外,基于会议系统的需求,对噪声模型的类型以及噪声模型的输出图像的图像格式、数据格式和数据类型进行修改,使得视频会议系统画质提升方法通用性强,画质提升明显且速度块的同时,还可以满足实际场景中的部署需求。
不难发现,本实施方式为上述视频会议系统画质的提升方法实施例相对应的装置实施例,本实施方式可与上述视频会议系统画质的提升方法实施例互相配合实施。上述视频会议系统画质的提升方法实施例提到的相关技术细节在本实施方式中依然有效,为了减少重复,这里不再赘述。相应地,本实施方式中提到的相关技术细节也可应用在上述视频会议系统画质的提升方法实施例中。
值得一提的是,本申请上述两个实施方式中所涉及到的各模块均为逻辑模块,在实际应用中,一个逻辑单元可以是一个物理单元,也可以是一个物理单元的一部分,还可以以多个物理单元的组合实现。此外,为了突出本申请的创新部分,本实施方式中并没有将与解决本申请所提出的技术问题关系不太密切的单元引入,但这并不表明本实施方式中不存在其它的单元。
本申请的实施例还提供一种电子设备,如图5所示,包括至少一个处理器501;以及,与所述至少一个处理器501通信连接的存储器502;其中,所述存储器502存储有可被所述至少一个处理器501执行的指令,所述指令被所述至少一个处理器501执行,以使所述至少一个处理器能够执行上述画质 提升模型的训练方法,或者,能够执行上述视频会议系统画质的提升方法。
其中,存储器和处理器采用总线方式连接,总线可以包括任意数量的互联的总线和桥,总线将一个或多个处理器和存储器的各种电路连接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路连接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件,也可以是多个元件,比如多个接收器和发送器,提供用于在传输介质上与各种其他装置通信的单元。经处理器处理的数据通过天线在无线介质上进行传输,进一步,天线还接收数据并将数据传送给处理器。
处理器负责管理总线和通常的处理,还可以提供各种功能,包括定时,外围接口,电压调节、电源管理以及其他控制功能。而存储器可以被用于存储处理器在执行操作时所使用的数据。
上述产品可执行本申请实施例所提供的方法,具备执行方法相应的功能模块和有益效果,未在本实施例中详尽描述的技术细节,可参见本申请实施例所提供的方法。
本申请的实施例还提供一种计算机可读存储介质,存储有计算机程序。计算机程序被处理器执行时实现上述方法实施例。
本领域技术人员可以理解,实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
上述实施例是提供给本领域普通技术人员来实现和使用本申请的,本领域普通技术人员可以在脱离本申请的发明思想的情况下,对上述实施例做出种种修改或变化,因而本申请的保护范围并不被上述实施例所限,而应该符 合权利要求书所提到的创新性特征的最大范围。

Claims (13)

  1. 一种画质提升模型的训练方法,其中,所述画质提升模型用于对视频会议系统的视频图像的画质进行提升,所述方法包括:
    获取至少一个噪声模型;其中,所述噪声模型基于第一带噪图像训练得到,用于模拟视频会议系统的噪声特性;
    获取无噪图像集合,并将所述无噪图像集合中的无噪图像输入到所述噪声模型中,得到第二带噪图像,所述无噪图像和所述第二带噪图像构成训练数据对;
    使用所述训练数据对对初始画质提升模型进行训练,并在完成训练后,对完成训练的画质提升模型进行卷积折叠,得到最终的画质提升模型。
  2. 根据权利要求1所述的画质提升模型的训练方法,其中,所述噪声模型的训练,包括:
    将所述第一带噪图像输入到初始噪声模型中,得到第三带噪图像;其中,在输入所述第一带噪图像时,随机选择是否叠加辛格滤波噪声;
    将所述第一带噪图像和所述第三带噪图像输入到卷积神经网络模型中,得到尺寸一致的所述第一带噪图像的第一组特征图和所述第三带噪图像的第二组特征图,根据所述第一组特征图和所述第二组特征图对所述初始噪声模型进行训练。
  3. 根据权利要求1所述的画质提升模型的训练方法,其中,在所述获取至少一个噪声模型之前,还包括:
    从所述视频会议系统的接收端获取第一带噪图像集合;其中,所述第一带噪图像集合中的各第一带噪图像在编码前的分辨率均低于第一阈值,所述各第一带噪图像在编码时的码率均低于第二阈值;
    所述获取至少一个噪声模型,包括:
    基于所述各第一带噪图像,获取噪声模型的集合;其中,所述第一带噪 图像集合中的每一个第一带噪图像,均对应一个噪声模型;
    所述将所述无噪图像集合中的无噪图像输入到所述噪声模型中,得到第二带噪图像,包括:
    针对获取的无噪图像集合中的每一张无噪图像,将所述无噪图像输入到从所述噪声模型的集合中随机选择的噪声模型,得到第二带噪图像。
  4. 根据权利要求1所述的画质提升模型的训练方法,其中,在所述使用所述训练数据对对初始画质提升模型进行训练之前,还包括:
    将所述训练数据对转换为二进制数据格式或闪电内存映射数据库lmdb格式;
    从转换后的所述训练数据对的第二带噪图像中随机提取出指定尺寸的小块,并从转换后的所述数据对的无噪图像中提取出对应的小块;
    所述使用所述训练数据对对初始画质提升模型进行训练,包括:
    使用从所述训练数据对中提取出的小块对画质提升模型进行训练。
  5. 根据权利要求1所述的画质提升模型的训练方法,其中,所述对完成训练的画质提升模型进行卷积折叠,包括:
    遍历所述画质提升模型中的所有卷积残差块,针对每一个卷积残差块,将所述卷积残差块折叠成单个卷积;
    将所述单个卷积的权重拷贝到经过卷积折叠后的画质提升模型中。
  6. 根据权利要求1-5中任一项所述的画质提升模型的训练方法,其中,在所述对完成训练的画质提升模型进行卷积折叠之后,还包括:
    对所述画质提升模型进行int8量化处理,将经所述int8量化处理后的模型作为所述最终的画质提升模型。
  7. 根据权利要求1-5中任一项所述的画质提升模型的训练方法,其中,所 述画质提升模型的输入和输出图像的图像格式为RGB格式。
  8. 一种视频会议系统画质的提升方法,其中,包括:
    获取视频会议系统传输的视频图像;
    采用画质提升模型对所述视频会议系统的视频图像进行画质提升,得到画质提升后的视频图像;其中,所述画质提升模型根据权利要求1至7中任一项所述的画质提升模型的训练方法得到。
  9. 根据权利要求8所述的视频会议系统画质的提升方法,其中,在所述采用画质提升模型对所述视频会议系统的视频图像进行画质提升之前,还包括:
    将所述画质提升模型的类型转换成视频会议系统终端所需的类型。
  10. 根据权利要求8所述的视频会议系统画质的提升方法,其中,在所述采用画质提升模型对所述视频会议系统的视频图像进行画质提升之前,还包括:
    将所述视频图像拆分成N个尺寸一致的视频图像块,且相邻两个视频图像块之间重叠M像素;
    采用画质提升模型对所述视频会议系统的视频图像进行画质提升,得到画质提升后的视频图像,包括:
    在同一时刻,分别将所述N个尺寸一致的视频图像块输入到所述画质提升模型中,得到N个高清视频图像块后,将所述N个高清视频图像块进行图像融合,得到画质提升后的视频图像;其中,所述M和N均为大于1的整数。
  11. 根据权利要求8-10中任一项所述的视频会议系统画质的提升方法,其中,在所述得到画质提升后的视频图像之后,还包括:
    将所述画质提升后的视频图像的图像格式转换为视频会议系统的图像格式;
    将所述画质提升后的视频图像的数据格式转换为视频会议系统的数据格式;
    将所述画质提升后的视频图像的数据类型转换为视频会议系统的数据类型。
  12. 一种电子设备,其中,包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1至7中任一项所述的画质提升模型的训练方法,或者,能够执行如权利要求8至11中任一项所述的视频会议系统画质的提升方法。
  13. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1至7中任一项所述的画质提升模型的训练方法,或者,能够实现如权利要求8至11中任一项所述的视频会议系统画质的提升方法。
PCT/CN2023/087910 2022-04-20 2023-04-12 画质提升模型的训练方法和视频会议系统画质的提升方法 WO2023202447A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210420820.0A CN116977191A (zh) 2022-04-20 2022-04-20 画质提升模型的训练方法和视频会议系统画质的提升方法
CN202210420820.0 2022-04-20

Publications (1)

Publication Number Publication Date
WO2023202447A1 true WO2023202447A1 (zh) 2023-10-26

Family

ID=88419132

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/087910 WO2023202447A1 (zh) 2022-04-20 2023-04-12 画质提升模型的训练方法和视频会议系统画质的提升方法

Country Status (2)

Country Link
CN (1) CN116977191A (zh)
WO (1) WO2023202447A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117649362A (zh) * 2024-01-29 2024-03-05 山东师范大学 基于条件扩散模型的腹腔镜影像除烟方法、系统及设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107993200A (zh) * 2017-11-02 2018-05-04 天津大学 基于深度学习的图像噪声水平估计方法
CN108229525A (zh) * 2017-05-31 2018-06-29 商汤集团有限公司 神经网络训练及图像处理方法、装置、电子设备和存储介质
CN113628146A (zh) * 2021-08-30 2021-11-09 中国人民解放军国防科技大学 基于深度卷积网络的图像去噪方法
CN113822289A (zh) * 2021-06-15 2021-12-21 腾讯科技(深圳)有限公司 图像降噪模型的训练方法、装置、设备及存储介质
WO2022012888A1 (en) * 2020-07-14 2022-01-20 Asml Netherlands B.V. Apparatus and methods for generating denoising model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229525A (zh) * 2017-05-31 2018-06-29 商汤集团有限公司 神经网络训练及图像处理方法、装置、电子设备和存储介质
CN107993200A (zh) * 2017-11-02 2018-05-04 天津大学 基于深度学习的图像噪声水平估计方法
WO2022012888A1 (en) * 2020-07-14 2022-01-20 Asml Netherlands B.V. Apparatus and methods for generating denoising model
CN113822289A (zh) * 2021-06-15 2021-12-21 腾讯科技(深圳)有限公司 图像降噪模型的训练方法、装置、设备及存储介质
CN113628146A (zh) * 2021-08-30 2021-11-09 中国人民解放军国防科技大学 基于深度卷积网络的图像去噪方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117649362A (zh) * 2024-01-29 2024-03-05 山东师范大学 基于条件扩散模型的腹腔镜影像除烟方法、系统及设备
CN117649362B (zh) * 2024-01-29 2024-04-26 山东师范大学 基于条件扩散模型的腹腔镜影像除烟方法、系统及设备

Also Published As

Publication number Publication date
CN116977191A (zh) 2023-10-31

Similar Documents

Publication Publication Date Title
Wang et al. Deep learning for image super-resolution: A survey
WO2022057837A1 (zh) 图像处理和人像超分辨率重建及模型训练方法、装置、电子设备及存储介质
CN112991203B (zh) 图像处理方法、装置、电子设备及存储介质
CN110163801B (zh) 一种图像超分辨和着色方法、系统及电子设备
CN110610526B (zh) 一种基于wnet对单目人像进行分割和景深渲染的方法
WO2023010754A1 (zh) 一种图像处理方法、装置、终端设备及存储介质
CN112889069B (zh) 用于提高低照度图像质量的方法、系统和计算机可读介质
CN112602088B (zh) 提高弱光图像的质量的方法、系统和计算机可读介质
CN111784570A (zh) 一种视频图像超分辨率重建方法及设备
WO2021115403A1 (zh) 一种图像的处理方法及装置
WO2023202447A1 (zh) 画质提升模型的训练方法和视频会议系统画质的提升方法
WO2023284401A1 (zh) 图像美颜处理方法、装置、存储介质与电子设备
CN114581347B (zh) 无参考影像的光学遥感空谱融合方法、装置、设备及介质
CN113297937B (zh) 一种图像处理方法、装置、设备及介质
CN110830808A (zh) 一种视频帧重构方法、装置及终端设备
WO2023010750A1 (zh) 一种图像颜色映射方法、装置、终端设备及存储介质
CN116248955A (zh) 一种基于ai抽帧补帧的vr云渲染图像增强方法
Arulkumar et al. Super resolution and demosaicing based self learning adaptive dictionary image denoising framework
CN111429371A (zh) 图像处理方法、装置及终端设备
CN113822803A (zh) 图像超分处理方法、装置、设备及计算机可读存储介质
Weng et al. L-cad: Language-based colorization with any-level descriptions using diffusion priors
Li et al. RGSR: A two-step lossy JPG image super-resolution based on noise reduction
CN115471417A (zh) 图像降噪处理方法、装置、设备、存储介质和程序产品
CN110766153A (zh) 神经网络模型训练方法、装置及终端设备
CN114170082A (zh) 视频播放、图像处理和模型训练方法、装置以及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23791102

Country of ref document: EP

Kind code of ref document: A1